WU appears to be stuck

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109410697841
RAC: 35022522

At one point (in a different

At one point (in a different thread) you mentioned that you had tried running with 2 cards, a 470 as well as the 570.  If your 470 is still available, it would be worthwhile replacing the current 570 with it, just for the purpose of seeing whether or not the errors continue.  If the errors stop, your 570 may have a hardware issue.  If they continue on as before, it would be unlikely that the 570 hardware is the problem since two different GPUs are unlikely to have the same issue.  That would be the easiest way of ruling in or out, the card itself.

After that gives you an answer, the trick is then to work out what to do to rule in or out, other parts of the complete chain.

Let's really pin some things down.  To have run 2 GPUs previously, your board must have (mechanically) 2 x16 slots (unless you were running the 2nd card on some sort of extension (riser) from an x1 slot.  Assuming you do have 2 x16 slots, it would be common for the primary slot to be x16 electrically, but the other to be only x4 (or perhaps you can have an x8/x8 configuration).  I'm just interested to confirm which particular slot your current single card is in?  It should be in the primary slot, which is the one closest to the CPU.

There are two ways that power is delivered to a GPU.  Some comes from the PCIe slot itself.  The rest comes from an 8 pin PCIe power connector to the card.   The slot power is sourced through the 2 connectors that deliver power to the motherboard.  There is the 24 pin main power connector and (usually these days) an 8 pin ATX12V connector.  On older or more budget boards there may only be a 4 pin socket so the PSU cable usually has a 'split' connector that can accommodate either 4 pin or 8 pin sockets.  I want you to very carefully check all of these power connectors for proper seating.  Unplug and replug each one making sure the 'latches' click properly into place.  I have seen examples of problems caused by connectors that are not properly seated.

At some point in carrying out the above, please enter the firmware setup (either BIOS or UEFI, whatever you board uses) and see if you can find a voltage reporting page which lists the actual voltages for each of 3.3V, 5V, and 12V.  Are any of these voltages on the low side?  Particularly, is the 12V reading significantly less than 12V?  While you're there, check CPU fan speed.  When the machine is actually running, check voltages and fan speeds again with your Windows utility of choice to see if anything changes much or looks abnormal.

Another thing to look at is system memory.  In Linux, I've used a utility called memtest86+ which is pretty good at exercising the RAM and identifying single bit errors.  A complete run can take 20-30 mins or more.  There is probably some similar utility available for Windows - I don't know.  It would be useful to rule out RAM as any sort of potential problem.  Without a proper test utility, and If you have at least 2 sticks of RAM, a useful thing to do is remove all the sticks and check the gold plating on the pins for any signs of 'dulling' caused by contaminant buildup.  I've found that gently 'cleaning' the pins with a 'hard' eraser (so as not to leave 'soft' residue behind) can actually restore (occasionally) some sticks that throw errors in memtest.  If you have any doubt about the condition of the RAM, either try alternate sticks, or just use a single stick and test to see if the problem is with one stick only.  If you can isolate the problem to just one stick of a pair that way, then just replace the dodgy stick.

Once you try out some or all of the above, report back and we'll try to devise a strategy to link the problem to one of the 5 different most likely areas - GPU hardware, PSU/power system, motherboard/CPU, system memory, software/drivers.  Sometimes, the only way to find the culprit is to replace single items at a time until you find the bit that clears the problem.  That's not easy if you don't have a nice big spare parts box full of suitable replacement components :-).

 

Cheers,
Gary.

lunkerlander
lunkerlander
Joined: 25 Jul 18
Posts: 46
Credit: 31464094
RAC: 3

Thank you for all your help

Thank you for all your help everyone! 

I no longer have the RX470, or I would definately give it a try. I was able to sell it for what I paid for it. I've tried the following things in this order to resolve the computation errors:

1.) I did a clean install of the newest AMD drivers through AMD Radeon settings app.

2.) I ran 1 task at a time instead of 2.

3.) I used MSI Afterburner to underclock the GPU core clock from 1280mhz to 1120mhz.

4.) I changed the graphics mode from "graphics" to "compute" under the Radeon settings app.

5.) I used DDU to completely remove my display drivers and then did another clean install.

6.) Finally, I used MSI Afterburner to underclock the GPU core clock from 1280mhz down to 1050mhz. This has been the most successful. I did this 3 hours ago, and haven't had any errors since. It's too early to tell if it worked, but the initial results look good. Task completion times went up by 1-1.5 minutes, but there haven't been errors yet.

System info:

Motherboard ( https://www.asrock.com/mb/Intel/H97M%20Pro4/ ) has 2 PCIe 16x slots. One is a PCIe 3.0 x16 and the other runs as a PCIe 2.0 x4. The RX570 has always been in the top PCIe 16x slot (the main one). 

PSU: 850W Bronze 

Ram: 8GB DDR3 1600 (2 x 4gb)

If everything starts working, I'd like to add a 2nd GPU. If if doesn't, I could still get a 2nd GPU to test if it works better than this one.

lunkerlander
lunkerlander
Joined: 25 Jul 18
Posts: 46
Credit: 31464094
RAC: 3

10 minutes after writing that

10 minutes after writing that post, I got my first error:

https://einsteinathome.org/task/818909017

Hopefully there wont be as many as before. I'll update again in a day.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109410697841
RAC: 35022522

Joshua wrote:...Motherboard (

Joshua wrote:
...Motherboard ( https://www.asrock.com/mb/Intel/H97M%20Pro4/ ) has 2 PCIe 16x slots. One is a PCIe 3.0 x16 and the other runs as a PCIe 2.0 x4. The RX570 has always been in the top PCIe 16x slot (the main one).

I have a somewhat similar board - an Asrock B250M PRO4 with the same pair of PCIe slots x16 and x4.  I'm running dual RX 560 GPUs, each x2 (4 concurrent GPU tasks).  The CPU is a G4560 - Pentium dual core with HT - 4 threads - but I'm only running GPU tasks.  During winter it used to run 2xCPU and 4xGPU tasks.  It doesn't have any trouble running either configuration.

If downclocking from 1280 to 1050 MHz continues to give significantly reduced error rates, that's a pointer towards a faulty GPU.  I imagine the card is under warranty so you might like to explore options whilst you collect enough data on the new error rate.

I took a look at your most recent page of errors and could see the latest one.  It was indeed quite close to finishing.  On that same page, a few entries earlier, I took a look at the aborted task (by clicking on the task ID) and this bit hit me:-

% Binary point 936/1061
% Starting semicoherent search over f0 and f1.
% nf1dots: 41 df1dot: 2.51785325e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
.
.
(lots of dots removed)
.

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x00007FFFFFE1C5D2

Engaging BOINC Windows Runtime Debugger...

....

....

Now I know nothing of what happens when you decide to abort a running task or whether engaging the runtime debugger is normal or not but a few lines down from the above, I saw


ModLoad: 00000000e5910000 0000000000014000 C:\Program Files\AVAST Software\Avast\x64\aswhooka.dll ...

and wondered if the cause of some of your tasks getting stuck to the point you needed to abort them might be something to do with running an anti-virus scanner?  In thinking about this, I also wonder if virus scanning could cause a task to completely crash rather than just stop crunching.  Perhaps you would like to exclude the BOINC tree from scanning activities and see if your various problems go away?

Quote:
If everything starts working, I'd like to add a 2nd GPU. If if doesn't, I could still get a 2nd GPU to test if it works better than this one.

If you do end up with a second card (and still problems with the first), try it on its own and running 2x and at standard clocks.  If you get no errors with it that's pretty firm evidence that something is amiss with your current card.  If so, you should push for a warranty replacement.

 

Cheers,
Gary.

lunkerlander
lunkerlander
Joined: 25 Jul 18
Posts: 46
Credit: 31464094
RAC: 3

Thanks for your help!  I

Thanks for your help! 

I added C:\ProgramData\BOINC the the excluded list on my antivirus.

I have had just as many errors with the downclock to 1050mhz as before. I decided to change the core clock back to 1280mhz since the problem isn't going away. Over the night, I noticed a task that ran for 

I think what I'm going to try next is ordering another GPU and see if this RX570 is the problem. I think this is starting to look like the case. The reason why I don't think it's my PSU or RAM or Motherboard is that I could run SETI with 2 GPUs just fine. I also ran Einstein with an RX560 back in August without any problems. 

I'll order another GPU this week and see how it helps. If it works, then I know this GPU is bad. If it has the same issues, then I know it's my PC or Windows or the AMD drivers.

Thanks again!

Also, I made the above changes, and before I finished writing this post I already had another computation error. I'm going to just order the GPU today so I can get to the bottom of what's causing this.

lunkerlander
lunkerlander
Joined: 25 Jul 18
Posts: 46
Credit: 31464094
RAC: 3

I ordered a new RX570 that

I ordered a new RX570 that should arrive on Wednesday. I"m looking forward to installing it and seeing if it works.

One interesting thing I found was a task that took 9939 seconds to run (normally around 600-650 seconds), and actually was valid. https://einsteinathome.org/task/818974595

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109410697841
RAC: 35022522

Joshua wrote:... and actually

Joshua wrote:
... and actually was valid. https://einsteinathome.org/task/818974595

I'm not surprised by that.  Under Linux, it doesn't seem to happen much now but 12-18 months ago, not long after I started using Polaris GPUs in earnest, driver crashes were rather frequent and in some of those cases, the 'in-progress' crunching didn't completely stop but seemed just to transfer to the 'slow lane'.  If I rebooted the machine completely (stopping BOINC alone wouldn't cure it) the computation would resume (back in the fast lane) and the task would complete and validate, even after having spent quite a time in the slow lane.

I certainly don't know enough about how things really work at a low level but I always interpreted this type of behaviour as the computation being transferred from the GPU to a CPU core, since the design of OpenCL is to allow the same code to run on different but compatible devices.  So, the code could run on a CPU or a GPU - just much faster on a GPU.  This was just a guess on my part - I don't know if there is any merit in it.

Another thing I've noticed, particularly on Asrock H61 and H81 motherboards, is that if a GPU crashes, a warm reboot is not sufficient to clear whatever caused the lockup.  With other types (Asus, Gigabyte) a warm reboot mostly seems to work.  When a GPU locks up, the machine is still fully responsive to ssh connections over the LAN but it is unresponsive to the attached keyboard/mouse and nothing shows on the local screen.  A nice feature of Linux is the ability to do a 'safe' local reboot or shutdown, irrespective of the state of the local screen/keyboard/mouse. It's a special key combination that bypasses the desktop environment and talks directly to the kernel to safely stop running processes, sync disks, etc, before initiating a reboot or shutdown.  Much safer than hitting a reset button.

On boards other than Asrock, a warm reboot usually gives a fully functional screen display.  On the mentioned Asrock boards, the screen still doesn't display.  It has happened enough times for me to know to use the key combination to shut the machine down rather than rebooting. I then do a cold reboot - pull the power cord, hit the on button to completely drain storage caps in the PSU and then restore power and boot normally.  I'm guessing there must be something missing in the firmware of budget Asrock boards that doesn't allow the GPU to be fully reset with a warm reboot.

So next time you notice a task running longer that normal, try a warm reboot to see if that puts the task back in the fast lane.  By watching the increments in %done every second, you can easily tell if a task is in the fast lane or the slow lane :-).  If still in the slow lane (or if you get nothing on screen as you reboot) you will know to do a cold reboot.  If you can restore normal crunching on such tasks, there's every chance that a future driver update may fix this.  On Linux, I've noticed annoying things like this getting fixed over time.

 

Cheers,
Gary.

lunkerlander
lunkerlander
Joined: 25 Jul 18
Posts: 46
Credit: 31464094
RAC: 3

I have good news! My new

I have good news! My new RX570 has finished all of the previous tasks without any errors. I think the problem was my GPU. It's 2 months old, so I'm going to try to see if they will replace it under warranty.

My new problem is that BOINC hasn't downloaded any new Gamma-ray pulsar binary search #1 on GPUs v1.18 (FGRPopencl1K-ati) windows_x86_64 tasks for my GPU.

So right now, my PC is only working on CPU tasks. I have my account preferences set the same as before. What should I do?

Edit: I wanted to check if BOINC saw my AMD GPU, so I allowed SETI to send new tasks and it did send GPU tasks that were run by my GPU. So for some reason, Einstein isn't sending any GPU tasks. Should I reset Einstein?

lunkerlander
lunkerlander
Joined: 25 Jul 18
Posts: 46
Credit: 31464094
RAC: 3

I solved the problem! I had

I solved the problem!

I had to abort the rest of my CPU tasks before I reset the project in BOINC. This didn't work, so I completely uninstalled BOINC and deleted all the ProgramData files from my computer. Then I reinstalled BOINC and Einstein started downloading GPU tasks again.

Thanks everyone for your help!

lunkerlander
lunkerlander
Joined: 25 Jul 18
Posts: 46
Credit: 31464094
RAC: 3

After 1 day, I've had no

After 1 day, I've had no errors with my new RX 570! It definitely was the other GPU that was bad. Its only 3 months old, so I'm going to try to RMA it.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.