AMD GPU WU's 'hang'

eskercurve
eskercurve
Joined: 15 Mar 07
Posts: 4
Credit: 81982945
RAC: 0
Topic 216292

I've noticed several work units for my computer's AMD GPU (https://einsteinathome.org/host/11456264) 'hang.'  That is, they progress to anywhere between 95-100% and then just ... stop.  Just today I noticed one work unit with over 24 hrs of computation time but it was at 100% completion (they usually take ~2 hr to complete).  Any ideas on what may be causing this?

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109967369128
RAC: 30545752

Patrick_130 wrote:I've

Patrick_130 wrote:
I've noticed several work units for my computer's AMD GPU (https://einsteinathome.org/host/11456264) 'hang.'  That is, they progress to anywhere between 95-100% and then just ... stop.  Just today I noticed one work unit with over 24 hrs of computation time but it was at 100% completion (they usually take ~2 hr to complete).  Any ideas on what may be causing this?

I'm sorry for the delayed response.  I wasn't able to monitor the message boards at all on Sunday.

I've now spent some time looking through the details for your host - 11456264 - from the link you kindly provided.  I think there may be several things going on, rather than just a GPU 'hang'.

Firstly, your machine has a dual core CPU (core2duo E6750) trying to support two crunching GPUs - an nvidia GTX650 as well as an AMD HD 5800/5900 series.  In the defaults for both GPUs, the support of a single CPU core is required for each.  Whilst that is very necessary for the GTX650, the other GPU doesn't require that much support at all.  You can see this by looking at tasks on the website and seeing how much CPU time was used for each GPU type.

As currently configured, you seem to be allowing both CPU tasks and GPU tasks which means that when each GPU is crunching a task, a CPU task will be prevented from running by BOINC, as you only have two cores to support everything.  Eventually, the machine would likely go into high priority mode as a CPU task approached its deadline.  A CPU task in high priority would stop a GPU task from running (I imagine).  I don't know for sure exactly what happens because I deliberately don't allow that situation to arise.  I suspect you may have to abort some CPU tasks because they can't be done within the deadline.  So, in short, you are tending to create a workload that will adversely affect your machine's performance.

Since GPUs are much more efficient at crunching, you might like to consider changing your preferences to exclude CPU tasks completely.  Without the competition from a CPU task, you might find that your GPU task performance improves quite a bit.  You certainly won't need to abort excess CPU tasks that way.

A second possibility for your observations is the nature of GPU tasks in general.  They require CPU support but they also have two stages to the computation that may create symptoms similar to what you seem to be describing.  If you are not aware of these two stages, please read this pinned thread.  The answer I gave to Q1 describes the two stages.

When you say crunching 'hangs' at ~95% are you perhaps noticing an apparent pause in progress at ~90%?  Within the last few days, there has been a significant change in the data so that this pause is very much in evidence once again.  A week or two ago, there was no evidence of the second (follow-up) stage of crunching so no pauses.

A third possibility is that GPUs do indeed 'hang' occasionally, because of the stress that they are under.  If you see a GPU task running for significantly (hours) longer than normal, just reboot the machine.  There are several factors which increase the frequency of these 'hangs'.  From my own experience, they are, (1) the age of the device, (2) the quality of the output of the PSU, (3) the cleanliness and efficiency of the cooling system, (4) the age and stability of the motherboard, and (5) what time of year it is :-).  In summer, I have GPU 'hangs' many times more frequently than in winter :-).  When was the last time you pulled your two GPUs out and gave them a good clean? :-).  Are you sure your PSU is up to the job of supporting two crunching GPUs?

 

Cheers,
Gary.

eskercurve
eskercurve
Joined: 15 Mar 07
Posts: 4
Credit: 81982945
RAC: 0

Gary, thanks for the time to

Gary, thanks for the time to write a reply.  What I am seeing is that it doesn't so much "hang" as the rate at which the % completion becomes asymptotic as it reaches 100%.  Just today, one AMD task took 13 hours, and was at 99.75%.  Normally would take ~45 minutes.

And, yes, you hit the nail on the head: the mobo and all the components on it are OLD.  This computer was donated to me by my brother around 5 years ago (or more! how time flies!  I don't even think my son was born yet!), and it was old then.  The only thing new-ish on it is a GTX 650.  It is running stably and reliably (if much slower than my other computer, which has a single 1060 6GB).  

The PSU might be overstressed a bit as it is a 650 W supply feeding two cards, one of which takes two 6 pin connectors (the AMD). So, I will try disconnecting the GTX650 and see how it performs this weekend.

So, I think I will give it a good cleaning too.  I vacuumed the room and gave it a good cleaning.  Now time for my computers.  

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109967369128
RAC: 30545752

eskercurve wrote:Gary, thanks

eskercurve wrote:
Gary, thanks for the time to write a reply.  What I am seeing is that it doesn't so much "hang" as the rate at which the % completion becomes asymptotic as it reaches 100%.  Just today, one AMD task took 13 hours, and was at 99.75%.  Normally would take ~45 minutes.

If a task normally takes 45 minutes, you should be able to see ~2% progress each minute.  A checkpoint is made approximately each minute so a good test at any stage is to select the task you want to check on by clicking it in the tasks tab of BOINC Manager - Advanced view.  With a task selected, you can click on a 'properties' button which will give you useful information about the state of the task.

I'm guessing that if you had checked the properties of that 13hr task at any point during that whole period, you would have seen that no checkpoint had ever been written.  The field for "CPU time at last checkpoint" would not have been filled in.

I believe (as I've seen it often enough myself) that when no checkpoint information is available, BOINC simulates progress so as not to have a supposedly crunching task sitting at 0.00% for as long as it takes until a checkpoint is written.  As soon as a checkpoint is written and the task has made a known amount of progress in a known time interval, BOINC will stop using simulated progress and use the real numbers.  If a checkpoint is never written (the task has hung) BOINC would continue with simulated progress, probably until a time limit was exceeded.

This would also explain why you would have seen progress numbers between 90% and 100%.  You would never see those values for a task that is crunching properly.   You should see %completed figures refreshing every second until the figure reaches 89.997%.  This is the end of the primary stage.  The follow-up stage which uses double precision to re-calculate the 'toplist', the 10 most likely candidate signals, then starts.  There is no increasing %completed figure until the follow-up stage is complete, at which point the display jumps immediately to 100% and the results are uploaded.

I always have in mind, the approximate progress a task should be making every second.  In your case that number is (100/(45x60)) which is approximately 0.037% per second.  It only takes a few seconds of observation to identify a hung task because the simulated progress will be very much less than that.  As soon as you notice a problem of very slow progress, the only solution I know of is to reboot the machine.  Sometimes it needs to be a cold reboot for everything to start working properly again.

Quote:
The PSU might be overstressed a bit as it is a 650 W supply feeding two cards, one of which takes two 6 pin connectors (the AMD). So, I will try disconnecting the GTX650 and see how it performs this weekend.

So that's 3 x 6 pin connectors needed. Perhaps you may be using a dual molex to 6 pin PCIe power adapter cable as well as the standard 2 PCIe connectors that I presume the PSU would have had initially.

At a guess, the machine might be drawing around 300W from the wall so 650W PSU should be OK as long as it is a good quality unit in good condition.  If it's the original PSU, I'd be quite concerned about the condition of the internal electrolytic capacitors on the secondary side.  Bulging caps might be the cause of tasks locking up.  The best way to check is to open the PSU and look at the condition of all caps, particularly those clustered close to the different low voltage rails - 3.3V, 5V, 12V. (orange, red, yellow).  A cap can fail without bulging so you wont be able to always know just from inspection.  However, if you can see obvious signs of bulging, you do need to replace the PSU (or repair it).

Exactly the same comments apply to any electrolytic caps on the motherboard.  Modern boards are pretty much all polymer caps these days but a lot less so for older machines.  To keep older equipment functioning correctly, I've replaced lots of these caps both in PSUs and on the motherboard.

Good luck with sorting out why tasks are stalling.

 

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.