GPU work units fail, AMD gpu

Ole Kristian
Ole Kristian
Joined: 30 Nov 17
Posts: 8
Credit: 40,617,894
RAC: 91,281
Topic 221740

A Seti user here, and I got good help in the Seti Orphans thread.

A summery of the problems and solving.

First of all, Einstein together with other projects was painful.  Now it is alone on one computer.

I was recommended to only use a couple of applications and use

Gamma-ray pulsar binary search #1 (GPU)
Gravitational Wave search O2 Multi-Directional

I reduced the cpu load to 44% so I am running 4 cpu and 3 gpu wu at a time.  No longer waiting for memory.

One problem still occurs, sometimes a gpu unit goes on.  I will never end, after 10-12 hrs its is at 99,999%  if Einstein is the only project.  The GPU is still at 100% but the fans are off.  The other gpu wu finishes at doble the time as normal.  Only solution so far for me is to abort such wu. When everything is good a wu takes 26 minutes and the bad ones are easily spotted if I look at the tasks running.  But normally I wouldnt babysit Boinc so a better solution would be preferred.

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 838
Credit: 789,041,118
RAC: 1,401,175

Sounds like the gpu tasks are

Sounds like the gpu tasks are being starved of enough cpu resource.  The gpu wu depending on which sub-project do need to use the cpu at the end of the calculation on the gpu. Typically at 90% for Gamma Ray and at 99% for Gravity Wave.

 

Ole Kristian
Ole Kristian
Joined: 30 Nov 17
Posts: 8
Credit: 40,617,894
RAC: 91,281

Keith Myers wrote:Sounds like

Keith Myers wrote:
Sounds like the gpu tasks are being starved of enough cpu resource.  The gpu wu depending on which sub-project do need to use the cpu at the end of the calculation on the gpu. Typically at 90% for Gamma Ray and at 99% for Gravity Wave.

 

No, I dont think so.  Lots of available cpu resources.  And it is not like it goes to 90% and then get a problem.  It never stops and I can spot the troublemakers early on.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 838
Credit: 789,041,118
RAC: 1,401,175

I'm sure I've read of similar

I'm sure I've read of similar posts in the forums and I believe the issue was resolved.  A forum search might be required to find them and see if they can help you resolve the issue.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,231
Credit: 44,615,823,179
RAC: 39,416,278

Hi Kristian, Welcome to

Hi Kristian, Welcome to Einstein!  Thank you for deciding to devote resources here.

Yesterday, I posted a small guide in the Getting Started forum that might be useful to review.  Since you are successfully running, you may be needing more targeted information to get the best performance.

You have a pair of very capable machines with AMD GPUs.  I use very similar AMD GPUs (no nvidia) so I can give some advice about how to optimise.

Ole Kristian wrote:

First of all, Einstein together with other projects was painful.  Now it is alone on one computer.

I was recommended to only use a couple of applications ...

A lot depends on your objectives.  At point 2. in the guide, I posed a question.  At the end of the guide, I tried to suggest why detecting continuous GW is likely to be a big deal and therefore is very much why the project would love you to run the GW app.  If you make a decision about what you would like to run, I'll help you get the best out of what you decide to run, on whichever machine you want to run it.

Think about it and let me know what you want to do.

The GW GPU app is relatively new and still with some raw edges.  There are improvements but they are quite slow to arrive.  The biggest problem is that there are still some procedures/functions in the algorithm that can only run on a CPU core so GPU utilisation appears low and crunch times are slow compared to the much more efficient GRP GPU app.  However, you can considerably improve performance of the GW app by running concurrent tasks.  It does make a substantial difference to output.

For example, I have a Ryzen 5 2600 (6C/12T) host with a 4GB RX 570.  It runs 3 concurrent instances of the GW GPU tasks in a bit over 30 mins currently - a completed task every 10-12 mins on average.  It runs GPU tasks only and has a current RAC of over 200K.  This is probably close to double what could be achieved by running GPU tasks one at a time.

Ole Kristian wrote:
One problem still occurs, sometimes a gpu unit goes on.  I will never end, after 10-12 hrs its is at 99,999%  if Einstein is the only project.  The GPU is still at 100% but the fans are off.  The other gpu wu finishes at doble the time as normal.  Only solution so far for me is to abort such wu. When everything is good a wu takes 26 minutes and the bad ones are easily spotted if I look at the tasks running.  But normally I wouldnt babysit Boinc so a better solution would be preferred.

I run Linux exclusively, not Windows.  I used to frequently see a similar problem - it essentially got fixed with a new amdgpu kernel module (the GPU driver) sometime in early 2018 if I remember correctly.  It still happens very occasionally to a couple of machines and it's the GPU crashing (fans stopped) whilst the CPU in the machine runs normally.  Connecting to the machine remotely shows the symptoms you describe - the task in question struggling extremely slowly towards 99.999%.

It never gets to that for me these days.  I have control scripts running on a server machine which closely monitor the app running on each host.  If a GPU crashes, the CPU clock ticks being used to support the task(s) running on that GPU decline to zero per second so the situation can be detected quite quickly.  A cold restart of the affected machine always solves the problem.  The BOINC client then restarts the tasks that were 'spinning their wheels', using the last saved checkpoint.  The tasks always complete successfully without further issue.

Unfortunately, stopping and restarting BOINC without completely resetting the GPU is not enough.  Hence the cold boot.  Next time you see this issue, try stopping and restarting BOINC.  Things might work differently with Windows.  If the task doesn't immediately start making normal progress, you will need to reboot the machine.

Cheers,
Gary.

Ole Kristian
Ole Kristian
Joined: 30 Nov 17
Posts: 8
Credit: 40,617,894
RAC: 91,281

Thanks.  I look into this,

Thanks.  I look into this, especially what the different applications do and what their scientific importance are.

Right now I just let Boinc run and it is getting much better.  For some reason no GPU wu has gone bad the last day. 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.