GPU crashes with multiple tasks

cecht
cecht
Joined: 7 Mar 18
Posts: 1537
Credit: 2915241970
RAC: 2122288
Topic 215601

Whenever I try to run two simultaneous FGRPB1G tasks on my ATI GPU, the GPU crashes within about 30 sec. of starting the tasks. It then automatically resets, but stops processing the GPU tasks  even though the BOINC task timer keeps ticking.  While that bit of nothing is going on, the GPU is pegged out at 100% usage, GPU memory speed drops to around 300 MHz, and the temperature drops 20 C from normal task processing. The GPU works great running single GPU tasks, but I have to abort all the queued "1 CPU + 0.5 AMD/ATI GPU" tasks for things to get back to normal.

My system is https://einsteinathome.org/host/12632329, a 6-core Xeon, 6GB, Windows 10, with a NVIDIA Quadro 600 (used only for display) and the Radeon RX 460 that I can't get to run more than one task at a time. There is no cc_config.xml, so BOINC correctly loads GPU tasks only on the ATI/AMD Radeon card. I'm using MSI Afterburner only to adjust the cooling fan speed; there is no overclocking. The AMD driver is recent: 24.20.11021.1000 (Adrenalin 18.6.1).

I have app_config.xml set as:
<app_config>
  <app>
      <name>hsgamma_FGRPB1G</name>
      <max_concurrent>1</max_concurrent>
      <gpu_versions>
          <gpu_usage>0.5</gpu_usage>
          <cpu_usage>1</cpu_usage>
      </gpu_versions>
  </app>
</app_config>

I get the same problem when, instead of the above instructions, the I use the app_version ngpus flag set to 0.5.

Sooooo, is the problem me, or the card, or something else?

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

I thought there was a issue

I thought there was a issue with running more than 1 work unit at a time on ATI cards?  Maybe I'm wrong but I could have sworn I heard that somewhere.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117783871940
RAC: 34685181

cecht wrote:Whenever I try to

cecht wrote:
Whenever I try to run two simultaneous FGRPB1G tasks on my ATI GPU, the GPU crashes within about 30 sec. of starting the tasks.

Hi Craig,

Welcome to the forums!

I run lots of Polaris cards (but under Linux) - 460s, 560s, 570s, 580s - and none of them have any problems running concurrent GPU tasks.  A few years back (from memory) the R9 Fury (or something like that) wouldn't run multiple tasks but I don't have any of those so have no experience.  However, I've not seen anyone having issues with multiple tasks on any Polaris series GPUs.

In the app_config.xml file you posted, there is a conflict but I wouldn't have thought it would create a crash - just not run the two tasks you expect.  The <gpu_usage> of 0.5 would allow two tasks to run on the GPU but the <max_concurrent> will restrict it to one anyway.  Try deleting that line completely and see what effect that has.

Also, you never have to abort tasks in the cache to get things working again.  If a crash has occurred, just stop BOINC, edit the app_config.xml and restart BOINC.  In normal circumstances, if you want to test out changes in settings in app_config.xml, just let BOINC continue to run whilst you edit the file to make the change.  When you are ready to have the change applied, open BOINC Manager - advanced view and under the Advanced menu item you'll find a 'read config files' option.  Click that and BOINC will incorporate your changes 'on the run'.  Don't worry if tasks seem to be listed with incorrect CPU + GPU numbers.  That will fix itself in time - it's not an issue.

Are you running any other projects with GPU apps on that GPU?  I have no experience with sharing a GPU between projects but there might be issues if you were doing that.  It might be quite OK 'one at a time' but I could imagine some issues if two different apps were trying to share the one GPU - for example, I imagine the second project might need to know it could only use 0.5 GPUs as well.  As I say, I've no experience with that.

 

Cheers,
Gary.

cecht
cecht
Joined: 7 Mar 18
Posts: 1537
Credit: 2915241970
RAC: 2122288

Okay, thank you for the

Okay, thank you for the feedback. Yes, I realized late last night that the <max-concurrent> I posted above was wrong - I meant for the value to be 2. With the value set to 1, it ran only 1 task on 0.5 of the GPU just fine, but with the value at 2, it ran two concurrent task that crashed.  I haven't tried it without the line and will give that a go when I get back into the office Monday and post the results.

I'm not running other BOINC or other computing projects on the GPU.  Windows and other apps (browsers, Dropbox) do occasionally toss stuff onto the NVIDIA GPU, but I don't know whether anything is running in the background on the AMD.

Thanks for the tip about not needing to abort tasks and editing app_config on the fly.  I realize now I was too impatient and didn't give BOINC time to set things right. 

Hopefully I will report back with good news.

Cheers,

Craig

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

I would try also if running

I would try also if running without MSI Afterburner involved would change anything.

mikey
mikey
Joined: 22 Jan 05
Posts: 12702
Credit: 1839107786
RAC: 3605

cecht wrote:Okay, thank you

cecht wrote:

Okay, thank you for the feedback. Yes, I realized late last night that the <max-concurrent> I posted above was wrong - I meant for the value to be 2. With the value set to 1, it ran only 1 task on 0.5 of the GPU just fine, but with the value at 2, it ran two concurrent task that crashed.  I haven't tried it without the line and will give that a go when I get back into the office Monday and post the results.

I'm not running other BOINC or other computing projects on the GPU.  Windows and other apps (browsers, Dropbox) do occasionally toss stuff onto the NVIDIA GPU, but I don't know whether anything is running in the background on the AMD.

Thanks for the tip about not needing to abort tasks and editing app_config on the fly.  I realize now I was too impatient and didn't give BOINC time to set things right. 

Hopefully I will report back with good news.

Cheers,

Craig

Are you running any Boinc cpu workunits on that pc? Are you leaving any free for the gpu to use? If you run two gpu tasks and aren't leaving at least one cpu core free that could be a part of the problem, I leave a cpu core free for each gpu task to start with then test whether only one cpu core free is enough with no slowdowns when I run multiple gpu tasks at once.

cecht
cecht
Joined: 7 Mar 18
Posts: 1537
Credit: 2915241970
RAC: 2122288

Good thought. I will try

Richie_9 wrote:
I would try also if running without MSI Afterburner involved would change anything.

Good thought. I will try this.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

cecht
cecht
Joined: 7 Mar 18
Posts: 1537
Credit: 2915241970
RAC: 2122288

Yes, I normally run the

mikey wrote:
Are you running any Boinc cpu workunits on that pc? Are you leaving any free for the gpu to use? If you run two gpu tasks and aren't leaving at least one cpu core free that could be a part of the problem, I leave a cpu core free for each gpu task to start with then test whether only one cpu core free is enough with no slowdowns when I run multiple gpu tasks at once.

Yes, I normally run the Continuous Gravitational Wave search O2 All-Sky tasks on four CPUs with one CPU reserved for the Gamma-ray pulsar binary search GPU task and leave one CPU open for non-BOINC work. When I set  app_config to use 0.5 GPU, BOINC ran two GPU tasks with 1 CPU each (briefly, before the GPU froze up), which put one of the four O2 CPU tasks on hold (again, briefly). 

I will set my account project preferences to not load or run CPU-only tasks and see if dual GPU tasks can work with free run of the CPUs. Although, the GPU tasks, when run singly, have ~1,300 sec run time, but only ~200 s CPU time, so I had always assumed that the CPUs had plenty of head room for the GPU tasks.

I'll going with web-based preferences because I don't understand how to instruct app_config or cc_config to not load or not run CPU-only tasks.  Um, any thoughts on how to go about that?

 

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

<app_config> <app> <name>hsga

<app_config> <app> <name>hsgamma_FGRPB1G</name> <gpu_versions> <gpu_usage>1</gpu_usage> <cpu_usage>1</cpu_usage> </gpu_versions> </app> <app> <name>einstein_O2AS20-500</name> <max_concurrent>4</max_concurrent> </app> <project_max_concurrent>5</project_max_concurrent> </app_config>

This will run 4 CPU work units and 1 GPU work unit. If you want to run more than 1 GPU work unit, then you need to change the value of 1 to 0.5 in the <gpu_usage>1</gpu_usage> section and reduce the number of CPU work units from 4 to 3 in the CPU section under the <max_concurrent>4</max_concurrent>.  

The project max concurrent will limit the total amount of any work units to 5

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

Zalster

Zalster wrote:

<app_config>

<app>

<name>hsgamma_FGRPB1G</name>

<gpu_versions>

<gpu_usage>1</gpu_usage>

<cpu_usage>1</cpu_usage>

</gpu_versions>

</app>

<app>

<name>einstein_O2AS20-500</name>

<max_concurrent>4</max_concurrent>

</app>

<project_max_concurrent>5</project_max_concurrent>

</app_config>

This will run 4 CPU work units and 1 GPU work unit. If you want to run more than 1 GPU work unit, then you need to change the value of 1 to 0.5 in the <gpu_usage>1</gpu_usage> section and reduce the number of CPU work units from 4 to 3 in the CPU section under the <max_concurrent>4</max_concurrent>.  

 

The project max concurrent will limit the total amount of any work units to 5

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117783871940
RAC: 34685181

cecht wrote:I will set my

cecht wrote:
I will set my account project preferences to not load or run CPU-only tasks and see if dual GPU tasks can work with free run of the CPUs. Although, the GPU tasks, when run singly, have ~1,300 sec run time, but only ~200 s CPU time, so I had always assumed that the CPUs had plenty of head room for the GPU tasks.

Your assumptions are correct. for AMD GPUs (which you are using on the host in question) but not for nVidia where the full 1 CPU core per GPU task instance seems to really be needed.  With the proviso that I'm talking about GPU tasks under Linux, I have a Pentium dual core host (with HT so 4 virtual cores (threads)) running 4 GPU tasks on 2 RX 560s in addition to 2 CPU tasks - so, effectively just one 'real' core supporting 4 GPU tasks.

I originally started it as an experiment (many months ago) and had forgotten about it until I had a look just now to see how many CPU tasks were running.  I had expected to see a time penalty for GPU tasks but the current results are pretty much in line with other 560s running singly with plenty of available CPU support.  I think your issue is something other than the number of CPU tasks, but it would be a useful data point to see if 2 GPU tasks will run with no CPU tasks running at the time.

Quote:
I'll going with web-based preferences because I don't understand how to instruct app_config or cc_config to not load or not run CPU-only tasks.  Um, any thoughts on how to go about that?

You have to be a bit careful with the (essentially) three ways of changing things.  If you have ever used local preferences, your website preferences for compute stuff will be ignored.  If you want to revert to website prefs you have to open the local prefs window in BOINC Manager and at the top you will see a warning and a button to click to remove the local prefs and go back to website prefs.

The third way of setting some prefs is through config files.  Some stuff in config files gets incorporated into the state file (client_state.xml) and this will still override the equivalent website pref.  Deleting the config file doesn't help because that doesn't remove what has been incorporated into the state file.  Editing the config file and 're-reading its contents' through BOINC manager will fix things but still not allow you to change further through website prefs (for those particular settings) - it's complicated :-).

Ultimately if you want to permanently get back to website prefs only, you need to 'reset the project' in BOINC Manager but I hate that option because it throws away everything, including all tasks on board and downloads everything afresh - apps, data, the works.  If you know what you are doing, it's possible to remove the offending bits from the state file without doing a full project reset.  That's what I tend to do but I'm not recommending it at all unless you really do fully understand the structure and content of the state file.  If you damage that, you may well be starting from scratch.

For most people, website prefs are fine and easy to use.  There are two main reasons to prefer local prefs.  If you have a lot of computers you need to configure differently, you will quickly run out of available 'locations' where you can set up different preference sets.  The limit is 4 - default, home, work and school are their names.  The second reason is the fiddling around and waiting if you are experimenting with preference changes whilst fixing problems or optimising things.  Local prefs make this a lot easier.

For your situation, you have already used app_config.xml so your state file has already been modified as described above.  If I were you, I'd stay with local prefs until you have your problem sorted.  The easiest way for you to experiment with variable numbers of CPU tasks is to change locally the number of cores (a % setting) that BOINC is allowed to use.  You have 6 cores so the % values to use for 0 to 6 tasks respectively would be 0%, 17%, 34%, 50%, 67%, 84%, 100%.  You take the GPU right out of the mix all together by changing (in app_config.xml) the <cpu_usage> to something like 0.4.  That way, if you get 2 GPU tasks running, no additional cores will be 'reserved' by BOINC since 2x0.4=0.8 is still less than a full core.

The above scheme allows you to try 2 GPU tasks and no CPU tasks temporarily just by continuing to use the modified app_config.xml (remember to re-read config files whenever you make a change) along with a 0% setting for cores BOINC is allowed to use.  The real benefit is that you can change things very quickly and see immediately what happens.

You could achieve the same result by suspending (in BOINC Manager) all CPU tasks on board and if the GPU would crunch 2 GPU tasks in that configuration you could see what happened if you 'released' a single CPU task, and then another, etc.  However, I suspect it's not the number of CPU tasks so I expect it won't work, even with no CPU tasks running.  As I said, you should do the experiment, though :-).

Please ask if anything is not clear.

 

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.