Help with Configuring multi core & multi GPU

Dougga
Dougga
Joined: 27 Nov 06
Posts: 27
Credit: 24844941
RAC: 0
Topic 207529

I've read loads of posts on this topic but haven't been able to figure out how to configure Einstein for peak performance.

 

I'm running a 4-core intel chip with an Nvidia an nvidia 690 GPU which essentially is two 680's with an SLI setup within the dual proc card.

 

 

Here is my app_config.xml <app_config> <app>         <name>hsgamma_FGRPB1</name>         <gpu_versions>                 <gpu_usage>1</gpu_usage>                 <cpu_usage>.33</cpu_usage>         </gpu_versions> </app><app>         <name>hsgamma_FGRPB1G</name>         <gpu_versions>                 <gpu_usage>1</gpu_usage>                 <cpu_usage>.33</cpu_usage>         </gpu_versions> </app><app>         <name>einstein_O1Spot1THi</name>         <gpu_versions>                 <gpu_usage>.5</gpu_usage>                 <cpu_usage>.33</cpu_usage>         </gpu_versions> </app></app_config>          

I've tried pretty much every combination of cou & gpu usage options and I'm still getting 4 CPU's maxed out on separate work units with both GPU's working on their own.

 

TOP Output

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND11685 boinc     30  10 28.468g  78720  31196 R 100.2  0.5   0:15.23 hsgamma_FGRPB1G                           11676 boinc     30  10 28.468g  78692  31172 R  99.6  0.5   0:15.26 hsgamma_FGRPB1G                           11654 boinc     39  19  523940 514472   4996 R  50.2  3.1   0:09.53 hsgamma_FGRPB1_                           11656 boinc     39  19  523944 514756   5288 R  48.0  3.1   0:09.57 hsgamma_FGRPB1_                           11650 boinc     39  19  523940 514620   5048 R  47.8  3.1   0:09.65 hsgamma_FGRPB1_                           11652 boinc     39  19  523944 514468   4996 R  45.7  3.1   0:09.86 hsgamma_FGRPB1_                            1402 root      20   0  814720 148340 125768 S   6.7  0.9 138:38.52 Xorg                                      11527 doug      20   0 1076300  60832  38136 S   0.4  0.4   0:00.68 terminator

 

My understanding is that this is not optimal as the GPU's need to use the CPUs to function optimally.

Can someone help me configure this?

 

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

Ok, took me a couple of time

Ok, took me a couple of time reading thru this to figure out what is going on.

First thing, hsgamma is going to use a full core no matter what you put into your app_config. That is just the way it is.

You can say 0.33 all you want to it is still going to use a full core. So there are 2 cores of the 4 being used.

Next, einstein_01SpotTHi isn't a GPU work unit, it's a CPU work unit. So your app_config is wrong in labeling it. 

Since it is a CPU work unit, any free core not occupied by the hsgamma is going to be snagged up and used to crunch.

In the order of priority, GPU will crunch first, CPU will crunch second. If you run out of GPU work, then all 4 cores will be used by the einstein_01 app.  If you do have GPU work, then einstein_01 will surrender only those CPU cores that are required by the GPU work units. In this case 2 full cores for 2 GPU chips. In the past when we were not using a full core (ie when the other apps used 0.33) it meant that a CPU core could be share by both the CPU and GPU work units but that is not possible with the current apps we have.

So, now the question is. How do we free up your cores? The question for you is, which is a priority for you? The CPU or the GPU work units? If you want both then you are going to have to restrict the number of CPU work units running to 1 to leave 1 free core.  I've rewritten your app_config to reflect everything I have spoken about. You can try it and see if it does restrict total work to only 3 cores.

app_config><app><name>hsgamma_FGRPB1G</name><gpu_versions><gpu_usage>1</gpu_usage><cpu_usage>1</cpu_usage></gpu_versions></app><app><name>einstein_O1Spot1THi</name><max_concurrent>1</max_concurrent></app></app_config>

 

Let me know how it goes.

 

Zalster

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109406527927
RAC: 35373471

doug_20 wrote:Can someone

doug_20 wrote:
Can someone help me configure this?

Your best bet (for starters) would be to completely delete your app_config.xml.  If you did that you should then see 2 GPU tasks running (one on each GPU instance) and 2 CPU tasks running.  The default (which really does seem to be needed for NVIDIA GPUs) is to require a full CPU core to support each GPU task.  Because you seem to be setting 0.33 CPUs instead of 1.0 CPUs, you are not reserving a full CPU core for support and so all 4 are still available to crunch CPU tasks - which is why you see 4 active CPU tasks.

Please be aware that the app_config.xml feature is to control a particular application.  At the moment, there is only one application that might need adjustment of the allocated resources and that is the FGRPB1G GPU application.  FGRPB1 is CPU only and doesn't need any fractional adjustments and O1Spot doesn't have a GPU version.  So if you want to use the app_config.xml mechanism for GPU control, you only need to worry about the FGRPB1G application.

I know nothing about SLI other than to say it's not needed for crunching purposes.  I don't know if having it enabled interferes with crunching in any way.  BOINC sees your GPU as two instances and it gives the RAM as 2047MB.  If that is 4GB total (and not 2GB total) you could consider running 2 GPU tasks per instance (4 total) but that would tie up all 4 CPU cores (no CPU tasks crunching) and the GPU output would probably increase somewhat.

If you want to support the gravity wave search, leaving everything at default might well be the best for you.  If you want to maximise GPU output by running 2 GPU tasks per instance, you wont be able to run any CPU tasks without hurting GPU performance.  You would use app_config.xml (just for FGRPB1G) to set gpu_usage to 0.5.  You would leave cpu_usage at 1.  Four CPU cores would be reserved for GPU support and no CPU tasks would run (initially).  Eventually, any CPU tasks in your work cache would be at risk of deadline miss and BOINC would go into panic mode to get them crunched.  This would interfere with GPU performance.  To avoid this repeatedly happening, the best action would be to set your preferences to restrict BOINC to 0% of CPU cores and abort any CPU tasks you had on board.  BOINC would not fetch any further CPU tasks and your GPUs could crunch without further interference.

I'm not recommending any particular course of action.  I'm just trying to explain what might happen with certain choices.  I have no knowledge about the efficacy of running two tasks per GPU instance on your GTX690.  I don't know for sure if you could run 4 GPU tasks in total with just 3 CPU cores for support (or even just 2).  You could arrange for all these alternative combinations using app_config.xml.  I suspect that these non-default combinations might hurt performance.  The only way to find out is from others with the same hardware or to do the experiments yourself.

 

Cheers,
Gary.

Dougga
Dougga
Joined: 27 Nov 06
Posts: 27
Credit: 24844941
RAC: 0

Hi and thanks for your

Hi and thanks for your thoughtful responses.

 

I do have 4GB of memory between the two GPU's but running them both creates loads of computationsl errors and lost wu's for some reason.  When running I do see 4 GPU WUs active and now CPU's active as expected, but the results are a mess.

 

Implementing Zalster's app_config, got me 2 GPUs and 2CPUs for a moment or two but it reverted to only one GPU and 4 CPUs which obviously is not optimal. There is no CPU assisting the one GPU that's active and one GPU is idle.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                           
 3171 boinc     30  10 28.468g 144188  31132 R  98.6  0.9   6:29.33 hsgamma_FGRPB1G                   
 3167 boinc     39  19  523500 514200   5008 R  81.4  3.1   3:17.54 hsgamma_FGRPB1_                   
 3165 boinc     39  19  523944 514644   5068 R  78.0  3.1   4:35.19 hsgamma_FGRPB1_                   
 3163 boinc     39  19  523940 514624   5048 R  63.3  3.1   4:37.35 hsgamma_FGRPB1_                   
 3268 boinc     39  19  775588 773864   5524 R  63.0  4.7   3:12.75 einstein_O1Spot 

Renaming the app_config.xml to app_config.old and restarting seems to change nothing.

Still flummoxed.

 

Any further thoughts?

DVDL
DVDL
Joined: 5 May 17
Posts: 12
Credit: 17139825
RAC: 0

Maybe boinc doesnt understand

Maybe boinc doesnt understand how to use 4 units on a card with 2 gpu's.

Have your tried to put off SLI in the driver? i tought that could be possible. Maybe it works?

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

doug_20 wrote:Hi and thanks

doug_20 wrote:

Hi and thanks for your thoughtful responses.

 

I do have 4GB of memory between the two GPU's but running them both creates loads of computationsl errors and lost wu's for some reason.  When running I do see 4 GPU WUs active and now CPU's active as expected, but the results are a mess.

 

Implementing Zalster's app_config, got me 2 GPUs and 2CPUs for a moment or two but it reverted to only one GPU and 4 CPUs which obviously is not optimal. There is no CPU assisting the one GPU that's active and one GPU is idle.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                           
 3171 boinc     30  10 28.468g 144188  31132 R  98.6  0.9   6:29.33 hsgamma_FGRPB1G                   
 3167 boinc     39  19  523500 514200   5008 R  81.4  3.1   3:17.54 hsgamma_FGRPB1_                   
 3165 boinc     39  19  523944 514644   5068 R  78.0  3.1   4:35.19 hsgamma_FGRPB1_                   
 3163 boinc     39  19  523940 514624   5048 R  63.3  3.1   4:37.35 hsgamma_FGRPB1_                   
 3268 boinc     39  19  775588 773864   5524 R  63.0  4.7   3:12.75 einstein_O1Spot 

Renaming the app_config.xml to app_config.old and restarting seems to change nothing.

Still flummoxed.

 

Any further thoughts?

 

Can you post your new app_config here so I can look at it

Until then try this

<app_config><app><name>hsgamma_FGRPB1G</name><gpu_versions><gpu_usage>1</gpu_usage><cpu_usage>1</cpu_usage></gpu_versions></app><app><name>einstein_01SpotTHi</name><max_concurrent>1</max_concurrent></app><project_max_concurrent>3</project_max_concurrent></app_config>

 

This should allow for only 2 GPU work units and 1 CPU work unit. See if that does the trick

 

 

Edit...Is the above, in what you posted, what you see in the bonic monitor? Looks like 4 GPU tasks and 1 CPU task.

Dougga
Dougga
Joined: 27 Nov 06
Posts: 27
Credit: 24844941
RAC: 0

Hi Zalster, So the

Hi Zalster,

So the app_config.xml that I was using was the one you posted.

Your latest is the same with the addition of the one line:

<project_max_concurrent>3</project_max_concurrent>

Here is the whole file:

<app_config>
<app>
        <name>hsgamma_FGRPB1G</name>
        <gpu_versions>
                <gpu_usage>1</gpu_usage>
                <cpu_usage>1</cpu_usage>
        </gpu_versions>
</app>
<app>
        <name>einstein_O1Spot1THi</name>
        <max_concurrent>1</max_concurrent>
</app>
<project_max_concurrent>3</project_max_concurrent>
</app_config>
~                                                                                          
~                

 

You are right. I now have 2 GPU WUs and one CPU WU which is what we were looking for.

If this is optimal, I should see a rise in User & Host Average Work.

Here is the top output now:

  PID USER      PR  NI    VIRT          RES       SHR S  %CPU %MEM     TIME+ COMMAND                
 3789 boinc     39  19 1213992 487628     5524   R 100.2  3.0    0:35.07 einstein_O1Spot        
 3813 boinc     30  10 28.468g    78656   31132   R 100.2  0.5    0:33.26 hsgamma_FGRPB1G        
 3822 boinc     30  10 28.468g    78540   31016   R 100.0  0.5    0:33.10 hsgamma_FGRPB1G        
 1428 root       20    0  810064   142812 125176  R   5.5    0.9   36:08.30 Xorg 

As to your other question, yes there were 4 CPU WU and 1 GPU WU active.  Now there are 2GPU and 1 CPU.

 

We may have succeeded.

Thanks!

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

Glad to hear it worked.  Yes

Glad to hear it worked.  Yes I saved that last bit just in case. 

The previous app_config should have worked but I've noticed lately that Einstein is starting to ignore certain commands.  Not sure why since it works fine with other projects.  

I know the powers that be don't like us using them but if I don't then Einstein seizes all of my cores for their work units instead of what I decide they can use.  Only so much a CPU cooler can do when all 20 cores are screaming at 100%...Kind of an expensive science experiment if it failed due to overheating. Why I prefer to restrict it to a much lower amount.

I'll keep looking into why those commands failed, and you are welcome.

 

Zalster

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

Ok Dougga,    I think I

Ok Dougga, 

 

I think I figured out what is going with the app_config and the GW.

I manually typed in their name into the app_config and thought that was enough. Evidently there is some unusual symbol that I can't see in it's name or visible. What I found that works is to force BOINC to reread the config files. In the event log it throws out an error says that type of work unit is not found and list all the current work units. I highlighted that line in the BOINC log and copied it to a Textedit file and then removed all other work types listed until I found  'einstein_O1Spot1THi'  I highlighted the name between the ' ' and then pasted it into the app_config and saved it. I then forced boinc to reread the config. Once that was done, I noticed the error message in the start up was gone.  I can't see any difference in the names but evidently there is something hidden or just slightly off enough that BOINC won't recognize it if you write it.  I tried it on 2 different computers that were doing the same thing as you described and it corrected the issue without my having to add the <project_max_concurrent> line.  

For now, you can leave that project max concurrent line in but thought I would give you an update on what I found.
Zalster

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

Zalster: When you typed the

Zalster:

When you typed the name did you use an O or a 0 (zero) after the _?

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

I used a 0 (zero) but I'm

I used a 0 (zero) but I'm guessing they are using a O which is probably why it's not accepting it.

So like I said, I've just highlighted and copied it from the event log then pasted it into the app_config.

Kind of weird if it turns out they are using a large o instead of a 0(zero) to designate a work unit.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.