Optimising GPU-usage

Alex

Joined: 1 Mar 05

Posts: 451

Credit: 500648366

RAC: 37047

Hi Mike, Hi Gary, for

31 Aug 2010 18:32:56 UTC

Message 99231

(moderation:

)

Hi Mike, Hi Gary,

for about one week we could test the settings to run more than one cuda app at a time. Speaking for myself I can say that not a single wu failed to validate.

Now I cindly ask you to forward two issues to the developers:
- please remove the reset of the settings in the client_info.xml. SETI and Milkyway can live without resetting it, so it should be possible for Einstein also.
- the settings about cpu-usage. Does it really make sense to run ABPS-cpu apps when I can run them as cuda-app? I tried the setting no cpu; that resulted on my main system to get no more Einstein wu's, not even for GC. The setting should be separated for the two applications.

Thank you in advance,

Alexander

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5850

Credit: 110031966741

RAC: 22405216

Hi

1 Sep 2010 6:36:38 UTC

Message 99232 in response to message 99231

(moderation:

)

Hi Alex,

Quote:

Speaking for myself I can say that not a single wu failed to validate.

That's very impressive but it really doesn't tell us the things we need to know.

At various points, various ideas were mentioned but I was expecting that you would follow up with a detailed message showing exactly the techniques you have followed. You need to document everything very thoroughly and you can't assume that your readers (myself included) will get the correct picture from just a sketchy outline of what you have done. For instance, are you just editing the state file (client_state.xml)? Exactly what are you editing? How and when do these edits get 'undone'? Can you prevent that by downloading a whole bunch of work and then suspending further work requests until most of the tasks on board have been completed? Does your technique involve the use of the anonymous platform mechanism (AP) at all? If so, exactly what app_info.xml did you use?

In addition to fully documenting your techniques, you need to make an objective assessment of what you actually gain by doing this. What improvement in throughput do you actually achieve? How many extra tasks can you complete in a certain total elapsed time? Some hard numbers please.

Quote:

- please remove the reset of the settings in the client_info.xml. SETI and Milkyway can live without resetting it, so it should be possible for Einstein also.

There is no file called 'client_info.xml'. There is a 'client_state.xml' and an 'app_info.xml'. I assume you must mean the state file because the contents of that do get modified all the time when there are exchanges with the server. As far as I'm aware, the contents of app_info.xml are never modified by anything received from the server.

I'm no expert about this since my knowledge is based on what I have read and what I have personally experienced. AP is a function of BOINC and not of any specific project. It is documented to some extent on the BOINC website. There are people around who would be much more knowledgeable about it because they read the source code for light entertainment. I'm certainly not one of those :-). You may be able to get help from them on the BOINC website or the BOINC mailing lists.

I'm only mentioning this because I believe that your statement that "Seti and Milkyway can live without resetting ..." probably follows from the use of AP with the tasks for those projects. I'm reluctant to make statements where I clearly don't know the full details, so the following should be taken with a grain of salt.

When you use AP, whatever 'details' you insert into your customised app_inf.xml will get entered by BOINC into your state file and will override the same 'details' that would otherwise come from a server exchange. I'm assuming that you should be able to craft your own app_info.xml so as to set up the CPU and GPU 'counts' that you wish to use without them being reset by the server. As I say, I don't know for sure, and I have no way of experimenting as I don't own the requisite hardware.

Quote:

- the settings about cpu-usage. Does it really make sense to run ABPS-cpu apps when I can run them as cuda-app?

When I first read this I didn't understand your point. I thought you were saying that the project should do away with using the CPU altogether. I'm now assuming that what you are really asking for is for the ability to do ABP tasks with the CUDA app only (thereby using both CPU and GPU) and to have a mechanism where you avoid ABP tasks that are destined for the CPU only. Once again, this is something that you can do without making changes to the server. The answer is to use AP.

Quote:

I tried the setting no cpu; that resulted on my main system to get no more Einstein wu's, not even for GC. The setting should be separated for the two applications.

I don't know if it can be done this way. I think it might require substantial enhancements to BOINC itself. You could do it through AP by not providing the ABP CPU app in app_info.xml. For maximising efficient use of your CPUs, I imagine you would need to convince BOINC that somewhat less than 1.00CPUs are required in total for all the GPU tasks. I'm assuming that you could set up an app_info.xml that would allow GC1 tasks on the CPUs with ABP tasks that required the CUDA app and fractional CPU and GPU values perhaps something like 0.46CPUs and 0.22GPUs. This would allow you to run 1 Seti and 2 ABP CUDA tasks simultaneously on 1 CPU + 1 GPU, since 0.46+0.46+0.04<1.00 and 0.22+0.22+0.55<1.00 as well. I would guess that on a quad core machine, this might allow 3 GC1 tasks, 2 ABP CUDA tasks and 1 Set task to run simultaneously. Since the ABP CUDA tasks would require more than 0.46 of a CPU, they would each steal cycles from a GC1 task when they needed to. I have no idea if this would actually work but I'd think it would be worth a try.

Cheers,
Gary.

Alex

Joined: 1 Mar 05

Posts: 451

Credit: 500648366

RAC: 37047

Hi Gary, RE: At

1 Sep 2010 19:37:52 UTC

Message 99233 in response to message 99232

(moderation:

)

Hi Gary,

Quote:

At various points, various ideas were mentioned but I was expecting that you would follow up with a detailed message showing exactly the techniques you have followed. You need to document everything very thoroughly and you can't assume that your readers (myself included) will get the correct picture from just a sketchy outline of what you have done. For instance, are you just editing the state file (client_state.xml)? Exactly what are you editing? How and when do these edits get 'undone'? .... Does your technique involve the use of the anonymous platform mechanism (AP) at all? If so, exactly what app_info.xml did you use?

I'm sorry Gary, you are thinking too complicated, much, much too complicated.

This is the only change that must be done. Nothing else. No app_info.xml, no anonymous platform or anything else. I hope, this can be accepted as fully documented.

This count is set to 1.0000 every time the server sends new wu's to the client.
As you can see here
I have changed these numbers also for SETI and MW. Both projects can live with that and make no changes.

tolapho posted on 27 Aug:
Hi,
I changed the client_state.xml and set the values to 0.22.
He understood all the postings; I really have no idea why you call it 'sketchy'.

Quote:

Can you prevent that by downloading a whole bunch of work and then suspending further work requests until most of the tasks on board have been completed?

This is exactly what I do. I can upload completed tasks anytime without getting troubles.

Quote:

There is no file called 'client_info.xml'.

Yes, you're right. I wrote that at midnight after a 16 h workday, my mistake. It's only the 'client_state.xml' that is affected.

Quote:

"Seti and Milkyway can live without resetting ..." probably follows from the use of AP with the tasks for those projects.

I checked that. My MW-folder does not contain an app_info.xml. Seti has an app_info.xml; for some reason it contains my setting regarding , but believe me, I did not change that. I only changed the settings in client_state.xml. But I will have an eye on that when SETI is online again.

Setting of CPU-use:
There is a thread 'Selection of Arecibo tasks possible? ' in the wish-list.
It links to a long discussion about 'must-run' tasks and 'optional' tasks. I have seen, that I got 3.08 Type CPU-only tasks and 3.11 Type ABP2cuda23 tasks. In my mind it makes no sense to run that app on a cpu only when I can run it as CPU/GPU-app. This brought up the idea to disable cpu only for the Arecibo Binary Pulsar Search (STSP) -apps. But a setting of 'no cpu' resulted in getting no more 3.04 (S5GCESSE2) wu's, that affects also my main-system (which is equipped with ATI, not nVidia). Both systems stopped crunching Global Correlations S5 Engineering , which is, by definition, against the interests of Einstein (as far as I understood it)

Quote:

The answer is to use AP.

Yes, of course. As you stated,

Quote:

I'm no expert about this since my knowledge is based on what I have read and what I have personally experienced.

This is true for me too.
I searched for weeks to find an app_info for this project. I found a lot of posts, requesting such a file. I found two post with incomplete app_info.xml , which means, only the cuda-part is implemented; and again, this is against the interest of Einstein.

Quote:

There are people around who would be much more knowledgeable about it because they read the source code for light entertainment.

That's true. And this is why i'm posting it here, where people with interest in Einstein' work may read it. So if there is someone out there who can create an app_info.xml , PLEASE HELP !
Otherwise, please let me repeat my statement from 27 Aug:
A first and VERY helpful step would be not to change the settings in the client_state.xml.

I hope, this has made things a bit more clearer. Please apologise my bumpy english, it's not my native language and that I could not bring all that to the point as it would be necessary.

Kind regards,

Alexander

P.S.
Do not forget to set you windows to 'no automatic updates' or whatever it is called in the english version. Today Microsoft decided to update my display-driver, which resulted in a destruction of all my cuda wu's. SETI, MW and Einstein are affected by that.

(retired account)

Joined: 28 Mar 05

Posts: 6

Credit: 91492

RAC: 0

RE: So if there is someone

1 Sep 2010 20:12:24 UTC

Message 99234 in response to message 99233

(moderation:

)

Quote:

So if there is someone out there who can create an app_info.xml , PLEASE HELP !

Well, actually it is pretty easy. The structure of the app_info.xml is documented here: http://boinc.berkeley.edu/wiki/Anonymous_platform and everything else you need you'll find in the existing client_state.xml in your BOINC data directory (which has the same structure, only some additional information).

It might take a bit longer to create a file for Einstein@home than for other projects, cause more applications and files are involved here, but it is not more difficult.

For a start, this is the part you'll need for APB2 CUDA:

einsteinbinary_ABP2
Arecibo Binary Pulsar Search (STSP)

einsteinbinary_ABP2_3.11_windows_intelx86__ABP2cuda23.exe

cudart32_23.dll

cufft32_23.dll

einsteinbinary_ABP2_3.03_graphics_windows_intelx86.exe

einsteinbinary_ABP2
311
windows_intelx86
1
1
ABP2cuda23

einsteinbinary_ABP2_3.11_windows_intelx86__ABP2cuda23.exe

cudart32_23.dll
cudart.dll

cufft32_23.dll
cufft.dll

einsteinbinary_ABP2_3.03_graphics_windows_intelx86.exe
graphics_app

CUDA
1.00

Regards

(retired account)

Joined: 28 Mar 05

Posts: 6

Credit: 91492

RAC: 0

RE: What improvement in

1 Sep 2010 20:48:34 UTC

Message 99235 in response to message 99232

(moderation:

)

Quote:

What improvement in throughput do you actually achieve? How many extra tasks can you complete in a certain total elapsed time? Some hard numbers please.

Gary, your question was aimed at the other Alexander, but I might give my numbers here, too. I already posted the link to my tasklist earlier in this thread. Based on those numbers, I got the following picture on a 3,2 GHz Quadcore w/o hyperthreading and a GTX260-216:

1. One ABP2cuda23 v3.11 task per GPU
GPU load ~ 6% GPU
GPU Temp. ~ 59Â° C
average run time 5,415 sec. (100%)
average CPU time 4,922 sec. (100%)
CPU time / Run time ratio 0,91
(based on 6 results)

2. Two ABP2cuda23 v3.11 task per GPU
GPU load ~ 18% GPU
GPU Temp. ~ 61Â° C
average run time 5,264 sec. (97%)
average CPU time 5,007 sec. (102%)
CPU time / Run time ratio 0,95
(based on 20 results)

3. Three ABP2cuda23 v3.11 task per GPU
GPU load ~ 25% GPU
GPU Temp. ~ 65Â° C
average run time 5,457 sec. (101%)
average CPU time 5,228 sec. (106%)
CPU time / Run time ratio 0,96
(based on 18 results)

The video memory load is changing, but if you take 200 MB maximum load per task into account, you should be well on the save side (I actually got less).

Taken the accuracy of this test into account, one can assume, that no performance penalty can be seen if you increase the number of task on such video card from one to three. Seemingly the CPU time is slightly increased, but the run time per task is not. So only concerning GPU ressources you have a staggering increase of 200% throughput (meaning three tasks instead of one in the same time). Of course this 'costs' you two additional CPU cores, so the overall throughput is not rising by 200% (would be to good to be true *g*). But clearly, for someone wanting to maximize the ABP2 throughput, this is the way to go. Especially if you combine a low-range multicore CPU with a medium- to high-range Nvidia card.

Quote:

The answer is to use AP.

Agreed.

Regards

Jord

Joined: 26 Jan 05

Posts: 2952

Credit: 5779100

RAC: 0

RE: In my mind it makes no

1 Sep 2010 20:56:37 UTC

Message 99236 in response to message 99233

(moderation:

)

Quote:

In my mind it makes no sense to run that app on a cpu only when I can run it as CPU/GPU-app. This brought up the idea to disable cpu only for the Arecibo Binary Pulsar Search (STSP) -apps. But a setting of 'no cpu' resulted in getting no more 3.04 (S5GCESSE2) wu's, that affects also my main-system (which is equipped with ATI, not nVidia). Both systems stopped crunching Global Correlations S5 Engineering , which is, by definition, against the interests of Einstein (as far as I understood it)

Set up different venues. Home for the computer you want to use the GPU only, default (or any other venue) for the computer needing to use CPU. Setting up of different venues has always been an option in the BOINC back-end.

By going AP you're telling BOINC to use only the options in the app_info.xml file and to ignore what the server is telling it. It's like using the BOINC Manager Advanced Preferences, which will override all the same preferences from the web site. Changing the preferences on the web site will no longer have any effect, unless it's a preference not in the local preferences.

JohnDK

Joined: 25 Jun 10

Posts: 109

Credit: 2096011162

RAC: 2024201

RE: RE: So if there is

3 Sep 2010 21:37:34 UTC

Message 99237 in response to message 99234

(moderation:

)

Quote:

Quote:
So if there is someone out there who can create an app_info.xml , PLEASE HELP !

Well, actually it is pretty easy. The structure of the app_info.xml is documented here: http://boinc.berkeley.edu/wiki/Anonymous_platform and everything else you need you'll find in the existing client_state.xml in your BOINC data directory (which has the same structure, only some additional information).

I've tried using app_info.xml before but tried it again. I still get this message

Quote:

Message from server: To get more Einstein@Home work, finish current work, stop BOINC, remove app_info.xml file, and restart.

And then the project communication is deferred for 4 hours. Not sure if I *can* get more work without removing the file, I only use Einstein as a backup project.

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 691068031

RAC: 265188

RE: RE: Message from

3 Sep 2010 22:30:37 UTC

Message 99238 in response to message 99237

(moderation:

)

Quote:

Quote:
Message from server: To get more Einstein@Home work, finish current work, stop BOINC, remove app_info.xml file, and restart.

And then the project communication is deferred for 4 hours. Not sure if I *can* get more work without removing the file, I only use Einstein as a backup project.

When I do some alpha testing for optimizations, I also use AP and do get these warnings and still get new work in spite of it, so this warning doesn't seem to be 100% accurate in all cases.

ExtraTerrestria...

Joined: 10 Nov 04

Posts: 770

Credit: 540334125

RAC: 129497

This appears to be a very

4 Sep 2010 15:39:45 UTC

Message 99239

(moderation:

)

This appears to be a very interesting topic. Let's try to be productive. And apologies in advance for writing so much.. but I think it's worth it!

The current GPU app s*cks. The GPU utilization is so low that anyone except die-hard Einstein fans will prefer to use the GPU for other projects. That's certainly a problem the project will want to adress.

It's clear that one can not just recode the entire Einstein app(s) for GPUs. The co-processor attempt should be and is (IMO) fine. It just has to be executed in a smart way, so that it doesn't block the GPU all the time.

In this thread it's been shown that this is possible, without a substantial performance loss (~3 GHz Quad Core CPU, GTX260, 4 Einstein CUDAs). So how could Einstein ideally look like?

Let's consider a typical quad core CPU and a midrange / performance GPU (probably anything >= G92). Here we could run 4 Einstein CUDA tasks and have all of them share the GPU. CPU utilization would be 100% and much lower for the GPU. The remaining GPU time could be filled by a pure GPU project (GPU-Grid, SETI CUDA, Milkyway, Collatz etc.). Ideally the pure GPU project would run at lower priority than the Einstein CUDAs, but I don't think that's possible yet. They'll probably run in a round robin fashion (or whatever the nVidia driver decides). On Fermi hardware (compute capability 2.0 and up) it should also be possible to distribute individual shader multiprcessors between different programs, providing more flexibility and enhancing overall efficiency

And to round things up: it would be nice if the CPU part of the Einstein CUDAs could sleep until the GPU is finished. This way one could run up to 5 Einstein CUDAs on this machine and achieve a higher throughput than with 4 of them. A 6th task should not improve anything for a single GPU config, however, since (currently) at each time only one of the Einstein CUDAs can be waiting for the GPU.

In this example any of the 5 Einstein CUDAs could be substituted by regular CPU projects without a loss in efficiency, as long as at least one Einstein CUDA is present. Otherwise 4 CPU tasks would be the way to go.

So what needs to be done?

One could just change the server setting "coproc CUDA count" for Einstein CUDAs to a fixed low value. This would enable several CUDA WUs in parallel without using the AP and without any changes by the user. However, there are bound to be cases where this setting will be wrong (e.g. too low for an i7 with a low end 16 shader GPU and too large for a high end GPU).

Therefore we need an appropriate setting for each host based on the actual hardware. From GPU-Grid I know that their CPU count value varies between different systems, so it's possible. Determining this value would have to take into account:

- CPU speed: using the BOINC benchmark one could get a rough estimate of how often each thread requires GPU assistance
- GPU speed: using the BOINC benchmark and CUDA compute capability on can get an estimate of how long it will take the GPU to crunch through its share
-> combining both will yield the approximate GPU time fraction

- GPU memory should also be taken into account to try not to overload it (difficult if the memory consumption of the pure GPU project running along the Einsteins is not known)

- configurations with different GPUs will be tricky, but probably not a show stopper

If we could make this work people could just run Einstein CUDAs along their current pure GPU projects. They'd only loose the GPU time that Einstein is actually using instead of loosing 100% of it. This would partly be made up for by increased RAC from Einstein and the satisfaction of running a cool project on the GPU. I'd certainly go for such a config for my GPU rigs, which currently don't run any Einstein CUDAs due to the inefficiency.

Furthermore the the CPU count would be "1 - GPU count", if the CPU was sent into sleep while the GPU is busy. This should make BOINC run "nCPUs + nGPUs" CPU heavy projects. I suspect that this will not be easy, though, and that the current Einstein CUDA is continously polling the GPU for maximum performance. If the GPU is accessed (and finishes) at intervals 1 ms to complete. Then one could estimate the completion time and sleep until then (similar to GPU Grids Swan Sync or what Gipsel does at Milkyway and Collatz). If it's not already done like this, of course. On the other hand those packages should not exceed 40 ms to stay above 25 fps for the screen refresh.

So.. what do you guys think? Any official comment?

Best regards,
MrS

Scanning for our furry friends since Jan 2002

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 691068031

RAC: 265188

Hi! That sounds very

4 Sep 2010 20:14:32 UTC

Message 99240 in response to message 99239

(moderation:

)

Hi!

That sounds very interesting.

For me, the following thing would have to be clarified before doing changes to the server settings that would affect everyone (as opposed to AP tricks that affect only individual PCs where users want to optimize GPU utilization):

Can this be done in a way that is "safe" in the sense that the video RAM limits are observed even for people with 512 MB cards or people who have processes running that consume a lot of video RAM. It would be kind of unfair (and probably seen as a rather rude thing) if E@H would eat up lots of CPU cores, fill up video RAM as well and would result other GPU apps to fail for lack of video RAM :-(

Note that many people have reported problems even running a single instance of the ABP2 app for lack of video RAM and I would hate to see reducing volunteer's PCs stability just to squeeze out more GFlops for the project.

As for making the ABP2 app sleep while the GPU is working: Bernd and Oliver have stated that this is planned for ABP3, and whether it's worth the effort to change ABP2 now will depend on how far way ABP3 is.

Optimising GPU-usage

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner