Low GPU Load

dskagcommunity
dskagcommunity
Joined: 16 Mar 11
Posts: 89
Credit: 1219701683
RAC: 36277
Topic 195766

Hi all on this Eastermonday! :)

I installed a new 24/7 BOINC only Machine with Opteron 2,6Ghz (AMD, Single Core) CPU and a NVIDIA 9800 GTX+ (1GB RAM, Win XP SP3, Boinc 6.10.58, newest Detonator drivers). So im running my first Binary Radio Pulsar searches with a GPU but it uses only 55% GPU load with max 52% CPU Load. Do i need to change any settings for 70+% on GPU? The CPU has no other Projects running.

Thx for any helps :)

DSKAG Austria Research Team: [LINK]http://www.research.dskag.at[/LINK]

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 762348793
RAC: 1084717

Low GPU Load

Hi!

Volunteers have experimented with customized app_info.xml files to make more than one task in parallel on a single GPU, leading to longer runtimes but higher GPU utilization and overall increased throughput. See this thread:

http://einsteinathome.org/node/195553

Make sure to read to the end of the tread (or read it in-reverse order :-) because some of the app_info.xml posted at the beginning of the thread are now outdated because of newer app versions.

HB

dskagcommunity
dskagcommunity
Joined: 16 Mar 11
Posts: 89
Credit: 1219701683
RAC: 36277

Ok thx, then it does not work

Ok thx, then it does not work for me cos this Graphicscard has only 512MB of ram, so cant run 2 WUs @ same time :/

DSKAG Austria Research Team: [LINK]http://www.research.dskag.at[/LINK]

FrankHagen
FrankHagen
Joined: 13 Feb 08
Posts: 102
Credit: 272200
RAC: 0

RE: Volunteers have

Quote:
Volunteers have experimented with customized app_info.xml files to make more than one task in parallel on a single GPU, leading to longer runtimes but higher GPU utilization and overall increased throughput.

of course this is macgyvering things once again..

in fact i'm pretty sure it's mainly because the number of threads running on the GPU is simply too low.

i remember watching this on every project moving into GPU-developement and it took quite some effort to get the apps to tun efficiently.

you do know that paper?

http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 762348793
RAC: 1084717

Yup, I Know this

Yup, I Know this presentation, it's very interesting as it states that some of the performance recommendations written in the NVIDIA documentation are not the full truth and sometimes should be ignored to get better performance.

As for the BRP3 app, a considerable part of the work is done in NVIDIAs own FFT library cuFFT, which is also discussed in that paper.

I always wanted to play around with some of the ideas given in this paper but had no time yet. But given the fact that the CUFFT part takes a considerable share, the potential for optimization without re-writing the FFT part (noooooooooooo!) is somewhat limited.

Having said that, there sure remains some room for optimization, and just like the GW app, there will be improvements with each iteration.

Quote:

in fact i'm pretty sure it's mainly because the number of threads running on the GPU is simply too low.

Actually, the paper that you reference states the exact opposite: sometimes you get better performance by using FEWER threads!

HB

FrankHagen
FrankHagen
Joined: 13 Feb 08
Posts: 102
Credit: 272200
RAC: 0

RE: Yup, I Know this

Quote:
Yup, I Know this presentation, it's very interesting as it states that some of the performance recommendations written in the NVIDIA documentation are not the full truth and sometimes should be ignored to get better performance.

and the bottom line says: it's hard to code for optimal performance on all those different architectures out there.

the app needs to check what's available and adapt itself.

Quote:

As for the BRP3 app, a considerable part of the work is done in NVIDIAs own FFT library cuFFT, which is also discussed in that paper.

I always wanted to play around with some of the ideas given in this paper but had no time yet. But given the fact that the CUFFT part takes a considerable share, the potential for optimization without re-writing the FFT part (noooooooooooo!) is somewhat limited.

right - no go!

but giving cuda 4.0 a try (which is available with 270.xx drivers) should be worthy.

at least nvidia says there are many performance improvements..

FrankHagen
FrankHagen
Joined: 13 Feb 08
Posts: 102
Credit: 272200
RAC: 0

to bump this up.. i

to bump this up..

i remember DNECT (in fact distributed.net) had implememented an option for manual override like this:

"Core selection:

This option determines core selection. Auto-select is usually best since
it allows the client to pick other cores as they become available. Please
let distributed.net know if you find the client auto-selecting a core that
manual benchmarking shows to be less than optimal.
Cores marked as 'n/a' are not applicable to your particular cpu/os.

RC5-72:-

0) CUDA 1-pipe 64-thd
1) CUDA 1-pipe 128-thd-
2) CUDA 1-pipe 256-thd
3) CUDA 2-pipe 64-thd
4) CUDA 2-pipe 128-thd
5) CUDA 2-pipe 256-thd
6) CUDA 4-pipe 64-thd
7) CUDA 4-pipe 128-thd
8) CUDA 4-pipe 256-thd
9) CUDA 1-pipe 64-thd busy wait
10) CUDA 1-pipe 64-thd sleep 100us
11) CUDA 1-pipe 64-thd sleep dynamic"

on my GTS250 automatic selection went for option 0, but 10 worked a lot better.

~140 Mkeys/sec and 10% CPU load compared to ~ 200 Mkeys/sec and 1% CPU load.

option 9 was even faster, but took a complete CPU-core for that.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.