Hi all on this Eastermonday! :)
I installed a new 24/7 BOINC only Machine with Opteron 2,6Ghz (AMD, Single Core) CPU and a NVIDIA 9800 GTX+ (1GB RAM, Win XP SP3, Boinc 6.10.58, newest Detonator drivers). So im running my first Binary Radio Pulsar searches with a GPU but it uses only 55% GPU load with max 52% CPU Load. Do i need to change any settings for 70+% on GPU? The CPU has no other Projects running.
Thx for any helps :)
DSKAG Austria Research Team: [LINK]http://www.research.dskag.at[/LINK]
Copyright © 2024 Einstein@Home. All rights reserved.
Low GPU Load
)
Hi!
Volunteers have experimented with customized app_info.xml files to make more than one task in parallel on a single GPU, leading to longer runtimes but higher GPU utilization and overall increased throughput. See this thread:
http://einsteinathome.org/node/195553
Make sure to read to the end of the tread (or read it in-reverse order :-) because some of the app_info.xml posted at the beginning of the thread are now outdated because of newer app versions.
HB
Ok thx, then it does not work
)
Ok thx, then it does not work for me cos this Graphicscard has only 512MB of ram, so cant run 2 WUs @ same time :/
DSKAG Austria Research Team: [LINK]http://www.research.dskag.at[/LINK]
RE: Volunteers have
)
of course this is macgyvering things once again..
in fact i'm pretty sure it's mainly because the number of threads running on the GPU is simply too low.
i remember watching this on every project moving into GPU-developement and it took quite some effort to get the apps to tun efficiently.
you do know that paper?
http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf
Yup, I Know this
)
Yup, I Know this presentation, it's very interesting as it states that some of the performance recommendations written in the NVIDIA documentation are not the full truth and sometimes should be ignored to get better performance.
As for the BRP3 app, a considerable part of the work is done in NVIDIAs own FFT library cuFFT, which is also discussed in that paper.
I always wanted to play around with some of the ideas given in this paper but had no time yet. But given the fact that the CUFFT part takes a considerable share, the potential for optimization without re-writing the FFT part (noooooooooooo!) is somewhat limited.
Having said that, there sure remains some room for optimization, and just like the GW app, there will be improvements with each iteration.
Actually, the paper that you reference states the exact opposite: sometimes you get better performance by using FEWER threads!
HB
RE: Yup, I Know this
)
and the bottom line says: it's hard to code for optimal performance on all those different architectures out there.
the app needs to check what's available and adapt itself.
right - no go!
but giving cuda 4.0 a try (which is available with 270.xx drivers) should be worthy.
at least nvidia says there are many performance improvements..
to bump this up.. i
)
to bump this up..
i remember DNECT (in fact distributed.net) had implememented an option for manual override like this:
"Core selection:
This option determines core selection. Auto-select is usually best since
it allows the client to pick other cores as they become available. Please
let distributed.net know if you find the client auto-selecting a core that
manual benchmarking shows to be less than optimal.
Cores marked as 'n/a' are not applicable to your particular cpu/os.
RC5-72:-
0) CUDA 1-pipe 64-thd
1) CUDA 1-pipe 128-thd-
2) CUDA 1-pipe 256-thd
3) CUDA 2-pipe 64-thd
4) CUDA 2-pipe 128-thd
5) CUDA 2-pipe 256-thd
6) CUDA 4-pipe 64-thd
7) CUDA 4-pipe 128-thd
8) CUDA 4-pipe 256-thd
9) CUDA 1-pipe 64-thd busy wait
10) CUDA 1-pipe 64-thd sleep 100us
11) CUDA 1-pipe 64-thd sleep dynamic"
on my GTS250 automatic selection went for option 0, but 10 worked a lot better.
~140 Mkeys/sec and 10% CPU load compared to ~ 200 Mkeys/sec and 1% CPU load.
option 9 was even faster, but took a complete CPU-core for that.