Hello ladies and gentlemen,
Since i have updated my linux and hence my BOINC version, i have discovered GPU crunching for Einstein. I was doing it on a 9800 GT with 112 CUDA cores. Each GPU WU seemed to generate 8 output files.
Recently, i changed my GPU for a 560 GTX which is higher clocked and has 336 CUDA cores. I was expecting naively to get more than three times the same credit. My WU's are crunching a bit faster but that's all. I still send the same 8 files per GPU WU. My credits a day do not have skyrocketed.
I have read somewhere else (i think it was GPUGRID's forum) that their WU's used 256 cores of the 384 of a 560 TI so i am a bit worried my GPU is mostly sleeping. I have a feeling it goes a little faster due to the higher clock speed but it doesn't use more cores than a 9800 GT. Especially when it is stated here that you need a fixed amount of VRAM (300 Mo) to crunch.
Anyway my question to you is : what should i do to use the full potential of my GPU for science ? I heard somewhere there is a trick to do several GPU WU's on a single GPU. Maybe i should crunch for a more optimised project. Maybe milkyway or GPUGRID.
I am curious on how all this is functioning so i will take any technical information you will give. Thanks in advance to whoever will respond.
Copyright © 2024 Einstein@Home. All rights reserved.
Does E@H uses all my 336 CUDA cores ?
)
The credits per day are not so closely and directly related to your current speed, because you have to wait for your wingmen to finish the tasks... So you will see a delay since the change in the hardware until the daily credits go up to the right place. (And the RAC will take even longer due to its the average nature). And also it will be still harder to see the difference if you are attached to several projects.
But you should see an interesting difference in the crunching times of the BRP WUs. As they are very consistent on their crunching times you can get a good hint about the improvement on your performace.
To be able to crunch more than a WU at a time, in Einstein it's easy, you just need to edit the project preferences and change the "GPU utilization factor of BRP apps". To the inverse of the number of WUs you want to run concurrently, i.e. to run 2 WUs set it to 0.5, to run 3 is 0.33 and so on (for other projects, its ussually, a bit more complex, in any case ask in their forums...).
Now, if your GPU has 1Gb of ram it will be able to crunch 2 at time for sure an may be 3 (but it depends on the amount of free RAM left on the GPU by the OS and drivers).
Also, running several BRP WUS, will need more CPU power so it could be a good idea to set BOINC to leave at least one free CPU core (or more) to attend the GPUs, but you will need to test this on your own system as it may vary on different systems.
Hi! GPU performance is a
)
Hi!
GPU performance is a complex matter, it doesn't scale just with the numbers of cores and that's it, there are many other factors that have to be taken into account.
Even tho it's by now a 'legacy' graphics card, the GT 9800 is actually quite a powerful card for crunching. Yes, it has only 112 cores, but those are running at a relatively high clock rate: 1.5 GHz. This WIKI article gives its theoretical peek performance (single precision) as ca 500 GFlops, and ca 57 GB/s memory bandwidth.
http://en.wikipedia.org/wiki/GeForce_9_Series#GeForce_9800_GT.
The GTX 560 is only slightly higher clocked (1.6 GHz). It has three times the number of cores (from a different GPU architecture), but less than three times the memory bandwidth. Memory bandwidth is crucial for parts of the computation, so that's one reason you cannot expect to get 3 times the performance. E.g. Wikipedia cites a theoretical peak performance of ca 1100 GFlops.http://en.wikipedia.org/wiki/GeForce_500_Series. I'm not sure both articles use the same formula but you get the idea.
Another factor is the PCIe bus: both cards have the same PCI 2.0 x16 interface, which has also some effect on performance.
So just how much faster does your GTX 560 actually process BRP tasks?
From your results it appears that the faster BRP tasks take about 2800 seconds. I couldn't find 9800 GT tasks in your records, but I used one myself in a similar Linux box until recently and that one ran at about 4000 seconds per job.
So after the RAC reaches a new equilibrium (it's averaged over time), you should see a credit gain wrt B RP4 tasks that is in proportion to that (note that the CPU tasks also contribute credits).
The advise you got about running 2 or 3 units in parallel via the project preferences is definitely a good one and worth trying.
Cheers
HB
Oh, these are nice, precise
)
Oh, these are nice, precise answers !
Thank you very much, i will change my project preferences and see what happens.
So, now that it does compute
)
So, now that it does compute two WU's at the same time, i takes 75 mins for 2 WU's when it took 55 for one. So an output increase of roughly 47%.
RE: So, now that it does
)
Excellent!
Happy crunching
HBE
RE: So, now that it does
)
Nice upgrade. If you are in a mood to experiment further, you might try using the computing preferences to reduce the number of pure CPU BOINC tasks you allow on your 9550 simultaneously. (On multiprocessors, use at most nn% of processors)
The CPU "support application" for your GPU work keeps the GPU waiting to some degree both by latency (when it waits for an executing task to vacate a core so it can have a turn) and by execution time. You can't help the execution time short of measures such as overclock or RAM parameter tweaking, but you can help the latency both by raising the priority of the CPU support application compared to competing tasks, and by lowering the number of always-executing BOINC CPU tasks.
While the priority for a single CPU support task can be raised by manual intervention in Process Explorer or Task Manager for a single-job experiment, doing so on a continuing basis can be done using Process Lasso, or by efmer's Priority program.
The detailed tradeoff for total system productivity vs. number of simultaneous GPU tasks and reductions below maximum of BOINC CPU tasks varies considerably with GPU, CPU, system peripheral bus characteristics, and the specific GPU/CPU task. So I can't promise you'll find improvement on this route, but many people think they have, including me.
You can't simply increase
)
You can't simply increase clock by itself, or simply add more cores, and expect a proportionate increase in computer performance. And this can be true for CPUs as well. Part of the reason is shared resources, like RAM, hard drive, etc... And part of it is programming... Theoretically with 2 CPUs (or today a duel core, as it's essentially 2 CPUs on 1 die) one might expect the computer to do twice as much as a single proc computer, but in practice that hasn't held. And it's been known not to hold for a long time now. If something is not in their cache, and they have to compete for memory bandwidth to get at something, something is in the cache of 1 CPU, but the other needs to use it (now one's dealing with a bridge or whatever to transfer the data from one CPU, to another for which that thread doesn't currently have an affinity aka it's stuff is in the core's registers and what not), or they have to compete for the same system bus, etc....
Nothing new.... And proof that clock alone won't give all the answers, just look at what happened when Intel first came out with the Pentium 4 (arguably crippled as some stuff they originely intended that didn't get put in place until Northwood) due to competition from the Athlons, kinda early. A 1.5 GHz Pentium 4, first generation was getting rofl stomped by a 1 GHz Pentium III on more then a handful of benchmarks. This more, then anything AMD could have ever said, made their case that clock isn't everything, and looking at the 1.5 GHz p4 getting out-performed by a 1 GHz p-III, various review sites from Tom's Hardware, to Anandtech, to others, were anything but kind in the analysis...
Personally, and aside from a matter of drivers (Keppler for instance is new, and the driver version is still not fully developed for it, arguably), I think devs could have a sligt run in to the same sort of challenge that effected Intel's IA-64 or EPIC platform (think Merced for instance). Exactly how super-scaller can you make a program, to exploit a high degree of parallelism, before the returns you start seeing, can begin to dwindle off. This said, some projects code does use more of the core then others, aka POEM uses a higher % of my GPU here then Einstein or Milkyway based on GPU-ID's reporting. But as they're also different projects, how much of the extra CUDA cores can be exploited? Not sure. On the CPU front however, some (including DEC) mentioned some of the limitations in just how super scaller one can make a program. They also came up with hyperthreading which would have been slated for the EV-8 if it ever would ahve come out. But then again GPUs like nVidia's are designed for gamers, and in a game one isn't just running one CUDA program on their GPU necessarily...