ABP2 CUDA applications

Michael Goetz
Joined: 11 Feb 05
Posts: 21
Credit: 3,067,502
RAC: 2

RE: I enabled GPU tasks on

Message 96273 in response to message 96271

Quote:

I enabled GPU tasks on Einstein to test ABP2, but, alas, I wasn't given any. I'll leave GPU enabled and CPU disabled for now hoping I can observe the operating temperature while it's running, which should provide a good indication of how heavily it's using the GPU. I don't care if the RAC is low, but if ABP2 is tying up my GPU for half an hour, it needs to be using that GPU efficiently.

Me

Well, that was fast. Before I finished writing that post, Einstein sent me one of the ABP2 WUs.

The results do not look good.

Oh, wait, this is actually an ABP1 task that got re-issued. I'm not going to run that -- ABP1 essentially shuts off my GPU for six hours. Oh well. Maybe I'll get an ABP2 task soon.

Want to find one of the largest known primes? Try PrimeGrid. Or help cure disease at WCG.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3,522
Credit: 702,980,721
RAC: 443,660

One user has already reported

Message 96274 in response to message 96273

One user has already reported a comparison of GPU utilization (ABP2 vs. ABP1), showing an almost factor of 3 improvement, from 5% to 14 %.

The next step of improving GPU utilization is already under development ... one step at a time.

CU
Bikeman

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2,143
Credit: 2,944,764,257
RAC: 694,499

RE: That being said, RAC is

Message 96275 in response to message 96271

Quote:
That being said, RAC is far from the most accurate way to compare work between projects. For a GPU, I find that measuring the GPU temperature is a very good metric for how efficiently an application is utilizing the GPU.


I agree - RAC is much too easily influenced by factors outside the control of the individual host - project uptime and validation issues, to name but two. You need to work from the inidivual task times to get a true handle on these issues.

Having said that, I see nothing like the two- or three-to-one variation between SETI and GPUGrid that Michael G. is reporting. That brings into play some second-order effects: I was trying to avoid them in my original technical exposition, but they need to be considered, if only to conclude that a true 'fair' credit award covering all cases is nigh on impossible.

1) SETI. The original back-of-a-SETI-envelope FLOP equivalence calculation done by David Anderson way back when would have used very early NVidia drivers and - crucially - CUDA FFT DLLs. I'm now using CUDA v2.3 DLLs, and they have at least doubled my SETI scientific throughput. (Unfortunately, they didn't have a similar effect at the other BOINC CUDA projects I've tried, including Einstein).

2) GPUGrid. Michael is using a G2xx-series card with hardware double-precision capability: I'm using 9800GT-series cards with single precison. I'm less sure of the facts here, but I think that may bump up his throughput more than the shader clock*count speed rating would account for. Think of it as analogous to the SSE2 benefits we're used to for CPU applications.

Michael Goetz
Joined: 11 Feb 05
Posts: 21
Credit: 3,067,502
RAC: 2

RE: 2) GPUGrid. Michael is

Message 96276 in response to message 96275

Quote:
2) GPUGrid. Michael is using a G2xx-series card with hardware double-precision capability: I'm using 9800GT-series cards with single precison. I'm less sure of the facts here, but I think that may bump up his throughput more than the shader clock*count speed rating would account for. Think of it as analogous to the SSE2 benefits we're used to for CPU applications.

Correct.

The actual performance (at least on GPUGRID) of the GTX2xx class GPUs (the real ones, 260 and above, not the low end 2xxs's that are based on the older chips from the 9000 class nvidia cards) substantially outperform what would be expected based on comparing gflop numbers.

There was a person asking how long his GPU would take to crunch a GPUGRID WU. Comparing the benchmark numbers of his GPU to mine (his was close to the slowest recommended for that project), I estimated his WU would take anywhere from 18 to 30 hours, and I actually expected it to come in around 20. In reality, it took 40 hours.

It's not so much the double precision capability that comes into play here. It's that the benchmarks just don't represent enough of a real world comparison. The GTX2xx class GPUs, at least when running GPUGRID, run about twice as fast as you would expect based on the benchmark numbers.

Want to find one of the largest known primes? Try PrimeGrid. Or help cure disease at WCG.

tolafoph
tolafoph
Joined: 14 Sep 07
Posts: 122
Credit: 74,659,937
RAC: 0

More numbers: ABP1 CUDA:

More numbers:

ABP1 CUDA: 100% CPU load; 5% GPU load 43°C ; 16500-18000 sec per WU
ABP2 CUDA: 100% CPU load; 14% GPU load 44°C ; 1850-2060 sec per WU

GPUGrid CUDA: ~10% CPU load; 77% GPU load 56°C ; ??? WU not yet finished

I had to abort the WU because it would´ve taken 10 hours to finisch and the deadline was 14.01.2009. Because I (probably) won´t have an internet connection next week, I´m not sure a could report back to GPUGRID in time.

But overall GPUGrid uses the GPU way more than Einstein => more science done (?)

CPU only vs. CPU+GPU
Two of the CUDA WUs validated against a Q6600 which is similar two my E6750.

here
here

So with the GPU in use the WU gets crunched ~ 40-50% faster.

Sascha

Michael Goetz
Joined: 11 Feb 05
Posts: 21
Credit: 3,067,502
RAC: 2

RE: More numbers: ABP1

Message 96278 in response to message 96277

Quote:

More numbers:

ABP1 CUDA: 100% CPU load; 5% GPU load 43°C ; 16500-18000 sec per WU
ABP2 CUDA: 100% CPU load; 14% GPU load 44°C ; 1850-2060 sec per WU

GPUGrid CUDA: ~10% CPU load; 77% GPU load 56°C ; ??? WU not yet finished

Thanks for providing those numbers. The GPU load really shows the difference between the projects.

Want to find one of the largest known primes? Try PrimeGrid. Or help cure disease at WCG.

Ver Greeneyes
Ver Greeneyes
Joined: 26 Mar 09
Posts: 140
Credit: 9,562,235
RAC: 0

Is this the load on the GPU

Message 96279 in response to message 96278

Is this the load on the GPU while it is crunching or the percentage of calculations done on the GPU for the whole WU? If the latter, does the WU claim the GPU even while the GPU is idle, or can other WUs access the GPU during this time?

Michael Goetz
Joined: 11 Feb 05
Posts: 21
Credit: 3,067,502
RAC: 2

RE: Is this the load on the

Message 96280 in response to message 96279

Quote:
Is this the load on the GPU while it is crunching or the percentage of calculations done on the GPU for the whole WU? If the latter, does the WU claim the GPU even while the GPU is idle, or can other WUs access the GPU during this time?

That's the load on the GPU.

At the current time, only one CUDA application can access the GPU at a time. Most applications (Einstein being the only exception I'm aware of) use a good chunk of the GPU -- the GPUGRID example is not the highest. Milkyway is more efficient than GPUGRID, for example.

Also, understand that "CPU utilization" and "GPU utilization" are VERY different concepts.

The CPU utilization number is the percentage of time the CPU is working on that task. 50% CPU utilization means that 50% of the time the CPU is working on that task and 50% of the time it's either doing something else, or sitting idle.

The GPU is different. It's a massively parallel vector supercomputer on a chip. When it's crunching a BOINC task, it's running 100% of the time. The GPU utilization percentage actually represents how many of the parallel processors are working on your task.

For example, a GTX280 has 30 cores and 240 shaders. Without going into details of how vector processors work (which explains the difference between cores and shaders), each shader -- all 240 of them -- is able to do a calculation simultaneously. At 100% utilization, all 240 shaders would be actively crunching a number all of the time. At 75%, only 180 of them would be active, on average, at any given moment.

So, on average, GPUGRID keeps 180 of the shaders busy, ABP2 keeps only 36 of them busy, and ABP1 keeps only 12 of the 240 shaders working at any moment.

Why don't you see 100% utilization on the GPU like you do on a CPU? The reason is because it's extremely difficult to program parallel computer programs. Some problems lend themselves to parallel processing, some don't. Some problems lend themselves to vector processing, some don't. To effectively use a GPU, the problem has to be both -- well suited to parallel computing and also well suited to vector processing (they're related but somewhat different).

It may very well be that the Einstein computations are not well suited to parallel processing or not well suited to being solved on a vector processor. Either would cause the poor efficiency that we're currently seeing on the Einstein GPU application.

It could also be that it's simply poorly programmed -- and that's not a reflection on the programmers involved. Programing for a GPU can be very challenging, but more importantly it's very different than normal programing. Very few people have experience with it, and there is a learning curve associated with learning how to do it right.

Want to find one of the largest known primes? Try PrimeGrid. Or help cure disease at WCG.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2,143
Credit: 2,944,764,257
RAC: 694,499

RE: It could also be that

Message 96281 in response to message 96280

Quote:
It could also be that it's simply poorly programmed -- and that's not a reflection on the programmers involved. Programing for a GPU can be very challenging, but more importantly it's very different than normal programing. Very few people have experience with it, and there is a learning curve associated with learning how to do it right.


And that is exactly what the Einstein programmers are doing - gaining experience, and taking baby steps up that learning curve as they go along.

I think they are converting different parts of the overall ABP algorithm in stages, and I'm not sure exactly how the parts are distributed through the running time of each task. I suspect you may see 'bursts' of GPU activity when an already-converted function is encountered, followed by periods of low or even zero utilisation when other parts of the search are being worked on by the CPU. The overall percentage GPU usage reported possibly depands how well your monitoring tool averages over time, and what time interval it is asked to average over.

Olaf
Olaf
Joined: 16 Sep 06
Posts: 26
Credit: 190,763,630
RAC: 0

Concerning the problem with

Message 96282 in response to message 96267

Concerning the problem with the work buffer, I observe the same problem on my
computers with GPU.
Both have a work buffer of two days.
The i7 got work for much more days, but because I stopped to fetch work and
it does 8 jobs at once, and typically it gets anyway less work than noted for
the work buffer, it will manage this in time.
On the notebook I have currently about 137 jobs for two days!? - at its best
the notebook does 8 jobs a day, therefore should not have more than about
20 jobs in the buffer - I stopped BOINC to fetch more work of course...

This work buffer seems to do nonsense, if there is not enough work for the GPU
send out for the buffer...

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.