CUDA Application under-performance

Paul D. Buck
Paul D. Buck
Joined: 17 Jan 05
Posts: 754
Credit: 5385205
RAC: 0
Topic 194638

I will run a few tasks with this application, but, if the current indications are true then I will have to suspend EaH until you either let me run only on the CPU or the performance of the application are dramatically improved.

As far as I can tell so far what you have done is created a situation where for the same time you used to only use my CPU 100% you now occupy both the CPU and the GPUs on the system. What this means is that for a miniscule improvement in speed you have now drastically decreased my total contribution to all my other projects ...

Not cool ...

I object less to the need to spend time to get it right than to the fact that there are no indications that an improvement in total throughput is even close to being in the works. Sorry guys, a 10-20% improvement in speed gained by the total domination of my CUDA card and a core is not what I consider a fair investment on my part when I can get more done on GPU Grid, MW, and Collatz during that same time interval ... and with the old applications I could also do EaH work on a core (though supposedly slower) ...

So, aside from my Mac Pro I guess I will be NNT till this is fixed, one way or the other ... give me opt out, or a better application ...

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

CUDA Application under-performance

Quote:
give me opt out ...


http://einstein.phys.uwm.edu/prefs.php?subset=project

Use NVIDIA GPU if present
(enforced by 6.10+ clients) No.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2139
Credit: 2752699717
RAC: 1466726

RE: RE: give me opt out

Message 95698 in response to message 95697

Quote:
Quote:
give me opt out ...

http://einstein.phys.uwm.edu/prefs.php?subset=project

Use NVIDIA GPU if present
(enforced by 6.10+ clients) No.


I set that last night, for much the same reasons. It has successfully inhibited a CUDA work request, but I would like to see whether the host will request work for the CPU (I still have both S5R6 and ABP1 CPU app_versions in client_state).

Unfortunately, at the precise time my host picked up the 'no_cuda' directive from Einstein, BOINC v6.10.19 stopped recalculating long-term debt for the project: so I'm stuck on "(overworked)" and no work fetch with no way down.

Both of you have access to the logs I've just posted on boinc_alpha: any ideas?

Paul D. Buck
Paul D. Buck
Joined: 17 Jan 05
Posts: 754
Credit: 5385205
RAC: 0

RE: RE: give me opt out

Message 95699 in response to message 95697

Quote:
Quote:
give me opt out ...

http://einstein.phys.uwm.edu/prefs.php?subset=project

Use NVIDIA GPU if present
(enforced by 6.10+ clients) No.


When I looked that setting was not there ... it is now ... better they should have made the setting, warned us and then made the change ... not good still ...

Paul D. Buck
Paul D. Buck
Joined: 17 Jan 05
Posts: 754
Credit: 5385205
RAC: 0

RE: Both of you have access

Message 95700 in response to message 95698

Quote:
Both of you have access to the logs I've just posted on boinc_alpha: any ideas?


Since the beginning of the year I have been posting ideas ... to little to no avail ... one of the several reasons I stopped posting, my sense of humor watching UCB and others squirm trying to defend the indefensible and deny the undeniable is not presently up to the task of overruling my health; I mean John mentioned a problem I had fully documented in the first quarter and DA agreed until I mentioned that this was an issue I had previously noted ... now it seems to be off the screen again ... if it was not so pathetic it would be funny ...

The bottom line is the same as before, the system is fundamentally a design for single CPU computers onto which they have heaped tons of new features without fully considering the impact of those changes. Worse, no attention is paid to those with contrary advice ...

The bottom line is that the design of the Resource Scheduler has not been reconsidered in years ... and JM VII sadly is more interested in telling us how it is supposed to work than in investigating if it is working the way his theories predict. Worse, the design intent has slowly been compromised in the interests of expediency ...

And some changes have been made, like strict FIFO to handle a problem that was the result of bad decisions (the triggers for scheduling which can happen up to once every few seconds, (or faster as I also demonstrated causing other issues)) coupled with the actual bug ... the actual bug has been removed but strict FIFO, no longer needed, still remains ...

So, no, I have no answers ...

Truth?
Truth?
Joined: 3 May 08
Posts: 2
Credit: 159443
RAC: 0

I'm confused I just

I'm confused

I just completed a GPU Einstein task. My understanding is that GPU's decrease the overall time for a task to be completed. However the results point in a different direction. Here are the results
18,920.59 sec
With the following
CPU type GenuineIntel
Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz [Intel64 Family 6 Model 15 Stepping 11]
Number of CPUs 4
GPU details BOINC 6.10.18: CUDA GeForce 9800 GTX/9800 GTX+ 1 512MB 19562
Operating System Microsoft Windows 7
Ultimate x64 Edition, (06.01.7600.00)
Memory 2047.3 MB

Whereas the double checker took only
14,016.25
With the following
CPU type GenuineIntel
Intel(R) Core(TM)2 Duo CPU P7350 @ 2.00GHz [x86 Family 6 Model 23 Stepping 10]
Number of CPUs 2
GPU details BOINC 6.10.17
Operating System Darwin
10.2.0
Memory 4096 MB

If someone can explain this to me that would be great.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109387596798
RAC: 35918322

RE: I'm confused I just

Message 95702 in response to message 95701

Quote:

I'm confused

I just completed a GPU Einstein task. My understanding is that GPU's decrease the overall time for a task to be completed....


The E@H GPU app is very new and, as yet, not that much of the actual processing has been ported to the GPU. At this stage there is a speedup but probably not as much as people hope for and certainly not as much as there will be when the app is further developed and makes greater use of the GPU.

Quote:
If someone can explain this to me that would be great.


You have to be a bit careful when comparing your crunch times to those of someone else, particularly when you don't know how the other person has his machine configured. Yours is a quad core Q6600 and your wingman's is a dual core supposedly at only 2.0GHz. However your benchmarks (FP/Int) are listed as 2284/7228 whilst your wingman's are 2457/8874. It is highly likely that the machine is significantly overclocked.

Another possible factor is the performance of the app under different OS's (different compilers and switches) which can make a significant difference. Your wingman is using Darwin. It's interesting to note that you also have a dual core running Darwin and in the task list of that machine there is an APB1 task showing that also took only 14Ksecs. I don't know but it looks like the Darwin APB1 app may be pretty fast.

A third possible factor is that your wingman's task was completed in a single run without stopping. Yours had quite a few stop/starts, particularly towards the end. Click on the taskID for your task and scroll down and you can see each time the task was restarted from a checkpoint. If the app isn't permanently in memory, you will always lose a little bit each time you restart from a saved checkpoint. It isn't very much (checkpoints are just over a minute apart on your machine) so it should only make a relatively small difference but it can mount up if there is a lot of stopping and restarting.

Cheers,
Gary.

Michael Karlinsky
Michael Karlinsky
Joined: 22 Jan 05
Posts: 888
Credit: 23502182
RAC: 0

Hi, I am somewhat pleased

Hi,

I am somewhat pleased with the speedup. My last CPU ABP1 job took 27252.27s. Using a GT9800GT (green) completion times are down to approx. 20000s.

At least I am happy it works at all...

Don't be discouraged by all the negative comments.

Michael

Paul D. Buck
Paul D. Buck
Joined: 17 Jan 05
Posts: 754
Credit: 5385205
RAC: 0

RE: Hi, I am somewhat

Message 95704 in response to message 95703

Quote:

Hi,

I am somewhat pleased with the speedup. My last CPU ABP1 job took 27252.27s. Using a GT9800GT (green) completion times are down to approx. 20000s.

At least I am happy it works at all...

Don't be discouraged by all the negative comments.


I am happy it works at all also.

But, there has to be an awareness that this current configuration comes at a potentially very high cost to the participant if their goal is to support as much science as possible for the projects to which they have attached.

Assume even two projects, Collatz and EaH ... with the old application I would run Collatz 100% on the GPU and EaH 100% on the CPU. Shift to the new application, now whenever I run a EaH task I will not be able to run a Collatz task at all, and the EaH task is not that significantly affected as to run time. In this scenario the computer will be grossly underutilized because when Collatz is running you cannot run EaH and vice versa... yet the performance improvement is nominal to minimal.

We saw similar effects at GPU Grid for a time though on my HT machines the HT made the effect less of an issue, it still cost me a core to run 4 GPU Grid tasks and I was none to happy about that because that meant that the other 40 projects I support were cheated of that opportunity. And all they were doing was polling the GPU to see if it was done yet... a month or so later and a newer version came out and the CPU load became negligible and all was right with the world.

When EaH gets closer to that, well, I will be more than happy to add EaH to the list of projects that use the GPU, but, till then, I think it is only fair to explain that the use of this application has its downsides as well as its positives ...

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 686043163
RAC: 586162

Hi! RE: Assume even

Message 95705 in response to message 95704

Hi!

Quote:

Assume even two projects, Collatz and EaH ... with the old application I would run Collatz 100% on the GPU and EaH 100% on the CPU. Shift to the new application, now whenever I run a EaH task I will not be able to run a Collatz task at all, and the EaH task is not that significantly affected as to run time. In this scenario the computer will be grossly underutilized because when Collatz is running you cannot run EaH and vice versa... yet the performance improvement is nominal to minimal.

Remember that not all E@H workunits are GPU now. The "main" search S5R6 in the LIGO detector data (for the direct detection of Gravitational waves) is still CPU only and this will run besides Collatz (unless that app uses all the available cores). If you choose not to want E@H run on your GPU, there's an option to disable these WUs in the preferences.

CU
Bikeman

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 686043163
RAC: 586162

RE: I don't know but it

Message 95706 in response to message 95702

Quote:
I don't know but it looks like the Darwin APB1 app may be pretty fast.

It definitely is. For (Intel) Macs, the compiler knows that he will have at least SSE and SSE2 support (the earliest Intel Macs came with the 'Yonah' Core Duo or Core Solo chips). This will give the Darwin app quite an edge over the Linux and Windows apps that are optimized only for SSE.

The speedup is also in the part of the program that is NOT yet ported to CUDA, that's why the disadvantage of SSE-only-optimiziation is hurting the CUDA apps and the the CPU apps on Windows and Linux. So CUDA-OSX apps for the Macs would really run quite nicely, but it seems Apple (or NVIDIA) is a bit behind in making this work on the Mac, so there is no ABP1 CUDA app for Macs yet
.. you can't have it all, I guess.

Another issue is memory bandwidth and cache size. When comparing runtimes of ABP1 tasks executed on different (say) Core 2 CPUs but under the same OS, you'll see that nominally faster (in terms of clockrate) hosts are sometimes beaten by slower ones, if those have fewer cores and/or larger L2 caches.

Parts of the ABP1 app are rather memory intensive and every byte of cache does help. Probably that's the reason why (again), ABP1 is quite a bit slower on AMD consumer CPUs (with their typical 512 MB L2 cache per core) than you would expect from benchmarks.

CU
Bikeman

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.