Times (Elapsed / CPU) for BRP5/6/6-Beta on various CPU/GPU combos - DISCUSSION Thread

Mumak
Joined: 26 Feb 13
Posts: 325
Credit: 3,441,229,582
RAC: 1,971,621

Is there any chance to reduce

Is there any chance to reduce CPU usage even more for the OpenCL app ?

-----

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3,522
Credit: 695,111,198
RAC: 123,051

RE: Is there any chance to

Quote:
Is there any chance to reduce CPU usage even more for the OpenCL app ?

Yes, the final step of re-sorting the toplists of candidates (if new candidates were found) after each template iteration could be done on the GPU as well.

The benefit from this would be small tho, on most systems, and it would require quite some time for coding and testing. Originally we had never cared to even consider this, but that's the "curse" of optimizing your codes: you optimize one thing, and then something else that was insignificant in run time (in relative terms) suddenly becomes much more relevant...

HB

Mumak
Joined: 26 Feb 13
Posts: 325
Credit: 3,441,229,582
RAC: 1,971,621

RE: Yes, the final step of

Quote:


Yes, the final step of re-sorting the toplists of candidates (if new candidates were found) after each template iteration could be done on the GPU as well.

The benefit from this would be small tho, on most systems, and it would require quite some time for coding and testing. Originally we had never cared to even consider this, but that's the "curse" of optimizing your codes: you optimize one thing, and then something else that was insignificant in run time (in relative terms) suddenly becomes much more relevant...

HB

Is this step performed on GPU for CUDA apps, so the proposed optimization for OpenCL is in this step?
On OpenCL v1.52 I see rather constant higher CPU usage (~55%), while on CUDA it's much lower (<10%). So I'm not sure whether this is a given 'feature' of the OpenCL app, or there is a way how to reduce this.

-----

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2,143
Credit: 2,927,881,888
RAC: 764,494

RE: Is this step performed

Quote:
Is this step performed on GPU for CUDA apps, so the proposed optimization for OpenCL is in this step?
On OpenCL v1.52 I see rather constant higher CPU usage (~55%), while on CUDA it's much lower (<10%). So I'm not sure whether this is a given 'feature' of the OpenCL app, or there is a way how to reduce this.


In general across multiple BOINC projects, it appears to be a 'feature' of the OpenCL development environment and runtime support, including as it does an intermediate compilation step to allow running on the specific hardware target. But I'd be interested in hearing the developer viewpoint on this too, and any news - as opposed to speculation - on changes to the CPU overhead as the OpenCL development/runtime environment matures.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3,522
Credit: 695,111,198
RAC: 123,051

Right, what I was referring

Right, what I was referring too was the CPU load caused by the app itself, and that should indeed be identical fro OpenCL and CUDA apps of the same version (it is the exact same code executing on the CPU). The CPU overhead of the OpenCL runtime & driver is a different thing. I have only a very limited number of hosts with AMD GPUs for first-hand experience, but I agree that this kind of overhead seems to be a bit higher for OpenCL apps.

Cheers
HB

Mumak
Joined: 26 Feb 13
Posts: 325
Credit: 3,441,229,582
RAC: 1,971,621

Thanks for confirming. So

Thanks for confirming. So it's an OpenCL issue and there's probably nothing that can be done...

-----

archae86
archae86
Joined: 6 Dec 05
Posts: 3,157
Credit: 7,183,434,931
RAC: 779,157

I did much of the data

I did much of the data crunching to generate posts for four more GPUs to the results thread, but felt the moment of maximum interest in detail might have passed, but did one small additional computation that might be of some interest here: the percentage improvement in indicated GPU productivity by host, going from Parkes v1.39 to v1.52:

[pre]Host GPU multiplicity paired? 1.52prod/1.39prod
Stoll8 GTX 970 3X No 1.59
Stoll7 GTX 660 2X Yes 1.41
Stoll7 GTX 750Ti 2X Yes 1.59
Stoll6 GTX 660 2X Yes 1.26
Stoll6 GTX 750 2X Yes 1.38[/pre]
Comments:
1. While these are all Nvidia GPUs on Windows 7 hosts, the improvement ratio going from Parkes pre-beta to second beta varied rather substantially.
2. The three Maxwell GPUs (970, 750) improved by substantially more than did the two Keplers (660).
3. The cards which were running on a host more nearly able to keep them busy (Stoll6--a Westmere Xeon with three memory channels) gained less than similar cards running on a host which apparently was less able to provided low latency support (Stoll7--a Sandy Bridge with two memory channels).
4. The improvement ratios shown here are derived from reported elapsed time ratios on samples believed large enough to give good accuracy. RAC has responded mightily, but has many days to go before stabilizing.

In a more speculative light, if very recent CUDA variants might be expected to be friendlier to Maxwell than older ones, the improvement advantage of the 970/750 GPUs over the 660s might go even higher if Parkes code on a sufficiently recent CUDA variant makes it out to users. Of course, it is also possible that the Maxwell disadvantage (in BOINC relative productivity compared to game performance) may be due to architectural unsuitability to this task, not to a lack of suitable code.

Details aside, I'll say once again that the improvement from this batch of changes is quite remarkable, being over 25% on the least favorably affected of my five GPUs, and much more than that on average over my flotilla, which has gone from about 230,000 credit/day to 340,000 purely on this single improvement. As the season is warming, I may soon throw away some of this performance by throttling to reduce room heating in the sun-afflicted hours, but it is available to me at will.

MAGIC Quantum Mechanic
MAGIC Quantum M...
Joined: 18 Jan 05
Posts: 1,860
Credit: 1,358,432,480
RAC: 1,552,527

My OC'd 660Ti and 560Ti are

My OC'd 660Ti and 560Ti are running the BRP6 v1.52 quite a bit faster and abit warmer so I have to keep an eye on that since I caught the 560Ti getting in the upper 80's C.......so I cool the room down so it will stay in the lower 70's

The 660Ti is running in the mid to low 60's C......the 560Ti just needs a better fan since the stock fans died and I just laid a small 4in AC fan on the card sink since I had one laying around and never want to shut them down if I can make them run (it would run high 60's low 70's C before with BRP6 v1.39

Always running tasks X3

This laptop with the 610M is pretty much the same running X2

I have a couple OC'd 650Ti's and one is still finishing up the BRP6 v1.39

The other one is a little faster with the BRP6 v1.52 and temp stayed the same (mid 50's C)

Of course different CPU's can make things run different (and depending on if you have other tasks running at the same time and all of mine also run vLHC X2 and I also run Atlas X2 on the host with the 660Ti

The 560Ti is with a 3-core phenom running BRP6 v1.52 X3 and vLHC X2 and is actually still running faster than my other ones by quite a bit.

Bill592
Bill592
Joined: 25 Feb 05
Posts: 786
Credit: 70,825,065
RAC: 0

Not Bad Samson ! You will see

Not Bad Samson !
You will see a Large jump in RAC soon !

Bill

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,870
Credit: 116,096,518,730
RAC: 35,964,306

RE: I did much of the data

Quote:
I did much of the data crunching to generate posts for four more GPUs to the results thread, but felt the moment of maximum interest in detail might have passed ...


Maximum interest from the point of view of ... "Wow, this optimised app is really great!!" ... but I don't think so from the point of view of the Devs. I'm sure HB is pretty pleased that his ideas have been so spectacularly successful but I also tend to think that he is also interested in ongoing reports from a wider range of hardware types - perhaps those combinations that aren't working quite so well as opposed to those that are working brilliantly.

Quote:
... but did one small additional computation that might be of some interest here: the percentage improvement in indicated GPU productivity by host, going from Parkes v1.39 to v1.52: ....


This is exactly the sort of thing I'm talking about -- in fact two things.

1. Why is the improvement for the 750Ti so much better than the improvement for the 660? Is it just due to Maxwell or is it something else??
2. Why is the Sandy Bridge host significantly better than the Westmere? My impression is that the CPU is less of a factor so what is hampering the Xeon??

I've had other commitments for several days now but I'm hoping to find time to publish more results soon. All my NVIDIA GPUs are Kepler (650, 650Ti - or earlier) and my impression is that they (like your 660s) haven't done quite as well as some of my AMD 7850s. That's only a guess at this stage, I might get a surprise when I actually get the numbers :-).

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.