Times (Elapsed / CPU) for BRP5/6/6-Beta on various CPU/GPU combos - DISCUSSION Thread

Mumak

Joined: 26 Feb 13

Posts: 325

Credit: 3422604778

RAC: 1649431

Is there any chance to reduce

18 Mar 2015 15:30:50 UTC

Message 130788

(moderation:

)

Is there any chance to reduce CPU usage even more for the OpenCL app ?

-----

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 694138556

RAC: 136623

RE: Is there any chance to

18 Mar 2015 16:39:15 UTC

Message 130789 in response to message 130788

(moderation:

)

Quote:

Is there any chance to reduce CPU usage even more for the OpenCL app ?

Yes, the final step of re-sorting the toplists of candidates (if new candidates were found) after each template iteration could be done on the GPU as well.

The benefit from this would be small tho, on most systems, and it would require quite some time for coding and testing. Originally we had never cared to even consider this, but that's the "curse" of optimizing your codes: you optimize one thing, and then something else that was insignificant in run time (in relative terms) suddenly becomes much more relevant...

Mumak

Joined: 26 Feb 13

Posts: 325

Credit: 3422604778

RAC: 1649431

RE: Yes, the final step of

18 Mar 2015 17:11:22 UTC

Message 130790 in response to message 130789

(moderation:

)

Quote:

Yes, the final step of re-sorting the toplists of candidates (if new candidates were found) after each template iteration could be done on the GPU as well.

The benefit from this would be small tho, on most systems, and it would require quite some time for coding and testing. Originally we had never cared to even consider this, but that's the "curse" of optimizing your codes: you optimize one thing, and then something else that was insignificant in run time (in relative terms) suddenly becomes much more relevant...

HB

Is this step performed on GPU for CUDA apps, so the proposed optimization for OpenCL is in this step?
On OpenCL v1.52 I see rather constant higher CPU usage (~55%), while on CUDA it's much lower (<10%). So I'm not sure whether this is a given 'feature' of the OpenCL app, or there is a way how to reduce this.

-----

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2922398251

RAC: 939712

RE: Is this step performed

18 Mar 2015 17:52:06 UTC

Message 130791 in response to message 130790

(moderation:

)

Quote:

Is this step performed on GPU for CUDA apps, so the proposed optimization for OpenCL is in this step?
On OpenCL v1.52 I see rather constant higher CPU usage (~55%), while on CUDA it's much lower (<10%). So I'm not sure whether this is a given 'feature' of the OpenCL app, or there is a way how to reduce this.

In general across multiple BOINC projects, it appears to be a 'feature' of the OpenCL development environment and runtime support, including as it does an intermediate compilation step to allow running on the specific hardware target. But I'd be interested in hearing the developer viewpoint on this too, and any news - as opposed to speculation - on changes to the CPU overhead as the OpenCL development/runtime environment matures.

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 694138556

RAC: 136623

Right, what I was referring

18 Mar 2015 21:00:56 UTC

Message 130792 in response to message 130791

(moderation:

)

Right, what I was referring too was the CPU load caused by the app itself, and that should indeed be identical fro OpenCL and CUDA apps of the same version (it is the exact same code executing on the CPU). The CPU overhead of the OpenCL runtime & driver is a different thing. I have only a very limited number of hosts with AMD GPUs for first-hand experience, but I agree that this kind of overhead seems to be a bit higher for OpenCL apps.

Cheers
HB

Mumak

Joined: 26 Feb 13

Posts: 325

Credit: 3422604778

RAC: 1649431

Thanks for confirming. So

19 Mar 2015 7:11:11 UTC

Message 130793 in response to message 130792

(moderation:

)

Thanks for confirming. So it's an OpenCL issue and there's probably nothing that can be done...

-----

archae86

Joined: 6 Dec 05

Posts: 3156

Credit: 7176694931

RAC: 739816

I did much of the data

21 Mar 2015 2:56:18 UTC

Message 130794

(moderation:

)

I did much of the data crunching to generate posts for four more GPUs to the results thread, but felt the moment of maximum interest in detail might have passed, but did one small additional computation that might be of some interest here: the percentage improvement in indicated GPU productivity by host, going from Parkes v1.39 to v1.52:

[pre]Host GPU multiplicity paired? 1.52prod/1.39prod
Stoll8 GTX 970 3X No 1.59
Stoll7 GTX 660 2X Yes 1.41
Stoll7 GTX 750Ti 2X Yes 1.59
Stoll6 GTX 660 2X Yes 1.26
Stoll6 GTX 750 2X Yes 1.38[/pre]
Comments:
1. While these are all Nvidia GPUs on Windows 7 hosts, the improvement ratio going from Parkes pre-beta to second beta varied rather substantially.
2. The three Maxwell GPUs (970, 750) improved by substantially more than did the two Keplers (660).
3. The cards which were running on a host more nearly able to keep them busy (Stoll6--a Westmere Xeon with three memory channels) gained less than similar cards running on a host which apparently was less able to provided low latency support (Stoll7--a Sandy Bridge with two memory channels).
4. The improvement ratios shown here are derived from reported elapsed time ratios on samples believed large enough to give good accuracy. RAC has responded mightily, but has many days to go before stabilizing.

In a more speculative light, if very recent CUDA variants might be expected to be friendlier to Maxwell than older ones, the improvement advantage of the 970/750 GPUs over the 660s might go even higher if Parkes code on a sufficiently recent CUDA variant makes it out to users. Of course, it is also possible that the Maxwell disadvantage (in BOINC relative productivity compared to game performance) may be due to architectural unsuitability to this task, not to a lack of suitable code.

Details aside, I'll say once again that the improvement from this batch of changes is quite remarkable, being over 25% on the least favorably affected of my five GPUs, and much more than that on average over my flotilla, which has gone from about 230,000 credit/day to 340,000 purely on this single improvement. As the season is warming, I may soon throw away some of this performance by throttling to reduce room heating in the sun-afflicted hours, but it is available to me at will.

MAGIC Quantum M...

Joined: 18 Jan 05

Posts: 1855

Credit: 1344906042

RAC: 1510513

My OC'd 660Ti and 560Ti are

22 Mar 2015 5:19:10 UTC

Message 130795

(moderation:

)

My OC'd 660Ti and 560Ti are running the BRP6 v1.52 quite a bit faster and abit warmer so I have to keep an eye on that since I caught the 560Ti getting in the upper 80's C.......so I cool the room down so it will stay in the lower 70's

The 660Ti is running in the mid to low 60's C......the 560Ti just needs a better fan since the stock fans died and I just laid a small 4in AC fan on the card sink since I had one laying around and never want to shut them down if I can make them run (it would run high 60's low 70's C before with BRP6 v1.39

Always running tasks X3

This laptop with the 610M is pretty much the same running X2

I have a couple OC'd 650Ti's and one is still finishing up the BRP6 v1.39

The other one is a little faster with the BRP6 v1.52 and temp stayed the same (mid 50's C)

Of course different CPU's can make things run different (and depending on if you have other tasks running at the same time and all of mine also run vLHC X2 and I also run Atlas X2 on the host with the 660Ti

The 560Ti is with a 3-core phenom running BRP6 v1.52 X3 and vLHC X2 and is actually still running faster than my other ones by quite a bit.

Bill592

Joined: 25 Feb 05

Posts: 786

Credit: 70825065

RAC: 0

Not Bad Samson ! You will see

22 Mar 2015 7:10:33 UTC

Message 130796 in response to message 130795

(moderation:

)

Not Bad Samson !
You will see a Large jump in RAC soon !

Bill

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5870

Credit: 115785909020

RAC: 35331322

RE: I did much of the data

22 Mar 2015 8:04:23 UTC

Message 130797 in response to message 130794

(moderation:

)

Quote:

I did much of the data crunching to generate posts for four more GPUs to the results thread, but felt the moment of maximum interest in detail might have passed ...

Maximum interest from the point of view of ... "Wow, this optimised app is really great!!" ... but I don't think so from the point of view of the Devs. I'm sure HB is pretty pleased that his ideas have been so spectacularly successful but I also tend to think that he is also interested in ongoing reports from a wider range of hardware types - perhaps those combinations that aren't working quite so well as opposed to those that are working brilliantly.

Quote:

... but did one small additional computation that might be of some interest here: the percentage improvement in indicated GPU productivity by host, going from Parkes v1.39 to v1.52: ....

This is exactly the sort of thing I'm talking about -- in fact two things.

1. Why is the improvement for the 750Ti so much better than the improvement for the 660? Is it just due to Maxwell or is it something else??
2. Why is the Sandy Bridge host significantly better than the Westmere? My impression is that the CPU is less of a factor so what is hampering the Xeon??

I've had other commitments for several days now but I'm hoping to find time to publish more results soon. All my NVIDIA GPUs are Kepler (650, 650Ti - or earlier) and my impression is that they (like your 660s) haven't done quite as well as some of my AMD 7850s. That's only a guess at this stage, I might get a surprise when I actually get the numbers :-).

Cheers,
Gary.

Times (Elapsed / CPU) for BRP5/6/6-Beta on various CPU/GPU combos - DISCUSSION Thread

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner