Times (Elapsed/CPU) for BRP6-Beta-cuda55 compared to BRP6-cuda32 - Results and Discussion Thread

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109396393365
RAC: 35787398
Topic 198140

[EDIT] I have moved the posts below from the BRP6 announcement thread in "Technical News" to this new thread. The reason for doing so is that we can freely post and discuss results for the new BRP6-Beta-cuda55 app without cluttering up any of Bernd's future announcements. No doubt these will come as the app is released on more platforms. I picked this post of mine as a convenient starting point. [/EDIT]

As a test, I've just rebranded a couple of 1.52s to 1.54 and changed the plan class to BRP6-Beta-cuda55. The client seems to have no issue running these with the beta app. I'll be quite interested to see what happens at the server end when the first 1.52->1.54 result is returned.

The GPU is a 550Ti in a Pentium dual core host running 2xFGRP4 and 3xBRP6. The current GPU task mix is 2x1.52 and a 1.54 (a rebranded 1.52). By watching the %completed figure over a long enough period, the 1.52 tasks are incrementing about 0.00347% per second whilst the 1.54 task is visibly faster at 0.00432% per second. That's very much in the 20-25% range. Because they are running 3x, individual tasks were taking more than 7 hours so I expect the new app will drop at least an hour or so off that.

Thanks very much for releasing this very welcome improvement.

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7023714931
RAC: 1805481

Times (Elapsed/CPU) for BRP6-Beta-cuda55 compared to BRP6-cuda32

Quote:
The current GPU task mix is 2x1.52 and a 1.54 (a rebranded 1.52). By watching the %completed figure over a long enough period, the 1.52 tasks are incrementing about 0.00347% per second whilst the 1.54 task is visibly faster at 0.00432% per second. That's very much in the 20-25% range.


Watching a mixed load is a flawed way to estimate performance improvement. If part of the change causes less frequent demand for CPU services, the average active time slice for the new code instance will be longer, thus getting a higher than 1/3 share of available GPU resource.

I'm eager to see your observation on a non-mixed load, and shortening my prefetch queues in hopes that a Windows version might come out soon.

I also hope that someone with a mixed pre-Maxwell/Maxwell1 (750)/Maxwell2 fleet will report on the relative improvements of Maxwell types vs. classic flavors. While true Maxwell support is apparently a CUDA7 feature, there have been some hints that Maxwells may respond more to the CUDA updates than some of the other types.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109396393365
RAC: 35787398

RE: Watching a mixed load

Quote:
Watching a mixed load is a flawed way to estimate performance improvement.


Sure, but if you're entering an unknown pond, it's far safer just to dip your toe in first! :-).

Having dipped my toe and finding the experience pleasurable, I thought I'd grab some basic 'first impressions' of the improvement and report them, even if they turn out to be a bit 'rubbery' :-). I'll let just one 1.52->1.54 task crunch and get reported to make sure the server will accept it. I don't particularly want to crunch a whole bunch and have them all rejected. It's 12:30AM here so in the morning I'll switch to all 1.54 rebrands if the first is successful.

Cheers,
Gary.

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

After 18 hours of steady beta

After 18 hours of steady beta crunching, some results are in....

So a copy and paste from the BRP6 thread - deja vu...

Compared against the same 18 hours time interval,

a) 72 hours earlier for 1.52
and
b) 48 hours earlier for the 1.52

HOST NN - 4918234

[pre]
CPU: Intel(R) Core(TM) i3 CPU 530 @ 2.93GHz [Family 6 Model 37 Stepping ***1
Cores/Threads: 2/4
Motherboard: Gigabyte GA-H55M-UD2H
PCIe slot PCIEX16 v2.0m PCIX4 ***2
1st GPU: nVidia GTX-460 768MB (MSI) PCIe v2.0 x16 - system monitor
2nd GPU: nVidia GTX-460 768MB (MSI) PCIe v1.0 x4 - no monitor
3rd GPU: -
RAM: 2 x 2GB DIMM 1333 MHz (0.8 ns)
Concurrency: 2 tasks per GPU (ie share 0.5 GPU)
CPU Tasks: 1 xS6BucketFU2-1.01-X64

Free CPU cores: 4
OS: Ubuntu 2.6.32 ***1 ***7
Driver: 349.12 ***3
BOINC Version: 6.10.17 ***1 ***7

Elapsed Time Statistics CPU time Statistics
---------------------------------- ------------------------------------ Sample
Search Min Mean Max Std Dev Min Mean Max Std Dev Size Notes / Comments
====== ====== ====== ====== ======= ======= ====== ====== ======== ====== ================
GPUS-1.52a) 14033 15375 16623 581 592 656 787 57 17
GPUS-1.52b) 14053 15633 19157 1383 591 687 900 96 16 ***6 large max value
GPUS-1.54 15154 15233 15316 65 552 588 659 32 17

[/pre]

COMMENTS

  • * ***1 Information is already available from host information, or by checking a task for that host.
    * ***2 PCIe slot config needed PER GPU CARD.
    * ***3 can be obtained from a task
    * ***6 Outlier in CPU/ elapsed time
    * ***7 Don´t laugh - while you´ve been updating i´ve been crunching, may be time to upgrade...
    * GPU temp unchanged.
    * No overclocking
    * Figures are 2 GPUs consolidated, both are completing within 5% of the other.

So i´m seeing cpu times significantly down and elapsed times showing minor improvement.

I wondering if upping to 0.33 will make for improvement.

[Edit]
And the good news is tasks are validating.

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7023714931
RAC: 1805481

AgentB wrote:1st GPU:

AgentB wrote:

1st GPU: nVidia GTX-460 768MB (MSI) PCIe v2.0 x16 - system monitor
2nd GPU: nVidia GTX-460 768MB (MSI) PCIe v1.0 x4 - no monitor

So i´m seeing cpu times significantly down and elapsed times showing minor improvement.


The indicated throughput improvement from 1.54 you are reporting here is very, very modest.

Possibly the GTX-460 GPUs will turn out to have much less benefit from this particular change than some others. Especially if the subsequent CUDA versions were more accomodations of more recent architecture and less just general efficiency improvement.

By the way, I have a couple of out-of-service GTX-460s here, pulled from my hosts when I changed to 660s for better power efficiency. If someone would like to crunch here with them, I'll respond to a PM with favorable terms for USA destinations.

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

RE: Possibly the GTX-460

Quote:
Possibly the GTX-460 GPUs will turn out to have much less benefit from this particular change than some others. Especially if the subsequent CUDA versions were more accomodations of more recent architecture and less just general efficiency improvement.

Maybe, although if you look over the CUDA 5.5 manual (large download) here it mentions the OS is right on the borderline of what is supported, but does not mention many Kepler/Fermi performance differences from 5.0 to 5.5.

I´m not sure why the 5.5 performance difference is expected, my guess is cuFFT but that is just a guess, maybe one of the devs can explain.

I think i will wait for some other results to come in on other OS / driver / hardware before fettling. This host reached 50M this month, so it is due for retirement.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109396393365
RAC: 35787398

RE: So i´m seeing cpu

Quote:
So i´m seeing cpu times significantly down and elapsed times showing minor improvement.


Thanks for taking the trouble to produce the stats. I think archae86 may well be correct when he mentions that the degree of improvement could be quite architecture dependent. Only time will tell.

Quote:
I wondering if upping to 0.33 will make for improvement.


Since your GPUs show as 768MB RAM, you might not get 3 tasks to run without issue. I get 3 to run in 1GB but the improvement from 2x is small.

Quote:
[Edit]
And the good news is tasks are validating.


My 1.52->1.54 finished overnight and is "waiting for validation". It took 22,512s to complete. It's still being shown as 1.52 as Bernd predicted. The previous 50 tasks on that machine had completion times ranging from 23,500s to 31,000s with the bulk of tasks around the 27,000s mark.

I've just now converted all the remaining (30 tasks) 1.52s to 1.54s and restarted BOINC. With a nice search and replace function it only takes a minute or two.

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7023714931
RAC: 1805481

My hope to see some more

My hope to see some more modern results seems in process at GTX 760 host belonging to Phil

With just four 1.54 completions, it is early times yet (and I am mindful of my own admonition to prefer avoiding transition mixed-process work), but the four completions reported so far average 11,878 second reported run time, vs. 14,050 seconds average for twenty recent consecutive 1.52 units. Taken at face value, this is a 18% productivity improvement. Maybe GTX 760 sees more improvement than GTX 460. This card is listed as using the GK104 chip, also used by the GTX 660, 670, 680, 690 and GTX 770, so perhaps when more results are available for this host it may be predictive for those cards, and perhaps other Kepler cards as well.

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7023714931
RAC: 1805481

Phil's set of four hosts give

Phil's set of four hosts give some welcome data on this comparison. Happily he uses a pretty short queue, so data for an appreciable amount of 1.54 work is already available for all four hosts.

In my last jobs before retirement, I spent a great deal of time comparing data sets, and in cases like this am extremely fond of the probplot as a data representation that lets one spot many of the things one needs to see in order to avoid being misled by simplistic comparison. I've prepared one probplot for Phil's 760

While this plot shows that something about this host's operating conditions creates a high tail of times, the effect is having somewhat similar impact on both the 1.52 and 1.54 data, so that it is no unreasonable to take the means ratio showing a 1.235 productivity improvement as a decent first day estimate.

and another which combines together data from Phil's three 750s.

There is a hint of modes perhaps suggesting my combination of the three hosts is a bit rough and ready, but still it appears likely that the means ratio indicated productivity improvement for these 750s of 1.27 is not far wrong.

Leaping into the dark a bit, it seems pretty likely to me that many Linux hosts running Kepler and Maxwell1 (750 or 750ti) cards will see around 25% productivity improvement from this change, which so far as I know is simply an upgrade from using CUDA32 to using CUDA55. As Maxwell-specific support comes later in the CUDA sequence, this positive initial result might in the case of the Maxwell cards be reasonably hoped to be further improved were it possible to move on to CUDA7 (gross speculation on my part).

For those (maybe all of you) unfamiliar with probplots, the data shown are the individual completion times in seconds, sorted in ascending order, with the actual time shown on the X axis, and the Z-score on the Y axis. A change of 1 in Z is one standard deviation, so the 0 Z value is the median. A normally distributed population will make a simple diagonal line from lower left to upper right.

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7023714931
RAC: 1805481

For comparison, the same

For comparison, the same presentation of AgentB's previously reported 460, using the available 1.54 returns a few minutes ago, compared with an equal number of sequentially previous 1.52 returns shows a rather different sort of picture.

While these two populations, compared by mean, show as previously reported an essentially equivalent indicated productivity (nominally about 2% improvement for 1.54) the distribution shape differs wildly, with far less sample-to-sample variation in the 1.54 population than the 1.52 population. AgentB had reported this in showing a much lower Stdev, but I had not really envisioned how very different the distribution shapes actually were on reviewing his text summary.

Since, for Parkes PMPS work, I believe the actual work content varies negligibly from WU to WU, I generally take elapsed time variability as telling me something about operating conditions on the host. It seems curious that the 1.54 difference apparently leads to much more closely matched elapsed times on AgentB's 460 host, while not having a remotely similar effect on Phil's hosts. I, personally, suspect this may have more to do with host configuration differences than GTX architecture version differences.

Lastly, the kink near the 50th percentile of AgentB's 1.54 data plot probably arises from mismatch of the two cards on his host, which on his account differ in the grade of PCIe service employed.

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

Thanks archae86 they are

Thanks archae86 they are really insightful. I have the logfiles for the gtx460 going back a long way, so i may have a look to see if the brp6 v1.52 started being variable at some point.

My transition 1.52-> 1.54 only involved aborting some old 1.52 tasks and changing to non-beta to avoid 1.53 for a short time. There may be better stable 1.52 results going back a few days further. There was no restart (17 days uptime) nor other config changes.

Quote:
There is a hint of modes perhaps suggesting my combination of the three [Phil's 760] hosts is a bit rough and ready

This chart also seems to show three distinct 1.54 clusters, are they one for each host or just a co-incidence?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.