Stimulated by Stef's examples of posting 1.52 (second beta) results for his 750 Ti, I've processed data from the last two weeks on my GPUs, and prepared standardized graphs. As with Stef's, I choose an Elapsed Time vs. CPU time representation, as it helps illuminate the change from original to first beta to second beta, and helps illustrate some of the distribution shape issues.
Caveats: two of my three hosts each carry two dissimilar GPUs. These compete with one another for host services, and thus are less productive than each would be on comparable single-GPU host. This effect may have improved markedly with the base-population Beta WUs, in which case my degree of improvement in throughput for the four of my five GPUs hosted this way quite likely exceeds what people with single-GPU hosting will see. You can get a hint of this by comparing stef's 750 Ti data to mine--which were markedly inferior to his on 1.39, but are inferior by a considerably reduced ratio on 1.52
Rather than post all five images in one enormous post, I'll just post the single GPU host result in this message, then post twice more, one per host.
The first host is a modern dual-core Haswell, with a modern GTX 970 running with a considerable memory clock overclock, and a much more modest GPU clock overclock.
This is my only overclocked GPU, and while it had run stably on 1.39 (first Perseus and then Parkes PMPS since early January, with the 1.50 and 1.52 beta applications it generated validation errors about once a day. I've recently backed down the GPU clock just a little, and have yet to see my next validation error.
Comments:
This GPU produced a very tight timing distribution on 1.39. The base population 1.50 was much improved, but there was an appreciable high CPU time high elapsed time tail. There were also "fast outliers" with abnormally short elapsed time, which did well because they got more than 1/3 the GPU resource when paired with a slow result.
On 1.52 the distribution is pretty tight. There is a modest high tail in elapsed time, but with on a slight elevation in CPU time.
The 1.52 base population CPU time is clearly improved from 1.50, which was already hugely improved from 1.39.
The second host is a quad-core non hyperthreaded Ivy Bridge, with a modern GTX 750 Ti, and an older GTX 660, both running stock clock as the card was shipped.
My first image is for the GTX 750 Ti work on this host
my second image is for the GTX 660 work on this host
Comments:
The percentage improvement for the GTX 750 Ti from 1.39 to 1.52 is stunning. It appears that the two GPUs on same host penalty is considerably reduced with the reduced CPU usage and PCIe traffic of the beta applications.
While the GTX 660 improvement is proportionately much less than the 750 Ti, it is still highly welcome.
Unlike my 970 case, I have seen no need to adjust clock rates, and have not seen any invalid or computation error problems in the beta work performed by either GPU on this host.
While capture of data for these graphs by BoincTasks unambiguously labels each result as having been processed by the 660 or the 750, it does not distinguish the case where a result was partially processed by each. This does not happen in normal operation, but definitely does happen in some cases where work is suspended and resumed. Quite possibly there is a small number of points on this graph contaminated by this effect, but not enough to affect the basic character.
The third host is a quad-core hyperthreaded Westmere, with a modern GTX 750 (plain vanilla base model, not Ti, 1GB memory, not overclocked), and an older GTX 660, both running stock clock as the card was shipped.
My first image is for the GTX 750 work on this host
my second image is for the GTX 660 work on this host
Comments:
The percentage improvement for the GTX 750 from 1.39 to 1.52 is again stunning. But it is noticeable that with the 1.52 beta application this base model 750 is less productive than the moderately overclocked 750 Ti on Stoll7--which was not apparent under 1.39. Actually, on 1.39 the Stoll7 750 Ti logged longer elapsed time than did this 750 on Stoll6. I imagine this is attributable to less drag from waiting for host services on 1.52, and that to my surprise the old Westmere host was better able to provide required host services than the newer Ivy Bridge, as I don't think the application suddenly learned to make use of the extra computational resources of the Ti model.
As with Stoll7, on this host I have had no beta-associated invalid problem or computational errors, and I have made no changes in clock rates. I did raise the TThrottle maximum permitted GPU temperature on the 660, as I'm not intending to throttle in winter here.
Besides the HD2500-bearing box that I've discussed a bit in this thread, I have two boxes which have Nvidia cards. My results are only consistent from 07 March 2015 forward, because, on that date, the GTX 650 SC was moved from X2 to X3, and the GTX 960 SSC was installed in X2.
Nvidia machine details:
Name / CPU / RAM / GPU
X2 / Phenom II X4 965 BE oc to 3.8 GHz / 8GB DDR2 800 / GTX 960 SSC factory oc
X3 / Phenom II X4 945 stock 3.0 GHz / 8GB DDR3 1333 / GTX 650 SC + slight further oc
Long story about why the faster CPU has slower RAM but a faster GPU.
Both run Win7-64, host the GPU's in PCI-E 2.0x16 slots, and have 10k RPM primary drives. Both boxes often see non-BOINC activity evenings and weekends, but, so far, BOINC is mostly left running during such use, albeit with TThrottle stepping in for brief periods on X3 during the few warm afternoons. (Spring is arriving here in North Carolina!)
X2 runs 4xWCG and 2xEinstein; X3 runs 4xWCG and 1xEinstein, both sometimes including a demanding 2xCEP2 at the same time. I do not have enough 1.39 left in the results pages to provide truly meaningful figures. However, I can provide the following for 1.52 during times when a user was not present at the keyboard:
Simultaneous active non-BOINC mostly-CPU use seems to have only a minor effect on Run time or CPU. The most time I saw added to the CPU time during times when I know a user was at the keyboard was consistently around 1,000 seconds, and that was on X2 which gets much heavier non-BOINC use. Run Times are stretched much more if the non-BOINC use in question demands much of the GPU (e.g. "casual" game at 1920x1200 resolution). Of course, I begrudgingly suspend BOINC if the user wants to run a high resolution, GPU-intensive game (e.g. 3D shooter).
In addition, my WCG tasks run slightly faster/more efficiently alongside 1.52 tasks.
I consider all this a telling compliment to the optimization, even if informal.
And I don't mind the increase in points I'm getting, both from the better optimized units as well as from adding in the 960 SSC on 07 March 2015.
Any estimation if/when you will promote the current Beta app to standard runs ?
Not sure yet. As I said the GW search needs most of our attention at the moment. For those participating in the beta test this should make no difference (except you will drop in the stats because everyone else will be catching up ;-) )
I have opted in for the Beta and it seems these tasks have a much higher priority over standard units. It's no problem as far as Beta would (besides testing) crunch real data (not dummies). Is that so ?
I have also noticed that when running a single WU on a Tesla K20, the GPU usage seems to be fluctuating from 0% to >90% ever few seconds. Not sure if that is normal behavior..
I'm not providing the level of detail required for the "Results" thread so I'll just post up my observations here ...
2015.03.14: Win7x64, GTX480 pcie2x16, 4670K @3.9 GHZ, 16 GB RAM @1600.
I started doing a little testing yesterday morning and it seems like the latest beta (1.52) has evened out some of the variable response of how the 1.39 beta handled the input / output data. I've only completed a few WUs running 1x concurrent - 90% GPU utilization, 60% GPU memory bandwidth, about 2-3% CPU utilization (so about 8-10% of 1 core), looking pretty go so far. Unfortunately Nvidia Inspector does not report PCIe bandwidth for the GTX480 but by all accounts from other crunchers here at Einstein this is no longer the primary bottleneck. Next up I'll see if putting an OC on the card scales linearly or if gpu memory becomes the next gating factor.
2015.03.15: I saw little improvement (not linear) when the GPU was OC'd but am seeing a decent overall efficiency when running 2x concurrent WUs.
2015.03.15: Win7x64, GTX660Ti, GTX670 both at pcie2x8, 980X @3.2 GHZ, 16 GB RAM @1600.
I launched 1x concurrent each (both cards on CPU 980X box) at stock
As expected the 660Ti struggled with GPU memory around 74% but things looked OK, I was busy so went off to do *other* things. I checked in a little over an hour later and discovered that the GPU usage was very sporadic and even at time causing the driver to step down the power state. At this point (50% estimate runtime in 1:20) I decided to free another CPU core to see if I could even things out a bit (running multiple ATLAS, LHC, and vLHC), usage looks more consistent, I'll check back in another hour and see how if it truly helped or not.
Stimulated by Stef's examples
)
Stimulated by Stef's examples of posting 1.52 (second beta) results for his 750 Ti, I've processed data from the last two weeks on my GPUs, and prepared standardized graphs. As with Stef's, I choose an Elapsed Time vs. CPU time representation, as it helps illuminate the change from original to first beta to second beta, and helps illustrate some of the distribution shape issues.
Caveats: two of my three hosts each carry two dissimilar GPUs. These compete with one another for host services, and thus are less productive than each would be on comparable single-GPU host. This effect may have improved markedly with the base-population Beta WUs, in which case my degree of improvement in throughput for the four of my five GPUs hosted this way quite likely exceeds what people with single-GPU hosting will see. You can get a hint of this by comparing stef's 750 Ti data to mine--which were markedly inferior to his on 1.39, but are inferior by a considerably reduced ratio on 1.52
Rather than post all five images in one enormous post, I'll just post the single GPU host result in this message, then post twice more, one per host.
The first host is a modern dual-core Haswell, with a modern GTX 970 running with a considerable memory clock overclock, and a much more modest GPU clock overclock.
This is my only overclocked GPU, and while it had run stably on 1.39 (first Perseus and then Parkes PMPS since early January, with the 1.50 and 1.52 beta applications it generated validation errors about once a day. I've recently backed down the GPU clock just a little, and have yet to see my next validation error.
Comments:
This GPU produced a very tight timing distribution on 1.39. The base population 1.50 was much improved, but there was an appreciable high CPU time high elapsed time tail. There were also "fast outliers" with abnormally short elapsed time, which did well because they got more than 1/3 the GPU resource when paired with a slow result.
On 1.52 the distribution is pretty tight. There is a modest high tail in elapsed time, but with on a slight elevation in CPU time.
The 1.52 base population CPU time is clearly improved from 1.50, which was already hugely improved from 1.39.
The second host is a
)
The second host is a quad-core non hyperthreaded Ivy Bridge, with a modern GTX 750 Ti, and an older GTX 660, both running stock clock as the card was shipped.
My first image is for the GTX 750 Ti work on this host
my second image is for the GTX 660 work on this host
Comments:
The percentage improvement for the GTX 750 Ti from 1.39 to 1.52 is stunning. It appears that the two GPUs on same host penalty is considerably reduced with the reduced CPU usage and PCIe traffic of the beta applications.
While the GTX 660 improvement is proportionately much less than the 750 Ti, it is still highly welcome.
Unlike my 970 case, I have seen no need to adjust clock rates, and have not seen any invalid or computation error problems in the beta work performed by either GPU on this host.
While capture of data for these graphs by BoincTasks unambiguously labels each result as having been processed by the 660 or the 750, it does not distinguish the case where a result was partially processed by each. This does not happen in normal operation, but definitely does happen in some cases where work is suspended and resumed. Quite possibly there is a small number of points on this graph contaminated by this effect, but not enough to affect the basic character.
The third host is a quad-core
)
The third host is a quad-core hyperthreaded Westmere, with a modern GTX 750 (plain vanilla base model, not Ti, 1GB memory, not overclocked), and an older GTX 660, both running stock clock as the card was shipped.
My first image is for the GTX 750 work on this host
my second image is for the GTX 660 work on this host
Comments:
The percentage improvement for the GTX 750 from 1.39 to 1.52 is again stunning. But it is noticeable that with the 1.52 beta application this base model 750 is less productive than the moderately overclocked 750 Ti on Stoll7--which was not apparent under 1.39. Actually, on 1.39 the Stoll7 750 Ti logged longer elapsed time than did this 750 on Stoll6. I imagine this is attributable to less drag from waiting for host services on 1.52, and that to my surprise the old Westmere host was better able to provide required host services than the newer Ivy Bridge, as I don't think the application suddenly learned to make use of the extra computational resources of the Ti model.
As with Stoll7, on this host I have had no beta-associated invalid problem or computational errors, and I have made no changes in clock rates. I did raise the TThrottle maximum permitted GPU temperature on the 660, as I'm not intending to throttle in winter here.
Besides the HD2500-bearing
)
Besides the HD2500-bearing box that I've discussed a bit in this thread, I have two boxes which have Nvidia cards. My results are only consistent from 07 March 2015 forward, because, on that date, the GTX 650 SC was moved from X2 to X3, and the GTX 960 SSC was installed in X2.
Nvidia machine details:
Name / CPU / RAM / GPU
X2 / Phenom II X4 965 BE oc to 3.8 GHz / 8GB DDR2 800 / GTX 960 SSC factory oc
X3 / Phenom II X4 945 stock 3.0 GHz / 8GB DDR3 1333 / GTX 650 SC + slight further oc
Long story about why the faster CPU has slower RAM but a faster GPU.
Both run Win7-64, host the GPU's in PCI-E 2.0x16 slots, and have 10k RPM primary drives. Both boxes often see non-BOINC activity evenings and weekends, but, so far, BOINC is mostly left running during such use, albeit with TThrottle stepping in for brief periods on X3 during the few warm afternoons. (Spring is arriving here in North Carolina!)
X2 runs 4xWCG and 2xEinstein; X3 runs 4xWCG and 1xEinstein, both sometimes including a demanding 2xCEP2 at the same time. I do not have enough 1.39 left in the results pages to provide truly meaningful figures. However, I can provide the following for 1.52 during times when a user was not present at the keyboard:
X3: run / cpu (running 1xEinstein)
Average: 11,994.99 / 845.25
Minimum: 11,781.69 / 738.56
Maximum: 12,885.22 / 947.85
Median: 11,918.97 / 843.61
Simultaneous active non-BOINC mostly-CPU use seems to have only a minor effect on Run time or CPU. The most time I saw added to the CPU time during times when I know a user was at the keyboard was consistently around 1,000 seconds, and that was on X2 which gets much heavier non-BOINC use. Run Times are stretched much more if the non-BOINC use in question demands much of the GPU (e.g. "casual" game at 1920x1200 resolution). Of course, I begrudgingly suspend BOINC if the user wants to run a high resolution, GPU-intensive game (e.g. 3D shooter).
In addition, my WCG tasks run slightly faster/more efficiently alongside 1.52 tasks.
I consider all this a telling compliment to the optimization, even if informal.
And I don't mind the increase in points I'm getting, both from the better optimized units as well as from adding in the 960 SSC on 07 March 2015.
Thanks again fro the flow of
)
Thanks again fro the flow of hard facts and really useful visualizations.
It will be interesting to see what kind of improvement CUDA 5.5 will make on top of this once we begin to use it.
We are currently busy preparing additional work for the GW run, so I'm not sure when exactly we can roll out the CUDA 5.5 beta test version/
HB
Thanks Holmis, archae86,
)
Thanks Holmis, archae86, AgentB, Jeroen
for the data and charts in the results thread !
Bill
Any estimation if/when you
)
Any estimation if/when you will promote the current Beta app to standard runs ?
-----
RE: Any estimation if/when
)
Not sure yet. As I said the GW search needs most of our attention at the moment. For those participating in the beta test this should make no difference (except you will drop in the stats because everyone else will be catching up ;-) )
HB
I have opted in for the Beta
)
I have opted in for the Beta and it seems these tasks have a much higher priority over standard units. It's no problem as far as Beta would (besides testing) crunch real data (not dummies). Is that so ?
I have also noticed that when running a single WU on a Tesla K20, the GPU usage seems to be fluctuating from 0% to >90% ever few seconds. Not sure if that is normal behavior..
-----
I'm not providing the level
)
I'm not providing the level of detail required for the "Results" thread so I'll just post up my observations here ...
2015.03.14: Win7x64, GTX480 pcie2x16, 4670K @3.9 GHZ, 16 GB RAM @1600.
I started doing a little testing yesterday morning and it seems like the latest beta (1.52) has evened out some of the variable response of how the 1.39 beta handled the input / output data. I've only completed a few WUs running 1x concurrent - 90% GPU utilization, 60% GPU memory bandwidth, about 2-3% CPU utilization (so about 8-10% of 1 core), looking pretty go so far. Unfortunately Nvidia Inspector does not report PCIe bandwidth for the GTX480 but by all accounts from other crunchers here at Einstein this is no longer the primary bottleneck. Next up I'll see if putting an OC on the card scales linearly or if gpu memory becomes the next gating factor.
2015.03.15: I saw little improvement (not linear) when the GPU was OC'd but am seeing a decent overall efficiency when running 2x concurrent WUs.
2015.03.15: Win7x64, GTX660Ti, GTX670 both at pcie2x8, 980X @3.2 GHZ, 16 GB RAM @1600.
I launched 1x concurrent each (both cards on CPU 980X box) at stock
As expected the 660Ti struggled with GPU memory around 74% but things looked OK, I was busy so went off to do *other* things. I checked in a little over an hour later and discovered that the GPU usage was very sporadic and even at time causing the driver to step down the power state. At this point (50% estimate runtime in 1:20) I decided to free another CPU core to see if I could even things out a bit (running multiple ATLAS, LHC, and vLHC), usage looks more consistent, I'll check back in another hour and see how if it truly helped or not.
--------------------------
- Crunch, Crunch, Crunch -
--------------------------