HOST: AMD Phenom II X6 1090T + 2 GTX 570
PCIe2.0 2 x 16 , 2 x 2 GB 1600 , Windows XP Professional x86
running 0,2 CPUs + 0,5 NVIDIAs GPUs
running GW Follow-up and Climate on CPU
BRP6 v 1.39: 3 CPU core free (CPU usage 78%)
2 x BRP6 v 1.39: 15,050sec + 5,700sec CPU time
GPU temp 80C
BRP6 v 1.52: 2 CPU core free (CPU usage 71%)
2 x BRP6 v 1.52: 8,250sec + 500sec CPU time
GPU temp 97C
2. Why is the Sandy Bridge host significantly better than the Westmere? My impression is that the CPU is less of a factor so what is hampering the Xeon??
He's reporting improvements, not absolute performance. So either the Westmere is bad, or it was so good to being with that there was less room for improvement. From his older posts it seems to be the latter. The most probable reason is more PCIe bandwidth (dedicated 16x PCIe 2 lanes from Westmere vs. 2 * 8x PCIe 2 from Sandy), but a larger L3$ and more main memory bandwidth don't hurt either.
archae86 wrote:
As the season is warming, I may soon throw away some of this performance by throttling to reduce room heating in the sun-afflicted hours, but it is available to me at will.
Side note: I recall you're using TThrottle. Do you know how it throttles? Does it pause the CUDA tasks or does it reduce the GPU power and/or temperature target? If it's the former it's inefficient: the GPU runs at full steam for a brief period, at inefficient high voltage, because it doesn't know that it should temper itself a bit, and then gets paused to cool down. It would be more efficient to continously run at moderate voltage and slightly reduced clocks.
On the Sandy Bridge vs. Westmere matter, very tentatively I think that as I have them configured, the Westmere may well have superior memory/IO performance, with the Sandy Bridge having superior CPU throughput, or possibly task switch latency.
As the 1.39 application version put a lot of stress on memory/IO needs, the Westmere won, but with the change to 1.52 that need was wonderfully reduced (improved), so that the latency or actual CPU computational capacity of the host when executing the service task assumed greater relative importance.
(a long time ago, we joked that certain people made hand-waving arguments with such vigor as nearly to achieve lift-off. I'll confess that in this post I'm in that domain).
Regarding TTthrottle. I'm pretty sure it does not use GPU power or temperature targetting. I agree that using voltage reduction, where attainable, rather than task interruption would be an energetically much more efficient means. I don't know whether the requisite control interface is readily available to the developer, nor whether he could readily sort out precisely which GPU installations would appropriately respond to that form of direction as he could supply it. Maybe on reflection I'll post an inquiry on those lines to his user board.
Meanwhile, I may want to re-think my goals of shifting thermal load to desired times of day vs. simple overall power efficiency. If I decide my household power consumption is higher than I'd like, but discard interest in heating impact, I should probably pursue interventions to lower GPU operating voltage, by whatever name they are styled. If I want to shift heat away from the most inconvenient parts of the day, an alternate method to TThrottle is to use BOINC'S operating hours of the day facility, though I recall not liking it much the time I tried it. Even here in New Mexico the sun does not shine every day, and, as the year rolls by, the overheated time period shifts anyway.
For those looking at my hosts, currently I have TThrottle running (the overhead is small), but the limits set high enough that it is not intervening. The room housing two of the hosts is, however, rapidly heading into the part of the year when it is too hot in the morning sun, while the third is in the room where it is slowly getting too hot in the afternoon. So once my RAC gets close to catching up to my 1.52 raised production, I'll probably resume some form of power conservation. I have little doubt that the otherwise wonderful beta applications have raised my household monthly energy consumption above my informal target.
Regarding TTthrottle... Maybe on reflection I'll post an inquiry on those lines to his user board.
That would be nice :)
I thought to do so myself, but if you find the time I'd be happy to have delegated the task into competent hands. And to extend on my previous post: he would not necessarily have to interface with the video driver himself (I think adjusting those settings is done via the nvapi) but may be able to cooperate with the author of one of the regular tweak utilities. Bundle them with TThrottle and let them take care of interfacing with the driver properly.
Compared to simply modifying the temperature target in those tools TThrottle could offer further options, like only throtteling during specific day times (as seems useful for you). On the other hand: if you configure your GPUs primarily via a temperature target, the things should adjust themselves nicely to when ever it gets warm - no further inpout would be needed.
I'm running my nVidia to keep it around 1.1 V to stay in a "fairly efficient" operating point. I can't set this directly on Kepler or Maxwell, so I adjust the power target for a certain software load so that the result is +/- what I want.
Actually I had thought about setting it directly. It should be possible if I modify the GPU BIOS to only support boost states up to the desired voltage and then remove the power target limitation. This way I'll loose some performance under light loads, but would gain overall efficiency.
I think (from what HB has been trying to drum into us) that the variability is a function of data 'favourableness' (if that's even a word) in that lots of high scoring 'toplist' candidates found very early will save a lot of time in preventing later on, lots of expensive memory transfers GPU CPU.
This seems to be showing in the following graphic where you can see early 'unfavourable data' tasks, followed by a string of 'good' data before relapsing back to more unfavourable data. Notice (at around task 25) that the two 'worst' tasks are immediately bracketed by the two 'best' tasks - this hints at some interaction going on there. You can see evidence of this later on too.
Perhaps this might be being exacerbated by running excessive concurrent tasks for the hardware combination being used. I intend to test this next with the same model GPU in a very similar host (G645 rather than G640), but running 2x rather than 3x. I should be able to find enough data points to give a decent comparison.
It is worth remembering that we just report Elapsed time, and not what fraction of that time a particular WU (assuming one is running multiplicity greater than 1X) actually is enjoying the services of the GPU.
Imagine two simultaneously resident WUs, one of which demands CPU service substantially more frequently than the other, but for which said CPU service in practice is accomplished in an infinitesimal amount of time (some combination of negligible amount of computation or transfer actually requested, low latency, high throughput of the CPU...). If task management for the GPU tends to leave it servicing the current task until set aside awaiting service, then in the hypothesized case the more frequently demanding task will be reported with a much longer elapsed time on the GPU than the other, even though in the case I've made up, the two consumed a negligibly different amount of resource.
Something at least partially akin to this oversimplified pictures seems to be going on, as multiple careful observers have observed there to be a sort of base population regarding ET/CPU time, but that WUs running on the GPU at the same time as one of the less favored WUs actually reported materially shorter than base population ET--presumably through no special virtue of their own.
Aside from that, I've noticed in my many adventures of fiddling with process priorities, core affinities, and number of allowed CPU jobs, that for some applications on some hosts, some values of these adjustments would render greatly more equal the ETs for a stream of work which with other settings were far more unequal.
Lots of words, not much of an answer buried in them--but maybe more food for thought on the ET variation issue. Now on 1.47/1.50, I'm convinced that a lot of the variability was driven by a fundamental WU interaction with the code, and the considerable correlation of CPU and ET variation helped give this credibility. I still think there are real "somewhat worse" units for 1.52, but that the strong variation from host to host in 1.52 variability suggests there is some simple "bag squeezing" of the type I sketched above going on also.
It is worth remembering that we just report Elapsed time, and not what fraction of that time a particular WU (assuming one is running multiplicity greater than 1X) actually is enjoying the services of the GPU.
Hi Peter,
Thanks for taking the time to contribute. I've seen all the points you refer to but I don't claim to have any sort of real understanding of the issues involved. I've seen all your previous posts about process priorities and core affinities and the use of process lasso and certainly respect your expertise with tweaking things the way you have previously documented.
I regard myself as very much a complete novice when it comes to understanding even the most basic things about the inner workings of the kernel and how it handles process scheduling in the complex CPU/GPU hardware/software environment we are using. You use Windows, I use Linux. I imagine just that difference alone could be having quite an effect on the numbers we see when we start studying the variations in some detail.
I see my role as one of presenting data. I hope there will be others with the computer science background I don't have, who can step up and comment on what it all means. I'm sure you will have noticed that I've started rolling out some data on my NVIDIA GPUs. I intend to keep doing that for a while yet. It seems that something new pops up each time I process the next host. Take a look at Host 08, the data for which I've just posted.
Improvement on GeForce GTX
)
Improvement on GeForce GTX 570:
HOST: AMD Phenom II X6 1090T + 2 GTX 570
PCIe2.0 2 x 16 , 2 x 2 GB 1600 , Windows XP Professional x86
running 0,2 CPUs + 0,5 NVIDIAs GPUs
running GW Follow-up and Climate on CPU
BRP6 v 1.39: 3 CPU core free (CPU usage 78%)
2 x BRP6 v 1.39: 15,050sec + 5,700sec CPU time
GPU temp 80C
BRP6 v 1.52: 2 CPU core free (CPU usage 71%)
2 x BRP6 v 1.52: 8,250sec + 500sec CPU time
GPU temp 97C
RE: GPU temp 97C This is
)
This is certainly something you should avoid.
-----
Yes I did 2 ULTRA KAZE
)
Yes I did
2 ULTRA KAZE (3000 rpm, 133 CFM) push air on GPUs and temps drop to 82C
Just want to show temp diference (under the same conditions) between two app version.
1.52 run 17C warmer on GTX 570
RE: 2. Why is the Sandy
)
He's reporting improvements, not absolute performance. So either the Westmere is bad, or it was so good to being with that there was less room for improvement. From his older posts it seems to be the latter. The most probable reason is more PCIe bandwidth (dedicated 16x PCIe 2 lanes from Westmere vs. 2 * 8x PCIe 2 from Sandy), but a larger L3$ and more main memory bandwidth don't hurt either.
Side note: I recall you're using TThrottle. Do you know how it throttles? Does it pause the CUDA tasks or does it reduce the GPU power and/or temperature target? If it's the former it's inefficient: the GPU runs at full steam for a brief period, at inefficient high voltage, because it doesn't know that it should temper itself a bit, and then gets paused to cool down. It would be more efficient to continously run at moderate voltage and slightly reduced clocks.
MrS
Scanning for our furry friends since Jan 2002
On the Sandy Bridge vs.
)
On the Sandy Bridge vs. Westmere matter, very tentatively I think that as I have them configured, the Westmere may well have superior memory/IO performance, with the Sandy Bridge having superior CPU throughput, or possibly task switch latency.
As the 1.39 application version put a lot of stress on memory/IO needs, the Westmere won, but with the change to 1.52 that need was wonderfully reduced (improved), so that the latency or actual CPU computational capacity of the host when executing the service task assumed greater relative importance.
(a long time ago, we joked that certain people made hand-waving arguments with such vigor as nearly to achieve lift-off. I'll confess that in this post I'm in that domain).
Regarding TTthrottle. I'm pretty sure it does not use GPU power or temperature targetting. I agree that using voltage reduction, where attainable, rather than task interruption would be an energetically much more efficient means. I don't know whether the requisite control interface is readily available to the developer, nor whether he could readily sort out precisely which GPU installations would appropriately respond to that form of direction as he could supply it. Maybe on reflection I'll post an inquiry on those lines to his user board.
Meanwhile, I may want to re-think my goals of shifting thermal load to desired times of day vs. simple overall power efficiency. If I decide my household power consumption is higher than I'd like, but discard interest in heating impact, I should probably pursue interventions to lower GPU operating voltage, by whatever name they are styled. If I want to shift heat away from the most inconvenient parts of the day, an alternate method to TThrottle is to use BOINC'S operating hours of the day facility, though I recall not liking it much the time I tried it. Even here in New Mexico the sun does not shine every day, and, as the year rolls by, the overheated time period shifts anyway.
For those looking at my hosts, currently I have TThrottle running (the overhead is small), but the limits set high enough that it is not intervening. The room housing two of the hosts is, however, rapidly heading into the part of the year when it is too hot in the morning sun, while the third is in the room where it is slowly getting too hot in the afternoon. So once my RAC gets close to catching up to my 1.52 raised production, I'll probably resume some form of power conservation. I have little doubt that the otherwise wonderful beta applications have raised my household monthly energy consumption above my informal target.
RE: Regarding TTthrottle...
)
That would be nice :)
I thought to do so myself, but if you find the time I'd be happy to have delegated the task into competent hands. And to extend on my previous post: he would not necessarily have to interface with the video driver himself (I think adjusting those settings is done via the nvapi) but may be able to cooperate with the author of one of the regular tweak utilities. Bundle them with TThrottle and let them take care of interfacing with the driver properly.
Compared to simply modifying the temperature target in those tools TThrottle could offer further options, like only throtteling during specific day times (as seems useful for you). On the other hand: if you configure your GPUs primarily via a temperature target, the things should adjust themselves nicely to when ever it gets warm - no further inpout would be needed.
I'm running my nVidia to keep it around 1.1 V to stay in a "fairly efficient" operating point. I can't set this directly on Kepler or Maxwell, so I adjust the power target for a certain software load so that the result is +/- what I want.
Actually I had thought about setting it directly. It should be possible if I modify the GPU BIOS to only support boost states up to the desired voltage and then remove the power target limitation. This way I'll loose some performance under light loads, but would gain overall efficiency.
MrS
Scanning for our furry friends since Jan 2002
@Gary regarding your notes on
)
@Gary regarding your notes on GPU time variability in the Results Only thread here http://einsteinathome.org/node/198004&nowrap=true#139604
I do not see much variability in recent tasks.
I also saw some variability in early times.
Instead of using the last 100 tasks - how do the numbers look using the last 50?
I think (from what HB has
)
I think (from what HB has been trying to drum into us) that the variability is a function of data 'favourableness' (if that's even a word) in that lots of high scoring 'toplist' candidates found very early will save a lot of time in preventing later on, lots of expensive memory transfers GPU CPU.
This seems to be showing in the following graphic where you can see early 'unfavourable data' tasks, followed by a string of 'good' data before relapsing back to more unfavourable data. Notice (at around task 25) that the two 'worst' tasks are immediately bracketed by the two 'best' tasks - this hints at some interaction going on there. You can see evidence of this later on too.
Perhaps this might be being exacerbated by running excessive concurrent tasks for the hardware combination being used. I intend to test this next with the same model GPU in a very similar host (G645 rather than G640), but running 2x rather than 3x. I should be able to find enough data points to give a decent comparison.
Cheers,
Gary.
Gary, It is worth
)
Gary,
It is worth remembering that we just report Elapsed time, and not what fraction of that time a particular WU (assuming one is running multiplicity greater than 1X) actually is enjoying the services of the GPU.
Imagine two simultaneously resident WUs, one of which demands CPU service substantially more frequently than the other, but for which said CPU service in practice is accomplished in an infinitesimal amount of time (some combination of negligible amount of computation or transfer actually requested, low latency, high throughput of the CPU...). If task management for the GPU tends to leave it servicing the current task until set aside awaiting service, then in the hypothesized case the more frequently demanding task will be reported with a much longer elapsed time on the GPU than the other, even though in the case I've made up, the two consumed a negligibly different amount of resource.
Something at least partially akin to this oversimplified pictures seems to be going on, as multiple careful observers have observed there to be a sort of base population regarding ET/CPU time, but that WUs running on the GPU at the same time as one of the less favored WUs actually reported materially shorter than base population ET--presumably through no special virtue of their own.
Aside from that, I've noticed in my many adventures of fiddling with process priorities, core affinities, and number of allowed CPU jobs, that for some applications on some hosts, some values of these adjustments would render greatly more equal the ETs for a stream of work which with other settings were far more unequal.
Lots of words, not much of an answer buried in them--but maybe more food for thought on the ET variation issue. Now on 1.47/1.50, I'm convinced that a lot of the variability was driven by a fundamental WU interaction with the code, and the considerable correlation of CPU and ET variation helped give this credibility. I still think there are real "somewhat worse" units for 1.52, but that the strong variation from host to host in 1.52 variability suggests there is some simple "bag squeezing" of the type I sketched above going on also.
RE: It is worth remembering
)
Hi Peter,
Thanks for taking the time to contribute. I've seen all the points you refer to but I don't claim to have any sort of real understanding of the issues involved. I've seen all your previous posts about process priorities and core affinities and the use of process lasso and certainly respect your expertise with tweaking things the way you have previously documented.
I regard myself as very much a complete novice when it comes to understanding even the most basic things about the inner workings of the kernel and how it handles process scheduling in the complex CPU/GPU hardware/software environment we are using. You use Windows, I use Linux. I imagine just that difference alone could be having quite an effect on the numbers we see when we start studying the variations in some detail.
I see my role as one of presenting data. I hope there will be others with the computer science background I don't have, who can step up and comment on what it all means. I'm sure you will have noticed that I've started rolling out some data on my NVIDIA GPUs. I intend to keep doing that for a while yet. It seems that something new pops up each time I process the next host. Take a look at Host 08, the data for which I've just posted.
Cheers,
Gary.