Hyperthreading and Task number Impact Observations

ForumsCruncher's Corner

archae86
archae86
Joined: 6 Dec 05
Posts: 1800
Credit: 374893612
RAC: 678146
Topic 195511

Over the next couple of days I plan to generate some observations on the execution time and aggregate throughput impact of switching my Westmere E5620 between hyperthreaded and not, and of varying the number of simultaneously executing Einstein HF 3.06 tasks.

Westmere is mostly a 32nm Xeon flavor of the classic Nehalem 4-core design, but with a 12 M L3 cache. For these tests I'll leave it running as it has been lately, with a moderate overclock of 3.42 GHz, with 4 Gbyte of RAM running at default settings (and one more Gbyte plugged in not currently actually recognized by the BIOS--oops).

My current intentions--subject to revision if I have a better thought or get better advice here:

1. One "measurement task" for each condition, will be started only after the remaining tasks for the condition are up and running, and will all be chosen from the same HF frequency 1373.90.

2. I'll turn off most overhead tasks of the more frivolous sort, and not use the system for personal work during the timed runs, but leave running my Kaspersky AV (which does hurt a bit during that nasty slow startup phase, but I consider essential).

The test conditions I definitely plan to log are:
HT 8
HT 1
nHT 4
nHT 1

I might fill in some of the intermediate task count points, say perhaps HT 4 and HT 6, but probably not all of them.

For each condition I think I'll show CPU time for the task, CPU time for the task relative to the HT_8 case, and implied system throughput relative to the HT_8 case. Also system input power.

Why bother? On the negative side, since these results are rather strongly influenced by the particular CPU design, by the application being run, by some other system design and configuration details, and by the other code executing on the system they won't generalize very far.

But I think some of the results may surprise people--for example those who may expect a single Einstein task running HT to run about as fast as the single nHT case, or people who expect a full doubling of aggregate throughput with HT application.

My target is not mostly the regular posters here, for whom few of my results will be surprising, but mostly others who drop by who may be less well informed.

I don't think thread starters own threads, but I'd be perfectly pleased if others with useful observation data saw fit to add to this thread.

While my timing in starting this thread and project was influenced by this other thread I don't regard this as an answer to or continuation of that one.

archae86
archae86
Joined: 6 Dec 05
Posts: 1800
Credit: 374893612
RAC: 678146

Hyperthreading and Task number Impact Observations

I'm planning to post results by doing a screen capture of a bit of Excel spreadsheet, posting to my Photobucket account, and linking the image here.

Translation--the image below should change over time as I measure new observations or correct old ones. The divide by zero errors will mostly disappear once I've observed the primary reference of hyperthreaded eight parallel Einstein tasks.

In general I'll make any further comments in later posts, but for this one I'll observe that the non-hyperthreaded single task case at 13483 seconds is rather a lot below the recent typical values (at hyperthreaded 8 tasks) for this host running near 22,500 CPU seconds at the same clock rate, RAM parameters, and other operating parameters.

archae86
archae86
Joined: 6 Dec 05
Posts: 1800
Credit: 374893612
RAC: 678146

archae86 wrote:But I think

archae86 wrote:

But I think some of the results may surprise people--for example those who may expect a single Einstein task running HT to run about as fast as the single nHT case

My target is not mostly the regular posters here, for whom few of my results will be surprising, but mostly others who drop by who may be less well informed


While I can't speak for other regular posters, a result here surprised me.

The single task execution time running hyperthreaded was so close to that running nHT that I cannot confidently assert the difference was not just WU to WU natural variation. I had expected a large disadvantage for the HT case.

The observed result is of course what one would like and naively expect--with nothing to do on the "other half" of a core running HT you would of course want no context switching or other overhead to occur. But on my previous main machine with HT some years ago, I formed the strong impression that single task execution was considerably slower with HT enabled than not. I assumed that was still true for Nehalem and have a dedicated BIOS setting group aimed to support my audio processing which is nHT because I thought my (largely single-threaded) audio processing would go faster (I don't run BOINC when I'm doing audio).

Possibly I was mistake then, or possibly the Intel HT implementation in the Nehalem architecture is dramatically superior to that in the Gallatin (Northwood-derived with big cache) machine I used to own. Considering how unfortunate some other aspects of the whole Willamette-descended set of designs were, this would not surprise me.

So it is even more crucial to assure that performance data are taken with appropriate simultaneous workload than I thought. To the problems of conflict for RAM and cache resources, one must add the fact that an underloaded HT host can actually provide dramatically more computation per charged CPU second than a loaded one.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 1721
Credit: 66611304
RAC: 58167

Could not the difference be

Could not the difference be that your Gallatin was a single-core processor, so everything else running on the computer (including the OS itself and OS background tasks) would require a context switch.

But with four physical cores available in your Westmere, and with it being only lightly loaded, a clever operating system could keep one 100% utilisation task running on one core without context switches, and distribute the housekeeping tasks around the other three cores as necessary: it could even be running as effectively a seven-core computer, with 3xHT handling the non-computationally-intensive tasks, and 1xnHT for the busy one?

archae86
archae86
Joined: 6 Dec 05
Posts: 1800
Credit: 374893612
RAC: 678146

Richard Haselgrove

Richard Haselgrove wrote:
Could not the difference be that your Gallatin was a single-core processor, so everything else running on the computer (including the OS itself and OS background tasks) would require a context switch.


But the same system running nHT still needed to process interrupts and make context switches for those same things. And a big part of the claim for HT is that it supports a sort of context switch between the two threads sharing a core at any given moment that is immensely faster than a standard context switch. So at least some should have gone faster, and I don't see why there would be a penalty for all the others unless the scheme somehow forced frequent thread to thread switches even though there was not work for the other thread. Something like that is what I've assumed, but not from inside information, but the behavior I thought I'd seen.

Quote:
a clever operating system could keep one 100% utilisation task running on one core without context switches

I've noticed that Windows 7 on my application load seems far more inclined to leave a process on a core for a while than Windows XP Pro. Not sure Windows 7 qualifies as clever in this respect.

Mike Hewson
Mike Hewson
Joined: 1 Dec 05
Posts: 5100
Credit: 42002535
RAC: 7459

RE: The single task

Quote:
The single task execution time running hyperthreaded was so close to that running nHT that I cannot confidently assert the difference was not just WU to WU natural variation. I had expected a large disadvantage for the HT case.


Based on the work we did a couple a years ago ( Ready Reckoner et al ), if still valid, then one could easily get variation of the order of a third ( average variation, measured extremes were from as low as ~15% to as high as ~45% ) of the run time due to stepping through phase space at a given frequency ( sinusoids etc ). That's clearly of the order of +/- HT effect we expect anyway .....

[ specifically, if true, this implies nHT processing getting lucky with a 'short' WU with correspondingly disadvantaged HT processing working on a 'long' WU, to explain your finding ]

Cheers, Mike.

"I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

tear
tear
Joined: 12 Sep 10
Posts: 9
Credit: 9914974
RAC: 0

Running (up to) N/2 tasks* on

Running (up to) N/2 tasks* on machine with N HT CPUs yields pretty much same
performance as running same number of tasks with HT disabled**. What's surprising
about it?

*) task, as in "non-MPI CPU/memory intensive task"

**) on condition that no two tasks share sibling HT CPUs

Mike Hewson
Mike Hewson
Joined: 1 Dec 05
Posts: 5100
Credit: 42002535
RAC: 7459

RE: What's surprising about

Quote:
What's surprising about it?


For many, not alot. For others, a credulity issue .... :-)

Cheers, Mike.

"I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

archae86
archae86
Joined: 6 Dec 05
Posts: 1800
Credit: 374893612
RAC: 678146

Mike Hewson wrote:if still

Mike Hewson wrote:
if still valid, then one could easily get variation of the order of a third of the run time due to stepping through phase space at a given frequency ( sinusoids etc )

My impression is that the proportionate execution time variability of the current work load is much less than it was in that era.

Here is a histogram of recent 144 results of 3.04 ap work done on that same host running HT 8 tasks, 3.42 GHz, but suffering excess variation from handling my daily personal computing workload.

I agree that the "difference" I saw between nHT 1 task and HT 1 task is likely within random variation, but I don't think the current random variation is anything at all close to the third of the run time you mention.

Mike Hewson
Mike Hewson
Joined: 1 Dec 05
Posts: 5100
Credit: 42002535
RAC: 7459

RE: Mike Hewson wrote:if

Quote:
Mike Hewson wrote:
if still valid, then one could easily get variation of the order of a third of the run time due to stepping through phase space at a given frequency ( sinusoids etc )

My impression is that the proportionate execution time variability of the current work load is much less than it was in that era.

Here is a histogram of recent 144 results of 3.04 ap work done on that same host running HT 8 tasks, 3.42 GHz, but suffering excess variation from handling my daily personal computing workload.

I agree that the "difference" I saw between nHT 1 task and HT 1 task is likely within random variation, but I don't think the current random variation is anything at all close to the third of the run time you mention.


Fair enough. I thought that might well be so, as the variation 'back then' related to sinusoidal function 'look-ups' and like issues, which has undergone optimisation ( or become less relevant ) since. The phase space is right ascension and declination, effectively considered as orthogonal co-ordinates, but to un-Doppler a signal you still need to resolve components to detector/Earth frame ie. trigonometry. At least that's how I remember it. :-)

Cheers, Mike.

( edit ) Nice curve too. To a first attempt you'd model that as normally distributed :

[pre]normalising_const * exp[-((ordinate - mean_measure)/spread_measure)^2][/pre]
If so then you have an underlying random variable with no especial 'preference' related to the task at hand*. Asynchronous ( with respect to WU processing ) interruptions would explain that nicely .....

( edit ) Mean is ~ 22460, standard deviation is ~ 148. Average absolute residual ( of actual WU's per run-time bracket ) from Gaussian prediction is ~ 3.1 or around 10% of the peak. Close enough ... certainly believable for that sample size.

( edit ) This means that I am saying that the WU 'interruptions' account for around +/- 2% of their run-times ( 3 standard deviations/average ). So this is way less than the 'sequence number' effect studied in days of old. But you could have guessed that. I couldn't remember any more Excel-Fu ..... :-)

( edit ) Actually yet another reason for intrinsic WU variation not affecting your group of 144, is that at ~ 1500 Hz : each frequency is going to have well over 1000 WU's to plow through. [ We found earlier that the number of work-units per sequence-number-cycle at a specific frequency went quadratically with frequency, as you have to plod through the phase space more finely ]. With E@H's use of locality scheduling there is a mighty tendency for a given host ( especially a fast one ) to be given near consecutive sequence numbers, thus your example of 144 WU's may not sample much of any ( if existing ) sinusoidal variation in run-times from that cause. In fact looking at your Xeon's first 20 tasks in the current 'In progress' list ( a page full ) that's exactly what's happening. So I'll shut up now, having demonstrated that this is definitely not relevant to the HT exploration here. :-) :-)

( edit ) * Sorry, quiet night-shift! I didn't explain that if it was predominantly sequence-number related you'd get a low-skewed ( to the left ) concave-up curve, and not a convex-down symmetric bell shape, as most WU's would cluster at the sinusoid trough ( shorter run-times ). If there is any skew asymmetry in your data it is definitely to the right.

"I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

archae86
archae86
Joined: 6 Dec 05
Posts: 1800
Credit: 374893612
RAC: 678146

I've been surprised yet

I've been surprised yet again. The most recent addition to the result table in the second post of this thread covers the case of 4 tasks running with hyperthreading turned on.

Recalling that the single task HT case surprised me by having a tiny (and possibly noise rather than truth) deficit to nHT single task, one might have guessed that this case would very closely approximate the 4 task nHT result.

It does not--while the HT_4 case gives considerably lower CPU times than does the HT_8, by far more than RAM or cache conflict might be expected to produce, the shortfall to nHT_4 is quite large. Importantly it is large compared to the plausible random noise stemming from task-to-task execution difficulty and system loading variation.

So a BOINC person seeking to "keep threads free" from BOINC by setting a maximum number of CPUs below the (virtual) number available) but keeping HT turned on hoping that in quite times within nothing going on the system it will at perform about as well on BOINC as the same system running the same number of tasks nHT seems on Nehalem architecture to lose very little for one task, but quite a lot for 4. With my poor prediction record so far, I probably should not guess how this would be for six, but my guess is that the loss from perfection will continue to grow on the larger core count Nehalems unless they have a specific logic upgrade aimed at this problem.

One other point, and a new picture embedded here of data: For the HT_4 case, I was able to get two other results to run in the nearly identical conditions as the intended test subject. They enjoyed the same reduction of normal load from background tasks and foreground usage, the same environment of HT with 3 companion 3.06 tasks, and in fact ran in parallel with primary measured task for all save about three minutes of offset. I deliberately chose tasks of differing frequency and sequence, hoping to raise the chance of catching systematic execution variation. What I actually got was very, very close matching.

Combining the sort of Big Picture variation from the histogram I posted a few posts back, with the bottom up better controlled (but much smaller data set) evidence here, I think the case is pretty well made that task to task systematic CPU time variation may be usually quite low for 3.06 HF work in the near neighborhood of 1373 frequency. If Bikeman or anyone else can add some understanding or data on current Einstein result execution time systematic variation I'd be pleased.

(edit: after I wrote this paragraph I noticed that Mike Hewson had added considerable updates to his original comments on my histogram. An appropriate modification to my claim here is to say that I think that the current overall systematic variation may be far less than the old days, but that in any case the restricted set of results actually being compared here, being all from frequency 1373.90, and spanning a sequence number range only from 1000 to 1022 probably contributed very little measurement noise stemming from systematic execution work variation to the reported comparisons)

But for this little study I think the available evidence supports the following for a 4-core current generation Nehalem type CPU running near my system's operating point and running Einstein Global Correlations S5 HF search #1 v3.06 (S5GCESSE2):

1. With the system allowed to run all the parallel tasks it can, enabling HT raises total throughput appreciably, with the nHT system giving only a little over 75% as much BOINC throughput.

2. For the extreme case of restricting BOINC to a single task, it appears that for a system otherwise very, very lightly loaded there is little disadvantage to running HT--mostly likely between 1 and 2% loss of BOINC throughpu.

3. Probably the loss incurred by running a restricted number of tasks HT instead of nHT grows with number of tasks. For the 4 task case this loss is substantial with the HT system giving only about 87% as much BOINC throughput as the nHT case.

While power consumption increases with number of tasks running, the overall "greenness" for the fixed clock rate fixed voltage case considered here consistently improves with more tasks run and with higher BOINC throughput. If one limits tasks to no more than the number of physical cores present, then both throughput and power efficiency will be best running with hyperthreading disabled.

While I'd expect the general trends to hold across a fair range of clock rates and for both the two channel and three-channel RAM variants, it is important to note that while my host is equipped with three channels, the BIOS reported only two to be recognized and active during these tests. I expect that any three-channel variant fully populated and working properly will suffer less degradation in execution times with increasing number of tasks than seen here.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.