Hyperthreading and Task number Impact Observations

archae86
archae86
Joined: 6 Dec 05
Posts: 3160
Credit: 7261301899
RAC: 1545908

RE: Since your machine is

Quote:
Since your machine is rigged to be on the rather light side of task load ( compared to 'typical' use ) - what's about the number you see on the 'Processes' tab of Task Manager?


I construe your inquiry to apply to the condition for my tests, as to which I asserted that I shut down some things but not all (think I mentioned leaving my antivirus running, for example).

Just now I exited BOINC and shut down the same things I was shutting down for the tests, then I looked at Task Managers Process count and saw 41. Then I started boincmgr, and waited a little while for it to spawn things, and saw 61 as the TM Process count. That would be boincmgr, boinc, 8 Einstein executables, plus 8 instances of Console Window Host and two more I failed to spot.

The non-BOINC stuff showing in general has very low CPU use and pretty low context switch delta counts, but it is not nothing.

Robert
Robert
Joined: 5 Nov 05
Posts: 47
Credit: 324009351
RAC: 23885

I've been collecting data on

I've been collecting data on the new Intel i7-2600K for quite a while and thought these particular S5 measurements fit well here in this thread on hyper-threading. Yes, I realize that the S5 run just finished, but these results are applicable to the S6 run also. In fact, I was just finishing verify a few data points just as I ran out of my final S5 work units. Only S5 gravity waves jobs were run during this test collection, no BRP jobs.

A couple of notes on hardware details, I used a i7-2600K processor clocked at 5 different frequencies paired with 2 x 4GB DDR3-1600 memory modules. Hyper-threading was enabled at all times, speedstep and turbo modes were disabled to ensure a consistent frequency for each core. Ubuntu was the operating system. For each point along the curve, I collected data and averaged the time required to complete a single S5 work unit. Any end cases with vastly different times were thrown out. Daily RAC for each point on the curve [0..8] was estimated by the formula below for S5 jobs.

N = Number of simultaneous Threads = [0..8]
RAC = 251 credits * N * (seconds in day / average single work unit time for N)

Interesting observations; you can see the clear transition point where hyper-threading kicks in at 5 threads. And work has a direct relationship to clock speed.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6591
Credit: 323648764
RAC: 174685

RE: I've been collecting

Quote:
I've been collecting data .... work has a direct relationship to clock speed.


What a brilliant set of observations! Thank you very kindly for collecting, analysing and presenting that here. :-) :-)

Yes, the trends are clear. Let it be our benchmark for HT thinking.

By eye I can see the relation to clock speed ( all else held same ) could be modelled as linear to close fit.

The 'knee' at 4 HT cores is vivid. Indeed the RAC benefit per extra/added core ( the slope of the curves, ~ 2000 initially ) halves thereafter to ~ 1000. Which is near as stuff all to 2:1 ..... so there's the swapping overhead arising.

Again, thanks for the work on that! :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

ML1
ML1
Joined: 20 Feb 05
Posts: 347
Credit: 86563414
RAC: 51

RE: RE: I've been

Quote:
Quote:
I've been collecting data ....

... Yes, the trends are clear. Let it be our benchmark for HT thinking.


Indeed, very nice clear results. Thanks for sharing.

Quote:
By eye I can see the relation to clock speed ( all else held same ) could be modelled as linear to close fit.


That suggests a nicely balanced system, or a system where the memory bandwidth nicely exceeds that needed by the CPU for these tasks. That is: The CPU processing is the limiting factor. There's enough fast enough memory to let the CPU run at 100% utilisation for the CPU critical paths.

Quote:
The 'knee' at 4 HT cores is vivid. Indeed the RAC benefit per extra/added core ( the slope of the curves, ~ 2000 initially ) halves thereafter to ~ 1000. Which is near as stuff all to 2:1 ..... so there's the swapping overhead arising. ...


I don't interpret that as swapping unless you really mean 'process thread interleaving'. Intel's "Hyper-threading" uses additional state registers/logic to allow two process threads share the same one pool of processing units for a (physical) CPU core.

You are certainly getting a useful increase in throughput with the HT.

Happy fast crunchin',
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6591
Credit: 323648764
RAC: 174685

RE: ..... unless you really

Quote:
..... unless you really mean 'process thread interleaving' ....


No I don't especially. Unless/until we have any information as to how ( or indeed if ) his Linux machine's process scheduler handles affinity, I'll leave it as a generic 'swap' concept. See earlier discussions in this thread.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 581801485
RAC: 138259

RE: so there's the swapping

Quote:
so there's the swapping overhead arising.

You probably already know everything I'm going to say now, but this wording leads to misunderstanding, even if you meant the right thing.

If HT is used 2 tasks are being run on one core at the same time. That is not only at the same time for the observing user (as running 2 threads with 50% time share each would look like), but also at the same time for the CPU, clock for clock if you will. This is totally independent of OS scheduling and everything people normally associate with "swapping".

Speed per task does drop upon HT use because, although both threads have individual registers at such (the "core components" of the core), they have to share caches and, most importantly, execution units. That's the whole point of HT: making better use of the execution units for relatively little more die space.

What you seemed to talk about is OS scheduling, where the scheduler reassignes tasks to specific cores at typically ~1 ms intervals (Windows). Which is an eternity compared to the CPU clock ;)

MrS

Scanning for our furry friends since Jan 2002

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6591
Credit: 323648764
RAC: 174685

RE: RE: so there's the

Quote:
Quote:
so there's the swapping overhead arising.

You probably already know everything I'm going to say now, but this wording leads to misunderstanding, even if you meant the right thing.

If HT is used 2 tasks are being run on one core at the same time. That is not only at the same time for the observing user (as running 2 threads with 50% time share each would look like), but also at the same time for the CPU, clock for clock if you will. This is totally independent of OS scheduling and everything people normally associate with "swapping".

Speed per task does drop upon HT use because, although both threads have individual registers at such (the "core components" of the core), they have to share caches and, most importantly, execution units. That's the whole point of HT: making better use of the execution units for relatively little more die space.

What you seemed to talk about is OS scheduling, where the scheduler reassigns tasks to specific cores at typically ~1 ms intervals (Windows). Which is an eternity compared to the CPU clock ;)

MrS


Absolutely correct, but moot. Since we don't know what his machine's actual scheduling behaviour is, it's an assumption that it's the same across the graph ie. is affinity maintained? Recall that E@H workunits are given low priority by default and hence will be readily displaced by system calls etc especially/inevitably if all physical cores are busy. So there will be a substantial part of the latter right side of the graph ( 4 and over virtual cores busy ) involving actual OS task swaps in addition to HT behaviours. Again, see earlier discussion ...

Cheers, Mike.

( edit ) Sorry, the other thing I've not mentioned here is the probable 'pure' HT overhead. I've seen various estimates but nowhere near enough penalty to give that 2:1 ratio ( change at the knee ) in the benefit per added virtual core. I think the HT throughput ( opinions differ ) for 2 units on a core was given at about 1.7 at worst? Mind you my 2:1 estimate was by eye .....

( edit ) Printing out the graphic and drawing lines I find that : the ratio of the slope of the lines for less than 4 jobs to the slope of the lines for more than 4 jobs ( ie. before and after the knee, and for each given processor speed ) are all about 2.7:1 plus/minus ~ 0.05. So I guess the question is what are the best and worst estimates for HT throughput - the 'pure' HT part - with the remainder being OS task swaps ?

( edit ) Moreover if one does draw a line from the knee to the 8 job point, you'll find the graph dips slightly below it at around 6 jobs and then returns .... a short mild upwards concavity. This is repeated for all curves. So there's something ( my guess is non-HT related ie. scheduling behaviour ) kicking in there. I'll post a modified graphic explaining what I mean by all this when I get a chance ..... :-)

( edit ) I hope this is sufficient :


NB : I did the calculations for the other intermediate clock speeds and got very close to the above ratio pre/post knee ~ 2.69

Plus I've assumed that since the measure is daily RAC then wall clock time ( as opposed to CPU time ) - "averaged the time required to complete a single S5 work unit" - is the relevant scale.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6591
Credit: 323648764
RAC: 174685

Further thoughts ( based on

Further thoughts ( based on prior statements/assumptions ) : why the dip below linear at around 6 jobs? The slope of the curve is : the rate of change of RAC with change of virtual core number. That means there is a slight "penalty" for going from 4 to 6 which is "recovered" by going from 6 to 8 ( within the expected entire pattern of somewhat less than 2:1 throughput from HT at over 4 virtual cores ). Shouldn't task swaps per se forced by higher thread occupancy of the CPUs give a concave down aspect to the 'thigh' part of the curve? Meaning that at say 5 GW jobs there'll probably be more physical cores ( 3 ) only occupied by a single GW job, than at 7 GW jobs with fewer physical cores ( 1 ) occupied by a single GW job ie. non-GW work ( system stuff say ) is more likely to bump a GW job off a given physical core with 7 jobs than 5 .... unless I'm viewing an artifact of the data presentation.

Of course to decide matters firmly, what we need is a repeat of all the prior work with process affinities nailed down [ plus recording elapsed ( run ) time vs core ( CPU ) time ] Volunteers ? :-) :-) :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

FrankHagen
FrankHagen
Joined: 13 Feb 08
Posts: 102
Credit: 272200
RAC: 0

one thing to add.. since

one thing to add..

since even on a 64-bit linux we are running a 32-bit app (correct me if i'm wrong), this leads to only 8 of the 16 SSE2 registers of a core being usable.
on the other hand exactly this may lead to a better performance of HT.

but as long as there is no real 64-bit app which makes full use of SSE2 capabilities, we won't know.

archae86
archae86
Joined: 6 Dec 05
Posts: 3160
Credit: 7261301899
RAC: 1545908

RE: since even on a 64-bit

Quote:
since even on a 64-bit linux we are running a 32-bit app (correct me if i'm wrong), this leads to only 8 of the 16 SSE2 registers of a core being usable.
on the other hand exactly this may lead to a better performance of HT.


I was not even aware of this particular distinction. Speaking hypothetically, I suppose that an application variant which used more registers might be expected to generate less data memory traffic. As one of the opportunities for HT benefit is clearly finding something useful to do while waiting for a memory read, that would seem to suggest possibly less HT benefit on the hypothetical variant.

But memory references able to be supplanted by registers are, I should think, highly likely to be filled from cache, not RAM, and usually a fast level of the cache.

In practice I doubt the speculated effect is either substantial or consistent.

Over at SETI, it appears that the Lunatics tuned applications include distinct x64 and x32 Windows variants for both Astropulse and Multibeam. Do you know whether anyone has done work to compare the actual execution performance to see what benefit their x64 version provides compared to x32 when both are running in a 64-bit OS? I think there has actually been less recent careful HT assessment there than here, and certainly don't recall spotting any x64 vs. x32 HT detail.

But such an answer would necessarily be highly application-specific. I don't think either of the current SETI analyses much resembles any of the Einstein analyses computationally (if Bikeman or others know I'm wrong here, please correct me), and the considerable history and infrastructure of the Lunatics effort may mean that available tuning benefits have been more thoroughly explored there.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.