Hyperthreading and Task number Impact Observations

archae86
archae86
Joined: 6 Dec 05
Posts: 3,145
Credit: 7,050,054,931
RAC: 1,643,016

I am surprised, happy, and

I am surprised, happy, and very puzzled. This morning I pulled the motherboard out of the case and removed HSF and CPU. I figured if the stone-cold dead symptom with excess power consumption continued, I had a bad motherboard, and if not, I had a fried CPU. It was a well-behaved 2W peak descending to 1W (not the steady 10W I had seen on this supply) and the smart button LEDs lit up, so my diagnosis was CPU, but before spending almost $400 US for a replacement I put the "bad" one back in the socket. All was still well !?!?

So I very, very slowly reconnected things, taking the time to connect power after making almost every connection (including each cable to the case). As I added things, the peak initial power consumption fitfully rose, eventually to 5W, and at some point the smart button LED's began just to make a momentary flash instead of staying on, but aside from these behavior changes all went well. I now have all internals re-connected, the case buttoned up, and the host is on the internet and processing Einstein and a little SETI at the previous 3.4 GHz. I've not plugged in any USB devices, but it seems fully functional.

I'm very puzzled as to what was wrong, and how it got fixed. My two lead candidates:

1. I fat-fingered some connection to an unacceptable state, and that only when I finally pulled off all the case cables could things return to right.

2. My stupid error of doing RAM change with power on the box put some piece of internal state into an unacceptable state, which did not remedy with repeated reboots, a CMOS clear, or genuine power disconnections, but decayed away in spending a night disconnected from power.

Sorry to divert this thread from performance content. As I have a new set of three RAM sticks on order for delivery next week, I think I shall do a few trials and document the execution impact of going from 1 to 2 to 3 channels of RAM. Or maybe I'll see the wisdom of leaving well enough alone.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,537
Credit: 286,150,706
RAC: 106,228

Well, this is all good. Luck

Well, this is all good. Luck is a fortune! Some element has held onto charge and biased something inappropriately, now discharged.

Now what do the latest Westmere figures say to us?

(A) Comparing the first two rows and dividing :

0.5692/0.7622 ~ 0.7468

This means a WU that was alone on a physical core then loses ~25% of speed if it has to share with another WU. But we are still ahead per physical core total WU throughput. A more or less expected level of benefit from hyper threading. Fair enough, but not quite the right description as when being the only WU on a physical core there might have been some other non-WU task coming or going to share, compared to when you knew you were definitely always sharing a core with another WU. But having said that let's assume this may happen if the other virtual slot on a physical CPU isn't servicing a WU task, and say no more about it ( expectant of random spread of non WU tasks to whatever free virtual cores are about ). Even on a system where non-WU tasks have been well weeded.

(B) Now divide row three by row two :

0.4975/0.5692 ~ 0.874

Thus we see that each of four WU's sharing two physical cores lose ~ 12% of speed if we allow them to shift chairs rather than telling them to sit still in allocated seats. If you have four people on four chairs then swaps are probably done in pairs? Do we swap tasks on the same physical core in pairs ( if such a statement has meaning! ), or do we swap tasks across physical cores? Well however that's done the average over all such unseen mechanisms is 12%. Hence I agree and thus have simply restated your earlier comment of the benefit of hard assignment.

(C) Move to the fourth line. Now we have two WU's sharing a physical core, with two other WU's definitely on different physical cores. What ought we expect? There's a mild surprise. To achieve this case we take the situation indicated by the first line ( a physical core to each of 4 WU's ) and force two WU's to share a physical core ( and leaving a remaining physical core quite un-occupied by WU's at all ). Now the time per physical core isn't much different, in fact slightly faster ( 14373 vs 14606 )! But let's not get too excited, and just say it's the same. An obvious idea is that each WU is bound/limited by something other than any presence of another WU on the same physical core. I like this answer because it is quite consistent with the CPU being the fastest chip on the machine and having to wait on occasions for slower devices ( generally orders of magnitude slower, with possible contention for that device plus gate/buffer delays and longer distances too ....). So, say, if a WU has to wait for a hard disk then it really makes no odds if it waits alone or in company. Indeed on the face of it there's at least a one in two chance that such system tasks get assigned to that totally WU free physical core. And unless I'm mistaken all the WU's are contending for the same disk! Yup, I like WU contention for the same disk as the rate limiting step ..... because that will definitely be independent of any WU's core context.

(D) Now I'm not entirely sure what configuration line five describes '2+1+1 configuration--these two tasks confined to one core' and thus how that differs from the (C) above ( even assuming I captured that scenario correctly ). Please advise. My punt is the two WU's having a physical core each to themselves ( the '1+1' ) means they can roam - having three physical cores to choose from - so that also implies that WU-free physical core is also effectively being shifted ( much like a semiconductor lattice hole ). If that is indeed the correct view of this case then we repeat the lesson of hard assignment, specifically the penalty is :

0.2723/0.3873 ~ 0.7117

or about 30% for allowing them to stuff about amongst the available chairs. It's even worse than that [ compare (B) above ], you lose more time if there is more choices of chairs!!

None of which connects especially with my earlier 4 over 4 analysis, as that was referring to likelihood of overall random assignment of tasks to cores by the OS, which we have avoided here by design.

As usual please point out if I've missed some aspect. And a safe New Year to one and all. :-) :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,537
Credit: 286,150,706
RAC: 106,228

I'll try a guess at longer

I'll try a guess at longer term performance with the Westmere, taking account of the settings/scenario :

- 4 WU's and 4 physical cores.

- Windows randomly assigning virtual CPU cores to WU's ( NB which will have equal priority )

- averaged over a suitable long period or number of WU's. ( say ~ 50 WU's )

- using times as per Pete's latest Westmere table.

Good is allocated at 22.86% of instances at 14606 seconds

Mediocre is allocated at 68.57% of instances at 20444 seconds

Worst is allocated at 8.57% of instances at 22380 seconds

That is, a weighted mean :

= 0.2286 * 14606 + 0.6857 * 20444 + 0.0857 * 22380

~ 19275

meaning the number of seconds per work unit ( elapsed time ) given those assumptions.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

mikey
mikey
Joined: 22 Jan 05
Posts: 11,927
Credit: 1,831,600,615
RAC: 212,751

RE: I am surprised, happy,

Quote:

I am surprised, happy, and very puzzled. This morning I pulled the motherboard out of the case and removed HSF and CPU. I figured if the stone-cold dead symptom with excess power consumption continued, I had a bad motherboard, and if not, I had a fried CPU. It was a well-behaved 2W peak descending to 1W (not the steady 10W I had seen on this supply) and the smart button LEDs lit up, so my diagnosis was CPU, but before spending almost $400 US for a replacement I put the "bad" one back in the socket. All was still well !?!?

So I very, very slowly reconnected things, taking the time to connect power after making almost every connection (including each cable to the case). As I added things, the peak initial power consumption fitfully rose, eventually to 5W, and at some point the smart button LED's began just to make a momentary flash instead of staying on, but aside from these behavior changes all went well. I now have all internals re-connected, the case buttoned up, and the host is on the internet and processing Einstein and a little SETI at the previous 3.4 GHz. I've not plugged in any USB devices, but it seems fully functional.

I'm very puzzled as to what was wrong, and how it got fixed. My two lead candidates:

1. I fat-fingered some connection to an unacceptable state, and that only when I finally pulled off all the case cables could things return to right.

2. My stupid error of doing RAM change with power on the box put some piece of internal state into an unacceptable state, which did not remedy with repeated reboots, a CMOS clear, or genuine power disconnections, but decayed away in spending a night disconnected from power.

Sorry to divert this thread from performance content. As I have a new set of three RAM sticks on order for delivery next week, I think I shall do a few trials and document the execution impact of going from 1 to 2 to 3 channels of RAM. Or maybe I'll see the wisdom of leaving well enough alone.

I have seen this before in pc's, I have always thought it was some capacitor holding its charge and only after sitting, thus losing it charge, do things go back to normal. After telling people to try rebooting this is one of my secret fixes when providing over the phone pc tech help. I always tell them to wait about 15 to 30 minutes before restarting the pc, and it really does seem to work sometimes. It has saved me many a trip only to find things 'just working' when I do make the trip to their homes. I always tell them 'the pc is scared and knows I am there and will fix it' so it just works before I have to whip it into shape, we all laugh and I walk away wondering!

ps I have enjoyed this thread and your testing of the different ways to crunch and which is best, please don't stop!

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,537
Credit: 286,150,706
RAC: 106,228

RE: It has saved me many a

Quote:
It has saved me many a trip only to find things 'just working' when I do make the trip to their homes. I always tell them 'the pc is scared and knows I am there and will fix it' so it just works before I have to whip it into shape, we all laugh and I walk away wondering!


Cherish this effect! In medicine we say that 'a good doctor times his treatment to coincide with recovery!' :-) :-)

My other favorite is 'neither kill nor cure, if you seek repeat business!' ;0 :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

ML1
ML1
Joined: 20 Feb 05
Posts: 347
Credit: 86,314,215
RAC: 512

RE: Cherish this effect! In

Quote:
Cherish this effect! In medicine we say that 'a good doctor times his treatment to coincide with recovery!' :-) :-)


A-ha... Isn't that subverting the natural immune response to give a Pavlovian reinforcement to have you called out to merely administer a placebo?...

Quote:
My other favorite is 'neither kill nor cure, if you seek repeat business!' ;0 :-)


Ouch! That also sounds like certain dubious business practices as is foisted in IT/computers to maintain a never-ending upgrade cycle...

How to distinguish the good from the game?

Cheers,
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)

ML1
ML1
Joined: 20 Feb 05
Posts: 347
Credit: 86,314,215
RAC: 512

Interesting analysis. I'm

Interesting analysis.

I'm surprised at the 12% penalty for having WUs roam around the cores... Is Windows scheduling really that bad? That 12% adds up to an awful lot of poisoned cache. Or is it more a case of the low priority tasks for the roaming WUs being interrupted more frequently by other tasks even when other cores are idle? (Again, a quirk of poor scheduling?)

The 'other rate limiting feature' outside of the CPU may well be system RAM bandwidth limits being more significant when the CPU cache cannot be used as effectively as for the best cases.

Some good sleuthing there.

What would be interesting for comparison is to do an identical set of tests but using the latest Linux and then Apple Mac on the same hardware (all 64-bit).

Happy crunchin',
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)

archae86
archae86
Joined: 6 Dec 05
Posts: 3,145
Credit: 7,050,054,931
RAC: 1,643,016

RE: I'm surprised at the

Quote:
I'm surprised at the 12% penalty for having WUs roam around the cores... Is Windows scheduling really that bad? That 12% adds up to an awful lot of poisoned cache. Or is it more a case of the low priority tasks for the roaming WUs being interrupted more frequently by other tasks even when other cores are idle? (Again, a quirk of poor scheduling?)


No, No, and No.

The issue is not thrashing of any kind, but rather that for significant periods of time a task is not active. I thought I made this point clear in my notes, but obviously I failed, as both you and Mike seem to have a different notion.

The situation is quite artificial, in that affinity constraints to pools of a subset of all CPUs are placed on a task. So this behavior has no obvious relevance to typical working system behaviors.

No such effect is seen where no affinity constraint is supplied, and sufficient Einstein work is allowed to execute to populate all cores (i.e. 8 active Einstein tasks on my system). I hope that would put to rest the mistaken references to poisoned cache, excess context switches, RAM bandwidth, and so on. Clearly the 8/8 task situation has worse inherent constraint from each of these than does the case where 4 tasks are constrained 4 CPUs on two physical cores.

ML1
ML1
Joined: 20 Feb 05
Posts: 347
Credit: 86,314,215
RAC: 512

RE: RE: I'm surprised at

Quote:
Quote:

I'm surprised at the 12% penalty for having WUs roam around the cores...

... Or is it more a case of the low priority tasks for the roaming WUs being interrupted more frequently by other tasks even when other cores are idle?...


No, No, and No.

The issue is not thrashing of any kind, but rather that for significant periods of time a task is not active. ...


What is the case that the task becomes not active?

Quote:
The situation is quite artificial, in that affinity constraints to pools of a subset of all CPUs are placed on a task. So this behavior has no obvious relevance to typical working system behaviors.


Are you suggesting that the affinity restrictions will push multiple tasks onto just one CPU?...

Quote:
No such effect is seen where no affinity constraint is supplied, and sufficient Einstein work is allowed to execute to populate all cores (i.e. 8 active Einstein tasks on my system).


Which is what we expect to be the optimum usage and that does indeed appear to be the case from the numbers.

The interesting bits are the numbers from the artificial cases to try to work out what the effects are, and their significance.

Quote:
I hope that would put to rest the mistaken references to poisoned cache, excess context switches, RAM bandwidth, and so on. Clearly the 8/8 task situation has worse inherent constraint from each of these than does the case where 4 tasks are constrained 4 CPUs on two physical cores.


There are examples of some systems where due to RAM bandwidth constraints and CPU cache usage, you may well get higher throughput by running tasks on only 6 or 7 out of 8 virtual cores... This came up in previous s@h or e@h threads.

It comes back to an old argument that certain mixes of Boinc tasks can be beneficial for maximum throughput, and some combinations can be detrimental... It's all a question of what system bottlenecks get hit. We usually tune the system to keep the most expensive resource (the CPU) fully busy.

(However, on my recent systems, the CPU, GPU, and RAM have all been about equally priced...)

Happy fast crunchin',
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,537
Credit: 286,150,706
RAC: 106,228

RE: The issue is not

Quote:
The issue is not thrashing of any kind, but rather that for significant periods of time a task is not active. I thought I made this point clear in my notes, but obviously I failed, as both you and Mike seem to have a different notion.


Oh, I see now. Quite right. My bad, and even after you went to the trouble of color highlighting! :O]

I was thinking about HT too much. Potential execution time 'lost' by a task could be when it is not allocated a slice at all, even if it appears/seems it reasonably could have been. My 'unseen mechanisms' are imaginary. This is an OS issue, so the scheduling algorithm is the proper focus. Ooooh.

[ See my earlier conclusions with (B) vs (D) - 'you lose more time if there is more choices of chairs' when not hard assigning ]

Since your machine is rigged to be on the rather light side of task load ( compared to 'typical' use ) - what's about the number you see on the 'Processes' tab of Task Manager?

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.