Hyperthreading and Task number Impact Observations

telegd
telegd
Joined: 17 Apr 07
Posts: 91
Credit: 10212522
RAC: 0

RE: From some previous work

Quote:
From some previous work I formed an impression that (0,1), (2,3), (4,5), and (6,7) were core-sharing pairs on my E5620, though I'm not highly confident.


Interesting. On my i7-860 Linux box, it is (0,4) (1,5) (2,6) (3,7).

This thread is very interesting - thanks for doing all these tests. I would be very interested to see a similar comparison done under Linux but, sadly, I don't have the time to do it myself...

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7059404931
RAC: 1281547

As a probe, I tried a new

As a probe, I tried a new case of HT enabled but only 4 tasks (compared to 8 possible), but with one task restricted (by Process Explorer affinity setting) to what Process Explorer running under Windows 7 construed to be CPUs (2,3) while the other three tasks were all allowed to roam among CPUs (0,1,4,5,6,7).

On the "unlucky assignment" hypothesis, one would expect the task restricted to a single sibling pair from which the three other Einstein tasks were excluded to have spent nearly all of the time with sole use of a full core. Thus it would be expected to finish a good deal sooner than the three other tasks, who would by bad luck spend part of their time sharing a real physical core with another Einstein while a real physical core sat idle.

To first order this prophecy seems borne out by the observed result.

The task restricted to what Process Explorer called CPUs (2,3) required 14,449.05 of CPU time to complete.

The other three required 16,245.74, 16,267.50, and 16,258.21. They are quite tightly matched compared to the large difference to the (2,3) restricted task.

For those keeping track of frequency and sequence as possible indicia of varying inherent computation work, those were:

for the favored single-core WU:
freq 1373.05 seq 1015
for the disfavored three condemned to suffer temporary sharing of an actual physical core with each other:
freq 1373.10 seq 1009
freq 1373.90 seq 999
freq 1373.05 seq 1014

It seems clear to me that for my system with Windows 7 and other conditions, the OS is fairly likely to assign a "durable" task to the "other half" of a real physical core fairly often, and when that happens in an underworked situation, a real physical CPU is often left idle when a more ideal task assignment could have gotten more throughput.

All of this is what my reading of tear's post suggested. Attention Windows haters: as tear got an observation on Linux, it is at least hinted that some Linux distributions suffer to at least come degree the same form of sub-optimization.

[political observation]I don't much like Windows or Bill Gates myself, save for his second life in the Foundation, as to which at least the vaccine work, and quite a bit else seems well founded [/political observation]

BilBg
BilBg
Joined: 27 May 07
Posts: 56
Credit: 23998
RAC: 0

I'm not sure but can Process


I'm not sure but can Process Lasso assign CPU affinities automatically to use preferably the real cores in HT case? (I don't have such CPU to test):
http://www.bitsum.com/prolasso.php

" ProBalance
Balance process priorities (or CPU affinities) ...

Automated Process Control
Set default priorities and CPU affinities ...

Multi-Core Optimization
Through default CPU affinities and ProBalance affinity adjustments, you can optimize your multi-core processor to make the most efficient use of your CPUs (cores)
"

http://www.bitsum.com/docs/pl/how_does_lasso_work.htm

[pre] [/pre]

- ALF - "Find out what you don't do well ..... then don't do it!" :)

ML1
ML1
Joined: 20 Feb 05
Posts: 347
Credit: 86314215
RAC: 235

Some interesting observations

Some interesting observations and good discussion.

Still... Beware the aspect of memory bandwidth contention confusing the issue of CPU thread allocation and overall performance...

Quote:
... as tear got an observation on Linux, it is at least hinted that some Linux distributions suffer to at least come degree the same form of sub-optimization.

For all you might ever have wanted to know about the introduction of the Intel version of Hyper-Threading... (Note that this is a rather old idea harking long ago back to the days of the Cyber supercomputers and possibly before...)

Linux: HyperThreading-Aware Scheduler

... August 28, 2002 - 12:59pm

* Linux news

Ingo Molnar, author of the O(1) scheduler [earlier story] and the orginal preemptive kernel patch, has provided a patch to make the O(1) scheduler fully aware of HyperThreading. Ingo explains: ...

Linux: NUMA Awareness Added To Scheduler

... January 22, 2003 - 3:22am

* Linux news

After several earlier attempts [story], NUMA awareness has been merged into the 2.5 development kernel's scheduler. Martin Bligh submitted the patches, explaining: ...

Hyper-Threading support in Linux kernel 2.5.x

Linux kernel 2.4.x was made aware of HT since the release of 2.4.17. The kernel 2.4.17 knows about the logical processor, and it treats a Hyper-Threaded processor as two physical processors. However, the scheduler used in the stock kernel 2.4.x is still considered naive for not being able to distinguish the resource contention problem between two logical processors versus two separate physical processors.

Ingo Molnar has pointed out scenarios in which the current scheduler gets things wrong...

The solution is to change the way the run queues work. The 2.5 scheduler maintains one run queue per processor and attempts to avoid moving tasks between queues. The change is to have one run queue per physical processor that is able to feed tasks into all of the virtual processors. Throw in a smarter sense...

Rather interesting for the various scenarios...

As mentioned, a complication observed elsewhere is when the multiple CPU cores become resource limited for memory access (or even cache access).

Happy fast crunchin',
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7059404931
RAC: 1281547

RE: I'm not sure but can

Quote:

I'm not sure but can Process Lasso assign CPU affinities automatically to use preferably the real cores in HT case? (I don't have such CPU to test):

If you had the wish to:
1. run with HT enabled and
2. limit the number of BOINC tasks to the number of physical cores (thus losing appreciable Einstein throughput compared to allowing use of all of the apparent CPUs)
3. get more Einstein output than one gets allowing unlucky task assignment to sibling CPUs.

If my cursory understanding of Process Lasso from looking at your reference, my current belief on sibling pair numbering for Process Explorer applies to Process Lasso, and a new thought I had a couple of minutes ago are all correct,

Then one could, I think do this:

Use Process Lasso to assign CPU affinity of 0,2,4,6 (or any other list of four that includes only one of each sibling pair) to the "worker" exe for all BOINC applications that you run (needs to be the same list for all this class of aps). One would of course also wish to use BOINC to restrict the number of running BOINC processes to four or to 50%. On mixed fleets with both HT and nHT multi-core hosts, doing this from account preferences would use up some venue dimension--if unacceptable one might use host preference over-ride.

This should assure that no BOINC execution task ever shares a physical core with another. On a lightly loaded system of Nehalem-generation architecture, I'd expect based on what we have observed so far, such a system to get very close to the nHT BOINC throughput.

A possible reason to consider this might be that such a system might be found to be more responsive to non-BOINC tasks than the 4-task nHT alternative, and quite likely more responsive than the (admittedly higher throughput) 8-task variant.

Einstein current GC work has Working Set size reported about 260,000 kbytes by Process Explorer. Folks with modest-memory systems, or needing to run memory-hungry non Einstein aps (say PhotoShop...) or running BOINC projects yet more memory hungry might find this ocnfiguration attractive.

I'm not sure we have quite met the usefulness objection to this line of inquiry we lightheartedly entertained early in the thread, but I do think we are getting closer.

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 540137476
RAC: 135007

I'm wondering: does MS know

I'm wondering: does MS know the "unlucky assignment" apparently does happen this often? They should be scratching their heads already..

Edit: now that I think about it.. this should really upset Intel. Any regular (i.e. non-BOINC) software which uses more than 1 core is likely to use less than 8 cores. And that means it will be unnecessarily slowed down by "unlucky assignment".

MrS

Scanning for our furry friends since Jan 2002

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3562358667
RAC: 133

archae86: have you checked

archae86: have you checked your windows power management settings? I ask because any setting below max performance could have the scheduler intentionally pairing WU's up at times in order to idle cores and drop power levels.

ExtraTerrestrial Apes: The problem appears to be in the windows scheduler either not being able to detect that the boinc tasks are long running 100% load items to keep them separate, or that it bumps them deliberately because they're low priority tasks in favor of giving exclusive core use to a higher priority item that requested CPU time. In either case this is a Microsoft problem, and not something Intel could do anything about themselves.

It might be possible to differentiate between the two scheduler failures by changing the priority of the science apps from low to high. IF the problem is my second guess, this should discourage the scheduler from cramming 2 tasks onto a single core to give something else exclusive access.

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7059404931
RAC: 1281547

RE: CPU affinity of 0,2,4,6

Quote:
CPU affinity of 0,2,4,6 (or any other list of four that includes only one of each sibling pair)

While I mentioned this thought in conjunction with another poster's mention of Process Lasso, I just tried this recipe solo. The result was successful, and has been added to the image displayed in the second post in this thread. Possibly this is what tear meant in referring to affinity settings avoiding sibling conflict. The resulting (low) CPU time--actually I've entered the average of four results sharing this condition--seems to endorse this particular setting for this purpose.

DanNeely: my power management setting currently says "Turn off the display": 10 minutes, "Put the computer to sleep": never.

not sure how that comports with your concerns.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6540
Credit: 286776968
RAC: 88765

RE: RE: CPU affinity of

Quote:
Quote:
CPU affinity of 0,2,4,6 (or any other list of four that includes only one of each sibling pair)
While I mentioned this thought in conjunction with another poster's mention of Process Lasso, I just tried this recipe solo. The result was successful, and has been added to the image displayed in the second post in this thread. Possibly this is what tear meant in referring to affinity settings avoiding sibling conflict. The resulting (low) CPU time--actually I've entered the average of four results sharing this condition--seems to endorse this particular setting for this purpose.


To be clear on this, we'd be looking at the figures for nHT vs HT, both @ 4 tasks?

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3562358667
RAC: 133

RE: DanNeely: my power

Quote:


DanNeely: my power management setting currently says "Turn off the display": 10 minutes, "Put the computer to sleep": never.

not sure how that comports with your concerns.

From that dialog, click change advanced settings, scroll down to Processor Power Management, and take a look at the values for Minimum and Maximum processor speed (unless you locked your multiplier in the BIOS and disabled the power management features that let it throttle down). IF your power plan is based off of Balanced or Power Saver instead of maximum, the minimum value will be 5% leaving windows free to throttle your CPU as it sees fit. I thought there were also settings relating to standing down cores as well, but unless they're subsumed in the cpu speed setting I can't find them.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.