Hyperthreading and Task number Impact Observations

FrankHagen
FrankHagen
Joined: 13 Feb 08
Posts: 102
Credit: 272200
RAC: 0

RE: RE: since even on a

Quote:
Quote:
since even on a 64-bit linux we are running a 32-bit app (correct me if i'm wrong), this leads to only 8 of the 16 SSE2 registers of a core being usable.
on the other hand exactly this may lead to a better performance of HT.

I was not even aware of this particular distinction. Speaking hypothetically, I suppose that an application variant which used more registers might be expected to generate less data memory traffic.

that's only part one of the story - part 2 is, that in theory the core can process twice the number of calculations in a single operation. if the code can be and is vectorized to use all 16 registers.

Quote:
Over at SETI, it appears that the Lunatics tuned applications include distinct x64 and x32 Windows variants for both Astropulse and Multibeam. Do you know whether anyone has done work to compare the actual execution performance to see what benefit their x64 version provides compared to x32 when both are running in a 64-bit OS?

nope - i do not care about YETI.. ;)

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7059414931
RAC: 1276088

RE: RE: I was not even

Quote:
Quote:
I was not even aware of this particular distinction. Speaking hypothetically, I suppose that an application variant which used more registers might be expected to generate less data memory traffic.

that's only part one of the story - part 2 is, that in theory the core can process twice the number of calculations in a single operation. if the code can be and is vectorized to use all 16 registers.


That would seem to push back in the opposite direction if it were true. Not denying the performance benefit possibly available to the portion of the code doing that, but the opportunity for HT benefit would seem to go up with more closely spaced data reads from memory, which this would do.

But I doubt it is true. Do you seriously think that even Nehalem comes equipped with enough distinct SSE floating point units ever to keep 16 registers in use in real-world code?

Looking at this Nehalem block diagram I see two available ADD SSE units and one available MUL/DIV SSE unit. So thinking in terms of two-operand instructions that could support six, once in a while, but how on earth do you imagine getting to sixteen?

FrankHagen
FrankHagen
Joined: 13 Feb 08
Posts: 102
Credit: 272200
RAC: 0

RE: Looking at this Nehalem

Quote:
Looking at this Nehalem block diagram I see two available ADD SSE units and one available MUL/DIV SSE unit. So thinking in terms of two-operand instructions that could support six, once in a while, but how on earth do you imagine getting to sixteen?

this is MAYBE another thing on nehalems and bulldozers. but talking about cores capable of SSE2 in 64bit mode, and this is ( cough ) P4 and ATHLON64!

the difference between "native mode" and 64-bit mode started back then..

http://en.wikipedia.org/wiki/SSE2

that's just another reason why you'll find "AMD64" everywhere.

getting back to a certain app - it really depends:

heavy use of SSEx instructions?
code which can be vectorized?
processor architecture?

bottom line is: unless you really give it a try, you'll never know, but if you do not do it, you can as well still believe in earth being a flat thing.

in boincworld it's very rare that a real 64-bit app is not faster - you might want to check the numbers on http://wuprop.boinc-af.org/results/delai.py

Claggy
Claggy
Joined: 29 Dec 06
Posts: 560
Credit: 2694028
RAC: 0

RE: Over at SETI, it

Quote:
Over at SETI, it appears that the Lunatics tuned applications include distinct x64 and x32 Windows variants for both Astropulse and Multibeam. Do you know whether anyone has done work to compare the actual execution performance to see what benefit their x64 version provides compared to x32 when both are running in a 64-bit OS? I think there has actually been less recent careful HT assessment there than here, and certainly don't recall spotting any x64 vs. x32 HT detail.


There are only 64bit Windows apps for CPU Multibeam, there are no Windows 64bit Astropulse apps, and no Windows 64bit Cuda apps,

Claggy

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7059414931
RAC: 1276088

RE: There are only 64bit

Quote:

There are only 64bit Windows apps for CPU Multibeam, there are no Windows 64bit Astropulse apps, and no Windows 64bit Cuda apps,

Claggy

Thanks for the correction, I carelessly relied on a heading in their download area reading

Quote:
AstroPulse for Windows - x64 & x32 Bit Windows AstroPulse apps for SSE & SSE3.

which is similar language to that used for the Multibeam entry. Maybe the intended meaning is that the applications will run in those environments.

Claggy
Claggy
Joined: 29 Dec 06
Posts: 560
Credit: 2694028
RAC: 0

RE: RE: There are only

Quote:
Quote:

There are only 64bit Windows apps for CPU Multibeam, there are no Windows 64bit Astropulse apps, and no Windows 64bit Cuda apps,

Claggy

Thanks for the correction, I carelessly relied on a heading in their download area reading
Quote:
AstroPulse for Windows - x64 & x32 Bit Windows AstroPulse apps for SSE & SSE3.
which is similar language to that used for the Multibeam entry. Maybe the intended meaning is that the applications will run in those environments.


There are just different Installers aimed at 32bit or 64bit Boinc's, the 64bit installer does more entries to try and make sure no one looses any work when installing the Lunatics apps,
But the only 64bit app in it is the AK_V8 MB app,

Claggy

Robert
Robert
Joined: 5 Nov 05
Posts: 47
Credit: 318728021
RAC: 19396

I've been tied up the last

I've been tied up the last week and I see a few questions have come up on the HT results I collected.

  • - No special techniques were used for setting processor affinity
    - I used a standard 64 bit ubuntu load with 32 bit compatibility libs, no other OS tuning
    - The 32 bit S5 SSE2 application was utilized, version 1.07
    - This machine is dedicated to running E@H, so no other side loads
    - It took many weeks to collect measurements for all the data points, so different frequencies were used
    - I assumed slight variations in measurement points were from the different frequencies and data sets, as discussed in the beginning of this thread
    - Thanks Mike for the suggestion to compare slope ratios as a way to answer the question how much benefit from HT
    - My Calculations are in reasonable agreement with Mike's 2.69 calculated slope ratio
Robert
Robert
Joined: 5 Nov 05
Posts: 47
Credit: 318728021
RAC: 19396

There seemed to be interest

There seemed to be interest in this data, so I'll post the hyper-threading S5 data I collected on the i7-980 for comparison.

The same collection conditions apply.

Only two frequency curves were collected for the 980. I'm not willing to deduce the 980's slope ratios are the same or different, they are close. At 3.0 GHz the 980's slope ratio is close to the 2600K's slope ratio as shown by Mikes earlier calculations.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6540
Credit: 286787467
RAC: 88006

RE: My Calculations are in

Quote:
My Calculations are in reasonable agreement with Mike's 2.69 calculated slope ratio


Being lines fitted by ( my ) eye on a piece of paper : I definitely think I'm over exact at quoting 2 decimal places. Probably better to quote to only one, say call it 2.7 .... :-)

Anyway they're all around 2.5 to 3.0 hence my impression is that OS swap overhead is at least comparable to HT effects at high core loads. OK. If so, then to test : did someone mention 'process lasso' or somesuch as an appropriate Windoze based utility to achieve affinity control? And return detailed timings for that matter, or does some other utility do that better? Suggestions?

This machine of mine if divested of BRP work could be an ideal test rig methinks, I could measure actual core times vs wall clock times on a per virtual core basis? Thus times per core not devoted to GW tasks of interest, derive fractional overheads etc.

Thanks again for collecting and presenting that! We're always looking to study such behaviours and perhaps get a hint or two on optimising. :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

FrankHagen
FrankHagen
Joined: 13 Feb 08
Posts: 102
Credit: 272200
RAC: 0

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.