Hyperthreading and Task number Impact Observations

FrankHagen

Joined: 13 Feb 08

Posts: 102

Credit: 272200

RAC: 0

RE: RE: since even on a

7 Jun 2011 17:01:59 UTC

Message 101225 in response to message 101224

(moderation:

)

Quote:

Quote:
since even on a 64-bit linux we are running a 32-bit app (correct me if i'm wrong), this leads to only 8 of the 16 SSE2 registers of a core being usable.
on the other hand exactly this may lead to a better performance of HT.

I was not even aware of this particular distinction. Speaking hypothetically, I suppose that an application variant which used more registers might be expected to generate less data memory traffic.

that's only part one of the story - part 2 is, that in theory the core can process twice the number of calculations in a single operation. if the code can be and is vectorized to use all 16 registers.

Quote:

Over at SETI, it appears that the Lunatics tuned applications include distinct x64 and x32 Windows variants for both Astropulse and Multibeam. Do you know whether anyone has done work to compare the actual execution performance to see what benefit their x64 version provides compared to x32 when both are running in a 64-bit OS?

nope - i do not care about YETI.. ;)

archae86

Joined: 6 Dec 05

Posts: 3145

Credit: 7059414931

RAC: 1276088

RE: RE: I was not even

7 Jun 2011 19:49:52 UTC

Message 101226 in response to message 101225

(moderation:

)

Quote:

Quote:
I was not even aware of this particular distinction. Speaking hypothetically, I suppose that an application variant which used more registers might be expected to generate less data memory traffic.

that's only part one of the story - part 2 is, that in theory the core can process twice the number of calculations in a single operation. if the code can be and is vectorized to use all 16 registers.

That would seem to push back in the opposite direction if it were true. Not denying the performance benefit possibly available to the portion of the code doing that, but the opportunity for HT benefit would seem to go up with more closely spaced data reads from memory, which this would do.

But I doubt it is true. Do you seriously think that even Nehalem comes equipped with enough distinct SSE floating point units ever to keep 16 registers in use in real-world code?

Looking at this Nehalem block diagram I see two available ADD SSE units and one available MUL/DIV SSE unit. So thinking in terms of two-operand instructions that could support six, once in a while, but how on earth do you imagine getting to sixteen?

FrankHagen

Joined: 13 Feb 08

Posts: 102

Credit: 272200

RAC: 0

RE: Looking at this Nehalem

7 Jun 2011 20:14:34 UTC

Message 101227 in response to message 101226

(moderation:

)

Quote:

Looking at this Nehalem block diagram I see two available ADD SSE units and one available MUL/DIV SSE unit. So thinking in terms of two-operand instructions that could support six, once in a while, but how on earth do you imagine getting to sixteen?

this is MAYBE another thing on nehalems and bulldozers. but talking about cores capable of SSE2 in 64bit mode, and this is ( cough ) P4 and ATHLON64!

the difference between "native mode" and 64-bit mode started back then..

http://en.wikipedia.org/wiki/SSE2

that's just another reason why you'll find "AMD64" everywhere.

getting back to a certain app - it really depends:

heavy use of SSEx instructions?
code which can be vectorized?
processor architecture?

bottom line is: unless you really give it a try, you'll never know, but if you do not do it, you can as well still believe in earth being a flat thing.

in boincworld it's very rare that a real 64-bit app is not faster - you might want to check the numbers on http://wuprop.boinc-af.org/results/delai.py

Claggy

Joined: 29 Dec 06

Posts: 560

Credit: 2694028

RAC: 0

RE: Over at SETI, it

7 Jun 2011 20:52:53 UTC

Message 101228 in response to message 101224

(moderation:

)

Quote:

Over at SETI, it appears that the Lunatics tuned applications include distinct x64 and x32 Windows variants for both Astropulse and Multibeam. Do you know whether anyone has done work to compare the actual execution performance to see what benefit their x64 version provides compared to x32 when both are running in a 64-bit OS? I think there has actually been less recent careful HT assessment there than here, and certainly don't recall spotting any x64 vs. x32 HT detail.

There are only 64bit Windows apps for CPU Multibeam, there are no Windows 64bit Astropulse apps, and no Windows 64bit Cuda apps,

Claggy

archae86

Joined: 6 Dec 05

Posts: 3145

Credit: 7059414931

RAC: 1276088

RE: There are only 64bit

8 Jun 2011 1:31:14 UTC

Message 101229 in response to message 101228

(moderation:

)

Quote:

There are only 64bit Windows apps for CPU Multibeam, there are no Windows 64bit Astropulse apps, and no Windows 64bit Cuda apps,

Claggy

Thanks for the correction, I carelessly relied on a heading in their download area reading

Quote:

AstroPulse for Windows - x64 & x32 Bit Windows AstroPulse apps for SSE & SSE3.

which is similar language to that used for the Multibeam entry. Maybe the intended meaning is that the applications will run in those environments.

Claggy

Joined: 29 Dec 06

Posts: 560

Credit: 2694028

RAC: 0

RE: RE: There are only

8 Jun 2011 20:01:48 UTC

Message 101230 in response to message 101229

(moderation:

)

Quote:

Quote:
There are only 64bit Windows apps for CPU Multibeam, there are no Windows 64bit Astropulse apps, and no Windows 64bit Cuda apps,

Claggy
Thanks for the correction, I carelessly relied on a heading in their download area reading
Quote:
AstroPulse for Windows - x64 & x32 Bit Windows AstroPulse apps for SSE & SSE3.
which is similar language to that used for the Multibeam entry. Maybe the intended meaning is that the applications will run in those environments.

There are just different Installers aimed at 32bit or 64bit Boinc's, the 64bit installer does more entries to try and make sure no one looses any work when installing the Lunatics apps,
But the only 64bit app in it is the AK_V8 MB app,

Claggy

Robert

Joined: 5 Nov 05

Posts: 47

Credit: 318728021

RAC: 19396

I've been tied up the last

12 Jun 2011 0:47:46 UTC

Message 101231

(moderation:

)

I've been tied up the last week and I see a few questions have come up on the HT results I collected.

- No special techniques were used for setting processor affinity
- I used a standard 64 bit ubuntu load with 32 bit compatibility libs, no other OS tuning
- The 32 bit S5 SSE2 application was utilized, version 1.07
- This machine is dedicated to running E@H, so no other side loads
- It took many weeks to collect measurements for all the data points, so different frequencies were used
- I assumed slight variations in measurement points were from the different frequencies and data sets, as discussed in the beginning of this thread
- Thanks Mike for the suggestion to compare slope ratios as a way to answer the question how much benefit from HT
- My Calculations are in reasonable agreement with Mike's 2.69 calculated slope ratio

Robert

Joined: 5 Nov 05

Posts: 47

Credit: 318728021

RAC: 19396

There seemed to be interest

12 Jun 2011 0:52:50 UTC

Message 101232

(moderation:

)

There seemed to be interest in this data, so I'll post the hyper-threading S5 data I collected on the i7-980 for comparison.

The same collection conditions apply.

Only two frequency curves were collected for the 980. I'm not willing to deduce the 980's slope ratios are the same or different, they are close. At 3.0 GHz the 980's slope ratio is close to the 2600K's slope ratio as shown by Mikes earlier calculations.

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6540

Credit: 286787467

RAC: 88006

RE: My Calculations are in

12 Jun 2011 2:31:59 UTC

Message 101233 in response to message 101231

(moderation:

)

Quote:

My Calculations are in reasonable agreement with Mike's 2.69 calculated slope ratio

Being lines fitted by ( my ) eye on a piece of paper : I definitely think I'm over exact at quoting 2 decimal places. Probably better to quote to only one, say call it 2.7 .... :-)

Anyway they're all around 2.5 to 3.0 hence my impression is that OS swap overhead is at least comparable to HT effects at high core loads. OK. If so, then to test : did someone mention 'process lasso' or somesuch as an appropriate Windoze based utility to achieve affinity control? And return detailed timings for that matter, or does some other utility do that better? Suggestions?

This machine of mine if divested of BRP work could be an ideal test rig methinks, I could measure actual core times vs wall clock times on a per virtual core basis? Thus times per core not devoted to GW tasks of interest, derive fractional overheads etc.

Thanks again for collecting and presenting that! We're always looking to study such behaviours and perhaps get a hint or two on optimising. :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

FrankHagen

Joined: 13 Feb 08

Posts: 102

Credit: 272200

RAC: 0

just stumbled over that

27 Jun 2011 12:47:59 UTC

Message 101234

(moderation:

)

just stumbled over that one:

http://software.intel.com/en-us/articles/improved-linux-smp-scaling-user-directed-processor-affinity/

interesting..

Hyperthreading and Task number Impact Observations

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner