since even on a 64-bit linux we are running a 32-bit app (correct me if i'm wrong), this leads to only 8 of the 16 SSE2 registers of a core being usable.
on the other hand exactly this may lead to a better performance of HT.
I was not even aware of this particular distinction. Speaking hypothetically, I suppose that an application variant which used more registers might be expected to generate less data memory traffic.
that's only part one of the story - part 2 is, that in theory the core can process twice the number of calculations in a single operation. if the code can be and is vectorized to use all 16 registers.
Quote:
Over at SETI, it appears that the Lunatics tuned applications include distinct x64 and x32 Windows variants for both Astropulse and Multibeam. Do you know whether anyone has done work to compare the actual execution performance to see what benefit their x64 version provides compared to x32 when both are running in a 64-bit OS?
I was not even aware of this particular distinction. Speaking hypothetically, I suppose that an application variant which used more registers might be expected to generate less data memory traffic.
that's only part one of the story - part 2 is, that in theory the core can process twice the number of calculations in a single operation. if the code can be and is vectorized to use all 16 registers.
That would seem to push back in the opposite direction if it were true. Not denying the performance benefit possibly available to the portion of the code doing that, but the opportunity for HT benefit would seem to go up with more closely spaced data reads from memory, which this would do.
But I doubt it is true. Do you seriously think that even Nehalem comes equipped with enough distinct SSE floating point units ever to keep 16 registers in use in real-world code?
Looking at this Nehalem block diagram I see two available ADD SSE units and one available MUL/DIV SSE unit. So thinking in terms of two-operand instructions that could support six, once in a while, but how on earth do you imagine getting to sixteen?
Looking at this Nehalem block diagram I see two available ADD SSE units and one available MUL/DIV SSE unit. So thinking in terms of two-operand instructions that could support six, once in a while, but how on earth do you imagine getting to sixteen?
this is MAYBE another thing on nehalems and bulldozers. but talking about cores capable of SSE2 in 64bit mode, and this is ( cough ) P4 and ATHLON64!
the difference between "native mode" and 64-bit mode started back then..
that's just another reason why you'll find "AMD64" everywhere.
getting back to a certain app - it really depends:
heavy use of SSEx instructions?
code which can be vectorized?
processor architecture?
bottom line is: unless you really give it a try, you'll never know, but if you do not do it, you can as well still believe in earth being a flat thing.
Over at SETI, it appears that the Lunatics tuned applications include distinct x64 and x32 Windows variants for both Astropulse and Multibeam. Do you know whether anyone has done work to compare the actual execution performance to see what benefit their x64 version provides compared to x32 when both are running in a 64-bit OS? I think there has actually been less recent careful HT assessment there than here, and certainly don't recall spotting any x64 vs. x32 HT detail.
There are only 64bit Windows apps for CPU Multibeam, there are no Windows 64bit Astropulse apps, and no Windows 64bit Cuda apps,
There are only 64bit Windows apps for CPU Multibeam, there are no Windows 64bit Astropulse apps, and no Windows 64bit Cuda apps,
Claggy
Thanks for the correction, I carelessly relied on a heading in their download area reading
Quote:
AstroPulse for Windows - x64 & x32 Bit Windows AstroPulse apps for SSE & SSE3.
which is similar language to that used for the Multibeam entry. Maybe the intended meaning is that the applications will run in those environments.
There are just different Installers aimed at 32bit or 64bit Boinc's, the 64bit installer does more entries to try and make sure no one looses any work when installing the Lunatics apps,
But the only 64bit app in it is the AK_V8 MB app,
I've been tied up the last week and I see a few questions have come up on the HT results I collected.
- No special techniques were used for setting processor affinity
- I used a standard 64 bit ubuntu load with 32 bit compatibility libs, no other OS tuning
- The 32 bit S5 SSE2 application was utilized, version 1.07
- This machine is dedicated to running E@H, so no other side loads
- It took many weeks to collect measurements for all the data points, so different frequencies were used
- I assumed slight variations in measurement points were from the different frequencies and data sets, as discussed in the beginning of this thread
- Thanks Mike for the suggestion to compare slope ratios as a way to answer the question how much benefit from HT
- My Calculations are in reasonable agreement with Mike's 2.69 calculated slope ratio
There seemed to be interest in this data, so I'll post the hyper-threading S5 data I collected on the i7-980 for comparison.
The same collection conditions apply.
Only two frequency curves were collected for the 980. I'm not willing to deduce the 980's slope ratios are the same or different, they are close. At 3.0 GHz the 980's slope ratio is close to the 2600K's slope ratio as shown by Mikes earlier calculations.
My Calculations are in reasonable agreement with Mike's 2.69 calculated slope ratio
Being lines fitted by ( my ) eye on a piece of paper : I definitely think I'm over exact at quoting 2 decimal places. Probably better to quote to only one, say call it 2.7 .... :-)
Anyway they're all around 2.5 to 3.0 hence my impression is that OS swap overhead is at least comparable to HT effects at high core loads. OK. If so, then to test : did someone mention 'process lasso' or somesuch as an appropriate Windoze based utility to achieve affinity control? And return detailed timings for that matter, or does some other utility do that better? Suggestions?
This machine of mine if divested of BRP work could be an ideal test rig methinks, I could measure actual core times vs wall clock times on a per virtual core basis? Thus times per core not devoted to GW tasks of interest, derive fractional overheads etc.
Thanks again for collecting and presenting that! We're always looking to study such behaviours and perhaps get a hint or two on optimising. :-)
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
RE: RE: since even on a
)
that's only part one of the story - part 2 is, that in theory the core can process twice the number of calculations in a single operation. if the code can be and is vectorized to use all 16 registers.
nope - i do not care about YETI.. ;)
RE: RE: I was not even
)
That would seem to push back in the opposite direction if it were true. Not denying the performance benefit possibly available to the portion of the code doing that, but the opportunity for HT benefit would seem to go up with more closely spaced data reads from memory, which this would do.
But I doubt it is true. Do you seriously think that even Nehalem comes equipped with enough distinct SSE floating point units ever to keep 16 registers in use in real-world code?
Looking at this Nehalem block diagram I see two available ADD SSE units and one available MUL/DIV SSE unit. So thinking in terms of two-operand instructions that could support six, once in a while, but how on earth do you imagine getting to sixteen?
RE: Looking at this Nehalem
)
this is MAYBE another thing on nehalems and bulldozers. but talking about cores capable of SSE2 in 64bit mode, and this is ( cough ) P4 and ATHLON64!
the difference between "native mode" and 64-bit mode started back then..
http://en.wikipedia.org/wiki/SSE2
that's just another reason why you'll find "AMD64" everywhere.
getting back to a certain app - it really depends:
heavy use of SSEx instructions?
code which can be vectorized?
processor architecture?
bottom line is: unless you really give it a try, you'll never know, but if you do not do it, you can as well still believe in earth being a flat thing.
in boincworld it's very rare that a real 64-bit app is not faster - you might want to check the numbers on http://wuprop.boinc-af.org/results/delai.py
RE: Over at SETI, it
)
There are only 64bit Windows apps for CPU Multibeam, there are no Windows 64bit Astropulse apps, and no Windows 64bit Cuda apps,
Claggy
RE: There are only 64bit
)
Thanks for the correction, I carelessly relied on a heading in their download area reading
which is similar language to that used for the Multibeam entry. Maybe the intended meaning is that the applications will run in those environments.
RE: RE: There are only
)
There are just different Installers aimed at 32bit or 64bit Boinc's, the 64bit installer does more entries to try and make sure no one looses any work when installing the Lunatics apps,
But the only 64bit app in it is the AK_V8 MB app,
Claggy
I've been tied up the last
)
I've been tied up the last week and I see a few questions have come up on the HT results I collected.
- I used a standard 64 bit ubuntu load with 32 bit compatibility libs, no other OS tuning
- The 32 bit S5 SSE2 application was utilized, version 1.07
- This machine is dedicated to running E@H, so no other side loads
- It took many weeks to collect measurements for all the data points, so different frequencies were used
- I assumed slight variations in measurement points were from the different frequencies and data sets, as discussed in the beginning of this thread
- Thanks Mike for the suggestion to compare slope ratios as a way to answer the question how much benefit from HT
- My Calculations are in reasonable agreement with Mike's 2.69 calculated slope ratio
There seemed to be interest
)
There seemed to be interest in this data, so I'll post the hyper-threading S5 data I collected on the i7-980 for comparison.
The same collection conditions apply.
Only two frequency curves were collected for the 980. I'm not willing to deduce the 980's slope ratios are the same or different, they are close. At 3.0 GHz the 980's slope ratio is close to the 2600K's slope ratio as shown by Mikes earlier calculations.
RE: My Calculations are in
)
Being lines fitted by ( my ) eye on a piece of paper : I definitely think I'm over exact at quoting 2 decimal places. Probably better to quote to only one, say call it 2.7 .... :-)
Anyway they're all around 2.5 to 3.0 hence my impression is that OS swap overhead is at least comparable to HT effects at high core loads. OK. If so, then to test : did someone mention 'process lasso' or somesuch as an appropriate Windoze based utility to achieve affinity control? And return detailed timings for that matter, or does some other utility do that better? Suggestions?
This machine of mine if divested of BRP work could be an ideal test rig methinks, I could measure actual core times vs wall clock times on a per virtual core basis? Thus times per core not devoted to GW tasks of interest, derive fractional overheads etc.
Thanks again for collecting and presenting that! We're always looking to study such behaviours and perhaps get a hint or two on optimising. :-)
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
just stumbled over that
)
just stumbled over that one:
http://software.intel.com/en-us/articles/improved-linux-smp-scaling-user-directed-processor-affinity/
interesting..