amd or intel?

Cannibal Corpse
Cannibal Corpse
Joined: 21 Feb 05
Posts: 18
Credit: 1555535
RAC: 0

Hello..Pure cruncher with GPU

Message 80942 in response to message 80936

Hello..Pure cruncher with GPU crunching in mind...My current AMD X4 is OCed 2.7.

DO WHAT THO WILL SHALL BE THE WHOLE OF THE LAW.
PROUD MEMBER OF THE CARL SAGAN TEAM.

archae86
archae86
Joined: 6 Dec 05
Posts: 3163
Credit: 7347681687
RAC: 2122859

RE: Yeah. I'd always

Message 80943 in response to message 80939

Quote:
Yeah. I'd always believed that hyperthreading was effectively a hardware mechanism to recover time otherwise lost with context swaps - task state segments, various flushes et al. So your HT gain is only in that area and not with regard to uninterrupted aspects of the thread.


Not a bad model, but you seriously need to upgrade it to think of the processor as containing more than one partially independent resource. To keep it simple, just think of fixed point execution, floating point execution, and memory operations (though in fact there are more). Any time anyof these are idle there is at least the possibility that an infinitely fast task switch (which is something that HT roughly approximates)may be able to get use out of a resource which would otherwise be idle.

So your task switch focus is appropriate, but your notion that the only form of dead time is caused by explicit context swaps is dead wrong--HT would not work nearly so well as it does (when it works well...)were that true, as the interrupt rate is not high enough.

But there is no good general percentage improvement--it varies wildly both with application and with specific CPU implementation, and even memory implementation.

I have a reasonably modern (Nehalem architecture Xeon E5620 Westmere on the 32 nm variant) CPU, and just for fun will have a try at running it for enough hours HT and not HT to get reasonably comparable performance for current Einstein aps. I thought I had already done so and concluded HT so clear a win as to leave my rig running that way, but find my records deficient--though I remain sure that is the answer. So as of the time of this posting I'm switching my most capable rig from HT to nHT, and suspending the tasks in execution on reboot to get a reasonably clean distinction. I hope to post some indications within a day or two.

for the record, on my own hardware I did see one clear case where an ap had appreciably HT penalty (instead of the more common benefit). This started at a relatively late stage in the incredible sequence of akosf optimizations of Einstein code a few years ago. On my own hardware I've not seen it other than then.

I think most people with a BOINC focus will benefit in throughput by leaving HT turned on, and most people without it will do better with it turned off. Sadly not very much of the workload which determines responsiveness as perceived by the typical PC user is multithreaded enough to give HT benefit if the processor is already multicore.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6592
Credit: 332069838
RAC: 298524

RE: RE: Yeah. I'd always

Message 80944 in response to message 80943

Quote:
Quote:
Yeah. I'd always believed that hyperthreading was effectively a hardware mechanism to recover time otherwise lost with context swaps - task state segments, various flushes et al. So your HT gain is only in that area and not with regard to uninterrupted aspects of the thread.

Not a bad model, but you seriously need to upgrade it to think of the processor as containing more than one partially independent resource. To keep it simple, just think of fixed point execution, floating point execution, and memory operations (though in fact there are more). Any time anyof these are idle there is at least the possibility that an infinitely fast task switch (which is something that HT roughly approximates)may be able to get use out of a resource which would otherwise be idle.

So your task switch focus is appropriate, but your notion that the only form of dead time is caused by explicit context swaps is dead wrong--HT would not work nearly so well as it does (when it works well...)were that true, as the interrupt rate is not high enough.

But there is no good general percentage improvement--it varies wildly both with application and with specific CPU implementation, and even memory implementation.

I have a reasonably modern (Nehalem architecture Xeon E5620 Westmere on the 32 nm variant) CPU, and just for fun will have a try at running it for enough hours HT and not HT to get reasonably comparable performance for current Einstein aps. I thought I had already done so and concluded HT so clear a win as to leave my rig running that way, but find my records deficient--though I remain sure that is the answer. So as of the time of this posting I'm switching my most capable rig from HT to nHT, and suspending the tasks in execution on reboot to get a reasonably clean distinction. I hope to post some indications within a day or two.

for the record, on my own hardware I did see one clear case where an ap had appreciably HT penalty (instead of the more common benefit). This started at a relatively late stage in the incredible sequence of akosf optimizations of Einstein code a few years ago. On my own hardware I've not seen it other than then.

I think most people with a BOINC focus will benefit in throughput by leaving HT turned on, and most people without it will do better with it turned off. Sadly not very much of the workload which determines responsiveness as perceived by the typical PC user is multithreaded enough to give HT benefit if the processor is already multicore.


Well, there you go! Hence my "I'd always believed .. " preface. I'd sort of known of the non-strictly-task-switch features as mentioned, and now I realise they are encompassed by the HT moniker. I take the point about these ( sub- ) CPU resources being 'partially independent', and hence contributing to the variation in improvement. Thanks Pete for the correction/upgrade. We await your data ...

Cheers, Mike.

( edit ) Hence not just the fact of the faster task switch, but 'why would you want to?' ie. resource left idle in the absence of a switch. There'd long been task switching related ( machine instruction ) codes to flip context quicker anyway, before HT?

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

mikey
mikey
Joined: 22 Jan 05
Posts: 12868
Credit: 1884362640
RAC: 216189

RE: RE: I totally agree

Message 80945 in response to message 80931

Quote:
Quote:


I totally agree but since I believe it is an increase of 10% per core, not 10% per machine, that can make for a 40% increase over a quad core.

Hmm... this piece of math needs some serious reconsideration :-)

Let's do an example:

Let's assume this 10% performance gain (on the i5/i7 I'm sure it's much higher actually for most apps, but anyway...)

So without HT, let's say a WU needs 3600 sec "per core" ==> 24 WUs per day

A 10% performance increase "per core" means 26.4 WU a day, so 6545 seconds per WU if 2 can be done in parallel on a single core.

So....how many WU per day for a 4 core w/o HT: 4 x 24 WU = 96 WU

How many WU per day for the 4 core w/ HT: well... 4 x 26.4 WU = 105.6

Overall performance gain: 10 % ....not 40%

CU
HB

Sorry I was adding 10% per core to come up with the 40% overall, sorry it is fuzzy math and not correct in this discussion! 8-((

archae86
archae86
Joined: 6 Dec 05
Posts: 3163
Credit: 7347681687
RAC: 2122859

RE: We await your data

Message 80946 in response to message 80944

Quote:
We await your data ...


The variation from WU to WU seems rather small, so an overnight sample may be enough to give a reasonably accurate notion.

For my processor, at the current operating condition and memory setup, I see

GC mean CPU time dropped from 20822 CPU seconds running HT to 14899 nHT
ABP2 dropped from 14890 to 8695.

As both drops are well under the factor of two required to get equal output, this means both aps run more productively on my host in HT.

On these numbers I see 43% more output per unit time for ABP2, and 17% for GC.

As I claim these results will vary with processor, host configuration, application, and data, let me give some more description.

The processor is an Intel Xeon E5620, which is a 32nm Westmere family chip with four cores. CPU-Z describes the cache complement this way:
L1 D-cache 32 kByte 8-way, one per core
L1 I-cache 32 kByte 4-way, one per core
L2 cache 256 kByte 8-way one per core
L3 cache 12 Mbyte 16-way--share by all cores

The processor is nominally a 2.4 GHz part, but the E5620 is a low-end spec for a Westmere, so many, probably most of the units shipped have quite a lot of overclock headroom. I am operating at 3.42 GHz with a 19x multiplier. This is a socket 1366 part, so three memory channels are available. I have one DDR3 2G module plugged into each channel, and am operating conservatively with a 540 MHz DRAM clock, and 8-8-8-21-63 RAM times at a 1T command rate.

As current Intel chips go, this one is on the generous side for cache per core. This would help it get more HT benefit in the case of applications which have a relatively steep slope of performance vs. cache size in the region across which the parts vary. (I think current Intel offerings are mostly in the range of 1 to 3 Mbytes of top-level cache per core).

As a 3-channel design (like other full Nehalems, but unlike the more consumer-oriented parts which are 2-channel) my part has more RAM bandwidth available than the consumer parts, which may help it get more HT benefit if the application is close to being RAM-starved. However my failure to buy high-end DRAM parts and to attempt a strenuous overclock of them means my system is actually lower in RAM capability vs. CPU performance than would be the case either for a completely non-overclocked system, or for a system with a more maximalist approach to overclocking. (This is my daily driver, I want it to be absolutely reliable).

I'm currently not running SETI on this system except for Astropulse running on the ATI graphics card (a lunatics release from Raistmer within the last month).

I think the SETI work varies enough from unit to unit that I'd need to do something much more careful than a simple average to get a useful HT improvement number from a small sample. The real right way to do it would be something I did during the Akos improvement--create a sort of "walled garden" so I actually was running the exact same WUs on both legs of the comparison. But I don't think the extra information probably warrants the effort. Possibly if I controlled for Angle Range on Multibeam, and watched out for blanking and early termination effects in Astropulse I could get a useful answer without going to that extreme. If there is interest here, I'll think about having a try at it during next week's server run.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 801858231
RAC: 1203018

RE: GC mean CPU time

Message 80947 in response to message 80946

Quote:


GC mean CPU time dropped from 20822 CPU seconds running HT to 14899 nHT
ABP2 dropped from 14890 to 8695.

As both drops are well under the factor of two required to get equal output, this means both aps run more productively on my host in HT.

On these numbers I see 43% more output per unit time for ABP2, and 17% for GC.

I think you got ABP2 and GC swapped in the last sentence, right?

Anynway thanks for doing this experiment. I think it shows two things:

a) forget everything you know about HT performance increase from the Pentium 4 days. This is different.

b) The GC app has a part in it that is under-utilizing the CPU. This will be addresses in the next round of optimization.

CU
HBE.

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

Let me add three

Message 80948 in response to message 80947

Let me add three things.

1) The results will be different if you run 8 tasks of only one type.
2) You get the best throughput if you run a mixture of ABP2 and GC.
3) The time for GC tasks is not affected in a mixture with ABP2, but the ABP2 tasks run faster.

This is all valid on my i7 920(Linux) with HT. I can't proof it atm, cause the host is running 7 game servers too. In the past I did measure this through RPC calls once in a minute and scrips that calculated the Cr/h pretty precise. I also could see, that the GC tasks run faster(Cr/h) when they come to the end.

cu,
Michael

archae86
archae86
Joined: 6 Dec 05
Posts: 3163
Credit: 7347681687
RAC: 2122859

RE: I think you got ABP2

Message 80949 in response to message 80947

Quote:
I think you got ABP2 and GC swapped in the last sentence, right?


No.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 801858231
RAC: 1203018

RE: RE: I think you got

Message 80950 in response to message 80949

Quote:
Quote:
I think you got ABP2 and GC swapped in the last sentence, right?

No.


???

But your numbers indicate the following throughputs: ((results/day)

GC:
33,2 HT
23,2 nHT
(43% gain)

ABP2
46,42 HT
39,75 nHT
(17% gain).

???

HB

archae86
archae86
Joined: 6 Dec 05
Posts: 3163
Credit: 7347681687
RAC: 2122859

RE: But your numbers

Message 80951 in response to message 80950

Quote:
But your numbers indicate the following throughputs: ((results/day)


I agree that I got them swapped. It is embarrassing that I made the initial mistake--and preposterous that I persisted in the face of your pointing out my error.

Oops.

Peter

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.