amd or intel?

Cannibal Corpse

Joined: 21 Feb 05

Posts: 18

Credit: 1555535

RAC: 0

Hello..Pure cruncher with GPU

11 Oct 2010 2:18:36 UTC

Message 80942 in response to message 80936

(moderation:

)

Hello..Pure cruncher with GPU crunching in mind...My current AMD X4 is OCed 2.7.

DO WHAT THO WILL SHALL BE THE WHOLE OF THE LAW.
PROUD MEMBER OF THE CARL SAGAN TEAM.

archae86

Joined: 6 Dec 05

Posts: 3145

Credit: 7053624931

RAC: 1631251

RE: Yeah. I'd always

11 Oct 2010 3:16:01 UTC

Message 80943 in response to message 80939

(moderation:

)

Quote:

Yeah. I'd always believed that hyperthreading was effectively a hardware mechanism to recover time otherwise lost with context swaps - task state segments, various flushes et al. So your HT gain is only in that area and not with regard to uninterrupted aspects of the thread.

Not a bad model, but you seriously need to upgrade it to think of the processor as containing more than one partially independent resource. To keep it simple, just think of fixed point execution, floating point execution, and memory operations (though in fact there are more). Any time anyof these are idle there is at least the possibility that an infinitely fast task switch (which is something that HT roughly approximates)may be able to get use out of a resource which would otherwise be idle.

So your task switch focus is appropriate, but your notion that the only form of dead time is caused by explicit context swaps is dead wrong--HT would not work nearly so well as it does (when it works well...)were that true, as the interrupt rate is not high enough.

But there is no good general percentage improvement--it varies wildly both with application and with specific CPU implementation, and even memory implementation.

I have a reasonably modern (Nehalem architecture Xeon E5620 Westmere on the 32 nm variant) CPU, and just for fun will have a try at running it for enough hours HT and not HT to get reasonably comparable performance for current Einstein aps. I thought I had already done so and concluded HT so clear a win as to leave my rig running that way, but find my records deficient--though I remain sure that is the answer. So as of the time of this posting I'm switching my most capable rig from HT to nHT, and suspending the tasks in execution on reboot to get a reasonably clean distinction. I hope to post some indications within a day or two.

for the record, on my own hardware I did see one clear case where an ap had appreciably HT penalty (instead of the more common benefit). This started at a relatively late stage in the incredible sequence of akosf optimizations of Einstein code a few years ago. On my own hardware I've not seen it other than then.

I think most people with a BOINC focus will benefit in throughput by leaving HT turned on, and most people without it will do better with it turned off. Sadly not very much of the workload which determines responsiveness as perceived by the typical PC user is multithreaded enough to give HT benefit if the processor is already multicore.

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6537

Credit: 286339880

RAC: 101217

RE: RE: Yeah. I'd always

11 Oct 2010 3:39:22 UTC

Message 80944 in response to message 80943

(moderation:

)

Quote:

Quote:
Yeah. I'd always believed that hyperthreading was effectively a hardware mechanism to recover time otherwise lost with context swaps - task state segments, various flushes et al. So your HT gain is only in that area and not with regard to uninterrupted aspects of the thread.

Not a bad model, but you seriously need to upgrade it to think of the processor as containing more than one partially independent resource. To keep it simple, just think of fixed point execution, floating point execution, and memory operations (though in fact there are more). Any time anyof these are idle there is at least the possibility that an infinitely fast task switch (which is something that HT roughly approximates)may be able to get use out of a resource which would otherwise be idle.

So your task switch focus is appropriate, but your notion that the only form of dead time is caused by explicit context swaps is dead wrong--HT would not work nearly so well as it does (when it works well...)were that true, as the interrupt rate is not high enough.

But there is no good general percentage improvement--it varies wildly both with application and with specific CPU implementation, and even memory implementation.

I have a reasonably modern (Nehalem architecture Xeon E5620 Westmere on the 32 nm variant) CPU, and just for fun will have a try at running it for enough hours HT and not HT to get reasonably comparable performance for current Einstein aps. I thought I had already done so and concluded HT so clear a win as to leave my rig running that way, but find my records deficient--though I remain sure that is the answer. So as of the time of this posting I'm switching my most capable rig from HT to nHT, and suspending the tasks in execution on reboot to get a reasonably clean distinction. I hope to post some indications within a day or two.

for the record, on my own hardware I did see one clear case where an ap had appreciably HT penalty (instead of the more common benefit). This started at a relatively late stage in the incredible sequence of akosf optimizations of Einstein code a few years ago. On my own hardware I've not seen it other than then.

I think most people with a BOINC focus will benefit in throughput by leaving HT turned on, and most people without it will do better with it turned off. Sadly not very much of the workload which determines responsiveness as perceived by the typical PC user is multithreaded enough to give HT benefit if the processor is already multicore.

Well, there you go! Hence my "I'd always believed .. " preface. I'd sort of known of the non-strictly-task-switch features as mentioned, and now I realise they are encompassed by the HT moniker. I take the point about these ( sub- ) CPU resources being 'partially independent', and hence contributing to the variation in improvement. Thanks Pete for the correction/upgrade. We await your data ...

Cheers, Mike.

( edit ) Hence not just the fact of the faster task switch, but 'why would you want to?' ie. resource left idle in the absence of a switch. There'd long been task switching related ( machine instruction ) codes to flip context quicker anyway, before HT?

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

mikey

Joined: 22 Jan 05

Posts: 11936

Credit: 1832087794

RAC: 213178

RE: RE: I totally agree

11 Oct 2010 11:21:06 UTC

Message 80945 in response to message 80931

(moderation:

)

Quote:

Quote:

I totally agree but since I believe it is an increase of 10% per core, not 10% per machine, that can make for a 40% increase over a quad core.

Hmm... this piece of math needs some serious reconsideration :-)

Let's do an example:

Let's assume this 10% performance gain (on the i5/i7 I'm sure it's much higher actually for most apps, but anyway...)

So without HT, let's say a WU needs 3600 sec "per core" ==> 24 WUs per day

A 10% performance increase "per core" means 26.4 WU a day, so 6545 seconds per WU if 2 can be done in parallel on a single core.

So....how many WU per day for a 4 core w/o HT: 4 x 24 WU = 96 WU

How many WU per day for the 4 core w/ HT: well... 4 x 26.4 WU = 105.6

Overall performance gain: 10 % ....not 40%

CU
HB

Sorry I was adding 10% per core to come up with the 40% overall, sorry it is fuzzy math and not correct in this discussion! 8-((

archae86

Joined: 6 Dec 05

Posts: 3145

Credit: 7053624931

RAC: 1631251

RE: We await your data

11 Oct 2010 15:10:01 UTC

Message 80946 in response to message 80944

(moderation:

)

Quote:

We await your data ...

The variation from WU to WU seems rather small, so an overnight sample may be enough to give a reasonably accurate notion.

For my processor, at the current operating condition and memory setup, I see

GC mean CPU time dropped from 20822 CPU seconds running HT to 14899 nHT
ABP2 dropped from 14890 to 8695.

As both drops are well under the factor of two required to get equal output, this means both aps run more productively on my host in HT.

On these numbers I see 43% more output per unit time for ABP2, and 17% for GC.

As I claim these results will vary with processor, host configuration, application, and data, let me give some more description.

The processor is an Intel Xeon E5620, which is a 32nm Westmere family chip with four cores. CPU-Z describes the cache complement this way:
L1 D-cache 32 kByte 8-way, one per core
L1 I-cache 32 kByte 4-way, one per core
L2 cache 256 kByte 8-way one per core
L3 cache 12 Mbyte 16-way--share by all cores

The processor is nominally a 2.4 GHz part, but the E5620 is a low-end spec for a Westmere, so many, probably most of the units shipped have quite a lot of overclock headroom. I am operating at 3.42 GHz with a 19x multiplier. This is a socket 1366 part, so three memory channels are available. I have one DDR3 2G module plugged into each channel, and am operating conservatively with a 540 MHz DRAM clock, and 8-8-8-21-63 RAM times at a 1T command rate.

As current Intel chips go, this one is on the generous side for cache per core. This would help it get more HT benefit in the case of applications which have a relatively steep slope of performance vs. cache size in the region across which the parts vary. (I think current Intel offerings are mostly in the range of 1 to 3 Mbytes of top-level cache per core).

As a 3-channel design (like other full Nehalems, but unlike the more consumer-oriented parts which are 2-channel) my part has more RAM bandwidth available than the consumer parts, which may help it get more HT benefit if the application is close to being RAM-starved. However my failure to buy high-end DRAM parts and to attempt a strenuous overclock of them means my system is actually lower in RAM capability vs. CPU performance than would be the case either for a completely non-overclocked system, or for a system with a more maximalist approach to overclocking. (This is my daily driver, I want it to be absolutely reliable).

I'm currently not running SETI on this system except for Astropulse running on the ATI graphics card (a lunatics release from Raistmer within the last month).

I think the SETI work varies enough from unit to unit that I'd need to do something much more careful than a simple average to get a useful HT improvement number from a small sample. The real right way to do it would be something I did during the Akos improvement--create a sort of "walled garden" so I actually was running the exact same WUs on both legs of the comparison. But I don't think the extra information probably warrants the effort. Possibly if I controlled for Angle Range on Multibeam, and watched out for blanking and early termination effects in Astropulse I could get a useful answer without going to that extreme. If there is interest here, I'll think about having a try at it during next week's server run.

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 688546383

RAC: 201723

RE: GC mean CPU time

11 Oct 2010 17:32:26 UTC

Message 80947 in response to message 80946

(moderation:

)

Quote:

GC mean CPU time dropped from 20822 CPU seconds running HT to 14899 nHT
ABP2 dropped from 14890 to 8695.

As both drops are well under the factor of two required to get equal output, this means both aps run more productively on my host in HT.

On these numbers I see 43% more output per unit time for ABP2, and 17% for GC.

I think you got ABP2 and GC swapped in the last sentence, right?

Anynway thanks for doing this experiment. I think it shows two things:

a) forget everything you know about HT performance increase from the Pentium 4 days. This is different.

b) The GC app has a part in it that is under-utilizing the CPU. This will be addresses in the next round of optimization.

CU
HBE.

M. Schmitt

Joined: 27 Jun 05

Posts: 478

Credit: 15872262

RAC: 0

Let me add three

11 Oct 2010 17:56:49 UTC

Message 80948 in response to message 80947

(moderation:

)

Let me add three things.

1) The results will be different if you run 8 tasks of only one type.
2) You get the best throughput if you run a mixture of ABP2 and GC.
3) The time for GC tasks is not affected in a mixture with ABP2, but the ABP2 tasks run faster.

This is all valid on my i7 920(Linux) with HT. I can't proof it atm, cause the host is running 7 game servers too. In the past I did measure this through RPC calls once in a minute and scrips that calculated the Cr/h pretty precise. I also could see, that the GC tasks run faster(Cr/h) when they come to the end.

cu,
Michael

archae86

Joined: 6 Dec 05

Posts: 3145

Credit: 7053624931

RAC: 1631251

RE: I think you got ABP2

11 Oct 2010 17:59:45 UTC

Message 80949 in response to message 80947

(moderation:

)

Quote:

I think you got ABP2 and GC swapped in the last sentence, right?

No.

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 688546383

RAC: 201723

RE: RE: I think you got

11 Oct 2010 18:25:33 UTC

Message 80950 in response to message 80949

(moderation:

)

Quote:

Quote:
I think you got ABP2 and GC swapped in the last sentence, right?

No.

???

But your numbers indicate the following throughputs: ((results/day)

GC:
33,2 HT
23,2 nHT
(43% gain)

ABP2
46,42 HT
39,75 nHT
(17% gain).

???

archae86

Joined: 6 Dec 05

Posts: 3145

Credit: 7053624931

RAC: 1631251

RE: But your numbers

11 Oct 2010 18:57:51 UTC

Message 80951 in response to message 80950

(moderation:

)

Quote:

But your numbers indicate the following throughputs: ((results/day)

I agree that I got them swapped. It is embarrassing that I made the initial mistake--and preposterous that I persisted in the face of your pointing out my error.

Oops.

Peter

amd or intel?

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner