Memory channels provisioned vs. Einstein performance E5620

archae86

Joined: 6 Dec 05

Posts: 3163

Credit: 7325361687

RAC: 2324258

7 Jan 2011 0:21:17 UTC

Topic 195562

(moderation:

)

I've often wondered whether the 3-channel Nehalem-family parts are heavily over-supplied with RAM bandwidth for Einstein application performance purposes. The fact Intel is providing many variants with just two channels available in their more consumer oriented socket suggests this.

I recently had to invade my Westmere (E5620) host because one RAM stick had failed, and took the time to run comparison tests of 1, 2, and 3 populated channels, hyperthreaded and non hyperthreaded.

The results will not be broadly applicable, as RAM requirements vary widely by application, Nehalem-family cache sizes vary, and the component stock specs and user overclocking practices move around the relative performance of CPU and RAM channels quite a bit. Still, some may find some interest here.

For all these tests, the CPU was running at a moderate overclock of 3.42 GHz with the multiplier at 19. For all these tests the BIOS was left to set up the RAM as it wished, presumably based on SPD information and the clock rate implied by the 100/180 clock settings.

For those who speak CPU-Z, here is the CPU state:

and here the RAM state (3-channel populated case shown):

The memory sticks were all Corsair Platinum series XMS3 DDR3 parts, shipping as 3-packs under part number TR3X6G1333C9, for which CPU-Z reads the SPD information as:

And here are the mean execution times in CPU seconds for a full set (4 for nHT, 8 for HT) of current Einstein Global Correlations S5 HF search #1 v3.06 (S5GCESSE2) tasks.

The RelProd line indicates system productivity in each configuration as a fraction of the highest performing (hyperthreaded 3-channel) configuration.

It must be kept in mind that I've used an appreciable overclock on the CPU, and none at all on the RAM. Those who buy premium RAM and slave away to find clock counts to save may get faster RAM relative to CPU than this, and those who run everything dead stock will have slower RAM relative to CPU than this.

1. But for this case and this ap, in hyperthreaded mode the one-channel case is severely starved, and even the nHT case is significantly impaired.

2. Going from two to three channels in this configuration help the nHT case very little, and the HT case only moderately.

3. HT always helps, but it helps a lot more when the configuration is not memory starved (the HTBen line is direct productivity comparison of HT vs. nHT at same channel count).

Conscious of our old problem with high result-to-result variation in execution requirements, I made a reasonably serious effort to match the freq/seq characteristics of the six results sets here compared. I've shown the Stdev for each set, mostly to document that by and large the differences between test cases are large compared to the possibly random timing variations present, and also that generally the timing variations observed within my samples were very small. True, the Frequency range was very small (1264.00 to 1264.25), but the seq range was wider (3 to 462) and I just did not see evidence of major systematic variation. I think the high stdev for the hyperthreaded one channel case is another symptom of the severe memory famine of that configuration, not evidence that my matching efforts failed and somehow stocked that case with massively more inherent result effort variation.

I have no doubt that effort expended on getting lower RAM latency through tighter memory timings would benefit _all_ of these cases (even when you are not waiting for your brother tasks because of bandwidth constraints, you must wait out the latency time for every jump for which the target is not in some form of cache, and for every similarly challenged data memory access). But I have a low appetite for and little experience in twisting the tail on DDR3 RAM clocks, so don't plan to try. I would, however, be happy to watch another contributor taking a look at that.

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3576011460

RAC: 802774

Memory channels provisioned vs. Einstein performance E5620

7 Jan 2011 0:41:50 UTC

Message 101866

(moderation:

)

I'm mildly surprised that you saw any difference from the 3rd channel. When LGA1156 came out the Intel engineer who gave Anand the tech dump said that the 3rd channel in LGA1366 was for hex core support, and that outside of synthetic benchmarks 2 channels would be sufficient to keep a quadcore from bottlenecking. The einstien apps must really be hammering the memory controllers in order to see that effect.

archae86

Joined: 6 Dec 05

Posts: 3163

Credit: 7325361687

RAC: 2324258

RE: The einstien apps must

7 Jan 2011 3:35:19 UTC

Message 101867 in response to message 101866

(moderation:

)

Quote:

The einstien apps must really be hammering the memory controllers in order to see that effect.

I imagine the Intel engineer was presuming that neither the CPU clock nor the RAM timings would be overclocked. He also might quite likely decline to label what I saw as bottlenecking-reserving that term for a more severe level. In pushing up the CPU clock and not the RAM I've definitely pushed further into the RAM congestion side of the envelope.

That said, I'll wager there exist aps far more RAM intensive than Einstein, though they may not be ones that are likely to make up much of most plausible workloads.

Separately, I failed to mention an important RAM configuration detail. Those who know that Einstein work of this type has about a 250 Mbyte working set may figure that my single channel case run hyperthreaded would have gone to serious swapping as 2G of Einstein and something like 1G of Windows 7 tried to fit into 2G of physical RAM. However I actually placed two 2G modules on the single channel in service, so the 1 channel and 2 channel cases had the same RAM capacity. True, the 3-channel case had two more gig, but I doubt it found any use for it that had appreciable affect on execution times.

ML1

Joined: 20 Feb 05

Posts: 347

Credit: 86563414

RAC: 3

RE: Very good test

7 Jan 2011 11:24:06 UTC

Message 101868

(moderation:

)

Quote:

Very good test there thanks.

To put some e@h performance percentages on there, you get:
[pre]
Single channel -> double channel: +38% (HT), +17% (nHT)
Double channel -> triple channel: +05% (HT), +01% (nHT)
[/pre]
vs, how much does the extra channel cost?...

Happy fast crunchin',
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3576011460

RAC: 802774

LGA 1166 includes quad cores

8 Jan 2011 21:55:32 UTC

Message 101869

(moderation:

)

LGA 1166 includes quad cores that will turbo to 3.33ghz on DDR3-1333, so the fact that you didn't clock your ram to 1600mhz probably isn't a significant factor.

ExtraTerrestria...

Joined: 10 Nov 04

Posts: 770

Credit: 585578092

RAC: 134972

Thanks for the tests! And

9 Jan 2011 14:55:47 UTC

Message 101870

(moderation:

)

Thanks for the tests!

And I agree: you've probably got a higher than average CPU clock (3.4 GHz for all cores loaded) and your DDR3 1333 memory is actually running at 1080 effective MHz (2 x 540 MHz as reported by CPU-Z). Most stock configurations interested in high performance should rather run 1333 or 1600, so they'll be less bandwidth starved.

MrS

Scanning for our furry friends since Jan 2002

Robert

Joined: 5 Nov 05

Posts: 47

Credit: 324577611

RAC: 23250

For reference about the

13 Jan 2011 23:48:41 UTC

Message 101871

(moderation:

)

For reference about the effects of fully populating all 6 memory slots, I ran a test like this a year ago against the applications that were active then.

The system was a i7-980 (hexacore) OC = 3.6 GHz with HT on (so 12 jobs running at the same time) using an X58 board. Memory configurations were 3 x 2GB DDR3 1866 @ 1.5 volts (HT_3) and 6 x 2GB DDR3 1866 @ 1.5 volts (call it HT_6).

Result for gravity wave jobs: HT_3 = HT_6

I saw no difference in the speed of the jobs, but power increased by 17 watts (250 w - 233 w) for the HT_6 case. That is quite a power penalty in my mind.

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3576011460

RAC: 802774

RE: For reference about the

14 Jan 2011 2:25:55 UTC

Message 101872 in response to message 101871

(moderation:

)

Quote:

For reference about the effects of fully populating all 6 memory slots, I ran a test like this a year ago against the applications that were active then.

The system was a i7-980 (hexacore) OC = 3.6 GHz with HT on (so 12 jobs running at the same time) using an X58 board. Memory configurations were 3 x 2GB DDR3 1866 @ 1.5 volts (HT_3) and 6 x 2GB DDR3 1866 @ 1.5 volts (call it HT_6).

Result for gravity wave jobs: HT_3 = HT_6

I saw no difference in the speed of the jobs, but power increased by 17 watts (250 w - 233 w) for the HT_6 case. That is quite a power penalty in my mind.

The second set of memory slots are just to connect a second dimm to each channel. Your power usage went up because you were powering more chips.

archae86

Joined: 6 Dec 05

Posts: 3163

Credit: 7325361687

RAC: 2324258

Robert wrote:Result for

14 Jan 2011 6:16:46 UTC

Message 101873 in response to message 101871

(moderation:

)

Robert wrote:

Result for gravity wave jobs: HT_3 = HT_6

I saw no difference in the speed of the jobs, but power increased by 17 watts (250 w - 233 w) for the HT_6 case. That is quite a power penalty in my mind.

Your case is a useful illustration of a basic relationship: When you don't have enough memory and are swapping, just about nothing beats the cost/performance value of adding memory. When you do have enough memory, adding memory does nothing but add cost, failure rate, and power dissipation.

tullio

Joined: 22 Jan 05

Posts: 2118

Credit: 61407735

RAC: 0

I have a 8 GB RAM on my

14 Jan 2011 8:29:30 UTC

Message 101874

(moderation:

)

I have a 8 GB RAM on my Linux-pae. I am running 6 BOINC projects including 2 VirtualMachines via VirtualBox. Application Data uses 26% of RAM, Disk Caching 58%. 10% is free, plus some disk buffers.
Tullio

ExtraTerrestria...

Joined: 10 Nov 04

Posts: 770

Credit: 585578092

RAC: 134972

RE: Your case is a useful

14 Jan 2011 21:58:41 UTC

Message 101875 in response to message 101873

(moderation:

)

Quote:

Your case is a useful illustration of a basic relationship: When you don't have enough memory and are swapping, just about nothing beats the cost/performance value of adding memory. When you do have enough memory, adding memory does nothing but add cost, failure rate, and power dissipation.

Totally agreed. When people say "more RAM makes your computer faster" than I like to reply "More RAM doesn't make it faster, it keeps it from getting slower". That changed a bit due to super fetch, i.e. not only caching recent files but also predicting which stuff I'll usually need next, but generally I still stand by this.

MrS

Scanning for our furry friends since Jan 2002

Memory channels provisioned vs. Einstein performance E5620

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner