Memory channels provisioned vs. Einstein performance E5620

ForumsCruncher's Corner

archae86
archae86
Joined: 6 Dec 05
Posts: 1929
Credit: 440531898
RAC: 580366
Topic 195562

I've often wondered whether the 3-channel Nehalem-family parts are heavily over-supplied with RAM bandwidth for Einstein application performance purposes. The fact Intel is providing many variants with just two channels available in their more consumer oriented socket suggests this.

I recently had to invade my Westmere (E5620) host because one RAM stick had failed, and took the time to run comparison tests of 1, 2, and 3 populated channels, hyperthreaded and non hyperthreaded.

The results will not be broadly applicable, as RAM requirements vary widely by application, Nehalem-family cache sizes vary, and the component stock specs and user overclocking practices move around the relative performance of CPU and RAM channels quite a bit. Still, some may find some interest here.

For all these tests, the CPU was running at a moderate overclock of 3.42 GHz with the multiplier at 19. For all these tests the BIOS was left to set up the RAM as it wished, presumably based on SPD information and the clock rate implied by the 100/180 clock settings.

For those who speak CPU-Z, here is the CPU state:

and here the RAM state (3-channel populated case shown):

The memory sticks were all Corsair Platinum series XMS3 DDR3 parts, shipping as 3-packs under part number TR3X6G1333C9, for which CPU-Z reads the SPD information as:

And here are the mean execution times in CPU seconds for a full set (4 for nHT, 8 for HT) of current Einstein Global Correlations S5 HF search #1 v3.06 (S5GCESSE2) tasks.

The RelProd line indicates system productivity in each configuration as a fraction of the highest performing (hyperthreaded 3-channel) configuration.

It must be kept in mind that I've used an appreciable overclock on the CPU, and none at all on the RAM. Those who buy premium RAM and slave away to find clock counts to save may get faster RAM relative to CPU than this, and those who run everything dead stock will have slower RAM relative to CPU than this.

1. But for this case and this ap, in hyperthreaded mode the one-channel case is severely starved, and even the nHT case is significantly impaired.

2. Going from two to three channels in this configuration help the nHT case very little, and the HT case only moderately.

3. HT always helps, but it helps a lot more when the configuration is not memory starved (the HTBen line is direct productivity comparison of HT vs. nHT at same channel count).

Conscious of our old problem with high result-to-result variation in execution requirements, I made a reasonably serious effort to match the freq/seq characteristics of the six results sets here compared. I've shown the Stdev for each set, mostly to document that by and large the differences between test cases are large compared to the possibly random timing variations present, and also that generally the timing variations observed within my samples were very small. True, the Frequency range was very small (1264.00 to 1264.25), but the seq range was wider (3 to 462) and I just did not see evidence of major systematic variation. I think the high stdev for the hyperthreaded one channel case is another symptom of the severe memory famine of that configuration, not evidence that my matching efforts failed and somehow stocked that case with massively more inherent result effort variation.

I have no doubt that effort expended on getting lower RAM latency through tighter memory timings would benefit _all_ of these cases (even when you are not waiting for your brother tasks because of bandwidth constraints, you must wait out the latency time for every jump for which the target is not in some form of cache, and for every similarly challenged data memory access). But I have a low appetite for and little experience in twisting the tail on DDR3 RAM clocks, so don't plan to try. I would, however, be happy to watch another contributor taking a look at that.

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1125
Credit: 222041786
RAC: 257662

Memory channels provisioned vs. Einstein performance E5620

I'm mildly surprised that you saw any difference from the 3rd channel. When LGA1156 came out the Intel engineer who gave Anand the tech dump said that the 3rd channel in LGA1366 was for hex core support, and that outside of synthetic benchmarks 2 channels would be sufficient to keep a quadcore from bottlenecking. The einstien apps must really be hammering the memory controllers in order to see that effect.

archae86
archae86
Joined: 6 Dec 05
Posts: 1929
Credit: 440531898
RAC: 580366

RE: The einstien apps must

Quote:
The einstien apps must really be hammering the memory controllers in order to see that effect.


I imagine the Intel engineer was presuming that neither the CPU clock nor the RAM timings would be overclocked. He also might quite likely decline to label what I saw as bottlenecking-reserving that term for a more severe level. In pushing up the CPU clock and not the RAM I've definitely pushed further into the RAM congestion side of the envelope.

That said, I'll wager there exist aps far more RAM intensive than Einstein, though they may not be ones that are likely to make up much of most plausible workloads.

Separately, I failed to mention an important RAM configuration detail. Those who know that Einstein work of this type has about a 250 Mbyte working set may figure that my single channel case run hyperthreaded would have gone to serious swapping as 2G of Einstein and something like 1G of Windows 7 tried to fit into 2G of physical RAM. However I actually placed two 2G modules on the single channel in service, so the 1 channel and 2 channel cases had the same RAM capacity. True, the 3-channel case had two more gig, but I doubt it found any use for it that had appreciable affect on execution times.

ML1
ML1
Joined: 20 Feb 05
Posts: 331
Credit: 30659892
RAC: 30835

RE: Very good test

Quote:

Very good test there thanks.

To put some e@h performance percentages on there, you get:
[pre]
Single channel -> double channel: +38% (HT), +17% (nHT)
Double channel -> triple channel: +05% (HT), +01% (nHT)
[/pre]
vs, how much does the extra channel cost?...

Happy fast crunchin',
Martin

Powered by: Mageia5
See & try out your OS Freedom! Linux Voice
The Future is what We all make IT [url=http://www.gnu.org/copyleft/gpl.html](GPLv3

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1125
Credit: 222041786
RAC: 257662

LGA 1166 includes quad cores

LGA 1166 includes quad cores that will turbo to 3.33ghz on DDR3-1333, so the fact that you didn't clock your ram to 1600mhz probably isn't a significant factor.

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 744
Credit: 41375144
RAC: 352

Thanks for the tests! And

Thanks for the tests!

And I agree: you've probably got a higher than average CPU clock (3.4 GHz for all cores loaded) and your DDR3 1333 memory is actually running at 1080 effective MHz (2 x 540 MHz as reported by CPU-Z). Most stock configurations interested in high performance should rather run 1333 or 1600, so they'll be less bandwidth starved.

MrS

Scanning for our furry friends since Jan 2002

Robert_56
Robert
Joined: 5 Nov 05
Posts: 37
Credit: 275303122
RAC: 4189

For reference about the

For reference about the effects of fully populating all 6 memory slots, I ran a test like this a year ago against the applications that were active then.

The system was a i7-980 (hexacore) OC = 3.6 GHz with HT on (so 12 jobs running at the same time) using an X58 board. Memory configurations were 3 x 2GB DDR3 1866 @ 1.5 volts (HT_3) and 6 x 2GB DDR3 1866 @ 1.5 volts (call it HT_6).

Result for gravity wave jobs: HT_3 = HT_6

I saw no difference in the speed of the jobs, but power increased by 17 watts (250 w - 233 w) for the HT_6 case. That is quite a power penalty in my mind.

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1125
Credit: 222041786
RAC: 257662

RE: For reference about the

Quote:

For reference about the effects of fully populating all 6 memory slots, I ran a test like this a year ago against the applications that were active then.

The system was a i7-980 (hexacore) OC = 3.6 GHz with HT on (so 12 jobs running at the same time) using an X58 board. Memory configurations were 3 x 2GB DDR3 1866 @ 1.5 volts (HT_3) and 6 x 2GB DDR3 1866 @ 1.5 volts (call it HT_6).

Result for gravity wave jobs: HT_3 = HT_6

I saw no difference in the speed of the jobs, but power increased by 17 watts (250 w - 233 w) for the HT_6 case. That is quite a power penalty in my mind.

The second set of memory slots are just to connect a second dimm to each channel. Your power usage went up because you were powering more chips.

archae86
archae86
Joined: 6 Dec 05
Posts: 1929
Credit: 440531898
RAC: 580366

Robert wrote:Result for

Robert wrote:

Result for gravity wave jobs: HT_3 = HT_6

I saw no difference in the speed of the jobs, but power increased by 17 watts (250 w - 233 w) for the HT_6 case. That is quite a power penalty in my mind.


Your case is a useful illustration of a basic relationship: When you don't have enough memory and are swapping, just about nothing beats the cost/performance value of adding memory. When you do have enough memory, adding memory does nothing but add cost, failure rate, and power dissipation.

tullio
tullio
Joined: 22 Jan 05
Posts: 1920
Credit: 5225098
RAC: 4628

I have a 8 GB RAM on my

I have a 8 GB RAM on my Linux-pae. I am running 6 BOINC projects including 2 VirtualMachines via VirtualBox. Application Data uses 26% of RAM, Disk Caching 58%. 10% is free, plus some disk buffers.
Tullio

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 744
Credit: 41375144
RAC: 352

RE: Your case is a useful

Quote:
Your case is a useful illustration of a basic relationship: When you don't have enough memory and are swapping, just about nothing beats the cost/performance value of adding memory. When you do have enough memory, adding memory does nothing but add cost, failure rate, and power dissipation.

Totally agreed. When people say "more RAM makes your computer faster" than I like to reply "More RAM doesn't make it faster, it keeps it from getting slower". That changed a bit due to super fetch, i.e. not only caching recent files but also predicting which stuff I'll usually need next, but generally I still stand by this.

MrS

Scanning for our furry friends since Jan 2002

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.