Memory channels provisioned vs. Einstein performance E5620
I've often wondered whether the 3-channel Nehalem-family parts are heavily over-supplied with RAM bandwidth for Einstein application performance purposes. The fact Intel is providing many variants with just two channels available in their more consumer oriented socket suggests this.
I recently had to invade my Westmere (E5620) host because one RAM stick had failed, and took the time to run comparison tests of 1, 2, and 3 populated channels, hyperthreaded and non hyperthreaded.
The results will not be broadly applicable, as RAM requirements vary widely by application, Nehalem-family cache sizes vary, and the component stock specs and user overclocking practices move around the relative performance of CPU and RAM channels quite a bit. Still, some may find some interest here.
For all these tests, the CPU was running at a moderate overclock of 3.42 GHz with the multiplier at 19. For all these tests the BIOS was left to set up the RAM as it wished, presumably based on SPD information and the clock rate implied by the 100/180 clock settings.
For those who speak CPU-Z, here is the CPU state:
and here the RAM state (3-channel populated case shown):
The memory sticks were all Corsair Platinum series XMS3 DDR3 parts, shipping as 3-packs under part number TR3X6G1333C9, for which CPU-Z reads the SPD information as:
And here are the mean execution times in CPU seconds for a full set (4 for nHT, 8 for HT) of current Einstein Global Correlations S5 HF search #1 v3.06 (S5GCESSE2) tasks.
The RelProd line indicates system productivity in each configuration as a fraction of the highest performing (hyperthreaded 3-channel) configuration.
It must be kept in mind that I've used an appreciable overclock on the CPU, and none at all on the RAM. Those who buy premium RAM and slave away to find clock counts to save may get faster RAM relative to CPU than this, and those who run everything dead stock will have slower RAM relative to CPU than this.
1. But for this case and this ap, in hyperthreaded mode the one-channel case is severely starved, and even the nHT case is significantly impaired.
2. Going from two to three channels in this configuration help the nHT case very little, and the HT case only moderately.
3. HT always helps, but it helps a lot more when the configuration is not memory starved (the HTBen line is direct productivity comparison of HT vs. nHT at same channel count).
Conscious of our old problem with high result-to-result variation in execution requirements, I made a reasonably serious effort to match the freq/seq characteristics of the six results sets here compared. I've shown the Stdev for each set, mostly to document that by and large the differences between test cases are large compared to the possibly random timing variations present, and also that generally the timing variations observed within my samples were very small. True, the Frequency range was very small (1264.00 to 1264.25), but the seq range was wider (3 to 462) and I just did not see evidence of major systematic variation. I think the high stdev for the hyperthreaded one channel case is another symptom of the severe memory famine of that configuration, not evidence that my matching efforts failed and somehow stocked that case with massively more inherent result effort variation.
I have no doubt that effort expended on getting lower RAM latency through tighter memory timings would benefit _all_ of these cases (even when you are not waiting for your brother tasks because of bandwidth constraints, you must wait out the latency time for every jump for which the target is not in some form of cache, and for every similarly challenged data memory access). But I have a low appetite for and little experience in twisting the tail on DDR3 RAM clocks, so don't plan to try. I would, however, be happy to watch another contributor taking a look at that.