Benefits of Dual-Channel RAM Config for crunching??
Ever since I can remember, running two matched sticks in dual-channel motherboards has been recommended for extracting the best performance from the installed memory. When I first started building crunching boxes (Pentium III, Athlon XP era), I would test both single and dual configs and wouldn't really be able to get much of a difference. However, I would always try to follow the recommendation.
When DDR2 first became mainstream, I can remember setting up hosts with 2GB RAM total. Once again, I tested 2x1GB dual-channel against a single 2GB stick and couldn't really notice much of a difference. I must admit that the testing was very limited - I didn't expect to see much and when I didn't, I stopped 'wasting my time' and got on with other things :-). However, through force of habit, I've always arranged for machines to have dual-channel configs, even if the benefits seemed slight.
With the initial rather less than stellar gains, I haven't really paid any attention to further testing since that time. With DDR3, I just automatically used two matched sticks - for a long time it was 2x2GB. More recently it has been 2x4GB.
About a year or two ago when I started running multiple concurrent GPU tasks and CPU tasks on all available cores, I was concerned about memory use and when I checked on a number of machines with 4GB total, I found that swap wasn't being used at all and that everything was fitting in the available physical RAM. At that time RAM was very cheap so I just used 2x4GB on new builds and didn't even consider making a small cost saving by going back to single-channel.
I've just built a couple of new boxes to try out the available 'non-K' overclocking potential of the 'Anniversary edition' Pentium G3258 CPU. I'm using an el-cheapo Asrock H81M-DGS board with the P1.30 BIOS which makes it very easy to overclock this CPU. I had a 4GB stick lying around so I just used that to get started. Stock frequency is 3.2GHz and I'm running this particular host at 4.2GHz at the moment. I'm using the beta test version (1.05) of the FGRP4 app to test the crunching performance. There have been no crashes and tasks are validating so the overclock looks good.
I did pick up a second 4GB stick and installed it yesterday and this made a very significant difference. At 4.2GHz on idle, the machine draws 36W at the wall. At full crunching load (1x4GB) the draw is 72W. With 2x4GB it goes to 76W. The big difference is the crunch time. At 1x4GB a task takes around 21ksecs - if you use the above link to look at the tasks list, just ignore all the 'short ends' (half or quarter size) tasks that are mixed in with the full size tasks. If you then find the very latest tasks (done with 2x4GB) the crunch time has dropped to just under 17ksecs, over a full hour faster per task.
I've confirmed this behaviour on other recent builds so it's not something funny with this particular host. I've no idea though as to when it became this important to be using dual-channel configs. I've always done it by habit and haven't noticed when it started making a real difference like this.
Cheers,
Gary.
Benefits of Dual-Channel RAM Config for crunching??
)
Gary, interesting stuff.
Sometimes Intel has built processors with an enormous amount of spare RAM bandwidth, at least on base models running slow. But sometimes not.
So I suspect the how much does it matter answer depends both on CPU model and application.
Speaking of applications, I was recently quite surprised to find that my older Westmere host was a noticeably better performing GTX750 host than two more modern hosts, and speculated that the 3-channel RAM and perhaps other memory differences on the Westmere might give better memory performance in whatever respect the Perseus support application cares about than my more modern hosts.
Which is a long-winded leadup to a suggestion that you just might find that better memory access is even more important to your GPU hosts than to the CPU-only tasks you mention in this note.
Lastly, while I think populating both (all available, actually) channels is important, and that "matching" the RAM sticks as to primary parameters is important, I seriously doubt the "matched set" stuff most of us buy works any better than picking any two (or three, or four) sticks out of the bin so long as they carry the same main parameter designations. That is a guess on my part, not well informed by relevant data. Selling them as matched sets probably does help, as it keeps people from getting confused as to which parameters actually matter to match.
Hi Peter, Thanks for the
)
Hi Peter,
Thanks for the response.
All my GPU endowed hosts have dual-channel memory configurations with 2x2GB as a minimum. Newer, higher performing hosts with HD7850 GPUs running 4x are using 2x4GB. I can't remember ever trying to run a GPU cruncher with only a single stick of RAM.
If the CPU is Ivy Bridge, Haswell or Haswell refresh, the GPU crunch time seems to be reasonably independent of the precise CPU model or the precise CPU frequency (within reason).
As an example I've got a i3-4130 powered machine running at 3.4GHz and the BRP5 tasks run 4x in about 4h:17m. It happened to be in an Asrock H81M-DGS board. There were 2 CPU tasks and 2 free virtual cores. I upgraded the BIOS and changed the CPU to a G3258 running at 4.0GHz. Through app_config.xml, I adjusted the parameters to 0.45 CPUs and 0.25 GPUs so that running 4x only ties up 1 CPU core. The other core is running a FGRP4 task. It's hard to tell but it looks like the GPU tasks have actually sped up by a minute or two. With the i3, the CPU tasks used to take around 21ksecs. With the G3258, they take around 17ksecs. I was quite pleasantly surprised with this result. There is virtually no change to the machine's output overall. I was anticipating that just one free CPU core would struggle to keep the GPU fed. Apparently not for this setup.
I agree entirely. When I used the word "matched" it was just shorthand for "same make and model and latencies". In fact, one of the G3258 rigs I'm testing at the moment actually has a 4GB Crucial 1600 stick alongside a 4GB Samsung 1600 stick and the good performance doesn't seem to have suffered at all. When I buy memory, I buy single sticks if it's cheaper that way or a kit of 2 if that way is cheaper.
Cheers,
Gary.
Thanks for reporting,
)
Thanks for reporting, Gary.
Einstein is among the most memory bandwidth hungry real-world applications that I know of, so your findings fit my expectation well. Although I am surprised to see such a large magnitude of the difference even for a dual core CPU! And your DDR3-1600 even has decent speed, it's not "crippled" like DDR3-1066. The little Pentium has a relatively small L3 cache, though. This may pronounce the memory bandwidth dependence compared to other CPUs.
The need for bandwidth is even higher when you use the iGPU to crunch Einstein. That limitation is so severe that I hardly see better performance if I run my HD4000 at 1.35 GHz instead of 1.25 GHz. And this is already supported by 2 channels of DDR3-2400 with tight latencies and ~4 GHz CPU clock! This model has just 16 shader cores. It will be interesting to see how the rumored top Broadwells with 48 shaders and Skylakes with 72 shaders fare over here. They'll certainly need the CrystalWell cache but may also love fast DDR4 on Skylake.
BTW: I almost cry when ever I see laptops with expensive 4-core i7's using the iGPU and just a single 4 GB module.
MrS
Scanning for our furry friends since Jan 2002
On a slightly different
)
On a slightly different thread, I spent some extra cash to try some 2400MHz DDR3 in a new box and the time taken per WU was significantly less, not as large, but still a worthwhile improvement compared to "ordinary" 1600 MHz memory....
dunx
In general application
)
In general application benchmarks 1866 seems to be the sweet spot today for a reasonably fast Intel quad core at ~3.5 GHz, with 2133 providing a benefit if the timings don't get out of hand. 2400 still provides a small benefit with good timings, whereas everything above that hardly matters.
The faster your CPU is, the more memory bandwidth it needs, though. This is also not yet factoring in that Einstein wants more bandwidth than most programs, i.e. it reacts more favorably to bandwidth increases. This also neglects the iGPU, which loves bandwidth if used for Einstein.
So for Einstein I expect 1866 to easily beat 1600, with 2133 possibly still providing a nice boost (in contrast to the average described initially) and 2400 will depend on the timings (at relaxed timings it's going to be slower, at tight timings a bit faster).
As a rule of thumb regarding the latencies: if you step up one frequency step (1600 -> 1866) it's OK for the timings to be 1 clock slower, i.e. I'd prefer DDR3-1866 10-11-10 over DDR3-1600 9-10-9. Latency will be comparable, whereas bandwidth will be higher. If the latency suffers more than that the higher clocked memory is usually slower. In my example I would not want DDR3-1866 11-11-11 compared to that DDR3-1600.
MrS
Scanning for our furry friends since Jan 2002
I'm reviving this thread
)
I'm reviving this thread because I've accidentally stumbled onto something even more curious (well to me anyway) than what I posted about in my original message. In summary, the original message reported on the rather dramatic increase in CPU performance of a Haswell-refresh CPU (specifically a Pentium dual core G3258 anniversary edition) by going from a single 4GB stick to 2x4GB dual channel RAM configuration. As everything running on the machine fitted well within the 4GB (no hint of any swapping) I attributed the gain simply to the change to dual channel.
It may well still be just that but in the interim, I've built a few more with the same CPU and 2x4GB RAM setup and have added HD 7850 GPUs (2GB DDR5) to them all. They run BRP5 GPU tasks 4x and a single FGRP4 CPU task. They use app_config.xml to set GPU usage to 0.25 and CPU usage to 0.45 for the GPU tasks. Since 4x0.45 is less than 2, BOINC only reserves 1 core for GPU support.
In looking at memory usage with the above crunching pattern, the Linux utility kinfocenter shows application data consuming 15% of RAM, disk buffers and cache, etc, consuming 7% with a whopping 78% of physical RAM as 'free'. So I decided to try the very latest build with 2x2GB instead of 2x4GB. With that single change, there was still close to half of the physical RAM listed as 'free' and no use of swap space but the GPU crunch times took a real dive. With 2x4GB, 4 GPU tasks were completing in around 4hrs 25mins. With 2x2GB, that time blew out to around 5hrs 13mins. Needless to say, after making quite sure there was nothing else I could attribute the change to, I've simply changed back to the 2x4GB setup and I can see the performance increase already. Both RAM kits were G.Skill NT 1333MHz running with BIOS settings on auto. I checked all timings listed in the BIOS and they were the same for the 2 kits.
It's hard to tell if there was much change in the CPU crunch times as these are somewhat more variable compared to the GPU crunch times. A couple of CPU tasks did complete under the 2x2GB regime and they were actually a bit faster. Certainly not enough results to make any sort of claim though. Probably just the normal task variability.
I'd be very interested if anyone can suggest how and why 2x2GB RAM config is having such a negative impact on GPU crunch times? There's still ~45% of the physical RAM listed as 'free'. Is there stuff happening in RAM that isn't reported , or reportable? I've got plenty of older architectures with 2x2GB and GTX 650 GPUs and I haven't noticed any behaviour like this before. Maybe I'd better take a closer look :-).
One final bit of information. I've been playing with the AMD supplied Linux utility (aticonfig) to set and/or check various parameters on the HD 7850 GPU. When it was running with 2x2GB RAM config, the core clock was 900MHz and the mem clock was 1200MHz - stock values for this particular card. The GPU load was 94% and the temperature was 60C. I increased the core to 1GHz and the mem to 1300MHz. The GPU load went up to 95% and the temp to 62C. The crunch time seemed to decline by a few mins to around 5hrs 8mins approx. Now that it's back on the 2x4GB config, and the GPU is again on stock values, I've just noticed that the GPU load is showing as 98% and the temp is 63C. The ambient is much the same - slightly cooler if anything. These figures show the GPU working harder, as do the crunch times. I'll leave it like this for the moment but I'll probably try upping the clocks in a day or two. I usually leave GPUs at stock settings but I may as well OC this one a bit seeing as the previous tweaks did make a nice difference and the OC seemed to be stable.
Cheers,
Gary.
RE: I'd be very interested
)
I will take an uneducated guess the difference is related to the GPU memory size, assuming all thing being equal.
I suspect there is something in here http://www.codeproject.com/Articles/122405/Part-OpenCL-Memory-Spaces
or here http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/#50401315_pgfId-446494
about memory pinning and OpenCL performance. I´m guess the larger RAM leads to less pinning.
Perhaps you could compare 1x4GB, 2x4GB and 2x2GB on a GTX650 ?
I have a new system to build around a 7990 on Linux so this issue is relevant in deciding what memory to get.
Good Luck!
Thanks very much for the
)
Thanks very much for the response. I'll go through the links you kindly provided shortly. For now, I wanted to update what I've been doing, which is
exactly this :-).
Until I started using the HD 7850s, all with 2GB GPU RAM, I had previously invested in quite a number of GTX650s, all 1GB GPU RAM. These are all still running, quite reliably, so I don't have cause to pay any real attention to them. I looked through my 'inventory' and found a machine with a GTX650 that was performing rather below the norm for that type of host. It had a single 4GB stick of DDR3 1333. Most of my GTX650s are running on Ivy Bridge, a few on Sandy Bridge and a couple (like this particular 'slow' one) on Wolfdale. So a day or two ago, I replaced the single stick with a 2x2GB kit and there are already enough results to see a very nice improvement. Shortly, I'll replace the 2x2GB with 2x4GB and perhaps see a further difference. I also have a 16GB kit lying around so I'll try that in one of the HD 7850 systems to see if there's any change from what I'm getting with the 8GB kits. I'm not expecting anything more but you never know ... :-).
The CPU in the GTX650 based system is a rather old Pentium dual core E6300 with a stock frequency of 2.8GHz, but OC'd to 3.3GHz. It's running in an Asrock G41M-VS3 board. I built this machine around 6 years ago (no GPU at that stage). It's running FGRP4 tasks on both CPU cores and BRP5 2x on the GTX650. The change to 2x2GB has caused the elapsed time of GPU tasks to reduce from 26.7ksecs to 24.0ksecs. The CPU component of the elapsed time has fallen from around 6.6ksecs to 5.0ksecs. Because of the small number of tasks completed so far, it's a bit early to be too precise about this. I imagine I'll update this thread once there are more results and I've had time to try 2x4GB RAM.
Once again, thanks for the links.
Cheers,
Gary.
RE: Shortly, I'll replace
)
I forgot to mention earlier you asking about ¨free¨ memory being 45%. I guess you are looking at one of the process monitors where this free figure excludes buffers and file system caches which tends to fill a good percentage of most remaining memory. So adding extra RAM should improve (at least) file system performance, whether that makes a difference here will be interesting.
# free -m -s 2 should show the cache and buffer size (and the real free memory) repeated every 2 seconds. (I usually see less than 10% free, and the cache is at least 3 times the free space)
Hope that helps.
RE: RE: Shortly, I'll
)
I have been following this thread with interest but am not that familiar with ram architecture. I am running one machine with a NVIDIA GTX 770 and 16GB memory.
Free shows:
free -m -s 2
total used free shared buffers cached
Mem: 16003 6028 9974 69 272 1387
-/+ buffers/cache: 4368 11634
Swap: 16335 0 16335
total used free shared buffers cached
Mem: 16003 6028 9974 69 272 1387
-/+ buffers/cache: 4368 11634
Swap: 16335 0 16335
my apologies for the jumbled "free output".
cached vs. free differs from what you describe. Is there a way to determine if I am utilizing this machine to it full potential?