Under Performing XEON System

Grogley
Grogley
Joined: 20 Feb 05
Posts: 11
Credit: 24520270
RAC: 0
Topic 194784

I recently built a dual XEON E5506 system (8 total cores, not over-clocked). Based on benchmarks and the performance of other systems, I expect that the new XEON system is under performing in Einstein credits by about 80% of my expected capabilities. For example, my son's Q6600 quad core system will crank out about 2200 credits a day. Since my new system's benchmarks are slightly better than the Q6600 system, I would expect at least twice that rate or about 4400 credits per day. But the XEON system seems to max out at 3500 credit per day.

Note that the BOINC application runs 100% of the time on the new XEON system. I have logged the activity on the system and I know it is crunching through credits 24x7. I might add that I have 4GB of RAM in two sticks and there seems to be no memory related issues since the Benchmarks run quickly.

Anyone have any idea what is happening here?
Thanks,
Rod

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2139
Credit: 2752648717
RAC: 1486471

Under Performing XEON System

I have a dual Xeon E5320 system which has been crunching Einstein for three years now. Don't look at RAC or total credit, because Einstein only has a low resource share, and most of the credit goes to other projects: but the timings of individual tasks may be helpful in your quest.

Mine is the sort of motherboard which uses FB-DIMM memory: I was disappointed too when I first set it up, and found that the memory was only running in dual-channel mode. Once I got four matched FB-DIMMS running in quad-channel, the speed was noticably better (I now have 4 x 1GB, although Vista 32 is only reporting 3GB).

Also, I'm not up-to-date on how much cache RAM the E5506 CPUs have: Einstein tasks are pretty memory-intensive, and you need as much/as fast as you can get to keep 8 cores busy.

Elphidieus
Elphidieus
Joined: 20 Feb 05
Posts: 245
Credit: 20603702
RAC: 0

I did notice one thing on

I did notice one thing on your system, though it may affect little on your RAC, your graphics card is under-performing on ABP2 CUDA tasks. GPU around 17,600 sec vs CPU of 15,300 sec. Do take note that GPU task will also require a full CPU core (in your case 7 CPU cores plus 1 GPU + 1 CPU cores on CUDA tasks), so you're wasting one of your CPU core capabilities dedicated to a GPU which in turn runs slower. You might want to set BOINC not to use GPU while dedicate all 8 CPU cores instead.

Grogley
Grogley
Joined: 20 Feb 05
Posts: 11
Credit: 24520270
RAC: 0

Thanks, I think I now have

Message 97132 in response to message 97131

Thanks, I think I now have that turned off in my preferences, at least I think that is where it gets turned off. I never really liked my GPU being used anyway.

However, I am not sure that explains my 20% under performance estimate. It's like nearly two CPU are not used. Anyway, I will monitor what happens with GPU turned off to see if that helps.

Thanks again,
Rod

Quote:
I did notice one thing on your system, though it may affect little on your RAC, your graphics card is under-performing on ABP2 CUDA tasks. GPU around 17,600 sec vs CPU of 15,300 sec. Do take note that GPU task will also require a full CPU core (in your case 7 CPU cores plus 1 GPU + 1 CPU cores on CUDA tasks), so you're wasting one of your CPU core capabilities dedicated to a GPU which in turn runs slower. You might want to set BOINC not to use GPU while dedicate all 8 CPU cores instead.


archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7023144931
RAC: 1831412

RE: However, I am not sure

Message 97133 in response to message 97132

Quote:
However, I am not sure that explains my 20% under performance estimate. It's like nearly two CPU are not used.

One issue that affects multi-core hosts, strongly in some cases, is memory contention.

The most extreme case I can recall was a large server built using lots of Dunnington chips (the early 6-core processor single-die design which was the last of Intel's Penryn-line products). I forget just how many cores were on the hosts, but think it may have been 48 (eight chips). I don't remember the numbers either, but they woefully underperformed ordinary Penryn hosts of same clock rate on a per-core basis, turning in result CPU times several times higher. Using your terminology, one might have said most of their cores were "not used". Actually they just spent most of their charged CPU time waiting for RAM.

Your case is obviously nowhere near that extreme. But you may still suffer meaningful memory contention running Einstein, and the benchmarks SETI uses would not capture that well at all, I think. If you want to check this possibility, you could try dropping the number of cores in service. Depending on the which version of BOINC you are running, you may find a preference setting which will do this for you. If that does not work out, you could set No New Task, then suspend enough tasks to leave only the desired investigative number of freshly started tasks running (for a single trial, I suggest four). If the reported rate of advance during computation, and the reported CPU time on completion, show a considerable improvement, this would suggest pretty directly that you have a substantial memory contention effect.

And if not, it would suggest I have started a false hare.

Grogley
Grogley
Joined: 20 Feb 05
Posts: 11
Credit: 24520270
RAC: 0

I think the memory contention

Message 97134 in response to message 97133

I think the memory contention thing is the problem. I have my own way of testing this with my gravitational simulation. The simulation is a single threaded task that I can open as many instances as I want. I had tested this hypothesis earlier but what I didn't do was check to see how much memory the tasks use. Checking again, I discovered that I needed to create task memory sizes comparable to the Einstein memory usage. Since my current tasks use about 100MB of RAM, I set up my simulation to use that much per task.

Running a single simulation task, 100 iterations takes about 43 seconds and not surprisingly about 12% of the total CPU time. Opening more of these simulation tasks up to 8, all running the identical data sets, times for 100 iterations increased dramatically, now running anywhere between 48 to 68 seconds for any of the running tasks.

I am surprised at this because many times in the past on multiprocessor machines, I had never seen any dramatic differences in processing time for individual tasks. I really didn't expect to see that much of difference now. I suspect the earlier tests didn't use enough memory to tax the memory access for contentions to matter.

So I think I have the answer to my original question but not a particularly satisfying one though. I don't suppose there is anything I can do to mitigate this issue. More memory or maybe a 64 bit OS would help but I am not sure if this the fault of the memory hardware systems or the OS use of memory. I am not that well versed in the whole memory pipelining thing.

Thanks for your help everyone,

Rod

Quote:
Quote:
However, I am not sure that explains my 20% under performance estimate. It's like nearly two CPU are not used.

One issue that affects multi-core hosts, strongly in some cases, is memory contention.


Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 686042413
RAC: 597943

RE: So I think I have the

Message 97135 in response to message 97134

Quote:

So I think I have the answer to my original question but not a particularly satisfying one though. I don't suppose there is anything I can do to mitigate this issue. More memory or maybe a 64 bit OS would help but I am not sure if this the fault of the memory hardware systems or the OS use of memory. I am not that well versed in the whole memory pipelining thing.

A 64 bit OS will most likely not help much here, nor will additional memory. Using Linux instead of Win XP might help, tho, AFAIK the apps built for Linux are still a bit faster than those compiled for Windows (because of compiler differences, not because of the OS itself in general). I'm not sure but maybe Linux also handles CPU locality a bit better than Win XP, which might add a few % performance on systems with so many cores as yours.

HBE

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2139
Credit: 2752648717
RAC: 1486471

I fear it may be even more

I fear it may be even more difficult than that. Running multiple simulated tasks will - quite correctly - stress the pagefile and virtual memory subsystems appropriately. But I think the problem with Einstein and similar processes is the memory throughput, which stresses bandwidth rather than capacity. You need to consider speed as well as size.

Ver Greeneyes
Ver Greeneyes
Joined: 26 Mar 09
Posts: 140
Credit: 9562235
RAC: 0

Does the application make an

Message 97137 in response to message 97136

Does the application make an effort to read from and write to memory linearly? (in so-far as this is possible) I remember when I was working on some graphical filtering, writing linearly was key to performance - even though the other write patterns (which touched several rows and columns of pixels at a time) were equally predictable, writing to each pixel serially (breaking up the logical access into switch-cases) gave a massive speed advantage.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 686042413
RAC: 597943

RE: Does the application

Message 97138 in response to message 97137

Quote:
Does the application make an effort to read from and write to memory linearly? (in so-far as this is possible) I remember when I was working on some graphical filtering, writing linearly was key to performance - even though the other write patterns (which touched several rows and columns of pixels at a time) were equally predictable, writing to each pixel serially (breaking up the logical access into switch-cases) gave a massive speed advantage.

Indeed, all modern CPUs since the Pentium-III have logic that tries to predict the next memory accesses even before they are requested, and this works best when memory is accessed in a linear fashion, as in most multi-media "streaming" applications. Complex scientific calculations often do not fit into this pattern easily.

If you can't arrange memory access in a linear fashion (e.g. when using indirect access as with Look-up-tables etc), you can try to at least give the CPU a hint where the next memory access will occur. It's like :

*compute addess of memory access,
*tell CPU it should "prefetch" memory around this spot,
*do something else
*execute the code that actually loads the data in question.

CU
HBE

Mad_Max
Mad_Max
Joined: 2 Jan 10
Posts: 153
Credit: 2134772970
RAC: 461437

RE: I think the memory

Message 97139 in response to message 97134

Quote:

I think the memory contention thing is the problem. I have my own way of testing this with my gravitational simulation. The simulation is a single threaded task that I can open as many instances as I want. I had tested this hypothesis earlier but what I didn't do was check to see how much memory the tasks use. Checking again, I discovered that I needed to create task memory sizes comparable to the Einstein memory usage. Since my current tasks use about 100MB of RAM, I set up my simulation to use that much per task.

Running a single simulation task, 100 iterations takes about 43 seconds and not surprisingly about 12% of the total CPU time. Opening more of these simulation tasks up to 8, all running the identical data sets, times for 100 iterations increased dramatically, now running anywhere between 48 to 68 seconds for any of the running tasks.

I am surprised at this because many times in the past on multiprocessor machines, I had never seen any dramatic differences in processing time for individual tasks. I really didn't expect to see that much of difference now. I suspect the earlier tests didn't use enough memory to tax the memory access for contentions to matter.

So I think I have the answer to my original question but not a particularly satisfying one though. I don't suppose there is anything I can do to mitigate this issue. More memory or maybe a 64 bit OS would help but I am not sure if this the fault of the memory hardware systems or the OS use of memory. I am not that well versed in the whole memory pipelining thing.

Thanks for your help everyone,


Its not only memory(RAM) related thing, but processor cache too.
If you run tasks on multi-processor machine (not cores - different phisical chips!) all task has same resourses, for example 2 Ghz core + 1 MB L2/L3 cache. If you run only one task is still uses only 1 core + 1 MB L2 cache, and another CPU is totally idle. 2 task uses 2 CPUs (2 core + 2Ñ…1 MB cache) and you get up to 100% speed up (If there is no other "bottleneck")
And if you runs 1 tasks on multi-core CPU (one phisical chip with 2 cores inside) it gets 1 core + 2 MB cache (because on multi-cores CPUs 1 highly loaded core may use mostly all available cache, while other core(s) don't need it). And when you start 2 task, it gets 1 core + only 1 MB cache each. In result 2 task running on 2 core chip, executed ALWAYS slower than one task on the same processor.(The only question is for a couple of percent or a few tens of percent perfomance hit will be).
On the quad cores situations is same.
PLUS on the latest Intel CPUs like Core i5/core i7 (not sure about XEON E5506) there is such a "feature" called Turbo Boost. If it detects that the CPU loaded basically only 1 (or 2) cores, it over-clock them automatically. But if you load all CPU cores, then the over-clock is removed (to avoid overheating).

This is a general theory. Now, as mentioned specific processors.
Q6600 - it is not 4-core processor as you might think initially. This 2 dual-core processors (core 2 duo) in one package. Each has 2 GHz and 4 MB of shared cache level 2 running at full processor frequency. You can write the formula 2 x (2 cores + 4MB L2) @ 2.4 GHz
XEON E5506 is a Quad-Core processor. 4 each has 256KB cache level 2 + 4 MB total shared cache level 3, working at a lower frequency (uncore). You can write the formula 4 Ñ… (1 core + 256KB L2) @ 2.13GHz + 4 MB L3 @?? GHz

So, its all right: in terms of the program is well optimized for the parallel computation(N completely independent processes is executed at the same time is the best option for it) with high-speed memory system requirements Q6600 is more powerful processor than the E5506 indeed! (and 20% of the difference looks quite plausible)

And E5506 can be faster only in the programs of poorly optimized for multithreaded processing and low demands on the memory subsystem speed(or when main dataset size is huge and cache not help much).

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.