Nehalem-class hosts on Einstein

th3

Joined: 24 Aug 06

Posts: 208

Credit: 2208434

RAC: 0

Yes, 4 vs 8 WUs, so HT

22 Nov 2008 8:34:02 UTC

Message 88540 in response to message 88539

(moderation:

)

Yes, 4 vs 8 WUs, so HT enabled will be more productive for sure.

Team Philippines

archae86

Joined: 6 Dec 05

Posts: 3160

Credit: 7257668835

RAC: 1500596

RE: To Hyperthread or not

22 Nov 2008 14:09:02 UTC

Message 88541 in response to message 88538

(moderation:

)

Quote:

To Hyperthread or not to Hyperthread:
I just turned off HT in Bios and started testing. Preliminary estimate is somewhere around 35% shorter runtimes without HT.

If that is borne out, then the hyperthreading productivity gain would be 30%, which is a nice answer to the SMT-doubters for this case.

In case it helps, let me contribute that the estimated natural cycle length for your work at frequencies of 1125.nn is 372. So the third cyclic peak is narrow and predicted at sequence number 1116, very close to the work you ran 1119 through 1124, while the 2.5 cycle valley is broad and centered at 930, pretty close to the work you have reported at 908 through 911.

On what date did work reported as host 1584570 switch from something else to a Core i7 920?

th3

Joined: 24 Aug 06

Posts: 208

Credit: 2208434

RAC: 0

RE: On what date did work

22 Nov 2008 16:12:35 UTC

Message 88542 in response to message 88541

(moderation:

)

Quote:

On what date did work reported as host 1584570 switch from something else to a Core i7 920?

WUs reported on or after 21st are with the i7 except 45002803 (mixed). Then there is various overclock to consider, so basically it narrows down to just a few WUs which are done on 3.2GHz core/1600 RAM (dual channel only for now).

Some that can be used for comparing:

HT enabled:
h1_1125.40_S5R4__1122_S5R4a ...... 34,898.89
h1_1125.40_S5R4__1121_S5R4a ...... 35,247.06
h1_1125.40_S5R4__1120_S5R4a ...... 35,212.80
h1_1125.40_S5R4__1119_S5R4a ...... 35,371.91
avg: 35,182.665

HT Disabled:
h1_1125.40_S5R4__1113_S5R4a ...... 22,300.97
h1_1125.40_S5R4__1112_S5R4a ...... 22,225.88
avg: 22,263.425

HT adds only 58% to the WU runtime and does twice as many WUs, no one would want to disable HT on a Core i7 crunching E@H.

Sorry i dont have more data for comparing, was hard enough to wait for those 2 pure non-HT WUs to finish as i had a serious itch to change some bios settings at the time. Nehalem brings a lot of new stuff to play with for the overclocker, i feel like a born-again-noob when messing around in Bios setup.

Team Philippines

archae86

Joined: 6 Dec 05

Posts: 3160

Credit: 7257668835

RAC: 1500596

RE: Some that can be used

22 Nov 2008 16:37:29 UTC

Message 88543 in response to message 88542

(moderation:

)

Quote:

Some that can be used for comparing:

HT enabled:
h1_1125.40_S5R4__1122_S5R4a ...... 34,898.89
h1_1125.40_S5R4__1121_S5R4a ...... 35,247.06
h1_1125.40_S5R4__1120_S5R4a ...... 35,212.80
h1_1125.40_S5R4__1119_S5R4a ...... 35,371.91
avg: 35,182.665

HT Disabled:
h1_1125.40_S5R4__1113_S5R4a ...... 22,300.97
h1_1125.40_S5R4__1112_S5R4a ...... 22,225.88
avg: 22,263.425

HT adds only 58% to the WU runtime and does twice as many WUs, no one would want to disable HT on a Core i7 crunching E@H.

On the assumption that the cycle length I and Bikeman estimate here to be 372 is correct, you have pretty good symmetry around the third cycle peak at 1116 if one used the 1120 and 1119 results for HT and both nHT results.

That gives the nHT result taking .6308x2 or 1.262 times as much "core time" to get an increment of work done as in HT mode. So in a hypothetical 24 hours running at .99 CPU efficiency, the HT box would complete 19.39 results of this difficulty, while the nHT option could complete 15.37. That is less gain from HT that I saw long ago on SETI with a Gallatin host (Northwood descendant with the extra-big cache), but more gain from HT than I think was typical back then on the then-current Einstein aps. In fact, there was one stage in the famous series of amazing akosf improvements at which what had been a modest HT improvement switched to being a modest but definite HT degradation.

This sample should not be referenced for absolute times for comparison to other hosts which will likely have run WU's much farther away from the cycle peak on average (for example, in three weeks my newish Q9550 box has not run a single WU anywhere near so close to cycle peak as were any of these, so looks much more competitive to th3's i7 than it actually is--in fact it would look closely matched to the nHT i7 on its current setttings if one ignored this inconvenient truth).

As the computational mix alters with distance from cycle peak, it is likely that the HT gain will be slightly different as well, but I won't hazard a guess as to whether the (much more common) cycle valley results will see more or less HT gain.

As the current Nehalem's run on HT have only 1/3 as much cache per thread as does, for example, a Q9550 Penryn-class quad, it was not quite a given that HT would be a win. I'm glad to see it winning this handsomely.

John Clark

Joined: 4 May 07

Posts: 1087

Credit: 3143193

RAC: 0

Am I being a bit silly? I

22 Nov 2008 16:47:40 UTC

Message 88544

(moderation:

)

Am I being a bit silly?

I was looking at the times my 65nM Quad sorts out E@H WUs and it runs them off in between 22,000 and 25,000 seconds 4 at a time. While my 45nM Quad turns W@H WUs out (4 at a time) in between 15,800 and 18,500 seconds.

I was under the impression that the 4 core, with HT per core, was much faster than the Quad Penryns?

Shih-Tzu are clever, cuddly, playful and rule!! Jack Russell are feisty!

jedirock

Joined: 11 Jun 06

Posts: 23

Credit: 1517411

RAC: 0

RE: Am I being a bit

22 Nov 2008 17:00:57 UTC

Message 88545 in response to message 88544

(moderation:

)

Quote:

Am I being a bit silly?

I was looking at the times my 65nM Quad sorts out E@H WUs and it runs them off in between 22,000 and 25,000 seconds 4 at a time. While my 45nM Quad turns W@H WUs out (4 at a time) in between 15,800 and 18,500 seconds.

I was under the impression that the 4 core, with HT per core, was much faster than the Quad Penryns?

Not much faster, but they should around the same time frame. Maybe that's the penalty for a much smaller L1 and L2 cache?

archae86

Joined: 6 Dec 05

Posts: 3160

Credit: 7257668835

RAC: 1500596

RE: Am I being a bit

22 Nov 2008 17:10:37 UTC

Message 88546 in response to message 88544

(moderation:

)

Quote:

Am I being a bit silly?

Let's see--you are comparing a bottom-spec Nehalem first-release running work extremely close to cycle peak for which OC work may not be at all finished to a higher-spec mature-release Penryn-class quad at significant overclock running work most of which is far from cycle peak and every single one of which is farther from cycle peak than any of the six results we have discussed in this comparison.

So if you are suggesting that is a comparison, yes it is silly.

th3

Joined: 24 Aug 06

Posts: 208

Credit: 2208434

RAC: 0

RE: I was looking at the

22 Nov 2008 17:19:48 UTC

Message 88547 in response to message 88544

(moderation:

)

Quote:

I was looking at the times my 65nM Quad sorts out E@H WUs and it runs them off in between 22,000 and 25,000 seconds 4 at a time. While my 45nM Quad turns W@H WUs out (4 at a time) in between 15,800 and 18,500 seconds.
I was under the impression that the 4 core, with HT per core, was much faster than the Quad Penryns?

What is the clock of your 65nm Quad? The times i posted above are at peak, the fastest WUs i had on the Nehalem was around 27300 sec for 8 at the time, i think thats quite good.

But you are right, Nehalems are not much faster than Penryns when HT is not being utilized, just a few %, and in some games they show a few percent lower performance than Penryns, most likely due to the small L2 cache at only 256kB per core. L3 doesnt always make up for the small L2 as the L3 has higher latency.

Give the Nehalem some memory intensive tasks that can utilize lots of threads and Penryn becomes a small dot in the rear mirror. When Neha shines it really does so, and when it loses it does so by only small margins. All in all a solid step forward, but not what everyone had hoped for.

Team Philippines

Novasen169

Joined: 14 May 06

Posts: 43

Credit: 2767204

RAC: 0

By the way, if you're running

22 Nov 2008 19:24:10 UTC

Message 88548

(moderation:

)

By the way, if you're running 8 applications like you would be with HT on, wouldn't you easily get punished for taking up too much RAM? As in, longer runtimes? I think there are applications (not sure about Einstein) that require as much as 800 MB Ram. 800 MB * 8 = 6,25 GB RAM required, and that's if you're not running ANYTHING else, so also no Windows applications. So basically to run that you'd need around 8 GB Ram... seems a bit much, even nowadays. Or doesn't this problem occur?

archae86

Joined: 6 Dec 05

Posts: 3160

Credit: 7257668835

RAC: 1500596

RE: By the way, if you're

22 Nov 2008 20:08:17 UTC

Message 88549 in response to message 88548

(moderation:

)

Quote:

By the way, if you're running 8 applications like you would be with HT on, wouldn't you easily get punished for taking up too much RAM? As in, longer runtimes? I think there are applications (not sure about Einstein) that require as much as 800 MB Ram. 800 MB * 8 = 6,25 GB RAM required, and that's if you're not running ANYTHING else, so also no Windows applications. So basically to run that you'd need around 8 GB Ram... seems a bit much, even nowadays. Or doesn't this problem occur?

It must depend on the application, to be sure.

For Einstein, my copy of Process Explorer shows a working set for each copy of the ap of 59 to 64 Mbytes, with a peak working set of under 85 Mbytes. So with WinXP, which I think of as using about 250 Mbytes, you'd have plenty even with a single Gigabyte.

The SETI working set size for the one I'm currently running (not Astropulse, but Multibeam, in the AK SSE4.1 variant) shows both current and peak working set under 40 M.

So not a problem in 1G for Einstein or SETI on WinXP. I'd want at least 2G for Vista (but I'd want that to run MS Word under Vista).

I imagine some of the other projects may distribute aps with much larger footprints. However, you don't need to reach absolute zero swap--Nehalem has increase memory bandwidth and reduced latency compared to the Conroe-descended parts by more than it has increased computation, so I question even on the really huge aps whether there would be a devastating impact. I suspect it is difficult to maintain really high usage of a multi-gigabyte working set, even in eighths.

I have to agree with your root point--more simultaneous threads want more memory--but don't think it has practical significance in most of the BOINC world, nor even most of the real world.

Nehalem-class hosts on Einstein

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner