S41.xx Observation Thread

B52
B52
Joined: 19 Feb 05
Posts: 45
Credit: 273,899
RAC: 0

RE: Continuing my practice

Message 29948 in response to message 29924

Quote:

Continuing my practice of retesting the same archived WU on my Gallatin (Northwood-descended P4 EE running 3.2 GHz HT):

S40.12 39:05
S40.04 32:13
C41.00 41:26
C41.01 55:55
C41.00 41:11 (retry)
S41.06 34:27

So I see S41.06 as much faster than the unfortunate S40.12 on this machine, but still not so fast as the S40.04 it currently runs. I'll try S41.06 on my other machines, and if it looks more promising on them, retry on this machine with another work unit and more carefully controlled conditions.

Same observation here. S40.04 is still the fastest cruncher on my Prescott 3.0 GHZ running HT'ed (so far ;-) )

As said earlier in this thread, wonder where the penalty on these P4's with large caches running ht'd lies ??

Mr.Pernod
Mr.Pernod
Joined: 9 Jul 05
Posts: 83
Credit: 3,250,626
RAC: 0

so far 1 case of 0

so far 1 case of 0 credit
Intel Xeon/S41.06 versus Apple/4.37 and AthlonXP/4.37
the others (a lot) on the Xeons and AthlonMP/XP's are either pending or validated without a hitch.
I'll post some times later, when I get home.

B52,
Hyperthreading penalty on P4/Xeon seems related to the small L1 data- and instruction-caches on these chips, as DanNeely posted earlier.

B52
B52
Joined: 19 Feb 05
Posts: 45
Credit: 273,899
RAC: 0

RE: B52, Hyperthreading

Message 29950 in response to message 29949

Quote:
B52,
Hyperthreading penalty on P4/Xeon seems related to the small L1 data- and instruction-caches on these chips, as DanNeely posted earlier.

Thx m8, that post must have slipped thru my reading

Aglarond
Aglarond
Joined: 3 Feb 06
Posts: 2
Credit: 216,170
RAC: 0

RE: Doesn't setting the

Message 29951 in response to message 29947

Quote:
Doesn't setting the project to only use 1 processor on a multiprocessor machine do the trick?

No, this will set boinc to run only one project at a time.

Honza
Honza
Joined: 10 Nov 04
Posts: 136
Credit: 3,332,354
RAC: 0

RE: B52, Hyperthreading

Message 29952 in response to message 29949

Quote:
B52,
Hyperthreading penalty on P4/Xeon seems related to the small L1 data- and instruction-caches on these chips, as DanNeely posted earlier.

And high latency of L2 cache, slow FSB that has to feed cores and read/write memory content.

zagadka
zagadka
Joined: 29 Apr 06
Posts: 12
Credit: 17,088
RAC: 0

RE: RE: Doesn't setting

Message 29953 in response to message 29951

Quote:
Quote:
Doesn't setting the project to only use 1 processor on a multiprocessor machine do the trick?

No, this will set boinc to run only one project at a time.

My mistake, it's under general preferences so it should be obvious that it has nothing to do with how projects are being run.

Edited for typos.

Zap
Zap
Joined: 12 Feb 06
Posts: 15
Credit: 3,900,434
RAC: 0

RE: Posted 42 days ago by

Quote:

Posted 42 days ago by Zap
-------------------------------------------------------------------------------
AMD64 xp 3000 Newcastle core 10% overclock.

Went from 14k plus secs with the original app through an average of some less then 6000 with A36 to now my first result with S38 in 4235 secs.
Quite impressive Akosf.

Now with S41.06 about 2510 secs.( average of 3 z1 results) thats 5.6 times faster.!!!!
This so very impressive. No one here ever gonna forget Akosf I guess.

Validation wil take some time cos teamed up with 30k secs plus crunchers.

B52
B52
Joined: 19 Feb 05
Posts: 45
Credit: 273,899
RAC: 0

RE: RE: B52, Hyperthreadi

Message 29955 in response to message 29952

Quote:
Quote:
B52,
Hyperthreading penalty on P4/Xeon seems related to the small L1 data- and instruction-caches on these chips, as DanNeely posted earlier.
And high latency of L2 cache, slow FSB that has to feed cores and read/write memory content.

Thx for the answers guys.

Plz correct me if I'm wrong on this one.

The only thing then, that will give a HT'ed P4 Prescott a boost is SSE2 or SSE3 optimized code ?? Or will even that not increase the speed ??

Cheers

Mr.Pernod
Mr.Pernod
Joined: 9 Jul 05
Posts: 83
Credit: 3,250,626
RAC: 0

RE: Thx for the answers

Message 29956 in response to message 29955

Quote:

Thx for the answers guys.

Plz correct me if I'm wrong on this one.

The only thing then, that will give a HT'ed P4 Prescott a boost is SSE2 or SSE3 optimized code ?? Or will even that not increase the speed ??

Cheers


only very limited in my opinion, the only thing that, to the best of my knowledge, would make a big impact would be a decrease in the "important" dataset, as has been done for S39L, but even that dataset (~11KB) doesn't fit in the 8KB L1 datacache of my Prestonia's (Northwood based Xeons), so when running with HT enabled, there are 2 threads, both "wanting" 11KB L1 at the same time, while the CPU only has 8KB to offer.
This means there are cache-misses, flushes, reloads and fetches from the L2 cache, or even from main system RAM (worst case) which all adds latency to just the memory-handling.
Another issue with HT is the fact that those two Einstein threads are basically doing the same type of work, both claiming resources of a similar nature from the CPU, which only has so many ALU- and FPU-execution units available.
Under ideal circumstances for HyperThreading, you would be running two different threads, one claiming ALU, the other claiming FPU execution units and their combined datasets fit together in L1 and/or L2 cache.
From my own "experiments" with HyperThreading I have found combinations like running SETI + SIMAP or SETI + Distributed.net RC5 at the same time to make the most optimal use of my Xeons, resulting in crunchtimes for both projects that were very, very close to the times I got with HT disabled on those systems.

(hope that made sense)

TauCeti
TauCeti
Joined: 1 Apr 05
Posts: 16
Credit: 454,993
RAC: 0

RE: Another issue with HT

Message 29957 in response to message 29956

Quote:

Another issue with HT is the fact that those two Einstein threads are basically doing the same type of work, both claiming resources of a similar nature from the CPU, which only has so many ALU- and FPU-execution units available.
Under ideal circumstances for HyperThreading, you would be running two different threads, one claiming ALU, the other claiming FPU execution units and their combined datasets fit together in L1 and/or L2 cache.
From my own "experiments" with HyperThreading I have found combinations like running SETI + SIMAP or SETI + Distributed.net RC5 at the same time to make the most optimal use of my Xeons, resulting in crunchtimes for both projects that were very, very close to the times I got with HT disabled on those systems.
(hope that made sense)

That makes a lot of sense. Mixing the adequate choice of DC-applications on a HT-System can improve the overall performance a _lot_

example: Take my Northwood-P4 2.6 GHz

Calibrate S41.06 without HT with to "100%E@H" performance
Calibrate GIMPS (prime95) Trial-Factoring workload (up to 63 bit) without HT with "100%G" performance (enabling HT does not improve performance)

Running two HT-instances of S41.06 decreases performance to 75%E@H combined, so disabling HT for E@H is the usual choice for maximum throughput

Running one GIMPS-process and S41.06 hyperthreaded together yield 75%E@H _and_ 64%G throughput.

So if you have two identical (for the sake of this argument) machines and use both clients on both machines you get 150%E@H thoughput and 128%G thoughput. Thats a combined throughput of 278% for both projects compared to the 200%E@H or 200%G you get running only one project.

If you are only interested doing E@H you need to find another user who is only interested in GIMPS-TF work. Now you can team up (each user running both workloads) and both parties benefit ;)

Tau

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.