I've replaced akosf C-37 with S-38 on all four of my machines just a couple of hours ago.
wcpuid reports that all support SSE, (though the two oldest don't support SSE2).
All four have completed one mixed result (started on C-37, finished on S-38), without any observed abnormal behavior.
Initial indications are of appreciable further speedup compared to C-37--I'll report numbers and validation here when I have them.
The machines which appear to be working and further sped up include:
P4 EE WinXPPro
Pentium M (of the initial Banias generation) WinXPPro
Pentium 3 Win98SE
Pentium II Win98SE
Extremely preliminary observation suggests that the Banias part may be getting the biggest benefit, and the P4 the least, but all look well worth having assuming validation and stability prove to be OK.
AMD XP 2600+ C37->S38 (approx 25%)increase - no probs - validating fine
AMD 64 X2 3800+ S37a->S38 (approx 26%)increase - no probs - validating fine
:-) you did it akosf! XP 2600+ S38(SSE) beat the X2 3800+ S37a(SSE2)
Looking foward to your next SSE2 opt. hope it won't need more than 512K L2
Well done again! Keep having fun!
Pentium M Banias crunching major datafile 843.0
C-38 gives 61.6% of CPU time for 5 most recent C-37
Overall implied improvement compared to the official distributed Albert 4.37 is thus .438*.616= .270 of previous CPU time, or science output improvement on this machine by a factor of 3.71!
Initial indications are that this may be the best of my four machines of varying Intel architecture in speedup.
akosf's contribution to Einstein if somehow his work runs on a noticeable fraction of the user base is stunning.
Initial indications are of appreciable further speedup compared to C-37--I'll report numbers and validation here when I have them.
[pre]
CPU S-38/C-37 C-37/Dist S-38/Dist
Pentium M 0.616 0.448 0.276
Pentium III 0.676 0.383 0.259
[/pre]
So far my S-38/C-37 reports are based on a single "pure" S-38 result per CPU.
By "Dist" I mean performance on the unmodified Albert 4.37 science application as distributed by the project.
Looking foward to your next SSE2 opt. hope it won't need more than 512K L2
Well done again! Keep having fun!
Bruce suggested to me Chebyshev polinomials instead of Taylor series a month ago. I did a fast test to compare these methods, but I found that Chebyshev approximation produced worse average by same number of coefficients. I'm working on a program that will generate more precise values. I belive that it has to be better. So, if it works, that means we don't need the 512kB size look-up table (very-very wild idea from me).
I prefer polinomials than look-up table, because my Durons have just 256kB cache, altogether. :-)
S38 Observation thread
)
I've replaced akosf C-37 with S-38 on all four of my machines just a couple of hours ago.
wcpuid reports that all support SSE, (though the two oldest don't support SSE2).
All four have completed one mixed result (started on C-37, finished on S-38), without any observed abnormal behavior.
Initial indications are of appreciable further speedup compared to C-37--I'll report numbers and validation here when I have them.
The machines which appear to be working and further sped up include:
P4 EE WinXPPro
Pentium M (of the initial Banias generation) WinXPPro
Pentium 3 Win98SE
Pentium II Win98SE
Extremely preliminary observation suggests that the Banias part may be getting the biggest benefit, and the P4 the least, but all look well worth having assuming validation and stability prove to be OK.
AMD XP 2600+ C37->S38
)
AMD XP 2600+ C37->S38 (approx 25%)increase - no probs - validating fine
AMD 64 X2 3800+ S37a->S38 (approx 26%)increase - no probs - validating fine
:-) you did it akosf! XP 2600+ S38(SSE) beat the X2 3800+ S37a(SSE2)
Looking foward to your next SSE2 opt. hope it won't need more than 512K L2
Well done again! Keep having fun!
P3T 1.26GHz (512kB)
)
P3T 1.26GHz (512kB) C37:13250s; S38: 8510s -> -36%
P3mobile 1.0GHz (256kB) C37: 16230s; S38: 10720s -> -34%
Both are averages on the same WU size each.
Well done akosf! :-)
Pentium M Banias crunching
)
Pentium M Banias crunching major datafile 843.0
C-38 gives 61.6% of CPU time for 5 most recent C-37
Overall implied improvement compared to the official distributed Albert 4.37 is thus .438*.616= .270 of previous CPU time, or science output improvement on this machine by a factor of 3.71!
Initial indications are that this may be the best of my four machines of varying Intel architecture in speedup.
akosf's contribution to Einstein if somehow his work runs on a noticeable fraction of the user base is stunning.
RE: (approx 26%)increase -
)
Where can I get it ?
RE: Where can I get it
)
http://einsteinathome.org/node/190906
RE: Initial indications are
)
[pre]
CPU S-38/C-37 C-37/Dist S-38/Dist
Pentium M 0.616 0.448 0.276
Pentium III 0.676 0.383 0.259
[/pre]
So far my S-38/C-37 reports are based on a single "pure" S-38 result per CPU.
By "Dist" I mean performance on the unmodified Albert 4.37 science application as distributed by the project.
First result on my sempron
)
First result on my sempron 3000+
S-38/C-37
0.735
Edit
Interesting, so far Intel based hosts ~ 33% faster, AMD ~25%. (S-38/C-37)
Then you're really interested in a subject, there is no way to avoid it. You have to read the Manual.
First Results on a Barton
)
First Results on a Barton 3000+ :
"before" with A36 : ~ 2100 - 2200 sec
with S38 : 1,520.57 sec ; 1,627.00 sec ; 1,661.00 sec
First Results on a Barton 2500+ :
"before" with C37 : 6734 sec -> 7027 sec
with S38 : 5,161.00 sec ; 5,132.00 sec
Pretty fast :)
[
RE: Looking foward to your
)
Bruce suggested to me Chebyshev polinomials instead of Taylor series a month ago. I did a fast test to compare these methods, but I found that Chebyshev approximation produced worse average by same number of coefficients. I'm working on a program that will generate more precise values. I belive that it has to be better. So, if it works, that means we don't need the 512kB size look-up table (very-very wild idea from me).
I prefer polinomials than look-up table, because my Durons have just 256kB cache, altogether. :-)