edit: perhaps i will do a test with SSE2, probably those registers doesn't need these corrections because they are only 64-bit wide.
So, i tried out SSE2. It isn't good to us. The result of multiplications and divisions are different in SSE than FPU. The last bit difference is coming from the architecture of mul/div units ( cheaper-faster solutions... eh... )
On P D 830 3.21Ghz
standard: rough ave ~41,700
S5T0000 maybe couple hunders seconds off
S5T0301 and S5T0304: ~39,700
=drop of ~2000
On P 4 HT(on) 3.4Ghz not a large sample to compare but
standard: 61,604 (on sinlge lone wu)
S5T0001: 58,938 (two wu together)
S5T0301 down to 58,436 (two wu together)
=drop of ~3000
Done some hybrid reslut on my A64 3500+ at standard speed
These are approx. time since they come from different WU, but all one are from the same datafile
100% with stock app. about 31800s
40% stock + 60% S5S0003 about 31600s
20% S5S0003 + 80% S5S0007 about 28100s
so a large WU only S5S0007 should get some where between 27000-27500, have too see. Over i hour faster then stock, thanks akosf.
Only the WU with stock validated yet, the two others waiting for second computer but it should probably be no problem since it´s the two stable version I have used.
RE: edit: perhaps i will do
)
So, i tried out SSE2. It isn't good to us. The result of multiplications and divisions are different in SSE than FPU. The last bit difference is coming from the architecture of mul/div units ( cheaper-faster solutions... eh... )
edit: The additions and substractions are good.
RE: The first test result
)
The second of test WU:
WU: l1_0229.0_S5R1__2610_S5R1a_0 - T: 3411.7 sec
On P D 830 3.21Ghz standard:
)
On P D 830 3.21Ghz
standard: rough ave ~41,700
S5T0000 maybe couple hunders seconds off
S5T0301 and S5T0304: ~39,700
=drop of ~2000
On P 4 HT(on) 3.4Ghz not a large sample to compare but
standard: 61,604 (on sinlge lone wu)
S5T0001: 58,938 (two wu together)
S5T0301 down to 58,436 (two wu together)
=drop of ~3000
RE: RE: The first test
)
The 3th test WU:
WU:h1_0081.5_S5R1__242_S5R1a_1 - T: 4120.9 sec
Thats only confirmation, that S5T0307 is about 14% faster as original app (on my A64 2800+).
Now I'm testing S5T0308
First test WU crunched by
)
First test WU crunched by S5T0308:
WU: l1_0229.0_S5R1__2610_S5R1a_1 - T: 3117.4 sec.
Because it seems that S5T0308 get an invalid results, I stoped test other WUs with this app. I haven't SSE3 CPU so I can not test new S5T07XX app.
Here is a little summary:
WU - Oficial app - S5T0003 - S5T0307 - S5T0308
-----
h1_0318.0_S5R1__23088_S5R1a_1 - 4012.6 sec - not tested - not tested - not tested
h1_0081.5_S5R1__242_S5R1a_1 - 4718.7 sec - 4773.2 sec - 4120.9 sec - not tested
l1_0229.0_S5R1__2610_S5R1a_0 - 3908.6 sec - 3916.4 sec - 3411.7 sec - not tested
l1_0229.0_S5R1__2610_S5R1a_1 - 3949.9 sec - 3916.1 sec - 3404.0 sec - 3117.4 sec
-----
Requested time - 100% - cca 100% - cca 87% - cca 79%
I just did an exact test with
)
I just did an exact test with the same WU h1_0167.0_S5R1__2997_S5R1a_1 on my Athlon XP 1700+ (SSE and SSE2 capable but not SSE3!)
official App.:
2006-06-24 12:47:58.6976 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_S5R1_4.02_windows_intelx86.exe'.
2006-06-24 12:47:58.7077 [normal]: Started search at lalDebugLevel = 0
2006-06-24 12:47:59.6891 [normal]: Checkpoint-file 'Fstat.out.ckp' not found.
2006-06-24 12:47:59.6891 [normal]: No usable checkpoint found, starting from beginning.
Detected CPU type 1
2006-06-24 14:19:44.4145 [normal]: Search finished successfully.
-> 01:29:17 = 5357 sec. (but not uploaded!)
Akos S5T0307:
2006-06-24 11:21:55.6536 [normal]: E@H S5R1 4.02 SSE-0307 TEST
2006-06-24 11:21:55.6836 [normal]: Started search at lalDebugLevel = 0
2006-06-24 11:21:57.0456 [normal]: Checkpoint-file 'Fstat.out.ckp' not found.
2006-06-24 11:21:57.0456 [normal]: No usable checkpoint found, starting from beginning.
Detected CPU type 1
2006-06-24 12:43:06.6277 [normal]: Search finished successfully.
-> 01:16:13 = 4573 sec. (uploaded but not yet validated!)
thanks a lot Akos!
Udo
Udo
I love the S5T0307
)
I love the S5T0307 version:
Standard app: 28100 sec
S5T0003: 27500 sec
Hybrid S5T0003/S5T0307: 24100 sec
Crunched on a Dual Opteron 250 @ 2,6 ghz/2 GB RAM/SSE2 capable
I'm waiting for the 1st complete run with S5T0307 now ;)
Great work, Akos, both here
)
Great work, Akos, both here and at SIMAP and SZTAKI!
Hope the timings on these similar workunits help.
Hybrid WU (~90% stock, ~10% S0003), h1_0078.5_S5R1__153_S5R1a_0
CPU time = 4,824.41
Granted credit = 18.43
S0003 WU, h1_0078.5_S5R1__151_S5R1a_1
CPU time = 4,708.22
Granted credit = 18.43
116.19 seconds less, 2.408% speed-up
ThinkPad R52, set to maximum speed
Windows XP Pro, SP2
Pentium M 740 1.73 GHz (Dothan) SSE2
768MB DDR2
P 4 HT (on) 3.4Ghz Standard:
)
P 4 HT (on) 3.4Ghz
Standard: 61,604
S5T0301: 57,828 (Valid result)
= drop 3776
I Think thats 6.1% (Can anyone tell me i'm getting the percent correct please i can't remeber)
P D 830 3.21Ghz
Standard: ~41,800
S5T0709: 35,163 (Still waiting for other result)
= drop 6637
Thinks thats 15.8%
Utmost thanks agains Akosf.
Done some hybrid reslut on my
)
Done some hybrid reslut on my A64 3500+ at standard speed
These are approx. time since they come from different WU, but all one are from the same datafile
100% with stock app. about 31800s
40% stock + 60% S5S0003 about 31600s
20% S5S0003 + 80% S5S0007 about 28100s
so a large WU only S5S0007 should get some where between 27000-27500, have too see. Over i hour faster then stock, thanks akosf.
Only the WU with stock validated yet, the two others waiting for second computer but it should probably be no problem since it´s the two stable version I have used.