It would be really interesting to know if the SSE2 capable version (first release) you have running on your X2 shows a bigger speedup than the "fixed" SSE version you now have running on your XP. In other words, is the speedup due mainly to the SSE2 instructions or is there a speedup for SSE only machines as well.
Don't lose that SSE2 version - it might be quite valuable :-).
Im using the SSE2 version on my E8400, there is for sure a speedup (over 4.38)
1 WU almost broke the 9000 sec barrier, and thats when clocked to only 3.6GHz. Amazing work Bernd, you rule. Now someone please find out if the sse2 is indeed faster than the sse version =)
HOST: http://einsteinathome.org/host/1280362/tasks
I added Donald A. Tevault's X2 6000 to the hosts that I feed into my DB, as I have an X2 6000 too, but which is running the SSE2 version. In a few days there will be results to compare.
Comparison can probably not be done on a single or some few results, because the speedup, if there is any, might only show up at some WUs, depending on their position close to a trough or to the peak.
1. Is it possible, that the first version(with some SSE2 instructions) might be faster than the new one?
Yes, it is possible, but I simply don't know.
Hey Bernd,
How about you make that SSE2 version available as a "power user" app so that those of us who weren't quick enough of the mark can at least test it a little?
That'll save us having to bribe Michael or th3 who are the two who have so far admitted to having it :-).
I would be surprised to see a significant (if at all measurable) speedup for the initial app version that contains SSE2 instructions.
The app code consists of parts are really important to performance, and those have now been converted to hand-optimized assembly code (SSE).
The rest of the code is in C but not that crucial for performance. Only in those parts of the code there will be a difference in the two app versions, mostly by scalar double precision code being compiled to x87 or SSE2 instructions, respectively. To make optimal use of SSE2, one would have to generate SSE2 versions of the handcoded sections.
So, Iwould not hold my breath wrt. the SSE2 app variant.
Switched from 4.35 to 4.49 on my AMD Opteron 1210 cpu running SuSE Linux 10.3 and BOINC 5.10.45. Looks definitely faster but graphics is not working. It used to work in 4.35.
Tullio
Graphics did not work when I switched from 4,35 to 4.49 during a WU run. Now that I started one with 4.49 it works,
I would be surprised to see a significant (if at all measurable) speedup for the initial app version that contains SSE2 instructions.
The app code consists of parts are really important to performance, and those have now been converted to hand-optimized assembly code (SSE).
The rest of the code is in C but not that crucial for performance. Only in those parts of the code there will be a difference in the two app versions, mostly by scalar double precision code being compiled to x87 or SSE2 instructions, respectively. To make optimal use of SSE2, one would have to generate SSE2 versions of the handcoded sections.
So, Iwould not hold my breath wrt. the SSE2 app variant.
CU
Bikeman
In Bernd's initial posting I do not read anything about hand-coded SSE instructions, but about using a compiler switch.
I do not doubt what you are writing, but this app clearly is at least 10% faster, so there might be a chance that the SSE2 version is even a little faster.
Anyway, it will be fun to prove you are right. ;-)
Or in other words: Let's see if practice can prove theory. :-)
In Bernd's initial posting I do not read anything about hand-coded SSE instructions, but about using a compiler switch.
Yes, exactly this I wanted to stress: any SSE2 instructions will be in the "compiled from C", scalar parts of the app. They are not in the "vectorized" hand coded parts. So don't expect wonders :-).
Quote:
I do not doubt what you are writing, but this app clearly is at least 10% faster, so there might be a chance that the SSE2 version is even a little faster.
Anyway, it will be fun to prove you are right. ;-)
Or in other words: Let's see if practice can prove theory. :-)
cu,
Michael
Yes, that should be very interesting. Any speed improvement (SSE or SSE2 variant) compared with 4.38 has multiple reasons:
-better "hardware prefetching" thru the use of SSE prefetching instructions (more noticeable for the "slow" WUs)
-I think the new app uses the same "interleaved loop" variant as the latest MacOS Intel app (stuff transplanted from Akos' magic app :-) ) (more noticeable for the "fast" WUs)
Switched from 4.35 to 4.49 on my AMD Opteron 1210 cpu running SuSE Linux 10.3 and BOINC 5.10.45. Looks definitely faster but graphics is not working. It used to work in 4.35.
Tullio
Graphics did not work when I switched from 4,35 to 4.49 during a WU run. Now that I started one with 4.49 it works,
Yep. Switching App versions in the middle of a Task is not supported in BOINC. In case of the "separate graphics" Apps this means that the "graphics_app" link in the slot directory is not updated and points to a file that doesn't exist anymore after installing a new App version. It is only set up new when a new Task is started.
RE: It indeed did contain
)
@Bernd. . .
Would you like for some of us to test the SSE2 app as well? I have three SSE2-capable machines that run Linux, and I would be glad to help out.
RE: RE: The new beta runs
)
You read my thoughts. :-))
Im using the SSE2 version on
)
Im using the SSE2 version on my E8400, there is for sure a speedup (over 4.38)
1 WU almost broke the 9000 sec barrier, and thats when clocked to only 3.6GHz. Amazing work Bernd, you rule. Now someone please find out if the sse2 is indeed faster than the sse version =)
HOST: http://einsteinathome.org/host/1280362/tasks
Team Philippines
I added Donald A. Tevault's
)
I added Donald A. Tevault's X2 6000 to the hosts that I feed into my DB, as I have an X2 6000 too, but which is running the SSE2 version. In a few days there will be results to compare.
Comparison can probably not be done on a single or some few results, because the speedup, if there is any, might only show up at some WUs, depending on their position close to a trough or to the peak.
cu,
Michael
RE: RE: 1. Is it
)
Hey Bernd,
How about you make that SSE2 version available as a "power user" app so that those of us who weren't quick enough of the mark can at least test it a little?
That'll save us having to bribe Michael or th3 who are the two who have so far admitted to having it :-).
Cheers,
Gary.
Hi all! I would be
)
Hi all!
I would be surprised to see a significant (if at all measurable) speedup for the initial app version that contains SSE2 instructions.
The app code consists of parts are really important to performance, and those have now been converted to hand-optimized assembly code (SSE).
The rest of the code is in C but not that crucial for performance. Only in those parts of the code there will be a difference in the two app versions, mostly by scalar double precision code being compiled to x87 or SSE2 instructions, respectively. To make optimal use of SSE2, one would have to generate SSE2 versions of the handcoded sections.
So, Iwould not hold my breath wrt. the SSE2 app variant.
CU
Bikeman
RE: Switched from 4.35 to
)
Graphics did not work when I switched from 4,35 to 4.49 during a WU run. Now that I started one with 4.49 it works,
RE: Hi all! I would be
)
In Bernd's initial posting I do not read anything about hand-coded SSE instructions, but about using a compiler switch.
I do not doubt what you are writing, but this app clearly is at least 10% faster, so there might be a chance that the SSE2 version is even a little faster.
Anyway, it will be fun to prove you are right. ;-)
Or in other words: Let's see if practice can prove theory. :-)
cu,
Michael
RE: In Bernd's initial
)
Yes, exactly this I wanted to stress: any SSE2 instructions will be in the "compiled from C", scalar parts of the app. They are not in the "vectorized" hand coded parts. So don't expect wonders :-).
Yes, that should be very interesting. Any speed improvement (SSE or SSE2 variant) compared with 4.38 has multiple reasons:
-better "hardware prefetching" thru the use of SSE prefetching instructions (more noticeable for the "slow" WUs)
-I think the new app uses the same "interleaved loop" variant as the latest MacOS Intel app (stuff transplanted from Akos' magic app :-) ) (more noticeable for the "fast" WUs)
CU
Bikeman
RE: RE: Switched from
)
Yep. Switching App versions in the middle of a Task is not supported in BOINC. In case of the "separate graphics" Apps this means that the "graphics_app" link in the slot directory is not updated and points to a file that doesn't exist anymore after installing a new App version. It is only set up new when a new Task is started.
BM
BM