Beta App including SSE2/3 optimisations

bloed_brot
bloed_brot
Joined: 5 Apr 05
Posts: 70
Credit: 91,124,558
RAC: 0
Topic 191719

Dear Bernd,

first of all, great job on the speed-up thus far. I recall during the S4 run that Akos also made use of SSE2/3 instructions for further speed up. So I am wondering about the following:

1st: Do you see room for further speed up in general and if so how much?
2nd: Will there be optimisations for the SSE2/3 unit of x86 cpus, if they are not already integrated?

Cheers

:
your thoughts - the ways :: the knowledge - your space
:

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1,359
Credit: 2,926,801,011
RAC: 2,926,963

Beta App including SSE2/3 optimisations

There is an sse2 app in development, but akos never was able to get a speedup from sse2 over sse1, so I'm not holding my breath on any gain.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,054
Credit: 225,467,704
RAC: 27,005

Akos' Apps got their speedup

Akos' Apps got their speedup from many different things, not all of them being related to SSE2 or SSE3 instructions, even if the Apps ran faster on CPUs capable of these.

Most of these things have been incorporated in the current SSE code.

I have written a code that makes use of SSE2 instructions (double precision vectors), but it actually runs slightly slower (on CPUs also capable of SSE2) than the FPU code we're using in the current code, as the handling of the "virtually two" FPUs by these CPUs seems to be faster. Akos thinks it might give a little advantage on the new Core architecture ("Woodcrest", I think, as I haven't seen any speedup on Core Duo CPUs), but that's all.

There are two places in the code that might benefit from SSE3 instructions, but the overall speedup should be only a few percent. I am currently looking into a possibility to avoid a conditional jump based on that, which may in conjunction end up giving a speedup that would really be noticable, but I can't promise that.

The average variation in the reported CPU times with identical Workunits is around 5%, so any speedup of the code would need to break this barrier to be noticable at all. I don't think that there are so many possibilities left in the current code for that (and so does Akos).

Also the effort necessary for optimization grows continously - I had to reorder some datastructures, including rewriting the operations on them, to make the SSE2 code work, and still got next to nothing out of it.

We (Akos & me) have ran out of big, striking ideas some time ago, and also the small ideas don't gain much. For the current code (and the current run) I think we have almost reached the top end. There might be 10% speedup we can get from assembler coding for specific CPUs, and maybe another 10% from playing with compilers on the code around our "kernel", but that's about it.

BM

BM

Rockhount
Rockhount
Joined: 12 Dec 05
Posts: 12
Credit: 54,520,353
RAC: 3,697

I could do mine core 2 duo

I could do mine core 2 duo (Merom) to the beta test to begin if interest exist. But only in Win32, I could make win64 available in some weeks.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,054
Credit: 225,467,704
RAC: 27,005

Short update on SSE3 (FISTTP

Short update on SSE3 (FISTTP instruction): using it at the first place in the program atually makes things slower than how they are done by the compiler. At the second place the overall speedup is below what you can measure reliably (may be 2,5% or sth.). Definitely not worth another CPU type distinction.

BM

BM

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1,359
Credit: 2,926,801,011
RAC: 2,926,963

That's odd. Akos's s4 sse3

That's odd. Akos's s4 sse3 app was significantly (~20% IIRC) faster than the sse1/2 variants.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,054
Credit: 225,467,704
RAC: 27,005

RE: That's odd. Akos's s4

Message 44602 in response to message 44601

Quote:
That's odd. Akos's s4 sse3 app was significantly (~20% IIRC) faster than the sse1/2 variants.


1. On the same CPU?

2. At least in two places where Akos previously used a FISTTP instruction (requiring SSE3 capability) we are now using a different code that doesn't require SSE at all and is equally fast.

I'm not surprised.

BM

BM

ErichZann
ErichZann
Joined: 11 Feb 05
Posts: 120
Credit: 81,582
RAC: 0

RE: The average variation

Message 44603 in response to message 44598

Quote:

The average variation in the reported CPU times with identical Workunits is around 5%, so any speedup of the code would need to break this barrier to be noticable at all.

hm? if you look here for example the times aren't so much different (from the 20,231.70s WU on, the older ones are with different CPU speed):

http://einsteinathome.org/host/693089/tasks

The difference between the longest and shortest is only like 1,8%

also here, about 2%
http://einsteinathome.org/host/712215/tasks

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1,359
Credit: 2,926,801,011
RAC: 2,926,963

RE: RE: That's odd.

Message 44604 in response to message 44602

Quote:
Quote:
That's odd. Akos's s4 sse3 app was significantly (~20% IIRC) faster than the sse1/2 variants.

1. On the same CPU?

Yes, actually now that I thinkg it was closer to 40%. The 3dnow/SSE1/2 apps were ~5.X times faster than stock on my a64x2 2.6gig, the SSE3 app was 7.X times faster.

Quote:

2. At least in two places where Akos previously used a FISTTP instruction (requiring SSE3 capability) we are now using a different code that doesn't require SSE at all and is equally fast.

Might've been those places that made the difference, because his SSE3 app smoked the others.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,054
Credit: 225,467,704
RAC: 27,005

@DanNeely: I think with the

@DanNeely: I think with the combined competence of Akos and me we can just make more efficient use of SSE in the current Apps.

@MetalWarrior: I have seen machines where the variation was about 10%. I didn't make a statistical analysis of that, 5% on average was just my educated guess. Maybe the machines with larger variations have a problem with CPU time measuring, maybe it depends on how often the App needs to be restarted, and maybe it had gotten better with the BOINC code in the latest Apps or the Core Clients that are used by now. My impression was still from the beginning of S5R1. Anyway, thanks for the report, and sorry for my sloppiness.

BM

BM

ErichZann
ErichZann
Joined: 11 Feb 05
Posts: 120
Credit: 81,582
RAC: 0

ok, on average it could be

ok, on average it could be correct. It should differ more when someone is doing many things while crunching on one day and only lets the pc crunch on another day....
But i think it you for testing only let it crunch on Einstein and nothing else you could already see differences of about 2%....

And you dont have to excuse for anything ;)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.