Beta App including SSE2/3 optimisations

bloed_brot

Joined: 5 Apr 05

Posts: 70

Credit: 91124558

RAC: 0

22 Aug 2006 8:58:04 UTC

Topic 191719

(moderation:

)

Dear Bernd,

first of all, great job on the speed-up thus far. I recall during the S4 run that Akos also made use of SSE2/3 instructions for further speed up. So I am wondering about the following:

1st: Do you see room for further speed up in general and if so how much?
2nd: Will there be optimisations for the SSE2/3 unit of x86 cpus, if they are not already integrated?

Cheers

:
your thoughts - the ways :: the knowledge - your space
:

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3592347384

RAC: 647693

Beta App including SSE2/3 optimisations

22 Aug 2006 10:41:52 UTC

Message 44597

(moderation:

)

There is an sse2 app in development, but akos never was able to get a speedup from sse2 over sse1, so I'm not holding my breath on any gain.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4349

Credit: 253222640

RAC: 39781

Akos' Apps got their speedup

22 Aug 2006 10:54:41 UTC

Message 44598

(moderation:

)

Akos' Apps got their speedup from many different things, not all of them being related to SSE2 or SSE3 instructions, even if the Apps ran faster on CPUs capable of these.

Most of these things have been incorporated in the current SSE code.

I have written a code that makes use of SSE2 instructions (double precision vectors), but it actually runs slightly slower (on CPUs also capable of SSE2) than the FPU code we're using in the current code, as the handling of the "virtually two" FPUs by these CPUs seems to be faster. Akos thinks it might give a little advantage on the new Core architecture ("Woodcrest", I think, as I haven't seen any speedup on Core Duo CPUs), but that's all.

There are two places in the code that might benefit from SSE3 instructions, but the overall speedup should be only a few percent. I am currently looking into a possibility to avoid a conditional jump based on that, which may in conjunction end up giving a speedup that would really be noticable, but I can't promise that.

The average variation in the reported CPU times with identical Workunits is around 5%, so any speedup of the code would need to break this barrier to be noticable at all. I don't think that there are so many possibilities left in the current code for that (and so does Akos).

Also the effort necessary for optimization grows continously - I had to reorder some datastructures, including rewriting the operations on them, to make the SSE2 code work, and still got next to nothing out of it.

We (Akos & me) have ran out of big, striking ideas some time ago, and also the small ideas don't gain much. For the current code (and the current run) I think we have almost reached the top end. There might be 10% speedup we can get from assembler coding for specific CPUs, and maybe another 10% from playing with compilers on the code around our "kernel", but that's about it.

Rockhount

Joined: 12 Dec 05

Posts: 12

Credit: 57699921

RAC: 6934

I could do mine core 2 duo

22 Aug 2006 12:01:51 UTC

Message 44599

(moderation:

)

I could do mine core 2 duo (Merom) to the beta test to begin if interest exist. But only in Win32, I could make win64 available in some weeks.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4349

Credit: 253222640

RAC: 39781

Short update on SSE3 (FISTTP

22 Aug 2006 20:55:46 UTC

Message 44600

(moderation:

)

Short update on SSE3 (FISTTP instruction): using it at the first place in the program atually makes things slower than how they are done by the compiler. At the second place the overall speedup is below what you can measure reliably (may be 2,5% or sth.). Definitely not worth another CPU type distinction.

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3592347384

RAC: 647693

That's odd. Akos's s4 sse3

22 Aug 2006 22:37:48 UTC

Message 44601

(moderation:

)

That's odd. Akos's s4 sse3 app was significantly (~20% IIRC) faster than the sse1/2 variants.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4349

Credit: 253222640

RAC: 39781

RE: That's odd. Akos's s4

22 Aug 2006 22:48:43 UTC

Message 44602 in response to message 44601

(moderation:

)

Quote:

That's odd. Akos's s4 sse3 app was significantly (~20% IIRC) faster than the sse1/2 variants.

1. On the same CPU?

2. At least in two places where Akos previously used a FISTTP instruction (requiring SSE3 capability) we are now using a different code that doesn't require SSE at all and is equally fast.

I'm not surprised.

ErichZann

Joined: 11 Feb 05

Posts: 120

Credit: 81582

RAC: 0

RE: The average variation

22 Aug 2006 23:51:02 UTC

Message 44603 in response to message 44598

(moderation:

)

Quote:

The average variation in the reported CPU times with identical Workunits is around 5%, so any speedup of the code would need to break this barrier to be noticable at all.

hm? if you look here for example the times aren't so much different (from the 20,231.70s WU on, the older ones are with different CPU speed):

http://einsteinathome.org/host/693089/tasks

The difference between the longest and shortest is only like 1,8%

also here, about 2%
http://einsteinathome.org/host/712215/tasks

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3592347384

RAC: 647693

RE: RE: That's odd.

23 Aug 2006 0:41:29 UTC

Message 44604 in response to message 44602

(moderation:

)

Quote:

Quote:
That's odd. Akos's s4 sse3 app was significantly (~20% IIRC) faster than the sse1/2 variants.

1. On the same CPU?

Yes, actually now that I thinkg it was closer to 40%. The 3dnow/SSE1/2 apps were ~5.X times faster than stock on my a64x2 2.6gig, the SSE3 app was 7.X times faster.

Quote:

2. At least in two places where Akos previously used a FISTTP instruction (requiring SSE3 capability) we are now using a different code that doesn't require SSE at all and is equally fast.

Might've been those places that made the difference, because his SSE3 app smoked the others.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4349

Credit: 253222640

RAC: 39781

@DanNeely: I think with the

23 Aug 2006 7:46:42 UTC

Message 44605

(moderation:

)

@DanNeely: I think with the combined competence of Akos and me we can just make more efficient use of SSE in the current Apps.

@MetalWarrior: I have seen machines where the variation was about 10%. I didn't make a statistical analysis of that, 5% on average was just my educated guess. Maybe the machines with larger variations have a problem with CPU time measuring, maybe it depends on how often the App needs to be restarted, and maybe it had gotten better with the BOINC code in the latest Apps or the Core Clients that are used by now. My impression was still from the beginning of S5R1. Anyway, thanks for the report, and sorry for my sloppiness.

ErichZann

Joined: 11 Feb 05

Posts: 120

Credit: 81582

RAC: 0

ok, on average it could be

23 Aug 2006 9:47:39 UTC

Message 44606

(moderation:

)

ok, on average it could be correct. It should differ more when someone is doing many things while crunching on one day and only lets the pc crunch on another day....
But i think it you for testing only let it crunch on Einstein and nothing else you could already see differences of about 2%....

And you dont have to excuse for anything ;)

Beta App including SSE2/3 optimisations

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner