S5R2

F. Prefect

Joined: 7 Nov 05

Posts: 135

Credit: 1016868

RAC: 0

RE: RE: Thanks to all

6 May 2007 3:51:56 UTC

Message 62492 in response to message 62491

(moderation:

)

Quote:

Quote:

Thanks to all you guys. That will help me decide whether to only run E@H, Rosetta, or a combination of both.

But the way a combination of both is working, that alternative has just about seen it's days. For some reason on my Athlon 64XP 3200, I can only recieve
1 (!) E@H job at a time and only after the previous job is completed. It's set up just like a couple of other machines running both and I get plenty of work for both. All I'm using is the default settings, (no school,work, etc)but to get a new E@H job when the previous one if finished I have to suspend Rosetta and wait a few minutes before anything happens. Crazy thing.

thanks,
F. Prefect

Check your Active Fraction value for that machine. I had one system where it went down near zero and wouldn't download anything. I edited it back up near .9999 and it started downloading again. It's in file client_state.xml under parm .

Always be sure to shutdown BOINC and save your files before making changes.

The following is what I currently show in the file you indicated.

-
0
0

I was able to download about 12MB of data from Rosetta, but it appears the download is only representing 1 job.? Just for the heck of it I then installed version 5.8.16 to see if that might make a differance, but only one job is being shown, and it is of course, currently running. It may load a new job when this one is completed, but that wasn't the case in the past with the older software. When I was running the R1 E@H jobs, I might have over a page of jobs showing ready to be crunched. Now 12MB gets me ONE!

thanks,
Gary

In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move.....Douglas Adams

ExtraTerrestria...

Joined: 10 Nov 04

Posts: 770

Credit: 578393538

RAC: 192657

RE: RE: Akos, what about

6 May 2007 10:45:36 UTC

Message 62493 in response to message 62321

(moderation:

)

Quote:

Quote:
Akos, what about the S5R2 hot loop and SSE2? I know SSE1 and 3DNow brought a huge speed increase to the former applications, so the code seems to be fine for vectorization. These two work on 64 Bit, whereas you can do 128 Bits with SSE2. There's just not much benefit in doing so with P3/4/M and AXP/64. However, Core 2 D/Q and K10 would benefit greatly from such a codepath as they can process 128 Bit vectors in one cycle (OK, send and retire one such command per clock per unit :) instead of two cycles for the former CPUs. I see great potential for optimizations here, which is hardly used, if at all, by any BOINC app today (or did I miss some fancy new project?)

The 'main' hot loop is designed for single precision ( against of S5R1 ), and the double precision parts are similar to the S5R1 code, where the SSE2 code was tried out, but wasn't faster (AXP,A64,P4,PM). So i don't think that the code will be optimised for SSE2. I know the C2D processors would be faster but not significantly. There is no time for this at moment.

The 64 bit SSE engine CPUs can execute an arithmetic instruction ( ADD,MUL,SUB ) in 5 cycles. The 128 bit wide SSE engine CPUs (C2D,K10?) executes them in 4 cycles. So the ratio is 5:4, not 2:1. Of course, there are 1 cycle difference.

Hi Akos,

sorry, I lost track of this thread for some time.. when there's more I want to say. I can understand that currently there's no tme for special performance optimizations of the code (plus they only make sense with final code). But to get back to this C2D / K10 128 bit fp issue:

"I know the C2D processors would be faster but not significantly"

This is where I disagree. The conventional 64 bit fp CPUs can issue and retire one 64 bit instruction each cycle in one fp pipeline (only considering a single pipe here). I suspect no dependencies between the instructions, so the pipeline can work ideally and I don't need to care about the latency, that is the number of cycles the instruction needs to travel through the pipeline before it's finished. I suppose it's this latency / execution time which you're speaking of when you say "5" and "4" cycles?

If you feed these CPUs with 128 bit SSE2 code, using vectors with either 2 doubles or 4 singles, they can only issue 64 Bits each cycle. So in the first cycle they send one double or two singles into the pipeline. The rest is sent / issued in the second cycle. So you're not much better off compared with using 2 seperate 64 Bit instructions - that's why you don't see much of an improvement with SSE2 on these CPUs. (although you save some decoding and possibly cache space by using 1 instead of 2 instructions)

So what should happen on C2D and possibly K10? They can send the entire 128 Bit vector into the pipeline in one cycle (and process everything in parallel). That's where I'm taking the 2:1 ratio from. This only works if the instructions are independent enough so that the pipeline can by fully utilized. I don't know how friendly the E@H code is for that. I assume it's OK because the vectorization helped to speed up S5R1 a lot (and you can only vectorize if you have independend instructions).

So what do you think? Too bad I'm not an assembler programmer, otherwise a small demonstration program may be very nice. Like for example creating an array of 1024 random 32 bit fp values between 0 and 1 and adding them up with themselfes 1000 times. In version A this would be done by using 512 64 bit vectors, whereas version B would use 256 128 bit vectors. There shouldn't be much of a difference on P4 / A64 CPUs, but C2D might show it's strength here. Maybe a larger array or more iterations would be needed to measure the execution time properly.

Regards, MrS

Scanning for our furry friends since Jan 2002

Akos Fekete

Joined: 13 Nov 05

Posts: 561

Credit: 4527270

RAC: 0

Hi ETA! This routine

6 May 2007 13:06:13 UTC

Message 62494 in response to message 62493

(moderation:

)

Hi ETA!

This routine should be two times faster on Core2 than K8:

mov eax,250000000
cycle:
addpd xmm1,xmm0
addpd xmm2,xmm0
addpd xmm3,xmm0
addpd xmm4,xmm0
dec eax
jnz cycle ;1 billion 2x64 bit independent addition

K8 needed 2 billion clock pulses exactly.
(I don't have Core2 at moment.)

ExtraTerrestria...

Joined: 10 Nov 04

Posts: 770

Credit: 578393538

RAC: 192657

Hi Akos, let me see if I

6 May 2007 16:18:24 UTC

Message 62495

(moderation:

)

Hi Akos,

let me see if I got this right:

mov eax,250000000
-> write 250 million into register EAX

cycle:
-> create label "cycle" for assembler

addpd xmm1,xmm0
-> addpd is described nicely here.
Each of the 128 bit registers xmm0 to 4 contains two 64 bit floating point values, one from bit 0 to 63 and the other from bit 64 to 127. Storing several values as different components of a vector is called packing, hence the "p" in addpd. The results are written into bits 0 to 63 and 64 to 127 of xmm1/2/3/4 respectively. The 64 bit FPUs need 2 clock cycles to do this.

dec eax
-> decrement eax by 1

jnz cycle
-> jump to label cycle if the default register, eax, is not zero

So, yes, this is a nice demonstration program for what I had in mind! Too bad I don't have access to a Core2 either.
How do you count the number of clock cycles needed for the program? Can anyone else here do it easily?

Regards, MrS

Scanning for our furry friends since Jan 2002

F. Prefect

Joined: 7 Nov 05

Posts: 135

Credit: 1016868

RAC: 0

RE: Hi Akos, let me see if

6 May 2007 22:14:17 UTC

Message 62496 in response to message 62495

(moderation:

)

Quote:

Hi Akos,

let me see if I got this right:

mov eax,250000000
-> write 250 million into register EAX

cycle:
-> create label "cycle" for assembler

addpd xmm1,xmm0
-> addpd is described nicely here.
Each of the 128 bit registers xmm0 to 4 contains two 64 bit floating point values, one from bit 0 to 63 and the other from bit 64 to 127. Storing several values as different components of a vector is called packing, hence the "p" in addpd. The results are written into bits 0 to 63 and 64 to 127 of xmm1/2/3/4 respectively. The 64 bit FPUs need 2 clock cycles to do this.

dec eax
-> decrement eax by 1

jnz cycle
-> jump to label cycle if the default register, eax, is not zero

So, yes, this is a nice demonstration program for what I had in mind! Too bad I don't have access to a Core2 either.
How do you count the number of clock cycles needed for the program? Can anyone else here do it easily?

Regards, MrS

Now let me get this straight. No, I think I'll have something to eat first.

F. Prefect

In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move.....Douglas Adams

Ananas

Joined: 22 Jan 05

Posts: 272

Credit: 2500681

RAC: 0

RE: ...How do you count the

7 May 2007 0:41:15 UTC

Message 62497 in response to message 62495

(moderation:

)

Quote:

...How do you count the number of clock cycles needed for the program? Can anyone else here do it easily?

Regards, MrS

I guess, Rodney Zaks doesn't help there ;-)

But basically that's where you can get those informations, CPU documentation.

Akos Fekete

Joined: 13 Nov 05

Posts: 561

Credit: 4527270

RAC: 0

So, it's tested. RE: mov

7 May 2007 6:09:05 UTC

Message 62498 in response to message 62494

(moderation:

)

So, it's tested.

Quote:

mov eax,250000000
cycle:
addpd xmm1,xmm0
addpd xmm2,xmm0
addpd xmm3,xmm0
addpd xmm4,xmm0
dec eax
jnz cycle ;1 billion 2x64 bit independent addition

K8 needed 2 billion clock pulses exactly.

Core2 needs only 1 billion clock cycles!

ExtraTerrestria...

Joined: 10 Nov 04

Posts: 770

Credit: 578393538

RAC: 192657

Yeah baby, yeah :) So.. do

7 May 2007 7:48:57 UTC

Message 62499

(moderation:

)

Yeah baby, yeah :)

So.. do you see a possible application of this in E@H? I'm thinking about an additional SSE2 code path for the hot loop. It should not hurt P4/M/A64 and may benefit C2/K10 greatly.

MrS

Scanning for our furry friends since Jan 2002

Akos Fekete

Joined: 13 Nov 05

Posts: 561

Credit: 4527270

RAC: 0

RE: Yeah baby, yeah

7 May 2007 8:27:28 UTC

Message 62500 in response to message 62499

(moderation:

)

Quote:

Yeah baby, yeah :)

:-)

Quote:

So.. do you see a possible application of this in E@H? I'm thinking about an additional SSE2 code path for the hot loop. It should not hurt P4/M/A64 and may benefit C2/K10 greatly.

The hot loop is more difficult. All of the SSE registers was used.
There were place only for two independent data path.
(Allocation: 1 constant, 1 variable, 2-2-2 reciprocals, collectors and multiplier for parallel execution.)

The "double speed" code need at least 13 SSE registers for better performance.
(And "double speed" means only cca. x1.5 speed improvement in this case.)

The AMD64/EMT extension has 16 SSE registers... ;-)

Daran

Joined: 3 Oct 06

Posts: 13

Credit: 3007350

RAC: 0

> Core2 needs only 1 billion

8 May 2007 1:06:43 UTC

Message 62501

(moderation:

)

> Core2 needs only 1 billion clock cycles!

Yup, thats the way to go;-)

Nitpicking: For the addpd you'd only need to unroll threefold, whereas a mulpd would need fivefold. Also the L1-Cache so fast, that you can have the addpd read its source from memory and still be at the same speed:-)

> There were place only for two independent data path.
> (Allocation: 1 constant, 1 variable, 2-2-2 reciprocals, collectors and multiplier for parallel execution.)
> The AMD64/EMT extension has 16 SSE registers... ;-)

For the current S5R2 hot loop, you would not need them:-P

Due to the register renames one can reuse registers within a loop. As long as there is sufficient processing between memory accesses, there is no need to keep everything in registers, either. And if thats still not enough to hide all the latencies, one can merge multiple iterations of the entire function.

Playing around with the S5R1 SSE code on an otherwise boring day, I got a speedup factor 1.5 for my Core2. Would have been more fun, had I had the real source, though;-)

S5R2

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner