MacOS X Intel S5R3 App 4.10 available for Beta Test

peanut

Joined: 4 May 07

Posts: 162

Credit: 9644812

RAC: 0

My mini completed a couple

26 Oct 2007 22:21:52 UTC

Message 74924

(moderation:

)

My mini completed a couple more (not validated yet) and the new low is even lower @~35,000 sec.

My Mac Pro has finished quite a few. The drop doesn't seem quite as high as my mini but still a nice drop. None of my Pro's tasks have validated yet, but that doesn't seem to be a likely problem from the posts so far.

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 756843486

RAC: 1164674

Hi! Wow, your Pro's

27 Oct 2007 7:33:35 UTC

Message 74925

(moderation:

)

Hi!

Wow, your Pro's fastest result is now well under 5.5 hours.

CU
Bikeman

peanut

Joined: 4 May 07

Posts: 162

Credit: 9644812

RAC: 0

1 task has validated from my

27 Oct 2007 14:29:36 UTC

Message 74926

(moderation:

)

1 task has validated from my Pro.

I noticed that the scheduler is now queueing 2 tasks per CPU on my Pro. It used to just keep 1 queued up. I have DSL so I just keep a minimun of tasks in my waiting to run queue. I'm not sure if that is because of the 4.10 app's speed or some other reason.

Also the progess display changes almost as fast as it did in S5R2. Wow, 4.10 is one speedy app.

anders n

Joined: 29 Aug 05

Posts: 123

Credit: 1656300

RAC: 0

RE: Looking good so

29 Oct 2007 14:17:33 UTC

Message 74927 in response to message 74917

(moderation:

)

Quote:

Looking good so far.

Times seem to have dropped about 2500 sek. and 1 just validated OK.

Anders n

20 Wu-s done no errors and 13 validated so far.

Anders n

Pete Burgess

Joined: 7 Dec 05

Posts: 21

Credit: 318570870

RAC: 0

7 tasks completed so far on

30 Oct 2007 19:38:18 UTC

Message 74928

(moderation:

)

7 tasks completed so far on my 2.16 GHz Core 2 Duo, 6 validated successfully (cross platform), 7th pending. Run times have reduced from the range 39846 - 40625 seconds to 27830 - 24482 seconds for the h1_0419.40 frequency (quite an improvement !).

Later this week I hope to switch my second machine (running Leopard) on to 4.10, currently running 4.04 with no apparent problems.

Daran

Joined: 3 Oct 06

Posts: 13

Credit: 3007350

RAC: 0

RE: ... As we haven't seen

3 Nov 2007 21:09:55 UTC

Message 74929

(moderation:

)

Quote:

... As we haven't seen these nasty floating-point errors ("Input domain errors") on MacOS (*),
I decided to put in a vectorized SSE version of the "hot loop", which should give additional speedup.
This App is an App as fast as I can build today (fasten seat belts!),

On a C2D (Woodcrest) the beta app seems to spent only about 19% of the time in packed SSE code. Surely you can do better? :-D

Quote:

(*) I'm still not sure why this is. It might be that Apple not only controls the OS, but also the hardware it runs on, and thus the hardware failures are less likely to occur compared to cheaply built PCs. Another possibility is that having no CPU older than Intel Core 2 the compiler can rely on SSE/SSE2 and hardly makes use of the x87 FPU instructions / execution environment at all.

gcc on the Mac uses the FPU only for fp-functions unavailable in SSE2 ( (since implicitly compiled with -msse2). That affects rounding as it does not use the extended precision temporaries. But if thats causing domain errors on other platforms, you need to locate them anyway. Likely candidates would be asin (and similar) where a small error can leave the defined domain.

Greetz,
Daran

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 756843486

RAC: 1164674

RE: On a C2D (Woodcrest)

3 Nov 2007 21:37:34 UTC

Message 74930 in response to message 74929

(moderation:

)

Quote:

On a C2D (Woodcrest) the beta app seems to spent only about 19% of the time in packed SSE code. Surely you can do better? :-D

Did you use Shark? Anyway, the share of time spent in different parts of the app will differ from workunit to workunit which explains why some results vary in runtime even tho you get the same credit. E.g. see the discussion here.

I'd guess that the 19 % share you measured is below average, but I haven't done that many measurements with shark so far.

H-B

Daran

Joined: 3 Oct 06

Posts: 13

Credit: 3007350

RAC: 0

RE: Did you use Shark?

3 Nov 2007 22:01:29 UTC

Message 74931 in response to message 74930

(moderation:

)

Quote:

Did you use Shark?

Yup. Simplest profiling tool there is. When started as a root process it can sample the boinc processes without further ado.

Quote:

Anyway, the share of time spent in different parts of the app will differ from workunit to workunit

Obviously. This was on h1_0507.55_S5R2__65_S5R3a_0. That a rather quick one, I think,

Quote:

I'd guess that the 19 % share you measured is below average,

Maybe. But from how I lunderstand the assembly code, I'd be surprised if any WU spends more than a third in packed SSE. And given the massive usage of this code, near any optimization would be worthwhile. Need help? :-D

Greetz,
Daran

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4330

Credit: 251488234

RAC: 36275

RE: On a C2D (Woodcrest)

4 Nov 2007 9:27:10 UTC

Message 74932 in response to message 74929

(moderation:

)

Quote:

On a C2D (Woodcrest) the beta app seems to spent only about 19% of the time in packed SSE code. Surely you can do better?

It's not a matter of me or effort, but of the algorithm.

The part of the program that is similar to what we were using previously (you'll find the word "Fstat" in the function names in Shark) performs some identical operations over a number of similar input items that are arranged in a way that allows parallel loading. The <50% that it's using is already the result of aggressive optimization, in S5R2 it took about 80% of the total CPU time.

The other part that is rather new (look for "Hough") has a some data-dependent branches; the operations performed there aren't completely identical, and thus not easy to parallize. In addition the data structures used there don't support parallel access.

At some point I'll take a closer look at this part of the code, and maybe even rewrite it completely using data structures that allow for more efficient computing. But this will take a while, and currently I don't even have the time to start with it.

Daran

Joined: 3 Oct 06

Posts: 13

Credit: 3007350

RAC: 0

RE: The <50% that it's

4 Nov 2007 21:53:57 UTC

Message 74933 in response to message 74932

(moderation:

)

Quote:

The <50% that it's using is already the result of aggressive optimization, in S5R2 it took about 80% of the total CPU time.

That percentage will depend on your architecture. EG: Core 2 is about twice as fast as Core 1 for packed SSE instructions. So "<50%" could roughly translate to the 19% I observed. Either way, it looks like you optimized that part quite well. Which means it is no longer the lone hot section.

On my Woodcrest the distribution I measured was:
19%: LocalXLALComputeFaFb, packed SSE
32%: LocalXLALComputeFaFb, rest
34%: LALHOUGHAddPHMD2HD_W
15%: rest

Quote:

The other part that is rather new (look for "Hough") has a some data-dependent branches; ...

Hard to optimize, indeed. The biggest slowdown in LALHOUGHAddPHMD2HD_W should be the store to load dependency. How big is the array you are writing to?

Could be more rewarding to optimize the remainder of LocalXLALComputeFaFb, though.

Quote:

At some point I'll take a closer look at this part of the code, and maybe even rewrite it completely using data structures that allow for more efficient computing.

That sounds promising. Loads more science bang per power:-P

Quote:

But this will take a while, and currently I don't even have the time to start with it.

Want me to have a closer look at it?

MacOS X Intel S5R3 App 4.10 available for Beta Test

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner