MacOS X Intel S5R3 App 4.10 available for Beta Test

peanut
peanut
Joined: 4 May 07
Posts: 162
Credit: 9644812
RAC: 0

My mini completed a couple

My mini completed a couple more (not validated yet) and the new low is even lower @~35,000 sec.

My Mac Pro has finished quite a few. The drop doesn't seem quite as high as my mini but still a nice drop. None of my Pro's tasks have validated yet, but that doesn't seem to be a likely problem from the posts so far.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 756843486
RAC: 1164674

Hi! Wow, your Pro's

Hi!

Wow, your Pro's fastest result is now well under 5.5 hours.

CU
Bikeman

peanut
peanut
Joined: 4 May 07
Posts: 162
Credit: 9644812
RAC: 0

1 task has validated from my

1 task has validated from my Pro.

I noticed that the scheduler is now queueing 2 tasks per CPU on my Pro. It used to just keep 1 queued up. I have DSL so I just keep a minimun of tasks in my waiting to run queue. I'm not sure if that is because of the 4.10 app's speed or some other reason.

Also the progess display changes almost as fast as it did in S5R2. Wow, 4.10 is one speedy app.

anders n
anders n
Joined: 29 Aug 05
Posts: 123
Credit: 1656300
RAC: 0

RE: Looking good so

Message 74927 in response to message 74917

Quote:

Looking good so far.

Times seem to have dropped about 2500 sek. and 1 just validated OK.

Anders n

20 Wu-s done no errors and 13 validated so far.

Anders n

Pete Burgess
Pete Burgess
Joined: 7 Dec 05
Posts: 21
Credit: 318570870
RAC: 0

7 tasks completed so far on

7 tasks completed so far on my 2.16 GHz Core 2 Duo, 6 validated successfully (cross platform), 7th pending. Run times have reduced from the range 39846 - 40625 seconds to 27830 - 24482 seconds for the h1_0419.40 frequency (quite an improvement !).

Later this week I hope to switch my second machine (running Leopard) on to 4.10, currently running 4.04 with no apparent problems.

Daran
Daran
Joined: 3 Oct 06
Posts: 13
Credit: 3007350
RAC: 0

RE: ... As we haven't seen

Quote:
... As we haven't seen these nasty floating-point errors ("Input domain errors") on MacOS (*),
I decided to put in a vectorized SSE version of the "hot loop", which should give additional speedup.
This App is an App as fast as I can build today (fasten seat belts!),

On a C2D (Woodcrest) the beta app seems to spent only about 19% of the time in packed SSE code. Surely you can do better? :-D

Quote:
(*) I'm still not sure why this is. It might be that Apple not only controls the OS, but also the hardware it runs on, and thus the hardware failures are less likely to occur compared to cheaply built PCs. Another possibility is that having no CPU older than Intel Core 2 the compiler can rely on SSE/SSE2 and hardly makes use of the x87 FPU instructions / execution environment at all.

gcc on the Mac uses the FPU only for fp-functions unavailable in SSE2 ( (since implicitly compiled with -msse2). That affects rounding as it does not use the extended precision temporaries. But if thats causing domain errors on other platforms, you need to locate them anyway. Likely candidates would be asin (and similar) where a small error can leave the defined domain.

Greetz,
Daran

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 756843486
RAC: 1164674

RE: On a C2D (Woodcrest)

Message 74930 in response to message 74929

Quote:

On a C2D (Woodcrest) the beta app seems to spent only about 19% of the time in packed SSE code. Surely you can do better? :-D

Did you use Shark? Anyway, the share of time spent in different parts of the app will differ from workunit to workunit which explains why some results vary in runtime even tho you get the same credit. E.g. see the discussion here.

I'd guess that the 19 % share you measured is below average, but I haven't done that many measurements with shark so far.

CU

H-B

Daran
Daran
Joined: 3 Oct 06
Posts: 13
Credit: 3007350
RAC: 0

RE: Did you use Shark?

Message 74931 in response to message 74930

Quote:
Did you use Shark?

Yup. Simplest profiling tool there is. When started as a root process it can sample the boinc processes without further ado.

Quote:
Anyway, the share of time spent in different parts of the app will differ from workunit to workunit

Obviously. This was on h1_0507.55_S5R2__65_S5R3a_0. That a rather quick one, I think,

Quote:
I'd guess that the 19 % share you measured is below average,

Maybe. But from how I lunderstand the assembly code, I'd be surprised if any WU spends more than a third in packed SSE. And given the massive usage of this code, near any optimization would be worthwhile. Need help? :-D

Greetz,
Daran

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4330
Credit: 251488234
RAC: 36275

RE: On a C2D (Woodcrest)

Message 74932 in response to message 74929

Quote:
On a C2D (Woodcrest) the beta app seems to spent only about 19% of the time in packed SSE code. Surely you can do better?


It's not a matter of me or effort, but of the algorithm.

The part of the program that is similar to what we were using previously (you'll find the word "Fstat" in the function names in Shark) performs some identical operations over a number of similar input items that are arranged in a way that allows parallel loading. The <50% that it's using is already the result of aggressive optimization, in S5R2 it took about 80% of the total CPU time.

The other part that is rather new (look for "Hough") has a some data-dependent branches; the operations performed there aren't completely identical, and thus not easy to parallize. In addition the data structures used there don't support parallel access.

At some point I'll take a closer look at this part of the code, and maybe even rewrite it completely using data structures that allow for more efficient computing. But this will take a while, and currently I don't even have the time to start with it.

BM

BM

Daran
Daran
Joined: 3 Oct 06
Posts: 13
Credit: 3007350
RAC: 0

RE: The <50% that it's

Message 74933 in response to message 74932

Quote:
The <50% that it's using is already the result of aggressive optimization, in S5R2 it took about 80% of the total CPU time.

That percentage will depend on your architecture. EG: Core 2 is about twice as fast as Core 1 for packed SSE instructions. So "<50%" could roughly translate to the 19% I observed. Either way, it looks like you optimized that part quite well. Which means it is no longer the lone hot section.

On my Woodcrest the distribution I measured was:
19%: LocalXLALComputeFaFb, packed SSE
32%: LocalXLALComputeFaFb, rest
34%: LALHOUGHAddPHMD2HD_W
15%: rest

Quote:
The other part that is rather new (look for "Hough") has a some data-dependent branches; ...

Hard to optimize, indeed. The biggest slowdown in LALHOUGHAddPHMD2HD_W should be the store to load dependency. How big is the array you are writing to?

Could be more rewarding to optimize the remainder of LocalXLALComputeFaFb, though.

Quote:
At some point I'll take a closer look at this part of the code, and maybe even rewrite it completely using data structures that allow for more efficient computing.

That sounds promising. Loads more science bang per power:-P

Quote:
But this will take a while, and currently I don't even have the time to start with it.

Want me to have a closer look at it?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.