My mini completed a couple more (not validated yet) and the new low is even lower @~35,000 sec.
My Mac Pro has finished quite a few. The drop doesn't seem quite as high as my mini but still a nice drop. None of my Pro's tasks have validated yet, but that doesn't seem to be a likely problem from the posts so far.
I noticed that the scheduler is now queueing 2 tasks per CPU on my Pro. It used to just keep 1 queued up. I have DSL so I just keep a minimun of tasks in my waiting to run queue. I'm not sure if that is because of the 4.10 app's speed or some other reason.
Also the progess display changes almost as fast as it did in S5R2. Wow, 4.10 is one speedy app.
7 tasks completed so far on my 2.16 GHz Core 2 Duo, 6 validated successfully (cross platform), 7th pending. Run times have reduced from the range 39846 - 40625 seconds to 27830 - 24482 seconds for the h1_0419.40 frequency (quite an improvement !).
Later this week I hope to switch my second machine (running Leopard) on to 4.10, currently running 4.04 with no apparent problems.
... As we haven't seen these nasty floating-point errors ("Input domain errors") on MacOS (*),
I decided to put in a vectorized SSE version of the "hot loop", which should give additional speedup.
This App is an App as fast as I can build today (fasten seat belts!),
On a C2D (Woodcrest) the beta app seems to spent only about 19% of the time in packed SSE code. Surely you can do better? :-D
Quote:
(*) I'm still not sure why this is. It might be that Apple not only controls the OS, but also the hardware it runs on, and thus the hardware failures are less likely to occur compared to cheaply built PCs. Another possibility is that having no CPU older than Intel Core 2 the compiler can rely on SSE/SSE2 and hardly makes use of the x87 FPU instructions / execution environment at all.
gcc on the Mac uses the FPU only for fp-functions unavailable in SSE2 ( (since implicitly compiled with -msse2). That affects rounding as it does not use the extended precision temporaries. But if thats causing domain errors on other platforms, you need to locate them anyway. Likely candidates would be asin (and similar) where a small error can leave the defined domain.
On a C2D (Woodcrest) the beta app seems to spent only about 19% of the time in packed SSE code. Surely you can do better? :-D
Did you use Shark? Anyway, the share of time spent in different parts of the app will differ from workunit to workunit which explains why some results vary in runtime even tho you get the same credit. E.g. see the discussion here.
I'd guess that the 19 % share you measured is below average, but I haven't done that many measurements with shark so far.
Yup. Simplest profiling tool there is. When started as a root process it can sample the boinc processes without further ado.
Quote:
Anyway, the share of time spent in different parts of the app will differ from workunit to workunit
Obviously. This was on h1_0507.55_S5R2__65_S5R3a_0. That a rather quick one, I think,
Quote:
I'd guess that the 19 % share you measured is below average,
Maybe. But from how I lunderstand the assembly code, I'd be surprised if any WU spends more than a third in packed SSE. And given the massive usage of this code, near any optimization would be worthwhile. Need help? :-D
On a C2D (Woodcrest) the beta app seems to spent only about 19% of the time in packed SSE code. Surely you can do better?
It's not a matter of me or effort, but of the algorithm.
The part of the program that is similar to what we were using previously (you'll find the word "Fstat" in the function names in Shark) performs some identical operations over a number of similar input items that are arranged in a way that allows parallel loading. The <50% that it's using is already the result of aggressive optimization, in S5R2 it took about 80% of the total CPU time.
The other part that is rather new (look for "Hough") has a some data-dependent branches; the operations performed there aren't completely identical, and thus not easy to parallize. In addition the data structures used there don't support parallel access.
At some point I'll take a closer look at this part of the code, and maybe even rewrite it completely using data structures that allow for more efficient computing. But this will take a while, and currently I don't even have the time to start with it.
The <50% that it's using is already the result of aggressive optimization, in S5R2 it took about 80% of the total CPU time.
That percentage will depend on your architecture. EG: Core 2 is about twice as fast as Core 1 for packed SSE instructions. So "<50%" could roughly translate to the 19% I observed. Either way, it looks like you optimized that part quite well. Which means it is no longer the lone hot section.
On my Woodcrest the distribution I measured was:
19%: LocalXLALComputeFaFb, packed SSE
32%: LocalXLALComputeFaFb, rest
34%: LALHOUGHAddPHMD2HD_W
15%: rest
Quote:
The other part that is rather new (look for "Hough") has a some data-dependent branches; ...
Hard to optimize, indeed. The biggest slowdown in LALHOUGHAddPHMD2HD_W should be the store to load dependency. How big is the array you are writing to?
Could be more rewarding to optimize the remainder of LocalXLALComputeFaFb, though.
Quote:
At some point I'll take a closer look at this part of the code, and maybe even rewrite it completely using data structures that allow for more efficient computing.
That sounds promising. Loads more science bang per power:-P
Quote:
But this will take a while, and currently I don't even have the time to start with it.
My mini completed a couple
)
My mini completed a couple more (not validated yet) and the new low is even lower @~35,000 sec.
My Mac Pro has finished quite a few. The drop doesn't seem quite as high as my mini but still a nice drop. None of my Pro's tasks have validated yet, but that doesn't seem to be a likely problem from the posts so far.
Hi! Wow, your Pro's
)
Hi!
Wow, your Pro's fastest result is now well under 5.5 hours.
CU
Bikeman
1 task has validated from my
)
1 task has validated from my Pro.
I noticed that the scheduler is now queueing 2 tasks per CPU on my Pro. It used to just keep 1 queued up. I have DSL so I just keep a minimun of tasks in my waiting to run queue. I'm not sure if that is because of the 4.10 app's speed or some other reason.
Also the progess display changes almost as fast as it did in S5R2. Wow, 4.10 is one speedy app.
RE: Looking good so
)
20 Wu-s done no errors and 13 validated so far.
Anders n
7 tasks completed so far on
)
7 tasks completed so far on my 2.16 GHz Core 2 Duo, 6 validated successfully (cross platform), 7th pending. Run times have reduced from the range 39846 - 40625 seconds to 27830 - 24482 seconds for the h1_0419.40 frequency (quite an improvement !).
Later this week I hope to switch my second machine (running Leopard) on to 4.10, currently running 4.04 with no apparent problems.
RE: ... As we haven't seen
)
On a C2D (Woodcrest) the beta app seems to spent only about 19% of the time in packed SSE code. Surely you can do better? :-D
gcc on the Mac uses the FPU only for fp-functions unavailable in SSE2 ( (since implicitly compiled with -msse2). That affects rounding as it does not use the extended precision temporaries. But if thats causing domain errors on other platforms, you need to locate them anyway. Likely candidates would be asin (and similar) where a small error can leave the defined domain.
Greetz,
Daran
RE: On a C2D (Woodcrest)
)
Did you use Shark? Anyway, the share of time spent in different parts of the app will differ from workunit to workunit which explains why some results vary in runtime even tho you get the same credit. E.g. see the discussion here.
I'd guess that the 19 % share you measured is below average, but I haven't done that many measurements with shark so far.
CU
H-B
RE: Did you use Shark?
)
Yup. Simplest profiling tool there is. When started as a root process it can sample the boinc processes without further ado.
Obviously. This was on h1_0507.55_S5R2__65_S5R3a_0. That a rather quick one, I think,
Maybe. But from how I lunderstand the assembly code, I'd be surprised if any WU spends more than a third in packed SSE. And given the massive usage of this code, near any optimization would be worthwhile. Need help? :-D
Greetz,
Daran
RE: On a C2D (Woodcrest)
)
It's not a matter of me or effort, but of the algorithm.
The part of the program that is similar to what we were using previously (you'll find the word "Fstat" in the function names in Shark) performs some identical operations over a number of similar input items that are arranged in a way that allows parallel loading. The <50% that it's using is already the result of aggressive optimization, in S5R2 it took about 80% of the total CPU time.
The other part that is rather new (look for "Hough") has a some data-dependent branches; the operations performed there aren't completely identical, and thus not easy to parallize. In addition the data structures used there don't support parallel access.
At some point I'll take a closer look at this part of the code, and maybe even rewrite it completely using data structures that allow for more efficient computing. But this will take a while, and currently I don't even have the time to start with it.
BM
BM
RE: The <50% that it's
)
That percentage will depend on your architecture. EG: Core 2 is about twice as fast as Core 1 for packed SSE instructions. So "<50%" could roughly translate to the 19% I observed. Either way, it looks like you optimized that part quite well. Which means it is no longer the lone hot section.
On my Woodcrest the distribution I measured was:
19%: LocalXLALComputeFaFb, packed SSE
32%: LocalXLALComputeFaFb, rest
34%: LALHOUGHAddPHMD2HD_W
15%: rest
Hard to optimize, indeed. The biggest slowdown in LALHOUGHAddPHMD2HD_W should be the store to load dependency. How big is the array you are writing to?
Could be more rewarding to optimize the remainder of LocalXLALComputeFaFb, though.
That sounds promising. Loads more science bang per power:-P
Want me to have a closer look at it?