Gravitational Wave search O1 all-sky tuning (O1AS20-100T)

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 3,946
Credit: 201,651,333
RAC: 47,695

RE: RE: I'm not sure i

Quote:
Quote:
I'm not sure i understand - There are three 64 bit Linux apps (typo maybe)?
Quote:

[pre]Linux running on an AMD x86_64 or Intel EM64T CPU 1.04 19 Feb 2016, 16:30:35 UTC
Linux running on an AMD x86_64 or Intel EM64T CPU 1.04 (AVX) 19 Feb 2016, 16:30:35 UTC
Linux running on an AMD x86_64 or Intel EM64T CPU 1.04 (X64) 19 Feb 2016, 16:53:47 UTC[/pre]

The first and third apps I am referring to.


We can't browse the download directory anymore but for other applications in the past those versions featured identical binaries with possibly different configuration.

The first app version listed is identical to the last one (X64). The current scheduler code and project configuration prefers apps with plan classes over those without, so clients that reportedly could run 32Bit and 64Bit app versions (i.e. reported both platforms) did get the 32Bit SSE2 app version. Therefore I later set up the same application version with a X64 plan class.

The AVX app version was built on a newer machine and thus requires a newer libc version than the X64 app. The hope is that newer CPUs (with AVX) will run a newer Linux version, but there is no way on the server side to ensure this. This should be rectified in the future when the "tuning" run is finished.

The only part of the application that makes use of AVX for now is the FFT. The FFTW library claims that even when compiled with SIMD (AVX or SSE) support, it should automatically fall back to some generic code if the machine it is running on doesn't support that specific SIMD set. I now published the Windows 64Bit AVX app version with X64 plan class, which means that hosts that don't support AVX should now get a 64Bit version if they can run it, and it should use AVX if the CPU supports it, even if the BOINC Client doesn't detect and report it.

BM

BM

Maximilian Mieth
Maximilian Mieth
Joined: 4 Oct 12
Posts: 128
Credit: 7,293,421
RAC: 8,088

RE: I completed 4 version

Quote:
I completed 4 version 1.02 tasks (239717102, 239717368, 239717298, 239717062). They all are marked as "validation inconclusive" and the third task for the workunit remains unsent. Any chance it is being sent at all? Will I get any credits for them?


Nothing has changed. The tasks are still not sent out. Did you stop sending tasks for that version of the app? If so, will I ever get credits for them?

edit: Just found the answer myself, I think. So I will just wait and see...

Jim1348
Jim1348
Joined: 19 Jan 06
Posts: 380
Credit: 201,949,179
RAC: 5,942

RE: The only part of the

Quote:
The only part of the application that makes use of AVX for now is the FFT.


It works for me. My first one was the SSE2 version, which ran for 10:17 on an i7-4771 (four cores running) on Win7 64-bit. The last 24 work units have been the AVX version, now averaging 8:45. There have been no problems. Keep up the good work.

cliff
cliff
Joined: 15 Feb 12
Posts: 176
Credit: 283,452,444
RAC: 0

Hi, Sorry but I had to abort

Hi,
Sorry but I had to abort 8 of these WU, got far to many to complete at 8h 58min per WU. I have a cache of WU for another project that need to be completed before expiry and usually do 2x of those, reducing to 1x to run the avx WU makes their completion before expiry moot.

I did notice however that when one of these WU reaches 99% it seems to hang then all of a sudden shows as complete.. Guess its to do with Boinc's handling of the app.

I think I'll wait until a GPU app is available before trying anymore though.

Regards,

Cliff,

Been there, Done that, Still no damm T Shirt.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,236
Credit: 44,768,920,046
RAC: 37,490,014

RE: ID: 12087505: extremely

Quote:

ID: 12087505: extremely long processing time

Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz [Family 6 Model 26 Stepping 5]
(8 Prozessoren)


Now that Bernd has added the (X64) plan class, the above host has downloaded a bunch of (X64) tasks which will soon take over from the previous (SSE2) ones. Hopefully there will be a nice improvement in crunch times.

Thanks Bernd for the explanation and for adding the (X64) plan class to the list.

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 2,843
Credit: 3,382,087,918
RAC: 2,896,955

RE: the (X64) plan

Quote:

the (X64) plan class

which will soon take over from the previous (SSE2) ones. Hopefully there will be a nice improvement in crunch times.


Thanks for pointing out this comparison opportunity, Gary. I had actually discontinued CPU work on my GPU hosts, but realized now that the SSE2/x86 vs. X64 comparison on my Westmere might be interesting, so I enabled CPU jobs just long enough to get six of the fresh X64 ones, which should be enough to compare with twelve valid SSE2/x86 jobs among v1.02/1.03/1.04 which have a pretty tight scatter of completion times.

As my Westmere is very lightly loaded on the CPU side, and the Daniels_Parents machine is far more heavily loaded, this will give us samples from both ends of that spectrum on what I think are very similar CPUs.

Christian Beer
Christian Beer
Moderator
Joined: 9 Feb 05
Posts: 595
Credit: 96,904,763
RAC: 0

Another short update on the

Another short update on the runtime issue.

With some more results and a more widespread hostbase I did some more data digging. This is still a bit early since we only have the results of the "fast" hosts in but you can see a tendency here.

The designed runtime per WU was 8h on a modern CPU. Looking at all the hosts that completed at leat one 1.04 results I get as the distribution of average cpu_time this:
First column is the range of average cpu_time, Second column is number of hosts in that range and third column is the fraction compared to all hosts (2715)

0-6h      14   0.52%
6-8h     266   9.80%
8-10h   1052  38.75%
10-12h   435  16.02%
12-14h   303  11.16%
14-16h   208   7.66%
16-18h   171   6.30%
18-20h   115   4.24%
20-22h    59   2.17%
22-24h    29   1.07%
24-26h    16   0.59%
26-28h    10   0.37%
28-30h    10   0.37%
30-32h     2   0.07%
32-34h    10   0.37%
>34h      15   0.55%


This means that 49.06% of all results are done on average in 10 hours. 83.9% are done in 16h.

Another way to look at the data is the standard deviation of the successful results per host. There one can see that for 76% of all hosts (that have at least two successful results) the cpu_times do not differ more than 12 minutes.

This means that the runtime on a host is stable, which it should be (because we designed the app in a way that it is not data dependent) but the runtime between hosts varies a lot (which came a little bit unexpected). I currently attribute the latter to L2/L3 cache sizes and Hyperthreading. I believe that looking closer at the CPUs involved one can find that the ones with higher average cpu_time have lower L2/L3 Cache (built in or due to HT) and/or do not support AVX and/or have a lower memory bandwidth/throughput.

I didn't distinguish between different app_versions (AVX, SSE) and platforms but may do so when more results are turned in and I have more data available. Fixing the AVX detection bug on the server helped a lot with runtimes too.

There is little we can do about the variety in runtime for the upcoming search. I will look into increasing the flops estimate for all WUs so the DCF will not go through the roof and hosts do not gather more tasks than they can manage. More on this when more data is in ;)

archae86
archae86
Joined: 6 Dec 05
Posts: 2,843
Credit: 3,382,087,918
RAC: 2,896,955

RE: RE: the (X64) plan

Quote:
Quote:

the (X64) plan class

which will soon take over from the previous (SSE2) ones. Hopefully there will be a nice improvement in crunch times.


Thanks for pointing out this comparison opportunity, Gary.

As my Westmere is very lightly loaded on the CPU side


First X64 V1.04 task on my Westmere took 44,544.48 ET, 42,204.43 CPU time. While this is a single sample, it is unambiguously improved from the observation of a mix of 1.02, 1.03, and 1.04 SSE2 executable tasks returned form the same host while running in the same configuration, for which ET ran a little under 52,000, and CPU time a somewhat over 50,000. I'll do statistics when more of the returns are in. But these are several sigma out of the previous distribution--so "better" is quite clear.

I'm a bit surprised, and suspect there is more to the difference between these two applications than mere X64-ness. 64-bit addressing unambiguously gives more efficient options for handling truly large working sets, and at the hardware level allows installation of larger memory, but I'd not expect this sort of considerable execution time benefit. Then, I'm more a hardware guy than a software guy, and know nothing about the Einstein code.

Nice to see the improvement, anyway, and our gratitude to Bernd for toiling away at the plan classes to make more improvement available to more hosts.

I'm not sure the DP machine is actually returning new results at the moment. Maybe the Hewson machine of similar architecture will give us some insight on the heavily loaded end of the scale.

Other Nehalem's should not expect times as short as mine unless they have far higher clock rates (possible, mine is a low clock spec, running stock) or are lightly loaded. Mine is very, very lightly loaded, both to give good support to the primary task of running Parkes PMPS on two GPUs, and to avoid higher power consumption.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3,516
Credit: 458,151,501
RAC: 71,323

RE: I'm a bit surprised,

Quote:

I'm a bit surprised, and suspect there is more to the difference between these two applications than mere X64-ness. 64-bit addressing unambiguously gives more efficient options for handling truly large working sets, and at the hardware level allows installation of larger memory, but I'd not expect this sort of considerable execution time benefit.

One thing not to forget is that code compiled to run also on 32 bit machines / OSes has access to fewer CPU registers. The previous GW searches used mostly hand coded SSE/SSE2 code in critical places that didn't make much (or any) use of the additrional registers in 64 bit mode, but the current GW app uses e.g. FFTW, and I suspect this module could get a boost from the extra registers in 64 bit mode.

I agree that cache size is a very significant factor, probably also because of the FFT part. IIRC, the FFTs done in the current GW search are of length ca 2^17, single precision, complex to complex, in-place. So order of magnitude wise, the data per Fourier transform is about 2^17 x 2 x 4 bytes, or 2^20 bytes or 1 MB. That is just for the FFT.

Other Einstein@Home searches like the Fermi searches use much longer Fourier transforms (ca a factor of 16 and more iIRC), so this might explain why this search is more sensitive to cache size than other searches here.

I have three very old but rather similar (ca 2008) Core2 era hosts that a) all lack AVX support, of course b) have not so much different BOINC benchmark ratings but c) differ in task runtime a lot:

Core 2 Quad Q8200 @ 2.33 GHz , 2 x 2MB L2 cache, so only 1 MB cache per core, running 3 CPU tasks ==> ca 107k sec per task (!!)

Xeon E5405 @ 2.00GHz (quadcore), 2x 6MB L2 cache , so 3 MB per core , dual CPU, running 7 CPU tasks. ==> ca 70k sec per task

Core2 Duo E8400 @ 3.00GHz , 6 MB L2 cache, so 3 MB per core, and running just one CPU task (other core for GPU task) ==> 49k sec (and that with 32 bit app so far). Not bad for a machine that old.

So yes, cache size (per running task) seems to matter a lot.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,236
Credit: 44,768,920,046
RAC: 37,490,014

RE: Another short update on

Quote:
Another short update on the runtime issue.
....


Thanks very much for making the effort to publish this information.
It really is appreciated by many, I am sure.

I wonder how representative is the group of 'beta test' hosts of the overall population? I suspect the more 'committed' and/or 'nerdy' volunteers who participated may tend to have a 'better' or more 'tweaked' class of hardware :-). It will be very interesting to see how the distribution changes after a few days of running 'full blast'. I hope you will have the time to do that.

Quote:
I will look into increasing the flops estimate for all WUs so the DCF will not go through the roof and hosts do not gather more tasks than they can manage. More on this when more data is in ;)


This would be very much appreciated. Thank you for your consideration.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.