The O2-All Sky Gravitational Wave Search on GPUs - discussion thread.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117671142658
RAC: 35159972

Zalster wrote:I believe Gary

Zalster wrote:
I believe Gary is researching different architectures of CPUs early vs modern as a cause of issues?

Yes, I've been doing that for a range of different Intel CPUs.  Apart from a small number of Phenom II processors, the bulk of my machines are Intel based and most of those are older generations.  Virtually all my machines have an AMD GPU and these are in two different groups. About half are the older GCN 1st gen (southern islands - SI) such as HD 7850, R7 370, HD 7950.  The others are newer GCN 4th gen (Polaris) such as RX 460, RX 560, RX 570 and RX 580.

The very first tests I did used RX 570 GPUs, one in a Q6600 (2008) quad core host and another in a G4560 Kaby lake machine (2C/4T).  Essentially irrespective of task concurrency, the G4560 gave a very high proportion of valid results (just a single invalid) whilst the Q6600 gave zero valid results.  These weren't compute errors, just failure to validate.  At a later stage I followed up with around 40 tasks on the Q6600, all crunched at x1.  Zero valid results.  This behaviour was what made me wonder if CPU/motherboard architecture was somehow involved since the GPUs were identical.

A second set of tests was to see if SI cards could give valid results.  I ran a host under the old fglrx/proprietary OpenCL driver and got 100% invalid.  I also tried the same host under the new amdgpu/OpenCL from AMDGPU-PRO and this also gave 100% invalid.  The CPU was a Q8400 quad core (2009) and since I'd already seen 100% invalid using a modern GPU on a Q6600, this just seemed to confirm that there wasn't much use persisting with older CPU architectures.  Bear in mind that all the above hosts have no trouble handling FGRPB1G tasks.

To further test the effect of CPU architecture, I ran GW tasks at x1 on hosts with the following combinations.  I had started testing with V1.07 and then the unfortunate V1.08 came along so the comments below only refer to the observed behaviour under V1.07.  There were no examples of 'real' computation errors, but see (*).

  • Q6600 (Core 2 Quad - 2008 - 2.4GHz) / RX 460  --> All that completed without MD5 issue were valid(*).
  • E6300 (Wolfdale dual core - 2009 - 2.8GHz) / RX 460  --> The majority were invalid (just 2 valid).
  • G640 (Sandy Bridge dual core - 2012 - 2.8GHz) / RX 570   --> All were invalid.
  • G3260 (Haswell dual core - 2015 - 3.3GHz) / RX 460   --> All were valid.

(*) With this test, the aim was to see if older CPU architecture would work better with 'slower' GPUs.  I had a batch of around 15 tasks initially and 3 completed without incident.  The 4th also completed but then registered at the very end as a comp error due to one of the large data files failing a MD5 checksum check.  Of course this immediately resulted in all unstarted tasks (which depended on the same data file) immediately being declared as comp errors with no attempt to start any of them.  The machine went into a multi-hour backoff with the failed tasks just sitting there.

Einstein uses a 'resend lost tasks' feature.  I reasoned that I should be able to make the failed tasks become 'lost' and get fresh copies, along with a fresh download of the 'supposedly suddenly corrupt' large data file.  I have done this sort of thing before and I know that the technique works.  So I stopped BOINC and edited the state file to make all the failed tasks 'disappear'.  It worked a treat except for one small detail.  The supposedly corrupt file was replaced, the lost tasks were all replaced but there was an extra download going on - a copy of the CPU app.  Yes, that's right, the lost GPU tasks were being replaced with the very same tasks but as CPU versions rather than fresh copies of GPU versions, even though my preferences were set not to allow CPU tasks.  It seems like the rule to have 1 CPU and 1 GPU task in the quorum makeup doesn't take into account the fact that there was already a CPU task if the GPU task goes missing :-).

So, since the aim was to help test validation of GPU tasks, I aborted this batch of CPU tasks and got a brand new batch of 10 which indeed were all GPU tasks this time.  Crunching resumed normally and I moved on to other things.  When I came back later to see how things were going, the very first task had crunched to the end but a different large data file had failed the MD5 check this time and the whole batch had been declared as comp errors.  At that point I decided to give up with that machine (3 valid, 0 invalid, 22 MD5sum errors) and put it back on FGRPB1G where it continues to have no problems.

When Bernd announced that V1.08 was giving invalid results and was to be withdrawn, it wasn't clear if something new would take its place or if the previous V1.07 would be resurrected.  I decided to wait and see so all the above hosts were simply put back on FGRPB1G until there was a further announcement.  My reasoning was that V1.07 probably already has sufficient examples of the validation issues to be studied and that it would be more useful to help test the next iteration whenever that might be.  I didn't think it would take this long.

Zalster wrote:
So in short, the GW on GPU require more than 1 CPU thread.

Only for nvidia GPUs.  On all my tests with AMD GPUs, the CPU time component has been well short of the elapsed time - quite often around half the elapsed time.  I believe this may be a consequence of the "spin wait" behaviour of nvidia's OpenCL implementation.  A significant fraction of the computation of GW GPU tasks is done on the CPU so I imagine the combination of the two things is why more than 100% of a CPU core is showing up in the crunch times for nvidia GPUs.

As far as setting cpu_usage to 1.5 in app_config.xml is concerned, that might be necessary if using nvidia GPUs and if also using all spare threads for CPU crunching.  It would probably be easier for people with nvidia GPUs and not wanting to go down the app_config.xml path to just use the % of cores that BOINC is allowed to use to make sure there is a spare thread or two when crunching GW tasks on nvidia.  That would probably be simpler for the average user to set up and change when necessary.

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7225174931
RAC: 1042149

Gary Roberts wrote:Zalster

Gary Roberts wrote:
Zalster wrote:
So in short, the GW on GPU require more than 1 CPU thread.
Only for nvidia GPUs.  On all my tests with AMD GPUs, the CPU time component has been well short of the elapsed time - quite often around half the elapsed time.

Actually, I've seen reported CPU time somewhat higher than 100% of reported elapsed time on GW 1.07 work done on AMD RX 570/Windows 10/Intel CPU hosts.   Also, I've observed the reported CPU time per task to go quite a bit higher when the multiplicity is raised.

I have the impression, not carefully checked, that the current Linux implementation of the V1.07 GW GPU application uses substantially less CPU time than does the current Windows implementation.

In my case, I've found no need to make special accommodation, but I am running zero BOINC CPU tasks, and few enough BOINC GPU tasks that there is more than one CPU core available per GPU task physically.  People who work the CPU side of their systems harder may find this more important to mitigate.

cecht
cecht
Joined: 7 Mar 18
Posts: 1535
Credit: 2908262095
RAC: 2133676

Richie wrote:Both of my hosts

Richie wrote:
Both of my hosts have stopped getting new tasks. Anyone else?

Same here. No new GW GPU work for the past 7 hr on either of my hosts. 

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117671142658
RAC: 35159972

archae86 wrote:I have the

archae86 wrote:
I have the impression, not carefully checked, that the current Linux implementation of the V1.07 GW GPU application uses substantially less CPU time than does the current Windows implementation.

Yes, now that you mention it, I'm basing my comment pretty much entirely on my own Linux experience where even at 3x, I don't recall ever seeing a reported CPU time exceeding the elapsed time.  I haven't really looked at any AMD results on Windows machines so I might easily have overlooked a difference in behaviour for Windows.

It seems that on each occasion I've looked at somebody's nvidia result, I've seen a higher CPU time than the elapsed time.   I wasn't looking at the OS on those occasions so don't recall if there was any difference between Windows and Linux.  They were probably all Windows observations anyway :-).

Cheers,
Gary.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4964
Credit: 18745923927
RAC: 7032458

Quote:Only for nvidia GPUs. 

Quote:
Only for nvidia GPUs.  On all my tests with AMD GPUs, the CPU time component has been well short of the elapsed time - quite often around half the elapsed time.  I believe this may be a consequence of the "spin wait" behaviour of nvidia's OpenCL implementation.  A significant fraction of the computation of GW GPU tasks is done on the CPU so I imagine the combination of the two things is why more than 100% of a CPU core is showing up in the crunch times for nvidia GPUs.

 

Except that I have swan_sync=1  in my environment variable for GPUGrid tasks to put Nvidia cards into "no blocking sync" mode to avoid the "spin wait" default.

I still use the whole of the cpu to support the CGW task since my cpu_time matches my run_time.

 

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

I looked at run times on my

I looked at run times on my Windows + Nvidia GTX 960 hosts. I noticed that cpu time has been less than run time... but only by a small amount. Run times were roughly 6600 and 6200 seconds while running 1x... and cpu time has constantly been about 20-30 seconds less than that per ask.

Well, these 960s are not high end gpus anymore. If these hosts run a newer series 'big block' Nvidia gpu then I wouldn't be surprised if it required more cpu support.

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

My machine like Keith's is

My machine like Keith's is running Linux as well as having the Swan_Sync=1. Unfortunately we had a severe thunderstorm last night and lost power. Once power came back, it seems to have shorted out my fiberoptic cable box so I'm out of luck until late Thursday for Internet.

 

Edit..

132 pending,  97 valid no invalids yet

Jim1348
Jim1348
Joined: 19 Jan 06
Posts: 463
Credit: 257957147
RAC: 0

I don't think swan_sync does

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4964
Credit: 18745923927
RAC: 7032458

I had long forgotten those

I had long forgotten those posts.  OK.  Swan_sync not applicable for OpenCL tasks.  So the cpu usage is totally dependent  on how the app developer wrote the app for cpu support.

 

For my tasks, it looks to be 1:1 cpu_time:run_time except for one task I was playing with ncpus counts.

 

Matt White
Matt White
Joined: 9 Jul 19
Posts: 120
Credit: 280798376
RAC: 0

cecht wrote:Richie wrote:Both

cecht wrote:
Richie wrote:
Both of my hosts have stopped getting new tasks. Anyone else?
Same here. No new GW GPU work for the past 7 hr on either of my hosts. 

I stopped getting new GPU GW work sometime yesterday. I suspect there may be a backlog of gamma ray tasks piling up. Even my cache of CPU tasks is switching to gamma ray work.

My 2 machines are still crunching on the remaining work in the local cache, but the LINUX box has only 4 GW GPU tasks left.

Clear skies,
Matt

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.