consistant failures

Dp
Dp
Joined: 27 Aug 05
Posts: 14
Credit: 67,399,125
RAC: 4
Topic 223812

I have no clue as to the problem.


Application
Gravitational Wave search O2 Multi-Directional GPU 2.07 (GW-opencl-ati)
Name
h1_0423.90_O2C02Cl4In0__O2MDFS2_Spotlight_424.45Hz_2280
State
Running
Received
10/26/2020 11:54:57 AM
Report deadline
11/2/2020 10:54:54 AM
Resources
0.9 CPUs + 1 AMD/ATI GPU (device 1)
Estimated computation size
144,000 GFLOPs
CPU time
00:03:11
CPU time since checkpoint
00:00:05
Elapsed time
03:11:44
Estimated time remaining
1d 00:35:32
Fraction done
11.500%
Virtual memory size
319.93 MB
Working set size
317.27 MB
Directory
slots/21
Process ID
1708
Progress rate
3.960% per hour
Executable
einstein_O2MDF_2.07_windows_x86_64__GW-opencl-ati.exe

My system is

CPU type: GenuineIntel Intel(R) Core(TM) i7 CPU X 980 @ 3.33GHz [Family 6 Model 44 Stepping 2]

Number of processors: 12

Coprocessors: [2] AMD AMD Radeon HD 7900 Series (3072MB)

Operating system: Microsoft Windows 7 Home Premium x64 Edition, Service Pack 1, (06.01.7601.00)

BOINC client version: 7.14.3

Memory: 12278.12 MiB

Cache: 256 KiB

Swap space: 24554.38 MiB

Total disk space: 447.03 GiB

Free disk space: 344.39 GiB

Measured floating point speed: 3339.33 million ops/sec

Measured integer speed: 7910.37 million ops/sec

Average upload rate: 49.23 KiB/sec

Average download rate: 696 KiB/sec

Average turnaround time: 2.6 days

Tasks: 71

 

archae86
archae86
Joined: 6 Dec 05
Posts: 2,847
Credit: 3,399,093,213
RAC: 3,053,887

This message in stderr

This message in stderr (always a good place to look) probably gives a bit of a clue:

"EXIT_TIME_LIMIT_EXCEEDED"

Your errored out units all show "run time" (really means elapsed wall clock time) within a second or so of 14,375 seconds.  However the CPU time consumed during that time varies wildly, and on the low end is remarkably low--as a healthy GW GPU task needs quite a lot of CPU support.  On the high end these tasks report using nearly as much CPU time as elapsed time.  These high CPU time reports are also quite abnormal.

Your single pending (thus got past the particular error that killed the others) unit shows a run time of 10,609 seconds, so did not apparently reach the time limit, and a CPU time of just under 1000 seconds, which is not crazy wrong in either direction, though much lower as a fraction of elapsed than I'd expect for a healthy system.

An important question is why these tasks are running so slow on your system.  I speculate that the conditions on your system mean the GW GPU task is in a resource fight with other things going on on the system, and not getting much share.

What applications are competing for resources as you run these?  Do you run BOINC tasks from other projects?  Do you have other resources consumers running?

 

Richie
Richie
Joined: 7 Mar 14
Posts: 597
Credit: 1,687,214,169
RAC: 109,105

"AMD Radeon HD 7900 Series

"AMD Radeon HD 7900 Series (3072MB)" are GCN 1 gen chips. They practically have not been compatible with GW GPU app so far. Strange if that one task even managed to fully complete.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,245
Credit: 44,946,028,001
RAC: 36,130,199

Dp wrote:I have no clue as to

Dp wrote:

I have no clue as to the problem.
...

My system is

CPU type: GenuineIntel Intel(R) Core(TM) i7 CPU X 980 @ 3.33GHz [Family 6 Model 44 Stepping 2]

Number of processors: 12

Coprocessors: [2] AMD AMD Radeon HD 7900 Series (3072MB)

....

Unfortunately, BOINC is not able to give a fully accurate description of your GPUs. As shown above, you have two but they are not both the old 7900 series (ie Tahiti).  One certainly is but the other seems to be a much newer series, ie. Baffin - which is a Polaris series GPU.

That explains the real mystery for me - how could one task succeed whilst all others failed??  The answer is that the successful task was completed on the newer Polaris GPU whilst at least some of the failures (I didn't check them all) were attempted on the Tahiti GPU.  You can see this for yourself if you click on the Task ID link for any task and browse through the stderr output that archae86 mentioned.  You will see a line where it identifies the GPU being used.  The output for the successful task contains (truncated to fit):-

OpenCL Device ... 'Baffin (Platform: AMD Accelerated Parallel Processing, global mem: 4096 MiB)'

whilst for a failed task, the same line shows:-

OpenCL Device ... 'Tahiti (Platform: AMD Accelerated Parallel Processing, global mem: 3072 MiB)'

Tahiti series GPUs belong to the 1st generation of the GCN (Graphics Core Next) architecture, first released around 2011 if I remember correctly.  They are not able to crunch the Einstein gravity wave (GW) GPU tasks but they are fine for the gamma-ray pulsar (GRP) tasks.  GW tasks seem to make essentially no progress and end up with the "EXIT_TIME_LIMIT_EXCEEDED" failure message which you can see in the stderr output.  The newer Polaris series are 4th gen GCN and work just fine.

Whilst your newer GPU could continue to crunch GW tasks, the easiest way to use both GPUs without having failures would be to deselect GW GPU tasks in your preferences and select the GRP GPU tasks only.  You will need to abort any remaining GW GPU tasks that you have.

I'm not sure there is any easy way to have the Tahiti GPU do GRP and the Baffin GPU do GW GPU tasks.  You might be able to set up two separate BOINC clients with different preferences and different work directories but I've never tried anything like that.  There are ways for excluding particular GPU devices for each setup, so in theory it might be possible to do something like that.

Cheers,
Gary.

Dp
Dp
Joined: 27 Aug 05
Posts: 14
Credit: 67,399,125
RAC: 4

Thanks for the input and

Thanks for the input and recommendations.

I will try to breath some remaining life into my Voyager 1 series rig.

rewp

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.