Gamma-ray pulsar binary search #1 on GPU errors

Dark Angel
Dark Angel
Joined: 3 Jan 12
Posts: 5
Credit: 54902377
RAC: 93604
Topic 228077

I just started back with the project after a lay-off and am getting errors on the Gamma-ray pulsar binary search #1 on GPU sub-project.

Nine units all have the exact same output:

 

LATeah3012L12220828_892.0_0_0.0_23821281_1

Workunit ID: 669392854

Created: 4 Sep 2022 20:37:53 UTC

Sent: 4 Sep 2022 22:08:05 UTC

Report deadline: 18 Sep 2022 22:08:05 UTC

Received: 4 Sep 2022 22:10:40 UTC

Server state: Over

Outcome: Computation error

Client state: Compute error

Exit status: 11 (0x0000000B) Unknown error code

Computer: 12968750

Run time (sec): 2.11

CPU time (sec): 0.03

Peak working set size (MB): 40.16

Peak swap size (MB): 4829.48

Peak disk usage (MB): 0.02

Validation state: Invalid

Granted credit: 0

Application: Gamma-ray pulsar binary search #1 on GPUs v1.28 (FGRPopencl2Pup-nvidia)
x86_64-pc-linux-gnu


Stderr output

<core_client_version>7.17.0</core_client_version>
<![CDATA[
<message>
process exited with code 11 (0xb, -245)</message>
<stderr_txt>
08:08:36 (3964884): [normal]: This Einstein@home App was built at: Aug 17 2021 16:19:40

08:08:36 (3964884): [normal]: Start of BOINC application '../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.28_x86_64-pc-linux-gnu__FGRPopencl2Pup-nvidia'.
08:08:36 (3964884): [debug]: 1e+16 fp, 7.7e+09 fp/s, 1366574 s, 379h36m13s51
08:08:36 (3964884): [normal]: % CPU usage: 1.000000, GPU usage: 1.000000
command line: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.28_x86_64-pc-linux-gnu__FGRPopencl2Pup-nvidia --inputfile ../../projects/einstein.phys.uwm.edu/LATeah3012L12220828.dat --alpha 2.59819959601 --delta -0.694603692878 --skyRadius 1.890770e-06 --ldiBins 15 --f0start 884.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 1.69860773e-15 --ephemdir ../../projects/einstein.phys.uwm.edu/JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah3012L12220828_0892_23821281.dat --debug 0 -o LATeah3012L12220828_892.0_0_0.0_23821281_1_0.out
output files: 'LATeah3012L12220828_892.0_0_0.0_23821281_1_0.out' '../../projects/einstein.phys.uwm.edu/LATeah3012L12220828_892.0_0_0.0_23821281_1_0' 'LATeah3012L12220828_892.0_0_0.0_23821281_1_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah3012L12220828_892.0_0_0.0_23821281_1_1'
08:08:36 (3964884): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
08:08:36 (3964884): [debug]: glibc version/release: 2.31/stable
08:08:36 (3964884): [debug]: Set up communication with graphics process.

-- signal handler called: signal 1
2 stack frames obtained for this thread:
Frame 2:
Binary file: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.28_x86_64-pc-linux-gnu__FGRPopencl2Pup-nvidia (0x48e401)
Source file: hs_boinc_extras.c (Function: sighandler / Line: 290)
Frame 1:
Binary file: /lib/x86_64-linux-gnu/libpthread.so.0 (0x7f9128239420)
Offset info: +0x14420

End of stcaktrace
08:08:36 (3964884): called boinc_finish(11)

</stderr_txt>
]]>

 

The same machine is running MeerKAT units without issue.

I checked my wingman on these and discovered a Windows machine with an AMD GPU that has returned over 700 units on this project, all with consistent errors (output from other computer, not mine):

LATeah3012L12220828_892.0_0_0.0_23821281_2

Workunit ID: 669392854

Created: 4 Sep 2022 22:10:41 UTC

Sent: 4 Sep 2022 22:11:08 UTC

Report deadline: 18 Sep 2022 22:11:08 UTC

Received: 4 Sep 2022 23:17:23 UTC

Server state: Over

Outcome: Computation error

Client state: Compute error

Exit status: 53 (0x00000035) Unknown error code

Computer: 12564497

Run time (sec): 22.33

CPU time (sec): 8.89

Peak working set size (MB): 55.13

Peak swap size (MB): 120.34

Peak disk usage (MB): 0.01

Validation state: Invalid

Granted credit: 0

Application: Gamma-ray pulsar binary search #1 on GPUs v1.22 (FGRPopencl1K-ati)
windows_x86_64


Stderr output

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
The network path was not found.
 (0x35) - exit code 53 (0x35)</message>
<stderr_txt>
23:15:43 (13744): [normal]: This Einstein@home App was built at: May  8 2019 13:29:27

23:15:43 (13744): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.22_windows_x86_64__FGRPopencl1K-ati.exe'.
23:15:43 (13744): [debug]: 1e+016 fp, 3.3e+009 fp/s, 3224371 s, 895h39m30s80
23:15:43 (13744): [normal]: % CPU usage: 1.000000, GPU usage: 1.000000
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.22_windows_x86_64__FGRPopencl1K-ati.exe --inputfile ../../projects/einstein.phys.uwm.edu/LATeah3012L12220828.dat --alpha 2.59819959601 --delta -0.694603692878 --skyRadius 1.890770e-06 --ldiBins 15 --f0start 884.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 1.69860773e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah3012L12220828_0892_23821281.dat --debug 0 --device 0 -o LATeah3012L12220828_892.0_0_0.0_23821281_2_0.out
output files: 'LATeah3012L12220828_892.0_0_0.0_23821281_2_0.out' '../../projects/einstein.phys.uwm.edu/LATeah3012L12220828_892.0_0_0.0_23821281_2_0' 'LATeah3012L12220828_892.0_0_0.0_23821281_2_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah3012L12220828_892.0_0_0.0_23821281_2_1'
23:15:43 (13744): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
23:15:43 (13744): [debug]: Set up communication with graphics process.
boinc_get_opencl_ids returned [0000000003bbc140 , 00007ff811faf180]
Using OpenCL platform provided by: Advanced Micro Devices, Inc.
Using OpenCL device "Cedar" by: Advanced Micro Devices, Inc.
Max allocation limit: 536870912
Global mem size: 1073741824
OpenCL compiling FAILED! : -11 . Error message: "C:\Users\arsem\AppData\Local\Temp\OCLCCD5.tmp.cl", line 10: error: identifier
"double2" is undefined
__kernel void test( __global double2 *vec) {
^

1 error detected in the compilation of "C:\Users\arsem\AppData\Local\Temp\OCLCCD5.tmp.cl".
Frontend phase failed compilation.

OpenCL device has no FP64 support
read_checkpoint(): Couldn't open file 'LATeah3012L12220828_892.0_0_0.0_23821281_2_0.out.cpt': No such file or directory (2)
% fft length: 16777216 (0x1000000)

BUILD LOG
************************************************
"C:\Users\arsem\AppData\Local\Temp\OCLEC08.tmp.cl", line 24: error: work group
size exceeds the maximum default value for the selected device
__attribute__(( reqd_work_group_size( 16, 16, 1 ) ))
^

1 error detected in the compilation of "C:\Users\arsem\AppData\Local\Temp\OCLEC08.tmp.cl".
Frontend phase failed compilation.

************************************************
FFTGeneratedTransposeGCNAction::compileKernels failed
ERROR: plan generation ("baking") failed: -11
23:15:52 (13744): [CRITICAL]: ERROR: MAIN() returned with error '-11'
FPU status flags: COND_1
23:16:04 (13744): [normal]: done. calling boinc_finish(53).
23:16:04 (13744): called boinc_finish

</stderr_txt>
]]>

 

So different OS, different arch, different GPU, and still has errors on the unit.  This makes me suspect it's the work units not the machines.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117786428596
RAC: 34682647

Dark Angel wrote:I just

Dark Angel wrote:

I just started back with the project after a lay-off and am getting errors on the Gamma-ray pulsar binary search #1 on GPU sub-project.

....

So different OS, different arch, different GPU, and still has errors on the unit.  This makes me suspect it's the work units not the machines.

It's quite unlikely that your problem is anything to do with the tasks themselves.

The other computer also giving compute errors has quite a different issue.  The error there was:-

OpenCL compiling FAILED!

and this comes from an initial check performed by the app itself to see if the GPU to be used has the hardware capability to crunch the work.  That other person's GPU is listed as "Cedar" which is an old architecture which doesn't have the necessary OpenCL capabilities.  I've already sent a PM to the owner about this.

Your GPU certainly does have the necessary capabilities, so it's not the same issue.  The clue from your log is:-

signal handler called: signal 1

I have no idea why you are getting a SIGHUP (hangup signal) - that's something you will need to investigate at your end.  You can also see the reference to a system library:-

Binary file: /lib/x86_64-linux-gnu/libpthread.so.0

so maybe there's some problem with that.  I'm not a programmer so I have no clue about what that all implies.

 

Cheers,
Gary.

Dark Angel
Dark Angel
Joined: 3 Jan 12
Posts: 5
Credit: 54902377
RAC: 93604

Hi Gary, I've confirmed

Hi Gary,

I've confirmed the listed binary file does exist at that location and running ldd on the Einstein executables shows all dependencies are satisfied.

The units in question have been running successfully, if much more slowly, for me on another machine with the same drivers but an older K2000 Quadro card.  They just won't run on this GTX1660

fastbunny
fastbunny
Joined: 20 Apr 06
Posts: 22
Credit: 91424422
RAC: 0

Hello, I'm going to revive

Hello, I'm going to revive this thread by saying I am facing the same problem: all 'Gamma-ray pulsar binary search #1 on GPUs v1.28 (FGRPopencl2-ati) windows_x86_64' tasks exit with a compute error after a few seconds. This is the machine: https://einsteinathome.org/host/12979996

I have restarted crunching after a period of absence as well, and I am using the same GPU (AMD Vega56) as I have done for a long time, which previously had no trouble running these tasks.

When I look at the same workunits sent to other PCs, I see that the same version (1.28 opencl2) runs well on NVidia systems, but I see AMD systems are given 1.17 or 1.18 (opencl1K) depending on their platform. Am I given a beta version 1.28 for AMD perhaps? I have enabled 'Run test applications?' and I don't mind it if that's the reason, I'm just wondering if there's anything wrong on my end.

As said, I am running a Vega56, with driver version 22.5.1 on Windows 10. According to GPU-Z, this card should support OpenCL 2.0. Any help is much appreciated.

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3965
Credit: 47219242642
RAC: 65371339

fastbunny wrote: Hello, I'm

fastbunny wrote:

Hello, I'm going to revive this thread by saying I am facing the same problem: all 'Gamma-ray pulsar binary search #1 on GPUs v1.28 (FGRPopencl2-ati) windows_x86_64' tasks exit with a compute error after a few seconds. This is the machine: https://einsteinathome.org/host/12979996

I have restarted crunching after a period of absence as well, and I am using the same GPU (AMD Vega56) as I have done for a long time, which previously had no trouble running these tasks.

When I look at the same workunits sent to other PCs, I see that the same version (1.28 opencl2) runs well on NVidia systems, but I see AMD systems are given 1.17 or 1.18 (opencl1K) depending on their platform. Am I given a beta version 1.28 for AMD perhaps? I have enabled 'Run test applications?' and I don't mind it if that's the reason, I'm just wondering if there's anything wrong on my end.

As said, I am running a Vega56, with driver version 22.5.1 on Windows 10. According to GPU-Z, this card should support OpenCL 2.0. Any help is much appreciated.

 

your problem is not the same problem. You’re having errors, but the reason and root cause is completely different. 

your GPU does support OpenCL 2.0, but for some reason the drivers don’t provide proper support for the features being used by the application (driver/software issues are ubiquitous with AMD). I’ve not seen anyone with older AMD GPUs have success with this version of the application on Windows, only a handful of successes with the Big Navi cards, and some success with older GPUs on Linux depending which driver you install.  
 

however, this application is listed in the beta. If you go into your project preferences and disable processing of beta tasks, you will get the older application that works fine for your GPU. 

_________________________________________________________________________

fastbunny
fastbunny
Joined: 20 Apr 06
Posts: 22
Credit: 91424422
RAC: 0

Thank you for clearing that

Thank you for clearing that up.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.