Computation error on all Nvidia GPU tasks

benoit
benoit
Joined: 30 Nov 17
Posts: 3
Credit: 172440
RAC: 0
Topic 211611

Hello,

my Nvidia GPU is correctly detected but all the Einstein GPU tasks fail on a computation error. (Milkyway GPU tasks work fine)

For example LATeah0046L_468.0_0_0.0_140560_1 (TASK 703746622)

I don't understand where is the problem.

First Output lines :

Stderr output

<core_client_version>7.8.4</core_client_version>
<![CDATA[
<message>
process exited with code 6 (0x6, -250)</message>
<stderr_txt>
18:25:34 (11661): [normal]: This Einstein@home App was built at: Feb 15 2017 10:50:14

18:25:34 (11661): [normal]: Start of BOINC application '../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_x86_64-pc-linux-gnu__FGRPopencl1K-nvidia'.
18:25:34 (11661): [debug]: 1e+16 fp, 4.4e+09 fp/s, 2372293 s, 658h58m12s99
18:25:34 (11661): [normal]: % CPU usage: 1.000000, GPU usage: 1.000000
command line: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_x86_64-pc-linux-gnu__FGRPopencl1K-nvidia --inputfile ../../projects/einstein.phys.uwm.edu/LATeah0046L.dat --alpha 4.42281478648 --delta -0.0345027837249 --skyRadius 2.152570e-06 --ldiBins 15 --f0start 460.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 3.344368011e-15 --ephemdir ../../projects/einstein.phys.uwm.edu/JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah0046L_0468_140560.dat --debug 1 --device 0 -o LATeah0046L_468.0_0_0.0_140560_1_0.out
output files: 'LATeah0046L_468.0_0_0.0_140560_1_0.out' '../../projects/einstein.phys.uwm.edu/LATeah0046L_468.0_0_0.0_140560_1_0' 'LATeah0046L_468.0_0_0.0_140560_1_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah0046L_468.0_0_0.0_140560_1_1'
18:25:34 (11661): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
18:25:34 (11661): [debug]: glibc version/release: 2.26/stable
18:25:34 (11661): [debug]: Set up communication with graphics process.

-- signal handler called: signal 6

2 stack frames obtained for this thread:
Frame 32:
	Binary file: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_x86_64-pc-linux-gnu__FGRPopencl1K-nvidia (0x48b261)
	Source file: hs_boinc_extras.c (Function: sighandler / Line: 290)
Frame 31:
	Binary file: /lib64/libc.so.6 (0x7f5293a3869b)
	Offset info: gsignal+0xcb
Frame 30:
	Binary file: /lib64/libc.so.6 (0x7f5293a3869b)
	Offset info: gsignal+0xcb
Frame 29:
	Binary file: /lib64/libc.so.6 (0x7f5293a3a3b1)
	Offset info: abort+0x141
Frame 28:
	Binary file: /lib64/libc.so.6 (0x7f5293a82a87)
	Offset info: +0x81a87
Frame 27:
	Binary file: /lib64/libc.so.6 (0x7f5293a89e8e)
	Offset info: +0x88e8e
Frame 26:
	Binary file: /lib64/libc.so.6 (0x7f5293a8b989)
	Offset info: +0x8a989
Frame 25:
	Binary file: /lib64/libc.so.6 (0x7f5293a942ee)
	Offset info: cfree+0x6e
Frame 24:
	Binary file: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_x86_64-pc-linux-gnu__FGRPopencl1K-nvidia (0x6a7598)
	Offset info: _ZNSt13runtime_errorD2Ev+0x58
	Source file: basic_string.h (Function: y / Line: 249)
	Source file: basic_string.h (Function: ~basic_string / Line: 539)
	Source file: stdexcept.cc (Function: y / Line: 68)
Frame 23:
	Binary file: /lib64/libMesaOpenCL.so.1 (0x7f528abc8d9e)
	Offset info: +0x20d9e
Frame 22:
	Binary file: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_x86_64-pc-linux-gnu__FGRPopencl1K-nvidia (0x69992f)
	Source file: eh_throw.cc (Function:  / Line: 52)
Frame 21:
	Binary file: /lib64/libMesaOpenCL.so.1 (0x7f528ac4ae1f)
	Offset info: +0xa2e1f
Frame 20:
	Binary file: /lib64/libMesaOpenCL.so.1 (0x7f528abf4cc4)
	Offset info: +0x4ccc4
Frame 19:
	Binary file: /lib64/libMesaOpenCL.so.1 (0x7f528abf4cf4)
	Offset info: +0x4ccf4
Frame 18:
	Binary file: /lib64/ld-linux-x86-64.so.2 (0x7f529478ee83)
	Offset info: +0x10e83
Frame 17:
	Binary file: /lib64/ld-linux-x86-64.so.2 (0x7f5294793dda)
	Offset info: +0x15dda
Frame 16:
	Binary file: /lib64/libc.so.6 (0x7f5293b5f4df)
	Offset info: _dl_catch_error+0x8f
Frame 15:
	Binary file: /lib64/ld-linux-x86-64.so.2 (0x7f52947932e9)
	Offset info: +0x152e9
Frame 14:
	Binary file: /lib64/libdl.so.2 (0x7f529413bf96)
	Offset info: +0xf96
Frame 13:
	Binary file: /lib64/libc.so.6 (0x7f5293b5f4df)
	Offset info: _dl_catch_error+0x8f
Frame 12:
	Binary file: /lib64/libdl.so.2 (0x7f529413c715)
	Offset info: +0x1715
Frame 11:
	Binary file: /lib64/libdl.so.2 (0x7f529413c021)
	Offset info: dlopen+0x41
Frame 10:
	Binary file: /lib64/libOpenCL.so.1 (0x7f5294563a82)
	Offset info: +0x5a82
Frame 9:
	Binary file: /lib64/libOpenCL.so.1 (0x7f5294565a74)
	Offset info: clGetPlatformIDs+0x114
Frame 8:
	Binary file: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_x86_64-pc-linux-gnu__FGRPopencl1K-nvidia (0x5baf44)
	Offset info: _Z24boinc_get_opencl_ids_auxPciiPP13_cl_device_idPP15_cl_platform_id+0x74
	Source file: unknown (Function:  / Line: 0)
Frame 7:
	Binary file: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_x86_64-pc-linux-gnu__FGRPopencl1K-nvidia (0x5bb46a)
	Offset info: _Z20boinc_get_opencl_idsPP13_cl_device_idPP15_cl_platform_id+0xe6
	Source file: unknown (Function:  I / Line: 0)
Frame 6:
	Binary file: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_x86_64-pc-linux-gnu__FGRPopencl1K-nvidia (0x48bc66)
	Offset info: eah_boinc_get_opencl_ids+0x26
	Source file: hs_boinc_options.cpp (Function: eah_boinc_get_opencl_ids / Line: 136)
Frame 5:
	Binary file: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_x86_64-pc-linux-gnu__FGRPopencl1K-nvidia (0x48dcf4)
	Offset info: gen_fft_get_ctx+0x44
	Source file: unknown (Function: gen_fft_get_ctx / Line: 0)
Frame 4:
	Binary file: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_x86_64-pc-linux-gnu__FGRPopencl1K-nvidia (0x47975c)
	Offset info: MAIN+0x15c
	Source file: HSgammaPulsar.c (Function: MAIN / Line: 4251)
Frame 3:
	Binary file: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_x86_64-pc-linux-gnu__FGRPopencl1K-nvidia (0x46c0ff)
	Offset info: main+0x5ff
	Source file: hs_boinc_extras.c (Function: worker / Line: 832)
	Source file: hs_boinc_extras.c (Function: main / Line: 1038)
Frame 2:
	Binary file: /lib64/libc.so.6 (0x7f5293a2203a)
	Offset info: __libc_start_main+0xea
Frame 1:
	Binary file: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_x86_64-pc-linux-gnu__FGRPopencl1K-nvidia (0x46e5f9)
	Source file: unknown (Function: _start / Line: 0)

End of stcaktrace
18:25:34 (11661): called boinc_finish

</stderr_txt>
]]>

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

Hi and welcome to

Hi and welcome to Einstein@home!

When checking your tasks for hostID 12596871 I found that your computer managed to finish a few tasks before the errors started and that Task 702186181 seems to be the first one to fail.

The error message given in stderr is:

Warning:  Program terminating, but clFFT resources not freed.
Please consider explicitly calling clfftTeardown( ).

And then the task restarted and failed with "-- signal handler called: signal 6".

Something seems to have gone wrong while processing this tasks and that might have somehow gotten the graphics card or the driver into a unstable state and every task after that one fails with the same "signal 6" error.

Have you tried to reboot your computer?

benoit
benoit
Joined: 30 Nov 17
Posts: 3
Credit: 172440
RAC: 0

Hi and thank youfor your

Hi and thank youfor your answer,

I just rebooted my computer some minutes ago, I did a project update and the problem is still there Frown.

Errors on the four tasks (after the reboot):

LATeah0046L_612.0_0_0.0_1165895_0  703911012 LATeah0046L_612.0_0_0.0_1149580_0  703910985 LATeah0046L_612.0_0_0.0_1148325_0  703910983 LATeah0046L_612.0_0_0.0_1126990_1  703910949

Just after these 4 computation errors on Einstein project, my GPU did a Milkyway task with no error (there is never any error on the GPU Milkyway tasks). 

 

mikey
mikey
Joined: 22 Jan 05
Posts: 12705
Credit: 1839110349
RAC: 3608

benoit_7 wrote:Hi and thank

benoit_7 wrote:

Hi and thank youfor your answer,

I just rebooted my computer some minutes ago, I did a project update and the problem is still there Frown.

Errors on the four tasks (after the reboot):

LATeah0046L_612.0_0_0.0_1165895_0  703911012 LATeah0046L_612.0_0_0.0_1149580_0  703910985 LATeah0046L_612.0_0_0.0_1148325_0  703910983 LATeah0046L_612.0_0_0.0_1126990_1  703910949

Just after these 4 computation errors on Einstein project, my GPU did a Milkyway task with no error (there is never any error on the GPU Milkyway tasks).  

I looked at a few of your tasks that have had problems and others seem to be having problems with the workunits too, not everyone but it may just not be on your end alone.

benoit
benoit
Joined: 30 Nov 17
Posts: 3
Credit: 172440
RAC: 0

Thank you for your help, I

Thank you for your help, I feel less lonely Smile

 

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

benoit_7 skrev:Hi and thank

benoit_7 wrote:

Hi and thank youfor your answer,

I just rebooted my computer some minutes ago, I did a project update and the problem is still there Frown.

Errors on the four tasks (after the reboot):

LATeah0046L_612.0_0_0.0_1165895_0  703911012 LATeah0046L_612.0_0_0.0_1149580_0  703910985 LATeah0046L_612.0_0_0.0_1148325_0  703910983 LATeah0046L_612.0_0_0.0_1126990_1  703910949

Just after these 4 computation errors on Einstein project, my GPU did a Milkyway task with no error (there is never any error on the GPU Milkyway tasks).


Sorry to hear that Frown

Being a Windows user I'm afraid I can't help you with this problem but I hope that one of our Linux users will stop by and offer some advice.

mmonnin
mmonnin
Joined: 29 May 16
Posts: 291
Credit: 3431156540
RAC: 4010805

Beats me. Have you tried

Beats me. Have you tried resetting the project?

alanb1951
alanb1951
Joined: 28 Nov 16
Posts: 23
Credit: 733079727
RAC: 392213

Benoit, One thing I noticed

Benoit,

One thing I noticed in your stack trace is it seems to be using the Mesa OpenCL here, whereas if I look at one of your successful jobs over at Milkyway it seems to be using the NVIDIA OpenCL stuff.

I've got my one NVIDIA+Ubuntu set-up organized so that there's no trace of mesa-opencl-icd.  I don't know whether you are able to try that or whether that would interfere with something else you have running...

Just a thought...

Good luck - Al.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.