GPU failures signal 11

Dad
Dad
Joined: 25 Oct 18
Posts: 7
Credit: 17,928,784
RAC: 2,019
Topic 223806

I get GPU task failure with all Einstein GPU tasks marked as 1 GPU with the ones with 0.9 GPU's work fine. Not much info about and I don't have a good handle on debugging it.

https://einsteinathome.org/task/1022302418

Stderr output

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process got signal 11</message>
<stderr_txt>
15:28:10 (156498): [normal]: This Einstein@home App was built at: Jan 16 2017 08:09:16

15:28:10 (156498): [normal]: Start of BOINC application '../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.18_x86_64-pc-linux-gnu__FGRPopencl1K-ati'.
15:28:10 (156498): [debug]: 1e+16 fp, 5e+09 fp/s, 2088056 s, 580h00m55s88
command line: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.18_x86_64-pc-linux-gnu__FGRPopencl1K-ati --inputfile ../../projects/einstein.phys.uwm.edu/LATeah1065L65.dat --alpha 1.41058464281 --delta -0.444366280137 --skyRadius 5.526880e-07 --ldiBins 30 --f0start 468.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 2.512676418e-15 --ephemdir ../../projects/einstein.phys.uwm.edu/JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah1065L65_0476_35629195.dat --debug 0 --device 0 -o LATeah1065L65_476.0_0_0.0_35629195_3_0.out
output files: 'LATeah1065L65_476.0_0_0.0_35629195_3_0.out' '../../projects/einstein.phys.uwm.edu/LATeah1065L65_476.0_0_0.0_35629195_3_0' 'LATeah1065L65_476.0_0_0.0_35629195_3_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah1065L65_476.0_0_0.0_35629195_3_1'
15:28:10 (156498): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
15:28:10 (156498): [debug]: glibc version/release: 2.31/stable
15:28:10 (156498): [debug]: Set up communication with graphics process.

-- signal handler called: signal 1

</stderr_txt>
]]>



mikey
mikey
Joined: 22 Jan 05
Posts: 6,406
Credit: 558,304,474
RAC: 222,392

Try rebooting  "Signal 11

Try rebooting 

"Signal 11 is a segfault, which means your program is trying to read or write memory it doesn't have permission to."

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,245
Credit: 44,946,104,231
RAC: 36,116,753

Dad wrote:I get GPU task

Dad wrote:
I get GPU task failure with all Einstein GPU tasks marked as 1 GPU with the ones with 0.9 GPU's work fine. Not much info about and I don't have a good handle on debugging it.

Just to get a better picture of the problem, there are no GPU tasks masked as "0.9 GPUs".  All GPU tasks are marked with an estimate of what fraction of a CPU core might be required to support GPU crunching so I imagine that might be where you have got the 0.9 figure from.

All your successful tasks are CPU tasks and you have both types, gravity wave (GW) and gamma-ray pulsar (FGRP5).  You have 37 failed GPU tasks all of the gamma-ray pulsar type for GPUs (FGRPB1G).  This is quite different to the FGRP5 CPU search.  There are no successful GPU tasks which suggests that you may not have a working set of OpenCL compute libs installed.  Here is a link to your full tasks list on the website.

If you examine the startup messages from BOINC's event log when you first start BOINC, it should give you information about the OpenCL libs.  If you post a copy of those startup messages, someone may be able to see if the problem is related to that.

Cheers,
Gary.

Dad
Dad
Joined: 25 Oct 18
Posts: 7
Credit: 17,928,784
RAC: 2,019

As far as GPU percentages go

As far as GPU percentages go there were tasks marked .9 GPU that did work but there haven't been for sometime. Milkyway works ok.

From my current log

Sun 25 Oct 2020 12:05:02 |  | Starting BOINC client version 7.16.6 for x86_64-pc-linux-gnu
Sun 25 Oct 2020 12:05:02 |  | log flags: file_xfer, sched_ops, task
Sun 25 Oct 2020 12:05:02 |  | Libraries: libcurl/7.69.1 OpenSSL/1.1.1g-fips zlib/1.2.11 brotli/1.0.9 libidn2/2.3.0 libpsl/0.21.0 (+libidn2/2.3.0) libssh/0.9.5/openssl/zlib nghttp2/1.41.0
Sun 25 Oct 2020 12:05:02 |  | Data directory: /mnt/store/boinc
Sun 25 Oct 2020 12:05:03 |  | OpenCL: AMD/ATI GPU 0: Radeon RX550/550 Series (driver version 3110.6, device version OpenCL 1.2 AMD-APP (3110.6), 1501MB, 1501MB available, 1211 GFLOPS peak)
Sun 25 Oct 2020 12:05:03 |  | OpenCL: AMD/ATI GPU 1 (ignored by config): Radeon RX550/550 Series (POLARIS12, DRM 3.38.0, 5.8.15-201.fc32. (driver version 20.1.10, device version OpenCL 1.1 Mesa 20.1.10, 3072MB, 3072MB available, 757 GFLOPS peak)
Sun 25 Oct 2020 12:05:03 |  | app version refers to missing GPU type NVIDIA
Sun 25 Oct 2020 12:05:03 | SETI@home | Application uses missing NVIDIA GPU
Sun 25 Oct 2020 12:05:03 |  | libc: GNU libc version 2.31
Sun 25 Oct 2020 12:05:03 |  | Host name: frank.farmdomain
Sun 25 Oct 2020 12:05:03 |  | Processor: 12 AuthenticAMD AMD Ryzen 5 1600 Six-Core Processor [Family 23 Model 1 Stepping 1]
Sun 25 Oct 2020 12:05:03 |  | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
Sun 25 Oct 2020 12:05:03 |  | OS: Linux Fedora: Fedora release 32 (Thirty Two) [5.8.15-201.fc32.x86_64|libc 2.31 (GNU libc)]
Sun 25 Oct 2020 12:05:03 |  | Memory: 15.63 GB physical, 15.83 GB virtual
Sun 25 Oct 2020 12:05:03 |  | Disk: 228.23 GB total, 141.53 GB free
Sun 25 Oct 2020 12:05:03 |  | Local time is UTC +11 hours
Sun 25 Oct 2020 12:05:03 | Milkyway@Home | General prefs: from Milkyway@Home (last modified 17-Jun-2020 15:26:10)
Sun 25 Oct 2020 12:05:03 | Milkyway@Home | Computer location: home
Sun 25 Oct 2020 12:05:03 | Milkyway@Home | General prefs: no separate prefs for home; using your defaults
Sun 25 Oct 2020 12:05:03 |  | Reading preferences override file
Sun 25 Oct 2020 12:05:03 |  | Preferences:
Sun 25 Oct 2020 12:05:03 |  | max memory usage when active: 8000.15 MB
Sun 25 Oct 2020 12:05:03 |  | max memory usage when idle: 14400.26 MB
Sun 25 Oct 2020 12:05:03 |  | max disk usage: 15.00 GB
Sun 25 Oct 2020 12:05:03 |  | max CPUs used: 11
Sun 25 Oct 2020 12:05:03 |  | suspend work if non-BOINC CPU load exceeds 25%
Sun 25 Oct 2020 12:05:03 |  | (to change preferences, visit a project web site or select Preferences in the Manager)
Sun 25 Oct 2020 12:05:03 |  | Setting up project and slot directories
Sun 25 Oct 2020 12:05:03 |  | Checking active tasks
Sun 25 Oct 2020 12:05:03 | Einstein@Home | URL http://einstein.phys.uwm.edu/; Computer ID 12725752; resource share 35

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.