I upgraded my Windows Intel 7920x computer from a NVIDIA GeForce GTX 1080 Ti to a NVIDIA GeForce RTX 2080 Ti Founders Edition purchased directly from Nvidia in late December. The 1080ti had been successfully crunching Einstein for a couple of years without incident. The Einstein WU took about 430 seconds to execute on the 1080ti. The new 2080ti board runs all other BOINC programs except the Einstein GPU WU. I did not notice the failures and foolishly MERGED the machine around the first of January.
The Windows failure mode is the same for each Einstein GPU WU.
The Windows BOINC WU starts up, but after 20 to 30 seconds, the Windows screen goes blank/black for a second, Windows continues to run with the restarted driver and BOINC reports the Einstein WU with a "bridge_fft_clfft.c:948: clFinish failed. status=-36" ERROR.
GPUZ indicates the GPU LOAD is near zero during the timeout. GPU Temperature is in the 60 degree range. GPU and CPU appear to be idling.
1. Is anyone with Turing successfully crunching the GPU WU? Especially Windows 2080ti?
2. The OpenCL error -36 is described as a "Compile-time Error (driver-independent)". It seems like an error described as a driver-independent error "command_queue is not a valid command-queue" would be an Einstein app error.
https://streamhpc.com/blog/2013-04-28/opencl-error-codes/
% Starting semicoherent search over f0 and f1. % nf1dots: 41 df1dot: 2.51785325e-015 f1dot_start: -1e-013 f1dot_band: 1e-013 % Filling array of photon pairs ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36 ERROR: opencl_ts_2_phase_diff_sorted() returned with error -1532208864 18:25:57 (9392): [CRITICAL]: ERROR: MAIN() returned with error '-36' FPU status flags: PRECISION 18:26:09 (9392): [normal]: done. calling boinc_finish(28). 18:26:09 (9392): called boinc_finish
</stderr_txt>
]]>
Name: LATeah2006L_1108.0_0_0.0_1288054_1
Workunit ID: 385621767
Created: 2 Jan 2019 18:12:37 GMT
Sent: 2 Jan 2019 19:24:14 GMT
Report deadline: 16 Jan 2019 19:24:14 GMT
Received: 2 Jan 2019 19:34:20 GMT
Server state: Over
Outcome: Computation error
Client state: Compute error
Exit status: 28 (0x0000001C) Unknown error code
Computer: 11997589
Run time (sec): 33.17
CPU time (sec): 16.81
Peak working set size (MB): 146.81
Peak swap size (MB): 865.56
Peak disk usage (MB): 0.01
Validation state: Invalid
Granted credit: 0
Application: Gamma-ray pulsar binary search #1 on GPUs v1.20 (FGRPopencl1K-nvidia)
windows_x86_64 |
<core_client_version>7.14.2</core_client_version> <![CDATA[ <message> The printer is out of paper. (0x1c) - exit code 28 (0x1c)</message> <stderr_txt> 18:25:37 (9392): [normal]: This Einstein@home App was built at: Feb 15 2017 09:23:49
18:25:37 (9392): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopencl1K-nvidia.exe'.
18:25:37 (9392): [debug]: 1.1e+016 fp, 4.2e+009 fp/s, 2497958 s, 693h52m38s35
18:25:37 (9392): [normal]: % CPU usage: 0.200000, GPU usage: 1.000000
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopencl1K-nvidia.exe --inputfile ../../projects/einstein.phys.uwm.edu/LATeah2006L.dat --alpha 5.40870031252 --delta -0.982374415307 --skyRadius 2.472550e-06 --ldiBins 30 --f0start 1100.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 2.51785325e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah2006L_1108_1288054.dat --debug 1 --device 0 -o LATeah2006L_1108.0_0_0.0_1288054_1_0.out
output files: 'LATeah2006L_1108.0_0_0.0_1288054_1_0.out' '../../projects/einstein.phys.uwm.edu/LATeah2006L_1108.0_0_0.0_1288054_1_0' 'LATeah2006L_1108.0_0_0.0_1288054_1_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah2006L_1108.0_0_0.0_1288054_1_1'
18:25:37 (9392): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
18:25:37 (9392): [debug]: Set up communication with graphics process.
boinc_get_opencl_ids returned [00000000013CB760 , 00000000013CB210]
Using OpenCL platform provided by: NVIDIA Corporation
Using OpenCL device "GeForce RTX 2080 Ti" by: NVIDIA Corporation
Max allocation limit: 2952790016
Global mem size: 3221225472
OpenCL device has FP64 support
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah2006L.dat
% Total amount of photon times: 20991
% Preparing toplist of length: 10
% Read 1061 binary points
read_checkpoint(): Couldn't open file 'LATeah2006L_1108.0_0_0.0_1288054_1_0.out.cpt': No such file or directory (2)
% fft_size: 16777216 (0x1000000); alloc: 67108872
% Sky point 1/1
% Binary point 1/1061
% Creating FFT plan.
% fft length: 16777216 (0x1000000)
% Scratch buffer size: 136314880
% Starting semicoherent search over f0 and f1.
% nf1dots: 41 df1dot: 2.51785325e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error -1532208864
18:25:57 (9392): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags: PRECISION
18:26:09 (9392): [normal]: done. calling boinc_finish(28).
18:26:09 (9392): called boinc_finish
</stderr_txt>
]]>
Copyright © 2024 Einstein@Home. All rights reserved.
Please search for multiple
)
Please search for multiple other threads about observed incompatibilities between 'Turing' range cards (including your new RTX 2080 Ti) and the Gamma-ray pulsar binary search #1 application.
The problem is specific to certain data files, but unfortunately these include 2006L (as shown in your task's name).
No-one has found a solution yet (or even identified the exact cause), but several have reported the same error '-36' to NVidia as a bug in the software support (either compiler or runtime) in the drivers for the new cards.
I have done the searches and
)
I have done the searches and read the speculations. If you use Google to search for the string "bridge_fft_clfft.c:948: clFinish failed. status=-36" the Einstein failures go back a year to Jan 2018 or farther.
I looked around a little more and found a reference to the Nvidia Dev forum thread where someone was getting an invalid command queue error when calling clFinish. That is the conditions of the Einstein error.
CL_INVALID_COMMAND_QUEUE when clFinish
https://devtalk.nvidia.com/default/topic/911395/cl_invalid_command_queue-when-clfinish/
One of the Nvidia forum moderators referenced a thread talking about the CL_INVALID_COMMAND_QUEUE could be returned if the dynamic CL sizing was used in the app. The app messed up the dynamic sizing and hand shake timing.
CL_INVALID_COMMAND_QUEUE error on clFinish command - a lot of operations in each kernel driver crash
https://devtalk.nvidia.com/default/topic/501409/cl_invalid_command_queue-error-on-clfinish-command-a-lot-of-operations-in-each-kernel-driver-crash/?offset=2
I would very surprised if the problem seen with this generic Einstein GPU app were a compiler or driver problem. I also doubt that this problem will get much attention from Nvidia since they view their primary market as games. I am pretty sure this problem can be avoided by modifying the Einstein code.
I have joined the Albert@Home project. Albert is where the new Einstein code is developed (I think) and where they will be most interested in solving Einstein code problems.
That's the most in depth bug
)
That's the most in depth bug report I've seen. I'm sure it's a code issue. Thanks for that feedback. I hope someone can put it to use.
The only other projects I've been able to get a 2080 working on are Primegrid, Enigma, and Seti. Einstein here with the L data set fails, asterioids fails, gpu grid fails. Haven't tried collatz or much else. I was able to fold successfully with the 2080 for folding@home but that's not a boinc project anymore it's independent. Hoping Einstein fixes it since it was tearing through work b4 the data set change.
If you look around for the
)
If you look around for the origin of the -36 errors, this particular error started appearing around the introduction of "Gamma-ray pulsar binary search #1 on GPUs v1.17 (FGRPopencl-nvidia)" in late 2016. IMO, an Einstein app logic bug was introduced into v1.17 and has been floating around since then. The developers always have a problem going back and fixing an old logic bug. They would rather create new stuff.
I too just updated to the
)
I too just updated to the 2080 Ti I've had to suspend Einstein (Computation error). There seems to other Boinc projects that may be affected. I get warnings that the 2080 Ti is reaching 99% GPU usage and getting lockups for a few seconds. When I get a chance I will see if I can isolate the problem.
Thank you.
Found that Seti is also sending the GPU into over limit, just up to sending an error msg. Very occasional lockups.
I have five projects working. At the moment I have Seti & Einstein suspended and all's well. I will be turning on Seti as they complete the tasks...the warnings I will put up with.
I had a problem with the new
)
I had a problem with the new way the Turing Founders Edition board exhausted heat into the case instead of out the back of the machine. My ASUS motherboard was set to all the manufacturer default options. I explicitly set the BIOS CPU maximum temperature setting lower and I now only have a problem with Einstein. Einstein still fails even though it is the only app running and GPUZ shows no GPU temperature rise or heavy load.
Lagbolt wrote:I too just
)
No problems with Seti for me. the 2080 tears it up but I think my CPU is a bottleneck slightly. I've had SETI go for weeks at a clip without any warning messages. Maybe your card is overheating? Odd with SETI, but it could happen.
With here at Einstein it seems to be the data set specifically that causes the problem. So it needs to be compared with a working data set and see just what is different. When I first got the 2080 card it tore stuff up, data set changed, immediate crashes, posted here, others had same issues, narrowed to something with the data set itself. Something to large perhaps not being adjusted right for the new cards... really I have no idea exactly what. Fingers crossed it will be solved at some point down the line. Its not just limited to Einstein though, a few other projects also fail to work with the RTX Turing architecture as well.