Need Help getting Einstein to work on Turing - 2080ti

rjs5

Joined: 3 Jul 05

Posts: 32

Credit: 609371826

RAC: 666032

12 Jan 2019 19:48:46 UTC

Topic 217896

(moderation:

)

I upgraded my Windows Intel 7920x computer from a NVIDIA GeForce GTX 1080 Ti to a NVIDIA GeForce RTX 2080 Ti Founders Edition purchased directly from Nvidia in late December. The 1080ti had been successfully crunching Einstein for a couple of years without incident. The Einstein WU took about 430 seconds to execute on the 1080ti. The new 2080ti board runs all other BOINC programs except the Einstein GPU WU. I did not notice the failures and foolishly MERGED the machine around the first of January.

The Windows failure mode is the same for each Einstein GPU WU.

The Windows BOINC WU starts up, but after 20 to 30 seconds, the Windows screen goes blank/black for a second, Windows continues to run with the restarted driver and BOINC reports the Einstein WU with a "bridge_fft_clfft.c:948: clFinish failed. status=-36" ERROR.

GPUZ indicates the GPU LOAD is near zero during the timeout. GPU Temperature is in the 60 degree range. GPU and CPU appear to be idling.

1. Is anyone with Turing successfully crunching the GPU WU? Especially Windows 2080ti?

2. The OpenCL error -36 is described as a "Compile-time Error (driver-independent)". It seems like an error described as a driver-independent error "command_queue is not a valid command-queue" would be an Einstein app error.

https://streamhpc.com/blog/2013-04-28/opencl-error-codes/

% Starting semicoherent search over f0 and f1.
% nf1dots: 41  df1dot: 2.51785325e-015  f1dot_start: -1e-013  f1dot_band: 1e-013
% Filling array of photon pairs
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error -1532208864
18:25:57 (9392): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags:  PRECISION
18:26:09 (9392): [normal]: done. calling boinc_finish(28).
18:26:09 (9392): called boinc_finish

</stderr_txt>
]]>

Task 817828663

Name: LATeah2006L_1108.0_0_0.0_1288054_1

Workunit ID: 385621767

Created: 2 Jan 2019 18:12:37 GMT

Sent: 2 Jan 2019 19:24:14 GMT

Report deadline: 16 Jan 2019 19:24:14 GMT

Received: 2 Jan 2019 19:34:20 GMT

Server state: Over

Outcome: Computation error

Client state: Compute error

Exit status: 28 (0x0000001C) Unknown error code

Computer: 11997589

Run time (sec): 33.17

CPU time (sec): 16.81

Peak working set size (MB): 146.81

Peak swap size (MB): 865.56

Peak disk usage (MB): 0.01

Validation state: Invalid

Granted credit: 0

Application: Gamma-ray pulsar binary search #1 on GPUs v1.20 (FGRPopencl1K-nvidia) windows_x86_64

Stderr output

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
The printer is out of paper.
 (0x1c) - exit code 28 (0x1c)</message>
<stderr_txt>
18:25:37 (9392): [normal]: This Einstein@home App was built at: Feb 15 2017 09:23:49

18:25:37 (9392): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopencl1K-nvidia.exe'.
18:25:37 (9392): [debug]: 1.1e+016 fp, 4.2e+009 fp/s, 2497958 s, 693h52m38s35
18:25:37 (9392): [normal]: % CPU usage: 0.200000, GPU usage: 1.000000
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopencl1K-nvidia.exe --inputfile ../../projects/einstein.phys.uwm.edu/LATeah2006L.dat --alpha 5.40870031252 --delta -0.982374415307 --skyRadius 2.472550e-06 --ldiBins 30 --f0start 1100.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 2.51785325e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah2006L_1108_1288054.dat --debug 1 --device 0 -o LATeah2006L_1108.0_0_0.0_1288054_1_0.out
output files: 'LATeah2006L_1108.0_0_0.0_1288054_1_0.out' '../../projects/einstein.phys.uwm.edu/LATeah2006L_1108.0_0_0.0_1288054_1_0' 'LATeah2006L_1108.0_0_0.0_1288054_1_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah2006L_1108.0_0_0.0_1288054_1_1'
18:25:37 (9392): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
18:25:37 (9392): [debug]: Set up communication with graphics process.
boinc_get_opencl_ids returned [00000000013CB760 , 00000000013CB210]
Using OpenCL platform provided by: NVIDIA Corporation
Using OpenCL device "GeForce RTX 2080 Ti" by: NVIDIA Corporation
Max allocation limit: 2952790016
Global mem size: 3221225472
OpenCL device has FP64 support
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah2006L.dat
% Total amount of photon times: 20991
% Preparing toplist of length: 10
% Read 1061 binary points
read_checkpoint(): Couldn't open file 'LATeah2006L_1108.0_0_0.0_1288054_1_0.out.cpt': No such file or directory (2)
% fft_size: 16777216 (0x1000000); alloc: 67108872
% Sky point 1/1
% Binary point 1/1061
% Creating FFT plan.
% fft length: 16777216 (0x1000000)
% Scratch buffer size: 136314880
% Starting semicoherent search over f0 and f1.
% nf1dots: 41 df1dot: 2.51785325e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error -1532208864
18:25:57 (9392): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags: PRECISION
18:26:09 (9392): [normal]: done. calling boinc_finish(28).
18:26:09 (9392): called boinc_finish

</stderr_txt>
]]>

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2956986383

RAC: 718831

Please search for multiple

12 Jan 2019 19:56:29 UTC

Message 168812

(moderation:

)

Please search for multiple other threads about observed incompatibilities between 'Turing' range cards (including your new RTX 2080 Ti) and the Gamma-ray pulsar binary search #1 application.

The problem is specific to certain data files, but unfortunately these include 2006L (as shown in your task's name).

No-one has found a solution yet (or even identified the exact cause), but several have reported the same error '-36' to NVidia as a bug in the software support (either compiler or runtime) in the drivers for the new cards.

rjs5

Joined: 3 Jul 05

Posts: 32

Credit: 609371826

RAC: 666032

I have done the searches and

12 Jan 2019 22:51:51 UTC

Message 168813 in response to message 168812

(moderation:

)

I have done the searches and read the speculations. If you use Google to search for the string "bridge_fft_clfft.c:948: clFinish failed. status=-36" the Einstein failures go back a year to Jan 2018 or farther.

I looked around a little more and found a reference to the Nvidia Dev forum thread where someone was getting an invalid command queue error when calling clFinish. That is the conditions of the Einstein error.

CL_INVALID_COMMAND_QUEUE when clFinish

https://devtalk.nvidia.com/default/topic/911395/cl_invalid_command_queue-when-clfinish/

One of the Nvidia forum moderators referenced a thread talking about the CL_INVALID_COMMAND_QUEUE could be returned if the dynamic CL sizing was used in the app. The app messed up the dynamic sizing and hand shake timing.

CL_INVALID_COMMAND_QUEUE error on clFinish command - a lot of operations in each kernel driver crash

https://devtalk.nvidia.com/default/topic/501409/cl_invalid_command_queue-error-on-clfinish-command-a-lot-of-operations-in-each-kernel-driver-crash/?offset=2

I would very surprised if the problem seen with this generic Einstein GPU app were a compiler or driver problem. I also doubt that this problem will get much attention from Nvidia since they view their primary market as games. I am pretty sure this problem can be avoided by modifying the Einstein code.

I have joined the Albert@Home project. Albert is where the new Einstein code is developed (I think) and where they will be most interested in solving Einstein code problems.

Richard Haselgrove wrote:

Please search for multiple other threads about observed incompatibilities between 'Turing' range cards (including your new RTX 2080 Ti) and the Gamma-ray pulsar binary search #1 application.

The problem is specific to certain data files, but unfortunately these include 2006L (as shown in your task's name).

No-one has found a solution yet (or even identified the exact cause), but several have reported the same error '-36' to NVidia as a bug in the software support (either compiler or runtime) in the drivers for the new cards.

Penguin

Joined: 8 Oct 12

Posts: 14

Credit: 392159017

RAC: 21668

That's the most in depth bug

13 Jan 2019 0:42:01 UTC

Message 168815 in response to message 168813

(moderation:

)

That's the most in depth bug report I've seen. I'm sure it's a code issue. Thanks for that feedback. I hope someone can put it to use.

The only other projects I've been able to get a 2080 working on are Primegrid, Enigma, and Seti. Einstein here with the L data set fails, asterioids fails, gpu grid fails. Haven't tried collatz or much else. I was able to fold successfully with the 2080 for folding@home but that's not a boinc project anymore it's independent. Hoping Einstein fixes it since it was tearing through work b4 the data set change.

rjs5

Joined: 3 Jul 05

Posts: 32

Credit: 609371826

RAC: 666032

If you look around for the

13 Jan 2019 4:20:52 UTC

Message 168817 in response to message 168815

(moderation:

)

If you look around for the origin of the -36 errors, this particular error started appearing around the introduction of "Gamma-ray pulsar binary search #1 on GPUs v1.17 (FGRPopencl-nvidia)" in late 2016. IMO, an Einstein app logic bug was introduced into v1.17 and has been floating around since then. The developers always have a problem going back and fixing an old logic bug. They would rather create new stuff.

Penguin wrote:

That's the most in depth bug report I've seen. I'm sure it's a code issue. Thanks for that feedback. I hope someone can put it to use.

The only other projects I've been able to get a 2080 working on are Primegrid, Enigma, and Seti. Einstein here with the L data set fails, asterioids fails, gpu grid fails. Haven't tried collatz or much else. I was able to fold successfully with the 2080 for folding@home but that's not a boinc project anymore it's independent. Hoping Einstein fixes it since it was tearing through work b4 the data set change.

Lagbolt

Joined: 8 Mar 05

Posts: 1

Credit: 283841076

RAC: 0

I too just updated to the

23 Jan 2019 7:39:56 UTC

Message 169025

(moderation:

)

I too just updated to the 2080 Ti I've had to suspend Einstein (Computation error). There seems to other Boinc projects that may be affected. I get warnings that the 2080 Ti is reaching 99% GPU usage and getting lockups for a few seconds. When I get a chance I will see if I can isolate the problem.

Thank you.

Found that Seti is also sending the GPU into over limit, just up to sending an error msg. Very occasional lockups.

I have five projects working. At the moment I have Seti & Einstein suspended and all's well. I will be turning on Seti as they complete the tasks...the warnings I will put up with.

rjs5

Joined: 3 Jul 05

Posts: 32

Credit: 609371826

RAC: 666032

I had a problem with the new

23 Jan 2019 16:57:57 UTC

Message 169032 in response to message 169025

(moderation:

)

I had a problem with the new way the Turing Founders Edition board exhausted heat into the case instead of out the back of the machine. My ASUS motherboard was set to all the manufacturer default options. I explicitly set the BIOS CPU maximum temperature setting lower and I now only have a problem with Einstein. Einstein still fails even though it is the only app running and GPUZ shows no GPU temperature rise or heavy load.

Lagbolt wrote:

I too just updated to the 2080 Ti I've had to suspend Einstein (Computation error). There seems to other Boinc projects that may be affected. I get warnings that the 2080 Ti is reaching 99% GPU usage and getting lockups for a few seconds. When I get a chance I will see if I can isolate the problem.

Thank you.

Found that Seti is also sending the GPU into over limit, just up to sending an error msg. Very occasional lockups.

I have five projects working. At the moment I have Seti & Einstein suspended and all's well. I will be turning on Seti as they complete the tasks...the warnings I will put up with.

Penguin

Joined: 8 Oct 12

Posts: 14

Credit: 392159017

RAC: 21668

Lagbolt wrote:I too just

5 Feb 2019 18:34:49 UTC

Message 169338 in response to message 169025

(moderation:

)

Lagbolt wrote:

I too just updated to the 2080 Ti I've had to suspend Einstein (Computation error). There seems to other Boinc projects that may be affected. I get warnings that the 2080 Ti is reaching 99% GPU usage and getting lockups for a few seconds. When I get a chance I will see if I can isolate the problem.

Thank you.

Found that Seti is also sending the GPU into over limit, just up to sending an error msg. Very occasional lockups.

I have five projects working. At the moment I have Seti & Einstein suspended and all's well. I will be turning on Seti as they complete the tasks...the warnings I will put up with.

No problems with Seti for me. the 2080 tears it up but I think my CPU is a bottleneck slightly. I've had SETI go for weeks at a clip without any warning messages. Maybe your card is overheating? Odd with SETI, but it could happen.

With here at Einstein it seems to be the data set specifically that causes the problem. So it needs to be compared with a working data set and see just what is different. When I first got the 2080 card it tore stuff up, data set changed, immediate crashes, posted here, others had same issues, narrowed to something with the data set itself. Something to large perhaps not being adjusted right for the new cards... really I have no idea exactly what. Fingers crossed it will be solved at some point down the line. Its not just limited to Einstein though, a few other projects also fail to work with the RTX Turing architecture as well.

Need Help getting Einstein to work on Turing - 2080ti

Forums › Problems and Bug Reports

Task 817828663

Stderr output

Please search for multiple

I have done the searches and

That's the most in depth bug

If you look around for the

I too just updated to the

I had a problem with the new

Lagbolt wrote:I too just

Comment viewing options

Forums › Problems and Bug Reports