Need Help getting Einstein to work on Turing - 2080ti

rjs5
rjs5
Joined: 3 Jul 05
Posts: 32
Credit: 609171845
RAC: 653906
Topic 217896

I upgraded my Windows Intel 7920x computer from a  NVIDIA GeForce GTX 1080 Ti to a NVIDIA GeForce RTX 2080 Ti Founders Edition purchased directly from Nvidia in late December.  The 1080ti had been successfully crunching Einstein for a couple of years without incident. The Einstein WU took about 430 seconds to execute on the 1080ti. The new 2080ti board runs all other BOINC programs except the Einstein GPU WU. I did not notice the failures and foolishly MERGED the machine around the first of January.

The Windows failure mode is the same for each Einstein GPU WU.

The Windows BOINC WU starts up, but after 20 to 30 seconds, the Windows screen goes blank/black for a second, Windows continues to run with the restarted driver and BOINC reports the Einstein WU with a "bridge_fft_clfft.c:948: clFinish failed. status=-36" ERROR.

GPUZ indicates the GPU LOAD is near zero during the timeout. GPU Temperature is in the 60 degree range. GPU and CPU appear to be idling.

1.  Is anyone with Turing successfully crunching the GPU WU? Especially Windows 2080ti?

2. The OpenCL error -36 is described as a "Compile-time Error (driver-independent)".  It seems like an error described as a driver-independent error "command_queue is not a valid command-queue" would be an Einstein app error.

https://streamhpc.com/blog/2013-04-28/opencl-error-codes/

 

% Starting semicoherent search over f0 and f1.
% nf1dots: 41  df1dot: 2.51785325e-015  f1dot_start: -1e-013  f1dot_band: 1e-013
% Filling array of photon pairs
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error -1532208864
18:25:57 (9392): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags:  PRECISION
18:26:09 (9392): [normal]: done. calling boinc_finish(28).
18:26:09 (9392): called boinc_finish

</stderr_txt>
]]>

 

 

 

 

 

Task 817828663

Name: LATeah2006L_1108.0_0_0.0_1288054_1
Workunit ID: 385621767
Created: 2 Jan 2019 18:12:37 GMT
Sent: 2 Jan 2019 19:24:14 GMT
Report deadline: 16 Jan 2019 19:24:14 GMT
Received: 2 Jan 2019 19:34:20 GMT
Server state: Over
Outcome: Computation error
Client state: Compute error
Exit status: 28 (0x0000001C) Unknown error code
Computer: 11997589
Run time (sec): 33.17
CPU time (sec): 16.81
Peak working set size (MB): 146.81
Peak swap size (MB): 865.56
Peak disk usage (MB): 0.01
Validation state: Invalid
Granted credit: 0
Application: Gamma-ray pulsar binary search #1 on GPUs v1.20 (FGRPopencl1K-nvidia) windows_x86_64

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2956909724
RAC: 719591

Please search for multiple

Please search for multiple other threads about observed incompatibilities between 'Turing' range cards (including your new RTX 2080 Ti) and the Gamma-ray pulsar binary search #1 application.

The problem is specific to certain data files, but unfortunately these include 2006L (as shown in your task's name).

No-one has found a solution yet (or even identified the exact cause), but several have reported the same error '-36' to NVidia as a bug in the software support (either compiler or runtime) in the drivers for the new cards.

rjs5
rjs5
Joined: 3 Jul 05
Posts: 32
Credit: 609171845
RAC: 653906

I have done the searches and

I have done the searches and read the speculations. If you use Google to search for the string "bridge_fft_clfft.c:948: clFinish failed. status=-36" the Einstein failures go back a year to Jan 2018 or farther.  

I looked around a little more and found a reference to the  Nvidia Dev forum thread where someone was getting an invalid command queue error when calling clFinish. That is the conditions of the Einstein error.

CL_INVALID_COMMAND_QUEUE when clFinish

https://devtalk.nvidia.com/default/topic/911395/cl_invalid_command_queue-when-clfinish/

 

One of the Nvidia forum moderators referenced a thread talking about the CL_INVALID_COMMAND_QUEUE could be returned if the dynamic CL sizing was used in the app. The app messed up the dynamic sizing and hand shake timing.

CL_INVALID_COMMAND_QUEUE error on clFinish command - a lot of operations in each kernel driver crash

https://devtalk.nvidia.com/default/topic/501409/cl_invalid_command_queue-error-on-clfinish-command-a-lot-of-operations-in-each-kernel-driver-crash/?offset=2

 

I would very surprised if the problem seen with this generic Einstein GPU app were a compiler or driver problem. I also doubt that this problem will get much attention from Nvidia since they view their primary market as games. I am pretty sure this problem can be avoided by modifying the Einstein code.

 

I have joined the Albert@Home project. Albert is where the new Einstein code is developed (I think) and where they will be most interested in solving Einstein code problems.

 

Richard Haselgrove wrote:

Please search for multiple other threads about observed incompatibilities between 'Turing' range cards (including your new RTX 2080 Ti) and the Gamma-ray pulsar binary search #1 application.

The problem is specific to certain data files, but unfortunately these include 2006L (as shown in your task's name).

No-one has found a solution yet (or even identified the exact cause), but several have reported the same error '-36' to NVidia as a bug in the software support (either compiler or runtime) in the drivers for the new cards.

Penguin
Penguin
Joined: 8 Oct 12
Posts: 14
Credit: 392159017
RAC: 21668

That's the most in depth bug

That's the most in depth bug report I've seen.  I'm sure it's a code issue.  Thanks for that feedback.  I hope someone can put it to use.

 

The only other projects I've been able to get a 2080 working on are Primegrid, Enigma, and Seti.  Einstein here with the L data set fails, asterioids fails, gpu grid fails.  Haven't tried collatz or much else.  I was able to fold successfully with the 2080 for folding@home but that's not a boinc project anymore it's independent.  Hoping Einstein fixes it since it was tearing through work b4 the data set change.

rjs5
rjs5
Joined: 3 Jul 05
Posts: 32
Credit: 609171845
RAC: 653906

If you look around for the

If you look around for the origin of the -36 errors, this particular error started appearing around the introduction of "Gamma-ray pulsar binary search #1 on GPUs v1.17 (FGRPopencl-nvidia)" in late 2016.  IMO, an Einstein app logic bug was introduced into v1.17 and has been floating around since then.  The developers always have a problem going back and fixing an old logic bug. They would rather create new stuff.

 

Penguin wrote:

That's the most in depth bug report I've seen.  I'm sure it's a code issue.  Thanks for that feedback.  I hope someone can put it to use.

 

The only other projects I've been able to get a 2080 working on are Primegrid, Enigma, and Seti.  Einstein here with the L data set fails, asterioids fails, gpu grid fails.  Haven't tried collatz or much else.  I was able to fold successfully with the 2080 for folding@home but that's not a boinc project anymore it's independent.  Hoping Einstein fixes it since it was tearing through work b4 the data set change.

Lagbolt
Lagbolt
Joined: 8 Mar 05
Posts: 1
Credit: 283841076
RAC: 0

I too just updated to the

I too just updated to the 2080 Ti I've had to suspend Einstein (Computation error). There seems to other Boinc  projects that may be affected. I get warnings that the 2080 Ti is reaching 99% GPU usage and getting lockups for a few seconds. When I get a chance I will see if I can isolate the problem.

Thank you. 

 

Found that Seti is also sending the GPU into over limit, just up to sending an error msg. Very occasional lockups.

I have five projects working. At the moment I have Seti & Einstein suspended and all's well. I will be turning on Seti as they complete the tasks...the warnings I will put up with.

rjs5
rjs5
Joined: 3 Jul 05
Posts: 32
Credit: 609171845
RAC: 653906

I had a problem with the new

I had a problem with the new way the Turing Founders Edition board exhausted heat into the case instead of out the back of the machine. My ASUS motherboard was set to all the manufacturer default options. I explicitly set the BIOS CPU maximum temperature setting lower and I now only have a problem with Einstein. Einstein still fails even though it is the only app running and GPUZ shows no GPU temperature rise or heavy load.

Lagbolt wrote:

I too just updated to the 2080 Ti I've had to suspend Einstein (Computation error). There seems to other Boinc  projects that may be affected. I get warnings that the 2080 Ti is reaching 99% GPU usage and getting lockups for a few seconds. When I get a chance I will see if I can isolate the problem.

Thank you. 

 

Found that Seti is also sending the GPU into over limit, just up to sending an error msg. Very occasional lockups.

I have five projects working. At the moment I have Seti & Einstein suspended and all's well. I will be turning on Seti as they complete the tasks...the warnings I will put up with.

Penguin
Penguin
Joined: 8 Oct 12
Posts: 14
Credit: 392159017
RAC: 21668

Lagbolt wrote:I too just

Lagbolt wrote:

I too just updated to the 2080 Ti I've had to suspend Einstein (Computation error). There seems to other Boinc  projects that may be affected. I get warnings that the 2080 Ti is reaching 99% GPU usage and getting lockups for a few seconds. When I get a chance I will see if I can isolate the problem.

Thank you. 

 

Found that Seti is also sending the GPU into over limit, just up to sending an error msg. Very occasional lockups.

I have five projects working. At the moment I have Seti & Einstein suspended and all's well. I will be turning on Seti as they complete the tasks...the warnings I will put up with.

 

No problems with Seti for me.  the 2080 tears it up but I think my CPU is a bottleneck slightly. I've had SETI go for weeks at a clip without any warning messages. Maybe your card is overheating? Odd with SETI, but it could happen.

 

With here at Einstein it seems to be the data set specifically that causes the problem.  So it needs to be compared with a working data set and see just what is different.  When I first got the 2080 card it tore stuff up, data set changed, immediate crashes, posted here, others had same issues, narrowed to something with the data set itself. Something to large perhaps not being adjusted right for the new cards... really I have no idea exactly what.  Fingers crossed it will be solved at some point down the line.  Its not just limited to Einstein though, a few other projects also fail to work with the RTX Turing architecture as well.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.