Starting today - Computation error with RTX 2080 GPU

Penguin
Penguin
Joined: 8 Oct 12
Posts: 14
Credit: 371,836,414
RAC: 159,737
Topic 217173

As the title says, starting today, I began getting computation errors just as the tasks are beginning on my RTX 2080 GPU.  Primegrid, Seti, working fine.  This project was working fine until I restarted it later today.  It had been working the past week on both a 1070 and a 2080. Today it will only work on the 1070, not the 2080.

I saw similar happen with asteroids@home and gpugrid when I first got the 2080 so I set BOINC not to use the 2080 on those projects.  As of today I have had to set Einstein@home to do the same.  But, before today, never a problem.  Things ran smoothly without error.

So, what changed today?

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1,702,989,778
RAC: 0

Type of GPU data set changed.

Type of data set for those GPU tasks changed. That happens here now and then, it's normal. But RTX 2080's haven't been able to run these type of tasks successfully so far, only the other type.

Here's plenty of messages pointing out that problem:

https://einsteinathome.org/content/pascal-again-available-turing-may-be-coming-soon

https://einsteinathome.org/content/latest-data-file-fgrpb1g-gpu-tasks

Penguin
Penguin
Joined: 8 Oct 12
Posts: 14
Credit: 371,836,414
RAC: 159,737

Richie wrote:Type of data set

Richie wrote:

Type of data set for those GPU tasks changed. That happens here now and then, it's normal. But RTX 2080's haven't been able to run these type of tasks successfully so far, only the other type.

Here's plenty of messages pointing out that problem:

https://einsteinathome.org/content/pascal-again-available-turing-may-be-coming-soon

https://einsteinathome.org/content/latest-data-file-fgrpb1g-gpu-tasks

 

Yes I see in the data file thread this:

 

For any volunteers that have one of the new Turing GPUs (eg RTX 2080, etc.) running under Windows, these new tasks are likely to fail immediately after crunching starts - based on past experience.  If this affects you and you haven't been aware of it previously, you may like to check out the first report at Einstein about the problem.  There is a lot of extra information in all the subsequent posts in that thread.  The cause of the problem has not been identified.

We hope to find out shortly whether the same problem also occurs under Linux.  It would be nice if it doesn't since that might tend to suggest that it's specific to a particular Windows driver which hopefully might be able to be rectified at some stage by nVidia.

 

 

Unfortunately it's this new slew of tasks, but fortunately it's a known issue and is limited to the new data set and it's only with these new tasks.  There are similar issues on a few other sites regarding this with the Turing cards.  It has to do with the way the programs are compiled on those projects, not sure what it could be here.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,567
Credit: 292,682,504
RAC: 96,305

Penguin wrote:There are

Penguin wrote:
There are similar issues on a few other sites regarding this with the Turing cards.  It has to do with the way the programs are compiled on those projects...

Do these projects perchance use OpenCL, to your knowledge ?

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3,117
Credit: 4,050,672,230
RAC: 0

Keith was testing his new

Keith was testing his new card on both Seti, GPUGrid, and Einstein under Linux to see if that made a difference. Both Seti and GPUGrid use Cuda. Seti works, GPUGrid errored out, he hasn't said anything yet about Einstein under Linux.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,867
Credit: 112,187,791,794
RAC: 35,789,749

I bumped into an anonymous

I bumped into an anonymous host with a RTX 2080Ti.  Seems like the work cache size might have been increased quite a bit recently.   I noticed a whole lot of new GPU tasks a couple of days ago, at much the same time and all for the previous data file.  There are no tasks for the new file yet.  I hope the owner has a look at the boards and notices the potential problem before it hits.

The CPU is an AMD threadripper 2950X so 32 threads.   There are also *lots* of CPU tasks so maybe the client has already gone into panic mode.  That might explain why there are no recent GPU tasks.  The average turnaround time is listed as 11.7 days so lots future grief for that host, even without the new data file problem.

 

Cheers,
Gary.

Penguin
Penguin
Joined: 8 Oct 12
Posts: 14
Credit: 371,836,414
RAC: 159,737

Mike Hewson wrote:Penguin

Mike Hewson wrote:
Penguin wrote:
There are similar issues on a few other sites regarding this with the Turing cards.  It has to do with the way the programs are compiled on those projects...

Do these projects perchance use OpenCL, to your knowledge ?

Cheers, Mike.

 

Asteroids@home's app says (cuda55) after it, fails with the 2080, fine on a 1070.

at Primegrid all sub projects work on both the 1070 and 2080. Both CUDA and OpenCL

seti@home seems to use opencl... app is called opencl_nvidia_SoG. All tasks work on the 1070 and the 2080.

GPUGrid only works on the 1070, their gpu apps say (cuda80) so I guess CUDA

and milkyway at home works on both the 1070 and 2080, I think, I can't get any new tasks right now to double check that, not sure if they use CUDA or OpenCL.

 

 

So perhaps OpenCL apps are ok?  CUDA apps giving problems with the RTX series.

 

I don't remember where, possibly at seti, where there was a post saying they needed to be compiled using the latest CUDA versions...  I really don't know what that means or where I saw it or if I'm repeating it correctly.... I just remember seeing something about it on one of the BOINC project sites.

 

 

 

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3,150
Credit: 7,116,464,931
RAC: 573,513

Penguin wrote:So perhaps

Penguin wrote:
So perhaps OpenCL apps are ok?  CUDA apps giving problems with the RTX series.

I understand the current Einstein Windows application for Gamma-ray Pulsar search to be OpenCL and not CUDA.

One clue is that the executable has the string opencl in the file name:

hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopencl1K-nvidia.exe

Other clues are that stderr does not include the string CUDA, while it does contain these lines:


boinc_get_opencl_ids returned [0000000001353320 , 0000000001352830]
Using OpenCL platform provided by: NVIDIA Corporation
Using OpenCL device "GeForce RTX 2080" by: NVIDIA Corporation
Max allocation limit: 2147483648
Global mem size: 0
OpenCL device has FP64 support
Penguin
Penguin
Joined: 8 Oct 12
Posts: 14
Credit: 371,836,414
RAC: 159,737

archae86 wrote:Penguin

archae86 wrote:
Penguin wrote:
So perhaps OpenCL apps are ok?  CUDA apps giving problems with the RTX series.

I understand the current Einstein Windows application for Gamma-ray Pulsar search to be OpenCL and not CUDA.

One clue is that the executable has the string opencl in the file name:

hsgamma_FGRPB1G_1.20_windows_x86_64__FGRPopencl1K-nvidia.exe

Other clues are that stderr does not include the string CUDA, while it does contain these lines:


boinc_get_opencl_ids returned [0000000001353320 , 0000000001352830]
Using OpenCL platform provided by: NVIDIA Corporation
Using OpenCL device "GeForce RTX 2080" by: NVIDIA Corporation
Max allocation limit: 2147483648

Global mem size: 0
OpenCL device has FP64 support

 

 

OK, so that disproves the cuda apps being an issue alone then...

CElliott
CElliott
Joined: 9 Feb 05
Posts: 28
Credit: 986,836,693
RAC: 1,346

I have an RTX 2070 and am

I have an RTX 2070 and am experiencing this problem.  Every time a WU is aborted this message appears in the Event Log: "Display driver nvlddmkm stopped responding and has successfully recovered."  A similar message pops up in the system tray.

When a similar problem occurred in early November, I contacted NVidia tech support, which seemed only too eager to help solve the problem.  However, how is NVidia to lay hold of a faulty WU with which to reproduce the problem? 

So, my question is, is anyone at Einstein@Home working with NVidia to get them to fix whatever is ailing the display driver WRT the RTX 2070 and 2080?

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2,142
Credit: 2,824,932,900
RAC: 1,017,441

CElliott wrote:So, my

CElliott wrote:
So, my question is, is anyone at Einstein@Home working with NVidia to get them to fix whatever is ailing the display driver WRT the RTX 2070 and 2080?

archae86 at https://einsteinathome.org/content/pascal-again-available-turing-may-be-coming-soon?page=6#comment-167615 has a portable test case and an NVidia bug number from Nvidia driver feedback. You might consider joining forces with him.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.