Gamma ray GPU tasks hanging?

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4704
Credit: 17547585520
RAC: 6423592

ServicEnginIC

ServicEnginIC wrote:

Unfortunately, early this morning I aborted several hanging tasks reporting 100% but not finising at my Linux hosts.
I thought then that it was a defective batch.
You'll find these tasks as Error - Aborted tasks for each host.
Some of them were running for more than 30000 seconds...

Same here. Aborting all the bad tasks.

 

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 260
Credit: 6914861637
RAC: 20673728

GR tasks: Win7 and

GR tasks:

Win7 and Win10

Newest drivers.

All GR get computation error.

Nvidia driver recovery several times, then PCs get BSOD with driver error 116.

Have aborted ALL tasks.

Hope tech guys are up and running!

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2139
Credit: 2752853842
RAC: 1405366

One of my Mint machines is

One of my Mint machines is still at Mint 19.1 (Ubuntu 18.04), but I've just started a pair of 8-day CPDN tasks on it. I could try a driver roll-back if it's really important, but I'd prefer not to.

I can confirm that the faulting tasks run OK on a Windows GeForce GTX 1050 Ti with driver: 442.74, and on the HD 530 intel_gpu on the same machine (host 12496320).

solling2
solling2
Joined: 20 Nov 14
Posts: 219
Credit: 1563276629
RAC: 47870

FWIW, the following setting

FWIW, the following setting is doing fine:


GPU:NVIDIA GeForce GTX 1060 3GB (3016MB) driver: 450.10
OS:Linux LinuxMint Linux Mint 19.3 Tricia [5.4.0-62-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1.4)]
BOINC client version:7.9.3

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3681
Credit: 33839232870
RAC: 37218939

Richard Haselgrove

Richard Haselgrove wrote:

One of my Mint machines is still at Mint 19.1 (Ubuntu 18.04), but I've just started a pair of 8-day CPDN tasks on it. I could try a driver roll-back if it's really important, but I'd prefer not to.

I can confirm that the faulting tasks run OK on a Windows GeForce GTX 1050 Ti with driver: 442.74, and on the HD 530 intel_gpu on the same machine (host 12496320).

yeah, no worries if you don't want to. I was just reaching for some extra data points. this situation feels similar to the issue seen on SETI last year with the SoG tasks on certain nvidia drivers. so far, this particular issue seems to be driver-agnostic, but you never know. If someone is able to give a conclusive answer it would just be another variable to rule out.

_________________________________________________________________________

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3681
Credit: 33839232870
RAC: 37218939

solling2 wrote: FWIW, the

solling2 wrote:

FWIW, the following setting is doing fine:


GPU:NVIDIA GeForce GTX 1060 3GB (3016MB) driver: 450.10
OS:Linux LinuxMint Linux Mint 19.3 Tricia [5.4.0-62-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1.4)]
BOINC client version:7.9.3

the issue seems to only affect Volta/Turing/Ampere cards. anything Pascal and earlier seems to be unaffected. your GTX 1060 is Pascal.

  • RTX 30-series : Ampere
  • RTX 20-series/GTX 16-series : Turing
  • TitanV : Volta
  • GTX 10-series : Pascal
  • GTX 9-series (& GTX 750ti*): Maxwell
  • GTX 7-series (most*) : Kepler

 

_________________________________________________________________________

robl
robl
Joined: 2 Jan 13
Posts: 1709
Credit: 1454483283
RAC: 8517

Ian&Steve C. wrote:can

Ian&Steve C. wrote:

can anyone with a Linux host and older kernel (5.4 or earlier) try the nvidia driver from the 440 generation or older?

I have kernel 5.8+ and it seems I can't install the older driver on this kernel.

could this be similar to the driver issue on ubuntu 20 and amd drivers here:

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3681
Credit: 33839232870
RAC: 37218939

No. Totally separate issue. 

No. Totally separate issue. 

_________________________________________________________________________

alanb1951
alanb1951
Joined: 28 Nov 16
Posts: 18
Credit: 642013783
RAC: 423416

Ian&Steve C. wrote: can

Ian&Steve C. wrote:

can anyone with a Linux host and older kernel (5.4 or earlier) try the nvidia driver from the 440 generation or older?

I have kernel 5.8+ and it seems I can't install the older driver on this kernel.

I've got a GTX 1660 Ti on a Ryzen 3700X running Ubuntu 18.04 (kernel 5.3) with version 440.10 drivers, and it was failing these - hope that's old enough for a useful data point!

That machine is, of course, on NNT and having an E@H holiday at present (running GW doesn't seem to go well with CPDN and some WCG CPU stuff, so I don't switch over...)

Cheers - Al.

 

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4704
Credit: 17547585520
RAC: 6423592

Thanks for the new

Thanks for the new datapoint.  That is a Turing card which is one of the problem architectures.

But you have an older kernel and an older driver which was one of the parameters that we were hoping was not affected.

So, it is beginning to look that OS, kernel and drivers are not the problem.  It looks like the hardware architecture is the root cause.

I wonder if one of the task parameters is not able to handle the CC level or the commands available to the latest architectures of Volta/Turing/Ampere.

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.