Gamma ray GPU tasks hanging?

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4964
Credit: 18717093762
RAC: 6390842

ServicEnginIC

ServicEnginIC wrote:

Unfortunately, early this morning I aborted several hanging tasks reporting 100% but not finising at my Linux hosts.
I thought then that it was a defective batch.
You'll find these tasks as Error - Aborted tasks for each host.
Some of them were running for more than 30000 seconds...

Same here. Aborting all the bad tasks.

 

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 401
Credit: 10156373455
RAC: 25758922

GR tasks: Win7 and

GR tasks:

Win7 and Win10

Newest drivers.

All GR get computation error.

Nvidia driver recovery several times, then PCs get BSOD with driver error 116.

Have aborted ALL tasks.

Hope tech guys are up and running!

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2956469768
RAC: 715134

One of my Mint machines is

One of my Mint machines is still at Mint 19.1 (Ubuntu 18.04), but I've just started a pair of 8-day CPDN tasks on it. I could try a driver roll-back if it's really important, but I'd prefer not to.

I can confirm that the faulting tasks run OK on a Windows GeForce GTX 1050 Ti with driver: 442.74, and on the HD 530 intel_gpu on the same machine (host 12496320).

solling2
solling2
Joined: 20 Nov 14
Posts: 219
Credit: 1577531314
RAC: 20331

FWIW, the following setting

FWIW, the following setting is doing fine:


GPU:NVIDIA GeForce GTX 1060 3GB (3016MB) driver: 450.10
OS:Linux LinuxMint Linux Mint 19.3 Tricia [5.4.0-62-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1.4)]
BOINC client version:7.9.3

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3945
Credit: 46769662642
RAC: 64079292

Richard Haselgrove

Richard Haselgrove wrote:

One of my Mint machines is still at Mint 19.1 (Ubuntu 18.04), but I've just started a pair of 8-day CPDN tasks on it. I could try a driver roll-back if it's really important, but I'd prefer not to.

I can confirm that the faulting tasks run OK on a Windows GeForce GTX 1050 Ti with driver: 442.74, and on the HD 530 intel_gpu on the same machine (host 12496320).

yeah, no worries if you don't want to. I was just reaching for some extra data points. this situation feels similar to the issue seen on SETI last year with the SoG tasks on certain nvidia drivers. so far, this particular issue seems to be driver-agnostic, but you never know. If someone is able to give a conclusive answer it would just be another variable to rule out.

_________________________________________________________________________

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3945
Credit: 46769662642
RAC: 64079292

solling2 wrote: FWIW, the

solling2 wrote:

FWIW, the following setting is doing fine:


GPU:NVIDIA GeForce GTX 1060 3GB (3016MB) driver: 450.10
OS:Linux LinuxMint Linux Mint 19.3 Tricia [5.4.0-62-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1.4)]
BOINC client version:7.9.3

the issue seems to only affect Volta/Turing/Ampere cards. anything Pascal and earlier seems to be unaffected. your GTX 1060 is Pascal.

  • RTX 30-series : Ampere
  • RTX 20-series/GTX 16-series : Turing
  • TitanV : Volta
  • GTX 10-series : Pascal
  • GTX 9-series (& GTX 750ti*): Maxwell
  • GTX 7-series (most*) : Kepler

 

_________________________________________________________________________

Anonymous

Ian&Steve C. wrote:can

Ian&Steve C. wrote:

can anyone with a Linux host and older kernel (5.4 or earlier) try the nvidia driver from the 440 generation or older?

I have kernel 5.8+ and it seems I can't install the older driver on this kernel.

could this be similar to the driver issue on ubuntu 20 and amd drivers here:

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3945
Credit: 46769662642
RAC: 64079292

No. Totally separate issue. 

No. Totally separate issue. 

_________________________________________________________________________

alanb1951
alanb1951
Joined: 28 Nov 16
Posts: 23
Credit: 730103669
RAC: 380413

Ian&Steve C. wrote: can

Ian&Steve C. wrote:

can anyone with a Linux host and older kernel (5.4 or earlier) try the nvidia driver from the 440 generation or older?

I have kernel 5.8+ and it seems I can't install the older driver on this kernel.

I've got a GTX 1660 Ti on a Ryzen 3700X running Ubuntu 18.04 (kernel 5.3) with version 440.10 drivers, and it was failing these - hope that's old enough for a useful data point!

That machine is, of course, on NNT and having an E@H holiday at present (running GW doesn't seem to go well with CPDN and some WCG CPU stuff, so I don't switch over...)

Cheers - Al.

 

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4964
Credit: 18717093762
RAC: 6390842

Thanks for the new

Thanks for the new datapoint.  That is a Turing card which is one of the problem architectures.

But you have an older kernel and an older driver which was one of the parameters that we were hoping was not affected.

So, it is beginning to look that OS, kernel and drivers are not the problem.  It looks like the hardware architecture is the root cause.

I wonder if one of the task parameters is not able to handle the CC level or the commands available to the latest architectures of Volta/Turing/Ampere.

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.