Gamma ray GPU tasks hanging?

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7221394931
RAC: 977068

Ian&Steve C. wrote: looks

Ian&Steve C. wrote:

looks like San-Fernando-Valley's Titan V, (Volta Architecture, precursor to Turing) on Windows, is having issue with these new tasks too. so it's not just me.

https://einsteinathome.org/host/12284149/tasks/6/0

looking more likely to be an issue with these tasks on certain architectures.

Oddly enough, I was just about to post about trouble on a different one of that same user's machines:

fast errors on host 12675571

The failure syndrome is utterly different from the one you reported on your four machines.  In this case the tasks error out quickly--about 22 seconds elapsed time on average.

possibly interesting stderr entries for this machine include:

ERROR: /home/bema/source/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error 7261931
21:36:33 (5124): [CRITICAL]: ERROR: MAIN() returned with error '-36'

This failing machine is a Windows 7 machine with this GPU reported:

GeForce GTX 1650 (4095MB)

So, yes, Ian&Steve, all of us seem to see these tasks as different, and it is not only you who get failures on them.  But the failing behavior so far falls in at least two very different buckets.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3946
Credit: 46777172642
RAC: 64125649

archae86 wrote: This

archae86 wrote:

 

This failing machine is a Windows 7 machine with this GPU reported:

GeForce GTX 1650 (4095MB)

So, yes, Ian&Steve, all of us seem to see these tasks as different, and it is not only you who get failures on them.  But the failing behavior so far falls in at least two very different buckets.

Read the stderr.txt. it is the 'TITAN V' failing in that manner, which this different failure mode is likely due to the fact that this is a Volta card, not Turing, which all my cards are, as well as Dotius who sees the same "failure" mode of tasks that don't seem to ever finish. 

https://einsteinathome.org/task/1061641589

Using OpenCL device "TITAN V" by: NVIDIA Corporation

 

in mixed GPU systems, BOINC only reports what it deems the be the "best" GPU, and in the case of Nvidia the first level of hierarchy is the CC value. A GTX 1650 has a CC of 7.5, where a Titan V (undoubtedly the 'better' card) has a CC of 7.0. blame it on poor BOINC sorting logic, but you should always check the stderr.txt to see what device ACTUALLY ran the job.

 

--edit-- to amend, I see a few fast failures from that system that do appear to have tried to run on the 1650. and the difference in failure mode can probably come down to OS or drivers also.

 

one thing is clear, it all started with these new tasks. 

_________________________________________________________________________

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4964
Credit: 18717786571
RAC: 6387963

I have a slew of these tasks

I have a slew of these tasks in the queue. Two of my hosts are mixed generations with a single Pascal and dual or triple Turing cards.  I expect all the tasks that will run on the Turing to fail and the tasks on the Pascal will run correctly.

The other host is all Turing and will all fail. I think I will let a few go through to verify the bad batch of work and then abort all these LATeah3001L tasks. Then will set NNT and crunch nothing but MW which is running fine.

I wish there was a way to get out of the mixed DCF problem when switching between GR and GW.

 

Dotious
Dotious
Joined: 5 Dec 20
Posts: 4
Credit: 94096447
RAC: 137648

I suspended my problematic

I suspended my problematic task.  It reverted to 0.000% when I resumed and continued to exhibit the stall behavior.  I suspended again, transferred it to the gtx 970, and the task completed within typical times.  I’ll switch over to GW only as well.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7221394931
RAC: 977068

As of a few minutes ago, my

As of a few minutes ago, my three hosts had 23 validations of tasks from this "different" set.

Looking at my quorum partners, and omitting those reporting more than one GPU, I found these flavors which ran at least one of these tasks successfully:

Win10 RX Vega
Linux Radeon VII
Win10 GTX 1050 Ti
Win10 GTX 1070 Ti
Win10 RX 580
Win10 GTX 1080
Win10 GTX 1060 6GB
Win10 RX 580 Series
Win10 GTX 1060 3GB
Win10 GTX 1050 Ti
Darwin Radeon Pro 580 Compute Engine
Win10 GeForce GTX 950
Win10 GTX 1070
Linux  GTX 970
Win7 Radeon HD 7700/R7 250X/R9 255 series (Capeverde)
Darwin Radeon RX Vega 56 Compute Engine
Win10 RX 5700

I did not check any more to see whether all experience the dramatic speedup in completion times that I've seen on all three of my own machines and the first few successful machines.  I suspect they all do.

I did not check in any of these cases save my own to see whether the machines were uniformly successful on this type of task, or just happened to succeed on the one for which they joined a quorum with me.

If this type of task dominates the tasks sent from now through the weekend, the usual weekend upload congestion may be worse.  More resends because of failed tasks, and more tasks requested by the suddenly more productive successful machines.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3946
Credit: 46777172642
RAC: 64125649

All of the nvidia GPUs in

All of the nvidia GPUs in that list being older GPUs (pascal or older) so it’s no surprise they were ok. We have enough evidence so far to conclusively say the issues seems to lie in these tasks when run on Volta/Turing/Ampere cards on either Windows or Linux.
 

now if we can find an example of one of those that successfully processed on Volta/Turing/Ampere, that would be some new info to digest. 

_________________________________________________________________________

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2956546427
RAC: 714967

I run machines under both

I run machines under both Windows and Linux (Win 7 and Mint) - mainly with 1660 and 1650 cards.

I'm seeing the Windows tasks failing after 20-odd seconds with a driver crash: Linux tasks getting into an endless loop and reporting only pseudo-progress.

Will investigate further during the day.

Edit - further details. Both reports come from "GeForce GTX 1660 SUPER" cards.

Windows

ERROR: /home/bema/source/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error 7261931
00:42:36 (82296): [CRITICAL]: ERROR: MAIN() returned with error '-36'

Linux

Exit status:197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED

Additional error:

Warning:  Program terminating, but clFFT resources not freed.
Please consider explicitly calling clfftTeardown( ).

 

Freewill
Freewill
Joined: 10 Oct 11
Posts: 387
Credit: 6436448348
RAC: 3977238

I am seeing these tasks run

I am seeing these tasks run in normal time and validate on an Nvidia Pascal series GPU (1080Ti) running on Ubuntu 20.04:

https://einsteinathome.org/task/1061652336

But they are running forever with less than 1/2 power consumption on Turning GPUs (2070 Super) on the same PC.

ServicEnginIC
ServicEnginIC
Joined: 27 Apr 12
Posts: 1
Credit: 278643590
RAC: 14366

Unfortunately, early this

Unfortunately, early this morning I aborted several hanging tasks reporting 100% but not finising at my Linux hosts.
I thought then that it was a defective batch.
You'll find these tasks as Error - Aborted tasks for each host.
Some of them were running for more than 30000 seconds...

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3946
Credit: 46777172642
RAC: 64125649

can anyone with a Linux host

can anyone with a Linux host and older kernel (5.4 or earlier) try the nvidia driver from the 440 generation or older?

I have kernel 5.8+ and it seems I can't install the older driver on this kernel.

_________________________________________________________________________

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.