Latest data file for FGRPB1G GPU tasks

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4704
Credit: 17549768708
RAC: 6432710

Understand, Bernd.  And

Understand, Bernd.  And thanks for the clear answer.  I have removed my gpu_exclude for my RTX 2080 for now and will watch out for any of the offending task types and hopefully abort them in time.

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7024864931
RAC: 1812031

Keith Myers wrote:I have

Keith Myers wrote:
I have removed my gpu_exclude for my RTX 2080 for now and will watch out for any of the offending task types and hopefully abort them in time.

My RTX 2080 has come up off the floor and has been back into the box for some days now.  In my case, it currently shares the box with a GTX 1070.  My (relatively) low labor content procedure to process the non-Turing tasks is:

1. I carry an appreciable queue depth, so I have well over a day's time in which to detect and act on the troublesome tasks.

2. Once a day I go to the Tasks tab in BoincTasks for this machine, double-click the ready to start list to expand it, and sort it by task name.  That currently gets me offending tasks of the 21nnL flavor at one end of the list, and of the 0104? flavor at the other end.  (if any)

3. If it were a Turing-only machine, I'd just abort any offending tasks at that point, but as I have a Pascal card in the machine, I instead suspend all tasks older than one I wish to process, then suspend the currently executing task on the Pascal card.  As the offending task is then promptly started on the Pascal card, I can then undo all suspensions, and go on about my other interests until the next time I choose to reduce my offending task inventory.

They have come in irregular clusters.  The scheduler tends to punish owners of machines with dissimilar cards by requesting no work for a while, then getting a big gulp.  If some host which also gulps, but does not get work done by deadline has just generated a bunch of "time's up" returns, then I've gotten half a dozen or more at a time, but then go days with none.

I'm just happy that my Turing is working for Einstein again, and that Bernd sees some daylight for possible ways it might continue to do so with less operator intervention.

 

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4704
Credit: 17549768708
RAC: 6432710

I set NNT all the time for

I set NNT all the time for Einstein, then when I am getting low, I request work.  I can then review the tasks received for the offending types and abort them then and now.  Depending on the work mix I can then reset NNT if I got enough work or I can again let the scheduler get me more work and hope the next download gains me more of the good type.

It requires manual intervention as your method but my resource share is low enough that a slug of Einstein work will last for several days . . . .  as long as Seti doesn't have any extended upsets.  (wishful thinking)

 

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3562358667
RAC: 1580

Bernd Machenschalk

Bernd Machenschalk wrote:
Keith Myers wrote:
So is this the new plan of attack for Turing/Volta?  Prevent sending incompatible tasks vice fixing the application itself?

Actually we will do both. The limiting factor is manpower, and producing and sending compatible workunits is the most efficient thing we can do right now.

The reason for this problem appears to be a new feature of Turing/Volta ("independent thread scheduling"), which we have very limited control of in OpenCL. We might again intensify our efforts to develop a CUDA version that will likely give us more performance on NVidia cards and solve this problem as well. But we need more time for this than what we currently have.

 

Have you been in contact with NVidia about the issue?  I know someone on the board opened a trouble ticket with them over it crashing on Turing and IIRC got it escalated to engineering support.  That suggests that they care about the failures at some level and may be able to provide assistance either in helping with a workaround, or getting sufficient detail of the code from you to understand what needs changed in their OpenCL libraries if it needs fixed on their end.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6534
Credit: 284730859
RAC: 105773

My quick read of "independent

My quick read of "independent thread scheduling" suggests that it may leave open the possibility of deadlock occurring with locks/mutexes thus leading some other threads perpetually waiting for access they will never get. As each thread in a warp now has its own program counter and call stack then opportunity for such 'stalls' abound. This requires someone ( developer if CUDA ) or something ( scheduler if OpenCL ) to be smart enough to avoid this problem. As NVidia implements OpenCL using CUDA then it is a case of CUDA advances outreaching their current OpenCL library development. Sucks.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4267
Credit: 244933518
RAC: 16300

Bernd Machenschalk

Bernd Machenschalk wrote:
We'll try to figure out a way to prevent sending the scheduler older tasks to such cards.

This should be in effect now. Such cards should get app versions with plan class "FGRPopenclTV-nvidia" and tasks from "old" WUs for these app versions should be rejected.

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4267
Credit: 244933518
RAC: 16300

There is still the

There is still the possibility that the problem in the app has nothing to do with the "independent thread scheduling", but I increasingly doubt it.

If you tell NVidias OpenCL compiler not to optimize at all (by passing "-cl-opt-disable"), a previously problematic task runs through (although noticeably slower). [FWIW even with optimization level 1 (instead of the default 3) the task locks up.]

BM

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4704
Credit: 17549768708
RAC: 6432710

How does the scheduler handle

How does the scheduler handle mixed Pascal and Turing cards in the same host?  Will it see a Turing card and just send

 "FGRPopenclTV-nvidia" only type from then on?

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7024864931
RAC: 1812031

Keith Myers wrote:How does

Keith Myers wrote:
How does the scheduler handle mixed Pascal and Turing cards in the same host?  Will it see a Turing card

For some purposes, the BOINC software seems only to report one type of card.  If the cards are not at the same Compute capability level, it is the higher one that gets reported.  If so in a mixed system a Turing card (level 7.n) would get mentioned rather than Pascal (level 6.n), and only Turing-suitable tasks should be downloaded.

Of course, I'm half guessing and would be happy to be corrected.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2139
Credit: 2753008030
RAC: 1372919

Good guess - you're right,

Good guess - you're right, plus there are extra factors considered lower down the priority order.

// return 1/-1/0 if device 1 is more/less/same capable than device 2.
// factors (decreasing priority):
// - compute capability
// - software version
// - memory
// - speed

from https://github.com/BOINC/boinc/blob/master/client/gpu_nvidia.cpp#L134

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.