Latest data file for FGRPB1G GPU tasks

Keith Myers

Joined: 11 Feb 11

Posts: 5063

Credit: 19358755487

RAC: 8163759

Understand, Bernd. And

12 Feb 2019 1:05:04 UTC

Message 169451

(moderation:

)

Understand, Bernd. And thanks for the clear answer. I have removed my gpu_exclude for my RTX 2080 for now and will watch out for any of the offending task types and hopefully abort them in time.

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7412901687

RAC: 1927897

Keith Myers wrote:I have

12 Feb 2019 1:56:48 UTC

Message 169452 in response to message 169451

(moderation:

)

Keith Myers wrote:

I have removed my gpu_exclude for my RTX 2080 for now and will watch out for any of the offending task types and hopefully abort them in time.

My RTX 2080 has come up off the floor and has been back into the box for some days now. In my case, it currently shares the box with a GTX 1070. My (relatively) low labor content procedure to process the non-Turing tasks is:

1. I carry an appreciable queue depth, so I have well over a day's time in which to detect and act on the troublesome tasks.

2. Once a day I go to the Tasks tab in BoincTasks for this machine, double-click the ready to start list to expand it, and sort it by task name. That currently gets me offending tasks of the 21nnL flavor at one end of the list, and of the 0104? flavor at the other end. (if any)

3. If it were a Turing-only machine, I'd just abort any offending tasks at that point, but as I have a Pascal card in the machine, I instead suspend all tasks older than one I wish to process, then suspend the currently executing task on the Pascal card. As the offending task is then promptly started on the Pascal card, I can then undo all suspensions, and go on about my other interests until the next time I choose to reduce my offending task inventory.

They have come in irregular clusters. The scheduler tends to punish owners of machines with dissimilar cards by requesting no work for a while, then getting a big gulp. If some host which also gulps, but does not get work done by deadline has just generated a bunch of "time's up" returns, then I've gotten half a dozen or more at a time, but then go days with none.

I'm just happy that my Turing is working for Einstein again, and that Bernd sees some daylight for possible ways it might continue to do so with less operator intervention.

Keith Myers

Joined: 11 Feb 11

Posts: 5063

Credit: 19358755487

RAC: 8163759

I set NNT all the time for

12 Feb 2019 4:07:19 UTC

Message 169454

(moderation:

)

I set NNT all the time for Einstein, then when I am getting low, I request work. I can then review the tasks received for the offending types and abort them then and now. Depending on the work mix I can then reset NNT if I got enough work or I can again let the scheduler get me more work and hope the next download gains me more of the good type.

It requires manual intervention as your method but my resource share is low enough that a slug of Einstein work will last for several days . . . . as long as Seti doesn't have any extended upsets. (wishful thinking)

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3608000340

RAC: 568283

Bernd Machenschalk

12 Feb 2019 6:23:51 UTC

Message 169456 in response to message 169449

(moderation:

)

Bernd Machenschalk wrote:

Keith Myers wrote:
So is this the new plan of attack for Turing/Volta? Prevent sending incompatible tasks vice fixing the application itself?

Actually we will do both. The limiting factor is manpower, and producing and sending compatible workunits is the most efficient thing we can do right now.

The reason for this problem appears to be a new feature of Turing/Volta ("independent thread scheduling"), which we have very limited control of in OpenCL. We might again intensify our efforts to develop a CUDA version that will likely give us more performance on NVidia cards and solve this problem as well. But we need more time for this than what we currently have.

Have you been in contact with NVidia about the issue? I know someone on the board opened a trouble ticket with them over it crashing on Turing and IIRC got it escalated to engineering support. That suggests that they care about the failures at some level and may be able to provide assistance either in helping with a workaround, or getting sufficient detail of the code from you to understand what needs changed in their OpenCL libraries if it needs fixed on their end.

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6596

Credit: 340283230

RAC: 108501

My quick read of "independent

12 Feb 2019 8:47:00 UTC

Message 169458 in response to message 169456

(moderation:

)

My quick read of "independent thread scheduling" suggests that it may leave open the possibility of deadlock occurring with locks/mutexes thus leading some other threads perpetually waiting for access they will never get. As each thread in a warp now has its own program counter and call stack then opportunity for such 'stalls' abound. This requires someone ( developer if CUDA ) or something ( scheduler if OpenCL ) to be smart enough to avoid this problem. As NVidia implements OpenCL using CUDA then it is a case of CUDA advances outreaching their current OpenCL library development. Sucks.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4354

Credit: 254009582

RAC: 35002

Bernd Machenschalk

12 Feb 2019 11:01:51 UTC

Message 169460 in response to message 169443

(moderation:

)

Bernd Machenschalk wrote:

We'll try to figure out a way to prevent sending the scheduler older tasks to such cards.

This should be in effect now. Such cards should get app versions with plan class "FGRPopenclTV-nvidia" and tasks from "old" WUs for these app versions should be rejected.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4354

Credit: 254009582

RAC: 35002

There is still the

12 Feb 2019 11:11:01 UTC

Message 169461

(moderation:

)

There is still the possibility that the problem in the app has nothing to do with the "independent thread scheduling", but I increasingly doubt it.

If you tell NVidias OpenCL compiler not to optimize at all (by passing "-cl-opt-disable"), a previously problematic task runs through (although noticeably slower). [FWIW even with optimization level 1 (instead of the default 3) the task locks up.]

Keith Myers

Joined: 11 Feb 11

Posts: 5063

Credit: 19358755487

RAC: 8163759

How does the scheduler handle

12 Feb 2019 16:08:07 UTC

Message 169464 in response to message 169460

(moderation:

)

How does the scheduler handle mixed Pascal and Turing cards in the same host? Will it see a Turing card and just send

"FGRPopenclTV-nvidia" only type from then on?

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7412901687

RAC: 1927897

Keith Myers wrote:How does

12 Feb 2019 17:11:07 UTC

Message 169466 in response to message 169464

(moderation:

)

Keith Myers wrote:

How does the scheduler handle mixed Pascal and Turing cards in the same host? Will it see a Turing card

For some purposes, the BOINC software seems only to report one type of card. If the cards are not at the same Compute capability level, it is the higher one that gets reported. If so in a mixed system a Turing card (level 7.n) would get mentioned rather than Pascal (level 6.n), and only Turing-suitable tasks should be downloaded.

Of course, I'm half guessing and would be happy to be corrected.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 3064458968

RAC: 1986010

Good guess - you're right,

12 Feb 2019 17:32:35 UTC

Message 169467 in response to message 169466

(moderation:

)

Good guess - you're right, plus there are extra factors considered lower down the priority order.

// return 1/-1/0 if device 1 is more/less/same capable than device 2.
// factors (decreasing priority):
// - compute capability
// - software version
// - memory
// - speed

from https://github.com/BOINC/boinc/blob/master/client/gpu_nvidia.cpp#L134

Latest data file for FGRPB1G GPU tasks

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner