Pascal again available, Turing may be coming soon

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7057704931
RAC: 1599809

Gary Roberts wrote: As I

Gary Roberts wrote:
As I looked down the list of those 2001Ls, there were a *lot* of resends as well.  There must be a few people having some sort of problem with those at the moment for resends to be in such numbers at the very beginning.  I don't recall seeing lots of resends so soon after the start of a new file.

I looked over those 2001L resend units that arrived at my box and noticed that many were from just a few hosts which had generated quick failures.  None of them had Turing cards.  I think there were many more AMD cards than Nvidia but of more than one type.

The fact that they were returned as failures so very quickly suggests that the host was either running a very short queue or that the host had such a high failure rate on previous units that the cache was run down that way.  So there is a pretty good chance that one way or another these are not an unbiased sample of boxes.

As a check against comprehensive failure, I did promote a 2001L unit out of sequence to run on the 1060 that share a host with my 2080.  It ran to completion seemingly normally and was certainly not a fast fail.

In summary, I don't know whether to be further alarmed that there may be a problem affecting a new class of non-Turing hosts or not.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109972852999
RAC: 29830096

On reflection, I'm sure it's

On reflection, I'm sure it's nothing to be concerned about.  It just seemed unusual at the time that there were so many, one after the other - all resends.  But if you think about it, it's just the luck of the draw.  A host under stress trashes a bunch of tasks and crashes.  It gets rebooted and reports them all at once.  Another host comes along and asks for a bunch of work.  I'm fairly sure the scheduler gets rid of the resends before allocating any new primary tasks.  For that type of situation, if you were asking for a significant amount of work at just the right time, you would expect to see them 'all in a column' like they were.

It just struck me as odd.  I didn't mean to alarm you :-).

 

Cheers,
Gary.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

I had nothing to do so I too

I had nothing to do so I too promoted a few 2001L's.

Windows + GTX 960 (2 tasks),  Linux + GTX 960 (3 tasks) and Windows + R9 270X (6 tasks) ...

All run fine and are waiting for validation.

I saw there was a Windows host with GTX 650 that failed 2001L's in a row but it had failed also all other GPU tasks recently. Must have some other problem... ( host/9662990 )

Failed to get OpenCL platform/device info from BOINC (error: -1)!
initialize_ocl(): Got no suitable OpenCL device information from BOINC - boincPlatformId is NULL - boincDeviceId is NULL
initialize_ocl returned error [2004]
OCL context null
OCL queue null
Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4753
Credit: 17680908422
RAC: 5745554

That's a easy one.  They let

That's a easy one.  They let Microsoft update their video drivers with ones that have no OpenCL support.

 

th3tricky
th3tricky
Joined: 15 Mar 15
Posts: 18
Credit: 944439068
RAC: 0

Refreshing to know I'm not

Refreshing to know I'm not the only one with RTX issues after skimming over the last 20 pages of this thread! RTX 2070 on driver 417.22, crushed GRPB #1 units for about 4 days, then it looks like the driver crashed and when I reset my computer all the work units were trashed, 80-something of them. From what I gather from the conversation here is that it is mainly a driver issue, not so much lack of support from the project?

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7057704931
RAC: 1599809

th3tricky wrote:From what I

th3tricky wrote:
From what I gather from the conversation here is that it is mainly a driver issue, not so much lack of support from the project?

I don't think there is any confident way we can isolate the fault.  I have filed a "feedback" with Nvidia, which got a bug number assigned.  So if there is a way to patch things in the driver perhaps that will happen.   I'm unaware of any project activity on this matter.

I've taken a look at the summary of your failed tasks, and the stderr for one of them.  While the time to failure looks very similar to what we have seen on other Turing card machines, the stderr details differ.

exit status: 59 (0x0000003B) Unknown error code

and the tail end of stderr looks a bit different:


<pre>% Sky point 1/1
% Binary point 1/1018
% Creating FFT plan.
% fft length: 16777216 (0x1000000)
FFTGeneratedTransposeGCNAction::compileKernels failed
ERROR: plan generation ("baking") failed: -5
22:32:22 (4756): [CRITICAL]: ERROR: MAIN() returned with error '-5'
FPU status flags:  PRECISION
22:32:34 (4756): [normal]: done. calling boinc_finish(59).</pre>

I don't recall seeing compileKernels failed, nor the reference to failure during plan generation on these.

Still, I think it likely you are suffering from the same general class of problem as we are.  If so you can expect any additional tasks of the 104X or 2001L flavor you may run on the 2070 to fail unless someone fixes something.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6537
Credit: 286524051
RAC: 93547

archae86 wrote:I don't recall

archae86 wrote:
I don't recall seeing compileKernels failed, nor the reference to failure during plan generation on these.

FWIW : the simplest reason for compiling kernels/programs to fail on one machine but not another is the host code expecting an OpenCL version that the driver doesn't satisfy. ;-(

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

th3tricky
th3tricky
Joined: 15 Mar 15
Posts: 18
Credit: 944439068
RAC: 0

So it does sound like a

So it does sound like a driver issue then. I have stopped on that card for now since it defeats the purpose of helping if I'm just trashing WU's. I'll just watch the driver notes and this thread and hope something comes of it, as I really don't understand the info in the stderr file to troubleshoot on my own. 

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6537
Credit: 286524051
RAC: 93547

To be exact the driver can't

To be exact the driver can't compile to an OpenCL version higher than itself. If true, then this project is generating compile requests for a later driver standard. Or flipping that, the driver providers are only writing to an earlier standard. For that matter what is the OpenCL version of the kernels written by E@H ?

Another possibility is a 'lying driver' that declares support to version y.x but doesn't fully implement the standard for y.x .....

There's a similiar issue with OpenGL : a host asking for a context version that the driver does not support and/or support claimed but not actually realised for all features. Intel is epic for rubbish OpenGL driver support.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

th3tricky
th3tricky
Joined: 15 Mar 15
Posts: 18
Credit: 944439068
RAC: 0

So would rolling back one

So would rolling back one driver version have any effect? 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.