Pascal again available, Turing may be coming soon

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7224004931

RAC: 1006900

Gary Roberts wrote: As I

8 Dec 2018 4:42:58 UTC

Message 168127 in response to message 168125

(moderation:

)

Gary Roberts wrote:

As I looked down the list of those 2001Ls, there were a *lot* of resends as well. There must be a few people having some sort of problem with those at the moment for resends to be in such numbers at the very beginning. I don't recall seeing lots of resends so soon after the start of a new file.

I looked over those 2001L resend units that arrived at my box and noticed that many were from just a few hosts which had generated quick failures. None of them had Turing cards. I think there were many more AMD cards than Nvidia but of more than one type.

The fact that they were returned as failures so very quickly suggests that the host was either running a very short queue or that the host had such a high failure rate on previous units that the cache was run down that way. So there is a pretty good chance that one way or another these are not an unbiased sample of boxes.

As a check against comprehensive failure, I did promote a 2001L unit out of sequence to run on the 1060 that share a host with my 2080. It ran to completion seemingly normally and was certainly not a fast fail.

In summary, I don't know whether to be further alarmed that there may be a problem affecting a new class of non-Turing hosts or not.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117643036149

RAC: 35165296

On reflection, I'm sure it's

8 Dec 2018 5:22:35 UTC

Message 168128

(moderation:

)

On reflection, I'm sure it's nothing to be concerned about. It just seemed unusual at the time that there were so many, one after the other - all resends. But if you think about it, it's just the luck of the draw. A host under stress trashes a bunch of tasks and crashes. It gets rebooted and reports them all at once. Another host comes along and asks for a bunch of work. I'm fairly sure the scheduler gets rid of the resends before allocating any new primary tasks. For that type of situation, if you were asking for a significant amount of work at just the right time, you would expect to see them 'all in a column' like they were.

It just struck me as odd. I didn't mean to alarm you :-).

Cheers,
Gary.

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

I had nothing to do so I too

8 Dec 2018 6:49:09 UTC

Message 168129

(moderation:

)

I had nothing to do so I too promoted a few 2001L's.

Windows + GTX 960 (2 tasks), Linux + GTX 960 (3 tasks) and Windows + R9 270X (6 tasks) ...

All run fine and are waiting for validation.

I saw there was a Windows host with GTX 650 that failed 2001L's in a row but it had failed also all other GPU tasks recently. Must have some other problem... ( host/9662990 )

Failed to get OpenCL platform/device info from BOINC (error: -1)!
initialize_ocl(): Got no suitable OpenCL device information from BOINC - boincPlatformId is NULL - boincDeviceId is NULL
initialize_ocl returned error [2004]
OCL context null
OCL queue null

Keith Myers

Joined: 11 Feb 11

Posts: 4964

Credit: 18741013795

RAC: 7102229

That's a easy one. They let

8 Dec 2018 8:06:52 UTC

Message 168130 in response to message 168129

(moderation:

)

That's a easy one. They let Microsoft update their video drivers with ones that have no OpenCL support.

th3tricky

Joined: 15 Mar 15

Posts: 18

Credit: 944439068

RAC: 0

Refreshing to know I'm not

9 Dec 2018 20:13:43 UTC

Message 168152

(moderation:

)

Refreshing to know I'm not the only one with RTX issues after skimming over the last 20 pages of this thread! RTX 2070 on driver 417.22, crushed GRPB #1 units for about 4 days, then it looks like the driver crashed and when I reset my computer all the work units were trashed, 80-something of them. From what I gather from the conversation here is that it is mainly a driver issue, not so much lack of support from the project?

archae86

Joined: 6 Dec 05

Posts: 3157

Credit: 7224004931

RAC: 1006900

th3tricky wrote:From what I

9 Dec 2018 20:57:08 UTC

Message 168153 in response to message 168152

(moderation:

)

th3tricky wrote:

From what I gather from the conversation here is that it is mainly a driver issue, not so much lack of support from the project?

I don't think there is any confident way we can isolate the fault. I have filed a "feedback" with Nvidia, which got a bug number assigned. So if there is a way to patch things in the driver perhaps that will happen. I'm unaware of any project activity on this matter.

I've taken a look at the summary of your failed tasks, and the stderr for one of them. While the time to failure looks very similar to what we have seen on other Turing card machines, the stderr details differ.

exit status: 59 (0x0000003B) Unknown error code

and the tail end of stderr looks a bit different:


<pre>% Sky point 1/1
% Binary point 1/1018
% Creating FFT plan.
% fft length: 16777216 (0x1000000)
FFTGeneratedTransposeGCNAction::compileKernels failed
ERROR: plan generation ("baking") failed: -5
22:32:22 (4756): [CRITICAL]: ERROR: MAIN() returned with error '-5'
FPU status flags:  PRECISION
22:32:34 (4756): [normal]: done. calling boinc_finish(59).</pre>

I don't recall seeing compileKernels failed, nor the reference to failure during plan generation on these.

Still, I think it likely you are suffering from the same general class of problem as we are. If so you can expect any additional tasks of the 104X or 2001L flavor you may run on the 2070 to fail unless someone fixes something.

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6588

Credit: 317437463

RAC: 371175

archae86 wrote:I don't recall

10 Dec 2018 0:00:00 UTC

Message 168154 in response to message 168153

(moderation:

)

archae86 wrote:

I don't recall seeing compileKernels failed, nor the reference to failure during plan generation on these.

FWIW : the simplest reason for compiling kernels/programs to fail on one machine but not another is the host code expecting an OpenCL version that the driver doesn't satisfy. ;-(

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

th3tricky

Joined: 15 Mar 15

Posts: 18

Credit: 944439068

RAC: 0

So it does sound like a

10 Dec 2018 0:15:48 UTC

Message 168155

(moderation:

)

So it does sound like a driver issue then. I have stopped on that card for now since it defeats the purpose of helping if I'm just trashing WU's. I'll just watch the driver notes and this thread and hope something comes of it, as I really don't understand the info in the stderr file to troubleshoot on my own.

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6588

Credit: 317437463

RAC: 371175

To be exact the driver can't

10 Dec 2018 0:31:00 UTC

Message 168156 in response to message 168155

(moderation:

)

To be exact the driver can't compile to an OpenCL version higher than itself. If true, then this project is generating compile requests for a later driver standard. Or flipping that, the driver providers are only writing to an earlier standard. For that matter what is the OpenCL version of the kernels written by E@H ?

Another possibility is a 'lying driver' that declares support to version y.x but doesn't fully implement the standard for y.x .....

There's a similiar issue with OpenGL : a host asking for a context version that the driver does not support and/or support claimed but not actually realised for all features. Intel is epic for rubbish OpenGL driver support.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

th3tricky

Joined: 15 Mar 15

Posts: 18

Credit: 944439068

RAC: 0

So would rolling back one

10 Dec 2018 2:05:12 UTC

Message 168158

(moderation:

)

So would rolling back one driver version have any effect?

Pascal again available, Turing may be coming soon

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner