Pascal again available, Turing may be coming soon

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119600591194

RAC: 24836155

Richard Haselgrove wrote:....

31 Oct 2018 20:42:43 UTC

Message 167579 in response to message 167561

(moderation:

)

Richard Haselgrove wrote:

.... Einstein will have to get the debugger and the compiler out sooner or later, or suffer the error rate.

Unfortunately, it appears we'll just be suffering the error rate, or, more likely, a withdrawal of support rate from disillusioned volunteers.

I had hoped that by this time with all the information that has been provided by Peter in his search for a solution, at least someone from the project might have commented. It would only take a small amount of time (if nothing else) to make a simple comment to the effect that the problem had been noted and perhaps give some advice from a programmer's perspective as to what the problem might be down to - driver problem or app problem or perhaps a combination of both.

Volunteers tend to spend quite a lot of time, effort and money in providing free resources to projects. It's a really bad look when quite major issues like this don't seem to evoke any sort of a response from the staff. If things are happening 'behind the scenes', at least, as a simple courtesy, give the volunteers a small update about it.

Cheers,
Gary.

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7389161687

RAC: 2007047

archae86 wrote:Moments ago, I

2 Nov 2018 3:02:44 UTC

Message 167615 in response to message 167551

(moderation:

)

archae86 wrote:

Moments ago, I finally submitted a trouble report to Nvidia

Within the last hour I received an email notification from Nvidia driver feedback that a bug had been filed for my issue. Even more encouraging, the log on the web server which I used to post the zip file for my portable test case showed that someone had downloaded the zip file a couple of hours earlier.

So arguably we got past the first two of the long series of obstacles I mentioned on this path to resolution.

My assigned bug number is 2434391. If you like, you can check release notes for new driver releases (typically near page 14) to see whether my number is described as either fixed or as an open issue. 416.34 listed five bugs as fixed, with the lowest number being 2041443 and the highest 2414749. Only six were listed as open. So getting a number does not appear to affirm that they have verified it to be a problem nor assigned resources to fix it. It may indicate that someone regarded my submission as not frivolous.

Of course, if this really is an application bug and not a driver flaw, this is unlikely to help.

Sybie

Joined: 28 Mar 06

Posts: 5

Credit: 6347019

RAC: 0

OMG, its finally rendering a

4 Nov 2018 23:30:13 UTC

Message 167650

(moderation:

)

OMG, its finally rendering a 1CPU+1GPU Task on my 2080!

"Gamma-ray pulsar binary search #1 on GPUs 1.20 (FGRPopencl1K-nvidia)"

currently at 60% and 5:40min into it ... usually it crashed at 20sec. with 0%

edit: done 100% and 8:29min ... lets wait for the validation

Richie

Joined: 7 Mar 14

Posts: 656

Credit: 1702989778

RAC: 0

Sybie wrote:OMG, its finally

4 Nov 2018 23:44:56 UTC

Message 167651 in response to message 167650

(moderation:

)

Sybie wrote:

OMG, its finally rendering a 1CPU+1GPU Task on my 2080!

That's because new task was from 1031L series, which are slow / low-pay. Earlier tasks that errored out were from 0104 series (fast / high-pay ). These slow ones will propably run and validate fine.

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7389161687

RAC: 2007047

I had removed the 2080 from

5 Nov 2018 1:17:03 UTC

Message 167652 in response to message 167651

(moderation:

)

I had removed the 2080 from my machine during the recent run of high-pay work in the O104 series. As reported several places 1025L and subsequently 1031L work has been in issue for several days now, which from the existing patterns seemed likely to be in the low-pay class and to work on the Turing machines. So after burning down my cache of O104 work I a couple of hours ago swapped the 2080 back in, and, as expected saw the previous low-pay unit behavior and have several validations and no errors on 1025L work.

Not much news here, save that the previous patterns continue to hold. Sadly, this may mean that any new Turing cards turned on about now will shock their owners with 100% failures when (if) the work flow goes back to a type that does not work for the current application/driver/data/hardware combination.

Inspired by observations and comments from Vyper and Richard Haselgrove, I have started a project to look at command-line parameter sensitivity of the Turing high-pay failure issue. My first step was to compare the complete command line parameter strings for a number of high-pay and low-pay WUs. That has suggested five suspect parameters which all differ in a systematic way between high and low-pay WUs separated by a month. With the 2080 back in the box, I can use a Juha-method test environment to try altering each of the parameters from the high-pay to the low-pay value. The reverse case will require me to build a Juha-spec low-pay test environment and alter the parameters in the opposite direction.

I'll wait to see what I see before speculating on what it might mean, and what, if any, use it might be to someone who might try to fix this problem.

Ouiche

Joined: 13 May 17

Posts: 7

Credit: 22584610

RAC: 0

I have the same problem with

5 Nov 2018 3:00:05 UTC

Message 167653

(moderation:

)

I have the same problem with my 2080, and it cause the video driver to crash (lastest, 416.64). Normal tasks work fine and got some validated, short ones crashes after 20 seconds or so.

I'm not sure if it's a reading error but the card telemetry recorded a couple vcore spike at 1.068v during the crashes (among other weird data) and it make little sense. I never saw this card reach 1.068v (i capped the voltage at 0.950) and i'm not even sure it's supposed to go that high.

It's the second driver crash i have with that card : the first one happened in a game (war thunder) and could be replicated. All games were be working fine, half of War Thunder was working fine but a specific part (driving a tank) would cause the driver to crash after 10 seconds. As far as i know, the devs had to wait for a driver update to get the issue fixed after filing a report to nvidia, and it came ~2 weeks later.

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7389161687

RAC: 2007047

Ouiche wrote:I have the same

5 Nov 2018 4:38:30 UTC

Message 167654 in response to message 167653

(moderation:

)

Ouiche wrote:

I have the same problem with my 2080

Thank you for the report. I've reviewed the tasks for your host, and agree that it seems to me to share the characteristics we have previously seen on two other 2080, one 2070, and two 2080 Ti cards serving here at Einstein, and one additional 2080 which ran a portable test case.

Quote:

the video driver to crash (lastest, 416.64).

That driver is not offered at the standard Nvidia download site, which still shows 416.34 as latest (even if one selects the option to be shown Beta drivers). It appears that 461.64 is a hotfix driver in limited distribution. As you credibly report it did not fix our problem here at Einstein, I shall not try it out.

Quote:

I'm not sure if it's a reading error but the card telemetry recorded a couple vcore spike at 1.068v during the crashes

I've not yet observed that, shall look for it the next time I run tests. What metrology did you use? I'm currently using GPU-Z and MSI Afterburner. Neither may be ideal for detecting short voltage spikes.

Ouiche

Joined: 13 May 17

Posts: 7

Credit: 22584610

RAC: 0

I'm using GpuZ. I'll try to

6 Nov 2018 0:29:54 UTC

Message 167663

(moderation:

)

Hi,

I'm using GpuZ. I'll try to replicate the issue today and provide screencaps and some information (but the "short" tasks are random, i had 10h without one then 8 in a row yesterday.)

edit : no errors yet. I'm sure i'll get one as soon as i leave the computer!

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119600591194

RAC: 24836155

Ouiche wrote:... the "short"

6 Nov 2018 1:49:01 UTC

Message 167667 in response to message 167663

(moderation:

)

Ouiche wrote:

... the "short" tasks are random, i had 10h without one then 8 in a row yesterday.

They will be less frequent as time goes on since the only ones that will be available will be a diminishing number of 'resends' - tasks that have failed in some way after being issued to someone else. They will come if your computer happens to ask for new work at just the 'right' time - when someone else has returned one or a deadline has been exceeded :-).

Quote:

edit : no errors yet. I'm sure i'll get one as soon as i leave the computer!

You have no 'resends' currently in your 'in progress' work so there is no immediate risk. Each time your host gets new work, just look for a task that is from the 'high-pay' data files eg. LATeah0104[STUVW].dat (one of those 5 letters) with an extension of _2 or above. These are 'resend' tasks. If you get a bunch (like the 8 in a row you mentioned) and don't want to be bothered with them all, just abort them and return them by clicking 'update'.

Cheers,
Gary.

Ouiche

Joined: 13 May 17

Posts: 7

Credit: 22584610

RAC: 0

Seems to be right, i had no

6 Nov 2018 22:50:24 UTC

Message 167674

(moderation:

)

Seems to be right, i had no driver crash since yesterday. And even if a couple would happen, the driver kick back in so it's not much of a problem (for me, at least). I would bet on half-baked drivers :o

With some downvolt i got the card pulling ~125w on average according to GPU-Z, most tasks are done in 500-560secs in x1. I'm not 100% sure it's truly stable since i had a few invalid tasks. Changed the settings a bit, i'll see.

Thx for the help.

Pascal again available, Turing may be coming soon

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner