Pascal again available, Turing may be coming soon

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110040737295
RAC: 22419561

Richard Haselgrove wrote:....

Richard Haselgrove wrote:
.... Einstein will have to get the debugger and the compiler out sooner or later, or suffer the error rate.

Unfortunately, it appears we'll just be suffering the error rate, or, more likely, a withdrawal of support rate from disillusioned volunteers.

I had hoped that by this time with all the information that has been provided by Peter in his search for a solution, at least someone from the project might have commented.   It would only take a small amount of time (if nothing else) to make a simple comment to the effect that the problem had been noted and perhaps give some advice from a programmer's perspective as to what the problem might be down to - driver problem or app problem or perhaps a combination of both.

Volunteers tend to spend quite a lot of time, effort and money in providing free resources to projects.  It's a really bad look when quite major issues like this don't seem to evoke any sort of a response from the staff.  If things are happening 'behind the scenes', at least, as a simple courtesy, give the volunteers a small update about it.

 

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3146
Credit: 7060894931
RAC: 1160880

archae86 wrote:Moments ago, I

archae86 wrote:
Moments ago, I finally submitted a trouble report to Nvidia

Within the last hour I received an email notification from Nvidia driver feedback that a bug had been filed for my issue.  Even more encouraging, the log on the web server which I used to post the zip file for my portable test case showed that someone had downloaded the zip file a couple of hours earlier.

So arguably we got past the first two of the long series of obstacles I mentioned on this path to resolution.

My assigned bug number is 2434391.  If you like, you can check release notes for new driver releases (typically near page 14) to see whether my number is described as either fixed or as an open issue.  416.34 listed five bugs as fixed, with the lowest number being 2041443 and the highest 2414749.  Only six were listed as open.  So getting a number does not appear to affirm that they have verified it to be a problem nor assigned resources to fix it.  It may indicate that someone regarded my submission as not frivolous.

Of course, if this really is an application bug and not a driver flaw, this is unlikely to help.

 

 

Sybie
Sybie
Joined: 28 Mar 06
Posts: 5
Credit: 6347019
RAC: 0

OMG, its finally rendering a

OMG, its finally rendering a 1CPU+1GPU Task on my 2080!

"Gamma-ray pulsar binary search #1 on GPUs 1.20 (FGRPopencl1K-nvidia)"

 

currently at 60% and 5:40min into it ... usually it crashed at 20sec. with 0%

edit: done 100% and 8:29min ... lets wait for the validation

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Sybie wrote:OMG, its finally

Sybie wrote:
OMG, its finally rendering a 1CPU+1GPU Task on my 2080!

That's because new task was from 1031L series, which are slow / low-pay. Earlier tasks that errored out were from 0104 series (fast / high-pay ). These slow ones will propably run and validate fine.

archae86
archae86
Joined: 6 Dec 05
Posts: 3146
Credit: 7060894931
RAC: 1160880

I had removed the 2080 from

I had removed the 2080 from my machine during the recent run of high-pay work in the O104 series.  As reported several places 1025L and subsequently 1031L work has been in issue for several days now, which from the existing patterns seemed likely to be in the low-pay class and to work on the Turing machines.  So after burning down my cache of O104 work I a couple of hours ago swapped the 2080 back in, and, as expected saw the previous low-pay unit behavior and have several validations and no errors on 1025L work.

Not much news here, save that the previous patterns continue to hold.  Sadly, this may mean that any new Turing cards turned on about now will shock their owners with 100% failures when (if) the work flow goes back to a type that does not work for the current application/driver/data/hardware combination.

Inspired by observations and comments from Vyper and Richard Haselgrove, I have started a project to look at command-line parameter sensitivity of the Turing high-pay failure issue.  My first step was to compare the complete command line parameter strings for a number of high-pay and low-pay WUs.  That has suggested five suspect parameters which all differ in a systematic way between high and low-pay WUs separated by a month.  With the 2080 back in the box, I can use a Juha-method test environment to try altering each of the parameters from the high-pay to the low-pay value.  The reverse case will require me to build a Juha-spec low-pay test environment and alter the parameters in the opposite direction.

I'll wait to see what I see before speculating on what it might mean, and what, if any, use it might be to someone who might try to fix this problem.

 

Ouiche
Ouiche
Joined: 13 May 17
Posts: 7
Credit: 22584610
RAC: 0

I have the same problem with

I have the same problem with my 2080, and it cause the video driver to crash (lastest, 416.64). Normal tasks work fine and got some validated, short ones crashes after 20 seconds or so.

I'm not sure if it's a reading error but the card telemetry recorded a couple vcore spike at 1.068v during the crashes (among other weird data) and it make little sense. I never saw this card reach 1.068v (i capped the voltage at 0.950) and i'm not even sure it's supposed to go that high.

It's the second driver crash i have with that card : the first one happened in a game (war thunder) and could be replicated. All games were be working fine, half of War Thunder was working fine but a specific part (driving a tank) would cause the driver to crash after 10 seconds. As far as i know, the devs had to wait for a driver update to get the issue fixed after filing a report to nvidia, and it came ~2 weeks later.

archae86
archae86
Joined: 6 Dec 05
Posts: 3146
Credit: 7060894931
RAC: 1160880

Ouiche wrote:I have the same

Ouiche wrote:
I have the same problem with my 2080

Thank you for the report.  I've reviewed the tasks for your host, and agree that it seems to me to share the characteristics we have previously seen on two other 2080, one 2070, and two 2080 Ti cards serving here at Einstein, and one additional 2080 which ran a portable test case.

Quote:
the video driver to crash (lastest, 416.64).

That driver is not offered at the standard Nvidia download site, which still shows 416.34 as latest (even if one selects the option to be shown Beta drivers).  It appears that 461.64 is a hotfix driver in limited distribution.  As you credibly report it did not fix our problem here at Einstein, I shall not try it out.

Quote:
I'm not sure if it's a reading error but the card telemetry recorded a couple vcore spike at 1.068v during the crashes

I've not yet observed that, shall look for it the next time I run tests.  What metrology did you use?  I'm currently using GPU-Z and MSI Afterburner.  Neither may be ideal for detecting short voltage spikes.

 

Ouiche
Ouiche
Joined: 13 May 17
Posts: 7
Credit: 22584610
RAC: 0

I'm using GpuZ. I'll try to

Hi,

I'm using GpuZ. I'll try to replicate the issue today and provide screencaps and some information (but the "short" tasks are random, i had 10h without one then 8 in a row yesterday.)

edit : no errors yet. I'm sure i'll get one as soon as i leave the computer!

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110040737295
RAC: 22419561

Ouiche wrote:... the "short"

Ouiche wrote:
... the "short" tasks are random, i had 10h without one then 8 in a row yesterday.

They will be less frequent as time goes on since the only ones that will be available will be a diminishing number of 'resends' - tasks that have failed in some way after being issued to someone else.  They will come if your computer happens to ask for new work at just the 'right' time - when someone else has returned one or a deadline has been exceeded :-).

Quote:
edit : no errors yet. I'm sure i'll get one as soon as i leave the computer!

You have no 'resends' currently in your 'in progress' work so there is no immediate risk.  Each time your host gets new work, just look for a task that is from the 'high-pay' data files eg. LATeah0104[STUVW].dat (one of those 5 letters) with an extension of _2 or above.  These are 'resend' tasks.  If you get a bunch (like the 8 in a row you mentioned) and don't want to be bothered with them all, just abort them and return them by clicking 'update'.

 

Cheers,
Gary.

Ouiche
Ouiche
Joined: 13 May 17
Posts: 7
Credit: 22584610
RAC: 0

Seems to be right, i had no

Seems to be right, i had no driver crash since yesterday. And even if a couple would happen, the driver kick back in so it's not much of a problem (for me, at least). I would bet on half-baked drivers :o

With some downvolt i got the card pulling ~125w on average according to GPU-Z, most tasks are done in 500-560secs in x1. I'm not 100% sure it's truly stable since i had a few invalid tasks. Changed the settings a bit, i'll see.

Thx for the help.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.