Pascal again available, Turing may be coming soon

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1,359
Credit: 2,923,080,198
RAC: 2,927,727

archae86

archae86 wrote:
ExtraTerrestrial Apes wrote:

 

The reviews say there's now a decent auto-OC tool made available for the usual tuners.

I'm troubled that they don't deal with memory clock (though that seems far less helpful for current Einstein on Turing than it did for then-current Einstein on Maxwell).  I might give it a try at some point, if things settle down and I keep the card.

 

I'd be mildly surprised if it doesn't get added eventually; but time to market is king so if a feature wasn't ready when Pascal was it'd be cut.  And from a pragmatic standpoint they've been dynamically boosting clock rates based on workload as an automatic form of overclock-light for a few years now so that's where their strongest institutional knowledge lies, and except at the very bottom of the market gaming is almost never memory bandwidth limited so the ram OC's mostly about bragging rights for their main market.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,551
Credit: 78,927,994,084
RAC: 64,884,222

archae86 wrote:I seem to have

archae86 wrote:
I seem to have stumbled on a very serious problem with my new Turing 8020 card.  It has failed every time it has tried to run one of the immediate previous flavor of "high-pay" Gamma-ray pulsar units.

Hi Peter,
I'm sorry to read that these sort of issues are haunting you.  How do you tell if it's the hardware or something to do with the nature of the application/change in data or something to do with a driver that is not yet mature.  I'm glad I'm not struggling with that lot.

The only real reason for replying is to offer anecdotal type evidence that the mix of "high-pay" and "low-pay" work might have something to do with it, rather than just a straight out GPU fault.  Back in May, there was a similar change in the nature of the data which caused a number of my hosts to suffer from groups of computation errors.  I documented it at the time in this message and made further comments in later messages.  It took me a while to find it now, as I'd forgotten the comments were not in a thread I'd started.

When I first read the comments on your recent failures, it reminded me a bit of my own experiences so I went back and checked if you had been running 1x or 2x at the time.  You seem to have been running 1x so it's not the same as the problem I had.  My problem was always triggered by 2x running where there was one task of each of the 'types' involved.  It didn't happen every time (fairly infrequently considering the number of candidate hosts) and it wasn't really clear as to precisely the cause.  I'm just wondering (for your situation) if it's possible that one task type somehow 'pre-conditions' the GPU/driver combination to give problems when switching to the other task type.  Maybe it's not really the hardware so a different GPU might just behave the same as the current one.

I'm actually experiencing the same issues with this current transition as I experienced back in May.  It's not happening as frequently this time and I'm much better prepared to deal with it than I was last time :-).

 

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3,066
Credit: 5,905,234,942
RAC: 3,316,208

At the moment I am running 1X

At the moment I am running 1X with exclusively low-pay work units.  All has been well since I rebooted after installing the current driver, and I rapidly built back up to my previous maximum observed successful clock rates and have been running there for the majority of my planned 24 hour qualification check.

If my 24 hour safety check works, I'll switch over to 2X at default clocks, then quickly raise the clocks back toward maximum 1X level.

Only then, I'll have another try at the high-pay work.  Considering Gary's pre-conditioning thought, the first time I'll try it only after a full power-down reboot, after first switching back to 1X and default clocks.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,551
Credit: 78,927,994,084
RAC: 64,884,222

archae86 wrote:... the first

archae86 wrote:
... the first time I'll try it only after a full power-down reboot, after first switching back to 1X and default clocks.

That's exactly the sort of thing I had in mind, but didn't put into words :-).  It was late and I was supposed to be somewhere else :-).

When some of my GPUs play up, a warm reboot brings the machine up, but sometimes there is still no display and the GPU hasn't been properly detected and initialised..  In those cases, to get things back to normal, it has to be a full power off restart..  I disconnect power and hit the on button on the computer to discharge the storage caps in the PSU just to be sure there is no residual charge.  That always seems to fix things.

It will be interesting to see if something like that works in your case as well.

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3,066
Credit: 5,905,234,942
RAC: 3,316,208

I completed my intended 2X

I completed my intended 2X checkout work on low-pay WUs.   I did once again see intermittent downclocking at the very highest clocks I had believed to be workable.  While it took over 24 hours for that behavior to show up at 1X (and thus it technically passed my 24-hour success criterion), it showed up much sooner at 2X, which may just be luck or mismatch of some detailed conditions.  Dropping back to the revised two steps below maximum successful clock rates, I ran a few hours this morning at +95 core, +810 memory clock, which at 2X gave elapsed times on each unit of 15 minutes, 0.5 seconds, for an indicated Einstein slow-pay WU productivity of 664,911 cobblestones/day at average system power consumption of 257 watts.  This is only slightly higher credit production than at 1X, and power productivity is inferior to 1X.

I had intended my first new try at high-pay work to be cleanly after going down to 1X at default clocks and doing a cold-iron reboot.  But my own fumble-fingered typing coupled with the current lack of FGRPB1G work available conspired to have me accidentally start high-pay work while running 2X.   That got me three additional rapid-failure high-pay units, plus a mid-computation failure on the low-pay unit that was running while the trouble started.  I saw a brief black screen on the PC.

The most interesting lines from stderr for the failing high-pay work look like this:

% nf1dots: 38  df1dot: 2.71528666e-015  f1dot_start: -1e-013  f1dot_band: 1e-013
% Filling array of photon pairs
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:934: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error -1863952096
12:22:09 (16728): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags:  PRECISION
12:22:21 (16728): [normal]: done. calling boinc_finish(28).

 

 

The failure lines from the low-pay unit that was apparently killed by startup of a high-pay unit look like this:

% Filling array of photon pairs
. (I deleted fourteen more of these)
.
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:728: clFinish failed. status=-36
ERROR: gen_fft_execute() returned with error -1863952096
12:21:46 (3192): [CRITICAL]: ERROR: MAIN() returned with error '5'
FPU status flags:  PRECISION
12:21:58 (3192): [normal]: done. calling boinc_finish(69).

 

So the current driver, (released September 27) did not completely fix the high-pay unit catastrophe, and may have had zero effect on my situation.  When FGRPB1G units are again available for download, I still intend to switch down to 1X and do a cold-iron reboot trial.  (I also intend not to have any monitoring and control tools such as GPU-Z and MSI Afterburner running).  But I have not got much hope.

I think nearly all possibilities remain open for the high-pay unit rapid failure problem.  It could be that I have a faulty sample of 2080 card.  It could be that the Turing design has an inherent flaw exposed by this type of datafile running the Einstein code.  It could be that both Turing-capable drivers that I tried have a bug.  It could be that the Einstein code has a bug.  About the only thing I personally can do beyond posting here is to RMA the card.  The only available RMA option from NewEgg is another sample of the same model of card.  If that one does the same thing, then it is unlikely that an uncaught individual card defect is the issue.

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3,066
Credit: 5,905,234,942
RAC: 3,316,208

After a little while, new

After a little while, new downloads were again available, so I changed to 1X.  I suspended all WUs.  Then I shut down to full power off for over a minute.  On reboot I disabled launch of MSIAfterburner and some other less related monitoring and control applications.

When things had settled down from startup transients, I unsuspended a single high-pay WU.  Watching BOINCMgr at the time, I saw that the elapsed time column on the task list page incremented up to four seconds, then paused.  A couple of seconds later there was a brief black screen.  Elapsed time incrementing resumed, with the task reported as failing after 27 seconds.

The key stderr lines look like this:

% Filling array of photon pairs
ERROR: /home/bema/fermilat/src/bridge_fft_clfft.c:948: clFinish failed. status=-36
ERROR: opencl_ts_2_phase_diff_sorted() returned with error 1289181472
13:35:44 (9412): [CRITICAL]: ERROR: MAIN() returned with error '-36'
FPU status flags:  PRECISION
13:35:56 (9412): [normal]: done. calling boinc_finish(28).

I'm out of bright ideas for the moment on the deadly reaction of my Turing 2080 card to high-pay Einstein WUs.  For the short term I'll try to manage my queue to keep some in stock in case further testing opportunities come up (such as a new driver released by Nvidia).  Within about a week, I plan to RMA.  But I strongly suspect the next sample of this card will behave the same way.  Somehow the Einstein code, when processing certain data files, is not compatible with this driver controlling this card.

 

 

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 427,431,026
RAC: 3,716

archae86 schrieb:'m troubled

archae86 wrote:
'm troubled that they don't deal with memory clock (though that seems far less helpful for current Einstein on Turing than it did for then-current Einstein on Maxwell).  I might give it a try at some point, if things settle down and I keep the card.

I suspect that's because they know there's no good quick-and-dirty test for VRAM / DRAM stability, likle they apparently have for the chip.

MrS

Scanning for our furry friends since Jan 2002

Richie
Richie
Joined: 7 Mar 14
Posts: 651
Credit: 1,702,976,395
RAC: 485

Does your card have MICRON

edit: Sorry, wrong info... I looked at a wrong Gigabyte 2080 model.

archae86
archae86
Joined: 6 Dec 05
Posts: 3,066
Credit: 5,905,234,942
RAC: 3,316,208

Richie wrote:edit: Sorry,

Richie wrote:
edit: Sorry, wrong info... I looked at a wrong Gigabyte 2080 model.

I failed to notice that, so thanks for your correction.  Now that I'm looking at the support page for the model I have, there is no listing of an update vBIOS.

As to your questions, GPU-Z reports that I have Micron RAM.  It reports the BIOS version as 90.04.0B.40.53.  I don't know how that maps to F1...F5.  

Anyway, a blind alley for the moment, but perhaps I should check back anon.

Richie
Richie
Joined: 7 Mar 14
Posts: 651
Credit: 1,702,976,395
RAC: 485

Nvidia driver 416.16 is

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.