All things Radeon VII / Vega 20

Chooka
Chooka
Joined: 11 Feb 13
Posts: 134
Credit: 3689865759
RAC: 2207346

Quick update. I've updated

Quick update.

I've updated to Driver 19.5.2. With all setting back to stock standard, run times for 3 x WU's is 506 seconds.

Applying the undervolting as mentioned previously run time are about 528 seconds. Increasing the memory to 1050, run times come down to around 515 seconds.

That's all I've tested so far.

I'm happy for longer run times given the decrease in wattage from undervolting. It's also much quieter. 


Chooka
Chooka
Joined: 11 Feb 13
Posts: 134
Credit: 3689865759
RAC: 2207346

Keith Myers wrote:Radeon VII

Keith Myers wrote:

Radeon VII has been discontinued.

https://pcper.com/2019/07/report-amd-radeon-vii-has-been-discontinued/

WOW already!

I've seen the reviews that make the Radeon VII and it's hefty price tag look bad compared to Navi and there's been talk that it should have a price cut now that Navi is out but I guess it the price of the HBM2 that's expensive.

I doubt Navi will be anywhere near as good as the Radeon VII for this kind of crunching though but it will be interesting to see.


archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7205854931
RAC: 928988

Until a few days ago my

Until a few days ago my Radeon VII machine has run nicely, turning in daily production of about 1,600,000 credits of Gamma-Ray pulsar work, and serving as my personal main machine.

It is sick now.

Approximately in sequence the following events have occured (causal links not implied here).

I did a routine Windows cumulative update on September 18

By the end of September 19 I had about three system crashes.  The symptoms varied.  Once it stopped doing anything useful (Einstein, weather reporting) in the middle of the night, and failed to give me access in the morning, so I did the long power button shutdown.  Once it blue screened while I was sitting at it, with the "reason" message on the blue screen saying something along the lines of hung up in a driver (which driver not mentioned that I noticed)

On September 19 I installed the latest AMD driver.

Since then I've not had more blue screens, nor total system stalls. 

But Einstein computing performance has been very ragged, including:

1. Intermittent periods in which the reported GPU temperature drops drastically from the normally steady 83C (not the hot spot number).   Sometimes it goes down by perhaps 20 degrees and just stays there for tens of minutes, with resulting elapsed times 10 to 100% longer than normal.  Sometimes drops abruptly to near box ambient, and snaps up and down for a while.

2. A machine which would not generate a single Einstein GRP computation error in months has 48 in the past couple of days.  Many of these have come in clusters.  While one cluster had uniform 19 second elapsed times, the rest of these have had a wide scattering of elapsed times.  The tail end of stderr has generally looked much like this:

% Binary point 806/1631
% Starting semicoherent search over f0 and f1.
% nf1dots: 41 df1dot: 2.512676418e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
Error in computing index of fft input array, i:-1019762235 pair:281377
ERROR: prepare_ts_2_phase_diff_sorted() returned with error 20115240
03:06:41 (6268): [CRITICAL]: ERROR: MAIN() returned with error '1'
FPU status flags: COND_1 PRECISION
03:06:55 (6268): [normal]: done. calling boinc_finish(65).

3. A machine which had a steady-state condition of showing about 1% as many invalid as valid on the task page has started generating far, far more invalid and inconclusive tasks.  As I type it shows 105 invalid and 2045 valid, but that gravely understates how bad the current condition is, as the most recent returns show many inconclusives, and the summary numbers include many tasks returned when the results were better than currently.

I'm puzzled both as to what might be wrong and as to what I ought to do.  Certainly it is possible that some system issue unrelated both to the Radeon VII and Einstein is the main problem (Gary would remind me to consider power supply trouble).  It is possible that Windows and driver updates put the system to a state in which healthy hardware gives bad results.  

But I'm afraid I'm inclined to think my Radeon VII has changed in some way which makes it unreliable in operating conditions it formerly tolerated nicely.  Those conditions do include operation under the supervision of MSI Afterburner, which I use to regulate the GPU fans (mostly to get steadier fan sounds than the surfboarding horror of my early days with the Radeon VII) and to set a -16% power limitation (which in turns means the card runs at somewhat lower GPU voltage and clock rate than otherwise).

Possibly my first moves should be to drop use of Afterburner and run at pure stock voltage, clock rates, and fan control.  Then reduce multiplicity from 2X to 1X.  Then back down the AMD driver version from 19.9.2 to 19.8.1.  Then pull the Radeon VII out of the box and put in the most capable and modern remaining unused GPU around here, which after my big eBay sale is probably an Nvidia GTX 750.

This is pretty traumatic.  I don't like my options, and am not confident of fixing things.  Happily I don't play games, so reverting to the GTX 750, and if that works, perhaps buying another RX 570 (then two if that is OK) are viable paths so far as my daily life is concerned.  But it would be a pity to lose the fabulous output of the Radeon VII if the card is in fact healthy.  While my RAC is in a swan dive, as of now this machine shows as Number 10 at Einstein by RAC and number 19 by total credit.

 

 

 

 

Anonymous

I offer the following link

I offer the following link for your review.  Not sure how this would/could impact your windows performance on E@H:

 

https://www.zdnet.com/article/microsoft-releases-out-of-band-security-update-to-fix-ie-zero-day-defender-bug/

 

I do not run windows on any pc for crunching.

cecht
cecht
Joined: 7 Mar 18
Posts: 1525
Credit: 2867762685
RAC: 2058660

archae86 wrote:I did a

archae86 wrote:
I did a routine Windows cumulative update on September 18

What about wind back the system update on one of your hosts (or the sickly host) with a reinstall of an earlier version, say May 2019 Win10 ISO, install the previous AMD drivers, then see whether the Radeon VII is still acting up?

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Joseph Stateson
Joseph Stateson
Joined: 7 May 07
Posts: 174
Credit: 3061321363
RAC: 412093

archae86 wrote:1.

archae86 wrote:

1. Intermittent periods in which the reported GPU temperature drops drastically from the normally steady 83C (not the hot spot number).   Sometimes it goes down by perhaps 20 degrees and just stays there for tens of minutes, with resulting elapsed times 10 to 100% longer than normal.  Sometimes drops abruptly to near box ambient, and snaps up and down for a while.

 

I ran across something similar.  One of my (obsolete) S9000 has a problem:  The blower on it (3rd party) is NFG.  It is a DIY add-on and runs too slow even after raising the voltage from 12 to 24.  I am waiting for the correct blower to arrive from China. When starting Boinc, I noticed the temps slowly rises to 97 (the throttling cutoff for the s9000).  It immediately drops to 500mhyz from 900mhz.  The temp then drops to 75 which is ok.  It will stay at 500mhz forever.  Does not attempt to go back up.  If I forget to turn on the power to the blower (been there done that) the frequency drops to 300 exactly, nothing gets done, and temps are in 60s or 70s depending on if the garage door is open or not.

 

I am guessing your temp jumps to the throttling point and that value may or may not be shown, and then drops to that minimum, sort of like mine.  If running gpuz you might check to see the "maximum" temp as that app does record min and max as I recall.  

 

Incredible that the Vii was discontinued so soon. Supposedly there are problem with OpenCL on the replacement 5700 series

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117010041007
RAC: 36478305

archae86 wrote:I'm puzzled

archae86 wrote:
I'm puzzled both as to what might be wrong and as to what I ought to do.  Certainly it is possible that some system issue unrelated both to the Radeon VII and Einstein is the main problem (Gary would remind me to consider power supply trouble).  It is possible that Windows and driver updates put the system to a state in which healthy hardware gives bad results.

I'm in the fortunate (?) position of having a large sample size over which to gain experience about what can cause problems.  I happened to buy around 60 300W 80+ efficient PSUs that could do 270W on the 12V rail and whose OEM was SeaSonic (back in 2007) and I've been quite keen to keep them going, even right up to today.   I do have a lot of experience with PSU problems and how to fix them :-).  However, that's not the broken record that I'll be playing this time :-).

As it turns out, I have a couple of recent examples of what can happen to GPUs as they 'age'.  I'm not saying this is what is happening to you but it might be worth considering.  In my cases, the computer might still keep running but the GPU tasks crash from time to time or are declared to be invalid (or even validate errors) if they actually run to completion.  The error output given from a crashed task looks very much the same as what you show.  Because I have so many older PSUs, my first reaction is to replace that with a known good spare.  On inspection of the old PSU, I look for signs of swollen caps or inadequate fan lubrication and these two things do end up being the cause of quite a few of the problems.

Today's story is about a  recent example where a new PSU seemed to solve the problem for a while but then it came back - about 4 weeks ago.  The new PSU had been working fine for about the last 4 months (autumn/winter here) and at the very end of August we had several days of 33 -35C which is not normal for the end of winter.  It was pretty easy to make a connection to the suddenly unusually hot days.

So I grabbed a spare server fan and strapped it to the GPU (a single fan HD7850 from early 2014) and despite having even warmer weather since the initial hot days, there hasn't yet been a single crash since the extra fan cooling was put in place.

Since the existing stock GPU fan didn't seem to be defective, my assessment is that perhaps the TIM has 'dried out' or that components have 'aged' and become more sensitive to heat and that the considerably stronger air flow over everything is now solving whatever it was that was causing the issue.  I'm quite happy to leave this setup as it is whilst it continues to have no issues.  My next move if problems resurface would be to replace the TIM with a fresh application.  At the moment I'm quite content to be a 'if it ain't broke, don't fix it' kinda guy :-).

If I had your problem, I'd be increasing the fan speed to see if that would work on its own first.  That should tell you if it's just heat, hopefully.  If the problem persists, perhaps see if returning to stock volts/frequencies has any beneficial effect.

I imagine your card is under warranty.  At one point I had problems with an RX 460 that would work fine during the standard calculations and then always crash during the followup stage of each task (in the days where there was a significant followup stage).  I could duplicate the behaviour on 3 separate machines so I returned it under warranty.  I had some trouble getting the supplier to acknowledge there was a problem so I pushed the point, "are you testing the double precision capability of this card, as this is where the problem is?"  Just by being persistent, they did agree to refund the purchase price of the card.  I don't know how they test returned cards but it might be worth investigating if the sort of failures you are seeing can be tested for in some way.

 

Cheers,
Gary.

shuhui1990
shuhui1990
Joined: 16 Sep 06
Posts: 27
Credit: 3631456971
RAC: 0

I would

I would use bluescreenview-x64 to check the cause of the blue screen. Then try to restore the system to a point before the update. Just to eliminate possibilities on the software side.

If a system restore doesn't help, I would test CPU, DRAM, GPU one by one.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7205854931
RAC: 928988

archae86 wrote:Until a few

archae86 wrote:

Until a few days ago my Radeon VII machine has run nicely <snip>

It is sick now.

<snip>

Possibly my first moves should be to drop use of Afterburner and run at pure stock voltage, clock rates, and fan control.  Then reduce multiplicity from 2X to 1X.  Then back down the AMD driver version from 19.9.2 to 19.8.1.  Then pull the Radeon VII out of the box and put in the most capable and modern remaining unused GPU around here, which after my big eBay sale is probably an Nvidia GTX 750.

I got as far down my gradual abandonment list as the multiplicity reduction.  I took that step over four days ago, and have had seemingly good stable behavior ever since.

I'm not prepared to declare this sequence to be cause and effect, but I plan to continue at the current settings (No use of Afterburner, Radeon controlled default fan settings, 0% power limitation, 1X Einstein running) for at least a week.  If it keeps working, I can tinker around trying to see if I can get back some of what I've lost.

Losses:

- unpleasant 3.5-minute fan cycle sounds, as the 1X GRP behavior has an appreciably low usage stretch once per task.
- excess power consumption, as I'm not running any power limitation, which was accepted before.
- loss of output, as this card gets materially less GRP work done at 1X than at 2X.

But compared to pulling the Radeon VII out altogether, which I thought moderately likely, this operating point looks attractive.

I should explain that I'm not interested in reverting to a previous Windows version.  I lack experience in that particular operation, and this is my primary personal use machine--not something sitting on a shelf with no use save Einstein.  Also, I want to run with all current patches and protections.

 

shuhui1990
shuhui1990
Joined: 16 Sep 06
Posts: 27
Credit: 3631456971
RAC: 0

The stability may be related

The stability may be related to the SOC Voltage. I have been running 3x since I applied the following PowerPlay Table registry, which was not possible before.

EvenMorePowerVII_1293+: 500W(250W) TDP, +99%(+20%) max power limit, 600A TDC Core (330A), 100A TDC SoC (50A) 1293mV Vcore (1218mV), 1218mV SoC (1168mV), max Core 2400MHz (2200MHz), max HBM 1400MHz (1200MHz)

https://www.overclock.net/forum/67-amd/1633446-preliminary-view-amd-vega-bios-131.html#/topics/1633446?page=131

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.