All things Radeon VII / Vega 20

cecht
cecht
Joined: 7 Mar 18
Posts: 1534
Credit: 2905235488
RAC: 2179727

archae86 wrote:...But

archae86 wrote:

...But compared to pulling the Radeon VII out altogether, which I thought moderately likely, this operating point looks attractive.

I should explain that I'm not interested in reverting to a previous Windows version.  I lack experience in that particular operation, and this is my primary personal use machine--not something sitting on a shelf with no use save Einstein.  Also, I want to run with all current patches and protections.

Lol, With all that extra graphics power, you could take up video gaming, like Forza Motorsport 7, Wolfenstein II, Just Cause 4, or Assassin's Creed: Origin.  Be sure to post us when you level up! ;)

Ideas are not fixed, nor should they be; we live in model-dependent reality.

EDU Enthusiasts of Digital Universe
EDU Enthusiasts...
Joined: 7 May 10
Posts: 3
Credit: 36242144
RAC: 330

Maybe interesting to some;I

Maybe interesting to some;

 I have watercooling on my radeon vii, and card is undervolted overclocked.

 I'm getting stable 227-228 sec per wu

 This is typical load on the GPU (note: max is when i was playing bdo)

 

 

Here are my oc adventures before i applied waterblock

 https://docs.google.com/spreadsheets/d/1qw5fAwdyBGone-D3dX9RC9Htv-8gymA2pkN-g1RvT9c/edit?usp=sharing

shuhui1990
shuhui1990
Joined: 16 Sep 06
Posts: 27
Credit: 3631456971
RAC: 0

Just want to share that XFX

Just want to share that XFX Radeon VII is on sale on Newegg for $569.99 which I believe is the lowest price by far.

https://www.newegg.com/xfx-radeon-vii-rx-vegma3fd6/p/N82E16814150820

 I surmise Radeon VII will still be the most efficient card for Einstein@home in a year.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7222784931
RAC: 966305

archae86 wrote:I got as far

archae86 wrote:

I got as far down my gradual abandonment list as the multiplicity reduction.  I took that step over four days ago, and have had seemingly good stable behavior ever since.

I'm not prepared to declare this sequence to be cause and effect, but I plan to continue at the current settings (No use of Afterburner, Radeon controlled default fan settings, 0% power limitation, 1X Einstein running) for at least a week.  If it keeps working, I can tinker around trying to see if I can get back some of what I've lost.

I had some pretty bad problems running Einstein GRP work on my Radeon VII starting in late September.  As of early October it was working properly in an impaired operating point, and gradually after my last posting here I climbed back up the capability ladder, so it had run a couple of months at multiplicity of 2, 15% power limitation, around 1% invalid rate, and generating about 1,600,000 daily credits on Einstein GRP.  The recent worst behavior had been usually brief drops in clock rate and power consumption corresponding to great production decrease lasting 15 minutes or so, with a few occurrences per week.

But in the last couple of days things have taken a grave turn.  As of today, I am in an even worse state than in late September.

While the initial symptom I detected was some stretches of lower GPU temperature, this was not simply because of reduced clock rate, but rather cases of early failure of tasks, reported as:

Outcome: computation error:
Exit status: 65 (0x00000041) Unknown error code

Examination of the stderr generally shows something very like this:

Error in computing index of fft input array, i:1099458324 pair:281375
ERROR: prepare_ts_2_phase_diff_sorted() returned with error 20115240
10:27:44 (9404): [CRITICAL]: ERROR: MAIN() returned with error '1'
FPU status flags: COND_1 PRECISION

With the problem in full force, each task consumes a total elapsed time of 19 seconds, of which only a very small portion has GPU power level and clock rates consistent with even attempting to do actual work.  However, the penumbra tasks generate elapsed times up to several times longer, with stderr logging apparently successful completion of many intermediate points before falling into the same pit as the immediate failures.

So far I have:

- rebooted the PC
- done a full power-off cold reboot
- changed power limitation from -15% to 0%
- lowered the effective multiplicity from 2X to 1X 
- installed the latest available AMD driver, using the "factory reset" install option supposed to clear away old cobwebs

My current history of error status tasks from this PC runs over 300 of this type of failure.  My most recent success was over three hours ago, and the next thing I might try is re-installing an older driver.

My motivation for installing an older driver is not that I think it likely suddenly to return me to Einstein GRP success, but because I don't know where the overclock/power limitation section is on this new one.

I have ordered today yet another XFX RX 570 card (adding to the three I currently run successfully on two other machines).  My general plan is to install that one as soon as it arrives, and if it runs Einstein well (somewhat exonerating the host system), to order and install a second one (the box has plenty of space, cooling, and power supply capacity for two RX 570 cards, though the CPU being only quad core and non hyperthreaded is a bit suboptimal).

If I get that far down the road, I'd likely be happy to provide my Radeon VII to someone here interested in having a try with it.  When it works, it has absolutely fabulous Einstein GRP productivity, and pretty decent power efficiency.  But my interest it trying to find ways to keep it running well is rapidly diminishing.

I don't know whether I have a marginally defective sample of the card, or am mishandling it in some way.

 

 

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117614832992
RAC: 35218418

archae86 wrote:... my

archae86 wrote:
... my interest it trying to find ways to keep it running well is rapidly diminishing.

That's quite understandable so let's see if we can bounce some thoughts around and come up with something.

I had a look through quite a few of the most recent errors and saw the same type of message that you posted, eg.

% Binary point 2/1631
% Starting semicoherent search over f0 and f1.
% nf1dots: 41 df1dot: 2.512676418e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
Error in computing index of fft input array, i:1083564720 pair:281360
ERROR: prepare_ts_2_phase_diff_sorted() returned with error 20115240
15:31:17 (7708): [CRITICAL]: ERROR: MAIN() returned with error '1'


I probably looked at around 15 or more, some very early like the above which was on just the 2nd 'binary point' out of 1631 in total and others which were much further advanced - eg. more than half way through the binary points.  One particular thing caught my eye.  On every one I looked at, the failure occurred at an almost identical pair number.

I have no real understanding of how things are supposed to work but the description seems to imply that a very large array is being filled with photon pairs and an index into this array is being calculated.  The range of values I saw for pair numbers was from a 'low' of 281350 to a 'high' of 281373 - seemingly an unusually tiny spread for any sort of random condition triggering a compute error.  To me it seems to imply that the problem is being triggered when the array reaches a certain degree of 'fullness' (for want of a better description).  It might be worthwhile sending this sort of information to Bernd and asking him to pass it on to a relevant scientist/app author for comment.  It would also be worthwhile to ask if the error code 20115240 provides any further enlightenment.

I have seen this same error condition in the past.  I have no record of pair numbers but the description was the same, with the same function being mentioned.  This was occurring with a HD7850 GPU running a mid 2016 version of Linux - the last one before the old fglrx proprietary driver was deprecated.  Those GPUs (GCN 1st gen) don't work with current drivers so I have a substantial batch of machines still running that old driver.

I found that I could pretty much completely get rid of error tasks if I rebooted the machine every 12 hours.  Restarting BOINC wasn't enough.  I probably ran this way (rebooting morning and night) for more than a month with virtually no errors - unless I got too complacent and neglected to do the mandatory reboot.  The machine was near the bottom of a stack of boxes and I was too lazy to disturb the stack in order to further diagnose by swapping hardware in the problem machine.

Eventually, I got sick of rebooting and repositioned that machine so I could change hardware.  The problem got 'fixed' by changing the power supply - which surprised me as the original PSU seemed just fine.  Since changing the PSU, there has been no further problem.  That was around 6 months ago.  Maybe there was some connection that wasn't quite right - I really don't know.

However, this leads to the next suggestion for what to do while you await a response from an approach to Bernd.  You talk about getting two further RX 570s to replace the VII.  You already have a machine with two existing 570s.  I presume that machine would take the VII, so why not swap GPU hardware and see if the problem stays with the host or transfers with the GPU?  In the past, I've had good success in really pinning down the source of the problem by transferring the obvious hardware.  Sometimes the results can be surprising :-).

Good luck with whatever you decide to do.

Cheers,
Gary.

Stefan Ledwina
Stefan Ledwina
Joined: 23 Oct 05
Posts: 17
Credit: 2517051894
RAC: 1418093

I also had a lot of problems

I also had a lot of problems with my Radeon VII over the past few weeks.

Not really with computation errors, but system lock ups, spontaneous reboots, blue screens - sometimes 3-4 times a day. My number of invalid tasks was always about 10% of the valid tasks...

 

Over the past few days I tried some things -

Watercooling the card is really great. The temps came down to about 43°C and 65°C hot spot (it was always in the higher 80 and 111°C for hot spot temperature with air cooling). Since then the number of invalids is decreasing. But I still had the same lock ups and reboots.

I even did a complete fresh install of windows, BIOS update, newest chipset drivers, but still had those errors. 

Then I did a google search and found the AMD community forum and some guys that are having the same problems with their 5700 XT cards and also with "older" Radeon Vega cards. And there I found two recommendations - in the energy settings there's a setting about PCI-express energy saving - turning that off could help.

But it didn't help with my computer...

What really helped was the second suggestion I found - going back to the adrenaline 19.7.5 driver https://www.amd.com/en/support/kb/release-notes/rn-rad-win-19-7-5 .

My computer is now crunching for about 24 hours without system lock ups or reboots. I dont't know if it could help with other problems, but for people with unstable computers it is maybe worth trying older drivers.

 

I hope that can maybe help someone.

Wish you all a merry Christmas!

 

Stefan

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7222784931
RAC: 966305

Stefan Ledwina wrote:What

Stefan Ledwina wrote:
What really helped was the second suggestion I found - going back to the adrenaline 19.7.5 drive

That was pretty easy to try, and my most recent brief periods of success came after I reverted from 19.12.3 to the version I had previously been running, 19.9.2.

Sadly going to 19.7.5 does not seem to have helped.  Most of the tasks I have started have done the fastest 19 elapsed second bailout, with just a few getting a little further (and managing to spool up the fan) with the longest ET before failure logged at 42 seconds.  As that seems not to have helped, I think I'll go back to 19.9.2

Thanks for your observations and suggestion.

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7222784931
RAC: 966305

Gary Roberts wrote:You talk

Gary Roberts wrote:
You talk about getting two further RX 570s to replace the VII.  You already have a machine with two existing 570s.  I presume that machine would take the VII, so why not swap GPU hardware and see if the problem stays with the host or transfers with the GPU? 

Maybe I should.  As the first "further" RX 570 is already on the way, I think I'll wait for that one, and just take out the VII and put the newest RX 570 into my main machine.   If that goes very poorly, I'll have to suspect some type of host problem on my main machine.  If it goes well, I'll ponder whether my curiosity is sufficient to take the time, trouble, and risk of swapping the VII into the current dual 570 machine.  

One obstacle is inconsistency.  The swapping method works better with a nice, reproducible failure syndrome than it does with the episodic mess I have tangled with throughout my VII experience.

For the moment, I'm saving a few hundred watts of power consumption, and my study is a bit uncomfortably cool.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7222784931
RAC: 966305

archae86 wrote:As the first

archae86 wrote:
As the first "further" RX 570 is already on the way, I think I'll wait for that one, and just take out the VII and put the newest RX 570 into my main machine.

The first additional RX 570 arrived today.  I've got a dozen validations and no observed trouble so far.  The VII is out and waiting for disposition.  Amazon wants just $137.45 plus tax for the particular XFX RX 570 model I've been buying, so I've ordered the "second further" just now.

Within a week or two, if nothing changes in my thinking and experience, I'll be happy to provide my VII to an interested Einstein member.  The risk to you is that mine may indeed have been a little marginal at the start, and have taken some turn for the worse recently, so any effort you put into it could be wasted.  The opportunity to you could be the acquisition of a very highly productive (1.6M credits/day) Einstein GRP card for nothing more than the trouble of figuring out how to get it to run in your system.

Stef
Stef
Joined: 8 Mar 05
Posts: 206
Credit: 110568193
RAC: 0

I don't know how the

I don't know how the situation in the states is, but here are plenty of used RX 580 relatively cheap available, market price is about $110-$120. I got me one to support the heating for this winter, I already need noticeably less firewood. I wouldn't mind adding a VII Laughing

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.