All things Radeon VII / Vega 20

Peter van Kalleveen
Peter van Kalleveen
Joined: 15 Jan 19
Posts: 45
Credit: 250,329,645
RAC: 0

Quote:archae86 wrote: The

Quote:

archae86 wrote:

The good news: a clear productivity improvement if it worked consistently, as the elapsed times per task at 2X run about 6:10, while the 3X run about 8:45 (breakeven would be 9:15)

The bad news: about 10% of the tasks I have run at 3X have terminated early, reporting Computation error (65,).  The early termination has varied from Binary Point 83 up to Binary point 1543.

Wow, that 6:10 per task is really fast compared to my 7:55

Maybe I got the dials turned down a bit to far.

Somehow running more than 2 tasks spells trouble, the unfortunate thing is that with two tasks the card is not yet at full utilization.

My previous card the Maxwell titan x could crunch 5 at the same time without any issue.

I have a feeling a lot needs to be fixed in the driver stack, the card had a somewhat rushed und unprepared launch. Probably if nvidia had not priced there turing cards insanely high the VII would have never come to consumers and remained data center exclusive.

In the driver note they acknowledge several bugs and problems not yet fixed with this card.

Hopefully stability will get better over time.

Also since nvidia did creator drivers for there card as response to amd inclusion of some radeon cards in the pro driver they will integrate the VII soon and give it some compute optimizations.

Chooka
Chooka
Joined: 11 Feb 13
Posts: 134
Credit: 3,574,985,759
RAC: 585,547

Can anyone shed some light on

Can anyone shed some light on my the stats for my Vega VII shows -

Measured floating point speed:1000 million ops/sec


Measured integer speed:1000 million ops/sec


 


Doesn't look right.


 



I'm running 3 x WU's. Average run time is 544sec/wu.


I have reserved 1 thread for each WU. (28 threads crunching CPU work.)


My average credit is 200K less than yours archae86 although that could be a run time thing?


 


Thx


 


Edit - I guessone difference could also be that I pause crunching on this machine from time to time. I forgot about that. Can't play games while crunching lol.


archae86
archae86
Joined: 6 Dec 05
Posts: 3,156
Credit: 7,174,634,931
RAC: 735,894

Chooka wrote:I'm running 3 x

Chooka wrote:
I'm running 3 x WU's. Average run time is 544sec/wu.
My average credit is 200K less than yours although that could be a run time thing?

I don't think either of our machines has been running at current condition long enough for the reported RAC to stabilize--but mine is closer.

My try at running 3X had frequent failures, but would have been a productivity boost if it worked. You are definitely cranking out work at a higher credit rate than I am (when you are running) as I am running at 2X and getting elapsed times averaging near 368 seconds. As it seems to work for you I should try again with an updated driver. Radeon reports I am at 19.3.1. What driver version are you running?

archae86
archae86
Joined: 6 Dec 05
Posts: 3,156
Credit: 7,174,634,931
RAC: 735,894

archae86 wrote:My try at

archae86 wrote:
My try at running 3X had frequent failures but would have been a productivity boost if it worked. You are definitely cranking out work at a higher credit rate than I am (when you are running) as I am running at 2X and getting elapsed times averaging near 368 seconds. As it seems to work for you I should try again with an updated driver. Radeon reports I am at 19.3.1. What driver version are you running?

I updated to driver 19.4.3.  That seems to have given me a slight elapsed time reduction at 2X--but maybe that is some other side effect of rebooting.  Then I tried switching to 3X.  So far I have zero compute errors in over five hours of running.  Last try I had accumulated ten Computation error 65 instances by then.  Something is different.  Maybe it is the driver, maybe something else.  So far the productivity gain is very small at 3X versus recent 2X--but for the first five hours, I was running with the CPU affinity set to just two cores.  I'll probably get at a slight further boost from having just now loosened it to three cores.

Chooka
Chooka
Joined: 11 Feb 13
Posts: 134
Credit: 3,574,985,759
RAC: 585,547

Hi Archae86, I was getting

Hi Archae86,

I was getting quite a few errors just a few days ago. The only thing i changed was resetting my Power Limit back to 0%. The errors seemed to stop and I now have it back at -20%. I'm not sure if it was the power limit or just bad wu's?

I'm using 19.4.1.

About 1 month ago, I updated my driver and my pc saw multiple GPU's instead of 1 x Radeon VII. I had all sorts of issues after that and had to do a full wipe/reinstall to correct the issue.

That REALLY sucked! I'm not game to change my driver now unless I have too :) I think I posted about it on this thread a few pages back.

3 x wu's did give me a boost over 2 x.

My 2 x Vega 56;s are outperforming my 1 x RVII though.


archae86
archae86
Joined: 6 Dec 05
Posts: 3,156
Credit: 7,174,634,931
RAC: 735,894

archae86 wrote:I updated to

archae86 wrote:
I updated to driver 19.4.3.  That seems to have given me a slight elapsed time reduction at 2X--but maybe that is some other side effect of rebooting.  Then I tried switching to 3X.  So far I have zero compute errors in over five hours of running.  Last try I had accumulated ten Computation error 65 instances by then.  Something is different.

My 3X adventures are over. and I have resumed running 2X.  I have these conclusions:

1. Driver 19.4.3 gave me a slight 2X productivity improvement over 19.3.1 on Einstein GRP work.
2. My system as currently configured gives an unacceptable rate of "error while computing" type 65 when running 3X.
3. The 3X error 65 rate is substantially raised by changing from two to three allowed cores for CPU support (as constrained by affinity)
4. The 3X error rate seems least when I am not myself doing things on the system, and raised when I use browsers, editors, and such.
5. Relaxing the power limit from the starting point of -16% as far up as -9% in 1% increments did not find a level at which the error while computing problem disappeared even with CPU support restricted by affinity to two cores.

While it might be that going to zero power limit would finally find acceptable 3X error rate (for this type of error I currently consider the acceptable rate to be zero), I would not run there as the extra power consumption troubles me more than the modest productivity improvement delights me.

I doubt there is anything fundamental about my results.  Other samples of the Radeon VII card running on other systems (let alone running other applications) may well do just fine at 3X.

 

Peter van Kalleveen
Peter van Kalleveen
Joined: 15 Jan 19
Posts: 45
Credit: 250,329,645
RAC: 0

Some interesting specs for

Some interesting specs for us, this is from https://wccftech.com/amd-radeon-vii-mining-hashrate-beats-titan-v-radeon-pro-duo/ and obviously refers to coin mining but is applicable to us as well as in getting best efficiency.

I already found out that memory overclock does not work on my card, if I touch it it will drop to significantly lower, but the core clock seems very stable at that low voltage and power consumption has dropped a lot.

Below a short part of the article

After tweaking the Radeon VII, it is possible to achieve a hash rate between 90MH/s and 100MH/s. According to VoskCoin over at BitcoinTalk, the following configuration brings in 91MH/s at 251 watts. This configuration provides an efficiency improvement of 21% over the stock 319-watt power consumption.

Configuration Core Voltage Core Clock Memory Clock Power Limit Power Consumption
Stock 1136mV 1801MHz 1000MHz +0% 319W
Optimized 950mV 1750MHz 1100MHz +0% 251W
archae86
archae86
Joined: 6 Dec 05
Posts: 3,156
Credit: 7,174,634,931
RAC: 735,894

Peter van Kalleveen

Peter van Kalleveen wrote:

This configuration provides an efficiency improvement of 21% over the stock 319-watt power consumption.

Configuration Core Voltage Core Clock Memory Clock Power Limit Power Consumption
Stock 1136mV 1801MHz 1000MHz +0% 319W
Optimized 950mV 1750MHz 1100MHz +0% 251W

Interesting.  This source advocates power reduction by turning the core voltage knob, not the power limit knob.  My personal observations on my card in my system on Einstein work has been bad behavior in response to imposition of reduced core voltage using that knob, and much better behavior in response to the power limit knob.

The real mechanism of power reduction must primarily depend on the core voltage and core clock rate, regardless of which knob one turns to get the result.  But the card internal controls continue to make very frequent adjustments of both voltage and frequency.  The key question is what user input configuration causes one's card to make the best choices.

My personal Radeon VII current configuration involves control using Afterburner (I like the smooth fan speed I could get out of it much better than the motorboating I frequently got under Wattman control) with the following user settings:

Core voltage -- default (reads max 1123)
Power Limit -- -16%
Core clock -- default (reads max 1801)
Memory clock -- default (reads max 1000)
Fan speed is on a user map

As to the average actual operating condition, GPU-Z reports these averages:

GPU clock: 1605 MHz
Memory clock: 1000 MHz
GPU voltage: 1.07 V
GPU only Power draw: 207 W
GPU Temperature: 79C
GPU Temperature (hot spot): 98C

I've seen multiple reports of people seeing quite severe errors when adjusting down the voltage limit by surprisingly small amounts on the Radeon VII.  I speculate that the current cards are shipping with controls which are not very smart about setting the other parameters to workable levels when user voltage limitation is dialed down.

Quite likely these matters are very workload dependent.

Peter van Kalleveen
Peter van Kalleveen
Joined: 15 Jan 19
Posts: 45
Credit: 250,329,645
RAC: 0

I have noticed that reducing

I have noticed that reducing voltage drops the temperatures really fast.

If I take the standard undervolt in wattman, sooner or later it will crash.

But running at 955mv and setting clock speeds to 1750 gives very cool picture.

Normally the core would dip, spike, dip and continue to fluctuate between max and around 1400. With these settings its about to almost run continued on around 1723mhz.

I got a pretty aggressive ramping of the fans, but still with 24 degrees Celsius as ambient the card only gets up to 57 degrees and hotspot 78 degrees celsius with fans blowing around 3000rpm, just before they start to get loud.

fortunately no errors yet.

Traditionally undervolting has worked very well for me on mobile chips in laptops so I understand putting a little brake on the max and giving much more room for the voltage to drop and clocks to remain constantly high.

That said the power limit slider is a lot more user friendly, but I have no idea where it pulls the brake. If its just clockspeed it will hurt performance a lot more, or maybe a combination of clocks and voltages in the background.

Still I am frustrated that half the world seems to be able to overclock there memory and mine  seem to fail as soon as I touch it.

BTW has anyone experimented with the memory timings on the gpu?

Chooka
Chooka
Joined: 11 Feb 13
Posts: 134
Credit: 3,574,985,759
RAC: 585,547

archae86 wrote:archae86

archae86 wrote:
archae86 wrote:

4. The 3X error rate seems least when I am not myself doing things on the system, and raised when I use browsers, editors, and such.

 

I can attest to this. When I have the power limit set to -20%, when i start flicking web pages the fans ramp up quite a bit. A couple of times the screen just went blue and the card ceased output.

Upon restart, Wattman rests to 0 power limit.

I see I have quite a few errors again. Maybe I should just stick to 2X.


Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.