All things Radeon VII / Vega 20

Chooka
Chooka
Joined: 11 Feb 13
Posts: 117
Credit: 3230260814
RAC: 17

Thank you!

Thank you!


Chooka
Chooka
Joined: 11 Feb 13
Posts: 117
Credit: 3230260814
RAC: 17

DF1DX wrote:WattMan (19.3.2,

DF1DX wrote:

WattMan (19.3.2, Win 7) on host 6742381:

Powerlimit: -20 %

Max GPU: 1750 MHz@1025 mV

Max Mem: 1000 MHz

Fan:

50° - 8%

72° - 32%

94° - 44 %

104° - 50 %

1 CPU Task Universe BHspin with 2 WUs FGRP concurrently, t ~ 375-380 s.

One or two driverresets per day. Needs BOINC restart.

 

 

Can you please explain these fan settings?

Does that mean to keep the junction temp @ 104 degrees, the fans will run @ 50%?

Anyone else setting this manually or just using auto?


archae86
archae86
Joined: 6 Dec 05
Posts: 3144
Credit: 7005794931
RAC: 1853713

Chooka wrote: Does that mean

Chooka wrote:

Does that mean to keep the junction temp @ 104 degrees, the fans will run @ 50%?

Anyone else setting this manually or just using auto?

Bear in mind that "junction temperature" is something new on these units.  If you have the latest version of GPU-Z this parameter is called GPU temperature (hot spot), to differentiate it from plain GPU temperature.

While I have seen the two values to be equal or nearly so when the GPU is idle, in running Einstein work the hot spot runs about 25C above the traditional GPU temperature.  I have seen reports that it is believed to be the maximum value observed from dozens of reporting locations on the GPU die.

To give a live example, as I type in a cool room in the morning GPU-Z reports that my Radeon VII has a hot spot temperature of 104C, a GPU temperature of 81C, and a fan speed of 55% (2837 rpm).  Mine is running under Afterburner fan control.  For the Radeon VII, the Afterburner fan control user curve is with respect to the more traditional GPU temperature.  While this may be part of why fan control under Afterburner has given me smooth variation and good temperature control without fan speed surging, I think the primary reason is algorithmic.

Until yesterday I had a stable case fan/GPU fan configuration that was giving me about 99C reported junction temperature.  I think a case fan control change I made during a reboot for other reasons may have had undesired effects.  Thanks for nudging me to take a look.

 

 

DF1DX
DF1DX
Joined: 14 Aug 10
Posts: 102
Credit: 2908431438
RAC: 1471074

I am currently using a manual

I am currently using a manual profile for all settings in WattMan (19.3.2) and have set the fan control to a steady speed without big jumps and little noise.

Currently there are two FGRP GPU-wus and one Universe running on the CPU.

The fan speed is between 1950 and 2050 rpm, gpu temp @ 72 °C and junction @ 94 °C.

Side panel open, room temp @ 22° C.

Because of the driver resets I wrote a small tool which checks the file job_log_einstein.phys.uwm.edu.txt every minute for changes. If 10 minutes have passed without changes, the computer is restarted. So I only lose 10 minutes if boinc is blocked and not many hours if that happens at night. Not elegant, but works for me.

Chooka
Chooka
Joined: 11 Feb 13
Posts: 117
Credit: 3230260814
RAC: 17

Thanks for the feedback

Thanks for the feedback guys.

I've had mine running since Friday now. It was pretty noisy until I set the power limit to -20%. Now it's much quieter than my old duel Vega 56 blower cards.

The fan usually stays around 1680rpm and I've had no issues what so ever running 2 x consecutive WU's.

Current temp is 84°C and 106°C Junction temp. Ambient temp in my room is around 30°C  (Even with the air con on, 6 PC's crunching keeps the temp up :( )

84° is about normal for ATI cards so I'm happy with that.


Chooka
Chooka
Joined: 11 Feb 13
Posts: 117
Credit: 3230260814
RAC: 17

Anyone else getting a number

Anyone else getting a number of computational errors lately?

I've noticed quite a few today but can't see anything odd with regards to the running of the card?


archae86
archae86
Joined: 6 Dec 05
Posts: 3144
Credit: 7005794931
RAC: 1853713

Chooka wrote:Anyone else

Chooka wrote:

Anyone else getting a number of computational errors lately?

I've noticed quite a few today but can't see anything odd with regards to the running of the card?

What exact symptom are you getting?

I backed down my power limit to first -15 and more recently -16% after my second involuntary reboot and have seen no run-time anomalies at my Radeon VII system since.  My task list at my Einstein account for my Radeon VII currently shows 14 invalid and 2357 valid results.  I tend to regard the ratio of those two numbers as indicative of health, and 14/2357 is actually far better than my rate on this particular series of Einstein Gamma-Ray Pulsar work on my Nvidia Pascal cards.  My sole RX 570 machine currently has 2/657, which is even better, but I suspect the difference is noise.

Looking at the task list for your Radeon VII, I see dozens of "Error while Computing".  I have exactly zero of those.  I suggest that you consider releasing the power limit all the way back to 0%, and if that heals the problem, try inching your way back down.

It seems odd that AMD set the envelope of allowed power limit to only -20% unlike other cards, and likely that they know at least some of the cards get in trouble not very far from there.  However I have no hard evidence.  If you are willing to twist that knob and report, maybe we can acquire some hard evidence.

 

 

 

Chooka
Chooka
Joined: 11 Feb 13
Posts: 117
Credit: 3230260814
RAC: 17

Hi Archae86, I've pretty

Hi Archae86,

I've pretty much had the power limit set to -20% since day one and haven't made any changes to my system. The only thing I've done in the short space of time that I'm getting errors is change my CPU project from Cosmology to Rosetta.

I have my 16/32T threadripper set to 90% usage so there should be ample CPU capability for running 1 x Radeon VII with 3 x consecutive WU's.

This morning I changed the power limit to -10% but I don't think that's the issue. I could also try dropping back to 2 x WU's.... but again..... things have been relatively fine until the last few days.


Chooka
Chooka
Joined: 11 Feb 13
Posts: 117
Credit: 3230260814
RAC: 17

Uh oh... help please. So I

Uh oh... help please.

So I updated to the latest driver and when I restarted my pc, it now thinks I have 2 cards (Device 0 & device 1)

I reinstalled the old driver but no change. Anyone know why this has happened?


mmonnin
mmonnin
Joined: 29 May 16
Posts: 290
Credit: 3212249019
RAC: 9371

See this thread on BOINC

See this thread on BOINC Forums. Try removing drivers, unplugging from network and installing AMD drivers while offline so W10 doesn't get drivers on its own.

https://boinc.berkeley.edu/forum_thread.php?id=12807

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.