A walk to the AMD side

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117784975270
RAC: 34706794

archae86 wrote:Perhaps others

archae86 wrote:
Perhaps others might mention here whether they succeed or fail in running 3X for Einstein GRP work on Polaris (e.g. RX 570) or Radeon VII cards.

I've been intending to respond but needed to get finished some 'script extensions' (that turned into a more major set of re-writes) for some of my management scripts.  That's done and tested now so I'll apologise for the delay and share some of my experiences when I was running 3x.

I bought my first Polaris GPU (RX 460) well over 2 years ago.  It had 2GB VRAM so I was never tempted to try anything more than 2x.  When I saw how reasonable they were on power requirements, I accumulated quite a few more but always 2GB.  Then, perhaps 9 months or so later, I started buying some RX 570/580 types, whatever was the best 'special' at the time.  I found very little difference in performance between these two higher models so I didn't bother trying to get 580s if a 570 was a better price (which it invariably was).  I did buy some 580s when they were the same price as a 570.

When I had the first group of these, I ran them, seemingly without too much problem, at 3x because of the extra VRAM.  At the time, there was an unrelated issue which I had first noticed with RX 460s and first reported here as possible resource starvation.  When I had some 570s/580s running, they showed the same issue, with the only difference being that the time before a reboot was needed was a lot shorter - around 12 days.  The only reason for mentioning this is that this problem for all my Polaris GPUs disappeared with a subsequent kernel/amdgpu driver update.  Then, along came a different problem (after yet another update) which seems to me to be similar to what you are seeing.

At that time, I was running very nicely using 3x for all the 4GB 570s/580s.  I had got in the habit (probably about once every 2 months) of updating the kernel from time to time.  My reasoning was that the amdgpu driver, being part of the kernel, would be updated as well so it would be a good idea to take advantage of progressive bug fixes being made.  I have a full local copy of the PCLOS repo so it was a fairly quick and painless exercise.  Whilst, I don't know for sure, my impression was that one of these updates introduced my particular version of the 3x problem.   The main effect seemed to be something akin to your "sloth mode" for some of the concurrent tasks (but maybe not all) which often ended with compute errors if not detected and dealt with promptly.

For me, the gain from 3x was relatively small so when I discovered that the problem disappeared completely by returning to 2x, that's what I chose to do.  One of these days, I might try 3x again, but that's not very high on the priorities list.  With the advent of the 5.0.x kernels and also just recently, the 5.1.x kernels, it's about time for all the Polaris hosts to get a full upgrade.  I have a few with RX570s that I put into service recently with 5.0.9 kernels.  I should put one of those on 3x and see what happens :-).

It's actually very hard to get motivated to make changes that might upset what seems to be currently a nice equilibrium for my fleet :-).   However, it would be useful to know if there is any change in the behaviour I saw previously when the 3x problem first showed up for me - probably a year ago, so lots of driver changes by now.  I've selected a host that's been running for around 12 days with no problems at 2x.  It's now running 3x so I'll monitor it closely to see what happens.

Cheers,
Gary.

shuhui1990
shuhui1990
Joined: 16 Sep 06
Posts: 27
Credit: 3631456971
RAC: 0

The same behavior was

The same behavior was mentioned in the Radeon VII thread, i.e. same boost clock, same power consumption, long elapsed time. At the time I thought the problem came from undervolt or bad driver. I even replaced a card with AMD.  On driver 19.4.2 the card didn't even work for an hour. Then I discovered having WattMan opened and monitoring caused the issue. Still the system crashed after a day or two. But at 2x the system seems to last even with two Radeon VIIs.

mmonnin
mmonnin
Joined: 29 May 16
Posts: 291
Credit: 3427286540
RAC: 3891660

3x on my RX 580 in Win7 is

3x on my RX 580 in Win7 is slow as well so it's not just a OS thing if its slow in Linux too. While bunkering for the Pent I had accidentally started a 3rd task on another client (before reducing to 1x on the main client) and that had reset the driver.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7230031515
RAC: 1153837

Gary Roberts wrote:For me,

Gary Roberts wrote:
For me, the gain from 3x was relatively small so when I discovered that the problem disappeared completely by returning to 2x

On the RX 570 machine most recently built and with best record-keeping, I appear to have lost almost 10% in Einstein productivity by dropping back from 3X to 2X.  So I am motivated to look further into the matter.

Separately, just a few hours after I wrote my post on my 3X troubles (and dropped my last 3X machine back to 2X) my first RX 570 dropped to a state much more impaired than the "sloth mode".  While it was reporting normal core clock rate, the reported memory clock rate had dropped drastically to 300, reported power consumption, GPU, and CPU temperature were all way down, and in the seven hours that elapsed before I noticed and intervened, not a single checkpoint was written (so rate of progress apparently somewhere between zero and extremely slow, not the 1/3 normal I reported for one instance of sloth mode).  Call this "catatonic mode".

I am not sure, but believe I have seen catatonic mode at least once previously in my short experience with the RX 570 cards.  I suspect in both cases the uptime since last reboot was many days, very roughly on the order of a couple of weeks of 2X operation.

It seems to me that my RX 570 may possibly "want" to be rebooted regularly in order to contribute reliably to Einstein at 2X, and might just possibly become reliable at 3X with a sufficient reboot rate. 

In the short term, I intend to start actually logging reboots, multiplicity (my name for 3X vs. 2X), sloth mode onset, and catatonic mode onset. 

I think I'll put my most recent 570, which had no daily driver role in our household, back to 3X and start with a reboot interval of 1 day.  If it goes a week without sloth or catatonic failure, I'll try enlarging the reboot interval.

For my wife's 570 machine, I think I'll leave it at 2X and start an intentional reboot regime at the interval of once per week.  If the other 570 results at 3X are promising, I may move hers to 3X also.

For any 3X running, I need to monitor results for errors, "validate error", and "Completed, marked as invalid".  Maybe more frequent rebooting will mean these stay reasonable.  I see no daylight for 3X running on my Radeon VII unless some future change in the Einstein application, AMD driver, or AMD firmware makes a critical improvement.

While the purchase price and power efficiency of the RX 570 cards make them a highly attractive alternative to any Nvidia card I know of for Einstein Gamma-Ray pulsar work, these issues are pretty troublesome, and I'd be very reluctant to suggest the cards to others if a reboot regime is really required for satisfactory operating behavior under Windows with the current AMD driver and current Einstein application.

 

solling2
solling2
Joined: 20 Nov 14
Posts: 219
Credit: 1577671300
RAC: 21403

archae86

archae86 wrote:

...

Separately, just a few hours after I wrote my post on my 3X troubles (and dropped my last 3X machine back to 2X) my first RX 570 dropped to a state much more impaired than the "sloth mode".  While it was reporting normal core clock rate, the reported memory clock rate had dropped drastically to 300, reported power consumption, GPU, and CPU temperature were all way down, and in the seven hours that elapsed before I noticed and intervened, not a single checkpoint was written (so rate of progress apparently somewhere between zero and extremely slow, not the 1/3 normal I reported for one instance of sloth mode).  Call this "catatonic mode".

...

A similar behavior occured recently on my similar system. I run it undervolted. Too much undervolting ended up in lots of errors. A little bit less undervolting resulted in error rate almost zero, BUT after a few hours the system ended up with one task in a loop or even stand still with no progress other than time elapsing, and the other task extremely slow. This trouble disappeared with just a few more mV added. Hopefully, according to Shuhui1990s hint, that wasn't just accidentally. 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117784975270
RAC: 34706794

The host I set up to run at

The host I set up to run at 3x has been doing so without any computation issues showing up so far.  I've taken the average crunch time for around 30 tasks that were crunched immediately prior to the change and also for a similar number of tasks completed long enough after the change to avoid any 'transition' effect.

The average crunch time for tasks at 2x was 1250 secs - ie. 625 secs per task.
The average crunch time for tasks at 3x was 1869 secs - ie. 623 secs per task.

So there is essentially no change in the crunch time.  I was a bit surprised by this.  I expected maybe around 3% or so gain, based on what I saw maybe 15 months ago.  So far, there has been no sign of any problem.  All crunch times, both before and after the change have been remarkably consistent.  The maximum variation I saw was less than perhaps +/- 10 secs from the average and often quite a bit less than that.

One thing that I did notice was that the rate of invalids was rather higher than what I'm used to seeing.  Every time I've had a random look at a machine using a Polaris GPU, I've been accustomed to seeing that rate being less than about 1% of the number that validate.  I don't check this consistently - I don't really have the time to, with the number of hosts that I run.  I've just looked through 6 hosts that have been set up as a batch over the last couple of weeks.  The current invalid to validated ratios for this group are shown below.

I bought these hosts from a small business that was closing down.  They were around 4 years old and were perfect for a GPU upgrade.  The first 4 in the list now have RX 570s, the fifth has dual RX 560s (two existing hosts that had the 560s now have 570s) and the last has a spare RX 460. 

          Invalid / Valid     Ratio
          =======   =====     =====
               19 / 760        2.5%
               10 / 761        1.3%
                9 / 728        1.2%
               14 / 834        1.7%
               13 / 865        1.5%
                4 / 359        1.1%


Normally I would expect to see values a bit lower than this.  Of course the machine I decided to bump to 3x turned out to be the first one on the list with the highest number of invalids :-).

The 19 invalids for that machine all come from tasks crunched at 2x.  It's been on 3x for around 40 hours now and has no compute errors or invalid results so far, since the transition.  All six machines have perfect records so far for compute or validate errors.  I'm thinking that the current validation procedure may be a bit stricter than what it used to be so that having the less common OS may be leading to a somewhat higher level of invalids.  I could imagine two Windows machines might more consistently get 'closer' answers which could lead to a Linux machine missing out a bit more often in a 3-way contest.

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7230031515
RAC: 1153837

Gary Roberts wrote:The

Gary Roberts wrote:
The average crunch time for tasks at 2x was 1250 secs - ie. 625 secs per task.
The average crunch time for tasks at 3x was 1869 secs - ie. 623 secs per task.

Interesting,  the corresponding numbers for my second RX 570 machine, which I put back on 3X half a day ago, are:

2X 1230 secs -- equiv 615/task
3X 1685 secs -- equiv 562/task

I think you'll need to wait a while longer to gain confidence in any impact on invalid rate.  Even the "Validate error" cases don't score until your quorum partner returns his task, and the much more common "Completed, marked as invalid" don't get posted until your first quorum partner returns, the results miscompare, a second task is sent out, and that partner returns.  I don't have any trouble (and lots of validations) from my new half day at 3X, but won't draw much comfort from that for a few days yet.

I also look at the ratio of invalid to valid results on my host task summary lines.  At the moment my three machines (two RX 570, one Radeon VII) show in the 1.5 to 2% range.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7230031515
RAC: 1153837

My project to test the more

My project to test the more frequent reboot cure for my RX 570 sloth mode and catatonic mode problems is not starting well.  My second machine held 3X at full speed for less than 20 hours before dropping to sloth mode.  In sloth mode the 3X completion times which had bee 1685 seconds became 4348 seconds.  Reported GPU temperature and power consumption dropped markedly, and reported CPU consumption dropped astonishingly low.  One of the fully affected tasks has validated!

There was a minor Windows update during the night.  Perhaps that somehow triggered the change, so after a pass through the uninstall/DDU/install latest driver loop, I've resumed the 3X with daily reboot experiment.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117784975270
RAC: 34706794

archae86 wrote:... Reported

archae86 wrote:
... Reported GPU temperature and power consumption dropped markedly, and reported CPU consumption dropped astonishingly low.  One of the fully affected tasks has validated!


I'm wondering if your two described 'conditions' (sloth and catatonic) have a common cause - a GPU lockup/crash that can't be recovered from properly.  In the sloth case, the GPU part recovers but gets stuck at some low frequency mode - perhaps idle frequency - and can't budge from that.  In the catatonic state, the GPU remains locked up and doesn't even start to recover.  If the GPU partly recovers and the calculations are able to continue, you would think that the driver should be smart enough to allow the speed to ramp back up to normal rather than staying at idle.  Maybe this is something that will be fixed in a future driver update.

Under Linux, I'm pretty sure I don't ever see sloth mode these days (if I ever did), only the catatonic state.  On thinking about it, the problem I used to have that I reported previously as some sort of "resource starvation" is pretty much your catatonic mode.  This still occurs for me (fairly infrequently for 2x - we'll see for 3x after enough time) and it's certainly not tied to any reasonably standard amount of uptime like the previous condition was.

It's very easy for me to detect these current events remotely and they get flagged and dealt with by a reboot.  The reboot is needed to get the GPU out of its catatonic state.  On average, an event happens about 3-4 times a week over about 50 hosts - I've seen 3 in the last six days and the previous uptimes for those hosts were 58, 31 and 61 days respectively.   Sometimes lockups occur at much shorter times than these and sometimes even longer.  It's pretty clear that it's not just a regular short term event.

For all my hosts with Polaris GPUs, many of the uptimes they have are tied to some sort of 'power disruption event' such as frequent summer storms.  I keep daily records of uptimes and at the moment I can see 4 distinct groups of uptimes around 198, 145, 97 and  62 days respectively.  These times correlate with summer storms which hopefully, as winter approaches, are finished for the next 6 months :-). I have 3-phase power and sometimes only one phase gets affected.  This is why there are these different groupings.

I have a suspicion that these GPU lockup events, when they randomly occur, are tied to task starting/finishing states.  I haven't kept proper statistics but I regularly see (immediately after a lockup instigated reboot) that at least one task was very close to finishing or right at the start point when the lockup occurred.  This is one reason why I make an effort to properly 'space' tasks so that they aren't close to lockstep with each other.  Of course, 'drift' over time means they might get into lockstep at some future point and I wonder if that might be the trigger for some of these events.

With the first host I changed to 3x continuing to run OK, I've made the change on two more machines.  The original one was an Acer Veriton with an i5-3470 CPU and some sort of proprietary motherboard and UEFI and I just left everything configured "as it was".  I'm interested to see if two of my own builds behave any differently.  I chose the two previous complete builds I had done before acquiring the business machines.  Both these have G4560 Pentium CPUs (Kaby Lake - 2/4 cores/threads) and, similar to the first one, only run GPU tasks.  Neither of these two have been rebooted since being put 'into production' - Feb 11 and Mar 30 respectively - with the uptimes being listed in the table below.

As an interesting aside, this is the first time I've actually looked at these two machines since they were put into production.  The only time I tend to look closely at a machine is when a control script flags it for attention.  I was pleased to see that in both cases, the two running tasks were still widely separated in their start times.  When transitioning them to 3x, I made sure the 3 running tasks continued to be equally spaced from each other.

I did the transition on these two several hours ago and by now there are more than 10 valid results - sufficient to get a passable idea of crunch time.  I've averaged the 2x times over 50 results and once again the times are extremely stable.  The 3x times so far also seem to be tightly grouped around the average - certainly good enough for a ball park figure.  Here are the results in tabular form.  I'm also including the invalid to valid ratio so that, over time, we'll be able to see if that changes by the time the database accumulates more and more 3x results.

       CPU      2x Time - Per Task   3x Time - Per Task    Invalid   Valid   Ratio   Uptime
       ===      ==================    ==================    =======   =====   =====   ======
     i5-3470      1250  -   625         1869  -   623            20     680    2.9%     8 d.
       G4560      1180  -   590         1822  -   607            15     784    1.9%    95 d.
       G4560      1202  -   601         1853  -   618             2     774    0.3%    48 d.


Given that none of the above show an improvement at 3x, I'll be returning to 2x after allowing enough time to see if anything untoward happens - perhaps a week or two would give some indication.

Cheers,
Gary.

cecht
cecht
Joined: 7 Mar 18
Posts: 1537
Credit: 2915271969
RAC: 2120990

On my Linux host running two

On my Linux host running two RX 570s I see a slight improvement with 3x, which I've been running for the past ~24 hr:

At 2x, avg. time = 1203 sec, for 602 sec/task
At 3x, avg. time = 1782 sec, for 594 sec/task

I have both cards capped at an s-clock p-state of 6, which corresponds to a stable clock speed of 1071 MHz (with the mining BIOS), and an average GPU power of ~75 W for one card and ~80 W for the other.  I'm not really sure why the cards are different in their power usage.

The invalid rate has been about 2 per day at each setting, but it has historically varied from 0 to 4 per day at 2x. At 3x each card runs about 5W extra, which I don't think is worth it, so I'll drop back to 2x eventually.  I've not noticed any issues with performance "drift" over time, but then I check on (play with) the host regularly and keep task % completion on each card well-spaced. For one reason or another I haven't had the host go for more than a few weeks without a reboot so can't say how stable things are over the long haul.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.