Possible resource starvation (?) with FGRPB1G tasks on RX 460 GPUs under Linux

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,196
Credit: 41,769,445,468
RAC: 44,850,632
Topic 210953

SUMMARY

This is a report on a possible resources exhaustion problem that I seem to be seeing when running FGRP5/FGRPB1G tasks on a range of different generation hosts but all having AMD RX 460 GPUs.  The OS is Linux, but not one of those that are supported by specific AMDGPU-PRO driver/lib packages supplied directly by AMD.  By looking at what gets installed when using the --compute option in the AMD supplied install script for the Red Hat specific package, I have been able to get GPU tasks to run correctly and produce valid results on my Linux distro of choice.  The hosts run for significant lengths of time (up to ~30 days) before exhibiting symptoms of resource exhaustion.  I'm wanting to document the details as I have observed them, for the purposes of seeing if others are seeing similar behaviour or have alternative theories about what might be happening.

FULL REPORT

I have quite a few older CPU crunchers that have been upgraded with RX 460 GPUs from different manufacturers - Asus, Gigabyte, Sapphire, XFX, in the main.  I also have some new builds using the same GPU.  The CPU/motherboard combinations cover a very wide range, from 2008 to recent, and include Q6600 quads, Q8400 quads, E6300/6500 Pentium dual cores, various Ivy Bridge and Haswell CPUs right up to recent Kaby Lake G4560 Pentium dual cores.  There are DDR2,3,4 RAM variants in the mix.  There are various PSUs, from old and faithful to brand new, high efficiency.  The only common factor is the RX 460 GPU.  The machines all run PCLinuxOS - not all precisely the same version - but the OpenCL compute libs all come from the same source.

I have no desire to run Ubuntu or a derivative, or even Red Hat or a derivative, so there isn't a simple procedure to install a pre-packaged driver/OpenCL compute libs combination.  On installation, PCLOS installs the free amdgpu driver.  To get the GPUs to crunch, I looked at what gets installed using the --compute option in the Red Hat version of the AMDGPU-PRO package installer (version 16.60).  I did this back in February this year.  It's a long story, but the GPUs seem to be able to crunch fine by installing just the compute libs this way.  In truth, I'd stumbled on a procedure that works without really understanding why.

The performance was good and the hosts did run for quite a while with no apparent issues.  I tended to reboot from time to time and I did make hardware changes progressively over time so it wasn't initially apparent that there was any real problem.  Now there are signs that there may be some sort of resource exhaustion issue (memory leak?) with a GPU software component.  I'm not a programmer so that's just a guess to explain what I'm seeing recently.

With a workable arrangement for RX 460s, I've been steadily bringing retired CPU only machines back on line.  Over a period, these have been stopped and restarted multiple times for various reasons so that none of them had long uptimes that were in sync with each other.  The odd machine would have a particular type of issue from time to time.  A lot of the hardware is quite old so I expected problems.  I have replaced capacitors on motherboards and in PSUs and even installed brand new PSUs and this has (until very recently) seemed to cut down the incidence of machines having regular stoppages/crashes/lockups - whatever.

Just over a month ago, a car demolished a power pole in a neighbouring suburb and it took more than 6 hours for power to be restored.  So every machine in the fleet got a restart at much the same time - it took 2-3 days to get my hardware damage sorted and things back to normal.  So for the first time, all the upgraded RX 460 machines had pretty much the same uptime.  Whilst a couple have been restarted in the period since, all those that hadn't (around 20 machines) have had an almost identical problem.  A common factor is that these 20 machines all had uptimes in the range 27-30 days.  To me, this was the key bit of evidence for some sort of resource exhaustion.

To make it very clear, the problem concerns a group of around 20 machines where the only remotely common hardware component is an RX 460 GPU.  The problem is restricted to the GPU because in all cases, the machines continue to crunch CPU tasks and can be remotely accessed with ssh or even BOINC Manager running on a server machine.  The machines don't normally have a screen, keyboard, mouse attached.  I only hook these up as required.  When I hookup to a 'good' machine, everything just works normally.  If I hookup to a machine having a problem, the screen is black and the keyboard and mouse give no reaction.  The numlock light is usually on but not always and sometimes cannot be toggled.  I used to just do a hard reset but some time ago I noticed that a hard reset is not necessary as a machine would almost invariably respond to the Linux 'REISUB' trick.  Just google it if you want to know about that :-).

So even though a machine is seemingly quite unresponsive, it is still running.  I've recently started logging in remotely over ssh to a problem machine and checking with 'ps' for the status of running tasks.  BOINC is still running and clocking up time.  FGRP5 tasks are still running and clocking up time.  FGRPB1G tasks are listed as running but don't clock up time.  If I try to use boinccmd to get the client to exit, the machine will lock up.

I've also started connecting using BOINC Manager on another LAN host and examining the tasks list of the problem machine.  CPU tasks are running normally and results get reported and new work gets downloaded.  GPU tasks appear to be running because elapsed time is increasing - way beyond what it normally should take.  I've seen as much as 20+ hours.  However % done is not moving and there is no movement in CPU time as disclosed by 'ps'.  Apart from typing Alt+Sysreq+R E I S U B on the connected keyboard, I can also trigger a remote reboot over ssh to restore normal operation.  But it has to be a full reboot because I can't stop and restart the client only without locking up the machine in the attempt.  My guess is that the graphics card is in an inconsistent state and needs a full reset.  Sometimes, a reboot is not enough.  Everything is apparently working but GPU crunching is much slower than normal.  This happens in maybe 10-20% of cases.  A shutdown, remove all power, cold restart sequence always resolves this.

When a problem machine is restarted, there is almost invariably no particular issue visible.  Crunching always correctly restarts from saved checkpoints and there are no compute errors.  Very regularly, one or more tasks in progress (either CPU or GPU or both) will finish quickly and upload - perhaps within seconds to a minute or so after restarting.  This happens so regularly that it can't be just coincidence.  Such tasks are in the followup stage after normal crunching has finished.  My guess is that something is happening in that followup stage that contributes significantly to resource depletion - enough to tip it over the edge whilst in that stage.  Whilst both types of tasks are involved, it happens surprisingly often with CPU tasks.  If it were a random event, you would expect it to be quite rare for CPU tasks to just happen to be in the followup stage.

I'm documenting this now because the chance alignment of so many RX 460 hosts with ~30 day uptimes has seemingly clarified the issue.  It's hard to believe that this problem is not some sort of resource exhaustion.  The other main GPU types I use are the older Pitcairn series HD7850 and the newer (but still Pitcairn series) R7 370.   All these use the old fglrx driver and do tend to have much larger uptimes without an apparent resource exhaustion problem.  At the moment they have ~35 day uptimes but I've looked back before the power outage and in August, quite a few had uptimes of ~235 days which just happens to coincide with the previous full power outage (a storm) on 6th December 2016.  Many others of them had uptimes around 210 days and 180 days and there were storms in Jan - March which caused power glitches that crashed some machines on particular power circuits.  I have a 3-phase power connection with circuits on different phases so at storm time (December to March) only a subset of machines might be affected by a particular event.

I'm hoping one of the Devs or perhaps the program author(s) may be inclined to think about the above description and give a better informed opinion on whether or not there may be resource exhaustion happening.  If so, is it the driver/compute libs or the app itself or a combination of these?   Is there anything to be done to better isolate/identify the true cause?

If anyone has any thoughts, please feel free to share.  My next step will be to download the 17.40 version of the Red Hat AMDGPU-PRO package and take a look at what the --compute option installs for that.  If things look feasible, I might try to get things working with that version to see if there is any difference in behaviour.

 

 

Cheers,
Gary.

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513,211,304
RAC: 0

Gary Roberts wrote:If anyone

Gary Roberts wrote:
If anyone has any thoughts, please feel free to share.  My next step will be to download the 17.40 version of the Red Hat AMDGPU-PRO package and take a look at what the --compute option installs for that.  If things look feasible, I might try to get things working with that version to see if there is any difference in behaviour.

If you can run clinfo on a problem machine, that should show OpenCL memory available.  Logging it daily may be revealing.

You probably need to at least try running a problem machine with a supported OS/Driver combination to rule out any vague PCLinuxOS issues.

Finally if they are stable for 20 days, maybe just reboot  them once every 20 days.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,196
Credit: 41,769,445,468
RAC: 44,850,632

Thanks you very much for the

Thanks you very much for the marathon read and your response.  I do very much appreciate it.

AgentB wrote:
If you can run clinfo on a problem machine, that should show OpenCL memory available.  Logging it daily may be revealing.

Yes indeed and I intended to do just that.  I remember that I did run clinfo at your suggestion (I think) way back earlier in the year when I first started using RX 460s (and talking about them) but I hadn't run it since and didn't really remember much about it.   I did try to run it again when I saw a bunch of consecutive rapid fire failures but it wouldn't launch and would just give an obscure failure message that I wasn't smart enough to interpret.  I'm getting so old and decrepit that I couldn't immediately figure out why, so it went on the back burner.

Quote:
You probably need to at least try running a problem machine with a supported OS/Driver combination to rule out any vague PCLinuxOS issues.

Also yes indeed but I'm old and set in my ways so that's not an enticing option :-).  I'm sufficiently stubborn to exhaust other options before considering that.

Quote:
Finally if they are stable for 20 days, maybe just reboot  them once every 20 days.

Yep, it's in the back of my mind but it goes against the grain to hide something if it's possible to fix it :-).

So, your response caused me to be sufficiently embarrassed about having to admit I put something on the back burner, to actually put my mind to solving it.  All it took was to properly set up LD_LIBRARY_PATH so that clinfo could use the correct libraries.  It now runs fine on the first machine I tried and produces some pages of output.  I'll have to do some research to see if I can understand what it all means.

It sees and reports on both the GPU and the CPU.  For the GPU it reports

Max Memory Allocation    169,844,736

Global Memory Size         382,058,496

Local Memory Size                   32,768

and none of these have changed in the last half hour.  There are three similar parameters for the CPU (obviously different values) but I guess the first thing is to go home and get some sleep since I'm buggered and not thinking very clearly.  Maybe tomorrow I will be able to see something changing.

 Once again, thanks very much for responding.

 

Cheers,
Gary.

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513,211,304
RAC: 0

Gary Roberts wrote: Once

Gary Roberts wrote:
Once again, thanks very much for responding.

 

np.  Those numbers look low to start with.  You might want to try running boinc with 0, 1, 2 etc GPU tasks running and see how the memory / reliability plays out. 

You certainly don't have enough memory for another task. 

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,196
Credit: 41,769,445,468
RAC: 44,850,632

A new day now and I've just

A new day now and I've just had a look at what's going on.  Two of the three numbers have changed slightly -

Max mem:  from 169,844,736   to   169,943,040  - an increase of 98,304
Glob mem:  from 382,058,496   to   377,970,688  - a decrease of 4,087,808

The machine is my latest previously retired box (2009 vintage), now rejuvenated with an RX 460.  It has a current uptime of 4 days.  It's a more recent addition so wasn't one of the big group of machines described in the opening post.  It hasn't been rebooted since it was upgraded and put into service.  It's a dual core E6300, 4GB RAM, running 1xCPU + 2xGPU tasks.

After recording this morning's numbers, I opened the Manager to observe the status of crunching of the 2 GPU tasks.  I always try to arrange task starts so that when 1 task is entering the followup stage, the other is around the 45-50% done mark.  I noted this spread hadn't changed much from when I set it up 4 days ago - one task was at 40% and the other at 87%.  So I waited until the 87% task reached the followup stage and 89.997% had been showing for around 20-30 seconds.  Clinfo gave the following values -

Max mem:  from 169,844,736   to   665,573,376  - an increase of 495,728,640
Glob mem:  from 377,970,688   to   1,113,759,754  - an increase of 735,789,056

So, looks like something big is happening in the followup stage.  I let the task complete and as soon as the new task launched and was running (about 7 secs) I ran clinfo again and got -

Max mem:   169,648,128
Glob mem:   338,219,008

About 5 mins later with both tasks well clear of the start/finish stages the numbers were -

Max mem:   169,746,432
Glob mem:   338,350,080

I don't know the meanings of Max Memory Allocation and Global Memory Size but it is interesting to see various differences.  I'm not meaning the big jump in the followup stage, but rather the GM size going from 382M to 377M to 338M over time.  Could that be some sort of indication of progressive free memory depletion?

Obviously, I'll need to keep monitoring this for quite a while :-).

 

 

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,196
Credit: 41,769,445,468
RAC: 44,850,632

I've noticed a different

I've noticed a different machine with an RX 460 GPU that has stopped crunching GPU tasks.  It's a Kaby Lake G4560, 2 cores, 4 threads with 8GB DDR4 RAM.  It runs 2xCPU and 2xGPU tasks.  It was one of the machines that hadn't recently had this problem and so wasn't in the ~20 group discussed earlier.

I'm in the habit now of regularly (several times a day) refreshing the last page of my active hosts list.  If there's a machine with an RX 460 near the bottom (oldest) of the list whose last contact time is more than an hour ago, I tend to check it out.  That's how I noticed this machine.

I hooked up keyboard, mouse, screen and moved the mouse.  A partial image appeared, basically just the bottom panel and a lonely mouse pointer that wouldn't move - no screen background image.  The clock (showing seconds) had stopped at precisely the current time.  The action of attempting to move the mouse caused the clock to freeze.  I went to a server machine and logged into the problem host using ssh with no issue.  I navigated to /opt/amdgpu-pro/bin/ freely.  The machine was fully responsive.  I 'exported' LD_LIBRARY_PATH=/opt/amdgpu-pro/lib64 so that clinfo could be started and attempted to launch clinfo.  At that point the machine completely froze and not even going back to the directly attached keyboard and using REISUB could get it to reboot.

So I did a hard reset.  I restarted BOINC and opened the tasks tab as quickly as possible.  The machine was doing benchmarks so I knew it was a while since BOINC had last been restarted.  That gave me the idea to use stdoutdae.txt (actually .old) to find precisely when BOINC was previously restarted.  I watched the benchmarks finish and crunching start.  Within about 15 seconds, a CPU task showing quite low time and % done finished and started uploading.

I noted that both CPU tasks had restarted with quite low values.  The GPU tasks were at ~30% and ~77% and had correct looking times for those numbers.  The second CPU task finished around 4 mins from restart and just before the first of the GPU tasks.  It would appear that both of the CPU tasks had restarted from checkpoints within the followup stage even though you couldn't have guessed that from the low %done figures.

On checking stdoutdae.txt, this file had been rotated at the time of the current restart so I searched backwards in .old and found that BOINC had previously been restarted at 20:30:27 on 20/10/17 - around 26 days earlier.  Seems to be another example that fits the same pattern as previously described.

After things settled down, I exported LD_LIBRARY_PATH again and ran clinfo just to be sure it would work.  I got

Max mem:   217,110,528
Glob mem:   488,046,592

I had a quick look through the full output rather than just grepping on 'memory' and noticed a parameter called Constant buffer size which was identical in value to the Max figure above.  I'd better start doing some research on clinfo if I'm going to get a clue about this :-).

 

Cheers,
Gary.

solling2
solling2
Joined: 20 Nov 14
Posts: 159
Credit: 471,010,584
RAC: 351

Just for curiosity: in your

Just for curiosity: in your Boinc Manager, options, settings, memory - the next to last line allows to keep CPU tasks in memory or not when paused. It appears unclear to me what's better normally and whether it's relevant in your case.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,196
Credit: 41,769,448,933
RAC: 44,850,449

solling2 wrote:... It appears

solling2 wrote:
... It appears unclear to me what's better normally ...

It depends on your objectives and a few other things :-).

When running CPU tasks, if you must minimise memory consumption, then say no.  The downside of not using it is that a paused task will have to be reloaded from a saved checkpoint on disk when it needs to resume.  If the interval between checkpoints being saved is large, you will lose some already completed crunching, as well as the time it takes to reload the full saved image.  For that reason I have my pref set to yes.

For GPU tasks that get paused, I don't think the setting makes any difference.  A paused GPU task always gets reloaded from a checkpoint when resumed (so it seems).  GPU tasks checkpoint approximately every minute.  I sometimes pause a GPU task if there is not enough 'gap' between the two of them when running 2x.  I tend to time the pause to be issued immediately after a checkpoint is saved.  You can see how close you were in estimating that when the paused task is resumed.  There's a few seconds 'load time' which can't be avoided.  The % done will then drop by an amount related to how far back the checkpoint was when the task was paused.  This is a drop you can minimise by pausing just after a checkpoint.

Quote:
... and whether it's relevant in your case.

I don't imagine so.  If it was, there should be a lot of other hosts with GPUs, having the same issue :-).

 

Cheers,
Gary.

solling2
solling2
Joined: 20 Nov 14
Posts: 159
Credit: 471,010,584
RAC: 351

Thanks for commenting on

Thanks for commenting on those memory settings, matching about with what I had assumed there. What made me hesitate though for a moment is that I found only that single one setting on memory clearing whereas there are several cases of pausing.

For GPU tasks, there seems to be no setting regarding GDDR5 memory at all. For CPU tasks, the setting mentioned obviously applies to three cases of pausing (1 - Boinc manager menu, pause all tasks, 2 - Boinc manager, pause single task, 3 - pause all tasks automatically when CPU in use above threshold). Also, if setting is Yes, keep in memory, that is overruled with any software update where you have to restart the system or any shutdown of non-24/7 crunchers. Alright now, and sorry for getting a bit off topic. That clinfo approach will hopefully highlight what's crucial in your case. :-)

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,196
Credit: 41,769,448,933
RAC: 44,850,449

Long overdue update for the

Long overdue update for the above saga.

I didn't find any way to narrow down the causes for the problem so I decided to adopt AgentB's third suggestion of rebooting each affected machine around every three weeks and have been doing basically that ever since.  There had been little change in behaviour until about 5-6 months ago when a change in the LATeah data file seemed to cause quite a change in the uptime before the problem showed up.

What had been 25+ days of trouble free running suddenly dropped to around 17-19 days.  So I started rebooting machines after 15 days and apart from the annoyance of having to reboot more regularly, the problem was again able to be worked around.  Over the entire period, I've written several scripts to manage various aspects of monitoring/controlling the fleet.  Two of those have been to do with automating upgrades/installs of the OS itself and with automating the deploying of the OpenCL libs for all the released versions of AMDGPU-PRO.

As mentioned previously, I had worked out exactly what bits of the full 16.60 version of the AMDGPU-PRO package were needed to support OpenCL for my distro.  As AMD released new versions, I would download the new Red Hat package and spend quite a bit of time working out what had changed - not just the file content, but the new bits and where they fitted into the overall structure.

Once I had worked those things out, I decided to automate the process of installing/uninstalling different versions to facilitate the process of seeing if newer versions would have any impact on the problem.  It was a bit painful to get working correctly but I'm very glad I persisted with it.  It's now a trivial single command to install any of the versions from the initial 16.60 up to the current 18.20, no matter what version (if any) was previously installed.  It automatically determines if there is a previous version installed and, if so, uninstalls the old version completely before installing the new.

This has allowed me to test different machines with different versions very easily.  Unfortunately, none of these different versions seemed to have any impact on the problem.

I've also been following the progressive development of the open source amdgpu driver that AMD has contributed to the Linux kernel.  When I first started playing with Polaris GPUs in early 2017, the kernel was at something like 4.9.x or 4.10.x, if I remember correctly.  As the OpenCL libs sit on top of the amdgpu driver, I've been testing each new kernel series (4.11.x, 4.12.x, etc) as they came out.  4.18 has just recently come out but I seem to have hit paydirt with the 4.17 series which I started testing about 7 weeks ago.  I upgraded a single machine initially.  It has a kernel version of 4.17.3 and it has been running continuously now for 46 days - no sign of problems!!

There was quite a bit of stuff for amdgpu contributed to the kernel in time for the 4.17 release so I was fairly keen to test this and hopeful of some sort of solution.  As of now, all my hosts with Polaris GPUs are running a 4.17.x kernel, apart from one which is a new test machine running 4.18.3.  The versions range from 4.17.3 to 4.17.17 (the latest in the PCLOS repo).  Even though quite a few of the group have had  upgrades within the 4.17 range (and have thus been rebooted for that) there are still around 6 or so that have uptimes over 30 days with no sign of the previous problem.  So it really looks like something got fixed within amdgpu that has resolved this issue for me.  Fingers crossed :-).

As a final note, with each machine upgraded to a 4.17 kernel, there has been a small performance improvement - around 5% with some a bit more.  This too seems to be attributable to amdgpu development as I really didn't see much change when testing the OpenCL libs from 16.60 to 18.20 versions of AMDGPU-PRO.

 

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.