How do I monitor task elapsed time in real time?

cecht
cecht
Joined: 7 Mar 18
Posts: 1421
Credit: 2444902309
RAC: 1503868
Topic 225237

BoincManager reports current Elapsed time for running tasks. Does anyone know how to access that time from the command line or from a file?  I've looked at boinccmd --get_tasks, --get_state, and --get_simple_gui_info, but those commands only list checkpoint times. Same for reading the client_state.xml file for active tasks.

The issue is that on my RX 5600xt card, (but not RX 570 cards), instead of the usual completion time of a few minutes, I occasionally will get tasks that run on for hours with no or little progress. While the elapsed time ticks away, a stalled task doesn't advance its checkpoint times. Once I abort a stalled task(s), things get back to normal. Final elapsed time (run time) is reported to boinc-client files only after a task completes, but I'd like to monitor real-time elapsed time to use in a script to help manage stalled tasks. I don't see any pattern among the various checkpoint times that distinguish normally running tasks from stalled tasks. So where does Boinc Manager get its elapsed time data?

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Mike.Gibson
Mike.Gibson
Joined: 17 Dec 07
Posts: 21
Credit: 3747901
RAC: 1314

When in "Tasks", click on a

When in "Tasks", click on a unit and then click on "Properties" in the Commands section (Top Left).

Tom M
Tom M
Joined: 2 Feb 06
Posts: 5586
Credit: 7673719503
RAC: 1771791

cecht wrote: The issue is

cecht wrote:

The issue is that on my RX 5600xt card, (but not RX 570 cards), instead of the usual completion time of a few minutes, I occasionally will get tasks that run on for hours with no or little progress.

I tried two fixes for that kind of behavior.  Various versions of the AMD gpu drivers.  And re-flashing the (used) card to the stock bios.  The 2nd worked.

Tom M

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)

cecht
cecht
Joined: 7 Mar 18
Posts: 1421
Credit: 2444902309
RAC: 1503868

Mike.Gibson wrote: When in

Mike.Gibson wrote:

When in "Tasks", click on a unit and then click on "Properties" in the Commands section (Top Left).

Yes, thanks, that's good for interacting through the GUI, but I need to read that into a variable for script implementation.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

cecht
cecht
Joined: 7 Mar 18
Posts: 1421
Credit: 2444902309
RAC: 1503868

Tom M wrote: cecht

Tom M wrote:

cecht wrote:

The issue is that on my RX 5600xt card, (but not RX 570 cards), instead of the usual completion time of a few minutes, I occasionally will get tasks that run on for hours with no or little progress.

I tried two fixes for that kind of behavior.  Various versions of the AMD gpu drivers.  And re-flashing the (used) card to the stock bios.  The 2nd worked.

Yikes! The Sapphire card does have a dual bios switch, so I guess I could brave it.  Did you use amdvbflash ?  I'm hoping to just try managing the problem instead of curing it, in case the cure turns out to be worse than the disease.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109390260102
RAC: 35889457

cecht wrote:The issue is that

cecht wrote:
The issue is that on my RX 5600xt card, (but not RX 570 cards), instead of the usual completion time of a few minutes, I occasionally will get tasks that run on for hours with no or little progress. .... So where does Boinc Manager get its elapsed time data?

I don't think BOINC's notion of elapsed time will help you.  It's a driver problem and BOINC just continues to accumulate time after the GPU locks up.

The history goes back to 2017 for me.  At that time with RX 460s, once the uptime of the host reached ~25 days, the GPU would get itself into this state.  When I first started with an RX 570, the time was only 12 days.  The only way to resolve was to cold reboot at which point the stalled tasks would restart and complete (without further issue) using the last saved checkpoint which may have been made many hours previously.  The initial control measure for me was to do precautionary reboots just short of those two times.

I kept closely following amdgpu driver development and eventually (sometime in 2018 I think) the problem largely disappeared.  Even today it happens occasionally.  I did a lot of googling to try to find methods of resetting the driver without a reboot.  Once the driver gets into this state, the only reliable method I found was a cold restart.

These days, I see maybe one or two instances of this per week, usually on a couple of older hosts where there may be other factors triggering it.  The most likely GPU brand where this seems to happen is Asus.  Many newer machines with other brand GPUs seem to be largely immune.

I have found that the easiest and most reliable way to deal with the issue is to detect the condition through the behaviour of the CPU support process.  With AMD GPUs and the FGRPB1G app, the CPU support, although quite light, seems very regular - apart from the very start and very end.  So I have a script that runs continuously on a 'server' machine that uses ssh to talk to all the others in the fleet.

It measures the clock ticks used by the CPU support process as recorded in the kernel's virtual file system.  The kernel uses a clock rate of 100Hz and keeps track of the 'ticks' used by every running process.  I use a 2 sec interval and usually see a small number of ticks over that interval (perhaps averaging around 10 or so - it does vary).  If I get a very high value (eg. >100) or a very low value (eg. 0) the script pauses for a few seconds and then repeats the test.

If the pattern repeats and the value is still 0, the problem is flagged.  If the value is still high, it will make a third attempt (after a slightly longer pause) before flagging it as a potential problem.  This is due to the extreme use of ticks at the start and end of each task.  I added this last bit to get rid of occasional false positives.  At the low end, two successive zero values seem to always indicate a GPU freeze problem.  It gets logged and (when I notice the message :-) ) I just do a cold reboot to fix it.

If you're interested, I could dig out a code snippet to show what I do.  It's been quite reliable and I haven't looked at that code for quite a while.  I'll need to refresh my memory of how it works :-).

Cheers,
Gary.

cecht
cecht
Joined: 7 Mar 18
Posts: 1421
Credit: 2444902309
RAC: 1503868

Thanks Gary. I think I should

Thanks Gary. I think I should be able to work up a script based on the description you gave, but I'm first going to try side-stepping direct monitoring of task time or CPU time and instead track task persistence through multiple script cycles. My script runs continuously at regular timed intervals, so I can just count cycles as a time proxy.  I remember when the RX 570s had that zombie issue, but now a simple task abort (which I can automate) seems to clear things up on the RX 5600 without a reboot. If this alternate approach doesn't pan out, I'll be back seeking help.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109390260102
RAC: 35889457

cecht wrote:... a simple task

cecht wrote:
... a simple task abort (which I can automate) seems to clear things up ...

Thank you very much for the response.  Those simple words triggered a little bit of lateral thinking which has just allowed me possibly to understand something that's been troubling me over the last couple of weeks.  Maybe this is something new and different from the previous problem

I'm monitoring a very large number of hosts from the one machine.  The monitored hosts (in the main) don't have any peripherals - I hook them up when needed and always get local access by tapping the ctrl key.  I like to know that local access is readily available if ever it is needed :-).

When I used to get regular GPU freezes, hooking up peripherals didn't allow normal operation any more.  The OS was still running but pounding on a keybord while looking at a black screen was rather frustrating :-).   I have always tended to reboot to regain that level of control.  With Linux, I can just type alt+SysRq+R E I S U B (to reboot) or alt+SysRq+R E I S U O (to shutdown) to get everything back to normal, even though there's no visible reaction on the screen to those keystrokes until after the reboot starts.

For this old (and largely solved) variant of the issue, I can remember trying to use a remote manager to manipulate the stuck tasks (always both when running x2 and always both showing increasing elapsed time with no change in % progress) without any joy at all, so the reboot seemed the easiest option.

Over the recent summer here (downunder), I have had a large number of machines shut down to ease the heat burden.  Over the last few weeks they've all been refurbished and the OS's (plus BOINC and OpenCL libs) brought up to the very latest.  With autumn having now arrived, they're all back in production and generally running quite well.  However I've started noticing some extra occurrences of what I originally thought was the same GPU freeze problem.

One machine was in that state this morning.  The log message appeared on my screen as I was reading and thinking about your reply.  Rather than unthinkingly doing a simple reboot, I thought I might play with that machine in its current state.  I fired up a local manager and attached to it remotely.  There were the two in-progress tasks, one running at almost double the normal speed and the other stuck at around 85%.  This immediately struck me as 'unusual' - normally both are stuck.  In your observations, is it always just one task that is stuck - I presume you run multiples?

On a whim, I tried 'suspending' the stuck task.  To my surprise, a new task sprung to life and there were 2 tasks, both progressing at normal speed.   I 're-enabled' the stuck task and when one of the other two finished, the previously stuck task started up (the % value dropped a little as the last checkpoint was reloaded) and it went on to complete as normal.  To my mind, this can't be the same problem that was around a few years ago.

I started thinking about the few examples of stuck GPUs I had recently and a few oddities started to register.  Most of them have been in recent refurbs where the GPU was not an RX 570.  A couple were old Pitcairn series (HD 7850 - as was this morning's one) and a couple were early RX 460s.  I can distinctly remember also noticing (when typing the REISUB sequence) seeing some action on the screen.  At the time I was too dumb to think much about that oddity).  I have also noticed on a different example (where I just happened to look before rebooting) that only one task was stuck.  I just rebooted that one as well without further investigation.

So, I'm now wondering if the latest stuff in the amdgpu driver (a lot of which is probably to do with handling Navi and Big Navi cards) might have introduced some sort of regression for older architectures.  At least, when a message flashes up on my screen from now on, I'll fire up a manager and see if there's just a single stuck task and if suspending and resuming it get's things back to normal.

Without being forced to think about it, I probably would have just continued blindly rebooting.  Thanks very much for the prompt - I owe you for that one :-).

As I get more observations, I'll post again with the findings.

Cheers,
Gary.

cecht
cecht
Joined: 7 Mar 18
Posts: 1421
Credit: 2444902309
RAC: 1503868

Gary Roberts wrote:This

Gary Roberts wrote:

This immediately struck me as 'unusual' - normally both are stuck.  In your observations, is it always just one task that is stuck - I presume you run multiples?

Yes, and either one, some, or all tasks in a multiple run can get stuck. I've switched that  RX 5600 XT host over to GRP tasks for the moment to see whether it's an issue with the card and drivers or the type of E@H work being done.

Good thought on suspending stuck tasks. When I switch back to running GW work, I will try unsticking tasks by suspend & resume instead of abort.

Glad to hear that you got your REISUB brain unstuck. ;)

 

EDIT: I just checked and see that overnight 8 tasks were aborted, while previously it was maybe one or two per day (out of ~400 tasks/day). The only difference I'm aware of is that yesterday evening I upgraded the AMDGPU driver package from 20.50 to 21.10. Hmmm.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Tom M
Tom M
Joined: 2 Feb 06
Posts: 5586
Credit: 7673719503
RAC: 1771791

cecht wrote: EDIT: I just

cecht wrote:

EDIT: I just checked and see that overnight 8 tasks were aborted, while previously it was maybe one or two per day (out of ~400 tasks/day). The only difference I'm aware of is that yesterday evening I upgraded the AMDGPU driver package from 20.50 to 21.10. Hmmm.

Oops.  I have been running as old as 2.35 without a problem... but that is on some test systems.   I also run the 5.4 Kernel... because of the install issues we were having with the 5.8 Kernel.

Tom M

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)

cecht
cecht
Joined: 7 Mar 18
Posts: 1421
Credit: 2444902309
RAC: 1503868

Tom M wrote: cecht

Tom M wrote:

cecht wrote:

EDIT: I just checked and see that overnight 8 tasks were aborted, while previously it was maybe one or two per day (out of ~400 tasks/day). The only difference I'm aware of is that yesterday evening I upgraded the AMDGPU driver package from 20.50 to 21.10. Hmmm.

Oops.  I have been running as old as 2.35 without a problem... but that is on some test systems.   I also run the 5.4 Kernel... because of the install issues we were having with the 5.8 Kernel.

Tom M

After no internet connection for my hosts for the past day or so, I'm back and now running O3 Engineering tasks; am not yet sure whether the task stalling problem will persist. I wrote a bash script to monitor and correct the problem (in theory) in case it does.

However, I recently did a fresh install of Ubuntu 20.04.2 (kernel 5.8), installed AMDGPU 21.10 OpenCl drivers, and had no problem getting either a RX 570 or RX 5600 XT to crunch right off the bat. No idea what is different between my hosts and others that need to run kernel 5.4 with AMD GPUs.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.