WUs regularly hang at or near 100% without ever finishing

captain_curly
captain_curly
Joined: 13 May 10
Posts: 4
Credit: 53022270
RAC: 0
Topic 210899

I've been having problems running einstein on my spare computer recently. After successfully completing WUs for about 12 hours, they will suddenly never finish. They can get to 99.999%, or even 100% but then just stay there for hours or even days without finishing. If I reset the project in the BOINC manager they start completing again, for about 12 hours before the problem appears again. My computer is running 24/7 so this means I'm contributing less than half its potential as I have to manually check if it's stalled and then reset, which I can't do while at work or sleeping leading to up to 20 hour long periods of inactivity.

This is the computer in question https://einsteinathome.org/host/12532695

Other projects (SETI, Rosetta) runs just fine and encounters no such problems. Anyone ever had a similar problem or a suggestion for how to solve it?

Rolf
Rolf
Joined: 7 Aug 17
Posts: 27
Credit: 135377187
RAC: 0

I had similar problems and

I had similar problems and think it was because of memory conflicts in the GPU. Running fewer tasks in parallel was the only solution at the time, I could run maximum 2 tasks at a time with 4GB. You might want to try running only 1 task.

You can try to catch error messages in the STDERR output, click on the link in the task list, leftmost column. I couldn't find anything in your current list of tasks, most of them just say aborted by user and nothing in the error log.

There are initiatives to improve Opencl drivers and memory management on AMD GPUs but they are still experimental and beta releases.

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

Resetting the project should

Resetting the project should not have to be done.

What AMD driver are using?

I'm not 100% convinced about it being GPU related though, it seems the tasks may be completing but the scheduling process (boinc) is not picking these up. 

The GPU tasks should be visible as processes named

hsgamma_FGRPB1G_1.18_x86_64-pc-linux-gnu__FGRPopencl1K-ati

Are they still running and consuming resources?  use top or ps -F -u boinc

The next place to look is the boinc event logs, can you post the starting 20-40 lines and any errors?  You might have to toggle some event log options to see them.

 

captain_curly
captain_curly
Joined: 13 May 10
Posts: 4
Credit: 53022270
RAC: 0

Thanks for the input, but

Thanks for the input, but this might have resolved itself unexpectedly. Rebooted the computer yesterday to apply some updates, and after that it's been running smoothly for at least 20 hours so. Thinking this might suggest there was something wrong with the boinc process itself, as mentioned by AgentB, but previous reboots had no such effect.

I'll keep an eye on it for the next few days and see if the problem occurs again and if so try the troubleshooting steps you guys suggested.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109381679499
RAC: 35971514

captain_curly wrote:... this

captain_curly wrote:
... this might have resolved itself unexpectedly.

I hope so but it may be something like what I'm seeing, so I suspect it may recur.  I have a whole bunch of RX 460 GPUs running under Linux and I've documented what I see in this thread, if you're interested.

I had intended replying earlier but I ran out of time after I saw your opening post.  I'm interested because we both use Linux and the same GPU.  As well as directing you towards what I see, I thought I'd explain a bit about the crunching characteristics of FGRP style tasks so you know what to expect.

For both CPU and GPU versions, there are two stages to crunching.  The first stage takes the bulk of the time and ends with the % done figure showing 89.997% for the GPU version and slightly lower (I think 89.979%) for the CPU version.  The exact figure is not important.  What is important is that it remains static (possibly giving the appearance of crunching having stopped) whilst a second followup stage is underway.  When the followup stage is completed, the % done suddenly jumps to 100% and the completed result is uploaded.  There shouldn't be any figure between 90 and 100% visible at any stage if crunching is proceeding normally.  I believe checkpoints are saved during the followup stage so that if the client is stopped during that time it can be restarted without having to repeat the full stage.

The followup stage is used to re-evaluate the potential signals found during the first stage and create a 'toplist' of the ten most likely candidates.  Double precision is used during this time.  The followup stage can take from around 1-3 minutes on a decent GPU and perhaps 10-30 minutes on a CPU task.  These are rough figures but it's important to let things run if the % done is just below 90% and nothing seems to be happening.

With that out of the way, let's look at your particular symptoms.  You say you've seen a task sitting at 99.999 or even 100% for a long time - many hours.  I've never seen that.  When a GPU of mine locks up, I can't get anything on the attached screen but I can see all the tasks on a Manager running on a different host and connecting to the client on the problem host as described in the other thread.  When you see a task stuck like that, is your screen fully functional?  Is the task still clocking up elapsed time?  Can you stop and restart BOINC without the machine locking up?  You really don't need a project reset to solve this.  A simple stop and restart of the client might work.  If not a reboot seems to.  Have you tried those steps?  If so, what happens?  Details, details :-).

I see your kernel is listed as 4.9.0-4-amd64.  Is that a Ubuntu designation?  I've never used anything but PCLOS so I'm not familiar with the naming systems of other distros.  The kernels I'm using are more recent in the 4.11.x and 4.12.x series, depending on what updates were available at install time.  I keep a fully updated local copy of my distro's repos so when I install from a live USB, I just apply all the updates that existed at that time.

What version of graphics driver and OpenCL libs are you using?  If you're running Ubuntu, I presume you will be using AMDGPU-PRO?   If so, are you able to use the recent 17.40 version?

I'm sorry for all the questions - feel free to ignore them if you haven't got time to respond.  I decided to post them all in case your problem recurs.  If it does, you know what I'll be asking :-).

 

Cheers,
Gary.

captain_curly
captain_curly
Joined: 13 May 10
Posts: 4
Credit: 53022270
RAC: 0

Thanks for a very detailed

Thanks for a very detailed and informative response! I considered this a case closed on my end so didn't visit the forum for some days and then I saw your long response, and your other thread, and started writing a reply some days ago but then other things came in the way and it languished in a seldom visited tab in my browser. You're not the only one guilty of putting things on the back burner :-)

Gary Roberts wrote:
There shouldn't be any figure between 90 and 100% visible at any stage if crunching is proceeding normally.

You say stage 1 of a GPU task is done when showing 89.XXX% and stage 2 reports no visible progress before finishing and making the jump to 100%. The 89.XXX% looks very familiar and I'm fairly confident this is where every single task that ended up hanging first stopped. However, if I let them continue running, the % done would slowly but surely increase all the way up to 99.999% sometimes also reaching 100%. Here we're talking about it taking an hour or more for every extra % done.

Quote:
Have you tried those steps?  If so, what happens?  Details, details :-).

I didn't have any screen attached, having it running headless in my living room, but when tasks were hanging SSH and VNC worked just fine. I did attempt to connect it to my TV a couple of times when tasks were hanging, but the TV reported receiving no signal. Hanging tasks were still clocking up elapsed time, and I could stop BOINC without the machine locking up, but the BOINC processes themselves would become unkillable zombies requiring a reboot to get BOINC running again. As mentioned in my last post, I had done several reboots ranging from regular reboots, REISUBs and full power down with power cable removed for several minutes, none of which worked until suddenly the problem went away after the last reboot.

Quote:
What version of graphics driver and OpenCL libs are you using?  If you're running Ubuntu, I presume you will be using AMDGPU-PRO?   If so, are you able to use the recent 17.40 version?

I'm running Debian 9 which is the current stable release of Debian so packages are a bit outdated. AMDGPU-PRO driver is 17.10-429170. BOINC version is 7.6.33. CPU is an AMD FX 8350 (8 cores) running stock speed with 7 cores dedicated to BOINC leaving 1 core to feed the GPU. OpenCL 1.2 2348.3.

From your other thread, I see that this happens to you after the hosts have been crunching for about 20 days. After my last post the host was up and running without problems for 11 days before I needed to shut it down to extract some hard drives. However, when I first encountered this problem, the host had not been running for 20 days IIRC. It might have, but the way I discovered I had a problem was when noticing Einstein tasks were hanging and I logged in to Boincstats to see when that might have happened and I had barely gotten any credit the past 40 days so it seemed to have occurred almost immediately after I brought the host out of storage and booted it up for the first time in months. I now have it running with screen and keyboard attached with htop and boinctui running in a split terminal window and checking that Einstein is still running is the first thing I do in the morning and the last thing I do before going to bed so I'll report back after reaching 20 days uptime in about two weeks :-)

 

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109381679499
RAC: 35971514

captain_curly wrote:...

captain_curly wrote:
... However, if I let them continue running, the % done would slowly but surely increase all the way up to 99.999% sometimes also reaching 100%. Here we're talking about it taking an hour or more for every extra % done.

This is why I qualified my statement by saying, "... if crunching is proceeding normally".  You only see the gradual increasing of %done beyond 89.997% if the GPU has crashed but BOINC hasn't.   BOINC seems to have some sort of auto-incrementing of progress in these cases.  GPU tasks are not actually making any progress despite what BOINC is reporting.

My machines are on racks and run headless most of the time.  I can immediately hookup to any particular machine if I need to.  Because GPU tasks crunch quickly, hosts communicate regularly with the project.  It's unusual for the last contact to be more than an hour ago.  I use that to get a warning about any host that may be misbehaving.

For RX 460 GPUs, when I suspect a problem, I just hookup the peripherals.  If the machine is actually running correctly, the desktop springs to life and I can open BOINC Manager to see what's going on.  If there is no response to keyboard/mouse movement, there is a problem and I've been in the habit of using REISUB, which almost invariably works to get things quickly back to normal.  Occasionally, the machine really has crashed so I do a hard reset.

From when I first started talking about this, I've seen a further 5 cases of my RX 460 machines reaching around 26 day uptime and then having the problem.  When this happens, I can ssh into the machine, see the uptime and see the client and the CPU/GPU apps running.  CPU time continues to increment for everything except the GPU tasks.  The CPU time for those is frozen although ps doesn't show them as zombies.  They seem to be unkillable if I try to do that.  I don't often do that because it usually causes the whole machine to lock up.  On one occasion I tried to run clinfo over ssh and the machine immediately crashed.

Most of the time, I can remotely connect to such a host with BOINC Manager running on a server machine.  The elapsed times are ticking over normally and the GPU tasks do show some very slow increments to the %done, along the lines of what you see.  The elapsed times can show several hours but there is no increment to CPU time.  When such a machine is restarted, the GPU tasks get restarted from saved checkpoints and then complete normally with all that excess elapsed time washed away.

Quite often, one or more tasks will complete almost immediately, indicating that the checkpoint they restarted from was deep in the followup stage of crunching.  The %done they restarted with is usually wrong (quite low) and is not initially moving, but then jumps to 100%.  I've seen enough examples to spot this in advance because they seem to restart with an elapsed time quite close to what a normal task takes to complete, despite the low %done.  So it's no real surprise when they suddenly jump to completion.

Quote:
I now have it running with screen and keyboard attached with htop and boinctui running in a split terminal window and checking that Einstein is still running is the first thing I do in the morning and the last thing I do before going to bed so I'll report back after reaching 20 days uptime in about two weeks :-)

I'm now keeping a record of the uptimes of my hosts.  At the moment there are 10 RX 460 hosts with uptimes between 21 and 24 days.  There are several more close to 20.  In the last couple of days, two machines had stopped GPU crunching but could readily be contacted through ssh or a remote BOINC manager.  The uptimes were 24 and 26 days respectively.  The GPU tasks were not accumulating CPU time and the %done figures were quite low.  The elapsed times were incrementing way beyond the normal finish times.  REISUB immediately got things working again.

Today, I'm planning to start an experiment to see if the problem can be removed by restarting certain components without rebooting.  As each host reaches 24 days uptime, I'll put it into one of four categories:-

  1. Stop BOINC, wait say 15 secs, restart BOINC.  Uptime is unchanged.
  2. Logout, wait say 15 secs, login again.  BOINC runs as a daemon so just continues.  Uptime is unchanged.
  3. Stop BOINC, logout, wait 15 secs, login again and then restart BOINC.
  4. Control group. I want to know if any can go beyond 26-28 days or if all will fail around that time.

The idea is to attempt to determine if any particular software component is the critical item.  I hope to see if it's BOINC related, X windows related or perhaps a combination of both, or perhaps neither.  I figure that something has got to make some sort of difference that will give some sort of clue :-).

Over the next 4 days there are 16 hosts that will enter one of these categories so by early next week, surely something will show up :-).

 

Cheers,
Gary.

captain_curly
captain_curly
Joined: 13 May 10
Posts: 4
Credit: 53022270
RAC: 0

Just happened again, after 29

Just happened again, after 29 days uptime, task hanging at 88.31%. Mouse pointer could be moved around but otherwise the screen was frozen and didn't react to keyboard input. I SSH'd from another machine and the CPU seemed to still be happily crunching LHC tasks, but boinctui stated that the boinc client was offline.  Stopping BOINC had no effect so I did a REISUB and seems to be crunching nicely once more but will keep a close eye on it during Christmas.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109381689499
RAC: 35972214

captain_curly wrote:Just

captain_curly wrote:
Just happened again, after 29 days uptime, task hanging at 88.31%. Mouse pointer could be moved around but otherwise the screen was frozen and didn't react to keyboard input.

I'm pretty sure this is the same behaviour as I see on my hosts with Polaris GPUs.  I have a mixture of Rx 460, 560, 570, 580 GPUs which I feel will all show a similar problem, although the 570 and 580 variants haven't been running long enough to trigger it (yet) :-).

As mentioned previously, I've been carrying out experiments to see if there is any way (short of rebooting) to prevent the problem.  In a nutshell, no there isn't that I can find.  I've tried everything I can think of including stopping and restarting BOINC, stopping and restarting X,  and logging out and then back in, but no machine yet has been able to go past ~25-27 days uptime without the problem showing up anyway.  I thought that stopping and restarting BOINC at least might work but it doesn't seem to have any beneficial effect.

I have about 6 examples where BOINC was stopped and restarted at 25 days uptime.  None of them got any further than 27 days before the problem showed up.  Stopping BOINC was just a waste of time.   I now reboot all machines with Polaris GPUs once the uptime reaches 25 days.  I have examples of machines that were rebooted at 25 days that are due for their next reboot and have not had the problem show up.  It's annoying to have to do this but at least it is workable.

Quote:
I SSH'd from another machine and the CPU seemed to still be happily crunching LHC tasks, but boinctui stated that the boinc client was offline.  Stopping BOINC had no effect so I did a REISUB and seems to be crunching nicely once more but will keep a close eye on it during Christmas.

On several of my machines being used as 'controls' (not rebooted after 25 days) I've been able to observe what happens when the problem hits.  Yes, the machine is still running but I can't get a display when I hook up a monitor.  I can ssh into the machine or contact it with a BOINC Manager running on a different machine.  CPU tasks are still crunching fine and will complete, upload, be reported and replacement tasks will be downloaded.  GPU tasks are not crunching (no increment to CPU time and no further checkpoints) although elapsed time continues to accumulate.  When the machine is ultimately rebooted, (using REISUB or reboot command from an ssh session) all this extra elapsed time disappears when crunching restarts from the last saved checkpoint.  As far as I'm aware, there are usually no computation errors, just the lost time until the problem is noticed.  When a machine is at the critical 25-27 day uptime, the problem is quite often triggered when one (or more) running tasks (CPU or GPU) have entered the 'followup' stage (>89.9xx% done) where double precision is being used.

 

Cheers,
Gary.

Jim Martin
Jim Martin
Joined: 24 Jun 05
Posts: 7
Credit: 6631303
RAC: 20743

 For the first time, two WU's

 For the first time, two WU's have hung up at 98.979%.  Gamma ray pulsar search...

 I have a DELL Latitude E7240 (4 cpu's), and Windows 7.  Have suspended all other programs, except one of  the  two from Einstein.  CERN (ATLAS, for example), at one time, had to be given extra care, when running, but their  engineers seem to have gotten a handle on their GPU-based problems.

 Will wait a days-worth, then abort them; they can't tie up my machine for  other projects.  Hopefully, Einstein

personnel  can solve.

Jim Martin
Jim Martin
Joined: 24 Jun 05
Posts: 7
Credit: 6631303
RAC: 20743

Good news.  Each WU uploaded,

Good news.  Each WU uploaded, after approx. 20 minutes. Results were "completed and validated".

If this was an assist from project mgr./engineer, thanks.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.