Stalled task at 89.989 % and another at 89.980 %

rjs5
rjs5
Joined: 3 Jul 05
Posts: 32
Credit: 616341512
RAC: 841932

I am still trying to figure

I am still trying to figure out what is causing my Windows 10 Nvidia 2080ti GPU failures. I moved the 2080ti board from my 7920x Windows system to my 8700 Linux system.  The 1080ti Linux systems was taking about 400 seconds per WU.  I watched the 2080ti progress during the GPU WU.

What I noticed was that the 2080ti PERCENTAGE progress increased slowly. The TIME REMAINING dropped very slowly.

IMO, these WU are getting stalled very early in the computing and the percentage progress increases slowly only as the ELAPSED run time increases. Seems like there are problems with the app.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117793095229
RAC: 34681967

rjs5 wrote:I am still trying

rjs5 wrote:
I am still trying to figure out what is causing my Windows 10 Nvidia 2080ti GPU failures.

I don't understand why you are using this particular thread, which discusses why there is a pause in crunching progress just short of the 90% mark for all FGRP type tasks both CPU and GPU.  This topic is nothing to do with Turing GPU failures.  There are other existing threads that would be much more appropriate for that.

If you are saying that you can get a task to run (albeit very slowly) under Linux without crashing, that is a new observation that would probably be of great interest to other Turing owners and so quite worthy of a new thread.

rjs5 wrote:

I moved the 2080ti board from my 7920x Windows system to my 8700 Linux system.  The 1080ti Linux systems was taking about 400 seconds per WU.  I watched the 2080ti progress during the GPU WU.

What I noticed was that the 2080ti PERCENTAGE progress increased slowly. The TIME REMAINING dropped very slowly.

Are you sure a GPU task was actually running on the Turing GPU?  Sounds like 'simulated progress' that BOINC invents before the initial checkpoint is written.  How long did you let it run for?

I looked at your hosts list on the website.  The Windows machine shows a single 2080Ti.  the Linux machine a single 1080Ti.   There is no sign that you switched the GPUs.  That could be if you immediately reverted the change after a short test, or if the host hasn't contacted the project yet to report the hardware change.  When you restarted the Linux machine after the change, did you examine the event log to see if the 2080Ti was properly detected?  Did you check the task properties of the 'in progress' task to see if checkpoints were being written?

 

Cheers,
Gary.

rjs5
rjs5
Joined: 3 Jul 05
Posts: 32
Credit: 616341512
RAC: 841932

I gave some background on

I gave some background on what I was doing and why. I am suggesting that the problem might come far earlier in the the WU processing than 89%. I think it is only getting to 89% because of the long run time.  If this comment is a problem, feel free to delete it. I won't be offended.

I turned both machines off and moved the 2080ti board to the Linux machine. The Linux machine detected the 2080ti, but did not complete and report any completions or failures. I let the GPU WU run on the 2080ti for an hour.  I started an Einstein GPU WU from its beginning and progress percentage increased slowly. The 1080ti board takes 7 minutes to complete and uniformly crunches the job.  I was expecting the 2080ti to take half that long, if it worked.

Sorry, I neglected to check the checkpoints.

 

Gary Roberts wrote:
rjs5 wrote:
I am still trying to figure out what is causing my Windows 10 Nvidia 2080ti GPU failures.

I don't understand why you are using this particular thread, which discusses why there is a pause in crunching progress just short of the 90% mark for all FGRP type tasks both CPU and GPU.  This topic is nothing to do with Turing GPU failures.  There are other existing threads that would be much more appropriate for that.

If you are saying that you can get a task to run (albeit very slowly) under Linux without crashing, that is a new observation that would probably be of great interest to other Turing owners and so quite worthy of a new thread.

rjs5 wrote:

I moved the 2080ti board from my 7920x Windows system to my 8700 Linux system.  The 1080ti Linux systems was taking about 400 seconds per WU.  I watched the 2080ti progress during the GPU WU.

What I noticed was that the 2080ti PERCENTAGE progress increased slowly. The TIME REMAINING dropped very slowly.

Are you sure a GPU task was actually running on the Turing GPU?  Sounds like 'simulated progress' that BOINC invents before the initial checkpoint is written.  How long did you let it run for?

I looked at your hosts list on the website.  The Windows machine shows a single 2080Ti.  the Linux machine a single 1080Ti.   There is no sign that you switched the GPUs.  That could be if you immediately reverted the change after a short test, or if the host hasn't contacted the project yet to report the hardware change.  When you restarted the Linux machine after the change, did you examine the event log to see if the 2080Ti was properly detected?  Did you check the task properties of the 'in progress' task to see if checkpoints were being written?

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7230394822
RAC: 1162999

rjs5 wrote:I let the GPU WU

rjs5 wrote:

I let the GPU WU run on the 2080ti for an hour.  I started an Einstein GPU WU from its beginning and progress percentage increased slowly. The 1080ti board takes 7 minutes to complete and uniformly crunches the job.  I was expecting the 2080ti to take half that long, if it worked.

Sorry, I neglected to check the checkpoints.

So far as I know, we only have one previous report of an Einstein user attempting to run Gamma-Ray pulsar jobs on a Turing GPU. That user was Keith Myers, and you can review several reports he made during that attempt on December 7, 2018 in our Turing thread.
it appears likely to me that the behavior you experienced was the same that he saw: a superficial indication that the card was running, but with no checkpoints written, and no actual progress. This differs from the behavior under Windows, for which these tasks commonly terminate in a clear error condition in about 25 seconds.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117793095229
RAC: 34681967

rjs5 wrote:I gave some

rjs5 wrote:
I gave some background on what I was doing and why. I am suggesting that the problem might come far earlier in the the WU processing than 89%. I think it is only getting to 89% because of the long run time.  If this comment is a problem, feel free to delete it. I won't be offended.

Thanks for starting the new thread.  I see Richard has responded there and has suggested that you bring yourself up to date with the comments of quite a few people who have already been affected by this problem.  Archae86 has also responded to your report here and linked to the big Turing thread where he first reported the problem with a particular type of GPU task using a new Turing series GPU.  That was back on 30th Sept last year and would be a suitable starting point in becoming familiar with what is known about the problem.

There is a lot of reading if you want to know the full story so far.  The one sentence summary is that everything conceivable has been tried and there is currently no resolution, or even any real indication of where the true cause of the problem lies.  There have been multiple reports to nVidia about it.  There has been no comment about it from the Einstein Devs.  This suggests that they are just as baffled about the cause as everyone else.  If it was an easily fixable app flaw, the fix would have been made very promptly. As the problem has continued for several months, the indications point to something much harder to solve, and possibly only solvable by nVidia.

Just like you did, Keith Myers also tried running the card under linux.  This was back on Dec 6th.  He let a task run for a total of about 9 hours with no real progress being made.  He also observed the BOINC 'simulated progress' that you see when there is no checkpoint created from which 'true' progress could be calculated.  The progress you saw was obviously this same effect.  Richard calls this 'pseudo-progress', I believe.  Apparently some projects don't checkpoint very frequently so it's supposedly of benefit in stopping inexperienced people from thinking a task has stalled prior to the first checkpoint.  If real progress is being made, Einstein GPU tasks will checkpoint after the first minute.  The easiest way to confirm this is to check the task properties a little after that.  If there is still no checkpoint, then the task is apparently not running, no matter what the pseudo-progress indicates.

There have been other reports as well - from Windows users.  This thread contains several.  As far as I'm aware, people have been reporting the problem to nVidia for quite some time now.  I don't know of any meaningful response.

 

Cheers,
Gary.

samos
samos
Joined: 1 Jan 16
Posts: 5
Credit: 649480
RAC: 763

Hi I'm running 7.16.6 on a

Hi I'm running 7.16.6 on a Mac Catalina, and can see the same "89.978%" (or89.979%) seemingly stuck progress issue as well. Here's some properties of the three that are like that at the moment. This machine is admittedly recently added to E@H but has crunched successfully before (and for other projects, such as Rosetta). Any hints at what I could try? (or does it seem normal and I just wait??) It's been like this for about two days (and about 16hrs of crunching over those two days...)

 


Application
Gamma-ray pulsar search #5 1.08 (FGRPSSE)
Name
LATeah1022F_88.0_228_-9e-11
State
Running
Received
Monday, 18 May 2020 at 07:54:36 pm
Report deadline
Monday, 01 June 2020 at 07:54:36 pm
Estimated computation size
105,000 GFLOPs
CPU time
12:36:54
CPU time since checkpoint
00:04:34
Elapsed time
16:02:39
Estimated time remaining
01:52:20
Fraction done
89.978%
Virtual memory size
4.72 GB
Working set size
617.60 MB
Directory
slots/0
Process ID
5663
Progress rate
3.960% per hour
Executable
hsgamma_FGRP5_1.08_x86_64-apple-darwin__FGRPSSE

 


Application
Gamma-ray pulsar search #5 1.08 (FGRPSSE)
Name
LATeah1025F_1240.0_18780_0.0
State
Running
Received
Friday, 22 May 2020 at 08:22:24 pm
Report deadline
Friday, 05 June 2020 at 08:22:24 pm
Estimated computation size
105,000 GFLOPs
CPU time
02:42:06
CPU time since checkpoint
00:00:00
Elapsed time
03:28:16
Estimated time remaining
02:11:58
Fraction done
89.980%
Virtual memory size
4.09 GB
Working set size
1.35 MB
Directory
slots/3
Process ID
6224
Progress rate
25.920% per hour
Executable
hsgamma_FGRP5_1.08_x86_64-apple-darwin__FGRPSSE

 


Application
Gamma-ray pulsar search #5 1.08 (FGRPSSE)
Name
LATeah1019F_1256.0_164395_0.0
State
Waiting to run
Received
Saturday, 23 May 2020 at 05:44:02 pm
Report deadline
Saturday, 06 June 2020 at 05:44:01 pm
Estimated computation size
105,000 GFLOPs
CPU time
12:54:25
CPU time since checkpoint
00:03:44
Elapsed time
16:21:05
Estimated time remaining
02:13:20
Fraction done
89.980%
Virtual memory size
4.72 GB
Working set size
568.46 MB
Directory
slots/4
Process ID
663
Progress rate
2.880% per hour
Executable
hsgamma_FGRP5_1.08_x86_64-apple-darwin__FGRPSSE

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117793095229
RAC: 34681967

samos wrote:Hi I'm running

samos wrote:
Hi I'm running 7.16.6 on a Mac Catalina, and can see the same "89.978%" (or89.979%) seemingly stuck progress issue as well.

Hi, welcome to the forums.

Do you have the preference to suspend BOINC stuff when you use your computer set to 'Yes'?  If so, do you have the setting to keep tasks in memory when suspended also set to 'Yes'?

If you don't keep tasks in memory when suspended, each time one tries to restart after being suspended, it will need to read the last saved checkpoint from disk.  If checkpoints are widely spaced (it does vary with different tasks) you can lose a lot of compute progress as you do your ordinary work on your computer.

Before suggesting anything else, please rule out this setting as the possible cause of the issue.

Cheers,
Gary.

samos
samos
Joined: 1 Jan 16
Posts: 5
Credit: 649480
RAC: 763

Hi Gary, both of those

Hi Gary, both of those settings are set to 'off' (ie. unticked). also the checkpoints on my Mac are set to 120 seconds, and change tasks at 60mins. 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117793095229
RAC: 34681967

samos wrote:.. both of those

samos wrote:
.. both of those settings are set to 'off' (ie. unticked).

OK. I didn't check the precise format of the prefs - some are y/n and some are tick boxes.  I figured you'd work it out :-).

So, if you don't suspend BOINC when you're using, does your machine put itself into some sort of low power state when you're not using it?  There's got to be some reason for long periods of no progress.  In any case, I'd recommend allowing tasks to stay in memory if they do happen to become suspended.  This will cut out the potential for significant loss of progress that can otherwise occur.

samos wrote:
also the checkpoints on my Mac are set to 120 seconds

This will have no effect if the checkpoint interval specified by the app itself is longer than that, which most certainly it will be - perhaps a lot longer with some tasks.  You can check this for a running task by looking at its properties in BOINC Manager and noting the CPU time since the last checkpoint and then noting when that value gets reset immediately after a new checkpoint has been written.  It's less convenient with current BOINC versions than it used to be unfortunately.

You haven't mentioned what type of Mac you are using but from what is listed on the website, it would appear to be a laptop of some type.  That seems to suggest that it's saving power by not allowing BOINC stuff to run for a lot of the time.

You have two returned tasks showing on the website.  If you click the Task ID link for each task you can get a complete crunching record that was returned to the project.  Scroll down below the "Stderr Output" heading.  The "Sky points" value shown for each tells you how many checkpoints there were in the task.  The "nf1dots" parameter tells you how many 'calculation loops' there are for each sky point.

For one task there were 6 sky points and 689 nf1dots.  For the other the values were 60 and 74 respectively.  With only 6 sky points, the checkpoint interval will be very long and therefore lots of opportunity to 'lose' significant progress if calculations need to keep restarting from the previous checkpoint.

The total 'calculation loops' is the product of the two numbers so the task with the 6 sky points and 689 nf1dots actually had a slightly lower total number.  In theory, it could have finished a little faster than the other one if there wasn't all the lost crunching.

Each time an individual 'calculation loop' is completed a 'dot' is written to the output - hence the rows of 'dots' that you can see.  When a sky point is completed, a checkpoint is written.  The short line that starts with a "% C" shows when that happens.  The digit following the "C" is the checkpoint number.  If you see partial groups of dots without the short checkpoint indicator following them, that represents lost crunching.

It's worthwhile looking through the stuff returned to the project like this.  It can help diagnose the behaviour you are seeing.

Cheers,
Gary.

samos
samos
Joined: 1 Jan 16
Posts: 5
Credit: 649480
RAC: 763

Gary Roberts wrote: In any

Gary Roberts wrote:

In any case, I'd recommend allowing tasks to stay in memory if they do happen to become suspended.  This will cut out the potential for significant loss of progress that can otherwise occur.  

Ok, done.

Gary Roberts wrote:

You haven't mentioned what type of Mac you are using but from what is listed on the website, it would appear to be a laptop of some type.  That seems to suggest that it's saving power by not allowing BOINC stuff to run for a lot of the time.

MacBook Pro 2018 i5 quad core 2.3GHz 8GB RAM 512GB SSD (and yes I think it is sleeping at some point)

Gary Roberts wrote:

 If you see partial groups of dots without the short checkpoint indicator following them, that represents lost crunching.

It's worthwhile looking through the stuff returned to the project like this.  It can help diagnose the behaviour you are seeing.

Ok - will try to interpret. With the one setting change above (wrt. the keeping of non gpu tasks in memory) I'll see how it goes for a few work units, and report back.

Thanks for the pointers!

Sam.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.