Tasks list.

adrianxw
adrianxw
Joined: 21 Feb 05
Posts: 242
Credit: 322654862
RAC: 0
Topic 225284

The website for Einstein is different from the normal BOINC projects sites. I want to look at a result, and hopefully diagnose why a task that normally complete in about 15 minutes, ran for 12 hours and then crashed. Can someone point me to the tasks list?

Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 260
Credit: 6911301637
RAC: 21663194

HOME     VIEW

HOME

    VIEW ACCOUNT

         TASKS

 

(the "tasks" is below the list of computers ...)

 

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2139
Credit: 2752659092
RAC: 1474000

Or, if you click on the name

Or, if you click on the name of the affected computer, the tasks for that particular machine are accessible via links more similar to the normal BOINC template.

adrianxw
adrianxw
Joined: 21 Feb 05
Posts: 242
Credit: 322654862
RAC: 0

Thanks, found it, but

Thanks, found it, but realistically, it doesn't tell me anything other than the exit status, (0x000000C5), is a timeout. The processing of the unit must have entered an endless loop somewhere. 12 hours instead of 15 minutes....

https://einsteinathome.org/task/1109792077

I've not changed anything, I'll leave it running as is.

Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

Tom M
Tom M
Joined: 2 Feb 06
Posts: 5586
Credit: 7673166220
RAC: 1746653

adrianxw wrote: Thanks,

adrianxw wrote:

Thanks, found it, but realistically, it doesn't tell me anything other than the exit status, (0x000000C5), is a timeout. The processing of the unit must have entered an endless loop somewhere. 12 hours instead of 15 minutes....

https://einsteinathome.org/task/1109792077

I've not changed anything, I'll leave it running as is.

I have had that problem when my Windows 10 gpu driver wasn't quite working right.  So you might play with older/newer drivers to see if you can get rid of that problem.

I assume you understand you can still upgrade Win 8.1 to Win 10 for free?  Which might change the range of gpu drivers you can use.

Tom M

A Proud member of the O.F.A.  (Old Farts Association).  Be well, do good work, and keep in touch.® (Garrison Keillor)

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109384669467
RAC: 35937509

adrianxw wrote:... it doesn't

adrianxw wrote:
... it doesn't tell me anything other than the exit status, (0x000000C5), is a timeout.

It also tells you (lower down in the stderr output) that it was progressing for a while (11 checkpoints completed) and then nothing for a very long time until BOINC decided to pull the plug when the time limit was reached.

I know from a lot of experience seeing this exact sort of thing with AMD GPUs, that it is very likely a driver lockup issue and nothing to do with the task itself.  The issue massively declined with driver improvements a couple of years ago but it still does occur today (fairly rarely) with older GPUs (like your Pitcairn) and with older systems, perhaps where the PSU isn't delivering power as clean and stable as it used to be.  I assume that inference because I have a lot of different hardware to see this issue occurring on, and those things seem to be common factors.  It tends to occur more with older GPUs and PSUs.

The issue can be detected as soon as it occurs with some sort of automatic monitoring of the behaviour of the CPU support process that is looking after the GPU.  All my GPUs are AMD (and could be affected) so every hour, a script on a central machine runs and checks each host in the fleet (using a 2 sec time interval) to see if the CPU support process is accumulating CPU clock 'ticks'.  In that interval, there will always be quite a few provided things are running normally.

A zero answer gets checked by a repeat test and if still zero, I get a report that the task has stalled.  Before I developed a monitoring script, the detection depended on looking at the tasks in BOINC Manager and noticing that time was accumulating but no change in the % progress figure.

Once the GPU has locked up this way, the only reliable 'cure' I have found is a full cold restart.  Suspending and resuming the task doesn't work.  Stopping and restarting BOINC doesn't work.  A full reboot always works.

After restarting, the last checkpoint is used (so the elapsed time gets reset back to the value saved there) and there is normal progress to completion.  Tasks always validate so I'm sure it's not a problem with the tasks themselves.

I realise this doesn't give you a 'workable' solution but I thought it might be useful for you to know what the real situation probably is.  My script detects one or two instances every week but I do have a lot of older GPUs that really do need checking :-).

 

Cheers,
Gary.

adrianxw
adrianxw
Joined: 21 Feb 05
Posts: 242
Credit: 322654862
RAC: 0

I've just had another of

I've just had another of these, this one after a little over two hours. I've set no new tasks, I'll have a fuller look at this when I get a moment. Odd that the machine churns out dozens of work units per day without issue, then the same thing appears twice quickly.

Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 260
Credit: 6911301637
RAC: 21663194

... would be great if you

... would be great if you would point out which app you mean:  i.e. GR or GW or what not ...

Just a little reminder:  GW tasks usually need more than a 2GB GPU -- not always, but mostly.

...

adrianxw
adrianxw
Joined: 21 Feb 05
Posts: 242
Credit: 322654862
RAC: 0

I've been away for a few

I've been away for a few days. When I came back, I found 21 errors. Most, not all, are GW, but there is a GR there as well. The GW's have similar runtimes of 2-3 hours. The GR fell over after 8 hours.

Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

adrianxw
adrianxw
Joined: 21 Feb 05
Posts: 242
Credit: 322654862
RAC: 0

This is getting ridiculous.

This is getting ridiculous. I've now got one, a GW, which normally run to completion in ~15 minutes, which has run for 4:47:40, with a remaining at suspension of 6d 17:05:54 increasing. It is showing 2.910% done. It is obviously trapped in a loop of some kind. Another, also a GW, has run for 1:44:35 with the remaining 16d 01:36:55, also increasing, showing 0.449% done, again, a loop exit problem I guess.

If not a loop exit issue, it could be something is not being set somewhere, that should be set, and it goes back into the same loop or function, or subroutine again and again.

I have an issue with the project at the moment, Windows 8.1 x64, I've deleted work units not started yet and no new tasks set.

Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3681
Credit: 33820930739
RAC: 37791888

adrianxw wrote: This is

adrianxw wrote:

This is getting ridiculous. I've now got one, a GW, which normally run to completion in ~15 minutes, which has run for 4:47:40, with a remaining at suspension of 6d 17:05:54 increasing. It is showing 2.910% done. It is obviously trapped in a loop of some kind. Another, also a GW, has run for 1:44:35 with the remaining 16d 01:36:55, also increasing, showing 0.449% done, again, a loop exit problem I guess.

If not a loop exit issue, it could be something is not being set somewhere, that should be set, and it goes back into the same loop or function, or subroutine again and again.

I have an issue with the project at the moment, Windows 8.1 x64, I've deleted work units not started yet and no new tasks set.

all of your successful completions are on the Gamma-Ray application.

all of your failures are on the Gravitational Wave application.

Your task list doesn't show that you've ever completed a GW task successfully.

 

I believe your GPU is too old to process Gravitational Wave tasks, or maybe these new tasks are consistently using more than the 2GB you have and are causing failures. Or are you trying to run more than 1 task at a time?. You should disable work from this application to avoid further issues.

_________________________________________________________________________

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.