Another scheduler instance is running for this host - - - ? ? ?

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7024914931
RAC: 1808801

Gary Roberts wrote:Once the

Gary Roberts wrote:

Once the number of trashed tasks is such that the reduced daily limit is below the number of tasks already sent, the client is prevented from receiving new work until a new 'day' arrives.

<snip>

without having to wait for a new 'day' to start (midnight UTC).

Quibbling detail to add:

There is a long-standing flaw in this picture, in that different components of the overall system have a different definition of when the "new day" boundary happens.  In particular, the bit that decides how long to put off the next request sent by your host does not agree on that boundary with the bit that decides whether to honor a request that is made.  This conflict has some secondary consequences that are both confusing and displeasing.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4704
Credit: 17550092080
RAC: 6433607

Yes, I ran into this issue

Yes, I ran into this issue with my Nano when I was trying to get my custom app running.

The one day backoff for turning in errored tasks turned into a 3 day backoff before I finally got sent work to my fixed configuration and my first non-error return.

So the definition of the "new day" seems to be variable.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109410941167
RAC: 34965813

archae86 wrote:There is a

archae86 wrote:
There is a long-standing flaw in this picture, in that different components of the overall system have a different definition of when the "new day" boundary happens.

I think the full story might be that there are two different situations that are being handled, one by the server and one by the client.

The problem I was talking about is one where (progressively by successive tasks failing computationally) the scheduler gets to the point where it sees that it has already distributed more than the now significantly reduced daily limit.  Under those circumstances, it always chooses to set a backoff until just after the next midnight UTC.

The separate problem is when there is an event that trashes the complete list of work locally for a particular search.  An example would be where a vital file (the app itself or a critical data file) fails an MD5 test all of a sudden.  I've had that happen quite a few times over the years and the client's response (not the server) has been to immediately issue a full 24 hour backoff, irrespective of the proximity to midnight UTC.

I know it's the client's decision here because the full cache of trashed tasks is still sitting there when I notice the problem.  There has been no server contact to return them.  I actually quite appreciate this behaviour.  Once whatever was the original cause of the problem is sorted, it's fairly straightforward to edit the state file and remove all the <result> blocks that have been trashed, being careful to not disturb any unreported 'good' results.  There are also some status values that need to be adjusted but it only takes a few minutes with a good editor to sort that out.  On restarting BOINC, any good results can be reported and I then have the chore of multiple updates to restore the full cache of trashed work, twelve resends at a time.

I would have used this technique at least 20 times over the years to recover this type of failure.  It has always worked and is the main reason why I don't want to see the 'resend lost tasks' feature ever turned off :-).  I like to crunch what is sent and hate to see hundreds of trashed tasks having to be sent out to some other host.  Yep, I know I'm weird :-).

Keith Myers wrote:
Yes, I ran into this issue with my Nano when I was trying to get my custom app running.

Was there some problem with the boinc client handling the custom app?  If so, the client would probably just slap on a 24 hour backoff, irrespective of the time of day.

In your case, you may not have had a choice if the client immediately reported to the server.  The same situation would exist if you were to force an update to clear the debris without 'hiding' the fact that tasks were trashed.  For my case, since no failed tasks were ever being reported, the server was unaware and so had no problem with dishing out the resends and no additional backoff was imposed.  With the server sending back 12 resends, the client was happy as well :-).

Cheers,
Gary.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4704
Credit: 17550092080
RAC: 6433607

Yes, it was just successive

Yes, it was just successive 24 hour penalty box delays for each connection.

why-do-i-get-deferred-for24-hours-when-i-try-sign-boinc-manager

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.