Another scheduler instance is running for this host - - - ? ? ?

archae86

Joined: 6 Dec 05

Posts: 3165

Credit: 7406841687

RAC: 1943833

Gary Roberts wrote:Once the

26 Jun 2020 0:35:58 UTC

Message 178639 in response to message 178638

(moderation:

)

Gary Roberts wrote:

Once the number of trashed tasks is such that the reduced daily limit is below the number of tasks already sent, the client is prevented from receiving new work until a new 'day' arrives.

<snip>

without having to wait for a new 'day' to start (midnight UTC).

Quibbling detail to add:

There is a long-standing flaw in this picture, in that different components of the overall system have a different definition of when the "new day" boundary happens. In particular, the bit that decides how long to put off the next request sent by your host does not agree on that boundary with the bit that decides whether to honor a request that is made. This conflict has some secondary consequences that are both confusing and displeasing.

Keith Myers

Joined: 11 Feb 11

Posts: 5063

Credit: 19326199389

RAC: 7563329

Yes, I ran into this issue

1 Jul 2020 23:19:04 UTC

Message 178719

(moderation:

)

Yes, I ran into this issue with my Nano when I was trying to get my custom app running.

The one day backoff for turning in errored tasks turned into a 3 day backoff before I finally got sent work to my fixed configuration and my first non-error return.

So the definition of the "new day" seems to be variable.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119848672683

RAC: 25907942

archae86 wrote:There is a

2 Jul 2020 0:52:26 UTC

Message 178720 in response to message 178639

(moderation:

)

archae86 wrote:

There is a long-standing flaw in this picture, in that different components of the overall system have a different definition of when the "new day" boundary happens.

I think the full story might be that there are two different situations that are being handled, one by the server and one by the client.

The problem I was talking about is one where (progressively by successive tasks failing computationally) the scheduler gets to the point where it sees that it has already distributed more than the now significantly reduced daily limit. Under those circumstances, it always chooses to set a backoff until just after the next midnight UTC.

The separate problem is when there is an event that trashes the complete list of work locally for a particular search. An example would be where a vital file (the app itself or a critical data file) fails an MD5 test all of a sudden. I've had that happen quite a few times over the years and the client's response (not the server) has been to immediately issue a full 24 hour backoff, irrespective of the proximity to midnight UTC.

I know it's the client's decision here because the full cache of trashed tasks is still sitting there when I notice the problem. There has been no server contact to return them. I actually quite appreciate this behaviour. Once whatever was the original cause of the problem is sorted, it's fairly straightforward to edit the state file and remove all the <result> blocks that have been trashed, being careful to not disturb any unreported 'good' results. There are also some status values that need to be adjusted but it only takes a few minutes with a good editor to sort that out. On restarting BOINC, any good results can be reported and I then have the chore of multiple updates to restore the full cache of trashed work, twelve resends at a time.

I would have used this technique at least 20 times over the years to recover this type of failure. It has always worked and is the main reason why I don't want to see the 'resend lost tasks' feature ever turned off :-). I like to crunch what is sent and hate to see hundreds of trashed tasks having to be sent out to some other host. Yep, I know I'm weird :-).

Keith Myers wrote:

Yes, I ran into this issue with my Nano when I was trying to get my custom app running.

Was there some problem with the boinc client handling the custom app? If so, the client would probably just slap on a 24 hour backoff, irrespective of the time of day.

In your case, you may not have had a choice if the client immediately reported to the server. The same situation would exist if you were to force an update to clear the debris without 'hiding' the fact that tasks were trashed. For my case, since no failed tasks were ever being reported, the server was unaware and so had no problem with dishing out the resends and no additional backoff was imposed. With the server sending back 12 resends, the client was happy as well :-).

Cheers,
Gary.

Keith Myers

Joined: 11 Feb 11

Posts: 5063

Credit: 19326199389

RAC: 7563329

Yes, it was just successive

3 Jul 2020 14:37:34 UTC

Message 178753

(moderation:

)

Yes, it was just successive 24 hour penalty box delays for each connection.

why-do-i-get-deferred-for24-hours-when-i-try-sign-boinc-manager

Another scheduler instance is running for this host - - - ? ? ?

Forums › Problems and Bug Reports

Gary Roberts wrote:Once the

Yes, I ran into this issue

archae86 wrote:There is a

Yes, it was just successive

Comment viewing options

Forums › Problems and Bug Reports