High number of task timeouts

n12365

Joined: 4 Mar 16

Posts: 26

Credit: 6491436572

RAC: 0

25 Nov 2019 20:29:58 UTC

Topic 220060

(moderation:

)

I am having trouble understanding the high number of task timeouts one of my hosts is experiencing. This is the host: https://einsteinathome.org/host/12568730

The tasks in question have a status of “Timed out - no response” and it looks like there are approximately 1,000 tasks is this state. They all show a very short period of time between when they were sent and “TIME REPORTED OR DEADLINE”. Here are a few examples with a time difference of 5 minutes.

https://einsteinathome.org/task/900900798

https://einsteinathome.org/task/900900792

https://einsteinathome.org/task/900900741

Can someone help me understand what is going on with these tasks? A five minute timeframe to complete the task doesn’t make sense, so I assume I am misunderstanding something. Or maybe there is something wrong with my host.

Ryan

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5877

Credit: 118574497572

RAC: 17520261

n12365 wrote:Can someone help

26 Nov 2019 9:49:57 UTC

Message 174567

(moderation:

)

n12365 wrote:

Can someone help me understand what is going on with these tasks?

Yes, I believe I know with a fair degree of confidence the backstory for what you've experienced.

I started a response soon after you posted and did a fair bit of research into some precise details from your complete list of errors. My notes remind me that your full list (at the time I looked this morning) had some 3402 errors of which I only saw 2 that were actual compute errors. I didn't examine every last one but I did glance through lots of the 170 pages. There is a reason for these "Timed out - no response" errors.

I had spent several hours researching and had almost finished the entire message when a short power fluctuation took out the machine I was using and a bunch of others. I've spent the last 7 hours getting the fleet settled again - there was some collateral damage so I had a few running repairs to make. I've just fired up firefox to see what might remain of what I was working on and what I'd already typed in. As expected, absolutely nothing.

Fortunately, I'd written extensive notes of what I'd managed to work out so it shouldn't take all that long to type it all in again - it's quite a story. It's long past my dinner time and I'm grumpy about power distribution networks that allow these sort of very short glitches to occur so I'm going to go eat, cool down and catch a good night's sleep.

I'll start retyping the response first thing in the morning - after I check that the farm are still all happy :-).

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5877

Credit: 118574497572

RAC: 17520261

First, some background

27 Nov 2019 3:02:44 UTC

Message 174577

(moderation:

)

First, some background information needed to fully understand things.

The BOINC framework needs to provide reliable communications between client and server over an inherently unreliable medium - the internet. On each request from a client, the server checks information from the client against what it has in its own database. It confirms that that the two sides are in agreement and it then has to source suitable tasks to fill a client request. If the server detects a discrepancy, it will try to do something about it.

Just one of the possible safeguards is the ability to do something about previously allocated work that appears, for whatever reason, to have become 'lost' at the client end. The server can attempt to 'resend lost tasks'. One of the probably many ways that tasks can become lost is through unreliable comms. The server sent a bunch of work but the client didn't receive the details. Obviously another way is if the client's own database (the state file, client_state.xml) had the information but then got damaged in some way.

Another bit of important background in your case is that Einstein uses a 'home grown' (and older) version of the BOINC server code in order to handle locality scheduling which is vital for the efficient deploying of the very large numbers of huge data files required for GW type searches. That code uses duration correction factor (DCF) to 'correct' the time it will take to crunch a task. Each task contains an 'estimate' of the work content. Even for a standard work content, different hardware configured and operated differently by each individual owner and with different operating systems, is liable to take a different amount of time. DCF was designed to adjust the estimate progressively (for each volunteer computer) so that, irrespective of its state of configuration, the crunch time estimate could be made to match reality.

The problem with DCF is that there is only one of them per project and not per individual search being used at that project. For DCF to have a chance of working properly here, there would need to be one per search unless somehow, the work content could be more accurately defined in the first place. As an example of what I'm talking about, for the FGRPB1G GPU search, all my hosts need a DCF of somewhere around 0.25 to 0.35. In other words, for the conditions I'm using, the initial estimate is around 3 times larger than what the task will actually take.

For the GW search at the conditions I'm testing (RX 570 GPU running tasks 4x) the DCF is close to 5. Admittedly, because I'm running 4x (which pretty much doubles the daily output) the crunch time is around twice what a single task would take. So for single tasks, the DCF would be close to 2.5 which is still close to an order of magnitude larger than what I need for FGRPB1G. Estimated crunch times are going to act in a very crazy fashion if I try to run both searches simultaneously.

There are several ways to cope with this. The easiest way is to choose one of the GPU searches and run it only on an individual machine. A second way would be to alternate between searches using a suitable time span - eg. a week or a month. You could run one search for a month unattended, set NNT at the end and return all the completed work and then reconfigure for the other search and set it going. The changeover could be fairly quick to do once a month say.

A third way would be to allow both searches but set a very low work cache size, bearing in mind the deadline is different for each currently - 7 days for GW and 14 days for FGRPB1G last time I noticed. If you set no more than say 0.2 days total cache size, the maximum actual work with wildly wrong estimates driven by a DCF an order of magnitude lower than it needed to be, would only be 2 days. At the other end of the scale with a DCF 10 times too high, a full cache of the affected tasks would only be 0.02 days worth of true crunch time. Neither of these two extremes should cause BOINC to panic so should be relatively safe for (close to) unattended operation. You just need to convince yourself that wildly fluctuating estimates are not a concern. The potential concern is the project unable to supply work, allowing your host to run out quite quickly.

Now, the analysis of your situation. All dates and times are UTC as listed in the errors list. Your machine has been running both FGRPB1G tasks and O2MDF (GW) tasks over a period. Your GW errors list shows the "Timed out - no response" issue for close to two weeks. Undoubtedly many older tasks have been completed by other hosts and removed from the online database over the period. There is clear evidence of big numbers of these failures from Nov 20.

It's still happening now. I looked yesterday and it would have been before midnight on Nov 25. The latest date I have in my notes is a task that was timed out at 08:01:11 on Nov 25. It had been allocated at 07:56:09. The earliest task in that complete series that timed out at much the same time (08:01:10) was allocated earlier at 07:50:46. It was 11 pages of errors (about 220 individual items) earlier and allocated nearly 6 mins earlier in time. In other words, the scheduler took nearly six minutes to assemble a list of over 200 tasks to send as a batch in response to a work request. The client had probably given up waiting - there's bound to be some sort of timeout on waiting for scheduler responses.

If you really want to know what exactly happened, go back through your event log (stdoutdae.txt) or its predecessor (stdoutdae.old) if necessary and look for event log entries just earlier than 07:50 UTC for Nov 25. Your times will be local times, probably. Just convert if necessary. You should be able to see the client request and anything that came back from the server.

When it was ready to send the response, and with the client perhaps no longer paying attention, I have no idea what the scheduler does these days. In the past, the scheduler would hold them as 'lost tasks' and would send them in batches of 12 at a time in response to further requests from the client. Because the list is already assembled, I imagine that could happen relatively quickly without keeping the client waiting for such a long time. So the 'resend lost tasks' feature was probably quite useful in that context.

These days it would appear that the scheduler drops the whole bundle, marks them as errors with a status of "timed out - no response". This undoubtedly 'scares' the average volunteer into thinking it's entirely their problem. As a complete side note, if a project is trying to 'encourage' the volunteers to keep contributing, it's sure hard to understand how 'claiming' (through the offensive terminology used) that the 'errors' are the sole responsibility of the client is the proper thing to do. Why not a status of "Unable to allocate" or something like that. And don't include them in the client error list if they are not client errors.

When I looked again today, there are nearly 200 more fresh "Timed out - no response" errors. The situation with these tasks is very similar. Assemble the full list over quite a long period then drop the lot soon after trying to send them. Here are the actual details of two tasks, one at the start and one at the end a full batch.

Current error list page 152 - Nov 26 - allocated at 00:20:19 and dropped at 00:27:14
Current error list page 161 - Nov 26 - allocated at 00:26:02 and dropped at 00:27:15

Once again, 6 mins to assemble the full list and around a minute later decide to drop the lot. Notice (as was common on many previous days) this all happened just after midnight UTC which is exactly when a "new day" for daily allocation limits kicks in. It looks very much like your client had been 'backed-off' (ie. used up its daily limit) for some time and once a new day kicked in and it made a huge request, the scheduler wasn't able to handle the request in a timely manner. Unfortunately, you need to work out how to break this cycle if you want to get back to normal operations. Setting a low work cache size is the obvious first step.

I haven't looked at anything other than the GW errors list. I have no idea how and when you get other tasks to crunch. I have no idea of the reasonableness or otherwise of your work cache settings. Hopefully, with some of the background I've gone through, you should be able to work out suitable settings for your situation. If you have specific questions, please ask. Sorry it took so long to get a response back to you.

For any of the staff happening to read, if this assessment is even roughly correct, the real problem would seem to be how long it takes for the scheduler to assemble enough tasks to meet a big request. There seems to be a reasonable solution. Do what is done for resending lost tasks. Put a limit on the number that can be sent as one batch, eg. 12. Send the 12 quickly and force the client to ask again if it really does need more.

Cheers,
Gary.

n12365

Joined: 4 Mar 16

Posts: 26

Credit: 6491436572

RAC: 0

Gary, Thanks for taking the

27 Nov 2019 5:21:27 UTC

Message 174579 in response to message 174577

(moderation:

)

Gary,

Thanks for taking the time to look into this and thanks for the detailed description of what is happening. I originally had the work cache set to 5 days. I just changed it to 0.5 days. I also changed my project preference to only include GW applications. Hopefully these two changes will cause this issue to settle down.

Ryan

gbaker

Joined: 2 Apr 17

Posts: 3

Credit: 2611077979

RAC: 528372

I also recently had this

3 Dec 2019 14:19:27 UTC

Message 174686

(moderation:

)

I also recently had this problem around midnight UTC 28--29 November after upgrading my graphics card to a Radeon VII.

https://einsteinathome.org/host/12796547/tasks/6/0

At that time, I think I was in the middle of dialing in an undervolt for the card. At some point during the day I set maximum job cache to be something very low to avoid having to abort so many tasks if I had to reinstall the OS after messing up a driver installation, but I don't recall if that change was before or after these errors.

At the time I was also running all possible tasks. The following day I disabled O2MD tasks since they didn't seem to be utilizing the Radeon VII well (running them about as fast as my old 750ti).

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5877

Credit: 118574497572

RAC: 17520261

gBaker wrote:I also recently

4 Dec 2019 3:41:18 UTC

Message 174694 in response to message 174686

(moderation:

)

gBaker wrote:

I also recently had this problem around midnight UTC 28--29 November after upgrading my graphics card to a Radeon VII.

Yes, it certainly looks like it. The link you provided shows 1579 errors of which 8 belong to the gamma-ray pulsar search (FGRPB1G) and the remaining 1571 are GW tasks, and this is the real issue. A good way to look at these is to sort in ascending order of date sent by clicking on the 'date sent' column header. Here is the sorted list as it currently exists.

I've had a good look through the entire list and it shows quite similar scheduler behaviour to what I saw when I looked through the OP's list. There was also something extra in yours which I don't really understand. It's the very set of tasks covering the first 10 pages which started being allocated at 21:00:00 on Nov 28 and seemed to end at item 13 on page 10 where the allocation time was listed as 21:12:49. So it took nearly 13 minutes to gather together this list of tasks. The bit I don't understand is that later on at page 45 item 4, with the time showing as 22:32:06, (ie. 80 mins later) more tasks seem to be added to this same particular batch. I say 'same batch' because there is a common time where the tasks from the first 10 pages plus the 'extras' starting at page 45 all get dropped as a group, which happens to be 01:49:54 on Nov 29.

Between the initial allocation time of 21:00:00 and when all those tasks were ultimately dropped (nearly five hours later) there were several batches of task allocation, very quickly followed by them all being dropped, which was just as I saw for the earlier case reported by the OP in this thread. I'll describe just one of these batches. Starting on page 26 at item 13, a new batch started with the allocation time of 22:03:13. The last task in this series is item 3 on page 45 with an allocation time of 22:13:19. The scheduler took more than 10 minutes to construct a list of around 360 tasks. 10 seconds later at 22:13:29 the scheduler started dropping the full batch and this operation completed one second later at 22:13:30. My guess is that the very long delay before a response was ready to be sent by the scheduler, caused the client to be no longer in a position to receive the response. The scheduler probably got immediately rejected by the client end so it just dropped the whole batch of 360+ tasks.

I'm not a programmer and I have no real understanding of the low level details of how client/server communications are supposed to work but, as with the previous case, it seems to me highly likely that the long delay in responding to the initial client request must be the real issue here. Hopefully, the staff can take a good look at all this and work out how to solve the problem.

In the meantime, make sure work requests are relatively small so that the scheduler can respond in a timely manner. I've had no problems getting batches of around 10-20 tasks at a time. You can notice the definite delay. It can take up to a minute to get the response back.

Due to the wildly divergent DCF values caused by the fact that the two GPU searches are both far from the standard 1.0 value (and on opposite 'sides') it's very difficult to get sensible work fetch happening if you're trying to support both searches simultaneously. For that reason, decide on one or the other and don't mix them, unless you want to be spending all your time micromanaging things :-).

Cheers,
Gary.

High number of task timeouts

Forums › Problems and Bug Reports

n12365 wrote:Can someone help

First, some background

Gary, Thanks for taking the

I also recently had this

gBaker wrote:I also recently

Comment viewing options

Forums › Problems and Bug Reports