Some Observations about 'Resends'

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117255231941
RAC: 36209914
Topic 225741

Resends are extra copies of a task.  They get produced and distributed whenever one of the two copies needed to form a quorum fails (for whatever reason).

I run a lot of hosts so I get to see a lot of resends from other people's failures.  That, in itself, is usually of no concern.  However, with the variable elapsed time behaviour of GRP tasks (LATeah3... and LATeah4... style tasks) I've noticed a nasty surprise that seems to be linked with resends under certain conditions.

Most of my hosts are relatively old and have a single GPU.  None of them run CPU tasks.  Five of them have dual GPUs (always matched pairs) and it's just those hosts (so far) that have had the problem.  It's happened enough times for me to attempt to better characterise it and thereby do something about it.  Those 5 hosts have older AMD 2GB GPUs - HD7850s, R7 370s, RX 460s, RX 560s.  They crunch two tasks at a time so there are always 4 running tasks.

I try very much to avoid 'micro-managing' so I rarely hook up peripherals and run BOINC Manager on any host unless there's a problem.  Sometimes I'll look at a host remotely from a central 'server' machine but certainly not routinely.

I have a number of scripts to monitor/control the fleet which avoids the need for that.  The two main things I want the scripts to do are, 1) regularly check the status of each host and report if anything seems amiss, and 2) control work fetch so that every host has a completely up-to-date supply of data files before any work fetch event occurs.  Checking host status occurs hourly.  Work fetch events occur immediately after a check and on usually a 3 or 4 hour gap from the previous one.  The purpose for controlling work fetch is to ensure I minimise bandwidth that would be consumed if each host was responsible for maintaining its cache of files.  I use a 'download once and deploy to all' policy, since the connection is shared with a business that needs the internet extensively.

The problem I've seen quite a few times on several different dual GPU hosts is this.  All of a sudden, and for no obvious reason, many (if not all) tasks on board get trashed with a compute error, seemingly more or less 'out of the blue'.  In some cases they get reported and the host ends up in a 24hr backoff.  In other cases, my script warns me before they get reported and I recover the problem by editing the state file to make all the compute errors become lost tasks and then having the server resend them.  It's pretty tedious having to keep 'updating' to get each additional batch of 12 'lost tasks' through that process :-).  After restoring the cache of work, there wasn't any obvious 'smoking gun' reason for why it happened - until the last time it happened :-).

I didn't have a clue what was doing this when, purely by chance, I was looking (using the LAN) at the tasks list on one of these hosts when it just started happening before my eyes.  I got quite a shock to see how quickly things were heading south :-).

I quickly suspended all remaining tasks, stopped BOINC, and counted about 6 completed tasks and around a further 10 that had been trashed before I had time to suspend.  I noticed that the first two in the trashed list were LATeah3 resends.  This event happened not long ago when new tasks were LATeah4 and resends included lots of LATeah3.

I immediately wondered if crunching resends of the previous type together with primary tasks of the current type had anything to do with the cause, but I  decided to get rid of the compute errors and recover the work cache first before investigating further.  I carefully deleted the trashed tasks in the state file, restarted BOINC, returned the completed work and received the failed tasks via the "resending lost tasks" mechanism.  As all the remaining tasks were still suspended, BOINC launched the first 4 of the resends immediately, so I promptly suspended all but one of these.

My objective was to create spacing between the 4 running tasks to see if that allowed them to run without failure.  By spacing them out, I ended up with 2 tasks on GPU0 at ~70% and ~25% and 2 tasks on GPU1 at ~45% and 0% (just starting).  At least for a while, no tasks on either GPU would be trying to start or finish simultaneously with any other.  Of course, with the different run times for resends, that situation doesn't last for very long.

There were no further issues and after some further observation, decided that things were 'back to normal'.  My strong guess was that the problem was memory related (only 2GB to handle 2 tasks) and was triggered either during startup or during finishing if the two tasks were almost fully 'aligned'.  I determined this by checking the event log later.

Since that time, I have monitored all 5 dual GPU hosts (usually once per day) to keep tasks well spaced.  For a while there were lots of LATeah3 resends so I paid close attention.  Because of the elapsed time differences between the two, things can quite quickly get out of whack.  Now that there are very, very few '3' resends (but lots of '4') I've stopped doing that about a week ago.  There have been no more issues since the one described above so I'm willing to assume that my guess above is likely to be at least part of the explanation.

At some point, there may be a transition from '4' back to '3' or perhaps to something else.  Since I don't want to go back to close personal monitoring, I've written a new script to monitor resends and to give a 'heads-up' when there might be a potential problem.  I've only seen the problem with the dual GPU hosts but the new script monitors resends on all hosts, just in case it turns up on single GPU hosts as well.

Over the last several days, I've been logging the details for all hosts.  As part of the process, I get information about resends in the different parts (stages of processing) in the work cache at the time the script runs.  I've named these stages as 'unreported' (crunched & uploaded but not reported), 'running' (actively crunching) and 'waiting' (the bulk of the cache, still yet to start).  I get these values for every host and a grand total for all.  As a result of looking at the data, I've noticed something interesting.  I've only been logging for a couple of days but so far the number of resends seems to skyrocket around midnight UTC or soon after.  My local time is UTC+10 so midnight UTC is 10:00AM the next morning :-).

Below are today's resend stats for around 23:00 UTC on Jul 27 and 02:00 UTC on Jul 28.  I currently maintain a 0.6 day cache of work and all hosts had been 'topped-up' immediately prior to those two times.  At each time there were two separate searches for resends, one for LATeah3 resends and one for the LATeah4 type.  For a couple of days (including today) there have been zero resends for '3' so I'll just quote the '4' results.  The first one finished at Wed Jul 28 08:55:05 EST 2021 local time (22:55:05 UTC) and the second one at Wed Jul 28 11:45:08 EST 2021 local time (01:45:08 UTC).  The run time is around 2.5 mins to process all the hosts.  I was quite interested to see the big jump in resend numbers as a result of the 2nd work fetch, just 3 hrs after the first.

Cumulative eah4 type resend totals:   Total=559   Running=  7   Waiting=544   Unreported=  8

Cumulative eah4 type resend totals:   Total=890   Running= 21   Waiting=855   Unreported= 14

Before I had written this script and started getting hard numbers, I had the impression that surprisingly large numbers of resends turn up in work fetches around my mid-morning (11:00am local) so the above agrees with that observation and suggests that maybe resends are not immediately put in the ready to send queue exactly when the failure occurs but rather lumped into batches which get inserted at particular times in the 24 hour cycle.  Perhaps around midnight UTC is one of those times.  I've actually started noticing that I don't get many resends in my late afternoon/early evening - around 06:00 to 09:00 UTC.

I thought I'd document the problem of dual low memory GPUs crunching GRP tasks x2 in case any one else had observed something similar.  My next step is to think about how to detect when tasks of different series become 'aligned' in their start and finish times since that seems to be the 'trigger'.  At the moment there are lots of '4' resends but no problems since the primary tasks are '4' as well.  I'm paying close attention to what we get next after the current LATeah4011L02 data file is finished which will probably be quite soon now.

Since I've been running these 5 dual GPU hosts for years now and not seen the issue until LATeah4 tasks turned up relatively recently, maybe there is something with that type of task that uses extra memory at start or finish.  I always try to run tasks 'staggered' and it's the run time difference between '3' and '4' that destroys that.  I'm not particularly looking forward to the next change if it is back to '3' style tasks with lots of '4' resends :-(.

Cheers,
Gary.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3931
Credit: 46209462642
RAC: 63840746

Gary Roberts wrote:In some

Gary Roberts wrote:

In some cases they get reported and the host ends up in a 24hr backoff. 

It seems obvious to check the nature of these errors to point to why they failed, rather than speculate if it's due to mixed 3&4 tasks. If you think it might be a memory issue, the stderr file would likely indicate as much. Your hosts are hidden and not open for inspection however.

 

Gary Roberts wrote:

Wed Jul 28 08:55:05 EST 2021 local time (22:55:05 UTC) and the second one at Wed Jul 28 11:45:08 EST 2021 local time (01:45:08 UTC).

Might be prudent to be a bit more precise that your timezone is AEST. EST is the designation for the Eastern US timezone (EST= standard, EDT = daylight)

https://www.timeanddate.com/time/zones/

 

Gary Roberts wrote:

Before I had written this script and started getting hard numbers, I had the impression that surprisingly large numbers of resends turn up in work fetches around my mid-morning (11:00am local) so the above agrees with that observation and suggests that maybe resends are not immediately put in the ready to send queue exactly when the failure occurs but rather lumped into batches which get inserted at particular times in the 24 hour cycle.  Perhaps around midnight UTC is one of those times.  I've actually started noticing that I don't get many resends in my late afternoon/early evening - around 06:00 to 09:00 UTC.

or perhaps correlates with large numbers of hosts in some parts of the world reporting their work. either folks turning their systems on around this time, or lifting BOINC communications restrictions for off-peak data usage reasons.

_________________________________________________________________________

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117255231941
RAC: 36209914

Ian&Steve C. wrote:It seems

Ian&Steve C. wrote:
It seems obvious to check the nature of these errors to point to why they failed ...

Sure, and I always do.  In this particular case, the failed tasks were removed and replaced with new copies, so there was no stderr.txt information on the website.  In all cases, prior to deletion, I examine what is recorded in the state file which would have been returned, if allowed.  There was never anything specific enough to be confident of the precise cause.  The actual message said something like, "Error 69 - unspecified memory error." or something to that effect.  I didn't record the precise words but it was the same message as it had been on each previous occasion it had happened.

Since I had just started watching this particular host, the first thing I had noticed (before any compute errors) was that there were tasks pretty much in lockstep and just about ready to finish.  I was debating whether or not to intervene to 'space out' the tasks.  That decision was made for me when tasks started failing rapidly as I was thinking about it.

I don't know the exact cause.  For the first time, I had evidence that it was probably associated with mixtures of primary and resend tasks that were finishing/starting together.  To see it start to happen under those conditions is a bit stronger than just "speculation".  Particularly when, without even rebooting, the host was able to resume processing without any further issue since that time.  It has a current uptime of 95 days and I'm confident it doesn't have a hardware issue that could have caused this.

Ian&Steve C. wrote:
or perhaps correlates with large numbers of hosts in some parts of the world reporting their work. either folks turning their systems on around this time, or lifting BOINC communications restrictions for off-peak data usage reasons.

Thanks for the suggestion - I hadn't tried to analyse if there might be some sort of user driven 'time of day' bias.  There certainly could be.

I haven't been watching for long enough to see how regular it might be.  Probably it's just a 'not enough data' artifact.  I'm not sure that post-midnight UTC would be the time when the things you mention might be happening though.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.