Upload trouble 12/29/18

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2142
Credit: 2776081327
RAC: 806384

Gary Roberts wrote:Why does a

Gary Roberts wrote:
Why does a failure like this have to continue for so long if there is an 'automated restart' mechanism in place?

Probably because server time operates at a different speed to client time ;-)

Bernd said that around 11:00 UTC, after uploads had been borked for nearly 12 hours. He said it would restart "automatically in about an hour" - which it did. I appreciated that he replied at all on a holiday Sunday.

But it does suggest that if the core problem hasn't been solved (with a 9 day repeat period this time, after 30 days last time), we perhaps we need to up the restart cycle beyond the (I guess) once every 24 hours. For me, every 6 hours would be nice, but I guess you need every hour, needed or not, until we find the cause.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110111752793
RAC: 24750461

Richard Haselgrove wrote:...

Richard Haselgrove wrote:
... perhaps we need to up the restart cycle beyond the (I guess) once every 24 hours.

That was my thinking exactly.  It would seem to be quite straight forward to have a cron (or similar) job that wakes up say once every hour or three to check if a particular process is active or has died.  I, too, think that the check is being done once per day although, by looking closely at my logs, it's possible it could be every 12 hours.

My first entry that complained about "too long since last RPC" was at 10:33 am local time (UTC+10).  At that time the RPC interval for the complaining host was 6218 secs.  As these all have GPUs, they do tend to report around the hour mark when they have several completed tasks.  It's unusual to see intervals longer than about 4500 secs.  My guess is that uploads stopped around midnight UTC (10:00AM local).  If the uploads were still working at midnight UTC and didn't fail until 00:01 UTC, it's possible that checks being done on a UTC based 12 hour cycle did miss the problem at midnight, but then succeeded at the following midday.

I also eventually worked out why quite a few of my hosts had unusually low work caches.  Work fetch is done under script control so as to make sure all hosts have copies (from a data file cache) of all the recent data files before a work fetch happens.  This is done to avoid unnecessary downloads.  I usually run this script about 4 times a day to keep the work on hand always about the 1 day mark.  I usually run the first loop for the day pretty early because the work cache gets rather depleted from the previous overnight activity.

Yesterday, I was late with the first run, so by the time the hosts towards the end of the series were being attended to, the uploads problem (unknown at that time) had kicked in and a number of hosts in the latter half of the series were prevented from topping up at all.

Quote:
For me, every 6 hours would be nice, but I guess you need every hour, needed or not, until we find the cause.

I can't imagine this check would use any significant resources, so why not every hour anyway?  That would be preferable to having frustrated volunteers clicking 'retry' or 'update' or whatever else they could find to click.

And whilst I'm in the 'frustrated volunteer' mood, wouldn't it have been very nice to have had a very short technical news item to say that the 'auto-restart' of any future failed uploads was now implemented so no need to attempt to report this problem should it occur again?  And, of course, it would have been even more thoughtful to mention the time schedule on which such checks would be made.  This would have saved a heck of a lot of volunteer angst, whilst wondering about whether or not the problem would get noticed and/or dealt with in a timely manner.

 

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.