Upload trouble 12/29/18

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7022934931
RAC: 1835474
Topic 217742

All three of my hosts have started to build up upload failures.

I see a mix of transient http error and of failure to communicate with project.

Such as these lines:

 

21546 Einstein@Home 12/29/2018 5:16:42 PM Started upload of LATeah2005L_412.0_0_0.0_21220_1_1
21547 Einstein@Home 12/29/2018 5:17:43 PM Temporarily failed upload of LATeah2005L_412.0_0_0.0_21220_1_0: transient HTTP error
21548 Einstein@Home 12/29/2018 5:17:43 PM Backing off 00:02:04 on upload of LATeah2005L_412.0_0_0.0_21220_1_0
21549 Einstein@Home 12/29/2018 5:17:43 PM Temporarily failed upload of LATeah2005L_412.0_0_0.0_21220_1_1: transient HTTP error
21550 Einstein@Home 12/29/2018 5:17:43 PM Backing off 00:03:48 on upload of LATeah2005L_412.0_0_0.0_21220_1_1
21551 Einstein@Home 12/29/2018 5:20:51 PM Started upload of LATeah2005L_420.0_0_0.0_321483_1_0
21552 Einstein@Home 12/29/2018 5:20:51 PM Started upload of LATeah2005L_420.0_0_0.0_321483_1_1
21553 12/29/2018 5:20:52 PM Project communication failed: attempting access to reference site

Betreger
Betreger
Joined: 25 Feb 05
Posts: 987
Credit: 1421386412
RAC: 813837

Yep, uploads are failing

Yep, uploads are failing mightily 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109376856219
RAC: 35988088

Please try sending an email

Please try sending an email to eah_admin(at)einsteinathome.org.  I don't have email access right now.

The scheduler is responding normally so it must just be the upload server.  Once there are enough uploads backed up, the client will stop requesting work (I believe).  I think I remember seeing something like 8 uploads in progress as being the magic number to stop work requests.

 

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7022934931
RAC: 1835474

Gary Roberts wrote:Please try

Gary Roberts wrote:
Please try sending an email to eah_admin(at)einsteinathome.org.

Done.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109376856219
RAC: 35988088

Out of necessity, I just

Out of necessity, I just learned something new (for me).

A monitoring script I run regularly reported a host with a crashed GPU (in amongst all the reports of last RPC contacts being too long ago) :-).  No big deal, it happens occasionally.  So I restarted the machine and noticed just 5 tasks stuck in uploads from the current problem.  However there were unusually few 'ready to start' tasks left in the cache (maybe 3 hrs worth) so I decided to do a quick top-up whilst the number of uploads was relatively small.  The response I got was that there were already too many stuck uploads so no new work.

I still haven't worked out why there was so little work on board (there should have been a days worth) but I decided to read the documentation with a view to seeing if the number of uploads that would prevent new work was configurable.  I found two possible tags - <max_file_xfers> (default=8) and <max_file_xfers_per_project> (default=2).

As time was quite short, I whipped up a cc_config.xml file with those two tags added with the values of 32 and 12 respectively.  After 're-reading config files' in BOINC Manager, I was immediately able to download a bunch of work to bring the cache up to over a day.

I don't know which particular tag did the trick.  I suspect the first has to be bigger than the second but the second is probably the important one when just one project has upload problems.  I might take the opportunity to experiment with the numbers a bit more while the problem exists.

 

Cheers,
Gary.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2139
Credit: 2752617780
RAC: 1503803

When emailing Bernd, it's

When emailing Bernd, it's helpful to do a little bit more diagnosis to help him find exactly which part of the system has broken, and in what way.

I'm getting

30/12/2018 09:05:29 | Einstein@Home | [http] [ID#811] Sent header to server: POST /EinsteinAtHome/cgi-bin/file_upload_handler_medium HTTP/1.1
30/12/2018 09:06:29 | Einstein@Home | [http] [ID#811] Received header from server: HTTP/1.1 504 Gateway Time-out

'file_upload_handler_medium' and '504 Gateway Time-out' are both exactly the same symptoms as the failures on 21 November and 21 December, when the upload server lost communications with what Bernd described as the web server. He was going to try and script an automated recovery, but I guess the holidays got in the way.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2139
Credit: 2752617780
RAC: 1503803

Before anybody else tries

Before anybody else tries Gary's workround (I just did), the actual file upload limit is defined as

- client: define "too many uploads" (for work fetch) as 2 * max(ncpus, ngpus);

show this in the state displayed by <work_fetch_debug>

(from https://github.com/BOINC/boinc/commit/26114920fea508d44a8a0561afd71766799b4bf4)

With 45 uploads waiting on this machine, I haven't yet tried falsifying ncpus, but I might later....

Unfortunately, changing the file transfer limit doesn't hack it.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2139
Credit: 2752617780
RAC: 1503803

On the other hand, setting

On the other hand, setting <ncpus>24</ncpus> in cc_config.xml did work. I think this might be one where you have to do a full client restart, rather than just 'Read config files' - but report back if you find different.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2139
Credit: 2752617780
RAC: 1503803

Uploads are working again -

Uploads are working again - you may need to retry one to kick-start the process.

Bernd says that this was an automated restart, put into place after the previous problems, so it's less important to report it manually if it happens again.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109376856219
RAC: 35988088

Richard Haselgrove wrote:...

Richard Haselgrove wrote:

... the actual file upload limit is defined as

- client: define "too many uploads" (for work fetch) as 2 * max(ncpus, ngpus);

Thanks for that.  When I installed a cc_config.xml on that machine that had little work left, I grabbed an existing file I'd used to simulate 8 CPUs for the last time we had issues with very fast running tasks eating up the daily quota.  I just installed the extra two tags without removing the <ncpus> line.

It worked immediately with just 're-read config files' - no client restart needed.  It didn't occur to me that the success had anything to do with the <ncpus> line rather than either of the extra tags I'd added.  I was pretty desperate to try something quickly before any more upload failures occurred on that machine, so I wasn't thinking very clearly :-).

I'm not suggesting I would have made the connection, even with a lot more time to ponder it :-).

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109376856219
RAC: 35988088

Richard Haselgrove

Richard Haselgrove wrote:
Uploads are working again - you may need to retry one to kick-start the process.

By the time uploads started working, I was already in bed.  With the multi-hour back-offs and with no further work to finish which might trigger a new upload attempt, hosts that were 'out', stayed out a lot longer than necessary.

Quote:
Bernd says that this was an automated restart, put into place after the previous problems, so it's less important to report it manually if it happens again.

Why does a failure like this have to continue for so long if there is an 'automated restart' mechanism in place?

 

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.