Please use the debug tools to tell the staff exactly what they need to kick
Richard, I turned on some of the event log debug options. Then, with 15 tasks pending waiting out their penalty time in the transfer queue, I selected one and specified a retry. Superficially, this initially seemed to work, including a prompt "Info: We are completely uploaded and fine" message. However the transfer tab did not show that upload as finishing, and 60 seconds after the initial request the message log show a gateway timeout.
I normally would not post so long a log fragment, but I truly don't know what might be useful here, so here is the complete log for two minutes, starting with my retry request, and ending with the seven-minute backoff.
453 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#28] Sent header to server: Host: einstein4.aei.uni-hannover.de
454 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#28] Sent header to server: User-Agent: BOINC client (windows_x86_64 7.14.0)
455 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#28] Sent header to server: Accept: */*
456 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#28] Sent header to server: Accept-Encoding: deflate, gzip
457 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#28] Sent header to server: Content-Type: application/x-www-form-urlencoded
458 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#28] Sent header to server: Accept-Language: en_US
459 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#28] Sent header to server: Content-Length: 830
460 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#28] Sent header to server:
461 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#28] Info: We are completely uploaded and fine
462 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#27] Info: Connected to einstein4.aei.uni-hannover.de (130.75.116.34) port 80 (#13)
463 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#27] Sent header to server: POST /EinsteinAtHome/cgi-bin/file_upload_handler_medium HTTP/1.1
464 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#27] Sent header to server: Host: einstein4.aei.uni-hannover.de
465 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#27] Sent header to server: User-Agent: BOINC client (windows_x86_64 7.14.0)
466 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#27] Sent header to server: Accept: */*
467 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#27] Sent header to server: Accept-Encoding: deflate, gzip
468 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#27] Sent header to server: Content-Type: application/x-www-form-urlencoded
469 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#27] Sent header to server: Accept-Language: en_US
470 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#27] Sent header to server: Content-Length: 919
471 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#27] Sent header to server:
472 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#27] Info: We are completely uploaded and fine
473 1/17/2019 5:34:16 PM NOTICES::write: seqno 5, refresh false, 4 notices
474 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: HTTP/1.1 504 Gateway Time-out
475 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: Server: nginx
476 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: Date: Fri, 18 Jan 2019 00:35:15 GMT
477 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: Content-Type: text/html
478 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: Content-Length: 433
479 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: Connection: keep-alive
480 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: ETag: "593ad040-1b1"
481 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server:
482 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: <!DOCTYPE html>
483 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: <html>
484 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: <head>
485 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: <meta charset="UTF-8">
486 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: <title>Technical Problems</title>
487 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: <style>
488 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: body {
489 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: width: 35em;
490 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: margin: 0 auto;
491 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: font-family: Tahoma, Verdana, Arial, sans-serif;
492 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: }
493 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: </style>
494 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: </head>
495 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: <body>
496 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: <h1>This server is experiencing technical problems</h1>
497 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: <p>If you see this page, something is broken on the server. Administrators are informed and will fix the problem soon.</p>
498 Einstein@Home 1/17/2019 5:35:13 PM
499 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: </html>
500 1/17/2019 5:35:13 PM [http_xfer] [ID#28] HTTP: wrote 433 bytes
501 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Info: Connection #14 to host einstein4.aei.uni-hannover.de left intact
502 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#27] Received header from server: HTTP/1.1 504 Gateway Time-out
503 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#27] Received header from server: Server: nginx
504 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#27] Received header from server: Date: Fri, 18 Jan 2019 00:35:15 GMT
505 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#27] Received header from server: Content-Type: text/html
506 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#27] Received header from server: Content-Length: 433
507 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#27] Received header from server: Connection: keep-alive
508 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#27] Received header from server: ETag: "593ad040-1b1"
509 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#27] Received header from server:
510 1/17/2019 5:35:13 PM [http_xfer] [ID#27] HTTP: wrote 433 bytes
511 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#27] Info: Connection #13 to host einstein4.aei.uni-hannover.de left intact
512 Einstein@Home 1/17/2019 5:35:14 PM [file_xfer] http op done; retval -184 (transient HTTP error)
513 Einstein@Home 1/17/2019 5:35:14 PM [file_xfer] http op done; retval -184 (transient HTTP error)
514 Einstein@Home 1/17/2019 5:35:14 PM [file_xfer] file transfer status -184 (transient HTTP error)
515 Einstein@Home 1/17/2019 5:35:14 PM Temporarily failed upload of LATeah2008L_1044.0_0_0.0_1712454_0_0: transient HTTP error
516 Einstein@Home 1/17/2019 5:35:14 PM Backing off 00:21:05 on upload of LATeah2008L_1044.0_0_0.0_1712454_0_0
517 Einstein@Home 1/17/2019 5:35:14 PM [file_xfer] file transfer status -184 (transient HTTP error)
518 Einstein@Home 1/17/2019 5:35:14 PM Temporarily failed upload of LATeah2008L_1036.0_0_0.0_7187214_1_0: transient HTTP error
519 Einstein@Home 1/17/2019 5:35:14 PM Backing off 00:33:49 on upload of LATeah2008L_1036.0_0_0.0_7187214_1_0
520 Einstein@Home 1/17/2019 5:35:14 PM [fxd] starting upload, upload_offset 0
521 Einstein@Home 1/17/2019 5:35:14 PM [http] HTTP_OP::libcurl_exec(): ca-bundle 'C:\Program Files\BOINC\ca-bundle.crt'
522 Einstein@Home 1/17/2019 5:35:14 PM [http] HTTP_OP::libcurl_exec(): ca-bundle set
523 Einstein@Home 1/17/2019 5:35:14 PM Started upload of LATeah2008L_1044.0_0_0.0_501853_1_0
524 Einstein@Home 1/17/2019 5:35:14 PM [file_xfer] URL: http://einstein4.aei.uni-hannover.de/EinsteinAtHome/cgi-bin/file_upload_handler_medium
525 Einstein@Home 1/17/2019 5:35:14 PM [fxd] starting upload, upload_offset 0
526 Einstein@Home 1/17/2019 5:35:14 PM [http] HTTP_OP::libcurl_exec(): ca-bundle 'C:\Program Files\BOINC\ca-bundle.crt'
527 Einstein@Home 1/17/2019 5:35:14 PM [http] HTTP_OP::libcurl_exec(): ca-bundle set
528 Einstein@Home 1/17/2019 5:35:14 PM Started upload of LATeah2008L_1044.0_0_0.0_501853_1_1
529 Einstein@Home 1/17/2019 5:35:14 PM [file_xfer] URL: http://einstein4.aei.uni-hannover.de/EinsteinAtHome/cgi-bin/file_upload_handler_medium
530 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Info: Found bundle for host einstein4.aei.uni-hannover.de: 0x5209160 [can pipeline]
531 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Info: Re-using existing connection! (#13) with host einstein4.aei.uni-hannover.de
532 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Info: Connected to einstein4.aei.uni-hannover.de (130.75.116.34) port 80 (#13)
533 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Sent header to server: POST /EinsteinAtHome/cgi-bin/file_upload_handler_medium HTTP/1.1
534 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Sent header to server: Host: einstein4.aei.uni-hannover.de
535 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Sent header to server: User-Agent: BOINC client (windows_x86_64 7.14.0)
536 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Sent header to server: Accept: */*
537 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Sent header to server: Accept-Encoding: deflate, gzip
538 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Sent header to server: Content-Type: application/x-www-form-urlencoded
539 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Sent header to server: Accept-Language: en_US
540 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Sent header to server: Content-Length: 906
541 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Sent header to server:
542 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Info: Found bundle for host einstein4.aei.uni-hannover.de: 0x5209160 [can pipeline]
543 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Info: Re-using existing connection! (#14) with host einstein4.aei.uni-hannover.de
544 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Info: Connected to einstein4.aei.uni-hannover.de (130.75.116.34) port 80 (#14)
545 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Sent header to server: POST /EinsteinAtHome/cgi-bin/file_upload_handler_medium HTTP/1.1
546 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Sent header to server: Host: einstein4.aei.uni-hannover.de
547 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Sent header to server: User-Agent: BOINC client (windows_x86_64 7.14.0)
548 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Sent header to server: Accept: */*
549 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Sent header to server: Accept-Encoding: deflate, gzip
550 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Sent header to server: Content-Type: application/x-www-form-urlencoded
551 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Sent header to server: Accept-Language: en_US
552 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Sent header to server: Content-Length: 892
553 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Sent header to server:
554 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Info: We are completely uploaded and fine
555 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Info: We are completely uploaded and fine
556 1/17/2019 5:35:16 PM NOTICES::write: seqno 5, refresh false, 4 notices
557 Einstein@Home 1/17/2019 5:35:18 PM Computation for task LATeah2008L_1044.0_0_0.0_2950641_1 finished
558 Einstein@Home 1/17/2019 5:35:18 PM Starting task LATeah2008L_1044.0_0_0.0_2949580_0
559 Einstein@Home 1/17/2019 5:36:14 PM [http] [ID#30] Received header from server: HTTP/1.1 504 Gateway Time-out
560 Einstein@Home 1/17/2019 5:36:14 PM [http] [ID#30] Received header from server: Server: nginx
561 Einstein@Home 1/17/2019 5:36:14 PM [http] [ID#30] Received header from server: Date: Fri, 18 Jan 2019 00:36:16 GMT
562 Einstein@Home 1/17/2019 5:36:14 PM [http] [ID#30] Received header from server: Content-Type: text/html
563 Einstein@Home 1/17/2019 5:36:14 PM [http] [ID#30] Received header from server: Content-Length: 433
564 Einstein@Home 1/17/2019 5:36:14 PM [http] [ID#30] Received header from server: Connection: keep-alive
565 Einstein@Home 1/17/2019 5:36:14 PM [http] [ID#30] Received header from server: ETag: "593ad040-1b1"
566 Einstein@Home 1/17/2019 5:36:14 PM [http] [ID#30] Received header from server:
567 1/17/2019 5:36:14 PM [http_xfer] [ID#30] HTTP: wrote 433 bytes
568 Einstein@Home 1/17/2019 5:36:14 PM [http] [ID#30] Info: Connection #14 to host einstein4.aei.uni-hannover.de left intact
569 Einstein@Home 1/17/2019 5:36:14 PM [file_xfer] http op done; retval -184 (transient HTTP error)
570 Einstein@Home 1/17/2019 5:36:14 PM [file_xfer] file transfer status -184 (transient HTTP error)
571 Einstein@Home 1/17/2019 5:36:14 PM Temporarily failed upload of LATeah2008L_1044.0_0_0.0_501853_1_1: transient HTTP error
572 Einstein@Home 1/17/2019 5:36:14 PM [file_xfer] project-wide xfer delay for 677.012848 sec
573 Einstein@Home 1/17/2019 5:36:14 PM Backing off 00:07:54 on upload of LATeah2008L_1044.0_0_0.0_501853_1_1
574 Einstein@Home 1/17/2019 5:36:14 PM [fxd] starting upload, upload_offset 0
575 Einstein@Home 1/17/2019 5:36:14 PM [http] HTTP_OP::libcurl_exec(): ca-bundle 'C:\Program Files\BOINC\ca-bundle.crt'
576 Einstein@Home 1/17/2019 5:36:14 PM [http] HTTP_OP::libcurl_exec(): ca-bundle set
....
"Upload handler medium" can't talk to the next server in the chain. We've had several of those recently, and Bernd put an automatic restart script into place. But I don't know how long we'll have to wait until it next runs.
Maybe we need to have a new 'restart the restart' script to give the restart script a kick in the pants for being overly obtuse about the matter :-).
Here is when one of my scripts first detected the problem (the 2nd item in the logs which happened on loop 8) :-
Loop Item Time Hostname Octet Uptime KDE RPC Status TPI Status
==== ==== ======== ======== ===== ====== === === ====== === ======================
1. 1. 20:17:14 q6600-06 ( .50) 26.1d v5 1181s OK 37 Info: Low ticks but then OK 1st=1 2nd=37.
8. 2. 03:15:04 styx ( .9) 13.5d v5 6256s HIGH 10 Ticks OK
So at 3:15AM local time (5:15PM UTC yesterday) a host reported that it had been 6256s since it had last been able to make successful contact with the project. This was immediately followed by every other host in the fleet reporting much the same claims - so much so that by the time I arrived on site at 6:00AM, there were many pages of such messages in the logs. Hosts usually make contact about every hour or so. I'm guessing the uploads probably failed about 30mins earlier, probably before 5PM UTC.
It is now around 9 hours since the problem started and uploads are still stuck. My hosts have far too many stuck uploads to be able to request new work so I will run out shortly. When the problem is eventually cleared, I'll be left with multi-hour backoffs so I'll need to run another script to override that. Not a big deal for me but just a waste for many volunteers who use a set and forget strategy. When the uploads clear and caches can be replenished, that will need to be carefully managed (particularly for anyone with a similar 'unusual' situation like mine) because of the excessive load it will place on a shared internet connection. Once again, I have scripts to deal with this - and unfortunately, plenty of recent experience on how best to manage this - so I'm not really worried for myself.
The thing that is really starting to annoy me is why can't this be solved more efficiently? If it's just a matter of killing and restarting some stuck process, surely it doesn't need to take this long? And I don't think it's fair to expect a volunteer to have to know about and set the appropriate debug options in order to produce an extensive log of low level (and perhaps indecipherable to many) stuff to 'assist' in the diagnosis of this problem. Particularly when such a log won't make any difference to a supposedly "automatic restart script" that is handling the problem anyway!! It might be a different matter if such a report could be acted upon immediately and thereby truncate the overly long hiatus between problem commencement and ultimate rectification.
The thing that is really starting to annoy me is why can't this be solved more efficiently? If it's just a matter of killing and restarting some stuck process, surely it doesn't need to take this long?
It's been about 11 hrs now and no one in control seems to be doing anything. Bernd's script hasn't solved the problem! There needs to be a more rigorous solution...
With uploading now able to complete successfully, you may wish to check hosts to see if any are still in an extended backoff. Whilst this should clear as new work is completed, you may wish to force it by using the 'retry now' function on the transfers tab of BOINC Manager.
Looking back through my logs, the first sign of trouble was 17-Jan-2019 15:47:35 - at this time of year, my logs are in UTC, which makes things easier.
When things are running smoothly, I complete a task every 10 minutes or so - and new-ish versions of BOINC try every new upload once even when there are backoffs, so I'd have been back in business immediately if the server had come back online while I still had work cached.
But instead, three machine logs show
18/01/2019 01:20:54 | Einstein@Home | Temporarily failed upload of p2030.20170410.G39.44-01.25.N.b4s0g0.00000_291_0_0: transient HTTP error
18/01/2019 05:11:32 | Einstein@Home | Started upload of p2030.20170410.G39.44-01.25.N.b4s0g0.00000_291_0_0
17/01/2019 23:12:09 | Einstein@Home | Temporarily failed upload of p2030.20170410.G39.44-01.25.N.b4s0g0.00000_2683_0_0: transient HTTP error
18/01/2019 05:07:25 | Einstein@Home | Started upload of p2030.20170410.G39.44-01.25.N.b3s0g0.00000_2520_1_0
18/01/2019 02:05:31 | Einstein@Home | Temporarily failed upload of p2030.20170410.G39.44-01.25.N.b4s0g0.00000_1307_0_0: transient HTTP error
18/01/2019 05:40:58 | Einstein@Home | Started upload of p2030.20170410.G39.44-01.25.N.b4s0g0.00000_1298_1_0
- so I can't tell when it came back, except to say between 02:05 and 05:07. But that's a long time for the outage to last, when it's a known problem with an automatic scripted recovery in place. I think I'm going to suggest to Bernd that he sets the script to run more often.
2/2/2019 11:17:27 AM | Einstein@Home | Scheduler request completed
2/2/2019 11:18:09 AM | | Project communication failed: attempting access to reference site
2/2/2019 11:18:11 AM | | Internet access OK - project servers may be temporarily down.
Two questions. 1) Are your
)
Two questions.
1) Are your uploads going through?
2) Did you look to see if the 'timeout' message appeared 60 seconds later?
Richard Haselgrove wrote:Two
)
I guess I left out the good part. Yes, it does timeout. But no, they are not going through.
I have about a hundred stuck in upload.
241 Einstein@Home 1/17/2019 6:36:35 PM [http] [ID#49] Received header from server: HTTP/1.1 504 Gateway Time-out
242 Einstein@Home 1/17/2019 6:36:35 PM [http] [ID#49] Received header from server: Server: nginx
243 Einstein@Home 1/17/2019 6:36:35 PM [http] [ID#49] Received header from server: Date: Thu, 17 Jan 2019 23:37:08 GMT
244 Einstein@Home 1/17/2019 6:36:35 PM [http] [ID#49] Received header from server: Content-Type: text/html
245 Einstein@Home 1/17/2019 6:36:35 PM [http] [ID#49] Received header from server: Content-Length: 433
246 Einstein@Home 1/17/2019 6:36:35 PM [http] [ID#49] Received header from server: Connection: keep-alive
247 Einstein@Home 1/17/2019 6:36:35 PM [http] [ID#49] Received header from server: ETag: "593ad040-1b1"
248 Einstein@Home 1/17/2019 6:36:35 PM [http] [ID#49] Received header from server:
249 Einstein@Home 1/17/2019 6:36:35 PM [http] [ID#48] Received header from server: HTTP/1.1 504 Gateway Time-out
250 Einstein@Home 1/17/2019 6:36:35 PM [http] [ID#48] Received header from server: Server: nginx
251 Einstein@Home 1/17/2019 6:36:35 PM [http] [ID#48] Received header from server: Date: Thu, 17 Jan 2019 23:37:08 GMT
252 Einstein@Home 1/17/2019 6:36:35 PM [http] [ID#48] Received header from server: Content-Type: text/html
253 Einstein@Home 1/17/2019 6:36:35 PM [http] [ID#48] Received header from server: Content-Length: 433
254 Einstein@Home 1/17/2019 6:36:35 PM [http] [ID#48] Received header from server: Connection: keep-alive
255 Einstein@Home 1/17/2019 6:36:35 PM [http] [ID#48] Received header from server: ETag: "593ad040-1b1"
256 Einstein@Home 1/17/2019 6:36:35 PM [http] [ID#48] Received header from server:
257 Einstein@Home 1/17/2019 6:36:35 PM [http] [ID#49] Info: Connection #38 to host einstein4.aei.uni-hannover.de left intact
258 Einstein@Home 1/17/2019 6:36:35 PM [http] [ID#48] Info: Connection #37 to host einstein4.aei.uni-hannover.de left intact
259 Einstein@Home 1/17/2019 6:36:36 PM Temporarily failed upload of LATeah2009L_1052.0_0_0.0_2414836_0_0: transient HTTP error
260 Einstein@Home 1/17/2019 6:36:36 PM Backing off 00:03:07 on upload of LATeah2009L_1052.0_0_0.0_2414836_0_0
261 Einstein@Home 1/17/2019 6:36:36 PM Temporarily failed upload of LATeah2009L_1052.0_0_0.0_2414836_0_1: transient HTTP error
262 Einstein@Home 1/17/2019 6:36:36 PM Backing off 00:02:57 on upload of LATeah2009L_1052.0_0_0.0_2414836_0_1
263 Einstein@Home 1/17/2019 6:46:43 PM Computation for task LATeah2009L_1052.0_0_0.0_2357542_0 finished
Richard Haselgrove
)
Richard, I turned on some of the event log debug options. Then, with 15 tasks pending waiting out their penalty time in the transfer queue, I selected one and specified a retry. Superficially, this initially seemed to work, including a prompt "Info: We are completely uploaded and fine" message. However the transfer tab did not show that upload as finishing, and 60 seconds after the initial request the message log show a gateway timeout.
I normally would not post so long a log fragment, but I truly don't know what might be useful here, so here is the complete log for two minutes, starting with my retry request, and ending with the seven-minute backoff.
Stoll9
453 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#28] Sent header to server: Host: einstein4.aei.uni-hannover.de
454 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#28] Sent header to server: User-Agent: BOINC client (windows_x86_64 7.14.0)
455 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#28] Sent header to server: Accept: */*
456 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#28] Sent header to server: Accept-Encoding: deflate, gzip
457 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#28] Sent header to server: Content-Type: application/x-www-form-urlencoded
458 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#28] Sent header to server: Accept-Language: en_US
459 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#28] Sent header to server: Content-Length: 830
460 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#28] Sent header to server:
461 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#28] Info: We are completely uploaded and fine
462 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#27] Info: Connected to einstein4.aei.uni-hannover.de (130.75.116.34) port 80 (#13)
463 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#27] Sent header to server: POST /EinsteinAtHome/cgi-bin/file_upload_handler_medium HTTP/1.1
464 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#27] Sent header to server: Host: einstein4.aei.uni-hannover.de
465 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#27] Sent header to server: User-Agent: BOINC client (windows_x86_64 7.14.0)
466 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#27] Sent header to server: Accept: */*
467 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#27] Sent header to server: Accept-Encoding: deflate, gzip
468 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#27] Sent header to server: Content-Type: application/x-www-form-urlencoded
469 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#27] Sent header to server: Accept-Language: en_US
470 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#27] Sent header to server: Content-Length: 919
471 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#27] Sent header to server:
472 Einstein@Home 1/17/2019 5:34:13 PM [http] [ID#27] Info: We are completely uploaded and fine
473 1/17/2019 5:34:16 PM NOTICES::write: seqno 5, refresh false, 4 notices
474 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: HTTP/1.1 504 Gateway Time-out
475 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: Server: nginx
476 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: Date: Fri, 18 Jan 2019 00:35:15 GMT
477 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: Content-Type: text/html
478 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: Content-Length: 433
479 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: Connection: keep-alive
480 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: ETag: "593ad040-1b1"
481 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server:
482 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: <!DOCTYPE html>
483 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: <html>
484 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: <head>
485 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: <meta charset="UTF-8">
486 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: <title>Technical Problems</title>
487 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: <style>
488 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: body {
489 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: width: 35em;
490 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: margin: 0 auto;
491 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: font-family: Tahoma, Verdana, Arial, sans-serif;
492 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: }
493 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: </style>
494 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: </head>
495 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: <body>
496 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: <h1>This server is experiencing technical problems</h1>
497 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: <p>If you see this page, something is broken on the server. Administrators are informed and will fix the problem soon.</p>
498 Einstein@Home 1/17/2019 5:35:13 PM
499 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Received header from server: </html>
500 1/17/2019 5:35:13 PM [http_xfer] [ID#28] HTTP: wrote 433 bytes
501 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#28] Info: Connection #14 to host einstein4.aei.uni-hannover.de left intact
502 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#27] Received header from server: HTTP/1.1 504 Gateway Time-out
503 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#27] Received header from server: Server: nginx
504 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#27] Received header from server: Date: Fri, 18 Jan 2019 00:35:15 GMT
505 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#27] Received header from server: Content-Type: text/html
506 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#27] Received header from server: Content-Length: 433
507 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#27] Received header from server: Connection: keep-alive
508 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#27] Received header from server: ETag: "593ad040-1b1"
509 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#27] Received header from server:
510 1/17/2019 5:35:13 PM [http_xfer] [ID#27] HTTP: wrote 433 bytes
511 Einstein@Home 1/17/2019 5:35:13 PM [http] [ID#27] Info: Connection #13 to host einstein4.aei.uni-hannover.de left intact
512 Einstein@Home 1/17/2019 5:35:14 PM [file_xfer] http op done; retval -184 (transient HTTP error)
513 Einstein@Home 1/17/2019 5:35:14 PM [file_xfer] http op done; retval -184 (transient HTTP error)
514 Einstein@Home 1/17/2019 5:35:14 PM [file_xfer] file transfer status -184 (transient HTTP error)
515 Einstein@Home 1/17/2019 5:35:14 PM Temporarily failed upload of LATeah2008L_1044.0_0_0.0_1712454_0_0: transient HTTP error
516 Einstein@Home 1/17/2019 5:35:14 PM Backing off 00:21:05 on upload of LATeah2008L_1044.0_0_0.0_1712454_0_0
517 Einstein@Home 1/17/2019 5:35:14 PM [file_xfer] file transfer status -184 (transient HTTP error)
518 Einstein@Home 1/17/2019 5:35:14 PM Temporarily failed upload of LATeah2008L_1036.0_0_0.0_7187214_1_0: transient HTTP error
519 Einstein@Home 1/17/2019 5:35:14 PM Backing off 00:33:49 on upload of LATeah2008L_1036.0_0_0.0_7187214_1_0
520 Einstein@Home 1/17/2019 5:35:14 PM [fxd] starting upload, upload_offset 0
521 Einstein@Home 1/17/2019 5:35:14 PM [http] HTTP_OP::libcurl_exec(): ca-bundle 'C:\Program Files\BOINC\ca-bundle.crt'
522 Einstein@Home 1/17/2019 5:35:14 PM [http] HTTP_OP::libcurl_exec(): ca-bundle set
523 Einstein@Home 1/17/2019 5:35:14 PM Started upload of LATeah2008L_1044.0_0_0.0_501853_1_0
524 Einstein@Home 1/17/2019 5:35:14 PM [file_xfer] URL: http://einstein4.aei.uni-hannover.de/EinsteinAtHome/cgi-bin/file_upload_handler_medium
525 Einstein@Home 1/17/2019 5:35:14 PM [fxd] starting upload, upload_offset 0
526 Einstein@Home 1/17/2019 5:35:14 PM [http] HTTP_OP::libcurl_exec(): ca-bundle 'C:\Program Files\BOINC\ca-bundle.crt'
527 Einstein@Home 1/17/2019 5:35:14 PM [http] HTTP_OP::libcurl_exec(): ca-bundle set
528 Einstein@Home 1/17/2019 5:35:14 PM Started upload of LATeah2008L_1044.0_0_0.0_501853_1_1
529 Einstein@Home 1/17/2019 5:35:14 PM [file_xfer] URL: http://einstein4.aei.uni-hannover.de/EinsteinAtHome/cgi-bin/file_upload_handler_medium
530 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Info: Found bundle for host einstein4.aei.uni-hannover.de: 0x5209160 [can pipeline]
531 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Info: Re-using existing connection! (#13) with host einstein4.aei.uni-hannover.de
532 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Info: Connected to einstein4.aei.uni-hannover.de (130.75.116.34) port 80 (#13)
533 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Sent header to server: POST /EinsteinAtHome/cgi-bin/file_upload_handler_medium HTTP/1.1
534 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Sent header to server: Host: einstein4.aei.uni-hannover.de
535 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Sent header to server: User-Agent: BOINC client (windows_x86_64 7.14.0)
536 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Sent header to server: Accept: */*
537 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Sent header to server: Accept-Encoding: deflate, gzip
538 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Sent header to server: Content-Type: application/x-www-form-urlencoded
539 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Sent header to server: Accept-Language: en_US
540 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Sent header to server: Content-Length: 906
541 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Sent header to server:
542 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Info: Found bundle for host einstein4.aei.uni-hannover.de: 0x5209160 [can pipeline]
543 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Info: Re-using existing connection! (#14) with host einstein4.aei.uni-hannover.de
544 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Info: Connected to einstein4.aei.uni-hannover.de (130.75.116.34) port 80 (#14)
545 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Sent header to server: POST /EinsteinAtHome/cgi-bin/file_upload_handler_medium HTTP/1.1
546 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Sent header to server: Host: einstein4.aei.uni-hannover.de
547 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Sent header to server: User-Agent: BOINC client (windows_x86_64 7.14.0)
548 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Sent header to server: Accept: */*
549 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Sent header to server: Accept-Encoding: deflate, gzip
550 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Sent header to server: Content-Type: application/x-www-form-urlencoded
551 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Sent header to server: Accept-Language: en_US
552 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Sent header to server: Content-Length: 892
553 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Sent header to server:
554 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#29] Info: We are completely uploaded and fine
555 Einstein@Home 1/17/2019 5:35:14 PM [http] [ID#30] Info: We are completely uploaded and fine
556 1/17/2019 5:35:16 PM NOTICES::write: seqno 5, refresh false, 4 notices
557 Einstein@Home 1/17/2019 5:35:18 PM Computation for task LATeah2008L_1044.0_0_0.0_2950641_1 finished
558 Einstein@Home 1/17/2019 5:35:18 PM Starting task LATeah2008L_1044.0_0_0.0_2949580_0
559 Einstein@Home 1/17/2019 5:36:14 PM [http] [ID#30] Received header from server: HTTP/1.1 504 Gateway Time-out
560 Einstein@Home 1/17/2019 5:36:14 PM [http] [ID#30] Received header from server: Server: nginx
561 Einstein@Home 1/17/2019 5:36:14 PM [http] [ID#30] Received header from server: Date: Fri, 18 Jan 2019 00:36:16 GMT
562 Einstein@Home 1/17/2019 5:36:14 PM [http] [ID#30] Received header from server: Content-Type: text/html
563 Einstein@Home 1/17/2019 5:36:14 PM [http] [ID#30] Received header from server: Content-Length: 433
564 Einstein@Home 1/17/2019 5:36:14 PM [http] [ID#30] Received header from server: Connection: keep-alive
565 Einstein@Home 1/17/2019 5:36:14 PM [http] [ID#30] Received header from server: ETag: "593ad040-1b1"
566 Einstein@Home 1/17/2019 5:36:14 PM [http] [ID#30] Received header from server:
567 1/17/2019 5:36:14 PM [http_xfer] [ID#30] HTTP: wrote 433 bytes
568 Einstein@Home 1/17/2019 5:36:14 PM [http] [ID#30] Info: Connection #14 to host einstein4.aei.uni-hannover.de left intact
569 Einstein@Home 1/17/2019 5:36:14 PM [file_xfer] http op done; retval -184 (transient HTTP error)
570 Einstein@Home 1/17/2019 5:36:14 PM [file_xfer] file transfer status -184 (transient HTTP error)
571 Einstein@Home 1/17/2019 5:36:14 PM Temporarily failed upload of LATeah2008L_1044.0_0_0.0_501853_1_1: transient HTTP error
572 Einstein@Home 1/17/2019 5:36:14 PM [file_xfer] project-wide xfer delay for 677.012848 sec
573 Einstein@Home 1/17/2019 5:36:14 PM Backing off 00:07:54 on upload of LATeah2008L_1044.0_0_0.0_501853_1_1
574 Einstein@Home 1/17/2019 5:36:14 PM [fxd] starting upload, upload_offset 0
575 Einstein@Home 1/17/2019 5:36:14 PM [http] HTTP_OP::libcurl_exec(): ca-bundle 'C:\Program Files\BOINC\ca-bundle.crt'
576 Einstein@Home 1/17/2019 5:36:14 PM [http] HTTP_OP::libcurl_exec(): ca-bundle set
Richard Haselgrove
)
Maybe we need to have a new 'restart the restart' script to give the restart script a kick in the pants for being overly obtuse about the matter :-).
Here is when one of my scripts first detected the problem (the 2nd item in the logs which happened on loop 8) :-
Loop Item Time Hostname Octet Uptime KDE RPC Status TPI Status ==== ==== ======== ======== ===== ====== === === ====== === ====================== 1. 1. 20:17:14 q6600-06 ( .50) 26.1d v5 1181s OK 37 Info: Low ticks but then OK 1st=1 2nd=37. 8. 2. 03:15:04 styx ( .9) 13.5d v5 6256s HIGH 10 Ticks OK
So at 3:15AM local time (5:15PM UTC yesterday) a host reported that it had been 6256s since it had last been able to make successful contact with the project. This was immediately followed by every other host in the fleet reporting much the same claims - so much so that by the time I arrived on site at 6:00AM, there were many pages of such messages in the logs. Hosts usually make contact about every hour or so. I'm guessing the uploads probably failed about 30mins earlier, probably before 5PM UTC.
It is now around 9 hours since the problem started and uploads are still stuck. My hosts have far too many stuck uploads to be able to request new work so I will run out shortly. When the problem is eventually cleared, I'll be left with multi-hour backoffs so I'll need to run another script to override that. Not a big deal for me but just a waste for many volunteers who use a set and forget strategy. When the uploads clear and caches can be replenished, that will need to be carefully managed (particularly for anyone with a similar 'unusual' situation like mine) because of the excessive load it will place on a shared internet connection. Once again, I have scripts to deal with this - and unfortunately, plenty of recent experience on how best to manage this - so I'm not really worried for myself.
The thing that is really starting to annoy me is why can't this be solved more efficiently? If it's just a matter of killing and restarting some stuck process, surely it doesn't need to take this long? And I don't think it's fair to expect a volunteer to have to know about and set the appropriate debug options in order to produce an extensive log of low level (and perhaps indecipherable to many) stuff to 'assist' in the diagnosis of this problem. Particularly when such a log won't make any difference to a supposedly "automatic restart script" that is handling the problem anyway!! It might be a different matter if such a report could be acted upon immediately and thereby truncate the overly long hiatus between problem commencement and ultimate rectification.
It always feels better to have a good whinge :-).
Cheers,
Gary.
Quote: The thing that is
)
It's been about 11 hrs now and no one in control seems to be doing anything. Bernd's script hasn't solved the problem! There needs to be a more rigorous solution...
Mine just got uploaded and
)
Mine just got uploaded and reported, so everything seems fine now,
THX for the fast reaction.
Grüße vom Sänger
With uploading now able to
)
With uploading now able to complete successfully, you may wish to check hosts to see if any are still in an extended backoff. Whilst this should clear as new work is completed, you may wish to force it by using the 'retry now' function on the transfers tab of BOINC Manager.
Cheers,
Gary.
Good to see that we're back
)
Good to see that we're back up and running.
Looking back through my logs, the first sign of trouble was 17-Jan-2019 15:47:35 - at this time of year, my logs are in UTC, which makes things easier.
When things are running smoothly, I complete a task every 10 minutes or so - and new-ish versions of BOINC try every new upload once even when there are backoffs, so I'd have been back in business immediately if the server had come back online while I still had work cached.
But instead, three machine logs show
18/01/2019 01:20:54 | Einstein@Home | Temporarily failed upload of p2030.20170410.G39.44-01.25.N.b4s0g0.00000_291_0_0: transient HTTP error
18/01/2019 05:11:32 | Einstein@Home | Started upload of p2030.20170410.G39.44-01.25.N.b4s0g0.00000_291_0_0
17/01/2019 23:12:09 | Einstein@Home | Temporarily failed upload of p2030.20170410.G39.44-01.25.N.b4s0g0.00000_2683_0_0: transient HTTP error
18/01/2019 05:07:25 | Einstein@Home | Started upload of p2030.20170410.G39.44-01.25.N.b3s0g0.00000_2520_1_0
18/01/2019 02:05:31 | Einstein@Home | Temporarily failed upload of p2030.20170410.G39.44-01.25.N.b4s0g0.00000_1307_0_0: transient HTTP error
18/01/2019 05:40:58 | Einstein@Home | Started upload of p2030.20170410.G39.44-01.25.N.b4s0g0.00000_1298_1_0
- so I can't tell when it came back, except to say between 02:05 and 05:07. But that's a long time for the outage to last, when it's a known problem with an automatic scripted recovery in place. I think I'm going to suggest to Bernd that he sets the script to run more often.
My reporting seems to not be
)
My reporting seems to not be working :
2/2/2019 11:17:27 AM | Einstein@Home | Scheduler request completed
2/2/2019 11:18:09 AM | | Project communication failed: attempting access to reference site
2/2/2019 11:18:11 AM | | Internet access OK - project servers may be temporarily down.
Free your MIND, and the rest will follow!
Uploads stuck here too
)
Uploads stuck here too