some rare Spotlight result files won't upload

Eugene Stemple
Eugene Stemple
Joined: 9 Feb 11
Posts: 67
Credit: 374071171
RAC: 541535
Topic 223367

Probably hundreds of Spotlight result files are uploading without difficulty.  However, I have (so far) three file that just won't complete the upload.  The odd thing is that they stall at exactly the same point and all retries fail.  So, for example, the result file h1_0209.35_O2C02ClrIn0__O2MDFS2_Spotlight_0209.50Hz_56_0_0 has retried 27 times and always stalls at 6.86% progress.  (There is another similar file that has retried 39 times and stalls at 20.69% but assuming it is the same underlying issue let's just focus on the first case.)  "Knowing" that the upload is going to fail, I have used a packet capture utility (tcpdump in Linux) to save a log of the entire session.  And that log file, roughly 500K bytes, is available if needed.  All appears to proceed normally in establishing the TCP link and getting the file data flowing.  The host is sending data packets with advancing sequence numbers; the server begins "ack"ing the packets but does not get past (ack 4304  win 37648).  From that point onward the host continues the data stream but DOES NOT exceed the server's receive window.  The server's ack's are always "ack 4304" with the "win" parameter gradually increasing to 65160.  Eventually the server begins replying with the "selective ack" header showing 5752:23128 (for example) with the upper bound increasing with each ack packet.  Within reasonable time, 0.15 second, the host resends the "lost" data packet, seq 4304:5752, and then resumes the data stream where it left off.  The server continues to reply, ack 4304 with the sack block upper bound increasing.  This pattern, host resending the 4304 packet then resuming the data stream, server responding with ack 4304 and enlarging sack block boundary.  The process eventually ends when the host approaches the receive window size, keeps resending the 4304:5752 packet at increasing timeout intervals, and both ends eventually give up and close/reset the TCP link.

I have seen other uploads stall on the first attempt (reasons unknown) but always run to completion on the second (or at most third) retry.  The consistency of this upload always stalling at byte/block 4304:5752 certainly looks strange.  There are, of course, lots of potential software failure points:  the BOINC file upload process; the host TCP flow manager; the server file receiver.  One's imagination goes wild!  A byte sequence in the data stream that is interpreted as an "escape" sequence (at any node in the transfer path)?  In a gzip compressed file any byte pattern can eventually occur.

One of the stalled uploads, the one already retried 39 times, has a deadline of August 28 (Friday) so I suppose it will be cleared from the upload queue at that time.  The one I've discussed here has a deadline of August 30, so it will be around for a few more days if there is anything useful to try to get it uploaded, or to further diagnose the problem.

 

Betreger
Betreger
Joined: 25 Feb 05
Posts: 992
Credit: 1588445735
RAC: 762298

Try exiting Boinc and stop

Try exiting Boinc and stop processing then restart Boinc and try, that has worked for me. 

Eugene Stemple
Eugene Stemple
Joined: 9 Feb 11
Posts: 67
Credit: 374071171
RAC: 541535

@betreger I have just

@betreger

I have just tried your suggestion.  Set NNT; suspended all non-running tasks; allowed the GPU running task to finish, and upload normally; shutdown the BOINC client; shutdown the BOINC manager.  ...counted to 100...

Restarted BOINC client & manager normally.  Ran one GPU task and let it upload (just as a confidence test to assure that the upload path is working); did a "Retry Now" on each of the stalled uploads, one at a time and waiting for the transfer to timeout.  All retries failed.  I did enable the "file_xfer_debug" and "http_xfer_debug" options for the event log.  All the failed retries return status -184 (transient HTTP error).  Nothing else obvious in the event log, as compared to a successful upload.

I will pick one of the stuck uploads and do an "Abort Transfer."  I'm not sure what implications that has for the task being reported - marked as computation error?  marked as user aborted task?  marked as invalid?  will BOINC start the upload from scratch instead of as a retry?  I will find out, I guess.

For me, at least, these are rare failures - about 1 in 100 or something on that order.

San-Fernando-Valley
San-Fernando-Valley
Joined: 16 Mar 16
Posts: 401
Credit: 10143233455
RAC: 25876548

if you haven't, then try/set

if you haven't, then try/set <http_debug>

Eugene Stemple
Eugene Stemple
Joined: 9 Feb 11
Posts: 67
Credit: 374071171
RAC: 541535

It's either another clue or

It's either another clue or another missing piece in this puzzle...  but yesterday there was a local electrical storm that did an "unplanned" power cycle / reset / of the router and upstream cell modem (the computer itself is on an UPS and was not affected).  After everything recovered, the result file upload retries went through without error.  Certainly suggests some issue between my host and the associated router.  I have @S-F-V 's suggestion in mind, regarding http_debug, and I'll use that the next time I have a stalled upload.   I'll add to this thread if I discover anything relevant.  If anybody knows more about the -184 error status it would be great to hear from you.  The generic "transient HTTP error" event log message doesn't help very much.

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.