Unpleasant Surprise

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9,352,143
RAC: 0
Topic 192688

Well, my Core Duo T2400 suffered a malfunction which appears to be very similar to the scenario described in the SIGABRT workaround for Linux thread.

This is a host I'm allowed to crunch on quid pro quo, so I haven't had a chance to do any forensics on it yet.

From what I can tell from the web site reports:

1.) It had two running and two in the bullpen from the same datapack.

2.) It finished one of the ones running, uploaded it successfully, and started one of the 'spares'.

3.) At this point it's hard to tell for sure without looking at the logfiles from the host itself, but at 1:38:38 UTC on 6/6/07, a malfuction occurred which caused both the running tasks at that point to abort with an exit code 99.

4.) This has set up a 'cascade' failure where every result since then has failed with a 99 (with a couple of exceptions) and burned up the daily quota.

The similarity to the Linux scenario is that the event looks like it was triggered when the host was trying to report the completed result at the same time it was trying to get new work from the project for the same datapack.

I did some research into the code 99, and although I didn't find anything here in the EAH fora about it reports on other projects indicate it was due to corrupted input files from the project (ie, a bad WU).

I not sure why DL'ing another result for the datapack would cause this, and as I said this what I can glean from looking at the host summary page. I'm planning to go take at look at the host afternoon and see if I figure anything else out from looking at the actual logfiles.

One other detail, this was the first time it had tried to transition thorough a report/request cycle with 2 EAH running at the time. It normally splits time with SAH, but was completely out of SAH due to their MSD failure. It had made a successful cycle previously, but it looks like only 1 EAH (possibly none) was running at that time.

Alinator

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9,352,143
RAC: 0

Unpleasant Surprise

For reference, here's the stderr from the last failure listed, typical of almost all of them.

5.8.15

- exit code 99 (0x63)

2007-05-05 22:23:10.5781 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_S5R2_4.17_windows_intelx86.exe'.
2007-05-05 22:23:11.1093 [debug]: Reading SFTs and setting up stacks ... done
2007-05-05 22:23:26.2812 [normal]: ERROR: Couldn't open existing checkpointing toplist file h1_0299.65_S5R2__71_S5R2c_0_0
2007-05-05 22:23:26.2812 [debug]: Couldn't open checkpoint - starting from beginning
2007-05-05 22:23:26.2812 [debug]: Total skypoints = 22080. Progress: 0, 1, 2, 3, 4, XLAL Error - XLALComputeFaFb (ComputeFstat.c:478): Input domain error
Level 0: $Id: HierarchicalSearch.c,v 1.164 2007/04/25 14:31:47 bema Exp $
Function call `COMPUTEFSTATFREQBAND ( &status, fstatVector.data + k, &thisPoint, stackMultiSFT.data[k], stackMultiNoiseWeights.data[k], stackMultiDetStates.data[k], &CFparams)' failed.
file HierarchicalSearch.c, line 1010
2007-05-05 22:23:55.3125 [normal]:
Level 1: $Id: ComputeFstat.c,v 1.67 2007/02/26 14:15:43 reinhard Exp $
2007-05-05 22:23:55.3125 [normal]: Status code -1: Recursive error
2007-05-05 22:23:55.3125 [normal]: function ComputeFStatFreqBand, file ComputeFstat.c, line 151
2007-05-05 22:23:55.3125 [normal]:
Level 2: $Id: ComputeFstat.c,v 1.67 2007/02/26 14:15:43 reinhard Exp $
2007-05-05 22:23:55.3125 [normal]: Status code 5: XLAL function call failed
2007-05-05 22:23:55.3125 [normal]: function ComputeFStat, file ComputeFstat.c, line 286
2007-05-05 22:23:55.3125 [CRITICAL]: BOINC_LAL_ErrHand(): now calling boinc_finish()

]]>

Alinator

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9,352,143
RAC: 0

I just got back from taking a

I just got back from taking a look at this 'wanderer'. Here's some relevent log entry snippets from the stdoutdae file:

First is the routine log message for the last sucessful completed result and upload:

2007-05-05 13:17:15 [SETI@home] Sending scheduler request: To fetch work
2007-05-05 13:17:15 [SETI@home] Requesting 147182 seconds of new work
2007-05-05 13:17:37 [---] Project communication failed: attempting access to reference site
2007-05-05 13:17:39 [---] Access to reference site succeeded - project servers may be temporarily down.
2007-05-05 13:17:40 [SETI@home] Scheduler request failed: couldn't connect to server
2007-05-05 13:17:40 [SETI@home] Deferring communication for 1 hr 22 min 12 sec
2007-05-05 13:17:40 [SETI@home] Reason: scheduler request failed
2007-05-05 13:32:57 [Einstein@Home] Computation for task h1_0299.65_S5R2__127_S5R2c_0 finished
2007-05-05 13:32:57 [Einstein@Home] Starting h1_0299.65_S5R2__124_S5R2c_1
2007-05-05 13:32:58 [Einstein@Home] Starting task h1_0299.65_S5R2__124_S5R2c_1 using einstein_S5R2 version 417
2007-05-05 13:33:00 [Einstein@Home] [file_xfer] Started upload of file h1_0299.65_S5R2__127_S5R2c_0_0
2007-05-05 13:33:04 [Einstein@Home] [file_xfer] Finished upload of file h1_0299.65_S5R2__127_S5R2c_0_0
2007-05-05 13:33:04 [Einstein@Home] [file_xfer] Throughput 115662 bytes/sec
2007-05-05 14:39:54 [SETI@home] Sending scheduler request: To fetch work
2007-05-05 14:39:54 [SETI@home] Requesting 147373 seconds of new work
2007-05-05 14:40:17 [---] Project communication failed: attempting access to reference site
2007-05-05 14:40:19 [---] Access to reference site succeeded - project servers may be temporarily down.
2007-05-05 14:40:20 [SETI@home] Scheduler request failed: couldn't connect to server
2007-05-05 14:40:20 [SETI@home] Deferring communication for 1 min 0 sec
2007-05-05 14:40:20 [SETI@home] Reason: scheduler request failed
2007-05-05 14:41:20 [SETI@home] Fetching scheduler list
2007-05-05 14:41:25 [SETI@home] Master file download succeeded
2007-05-05 14:41:30 [SETI@home] Sending scheduler request: To fetch work
2007-05-05 14:41:30 [SETI@home] Requesting 147377 seconds of new work
2007-05-05 14:41:52 [---] Project communication failed: attempting access to reference site
2007-05-05 14:41:54 [---] Access to reference site succeeded - project servers may be temporarily down.

As you can see, nothing unusual here. The log entry is the normal 'pestering' of SAH for work which isn't there, followed by the successful completion of the EAH result, starting one of the 'bullpen results, and the upload of the output file to the project. Six minutes after that it goes back to normal pattern of pestering SAH.

Here's where the the interesting stuff starts, a little over 5 hours later:

2007-05-05 20:04:36 [SETI@home] Requesting 145511 seconds of new work
2007-05-05 20:04:59 [---] Project communication failed: attempting access to reference site
2007-05-05 20:05:01 [---] Access to reference site succeeded - project servers may be temporarily down.
2007-05-05 20:05:02 [SETI@home] Scheduler request failed: couldn't connect to server
2007-05-05 20:05:02 [SETI@home] Deferring communication for 2 hr 22 min 51 sec
2007-05-05 20:05:02 [SETI@home] Reason: scheduler request failed
2007-05-05 21:36:40 [Einstein@Home] Deferring communication for 1 min 0 sec
2007-05-05 21:36:40 [Einstein@Home] Reason: Unrecoverable error for result h1_0299.65_S5R2__139_S5R2c_2 ( - exit code 99 (0x63))
2007-05-05 21:36:41 [Einstein@Home] Computation for task h1_0299.65_S5R2__139_S5R2c_2 finished
2007-05-05 21:36:41 [Einstein@Home] Output file h1_0299.65_S5R2__139_S5R2c_2_0 for task h1_0299.65_S5R2__139_S5R2c_2 absent
2007-05-05 21:36:41 [Einstein@Home] Starting h1_0299.65_S5R2__122_S5R2c_0
2007-05-05 21:36:44 [Einstein@Home] Starting task h1_0299.65_S5R2__122_S5R2c_0 using einstein_S5R2 version 417
2007-05-05 21:37:21 [Einstein@Home] Deferring communication for 1 min 0 sec
2007-05-05 21:37:21 [Einstein@Home] Reason: Unrecoverable error for result h1_0299.65_S5R2__122_S5R2c_0 ( - exit code 99 (0x63))
2007-05-05 21:37:21 [Einstein@Home] Computation for task h1_0299.65_S5R2__122_S5R2c_0 finished
2007-05-05 21:37:21 [Einstein@Home] Output file h1_0299.65_S5R2__122_S5R2c_0_0 for task h1_0299.65_S5R2__122_S5R2c_0 absent
2007-05-05 21:37:22 [Einstein@Home] Deferring communication for 1 min 0 sec
2007-05-05 21:37:22 [Einstein@Home] Reason: Unrecoverable error for result h1_0299.65_S5R2__124_S5R2c_1 ( - exit code 99 (0x63))
2007-05-05 21:37:22 [Einstein@Home] Computation for task h1_0299.65_S5R2__124_S5R2c_1 finished
2007-05-05 21:37:22 [Einstein@Home] Output file h1_0299.65_S5R2__124_S5R2c_1_0 for task h1_0299.65_S5R2__124_S5R2c_1 absent
2007-05-05 21:38:25 [Einstein@Home] Sending scheduler request: To fetch work
2007-05-05 21:38:25 [Einstein@Home] Requesting 172800 seconds of new work, and reporting 4 completed tasks

Note that after the routine failed SAH nag at 20:05, the next log entry is a deferral request from EAH of one minute, however there is no entry for what CC activity generated the deferral request. Normally, this is due to a successful scheduler contact event, however there is no evidence logged of CC event which would initiate a contact session. Typically this would be a work request, work report, and/or a manual update request.

Looking further down the snippet, we can see three more unrecoverable errors (the code 99's) which lead up to a next work request for 2 days worth of work (86400 X 2 for dual core @ 1 day CI) and a report of 4 results.

I'm hypothesizing what happen here is the missing scheduler event was the a cache refilling request for work of 172800 seconds and the report of the recently completed one, which apparently didn't happen at first since the project didn't reject one on the report at the end of the snippet and no DL occured until that last session either.

The problem is something went horribly wrong while it was trying to do all that in the missing session. This lead to the cascade failure (with code 99's) of the 2 results which were running fine up to that point, and every other result DL'ed after that until the quota got burned up.

At this point I've reset the project, manually deleted all the datapack files, and am now waiting for the quota day to reset in a few hours approx (00:45 UTC) and see what happens.

Alinator

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2,143
Credit: 2,947,533,986
RAC: 705,191

RE: ....Here's where the

Message 63854 in response to message 63853

Quote:

....Here's where the the interesting stuff starts, a little over 5 hours later:

2007-05-05 21:37:21 [Einstein@Home] Deferring communication for 1 min 0 sec
2007-05-05 21:37:21 [Einstein@Home] Reason: Unrecoverable error for result h1_0299.65_S5R2__122_S5R2c_0 ( - exit code 99 (0x63))
2007-05-05 21:37:21 [Einstein@Home] Computation for task h1_0299.65_S5R2__122_S5R2c_0 finished
2007-05-05 21:37:21 [Einstein@Home] Output file h1_0299.65_S5R2__122_S5R2c_0_0 for task h1_0299.65_S5R2__122_S5R2c_0 absent

Note that after the routine failed SAH nag at 20:05, the next log entry is a deferral request from EAH of one minute, however there is no entry for what CC activity generated the deferral request. Normally, this is due to a successful scheduler contact event, however there is no evidence logged of CC event which would initiate a contact session. Typically this would be a work request, work report, and/or a manual update request.

Looking further down the snippet, we can see three more unrecoverable errors (the code 99's) which lead up to a next work request for 2 days worth of work (86400 X 2 for dual core @ 1 day CI) and a report of 4 results.

I'm hypothesizing what happen here is the missing scheduler event was the a cache refilling request for work of 172800 seconds and the report of the recently completed one, which apparently didn't happen at first since the project didn't reject one on the report at the end of the snippet and no DL occured until that last session either.


I'm not sure your hypothesis is correct, Alinator.

Here's another log snippet:

2007-04-20 02:32:02 [SETI@home Beta Test] [file_xfer] Started download of file ap_15fe07aa_B0_P0_00008_20070419_30805.wu
2007-04-20 02:32:13 [SETI@home Beta Test] [file_xfer] Finished download of file ap_15fe07aa_B0_P0_00007_20070419_30805.wu
2007-04-20 02:32:13 [SETI@home Beta Test] [file_xfer] Throughput 95855 bytes/sec
2007-04-20 02:32:13 [SETI@home Beta Test] [file_xfer] Started download of file ap_15fe07aa_B0_P0_00015_20070419_30805.wu
2007-04-20 02:32:23 [SETI@home Beta Test] Starting ap_15fe07aa_B0_P0_00007_20070419_30805.wu_3
2007-04-20 02:32:23 [SETI@home Beta Test] Starting task ap_15fe07aa_B0_P0_00007_20070419_30805.wu_3 using astropulse version 414
2007-04-20 02:32:26 [SETI@home Beta Test] Deferring communication for 1 min 0 sec
2007-04-20 02:32:26 [SETI@home Beta Test] Reason: Unrecoverable error for result ap_15fe07aa_B0_P0_00007_20070419_30805.wu_3 ( - exit code -5 (0xfffffffb))
2007-04-20 02:32:26 [SETI@home Beta Test] Computation for task ap_15fe07aa_B0_P0_00007_20070419_30805.wu_3 finished
2007-04-20 02:32:26 [SETI@home Beta Test] Starting ap_15fe07aa_B0_P0_00011_20070419_30805.wu_2
2007-04-20 02:32:26 [SETI@home Beta Test] Starting task ap_15fe07aa_B0_P0_00011_20070419_30805.wu_2 using astropulse version 414
2007-04-20 02:32:28 [SETI@home Beta Test] [file_xfer] Started upload of file ap_15fe07aa_B0_P0_00007_20070419_30805.wu_3_0
2007-04-20 02:32:29 [SETI@home Beta Test] Deferring communication for 1 min 0 sec
2007-04-20 02:32:29 [SETI@home Beta Test] Reason: Unrecoverable error for result ap_15fe07aa_B0_P0_00011_20070419_30805.wu_2 ( - exit code -5 (0xfffffffb))
2007-04-20 02:32:29 [SETI@home Beta Test] [file_xfer] Finished upload of file ap_15fe07aa_B0_P0_00007_20070419_30805.wu_3_0
2007-04-20 02:32:29 [SETI@home Beta Test] [file_xfer] Throughput 398 bytes/sec

This was from a time when SETI Beta/Astropulse had good communications, but was issuing faulty WUs which all errored out immediately.

Notice how I got 'Deferring communication' in the same second as, and immediately preceding, each 'Unrecoverable error'. I think the two are tied together - at least in the BOINC v5.8.16 I was using - and the deferral is not the result of some (invisible) scheduler comms failure.

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9,352,143
RAC: 0

Hmmmm.... Except you were

Hmmmm....

Except you were in a flurry of comm traffic with Beta at the time, so are you sure you went back far enough in the log to see the initiating event for the comm deferral? I just went over the logfile from mine (which went back to the beginning of April) and the one I posted was the only one where I could not track the deferral back to it's initiating scheduler contact event.

UL's don't generate a deferral request because they don't need scheduler intervention to happen, and DL's always have a scheduler contact event to go along with them because the CC needs to know which files to get from the downloader.

Alinator

FOLLOWUP:

The quota day expired and the host has successfully made contact, reloaded the app, config files, and data files. It also picked up 2 new results at the same template frequency it was running before the event, and appears to be running them normally.

Since I did do a reset I can't rule out the app exe or config files got corrupted, although I think that's unlikely. So at this point I'm leaning to a bad datapack file or some other client/project interaction snafu was the trigger for the event in this case.

Alinator

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2,143
Credit: 2,947,533,986
RAC: 705,191

To set you mind at

To set you mind at rest....

2007-04-20 01:17:55 [Einstein@Home] Computation for task h1_0713.5_S5R1__4737_S5RIa_1 finished
2007-04-20 01:17:55 [Einstein@Home] Starting h1_0713.5_S5R1__4614_S5RIa_1
2007-04-20 01:17:55 [Einstein@Home] Starting task h1_0713.5_S5R1__4614_S5RIa_1 using einstein_S5RI version 424
2007-04-20 01:17:57 [Einstein@Home] [file_xfer] Started upload of file h1_0713.5_S5R1__4737_S5RIa_1_0
2007-04-20 01:18:09 [Einstein@Home] [file_xfer] Finished upload of file h1_0713.5_S5R1__4737_S5RIa_1_0
2007-04-20 01:18:09 [Einstein@Home] [file_xfer] Throughput 16145 bytes/sec
2007-04-20 01:21:38 [climateprediction.net] Sending scheduler request: To send trickle-up message
2007-04-20 01:21:38 [climateprediction.net] (not requesting new work or reporting completed tasks)
2007-04-20 01:21:43 [climateprediction.net] Scheduler RPC succeeded [server version 509]
2007-04-20 02:27:03 [SETI@home Beta Test] Sending scheduler request: To fetch work
2007-04-20 02:27:03 [SETI@home Beta Test] Requesting 171138 seconds of new work
2007-04-20 02:27:08 [SETI@home Beta Test] Scheduler RPC succeeded [server version 509]
2007-04-20 02:27:08 [SETI@home Beta Test] Deferring communication for 7 sec
2007-04-20 02:27:08 [SETI@home Beta Test] Reason: requested by project

2007-04-20 02:27:08 [SETI@home Beta Test] Deferring communication for 1 min 0 sec
2007-04-20 02:27:08 [SETI@home Beta Test] Reason: no work from project

2007-04-20 02:28:09 [SETI@home Beta Test] Fetching scheduler list
2007-04-20 02:28:19 [SETI@home Beta Test] Master file download succeeded
2007-04-20 02:28:25 [SETI@home Beta Test] Sending scheduler request: To fetch work
2007-04-20 02:28:25 [SETI@home Beta Test] Requesting 171132 seconds of new work
2007-04-20 02:28:30 [SETI@home Beta Test] Scheduler RPC succeeded [server version 509]
2007-04-20 02:28:30 [SETI@home Beta Test] Deferring communication for 7 sec
2007-04-20 02:28:30 [SETI@home Beta Test] Reason: requested by project

2007-04-20 02:28:30 [SETI@home Beta Test] Deferring communication for 1 min 0 sec
2007-04-20 02:28:30 [SETI@home Beta Test] Reason: no work from project

2007-04-20 02:29:31 [SETI@home Beta Test] Sending scheduler request: To fetch work
2007-04-20 02:29:31 [SETI@home Beta Test] Requesting 171141 seconds of new work
2007-04-20 02:29:36 [SETI@home Beta Test] Scheduler RPC succeeded [server version 509]
2007-04-20 02:29:36 [SETI@home Beta Test] Deferring communication for 7 sec
2007-04-20 02:29:36 [SETI@home Beta Test] Reason: requested by project

2007-04-20 02:29:36 [SETI@home Beta Test] Deferring communication for 1 min 0 sec
2007-04-20 02:29:36 [SETI@home Beta Test] Reason: no work from project

2007-04-20 02:30:37 [SETI@home Beta Test] Sending scheduler request: To fetch work
2007-04-20 02:30:37 [SETI@home Beta Test] Requesting 171141 seconds of new work
2007-04-20 02:30:42 [SETI@home Beta Test] Scheduler RPC succeeded [server version 509]
2007-04-20 02:30:42 [SETI@home Beta Test] Deferring communication for 7 sec
2007-04-20 02:30:42 [SETI@home Beta Test] Reason: requested by project

2007-04-20 02:30:44 [SETI@home Beta Test] [file_xfer] Started download of file ap_15fe07aa_B0_P0_00007_20070419_30805.wu
2007-04-20 02:30:44 [SETI@home Beta Test] [file_xfer] Started download of file ap_15fe07aa_B0_P0_00011_20070419_30805.wu
2007-04-20 02:30:53 [SETI@home Beta Test] Sending scheduler request: To fetch work
2007-04-20 02:30:53 [SETI@home Beta Test] Requesting 21363 seconds of new work
2007-04-20 02:30:58 [SETI@home Beta Test] Scheduler RPC succeeded [server version 509]
2007-04-20 02:30:58 [SETI@home Beta Test] Deferring communication for 7 sec
2007-04-20 02:30:58 [SETI@home Beta Test] Reason: requested by project

2007-04-20 02:31:08 [SETI@home Beta Test] Sending scheduler request: To fetch work
2007-04-20 02:31:08 [SETI@home Beta Test] Requesting 4371 seconds of new work
2007-04-20 02:31:13 [SETI@home Beta Test] Scheduler RPC succeeded [server version 509]
2007-04-20 02:31:13 [SETI@home Beta Test] Deferring communication for 7 sec
2007-04-20 02:31:13 [SETI@home Beta Test] Reason: requested by project

2007-04-20 02:32:02 [SETI@home Beta Test] [file_xfer] Finished download of file ap_15fe07aa_B0_P0_00011_20070419_30805.wu
2007-04-20 02:32:02 [SETI@home Beta Test] [file_xfer] Throughput 109270 bytes/sec
2007-04-20 02:32:02 [SETI@home Beta Test] [file_xfer] Started download of file ap_15fe07aa_B0_P0_00008_20070419_30805.wu
2007-04-20 02:32:13 [SETI@home Beta Test] [file_xfer] Finished download of file ap_15fe07aa_B0_P0_00007_20070419_30805.wu
2007-04-20 02:32:13 [SETI@home Beta Test] [file_xfer] Throughput 95855 bytes/sec

That's the log immediately prior to the snippet I posted yesterday, including a bit of overlap for continuity. Notice how each deferral is followed by its 'reason' statement. These ones match up: the ones from yesterday only match with the application error.

I think the idea may be that when an error happens, BOINC deliberately inhibits reporting for a while to allow the result file to be uploaded - then the next scheduler contact can include this result as well. I suggested something similar last Autumn/Fall, while researching the impact of Return Results Immediately - I think it would be a good idea to do it for successful results too, because it would reduce the number of uploaded but unreported results lying around tempting people to update manually (I saw a lot of cases where scheduler requests went in just after an application exited, before the upload had even started).

I think your explanation for the repeated S5R2 errors makes sense: that the first error leaves the datapak in a damaged state, which subsequent runs can't cope with. But that doesn't do anything to explain the first error in the sequence.

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9,352,143
RAC: 0

Ah yes, I see your point now.

Ah yes, I see your point now. I agree the deferral appears to be locally generated by BOINC itself and triggered by the result abort. Looking at it now that pattern agrees with your observations on Beta.

Don't quite understand why I didn't put 2 and 2 together from the reason given that followed. Too focussed on one part of the equation, I guess.

Give self slap on head. ;-)

I'll have to go back and retrace the timeline, since my original hypothesis depended on some event external to the processing of the two results running caused them to abort and set off the cascade failure.

Alinator

Well I went back and looked over the logfiles with this viewpoint correction and there isn't any external events other than the SAH nags between when the last successful EAH uploaded and the start of the cascade failure.

The bad news is that since I reset the project and got fresh copies of everything which was onboard at the time of the failure and resumed working on the same template as before, it implies that BOINC and/or the app itself caused the failure by corrupting something.

In light of this, the question I have is just what does exit code 99 mean with regards to EAH? I don't think 'Killer Trickles' applies to EAH (or at least they shouldn't AFAIK). ;-)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.