Validate errors

Gundolf Jahn
Gundolf Jahn
Joined: 1 Mar 05
Posts: 1079
Credit: 341280
RAC: 0

RE: Perhaps I could have,

Message 91004 in response to message 91003

Quote:
Perhaps I could have, but time has marched on and the point at which the available messages now begin is too late...


What about the stdoutdae.txt file (in the BOINC data directory)? If that overflows, it usually is renamed (stdoutdae_old.txt?).

And allow for searching in hidden objects!

Gruß,
Gundolf

Computer sind nicht alles im Leben. (Kleiner Scherz)

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2980657349
RAC: 759876

RE: Peter has a 5.10.20

Message 91005 in response to message 91001

Quote:
Peter has a 5.10.20 which didn't have the problem so it looks like the bug was introduced somewhere between 20 and 28.


There was indeed a likely-looking change at version 5.10.21: towards the bottom -

Old functions:
handle_xfer_failure(), try_next_url(), xfer_failed(), retry_or_backoff()

New functions:
permanent_failure(), transient_failure()

archae86
archae86
Joined: 6 Dec 05
Posts: 3161
Credit: 7272811730
RAC: 1826442

RE: RE: Perhaps I could

Message 91006 in response to message 91004

Quote:
Quote:
Perhaps I could have, but time has marched on and the point at which the available messages now begin is too late...

What about the stdoutdae.txt file (in the BOINC data directory)? If that overflows, it usually is renamed (stdoutdae_old.txt?).

And allow for searching in hidden objects!

Gruß,
Gundolf


stodoutdae_old.txt it is, then. Thanks.

Sorry for the very long post, but possibly something in here will give a clue to someone--I've search stdoutdae_old.txt for references to this specific WU, deleted other things, and even deleted most of the failures--all deleted lines are clued by blank lines--the added comments should be obvious:

2009-03-27 02:59:36 [Einstein@Home] [file_xfer] Finished upload of file h1_0635.85_S5R4__638_S5R5a_1_0
2009-03-27 02:59:36 [Einstein@Home] [file_xfer] Throughput 80089 bytes/sec
2009-03-27 03:29:54 [Einstein@Home] Computation for task p2030_53613_08999_0009_G67.52-00.06.N_2.dm_230_0 finished
2009-03-27 03:29:54 [Einstein@Home] Starting h1_0635.85_S5R4__636_S5R5a_1
2009-03-27 03:29:54 [Einstein@Home] Starting task h1_0635.85_S5R4__636_S5R5a_1 using einstein_S5R5 version 301

2009-03-27 05:30:10 [Einstein@Home] Resuming task h1_0635.85_S5R4__636_S5R5a_1 using einstein_S5R5 version 301

2009-03-27 09:52:55 [Einstein@Home] Computation for task h1_0635.85_S5R4__636_S5R5a_1 finished
2009-03-27 09:52:55 [Einstein@Home] Starting h1_0635.85_S5R4__634_S5R5a_1
2009-03-27 09:52:55 [Einstein@Home] Starting task h1_0635.85_S5R4__634_S5R5a_1 using einstein_S5R5 version 301
2009-03-27 09:52:57 [Einstein@Home] [file_xfer] Started upload of file h1_0635.85_S5R4__636_S5R5a_1_0
2009-03-27 09:53:19 [---] Project communication failed: attempting access to reference site
2009-03-27 09:53:19 [Einstein@Home] [file_xfer] Temporarily failed upload of h1_0635.85_S5R4__636_S5R5a_1_0: http error
2009-03-27 09:53:19 [Einstein@Home] Backing off 1 min 0 sec on upload of file h1_0635.85_S5R4__636_S5R5a_1_0
2009-03-27 09:53:21 [---] Access to reference site succeeded - project servers may be temporarily down.
2009-03-27 09:54:20 [Einstein@Home] [file_xfer] Started upload of file h1_0635.85_S5R4__636_S5R5a_1_0
2009-03-27 09:58:19 [---] Project communication failed: attempting access to reference site
2009-03-27 09:58:19 [Einstein@Home] [file_xfer] Temporarily failed upload of h1_0635.85_S5R4__636_S5R5a_1_0: http error
2009-03-27 09:58:19 [Einstein@Home] Backing off 1 min 0 sec on upload of file h1_0635.85_S5R4__636_S5R5a_1_0
2009-03-27 09:58:21 [---] Access to reference site succeeded - project servers may be temporarily down.
2009-03-27 09:59:20 [Einstein@Home] [file_xfer] Started upload of file h1_0635.85_S5R4__636_S5R5a_1_0
2009-03-27 09:59:42 [---] Project communication failed: attempting access to reference site
2009-03-27 09:59:42 [Einstein@Home] [file_xfer] Temporarily failed upload of h1_0635.85_S5R4__636_S5R5a_1_0: system connect
2009-03-27 09:59:42 [Einstein@Home] Backing off 1 min 0 sec on upload of file h1_0635.85_S5R4__636_S5R5a_1_0
2009-03-27 09:59:44 [---] Access to reference site succeeded - project servers may be temporarily down.
2009-03-27 10:00:43 [Einstein@Home] [file_xfer] Started upload of file h1_0635.85_S5R4__636_S5R5a_1_0
2009-03-27 10:01:05 [---] Project communication failed: attempting access to reference site
2009-03-27 10:01:05 [Einstein@Home] [file_xfer] Temporarily failed upload of h1_0635.85_S5R4__636_S5R5a_1_0: system connect
2009-03-27 10:01:05 [Einstein@Home] Backing off 1 min 0 sec on upload of file h1_0635.85_S5R4__636_S5R5a_1_0
2009-03-27 10:01:06 [---] Access to reference site succeeded - project servers may be temporarily down.

some more of these, some tagged as http error, some as system connects, with increasing back down on retry time--then, for the first time

2009-03-27 11:52:37 [Einstein@Home] [file_xfer] Started upload of file h1_0635.85_S5R4__636_S5R5a_1_0
2009-03-27 11:53:23 [Einstein@Home] [file_xfer] Temporarily failed upload of h1_0635.85_S5R4__636_S5R5a_1_0: file not found
2009-03-27 11:53:23 [Einstein@Home] Backing off 29 min 38 sec on upload of file h1_0635.85_S5R4__636_S5R5a_1_0
2009-03-27 12:23:02 [Einstein@Home] [file_xfer] Started upload of file h1_0635.85_S5R4__636_S5R5a_1_0
2009-03-27 12:23:16 [Einstein@Home] [file_xfer] Temporarily failed upload of h1_0635.85_S5R4__636_S5R5a_1_0: file not found
2009-03-27 12:23:16 [Einstein@Home] Backing off 1 hr 56 min 20 sec on upload of file h1_0635.85_S5R4__636_S5R5a_1_0

still later, it is back to complaining about system connect, not about file not found

2009-03-27 14:19:38 [Einstein@Home] [file_xfer] Started upload of file h1_0635.85_S5R4__636_S5R5a_1_0
2009-03-27 14:20:00 [---] Project communication failed: attempting access to reference site
2009-03-27 14:20:00 [Einstein@Home] [file_xfer] Temporarily failed upload of h1_0635.85_S5R4__636_S5R5a_1_0: system connect
2009-03-27 14:20:00 [Einstein@Home] Backing off 2 hr 55 min 45 sec on upload of file h1_0635.85_S5R4__636_S5R5a_1_0

2009-03-27 17:15:45 [Einstein@Home] [file_xfer] Started upload of file h1_0635.85_S5R4__636_S5R5a_1_0
2009-03-27 17:16:07 [---] Project communication failed: attempting access to reference site
2009-03-27 17:16:07 [Einstein@Home] [file_xfer] Temporarily failed upload of h1_0635.85_S5R4__636_S5R5a_1_0: system connect
2009-03-27 17:16:07 [Einstein@Home] Backing off 1 hr 10 min 25 sec on upload of file h1_0635.85_S5R4__636_S5R5a_1_0
2009-03-27 17:16:09 [---] Access to reference site succeeded - project servers may be temporarily down.

the very next time it says file not found

2009-03-27 18:26:33 [Einstein@Home] [file_xfer] Started upload of file h1_0635.85_S5R4__636_S5R5a_1_0
2009-03-27 18:26:56 [Einstein@Home] [file_xfer] Temporarily failed upload of h1_0635.85_S5R4__636_S5R5a_1_0: file not found
2009-03-27 18:26:56 [Einstein@Home] Backing off 1 hr 52 min 58 sec on upload of file h1_0635.85_S5R4__636_S5R5a_1_0

I've delete perhaps a dozen attempts in this range all of which got "file not found"

2009-03-29 03:31:02 [Einstein@Home] [file_xfer] Started upload of file h1_0635.85_S5R4__636_S5R5a_1_0
2009-03-29 03:31:22 [Einstein@Home] [file_xfer] Temporarily failed upload of h1_0635.85_S5R4__635_S5R5a_1_0: file not found
2009-03-29 03:31:22 [Einstein@Home] Backing off 8 min 32 sec on upload of file h1_0635.85_S5R4__635_S5R5a_1_0
2009-03-29 03:31:22 [Einstein@Home] [file_xfer] Started upload of file h1_0635.85_S5R4__631_S5R5a_0_0
2009-03-29 03:31:23 [---] Project communication failed: attempting access to reference site
2009-03-29 03:31:23 [Einstein@Home] [file_xfer] Temporarily failed upload of h1_0635.85_S5R4__636_S5R5a_1_0: system connect
2009-03-29 03:31:23 [Einstein@Home] Backing off 2 hr 22 min 55 sec on upload of file h1_0635.85_S5R4__636_S5R5a_1_0

2009-03-29 05:54:20 [Einstein@Home] [file_xfer] Started upload of file h1_0635.85_S5R4__636_S5R5a_1_0
2009-03-29 05:54:39 [Einstein@Home] [file_xfer] Temporarily failed upload of h1_0635.85_S5R4__636_S5R5a_1_0: file not found
2009-03-29 05:54:39 [Einstein@Home] Backing off 3 hr 8 min 28 sec on upload of file h1_0635.85_S5R4__636_S5R5a_1_0

2009-03-29 09:03:08 [Einstein@Home] [file_xfer] Started upload of file h1_0635.85_S5R4__636_S5R5a_1_0
2009-03-29 09:03:30 [---] Project communication failed: attempting access to reference site
2009-03-29 09:03:30 [Einstein@Home] [file_xfer] Temporarily failed upload of h1_0635.85_S5R4__636_S5R5a_1_0: system connect
2009-03-29 09:03:30 [Einstein@Home] Backing off 1 hr 39 min 18 sec on upload of file h1_0635.85_S5R4__636_S5R5a_1_0

2009-03-29 10:42:49 [Einstein@Home] [file_xfer] Started upload of file h1_0635.85_S5R4__636_S5R5a_1_0
2009-03-29 10:43:45 [Einstein@Home] [file_xfer] Temporarily failed upload of h1_0635.85_S5R4__636_S5R5a_1_0: file not found
2009-03-29 10:43:45 [Einstein@Home] Backing off 3 hr 21 min 26 sec on upload of file h1_0635.85_S5R4__636_S5R5a_1_0

I've delete about 10 trials in this range all of which got "file not found"

2009-03-30 07:18:00 [Einstein@Home] [file_xfer] Started upload of file h1_0635.85_S5R4__636_S5R5a_1_0
2009-03-30 07:18:11 [Einstein@Home] [file_xfer] Temporarily failed upload of h1_0635.85_S5R4__636_S5R5a_1_0: file not found
2009-03-30 07:18:11 [Einstein@Home] Backing off 1 hr 48 min 25 sec on upload of file h1_0635.85_S5R4__636_S5R5a_1_0

2009-03-30 09:06:37 [Einstein@Home] [file_xfer] Started upload of file h1_0635.85_S5R4__636_S5R5a_1_0
2009-03-30 09:06:59 [---] Project communication failed: attempting access to reference site
2009-03-30 09:06:59 [Einstein@Home] [file_xfer] Temporarily failed upload of h1_0635.85_S5R4__636_S5R5a_1_0: system connect
2009-03-30 09:06:59 [Einstein@Home] Backing off 2 hr 46 min 18 sec on upload of file h1_0635.85_S5R4__636_S5R5a_1_0

2009-03-30 10:18:06 [Einstein@Home] [file_xfer] Started upload of file h1_0635.85_S5R4__636_S5R5a_1_0
2009-03-30 10:18:07 [Einstein@Home] [file_xfer] Started upload of file h1_0635.85_S5R4__635_S5R5a_1_0
2009-03-30 10:18:29 [---] Project communication failed: attempting access to reference site
2009-03-30 10:18:29 [Einstein@Home] [file_xfer] Temporarily failed upload of h1_0635.85_S5R4__635_S5R5a_1_0: system connect
2009-03-30 10:18:29 [Einstein@Home] Backing off 3 hr 17 min 43 sec on upload of file h1_0635.85_S5R4__635_S5R5a_1_0
2009-03-30 10:18:31 [---] Access to reference site succeeded - project servers may be temporarily down.
2009-03-30 10:20:44 [Einstein@Home] [file_xfer] Finished upload of file h1_0635.85_S5R4__636_S5R5a_1_0
2009-03-30 10:20:44 [Einstein@Home] [file_xfer] Throughput 1592 bytes/sec

I'll hazard a guess that the file was never actually missing at all, so that it should never have been given up on. And this version did not give up, but tried, tried, again.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118374448822
RAC: 25550031

RE: I'll hazard a guess

Message 91007 in response to message 91006

Quote:
I'll hazard a guess that the file was never actually missing at all, so that it should never have been given up on. And this version did not give up, but tried, tried, again.

Yes, for your version 5.10.20, I'm sure that the result file was never missing. However on the 5.10.45 systems I experimented with during the outage, I looked for the file and could not find it. I believe that it really had been deleted.

As Richard has identified, the problem seems likely to have been introduced in 5.10.21 so you were very wise to stay with 5.10.20 :-).

It's also interesting to ponder the "file not found" messages which only occurred intermittently in your logs. I had always thought that the comment referred to the result file that was being uploaded. I'm now wondering if it actually refers to a file on the server whose purpose might be to cause a "temporary upload failure" for those times when a "system connect" can be achieved but the server really can't handle the upload. Your logs show instances of upload failures due to "system connect" as well as "file not found". Perhaps the latter message indicates that some sort of connection could be made but the "trigger file" mentioned in the Trac Wiki error message explanation couldn't be found. It then seems to follow that this missing trigger file causes a "permanent failure" rather than a "temporary" one (in BOINC versions between 21 and 45) which results in the instant deletion of the result file. That's how I'm now interpreting the somewhat cryptic comments about this error message.

As I mentioned before, we need someone familiar with the code to clarify all that.

PS: This now explains another observation I made towards the end of the outage. I noticed some result uploads which were instantly greeted with the dreaded "file not found". The response was very quick, as if there was an upload server immediately responding. I guess the "trigger file" was still absent and this meant instant death to the result :-(.

Cheers,
Gary.

Conan
Conan
Joined: 19 Jun 05
Posts: 172
Credit: 8518159
RAC: 8072

RE: RE: In future if it

Message 91008 in response to message 90993

Quote:
Quote:
In future if it seems that a project will be off air for 3 days or more then I may as well abort all remaining work for that project as I wont get any credit anyway so why waste CPU time and power.

You don't really have to abort if you maintain a cache sufficient to ride out the outage. You could simply suspend comms for the duration and allow the tasks to be crunched but not uploaded. After the outage you would just re-enable comms and allow the work to be uploaded and reported.

In my case, I actually had enough work to (mostly) outlast the outage. I just didn't realise early enough what was happening and that I needed to suspend comms in order to save the completed work. When I first saw the reason for the problem I started upgrading BOINC but quickly gave that away when I realised it was easier to suspend comms and save the results that way.

G'Day Gary,
In response to your suggestion I would be unable to suspend communication to the net as I run other projects as well and they have shorter turn around times than Einstein does. So in order not to go past deadline on those projects I have to leave communications open.

I have just upgraded to 6.6.20 but then after running benchmarks went straight back to 5.10.21, which is the version I have been using (my AMD Opteron 275 gives around 3450 FP / 1760 Int on 5.10.21 and 2250 FP / 1120 Int on 6.x.x ).

The benchmarks are so poor in all versions I have tried above 5.10.21 that I will not be upgrading until I have all fixed credit projects.

So it looks like I may have to use the abort option for long server outages.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118374448822
RAC: 25550031

RE: The benchmarks are so

Message 91009 in response to message 91008

Quote:

The benchmarks are so poor in all versions I have tried above 5.10.21 that I will not be upgrading until I have all fixed credit projects.

So it looks like I may have to use the abort option for long server outages.


I fully understand your inability to suspend comms when you have other short deadline projects to consider.

If you really need to stay with 5.10.21, please go and get 5.10.20 instead (as a precaution) because it seems likely that the bug that caused the problem during the recent outage was first introduced in 5.10.21. If you go to the BOINC downloads page and select the "All versions" option you can see a limited subset of older versions. If you hover your mouse over any of the old links you will be able to see the actual URL where all the old versions are kept - at least you can in firefox. You can then manually enter the URL and browse the archive and pick out the exact version you need.

At least by using 5.10.20 you shouldn't need to abort anything.

Cheers,
Gary.

Dagorath
Dagorath
Joined: 22 Apr 06
Posts: 146
Credit: 226423
RAC: 0

RE: If you go to the BOINC

Message 91010 in response to message 91009

Quote:
If you go to the BOINC downloads page and select the "All versions" option you can see a limited subset of older versions. If you hover your mouse over any of the old links you will be able to see the actual URL where all the old versions are kept - at least you can in firefox. You can then manually enter the URL and browse the archive and pick out the exact version you need.

They say http://boinc.berkeley.edu/dl/?C=M;O=D lists every version there ever was.

Bill & Patsy
Bill & Patsy
Joined: 8 Sep 07
Posts: 17
Credit: 5242914
RAC: 0

I too lost credits for work

I too lost credits for work successfully completed during the latest outage, on stock machines (both Windows & Mac) that always validate. Since it was not due to user error and could have been prevented by proper operation on the server side, I've been expecting E@H to do the right thing and issue the credits that were earned. So far nothing has happened.
I'm still waiting for those credits.

--Bill

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118374448822
RAC: 25550031

RE: ... I've been expecting

Message 91012 in response to message 91011

Quote:
... I've been expecting E@H to do the right thing and issue the credits that were earned. So far nothing has happened.
I'm still waiting for those credits.

It would be rather hard for the project to "do the right thing" because your BOINC client (version 5.10.21 to 5.10.45 - I'm assuming this because your computers are hidden) is the culprit and not the project. If you read through this entire thread right from the beginning you will find - in a number of posts - enough information to properly understand this nasty BOINC bug. In short, the specified client versions caused your result file(s) to be deleted when it/they failed to upload during the outage and so a later 'report' will fail with a validate error since the required result file(s) is/are not on the server. They were deleted (permanent upload failure) by your client during the outage. The result(s) will have to be crunched again (by someone) before credit can be awarded.

This problem was identified and reported by me two outages ago and so the clear message is that if you are running one of the affected versions, either downgrade to 5.10.20 or upgrade to 6.2.x or later.

My first post about this gave a link to information in the trac wiki and in subsequent posts in this same thread the affected versions were identified. If you want to know the gory details, read carefully through this entire thread again. Then change your BOINC version.

It may be a small comfort for you to know that I got badly bitten - probably close to 1000 results in total over three outages now. I lost relatively little this time but next time I'll lose absolutely zero :-).

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3161
Credit: 7272811730
RAC: 1826442

I've previously tried 6.n.n

I've previously tried 6.n.n clients, and found them unacceptable to me. So I mostly had 5.10.45 with a couple of 5.10.20's in my small flotilla. Following advice in this thread, for people in my circumstance, I backed the 5.10.45 hosts down to 5.10.20 about a week ago after losing a good bit of work in the recent project downtime.

However, for one of my hosts which used to have a nasty habit of killing all four executing tasks on the first project communication after client restart, and some other bad behaviors, that pattern returned with my reversion to 5.10.20. So for that single host I've decided to prefer the risk of occasionally losing a lot of work on a project site outage to the daily drain of poorer behavior.

I post this just in case someone else may have such a host, and find this shared experience useful. By the way, I have no idea what the special problem is on that one host. All my machines run WinXP Pro, and generally carry a pretty similar load of other software. One even shares the same model of motherboard with this one. So quite likely the root problem is a configuration issue on my discrepant host, to which 5.10.20 reacts less gracefully than 5.10.45, rather than an outright 5.10.20 bug later fixed.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.