Validate Error - Why?

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0
Topic 193913

2 different hosts got a validate error on the same WU. The workunit gets send out again. I have no idea what happened, it just shows again why I really don't like these long workunits.

cu,
Michael

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 715089290
RAC: 938691

Validate Error - Why?

Quote:

2 different hosts got a validate error on the same WU. The workunit gets send out again. I have no idea what happened, it just shows again why I really don't like these long workunits.

cu,
Michael

Thanks for reporting, this looks a bit suspicious. Usually when two hosts fail to validate, the results are in state "no consensus yet" and the third one will match one of the submitted results. But in this case both results are invalid, which must mean that both failed individually. Possible, but somewhat unlikely. I forwarded this to the team so they can have a close look on this one.

The units are near the far end of the frequency range, so I would not rule out a yet undiscovered bug in the validator.

CU
Bikeman

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2951336948
RAC: 689776

"Validate error" != "Checked,

"Validate error" != "Checked, but no consensus yet"

"Validate error" usually means that no result data file is found on the server for the validator to do its work on - we see it at SETI, often when the BOINC 'report' stage follows too quickly after the result 'upload' stage. But to lose two result files this way, on two different days, is indeed unusual and unfortunate.

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9352143
RAC: 0

Hmmm... Agreed, this looks

Hmmm...

Agreed, this looks like the backend lost the output files from both hosts. If it was a CBNC, they should still be in the 'Pending' state.

Alinator

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 715089290
RAC: 938691

I suspect some problem with

I suspect some problem with the unzipping code. I brought this to Bernd's attention. Stay tuned, and again, thanks for the report!!

CU
Bikeman

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9352143
RAC: 0

Yep, that makes sense. I

Yep, that makes sense. I noticed there was some backend 'gagging' happening off and on with the project a few days ago or so, but fortunately for me none of my hosts had to upload or report any work then. ;-)

Obviously, other folks weren't so lucky. :-(

Just an observation about backend failures and other issues like this wrt EAH, my records show that EAH's overall reliability and task failure rate for all reasons is at least an order magnitude better than any other project I run, regardless of task runtime length.

Alinator

Arion
Arion
Joined: 20 Mar 05
Posts: 147
Credit: 1626747
RAC: 0

RE: "Validate error" !=

Message 85055 in response to message 85051

Quote:

"Validate error" != "Checked, but no consensus yet"

"Validate error" usually means that no result data file is found on the server for the validator to do its work on - we see it at SETI, often when the BOINC 'report' stage follows too quickly after the result 'upload' stage. But to lose two result files this way, on two different days, is indeed unusual and unfortunate.

I had 4 or 5 of these right after the change over last month. I think the difference was that it validated on 2 other computers (sent out to 3rd one after mine came back as invalid). Looking at the wu details it exited as it was supposed to. I memtioned something about this last month when others were reporting the same thing. (Whole thread asking everyone to report any of these)I don't know what ever happened with them as they seem to have disappeared during the last week. I don't know if I was ever given credit for them out not but that's the way it is sometimes.

Just remembered there were a couple that were invalid with all 3 hosts as well.

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

Thanks, for your

Thanks, for your explanations, my wingman's host got 3 more errors at the same time.
The time between uploading and reporting must have been far big enough.
My host is a root server which does only report when requesting new work.
So I also guess there was some kind of server trouble.

Too bad,
Michael

[edit] upload«»report

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117317728173
RAC: 35906891

The word from Bernd is that

The word from Bernd is that the problem was caused by a new validator which was installed and run for about a minute. During that time most results that it handled were marked as invalid. Needless to say that the old validator was put back into service immediately until the problem with the new one can be diagnosed.

Bernd says that he will cause the trashed results to be fixed (re-validated) as soon as he gets a chance.

Cheers,
Gary.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250213197
RAC: 35627

The 61 workunits (and their

The 61 workunits (and their results) affected have been marked for validation again. Re-checking should happen in a few minutes in most cases, in a few that will have to wait until the extra tasks already sent out have come back.

BM

BM

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2951336948
RAC: 689776

RE: 2 different hosts got a

Quote:
2 different hosts got a validate error .....


I see you've assigned ATLAS to re-validate that original WU, Bernd! Now that's what I call service.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.