Unlucky validation error :(

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,128
Credit: 36,940,096,230
RAC: 37,786,157

RE: I don't know if

Message 64129 in response to message 64124

Quote:


I don't know if someone here is still opposed to my idea of providing better feedback for invalid results, but one of the volunteer developers over at SETI saw some merit to what I was trying to get across, although they didn't think it was wise to add the burden on the system (specifically SETI's system) while it was still having so many other issues...

Brian

I'm sure absolutely nobody is opposed to this suggestion of yours :). The problem is how to arrive at this better feedback. The validator would have to be a far more intelligent beast than it currently is. At the moment it is smart enough to recognise that there is a difference between some result pairings but I'm sure it doesn't have a clue as to what is causing those differences. Therefore it can't really make some sort of useful addition to the "checked but no consensus yet" outcome that it currently reports. Even if the validator started saying that parameter X was Y% different in the two results, would that really constitute better feedback? You would probably only get better feedback if the validator could make some authoritative statement about the possible reasons for the difference and that is where the extra intelligence would be needed.

In the distant past when Bruce was able to post more regularly, he referred to "tweaking" or "relaxing" the parameters that the validator was using to make the judgement. He also said that it wasn't easy to arrive at a good compromise and so he would err on the side of allowing a small fraction of "good" results to be marked invalid rather than missing "bad" results.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,128
Credit: 36,940,096,230
RAC: 37,786,157

RE: The problem will exist

Message 64130 in response to message 64125

Quote:

The problem will exist if HR is turned on or not. So we might as well turn HR on and get the credit due.

I agree fully with this sentiment. Having said that, the reason why HR is not being used is probably just that it is too difficult to achieve it with the current way that work is distributed by the project.

Difficult or not, I still feel like having a good winge about the current waste of what are probably quite correct results that are being rejected simply because they were done on a (more accurate) Linux box rather than Windows :).

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,128
Credit: 36,940,096,230
RAC: 37,786,157

RE: RE: The problem will

Message 64131 in response to message 64126

Quote:
Quote:

The problem will exist if HR is turned on or not. So we might as well turn HR on and get the credit due.

But there are still quite a number of similar-platform validation disagreements.

http://einsteinathome.org/workunit/33925402

There really aren't very many of these! I would estimate that it is less than 5% of invalidations.

Also, in this particular case, you are quite right to point out that the 3 results in question were done under the same OS - WinXP - 2 on Intel and 1 on AMD. If you look at the full results list of the AMD box you will see that it is a laptop with three "client error (compute error)" results in it's current list along with that single invalidation. Can we perhaps imagine "overheating laptop" or at least some hardware issue as the reason behind this particular case?

Cheers,
Gary.

zombie67 [MM]
Joined: 10 Oct 06
Posts: 90
Credit: 248,884,198
RAC: 1,101,850

RE: RE: The problem will

Message 64132 in response to message 64126

Quote:
Quote:

The problem will exist if HR is turned on or not. So we might as well turn HR on and get the credit due.

But there are still quite a number of similar-platform validation disagreements.


Even if it fixed only a portion of the problem, it should be turned on. Something is better than nothing.

Reno, NV
Team: SETI.USA

Brian Silvers
Brian Silvers
Joined: 26 Aug 05
Posts: 772
Credit: 282,700
RAC: 0

RE: If you look at the

Message 64133 in response to message 64131

Quote:
If you look at the full results list of the AMD box you will see that it is a laptop with three "client error (compute error)" results in it's current list along with that single invalidation. Can we perhaps imagine "overheating laptop" or at least some hardware issue as the reason behind this particular case?

Nope. Error 10, which is what that host is getting for the "client error" results, is a problem with the science application checkpointing. Checkpointing improvements are in the 4.23 beta release, so it would be interesting if this particular person picked up the 4.23 release to see if it helped.

Brian

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,128
Credit: 36,940,096,230
RAC: 37,786,157

RE: Nope. I admire your

Message 64134 in response to message 64133

Quote:

Nope.

I admire your bravery :).

Quote:
Error 10, which is what that host is getting for the "client error" results, is a problem with the science application checkpointing....

Sure, you can read in the stderr.out output that 2 of the three client errors were due to the app being unable to make sense of a saved checkpoint when BOINC was being restarted. If this was being caused by a software bug associated with checkpointing, wouldn't many more people be seeing exactly the same thing? Isn't it more likely (on balance) that some sort of hardware flakiness is occasionally causing a bad checkpoint to be written or reread?

I fully admit that I have no idea of the precise cause of the client errors but I'd tend to think that the major reason is likely to be hardware related. It's also interesting to see the bit of garbage "@his program cannot be run in DOS mode.
$" which is inserted at the end of the output of the third client error. How do you explain that bit as being a software checkpointing bug?

Cheers,
Gary.

Brian Silvers
Brian Silvers
Joined: 26 Aug 05
Posts: 772
Credit: 282,700
RAC: 0

RE: RE: Nope. I admire

Message 64135 in response to message 64134

Quote:
Quote:

Nope.

I admire your bravery :).

Quote:
Error 10, which is what that host is getting for the "client error" results, is a problem with the science application checkpointing....

Sure, you can read in the stderr.out output that 2 of the three client errors were due to the app being unable to make sense of a saved checkpoint when BOINC was being restarted. If this was being caused by a software bug associated with checkpointing, wouldn't many more people be seeing exactly the same thing? Isn't it more likely (on balance) that some sort of hardware flakiness is occasionally causing a bad checkpoint to be written or reread?

I fully admit that I have no idea of the precise cause of the client errors but I'd tend to think that the major reason is likely to be hardware related. It's also interesting to see the bit of garbage "@his program cannot be run in DOS mode.
$" which is inserted at the end of the output of the third client error. How do you explain that bit as being a software checkpointing bug?

Read this thread over in Problems and Bug Reports

Oh, and it could be because the application spit out the junk about DOS mode while in the midst of crashing, considering it is the final entry in the output.

Yes, there could be hardware problems...but checkpointing is an acknowledged problem at this point, which means all of us are vulnerable to it. Why it happens to some and not others, who knows...

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.