Unlucky validation error :(

ohiomike
ohiomike
Joined: 4 Nov 06
Posts: 80
Credit: 6,453,639
RAC: 0

RE: RE: Hi! This one's

Message 64119 in response to message 64117

Quote:
Quote:

Hi!

This one's funny (the first one to be replicated to 4 PCs that I noticed):

http://einsteinathome.org/workunit/33712959

First computer to receive is running Win98, wingman is on XP, and they didn't agree.

Third one was a Mac, which seemed not to agree to either result.

Then the workunit got send to "RiversideCityCampus" which is best described as a Super massive Black Hole for workunits because workunits are sucked into this site and disapper forever :-)

Now my veteran Pentium III got the workunit, and it's running on Linux, the third OS involved. Hopefully it will validate against the Darwin Mac, or else the WU will reach "initial replication 5" and be sent to yet another host.

Come on little Workunit, hang in there!! You'll finally make it to the science database :-) :-)

CU

BRM

YESSS! The WU finally validated (Darwin + Linux). I'm sorry for the Windows users.... ;-)

The upside to this, is I noticed that "RiversideCityCampus" has no new WUs assigned to them!
PS- You do have to feel sorry for the Windows boxes that used 9 days to get 0 credit. My guess is they are not real happy. I've gotten upset enough over 18-22 hours to get 0 credit that I pulled 7 of 8 cores off of Einstein.


Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3,515
Credit: 450,913,931
RAC: 101,733

RE: At this point I'm

Message 64120 in response to message 64118

Quote:


At this point I'm starting to worry about the environmental impact of these validation problems, screw the credit... I hope Al Gore doesn't peruse this forum ;-)


There's an item in the ClimatePrediction.Net FAQ about the environmental impact of DC (as an answer to the obvious question whether the climate models used at CPDN take CO2 emissions by BOINCers into account). They computed the energy spent for CPDN in equivalents of cups of tea prepared (what else), and found it to be negligible.

CU

BRM

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,128
Credit: 36,942,483,615
RAC: 37,827,829

RE: RE: YESSS! The WU

Message 64121 in response to message 64119

Quote:
Quote:

YESSS! The WU finally validated (Darwin + Linux). I'm sorry for the Windows users.... ;-)

PS- You do have to feel sorry for the Windows boxes that used 9 days to get 0 credit. My guess is they are not real happy....

Your guess is extremely accurate :).

I happen to be the owner of the original XP box that was teamed up with the Win98 box - these two having the original disagreement. I noticed it a couple of weeks ago quite by accident when I was just looking at some outstanding results that hadn't validated. At that stage the WU had been sent to the third box (Darwin) but had not been returned. I was puzzled as to why two windows boxes should be having a disagreement and was sufficiently interested to keep following the progress to see what would happen when the Darwin box reported in. I was quite astonished to see the three way disagreement that ensued. Soon afterwards Bikeman received this WU on his linux box and reported the "interesting" nature of this "growing impasse" :).

As a bit of background, that XP box of mine is a 1Gig PIII HP Vectra VL400 that has been crunching 24/7 for about the last 15 months and has not had any previous validation issues that I'm aware of. Its accumulated credit total is 86K so it has certainly done a lot of work. It is not overclocked and, being winter here, ambient temps are a lot lower than they were a few months ago. There is no obvious reason for the validation problem, apart from the validator itself of course :).

When Bikeman first reported this "unusual" validation issue, I had meant to find time to add more to his original comments but I've been rather busy with more pressing things. I still don't have the time now but I'm angry enough about the sheer waste of it all to say something.

As some of you reading this would be aware, I reported some time ago, here and here, the problem of PIII boxes running windows being about 30% slower than the same machines running Linux. Since those reports I have continued switching my PIII and AMD machines to Linux and the current count of my Linux boxes is now more than 40. Only about another 50 to go :). The gain in productivity has been averaging 30% or more over all PIII architectures, Katmai, Coppermine and Tualatin, and around 40 - 45% for AMD Athlon XP and AMD64. When I started doing the conversions my RAC was around 14K and it is now around 20K, but I have added some new boxes as well so it's not just the Linux conversion :).

I have now had time to do a survey of a number of my Linux boxes to get an idea of how bad the validation problem really is (for me anyway). I've been through the results lists of about 15 linux boxes picked at random. I've examined 103 total results from those boxes and found 15 marked as "invalid" or "checked but no consensus yet" (which almost invariably eventually become invalid).

So it seems that I've gone to all this trouble to gain 30% in efficiency only to see around 15% of it immediately disappear into the validator black hole. To say I'm annoyed is an understatement as this issue has been going on for far too long and nothing seems to be happening to fix it.

If you take a look at this results list which comes from one of Bruce's boxes that I picked at random, you will find out of 9 results listed, two where no credit was awarded and a third which has no consensus yet. Seems like a terrible waste to me. Even if it is impossible to get the validator to behave better, surely it must be possible to change the scheduler to segregate the sending of work. For instance, once a new data file is sent to a windows box, only send additional copies to further windows boxes (ie check the platform before sending). Linux (and everything else) could be lumped together as a second category. Surely this would have a dramatic effect on the number of invalid results being produced.

I really am quite pissed off about the sheer waste that is currently happening.

EDIT: At least Bikeman did say he was sorry so I don't plan to assassinate him just yet!! :) ;).

Cheers,
Gary.

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1,305
Credit: 1,587,250,012
RAC: 992,961

RE: So it seems that I've

Message 64122 in response to message 64121

Quote:
So it seems that I've gone to all this trouble to gain 30% in efficiency only to see around 15% of it immediately disappear into the validator black hole. To say I'm annoyed is an understatement as this issue has been going on for far too long and nothing seems to be happening to fix it.

As per the win4.23 beta thread, the remaining intermittant client error problem is apparently being treated as the highest priority for the dev team.

http://einsteinathome.org/node/192867

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3,515
Credit: 450,913,931
RAC: 101,733

RE: EDIT: At least

Message 64123 in response to message 64121

Quote:


EDIT: At least Bikeman did say he was sorry so I don't plan to assassinate him just yet!! :) ;).

THX for not killing me :-).

BTW I know how you feel, I lost many WU myself to this issue. Bernd said that Reinhard Prix is working on the issue.

As to "homogenous redundancy (pairuing only the same OS ): If that had been done from the start, we would never have discovered this problem. I think if the software doesn't validate on one platform against the software on another platform, the client software or the validator has to be fixed. Pairing only hosts of the same OS will do away with the symptom, but the underlying problem still remains.

BRM

Brian Silvers
Brian Silvers
Joined: 26 Aug 05
Posts: 772
Credit: 282,700
RAC: 0

RE: I think if the software

Message 64124 in response to message 64123

Quote:
I think if the software doesn't validate on one platform against the software on another platform, the client software or the validator has to be fixed. Pairing only hosts of the same OS will do away with the symptom, but the underlying problem still remains.

Mostly agreed, but bear in mind that 100% validation should not be the goal. There will certainly still be some portion of results, no matter how small that portion may be, that should not validate against other results. These are the "true" invalid results, issues caused by overclocking too far, overheating, etc... As it stands right now, there is no way to know exactly why a result was marked invalid.

I don't know if someone here is still opposed to my idea of providing better feedback for invalid results, but one of the volunteer developers over at SETI saw some merit to what I was trying to get across, although they didn't think it was wise to add the burden on the system (specifically SETI's system) while it was still having so many other issues...

Brian

zombie67 [MM]
Joined: 10 Oct 06
Posts: 90
Credit: 248,901,523
RAC: 1,097,830

RE: As to "homogenous

Message 64125 in response to message 64123

Quote:
As to "homogenous redundancy (pairuing only the same OS ): If that had been done from the start, we would never have discovered this problem. I think if the software doesn't validate on one platform against the software on another platform, the client software or the validator has to be fixed. Pairing only hosts of the same OS will do away with the symptom, but the underlying problem still remains.

The problem will exist if HR is turned on or not. So we might as well turn HR on and get the credit due.

Reno, NV
Team: SETI.USA

Dave Burbank
Dave Burbank
Joined: 30 Jan 06
Posts: 275
Credit: 1,548,376
RAC: 0

RE: The problem will exist

Message 64126 in response to message 64125

Quote:

The problem will exist if HR is turned on or not. So we might as well turn HR on and get the credit due.

But there are still quite a number of similar-platform validation disagreements.

http://einsteinathome.org/workunit/33925402

There are 10^11 stars in the galaxy. That used to be a huge number. But it's only a hundred billion. It's less than the national deficit! We used to call them astronomical numbers. Now we should call them economical numbers. - Richard Feynman

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,128
Credit: 36,942,483,615
RAC: 37,827,829

RE: As per the win4.23

Message 64127 in response to message 64122

Quote:


As per the win4.23 beta thread, the remaining intermittant client error problem is apparently being treated as the highest priority for the dev team.

http://einsteinathome.org/node/192867

Yes, in that thread Bernd actually said:

Quote:

My highest priority are still stability issues, i.e. client errors. The "validation problems" that may be fixed by this very App are only a few (i.e. those that occur in the sky position fields of the output - in case so cares). Reinhard Prix is currently looking into the differences in "singinficance field", which are probably the largest part.

This seems to imply that validation problems are being looked at but are sort of "on the back burner" for the moment. BTW, validation problems have been around since the year dot - as an example check out this thread from more than two years ago. It's actually very worthwhile to read the thread fully as it contains useful commentary as to why this is a compiler problem and not an OS problem. In particular, take a look at this and this post for relevant information. There's also a post by Bruce near the end of the thread indicating that a team member was looking into validation issues and would get back to us in due course. I don't think that ever happened so it's obviously a difficult issue to make progress on, to say the least. My recollection is that there are always validation problems with each new run that is commenced but that they soon get reduced to a point where the incidence is quite low. This time however there doesn't yet seem to be any reduction in the rate at which these problems are occurring.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,128
Credit: 36,942,483,615
RAC: 37,827,829

RE: As to "homogenous

Message 64128 in response to message 64123

Quote:

As to "homogenous redundancy (pairuing only the same OS ): If that had been done from the start, we would never have discovered this problem.

That's a bit of a long bow to draw :).

The problem has been well known right from the time that the project first opened its doors and I imagine that HR was put into BOINC in the first place to allow projects to take care of these annoying discrepancies between different platforms, if they wanted to.

Quote:
I think if the software doesn't validate on one platform against the software on another platform, the client software or the validator has to be fixed.

In theory, Yes! but it seems just about impossible in practice.

Quote:
Pairing only hosts of the same OS will do away with the symptom, but the underlying problem still remains.

The underlying "problem" is not really a problem that needs to be solved. Consider this scenario:-

1. A WU gets sent to two windows boxes and they both send back results that validate - and are thus entered into the database.

2. The exact same WU is sent to two Linux boxes and they both send back results that validate - and are thus entered into the database.

3. However the results of 1. and 2. are "different" enough to cause the validator to reject them if the pairing had been W/L instead of W/W or L/L.

Is this a real problem?? Apparently not, as (because of the preponderance of windows boxes) the database is probably full of W/W pairings of which a significant fraction would have failed if they had been W/L pairings. Nobody is suggesting that there is any problem with what is in the database. So using HR to get rid of the annoying consequences of this "non-problem" would seem to me to be the "right" thing to do.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.