Validate errors (after scheduler outage)

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0
Topic 194698

I see a bunch of validate errors on 2 of my hosts and more on partner hosts. WUs were reported back in a timeframe between ca. 30 Dec 2009 12:33:23 UTC and 31 Dec 2009 11:49:51 UTC. Don't know when they were uploaded.

Not all WUs reported back at that time are invalid.

Some Upload errors on server side?

Happy new year,
Michael

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117363351228
RAC: 35715206

Validate errors (after scheduler outage)

Quote:
I see a bunch of validate errors on 2 of my hosts and more on partner hosts. WUs were reported back in a timeframe between ca. 30 Dec 2009 12:33:23 UTC and 31 Dec 2009 11:49:51 UTC. Don't know when they were uploaded.


I have a host that also had a number of validate errors in the same timeframe. Seems very likely that all these were caused by the server problems around that time.

Quote:
Not all WUs reported back at that time are invalid.


Same with my host. The difference is that your hosts don't seem to have had any further examples of these whilst my host continues to have one or two a day.

Quote:
Some Upload errors on server side?


I would have thought so. Because of the more recent examples, I've sent a note to Bernd about it. A possible cause is "uploaded files that have become 'lost' on the server" but also I believe it is possible for the error to be caused by files being reported too soon after being uploaded so that the validator gets called before the files have been properly received.

Quote:
Happy new year,
Michael


Same to you and all others reading these boards.

Cheers,
Gary.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6588
Credit: 313429260
RAC: 223184

I've had a couple of validate

I've had a couple of validate errors ( here and here ) on my i7 box ( thanks for pointing that out Gary! ) in that timeframe. I have no explanation other than they are bracketed by some client/compute errors - which I suspected/assumed were related to local power outages. I had two vigorous thunder/lightning storms, each of some 4 to 6 hours duration, which took out our house supply ( beginning in the mid evening, ~ 9pm, of 30/12 and 31/12, local time @ UTC + 11 ).

[ The DSL line to the house normally goes off air in electrical storms, even if the house doesn't go out. We're just on the 4km limit from the local exchange - and riding on ~ 60 year old copper with the odd inductive choke or two still in situ. ]

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

RE: RE: I see a bunch of

Message 96210 in response to message 96208

Quote:
Quote:
I see a bunch of validate errors on 2 of my hosts and more on partner hosts. WUs were reported back in a timeframe between ca. 30 Dec 2009 12:33:23 UTC and 31 Dec 2009 11:49:51 UTC. Don't know when they were uploaded.

I have a host that also had a number of validate errors in the same timeframe. Seems very likely that all these were caused by the server problems around that time.

Quote:
Not all WUs reported back at that time are invalid.

Same with my host. The difference is that your hosts don't seem to have had any further examples of these whilst my host continues to have one or two a day.


Usually I have no invalid WUs at all.

Quote:
Quote:
Some Upload errors on server side?

I would have thought so. Because of the more recent examples, I've sent a note to Bernd about it. A possible cause is "uploaded files that have become 'lost' on the server" but also I believe it is possible for the error to be caused by files being reported too soon after being uploaded so that the validator gets called before the files have been properly received.


I use a cc_config.xml file to report results immediately, but this never caused any problems. Example from the log of my root-server(UTC +1h):

31-Dec-2009 13:29:38 [Einstein@Home] Started upload of p2030_53652_86203_0114_G67.66+00.70.N_4.dm_254_0_0
31-Dec-2009 13:29:39 [Einstein@Home] Finished upload of p2030_53652_86203_0114_G67.66+00.70.N_4.dm_254_0_0
31-Dec-2009 13:30:15 [Einstein@Home] Sending scheduler request: To report completed tasks.
31-Dec-2009 13:30:15 [Einstein@Home] Reporting 1 completed tasks, not requesting new tasks
31-Dec-2009 13:30:20 [Einstein@Home] Scheduler request completed

As one can see, there is enough time between uploading and reporting, but the WU is invalid.

This one is valid:
31-Dec-2009 12:28:09 [Einstein@Home] Started upload of p2030_53652_86203_0114_G67.66+00.70.N_4.dm_261_0_0
31-Dec-2009 12:28:10 [Einstein@Home] Finished upload of p2030_53652_86203_0114_G67.66+00.70.N_4.dm_261_0_0
31-Dec-2009 12:28:12 [Einstein@Home] Sending scheduler request: To report completed tasks.
31-Dec-2009 12:28:12 [Einstein@Home] Reporting 1 completed tasks, not requesting new tasks
31-Dec-2009 12:28:17 [Einstein@Home] Scheduler request completed

Also some of my BOINC clients are crazy since that outage. The ones that did WCG jobs without contacting the E@H server at that time are alright. One client, doing a mix of WCG and E@H jobs, did not download any more work for E@H until I did a project reset. Another host crunches a mix of WCG and E@H jobs too, downloading a WCG WU once in a while as it's supposed to be. The E@H WUs are processed, uploaded and reported without request for new jobs until the very last WUs are in work. Then this host gets e.g. 96 WUs at once(5 days cache). Very strange! This started with the server outage and only affects clients with server contact at the problematic time.

cu,
Michael

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 717028521
RAC: 1002173

Hi all! Thank you all for

Hi all!

Thank you all for reporting the problem and for your patience. The problem is now understood and I'll try to summarize what was found out by the administrators and developers:

A rather copmplex chain of events involving left over "file debris" from an earlier server crash (around August) and the recent problems with the transitioner (see news entry on the front page) let to a situation where some results would be processed more than once. That's not fatal, but unfortunately the uploaded "extra results" would then be appended by the BOINC server rather than overwriting the redundant stuff, resulting in expanded siamese-twin-results (so to speak) that would not pass validation.

The repair work is ongoing and there will be an attempt to grant credit to those hosts that have lost results because of this bug.

Sorry for the inconveniences. ABP1 validation should be back to normal soon.

Happy crunching
Bikeman

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6588
Credit: 313429260
RAC: 223184

For your interest here is the

For your interest here is the known list of ABP1 data set name initialisers for which validation error rates were high ( over 20 per data set ). Upon these the work unit or task names are based :

p2030_54162_52662_0100_G68.30-02.38.C_6
p2030_54162_52661_0100_G68.30-02.38.C_5
p2030_54162_52661_0100_G68.30-02.38.C_4
p2030_54162_52661_0100_G68.30-02.38.C_3
p2030_54162_52661_0100_G68.30-02.38.C_2
p2030_54162_52661_0100_G68.30-02.38.C_1
p2030_54162_52661_0100_G68.30-02.38.C_0
p2030_54162_49902_0073_G61.37-01.65.C_6
p2030_54006_02456_0024_G71.07+02.52.C_5
p2030_54006_02456_0024_G71.07+02.52.C_4
p2030_54006_02455_0024_G71.07+02.52.C_3
p2030_54006_02455_0024_G71.07+02.52.C_2
p2030_54006_02455_0024_G71.07+02.52.C_1
p2030_54006_02455_0024_G71.07+02.52.C_0
p2030_54006_02151_0021_G69.90+02.62.C_6
p2030_54006_02151_0021_G69.90+02.62.C_5
p2030_54006_02151_0021_G69.90+02.62.C_3
p2030_54006_02151_0021_G69.90+02.62.C_2
p2030_54006_02151_0021_G69.90+02.62.C_1
p2030_54006_02151_0021_G69.90+02.62.C_0
p2030_54006_01849_0018_G68.95+02.54.C_6
p2030_54006_01547_0015_G67.80+02.63.C_6
p2030_54006_01547_0015_G67.80+02.63.C_1
p2030_54006_01547_0015_G67.80+02.63.C_0
p2030_53653_00692_0006_G68.57+00.53.N_5
p2030_53653_00692_0006_G68.57+00.53.N_4
p2030_53653_00692_0006_G68.57+00.53.N_3
p2030_53653_00692_0006_G68.57+00.53.N_2
p2030_53653_00692_0006_G68.57+00.53.N_1
p2030_53653_00692_0006_G68.57+00.53.N_0
p2030_53653_00397_0003_G68.32+00.45.N_5
p2030_53653_00397_0003_G68.32+00.45.N_4
p2030_53653_00397_0003_G68.32+00.45.N_3
p2030_53653_00397_0003_G68.32+00.45.N_2
p2030_53653_00397_0003_G68.32+00.45.N_1
p2030_53653_00397_0003_G68.32+00.45.N_0
p2030_53653_00098_0000_G67.87+00.53.N_3
p2030_53653_00098_0000_G67.87+00.53.N_2
p2030_53653_00098_0000_G67.87+00.53.N_1
p2030_53653_00098_0000_G67.87+00.53.N_0
p2030_53652_86203_0114_G67.66+00.70.N_5
p2030_53652_86203_0114_G67.66+00.70.N_4
p2030_53652_86203_0114_G67.66+00.70.N_3
p2030_53652_86203_0114_G67.66+00.70.N_2
p2030_53652_86203_0114_G67.66+00.70.N_1
p2030_53652_86203_0114_G67.66+00.70.N_0
p2030_53652_85904_0111_G67.21+00.78.N_5
p2030_53652_85904_0111_G67.21+00.78.N_4
p2030_53652_85904_0111_G67.21+00.78.N_1
p2030_53652_85904_0111_G67.21+00.78.N_0
p2030_53652_85609_0108_G67.01+00.95.N_6
p2030_53652_85609_0108_G67.01+00.95.N_5
p2030_53652_85609_0108_G67.01+00.95.N_4
p2030_53652_85608_0108_G67.01+00.95.N_3
p2030_53652_85608_0108_G67.01+00.95.N_2
p2030_53652_85608_0108_G67.01+00.95.N_1
p2030_53652_85608_0108_G67.01+00.95.N_0
p2030_53652_85309_0105_G67.16+00.53.N_6
p2030_53652_85309_0105_G67.16+00.53.N_5
p2030_53652_85309_0105_G67.16+00.53.N_4
p2030_53652_85309_0105_G67.16+00.53.N_3
p2030_53652_85309_0105_G67.16+00.53.N_2
p2030_53652_85309_0105_G67.16+00.53.N_1
p2030_53652_85309_0105_G67.16+00.53.N_0
p2030_53652_85013_0102_G66.96+00.70.N_1
p2030_53652_85013_0102_G66.96+00.70.N_0
p2030_53652_85012_0102_G66.96+00.70.N_6
p2030_53652_85012_0102_G66.96+00.70.N_5
p2030_53652_85012_0102_G66.96+00.70.N_4
p2030_53652_85012_0102_G66.96+00.70.N_2
p2030_53651_01043_0009_G72.85-00.87.N_6
p2030_53651_01043_0009_G72.85-00.87.N_5
p2030_53651_01043_0009_G72.85-00.87.N_4
p2030_53651_01043_0009_G72.85-00.87.N_1
p2030_53651_01043_0009_G72.85-00.87.N_0
p2030_53651_00544_0003_G70.28-00.55.N_6
p2030_53651_00544_0003_G70.28-00.55.N_5
p2030_53651_00544_0003_G70.28-00.55.N_4
p2030_53651_00544_0003_G70.28-00.55.N_3
p2030_53651_00544_0003_G70.28-00.55.N_2
p2030_53651_00544_0003_G70.28-00.55.N_1
p2030_53651_00544_0003_G70.28-00.55.N_0
p2030_53651_00207_0000_G68.73+00.11.N_6
p2030_53651_00207_0000_G68.73+00.11.N_5
p2030_53651_00207_0000_G68.73+00.11.N_4
p2030_53651_00207_0000_G68.73+00.11.N_2
p2030_53651_00206_0000_G68.73+00.11.N_1
p2030_53651_00206_0000_G68.73+00.11.N_0
p2030_53648_01405_0012_G70.95-00.79.N_6
p2030_53648_01405_0012_G70.95-00.79.N_5
p2030_53648_01405_0012_G70.95-00.79.N_4
p2030_53648_01405_0012_G70.95-00.79.N_3
p2030_53648_01405_0012_G70.95-00.79.N_2
p2030_53648_01405_0012_G70.95-00.79.N_1
p2030_53648_01405_0012_G70.95-00.79.N_0
p2030_53618_08624_0076_G71.16-00.96.N_6
p2030_53618_08324_0073_G70.49-00.71.N_6
p2030_53618_08026_0070_G69.98-00.88.N_6
p2030_53614_08687_0078_G74.64-00.28.N_6
p2030_53614_08383_0075_G73.48+00.04.N_6
p2030_53614_07774_0069_G71.58+00.11.N_6
p2030_53614_07774_0069_G71.58+00.11.N_1
p2030_53614_07774_0069_G71.58+00.11.N_0
p2030_53614_07469_0066_G70.82-00.13.N_6
p2030_53614_07469_0066_G70.82-00.13.N_5
p2030_53614_07469_0066_G70.82-00.13.N_4
p2030_53614_07469_0066_G70.82-00.13.N_1
p2030_53614_07469_0066_G70.82-00.13.N_0
p2030_53614_07166_0063_G69.70+00.20.N_6
p2030_53614_07166_0063_G69.70+00.20.N_5
p2030_53614_07166_0063_G69.70+00.20.N_4
p2030_53614_07166_0063_G69.70+00.20.N_1
p2030_53614_07166_0063_G69.70+00.20.N_0
p2030_53614_06858_0060_G68.94-00.06.N_6
p2030_53614_06858_0060_G68.94-00.06.N_5
p2030_53614_06858_0060_G68.94-00.06.N_4
p2030_53614_06858_0060_G68.94-00.06.N_1
p2030_53614_06858_0060_G68.94-00.06.N_0
p2030_53614_06553_0057_G67.82+00.28.N_6
p2030_53614_06553_0057_G67.82+00.28.N_5
p2030_53614_06553_0057_G67.82+00.28.N_4
p2030_53614_06553_0057_G67.82+00.28.N_1
p2030_53614_06553_0057_G67.82+00.28.N_0
p2030_53614_06245_0054_G67.07+00.03.N_6
p2030_53614_06245_0054_G67.07+00.03.N_5
p2030_53614_06245_0054_G67.07+00.03.N_4
p2030_53614_06245_0054_G67.07+00.03.N_1
p2030_53614_06245_0054_G67.07+00.03.N_0
p2030_53613_09599_0015_G69.03+00.44.N_6
p2030_53613_09599_0015_G69.03+00.44.N_5
p2030_53613_09599_0015_G69.03+00.44.N_4
p2030_53613_09299_0012_G68.28+00.19.N_6
p2030_53613_09299_0012_G68.28+00.19.N_5
p2030_53613_09299_0012_G68.28+00.19.N_4
p2030_53613_09299_0012_G68.28+00.19.N_1
p2030_53613_09299_0012_G68.28+00.19.N_0
p2030_53613_08999_0009_G67.52-00.06.N_6
p2030_53613_08999_0009_G67.52-00.06.N_5
p2030_53613_08999_0009_G67.52-00.06.N_4
p2030_53613_08999_0009_G67.52-00.06.N_1
p2030_53613_08999_0009_G67.52-00.06.N_0

So if you look at this task you will note the full task name p2030_53613_09299_0012_G68.28+00.19.N_1.dm_591_1 having the leading text p2030_53613_09299_0012_G68.28+00.19.N_1 as I've highlighted in the listing.

Here's a further listing of those name initialisers with a prominent validation error rate, but remain as suspect pro-tem being logged as only having been sent out once ( unlike the first list above ):

p2030_53651_00207_0000_G68.73+00.11.N_3
p2030_53651_01043_0009_G72.85-00.87.N_2
p2030_53651_01043_0009_G72.85-00.87.N_3
p2030_53652_85012_0102_G66.96+00.70.N_3
p2030_53652_85904_0111_G67.21+00.78.N_2
p2030_53652_85904_0111_G67.21+00.78.N_3
p2030_53653_00099_0000_G67.87+00.53.N_4
p2030_53653_00099_0000_G67.87+00.53.N_5
p2030_54006_01245_0012_G66.85+02.55.C_0
p2030_54006_01849_0018_G68.95+02.54.C_0
p2030_54006_01849_0018_G68.95+02.54.C_1
p2030_54006_02151_0021_G69.90+02.62.C_4

Similiarly look at this task you will note the full task name p2030_53653_00099_0000_G67.87+00.53.N_4.dm_211_1 having the leading text p2030_53653_00099_0000_G67.87+00.53.N_4, again as I've highlighted in the listing.

If you have a task/WU that has been marked invalid, for the period around New Year or since to date, that does NOT appear to be in either list above then please post details here in this thread, for our examination ( preferably including a link to the relevant task/WU detail page, or at least the full name of the task/WU and your computer's ID ).

[ I think the simplest way is to copy the task/WU base name, as per the above patterning, from the task/WU detail page of an invalid result and just paste into your browser's text search facility to try and find it within the listings given above ]

Our valiant technical staff ( yeah Bernd, Oliver and Ben!! ) will do their best, as time permits, to manually award due credit to those affected by this validation ker-fuffle. :-) :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6588
Credit: 313429260
RAC: 223184

Note this message from

Note this message from Bernd.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250257386
RAC: 35279

RE: For your interest here

Message 96214 in response to message 96212

Quote:
For your interest here is the known list of ABP1 data set name initialisers for which validation error rates were high ( over 20 per data set ).

This lists the workunits that we have confirmed to be processed twice. Tasks from these workunits that resulted in validate errors will be granted credit manually (hopefully tomorrow).

Quote:
Here's a further listing of those name initialisers with a prominent validation error rate, but remain as suspect pro-tem being logged as only having been sent out once ( unlike the first list above ):

The tasks from these workunits need futher investigation, there might be something else wrong.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250257386
RAC: 35279

Credit has been granted for

Message 96215 in response to message 96214

Credit has been granted for all validate errors from the tasks named above. This might be a bit too generous, but I couldn't look through all of them. Take it as a late xmas present.

BM

BM

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

RE: Credit has been granted

Message 96216 in response to message 96215

Quote:

Credit has been granted for all validate errors from the tasks named above. This might be a bit too generous, but I couldn't look through all of them. Take it as a late xmas present.

BM

Thanks, all but one WU got credits. :-)

cu,
Michael

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.