Arecibo Binary Pulsar Search (STSP) min quorum 21

{CurlY BracketS}
{CurlY BracketS}
Joined: 9 Feb 05
Posts: 4
Credit: 899111
RAC: 0
Topic 195378

Can anybody explain to me why there's WUs with a minimum quorum of 21, but they have been sent only to 2 clients, like eg p2030_53925_58521_0173_G201.47+00.42.C_6.dm_270 ?

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250225978
RAC: 36066

Arecibo Binary Pulsar Search (STSP) min quorum 21

To inhibit validation until the results have arrived on the correct server. See this news and this thread.

BM

BM

{CurlY BracketS}
{CurlY BracketS}
Joined: 9 Feb 05
Posts: 4
Credit: 899111
RAC: 0

Well, I don't understand all

Well, I don't understand all of this, but thanks anyway...

mikey
mikey
Joined: 22 Jan 05
Posts: 12663
Credit: 1839062724
RAC: 4264

RE: Well, I don't

Message 100000 in response to message 99999

Quote:
Well, I don't understand all of this, but thanks anyway...

I have to agree with you, Bernd I don't understand that either. In the News Thread you said "tasks that were issued with the wrong upload URL" and that was the problem, now you are seeming to say that you did it on purpose. I think I am misunderstanding what you are saying. The basic question was why if you need 21 results to verify a unit are you only sending it out to 2 people initially? And you said "tasks that were issued with the wrong upload URL", that answer doesn't seem to match the question.

Also doesn't this cause those units to be 'in the system' FOREVER? And we users to then wait forever for a unit to be validated and credits granted? I mean if it takes 21 valid returns and you are only sending 2 units out once every 2 weeks, we are talking FOREVER before the credits are granted. Isn't this going to cause your database to have to hold these units essentially in limbo until all 21 VALID results are returned to you? If so I hope you have alot of hard drives and a very fast server!

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7214814931
RAC: 978164

RE: Also doesn't this cause

Quote:
Also doesn't this cause those units to be 'in the system' FOREVER?


No.

Let me try to summarize--though I'm just a participant.

Point one: In the conversion from 4-fold to 10-fold Work Units for ABP2 an error was made (the "wrong URL").

Point two: a consequence of that error was that this work failed validation (falsely) on return. A consequence of the validation failure was that new issue was made--so if unfixed we not only had people not getting credit for good work (possibly fixable later), but had people repeating work already done to no useful purpose (pure waste). This was a large-scale episode, involving many tens of thousands of results at least.

Point three: The 21/2 condition of some WUs is an interim control measure--and is neither the original condition of the WUs in question nor their intended final state. While active, it "freezes" returned work--so it is not found falsely invalid and inappropriate reissue does not happen. The 21 part is an intentionally unmeetable but temporary condition--so validation is not attempted.

Point four: (this part I understand least). In some sense appropriate information sent to the right place after results are already returned allows correct validation to occur. This is currently a batch process (I believe an initial attempt to cover the full incident proved too large for the infrastructure to tolerate). You as a user may see this as a sudden conversion of many 21/2 results returned but sitting in Pending limbo to Valid, credited results over a few hour period. I think this happened most recently a day or so ago, and that Bernd plans one final round in about a week, after which all the original problem work will have gone past deadline expiry.

To mods and officials: I was just trying to summarize for those happening to read this thread. I'd welcome any correction, or simple deletion of my post if you see a better way of helping understanding.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250225978
RAC: 36066

archae86 is correct. The

archae86 is correct.

The main point is that I intentionally modified the minimum quorum (and, for that matter, initial replication) outside a reasonable range to put certain workunits on hold i.e. to inhibit validation and further replication.

It is true that while in this state the tasks will not be credited and will stay in the system. But I'll restore the original values for both parameters at some point, and let BOINC continue finishing these workunits the normal way.

The reason for putting these workunits on hold is that their first tasks were sent out with the wrong "upload URL", so after crunching the clients would upload the files to a wrong server. We put the affected workunits on hold until we transferred the affected results to the correct server. After that, we released the workunits again by restoring the original settings.

Originally ~70.000 workunits were put on hold that way, and by now ~45.000 have been released again, and their tasks correctly validated and credited.

BM

BM

{CurlY BracketS}
{CurlY BracketS}
Joined: 9 Feb 05
Posts: 4
Credit: 899111
RAC: 0

Now it starts to make sense,

Now it starts to make sense, thanx archae86 for clarifying things.

mikey
mikey
Joined: 22 Jan 05
Posts: 12663
Credit: 1839062724
RAC: 4264

RE: Now it starts to make

Quote:
Now it starts to make sense, thanx archae86 for clarifying things.

I totally agree, thanks from me too!!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.