// DBOINCP-300: added node comment count condition in order to get Preview working ?>
{CurlY BracketS}
Joined: 9 Feb 05
Posts: 4
Credit: 899,111
RAC: 0
12 Oct 2010 9:39:01 UTC
Topic 195378
(moderation:
)
Can anybody explain to me why there's WUs with a minimum quorum of 21, but they have been sent only to 2 clients, like eg p2030_53925_58521_0173_G201.47+00.42.C_6.dm_270 ?
Well, I don't understand all of this, but thanks anyway...
I have to agree with you, Bernd I don't understand that either. In the News Thread you said "tasks that were issued with the wrong upload URL" and that was the problem, now you are seeming to say that you did it on purpose. I think I am misunderstanding what you are saying. The basic question was why if you need 21 results to verify a unit are you only sending it out to 2 people initially? And you said "tasks that were issued with the wrong upload URL", that answer doesn't seem to match the question.
Also doesn't this cause those units to be 'in the system' FOREVER? And we users to then wait forever for a unit to be validated and credits granted? I mean if it takes 21 valid returns and you are only sending 2 units out once every 2 weeks, we are talking FOREVER before the credits are granted. Isn't this going to cause your database to have to hold these units essentially in limbo until all 21 VALID results are returned to you? If so I hope you have alot of hard drives and a very fast server!
Also doesn't this cause those units to be 'in the system' FOREVER?
No.
Let me try to summarize--though I'm just a participant.
Point one: In the conversion from 4-fold to 10-fold Work Units for ABP2 an error was made (the "wrong URL").
Point two: a consequence of that error was that this work failed validation (falsely) on return. A consequence of the validation failure was that new issue was made--so if unfixed we not only had people not getting credit for good work (possibly fixable later), but had people repeating work already done to no useful purpose (pure waste). This was a large-scale episode, involving many tens of thousands of results at least.
Point three: The 21/2 condition of some WUs is an interim control measure--and is neither the original condition of the WUs in question nor their intended final state. While active, it "freezes" returned work--so it is not found falsely invalid and inappropriate reissue does not happen. The 21 part is an intentionally unmeetable but temporary condition--so validation is not attempted.
Point four: (this part I understand least). In some sense appropriate information sent to the right place after results are already returned allows correct validation to occur. This is currently a batch process (I believe an initial attempt to cover the full incident proved too large for the infrastructure to tolerate). You as a user may see this as a sudden conversion of many 21/2 results returned but sitting in Pending limbo to Valid, credited results over a few hour period. I think this happened most recently a day or so ago, and that Bernd plans one final round in about a week, after which all the original problem work will have gone past deadline expiry.
To mods and officials: I was just trying to summarize for those happening to read this thread. I'd welcome any correction, or simple deletion of my post if you see a better way of helping understanding.
The main point is that I intentionally modified the minimum quorum (and, for that matter, initial replication) outside a reasonable range to put certain workunits on hold i.e. to inhibit validation and further replication.
It is true that while in this state the tasks will not be credited and will stay in the system. But I'll restore the original values for both parameters at some point, and let BOINC continue finishing these workunits the normal way.
The reason for putting these workunits on hold is that their first tasks were sent out with the wrong "upload URL", so after crunching the clients would upload the files to a wrong server. We put the affected workunits on hold until we transferred the affected results to the correct server. After that, we released the workunits again by restoring the original settings.
Originally ~70.000 workunits were put on hold that way, and by now ~45.000 have been released again, and their tasks correctly validated and credited.
Arecibo Binary Pulsar Search (STSP) min quorum 21
)
To inhibit validation until the results have arrived on the correct server. See this news and this thread.
BM
BM
Well, I don't understand all
)
Well, I don't understand all of this, but thanks anyway...
RE: Well, I don't
)
I have to agree with you, Bernd I don't understand that either. In the News Thread you said "tasks that were issued with the wrong upload URL" and that was the problem, now you are seeming to say that you did it on purpose. I think I am misunderstanding what you are saying. The basic question was why if you need 21 results to verify a unit are you only sending it out to 2 people initially? And you said "tasks that were issued with the wrong upload URL", that answer doesn't seem to match the question.
Also doesn't this cause those units to be 'in the system' FOREVER? And we users to then wait forever for a unit to be validated and credits granted? I mean if it takes 21 valid returns and you are only sending 2 units out once every 2 weeks, we are talking FOREVER before the credits are granted. Isn't this going to cause your database to have to hold these units essentially in limbo until all 21 VALID results are returned to you? If so I hope you have alot of hard drives and a very fast server!
RE: Also doesn't this cause
)
No.
Let me try to summarize--though I'm just a participant.
Point one: In the conversion from 4-fold to 10-fold Work Units for ABP2 an error was made (the "wrong URL").
Point two: a consequence of that error was that this work failed validation (falsely) on return. A consequence of the validation failure was that new issue was made--so if unfixed we not only had people not getting credit for good work (possibly fixable later), but had people repeating work already done to no useful purpose (pure waste). This was a large-scale episode, involving many tens of thousands of results at least.
Point three: The 21/2 condition of some WUs is an interim control measure--and is neither the original condition of the WUs in question nor their intended final state. While active, it "freezes" returned work--so it is not found falsely invalid and inappropriate reissue does not happen. The 21 part is an intentionally unmeetable but temporary condition--so validation is not attempted.
Point four: (this part I understand least). In some sense appropriate information sent to the right place after results are already returned allows correct validation to occur. This is currently a batch process (I believe an initial attempt to cover the full incident proved too large for the infrastructure to tolerate). You as a user may see this as a sudden conversion of many 21/2 results returned but sitting in Pending limbo to Valid, credited results over a few hour period. I think this happened most recently a day or so ago, and that Bernd plans one final round in about a week, after which all the original problem work will have gone past deadline expiry.
To mods and officials: I was just trying to summarize for those happening to read this thread. I'd welcome any correction, or simple deletion of my post if you see a better way of helping understanding.
archae86 is correct. The
)
archae86 is correct.
The main point is that I intentionally modified the minimum quorum (and, for that matter, initial replication) outside a reasonable range to put certain workunits on hold i.e. to inhibit validation and further replication.
It is true that while in this state the tasks will not be credited and will stay in the system. But I'll restore the original values for both parameters at some point, and let BOINC continue finishing these workunits the normal way.
The reason for putting these workunits on hold is that their first tasks were sent out with the wrong "upload URL", so after crunching the clients would upload the files to a wrong server. We put the affected workunits on hold until we transferred the affected results to the correct server. After that, we released the workunits again by restoring the original settings.
Originally ~70.000 workunits were put on hold that way, and by now ~45.000 have been released again, and their tasks correctly validated and credited.
BM
BM
Now it starts to make sense,
)
Now it starts to make sense, thanx archae86 for clarifying things.
RE: Now it starts to make
)
I totally agree, thanks from me too!!