> Another question is whether on a screening process like E@H or SETI you really
> need total redundancy at all. On the ZetaGrid project (www.zetagrid.net),
> looking for counter-examples to an unproven mathematical conjecture, the
> project went for limited redundancy. Newbies' work was double-checked, but
> as you returned more and more WU that checked out OK your machines got awarded
> a trust factor, and your work was only double-calculated on a random basis.
> About 10% of all WU were double-crunched overall.
> ...
> So I'd say the minimum plausible redundancy is to calulate everything just 1.1
> times, on average; that is enough to spot rogue machines (whether through
> malice or malfunction). A more comfortable value is to calcualte everything
> twice, and only call for third result when needed to arbitrate. Based on
> either of these minima there is a redundant amount of redundancy in most BOINC
> projects, built in by the basic BOINC infrastructure. While E@H works within
> BOINC it is not realistic to ask for anything less than 3x.
>
> The change from 3x to 4x is, however, something that E@H could choose to do
> something about. And if you accept that 3x is already overly redundant, the
> initial comment looks very plausible.
Two points in reply:
1. Your 1.1 figure clearly assumes a relatively low failure rate of hardware. Given that many machines used to crunch on BOINC projects are on and actually at high CPU load 24/7, I think that this is probably higher (especially given that many machines are laptops that were never intended to run in such a manner).
2. You have assumed that all projects are equal in their tolerance of error. In truth, there are wide-ranging differences across the projects. While we are not privy to the exact nature of these precision differences, it is possible that they could affect the level of redundancy required.
if the reason to send the WU to 4 clients is faster validation, why not change the strategy, make it a more complex process:
send wu to 3 clients.
if not validated after (1+x) * (slowest turnaround avg) days, send out to a machine known for short turnaround time.
(x in the range of 0.2 to 0.5 as a first guess)
This would reduce the amount of redundancy, since 4 machines are only started on the same work on a part of all WUs, not on all of them.
This requires the scheduler to be more complex, I admit.
> if the reason to send the WU to 4 clients is faster validation, why not change
> the strategy, make it a more complex process:
>
> send wu to 3 clients.
> if not validated after (1+x) * (slowest turnaround avg) days, send out to a
> machine known for short turnaround time.
> (x in the range of 0.2 to 0.5 as a first guess)
>
> This would reduce the amount of redundancy, since 4 machines are only started
> on the same work on a part of all WUs, not on all of them.
>
> This requires the scheduler to be more complex, I admit.
And this is one of the projected changes ... but we still have to finish the creation of the cross-platform BOINC ....
I asked basically this same question in my "Optimal distribution of work" thread last week, but it didn't seem to generate the same interest.
My point in that thread, which no one has really stated completely explictly here, is that there are two kinds of redundancy to consider:
1) how many matching results are required for validation
2) how many MORE copies of a work unit than "whatever the answer to (1) is" will be sent out to machines
I do not presume to know the answer to (1). Maybe it's three, maybe it's two, maybe it could be 1.1 or 1.5 in a well designed system.
I'm puzzled, though, that people would urge that we not second guess the designers regarding their decision on (2). On its face, there can be only three reasons for distributing one more work unit than is required for validation:
1) It's to placate childish people who can't stand to have credit pending for long.
2) There isn't enough raw data, so they may as well.
3) The hardware that creates work units and/or stores work units in progress is being pushed to its limits.
(1) is just stupid. It is a sad comment if the project stands to lose more crunching power from people leaving in irritation over how long they sometimes have to wait for their credit than it loses by intentionally assigning work units to more machines than it has to.
As far as I can tell, (2) is not even close to being the case.
If (3) is the case, hopefully this can be improved in the near future, so that only three work units (i.e. the exact number required for validation) are sent out at a time, and an extra is sent only when one of those fails or times out, or according to some formula along the lines of what lgkf proposes above.
If I understand right, every participant make an small experiment. If it is a real fundamental science, every experiment need to much repeat. Of course computing and experiment in our reality - have a difference, but Einstein@Home is a important science experiment and I think that four participants on one work unit is not a redundance.
> If (3) is the case, hopefully this can be improved in the near future, so that
> only three work units (i.e. the exact number required for validation) are sent
> out at a time, and an extra is sent only when one of those fails or times out,
> or according to some formula along the lines of what lgkf proposes above.
You are still missing one of the points we have been trying to make about this. If you look at the situation from the perspective of a work unit you have a complete description of the possible problems.
But if you look at the situation from the perspective of the system as a whole, then you are incomplete in your descriptions. The issuance of one extra Result stems from the fact that we are working in an system that is tolerant of error. Errors in the production of Results are such that is a fairly high likelyhood of them happening. So, the redundent issue of one or more Results for the probability of error is reasonable.
If we issue 4 when we need 3, if one Participant has a problem with download, processing, or upload; we have already anticipated this and have our "spare" in production. Therefore, the 4th Result allows us to skip over this hiccup with out extending the "window" where the Results are in limbo while we issue a new Result and to wait for it to be returned.
If there is further errors in the processing of this finalization Result we can have the situation where the Work Unit is stuck looking for the last Result to complete a Quorum of Results for upwards of months ... Many errors are only detected by waiting for the deadline to be passed before we issue a new Result for processing. So, the "padding" of one extra issue is like the spare tire in the car, we hope we don't have to use it, but it is there if we need it.
Most of the argument is that this extra issue is not needed so we are squandering the computing resource. But we have this in relative abundence. The main weak point in almost all DC projects is the Project servers capacity which is constrained by available funding. Any reasonable design would (and BOINC's design in fact does this) attempt to minimize the impact of the Server-side constraints ...
Anyway, I don't know how to say it more clearly than this ...
Paul's message explains it about as clearly as possible, but I thought I would add an example to illustrate. As Paul noted, the redundancy issue must be looked at from a system-wide perspective and not as a workunit issue in isolation. LHC provides a nice example of why the extra workunit is often a good thing to have. In LHC, current work is generated based upon previously analyzed models. Thus, receiving validated results more quickly is essential to the project both for maintaining workunit supplies as well as progressing the scientific goals of LHC in a timely manner. While projects such as Einstein and SETI do not generate work based on previous work (at least not yet...), this same system-wide principle, which Paul has articulately laid out, still applies.
Well, OK. Paul says that the probability of errors is "fairly high," and Scott says the extra work unit is "often" needed (much more than a spare tire, evidently!) Fine. Obviously, as the percentage of cases in which at least one copy of a work unit times out or produces an error approaches 100%, it becomes more and more reasonable just to send the work unit to one (or more) extra machines in the first place and write off whatever redundancy results. This will reduce server strain and reduce the chances that, through sheer bad luck, a work unit's progress will be drawn out to ridiculous or scientifically inconvenient lengths of time. If we do have a "relative abundance" of computing power (relative to raw data, or to created work units?), and unavoidable financial constraints on server power, then the point at which this becomes reasonable will be lower.
I don't think I missed any of this. I didn't announce that the present system is wrong. I simply questioned whether it is optimal. It isn't, strictly speaking, but other factors mean that the project has neither the need nor the luxury of reducing redundancy to zero.
> Well, OK. Paul says that the probability of errors is "fairly high," and
> Scott says the extra work unit is "often" needed (much more than a spare tire,
> evidently!) Fine. Obviously, as the percentage of cases in which at least
> one copy of a work unit times out or produces an error approaches 100%, it
> becomes more and more reasonable just to send the work unit to one (or more)
> extra machines in the first place and write off whatever redundancy results.
> This will reduce server strain and reduce the chances that, through sheer bad
> luck, a work unit's progress will be drawn out to ridiculous or scientifically
> inconvenient lengths of time. If we do have a "relative abundance" of
> computing power (relative to raw data, or to created work units?), and
> unavoidable financial constraints on server power, then the point at which
> this becomes reasonable will be lower.
>
> I don't think I missed any of this. I didn't announce that the present system
> is wrong. I simply questioned whether it is optimal. It isn't, strictly
> speaking, but other factors mean that the project has neither the need nor the
> luxury of reducing redundancy to zero.
>
If they had to pay for the crunching they would reduce unnecessary redundancy in very quick order. I think they have a very blasé attitude to using free resources efficiently.
> ... If we do have a "relative abundance" of
> computing power (relative to raw data, or to created work units?), and
> unavoidable financial constraints on server power, then the point at which
> this becomes reasonable will be lower.
We have a lot more computing power than almost all of the projects need. LHC@Home for example thought they had 1 to 2 years worth of work when we started ... it lasted about 3-4 months and we ran them dry. This was even with a very low cap on Participants.
I don't know if Matt is keeping track of the tape count in SETI@Home, but he has said that we are doing as much if not more work on the BOINC Side of the project than the Classic side. And we have a lot fewer participants.
> I don't think I missed any of this. I didn't announce that the present system
> is wrong. I simply questioned whether it is optimal. It isn't, strictly
> speaking, but other factors mean that the project has neither the need nor the
> luxury of reducing redundancy to zero.
Um, I was not trying to put you down in any way... I was trying to answer the question that you asked, and the point you were trying to make.
You are correct, from the perspective of the work unit/result this is a sub-optimal strategy. From a project perspective it is optimal.
> If they had to pay for the crunching they would reduce unnecessary
> redundancy in very quick order. I think they have a very blasé attitude
> to using free resources efficiently.
I don't think this is the case at all. They, the project developers, took a pragmatic look at the system and then adjusted it for the "best" operation as a system. If they were paying for computing resources, such as a supercomputer, they would not have the same problem set to solve in the design of the system. However, they would have a new one in that they would still need to process the results redundently to be sure that there is no signal burried. But that is a different problem.
> Another question is whether
)
> Another question is whether on a screening process like E@H or SETI you really
> need total redundancy at all. On the ZetaGrid project (www.zetagrid.net),
> looking for counter-examples to an unproven mathematical conjecture, the
> project went for limited redundancy. Newbies' work was double-checked, but
> as you returned more and more WU that checked out OK your machines got awarded
> a trust factor, and your work was only double-calculated on a random basis.
> About 10% of all WU were double-crunched overall.
> ...
> So I'd say the minimum plausible redundancy is to calulate everything just 1.1
> times, on average; that is enough to spot rogue machines (whether through
> malice or malfunction). A more comfortable value is to calcualte everything
> twice, and only call for third result when needed to arbitrate. Based on
> either of these minima there is a redundant amount of redundancy in most BOINC
> projects, built in by the basic BOINC infrastructure. While E@H works within
> BOINC it is not realistic to ask for anything less than 3x.
>
> The change from 3x to 4x is, however, something that E@H could choose to do
> something about. And if you accept that 3x is already overly redundant, the
> initial comment looks very plausible.
Two points in reply:
1. Your 1.1 figure clearly assumes a relatively low failure rate of hardware. Given that many machines used to crunch on BOINC projects are on and actually at high CPU load 24/7, I think that this is probably higher (especially given that many machines are laptops that were never intended to run in such a manner).
2. You have assumed that all projects are equal in their tolerance of error. In truth, there are wide-ranging differences across the projects. While we are not privy to the exact nature of these precision differences, it is possible that they could affect the level of redundancy required.
if the reason to send the WU
)
if the reason to send the WU to 4 clients is faster validation, why not change the strategy, make it a more complex process:
send wu to 3 clients.
if not validated after (1+x) * (slowest turnaround avg) days, send out to a machine known for short turnaround time.
(x in the range of 0.2 to 0.5 as a first guess)
This would reduce the amount of redundancy, since 4 machines are only started on the same work on a part of all WUs, not on all of them.
This requires the scheduler to be more complex, I admit.
> if the reason to send the
)
> if the reason to send the WU to 4 clients is faster validation, why not change
> the strategy, make it a more complex process:
>
> send wu to 3 clients.
> if not validated after (1+x) * (slowest turnaround avg) days, send out to a
> machine known for short turnaround time.
> (x in the range of 0.2 to 0.5 as a first guess)
>
> This would reduce the amount of redundancy, since 4 machines are only started
> on the same work on a part of all WUs, not on all of them.
>
> This requires the scheduler to be more complex, I admit.
And this is one of the projected changes ... but we still have to finish the creation of the cross-platform BOINC ....
I asked basically this same
)
I asked basically this same question in my "Optimal distribution of work" thread last week, but it didn't seem to generate the same interest.
My point in that thread, which no one has really stated completely explictly here, is that there are two kinds of redundancy to consider:
1) how many matching results are required for validation
2) how many MORE copies of a work unit than "whatever the answer to (1) is" will be sent out to machines
I do not presume to know the answer to (1). Maybe it's three, maybe it's two, maybe it could be 1.1 or 1.5 in a well designed system.
I'm puzzled, though, that people would urge that we not second guess the designers regarding their decision on (2). On its face, there can be only three reasons for distributing one more work unit than is required for validation:
1) It's to placate childish people who can't stand to have credit pending for long.
2) There isn't enough raw data, so they may as well.
3) The hardware that creates work units and/or stores work units in progress is being pushed to its limits.
(1) is just stupid. It is a sad comment if the project stands to lose more crunching power from people leaving in irritation over how long they sometimes have to wait for their credit than it loses by intentionally assigning work units to more machines than it has to.
As far as I can tell, (2) is not even close to being the case.
If (3) is the case, hopefully this can be improved in the near future, so that only three work units (i.e. the exact number required for validation) are sent out at a time, and an extra is sent only when one of those fails or times out, or according to some formula along the lines of what lgkf proposes above.
If I understand right, every
)
If I understand right, every participant make an small experiment. If it is a real fundamental science, every experiment need to much repeat. Of course computing and experiment in our reality - have a difference, but Einstein@Home is a important science experiment and I think that four participants on one work unit is not a redundance.
Excuse me, if my English is not right... :-(
> If (3) is the case,
)
> If (3) is the case, hopefully this can be improved in the near future, so that
> only three work units (i.e. the exact number required for validation) are sent
> out at a time, and an extra is sent only when one of those fails or times out,
> or according to some formula along the lines of what lgkf proposes above.
You are still missing one of the points we have been trying to make about this. If you look at the situation from the perspective of a work unit you have a complete description of the possible problems.
But if you look at the situation from the perspective of the system as a whole, then you are incomplete in your descriptions. The issuance of one extra Result stems from the fact that we are working in an system that is tolerant of error. Errors in the production of Results are such that is a fairly high likelyhood of them happening. So, the redundent issue of one or more Results for the probability of error is reasonable.
If we issue 4 when we need 3, if one Participant has a problem with download, processing, or upload; we have already anticipated this and have our "spare" in production. Therefore, the 4th Result allows us to skip over this hiccup with out extending the "window" where the Results are in limbo while we issue a new Result and to wait for it to be returned.
If there is further errors in the processing of this finalization Result we can have the situation where the Work Unit is stuck looking for the last Result to complete a Quorum of Results for upwards of months ... Many errors are only detected by waiting for the deadline to be passed before we issue a new Result for processing. So, the "padding" of one extra issue is like the spare tire in the car, we hope we don't have to use it, but it is there if we need it.
Most of the argument is that this extra issue is not needed so we are squandering the computing resource. But we have this in relative abundence. The main weak point in almost all DC projects is the Project servers capacity which is constrained by available funding. Any reasonable design would (and BOINC's design in fact does this) attempt to minimize the impact of the Server-side constraints ...
Anyway, I don't know how to say it more clearly than this ...
Paul's message explains it
)
Paul's message explains it about as clearly as possible, but I thought I would add an example to illustrate. As Paul noted, the redundancy issue must be looked at from a system-wide perspective and not as a workunit issue in isolation. LHC provides a nice example of why the extra workunit is often a good thing to have. In LHC, current work is generated based upon previously analyzed models. Thus, receiving validated results more quickly is essential to the project both for maintaining workunit supplies as well as progressing the scientific goals of LHC in a timely manner. While projects such as Einstein and SETI do not generate work based on previous work (at least not yet...), this same system-wide principle, which Paul has articulately laid out, still applies.
Well, OK. Paul says that the
)
Well, OK. Paul says that the probability of errors is "fairly high," and Scott says the extra work unit is "often" needed (much more than a spare tire, evidently!) Fine. Obviously, as the percentage of cases in which at least one copy of a work unit times out or produces an error approaches 100%, it becomes more and more reasonable just to send the work unit to one (or more) extra machines in the first place and write off whatever redundancy results. This will reduce server strain and reduce the chances that, through sheer bad luck, a work unit's progress will be drawn out to ridiculous or scientifically inconvenient lengths of time. If we do have a "relative abundance" of computing power (relative to raw data, or to created work units?), and unavoidable financial constraints on server power, then the point at which this becomes reasonable will be lower.
I don't think I missed any of this. I didn't announce that the present system is wrong. I simply questioned whether it is optimal. It isn't, strictly speaking, but other factors mean that the project has neither the need nor the luxury of reducing redundancy to zero.
> Well, OK. Paul says that
)
> Well, OK. Paul says that the probability of errors is "fairly high," and
> Scott says the extra work unit is "often" needed (much more than a spare tire,
> evidently!) Fine. Obviously, as the percentage of cases in which at least
> one copy of a work unit times out or produces an error approaches 100%, it
> becomes more and more reasonable just to send the work unit to one (or more)
> extra machines in the first place and write off whatever redundancy results.
> This will reduce server strain and reduce the chances that, through sheer bad
> luck, a work unit's progress will be drawn out to ridiculous or scientifically
> inconvenient lengths of time. If we do have a "relative abundance" of
> computing power (relative to raw data, or to created work units?), and
> unavoidable financial constraints on server power, then the point at which
> this becomes reasonable will be lower.
>
> I don't think I missed any of this. I didn't announce that the present system
> is wrong. I simply questioned whether it is optimal. It isn't, strictly
> speaking, but other factors mean that the project has neither the need nor the
> luxury of reducing redundancy to zero.
>
If they had to pay for the crunching they would reduce unnecessary redundancy in very quick order. I think they have a very blasé attitude to using free resources efficiently.
> ... If we do have a
)
> ... If we do have a "relative abundance" of
> computing power (relative to raw data, or to created work units?), and
> unavoidable financial constraints on server power, then the point at which
> this becomes reasonable will be lower.
We have a lot more computing power than almost all of the projects need. LHC@Home for example thought they had 1 to 2 years worth of work when we started ... it lasted about 3-4 months and we ran them dry. This was even with a very low cap on Participants.
I don't know if Matt is keeping track of the tape count in SETI@Home, but he has said that we are doing as much if not more work on the BOINC Side of the project than the Classic side. And we have a lot fewer participants.
> I don't think I missed any of this. I didn't announce that the present system
> is wrong. I simply questioned whether it is optimal. It isn't, strictly
> speaking, but other factors mean that the project has neither the need nor the
> luxury of reducing redundancy to zero.
Um, I was not trying to put you down in any way... I was trying to answer the question that you asked, and the point you were trying to make.
You are correct, from the perspective of the work unit/result this is a sub-optimal strategy. From a project perspective it is optimal.
> If they had to pay for the crunching they would reduce unnecessary
> redundancy in very quick order. I think they have a very blasé attitude
> to using free resources efficiently.
I don't think this is the case at all. They, the project developers, took a pragmatic look at the system and then adjusted it for the "best" operation as a system. If they were paying for computing resources, such as a supercomputer, they would not have the same problem set to solve in the design of the system. However, they would have a new one in that they would still need to process the results redundently to be sure that there is no signal burried. But that is a different problem.