Quorum of one?

svincent
svincent
Joined: 24 Oct 05
Posts: 5
Credit: 167049
RAC: 0
Topic 196051

The Clean Energy Project, a subproject of the World Community Grid (they're looking for next generation organic photovoltaics), is going essentially from a quorum of 2 to 1, with no redundancy checking for results from hosts that are deemed to be reliable, but with occasional random workunits selected for double checking. The thread is here https://secure.worldcommunitygrid.org/forums/wcg/viewthread_thread,31985 : unfortunately you need a WCG account to read it.

Is there a reason why a similar approach wouldn't work for Einstein@home or one of the pulsar-searching spinoff projects? It must be a very rare event that two reliable hosts report the same workunit as valid but the results are in fact different.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250357424
RAC: 35896

Quorum of one?

AFAIK by now most projects use what is known as "adaptive replication", which would mean to accept results without further validation from "reliable" hosts.

For the GW search this was discussed in the LVC (LIGO-Virgo collaboration, the scientific community behind Einstein@home) at least two times that I remember, and each time strongly voted against.

The BRP search is doing a lot of computation on GPUs, which are numerically less reliable than e.g. CPUs (note that the invalid result rate is 20x as high as that of the other searches). As the results from Einstein@home are used directly for targeting re-observations, the requirements on the correctness of the results are somewhat higher.

Finally our youngest application for the FGRP search hasn't yet reached the reliability that we would dare to take the results without comparison.

BM

BM

Filipe
Filipe
Joined: 10 Mar 05
Posts: 186
Credit: 405038403
RAC: 418151

RE: AFAIK by now most

Quote:

AFAIK by now most projects use what is known as "adaptive replication", which would mean to accept results without further validation from "reliable" hosts.

For the GW search this was discussed in the LVC (LIGO-Virgo collaboration, the scientific community behind Einstein@home) at least two times that I remember, and each time strongly voted against.

The BRP search is doing a lot of computation on GPUs, which are numerically less reliable than e.g. CPUs (note that the invalid result rate is 20x as high as that of the other searches). As the results from Einstein@home are used directly for targeting re-observations, the requirements on the correctness of the results are somewhat higher.

Finally our youngest application for the FGRP search hasn't yet reached the reliability that we would dare to take the results without comparison.

BM

Is this still not viable?
Are we still continuing with a high rate of invalids from GPU's?

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

RE: Are we still

Quote:

Are we still continuing with a high rate of invalids from GPU's?

I would say yes, very high at times.

See http://einstein6.aei.uni-hannover.de/EinsteinAtHome/download/BRP6-progress/ for one example. BRP4 is similar.

Filipe
Filipe
Joined: 10 Mar 05
Posts: 186
Credit: 405038403
RAC: 418151

And for CPU dedicated

And for CPU dedicated searches? Like it is the case now for S6GW and FGRP?

Would it be possible?

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

RE: And for CPU dedicated

Quote:
And for CPU dedicated searches? Like it is the case now for S6GW and FGRP?

http://einstein.phys.uwm.edu/server_status.html

Certainly shows even for CPU the invalid rates are high, S6BucketFU1UB is around 20% which was higher than I expected.

Jim1348
Jim1348
Joined: 19 Jan 06
Posts: 463
Credit: 257957147
RAC: 0

Insofar as I can remember, I

Insofar as I can remember, I have never had a S6Bucket Follow-up #2 or a Parkes PMPS XT v1.52 (BRP6-cuda32-nv301) fail.
http://einsteinathome.org/host/11368189/tasks
http://einsteinathome.org/host/11671653/tasks

The possible exceptions might be if the machine crashes for other reasons, but that is quite rare now. So it would seem to me that some machines are more susceptible to failures than others.

  • In the case of GPUs, this is easy to understand: the gamers frequently overclock their cards and think that just because the games don't crash, they are good for Einstein. It happens on every project; they don't realize that scientific calculations are a different story, and it takes some education to get the majority up to speed.

In the case of CPUs, overclocking problems are also possible, but I would suspect it is more likely chip or OS incompatibilities. Some projects just prefer one type over another.

But isn't the real question whether the machines that DO complete get it right often enough? The outright failures are easy to spot and are eliminated anyway. If the successful runs are always "good" at the scientific level, then the scheme mentioned by svincent above should work here too. It might be worth a study to compare the machines giving successful results to see if the quorum is really necessary.

Therefore, some machines may be more reliable than others, and can be "trusted", if their results are good often enough as you define it. I believe that on CEP2, there are periodic re-evaluations too, to ensure that machines are still providing good results.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.