How are repeated failures handled?

Ben Scott
Ben Scott
Joined: 30 Mar 20
Posts: 53
Credit: 1,281,779,987
RAC: 4,001,226
Topic 229935

I have been looking at the systems attempts to validate my All Sky GW results and it is truly a tale of woe and misery. Is there a point where computers that simply can't run them get blacklisted for that app? Thank you.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4,864
Credit: 18,276,398,930
RAC: 6,337,255

Yes, that should happen

Yes, that should happen eventually by the standard BOINC server software.

BUT, Einstein does NOT use standard BOINC server software.  They roll their own and I don't think I've ever seen a host get excluded by perpetual errors on an application.  They just get put into a 24 hour timeout to try again and again . . . .  ad nauseum.

 

Scrooge McDuck
Scrooge McDuck
Joined: 2 May 07
Posts: 977
Credit: 17,011,107
RAC: 8,345

It's a waste of energy (on

It's a waste of energy (on the faulty host side). But failed tasks are reissued to other hosts by the einstein server. A quorum of 20 is set for most tasks, i.e. a maximum of 20 tasks are assigned for a work unit. If there is still no valid result after this, the work unit is marked as faulty. This happens very rarely, e.g. in the case of incorrectly configured workunit generation when starting a new science run (different raw data, new app versions, etc.). The power crunchers here quickly find out and report something like that.

Ben Scott
Ben Scott
Joined: 30 Mar 20
Posts: 53
Credit: 1,281,779,987
RAC: 4,001,226

I've got a ton of tasks going

I've got a ton of tasks going back a few days now with the wingman "unsent" status. Shouldn't the tasks be going out in pairs?

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,869
Credit: 114,871,034,164
RAC: 30,260,936

Ben Scott wrote:... Shouldn't

Ben Scott wrote:
... Shouldn't the tasks be going out in pairs?

They always do, as long as there is a suitable host to which the second copy can be sent.

GW tasks use 'Locality Scheduling'.  There is a huge number of large data files to be sent out to cover the complete parameter space so a host requesting work is assigned a relatively small subset of the full set of those large data files.  There are a large number of tasks that can use each small subset of data files so once a set is allocated, hosts wont need a different set until the first one is exhausted .

The second member of any given pair of tasks can't be sent out until there is a second host with the same subset of data files that your host has.  It's totally impractical to send all the data to all the hosts.  We are talking about huge quantities of data which is why Locality scheduling was 'invented' when the project first started.  It's pretty much unique to Einstein.

Normally, that doesn't create too much of a problem.  This time, the number of hosts that have a GPU with enough VRAM to handle the tasks is severely limited.  So you see two separate unfortunate consequences.

Firstly you may have a quorum partner that doesn't have enough VRAM - something that you mentioned in a previous message - where the tasks just fail.

Secondly you may have no viable partner at all until someone with enough VRAM just happens to 'use up' their current set of large data files and then be allocated the same set that you have.  Then the scheduler will be able to issue a second task that can be vaildated against yours.

You just have to be patient until that second situation arises.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.