How are repeated failures handled?

Ben Scott

Joined: 30 Mar 20

Posts: 54

Credit: 1836297238

RAC: 2854689

8 Aug 2023 20:35:38 UTC

Topic 229935

(moderation:

)

I have been looking at the systems attempts to validate my All Sky GW results and it is truly a tale of woe and misery. Is there a point where computers that simply can't run them get blacklisted for that app? Thank you.

Keith Myers

Joined: 11 Feb 11

Posts: 5055

Credit: 19201095811

RAC: 5890341

Yes, that should happen

8 Aug 2023 21:27:02 UTC

Message 215729

(moderation:

)

Yes, that should happen eventually by the standard BOINC server software.

BUT, Einstein does NOT use standard BOINC server software. They roll their own and I don't think I've ever seen a host get excluded by perpetual errors on an application. They just get put into a 24 hour timeout to try again and again . . . . ad nauseum.

Scrooge McDuck

Joined: 2 May 07

Posts: 1126

Credit: 18794682

RAC: 11741

It's a waste of energy (on

9 Aug 2023 10:49:41 UTC

Message 215751

(moderation:

)

It's a waste of energy (on the faulty host side). But failed tasks are reissued to other hosts by the einstein server. A quorum of 20 is set for most tasks, i.e. a maximum of 20 tasks are assigned for a work unit. If there is still no valid result after this, the work unit is marked as faulty. This happens very rarely, e.g. in the case of incorrectly configured workunit generation when starting a new science run (different raw data, new app versions, etc.). The power crunchers here quickly find out and report something like that.

Ben Scott

Joined: 30 Mar 20

Posts: 54

Credit: 1836297238

RAC: 2854689

I've got a ton of tasks going

9 Aug 2023 20:00:57 UTC

Message 215774

(moderation:

)

I've got a ton of tasks going back a few days now with the wingman "unsent" status. Shouldn't the tasks be going out in pairs?

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5887

Credit: 119443008811

RAC: 25933217

Ben Scott wrote:... Shouldn't

9 Aug 2023 23:10:42 UTC

Message 215783 in response to message 215774

(moderation:

)

Ben Scott wrote:

... Shouldn't the tasks be going out in pairs?

They always do, as long as there is a suitable host to which the second copy can be sent.

GW tasks use 'Locality Scheduling'. There is a huge number of large data files to be sent out to cover the complete parameter space so a host requesting work is assigned a relatively small subset of the full set of those large data files. There are a large number of tasks that can use each small subset of data files so once a set is allocated, hosts wont need a different set until the first one is exhausted .

The second member of any given pair of tasks can't be sent out until there is a second host with the same subset of data files that your host has. It's totally impractical to send all the data to all the hosts. We are talking about huge quantities of data which is why Locality scheduling was 'invented' when the project first started. It's pretty much unique to Einstein.

Normally, that doesn't create too much of a problem. This time, the number of hosts that have a GPU with enough VRAM to handle the tasks is severely limited. So you see two separate unfortunate consequences.

Firstly you may have a quorum partner that doesn't have enough VRAM - something that you mentioned in a previous message - where the tasks just fail.

Secondly you may have no viable partner at all until someone with enough VRAM just happens to 'use up' their current set of large data files and then be allocated the same set that you have. Then the scheduler will be able to issue a second task that can be vaildated against yours.

You just have to be patient until that second situation arises.

Cheers,
Gary.

How are repeated failures handled?

Forums › Cruncher's Corner

Yes, that should happen

It's a waste of energy (on

I've got a ton of tasks going

Ben Scott wrote:... Shouldn't

Comment viewing options

Forums › Cruncher's Corner