Computers that hoard work units and return none

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3562358667

RAC: 0

I've had several full weekend

29 Nov 2006 1:09:39 UTC

Message 54515

(moderation:

)

I've had several full weekend (fri evening till mid monday) outages, and normally try and keep a 3 day base queue. Normally when I'm going to be away for a few days I temporaily boost it and DL enough extra work to keep my PC running until I'm back in case something goes wrong. I forgot to do it this last weekend, and ran out. DOH!

Annika

Joined: 8 Aug 06

Posts: 720

Credit: 494410

RAC: 0

Well, my ISP was a bit

29 Nov 2006 1:14:22 UTC

Message 54516

(moderation:

)

Well, my ISP was a bit unreliable last summer but that meant lots of disconnects and smaller outages (1 up to 6 hours) but never a longer time without Internet, so I'm not overly concerned about that kind of problem.
As for project outages, I just take the chance there... I've experienced Einstein to be really reliable (big thumbs up to the project staff) and in the rare case that sth does happen I'm at least sure my faster box won't be idle as I've got a climate model which is good for more than 1000 hours of crunching. So I'm confident it's occupied and let BOINC do its job working out the debts.

Nothing But Idl...

Joined: 24 Aug 05

Posts: 158

Credit: 289204

RAC: 0

RE: I've had several full

29 Nov 2006 16:08:36 UTC

Message 54517 in response to message 54515

(moderation:

)

Quote:

I've had several full weekend (fri evening till mid monday) outages, and normally try and keep a 3 day base queue. Normally when I'm going to be away for a few days I temporaily boost it and DL enough extra work to keep my PC running until I'm back in case something goes wrong. I forgot to do it this last weekend, and ran out. DOH!

Do you run Rosetta? Some of us have experienced "outages" that have nothing to do with ISP. Boinc.exe (client) terminates and if you open the manager all the tabs are blank. One evening just before bedtime I looked at boinc mgr and all was good. Twenty minutes later -- after climbing into bed -- the client crashed. Next morning I checked again and nothing had run for 7 hours. So you can have all the queue/cache you like, and your ISP can be rock solid, but that doesn't mean boinc will be busy while you're away on the weekend.

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3562358667

RAC: 0

No. I'm 100% einstien at

29 Nov 2006 22:33:48 UTC

Message 54518

(moderation:

)

No. I'm 100% einstien at present. Until last weekend all of my outages have been ISP related. The situation's really improved over the last year or so, but the summer before I had three full weekend ISP outages. Dunno if they didn't have any weekend techs, or if they weren't stocking hardware spares.

Wurgl (speak^Wc...

Joined: 11 Feb 05

Posts: 321

Credit: 140550008

RAC: 0

RE: I've had several full

30 Nov 2006 8:16:55 UTC

Message 54519 in response to message 54515

(moderation:

)

Quote:

I've had several full weekend (fri evening till mid monday) outages, and normally try and keep a 3 day base queue. Normally when I'm going to be away for a few days I temporaily boost it and DL enough extra work to keep my PC running until I'm back in case something goes wrong. I forgot to do it this last weekend, and ran out. DOH!

And sometimes even a buffer of three days is not enough. You remember the outage on Nov. 16th? After that outage a huge bulk of the fast machines did not connect for ~160 hours, which is about a week.

So to avoid that the computers run out of work in case of a longer down time of the project, one may consider to set a cache of ~8 days. Please think of guys who cannot do nurse work manually because of either the huge number of computers (Bruce Allen, Erik A. Espinoza) or because not all computers are accessable (eg. you are out of office/vacation etc.).

As I remember, this 7 day delay is a gift from the SETI project.

Alinator

Joined: 8 May 05

Posts: 927

Credit: 9352143

RAC: 0

RE: And sometimes even a

30 Nov 2006 15:08:09 UTC

Message 54520 in response to message 54519

(moderation:

)

Quote:

And sometimes even a buffer of three days is not enough. You remember the outage on Nov. 16th? After that outage a huge bulk of the fast machines did not connect for ~160 hours, which is about a week.

So to avoid that the computers run out of work in case of a longer down time of the project, one may consider to set a cache of ~8 days. Please think of guys who cannot do nurse work manually because of either the huge number of computers (Bruce Allen, Erik A. Espinoza) or because not all computers are accessable (eg. you are out of office/vacation etc.).

As I remember, this 7 day delay is a gift from the SETI project.

It isn't really the speed of the machine, the cache size, or even if you want to nurse maid the machines. The issue is if any machine cannot make contact with the scheduler it doesn't take very long (around 12 to 18 hours, IIRC) for the back off to ramp up to the maximum value. Fast machines are more likely to have this happen merely because they contact the project more frequently, but any host which tries to hit the project within that downtime window will end up with a maxed out deferral. The only way I know how to prevent it is to manually intervene.

The problem is the maximum delay should never be longer than the connect to network interval for a host initially, or longer than remaining estimated work onboard, or a random value <= 36 hours if the host runs out completely. I think this would eliminate the majority of the problems for most short term outages without having to go to an 8 day cache to cover the possibility of a rare 1 day outage "forcing" a deferral of 168 hours for a project that has the reliability of EAH.

In any event, the only way to "guarantee" a host won't run out of work under most circumstances is to run more than one project.

Alinator

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2980697345

RAC: 759353

IIRC, the latest development

30 Nov 2006 15:42:44 UTC

Message 54521

(moderation:

)

IIRC, the latest development versions of BOINC have the maximum delay set to 1 day instead of 7 days.

Unfortunately, the main development focus then switched to the simple GUI, which means this change won't be seen in a 'recommended' version for some time yet - and even then, we'll have a long delay (sic!) before the majority of current users have been persuaded to upgrade.

Jim Milks

Joined: 19 Jun 06

Posts: 116

Credit: 529852

RAC: 0

RE: In any event, the only

30 Nov 2006 17:16:24 UTC

Message 54522 in response to message 54520

(moderation:

)

Quote:

In any event, the only way to "guarantee" a host won't run out of work under most circumstances is to run more than one project.

Alinator

I agree with running more than one project. I primarily run Einstein on my MacBook, with Rosetta suspended unless Einstein goes down. That setup works well for me.

Jim

Alinator

Joined: 8 May 05

Posts: 927

Credit: 9352143

RAC: 0

@ Richard: Yes, I believe

30 Nov 2006 17:54:56 UTC

Message 54524

(moderation:

)

@ Richard:

Yes, I believe that's correct. However in my case the two which are in the "penalty box" aren't running 5.7.x and the one which is didn't try to hit the project during the outage.

OTOH, simply dropping the the maximum deferral to a day may take care of the "forum flapdoodles" which arise on relatively short outages causing excessive deferral, I wonder if it won't aggravate "shaky" project restarts due to more concentrated traffic early in the sequence, and was the main reason for the deferral mechanism in the first place. That's why I was thinking of a "smarter" algorithm which still spreads "the crunch" out as much as possible, but tries to get everyone back in the game as quickly as possible without manual intervention.

I've been monitoring EAH overall on Boincstats this week and the credit per day figures still haven't recovered to what they were before the failure, although tomorrow should be a whopping big day in that regard. ;-)

Regarding your final comment, whereas the development on the simple GUI, the Account Manager System, and new features for specific projects are worthy efforts, IMHO they are "fluff" when it leads to skipping a maintenance update for longstanding issues on the current production version which have a major negative impact to BOINC overall every time a project (especially SAH) goes down for 24 hours.

On the plus, at least here at EAH, users seem to have taken the fallout with pretty good grace, and hopefully the long term impact (both PR and work production) will be minimal. :-)

Alinator

Joined: 8 May 05

Posts: 927

Credit: 9352143

RAC: 0

RE: RE: In any event,

30 Nov 2006 18:06:32 UTC

Message 54525 in response to message 54522

(moderation:

)

Quote:

Quote:

In any event, the only way to "guarantee" a host won't run out of work under most circumstances is to run more than one project.

Alinator

I agree with running more than one project. I primarily run Einstein on my MacBook, with Rosetta suspended unless Einstein goes down. That setup works well for me.

Jim

One tip which can make a backup project virtually automatic is to maually edit the client_state file to give your primary a massive positive LTD and the backup the negative of that value (like 10 MSecs. for example).

Now you would only run the primary until it runs out of work, and then DL backup work one result at a time. Once the primary comes back online you stop DL'ing new backup work, work off the last result and go idle for the backup again.

If you try it, I suggest keeping you resource split equal or even tilted to the backup as that will help get back to working the primary faster.

Alinator

I find this "Baked" Client State method preferable to the "Infintessimal" Resource Share method, or the manual method for that matter. ;-)

Computers that hoard work units and return none

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner