Computers that hoard work units and return none

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1,364
Credit: 3,562,358,667
RAC: 8

I've had several full weekend

I've had several full weekend (fri evening till mid monday) outages, and normally try and keep a 3 day base queue. Normally when I'm going to be away for a few days I temporaily boost it and DL enough extra work to keep my PC running until I'm back in case something goes wrong. I forgot to do it this last weekend, and ran out. DOH!

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494,410
RAC: 0

Well, my ISP was a bit

Well, my ISP was a bit unreliable last summer but that meant lots of disconnects and smaller outages (1 up to 6 hours) but never a longer time without Internet, so I'm not overly concerned about that kind of problem.
As for project outages, I just take the chance there... I've experienced Einstein to be really reliable (big thumbs up to the project staff) and in the rare case that sth does happen I'm at least sure my faster box won't be idle as I've got a climate model which is good for more than 1000 hours of crunching. So I'm confident it's occupied and let BOINC do its job working out the debts.

Nothing But Idle Time
Nothing But Idl...
Joined: 24 Aug 05
Posts: 158
Credit: 289,204
RAC: 0

RE: I've had several full

Message 54517 in response to message 54515

Quote:
I've had several full weekend (fri evening till mid monday) outages, and normally try and keep a 3 day base queue. Normally when I'm going to be away for a few days I temporaily boost it and DL enough extra work to keep my PC running until I'm back in case something goes wrong. I forgot to do it this last weekend, and ran out. DOH!


Do you run Rosetta? Some of us have experienced "outages" that have nothing to do with ISP. Boinc.exe (client) terminates and if you open the manager all the tabs are blank. One evening just before bedtime I looked at boinc mgr and all was good. Twenty minutes later -- after climbing into bed -- the client crashed. Next morning I checked again and nothing had run for 7 hours. So you can have all the queue/cache you like, and your ISP can be rock solid, but that doesn't mean boinc will be busy while you're away on the weekend.

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1,364
Credit: 3,562,358,667
RAC: 8

No. I'm 100% einstien at

No. I'm 100% einstien at present. Until last weekend all of my outages have been ISP related. The situation's really improved over the last year or so, but the summer before I had three full weekend ISP outages. Dunno if they didn't have any weekend techs, or if they weren't stocking hardware spares.

Wurgl (speak^Wcrunching for Special: Off-Topic)
Wurgl (speak^Wc...
Joined: 11 Feb 05
Posts: 321
Credit: 140,550,008
RAC: 0

RE: I've had several full

Message 54519 in response to message 54515

Quote:
I've had several full weekend (fri evening till mid monday) outages, and normally try and keep a 3 day base queue. Normally when I'm going to be away for a few days I temporaily boost it and DL enough extra work to keep my PC running until I'm back in case something goes wrong. I forgot to do it this last weekend, and ran out. DOH!

And sometimes even a buffer of three days is not enough. You remember the outage on Nov. 16th? After that outage a huge bulk of the fast machines did not connect for ~160 hours, which is about a week.

So to avoid that the computers run out of work in case of a longer down time of the project, one may consider to set a cache of ~8 days. Please think of guys who cannot do nurse work manually because of either the huge number of computers (Bruce Allen, Erik A. Espinoza) or because not all computers are accessable (eg. you are out of office/vacation etc.).

As I remember, this 7 day delay is a gift from the SETI project.

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9,352,143
RAC: 0

RE: And sometimes even a

Message 54520 in response to message 54519

Quote:

And sometimes even a buffer of three days is not enough. You remember the outage on Nov. 16th? After that outage a huge bulk of the fast machines did not connect for ~160 hours, which is about a week.

So to avoid that the computers run out of work in case of a longer down time of the project, one may consider to set a cache of ~8 days. Please think of guys who cannot do nurse work manually because of either the huge number of computers (Bruce Allen, Erik A. Espinoza) or because not all computers are accessable (eg. you are out of office/vacation etc.).

As I remember, this 7 day delay is a gift from the SETI project.

It isn't really the speed of the machine, the cache size, or even if you want to nurse maid the machines. The issue is if any machine cannot make contact with the scheduler it doesn't take very long (around 12 to 18 hours, IIRC) for the back off to ramp up to the maximum value. Fast machines are more likely to have this happen merely because they contact the project more frequently, but any host which tries to hit the project within that downtime window will end up with a maxed out deferral. The only way I know how to prevent it is to manually intervene.

The problem is the maximum delay should never be longer than the connect to network interval for a host initially, or longer than remaining estimated work onboard, or a random value <= 36 hours if the host runs out completely. I think this would eliminate the majority of the problems for most short term outages without having to go to an 8 day cache to cover the possibility of a rare 1 day outage "forcing" a deferral of 168 hours for a project that has the reliability of EAH.

In any event, the only way to "guarantee" a host won't run out of work under most circumstances is to run more than one project.

Alinator

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2,142
Credit: 2,793,084,977
RAC: 694,262

IIRC, the latest development

IIRC, the latest development versions of BOINC have the maximum delay set to 1 day instead of 7 days.

Unfortunately, the main development focus then switched to the simple GUI, which means this change won't be seen in a 'recommended' version for some time yet - and even then, we'll have a long delay (sic!) before the majority of current users have been persuaded to upgrade.

Jim Milks
Jim Milks
Joined: 19 Jun 06
Posts: 116
Credit: 529,852
RAC: 0

RE: In any event, the only

Message 54522 in response to message 54520

Quote:


In any event, the only way to "guarantee" a host won't run out of work under most circumstances is to run more than one project.

Alinator

I agree with running more than one project. I primarily run Einstein on my MacBook, with Rosetta suspended unless Einstein goes down. That setup works well for me.

Jim

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9,352,143
RAC: 0

@ Richard: Yes, I believe

@ Richard:

Yes, I believe that's correct. However in my case the two which are in the "penalty box" aren't running 5.7.x and the one which is didn't try to hit the project during the outage.

OTOH, simply dropping the the maximum deferral to a day may take care of the "forum flapdoodles" which arise on relatively short outages causing excessive deferral, I wonder if it won't aggravate "shaky" project restarts due to more concentrated traffic early in the sequence, and was the main reason for the deferral mechanism in the first place. That's why I was thinking of a "smarter" algorithm which still spreads "the crunch" out as much as possible, but tries to get everyone back in the game as quickly as possible without manual intervention.

I've been monitoring EAH overall on Boincstats this week and the credit per day figures still haven't recovered to what they were before the failure, although tomorrow should be a whopping big day in that regard. ;-)

Regarding your final comment, whereas the development on the simple GUI, the Account Manager System, and new features for specific projects are worthy efforts, IMHO they are "fluff" when it leads to skipping a maintenance update for longstanding issues on the current production version which have a major negative impact to BOINC overall every time a project (especially SAH) goes down for 24 hours.

On the plus, at least here at EAH, users seem to have taken the fallout with pretty good grace, and hopefully the long term impact (both PR and work production) will be minimal. :-)

Alinator

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9,352,143
RAC: 0

RE: RE: In any event,

Message 54525 in response to message 54522

Quote:
Quote:


In any event, the only way to "guarantee" a host won't run out of work under most circumstances is to run more than one project.

Alinator

I agree with running more than one project. I primarily run Einstein on my MacBook, with Rosetta suspended unless Einstein goes down. That setup works well for me.

Jim

One tip which can make a backup project virtually automatic is to maually edit the client_state file to give your primary a massive positive LTD and the backup the negative of that value (like 10 MSecs. for example).

Now you would only run the primary until it runs out of work, and then DL backup work one result at a time. Once the primary comes back online you stop DL'ing new backup work, work off the last result and go idle for the backup again.

If you try it, I suggest keeping you resource split equal or even tilted to the backup as that will help get back to working the primary faster.

Alinator

I find this "Baked" Client State method preferable to the "Infintessimal" Resource Share method, or the manual method for that matter. ;-)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.