E@H Over commitment proof

venox7

Joined: 22 Jan 05

Posts: 16

Credit: 10072175

RAC: 0

27 Jul 2005 21:52:49 UTC

Topic 189616

(moderation:

)

Hi,

After several posts and arguments about E@H's over commitment of BOINC (4.72) I decided to do a little experiment.

I set all my projects to not fetch new work. I let all of them (except CPDN of course) to run dry.

Then I manually adjusted the short and long term debt of all the projects in the clientstate file to 0.

After running Boinc for a couple of minutes for CPDN to build up some debt I allowed the other projects to fetch work one by one in the following order (resource share in brackets):

1. Seti (2.2%)
2. Predictor (21.7%)
3. BURP (8.7%)
4. LHC (43.5%)
5. CPDN (8.7%)
6. E@H (2.2%)

My AMD64 3000 running W2K, requested 86400s of work from the projects (ie reconnect time of 1 day).

The following work was downloaded by the different projects, with the days to deadline in brackets:

1. Seti - 7h50 (14days)
2. Predictor - 13h50 (7days)
3. Burp - no work from project
4. LHC - 12h40 (14 days)
5. CPDN - 0 secs requested

and then

6. E@H - 25h20 with a deadline in 7 days.

Immediately the Boinc scheduler gave me the following:

2005/07/27 23:31:04|Einstein@Home|Sending scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
2005/07/27 23:31:04|Einstein@Home|Reason: To fetch work
2005/07/27 23:31:04|Einstein@Home|Requesting 86400 seconds of work, returning 0 results
2005/07/27 23:31:22|Einstein@Home|Scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded
2005/07/27 23:31:22|Einstein@Home|Got server request to delete file w1_0526.0
2005/07/27 23:31:23||request_reschedule_cpus: files downloaded
2005/07/27 23:31:23||request_reschedule_cpus: files downloaded
2005/07/27 23:31:23||request_reschedule_cpus: files downloaded
2005/07/27 23:31:23||request_reschedule_cpus: files downloaded
2005/07/27 23:31:23||Suspending work fetch because computer is overcommitted.
2005/07/27 23:31:23||Using earliest-deadline-first scheduling because computer is overcommitted.
2005/07/27 23:31:23|ProteinPredictorAtHome|Restarting result h0010A_1_184116_3 using mfoldB125 version 4.28
2005/07/27 23:31:23|LHC@home|Pausing result wjun1D_v6s4hvnom_mqx__5__64.312_59.322__8_10__6__65_1_sixvf_boinc5578_1 (removed from memory)

Thus, due to the 25hrs of work downloaded by E@H, despite having only 2.2% of the resources (2.2% of 7 days is 3h40), BOINC went into panic mode, earliest deadline first, no work fetch allowed.

From this it is quite obvious that there is a problem with the way E@H (whether it is the server or the client software) is calculating the number of WU's to be sent to the client.

To confirm these results I will redo the experiment once all the WU's have been processed, but start with reactivating E@H first and then the other projects.

Any comments?

Regards,

eberndl

Joined: 18 Jan 05

Posts: 43

Credit: 98691

RAC: 0

E@H Over commitment proof

27 Jul 2005 22:08:24 UTC

Message 14661

(moderation:

)

Quote:

Hi,

After several posts and arguments about E@H's over commitment of BOINC (4.72) I decided to do a little experiment.

I set all my projects to not fetch new work. I let all of them (except CPDN of course) to run dry.

Then I manually adjusted the short and long term debt of all the projects in the clientstate file to 0.

After running Boinc for a couple of minutes for CPDN to build up some debt I allowed the other projects to fetch work one by one in the following order (resource share in brackets):

1. Seti (2.2%)
2. Predictor (21.7%)
3. BURP (8.7%)
4. LHC (43.5%)
5. CPDN (8.7%)
6. E@H (2.2%)

My AMD64 3000 running W2K, requested 86400s of work from the projects (ie reconnect time of 1 day).

The following work was downloaded by the different projects, with the days to deadline in brackets:

1. Seti - 7h50 (14days)
2. Predictor - 13h50 (7days)
3. Burp - no work from project
4. LHC - 12h40 (14 days)
5. CPDN - 0 secs requested

and then

6. E@H - 25h20 with a deadline in 7 days.

Immediately the Boinc scheduler gave me the following:

2005/07/27 23:31:04|Einstein@Home|Sending scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
2005/07/27 23:31:04|Einstein@Home|Reason: To fetch work
2005/07/27 23:31:04|Einstein@Home|Requesting 86400 seconds of work, returning 0 results
2005/07/27 23:31:22|Einstein@Home|Scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded
2005/07/27 23:31:22|Einstein@Home|Got server request to delete file w1_0526.0
2005/07/27 23:31:23||request_reschedule_cpus: files downloaded
2005/07/27 23:31:23||request_reschedule_cpus: files downloaded
2005/07/27 23:31:23||request_reschedule_cpus: files downloaded
2005/07/27 23:31:23||request_reschedule_cpus: files downloaded
2005/07/27 23:31:23||Suspending work fetch because computer is overcommitted.
2005/07/27 23:31:23||Using earliest-deadline-first scheduling because computer is overcommitted.
2005/07/27 23:31:23|ProteinPredictorAtHome|Restarting result h0010A_1_184116_3 using mfoldB125 version 4.28
2005/07/27 23:31:23|LHC@home|Pausing result wjun1D_v6s4hvnom_mqx__5__64.312_59.322__8_10__6__65_1_sixvf_boinc5578_1 (removed from memory)

Thus, due to the 25hrs of work downloaded by E@H, despite having only 2.2% of the resources (2.2% of 7 days is 3h40), BOINC went into panic mode, earliest deadline first, no work fetch allowed.

From this it is quite obvious that there is a problem with the way E@H (whether it is the server or the client software) is calculating the number of WU's to be sent to the client.

To confirm these results I will redo the experiment once all the WU's have been processed, but start with reactivating E@H first and then the other projects.

Any comments?

Regards,

V7

eberndl

Joined: 18 Jan 05

Posts: 43

Credit: 98691

RAC: 0

RE: My AMD64 3000 running

27 Jul 2005 22:08:49 UTC

Message 14662

(moderation:

)

Quote:

My AMD64 3000 running W2K, requested 86400s of work from the projects (ie reconnect time of 1 day).

The following work was downloaded by the different projects, with the days to deadline in brackets:

1. Seti - 7h50 (14days)
2. Predictor - 13h50 (7days)
3. Burp - no work from project
4. LHC - 12h40 (14 days)
5. CPDN - 0 secs requested
6. E@H - 25h20 with a deadline in 7 days.

I hate to say it, and I know you hate to hear it, but from what I can see, Einstein did exactly what it was supposed to.

Please hear me out. The "reconnect" time is very very misleading. What it REALLY means is that it will download that many day's worth of work from each project, not that it will down load a total of (in your case) 1 day's worth of work.

So, Einstein downloaded 25 hours worth of work, that's about 24 hours, which is pretty accurate, in my mind.

Now, if you had down loaded from Einstein first, and Predictor last, and those predictor units had thrown Einstein into EDF mode, is that Einstein's fault, or predictor's?

With CP only getting 8.7% of your share, it's also likely that you won't be able to finish the CP unit before it's due without going into EDF. EDF is not your enemy, in fact it is your best friend... well, maybe just a good friend.

The very first download is ALWAYS weird. What will happen is the Einstein units will get done in their 7 days, and then you won't download any for a couple months, and then you'll download 25 hours worth, and it'll go into EDF mode again.

Just like when you finish your CP unit, it's very likely that you won't immediately d/l another one, because of your long term debt.

Ok, this ramble has gone on long enough, but I say just leave it alone for.... 8 days (7 to clear out the Einsteins completely and 1 to get the rest of the units into their groove). Then we can see what's really happening.

Happy crunching =-)

(editted for a stupidity on my part =-) )

venox7

Joined: 22 Jan 05

Posts: 16

Credit: 10072175

RAC: 0

Thanks for the explanation,

27 Jul 2005 22:32:38 UTC

Message 14663 in response to message 14662

(moderation:

)

Thanks for the explanation, however, being a programmer myself (read always trying to catch bugs - the programming kind anyways), I'm still not 100% satisfied/convinced?

Quote:

Please hear me out. The "reconnect" time is very very misleading. What it REALLY means is that it will download that many day's worth of work from each project, not that it will down load a total of (in your case) 1 day's worth of work.

So, Einstein downloaded 25 hours worth of work, that's about 24 hours, which is pretty accurate, in my mind.

Then what is the use of resource sharing? Only to stop BOINC from downloading work units for a month or two, and then send your scheduler into panic again?

Shouldn't Boinc download WU's enough to fill the resource share based on the time to deadline? ie my E@H is at 2.2% with deadline 7days = +- 4hr? After EDF-finishing the E@H WU's it is in debt by -150000s! at 2.2% it's going to take forever to get the next WU. Again this doesn't seem to happen with other projects.

Also why then did the other projects only receive a couple of hours' worth of work, more in line with their resource share?

Quote:

Now, if you had down loaded from Einstein first, and Predictor last, and those predictor units had thrown Einstein into EDF mode, is that Einstein's fault, or predictor's?

That's exactly why I want to repeat the experiment starting with E@H next time.

Quote:

Ok, this ramble has gone on long enough, but I say just leave it alone for.... 8 days (7 to clear out the Einsteins completely and 1 to get the rest of the units into their groove). Then we can see what's really happening.

I aborted the E@H units, will repeat in a day or two. Will keep you posted.

PS: Are you part of the Boinc/E@H/other project's dev team?

eberndl

Joined: 18 Jan 05

Posts: 43

Credit: 98691

RAC: 0

That is ONE of the effects,

27 Jul 2005 23:25:19 UTC

Message 14664

(moderation:

)

That is ONE of the effects, but not the only one.

As you can see, I have 5 projects on my sig, I also have 2 alpha projects that I'm connected to. I want Predictor to get half of my time, and the others equal share of the remaining time, so my shares are

Predictor: 60
Each of the others: 10

So, in a perfect world, if I had units for all my projects I would be doing (more or less)
Predictor
A
Predictor
B
Predictor
C
Predictor
D
Predictor
E
Predictor
F
repeat from beginning (6 hours predictor, 6 hours of everything else)

But, since E and F aren't sending work, they are accumulating LTD, and my scheduler is more like
Predictor
A
Predictor
B
Predictor
Predictor
C
Predictor
D
Predictor
repeat (note 6 predictor hours to every 4 hours of the other 4 combined)

BUT when I start getting work from E and F again, this "groove" may be thrown out of whack, especially since I KNOW that Orbit@Home is going to have short deadlines. I might download 2 orbit units and then immediately crunch them, and then ignore it for a day, then d/l 2 more and immediately crunch them....

It's seeing part of a much longer pattern.

BOINC will try to keep the hour by hour resource share as much as possible, but if it can't keep that share without going over the deadline, the deadline gets priority over the hour by hour time share. Then, to make up for "stolen" time, and to keep the resource share correct in the long run, it holds of downloading that unit for a while.

Personally, I'm in debt for Einstein - my computer was off for a couple days, so it went into EDF cause the deadline was looming. So I'm now 25K seconds in debt. I'm working it off, and I'll d/l from it again.

Boinc is trying to follow many (possibly conflicting) demands:
It is supposed to follow resource share in the short term.
It is supposed to follow resource share in the long term.
It is supposed to download your connect to amount from each project.
It is supposed to return all units before their respective deadlines.
It doesn't want to run dry.

It's actually a lot to balance, especially if the time to complete each unit is off (as is unfortunately the case for many projects).

Now, The Einstein/EDF issue has to do with a long unit and a short deadline. it's 3-4 times the size of a predictor unit, but has an equal deadline... this can cause some scheduling problems... which is exactly what we're discussing. This problem is further stressed by the resource ratios that you have chosen.

There is nothing wrong with having Einstein with a small resource share, however, this IS one of the side effects. You can chose to accept this round robin of sometimes having Einstein, and sometimes not, or you can increase the Resource share. Without changing the code, these are the options I can see.

As I said before, you may also notice that CP will go into EDF when it gets close to being done... I know mine will, with it being less than 50% done, and having 23 days of processing left and a deadline of January. I just don't have the 230 days that would be required if BOINC just dealt with the Short term debt.

And no, I'm not on any dev team, but I am a (newly minted) member of the Wiki writers.

(again, sorry for the long ramble.)

Blank Reg

Joined: 18 Jan 05

Posts: 228

Credit: 40599

RAC: 0

Resource_Share

27 Jul 2005 23:48:14 UTC

Message 14665

(moderation:

)

Resource_Share

Link to Unofficial Wiki for BOINC, by Paul and Friends

eberndl

Joined: 18 Jan 05

Posts: 43

Credit: 98691

RAC: 0

Pascal, clearly I don't mind

27 Jul 2005 23:53:47 UTC

Message 14666

(moderation:

)

Pascal, clearly I don't mind having a discussion with V7, so why are you putting in that link??

Sometimes you need to bounce ideas off of each other...

Blank Reg

Joined: 18 Jan 05

Posts: 228

Credit: 40599

RAC: 0

RE: Pascal, clearly I don't

28 Jul 2005 1:56:27 UTC

Message 14667 in response to message 14666

(moderation:

)

Quote:

Pascal, clearly I don't mind having a discussion with V7, so why are you putting in that link??

Sometimes you need to bounce ideas off of each other...

Seems like some people just do not like my style, sorry if I upset your fragile mind, but no matter I will continue to post I see fit, until the boss tells me to cease and desist, ok.....Boss = Dr. D

Link to Unofficial Wiki for BOINC, by Paul and Friends

John McLeod VII

Moderator

Joined: 10 Nov 04

Posts: 547

Credit: 632255

RAC: 0

Goals of the CPU

28 Jul 2005 2:25:51 UTC

Message 14668

(moderation:

)

Goals of the CPU scheduler.

1) Download enough work to get a modem user through their disconnected period (as specified by connect every X days). Note that this means not honoring deadlines sometimes (unless the projects self limit based on connect every X).
2) Make all deadlines that are possible to make. (note that this means not honoring the resource shares in the short term, and also for modem users to make a deadline, it is required that EDF be entered when the deadline is less than 2* the connect every X days).
3) Honor the resource shares in the long term. (This means not honoring the mix of work sometimes).
4) Keep a mix of work on the host.
5) Honor the resource shares in the short term.

I think that is the list (from memory) I may have missed one or two. The goals are mutually exclusive, so they had to be prioritized. The order above was what was decided on for the order of the goals.

I am a volunteer developer on the BOINC project (this is the one piece that I have had a hand in so far).

BOINC WIKI

Ziran

Joined: 26 Nov 04

Posts: 194

Credit: 695115

RAC: 722

If i understand things

28 Jul 2005 22:16:27 UTC

Message 14669

(moderation:

)

If i understand things correctly a request of 86400s of work were sent out to all of these projects. The scheduler will then calculate how many WU's that would equal. Since it is very unlikely that the time you request and the calculated time will mach exactly you will be getting less then a WU more work then requested. Looking at the numbers a very interesting question arises: Why are Einstein the only project that manage to do this? SETI didn't even manage to send out 1/3 of the requested work and Predictor and LHC sent about half of what was expected of them. Are those project so bad at estimate how long there WU's are, or are there something else influencing this? If these requests are representative, the projects aren't playing in the same ballpark and more or less not even playing the same game.

Then you're really interested in a subject, there is no way to avoid it. You have to read the Manual.

eberndl

Joined: 18 Jan 05

Posts: 43

Credit: 98691

RAC: 0

I was wondering the same, but

28 Jul 2005 23:48:38 UTC

Message 14670

(moderation:

)

I was wondering the same, but I figured that it might have to do with the way that the project gives out WUs, with a max of 4 or 5/request or something... I know I've never gotten more than 4 or 5 Predictors at a time.

It's hard to know, but it's likely that if he hadn't been put into EDF mode, he would have made another request to SETI, Predictor and LHC... though looking at my own queue right now, I only have 7.5 h of Predictor units and I have a 0.5 day (12 hour) connection level.

Odd....

E@H Over commitment proof

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports