E@H Over commitment proof

venox7
venox7
Joined: 22 Jan 05
Posts: 16
Credit: 10072175
RAC: 0
Topic 189616

Hi,

After several posts and arguments about E@H's over commitment of BOINC (4.72) I decided to do a little experiment.

I set all my projects to not fetch new work. I let all of them (except CPDN of course) to run dry.

Then I manually adjusted the short and long term debt of all the projects in the clientstate file to 0.

After running Boinc for a couple of minutes for CPDN to build up some debt I allowed the other projects to fetch work one by one in the following order (resource share in brackets):

1. Seti (2.2%)
2. Predictor (21.7%)
3. BURP (8.7%)
4. LHC (43.5%)
5. CPDN (8.7%)
6. E@H (2.2%)

My AMD64 3000 running W2K, requested 86400s of work from the projects (ie reconnect time of 1 day).

The following work was downloaded by the different projects, with the days to deadline in brackets:

1. Seti - 7h50 (14days)
2. Predictor - 13h50 (7days)
3. Burp - no work from project
4. LHC - 12h40 (14 days)
5. CPDN - 0 secs requested

and then

6. E@H - 25h20 with a deadline in 7 days.

Immediately the Boinc scheduler gave me the following:

2005/07/27 23:31:04|Einstein@Home|Sending scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
2005/07/27 23:31:04|Einstein@Home|Reason: To fetch work
2005/07/27 23:31:04|Einstein@Home|Requesting 86400 seconds of work, returning 0 results
2005/07/27 23:31:22|Einstein@Home|Scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded
2005/07/27 23:31:22|Einstein@Home|Got server request to delete file w1_0526.0
2005/07/27 23:31:23||request_reschedule_cpus: files downloaded
2005/07/27 23:31:23||request_reschedule_cpus: files downloaded
2005/07/27 23:31:23||request_reschedule_cpus: files downloaded
2005/07/27 23:31:23||request_reschedule_cpus: files downloaded
2005/07/27 23:31:23||Suspending work fetch because computer is overcommitted.
2005/07/27 23:31:23||Using earliest-deadline-first scheduling because computer is overcommitted.
2005/07/27 23:31:23|ProteinPredictorAtHome|Restarting result h0010A_1_184116_3 using mfoldB125 version 4.28
2005/07/27 23:31:23|LHC@home|Pausing result wjun1D_v6s4hvnom_mqx__5__64.312_59.322__8_10__6__65_1_sixvf_boinc5578_1 (removed from memory)

Thus, due to the 25hrs of work downloaded by E@H, despite having only 2.2% of the resources (2.2% of 7 days is 3h40), BOINC went into panic mode, earliest deadline first, no work fetch allowed.

From this it is quite obvious that there is a problem with the way E@H (whether it is the server or the client software) is calculating the number of WU's to be sent to the client.

To confirm these results I will redo the experiment once all the WU's have been processed, but start with reactivating E@H first and then the other projects.

Any comments?

Regards,

V7

eberndl
eberndl
Joined: 18 Jan 05
Posts: 43
Credit: 98691
RAC: 0

E@H Over commitment proof

Quote:

Hi,

After several posts and arguments about E@H's over commitment of BOINC (4.72) I decided to do a little experiment.

I set all my projects to not fetch new work. I let all of them (except CPDN of course) to run dry.

Then I manually adjusted the short and long term debt of all the projects in the clientstate file to 0.

After running Boinc for a couple of minutes for CPDN to build up some debt I allowed the other projects to fetch work one by one in the following order (resource share in brackets):

1. Seti (2.2%)
2. Predictor (21.7%)
3. BURP (8.7%)
4. LHC (43.5%)
5. CPDN (8.7%)
6. E@H (2.2%)

My AMD64 3000 running W2K, requested 86400s of work from the projects (ie reconnect time of 1 day).

The following work was downloaded by the different projects, with the days to deadline in brackets:

1. Seti - 7h50 (14days)
2. Predictor - 13h50 (7days)
3. Burp - no work from project
4. LHC - 12h40 (14 days)
5. CPDN - 0 secs requested

and then

6. E@H - 25h20 with a deadline in 7 days.

Immediately the Boinc scheduler gave me the following:

2005/07/27 23:31:04|Einstein@Home|Sending scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
2005/07/27 23:31:04|Einstein@Home|Reason: To fetch work
2005/07/27 23:31:04|Einstein@Home|Requesting 86400 seconds of work, returning 0 results
2005/07/27 23:31:22|Einstein@Home|Scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded
2005/07/27 23:31:22|Einstein@Home|Got server request to delete file w1_0526.0
2005/07/27 23:31:23||request_reschedule_cpus: files downloaded
2005/07/27 23:31:23||request_reschedule_cpus: files downloaded
2005/07/27 23:31:23||request_reschedule_cpus: files downloaded
2005/07/27 23:31:23||request_reschedule_cpus: files downloaded
2005/07/27 23:31:23||Suspending work fetch because computer is overcommitted.
2005/07/27 23:31:23||Using earliest-deadline-first scheduling because computer is overcommitted.
2005/07/27 23:31:23|ProteinPredictorAtHome|Restarting result h0010A_1_184116_3 using mfoldB125 version 4.28
2005/07/27 23:31:23|LHC@home|Pausing result wjun1D_v6s4hvnom_mqx__5__64.312_59.322__8_10__6__65_1_sixvf_boinc5578_1 (removed from memory)

Thus, due to the 25hrs of work downloaded by E@H, despite having only 2.2% of the resources (2.2% of 7 days is 3h40), BOINC went into panic mode, earliest deadline first, no work fetch allowed.

From this it is quite obvious that there is a problem with the way E@H (whether it is the server or the client software) is calculating the number of WU's to be sent to the client.

To confirm these results I will redo the experiment once all the WU's have been processed, but start with reactivating E@H first and then the other projects.

Any comments?

Regards,

V7


eberndl
eberndl
Joined: 18 Jan 05
Posts: 43
Credit: 98691
RAC: 0

RE: My AMD64 3000 running

Quote:

My AMD64 3000 running W2K, requested 86400s of work from the projects (ie reconnect time of 1 day).

The following work was downloaded by the different projects, with the days to deadline in brackets:

1. Seti - 7h50 (14days)
2. Predictor - 13h50 (7days)
3. Burp - no work from project
4. LHC - 12h40 (14 days)
5. CPDN - 0 secs requested
6. E@H - 25h20 with a deadline in 7 days.

I hate to say it, and I know you hate to hear it, but from what I can see, Einstein did exactly what it was supposed to.

Please hear me out. The "reconnect" time is very very misleading. What it REALLY means is that it will download that many day's worth of work from each project, not that it will down load a total of (in your case) 1 day's worth of work.

So, Einstein downloaded 25 hours worth of work, that's about 24 hours, which is pretty accurate, in my mind.

Now, if you had down loaded from Einstein first, and Predictor last, and those predictor units had thrown Einstein into EDF mode, is that Einstein's fault, or predictor's?

With CP only getting 8.7% of your share, it's also likely that you won't be able to finish the CP unit before it's due without going into EDF. EDF is not your enemy, in fact it is your best friend... well, maybe just a good friend.

The very first download is ALWAYS weird. What will happen is the Einstein units will get done in their 7 days, and then you won't download any for a couple months, and then you'll download 25 hours worth, and it'll go into EDF mode again.

Just like when you finish your CP unit, it's very likely that you won't immediately d/l another one, because of your long term debt.

Ok, this ramble has gone on long enough, but I say just leave it alone for.... 8 days (7 to clear out the Einsteins completely and 1 to get the rest of the units into their groove). Then we can see what's really happening.

Happy crunching =-)

(editted for a stupidity on my part =-) )

venox7
venox7
Joined: 22 Jan 05
Posts: 16
Credit: 10072175
RAC: 0

Thanks for the explanation,

Message 14663 in response to message 14662

Thanks for the explanation, however, being a programmer myself (read always trying to catch bugs - the programming kind anyways), I'm still not 100% satisfied/convinced?

Quote:

Please hear me out. The "reconnect" time is very very misleading. What it REALLY means is that it will download that many day's worth of work from each project, not that it will down load a total of (in your case) 1 day's worth of work.

So, Einstein downloaded 25 hours worth of work, that's about 24 hours, which is pretty accurate, in my mind.

Then what is the use of resource sharing? Only to stop BOINC from downloading work units for a month or two, and then send your scheduler into panic again?

Shouldn't Boinc download WU's enough to fill the resource share based on the time to deadline? ie my E@H is at 2.2% with deadline 7days = +- 4hr? After EDF-finishing the E@H WU's it is in debt by -150000s! at 2.2% it's going to take forever to get the next WU. Again this doesn't seem to happen with other projects.

Also why then did the other projects only receive a couple of hours' worth of work, more in line with their resource share?

Quote:

Now, if you had down loaded from Einstein first, and Predictor last, and those predictor units had thrown Einstein into EDF mode, is that Einstein's fault, or predictor's?

That's exactly why I want to repeat the experiment starting with E@H next time.

Quote:

Ok, this ramble has gone on long enough, but I say just leave it alone for.... 8 days (7 to clear out the Einsteins completely and 1 to get the rest of the units into their groove). Then we can see what's really happening.

I aborted the E@H units, will repeat in a day or two. Will keep you posted.

PS: Are you part of the Boinc/E@H/other project's dev team?

eberndl
eberndl
Joined: 18 Jan 05
Posts: 43
Credit: 98691
RAC: 0

That is ONE of the effects,

That is ONE of the effects, but not the only one.

As you can see, I have 5 projects on my sig, I also have 2 alpha projects that I'm connected to. I want Predictor to get half of my time, and the others equal share of the remaining time, so my shares are

Predictor: 60
Each of the others: 10

So, in a perfect world, if I had units for all my projects I would be doing (more or less)
Predictor
A
Predictor
B
Predictor
C
Predictor
D
Predictor
E
Predictor
F
repeat from beginning (6 hours predictor, 6 hours of everything else)

But, since E and F aren't sending work, they are accumulating LTD, and my scheduler is more like
Predictor
A
Predictor
B
Predictor
Predictor
C
Predictor
D
Predictor
repeat (note 6 predictor hours to every 4 hours of the other 4 combined)

BUT when I start getting work from E and F again, this "groove" may be thrown out of whack, especially since I KNOW that Orbit@Home is going to have short deadlines. I might download 2 orbit units and then immediately crunch them, and then ignore it for a day, then d/l 2 more and immediately crunch them....

It's seeing part of a much longer pattern.

BOINC will try to keep the hour by hour resource share as much as possible, but if it can't keep that share without going over the deadline, the deadline gets priority over the hour by hour time share. Then, to make up for "stolen" time, and to keep the resource share correct in the long run, it holds of downloading that unit for a while.

Personally, I'm in debt for Einstein - my computer was off for a couple days, so it went into EDF cause the deadline was looming. So I'm now 25K seconds in debt. I'm working it off, and I'll d/l from it again.

Boinc is trying to follow many (possibly conflicting) demands:
It is supposed to follow resource share in the short term.
It is supposed to follow resource share in the long term.
It is supposed to download your connect to amount from each project.
It is supposed to return all units before their respective deadlines.
It doesn't want to run dry.

It's actually a lot to balance, especially if the time to complete each unit is off (as is unfortunately the case for many projects).

Now, The Einstein/EDF issue has to do with a long unit and a short deadline. it's 3-4 times the size of a predictor unit, but has an equal deadline... this can cause some scheduling problems... which is exactly what we're discussing. This problem is further stressed by the resource ratios that you have chosen.

There is nothing wrong with having Einstein with a small resource share, however, this IS one of the side effects. You can chose to accept this round robin of sometimes having Einstein, and sometimes not, or you can increase the Resource share. Without changing the code, these are the options I can see.

As I said before, you may also notice that CP will go into EDF when it gets close to being done... I know mine will, with it being less than 50% done, and having 23 days of processing left and a deadline of January. I just don't have the 230 days that would be required if BOINC just dealt with the Short term debt.

And no, I'm not on any dev team, but I am a (newly minted) member of the Wiki writers.

(again, sorry for the long ramble.)

Blank Reg
Blank Reg
Joined: 18 Jan 05
Posts: 228
Credit: 40599
RAC: 0
eberndl
eberndl
Joined: 18 Jan 05
Posts: 43
Credit: 98691
RAC: 0

Pascal, clearly I don't mind

Pascal, clearly I don't mind having a discussion with V7, so why are you putting in that link??

Sometimes you need to bounce ideas off of each other...

Blank Reg
Blank Reg
Joined: 18 Jan 05
Posts: 228
Credit: 40599
RAC: 0

RE: Pascal, clearly I don't

Message 14667 in response to message 14666

Quote:

Pascal, clearly I don't mind having a discussion with V7, so why are you putting in that link??

Sometimes you need to bounce ideas off of each other...

Seems like some people just do not like my style, sorry if I upset your fragile mind, but no matter I will continue to post I see fit, until the boss tells me to cease and desist, ok.....Boss = Dr. D

John McLeod VII
John McLeod VII
Moderator
Joined: 10 Nov 04
Posts: 547
Credit: 632255
RAC: 0

Goals of the CPU

Goals of the CPU scheduler.

1) Download enough work to get a modem user through their disconnected period (as specified by connect every X days). Note that this means not honoring deadlines sometimes (unless the projects self limit based on connect every X).
2) Make all deadlines that are possible to make. (note that this means not honoring the resource shares in the short term, and also for modem users to make a deadline, it is required that EDF be entered when the deadline is less than 2* the connect every X days).
3) Honor the resource shares in the long term. (This means not honoring the mix of work sometimes).
4) Keep a mix of work on the host.
5) Honor the resource shares in the short term.

I think that is the list (from memory) I may have missed one or two. The goals are mutually exclusive, so they had to be prioritized. The order above was what was decided on for the order of the goals.

I am a volunteer developer on the BOINC project (this is the one piece that I have had a hand in so far).

Ziran
Ziran
Joined: 26 Nov 04
Posts: 194
Credit: 386400
RAC: 1241

If i understand things

If i understand things correctly a request of 86400s of work were sent out to all of these projects. The scheduler will then calculate how many WU's that would equal. Since it is very unlikely that the time you request and the calculated time will mach exactly you will be getting less then a WU more work then requested. Looking at the numbers a very interesting question arises: Why are Einstein the only project that manage to do this? SETI didn't even manage to send out 1/3 of the requested work and Predictor and LHC sent about half of what was expected of them. Are those project so bad at estimate how long there WU's are, or are there something else influencing this? If these requests are representative, the projects aren't playing in the same ballpark and more or less not even playing the same game.

Then you're really interested in a subject, there is no way to avoid it. You have to read the Manual.

eberndl
eberndl
Joined: 18 Jan 05
Posts: 43
Credit: 98691
RAC: 0

I was wondering the same, but

I was wondering the same, but I figured that it might have to do with the way that the project gives out WUs, with a max of 4 or 5/request or something... I know I've never gotten more than 4 or 5 Predictors at a time.

It's hard to know, but it's likely that if he hadn't been put into EDF mode, he would have made another request to SETI, Predictor and LHC... though looking at my own queue right now, I only have 7.5 h of Predictor units and I have a 0.5 day (12 hour) connection level.

Odd....

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.