backup project: resource share problem

gravywavy
gravywavy
Joined: 22 Jan 05
Posts: 392
Credit: 68962
RAC: 0
Topic 189617

The advice given to those wanting to set up backup projects is to give them derisory resource shares, the effect being claimed that the backup project will run a few wu to begin with, but then not run unless the major project runs dry.

So if my major porject is Orbit@home, I give that project a large resource (9900); then I have Einstein and LHC as backup projects with a smaller resource share (99), and then my last chance backup project as a very small share (Protein Predictor, share=2).

This should mean that when Orbit has no work, either Einstein or LHC will run; and when all three of these projects have no work, then Predoctor will run.

Sounds nice, but the advice does not work.

It is an ideal time to test this advice, for anyone with an Orbit a/c, because of course Orbit has no work for anybody just now.

Where it goes wrong is that the client starts asking for work when it gets below 0.7 days work (my connect interval for this test). The schedulers spot that there is some work held locally, and refuse to issue any work on the grounds that it will not finish in time. This makes sense as they are assuming that orbit is taking about 98% of the cpu. Each scheduler then defers future connections for 24 hours (Einstein, Orbit, Predictor) or 48 hours (LHC). During this time the machine runs dry and there is no work being done (I have witnessed this in my single cpu machine) or work being done on only one of two cpus (also witnessed).

Then it is as likely to be Predictor that gets to run work as either of the other projects, despite it being set with a very very small resource share.

I have previously suggested that it would be better to have a mechanism for selecting backup projects, and second tier backups, etc. (see this wish list thread.

I'd like to re-inforce that request now - instead of trying to fudge a priority scheme out of the resource shares - a geek trick that is opaque to naive users and which will always be subject to the protective programming of schedulers in the future, please can we have a separately designed priority scheme: I want to say that this is my frontline project; these are my priority 2 projects, and here is my final backstop project.

As I have already indicated in that previous thread, this would not ba a major change to the currenct handling in the scheduker, tho there would be a need for a new value in the database for project specific preferences.

~~gravywavy

Blank Reg
Blank Reg
Joined: 18 Jan 05
Posts: 228
Credit: 40599
RAC: 0

backup project: resource share problem

Sounds reasonable, cause that is not the way Boinc should be working......but you are trying to force it to your will.........

Resource_Share

Jim Baize
Jim Baize
Joined: 22 Jan 05
Posts: 116
Credit: 582144
RAC: 0

In the example that you

In the example that you cited, you are only looking at initial conditions. Each project will not stay EXACTLY at the assigned resource share, but it will stay within a range of the resource share.

I do believe that the client will request work once the WU gets close to the end. Even if it doesn't attempt near the end of a WU, assuming that you have one Einstein WU, one LHC WU, and one Predictor WU, you should be pretty close to 24 hours worth of work (depending upon computer speed, yada yada yada). According to your scenerio, in a 48 hour period, Einstein will attempt to DL a WU 2, as will Predictor and LHC will try once. That is 5 attempts within 48 hours. Averages out to close to one every 10 hours. Surely, within that time one of the three will successfully request and be granted work.

You also stated that the server would defer communications for 24 or 48 hours. How do you know this? Do you have a log entry that shows this? My understanding is that the "Connect to network about every X days" would prevent that from happening.

Quote:
Where it goes wrong is that the client starts asking for work when it gets below 0.7 days work (my connect interval for this test). The schedulers spot that there is some work held locally, and refuse to issue any work on the grounds that it will not finish in time. This makes sense as they are assuming that orbit is taking about 98% of the cpu. Each scheduler then defers future connections for 24 hours (Einstein, Orbit, Predictor) or 48 hours (LHC). During this time the machine runs dry and there is no work being done (I have witnessed this in my single cpu machine) or work being done on only one of two cpus (also witnessed).


Jim

gravywavy
gravywavy
Joined: 22 Jan 05
Posts: 392
Credit: 68962
RAC: 0

RE: In the example that you

Message 14684 in response to message 14682

Quote:
In the example that you cited, you are only looking at initial conditions. Each project will not stay EXACTLY at the assigned resource share, but it will stay within a range of the resource share.

sorry, you have missed my point. The complaint is not that the project shares are slightly out of line, there are two separate complaints:

1) that the machine is left for huge parts of a day with no work to do, which is exactly what a backup project is supposed to guard against

2) that furthermore, there is no way of controlling an order of preference for backup projects, once the major project is down then any of the backups is likely to get given the lions share of the work. Not a question of EXACTLY, but of HUGELY different from expectations.

Clearly, (1) is the more important of these two.

Quote:

I do believe that the client will request work once the WU gets close to the end.

correct, as reported in my previous posting.

and the result, as reported in my previous posting, is that the scheduler does two things, (a) it refuses to give any work as there is still work running that looks like it won't complete in time, and (b) the scheduler requests (and the recent client honours) no further contact for 24+ hours. Given that, as you rightly observe, we were 'close to the end' the effect is to lock the machine out from further work for nearly a day. Once all the work in hand is complete the machine sits for the rest of the day saying once an hour that it is requesting further work but that comms are deferred.

Quote:

Even if it doesn't attempt near the end of a WU, assuming that you have one Einstein WU, one LHC WU, and one Predictor WU, you should be pretty close to 24 hours worth of work (depending upon computer speed, yada yada yada). According to your scenerio, in a 48 hour period, Einstein will attempt to DL a WU 2, as will Predictor and LHC will try once. That is 5 attempts within 48 hours. Averages out to close to one every 10 hours. Surely, within that time one of the three will successfully request and be granted work.

except that they don't connect till 24 hours after the first attempt, and all the first attempst were in a few minutes of each other, when, as you correctly say, the previous work was 'close to the end'.

Quote:

You also stated that the server would defer communications for 24 or 48 hours. How do you know this? Do you have a log entry that shows this?

many - I am sorry I obviously did not make it clear that I am reporting actual observations of two clients in action, no guesses, no assumptions, just what actually happens when I test the received wisdom on this method of arranging for backup projects.

I will post extracts from a couple of logs in later postings.

Quote:

My understanding is that the "Connect to network about every X days" would prevent that from happening.

in practice, it appears that the scheduler request for contact to be defferred is more binding on the client than the connect interval. During normal running this makes perfect sense, if client X cannot possibly complete any work sent, or if the server has no work, it does not want client X mithering for work every few minutes. The side effect of this sensible prioritisation is that the frequently advised cludge to set up a backup project does not work.

~~gravywavy

gravywavy
gravywavy
Joined: 22 Jan 05
Posts: 392
Credit: 68962
RAC: 0

log entries here

log entries here

~~gravywavy

gravywavy
gravywavy
Joined: 22 Jan 05
Posts: 392
Credit: 68962
RAC: 0

RE: Sounds reasonable,

Message 14686 in response to message 14681

Quote:
Sounds reasonable, cause that is not the way Boinc should be working......but you are trying to force it to your will.........

absolutely so.

The danger with any cludge, that is with a neat trick to force software to do something that wasn't originally written into it, is that later updates will not follow the intent of the cludge.

To provide capability for the idea of a backup project, it is safer to write in specific code to offer that service, and not try to trick the schedulers.

A separately coded, separately documented, set of priorities for backup projects would avoid that: those writing upgrades to schedulers would know that backup projects had to be catered for and would write their own code around that documented functionality.

With hindsight it is obvious that this cludge relied on the scheduler not to check the viability of requests for work, and now that that capability has been added to the schedulers (which must be a good thing) it breaks the cludge.

It points up the danger of passing on untested advice, and the danger of passing on not-recently-tested advice.

~~gravywavy

Jim Baize
Jim Baize
Joined: 22 Jan 05
Posts: 116
Credit: 582144
RAC: 0

Ok, after looking through

Message 14687 in response to message 14685

Ok, after looking through your logs I feel that I have a better understanding what you are saying.

No, just a question for further clarification. Are you saying that this is a boinc issue or a project issue? What if the projects were not allowed to delay for an extended period of time in this circumstance?

Or, maybe, what if BOINC was able to take into account the other projects when requesting work.

Quote:
log entries here


Jim

Ziran
Ziran
Joined: 26 Nov 04
Posts: 194
Credit: 459726
RAC: 1582

To me it seems that we need

To me it seems that we need two changes to the BOINC client:

The resource share numbers reported to the server should only be for request project + all other projects with unfinished work on the client. The resource share numbers for all other projects are irrelevant since they don't have work on the client.

Then all work for a particular project is done, all Deferring communication with that project should be reseted and a retry done immediately. Then the client finishes its last WU, all Deferring communication should be reseted and a retry done immediately to all projects.

Then you're really interested in a subject, there is no way to avoid it. You have to read the Manual.

Jim Baize
Jim Baize
Joined: 22 Jan 05
Posts: 116
Credit: 582144
RAC: 0

River, I just read

River,

I just read someplace a post that said if the cache runs dry BOINC will accept work from the first project that it can connect to. I believe it was JM7 who made the statement.

Now, it's been a while since I looked at your logs, but did your cache ever actually run dry? If so, what were the results? If not, would you be willing to let it run dry to test this statement?

Jim

gravywavy
gravywavy
Joined: 22 Jan 05
Posts: 392
Credit: 68962
RAC: 0

RE: ... a post that said if

Message 14690 in response to message 14689

Quote:
... a post that said if the cache runs dry BOINC will accept work from the first project that it can connect to ...

Correct, but only so long as you understand that it interprets 'first project it can connect to' as meaning ... can connect to after honouring the outstanding deferral periods.

This turns out to be a double negative. First, it leaves the machine dry for much more than 50% of the time, and second, when it does connect the work allocation depends on the randomness of the deferral algorithm and bears no relation to the resource shares within the backup projects. This means there is no way of controlling which of your backup projects runs most, or at all.

Quote:

... did your cache ever actually run dry? ...

For more than 50% of the time the machines (both machines, a singel cpu machione and a two cpu machine) sit there dry.

Anyone who adopts the derisory resource share stratgey in the hope of having set a backup project will be very disappointed when the real outage occurs.

Quote:
would you be willing to let it run dry to test this statement?

yes, Jim, I tried it before making the original posting.

I said in my first posting 'the advice does not work', I meant I have tried it and found it does not work in practice. It leaves the machines dry for more than half the time, and for more than 24hrs continuously at a time.

If I hadn't tried it I would have posted very differently, like 'I am concerned that this might not work' or somesuch.

Like maybe it is time for others to try it also? The best way of seeing what goes on is not by reading other peoples logs but by trying for oneself. And, as with any code of more than a few dozen lines, it is important not to believe what the coder says the code should do, not till you see it doing it in practice (and even then stay sceptical). I mean no disrespect to our programmers, I've been a programmer myself and I know how easy it is to miss the implications of a combination of two pieces of well-meant code that meet unexpectedly.

How to test: A

People with an orbit a/c. Set Orbit to a high resource share (~99%) and one or more other projects to a low share and sit back and watch what actually happens.

People with a Pirates a/c that has not been detached from all clients: likewise.

Method B

I will try out a different way of simulating a project failure and get back to you if it proves an effective test. Having said what I just did about testing, I need to test my proposed test before advising you how to do it! Look out for a later posting.

~~gravywavy

gravywavy
gravywavy
Joined: 22 Jan 05
Posts: 392
Credit: 68962
RAC: 0

RE: Are you saying that

Message 14691 in response to message 14687

Quote:
Are you saying that this is a boinc issue or a project issue?

To my mind this is clearly a BOINC issue, as it is not about the workings of any one project but about the interworking of several projects, and the "communal" feeling amongst projects that if my favourite project can't give me work today, I will help someone else.

Quote:

What if the projects were not allowed to delay for an extended period of time in this circumstance?

Or, maybe, what if BOINC was able to take into account the other projects when requesting work.

In my opinion, and a firmly held opinion, the route you are suggesting is tempting but dangerous. I hope I don't offend by putting this so strongly, but the only right way to code for something is to code for it, not to attempt to trick existing code into doing something outside its design. To attempt to interfere with the design to make a cludge work is the route to software chaos.

We had a cludge, ie we thought we could achieve the desired effect without coding for it, simply by putting in daft numbers in our configs. Now this has stopped working due to better input validation routines in the schedulers you are saying lets cludge the cludge to make it work after all. This is the crucial point: this advice stopped working due to entirely reasonable actions by programmers to stop the sheduler doing something silly.

The professional way to deliver a feature, if we agree the feature is worth having, is to design code and config settings that explicitly deliver that functionality. Add the functionality to the documentation for future programmers, together with some testing routines for it, and hopefully future changes will honour it.

Any cludge to a cludge will eventuially die when there is effort put in to even better input validation on the shedulers. I say this with no knowledge at all of the workings of BOINC, but with experience coding in several dozen languages in everything from binary machine code up to 4th Generation and Object Oriented languages. Design something to do the job properly - don't cludge, and if you have cludged then certainly don't defend the cludge once it comes on top.

First decision: is there a demand for backup projects. I think the answer is 'yes' from the number of times I've seen many different people give out the erroneous advice.

Second decision: is there willingness to code for it? This decision can only be taken by the people who do the coding, and will be made in the light of countervailing calls on their time. I respect that.

Third decision: if there is willingness to code for it, is Gravy's solution the best? If not, what other solutions are there? If we have separate user config settings for normal resource shares and for backup priorities, how are these input, how held, how treated by the client? I do not insist that my solution is the only one, nor that it is the best: what I do say is that any solution that tries to use one setting to do two different things is inherently unsatable.

So my answer, is yes, for a short while if we made either of the changes you suggest, the cludge would work again for a while. But in a few months we'd be sure to find that either our fix has broken something else, or that yet another change somewhere else had broken our cludge again.

~~gravywavy

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.