Excessive Work Cache Size - How to screw your new Wingman!!

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2988999848
RAC: 705169

RE: RE: RE: BOINC is

Quote:
Quote:
Quote:
BOINC is giving precedence to workunits whose due dates start with a "0".

Are you sure it's the start with 0 that's significant? Only looking at the day, but not the month part of the timestamp would be a more plausible bug. If you can get any May 10 wu's before finishing the last April task the two scenarios would be easy to differentiate.

You could be right. I haven't run any extensive tests on that, but your theory sounds plausible.


It's a fascinating theory, but I think you may have a difficult job making it stick.

As far is can I can, deadlines are transmitted, stored and acted upon in Unix timestamp format:

    h1_1467.10_S5R4__551_S5GC1HFa
    1304486824.000000


That's from client_state.xml, which of course is only a human-readable version of the internal binary representation, but it gives an idea. If you put that number into a tool like http://www.onlineconversion.com/unix_time.htm, you get "Wed, 4 May 2011 05:27:04 UTC". I'm seeing "04/05/2011 06:27:04", because I'm a European in a UTC+1 timezone. I suspect a New Yorker might see "05/04/2011 01:24:04 AM" - note the reversed day-month order because of national preferences. And in San Francisco it would be "05/03/2011 10:27:04 PM" - another three time zones west takes it into a different day.

Scheduling decisions, I'm pretty certain, are taken by the Core Client calculating absolute (UTC) times in Unix format. The deadlines are rendered are rendered into a calender date by a combination of BOINC Manager and Operating System format conventions. I doubt the the numerology of the resulting string representation has any real effect on scheduling.

Donald A. Tevault
Donald A. Tevault
Joined: 17 Feb 06
Posts: 439
Credit: 73516529
RAC: 0

RE: RE: RE: RE: BOINC

Quote:
Quote:
Quote:
Quote:
BOINC is giving precedence to workunits whose due dates start with a "0".

Are you sure it's the start with 0 that's significant? Only looking at the day, but not the month part of the timestamp would be a more plausible bug. If you can get any May 10 wu's before finishing the last April task the two scenarios would be easy to differentiate.

You could be right. I haven't run any extensive tests on that, but your theory sounds plausible.


It's a fascinating theory, but I think you may have a difficult job making it stick.

As far is can I can, deadlines are transmitted, stored and acted upon in Unix timestamp format:

Well, whatever the reason, I have two machines that really are behaving exactly as described. Whenever they download workunits for the next month, they quit processing the ones for the current month. What's more, they give "high priority" status to the next month's workunits, while completely ignoring the ones for the current month.

mikey
mikey
Joined: 22 Jan 05
Posts: 12815
Credit: 1880716510
RAC: 1156999

RE: RE: RE: RE: Quote

Quote:
Quote:
Quote:
Quote:
Quote:
BOINC is giving precedence to workunits whose due dates start with a "0".

Are you sure it's the start with 0 that's significant? Only looking at the day, but not the month part of the timestamp would be a more plausible bug. If you can get any May 10 wu's before finishing the last April task the two scenarios would be easy to differentiate.

You could be right. I haven't run any extensive tests on that, but your theory sounds plausible.


It's a fascinating theory, but I think you may have a difficult job making it stick.

As far is can I can, deadlines are transmitted, stored and acted upon in Unix timestamp format:

Well, whatever the reason, I have two machines that really are behaving exactly as described. Whenever they download workunits for the next month, they quit processing the ones for the current month. What's more, they give "high priority" status to the next month's workunits, while completely ignoring the ones for the current month.

Check the memory usage on those units, if your pc only has 2 gig of ram, it could be that the Boinc Manager doesn't have enough resources to finish a unit so it moves on to another unit, making it seem like what you are seeing. To test this go into Boinc Manager down by the clock, and then Advanced, Preferences, then the memory usage tab and where it talks about idle and in us cpu percentages, change the in use one from the default of 50% to say 85%. This gives Boinc more memory for the units and should stop some of the stopping of some units and then the keeping of them in memory while starting new units.

Donald A. Tevault
Donald A. Tevault
Joined: 17 Feb 06
Posts: 439
Credit: 73516529
RAC: 0

RE: RE: RE: RE: Quote

Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
BOINC is giving precedence to workunits whose due dates start with a "0".

Are you sure it's the start with 0 that's significant? Only looking at the day, but not the month part of the timestamp would be a more plausible bug. If you can get any May 10 wu's before finishing the last April task the two scenarios would be easy to differentiate.

You could be right. I haven't run any extensive tests on that, but your theory sounds plausible.


It's a fascinating theory, but I think you may have a difficult job making it stick.

As far is can I can, deadlines are transmitted, stored and acted upon in Unix timestamp format:

Well, whatever the reason, I have two machines that really are behaving exactly as described. Whenever they download workunits for the next month, they quit processing the ones for the current month. What's more, they give "high priority" status to the next month's workunits, while completely ignoring the ones for the current month.

Check the memory usage on those units, if your pc only has 2 gig of ram, it could be that the Boinc Manager doesn't have enough resources to finish a unit so it moves on to another unit, making it seem like what you are seeing. To test this go into Boinc Manager down by the clock, and then Advanced, Preferences, then the memory usage tab and where it talks about idle and in us cpu percentages, change the in use one from the default of 50% to say 85%. This gives Boinc more memory for the units and should stop some of the stopping of some units and then the keeping of them in memory while starting new units.

No, that's not it. Both machines have four-Gig of RAM, running OpenSuSE Linux. Neither one ever gives me a "waiting for memory" message. And, when they start on the next month's workunits, four of them get processed at a time just fine. The current month's workunits get processed four at a time as well, until the next month's workunits show up. Then, processing totally shifts to the new month's units, and the current month's units are totally ignored until I put a hold on the next month's workunits.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5877
Credit: 118586504238
RAC: 17788802

RE: Well, whatever the

Quote:
Well, whatever the reason, I have two machines that really are behaving exactly as described. Whenever they download workunits for the next month, they quit processing the ones for the current month. What's more, they give "high priority" status to the next month's workunits, while completely ignoring the ones for the current month.


Hi Donnie,

I know exactly what you are talking about as I see my own version of this particular behaviour on a subset of my machines. My observations are a bit different as I'll explain later, but two key elements to initiate the behaviour seem to be, (a) any event that triggers high priority (HP) mode and (b) the particular version(s) of BOINC.

Most of my hosts run v6.2.15 of BOINC on Linux. I can't really recall ever seeing the problem there even though there would have been a few instances of HP mode from time to time. I do recall seeing the behaviour many times on Windows machines running 6.10.x. I run 12 of these with ATI graphics cards for Milkyway, with E@H on the CPUs. I have another group of Phenom IIs running Windows as well because I intended (but haven't done so yet) to put in nVidia cards of some description for BRP3. I also have Windows machines that belong to relatives so I don't have a choice about the OS. I don't think I'm running any old versions of BOINC on any of these.

My observation is that to crunch tasks way out of proper date order, there has to be something that triggers HP first. I run with a fairly large cache size so here are some things that will always trigger HP mode.

  • * Whilst task completion times are usually fairly constant on a given host, on particularly warm days a single task may take as much as double the normal time (I assume thermal throttling). When this task completes, the sudden large increase in time estimate for the entire cache triggers HP. I have actually watched this happen on quads. The other 3 tasks in progress will be suspended and 4 new tasks (well out of sequence) will start and the time to deadline of the 4 new tasks will be way longer than that of the ones that were suspended. There is no apparent rhyme or reason to the choice. When I actually see it, I just suspend enough tasks in the cache to remove HP and then things go back to normal, with future tasks being started in proper sequence. A day or so later, with enough tasks completed to restore the DCF to a more sane value, the suspended tasks can be unsuspended and all is fine. If the number of tasks needing to be suspended to get rid of HP is too large for comfort, I'll stop BOINC and edit the DCF directly.

* Leaving memory hungry processes open and idle for too long (Win XP). For example if large numbers of Firefox windows, each with multiple tabs gets left open on an office machine over a long weekend or longer, eventually resources are consumed to the point that E@H tasks start taking maybe 5 times longer than normal. This leads to the same sort of problem as before.

* Disconnecting keyboard and mouse on WinXP machines. I run a lot of machines with just a power and ethernet cable attached - except if it runs WinXP. These machines always now have to have a keyboard & mouse attached or else crunching will suddenly slow to a crawl after a gestation period of a week or three. It's not sufficient (although the problem is less frequent) just to have the extras attached. For true freedom from dramatic slowdown, there needs to be some activity once in a while. Toggling the numlock key once a week solves the problem. If this is not done, HP mode is activated by the first task completed after the dramatic slowdown.

* Anything that reduces the of a machine, such as being left off, lockup or crash when unattended, client shut down inadvertently, etc. On restarting, BOINC may think (because of the lower ) that the cache cannot be completed by the deadline, and so initiates HP.

* A project that suddenly comes up with new work after a long period of having none. I saw this happen a few hours ago. I have a few machines that have the LHC project attached. On one of these, I woke up this morning to see 2 new LHC tasks in the cache ready to run. LHC has a large resource share and with the lack of work there, E@H always fills the cache completely on its own. So now BOINC is assuming there will be lots more LHC tasks to come and the "excess" of E@H tasks will therefore cause a deadline problem. It's a dual core machine and the two E@H tasks closest to deadline had been stopped and two with quite a bit longer to deadline had been started to replace them. I just suspended all E@H tasks to allow the LHC tasks to start immediately. I unsuspended the 4 E@H tasks that had already started and went and had breakfast. By the time I'd finished, the LHC tasks had been crunched and the original two E@H tasks had been restarted. So I reported the LHC tasks and unsuspended the entire cache of E@H tasks and all was back to normal.

I don't know why tasks further away from deadline always get chosen to replace tasks much more likely to be under deadline stress in all of these cases. As I said there was nothing obvious in the choice of which tasks to start after HP was invoked. I must admit that I've seen it so many times and recovery is easy so it doen't bother me much any more. I just assume it's a bug in particular versions of BOINC and perhaps will be fixed when 6.12.x goes live.

Cheers,
Gary.

Donald A. Tevault
Donald A. Tevault
Joined: 17 Feb 06
Posts: 439
Credit: 73516529
RAC: 0

RE: RE: Well, whatever

Quote:
Quote:
Well, whatever the reason, I have two machines that really are behaving exactly as described. Whenever they download workunits for the next month, they quit processing the ones for the current month. What's more, they give "high priority" status to the next month's workunits, while completely ignoring the ones for the current month.

Hi Donnie,

I know exactly what you are talking about as I see my own version of this particular behaviour on a subset of my machines. My observations are a bit different as I'll explain later, but two key elements to initiate the behaviour seem to be, (a) any event that triggers high priority (HP) mode and (b) the particular version(s) of BOINC.

Hi Gary!

I want to thank you for confirming my observations, since everyone else seems to think that I'm crazy.

Both of my problem machines are running the 64-bit Linux version of BOINC 6.10.58. So I'm guessing that what we're seeing is due to a BOINC bug.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2988999848
RAC: 705169

RE: RE: RE: Well,

Quote:
Quote:
Quote:
Well, whatever the reason, I have two machines that really are behaving exactly as described. Whenever they download workunits for the next month, they quit processing the ones for the current month. What's more, they give "high priority" status to the next month's workunits, while completely ignoring the ones for the current month.

Hi Donnie,

I know exactly what you are talking about as I see my own version of this particular behaviour on a subset of my machines. My observations are a bit different as I'll explain later, but two key elements to initiate the behaviour seem to be, (a) any event that triggers high priority (HP) mode and (b) the particular version(s) of BOINC.


Hi Gary!

I want to thank you for confirming my observations, since everyone else seems to think that I'm crazy.

Both of my problem machines are running the 64-bit Linux version of BOINC 6.10.58. So I'm guessing that what we're seeing is due to a BOINC bug.


I think it's a very real effect, and what's more, I think it's going to go on happening with BOINC v6.12 and beyond.

What we're seeing is somewhat counter-intuitive behaviour from BOINC when it comes under deadline stress, for any of the reasons cited by Gary. There's nothing wrong with the general principle of 'High Priority' running, where the intention is to ensure as high a proportion as possible of deadline-stressed tasks are returned in good time. If the algorithm employed ends up with a higher proportion of tasks missing deadline than alternative strategies would, then I would call 'bug' with everyone else: but I don't think we've demonstrated that yet. Anyone who thinks that the current strategy can be improved is welcome to read Emulating Volunteer Computing Scheduling Policies (paper by David Anderson, pdf format).

The only point where I disagree with you is the attribution of 'high priority' work to calendar dates in the first third of the month. If you persist with high cache settings into the beginning of May, I think you'll find that it's the 10/11/12 deadlines which are brought forward: and in the middle of the month, it'll be 20/21/22.

But, please go back and (re-)read Gary's opening post in this thread. The basic underlying reason for high priority running is your choice to keep a 10-day cache on a 14-day turnround project. That's tight, and as Gary demonstrated, risky and (arguably) antisocial - from the perspective of both the project as a whole, and of your individual wingmates.

Donald A. Tevault
Donald A. Tevault
Joined: 17 Feb 06
Posts: 439
Credit: 73516529
RAC: 0

RE: But, please go back

Quote:

But, please go back and (re-)read Gary's opening post in this thread. The basic underlying reason for high priority running is your choice to keep a 10-day cache on a 14-day turnround project. That's tight, and as Gary demonstrated, risky and (arguably) antisocial - from the perspective of both the project as a whole, and of your individual wingmates.

I disagree with your assertion that I'm "antisocial". Until BOINC started screwing up in this manner, my 10-day caches were working out just fine.

FrankHagen
FrankHagen
Joined: 13 Feb 08
Posts: 102
Credit: 272200
RAC: 0

just let me ad my 5 cents to

just let me ad my 5 cents to this..

i recently attached a new host and i'm running CUDA-WU's only.
as i know things happen, it was set for 0,01 days work buffer in advance, but what happened then?

it got 10 jobs immediately and a minute later another 10 showed up. most likely estimated flops is way off on project side - at least for those running GPU only.

FrankHagen
FrankHagen
Joined: 13 Feb 08
Posts: 102
Credit: 272200
RAC: 0

RE: RE: But, please go

Quote:
Quote:

But, please go back and (re-)read Gary's opening post in this thread. The basic underlying reason for high priority running is your choice to keep a 10-day cache on a 14-day turnround project. That's tight, and as Gary demonstrated, risky and (arguably) antisocial - from the perspective of both the project as a whole, and of your individual wingmates.

I disagree with your assertion that I'm "antisocial". Until BOINC started screwing up in this manner, my 10-day caches were working out just fine.

since you say you know it's not working well anymore (and we all know DA's brickhead sheduler get's more and more crazy), go blame DA!

it's pretty crazy to inist on a setup which you are calling "screwed up" yourself..

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.