Excessive Work Cache Size - How to screw your new Wingman!!

Claggy
Claggy
Joined: 29 Dec 06
Posts: 560
Credit: 2738849
RAC: 1342

RE: RE: RE: As those

Quote:
Quote:
Quote:
As those tasks will be crunched quickly, each one completed will result in a drop in the DCF as BOINC tries to correct the estimate. This will affect the CPU work cache (each CPU task will end up with a reduced estimate) causing more CPU work to be downloaded to fill the cache. If a CPU task then completes, the actual crunch time will be longer than the reduced estimate and there will be a single upward step in DCF to make the correction to all tasks in the cache. As a result of this, BOINC may now suddenly think that you have too many hours of work to complete safely within the deadline so HP mode is entered immediately.

exactly that's what happens. the bigger the difference betwen CPU and GPU runtimes the worse.

And if the Boinc Developers could come up with a way to separate that in the software then more of us would run both the cpu and gpu on the same project. Right now alot os us must crunch for project a on one machines cpu but use a different machines gpu to solve the problems and still have a decent sized cache. They we reverse it on a 2nd pc.

There are two ways to fix that, eithier the project updates it's server software, then they'll have the per app scaling of to eventually achieve a DCF of 1,
The problem with that is it takes 10 validations before it kicks in, that's O.K with GPU tasks, but tasks like CPU Astropulse or BRP3SSE may take weeks or months to achieve 10 validations,
and there's likely to be a Big change of DCF every time one of thoses Wu's complete, which effects all the predicted runtimes for other apps from that project,

The other way to fix this is to convince DA to do separate DCF's per app, i've been running a third party version of 6.10.58 with separate DCF's since last July, as well as separate dcf's,
it also has a better convergence algorithm, dcf's/runtimes no longer shoot up if a task takes longer than initially predicted, the dcf for that app just increases gradually,
it works very well on my hosts, especially on my E8500/GTX460/HD5770, faster or slower CPU, Cuda or ATI tasks don't effect each other at all,
enabling one resource to still get work when the other resources dcf's have gone up enough to block their work fetch,

Claggy

mikey
mikey
Joined: 22 Jan 05
Posts: 12803
Credit: 1879221686
RAC: 1401026

RE: RE: RE: RE: As

Quote:
Quote:
Quote:
Quote:
As those tasks will be crunched quickly, each one completed will result in a drop in the DCF as BOINC tries to correct the estimate. This will affect the CPU work cache (each CPU task will end up with a reduced estimate) causing more CPU work to be downloaded to fill the cache. If a CPU task then completes, the actual crunch time will be longer than the reduced estimate and there will be a single upward step in DCF to make the correction to all tasks in the cache. As a result of this, BOINC may now suddenly think that you have too many hours of work to complete safely within the deadline so HP mode is entered immediately.

exactly that's what happens. the bigger the difference betwen CPU and GPU runtimes the worse.

And if the Boinc Developers could come up with a way to separate that in the software then more of us would run both the cpu and gpu on the same project. Right now alot os us must crunch for project a on one machines cpu but use a different machines gpu to solve the problems and still have a decent sized cache. They we reverse it on a 2nd pc.

There are two ways to fix that, eithier the project updates it's server software, then they'll have the per app scaling of to eventually achieve a DCF of 1,
The problem with that is it takes 10 validations before it kicks in, that's O.K with GPU tasks, but tasks like CPU Astropulse or BRP3SSE may take weeks or months to achieve 10 validations,
and there's likely to be a Big change of DCF every time one of thoses Wu's complete, which effects all the predicted runtimes for other apps from that project,

The other way to fix this is to convince DA to do separate DCF's per app, i've been running a third party version of 6.10.58 with separate DCF's since last July, as well as separate dcf's,
it also has a better convergence algorithm, dcf's/runtimes no longer shoot up if a task takes longer than initially predicted, the dcf for that app just increases gradually,
it works very well on my hosts, especially on my E8500/GTX460/HD5770, faster or slower CPU, Cuda or ATI tasks don't effect each other at all,
enabling one resource to still get work when the other resources dcf's have gone up enough to block their work fetch,

Claggy

Sounds interesting, is it available for others to use too?

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2986913390
RAC: 716565

RE: Sounds interesting, is

Quote:
Sounds interesting, is it available for others to use too?


Unfortunately not. It was designated as an experimental project, for testing and "proof of concept" purposes only. It isn't being maintained alongside all the (useful) improvements - like proper ATI support - being added to the V6.12.xx range of BOINC clients.

Having said that, I'm sure the source code would be made available if there was pressure from projects for BOINC to adopt the client-managed approach to multiple DCF support, rather than the server-side approach currently adopted by the central BOINC developers.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2986913390
RAC: 716565

RE: RE: ... For my

Quote:
Quote:
... For my higher-performance machines, I generally like to keep a ten-day cache.
....
.... on the two machines that I've upgraded to CUDA-capable ....

Whilst there isn't usually a problem with a 10 day cache for "always on" machines that are being regularly monitored and observed, I must apologise for overlooking the second part of the quote. At the time, I was catching up with a number of posts and I was skimming through quite quickly until I got to Richard's post and the 'unix time format' explanation of why the 'leading zero' theory couldn't be the cause of the problem. That post grabbed my attention and I ended up responding to your very next message without going back and re-reading your original post.

For some reason, I was quite sure in my mind that you were talking about CPU only hosts, so I only bothered to list some ways that I knew from personal experience that HP mode could be triggered on such machines. I don't have any hosts with GPUs so I have no relevant experience to call on anyway.

However, from what others have said about running CPU and GPU tasks from the same project simultaneously, I think I understand why you can get away with 10 day caches on CPU machines but could easily have trouble on a host with a fast GPU. Some one will correct me if I'm wrong, but I believe the time taken to crunch a BRP task on a fast GPU is quite overestimated. I might be wrong about this and it might just be badly overestimated in those cases where people are running under AP but without a estimate in app_info.xml. People have said they have to increase the cache size just to get a supply of GPU tasks.


It isn't quite as simple as that. Sure, a "fast" GPU is overestimated (although not by as much on this project as on some others), but what matters most is the ratio of CPU/GPU processing speeds of a particular host, and whether it matches the ratio assumed by the project. If the hardware is 'balanced' reasonably well, things aren't too bad: but a fast GPU in a host with a slow CPU, or a slow GPU in an otherwise fast host, will both throw up problems.

Quote:
As those tasks will be crunched quickly, each one completed will result in a drop in the DCF as BOINC tries to correct the estimate. This will affect the CPU work cache (each CPU task will end up with a reduced estimate) causing more CPU work to be downloaded to fill the cache. If a CPU task then completes, the actual crunch time will be longer than the reduced estimate and there will be a single upward step in DCF to make the correction to all tasks in the cache. As a result of this, BOINC may now suddenly think that you have too many hours of work to complete safely within the deadline so HP mode is entered immediately.


Or, worse, it may even have already downloaded so much work that it really can't be completed in time. With CUDA tasks completing in around an hour on fast cards, it is possible - especially on a host with multiple cards - for DCF to be driven far enough down the sawtooth by multiple completions to use up the entire 40% margin between a 10 day cache and the 14 day deadline. When the next CPU completion pulls DCF sharply back up, it may already be too late.

Quote:
So, on your two CUDA enabled hosts, what is the estimate for a GPU task and how long does it actually take to crunch one? If this really is the cause of your problem, I believe you could fix it by using an app_info.xml file and playing with parameters there. In your state file there should be a parameter. My understanding is that the value of this parameter is supplied by the project, so you can't tweak it in the state file without it getting reset everytime you get new tasks. If you were to set up an app_info.xml and run under AP, you could include a tweakable entry and fix the crunch time estimate for your setup. If you prevent DCF from oscillating you should become stable again and avoid HP mode.


Installing and fine-tuning an app_info.xml is no easy matter. The values needed have to be derived - by calculation or experimentation - to suit each individual host: the relative processing speeds of S5GCE and BRP3 vary with memory, as well as CPU, speeds: and there are a huge range of CUDA-capable GPUs available. Ideally, nirvana is attained with values which drive both application types to the same DCF value. When you achieve that, estimated runtimes for tasks ready to start remain unchanged when tasks complete and DCF is re-calculated, but it's harder to achieve than you might think.

Far better if the BOINC system - taking client and server together - maintains the necessary adjustments automatically. The central BOINC developers have chosen to follow the server-managed route: this works well, once the averaging process has fully started, but there are some rough edges and transition points in the current code. Bernd has indicated (BOINC server components (scheduler) upgrade) that he intends to deploy it eventually, but he wants to ensure that it works properly before making the change.

As noted elsewhere in this thread, the alternative is to maintain multiple DCFs separately in the client, and experimental code already exists to to this. But adopting a client-led approach at this late stage would require a major U-turn by the central BOINC developers, and there may be some resistance to this.

Testing is underway and code is being developed. But in the meantime, can I join Gary in urging users to avoid excessive cache sizes, and the consequences expelained in the thread title and the opening post.

KittenKaboodle
KittenKaboodle
Joined: 9 Feb 11
Posts: 13
Credit: 10765731
RAC: 0

RE: However, from what

Quote:
However, from what others have said about running CPU and GPU tasks from the same project simultaneously, I think I understand why you can get away with 10 day caches on CPU machines but could easily have trouble on a host with a fast GPU. Some one will correct me if I'm wrong, but I believe the time taken to crunch a BRP task on a fast GPU is quite overestimated. I might be wrong about this and it might just be badly overestimated in those cases where people are running under AP but without a estimate in app_info.xml. People have said they have to increase the cache size just to get a supply of GPU tasks.

That's exactly the problem when running more that one GPU WU simultaneously.
However, inserting the flops tag into app_info.xml solves this problem. After inceasing the given value for the tag by a factor 5, Boincs estimation for the time it takes to crunch one GPU WU is down to a reasonable value.
I have now set my cache to 3 days and I am still getting enough GPU WUs.

Kenny-Tower
Kenny-Tower
Joined: 20 Oct 10
Posts: 1
Credit: 1666771
RAC: 0

Hi I have noticed since

Hi

I have noticed since reading your post that i am currently working on things for the 16/5 (16 May) while there are tasks with deadlines of 9/5 (9 May)are not being done.

I am running Win 7 64 bit with quad CPU. From what i have read there may be a issue with the 64 bit BOINC. I would but my money on BOINC 6.10.60 ver2.8.10 not realizing that we are not in America and that the day of the month is first then the month.

I normal only cash 0.5 days but I recently used up my broad band allocation (3 gigs)and was automatically allocated another 3 gig unit with 4 days till the end of the billing period when the unused gigs would be lost. I increased my cash size (5 days) till is used up 2 gigs of broadband traffic. I normally limit my network usage to 2 gigs per month due to the expense. I did not want to waste over 2 gigs i will have to pay for.

Now I am wondering how this will affect next moths work load. Are the results as large as the downloaded tasks? I have set my cash to 0.5 days and my limit back to 2 gigs in 30 days. Does the BOINC manager spread the load evenly over the month or just stop when it has reached its quota? Does it leave network data (Gigs) so that it can report completed tasks? Do you have any recommendations on how to manage my network traffic?

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3562358667
RAC: 0

RE: Now I am wondering how

Quote:
Now I am wondering how this will affect next moths work load. Are the results as large as the downloaded tasks? I have set my cash to 0.5 days and my limit back to 2 gigs in 30 days. Does the BOINC manager spread the load evenly over the month or just stop when it has reached its quota? Does it leave network data (Gigs) so that it can report completed tasks? Do you have any recommendations on how to manage my network traffic?

Uploads are small. S5GC1 is approaching the end of its dataset, which means there are going to be a lot more resends being sent. Each work unit uses a large number of files that are shared with similarly numbered tasks, so during normal operation you generally only have a single large download when starting out, with subsequent tasks only requiring a small (or no) amount additional data to be downloaded. With resends however you end up getting a bunch of tasks scattered about randomly, and need a lot more bandwidth to download them. If you have decent geek skills you can fiddle around with things to minimize the amount of extra bandwidth you use during the end of S5GC1. If you aren't, or don't want to fiddle with stuff goto project options and turn off S5GC1 so you only get BRP1 tasks instead.

http://einsteinathome.org/node/195735

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5877
Credit: 118566884258
RAC: 21709595

RE: I have noticed since

Quote:
I have noticed since reading your post that i am currently working on things for the 16/5 (16 May) while there are tasks with deadlines of 9/5 (9 May)are not being done.


You have several computers attached to the project. It's always helpful to tell us which one(s) are having the problem you are reporting. I found something along the lines you describe in this host's task list. Is this the correct one? If so, the tasks aren't being done out of order because you have three different applications running, a BRP CUDA app for your GPU, a BRP CPU app and a GC1HF CPU app. The CUDA app is crunching tasks with a deadline of 16 May. The BRP CPU app is crunching tasks downloaded on 21 April and due by May 5. I don't see any tasks with a May 9 deadline.

There are 3 tasks due by May 5. That host has 4 cores so as long as the three tasks are already running, they should be done by the deadline. You will notice that BRP tasks take more than 24 hours on a CPU but around an order of magnitude less when done on the GPU. You should consider setting your preferences to direct all BRP tasks to the GPU and crunch only GC1HF tasks on your CPU cores.

Quote:
...I would but my money on BOINC 6.10.60 ver2.8.10 not realizing that we are not in America and that the day of the month is first then the month.


How much money would you like to wager? :-). I can give you the account details where you can deposit the funds if you like :-). Seriously, there are two things that make your wager 100% guaranteed to lose. Firstly it's the OS and not BOINC that controls how dates are displayed. Just properly set your localisation and time zone in Windows if you want day-month-year instead of month-day-year. Secondly, irrespective of your OS settings, BOINC does all time calculations internally in Unix time format, ie number of seconds since the 'epoch' - right at the beginning of 1970. So it's impossible for BOINC to be confused by the way your OS chooses to display dates. BTW, I understand BOINC 6.10.60 but what is ver2.8.10?

Quote:
I normal only cash 0.5 days but I recently used up my broad band allocation (3 gigs)and was automatically allocated another 3 gig unit with 4 days till the end of the billing period when the unused gigs would be lost. I increased my cash size (5 days) till is used up 2 gigs of broadband traffic. I normally limit my network usage to 2 gigs per month due to the expense. I did not want to waste over 2 gigs i will have to pay for.


The current GC1HF run is around 95% complete. There are only a few days of primary tasks left to distribute. There are likely to be lots of resends (resulting in lots of downloads) especially after all the primary tasks are gone, and also during the period when the new run is being phased in. I've been adding information to this thread for anyone masochistic enough to try to understand what is really going on and how possibly to interpret it and take advantage of it. It's certainly not required reading for normal participants. However, please be aware that any hosts participating at this particular time will be experiencing a spike in downloads that is unavoidable without manual intervention.

Like you, I have monthly limits on downloads if I want to avoid excess usage charges. It is possible to make big reductions in bandwidth usage by making sure that you get as many tasks as possible before allowing BOINC to throw away partly used data.

Quote:
Now I am wondering how this will affect next moths work load. Are the results as large as the downloaded tasks?


The new run will kick in very shortly now - unless there is some problem. From what has been said in various places, the data downloads will be smaller and the scheduler (being re-written) is likely to make much more efficient use of downloaded data before discarding it. For the current run, both the tasks themselves and the results that are uploaded are quite small. The really large files that are downloaded are not tasks. They are LIGO data files and you need (at worst) close to 200MB of these (48-52 files) just to support one task. Currently, hundreds of tasks can be based on the one set of LIGO data, but you wont achieve this without manual intervention. For the new run, the improved scheduler (when it's in service) should allow the more efficient use of downloaded data.

Quote:
I have set my cash to 0.5 days and my limit back to 2 gigs in 30 days. Does the BOINC manager spread the load evenly over the month or just stop when it has reached its quota? Does it leave network data (Gigs) so that it can report completed tasks? Do you have any recommendations on how to manage my network traffic?


DanNeely has given you one option to consider (turn off GC1HF tasks completely until the new run is underway). Another alternative is to load up with enough tasks and then set NNT (No New Tasks) to last until the download frenzy subsides. As an example, for this host of yours, you are fortunate to have tasks at a frequency that is quite favourable for getting plenty of work without a big risk of lots of large downloads. You have quite a few tasks for frequencies between 1497.25Hz and 1497.60Hz. You would already have LIGO data files between 1497.25Hz and 1498.20Hz, although quite a number at the lower end would already have tags applied in the state file. If you were to stop BOINC and edit out all those tags, you could download more tasks at those frequencies without having to download a whole lot more LIGO data. Even without editing your state file, you could get lots of 1497.xx and 1498.xx tasks by building on the existing data with some extra LIGO files above 1498.20Hz. If you were to build your cache say by 0.5 days at a time you should be able to limit how many extra files above 1498.20Hz you have to download. If you build your cache steadily 0.5 days at a time until you got to say 10 days (and keep it there while the current frequency tasks are still available), you should have enough work to last until well after the new run commences. It's a bit of fiddling around but you will need less data in the long run. When 1497.xx tasks run out, just set your cache back to 0.5 days and you would have 9.5 days before any further downloads would be needed. By that time the new run should be well and truly established and stabilised.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.