client not caching as many WU's as it should?

Sunny129
Sunny129
Joined: 5 Dec 05
Posts: 162
Credit: 160342159
RAC: 0
Topic 195645

Hello everyone.

i used to have a Phenom II X6 1090T based system that would be crunching 6 WU's at any given time, and would typically have 8 to 9 WU's in the que "ready to start." i always had the "maintain enough work for XX days" set to 0.25 days. just recently i purchased some parts to build a 2nd rig, but instead of leaving my existing system intact and building the new system with all the new parts i ordered, i decided to dismantle the existing system so i could mix and match parts. so now that both systems are up and running, they are essentially both a mix of new and old parts...but that's neither her nor there. the point is that i had to start from scratch with my 1090T system, this time with a new mobo and memory...and so i had to go through the OCing routine yet again.

in the process of testing the stability of my OC, several of my E@H WU's errored out. currently, my account shows 28 WU's that either encountered an error while computing or an error while downloading (most were computation errors). there may have been more than that (it couldn't have been more than ~40 errored WU's), but my account page only shows 28 WU's for now. after ironing out the instabilities in my OC, i noticed that instead of having the typical 8 to 9 WU's in the que ready to start, i only have 2 to 3...and only 1 new WU get downloaded and added to the que at a time now.

at first it made zero sense to me, but i fould the following info in the Q&A section, and now it makes more - but very little - sense:

Quote:

Why is my Daily Result Quota so small?

There are some host machines which 'error out' all the work that is sent to them. Often these machines are misconfigured or have some other Operating System or BOINC installation problem which needs to be fixed. To help reduce the impact of these machines on the project, we use a 'Daily Result Quota' to prevent these host machines from trashing hundreds or thousands of workunits per day.

The 'Daily Result Quota' is normally 8 workunits (per CPU, with a 4 CPU maximmum). A host can request, and will receive, up to this many workunits per day and per CPU. Each time that a host returns a failed result, or 'times out' on a result (fails to return a result by the deadline) its Daily Result Quota is reduced by one. Each time that a host returns a successful result, its Daily Result Quota is DOUBLED. Note: the Daily Result Quota is NEVER allowed to be less than one, and NEVER allowed to be larger than 8 (per CPU).

Provided that a host machine returns at least some successful results, it's Daily Result Quota should remain near 8. Host machines that have a Daily Result Quota of 1 should be examined: there is probably something wrong with them, or with how BOINC is installed or running on them.


i thought i was onto something when i came across this info b/c it was the first correlation i'd seen between the number of errored or timed-out WU's and the number of new WU's one receives. but i'm still not sure it explains my situation. after all, even if i did return 28 or more errored or timed-out WU's, that last errored or timed-out WU i returned was back on 2/2/11, and i have since returned what appears to be 38 successful results. according to the above info, my DRQ should be back up at 8...unless i did my math wrong.

...or perhaps my problem has nothing to do with this and i'm completely confused lol. but if that's the case, i'm at a total loss as to why my client used to que 8-9 WU's and now only ques 2-3 WU's. just to confirm that i should have more WU's in the que, my other Phenom II X4 965 BE based system is crunching 4 WU's at a time, and ques up to 4 BRPS WU's at any given time, or ~8 GC WU's at any given time.

can anyone make more sense out of my situation?

TIA,
Eric

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118380695452
RAC: 25581580

client not caching as many WU's as it should?

Quote:
can anyone make more sense out of my situation?


Hi Eric,

The information you found is correct in principle but out-of-date for the daily quota. These days the maximum is 32 per core per day - so 192 tasks in total for each of your 6 core hosts.

If you simply go to your account page and click the 'view' link for your computers, you can 'drill down' and see all the gory details for yourself (follow the 'details' link). You will see that both hosts have the full quota setting of 32, so that is not the issue. BTW, one host has not made recent contact so maybe you only have one of the two running at the moment? I had assumed initially that the two listed 6 core hosts were your current two new hosts but perhaps that is wrong and the two entries actually refer to the new and old identities of the one physical machine? If this latter scenario is really the case then you should 'merge' the duplicate identities using the link you can find at the bottom of one of the host's 'details' pages. Ask before doing anything if you are not sure what's going on.

When you have a machine off for a period, BOINC notes that and reduces the fraction of time for which your computer is recorded as being 'available' to do work. BOINC will reduce accordingly the quantum of work it will allow to be downloaded, assuming that your host may continue to be off from time to time. If your machine runs consistently, the fraction will increase over time and eventually things will get back to the way they were.

You could edit configuration files to 'fix' this immediately but it's quite easy to work around the problem. 0.25 days is too low for a cache setting (if you want to be able to crunch through periods where you might not be able to contact the project (for whatever reason) so just try setting the 'extra days' preference on the website to something larger like 1.0. Then select the E@H project in BOINC Manager and click 'update' to force the client to communicate and so receive the new setting immediately. This should result in quite a number of new tasks being sent. Over time, as your machine continues to run, the 'available' fraction will be adjusted by BOINC and so the number of tasks on board will gradually increase. For a machine like yours, it wouldn't be a concern to have lots of waiting tasks.

When you make changes to cache settings, it's best to make a number of modest changes and see what happens rather than one big 'hit'. The change 0.25 -> 1.0 is OK but if you were intending to keep a 3.0 day cache (which would be quite OK for a host like yours) you might consider going up in a number of 0.5 to 1.0 day steps rather than one big hit. The reasons are complicated but I've actually witnessed hosts being assigned several GBs of (quite unnecessary) new large LIGO data files as a result of an (accidental) overambitious cache setting change. It's to do with the way locality scheduling works at E#H.

Let us know how you get on.

Cheers,
Gary.

Sunny129
Sunny129
Joined: 5 Dec 05
Posts: 162
Credit: 160342159
RAC: 0

RE: Hi Eric, The

Quote:

Hi Eric,

The information you found is correct in principle but out-of-date for the daily quota. These days the maximum is 32 per core per day - so 192 tasks in total for each of your 6 core hosts.

If you simply go to your account page and click the 'view' link for your computers, you can 'drill down' and see all the gory details for yourself (follow the 'details' link). You will see that both hosts have the full quota setting of 32, so that is not the issue. BTW, one host has not made recent contact so maybe you only have one of the two running at the moment? I had assumed initially that the two listed 6 core hosts were your current two new hosts but perhaps that is wrong and the two entries actually refer to the new and old identities of the one physical machine? If this latter scenario is really the case then you should 'merge' the duplicate identities using the link you can find at the bottom of one of the host's 'details' pages. Ask before doing anything if you are not sure what's going on.

sorry about the confusion regarding the hosts. one is in fact a reincarnation of the other. originally i had the 1090T running on an ASUS M4A89GTD PRO/USB3 mobo in my master bedroom. after i bought parts to build my home office PC, i instead ended up shutting down the 1090T rig and swapping the 1090T to the new ASUS M478T-E i got for the home office build. i then put the 965 BE (that was originally intended for the home office rig) on the M4A89GTD PRO/USB3 mobo in the master bedroom. so that's why there are multiple instances of the same computer listed. regarding the merge function, i'm well aware of that and have already tried it. but regardless of whether i'm on my E@H, S@H, or MW@H online account, they won't merge. i have a feeling the difference in hardware configuration before and after the rebuild is too pronounced for it to recognize them as the same computer. no worries - in a few weeks we won't have to look at it b/c it'll have been inactive for more than 30 days :).

Quote:
When you have a machine off for a period, BOINC notes that and reduces the fraction of time for which your computer is recorded as being 'available' to do work. BOINC will reduce accordingly the quantum of work it will allow to be downloaded, assuming that your host may continue to be off from time to time. If your machine runs consistently, the fraction will increase over time and eventually things will get back to the way they were.

do you think short thinning or complete outages of my wireless connection signal might cause BOINC to think my computer's down more often than it is?...and by short i mean minutes at a time? today i believe was the first time my wireless signal went out for a few hours, but the lack of tasks in the que has been going on for days now. it just doesn't seem to make sense that anywhere from 0 to 3 tasks are in the que at any given time for my 6-core 1090T, and yet anywhere from 6 to 9 tasks are in the que at any given time for my 4-core 965 BE. they're crunching the exact same projects & applications, they're resources are allocated equally, and both have the same low 0.25-day cache settings. it just makes sense that there should be more tasks in the que for the 6-core CPU than the 4-core CPU...unless i'm completely overlooking something silly. and i know i can simply fix the problem by upping my cache (gradually as you said), but there was a time where i didn't have to worry about any particular core running out of work with only 0.25 days of work in the que. i must be getting affected by all the internet outages, short or long.

paul milton
paul milton
Joined: 16 Sep 05
Posts: 329
Credit: 35825044
RAC: 0

when you look at the details

when you look at the details for that computer what does it show for..

[pre]
% of time BOINC client is running 97.1907 %
While BOINC running, % of time host has an Internet connection 99.9945 %
While BOINC running, % of time work is allowed 99.7905 %
Task duration correction factor 1.778989[/pre]

seeing without seeing is something the blind learn to do, and seeing beyond vision can be a gift.

Sunny129
Sunny129
Joined: 5 Dec 05
Posts: 162
Credit: 160342159
RAC: 0

RE: when you look at the

Quote:

when you look at the details for that computer what does it show for..

[pre]
% of time BOINC client is running 97.1907 %
While BOINC running, % of time host has an Internet connection 99.9945 %
While BOINC running, % of time work is allowed 99.7905 %
Task duration correction factor 1.778989[/pre]


thanks for pointing that out! i didn't even occur to me that viewing those stats would give me a real idea of how often BIONC sees this client as connected. check this out:

the 965 BE (master bedroom) rig:
% of time BOINC client is running 99.9296 %
While BOINC running, % of time host has an Internet connection 100 %
While BOINC running, % of time work is allowed 99.9944 %
Task duration correction factor 1.477562

...and now for the computer in question - the 1090T (home office) rig:
% of time BOINC client is running 61.0768 %
While BOINC running, % of time host has an Internet connection 59.7753 %
While BOINC running, % of time work is allowed 61.1937 %
Task duration correction factor 1.505477

i guess now its apparent that i need to do something about my wireless signal strength, b/c its clearly affecting my internet connection. i'd hate to go back to running that damn 50-ft cat5e across the condo...but nothing beats a hard line. perhaps its time to run one through the walls so it at least stays hidden...

paul milton
paul milton
Joined: 16 Sep 05
Posts: 329
Credit: 35825044
RAC: 0

how low is the signal

how low is the signal strength to that rig? if you dont want to run a cable you could try a wifi repeater. if your network is G instead of N you could use an old wrt54gl (note the L) and load up 3rd party firmware on it and run it in WDS some where midway between the router and the rig. it would slow down traffic a tad as it would be running as a wifi/wired repeater. but it may be cheaper than a rewire (not money wise mind you. but a lot less labor)

on a side

Quote:
% of time BOINC client is running 61.0768 %

is this a dedicated rig? if so, somethings apparantly causing boinc to not run ~half the time the system is on.

hypothesis.. the software running the wifi could be "over powering" boinc (system not idle while wifi looking for connection)

seeing without seeing is something the blind learn to do, and seeing beyond vision can be a gift.

Sunny129
Sunny129
Joined: 5 Dec 05
Posts: 162
Credit: 160342159
RAC: 0

RE: how low is the signal

Quote:

how low is the signal strength to that rig? if you dont want to run a cable you could try a wifi repeater. if your network is G instead of N you could use an old wrt54gl (note the L) and load up 3rd party firmware on it and run it in WDS some where midway between the router and the rig. it would slow down traffic a tad as it would be running as a wifi/wired repeater. but it may be cheaper than a rewire (not money wise mind you. but a lot less labor)

on a side

Quote:
% of time BOINC client is running 61.0768 %

is this a dedicated rig? if so, somethings apparantly causing boinc to not run ~half the time the system is on.

hypothesis.. the software running the wifi could be "over powering" boinc (system not idle while wifi looking for connection)


i have a wireless N router. but the signal has to go through a few stud walls that contain plumbing. at any rate, i'm probably just going to hard-wire the rig to the router because wireless is so damn unreliable. besides, it'll cost me less b/c i already have a long enough cat5e cable and and easy crawl space above the ceilings.

yes this is a dedicated system. and yes, i already thought of that possibility. but that can't be the reason b/c i have BOINC set to run 24/7 regardless of non-BOINC induced CPU loads...that is, i have BOINC set to not suspend tasks regardless of how many other programs are using up resources. not only that, but even if something were overriding that setting, i'd be able to see its manifestation. for instance, no tasks are showing up as suspended-and-restarted as i scroll back and look through the messages tab in BOINC. also, all 6 cores are at 100% load all the time. if they weren't, i'd notice it in the evenings when i'm not at work...so i can't imagine how the server is thinking that BOINC is only running ~61% of the time the system is powered on for this particular client.

Sunny129
Sunny129
Joined: 5 Dec 05
Posts: 162
Credit: 160342159
RAC: 0

Gary, actually i think the

Gary,

actually i think the reason i can't merge duplicate computers is b/c their names are in fact different. EMA-001 vs ema-001...i guess it distinguishes between capitol and lower case letters.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118380695452
RAC: 25581580

I'm pretty sure the name

I'm pretty sure the name wouldn't even be considered in deciding if a merge is allowed, when using the standard merge function. Are you using 'merge by name'? If so try the standard merge function instead.

Information about merging can be found here, if you want to understand things a bit more. You could also google "BOINC merge computers" - should give you lots of information about others experiences.

As for your original query, the two ~60% values for BOINC run time and work allowed time are the culprits. Please be aware that it takes quite a while to recover from a period of being off and that period could have been weeks ago. Please also realise that if you changed the date significantly, this could confuse BOINC in its deliberations about how long it has been running compared to how long the machine has been running or how long work has been allowed.

If it's really bugging you, you can always stop BOINC and carefully edit the values in your state file. I have quite a few hosts running unattended. Occasionally, one will crash and it wont be noticed for a while. When I discover and reboot, I usually 'correct' things like this in the state file, rather than wait for BOINC to fix it slowly over time. Only takes a few extra seconds.

Alternatively, just up your cache setting as I suggested while you wait for BOINC to fix the numbers.

Cheers,
Gary.

Sunny129
Sunny129
Joined: 5 Dec 05
Posts: 162
Credit: 160342159
RAC: 0

ok i got it figured out...i

ok i got it figured out...i didn't realize there was a "merge" function in each client's details, and that it gives you choices unlike the "merge computers by name" function. so you'll see i was able to consolidate the list of computers i've dedicated to crunching over the years.

and thanks for letting me know my options. i think i'm just gonna up my cache slightly and ride it out as BOINC fixes itself over time. after coming to the realization yesterday that my wireless signal very well may be the culprit, i went ahead and disabled my wireless card and re-ran the 50-ft cat5e cable from the router to this client - and already (as is not quite 24 hrs later) my stats have changed considerably:

% of time BOINC client is running 63.9287 %
While BOINC running, % of time host has an Internet connection 62.7297 %
While BOINC running, % of time work is allowed 64.044 %
Task duration correction factor 1.403014

...if this continues at a linear pace - or even tapers off as it progresses - it appears as though it'll only be a few weeks before the numbers are back up near 100%.

...btw, if i chose to fix it now, exactly what changes would i need to make to the client state file?

TIA,
Eric

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118380695452
RAC: 25581580

RE: exactly what changes

Quote:
exactly what changes would i need to make to the client state file?


First of all, a simple warning. Don't do this unless you are comfortable with editing configuration files manually. Typos in the values themselves probably are pretty harmless but I'm talking more about mangling tags or deleting lines elsewhere or damaging the overall structure of the file. If you do any of these, bad things are likely to happen. You don't get a second chance by making a backup of the file in question because by the time you restart BOINC and go "Ooops!!", it's probably too late for just a file backup to do any good. You can adopt more extensive safeguards but the best safeguard is just to notice typos before you save the file. After all, the changes are so simple that there is no reason not to see a typo. If something more major seems to go wrong while editing, just reload a fresh copy of the file and start again.

So just completely stop BOINC and open client_state.xml with a text editor like notepad in Windows or kwrite in Linux. Search for the following block of three lines (relatively close to the top of the file). The numbers in my example are the numbers you posted. The duration correction factor value is a little later on in the file and can also be tweaked if BOINC has the estimated completion time for an unstarted task significantly wrong. With a multi-core host, BOINC will fix this rather quickly so I wouldn't bother unless I was really in a hurry.

0.639287
    0.629297
    0.640440


change the values to something like

0.999999
    0.999999
    0.999999


Save, exit and then restart BOINC. BOINC will suddenly want to download more work :-).

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.