BIG TIME OVERFETCH ! ! ! Perceus.

tbret
tbret
Joined: 12 Mar 05
Posts: 2,115
Credit: 4,861,253,940
RAC: 48,988

I'm guessing this will be the

I'm guessing this will be the last one of these I post.

5/28/2013 1:35:29 AM | Einstein@Home | [work_fetch] request: CPU (0.00 sec, 0.00 inst) NVIDIA (0.00 sec, 0.00 inst)
5/28/2013 1:35:29 AM | Einstein@Home | Sending scheduler request: To report completed tasks.
5/28/2013 1:35:29 AM | Einstein@Home | Reporting 636 completed tasks
5/28/2013 1:35:29 AM | Einstein@Home | Not requesting tasks: don't need
5/28/2013 1:35:40 AM | Einstein@Home | Scheduler request completed
5/28/2013 1:35:40 AM | | [work_fetch] Request work fetch: RPC complete
5/28/2013 1:35:45 AM | | [work_fetch] work fetch start
5/28/2013 1:35:45 AM | | [work_fetch] choose_project() for NVIDIA: buffer_low: no; sim_excluded_instances 0
5/28/2013 1:35:45 AM | | [work_fetch] choose_project() for CPU: buffer_low: yes; sim_excluded_instances 0
5/28/2013 1:35:45 AM | | [work_fetch] no eligible project for CPU
5/28/2013 1:35:45 AM | | [work_fetch] ------- start work fetch state -------
5/28/2013 1:35:45 AM | | [work_fetch] target work buffer: 43200.00 + 864.00 sec
5/28/2013 1:35:45 AM | | [work_fetch] --- project states ---
5/28/2013 1:35:45 AM | Einstein@Home | [work_fetch] REC 84786.533 prio -0.359908 can't req work: scheduler RPC backoff (backoff: 54.89 sec)
5/28/2013 1:35:45 AM | | [work_fetch] --- state for CPU ---
5/28/2013 1:35:45 AM | | [work_fetch] shortfall 132192.00 nidle 3.00 saturated 0.00 busy 0.00
5/28/2013 1:35:45 AM | Einstein@Home | [work_fetch] fetch share 0.000 (blocked by prefs)
5/28/2013 1:35:45 AM | | [work_fetch] --- state for NVIDIA ---
5/28/2013 1:35:45 AM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 74306.23 busy 0.00
5/28/2013 1:35:45 AM | Einstein@Home | [work_fetch] fetch share 0.000
5/28/2013 1:35:45 AM | | [work_fetch] ------- end work fetch state -------
5/28/2013 1:35:45 AM | | [work_fetch] No project chosen for work fetch
5/28/2013 1:35:45 AM | Einstein@Home | update requested by user
5/28/2013 1:35:45 AM | | [work_fetch] Request work fetch: project updated by user
5/28/2013 1:35:46 AM | | [work_fetch] Request work fetch: Backoff ended for Einstein@Home
5/28/2013 1:35:50 AM | | [work_fetch] work fetch start
5/28/2013 1:35:50 AM | | [work_fetch] choose_project() for NVIDIA: buffer_low: no; sim_excluded_instances 0
5/28/2013 1:35:50 AM | | [work_fetch] choose_project() for CPU: buffer_low: yes; sim_excluded_instances 0
5/28/2013 1:35:50 AM | | [work_fetch] no eligible project for CPU
5/28/2013 1:35:50 AM | | [work_fetch] ------- start work fetch state -------
5/28/2013 1:35:50 AM | | [work_fetch] target work buffer: 43200.00 + 864.00 sec
5/28/2013 1:35:50 AM | | [work_fetch] --- project states ---
5/28/2013 1:35:50 AM | Einstein@Home | [work_fetch] REC 84786.533 prio -1.359885 can req work
5/28/2013 1:35:50 AM | | [work_fetch] --- state for CPU ---
5/28/2013 1:35:50 AM | | [work_fetch] shortfall 132192.00 nidle 3.00 saturated 0.00 busy 0.00
5/28/2013 1:35:50 AM | Einstein@Home | [work_fetch] fetch share 0.000 (blocked by prefs)
5/28/2013 1:35:50 AM | | [work_fetch] --- state for NVIDIA ---
5/28/2013 1:35:50 AM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 74301.24 busy 0.00
5/28/2013 1:35:50 AM | Einstein@Home | [work_fetch] fetch share 1.000
5/28/2013 1:35:50 AM | | [work_fetch] ------- end work fetch state -------
5/28/2013 1:35:50 AM | | [work_fetch] No project chosen for work fetch
5/28/2013 1:35:50 AM | Einstein@Home | [work_fetch] request: CPU (0.00 sec, 0.00 inst) NVIDIA (0.00 sec, 0.00 inst)
5/28/2013 1:35:50 AM | Einstein@Home | Sending scheduler request: Requested by user.
5/28/2013 1:35:50 AM | Einstein@Home | Not requesting tasks: don't need
5/28/2013 1:35:53 AM | Einstein@Home | Scheduler request completed
5/28/2013 1:35:53 AM | | [work_fetch] Request work fetch: RPC complete
5/28/2013 1:35:58 AM | | [work_fetch] work fetch start
5/28/2013 1:35:58 AM | | [work_fetch] choose_project() for NVIDIA: buffer_low: no; sim_excluded_instances 0
5/28/2013 1:35:58 AM | | [work_fetch] choose_project() for CPU: buffer_low: yes; sim_excluded_instances 0
5/28/2013 1:35:58 AM | | [work_fetch] no eligible project for CPU
5/28/2013 1:35:58 AM | | [work_fetch] ------- start work fetch state -------
5/28/2013 1:35:58 AM | | [work_fetch] target work buffer: 43200.00 + 864.00 sec
5/28/2013 1:35:58 AM | | [work_fetch] --- project states ---
5/28/2013 1:35:58 AM | Einstein@Home | [work_fetch] REC 84786.533 prio -0.359848 can't req work: scheduler RPC backoff (backoff: 54.90 sec)
5/28/2013 1:35:58 AM | | [work_fetch] --- state for CPU ---
5/28/2013 1:35:58 AM | | [work_fetch] shortfall 132192.00 nidle 3.00 saturated 0.00 busy 0.00
5/28/2013 1:35:58 AM | Einstein@Home | [work_fetch] fetch share 0.000 (blocked by prefs)
5/28/2013 1:35:58 AM | | [work_fetch] --- state for NVIDIA ---
5/28/2013 1:35:58 AM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 74293.25 busy 0.00
5/28/2013 1:35:58 AM | Einstein@Home | [work_fetch] fetch share 0.000
5/28/2013 1:35:58 AM | | [work_fetch] ------- end work fetch state -------
5/28/2013 1:35:58 AM | | [work_fetch] No project chosen for work fetch
5/28/2013 1:36:53 AM | | [work_fetch] Request work fetch: Backoff ended for Einstein@Home
5/28/2013 1:36:53 AM | | [work_fetch] work fetch start
5/28/2013 1:36:53 AM | | [work_fetch] choose_project() for NVIDIA: buffer_low: no; sim_excluded_instances 0
5/28/2013 1:36:53 AM | | [work_fetch] choose_project() for CPU: buffer_low: yes; sim_excluded_instances 0
5/28/2013 1:36:53 AM | | [work_fetch] no eligible project for CPU
5/28/2013 1:36:53 AM | | [work_fetch] ------- start work fetch state -------
5/28/2013 1:36:53 AM | | [work_fetch] target work buffer: 43200.00 + 864.00 sec
5/28/2013 1:36:53 AM | | [work_fetch] --- project states ---
5/28/2013 1:36:53 AM | Einstein@Home | [work_fetch] REC 84792.289 prio -1.359594 can req work
5/28/2013 1:36:53 AM | | [work_fetch] --- state for CPU ---
5/28/2013 1:36:53 AM | | [work_fetch] shortfall 132192.00 nidle 3.00 saturated 0.00 busy 0.00
5/28/2013 1:36:53 AM | Einstein@Home | [work_fetch] fetch share 0.000 (blocked by prefs)
5/28/2013 1:36:53 AM | | [work_fetch] --- state for NVIDIA ---
5/28/2013 1:36:53 AM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 74238.49 busy 0.00
5/28/2013 1:36:53 AM | Einstein@Home | [work_fetch] fetch share 1.000
5/28/2013 1:36:53 AM | | [work_fetch] ------- end work fetch state -------
5/28/2013 1:36:53 AM | | [work_fetch] No project chosen for work fetch

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2,143
Credit: 2,947,577,318
RAC: 706,455

Thanks for that. I'll tell

Thanks for that. I'll tell David he has some more reading material here.

Neil Newell
Neil Newell
Joined: 20 Nov 12
Posts: 176
Credit: 169,699,457
RAC: 0

Similar situation here - 4

Similar situation here - 4 hosts have overfetched work.

Host 6618250, 479 PAS in progress, 7.0.62
Host 6290763, 361 PAS in progress, 7.0.62
Host 6119246, 104 PAS in progress, 7.0.65
Host 6564477, 412 PAS in progress, 7.0.62

Looking at the tasks, it appears they may have been sent in distinct time periods - the first period is roughly 20:30 31-May-2013 to 02:00 01-Jun-2013, and the second is from 00:30 to 01:30 02-Jun-2013

Not all hosts are affected in each period, e.g. Host 6119246 only appears to have overfetched in the first period.

As listed above, 3 hosts are 7.0.62 and one is 7.0.65, and are using 2 different venues.

Of possible relevance is that a few days ago (date not recorded) I updated my "Computing preferences", increasing my cache from 0.25 to 1 day (I did this because another of my hosts #6119565 keeps polling for work every 60 seconds, for no good reason that I can see).

Neil Newell
Neil Newell
Joined: 20 Nov 12
Posts: 176
Credit: 169,699,457
RAC: 0

...also Host 6123309: 81

...also Host 6123309: 81 pending PAS (~55 days work), all sent 18:00 to 18:45 on 1-Jun-2013.

I checked this host because I noticed that all the other hosts that have over-fetched have 2 GPU's (so does this one, albeit only 2x9600GTs which may explain why the overfetch isn't so severe).

Whether this is a significant observation or not, it seems that out of 10 of my active GPU hosts, the 5 that have 2 GPU's have over-fetched and the 5 that have a single GPU haven't. Additionally, it appears the hosts with single GPUs are polling for new work once a minute, but not getting any:

...
03-Jun-2013 11:01:09 [Einstein@Home] Sending scheduler request: To fetch work.
03-Jun-2013 11:01:09 [Einstein@Home] Not requesting tasks
03-Jun-2013 11:01:14 [Einstein@Home] Scheduler request completed
03-Jun-2013 11:02:15 [Einstein@Home] Sending scheduler request: To fetch work.
03-Jun-2013 11:02:15 [Einstein@Home] Not requesting tasks
03-Jun-2013 11:02:19 [Einstein@Home] Scheduler request completed
...
Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3,522
Credit: 708,171,118
RAC: 731,725

RE: Additionally, it

Quote:

Additionally, it appears the hosts with single GPUs are polling for new work once a minute, but not getting any:

...
03-Jun-2013 11:01:09 [Einstein@Home] Sending scheduler request: To fetch work.
03-Jun-2013 11:01:09 [Einstein@Home] Not requesting tasks
03-Jun-2013 11:01:14 [Einstein@Home] Scheduler request completed
03-Jun-2013 11:02:15 [Einstein@Home] Sending scheduler request: To fetch work.
03-Jun-2013 11:02:15 [Einstein@Home] Not requesting tasks
03-Jun-2013 11:02:19 [Einstein@Home] Scheduler request completed
...

I'm not even sure they really request work, see the text in above quote that was highlighted by me. The text is at least misleading.

Cheers
HB

Neil Newell
Neil Newell
Joined: 20 Nov 12
Posts: 176
Credit: 169,699,457
RAC: 0

I did wonder if it was a

I did wonder if it was a local-only check; still, the multi-GPU hosts aren't doing it.

Also I note that the same pattern of over-fetching fits tbret's hosts - the single GPU hosts have reasonable work, but most multi-GPU hosts have too many PAS tasks.

Poking about the [url=]http://einstein.phys.uwm.edu/top_hosts.php]top hosts[/url] shows other multi-GPU hosts that appear to have over-fetched - e.g. this host has 179 'in progress' (~40 days work with a deadline of 12th June - even with 2 GPU's it surely won't make it).

Neil Newell
Neil Newell
Joined: 20 Nov 12
Posts: 176
Credit: 169,699,457
RAC: 0

RE: RE: Additionally, it

Quote:
Quote:
Additionally, it appears the hosts with single GPUs are polling for new work once a minute

I'm not even sure they really request work, see the text in above quote that was highlighted by me. The text is at least misleading.


I checked with 'tcpdump' - the single-GPU hosts really are hitting the server once a minute; this shows up on the "Last contact" link as well.

2013-06-03 10:46:02.4045 [PID=30856]   Request: [USER#xxxxx] [HOST#6119565] [IP xxx.xxx.xxx.38] client 7.0.62
2013-06-03 10:46:02.4187 [PID=30856]    [send] effective_ncpus 1 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2013-06-03 10:46:02.4187 [PID=30856]    [send] effective_ngpus 1 max_jobs_on_host_gpu 999999
2013-06-03 10:46:02.4187 [PID=30856]    [send] Not using matchmaker scheduling; Not using EDF sim
2013-06-03 10:46:02.4187 [PID=30856]    [send] CPU: req 0.00 sec, 0.00 instances; est delay 0.00
2013-06-03 10:46:02.4187 [PID=30856]    [send] CUDA: req 0.00 sec, 0.00 instances; est delay 0.00
2013-06-03 10:46:02.4187 [PID=30856]    [send] work_req_seconds: 0.00 secs
2013-06-03 10:46:02.4187 [PID=30856]    [send] available disk 23.89 GB, work_buf_min 0
2013-06-03 10:46:02.4187 [PID=30856]    [send] active_frac 0.999999 on_frac 0.999944 DCF 1.226839
2013-06-03 10:46:02.4222 [PID=30856]    Sending reply to [HOST#6119565]: 0 results, delay req 60.00
2013-06-03 10:46:02.4225 [PID=30856]    Scheduler ran 0.024 seconds

Only mentioning it in case it relates to the main issue of work over-fetch; it may well be a sidetrack but it appears to only affect single-GPU hosts.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3,522
Credit: 708,171,118
RAC: 731,725

RE: RE: RE:

Quote:
Quote:
Quote:
Additionally, it appears the hosts with single GPUs are polling for new work once a minute

I'm not even sure they really request work, see the text in above quote that was highlighted by me. The text is at least misleading.


I checked with 'tcpdump' - the single-GPU hosts really are hitting the server once a minute; this shows up on the "Last contact" link as well.

2013-06-03 10:46:02.4045 [PID=30856]   Request: [USER#xxxxx] [HOST#6119565] [IP xxx.xxx.xxx.38] client 7.0.62
2013-06-03 10:46:02.4187 [PID=30856]    [send] effective_ncpus 1 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2013-06-03 10:46:02.4187 [PID=30856]    [send] effective_ngpus 1 max_jobs_on_host_gpu 999999
2013-06-03 10:46:02.4187 [PID=30856]    [send] Not using matchmaker scheduling; Not using EDF sim
2013-06-03 10:46:02.4187 [PID=30856]    [send] CPU: req 0.00 sec, 0.00 instances; est delay 0.00
2013-06-03 10:46:02.4187 [PID=30856]    [send] CUDA: req 0.00 sec, 0.00 instances; est delay 0.00
2013-06-03 10:46:02.4187 [PID=30856]    [send] work_req_seconds: 0.00 secs
2013-06-03 10:46:02.4187 [PID=30856]    [send] available disk 23.89 GB, work_buf_min 0
2013-06-03 10:46:02.4187 [PID=30856]    [send] active_frac 0.999999 on_frac 0.999944 DCF 1.226839
2013-06-03 10:46:02.4222 [PID=30856]    Sending reply to [HOST#6119565]: 0 results, delay req 60.00
2013-06-03 10:46:02.4225 [PID=30856]    Scheduler ran 0.024 seconds

Only mentioning it in case it relates to the main issue of work over-fetch; it may well be a sidetrack but it appears to only affect single-GPU hosts.

I don't think this is related to the over-fetch issue, but still noteworthy as it strains our servers without an obvious reason. as the log indicates, the client really doesn't request new work at all.

Cheers
HB

Neil Newell
Neil Newell
Joined: 20 Nov 12
Posts: 176
Credit: 169,699,457
RAC: 0

I was worried it was just my

I was worried it was just my hosts, but it's not - for example this host is doing the same according to its log. Same BOINC version, different OS (windows vs. linux).

Also I have found that one of my single-GPU hosts isn't doing it - the only difference I'm aware of is that it's on 7.0.44 (all the others are 7.0.64 or .65).

Beyond
Beyond
Joined: 28 Feb 05
Posts: 121
Credit: 2,275,766,212
RAC: 4,873,941

RE: I was worried it was

Quote:

I was worried it was just my hosts, but it's not - for example this host is doing the same according to its log. Same BOINC version, different OS (windows vs. linux).

Also I have found that one of my single-GPU hosts isn't doing it - the only difference I'm aware of is that it's on 7.0.44 (all the others are 7.0.64 or .65).


I had it happen on 2 machines too. The solution here was to switch to BOINC 7.1.3. There are many work fetch fixes between 7.0.64 and 7.1.3:

http://boinc.berkeley.edu/dev/forum_thread.php?id=8378#49151

Time will tell.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.