BIG TIME OVERFETCH ! ! ! Perceus.

tbret

Joined: 12 Mar 05

Posts: 2115

Credit: 4862557013

RAC: 120849

I'm guessing this will be the

28 May 2013 6:37:37 UTC

Message 116510

(moderation:

)

I'm guessing this will be the last one of these I post.

5/28/2013 1:35:29 AM | Einstein@Home | [work_fetch] request: CPU (0.00 sec, 0.00 inst) NVIDIA (0.00 sec, 0.00 inst)
5/28/2013 1:35:29 AM | Einstein@Home | Sending scheduler request: To report completed tasks.
5/28/2013 1:35:29 AM | Einstein@Home | Reporting 636 completed tasks
5/28/2013 1:35:29 AM | Einstein@Home | Not requesting tasks: don't need
5/28/2013 1:35:40 AM | Einstein@Home | Scheduler request completed
5/28/2013 1:35:40 AM | | [work_fetch] Request work fetch: RPC complete
5/28/2013 1:35:45 AM | | [work_fetch] work fetch start
5/28/2013 1:35:45 AM | | [work_fetch] choose_project() for NVIDIA: buffer_low: no; sim_excluded_instances 0
5/28/2013 1:35:45 AM | | [work_fetch] choose_project() for CPU: buffer_low: yes; sim_excluded_instances 0
5/28/2013 1:35:45 AM | | [work_fetch] no eligible project for CPU
5/28/2013 1:35:45 AM | | [work_fetch] ------- start work fetch state -------
5/28/2013 1:35:45 AM | | [work_fetch] target work buffer: 43200.00 + 864.00 sec
5/28/2013 1:35:45 AM | | [work_fetch] --- project states ---
5/28/2013 1:35:45 AM | Einstein@Home | [work_fetch] REC 84786.533 prio -0.359908 can't req work: scheduler RPC backoff (backoff: 54.89 sec)
5/28/2013 1:35:45 AM | | [work_fetch] --- state for CPU ---
5/28/2013 1:35:45 AM | | [work_fetch] shortfall 132192.00 nidle 3.00 saturated 0.00 busy 0.00
5/28/2013 1:35:45 AM | Einstein@Home | [work_fetch] fetch share 0.000 (blocked by prefs)
5/28/2013 1:35:45 AM | | [work_fetch] --- state for NVIDIA ---
5/28/2013 1:35:45 AM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 74306.23 busy 0.00
5/28/2013 1:35:45 AM | Einstein@Home | [work_fetch] fetch share 0.000
5/28/2013 1:35:45 AM | | [work_fetch] ------- end work fetch state -------
5/28/2013 1:35:45 AM | | [work_fetch] No project chosen for work fetch
5/28/2013 1:35:45 AM | Einstein@Home | update requested by user
5/28/2013 1:35:45 AM | | [work_fetch] Request work fetch: project updated by user
5/28/2013 1:35:46 AM | | [work_fetch] Request work fetch: Backoff ended for Einstein@Home
5/28/2013 1:35:50 AM | | [work_fetch] work fetch start
5/28/2013 1:35:50 AM | | [work_fetch] choose_project() for NVIDIA: buffer_low: no; sim_excluded_instances 0
5/28/2013 1:35:50 AM | | [work_fetch] choose_project() for CPU: buffer_low: yes; sim_excluded_instances 0
5/28/2013 1:35:50 AM | | [work_fetch] no eligible project for CPU
5/28/2013 1:35:50 AM | | [work_fetch] ------- start work fetch state -------
5/28/2013 1:35:50 AM | | [work_fetch] target work buffer: 43200.00 + 864.00 sec
5/28/2013 1:35:50 AM | | [work_fetch] --- project states ---
5/28/2013 1:35:50 AM | Einstein@Home | [work_fetch] REC 84786.533 prio -1.359885 can req work
5/28/2013 1:35:50 AM | | [work_fetch] --- state for CPU ---
5/28/2013 1:35:50 AM | | [work_fetch] shortfall 132192.00 nidle 3.00 saturated 0.00 busy 0.00
5/28/2013 1:35:50 AM | Einstein@Home | [work_fetch] fetch share 0.000 (blocked by prefs)
5/28/2013 1:35:50 AM | | [work_fetch] --- state for NVIDIA ---
5/28/2013 1:35:50 AM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 74301.24 busy 0.00
5/28/2013 1:35:50 AM | Einstein@Home | [work_fetch] fetch share 1.000
5/28/2013 1:35:50 AM | | [work_fetch] ------- end work fetch state -------
5/28/2013 1:35:50 AM | | [work_fetch] No project chosen for work fetch
5/28/2013 1:35:50 AM | Einstein@Home | [work_fetch] request: CPU (0.00 sec, 0.00 inst) NVIDIA (0.00 sec, 0.00 inst)
5/28/2013 1:35:50 AM | Einstein@Home | Sending scheduler request: Requested by user.
5/28/2013 1:35:50 AM | Einstein@Home | Not requesting tasks: don't need
5/28/2013 1:35:53 AM | Einstein@Home | Scheduler request completed
5/28/2013 1:35:53 AM | | [work_fetch] Request work fetch: RPC complete
5/28/2013 1:35:58 AM | | [work_fetch] work fetch start
5/28/2013 1:35:58 AM | | [work_fetch] choose_project() for NVIDIA: buffer_low: no; sim_excluded_instances 0
5/28/2013 1:35:58 AM | | [work_fetch] choose_project() for CPU: buffer_low: yes; sim_excluded_instances 0
5/28/2013 1:35:58 AM | | [work_fetch] no eligible project for CPU
5/28/2013 1:35:58 AM | | [work_fetch] ------- start work fetch state -------
5/28/2013 1:35:58 AM | | [work_fetch] target work buffer: 43200.00 + 864.00 sec
5/28/2013 1:35:58 AM | | [work_fetch] --- project states ---
5/28/2013 1:35:58 AM | Einstein@Home | [work_fetch] REC 84786.533 prio -0.359848 can't req work: scheduler RPC backoff (backoff: 54.90 sec)
5/28/2013 1:35:58 AM | | [work_fetch] --- state for CPU ---
5/28/2013 1:35:58 AM | | [work_fetch] shortfall 132192.00 nidle 3.00 saturated 0.00 busy 0.00
5/28/2013 1:35:58 AM | Einstein@Home | [work_fetch] fetch share 0.000 (blocked by prefs)
5/28/2013 1:35:58 AM | | [work_fetch] --- state for NVIDIA ---
5/28/2013 1:35:58 AM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 74293.25 busy 0.00
5/28/2013 1:35:58 AM | Einstein@Home | [work_fetch] fetch share 0.000
5/28/2013 1:35:58 AM | | [work_fetch] ------- end work fetch state -------
5/28/2013 1:35:58 AM | | [work_fetch] No project chosen for work fetch
5/28/2013 1:36:53 AM | | [work_fetch] Request work fetch: Backoff ended for Einstein@Home
5/28/2013 1:36:53 AM | | [work_fetch] work fetch start
5/28/2013 1:36:53 AM | | [work_fetch] choose_project() for NVIDIA: buffer_low: no; sim_excluded_instances 0
5/28/2013 1:36:53 AM | | [work_fetch] choose_project() for CPU: buffer_low: yes; sim_excluded_instances 0
5/28/2013 1:36:53 AM | | [work_fetch] no eligible project for CPU
5/28/2013 1:36:53 AM | | [work_fetch] ------- start work fetch state -------
5/28/2013 1:36:53 AM | | [work_fetch] target work buffer: 43200.00 + 864.00 sec
5/28/2013 1:36:53 AM | | [work_fetch] --- project states ---
5/28/2013 1:36:53 AM | Einstein@Home | [work_fetch] REC 84792.289 prio -1.359594 can req work
5/28/2013 1:36:53 AM | | [work_fetch] --- state for CPU ---
5/28/2013 1:36:53 AM | | [work_fetch] shortfall 132192.00 nidle 3.00 saturated 0.00 busy 0.00
5/28/2013 1:36:53 AM | Einstein@Home | [work_fetch] fetch share 0.000 (blocked by prefs)
5/28/2013 1:36:53 AM | | [work_fetch] --- state for NVIDIA ---
5/28/2013 1:36:53 AM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 74238.49 busy 0.00
5/28/2013 1:36:53 AM | Einstein@Home | [work_fetch] fetch share 1.000
5/28/2013 1:36:53 AM | | [work_fetch] ------- end work fetch state -------
5/28/2013 1:36:53 AM | | [work_fetch] No project chosen for work fetch

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2955346547

RAC: 720144

Thanks for that. I'll tell

28 May 2013 9:37:38 UTC

Message 116511

(moderation:

)

Thanks for that. I'll tell David he has some more reading material here.

Neil Newell

Joined: 20 Nov 12

Posts: 176

Credit: 169699457

RAC: 0

Similar situation here - 4

3 Jun 2013 8:47:43 UTC

Message 116512 in response to message 116511

(moderation:

)

Similar situation here - 4 hosts have overfetched work.

Host 6618250, 479 PAS in progress, 7.0.62
Host 6290763, 361 PAS in progress, 7.0.62
Host 6119246, 104 PAS in progress, 7.0.65
Host 6564477, 412 PAS in progress, 7.0.62

Looking at the tasks, it appears they may have been sent in distinct time periods - the first period is roughly 20:30 31-May-2013 to 02:00 01-Jun-2013, and the second is from 00:30 to 01:30 02-Jun-2013

Not all hosts are affected in each period, e.g. Host 6119246 only appears to have overfetched in the first period.

As listed above, 3 hosts are 7.0.62 and one is 7.0.65, and are using 2 different venues.

Of possible relevance is that a few days ago (date not recorded) I updated my "Computing preferences", increasing my cache from 0.25 to 1 day (I did this because another of my hosts #6119565 keeps polling for work every 60 seconds, for no good reason that I can see).

Neil Newell

Joined: 20 Nov 12

Posts: 176

Credit: 169699457

RAC: 0

...also Host 6123309: 81

3 Jun 2013 10:04:38 UTC

Message 116513 in response to message 116512

(moderation:

)

...also Host 6123309: 81 pending PAS (~55 days work), all sent 18:00 to 18:45 on 1-Jun-2013.

I checked this host because I noticed that all the other hosts that have over-fetched have 2 GPU's (so does this one, albeit only 2x9600GTs which may explain why the overfetch isn't so severe).

Whether this is a significant observation or not, it seems that out of 10 of my active GPU hosts, the 5 that have 2 GPU's have over-fetched and the 5 that have a single GPU haven't. Additionally, it appears the hosts with single GPUs are polling for new work once a minute, but not getting any:

...
03-Jun-2013 11:01:09 [Einstein@Home] Sending scheduler request: To fetch work.
03-Jun-2013 11:01:09 [Einstein@Home] Not requesting tasks
03-Jun-2013 11:01:14 [Einstein@Home] Scheduler request completed
03-Jun-2013 11:02:15 [Einstein@Home] Sending scheduler request: To fetch work.
03-Jun-2013 11:02:15 [Einstein@Home] Not requesting tasks
03-Jun-2013 11:02:19 [Einstein@Home] Scheduler request completed
...

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 722782306

RAC: 1151814

RE: Additionally, it

3 Jun 2013 10:10:35 UTC

Message 116514 in response to message 116513

(moderation:

)

Quote:

Additionally, it appears the hosts with single GPUs are polling for new work once a minute, but not getting any:

...
03-Jun-2013 11:01:09 [Einstein@Home] Sending scheduler request: To fetch work.
03-Jun-2013 11:01:09 [Einstein@Home] Not requesting tasks
03-Jun-2013 11:01:14 [Einstein@Home] Scheduler request completed

03-Jun-2013 11:02:15 [Einstein@Home] Sending scheduler request: To fetch work.
03-Jun-2013 11:02:15 [Einstein@Home] Not requesting tasks

03-Jun-2013 11:02:19 [Einstein@Home] Scheduler request completed
...

I'm not even sure they really request work, see the text in above quote that was highlighted by me. The text is at least misleading.

Cheers
HB

Neil Newell

Joined: 20 Nov 12

Posts: 176

Credit: 169699457

RAC: 0

I did wonder if it was a

3 Jun 2013 10:17:36 UTC

Message 116515 in response to message 116514

(moderation:

)

I did wonder if it was a local-only check; still, the multi-GPU hosts aren't doing it.

Also I note that the same pattern of over-fetching fits tbret's hosts - the single GPU hosts have reasonable work, but most multi-GPU hosts have too many PAS tasks.

Poking about the [url=]http://einstein.phys.uwm.edu/top_hosts.php]top hosts[/url] shows other multi-GPU hosts that appear to have over-fetched - e.g. this host has 179 'in progress' (~40 days work with a deadline of 12th June - even with 2 GPU's it surely won't make it).

Neil Newell

Joined: 20 Nov 12

Posts: 176

Credit: 169699457

RAC: 0

RE: RE: Additionally, it

3 Jun 2013 10:55:44 UTC

Message 116516 in response to message 116514

(moderation:

)

Quote:

Quote:
Additionally, it appears the hosts with single GPUs are polling for new work once a minute

I'm not even sure they really request work, see the text in above quote that was highlighted by me. The text is at least misleading.

I checked with 'tcpdump' - the single-GPU hosts really are hitting the server once a minute; this shows up on the "Last contact" link as well.

2013-06-03 10:46:02.4045 [PID=30856]   Request: [USER#xxxxx] [HOST#6119565] [IP xxx.xxx.xxx.38] client 7.0.62
2013-06-03 10:46:02.4187 [PID=30856]    [send] effective_ncpus 1 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2013-06-03 10:46:02.4187 [PID=30856]    [send] effective_ngpus 1 max_jobs_on_host_gpu 999999
2013-06-03 10:46:02.4187 [PID=30856]    [send] Not using matchmaker scheduling; Not using EDF sim
2013-06-03 10:46:02.4187 [PID=30856]    [send] CPU: req 0.00 sec, 0.00 instances; est delay 0.00
2013-06-03 10:46:02.4187 [PID=30856]    [send] CUDA: req 0.00 sec, 0.00 instances; est delay 0.00
2013-06-03 10:46:02.4187 [PID=30856]    [send] work_req_seconds: 0.00 secs
2013-06-03 10:46:02.4187 [PID=30856]    [send] available disk 23.89 GB, work_buf_min 0
2013-06-03 10:46:02.4187 [PID=30856]    [send] active_frac 0.999999 on_frac 0.999944 DCF 1.226839
2013-06-03 10:46:02.4222 [PID=30856]    Sending reply to [HOST#6119565]: 0 results, delay req 60.00
2013-06-03 10:46:02.4225 [PID=30856]    Scheduler ran 0.024 seconds

Only mentioning it in case it relates to the main issue of work over-fetch; it may well be a sidetrack but it appears to only affect single-GPU hosts.

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 722782306

RAC: 1151814

RE: RE: RE:

3 Jun 2013 11:00:31 UTC

Message 116517 in response to message 116516

(moderation:

)

Quote:

Quote:
Quote:
Additionally, it appears the hosts with single GPUs are polling for new work once a minute

I'm not even sure they really request work, see the text in above quote that was highlighted by me. The text is at least misleading.

I checked with 'tcpdump' - the single-GPU hosts really are hitting the server once a minute; this shows up on the "Last contact" link as well.
2013-06-03 10:46:02.4045 [PID=30856]   Request: [USER#xxxxx] [HOST#6119565] [IP xxx.xxx.xxx.38] client 7.0.62
2013-06-03 10:46:02.4187 [PID=30856]    [send] effective_ncpus 1 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2013-06-03 10:46:02.4187 [PID=30856]    [send] effective_ngpus 1 max_jobs_on_host_gpu 999999
2013-06-03 10:46:02.4187 [PID=30856]    [send] Not using matchmaker scheduling; Not using EDF sim
2013-06-03 10:46:02.4187 [PID=30856]    [send] CPU: req 0.00 sec, 0.00 instances; est delay 0.00
2013-06-03 10:46:02.4187 [PID=30856]    [send] CUDA: req 0.00 sec, 0.00 instances; est delay 0.00
2013-06-03 10:46:02.4187 [PID=30856]    [send] work_req_seconds: 0.00 secs
2013-06-03 10:46:02.4187 [PID=30856]    [send] available disk 23.89 GB, work_buf_min 0
2013-06-03 10:46:02.4187 [PID=30856]    [send] active_frac 0.999999 on_frac 0.999944 DCF 1.226839
2013-06-03 10:46:02.4222 [PID=30856]    Sending reply to [HOST#6119565]: 0 results, delay req 60.00
2013-06-03 10:46:02.4225 [PID=30856]    Scheduler ran 0.024 seconds
Only mentioning it in case it relates to the main issue of work over-fetch; it may well be a sidetrack but it appears to only affect single-GPU hosts.

I don't think this is related to the over-fetch issue, but still noteworthy as it strains our servers without an obvious reason. as the log indicates, the client really doesn't request new work at all.

Cheers
HB

Neil Newell

Joined: 20 Nov 12

Posts: 176

Credit: 169699457

RAC: 0

I was worried it was just my

3 Jun 2013 11:30:22 UTC

Message 116518 in response to message 116517

(moderation:

)

I was worried it was just my hosts, but it's not - for example this host is doing the same according to its log. Same BOINC version, different OS (windows vs. linux).

Also I have found that one of my single-GPU hosts isn't doing it - the only difference I'm aware of is that it's on 7.0.44 (all the others are 7.0.64 or .65).

Beyond

Joined: 28 Feb 05

Posts: 121

Credit: 2334426212

RAC: 5263350

RE: I was worried it was

3 Jun 2013 12:24:54 UTC

Message 116519 in response to message 116518

(moderation:

)

Quote:

I was worried it was just my hosts, but it's not - for example this host is doing the same according to its log. Same BOINC version, different OS (windows vs. linux).

Also I have found that one of my single-GPU hosts isn't doing it - the only difference I'm aware of is that it's on 7.0.44 (all the others are 7.0.64 or .65).

I had it happen on 2 machines too. The solution here was to switch to BOINC 7.1.3. There are many work fetch fixes between 7.0.64 and 7.1.3:

http://boinc.berkeley.edu/dev/forum_thread.php?id=8378#49151

Time will tell.

BIG TIME OVERFETCH ! ! ! Perceus.

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner