BIG TIME OVERFETCH ! ! ! Perceus.

Neil Newell

Joined: 20 Nov 12

Posts: 176

Credit: 169699457

RAC: 0

RE: Instead, it's with the

9 Jun 2013 11:50:38 UTC

Message 116540 in response to message 116539

(moderation:

)

Quote:

Instead, it's with the client asking for a reasonable amount again, and again, and again - failing to take account of the fact that the amount of work it has cached is growing with every request.

I've now set "work_fetch_debug" on all my affected hosts, and while doing so I took copies of all the work logs. Referring back to my message 124924 and given Richard's comments, I examined the periods where my hosts downloaded the excess work.

As surmised, the hosts do indeed appear to have gone into a state whereby they (about once a minute) ask for more work, get some, and repeat the process for several minutes.

Specifically, here's what I've found in the log files for my hosts (PM for copies):

Host 6564477 downloaded groups of ~ 4 tasks once per minute from at least (log start) 01 Jun 00:53 to 01:09, then same again the next day 02 Jun 01:15 to 01:30. Here's a snippet from the log at the end of the first period:

...
Sat 01 Jun 2013 00:54:28 BST | Einstein@Home | Sending scheduler request: To fetch work.
Sat 01 Jun 2013 00:54:28 BST | Einstein@Home | Requesting new tasks for NVIDIA
Sat 01 Jun 2013 00:54:34 BST | Einstein@Home | Scheduler request completed: got 4 new tasks
Sat 01 Jun 2013 00:54:36 BST | Einstein@Home | Started download of PA0057_015B1_144.bin4
.....repeat several times, once per minute....
Sat 01 Jun 2013 01:09:39 BST | Einstein@Home | Sending scheduler request: To fetch work.
Sat 01 Jun 2013 01:09:39 BST | Einstein@Home | Requesting new tasks for NVIDIA
Sat 01 Jun 2013 01:09:44 BST | Einstein@Home | Scheduler request completed: got 0 new tasks
Sat 01 Jun 2013 01:09:44 BST | Einstein@Home | No work sent
Sat 01 Jun 2013 01:09:44 BST | Einstein@Home | (reached daily quota of 384 tasks)
Sat 01 Jun 2013 01:09:44 BST | Einstein@Home | Project has no jobs available
Sat 01 Jun 2013 01:27:55 BST | Einstein@Home | Computation for task PA0057_01221_159_2 finished

Clearly, at this point the host had way too much work to do - yet the next day, it got a bunch more. Note that this time the host seems to have just stopped asking for work by itself (unlike the period above).

....routine uploads of completed work....
Sun 02 Jun 2013 00:54:40 BST | Einstein@Home | Finished upload of PA0058_00221_141_1_2
Sun 02 Jun 2013 01:15:45 BST | Einstein@Home | Sending scheduler request: To fetch work.
Sun 02 Jun 2013 01:15:45 BST | Einstein@Home | Reporting 13 completed tasks
Sun 02 Jun 2013 01:15:45 BST | Einstein@Home | Requesting new tasks for NVIDIA
Sun 02 Jun 2013 01:16:02 BST | Einstein@Home | Scheduler request completed: got 4 new tasks
....downloads ~4 tasks once a minute for several minutes....
Sun 02 Jun 2013 01:29:07 BST | Einstein@Home | Finished download of PA0058_007B1_92.bin4
Sun 02 Jun 2013 01:29:41 BST | Einstein@Home | Sending scheduler request: To fetch work.
Sun 02 Jun 2013 01:29:41 BST | Einstein@Home | Requesting new tasks for NVIDIA
Sun 02 Jun 2013 01:29:47 BST | Einstein@Home | Scheduler request completed: got 5 new tasks
Sun 02 Jun 2013 01:29:49 BST | Einstein@Home | Started download of PA0058_03961_195.bin4
.....downloads....
Sun 02 Jun 2013 01:30:48 BST | Einstein@Home | Finished download of PA0058_00691_239.bin4
Sun 02 Jun 2013 05:27:58 BST | Einstein@Home | Computation for task PA0058_00171_315_0 finished
Sun 02 Jun 2013 05:27:58 BST | Einstein@Home | Starting task PA0058_001D1_81_0 using einsteinbinary_BRP5 version 133 (BRP4cuda32nv270) in slot 2

Here's another interesting host - because it's slow, the log messages go back much further.

Host 6123309 - downloaded 1 or 2 tasks per minute (slow host) from 12:54 to 13:04 28 May, and again from 19:02 to 19:46 on 02 Jun. Nothing interesting in the log file for the first period, but the second period ended like this:-

.....40+ minutes of once per minute downloads....
Sat 01 Jun 2013 19:48:34 BST | Einstein@Home | Finished download of PA0058_00451.zap
Sat 01 Jun 2013 19:49:25 BST | Einstein@Home | Sending scheduler request: To fetch work.
Sat 01 Jun 2013 19:49:25 BST | Einstein@Home | Requesting new tasks for NVIDIA
Sat 01 Jun 2013 19:49:29 BST | Einstein@Home | Scheduler request completed: got 0 new tasks
Sat 01 Jun 2013 19:49:29 BST | Einstein@Home | No work sent
Sat 01 Jun 2013 19:49:29 BST | Einstein@Home | see scheduler log messages on http://einstein.phys.uwm.edu//host_sched_logs/6123/6123309
Sat 01 Jun 2013 19:49:29 BST | Einstein@Home | (won't finish in time) BOINC runs 100.0% of time, computation enabled 100.0% of that

Host 6618250 - downloaded groups of ~4 tasks once per minute from 02:10 (at least - oldest entry in log) to 02:29 on 2 Jun. There is no indication in the log why it stopped doing so at the point (next log entry is a "Computation finished" at 04:51).

Host 6290763 - tasks being downloaded from at least (log start) 01:27 to 01:30 on 02 Jun. Nothing significant in log at end of period.

Surmising, could it be that there has always been some subtle issue with work fetch which has been exposed by the "once a minute" issue? e.g. on the normal fetch schedule, the occasional overfetch might not have been noticed - but combined with the "once a minute" issue, results in too much work being downloaded? Additionally, while it seems that all hosts suffer from the "once per minute" issue, as far as I've seen only systems with more than 1 GPU have overfetched (anyone seen any single GPU hosts that have overfetched?).

n.b. the only host I have that hasn't done the "once a minute" thing is on boinc-7.0.44; 'grep Schedule $LOGFILE' doesn't show any instances of it (all the other hosts are on 7.0.62 or .65, and show the "once per minute" requests).

Ref. the DNS question, some technical notes about set-up here:- There are 10 linux hosts w/GPU online at present (5 single, 5 dual), running local DNS and NTP. Last network outage was 41 days ago. These hosts are pretty much exclusively BOINC, either just e@h or (2 hosts) a@h as well. No automatic updating enabled, most of them are headless and not even running X. So it's a pretty 'quiet' environment, and I certainly can't see anything that could have triggered the periods of "once per minute" requests (as previously mentioned, the only recent change for these hosts was to increase the cache from 0.25 to 1.00 days, to see if that stopped the "once per minute" requests).

Addendum: Having examined the log files for my single-GPU (unaffected hosts), the "once per minute" issue was absolutely consistent on them, in that the requests continued once per minute (mostly getting 0 work, occasionally getting a new task as one would expect) right up to the moment where I used the "load config file" trick to stop the problem. In other words, single-GPU hosts didn't stop requesting work (and didn't overfetch) until I intervened (unlike the dual-GPU hosts, which both over-fetched and sooner or later stopped the "once per minute" requests).

Jacob Klein

Joined: 22 Jun 11

Posts: 45

Credit: 114028547

RAC: 0

Does anyone here in this

9 Jun 2013 12:46:03 UTC

Message 116541

(moderation:

)

Does anyone here in this thread... currently still have the "BOINC requests too much work" problem still happening?

For people that did have it, can you think of any reasons why BOINC might consider a GPU to be fully idle?

I would LOVE to see a log section that demonstrates the issue.

Jacob Klein

Joined: 22 Jun 11

Posts: 45

Credit: 114028547

RAC: 0

So, I did the "driver

9 Jun 2013 20:40:58 UTC

Message 116542

(moderation:

)

So, I did the "driver installation" test, where I upgraded the nVidia drivers from the 311.06 drivers from Windows Update, to the latest 320.18 WHQL drivers, while BOINC was actively doing work on the 3 GPUs.

It looks like it did fail some tasks with computation error, but eventually, BOINC was able to get 6 tasks (2 per GPU) working again, all on its own. Then a 45 minute backoff was created. After the backoff timed out a request for work gave 0 tasks, per an 8-task quota, and a request for a much longer timeout was made by Einstein. (Log is below).

Although not entirely conclusive, I think I'm going to abandon this method of testing. I'll let my computer run Einstein for a few days in a row, but I am admittedly nowhere closer to discovering the issue.

For tbret's original problem, something made BOINC think that a GPU was fully idle. I don't know what that might be.

6/9/2013 11:13:45 AM | | [work_fetch] work fetch start
6/9/2013 11:13:45 AM | | [work_fetch] choose_project() for NVIDIA: buffer_low: no; sim_excluded_instances 0
6/9/2013 11:13:45 AM | | [work_fetch] choose_project() for CPU: buffer_low: yes; sim_excluded_instances 0
6/9/2013 11:13:45 AM | | [work_fetch] no eligible project for CPU
6/9/2013 11:13:45 AM | | [work_fetch] ------- start work fetch state -------
6/9/2013 11:13:45 AM | | [work_fetch] target work buffer: 86400.00 + 8640.00 sec
6/9/2013 11:13:45 AM | | [work_fetch] --- project states ---
6/9/2013 11:13:45 AM | Einstein@Home | [work_fetch] REC 19351.244 prio -1.058872 can req work
6/9/2013 11:13:45 AM | | [work_fetch] --- state for CPU ---
6/9/2013 11:13:45 AM | | [work_fetch] shortfall 190080.00 nidle 2.00 saturated 0.00 busy 0.00
6/9/2013 11:13:45 AM | Einstein@Home | [work_fetch] fetch share 0.000 (blocked by prefs)
6/9/2013 11:13:45 AM | | [work_fetch] --- state for NVIDIA ---
6/9/2013 11:13:45 AM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 102239.25 busy 0.00
6/9/2013 11:13:45 AM | Einstein@Home | [work_fetch] fetch share 1.000
6/9/2013 11:13:45 AM | | [work_fetch] ------- end work fetch state -------
6/9/2013 11:13:45 AM | | [work_fetch] No project chosen for work fetch
6/9/2013 11:13:47 AM | Einstein@Home | [checkpoint] result PA0070_00241_27_0 checkpointed
6/9/2013 11:13:52 AM | Einstein@Home | [checkpoint] result PA0053_00241_252_6 checkpointed
6/9/2013 11:14:15 AM | Einstein@Home | [checkpoint] result PA0052_005A1_201_4 checkpointed
6/9/2013 11:14:15 AM | Einstein@Home | [checkpoint] result PA0069_01151_387_1 checkpointed
6/9/2013 11:14:16 AM | Einstein@Home | [checkpoint] result PA0069_01171_129_1 checkpointed
6/9/2013 11:14:17 AM | Einstein@Home | [checkpoint] result PA0069_011A1_360_0 checkpointed
6/9/2013 11:14:23 AM | | [work_fetch] Request work fetch: application exited
6/9/2013 11:14:23 AM | Einstein@Home | Starting task PA0070_00241_81_0 using einsteinbinary_BRP5 version 133 (BRP4cuda32nv301) in slot 6
6/9/2013 11:14:24 AM | | [work_fetch] Request work fetch: application exited
6/9/2013 11:14:24 AM | Einstein@Home | Starting task PA0069_01291_309_1 using einsteinbinary_BRP5 version 133 (BRP4cuda32nv301) in slot 7
6/9/2013 11:14:25 AM | | [work_fetch] work fetch start
6/9/2013 11:14:25 AM | | [work_fetch] choose_project() for NVIDIA: buffer_low: yes; sim_excluded_instances 0
6/9/2013 11:14:25 AM | | [work_fetch] ------- start work fetch state -------
6/9/2013 11:14:25 AM | | [work_fetch] target work buffer: 86400.00 + 8640.00 sec
6/9/2013 11:14:25 AM | | [work_fetch] --- project states ---
6/9/2013 11:14:25 AM | Einstein@Home | [work_fetch] REC 19405.494 prio -1.053855 can req work
6/9/2013 11:14:25 AM | | [work_fetch] --- state for CPU ---
6/9/2013 11:14:25 AM | | [work_fetch] shortfall 190080.00 nidle 2.00 saturated 0.00 busy 0.00
6/9/2013 11:14:25 AM | Einstein@Home | [work_fetch] fetch share 0.000 (blocked by prefs)
6/9/2013 11:14:25 AM | | [work_fetch] --- state for NVIDIA ---
6/9/2013 11:14:25 AM | | [work_fetch] shortfall 23645.20 nidle 0.00 saturated 79276.53 busy 0.00
6/9/2013 11:14:25 AM | Einstein@Home | [work_fetch] fetch share 0.000
6/9/2013 11:14:25 AM | | [work_fetch] ------- end work fetch state -------
6/9/2013 11:14:25 AM | Einstein@Home | [work_fetch] request: CPU (0.00 sec, 0.00 inst) NVIDIA (0.00 sec, 0.00 inst)
6/9/2013 11:14:25 AM | Einstein@Home | Sending scheduler request: To fetch work.
6/9/2013 11:14:25 AM | Einstein@Home | Not requesting tasks
6/9/2013 11:14:26 AM | | [work_fetch] Request work fetch: application exited
6/9/2013 11:14:26 AM | | [work_fetch] Request work fetch: application exited
6/9/2013 11:14:26 AM | | [work_fetch] Request work fetch: application exited
6/9/2013 11:14:26 AM | | [work_fetch] Request work fetch: application exited
6/9/2013 11:14:26 AM | Einstein@Home | Computation for task PA0069_01151_387_1 finished
6/9/2013 11:14:26 AM | Einstein@Home | Output file PA0069_01151_387_1_2 for task PA0069_01151_387_1 absent
6/9/2013 11:14:26 AM | Einstein@Home | Computation for task PA0069_01171_129_1 finished
6/9/2013 11:14:26 AM | Einstein@Home | Output file PA0069_01171_129_1_0 for task PA0069_01171_129_1 absent
6/9/2013 11:14:26 AM | Einstein@Home | Output file PA0069_01171_129_1_1 for task PA0069_01171_129_1 absent
6/9/2013 11:14:26 AM | Einstein@Home | Output file PA0069_01171_129_1_2 for task PA0069_01171_129_1 absent
6/9/2013 11:14:26 AM | Einstein@Home | Computation for task PA0069_011A1_360_0 finished
6/9/2013 11:14:26 AM | Einstein@Home | Output file PA0069_011A1_360_0_0 for task PA0069_011A1_360_0 absent
6/9/2013 11:14:26 AM | Einstein@Home | Output file PA0069_011A1_360_0_1 for task PA0069_011A1_360_0 absent
6/9/2013 11:14:26 AM | Einstein@Home | Output file PA0069_011A1_360_0_2 for task PA0069_011A1_360_0 absent
6/9/2013 11:14:26 AM | Einstein@Home | Computation for task PA0053_00241_252_6 finished
6/9/2013 11:14:26 AM | Einstein@Home | Output file PA0053_00241_252_6_0 for task PA0053_00241_252_6 absent
6/9/2013 11:14:26 AM | Einstein@Home | Output file PA0053_00241_252_6_1 for task PA0053_00241_252_6 absent
6/9/2013 11:14:26 AM | Einstein@Home | Output file PA0053_00241_252_6_2 for task PA0053_00241_252_6 absent
6/9/2013 11:14:26 AM | Einstein@Home | Restarting task PA0052_005A1_201_4 using einsteinbinary_BRP5 version 133 (BRP4cuda32nv301) in slot 5
6/9/2013 11:14:26 AM | Einstein@Home | Restarting task PA0070_00241_27_0 using einsteinbinary_BRP5 version 133 (BRP4cuda32nv301) in slot 2
6/9/2013 11:14:26 AM | Einstein@Home | Starting task PA0070_001A1_54_0 using einsteinbinary_BRP5 version 133 (BRP4cuda32nv301) in slot 0
6/9/2013 11:14:26 AM | Einstein@Home | Starting task PA0069_01111_312_2 using einsteinbinary_BRP5 version 133 (BRP4cuda32nv301) in slot 1
6/9/2013 11:14:26 AM | Einstein@Home | Scheduler request completed
6/9/2013 11:14:26 AM | | [work_fetch] Request work fetch: RPC complete
6/9/2013 11:14:28 AM | | [work_fetch] Request work fetch: application exited
6/9/2013 11:14:28 AM | Einstein@Home | Started upload of PA0069_01151_387_1_0
6/9/2013 11:14:28 AM | Einstein@Home | Started upload of PA0069_01151_387_1_1
6/9/2013 11:14:28 AM | Einstein@Home | Computation for task PA0070_00241_81_0 finished
6/9/2013 11:14:28 AM | Einstein@Home | Output file PA0070_00241_81_0_0 for task PA0070_00241_81_0 absent
6/9/2013 11:14:28 AM | Einstein@Home | Output file PA0070_00241_81_0_1 for task PA0070_00241_81_0 absent
6/9/2013 11:14:28 AM | Einstein@Home | Output file PA0070_00241_81_0_2 for task PA0070_00241_81_0 absent
6/9/2013 11:14:28 AM | Einstein@Home | Starting task PA0070_00231_195_0 using einsteinbinary_BRP5 version 133 (BRP4cuda32nv301) in slot 3
6/9/2013 11:14:30 AM | | [work_fetch] Request work fetch: application exited
6/9/2013 11:14:30 AM | | [work_fetch] Request work fetch: application exited
6/9/2013 11:14:30 AM | | [work_fetch] Request work fetch: application exited
6/9/2013 11:14:30 AM | Einstein@Home | Finished upload of PA0069_01151_387_1_0
6/9/2013 11:14:30 AM | Einstein@Home | Finished upload of PA0069_01151_387_1_1
6/9/2013 11:14:30 AM | | [work_fetch] Request work fetch: project finished uploading
6/9/2013 11:14:30 AM | Einstein@Home | Starting task PA0070_001A1_0_1 using einsteinbinary_BRP5 version 133 (BRP4cuda32nv301) in slot 4
6/9/2013 11:14:30 AM | Einstein@Home | Starting task PA0070_00231_177_1 using einsteinbinary_BRP5 version 133 (BRP4cuda32nv301) in slot 6
6/9/2013 11:14:30 AM | Einstein@Home | Starting task PA0070_001C1_207_0 using einsteinbinary_BRP5 version 133 (BRP4cuda32nv301) in slot 8
6/9/2013 11:14:32 AM | | [work_fetch] Request work fetch: application exited
6/9/2013 11:14:32 AM | Einstein@Home | Computation for task PA0070_001A1_54_0 finished
6/9/2013 11:14:32 AM | Einstein@Home | Output file PA0070_001A1_54_0_0 for task PA0070_001A1_54_0 absent
6/9/2013 11:14:32 AM | Einstein@Home | Output file PA0070_001A1_54_0_1 for task PA0070_001A1_54_0 absent
6/9/2013 11:14:32 AM | Einstein@Home | Output file PA0070_001A1_54_0_2 for task PA0070_001A1_54_0 absent
6/9/2013 11:14:32 AM | Einstein@Home | Restarting task PA0052_005A1_201_4 using einsteinbinary_BRP5 version 133 (BRP4cuda32nv301) in slot 5
6/9/2013 11:14:32 AM | | [work_fetch] work fetch start
6/9/2013 11:14:32 AM | | [work_fetch] choose_project() for NVIDIA: buffer_low: yes; sim_excluded_instances 0
6/9/2013 11:14:32 AM | | [work_fetch] no eligible project for NVIDIA
6/9/2013 11:14:32 AM | | [work_fetch] choose_project() for CPU: buffer_low: yes; sim_excluded_instances 0
6/9/2013 11:14:32 AM | | [work_fetch] no eligible project for CPU
6/9/2013 11:14:32 AM | | [work_fetch] ------- start work fetch state -------
6/9/2013 11:14:32 AM | | [work_fetch] target work buffer: 86400.00 + 8640.00 sec
6/9/2013 11:14:32 AM | | [work_fetch] --- project states ---
6/9/2013 11:14:32 AM | Einstein@Home | [work_fetch] REC 19412.707 prio -0.041892 can't req work: scheduler RPC backoff (backoff: 121.53 sec)
6/9/2013 11:14:32 AM | | [work_fetch] --- state for CPU ---
6/9/2013 11:14:32 AM | | [work_fetch] shortfall 190080.00 nidle 2.00 saturated 0.00 busy 0.00
6/9/2013 11:14:32 AM | Einstein@Home | [work_fetch] fetch share 0.000 (blocked by prefs)
6/9/2013 11:14:32 AM | | [work_fetch] --- state for NVIDIA ---
6/9/2013 11:14:32 AM | | [work_fetch] shortfall 58381.06 nidle 0.00 saturated 57807.44 busy 0.00
6/9/2013 11:14:32 AM | Einstein@Home | [work_fetch] fetch share 0.000
6/9/2013 11:14:32 AM | | [work_fetch] ------- end work fetch state -------
6/9/2013 11:14:32 AM | | [work_fetch] No project chosen for work fetch
6/9/2013 11:14:33 AM | | [work_fetch] Request work fetch: application exited
6/9/2013 11:14:33 AM | | [work_fetch] Request work fetch: application exited
6/9/2013 11:14:33 AM | | [work_fetch] Request work fetch: application exited
6/9/2013 11:14:33 AM | Einstein@Home | Computation for task PA0069_01111_312_2 finished
6/9/2013 11:14:33 AM | Einstein@Home | Output file PA0069_01111_312_2_0 for task PA0069_01111_312_2 absent
6/9/2013 11:14:33 AM | Einstein@Home | Output file PA0069_01111_312_2_1 for task PA0069_01111_312_2 absent
6/9/2013 11:14:33 AM | Einstein@Home | Output file PA0069_01111_312_2_2 for task PA0069_01111_312_2 absent
6/9/2013 11:14:33 AM | Einstein@Home | Computation for task PA0070_00231_195_0 finished
6/9/2013 11:14:33 AM | Einstein@Home | Output file PA0070_00231_195_0_0 for task PA0070_00231_195_0 absent
6/9/2013 11:14:33 AM | Einstein@Home | Output file PA0070_00231_195_0_1 for task PA0070_00231_195_0 absent
6/9/2013 11:14:33 AM | Einstein@Home | Output file PA0070_00231_195_0_2 for task PA0070_00231_195_0 absent
6/9/2013 11:14:33 AM | Einstein@Home | Restarting task PA0070_00241_27_0 using einsteinbinary_BRP5 version 133 (BRP4cuda32nv301) in slot 2
6/9/2013 11:14:33 AM | Einstein@Home | Restarting task PA0069_01291_309_1 using einsteinbinary_BRP5 version 133 (BRP4cuda32nv301) in slot 7
6/9/2013 11:14:33 AM | Einstein@Home | Starting task PA0070_00251_378_0 using einsteinbinary_BRP5 version 133 (BRP4cuda32nv301) in slot 0
6/9/2013 11:14:35 AM | | [work_fetch] Request work fetch: application exited
6/9/2013 11:14:35 AM | | [work_fetch] Request work fetch: application exited
6/9/2013 11:14:35 AM | | [work_fetch] Request work fetch: application exited
6/9/2013 11:14:35 AM | Einstein@Home | Starting task PA0069_01081_336_0 using einsteinbinary_BRP5 version 133 (BRP4cuda32nv301) in slot 1
6/9/2013 11:14:35 AM | Einstein@Home | Starting task PA0070_00221_285_1 using einsteinbinary_BRP5 version 133 (BRP4cuda32nv301) in slot 3
6/9/2013 11:14:35 AM | Einstein@Home | Starting task PA0069_013A1_66_1 using einsteinbinary_BRP5 version 133 (BRP4cuda32nv301) in slot 9
6/9/2013 11:14:37 AM | | [work_fetch] Request work fetch: application exited
6/9/2013 11:14:37 AM | | [work_fetch] Request work fetch: application exited
6/9/2013 11:14:37 AM | Einstein@Home | Computation for task PA0070_00241_27_0 finished
6/9/2013 11:14:37 AM | Einstein@Home | Output file PA0070_00241_27_0_0 for task PA0070_00241_27_0 absent
6/9/2013 11:14:37 AM | Einstein@Home | Output file PA0070_00241_27_0_1 for task PA0070_00241_27_0 absent
6/9/2013 11:14:37 AM | Einstein@Home | Output file PA0070_00241_27_0_2 for task PA0070_00241_27_0 absent
6/9/2013 11:14:37 AM | Einstein@Home | Computation for task PA0069_01291_309_1 finished
6/9/2013 11:14:37 AM | Einstein@Home | Output file PA0069_01291_309_1_0 for task PA0069_01291_309_1 absent
6/9/2013 11:14:37 AM | Einstein@Home | Output file PA0069_01291_309_1_1 for task PA0069_01291_309_1 absent
6/9/2013 11:14:37 AM | Einstein@Home | Output file PA0069_01291_309_1_2 for task PA0069_01291_309_1 absent
6/9/2013 11:14:37 AM | Einstein@Home | Restarting task PA0052_005A1_201_4 using einsteinbinary_BRP5 version 133 (BRP4cuda32nv301) in slot 5
6/9/2013 11:14:37 AM | Einstein@Home | Restarting task PA0070_001A1_0_1 using einsteinbinary_BRP5 version 133 (BRP4cuda32nv301) in slot 4
6/9/2013 11:14:37 AM | | [work_fetch] work fetch start
6/9/2013 11:14:37 AM | | [work_fetch] choose_project() for NVIDIA: buffer_low: yes; sim_excluded_instances 0
6/9/2013 11:14:37 AM | | [work_fetch] no eligible project for NVIDIA
6/9/2013 11:14:37 AM | | [work_fetch] choose_project() for CPU: buffer_low: yes; sim_excluded_instances 0
6/9/2013 11:14:37 AM | | [work_fetch] no eligible project for CPU
6/9/2013 11:14:37 AM | | [work_fetch] ------- start work fetch state -------
6/9/2013 11:14:37 AM | | [work_fetch] target work buffer: 86400.00 + 8640.00 sec
6/9/2013 11:14:37 AM | | [work_fetch] --- project states ---
6/9/2013 11:14:37 AM | Einstein@Home | [work_fetch] REC 19418.013 prio -0.037074 can't req work: scheduler RPC backoff (backoff: 2728.03 sec)
6/9/2013 11:14:37 AM | | [work_fetch] --- state for CPU ---
6/9/2013 11:14:37 AM | | [work_fetch] shortfall 190080.00 nidle 2.00 saturated 0.00 busy 0.00
6/9/2013 11:14:37 AM | Einstein@Home | [work_fetch] fetch share 0.000 (blocked by prefs)
6/9/2013 11:14:37 AM | | [work_fetch] --- state for NVIDIA ---
6/9/2013 11:14:37 AM | | [work_fetch] shortfall 84459.43 nidle 0.00 saturated 52849.04 busy 0.00
6/9/2013 11:14:37 AM | Einstein@Home | [work_fetch] fetch share 0.000
6/9/2013 11:14:37 AM | | [work_fetch] ------- end work fetch state -------
6/9/2013 11:14:37 AM | | [work_fetch] No project chosen for work fetch
6/9/2013 11:15:38 AM | Einstein@Home | [checkpoint] result PA0070_00251_378_0 checkpointed
6/9/2013 11:15:38 AM | | [work_fetch] work fetch start
6/9/2013 11:15:38 AM | | [work_fetch] choose_project() for NVIDIA: buffer_low: yes; sim_excluded_instances 0
6/9/2013 11:15:38 AM | | [work_fetch] no eligible project for NVIDIA
6/9/2013 11:15:38 AM | | [work_fetch] choose_project() for CPU: buffer_low: yes; sim_excluded_instances 0
6/9/2013 11:15:38 AM | | [work_fetch] no eligible project for CPU
6/9/2013 11:15:38 AM | | [work_fetch] ------- start work fetch state -------
6/9/2013 11:15:38 AM | | [work_fetch] target work buffer: 86400.00 + 8640.00 sec
6/9/2013 11:15:38 AM | | [work_fetch] --- project states ---
6/9/2013 11:15:38 AM | Einstein@Home | [work_fetch] REC 19434.098 prio -0.037042 can't req work: scheduler RPC backoff (backoff: 2667.22 sec)
6/9/2013 11:15:38 AM | | [work_fetch] --- state for CPU ---
6/9/2013 11:15:38 AM | | [work_fetch] shortfall 190080.00 nidle 2.00 saturated 0.00 busy 0.00
6/9/2013 11:15:38 AM | Einstein@Home | [work_fetch] fetch share 0.000 (blocked by prefs)
6/9/2013 11:15:38 AM | | [work_fetch] --- state for NVIDIA ---
6/9/2013 11:15:38 AM | | [work_fetch] shortfall 84633.56 nidle 0.00 saturated 52846.87 busy 0.00
6/9/2013 11:15:38 AM | Einstein@Home | [work_fetch] fetch share 0.000
6/9/2013 11:15:38 AM | | [work_fetch] ------- end work fetch state -------
6/9/2013 11:15:38 AM | | [work_fetch] No project chosen for work fetch
...
...
6/9/2013 11:59:39 AM | | [work_fetch] work fetch start
6/9/2013 11:59:39 AM | | [work_fetch] choose_project() for NVIDIA: buffer_low: yes; sim_excluded_instances 0
6/9/2013 11:59:39 AM | | [work_fetch] no eligible project for NVIDIA
6/9/2013 11:59:39 AM | | [work_fetch] choose_project() for CPU: buffer_low: yes; sim_excluded_instances 0
6/9/2013 11:59:39 AM | | [work_fetch] no eligible project for CPU
6/9/2013 11:59:39 AM | | [work_fetch] ------- start work fetch state -------
6/9/2013 11:59:39 AM | | [work_fetch] target work buffer: 86400.00 + 8640.00 sec
6/9/2013 11:59:39 AM | | [work_fetch] --- project states ---
6/9/2013 11:59:39 AM | Einstein@Home | [work_fetch] REC 21720.530 prio -0.035701 can't req work: scheduler RPC backoff (backoff: 26.46 sec)
6/9/2013 11:59:39 AM | | [work_fetch] --- state for CPU ---
6/9/2013 11:59:39 AM | | [work_fetch] shortfall 190080.00 nidle 2.00 saturated 0.00 busy 0.00
6/9/2013 11:59:39 AM | Einstein@Home | [work_fetch] fetch share 0.000 (blocked by prefs)
6/9/2013 11:59:39 AM | | [work_fetch] --- state for NVIDIA ---
6/9/2013 11:59:39 AM | | [work_fetch] shortfall 91926.14 nidle 0.00 saturated 52837.83 busy 0.00
6/9/2013 11:59:39 AM | Einstein@Home | [work_fetch] fetch share 0.000
6/9/2013 11:59:39 AM | | [work_fetch] ------- end work fetch state -------
6/9/2013 11:59:39 AM | | [work_fetch] No project chosen for work fetch
6/9/2013 11:59:47 AM | Einstein@Home | [checkpoint] result PA0070_00251_378_0 checkpointed
6/9/2013 11:59:50 AM | Einstein@Home | [checkpoint] result PA0052_005A1_201_4 checkpointed
6/9/2013 11:59:50 AM | Einstein@Home | [checkpoint] result PA0070_001A1_0_1 checkpointed
6/9/2013 11:59:50 AM | Einstein@Home | [checkpoint] result PA0069_01081_336_0 checkpointed
6/9/2013 11:59:50 AM | Einstein@Home | [checkpoint] result PA0070_00221_285_1 checkpointed
6/9/2013 11:59:50 AM | Einstein@Home | [checkpoint] result PA0069_013A1_66_1 checkpointed
6/9/2013 12:00:06 PM | | [work_fetch] Request work fetch: Backoff ended for Einstein@Home
6/9/2013 12:00:09 PM | | [work_fetch] work fetch start
6/9/2013 12:00:09 PM | | [work_fetch] choose_project() for NVIDIA: buffer_low: yes; sim_excluded_instances 0
6/9/2013 12:00:09 PM | Einstein@Home | [work_fetch] set_request() for NVIDIA: ninst 3 nused_total 8.000000 nidle_now 0.000000 fetch share 1.000000 req_inst 0.000000 req_secs 92006.173071
6/9/2013 12:00:09 PM | | [work_fetch] ------- start work fetch state -------
6/9/2013 12:00:09 PM | | [work_fetch] target work buffer: 86400.00 + 8640.00 sec
6/9/2013 12:00:09 PM | | [work_fetch] --- project states ---
6/9/2013 12:00:09 PM | Einstein@Home | [work_fetch] REC 21720.530 prio -1.035686 can req work
6/9/2013 12:00:09 PM | | [work_fetch] --- state for CPU ---
6/9/2013 12:00:09 PM | | [work_fetch] shortfall 190080.00 nidle 2.00 saturated 0.00 busy 0.00
6/9/2013 12:00:09 PM | Einstein@Home | [work_fetch] fetch share 0.000 (blocked by prefs)
6/9/2013 12:00:09 PM | | [work_fetch] --- state for NVIDIA ---
6/9/2013 12:00:09 PM | | [work_fetch] shortfall 92006.17 nidle 0.00 saturated 52837.73 busy 0.00
6/9/2013 12:00:09 PM | Einstein@Home | [work_fetch] fetch share 1.000
6/9/2013 12:00:09 PM | | [work_fetch] ------- end work fetch state -------
6/9/2013 12:00:09 PM | Einstein@Home | [work_fetch] request: CPU (0.00 sec, 0.00 inst) NVIDIA (92006.17 sec, 0.00 inst)
6/9/2013 12:00:09 PM | Einstein@Home | Sending scheduler request: To report completed tasks.
6/9/2013 12:00:09 PM | Einstein@Home | Reporting 10 completed tasks
6/9/2013 12:00:09 PM | Einstein@Home | Requesting new tasks for NVIDIA
6/9/2013 12:00:11 PM | Einstein@Home | Scheduler request completed: got 0 new tasks
6/9/2013 12:00:11 PM | Einstein@Home | No work sent
6/9/2013 12:00:11 PM | Einstein@Home | (reached daily quota of 8 tasks)
6/9/2013 12:00:11 PM | Einstein@Home | Project has no jobs available
6/9/2013 12:00:11 PM | Einstein@Home | [work_fetch] backing off NVIDIA 768 sec
6/9/2013 12:00:11 PM | | [work_fetch] Request work fetch: RPC complete
6/9/2013 12:00:16 PM | | [work_fetch] work fetch start
6/9/2013 12:00:16 PM | | [work_fetch] choose_project() for NVIDIA: buffer_low: yes; sim_excluded_instances 0
6/9/2013 12:00:16 PM | | [work_fetch] no eligible project for NVIDIA
6/9/2013 12:00:16 PM | | [work_fetch] choose_project() for CPU: buffer_low: yes; sim_excluded_instances 0
6/9/2013 12:00:16 PM | | [work_fetch] no eligible project for CPU
6/9/2013 12:00:16 PM | | [work_fetch] ------- start work fetch state -------
6/9/2013 12:00:16 PM | | [work_fetch] target work buffer: 86400.00 + 8640.00 sec
6/9/2013 12:00:16 PM | | [work_fetch] --- project states ---
6/9/2013 12:00:16 PM | Einstein@Home | [work_fetch] REC 21720.530 prio -0.035682 can't req work: scheduler RPC backoff (backoff: 30775.91 sec)
6/9/2013 12:00:16 PM | | [work_fetch] --- state for CPU ---
6/9/2013 12:00:16 PM | | [work_fetch] shortfall 190080.00 nidle 2.00 saturated 0.00 busy 0.00
6/9/2013 12:00:16 PM | Einstein@Home | [work_fetch] fetch share 0.000 (blocked by prefs)
6/9/2013 12:00:16 PM | | [work_fetch] --- state for NVIDIA ---
6/9/2013 12:00:16 PM | | [work_fetch] shortfall 92025.08 nidle 0.00 saturated 52837.70 busy 0.00
6/9/2013 12:00:16 PM | Einstein@Home | [work_fetch] fetch share 0.000 (resource backoff: 762.49, inc 600.00)
6/9/2013 12:00:16 PM | | [work_fetch] ------- end work fetch state -------
6/9/2013 12:00:16 PM | | [work_fetch] No project chosen for work fetch

Claggy

Joined: 29 Dec 06

Posts: 560

Credit: 2699403

RAC: 0

RE: 2) clients contacting

9 Jun 2013 21:39:00 UTC

Message 116543 in response to message 116524

(moderation:

)

Quote:

2) clients contacting the server more often than is reasonable with requests to fetch NO work at all. There have been reports at Seti@Home about this, and some rather strange workaround, quoting Eric Korpela on the boinc-dev list:
Quote:

[...] people have reported that it goes away when they select
"read config file", even if they don't have a config file.

I've now had this on two different hosts, both times after removing the setiathome app_info and restarting Boinc, both times they got Stock work from Seti, then continued every 5 minutes to ask for no work, read config file stopped that,

Claggy

Jacob Klein

Joined: 22 Jun 11

Posts: 45

Credit: 114028547

RAC: 0

Regarding the "BOINC fetches

10 Jun 2013 12:42:49 UTC

Message 116544

(moderation:

)

Regarding the "BOINC fetches infinite work" problem...

I've done 2 other tests: pulling the ethernet cord for a bit, and also pulling the ethernet cord for an extended period of time (long enough to leave 2 of the GPUs idle). In both cases, work fetch filled the buffers back up appropriately.

I can't think of a reason that BOINC work fetch would think the GPUs are idle (nidle=1), but it appears to me that that's what's happening (based on reading tbret's logs in this thread).

For the guys with the problems, are you guys using app_info.xml files, or are you using app_config.xml files? Could you include the contents of those files within this thread?

Neil Newell

Joined: 20 Nov 12

Posts: 176

Credit: 169699457

RAC: 0

RE: For the guys with the

10 Jun 2013 15:01:29 UTC

Message 116545 in response to message 116544

(moderation:

)

Quote:

For the guys with the problems, are you guys using app_info.xml files, or are you using app_config.xml files? Could you include the contents of those files within this thread?

Not using app_config.xml or anything unusual. I do use a cc_config.xml file, which varies between hosts but only to allow multiple GPUs and control CPU cores. Affected hosts run in different venues, with GPU utilisation of 0.5 or 1.0; some hosts are GPU only, others run both CPU and GPU work.

1

2
1

The problem seems to have started around the time of BRP5 introduction; could it be the problem was the initial estimates of BRP5 runtime were low, leading the client to believe it should fetch too much work? Then as the estimate improved, the over-fetch problem would stop of its own accord - sound like a possible reason?

The "once per minute" issue appears to be something separate (perhaps meaning the work overfetch happened faster than it would otherwise, but otherwise unrelated).

tbret

Joined: 12 Mar 05

Posts: 2115

Credit: 4862557013

RAC: 120849

RE: I'd really like to get

11 Jun 2013 6:58:34 UTC

Message 116546 in response to message 116539

(moderation:

)

Quote:

I'd really like to get to whatever the underlying cause of this problem is.

Edit - it would be helpful if you would tell us the timings of your network blackout, either in local time and timezone, or in UTC, so that we can try and compare them with the server log evidence.

Oh, crud Richard, I just, tonight, changed it before I read your post. I'd like to help you, too.

Before I give you the details, let me go ahead and tell you that at the end of this note there is evidence that the power-cut has nothing to do with it.

It was something like 2:00am until 3:30am CDT which is what? UTC -5 hours?

The machines are in a home, and connected to a wired/wireless router connected to a cablemodem via the WAN port. The router is a cheap Netgear. Two of the affected computers are wireless, the rest are on CAT6 or 5e, two are on a switch connected to the router.

The reason for the power-down is that the cable service would flake-out every few days and with the 100wu limit at SETI@Home I can't build a cache for the computers to crunch-on until I could get to that location to manually reset the cablemodem. So, I just decided to stick it on a timer and force it to reset once a day.

Well, the router didn't like it when the cablemodem would come back up. It would have to be reset, also. So, I put it on a timer, too. It goes-off and comes back on after the cablemodem.

I admit that's primitive, but it works.

So, no, there is no separate DNS server and I have not specified one. I would imagine that the gateway address is the DNS address.

I think I understand where you are going with the "missing GPU" thing, but as much as that seems like a reasonable avenue, I have not noticed drivers crashing, except once, yesterday, and I don't think that machine is currently crunching Einstein.

But I may be able to help more than that.

One moment please!

Yes, I surely can help...

The power interruption is not it.

Computer 5891285 is in my office, it is connected to a DSL router via CAT 5 some 14 miles away from the others, with no additional switch or router, alongside computer # 6109868 also connected via CAT 5.

5891285 went into a fetch loop, #6109868 did not.

The other four computers that I did report over-fetching are at the house on the cablemodem with the timers, HOWEVER there were three additional computers at that location which did not go into this loop.

If one example of an exception falsifies a theory, I'd say this one's busted. ...not that three didn't loop, that doesn't prove anything, but that one did that was in completely different circumstances and the router never lost power.

Also, even though the nightly power interruption to the cablemodem and router continue, I have not had a duplicate of the problem.

...damn, of course not, I set Einstein to NNT to clear as much of the cache as would clear before it started timing-out... damn, this can get convoluted.

I will attempt to recreate the problem. But it has to wait until the wee hours of Wednesday morning, but at least I can set the log flags before the planned power outage Wednesday about 1:00am.

Lord help me remember that little item on my to-do list.

If it works, it will very likely happen while I sleep and we'll burn another big batch of work.

tbret

Joined: 12 Mar 05

Posts: 2115

Credit: 4862557013

RAC: 120849

RE: For the guys with the

11 Jun 2013 7:08:15 UTC

Message 116547 in response to message 116544

(moderation:

)

Quote:

For the guys with the problems, are you guys using app_info.xml files, or are you using app_config.xml files? Could you include the contents of those files within this thread?

Guys, I just didn't know you'd want any information from my computers and I did not preserve these files on all of them. SETI@Home changed versions and most of these files got blown-up or I blew them up.

I'll see what I can do tomorrow. I'm sorry I've not kept current. I didn't realize...

Beyond

Joined: 28 Feb 05

Posts: 121

Credit: 2334336212

RAC: 5260640

RE: can you PLEASE check to

11 Jun 2013 15:44:49 UTC

Message 116548 in response to message 116538

(moderation:

)

Quote:

can you PLEASE check to see if you have any Windows Update installs around the same time as the issues started?)

None of my machines have windows update auto installs. As stated above, I have not seen this issue since upgrading to 7.1.3 (nor in 7.1.10).

tbret

Joined: 12 Mar 05

Posts: 2115

Credit: 4862557013

RAC: 120849

It's just an amazing thing,

11 Jun 2013 19:21:01 UTC

Message 116549

(moderation:

)

It's just an *amazing* thing, isn't it?

I have to go get a crown stuck on a tooth in an hour. The round-trip to the remote location is more than an hour.

The remote location is currently *down*. I have no idea why. Is the power off? Did the cablemodem service do what it does that made me put a timer on it in the first place? Has all of the equipment been stolen, the house emptied, and / or has it burned to the ground?

I have a meeting tonight after being crowned. It is highly unlikely that I will be able to get out to the house to fix this situation until tomorrow; but since I was out there yesterday and my vehicle burns fuel like it's going-out-of-style (a single trip out there is a $10.00 proposition) I really don't need to go out there for several more days.

...but it will probably bug me if I don't...

I just wanted to report that I'm not ignoring the situation and I do want to help. Life just keeps erecting obstacles. If I can work the round-trip into my schedule, I'll try.

BIG TIME OVERFETCH ! ! ! Perceus.

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner