Scheduler went nuts

Hartmut Geissbauer
Hartmut Geissbauer
Joined: 5 Jan 06
Posts: 31
Credit: 152,941,307
RAC: 0
Topic 195489

Hi
Starting yesterday, the scheduler on only one of my machines has been requesting a lot of WUs. I checked it yesterday evening and there were aprox. 30 tasks at a stable condition.
Today I checked it at noon from remote and there were 370 tasks. I stopped that horror by setting that computer to "no new work" via boincmanager from remote.
First suggestion was previous WUs has been failed and newer ones are requested because of that reason. But no, all of these WUs are in the queue.
My network settings say: Maintain enough work for an additional 0.25 days. So this might not the reason.
The hostid of the computer in question is 2500292.

Any ideas?

Gundolf Jahn
Gundolf Jahn
Joined: 1 Mar 05
Posts: 1,079
Credit: 341,280
RAC: 0

Scheduler went nuts

Perhaps it's the current lack of ABP tasks that causes this, as Gary Roberts has explained in this thread.

Gruß,
Gundolf

Computer sind nicht alles im Leben. (Kleiner Scherz)

Hartmut Geissbauer
Hartmut Geissbauer
Joined: 5 Jan 06
Posts: 31
Credit: 152,941,307
RAC: 0

Maybe, I'll check this.

Maybe, I'll check this.

SciManStev
Joined: 27 Aug 05
Posts: 148
Credit: 15,031,642
RAC: 0

This same thing happened to

This same thing happened to me, and like the other thread says, I was out of GPU units. It stopped many units in mid-crunch to do others. The deadlines are December 13, which is easy for this rig, but a slew of them are running in high priority at the moment. About 20 or so have crunched mid way, only to stop and crunch the high priority ones. No problem on my end, as I will just not crunch with my GPU's.

Steve

Crunching as member of The GPU Users Group team.

Rechenkuenstler
Rechenkuenstler
Joined: 22 Aug 10
Posts: 138
Credit: 101,411,803
RAC: 0

RE: Perhaps it's the

Quote:

Perhaps it's the current lack of ABP tasks that causes this, as Gary Roberts has explained in this thread.

Gruß,
Gundolf

I have the same problem. I check my computer remotly and today I found mor than 400 WU. Even with my I7 running on full capacity, there will some run out of time.

Regarding Gary Robert's thread. He mentions to set the cache for new tasks to max 1 day. Well. That's my setting and despite that, I have this huge amount of WU. So, this is not the solution.

Regards

Bernhard

Gundolf Jahn
Gundolf Jahn
Joined: 1 Mar 05
Posts: 1,079
Credit: 341,280
RAC: 0

RE: So, this is not the

Quote:
So, this is not the solution.


That was only one of the things Gary suggested. Did you read further in the same thread to Richard Haselgrove's post?

And beyond that, this is a problem of the BOINC client and/or server code and not of the Einstein@home project.

Gruß,
Gundolf

Computer sind nicht alles im Leben. (Kleiner Scherz)

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 1,935
Credit: 270,610,700
RAC: 258,192

From the various posts in

From the various posts in this thread, the only example I've been able to track down is Hartmut Geissbauer's host 2500292 - a 16-core Apple with a comparatively low-powered CUDA card.

This host has gone through extended periods of being allocated single Global Correlations S5 HF tasks at 1-minute intervals, and occasionally being allocated a sporadic ABP2 (ABP2cuda23) task - which seems to interrupt the S5GC1HF sequence. I can only interpret that pattern as Hartmut's host sending repeated work requests for CUDA work, but being allocated CPU work instead.

Hartmut is running BOINC v6.10.58, so he should be requesting CUDA/CPU work separately, like this:

SETI@home	01/12/2010 10:06:37	Sending scheduler request: Requested by user.
SETI@home	01/12/2010 10:06:37	Requesting new tasks for NVIDIA GPU
SETI@home	01/12/2010 10:06:37	[sched_op] CPU work request: 0.00 seconds; 0.00 CPUs
SETI@home	01/12/2010 10:06:37	[sched_op] NVIDIA GPU work request: 19180.25 seconds; 1.00 GPUs
SETI@home	01/12/2010 10:06:43	Scheduler request completed: got 5 new tasks


So, either the client is wrongly requesting CPU work in spite of having so much already cached - I find that unlikely with v6.10.58, but he would have to enable either sched_op or work_fetch debug logging to be certain - or the Einstein server is wrongly configured to issue CPU work when CUDA work is requested but not available.

Does anyone know a way to access the server logs for one of those S5GC1HF allocation events?

Jord
Joined: 26 Jan 05
Posts: 2,952
Credit: 5,684,975
RAC: 430

RE: Does anyone know a way

Quote:
Does anyone know a way to access the server logs for one of those S5GC1HF allocation events?


http://einstein.phys.uwm.edu/host_sched_logs/1112/1112358

Quote:
2010-12-01 11:35:20.2743 [PID=29862] Request: [USER#xxxxx] [HOST#1112358] [IP xxx.xxx.xxx.111] client 6.10.58
2010-12-01 11:35:20.8425 [PID=29862] [send] effective_ncpus 8 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2010-12-01 11:35:20.8425 [PID=29862] [send] effective_ngpus 1 max_jobs_on_host_gpu 999999
2010-12-01 11:35:20.8425 [PID=29862] [send] Not using matchmaker scheduling; Not using EDF sim
2010-12-01 11:35:20.8425 [PID=29862] [send] CPU: req 41.65 sec, 0.00 instances; est delay 0.00
2010-12-01 11:35:20.8426 [PID=29862] [send] CUDA: req 0.00 sec, 0.00 instances; est delay 0.00
2010-12-01 11:35:20.8426 [PID=29862] [send] work_req_seconds: 41.65 secs
2010-12-01 11:35:20.8426 [PID=29862] [send] available disk 98.74 GB, work_buf_min 0
2010-12-01 11:35:20.8426 [PID=29862] [send] active_frac 0.999923 on_frac 0.997590 DCF 1.030007
2010-12-01 11:35:21.4812 [PID=29862] [send] [HOST#1112358] is reliable
2010-12-01 11:35:21.4812 [PID=29862] [send] set_trust: random choice for error rate 0.002104: yes
2010-12-01 11:35:21.4988 [PID=29862] [version] Best version of app einstein_S5GC1HF is ID 233 (3.14 GFLOPS)
2010-12-01 11:35:21.5277 [PID=29862] [debug] Sorted list of URLs follows [host timezone: UTC+3600]
2010-12-01 11:35:21.5277 [PID=29862] [debug] zone=+03600 url=http://einstein.aei.mpg.de
2010-12-01 11:35:21.5277 [PID=29862] [debug] zone=+03600 url=http://einstein-mirror.aei.uni-hannover.de/EatH
2010-12-01 11:35:21.5277 [PID=29862] [debug] zone=-21600 url=http://einstein-dl2.phys.uwm.edu
2010-12-01 11:35:21.5277 [PID=29862] [debug] zone=-21600 url=http://einstein-dl3.phys.uwm.edu
2010-12-01 11:35:21.5277 [PID=29862] [debug] zone=-21600 url=http://einstein-dl4.phys.uwm.edu
2010-12-01 11:35:21.5277 [PID=29862] [debug] zone=-28800 url=http://einstein.ligo.caltech.edu
2010-12-01 11:35:21.5281 [PID=29862] [send] [HOST#1112358] Sending app_version einstein_S5GC1HF 6 504 ; 3.14 GFLOPS
2010-12-01 11:35:21.5294 [PID=29862] [send] est. duration for WU 88730296: unscaled 20637.32 scaled 21309.58
2010-12-01 11:35:21.5294 [PID=29862] [HOST#1112358] Sending [RESULT#209282644 h1_1357.20_S5R4__1058_S5GC1HFa_0] (est. dur. 21309.58 seconds)
2010-12-01 11:35:21.7165 [PID=29862] [send] don't need more work
2010-12-01 11:35:21.7165 [PID=29862] [send] don't need more work
2010-12-01 11:35:21.7165 [PID=29862] [send] don't need more work
2010-12-01 11:35:21.7165 [PID=29862] [send] don't need more work
2010-12-01 11:35:21.8556 [PID=29862] Sending reply to [HOST#1112358]: 1 results, delay req 60.00
2010-12-01 11:35:21.8560 [PID=29862] Scheduler ran 1.588 seconds


Looks to be a CPU work request.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 1,935
Credit: 270,610,700
RAC: 258,192

RE: RE: Does anyone know

Quote:
Quote:
Does anyone know a way to access the server logs for one of those S5GC1HF allocation events?

http://einstein.phys.uwm.edu/host_sched_logs/1112/1112358

Looks to be a CPU work request.


Yes, but host 1112358 doesn't seem to have suffered a 'scheduler went nuts' event - requesting new work roughly once an hour for an 8-core seems entirely normal and reasonable.

What I was interested in catching was one of those 'new task every minute' events as visible in 2500292's task list - those are the ones which I suspect to be CUDA requests.

SciManStev
Joined: 27 Aug 05
Posts: 148
Credit: 15,031,642
RAC: 0

Although I understand what

Although I understand what happened with running out of GPU unoits, it set off a strange set of events. Normally, even though BOINC is set to change WU's every 60 minutes, it never did it until it ran out of GPU units and orderd up everything available. It has now started 5 or 6 batches or work units, only to jump to something else in high priority. The thing that draws my attention is that it is crunching the wu's for December 13 before doing the ones with earlier dates. It has left a trail of unfinished wu's. I tried to set BOINC to change wu's every 0 minutes, which reverted to 60. Then I set it to 3600 minutes, which is more than enough time to complete one once it's started.

Steve

Crunching as member of The GPU Users Group team.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,210
Credit: 43,483,120,633
RAC: 44,258,837

RE: ... It has now started

Quote:
... It has now started 5 or 6 batches or work units, only to jump to something else in high priority. The thing that draws my attention is that it is crunching the wu's for December 13 before doing the ones with earlier dates. It has left a trail of unfinished wu's...


I've seen this sort of behaviour reported previously and I've even seen a version of this madness (abandoning partly finished tasks which are closer to deadline in favour of ones that are further away from deadline) on a couple of my hosts under unusual circumstances. The behaviour isn't to do with the task switch interval so I think you should leave that at default.

The problem seems to be caused by buggy BOINC behaviour when 'high priority' (HP) mode is invoked. In my cases that mode gets invoked on Windows machines (yes I have some of those even though I mainly run Linux) running XP with no keyboard or mouse attached. If run undisturbed, Windows will eventually end up spending so much time looking for the mouse (so I presume) that a task which might take 10 hours to crunch suddenly ends up taking something like 40-80 hours. This doesn't happen immediately - it usually takes an incubation period of a couple of weeks of quite normal behaviour before suddenly for no apparent reason, a task will slow down to a crawl. As soon as such a task finishes, the duration correction factor (DCF) gets wrecked and the whole cache of work suddenly gets thrown into high priority mode. Tasks get started and then later get preempted by other tasks exactly along the lines you are observing.

I've solved my problem by attaching a mouse and keyboard permanently to each Windows host. The problem will still happen (not as frequently) unless I go around occasionally and actually move the mouse and toggle the numlock key. If I do that at least once every couple of days, I never get the drastic slowdown. Also it never happens on identical hardware running Linux. I have many Linux hosts that have never had a keyboard/mouse for years and no drastic task slowdowns.

To cure your problem and get your tasks crunched in a reasonable and orderly fashion, you need to get out of HP mode. If you have excess tasks due to the 'running out of GPU work' issue, you may need to abort enough tasks until BOINC is satisfied that it could cope with what is left. BOINC will be a bit conservative so that might involve aborting more than you really need to. If you want to avoid that, you can get rid of HP mode by selectively suspending tasks, starting with the most recently received (longest time to deadline). Eventually you will have suspended sufficient so that HP mode is dropped. BOINC should then be able to process the partly started tasks in a more rational manner. You can further encourage this by suspending partly crunched tasks that shouldn't have been started in the first place. Each day you could enable tasks in 'shortest time to deadline' order so that you keep up the supply without triggering HP mode again. I've done this myself and it works fine. In your case, if you really do have more tasks than can be done in the available time, you could start aborting 'close to deadline but unstarted' tasks to ease the pressure.

EDIT: I've just looked through your entire cache of tasks and you seem to have about 1000 'in progress' ones. My 'back of the envelope' calculations indicate that you might be able to complete around 700 or so before hitting the deadline. It seems inevitable that you will need to abort more than 200 to 300 tasks at some stage. You could probably save a few of these by making sure all 12 cores are on GW tasks as soon as possible. You don't have many ABP2 tasks left and you should avoid getting any more by setting NNT for the next week or two.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.