Hi
Starting yesterday, the scheduler on only one of my machines has been requesting a lot of WUs. I checked it yesterday evening and there were aprox. 30 tasks at a stable condition.
Today I checked it at noon from remote and there were 370 tasks. I stopped that horror by setting that computer to "no new work" via boincmanager from remote.
First suggestion was previous WUs has been failed and newer ones are requested because of that reason. But no, all of these WUs are in the queue.
My network settings say: Maintain enough work for an additional 0.25 days. So this might not the reason.
The hostid of the computer in question is 2500292.
Any ideas?
Copyright © 2024 Einstein@Home. All rights reserved.
Scheduler went nuts
)
Perhaps it's the current lack of ABP tasks that causes this, as Gary Roberts has explained in this thread.
Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
Maybe, I'll check this.
)
Maybe, I'll check this.
This same thing happened to
)
This same thing happened to me, and like the other thread says, I was out of GPU units. It stopped many units in mid-crunch to do others. The deadlines are December 13, which is easy for this rig, but a slew of them are running in high priority at the moment. About 20 or so have crunched mid way, only to stop and crunch the high priority ones. No problem on my end, as I will just not crunch with my GPU's.
Steve
Crunching as member of The GPU Users Group team.
RE: Perhaps it's the
)
I have the same problem. I check my computer remotly and today I found mor than 400 WU. Even with my I7 running on full capacity, there will some run out of time.
Regarding Gary Robert's thread. He mentions to set the cache for new tasks to max 1 day. Well. That's my setting and despite that, I have this huge amount of WU. So, this is not the solution.
Regards
Bernhard
RE: So, this is not the
)
That was only one of the things Gary suggested. Did you read further in the same thread to Richard Haselgrove's post?
And beyond that, this is a problem of the BOINC client and/or server code and not of the Einstein@home project.
Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
From the various posts in
)
From the various posts in this thread, the only example I've been able to track down is Hartmut Geissbauer's host 2500292 - a 16-core Apple with a comparatively low-powered CUDA card.
This host has gone through extended periods of being allocated single Global Correlations S5 HF tasks at 1-minute intervals, and occasionally being allocated a sporadic ABP2 (ABP2cuda23) task - which seems to interrupt the S5GC1HF sequence. I can only interpret that pattern as Hartmut's host sending repeated work requests for CUDA work, but being allocated CPU work instead.
Hartmut is running BOINC v6.10.58, so he should be requesting CUDA/CPU work separately, like this:
So, either the client is wrongly requesting CPU work in spite of having so much already cached - I find that unlikely with v6.10.58, but he would have to enable either sched_op or work_fetch debug logging to be certain - or the Einstein server is wrongly configured to issue CPU work when CUDA work is requested but not available.
Does anyone know a way to access the server logs for one of those S5GC1HF allocation events?
RE: Does anyone know a way
)
http://einstein.phys.uwm.edu/host_sched_logs/1112/1112358
Looks to be a CPU work request.
RE: RE: Does anyone know
)
Yes, but host 1112358 doesn't seem to have suffered a 'scheduler went nuts' event - requesting new work roughly once an hour for an 8-core seems entirely normal and reasonable.
What I was interested in catching was one of those 'new task every minute' events as visible in 2500292's task list - those are the ones which I suspect to be CUDA requests.
Although I understand what
)
Although I understand what happened with running out of GPU unoits, it set off a strange set of events. Normally, even though BOINC is set to change WU's every 60 minutes, it never did it until it ran out of GPU units and orderd up everything available. It has now started 5 or 6 batches or work units, only to jump to something else in high priority. The thing that draws my attention is that it is crunching the wu's for December 13 before doing the ones with earlier dates. It has left a trail of unfinished wu's. I tried to set BOINC to change wu's every 0 minutes, which reverted to 60. Then I set it to 3600 minutes, which is more than enough time to complete one once it's started.
Steve
Crunching as member of The GPU Users Group team.
RE: ... It has now started
)
I've seen this sort of behaviour reported previously and I've even seen a version of this madness (abandoning partly finished tasks which are closer to deadline in favour of ones that are further away from deadline) on a couple of my hosts under unusual circumstances. The behaviour isn't to do with the task switch interval so I think you should leave that at default.
The problem seems to be caused by buggy BOINC behaviour when 'high priority' (HP) mode is invoked. In my cases that mode gets invoked on Windows machines (yes I have some of those even though I mainly run Linux) running XP with no keyboard or mouse attached. If run undisturbed, Windows will eventually end up spending so much time looking for the mouse (so I presume) that a task which might take 10 hours to crunch suddenly ends up taking something like 40-80 hours. This doesn't happen immediately - it usually takes an incubation period of a couple of weeks of quite normal behaviour before suddenly for no apparent reason, a task will slow down to a crawl. As soon as such a task finishes, the duration correction factor (DCF) gets wrecked and the whole cache of work suddenly gets thrown into high priority mode. Tasks get started and then later get preempted by other tasks exactly along the lines you are observing.
I've solved my problem by attaching a mouse and keyboard permanently to each Windows host. The problem will still happen (not as frequently) unless I go around occasionally and actually move the mouse and toggle the numlock key. If I do that at least once every couple of days, I never get the drastic slowdown. Also it never happens on identical hardware running Linux. I have many Linux hosts that have never had a keyboard/mouse for years and no drastic task slowdowns.
To cure your problem and get your tasks crunched in a reasonable and orderly fashion, you need to get out of HP mode. If you have excess tasks due to the 'running out of GPU work' issue, you may need to abort enough tasks until BOINC is satisfied that it could cope with what is left. BOINC will be a bit conservative so that might involve aborting more than you really need to. If you want to avoid that, you can get rid of HP mode by selectively suspending tasks, starting with the most recently received (longest time to deadline). Eventually you will have suspended sufficient so that HP mode is dropped. BOINC should then be able to process the partly started tasks in a more rational manner. You can further encourage this by suspending partly crunched tasks that shouldn't have been started in the first place. Each day you could enable tasks in 'shortest time to deadline' order so that you keep up the supply without triggering HP mode again. I've done this myself and it works fine. In your case, if you really do have more tasks than can be done in the available time, you could start aborting 'close to deadline but unstarted' tasks to ease the pressure.
EDIT: I've just looked through your entire cache of tasks and you seem to have about 1000 'in progress' ones. My 'back of the envelope' calculations indicate that you might be able to complete around 700 or so before hitting the deadline. It seems inevitable that you will need to abort more than 200 to 300 tasks at some stage. You could probably save a few of these by making sure all 12 cores are on GW tasks as soon as possible. You don't have many ABP2 tasks left and you should avoid getting any more by setting NNT for the next week or two.
Cheers,
Gary.