Scheduler went nuts

SciManStev
Joined: 27 Aug 05
Posts: 154
Credit: 15562799
RAC: 0

Thank you for the excellent

Thank you for the excellent reply. I haven't checked it this morning, as I'm at work. Previously I had set my cache for one day, and of course all of a sudden it filled up to what ever the Einstein limit is. I set it to No New Tasks, and by my recconing it should complete everything in a couple of days. The screen shot I posted was just a small block of what is there, so many of my work units are more than half crunched and will clear quickly. I think the earliest dedline I saw was December 9'th, which is easy. If I have to abort a few then I will. When I enabled hyperthreading, it did add about an hour and 15 minutes to each crunch time, but it more than made up for it with the number of wu's crunched.

Steve

Crunching as member of The GPU Users Group team.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2142
Credit: 2774331459
RAC: 851925

RE: Thank you for the

Quote:

Thank you for the excellent reply. I haven't checked it this morning, as I'm at work. Previously I had set my cache for one day, and of course all of a sudden it filled up to what ever the Einstein limit is. I set it to No New Tasks, and by my recconing it should complete everything in a couple of days. The screen shot I posted was just a small block of what is there, so many of my work units are more than half crunched and will clear quickly. I think the earliest dedline I saw was December 9'th, which is easy. If I have to abort a few then I will. When I enabled hyperthreading, it did add about an hour and 15 minutes to each crunch time, but it more than made up for it with the number of wu's crunched.

Steve


When you have a moment in front of the screen, would you mind giving it another try with work fetch enabled, and one or other level of debug logging on the client? If the cycle of 'request NVidia, receive S5GC1HF' starts again, try and grab a matching server log (link in right-most column of your host summary list) for comparison.

This cross-allocation shouldn't be happening, but we need evidence to nip it in the bud.

SciManStev
Joined: 27 Aug 05
Posts: 154
Credit: 15562799
RAC: 0

RE: RE: Thank you for the

Quote:
Quote:

Thank you for the excellent reply. I haven't checked it this morning, as I'm at work. Previously I had set my cache for one day, and of course all of a sudden it filled up to what ever the Einstein limit is. I set it to No New Tasks, and by my recconing it should complete everything in a couple of days. The screen shot I posted was just a small block of what is there, so many of my work units are more than half crunched and will clear quickly. I think the earliest dedline I saw was December 9'th, which is easy. If I have to abort a few then I will. When I enabled hyperthreading, it did add about an hour and 15 minutes to each crunch time, but it more than made up for it with the number of wu's crunched.

Steve


When you have a moment in front of the screen, would you mind giving it another try with work fetch enabled, and one or other level of debug logging on the client? If the cycle of 'request NVidia, receive S5GC1HF' starts again, try and grab a matching server log (link in right-most column of your host summary list) for comparison.

This cross-allocation shouldn't be happening, but we need evidence to nip it in the bud.

Will do Richard, once I get home this evening.

Steve

Crunching as member of The GPU Users Group team.

Mayor of Bree
Mayor of Bree
Joined: 23 Feb 09
Posts: 1
Credit: 581148
RAC: 0

Hi, I don't know whether

Hi,

I don't know whether this is a project problem or a BOINC one, or perhaps a combination of both - either way it's a problem and I'd welcome a fix or suggestions about how to prevent it. Einstein@home over the past three days has begun to send me great rafts of WUs, each with an estimated time of 11+ hours, and most due on a short completion schedule. My settings (both local and on the project) are to keep two extra days of work, but the project has decided I need up to 120 extra days of work.

Of course this has led to "high priority" situations, and has led to all my other projects being unable to receive a share of my CPU resources. I've taken to aborting all but the few Einstein projects I know I'll be able to complete, and now I've set the project to accept no new tasks.

Any suggestions?

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

RE: What I was interested

Quote:
What I was interested in catching was one of those 'new task every minute' events as visible in 2500292's task list - those are the ones which I suspect to be CUDA requests.


His latest log (at the time I caught it) was a work request on that host for his CUDA:

2010-12-02 19:42:37.3358 [PID=6883]   Request: [USER#xxxxx] [HOST#2500292] [IP xxx.xxx.xxx.111] client 6.10.58
2010-12-02 19:42:37.3391 [PID=6883 ]    [send] effective_ncpus 16 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2010-12-02 19:42:37.3391 [PID=6883 ]    [send] effective_ngpus 1 max_jobs_on_host_gpu 999999
2010-12-02 19:42:37.3391 [PID=6883 ]    [send] Not using matchmaker scheduling; Not using EDF sim
2010-12-02 19:42:37.3391 [PID=6883 ]    [send] CPU: req 0.00 sec, 0.00 instances; est delay 0.00
2010-12-02 19:42:37.3391 [PID=6883 ]    [send] CUDA: req 21600.86 sec, 1.00 instances; est delay 0.00
2010-12-02 19:42:37.3392 [PID=6883 ]    [send] work_req_seconds: 0.00 secs
2010-12-02 19:42:37.3392 [PID=6883 ]    [send] available disk 98.80 GB, work_buf_min 0
2010-12-02 19:42:37.3392 [PID=6883 ]    [send] active_frac 0.999941 on_frac 0.999675 DCF 1.266993
2010-12-02 19:42:37.3710 [PID=6883 ]    [send] [HOST#2500292] is reliable
2010-12-02 19:42:37.3712 [PID=6883 ]    [send] set_trust: random choice for error rate 0.000010: yes
2010-12-02 19:42:37.5376 [PID=6883 ]    [version] Don't need CPU jobs, skipping version 504 for einstein_S5GC1HF ()
2010-12-02 19:42:37.5377 [PID=6883 ]    [version] Don't need CPU jobs, skipping version 704 for einstein_S5GC1HF ()
2010-12-02 19:42:37.5377 [PID=6883 ]    [version] no app version available: APP#14 (einstein_S5GC1HF) PLATFORM#6 (i686-apple-darwin) min_version 0
2010-12-02 19:42:37.5377 [PID=6883 ]    [version] no app version available: APP#14 (einstein_S5GC1HF) PLATFORM#3 (powerpc-apple-darwin) min_version 0
2010-12-02 19:42:37.5911 [PID=6883 ] [debug]   [HOST#2500292] MSG(high) No work sent
2010-12-02 19:42:37.5911 [PID=6883 ]    Sending reply to [HOST#2500292]: 0 results, delay req 60.00
2010-12-02 19:42:37.5914 [PID=6883 ]    Scheduler ran 0.447 seconds


Looks normal and correct.

Edit: Although the interesting part would be why we have the following sequnce:

Quote:
2010-12-02 19:42:37.5376 [PID=6883 ] [version] Don't need CPU jobs, skipping version 504 for einstein_S5GC1HF ()
2010-12-02 19:42:37.5377 [PID=6883 ] [version] Don't need CPU jobs, skipping version 704 for einstein_S5GC1HF ()


That's 5.04 for Mac OS X on Intel and 7.04 for Mac OS X on PPC.

Why the need to check for both? An Apple with Intel CPU can't run the PPC application, while the Apple with PPC CPU can't run the Intel applications as else the project wouldn't need to make an app for both processors. Or am I seeing that wrong?

soft spirit
soft spirit
Joined: 27 Oct 10
Posts: 113
Credit: 5880079
RAC: 0

hit me as well. all CPU

hit me as well. all CPU work, 2 pages, 5hrs each. I had 1 day cache set, and it filled until I hit a project limit. With 3 cores crunching, I expect to get about half done before the time limit is hit on 12/13.

SciManStev
Joined: 27 Aug 05
Posts: 154
Credit: 15562799
RAC: 0

Richard, I tried adding

Richard, I tried adding several debug flags, and ended up with a mess of messages. I did capture the server log, which I'll post. If you can tell me which debug flags to use, I will delete the rest of the garbage, and post the results I have saved.

2010-12-02 23:18:46.0145 [PID=4073] Request: [USER#xxxxx] [HOST#2924241] [IP xxx.xxx.xxx.34] client 6.10.58
2010-12-02 23:18:46.0185 [PID=4073 ] [send] effective_ncpus 12 max_jobs_on_host_cpu 999999 max_jobs_on_host 999999
2010-12-02 23:18:46.0185 [PID=4073 ] [send] effective_ngpus 2 max_jobs_on_host_gpu 999999
2010-12-02 23:18:46.0185 [PID=4073 ] [send] Not using matchmaker scheduling; Not using EDF sim
2010-12-02 23:18:46.0185 [PID=4073 ] [send] CPU: req 2725.46 sec, 0.00 instances; est delay 540788.19
2010-12-02 23:18:46.0186 [PID=4073 ] [send] CUDA: req 475544.20 sec, 0.00 instances; est delay 0.00
2010-12-02 23:18:46.0186 [PID=4073 ] [send] work_req_seconds: 2725.46 secs
2010-12-02 23:18:46.0186 [PID=4073 ] [send] available disk 97.30 GB, work_buf_min 0
2010-12-02 23:18:46.0186 [PID=4073 ] [send] active_frac 0.998591 on_frac 0.972763 DCF 1.418769
2010-12-02 23:18:46.1456 [PID=4073 ] [send] [HOST#2924241] is reliable
2010-12-02 23:18:46.1457 [PID=4073 ] [send] set_trust: random choice for error rate 0.007381: yes
2010-12-02 23:18:46.3463 [PID=4073 ] [version] Best version of app einstein_S5GC1HF is ID 231 (6.00 GFLOPS)
2010-12-02 23:18:46.3463 [PID=4073 ] [send] est. duration for WU 88715847: unscaled 10789.98 scaled 15759.32
2010-12-02 23:18:46.3464 [PID=4073 ] [send] [WU#88715847] meets deadline: 540788.19 + 15759.32 < 1209600
2010-12-02 23:18:46.3471 [PID=4073 ] [debug] Sorted list of URLs follows [host timezone: UTC-18000]
2010-12-02 23:18:46.3471 [PID=4073 ] [debug] zone=-21600 url=http://einstein-dl4.phys.uwm.edu
2010-12-02 23:18:46.3472 [PID=4073 ] [debug] zone=-21600 url=http://einstein-dl2.phys.uwm.edu
2010-12-02 23:18:46.3472 [PID=4073 ] [debug] zone=-21600 url=http://einstein-dl3.phys.uwm.edu
2010-12-02 23:18:46.3472 [PID=4073 ] [debug] zone=-28800 url=http://einstein.ligo.caltech.edu
2010-12-02 23:18:46.3472 [PID=4073 ] [debug] zone=+03600 url=http://einstein.aei.mpg.de
2010-12-02 23:18:46.3472 [PID=4073 ] [debug] zone=+03600 url=http://einstein-mirror.aei.uni-hannover.de/EatH
2010-12-02 23:18:46.3474 [PID=4073 ] [send] [HOST#2924241] Sending app_version einstein_S5GC1HF 2 304 S5GCESSE2; 6.00 GFLOPS
2010-12-02 23:18:46.3485 [PID=4073 ] [send] est. duration for WU 88715847: unscaled 10789.98 scaled 15759.32
2010-12-02 23:18:46.3485 [PID=4073 ] [HOST#2924241] Sending [RESULT#209494783 h1_1339.00_S5R4__1085_S5GC1HFa_2] (est. dur. 15759.32 seconds)
2010-12-02 23:18:50.5032 [PID=4073 ] [version] have CPU version but no more CPU work needed
2010-12-02 23:18:50.5032 [PID=4073 ] [version] Don't need CPU jobs, skipping version 304 for einstein_S5GC1HF ()
2010-12-02 23:18:50.5032 [PID=4073 ] [version] Don't need CPU jobs, skipping version 304 for einstein_S5GC1HF (S5GCESSE)
2010-12-02 23:18:50.5032 [PID=4073 ] [version] Don't need CPU jobs, skipping version 304 for einstein_S5GC1HF (S5GCESSE2)
2010-12-02 23:18:50.5033 [PID=4073 ] [version] no app version available: APP#14 (einstein_S5GC1HF) PLATFORM#2 (windows_intelx86) min_version 0
2010-12-02 23:18:50.5911 [PID=4073 ] Sending reply to [HOST#2924241]: 1 results, delay req 60.00
2010-12-02 23:18:50.5915 [PID=4073 ] Scheduler ran 4.584 seconds

Steve

Crunching as member of The GPU Users Group team.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110026839479
RAC: 22489705

RE: Any suggestions? I

Quote:
Any suggestions?


I don't have an NVIDIA GPU so I have no direct experience and can't experiment. It seems to me that there's not much point in asking for ABP2 work when there is no new work available so I'd change my preferences, either to deselect ABP2 crunching or to stop asking for work for the GPU. If your client isn't continually trying to get the non-existent GPU tasks, it shouldn't continue to get the excess GW tasks (hopefully) :-). Once there are new binary pulsar tasks (I'm sure there will be a big announcement) you could easily reinstate your desired preference settings.

The other way you could handle it would be to disable the use of your GPU using BOINC's client configuration options in a self-constructed cc_config.xml file. Since that would disable the GPU for all projects, you wouldn't want to do that if you wished to use your GPU for any other project.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110026839479
RAC: 22489705

Hi Steve, thanks very much

Hi Steve, thanks very much for posting the server log snippet. Richard and Jord will be much more competent than I am in the interpretation but I thought I'd have a go and make a few comments anyway.

Your client seems to be requesting both CPU work (a small amount) and GPU work (quite a lot) so that seems to answer Richard's point about whether or not the server was wrongly cross-allocating CPU work when it couldn't supply GPU work. It would seem the server is just responding to a request coming from the client that the client really shouldn't be making.

If the client is going to continue making such a CPU request every time it also makes an unsatisfied GPU request, it's not surprising that people are getting way too many CPU tasks.

The log is quite informative about the decision making process for selecting a CPU task to fill that request but (apart from acknowledging that a substantial GPU request was made) there seems to be no comment about the GPU - not even a 'no work available' type comment. That seems to be a bit surprising. Maybe there are extra flags that need to be set to make the scheduler more 'chatty' about its decision making processes for GPU tasks or maybe it's just that the complexity of the 'locality scheduling' used for selecting GW tasks needs more output about how and why certain decisions about these CPU tasks were made.

The upshot of all this is that someone with a GPU needs to set client debugging flags to see if we can work out why these extra requests for CPU work are being made. If you set flags relating to work scheduling in the client, you may get an insight as to why the BOINC client is prepared to keep asking for CPU work. To preserve your own sanity, you might like to start with just work_fetch_debug and see what that gives you.

Cheers,
Gary.

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

RE: To preserve your own

Quote:
To preserve your own sanity, you might like to start with just work_fetch_debug and see what that gives you.


No, better use and as these only run at the time of making contact and doing the transfers.

That's giving less fodder than which runs and outputs every 10 seconds to every minute.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.