Another scheduler instance is running for this host - - - ? ? ?

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4754
Credit: 17707217567
RAC: 5260411
Topic 221721

Mon 06 Apr 2020 09:52:26 AM PDT | Einstein@Home | [sched_op] Starting scheduler request
Mon 06 Apr 2020 09:52:26 AM PDT | Einstein@Home | Sending scheduler request: To report completed tasks.
Mon 06 Apr 2020 09:52:26 AM PDT | Einstein@Home | Reporting 1 completed tasks
Mon 06 Apr 2020 09:52:26 AM PDT | Einstein@Home | Requesting new tasks for NVIDIA GPU
Mon 06 Apr 2020 09:52:26 AM PDT | Einstein@Home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
Mon 06 Apr 2020 09:52:26 AM PDT | Einstein@Home | [sched_op] NVIDIA GPU work request: 1.00 seconds; 1.00 devices
Mon 06 Apr 2020 09:52:31 AM PDT | Einstein@Home | Scheduler request completed: got 0 new tasks
Mon 06 Apr 2020 09:52:31 AM PDT | Einstein@Home | [sched_op] Server version 611
Mon 06 Apr 2020 09:52:31 AM PDT | Einstein@Home | Another scheduler instance is running for this host
Mon 06 Apr 2020 09:52:31 AM PDT | Einstein@Home | Project requested delay of 60 seconds
Mon 06 Apr 2020 09:52:31 AM PDT | Einstein@Home | [sched_op] Deferring communication for 00:01:00
Mon 06 Apr 2020 09:52:31 AM PDT | Einstein@Home | [sched_op] Reason: requested by project

 

Can anybody explain what this is and why I am getting this message.

 

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

I was wondering that same

I was wondering that same thing this morning. For me it shows also that a task gets 'lost' but very soon it is then sent successfully. This is happening at the moment too. Here the cycle looks like this:

6.4.2020 19:36:56 | Einstein@Home | Sending scheduler request: To fetch work.
6.4.2020 19:36:56 | Einstein@Home | Requesting new tasks for NVIDIA GPU
6.4.2020 19:42:03 | Einstein@Home | Scheduler request failed: Timeout was reached
6.4.2020 19:42:04 | | Project communication failed: attempting access to reference site
6.4.2020 19:42:06 | | Internet access OK - project servers may be temporarily down.
6.4.2020 19:43:28 | Einstein@Home | Sending scheduler request: To fetch work.
6.4.2020 19:43:28 | Einstein@Home | Requesting new tasks for NVIDIA GPU
6.4.2020 19:43:30 | Einstein@Home | Scheduler request completed: got 0 new tasks
6.4.2020 19:43:30 | Einstein@Home | Another scheduler instance is running for this host
6.4.2020 19:43:30 | Einstein@Home | Project requested delay of 60 seconds
6.4.2020 19:47:26 | Einstein@Home | Computation for task h1_1720.95_O2C02Cl4In0__O2MDFG3a_G34731_1721.40Hz_24_0 finished
6.4.2020 19:47:26 | Einstein@Home | Starting task h1_1720.95_O2C02Cl4In0__O2MDFG3a_G34731_1721.40Hz_23_0
6.4.2020 19:47:37 | Einstein@Home | Sending scheduler request: To fetch work.
6.4.2020 19:47:37 | Einstein@Home | Reporting 1 completed tasks
6.4.2020 19:47:37 | Einstein@Home | Requesting new tasks for NVIDIA GPU
6.4.2020 19:47:39 | Einstein@Home | Scheduler request completed: got 1 new tasks
6.4.2020 19:47:39 | Einstein@Home | Resent lost task h1_1720.95_O2C02Cl4In0__O2MDFG3a_G34731_1721.40Hz_22_0
6.4.2020 19:47:39 | Einstein@Home | Project requested delay of 60 seconds
6.4.2020 20:00:44 | Einstein@Home | Sending scheduler request: To fetch work.
6.4.2020 20:00:44 | Einstein@Home | Requesting new tasks for NVIDIA GPU
6.4.2020 20:05:51 | Einstein@Home | Scheduler request failed: Timeout was reached
6.4.2020 20:05:52 | | Project communication failed: attempting access to reference site
6.4.2020 20:05:54 | | Internet access OK - project servers may be temporarily down.
6.4.2020 20:07:02 | Einstein@Home | Sending scheduler request: To fetch work.
6.4.2020 20:07:02 | Einstein@Home | Requesting new tasks for NVIDIA GPU
6.4.2020 20:07:04 | Einstein@Home | Scheduler request completed: got 0 new tasks
6.4.2020 20:07:04 | Einstein@Home | Another scheduler instance is running for this host
6.4.2020 20:07:04 | Einstein@Home | Project requested delay of 60 seconds
6.4.2020 20:12:13 | Einstein@Home | Computation for task h1_1720.95_O2C02Cl4In0__O2MDFG3a_G34731_1721.40Hz_23_0 finished
6.4.2020 20:12:14 | Einstein@Home | Starting task h1_1721.00_O2C02Cl4In0__O2MDFG3a_G34731_1721.50Hz_30_1
6.4.2020 20:12:22 | Einstein@Home | Sending scheduler request: To fetch work.
6.4.2020 20:12:22 | Einstein@Home | Reporting 1 completed tasks
6.4.2020 20:12:22 | Einstein@Home | Requesting new tasks for NVIDIA GPU
6.4.2020 20:12:24 | Einstein@Home | Scheduler request completed: got 1 new tasks
6.4.2020 20:12:24 | Einstein@Home | Resent lost task h1_1652.45_O2C02Cl4In0__O2MDFG3a_G34731_1652.75Hz_10_2

I came to conclusion that this happens because the scheduler is again too slow. It takes more than a minute for it to process the scheduling. When the scheduling doesn't happen in time a task then gets 'lost'. I remember Gary explaining this phenomenon some time ago.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4754
Credit: 17707217567
RAC: 5260411

Yes, I have been getting

Yes, I have been getting resent lost tasks also.  They all report that they have already reported.  They seem to be tied to the 100 errored gpu tasks that "insta-expired" back on the 2 April. The tasks were reported 1 Jan 1970. The servers glitched and reset the Linux timer to Day One and the tasks were marked as received before they were sent.

 

Name:h1_0799.95_O2C02Cl5In0__O2MD1C2_CasA_800.55Hz_74_0
Workunit ID:447326077
Created:2 Apr 2020 8:17:20 UTC
Sent:2 Apr 2020 8:17:21 UTC
Report deadline:2 Apr 2020 8:31:32 UTC
Received:1 Jan 1970 0:00:00 UTC

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4754
Credit: 17707217567
RAC: 5260411

Quote:I came to conclusion

Quote:
I came to conclusion that this happens because the scheduler is again too slow. It takes more than a minute for it to process the scheduling. When the scheduling doesn't happen in time a task then gets 'lost'. I remember Gary explaining this phenomenon some time ago.

I suspected the same.  With the scheduler interval timer only 60 seconds here, the scheduler spins its wheel in locality scheduling for far longer.  The scheduler printout link is multiple pages long in trying to decide what to send.

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4754
Credit: 17707217567
RAC: 5260411

Now the Einstein scheduler

Now the Einstein scheduler has completely lost its mind.  The Task Duration Correction factor has jumped from 0.85~ to 7.40 and now says my 1 day cache will take 25 days for the cpu  and 13 days for gpu to complete.

This has forced my other projects gpu tasks into High Priority EDF mode for their deadline 5 days from now and they actually will complete in a few hours. Also now my other projects are prevented from replenishing their 1 day cache because BOINC says the gpu cache is full  . . . . . with Einstein gpu tasks of 13 days.

Gawd, I wish the Einstein server software would be updated to something modern.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110040647295
RAC: 22414847

Keith Myers wrote:Now the

Keith Myers wrote:
Now the Einstein scheduler has completely lost its mind.  The Task Duration Correction factor has jumped from 0.85~ to 7.40 and now says my 1 day cache will take 25 days for the cpu  and 13 days for gpu to complete.

It's not the scheduler that's "lost its mind" - it's your BOINC client :-).  The scheduler may have many other quirks but it doesn't have any impact on what your client thinks its cache of tasks will need time-wise.

Let me hazard a guess.  The DCF jump from 0.85 to 7.40 is a factor of 8.7.  It might have been caused by a single task taking 8.7 times as long as its estimate suggested it should.  There are many possible reasons for that and I don't know which host has the problem but I'd be looking to find some task on that host that completed with an unusually long crunch time.

One possibility is that the 0.85 DCF was created by some fast finishing GRP tasks.  GRP tasks tend to create a DCF even significantly lower than that.  This would have lowered the estimate for GW tasks to well below what they actually take.  The 7.40 value could have been created by a single GW task (estimated far too low) that took rather longer than normal.  I see DCFs of around 4 or so for my GW tasks.  There are the occasional slow tasks that could take it to above 7.

There are two ways to handle the inaccurate task estimates for GW as compared to GRP.  Either, don't run both GPU searches on the one machine, or, keep the cache size to less than 0.5 days.  You would think that a 1 day cache would be safe but according to your figures, you had rather more than that at the time the DCF jumped by 8.7 times.  To have 25 days of CPU work after the jump, you must have had close to 3 days worth before the jump.  For the GPU tasks, the corresponding figure would have been around 1.5 days.  Those sort of figures are quite believable - the actual work on hand does often stray a bit above what the cache setting says.

All the above is pure conjecture on my part.  If something else other than a very slow finishing task causing the sudden jump in DCF turns out to be the true reason, it would be very interesting to know the details.

Keith Myers wrote:
Gawd, I wish the Einstein server software would be updated to something modern.

That's an impossible wish as there isn't "something modern" to update to :-).  In any case, that wouldn't solve the problem. As I've said, the real issue is that the two GPU searches have very inaccurate estimates in the opposite directions to each other.  If they were both inaccurate in the same direction, the single DCF could probably cope.

Because Locality Scheduling is vital in allowing both volunteers and project servers to cope with the volume of data downloads, the highly customised server code here is probably destined to stay that way.  When the GW app matures, it's possible the estimate for those tasks might get a lot better.  One can always hope.

Cheers,
Gary.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4754
Credit: 17707217567
RAC: 5260411

Nope, not more than 1 day

Nope, not more than 1 day cache setting.  In fact tasks are hard limited to 20 cpu tasks and 120 gpu tasks at any time.

This is the host turnaround time.

Average turnaround time:0.37 days

Task duration correction factor was abandoned or actually nullified by setting to 1.0 in most projects configurations long ago.  Only Einstein is still using it of all my projects.  So has nothing to do with the client but what the server scheduler tells the client.  GPUGrid, Seti and Milkyway all are set to 1.0 DCF in the client.

Quote:
That's an impossible wish as there isn't "something modern" to update to :-).  In any case, that wouldn't solve the problem.

Uhh,  the current BOINC server software is version 715.

This is what Einstein is running.

Mon 06 Apr 2020 09:22:11 PM PDT | Einstein@Home | [sched_op] Server version 611

 

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110040647295
RAC: 22414847

Keith Myers wrote:Nope, not

Keith Myers wrote:
Nope, not more than 1 day cache setting.  In fact tasks are hard limited to 20 cpu tasks and 120 gpu tasks at any time.

Go back and read what was actually written.  I'm quite sure you had a 1 day setting.  I said that quite often that setting can turn into more than that amount of actual work.  The normal DCF swings can easily do that.

Keith Myers wrote:
Task duration correction factor was abandoned or actually nullified by setting to 1.0 in most projects configurations long ago.  Only Einstein is still using it of all my projects.  So has nothing to do with the client but what the server scheduler tells the client.  GPUGrid, Seti and Milkyway all are set to 1.0 DCF in the client.

None of that is relevant to the discussion here.  What other projects do or don't do has no bearing on how the Einstein scheduler handles its work requests.  The scheduler doesn't "tell the client" - it's the other way around.  The client tells the server 'seconds of work' (for both CPU and GPU if necessary) and the value of DCF when it makes a request.  The server uses that information to work out what to send to meet the request.  The server doesn't know if the DCF value given is good, bad, or indifferent.  It just uses what it is given.

Keith Myers wrote:
Uhh,  the current BOINC server software is version 715.

How is that relevant?  Because of the need for Locality Scheduling, Einstein has chosen to use its own highly customised version of the BOINC server code.  That code was originally based on some much earlier BOINC standard version but has been significantly (and continually) modified over the years to meet Einstein requirements.  I can remember a very clear statement many years ago that it was a much more onerous task to try to keep up with BOINC versions as they rolled out than it was to stay with their own customised version.  The reason given was that each new BOINC version would require an horrendous number of patches to adapt it to what they needed.  Essentially, they chose to have their own unique version - which included DCF.  We, the volunteers, either work out how to live with that or we choose to move on.  That's a bit brutal but it's just the way it is.

The fact that you got to start this thread at all is probably a testament to the latest update to the Einstein server code that (at least temporarily) had some sort of a bug in it.  How much more "modern" than that can you get :-).

 

Cheers,
Gary.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4754
Credit: 17707217567
RAC: 5260411

So where does the value of

So where does the value of DCF come from??  Do you set it anywhere in the client? NO

Do you set the DCF value anywhere in the Project Preferences? NO

You have no control over the value of DCF.  That is determined by the project servers and stored in the host configuration profile on the server.

Pull up the properties page of any project in the Manager and you will not find Duration Correction Factor listed in the Scheduling section other than for Einstein project.

The duration_correction_factor is handled by the client ONLY if the project uses it. These snippets are from the client code modules.

// If the project's DCF is > 90 (and we're not ignoring it) // <<< KEY FACTOR It is ignored everywhere but at Einstein.

if (project->dont_use_dcf) {
project->duration_correction_factor = 1;
}

The individual project servers HAVE to tell the client to use DCF.  You don't get a choice.

Quote:
The scheduler doesn't "tell the client"

This statement is FALSE in the matter of DCF.  Yes the scheduler DOES tell the client.

 

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110040647295
RAC: 22414847

Keith, I need to be gone so

Keith, I need to be gone so I'll have to fully answer tomorrow if you wish.

In the meantime, how do you explain the following.  I just went to the BOINC directory on the closest host of mine and issued the following commands.  What was returned is shown immediately following each command.

[gary@ryzen-01 BOINC]$ grep duration_correction sched_request_einstein.phys.uwm.edu.xml
    <duration_correction_factor>3.108055</duration_correction_factor>
[gary@ryzen-01 BOINC]$

 

[gary@ryzen-01 BOINC]$ grep duration_correction sched_reply_einstein.phys.uwm.edu.xml
[gary@ryzen-01 BOINC]$

So, for Einstein, who is telling who exactly what the duration correction factor actually is?  It's sent in the request and not received in the reply.

When you first attach to the Einstein project, your state file (in the brand new Einstein Project section) will contain a default value of 1.000000 for DCF.  If the very first Einstein task completed takes twice as long as its estimate, you can watch the client adjust all the other tasks on the tasks tab of BOINC Manager to have (immediately and without server consultation) a corrected estimate that will be exactly (for this artificial example) double the previous estimate.  If you looked in the state file, you would see the change to 2.000000 there as well.

If your very first task took exactly half as long as the estimate, the DCF would be reduced below 1.000000 by 10% of the error in the estimate.  If this pattern of half the original estimate was repeated over time, the DCF would converge progressively on 0.500000.  It's done that way to prevent over-fetching if a single task only had a 'once-off' fast finishing time.

For either example, when your client makes its next scheduler request, it will always report to the scheduler the current value in the state file that it has control over.  You also have control over this.  If you stop BOINC and edit the value in your state file, the client (without question) will immediately read the new value on restart and believe it.  I have done this many times to correct damaged values.  I'm certainly not recommending that as a standard course of action.

It doesn't matter what other projects do.  We are discussing Einstein behaviour.

Cheers,
Gary.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6542
Credit: 287110889
RAC: 99376

FWIW : I've had a peek at the

I'm sick of thinking of COVID19 all day long and so ....... 

FWIW : I've had a peek at the latest BOINC git repo ( https://github.com/BOINC/boinc ) and searched on "duration_correction_factor" and "dont_use_dcf" finding that they members of a ( massive ) structure/type called PROJECT ( defined in boinc/client/project.h ) as follows :

double duration_correction_factor;
        // Multiply by this when estimating the CPU time of a result
        // (based on FLOPs estimated and benchmarks).
        // This is dynamically updated in a way that maintains an upper bound.
        // it goes down slowly but if a new estimate X is larger,
        // the factor is set to X.
        //
        // Deprecated - current server logic handles this,
        // and this should go to 1.
        // But we need to keep it around for older projects

bool dont_use_dcf;

There will be at runtime one of these PROJECT structures for each project that the BOINC client is managing. NB This implies that E@H is one of these 'older projects' and that the 'current server logic' doesn't apply for E@H because as mentioned it is on a vastly different server code base which split off the main branch some years ago. Anyhows these two variables are  initialised, per PROJECT instance ( in /boinc/client/project.cpp ) as follows : 

void PROJECT::init(){

.

.

dont_use_dcf = false;

.

.

duration_correction_factor = 1;

.

}

Of importance is where are these variables subsequently altered ( ie. assigned to some expression ). Within boinc/client/workfetch.cpp is the crucial piece :

// called when benchmarks change
//
void CLIENT_STATE::scale_duration_correction_factors(double factor) {
    if (factor <= 0) return;
    for (unsigned int i=0; i<projects.size(); i++) {
        PROJECT* p = projects[i];
        if (p->dont_use_dcf) continue;
        p->duration_correction_factor *= factor;
    }
    if (log_flags.dcf_debug) {
        msg_printf(NULL, MSG_INFO,
            "[dcf] scaling all duration correction factors by %f",
            factor
        );
    }
}

That says : for each current project, and only for those that don't ignore the dcf, then the dcf is updated. So the question regresses to what code calls this function and with what value of the parameter "factor". For that there is only one reference ( in boinc/client/cs_benchmark.cpp, function "cpu_benchmarks_poll" ) :

 // scale duration correction factor according to change in benchmarks.
        //
        if (host_info.p_calculated && old_p_fpops) {
            scale_duration_correction_factors(host_info.p_fpops/old_p_fpops);
        }

I hope this clarifies the matter. The factor that changes the dcf is determined on the host machine by the BOINC client software according to benchmarking activity. Not surprisingly that dcf adjustment depends on the ratio of the new benchmark to the old. 

{ Another interesting question then becomes when is benchmarking called for ? }

Anyway, this makes perfect sense as it is only the client that can measure the host capability ( which could vary from time to time  ) and advise the project accordingly.

Cheers, Mike.

( edit ) I've just seen your reply Gary. Yes, you can spoof the BOINC client by fiddling with it's state file.

( edit ) I haven't searched the code for where it saves and recalls values from the client's state file, local to the host, as that is not relevant to the question at hand.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.