Another scheduler instance is running for this host - - - ? ? ?

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4753
Credit: 17702461463
RAC: 5389402

Thanks your reply Mike and

Thanks your reply Mike and digging deeper into the code.  I am not arguing that DCF is sent from the client to the scheduler in its work request.  That is how the number of seconds is calculated for Einstein.

What my original comment was meant to imply that "the scheduler has lost its mind" was that the Einstein server software lets the client use the old PROJECTS code path to utilize DCF in the work request in the first place.

It is ignored everywhere else.  You won't get these outlandish shifts in calculated work requests on any other project other than Einstein.  Every other project simply sends the amount that is specified in the cache settings without further perturbations by DCF in the work calculation request.

This is acceptable if you only run Einstein by itself on the host.  But if you run other projects concurrently on the same host, this leads to unacceptable client performance which I documented in my post.

This also reinforced my lament that I simply wished Einstein would use modern server software so it would play nicer with other projects.

I think I have made clear that I was simply venting my exasperation with Einstein.  So the blame lies not with the individual host and client.  The blame lies with the Einstein server software and how it directs the client to run through the DCF mechanism.

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4753
Credit: 17702461463
RAC: 5389402

Quote:I hope this clarifies

Quote:
I hope this clarifies the matter. The factor that changes the dcf is determined on the host machine by the BOINC client software according to benchmarking activity. Not surprisingly that dcf adjustment depends on the ratio of the new benchmark to the old.

I reread your post Mike and I think that the benchmark change is what precipitated the drastic change in DCF.  But I have had the host benchmark change every time it gets one of the lost tasks that the server generated with a "insta-expire" on 2 April.  Every time the server resends a lost task I get all the startup jpg files resent also for the project.  I think that makes the benchmark jump.  I would have to monitor and record what the benchmark was before and after a lost task is resent to be certain.

If I could clear the 100 insta-expire error tasks from the host I think the client would settle back down to normalcy.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110019374240
RAC: 22845683

Hi Mike, Good to hear from

Hi Mike, Good to hear from you.

As you are one of the front line workers in these terrible COVID19 times, I sincerely hope you are able to take care of yourself and your family and that you aren't encountering too many that are flouting the rules and putting the health workers at increased risk.  I wish you all the best with this new and challenging reality.

Mike Hewson wrote:

.....
I hope this clarifies the matter. The factor that changes the dcf is determined on the host machine by the BOINC client software according to benchmarking activity. Not surprisingly that dcf adjustment depends on the ratio of the new benchmark to the old. 

{ Another interesting question then becomes when is benchmarking called for ? }

Anyway, this makes perfect sense as it is only the client that can measure the host capability ...

Thanks for trawling through some of the code.  I'm not competent to do that :-)

Yes, of course a change in benchmarks will automatically lead to a change in DCF.  Benchmarks are re-done after some sort of standard amount of time having elapsed from a previous set, when there has been no other hardware or software change that would have immediately triggered a new set to be measured.  From my very casual observations over hosts soldiering on under unchanging conditions, a new set of benchmarks will automatically be taken after something like a week or two.  I really haven't tried to work out the precise time.

There are things other than benchmark changes that cause DCF to be adjusted.  In particular, every time a task completes, there must be a routine which applies the effect of the difference between estimated crunch time and the actual time that the task took.

It was this very frequent and often (but not always) minuscule change that was under discussion - not a change caused by a new benchmark measurement.

Mike Hewson wrote:
I've just seen your reply Gary. Yes, you can spoof the BOINC client by fiddling with it's state file.

It's not a matter of "spoofing the client".  It's a matter of assisting the client to recover from something that caused a drastic (and wrong) upward movement in all task estimates and for which the client doesn't have the tools for an immediate recovery.  The alternative is to allow the client to remain in utter chaos (panic mode) and have to baby-sit the machine by suspending lots of tasks so that it can slowly struggle back to normal.  A very quick simple edit wins every time :-).

Mike Hewson wrote:
I haven't searched the code for where it saves and recalls values from the client's state file, local to the host, as that is not relevant to the question at hand.

The "question at hand" that started all this was the assertion that the scheduler imposes some sort of fictitious value for DCF on the client.  So, some sort of code snippet that shows how the client recalculates DCF for each task finish event and then uses that new value to send to the scheduler during a scheduler request would be quite relevant.

I'm certainly not asking you to do that.  The changes in DCF for every task finish are easy to observe and the sched_request/sched_reply files clearly show how the information is sent to the scheduler.

Cheers,
Gary.

Cheers,
Gary.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4753
Credit: 17702461463
RAC: 5389402

I think it likely this bit of

I think it likely this bit of code in result.cpp is how the estimated time is calculated with dcf factored.

// estimate how long a result will take on this host

//
double RESULT::estimated_runtime() {
double x = estimated_runtime_uncorrected();
if (!project->dont_use_dcf) {
x *= project->duration_correction_factor;
}
return x;
}

double RESULT::estimated_runtime_remaining() {
if (computing_done()) return 0;
ACTIVE_TASK* atp = gstate.lookup_active_task_by_result(this);
if (app->non_cpu_intensive) {
if (atp && atp->fraction_done>0) {
double est_dur = atp->fraction_done_elapsed_time / atp->fraction_done;
double x = est_dur - atp->elapsed_time;
if (x <= 0) x = 1;
return x;
}
return 0;
}

if (atp) {
#ifdef SIM
return sim_flops_left/avp->flops;
#else
return atp->est_dur() - atp->elapsed_time;
#endif[/code]
}
return estimated_runtime();
}

And this snippet in cs_benchmark.cpp is how it is tied into benchmarking.

// scale duration correction factor according to change in benchmarks.
//
if (host_info.p_calculated && old_p_fpops) {
scale_duration_correction_factors(host_info.p_fpops/old_p_fpops);
}
host_info.p_calculated = now;
benchmarks_running = false;
set_client_state_dirty("CPU benchmarks");
}
return false;
}

 

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6542
Credit: 286953049
RAC: 94932

Thank you for your kind

Thank you for your kind concerns Gary. 

Now when are benchmarks called ? Read on ....

In boinc/client/cs_benchmark.cpp is the following constant :

#define BENCHMARK_PERIOD        (SECONDS_PER_DAY*30)
    // rerun CPU benchmarks this often (hardware may have been upgraded)

Implying a default period of thirty days. But that's not the only possible time :

// called at startup to decide if we need to do benchmarks;
// set run_cpu_benchmarks if so.
//
void CLIENT_STATE::check_if_need_benchmarks() {
    if (run_cpu_benchmarks) return;
    // if user has changed p_calculated into the future
    // (as part of cheating, presumably) always run benchmarks
    //
    double diff = now - host_info.p_calculated;
    if (diff < 0) {
        run_cpu_benchmarks = true;
    } else if (diff > BENCHMARK_PERIOD) {
        if (host_info.p_calculated) {
            msg_printf(NULL, MSG_INFO,
                "Last CPU benchmark was %s ago", timediff_format(diff).c_str()
            );
        } else {
            msg_printf(NULL, MSG_INFO, "No CPU benchmark yet");
        }
        run_cpu_benchmarks = true;
    }
}

Now "p_calculated" is a member of class/type HOST_INFO storing the time of the last benchmark. It is initialised elsewhere to zero, otherwise taking it's value from stored state. With the interesting implication being that if set to a future time, then something funny is going on like user interference !! It is not immediately obvious to me how that could be part of some cheating strategy.

But is there any other code touching the value "run_cpu_benchmarks" ? In boinc/client/cs_cmdline.cpp parse_cmdline member function :

} else if (ARG(run_cpu_benchmarks)) {
            run_cpu_benchmarks = true;

So it can be requested on the command line ( plus we know that when using the GUI Boinc Manager you can request it via the Tools menu option ). Plus in boinc/client/client_state.cpp it is initialised to false, then  :

// if new version of client,
    // - run CPU benchmarks
    // - get new project list
    // - contact reference site (or some project) to trigger firewall alert
    //
    if (new_client) {
        run_cpu_benchmarks = true;
        all_projects_list_check_time = 0;
        if (cc_config.dont_contact_ref_site) {
            if (projects.size() > 0) {
                projects[0]->master_url_fetch_pending = true;
            }
        } else {
            net_status.need_to_contact_reference_site = true;
        }
    }

That is, with the installation of a new client version. The actual calling of the benchmark code is done in a polling loop :

bool CLIENT_STATE::poll_slow_events() {

    if (run_cpu_benchmarks && can_run_cpu_benchmarks()) {

        run_cpu_benchmarks = false;
        start_cpu_benchmarks();
    }

Thus it is being checked constantly as to whether it needs to be done.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4753
Credit: 17702461463
RAC: 5389402

I've always seen a new

I've always seen a new benchmark run whenever I change the client version.  Now know where the code does it.

I'm still wondering why these lost task resends always trigger the resend of the project jpg files in the download also.

I just wish they would stop.  I am just wasting electrons every time I get sent one because I get no credit for it.

07-Apr-2020 14:59:28 [Einstein@Home] [sched_op] Starting scheduler request
07-Apr-2020 14:59:28 [Einstein@Home] Sending scheduler request: To report completed tasks.
07-Apr-2020 14:59:28 [Einstein@Home] Reporting 4 completed tasks
07-Apr-2020 14:59:28 [Einstein@Home] Requesting new tasks for CPU and NVIDIA GPU
07-Apr-2020 14:59:28 [Einstein@Home] [sched_op] CPU work request: 1.00 seconds; 4.00 devices
07-Apr-2020 14:59:28 [Einstein@Home] [sched_op] NVIDIA GPU work request: 1.00 seconds; 5.00 devices
07-Apr-2020 14:59:29 [Milkyway@Home] Computation for task de_modfit_14_bundle5_testing_4s3f_1_1580162702_44880849_0 finished
07-Apr-2020 14:59:29 [Milkyway@Home] Starting task de_modfit_86_bundle4_4s_south4s_bgset_2_1580162702_44882797_0
07-Apr-2020 14:59:30 [Milkyway@Home] Computation for task de_modfit_86_bundle4_4s_south4s_bgset_2_1580162702_44880567_0 finished
07-Apr-2020 14:59:30 [Milkyway@Home] Starting task de_modfit_14_bundle5_testing_4s3f_3_1580162702_43236238_3
07-Apr-2020 14:59:34 [Einstein@Home] Scheduler request completed: got 3 new tasks
07-Apr-2020 14:59:34 [Einstein@Home] [sched_op] Server version 611
07-Apr-2020 14:59:34 [Einstein@Home] Completed result h1_1654.90_O2C02Cl4In0__O2MDFG3a_G34731_1655.30Hz_17_3 refused: result already reported as success
07-Apr-2020 14:59:34 [Einstein@Home] Completed result h1_1525.45_O2C02Cl4In0__O2MDFG3_G34731_1525.95Hz_34_2 refused: result already reported as success
07-Apr-2020 14:59:34 [Einstein@Home] Resent lost task h1_1809.00_O2C02Cl4In0__O2MDFG3a_G34731_1809.40Hz_20_0
07-Apr-2020 14:59:34 [Einstein@Home] Resent lost task h1_1809.00_O2C02Cl4In0__O2MDFG3a_G34731_1809.40Hz_19_0
07-Apr-2020 14:59:34 [Einstein@Home] Resent lost task h1_1809.00_O2C02Cl4In0__O2MDFG3a_G34731_1809.40Hz_18_0
07-Apr-2020 14:59:34 [Einstein@Home] Project requested delay of 60 seconds
07-Apr-2020 14:59:34 [Einstein@Home] [sched_op] estimated total CPU task duration: 0 seconds
07-Apr-2020 14:59:34 [Einstein@Home] [sched_op] estimated total NVIDIA GPU task duration: 2096 seconds
07-Apr-2020 14:59:34 [Einstein@Home] [sched_op] handle_scheduler_reply(): got ack for task h1_1654.90_O2C02Cl4In0__O2MDFG3a_G34731_1655.30Hz_17_3
07-Apr-2020 14:59:34 [Einstein@Home] [sched_op] handle_scheduler_reply(): got ack for task h1_1525.45_O2C02Cl4In0__O2MDFG3_G34731_1525.95Hz_34_2
07-Apr-2020 14:59:34 [Einstein@Home] [sched_op] handle_scheduler_reply(): got ack for task h1_1654.90_O2C02Cl4In0__O2MDFG3a_G34731_1655.20Hz_5_3
07-Apr-2020 14:59:34 [Einstein@Home] [sched_op] handle_scheduler_reply(): got ack for task h1_1654.90_O2C02Cl4In0__O2MDFG3a_G34731_1655.15Hz_2_3
07-Apr-2020 14:59:34 [Einstein@Home] [sched_op] Deferring communication for 00:01:00
07-Apr-2020 14:59:34 [Einstein@Home] [sched_op] Reason: requested by project
07-Apr-2020 14:59:37 [Einstein@Home] Started download of einstein_icon.png
07-Apr-2020 14:59:37 [Einstein@Home] Started download of Android.jpg
07-Apr-2020 14:59:37 [Einstein@Home] Started download of Arecibo_full.jpg
07-Apr-2020 14:59:37 [Einstein@Home] Started download of Arecibo_platform.jpg
07-Apr-2020 14:59:37 [Einstein@Home] Started download of Fermi_grsky.jpg
07-Apr-2020 14:59:37 [Einstein@Home] Started download of Fermi_satellite.jpg

 

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

Keith Myers wrote:I just wish

Keith Myers wrote:
I just wish they would stop.  I am just wasting electrons every time

Are they building bridges??

<------ducks to avoid flying book

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4753
Credit: 17702461463
RAC: 5389402

I'll throw something more

I'll throw something more damaging than a book . . . . . . ha ha.

I'm down to 59 now from the original 100 insta-expired tasks the scheduler inflicted on me.

So slowly whittling them down.

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4753
Credit: 17702461463
RAC: 5389402

Back to this issue.  Sometime

Back to this issue.  Sometime in the middle of the night, unattended-upgrades pulled the Nvidia drivers out from under the running tasks and all my gpu work got errored out.

I had never had unattended-upgrades ever update the video drivers before without my intervention.  I always had just run an apt list --upgradable before to see whether the video drivers were due for an update so I could stop BOINC running and then safely update the drivers without dumping all my gpu work.

Now I am in the penalty box for Einstein because of locality scheduling while all my other gpu projects gracefully got back to work with new tasks.

Again the frailty of locality scheduling strikes again.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110019374240
RAC: 22845683

Keith Myers wrote:... in the

Keith Myers wrote:

... in the penalty box for Einstein because of locality scheduling ...

... Again the frailty of locality scheduling strikes again.

It would be interesting to understand how you can declare that locality scheduling should get the blame for this.  You seem determined to bag the project for something that is completely under your control.

In case any casual reader might think that project scheduling policies (locality or otherwise) might be at fault, the real story is that projects do protect themselves against rogue hosts that trash large numbers of tasks by setting daily limits and by reducing those limits progressively as trashed tasks get returned.  Once the number of trashed tasks is such that the reduced daily limit is below the number of tasks already sent, the client is prevented from receiving new work until a new 'day' arrives.

If the problem on the host gets fixed and it starts returning successfully completed tasks, the limits are restored very promptly.  If you notice the problem and fix it immediately, it is quite possible to get back to work straight away in many cases without having to wait for a new 'day' to start (midnight UTC).

All locality scheduling does is try to ensure that when you make a work request, you get sent work (if possible) that is related to large data files that are already on your machine.  The idea is to minimise network loads for both you and for the project.  It has nothing to do with daily limits or the reasons why they get trashed. 

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.