highFreq or lowFreq

MarkHNC
MarkHNC
Joined: 31 Aug 12
Posts: 37
Credit: 170965842
RAC: 0

I have a Xeon whose CPU in

I have a Xeon whose CPU in the profile on the Einstein website reads "GenuineIntel Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz [Family 6 Model 45 Stepping 7]", which I know as an E5-2670v1.  CPU-Z reports it running a hair shy of 3GHz.  It has 32GB 1600MHz DDR3 ECC CL11 RAM.  Oddly, although the box runs Windows 10 Pro, it is listed as Windows 8.1 Pro on Einstein.  Hyperthreading is on.  The challenge for me is that it runs both WCG and Einstein, including 2x FGRP on a GTX 960 SSC GPU (except when it is in panic mode, like it has been for the past couple of days).  To get Einstein to get a decent amount of CPU work, I had to change the resource share to Einstein=400 (80%) and WCG=100 (20%). 

BOINC has only this evening started getting its estimates right about the runtime of these tuning units.  This is the only one of my machines that is getting the "Hi" units.  The units are generally taking between 65,000 seconds and 67,500 seconds to complete.  However, I ended up aborting a bunch of work units that it wouldn't get done because it had downloaded a bunch using the bad estimate (as recently as this morning, it was still estimating ~13 hours when the units were consistently taking ~18-19 hours).

Is the client version in any way responsible for the estimate taking so long to "settle?"  I'm using the WCG BOINC client, which is 7.2.47.  I assume that the hyperthreading is why it is taking about double the expected range that Christian referenced?

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117574909903
RAC: 35238716

MarkHNC wrote:Is the client

MarkHNC wrote:
Is the client version in any way responsible for the estimate taking so long to "settle?"

It's not the client version.  It's because you have Einstein CPU tasks that take longer then the estimate and GPU tasks that take less then their estimate.  At Einstein, the client uses duration correction factor (DCF) to refine estimates.  Because GPU tasks tend to finish more quickly than their estimates, the client will reduce the DCF in an attempt to compensate, every time a GPU task finishes.  The reverse happens every time a CPU task finishes.  The estimates for either task type can never 'settle' because they are continually being 'pulled' in different directions.  This is because there is only one DCF which all searches have to use.

At the moment you have a bunch more CPU tasks that have already expired and even more that are quite close to expiry.  The easiest way to help your BOINC client deal with the situation is to reduce your work cache setting to a very low value (I'd be using 0.1 days) until the backlog is cleared.  You are going to have to abort more CPU tasks - those already past deadline and more that are close to deadline.  If you don't drastically reduce your cache setting, BOINC will just keep requesting replacements for aborted tasks and the problem will continue.  Part of the problem is the current 5 day deadline for GW tasks.  When the 'non-tuning' full run gets underway the deadline will be 14 days.  However you will still need to keep a sensible cache size for the mixture of CPU/GPU tasks.

 

 

Cheers,
Gary.

Sid
Sid
Joined: 17 Oct 10
Posts: 164
Credit: 970506747
RAC: 427063

archae86 wrote:Nick_43

archae86 wrote:
Nick_43 wrote:
I would think my 8MB of cache would be considerable, maybe it isn't these days...

That's a Nehalem EP which was quite a fearsome chip when new.  It is no longer new, nor even middle-aged.  I've just retired my system which was running a Xeon E5620 Westmere, which is a quick redraft of Nehalem on the next generation manufacturing process (32 nm down from 45 nm for yours).

 

L5640 that I am using might be a bit ancient processor but having 12 virtual cores it can win by weight of numbers  rather than by skill. 24 virtual cores in two sockets motherboard might be very cheap and relatively efficient solution so I'm a bit reluctant to retire it.

 

 

MarkHNC
MarkHNC
Joined: 31 Aug 12
Posts: 37
Credit: 170965842
RAC: 0

Gary Roberts wrote:MarkHNC

Gary Roberts wrote:
MarkHNC wrote:
Is the client version in any way responsible for the estimate taking so long to "settle?"

The easiest way to help your BOINC client deal with the situation is to reduce your work cache setting to a very low value (I'd be using 0.1 days) until the backlog is cleared.  

 

I have minimum work buffer of 0.50 days and max additional work buffer of 1.0 days.  I assumed I would have to do so in order to receive larger/longer running work units (both here and at WCG).  Wouldn't a tiny buffer prevent me from getting longer running units?  It doesn't appear to have downloaded any new work units since it went into panic mode.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7221554931
RAC: 967170

MarkHNC wrote:Wouldn't a tiny

MarkHNC wrote:
Wouldn't a tiny buffer prevent me from getting longer running units? 

No it would not prevent you getting work.

With current deadlines, task types, and the duration estimating characteristics of the boinc client, people who want to mix this particular new CPU work with GPU work on Einstein need either to set a very short queue length, or to be prepared to spend a lot of time manually adjusting back and forth in order effectively to use a different queue length setting for GPU tasks and for CPU tasks.

Somewhat to my unpleasant surprise, the current boinc client seems to trip into high priority mode well over a day before the task deadline with a CPU task load the machine could easily finish on time. This "feature" combined with the already known DCF mismatch problem between work types really drives down the acceptable queue length.

At the moment I am making a once per day CPU task fetch procedure in which I first adjust down the queue length to 0.2 days on one machine and 0.3 days on another machine, and only then enable CPU task fetch. When new work has come on board, I first disable CPU task fetch again, then raise my queue length to my desired two days. The specific numbers are particular to my specific machines, and I agree with Gary that for a first shot 0.1 days is a really good idea, especially if you want a set-and-forget situation.

This will all get somewhat better when the project raises the deadline for these CPU tasks from the current short 5 days, but it is not going to get really good unless a brave new BOINC really handles a mix of different performing work types much better regarding run time estimation.  Don't hold your breath for that one.

MarkJ
MarkJ
Joined: 28 Feb 08
Posts: 437
Credit: 139002861
RAC: 0

Ryzen 1700 at stock clocks

Ryzen 1700 at stock clocks getting Lo only. Current estimate 5+ hours. Will see how they go. Hopefully they can get some stats out of the tuning run.

its this host

Nick
Nick
Joined: 12 Oct 13
Posts: 27
Credit: 8949649
RAC: 0

AGENTBI am running tasks

AGENTB

 

The X5570 also does not support AVX, so that may also be a factor.

I am running tasks that are labeled AVX.

 

 

Nick
Nick
Joined: 12 Oct 13
Posts: 27
Credit: 8949649
RAC: 0

Regarding DCF Is this why

Regarding DCF

Is this why after all these years, BONIC still can't figure out that after spending 4 hours completing 75% of a task it thinks it still needs 12 more hours to complete the remaining 25%? This has annoyed me for years wondering why it's never been fixed.

 

 

 

 

 

Jim1348
Jim1348
Joined: 19 Jan 06
Posts: 463
Credit: 257957147
RAC: 0

Nick_43 wrote:Regarding

Nick_43 wrote:

Regarding DCF

Is this why after all these years, BONIC still can't figure out that after spending 4 hours completing 75% of a task it thinks it still needs 12 more hours to complete the remaining 25%? This has annoyed me for years wondering why it's never been fixed.

You can use the <fraction_done_exact/> option in an app_config.xml file to fix this, as Retvari Zoltan explains on GPUGrid.

https://einsteinathome.org/comment/reply/207795/158291?quote=1#comment-form

I don't know why this is not the default.  The BOINC scheduler is beyond understanding.

 

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7221554931
RAC: 967170

Jim1348 wrote:You can use the

Jim1348 wrote:

You can use the <fraction_done_exact/> option in an app_config.xml file

<snip>

I don't know why this is not the default.  The BOINC scheduler is beyond understanding.

I'm relying on Retvari Zoltan's description.  If true, with this option selected the time-to-go estimate displayed by BOINC completely ignores work content input provided by the project with a work unit, and also ignores any locally acquired history of behavior information, instead relying purely on the current fraction complete as provided by the application and time spent so far, as logged by BOINC.  So the name of the option is deeply misleading.

The degree to which this is better or worse than the default will be strongly application and configuration dependent.  A given user may like it better or worse, depending on their personal weighting of accuracy in different situations.

If you are working with applications which give excellent fraction complete estimates in all situations of interest to you, this option may very well be greatly superior for you.

I, personally, run non-identical GPU units on a host.  This option may well considerably aid time to completion accuracy for me, which is badly compromised both by DCF wander depending on which GPU has recently reported work, and by inaccurate relative speed by GPU estimate employed, and I think I'll give it a try.  

But I doubt this influences the calculation done to estimate work in queue, so I doubt it will help the CPU/CPU mismatch fetching problems.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.