Scheduler is an idiot

keputnam

Joined: 18 Jan 05

Posts: 47

Credit: 87530059

RAC: 83454

9 Jul 2013 21:44:21 UTC

Topic 197049

(moderation:

)

Einstein just downloaded a GRP Seatch #2 1.10 WU with a deadline of 2013-07-13 09:40 and and estimated tuntime of 255 hours

Based on my knowledge of my own machine, there is not enough real-world time to complete it, even if that WU uses all of one HyperThreaded "CPU" for the entire time (and we all know that that situation will never happen, right?)

if it gets re-assigned to someone else with a much faster machine, and they get it done and back before I do, will I still receive any credit, or should I just go ahead and abort the WU?

Alinator

Joined: 8 May 05

Posts: 927

Credit: 9352143

RAC: 0

Scheduler is an idiot

9 Jul 2013 21:52:45 UTC

Message 117108

(moderation:

)

If you overrun the deadline you have to get your task back before the resend host does to get credit.

That being said, I'd give it some time before aborting to see if the RTE comes more into line. None of my P4's has ever missed a deadline unless something extraordinary occurred.

I'd expect something more like 50 hours or so RT for yours, based on my 2.8GHz P4's last GRPS2 task.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4350

Credit: 253898194

RAC: 35434

The only E@H task your

10 Jul 2013 2:34:26 UTC

Message 117109

(moderation:

)

The only E@H task your computer got has a deadline of July 20 and was already sent on July 6.

keputnam

Joined: 18 Jan 05

Posts: 47

Credit: 87530059

RAC: 83454

Bernd Sorry, you are

10 Jul 2013 5:15:32 UTC

Message 117110 in response to message 117109

(moderation:

)

Bernd

Sorry, you are correct about the deadline of 20 July, but the fact remains that this computer, in all likelihood, will NOT finish before then

After 18.75 hours, the total estimated runtime is now up to 265.5 hours

Since there is no way that the WU will ever get 100% of one "CPU", that still puts me over the deadline by at least half a day

keputnam

Joined: 18 Jan 05

Posts: 47

Credit: 87530059

RAC: 83454

Alinator This machine has

10 Jul 2013 5:28:53 UTC

Message 117111 in response to message 117108

(moderation:

)

Alinator

This machine has a single 3.0 GHz Hyper-Threaded processor, so each "CPU" is really only about 1.35 to 1.4 GHz

And as stated in my reply to Bern, after 18.75 hours, the total estimated run-time is now *up* to 265.5 hours

This gross underestimation of run-time seems to be endemic on this machine. Most WUs seem to have an original estimate anywhere from 10-25% below the final total run-time

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119846686029

RAC: 25904129

RE: This machine has a

10 Jul 2013 9:55:03 UTC

Message 117112 in response to message 117111

(moderation:

)

Quote:

This machine has a single 3.0 GHz Hyper-Threaded processor, so each "CPU" is really only about 1.35 to 1.4 GHz

I'm not sure how you work that out but if the two virtual CPUs combined give 'less' than the single real CPU, why are you running hyperthreaded? P4 style hyperthreading was pretty lousy but my experience way back then was that you could get about 10-20%+/- extra output when HT was enabled.

Quote:

And as stated in my reply to Bern, after 18.75 hours, the total estimated run-time is now *up* to 265.5 hours

You seem to be quite angry about this as if somebody could just flick a magic switch and, presto, all your new tasks would have perfect estimates. Have you considered for a moment, how big a problem it is to correctly frame the estimated crunch time when there is such a wide diversity of platforms out there? If the task parameters were tweaked to make the estimate 'correct' for your P4, all the hosts using rather more modern architecture, would get tasks with way overblown estimates.

So it's a juggling act - pick a suitable compromise so that the bulk of 'average' hosts are about right, while newer and more efficient hosts get estimates that are too long and older and less efficient hosts like yours get estimates that are too short.

And apart from all that, it's the BOINC client's responsibility to fine tune the estimates (through the Duration Correction Factor, DCF) so that you end up with progressively better and better estimates over time. There's no feasible way for internal parameters of individual tasks to be 'fine tuned' on the server prior to sending, in order to 'match' the expected performance characteristics of the requesting host.

Quote:

This gross underestimation of run-time seems to be endemic on this machine. Most WUs seem to have an original estimate anywhere from 10-25% below the final total run-time

You call 10-25% error in the estimate "gross"? I'd call it "pretty close" and "fair and reasonable", bearing in mind the difficulties in framing that estimate. If you run enough tasks of the same type, BOINC will automatically adjust the DCF so that eventually new tasks will arrive with much closer estimates.

I was quite surprised that your host is going to take around 300 hours to crunch a current FGRP2 task. There seems to be something a bit odd about that. I have an old laptop with a 2.16GHz celeron 585 single cpu which is taking around 25 hours per task. I can't see that your machine should be more than an order of magnitude worse.

Be that as it may, you have some options to improve your situation. It would be worthwhile to investigate why your machine is taking so long. If there's nothing you can do to improve its performance, you should set your preferences to opt out of the FGRP2 run. Instead, you could opt in to the Arecibo BRP4 run which has much shorter tasks for CPUs. That would seem to be a much more suitable match for you.

EDIT: The more I thought about it, the more convinced I was that a P4 shouldn't be taking the time that yours is. So here is a 2.8GHz P4 (admittedly not hyperthreaded) that is doing both BRP4 and FGRP2 tasks in around 15ksecs and 150ksecs respectively. That's just 4.2 hours and 42 hours respectively. On those figures, there's got to be something wrong with your setup.

Cheers,
Gary.

Alinator

Joined: 8 May 05

Posts: 927

Credit: 9352143

RAC: 0

Agreed, unless it's got a

10 Jul 2013 15:11:00 UTC

Message 117113 in response to message 117112

(moderation:

)

Agreed, unless it's got a serious problem there's no way any P4 shouldn't be able to make an FGRP deadline.

I mean if this old timer of mine can make it while running three other projects at equal resource share at the same time, I rest my case on that.

That being said, there are a lot of reasons why you can get messed up ERT's. Bad benchmark run, something screws up a project's DCF, etc. This might well apply in this case since the problem with the FLOP estimate a couple of weeks ago did exactly that to DCF on mine, and needless to say all the ETR's ballooned into upper atmosphere, and created other issues until I went in and edited it back to reality.

In any event, you can do a from the hip manual calculation on ETR this way:

Current ET * (1/Current % complete)

If that exceeds 14 days, then you really do have a problem.

keputnam

Joined: 18 Jan 05

Posts: 47

Credit: 87530059

RAC: 83454

You seem to be quite angry

10 Jul 2013 16:29:57 UTC

Message 117114 in response to message 117112

(moderation:

)

You seem to be quite angry about this as if somebody could just flick a magic switch and, presto, all your new tasks would have perfect estimates.

Silly me. I thought that was the whole purpose of running periodic benchmarks on each machine? And I never said anything about perfect estimates, but if the estimated run-time is THAT close to the deadline, it should never have been sent to me. None of the other projects seem to have that problem.

And I am fairly angry. if the estimate is as bad as it seems (after approx 28 hours, the total estimate seems to have settled on 266 hours), that means that over 10 days of processing will have been wasted when the third cruncher gets done before me. I know that I will never be anywhere near the top of the leader-board, but I *DO* want to get credit for my work.

you should set your preferences to opt out of the FGRP2 run. Instead, you could opt in to the Arecibo BRP4 run which has much shorter tasks for CPUs

I'll set that option, thanks. I had thought that I had deselected the long running tasks based on another thread. Guess I read it the wrong way

As far as tweaking the machine, I can't think of anything I can do. If you have any pointers they would be gladly accepted

Alinator

Joined: 8 May 05

Posts: 927

Credit: 9352143

RAC: 0

Apologies if it seemed to

10 Jul 2013 16:48:40 UTC

Message 117115 in response to message 117114

(moderation:

)

Apologies if it seemed to come across that way, I'm not angry and I'm sure Gary isn't either.

It's just we are highly skeptical about your P4 not being able to make an FGRP task within the deadline. The corollary to that is if it can't make it, then it has problems which would make it difficult to get any EAH task in under the deadline.

First question to get to the bottom of this is, did you do the manual calculation of Estimated RunTime (ERT) I suggested?

keputnam

Joined: 18 Jan 05

Posts: 47

Credit: 87530059

RAC: 83454

Alinator Your equation

10 Jul 2013 17:46:18 UTC

Message 117116 in response to message 117115

(moderation:

)

Alinator

Your equation gives a result of 19.5

So I should probably just go ahead and abort the WU, right??

I just ran Benchmark again and got this result

FWIW, when I connect to Malaria Control, I get the following

7/6/2013 9:40:26 AM | malariacontrol.net | Tasks won't finish in time: BOINC runs 94.3% of the time; computation is enabled 97.6% of that

Alinator

Joined: 8 May 05

Posts: 927

Credit: 9352143

RAC: 0

Sorry, I had to go out for an

10 Jul 2013 18:35:32 UTC

Message 117117 in response to message 117116

(moderation:

)

Sorry, I had to go out for an appointment.

No, your manual calculation is based on what the science app is reporting as its progress, and is in the ballpark for what I would expect for your host. BTW, that manual calculation works for any project science app that updates its progress in a reasonable fashion.

However, your benchmarks are really low for a P4 and the message from Malaria Control is telling me the time statistics the BOINC Core Client (CC) keeps track of have gotten screwed up in a big way.

One way to take care of the bad benchmarks is to suspend all project work and shut down BOINC, then reboot the machine. Once you get back to the desktop, temporarily suspend any other background tasks you may have running. Then start BOINC and run the benchmarks again. Make you don't touch anything after starting them until it's done. That should give the best numbers possible.

If they still come out low, then we need to look into the particulars of your system to figure what's going on. At that point, either way you can resume all your projects.

Scheduler is an idiot

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner