BOINC weirdness again

David S
David S
Joined: 6 Dec 05
Posts: 2473
Credit: 22936222
RAC: 0
Topic 197617

http://einsteinathome.org/workunit/191985985

For several days, I saw this task at the top of my task list in Boinc Manager, saying 100% complete, but still waiting to run. The deadline was 25 June, about 5:00 local (10:00 UTC). I decided I wouldn't worry about it, Boinc would take care of it before the deadline.

Well, this morning, I happened to take a look and that task was still sitting there, waiting to run at 100% complete, six hours past deadline.

I immediately went to NNT for both Einstein and Seti, then suspended every task except for that one (which oddly caused a bunch of Setis to start up and run for 1 second before calling themselves suspended, but I think that's a separate issue). The task ran for five or six minutes and then uploaded and reported. (At that point I resumed everything else and allowed new tasks.) As you can see, I now have credit for the task, but it already went out to a third host, which will waste time crunching it needlessly.

So I guess my question is, is there some minor glitch in the Boinc code that, first of all, allows tasks to sit there waiting to run at 100% complete but not actually finished for days at a time, and second, doesn't kick such a task into high priority to get done by deadline?

David

Miserable old git
Patiently waiting for the asteroid with my name on it.

tbret
tbret
Joined: 12 Mar 05
Posts: 2115
Credit: 4861254633
RAC: 36453

BOINC weirdness again

I don't know if that is an error or a "feature." I had a whole string of tasks unstarted by deadline while SETI tasks with deadlines a week later were running.
I aborted the tardy tasks, but I thought it was weird that June 20 tasks had not started and July 1 tasks were running.

I have not had the "stuck" at 100% tasks phenomenon, so it may have been something completely different and unrelated.

tbret
tbret
Joined: 12 Mar 05
Posts: 2115
Credit: 4861254633
RAC: 36453

I just read another thread

I just read another thread elsewhere in which someone who knows says that the tasks are run First In - First Out regardless of deadline. So, they run by the time you received them, not the time they are due.

It's a feature.

That doesn't explain your hung task, however.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2950050410
RAC: 691198

That may have been me. They

That may have been me. They are *started* in FIFO order.

There is an override, in which tasks are marked "high priority" in BOINC Manager: we tend to call it 'EDF' for Earliest Deadline First on project message boards. That should have kicked in here, but clearly didn't.

tbret
tbret
Joined: 12 Mar 05
Posts: 2115
Credit: 4861254633
RAC: 36453

RE: That should have

Quote:
That should have kicked in here, but clearly didn't.

I didn't take notes, Richard, but that fail-safe mechanism does seem to be failing here.

One of the things that would be interesting to know is if it has anything to do with my suspending and resuming a project.

As you know, I recently allowed SETI AP-tasks to accumulate to the 100 per device limit. Just to tweak a teammate's nose a little I wanted the SETI tasks to run first.

I kept an eye on the Einstein tasks as best I could and meant-to re-enable those to finish in time to meet deadline. Sometimes I made it, sometimes I failed.

But I noticed that even when I failed to get that done in time and my expectation was for my machine to contact the server, discover there were tasks past deadline, and cancel those tasks with an "Error" message, it didn't happen.

In a few cases I had newly running tasks which were days late to meet their deadline. The time and electricity were going to be wasted. I aborted them.

Fortunately, I doubt the phenomenon is so wide-spread as to make investigating it and spending time pondering a fix worthwhile.

You have much bigger, much more important fish you are frying. If you can get those issues under control, then this behavior may be self-healing.

David S
David S
Joined: 6 Dec 05
Posts: 2473
Credit: 22936222
RAC: 0

RE: RE: That should have

Quote:
Quote:
That should have kicked in here, but clearly didn't.

I didn't take notes, Richard, but that fail-safe mechanism does seem to be failing here.

One of the things that would be interesting to know is if it has anything to do with my suspending and resuming a project.

As you know, I recently allowed SETI AP-tasks to accumulate to the 100 per device limit. Just to tweak a teammate's nose a little I wanted the SETI tasks to run first.

I kept an eye on the Einstein tasks as best I could and meant-to re-enable those to finish in time to meet deadline. Sometimes I made it, sometimes I failed.

But I noticed that even when I failed to get that done in time and my expectation was for my machine to contact the server, discover there were tasks past deadline, and cancel those tasks with an "Error" message, it didn't happen.

In a few cases I had newly running tasks which were days late to meet their deadline. The time and electricity were going to be wasted. I aborted them.

Fortunately, I doubt the phenomenon is so wide-spread as to make investigating it and spending time pondering a fix worthwhile.

You have much bigger, much more important fish you are frying. If you can get those issues under control, then this behavior may be self-healing.


I've made that boo-boo too, and I just kick myself for interfering when I should leave it alone. That's part of the reason I did leave it alone in this case, and where did it get me? Or rather, where did it get Mad_Doc, owner of the box doing the now-useless work? (Maybe this kind of thing is what drove him mad? :))

That same experience with interfering, multiple times, is how I knew to first put both projects on NNT before I suspended all the other tasks, so at least I don't now have a bunch of extra work that won't get done in time.

David

Miserable old git
Patiently waiting for the asteroid with my name on it.

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 204

For whatever reason BOINC did

For whatever reason BOINC did this, we can't forward it to the developers and ask, unless you manage to reproduce the problem with a BOINC 7.2 or 7.4 version.

6.10.60 had a whole different way of scheduling tasks, with all its own idiosyncrasies, and a scheduler with so many problems and errors included that developing it further would prove difficult.

Which is why BOINC 7 had everything decision based reprogrammed from the ground up, following a new scheme.

mikey
mikey
Joined: 22 Jan 05
Posts: 12658
Credit: 1839055224
RAC: 4425

RE: In a few cases I had

Quote:
In a few cases I had newly running tasks which were days late to meet their deadline. The time and electricity were going to be wasted. I aborted them.

Seems to me the Server should have canceled these for you due to them being late. There is a policy in place that even if you send them in late, if you are still the first person to return them you get credit for them. So it might make sense, if you only have a few, to check the website and see if a replacement unit has in fact been sent out and if not finish it quick so you can still get credit for it. If however you try to crunch it and aren't the first person to send it back after it has been reissued you are right, no credits for you and nothing but wasted time.

Another thing as Jord said is why you haven't you upgraded to the latest version yet? They're moving forward and the old ways are gone, we have to deal with the new Boinc, since we aren't making our own. It IS much better, I have had much fewer problems with the new versions then I had with the older versions.

David S
David S
Joined: 6 Dec 05
Posts: 2473
Credit: 22936222
RAC: 0

RE: Another thing as Jord

Quote:
Another thing as Jord said is why you haven't you upgraded to the latest version yet? They're moving forward and the old ways are gone, we have to deal with the new Boinc, since we aren't making our own. It IS much better, I have had much fewer problems with the new versions then I had with the older versions.


Just what is so much better about it? Okay, this problem may or may not have been fixed. The only issue I commonly have is that after a restart, Boinc doesn't find my GPU and I have to stop Boinc Manager and restart it.

In other words, if the version I have is working fine for me 99.9% of the time, is there a good reason for me to upgrade it?

David

Miserable old git
Patiently waiting for the asteroid with my name on it.

Gavin
Gavin
Joined: 21 Sep 10
Posts: 191
Credit: 40644337738
RAC: 3

RE: Just what is so much

Quote:

Just what is so much better about it? Okay, this problem may or may not have been fixed. The only issue I commonly have is that after a restart, Boinc doesn't find my GPU and I have to stop Boinc Manager and restart it.

In other words, if the version I have is working fine for me 99.9% of the time, is there a good reason for me to upgrade it?

Boinc 7 is obviously (one would hope!) an advancement of the platform and appears to be more system aware (for want of a better word) than previous versions particularly when it comes to GPU detection and scheduling. I am not the man to go into the deep technicalities but from my experience the newer versions do in fact work better ;)

The new layout takes a little getting used to but soon becomes second nature and if it solves all your niggling issues, what is there to lose?

Try it :) Reverting to version 6 is not quite as problematic as many would say.

mikey
mikey
Joined: 22 Jan 05
Posts: 12658
Credit: 1839055224
RAC: 4425

RE: RE: Another thing as

Quote:
Quote:
Another thing as Jord said is why you haven't you upgraded to the latest version yet? They're moving forward and the old ways are gone, we have to deal with the new Boinc, since we aren't making our own. It IS much better, I have had much fewer problems with the new versions then I had with the older versions.

Just what is so much better about it? Okay, this problem may or may not have been fixed. The only issue I commonly have is that after a restart, Boinc doesn't find my GPU and I have to stop Boinc Manager and restart it.

In other words, if the version I have is working fine for me 99.9% of the time, is there a good reason for me to upgrade it?

The Change-logs are here:
http://boinc.berkeley.edu/dev/forum_forum.php?id=2

and while they can be long and tedious to read it may help you in the end to see what they have been changing. They have changed ALOT of stuff between your version and the latest beta version, some helped, some made it worse. Those that made Boinc worse were tweaked until they too became more helpful. Is there more to do, of course there is we users ALWAYS want more, more and then still more!!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.