David Anderson and I made modifications to the BOINC scheduler which are designed to resend WU to hosts which have lost them. This only works if you are running a recent client (>=4.45 I think).
Currently any WU which are supposed to be on your machine and which are NOT reported as being there are resent. This is accompanied by a message of the form:
Resent lost result w1_0399.5__0399.6_0.1_T09_S4hA_0
Currently any 'missing' results are sent, even if they are close to deadline.
Please report good and/or bad experiences with this feature in this thread.
Bruce
Director, Einstein@Home
Copyright © 2024 Einstein@Home. All rights reserved.
Ghost WU and resending lost results
)
In the case where a resent WU is close to its deadline, will the client recognize this and go into EDF mode? Or do you have any data on this possibility yet?
Jim
Jim
RE: David Anderson and I
)
Bruce, it works real well (thanks!) but for one thing. Maybe an unintentional side effect based on what you wrote - when it sends the WU again, it assigns a new complation date. So when the result is close to deadline, you can "reset" the project and get the same work back but with another week to go.
Heres a line from my results list:
Before WU was resent:
6906875 1643286 28 Jul 2005 20:14:59 UTC 4 Aug 2005 20:14:59 UTC In Progress Unknown New
After WU was resent:
6906875 1643286 28 Jul 2005 20:22:29 UTC 4 Aug 2005 20:22:29 UTC In Progress Unknown New
[EDIT]
The original deadline was like 28 JUL 2005 14:42:32. I noticed after testing that the deadline showed one week to the second after the re-download. Tried again, but first saved off the "old" deadline as reported by the server.
Walt
RE: RE: David Anderson
)
Walt,
Good catch -- I'm going to have to change your status to 'Developer'!!
Now that you point this out it's obvious that this is how our code works. But it wasn't what I intended. I'll have to fix it, else results will never time out for misconfigured hosts that never get the work.
Any reason that I shouldn't fix this?
[EDIT 10 minutes later]
Walt, I've fixed this. Now when results are resent the 'sent_time' and 'report_deadline' in the database are left unchanged.
[EDIT 5 minutes later]
I wonder if I should update 'sent_time' but NOT 'report_deadline'. This way the result will still time out OK but it'll be obvious from the database that it has been re-sent one or more times. Thoughts??
Director, Einstein@Home
RE: Walt, Good catch --
)
Thats OK with me. Would I get access to the source? Would make it easier to get a handle on those pesky 0xC0000005 bugs :)
Tried it with WUs that "expire" on Sunday, works fine now.
Thats a good idea and would be very useful. For instance.....
Lets other people looking at the WU know it was resent, and when. Like for WU's that miss the deadline. For the user, an indication it was resent if they missed the message (its only there until BOINC ends). For people answering questions/problems on the forums, it alerts them the WU was resent, perhaps too late. For project admin/dev people, can be used to track "missed deadlines".
I just got a huge batch of
)
I just got a huge batch of these lost results resent to my host.
Now I'm a kind of dilemma wether I like the deadline being reset (I got them before Bruce changed the behaviour). Originally they were due in 2 to 7 hours after they were resent so they would time out anyway (as did some 50). Now I have opportunity to crunch them down. Some of them will time-out anyhow as I have about 12-days worth of them ...
Perhaps I'm better off just to abort them?
As to why I have so many: BOINC started to misbehave about a week ago. Eventually I detached the project and re-attached. Host got a new Id ... so far so good. Then I obviously made a mistake to merge the two records together.
Metod ...
RE: Perhaps I'm better off
)
My heart was bleeding when I saw the pile of WUs ... I aborted all that already have quorum and validated.
The pile is a bit shorter now. I just have to wait for BOINC to report them back to the server as aborted.
Metod ...
RE: RE: Perhaps I'm
)
Thank you for this post and the previous one as well. I hadn't realized that when merging hosts, the new 'child' host would get any work that had been sent to the 'parent' hosts, and which was not on the child host.
I intend to watch this thread and 'tweak' the behavior of this re-send mechanism over the coming days. [For example, if the result which would be re-sent is already close to the deadline, I could mark it as an error and generate a new result instead (which would go to some other host).] But I would like to keep this mechanism as simple as possible for the moment, so for now I just plan to 'watch and wait'.
If you have suggestions about changes or refinements to this mechanism, please post them here.
Bruce
Director, Einstein@Home
I've made an additional
)
I've made an additional change as Walt and I discussed.
For results that are re-sent, the REPORT DEADLINE is left unchanged. However I update the SENT TIME when the result is reset. Thus if
(REPORT_DEADLINE-SENT_TIME) is less than 7 days
it means that the work was resent one or more times.
Director, Einstein@Home
I haven't had any resent to
)
I haven't had any resent to me (I think) but from looking at the posts here and the change notes I think there is a check missing. If the project is reset causeing the lost workunits they should not be resent. Probably this should apply to merged hosts as well.
I think this needs to be added because in either of those cases there was a problem that may have even been caused by the workunit that is being resent. If so that workunit will most likely cause the same problem again and we get into a cycle of resetting and resending.
BOINC WIKI
BOINCing since 2002/12/8
RE: I haven't had any
)
I see the point. But I'm not sure about this. After all a user can always ABORT a workunit that is problematic, to get rid of it.
Director, Einstein@Home