Ghost WU and resending lost results

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0
Topic 189626

David Anderson and I made modifications to the BOINC scheduler which are designed to resend WU to hosts which have lost them. This only works if you are running a recent client (>=4.45 I think).

Currently any WU which are supposed to be on your machine and which are NOT reported as being there are resent. This is accompanied by a message of the form:

Resent lost result w1_0399.5__0399.6_0.1_T09_S4hA_0

Currently any 'missing' results are sent, even if they are close to deadline.

Please report good and/or bad experiences with this feature in this thread.

Bruce

Director, Einstein@Home

Jim Baize
Jim Baize
Joined: 22 Jan 05
Posts: 116
Credit: 582144
RAC: 0

Ghost WU and resending lost results

In the case where a resent WU is close to its deadline, will the client recognize this and go into EDF mode? Or do you have any data on this possibility yet?

Jim

Quote:

David Anderson and I made modifications to the BOINC scheduler which are designed to resend WU to hosts which have lost them. This only works if you are running a recent client (>=4.45 I think).

Currently any WU which are supposed to be on your machine and which are NOT reported as being there are resent. This is accompanied by a message of the form:

Resent lost result w1_0399.5__0399.6_0.1_T09_S4hA_0

Currently any 'missing' results are sent, even if they are close to deadline.

Please report good and/or bad experiences with this feature in this thread.

Bruce


Jim

Walt Gribben
Walt Gribben
Joined: 20 Feb 05
Posts: 219
Credit: 1645393
RAC: 0

RE: David Anderson and I

Quote:

David Anderson and I made modifications to the BOINC scheduler which are designed to resend WU to hosts which have lost them. This only works if you are running a recent client (>=4.45 I think).

Currently any WU which are supposed to be on your machine and which are NOT reported as being there are resent. This is accompanied by a message of the form:

Resent lost result w1_0399.5__0399.6_0.1_T09_S4hA_0

Currently any 'missing' results are sent, even if they are close to deadline.

Please report good and/or bad experiences with this feature in this thread.

Bruce

Bruce, it works real well (thanks!) but for one thing. Maybe an unintentional side effect based on what you wrote - when it sends the WU again, it assigns a new complation date. So when the result is close to deadline, you can "reset" the project and get the same work back but with another week to go.

Heres a line from my results list:

Before WU was resent:
6906875 1643286 28 Jul 2005 20:14:59 UTC 4 Aug 2005 20:14:59 UTC In Progress Unknown New

After WU was resent:
6906875 1643286 28 Jul 2005 20:22:29 UTC 4 Aug 2005 20:22:29 UTC In Progress Unknown New

[EDIT]
The original deadline was like 28 JUL 2005 14:42:32. I noticed after testing that the deadline showed one week to the second after the re-download. Tried again, but first saved off the "old" deadline as reported by the server.

Walt

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

RE: RE: David Anderson

Message 14743 in response to message 14742

Quote:
Quote:

David Anderson and I made modifications to the BOINC scheduler which are designed to resend WU to hosts which have lost them. This only works if you are running a recent client (>=4.45 I think).

Currently any WU which are supposed to be on your machine and which are NOT reported as being there are resent. This is accompanied by a message of the form:

Resent lost result w1_0399.5__0399.6_0.1_T09_S4hA_0

Currently any 'missing' results are sent, even if they are close to deadline.

Please report good and/or bad experiences with this feature in this thread.

Bruce

Bruce, it works real well (thanks!) but for one thing. Maybe an unintentional side effect based on what you wrote - when it sends the WU again, it assigns a new complation date. So when the result is close to deadline, you can "reset" the project and get the same work back but with another week to go.

Heres a line from my results list:

Before WU was resent:
6906875 1643286 28 Jul 2005 20:14:59 UTC 4 Aug 2005 20:14:59 UTC In Progress Unknown New

After WU was resent:
6906875 1643286 28 Jul 2005 20:22:29 UTC 4 Aug 2005 20:22:29 UTC In Progress Unknown New

[EDIT]
The original deadline was like 28 JUL 2005 14:42:32. I noticed after testing that the deadline showed one week to the second after the re-download. Tried again, but first saved off the "old" deadline as reported by the server.

Walt

Walt,

Good catch -- I'm going to have to change your status to 'Developer'!!

Now that you point this out it's obvious that this is how our code works. But it wasn't what I intended. I'll have to fix it, else results will never time out for misconfigured hosts that never get the work.

Any reason that I shouldn't fix this?

[EDIT 10 minutes later]
Walt, I've fixed this. Now when results are resent the 'sent_time' and 'report_deadline' in the database are left unchanged.

[EDIT 5 minutes later]
I wonder if I should update 'sent_time' but NOT 'report_deadline'. This way the result will still time out OK but it'll be obvious from the database that it has been re-sent one or more times. Thoughts??

Director, Einstein@Home

Walt Gribben
Walt Gribben
Joined: 20 Feb 05
Posts: 219
Credit: 1645393
RAC: 0

RE: Walt, Good catch --

Message 14744 in response to message 14743

Quote:

Walt,

Good catch -- I'm going to have to change your status to 'Developer'!!

Thats OK with me. Would I get access to the source? Would make it easier to get a handle on those pesky 0xC0000005 bugs :)

Quote:

Now that you point this out it's obvious that this is how our code works. But it wasn't what I intended. I'll have to fix it, else results will never time out for misconfigured hosts that never get the work.

Any reason that I shouldn't fix this?

[EDIT 10 minutes later]
Walt, I've fixed this. Now when results are resent the 'sent_time' and 'report_deadline' in the database are left unchanged.

Tried it with WUs that "expire" on Sunday, works fine now.

Quote:
[EDIT 5 minutes later]
I wonder if I should update 'sent_time' but NOT 'report_deadline'. This way the result will still time out OK but it'll be obvious from the database that it has been re-sent one or more times. Thoughts??

Thats a good idea and would be very useful. For instance.....

Lets other people looking at the WU know it was resent, and when. Like for WU's that miss the deadline. For the user, an indication it was resent if they missed the message (its only there until BOINC ends). For people answering questions/problems on the forums, it alerts them the WU was resent, perhaps too late. For project admin/dev people, can be used to track "missed deadlines".

Metod, S56RKO
Metod, S56RKO
Joined: 11 Feb 05
Posts: 135
Credit: 809629934
RAC: 64317

I just got a huge batch of

I just got a huge batch of these lost results resent to my host.

Now I'm a kind of dilemma wether I like the deadline being reset (I got them before Bruce changed the behaviour). Originally they were due in 2 to 7 hours after they were resent so they would time out anyway (as did some 50). Now I have opportunity to crunch them down. Some of them will time-out anyhow as I have about 12-days worth of them ...
Perhaps I'm better off just to abort them?

As to why I have so many: BOINC started to misbehave about a week ago. Eventually I detached the project and re-attached. Host got a new Id ... so far so good. Then I obviously made a mistake to merge the two records together.

Metod ...

Metod, S56RKO
Metod, S56RKO
Joined: 11 Feb 05
Posts: 135
Credit: 809629934
RAC: 64317

RE: Perhaps I'm better off

Message 14746 in response to message 14745

Quote:
Perhaps I'm better off just to abort them?

My heart was bleeding when I saw the pile of WUs ... I aborted all that already have quorum and validated.

The pile is a bit shorter now. I just have to wait for BOINC to report them back to the server as aborted.

Metod ...

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

RE: RE: Perhaps I'm

Message 14747 in response to message 14746

Quote:
Quote:
Perhaps I'm better off just to abort them?

My heart was bleeding when I saw the pile of WUs ... I aborted all that already have quorum and validated.

The pile is a bit shorter now. I just have to wait for BOINC to report them back to the server as aborted.

Thank you for this post and the previous one as well. I hadn't realized that when merging hosts, the new 'child' host would get any work that had been sent to the 'parent' hosts, and which was not on the child host.

I intend to watch this thread and 'tweak' the behavior of this re-send mechanism over the coming days. [For example, if the result which would be re-sent is already close to the deadline, I could mark it as an error and generate a new result instead (which would go to some other host).] But I would like to keep this mechanism as simple as possible for the moment, so for now I just plan to 'watch and wait'.

If you have suggestions about changes or refinements to this mechanism, please post them here.

Bruce

Director, Einstein@Home

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

I've made an additional

I've made an additional change as Walt and I discussed.

For results that are re-sent, the REPORT DEADLINE is left unchanged. However I update the SENT TIME when the result is reset. Thus if

(REPORT_DEADLINE-SENT_TIME) is less than 7 days

it means that the work was resent one or more times.

Director, Einstein@Home

Keck_Komputers
Keck_Komputers
Joined: 18 Jan 05
Posts: 376
Credit: 5744955
RAC: 0

I haven't had any resent to

I haven't had any resent to me (I think) but from looking at the posts here and the change notes I think there is a check missing. If the project is reset causeing the lost workunits they should not be resent. Probably this should apply to merged hosts as well.

I think this needs to be added because in either of those cases there was a problem that may have even been caused by the workunit that is being resent. If so that workunit will most likely cause the same problem again and we get into a cycle of resetting and resending.

BOINC WIKI

BOINCing since 2002/12/8

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

RE: I haven't had any

Message 14750 in response to message 14749

Quote:

I haven't had any resent to me (I think) but from looking at the posts here and the change notes I think there is a check missing. If the project is reset causeing the lost workunits they should not be resent. Probably this should apply to merged hosts as well.

I think this needs to be added because in either of those cases there was a problem that may have even been caused by the workunit that is being resent. If so that workunit will most likely cause the same problem again and we get into a cycle of resetting and resending.

I see the point. But I'm not sure about this. After all a user can always ABORT a workunit that is problematic, to get rid of it.

Director, Einstein@Home

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.