Ghost WU and resending lost results

Metod, S56RKO

Joined: 11 Feb 05

Posts: 135

Credit: 826491156

RAC: 86243

RE: RE: I haven't had any

29 Jul 2005 11:04:21 UTC

Message 14751 in response to message 14750

(moderation:

)

Quote:

Quote:
I haven't had any resent to me (I think) but from looking at the posts here and the change notes I think there is a check missing. If the project is reset causeing the lost workunits they should not be resent. Probably this should apply to merged hosts as well.

I think this needs to be added because in either of those cases there was a problem that may have even been caused by the workunit that is being resent. If so that workunit will most likely cause the same problem again and we get into a cycle of resetting and resending.

I see the point. But I'm not sure about this. After all a user can always ABORT a workunit that is problematic, to get rid of it.

Even if (s)he doesn't abort it (not everyone babysits their BOINC installations), it will eventually pass the dead-line iff it is not re-set after WU is re-sent.

Metod ...

Bruce Allen

Moderator

Joined: 15 Oct 04

Posts: 1119

Credit: 172127663

RAC: 0

RE: RE: RE: I haven't

29 Jul 2005 12:07:04 UTC

Message 14752 in response to message 14751

(moderation:

)

Quote:

Quote:
Quote:
I haven't had any resent to me (I think) but from looking at the posts here and the change notes I think there is a check missing. If the project is reset causeing the lost workunits they should not be resent. Probably this should apply to merged hosts as well.

I think this needs to be added because in either of those cases there was a problem that may have even been caused by the workunit that is being resent. If so that workunit will most likely cause the same problem again and we get into a cycle of resetting and resending.

I see the point. But I'm not sure about this. After all a user can always ABORT a workunit that is problematic, to get rid of it.

Even if (s)he doesn't abort it (not everyone babysits their BOINC installations), it will eventually pass the dead-line iff it is not re-set after WU is re-sent.

Agreed.

Director, Einstein@Home

Grenadier

Joined: 9 Feb 05

Posts: 14

Credit: 2823344

RAC: 0

I just got a pile of these on

29 Jul 2005 12:59:43 UTC

Message 14753

(moderation:

)

I just got a pile of these on one of my hosts. However, the deadline is set to tomorrow. I'm not sure how 48-70 hours of work is supposed to get done in 36 hours or so. Shouldn't the deadlines be reset on any of these resent units such that the host has a chance of catching up?

Bruce Allen

Moderator

Joined: 15 Oct 04

Posts: 1119

Credit: 172127663

RAC: 0

RE: I just got a pile of

29 Jul 2005 13:58:06 UTC

Message 14754 in response to message 14753

(moderation:

)

Quote:

I just got a pile of these on one of my hosts. However, the deadline is set to tomorrow. I'm not sure how 48-70 hours of work is supposed to get done in 36 hours or so. Shouldn't the deadlines be reset on any of these resent units such that the host has a chance of catching up?

I suggest that you abort the workunits which can't be finished in time. Then do 'update project' to report the aborted WU to the server. This way, new WU can be issued and your computer won't spend a long time doing work that's overdue.

Any idea how this work got lost??

Cheers,
Bruce

Director, Einstein@Home

Grenadier

Joined: 9 Feb 05

Posts: 14

Credit: 2823344

RAC: 0

RE: I suggest that you

29 Jul 2005 14:18:29 UTC

Message 14755 in response to message 14754

(moderation:

)

Quote:

I suggest that you abort the workunits which can't be finished in time. Then do 'update project' to report the aborted WU to the server. This way, new WU can be issued and your computer won't spend a long time doing work that's overdue.

Any idea how this work got lost??

I went through as another poster had suggested and aborted the ones that already had been granted credit, figuring the remaining ones would be useful, at least.

I now have this on 2 of my 20 hosts, with those 2 having 8-10 WU's each. All expirations are less than 48 hours.

As for how they got lost, I was going to ask about that. One of the affected hosts is a new PC I got a week ago. It's only been attached to the project for a week, and I don't see how it could have this many ghosts associated with it. Is it possible that the new code is seeing WU's from another host?

Alternately, is it possibly marking WU's as Ghosts that are really in the machine's Work Unit Data File, but just hadn't been assigned to the machine as actual WU's yet?

Grenadier

Joined: 9 Feb 05

Posts: 14

Credit: 2823344

RAC: 0

By the way, this new machine

29 Jul 2005 14:20:33 UTC

Message 14756

(moderation:

)

By the way, this new machine has CC 4.45, and that's the only client it's ever had.

Peter

Joined: 6 Jul 05

Posts: 7

Credit: 6412725

RAC: 0

RE: Any idea how this work

29 Jul 2005 14:25:30 UTC

Message 14757 in response to message 14754

(moderation:

)

Quote:

Any idea how this work got lost??

I also had a bunch of work get lost/re-sent to one of my hosts. What probably happened to me was that my ADSL account had exceeded its quota for the month - international bandwidth then drops to sub 1KB/s levels. The client probably managed to contact the server and request new work, but was unable to transfer the wu's (2 days worth). That's what I suspect, anyway.

I was glad to see them re-sent though, I hate failing anything ;)

PS If that is what happened, wouldn't it be good to have the client return a acknowledgement of receipt before the work is marked as 'In Progress'?

Metod, S56RKO

Joined: 11 Feb 05

Posts: 135

Credit: 826491156

RAC: 86243

On a slightly different

29 Jul 2005 15:18:09 UTC

Message 14758

(moderation:

)

On a slightly different topic: is it possible that the DL server has slight problems from time to time?

Just today I installed BOINC on another cruncher and attached to E@H project. It downloaded all the needed files fine except for science app (exe and pdb). Due to that it trashed two WUs. Next try yielded in assigning two more WUs and DLing exe fine, but DLing pdf file failed, therefore trashing another two Wus. The pdb file transferred fine just a moment later but at that time, that host used up it's daily quota (4 as it is a new host) leaving it without E@H work until tomorrow.

Metod ...

Grenadier

Joined: 9 Feb 05

Posts: 14

Credit: 2823344

RAC: 0

I just looked over my second

29 Jul 2005 16:01:44 UTC

Message 14759

(moderation:

)

I just looked over my second host that got these units, and realized it had 8 WU's all due in 7 hours. I ended up aborting all but the currently executing unit, since none of them will finish on time.

If you're going to resend these ghost units to their original hosts, there either needs to be a much longer deadline, or a throttle on how many get sent. Otherwise, you're just causing more missed deadlines and aborted units.

How about just marking the ghost units as aborted/comp error/not returned/etc, and then throwing them back in the queue for the next user to pick up in the normal course of business? I know other projects resubmit units that for whatever reason never got a quorum. Shouldn't this be handled the same way?

Ziran

Joined: 26 Nov 04

Posts: 194

Credit: 615123

RAC: 1329

Then reading through this

29 Jul 2005 17:33:09 UTC

Message 14760

(moderation:

)

Then reading through this tread one suggestion comes to my mind. If results already have quorum and validated, is there a point in resending that result? Isn't it better to automatically mark those results with an error so the result can be removed from the database faster.

Then you're really interested in a subject, there is no way to avoid it. You have to read the Manual.

Ghost WU and resending lost results

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports