Ghost WU and resending lost results

Metod, S56RKO
Metod, S56RKO
Joined: 11 Feb 05
Posts: 135
Credit: 826479507
RAC: 85847

RE: RE: I haven't had any

Message 14751 in response to message 14750

Quote:
Quote:

I haven't had any resent to me (I think) but from looking at the posts here and the change notes I think there is a check missing. If the project is reset causeing the lost workunits they should not be resent. Probably this should apply to merged hosts as well.

I think this needs to be added because in either of those cases there was a problem that may have even been caused by the workunit that is being resent. If so that workunit will most likely cause the same problem again and we get into a cycle of resetting and resending.

I see the point. But I'm not sure about this. After all a user can always ABORT a workunit that is problematic, to get rid of it.

Even if (s)he doesn't abort it (not everyone babysits their BOINC installations), it will eventually pass the dead-line iff it is not re-set after WU is re-sent.

Metod ...

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

RE: RE: RE: I haven't

Message 14752 in response to message 14751

Quote:
Quote:
Quote:

I haven't had any resent to me (I think) but from looking at the posts here and the change notes I think there is a check missing. If the project is reset causeing the lost workunits they should not be resent. Probably this should apply to merged hosts as well.

I think this needs to be added because in either of those cases there was a problem that may have even been caused by the workunit that is being resent. If so that workunit will most likely cause the same problem again and we get into a cycle of resetting and resending.

I see the point. But I'm not sure about this. After all a user can always ABORT a workunit that is problematic, to get rid of it.

Even if (s)he doesn't abort it (not everyone babysits their BOINC installations), it will eventually pass the dead-line iff it is not re-set after WU is re-sent.

Agreed.

Director, Einstein@Home

Grenadier
Grenadier
Joined: 9 Feb 05
Posts: 14
Credit: 2823344
RAC: 0

I just got a pile of these on

I just got a pile of these on one of my hosts. However, the deadline is set to tomorrow. I'm not sure how 48-70 hours of work is supposed to get done in 36 hours or so. Shouldn't the deadlines be reset on any of these resent units such that the host has a chance of catching up?

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

RE: I just got a pile of

Message 14754 in response to message 14753

Quote:
I just got a pile of these on one of my hosts. However, the deadline is set to tomorrow. I'm not sure how 48-70 hours of work is supposed to get done in 36 hours or so. Shouldn't the deadlines be reset on any of these resent units such that the host has a chance of catching up?

I suggest that you abort the workunits which can't be finished in time. Then do 'update project' to report the aborted WU to the server. This way, new WU can be issued and your computer won't spend a long time doing work that's overdue.

Any idea how this work got lost??

Cheers,
Bruce

Director, Einstein@Home

Grenadier
Grenadier
Joined: 9 Feb 05
Posts: 14
Credit: 2823344
RAC: 0

RE: I suggest that you

Message 14755 in response to message 14754

Quote:

I suggest that you abort the workunits which can't be finished in time. Then do 'update project' to report the aborted WU to the server. This way, new WU can be issued and your computer won't spend a long time doing work that's overdue.

Any idea how this work got lost??

I went through as another poster had suggested and aborted the ones that already had been granted credit, figuring the remaining ones would be useful, at least.

I now have this on 2 of my 20 hosts, with those 2 having 8-10 WU's each. All expirations are less than 48 hours.

As for how they got lost, I was going to ask about that. One of the affected hosts is a new PC I got a week ago. It's only been attached to the project for a week, and I don't see how it could have this many ghosts associated with it. Is it possible that the new code is seeing WU's from another host?

Alternately, is it possibly marking WU's as Ghosts that are really in the machine's Work Unit Data File, but just hadn't been assigned to the machine as actual WU's yet?

Grenadier
Grenadier
Joined: 9 Feb 05
Posts: 14
Credit: 2823344
RAC: 0

By the way, this new machine

By the way, this new machine has CC 4.45, and that's the only client it's ever had.

Peter
Peter
Joined: 6 Jul 05
Posts: 7
Credit: 6412725
RAC: 0

RE: Any idea how this work

Message 14757 in response to message 14754

Quote:

Any idea how this work got lost??

I also had a bunch of work get lost/re-sent to one of my hosts. What probably happened to me was that my ADSL account had exceeded its quota for the month - international bandwidth then drops to sub 1KB/s levels. The client probably managed to contact the server and request new work, but was unable to transfer the wu's (2 days worth). That's what I suspect, anyway.

I was glad to see them re-sent though, I hate failing anything ;)

PS If that is what happened, wouldn't it be good to have the client return a acknowledgement of receipt before the work is marked as 'In Progress'?

Metod, S56RKO
Metod, S56RKO
Joined: 11 Feb 05
Posts: 135
Credit: 826479507
RAC: 85847

On a slightly different

On a slightly different topic: is it possible that the DL server has slight problems from time to time?

Just today I installed BOINC on another cruncher and attached to E@H project. It downloaded all the needed files fine except for science app (exe and pdb). Due to that it trashed two WUs. Next try yielded in assigning two more WUs and DLing exe fine, but DLing pdf file failed, therefore trashing another two Wus. The pdb file transferred fine just a moment later but at that time, that host used up it's daily quota (4 as it is a new host) leaving it without E@H work until tomorrow.

Metod ...

Grenadier
Grenadier
Joined: 9 Feb 05
Posts: 14
Credit: 2823344
RAC: 0

I just looked over my second

I just looked over my second host that got these units, and realized it had 8 WU's all due in 7 hours. I ended up aborting all but the currently executing unit, since none of them will finish on time.

If you're going to resend these ghost units to their original hosts, there either needs to be a much longer deadline, or a throttle on how many get sent. Otherwise, you're just causing more missed deadlines and aborted units.

How about just marking the ghost units as aborted/comp error/not returned/etc, and then throwing them back in the queue for the next user to pick up in the normal course of business? I know other projects resubmit units that for whatever reason never got a quorum. Shouldn't this be handled the same way?

Ziran
Ziran
Joined: 26 Nov 04
Posts: 194
Credit: 615123
RAC: 1329

Then reading through this

Then reading through this tread one suggestion comes to my mind. If results already have quorum and validated, is there a point in resending that result? Isn't it better to automatically mark those results with an error so the result can be removed from the database faster.

Then you're really interested in a subject, there is no way to avoid it. You have to read the Manual.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.