Serious BUG: Phantom WUs NOT on user client machines but on result pages

Ulrich Metzner
Ulrich Metzner
Joined: 22 Jan 05
Posts: 113
Credit: 963,370
RAC: 0
Topic 188165

Cause everyone seems to overlook the big BUG mentioned in this thread, i start this new one with a better thread title to attract people and moderators to this concern.

This thread could also be related.

Aloha, Uli

Wolverine
Wolverine
Joined: 24 Feb 05
Posts: 15
Credit: 217,693
RAC: 0

Serious BUG: Phantom WUs NOT on user client machines but on resu

> Cause everyone seems to overlook the big BUG mentioned in this thread,
> i start this new one with a better thread title to attract people and
> moderators to this concern.

Greetings,

To elaborate -- in the first thread Ulrich posted, these were my first 2 WUs unaccounted for (to reiterate, back on Feb 26th when there was the crash & scheduler problems):

http://einsteinathome.org/task/1315339
http://einsteinathome.org/task/1314874

I've now noticed this...:
http://einsteinathome.org/task/1523591

...which seems to coincide with a non-response in my log:
_________
2005-03-01 02:15:12 - Sending request to scheduler: http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
2005-03-01 02:15:21 - Scheduler RPC to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi failed
2005-03-01 02:15:21 - No schedulers responded
2005-03-01 02:15:21 - Deferring communication with project for 1 minutes and 0 seconds
_________

So... ResultID 1523591 lists as "sent" 1 Mar 2005 8:15:20 UTC; 1 second before I received the "failed" indication (and I'm on central time in the US, GMT -6).

Seems like there's some common ground between all of the above?

Ulrich Metzner
Ulrich Metzner
Joined: 22 Jan 05
Posts: 113
Credit: 963,370
RAC: 0

Another thought: Could

Another thought:

Could that be related to the fact that in reality we have 2 (two) servers, one in the states and one in Germany? Maybe one server says: "Ok, take this xyz-wu" then after some time you do a manual update and reach (by accident) the other server and it says: "Hey, you never got that xyz-wu from me, so forget it!".

A response from the developers or admins would be greatly appreciated.

Aloha, Uli

Ulrich Metzner
Ulrich Metzner
Joined: 22 Jan 05
Posts: 113
Credit: 963,370
RAC: 0

> I've now noticed this...: >

Message 6384 in response to message 6382

> I've now noticed this...:
> http://einsteinathome.org/task/1523591
>
> ...which seems to coincide with a non-response in my log:
> _________
> 2005-03-01 02:15:12 - Sending request to scheduler:
> http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
> 2005-03-01 02:15:21 - Scheduler RPC to
> http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi failed
> 2005-03-01 02:15:21 - No schedulers responded
> 2005-03-01 02:15:21 - Deferring communication with project for 1 minutes and 0
> seconds
> _________
>
> So... ResultID 1523591 lists as "sent" 1 Mar 2005 8:15:20 UTC; 1 second before
> I received the "failed" indication (and I'm on central time in the US, GMT
> -6).
>
> Seems like there's some common ground between all of the above?
>

You're right. In the last days i had a lot "Scheduler...failed" and right after that (1 minute or so delay from client) it got through. Smells a bit fishy ;)

Aloha, Uli

hoarfrost
hoarfrost
Joined: 9 Feb 05
Posts: 207
Credit: 103,053,349
RAC: 11

And, if I think right, this

And, if I think right, this thread.

Wolverine
Wolverine
Joined: 24 Feb 05
Posts: 15
Credit: 217,693
RAC: 0

> You're right. In the last

Message 6386 in response to message 6384

> You're right. In the last days i had a lot "Scheduler...failed" and right
> after that (1 minute or so delay from client) it got through. Smells a bit
> fishy ;)

Yes -- and (oops) I forgot to note that 1523591 seems to have been skipped; when communication was re-established, my box began crunching the subsequent WU, sent just over a minute later:

http://einsteinathome.org/task/1523619

Doris and Jens
Doris and Jens
Joined: 30 Oct 04
Posts: 30
Credit: 2,688,588
RAC: 0

> Could that be related to

Message 6387 in response to message 6383

> Could that be related to the fact that in reality we have 2 (two) servers, one
> in the states and one in Germany? Maybe one server says: "Ok, take this
> xyz-wu" then after some time you do a manual update and reach (by accident)
> the other server and it says: "Hey, you never got that xyz-wu from me, so
> forget it!".

I don't believe that because there ar not two 'scheduler' servers, only two 'data' server.

This means it may be possible to download the WUs from several servers, but only one server decide which WU is yours.

I didn't know if we can upload to several servers, but didn't believe in this because this would need a synchronizing over the web. And if this was finished, I am sure we had heard about it.

Greetings from Bremen/Germany

Jens Seidler (TheBigJens)


Iron Sun 254
Iron Sun 254
Joined: 25 Feb 05
Posts: 38
Credit: 55,455
RAC: 0

I have two on my machine and

I have two on my machine and both were sent on Feb 26, the same day as the server crash. I suspect a lot of us these phantom WUs from the same date as well. My big concern is just that they won't get completed if it looks like they've been sent out already.

------------------------------------
There's a thin line between Genius and Insanity. That's where I live, baby!

THE SPACEPORT - The Other Side of Space

Doris and Jens
Doris and Jens
Joined: 30 Oct 04
Posts: 30
Credit: 2,688,588
RAC: 0

> I have two on my machine

Message 6389 in response to message 6388

> I have two on my machine and both were sent on Feb 26, the same day as the
> server crash. I suspect a lot of us these phantom WUs from the same date as
> well. My big concern is just that they won't get completed if it looks like
> they've been sent out already.

If there are not enough valid results returned the scheduler will send out new until a canonical result for the WU is found.

Greetings from Bremen/Germany

Jens Seidler (TheBigJens)


Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1,119
Credit: 172,127,663
RAC: 0

I'm going to add something

I'm going to add something more to the FAQ about this.

The basic problem is that some hosts are not getting the scheduler replies. When this happens, two things go wrong:
[A] The work sent to the host in the scheduler reply is lost.
[B] If the host was registering for the first time, it never gets its hostid back, and
then registers AGAIN (and sometimes AGAIN and AGAIN!). These hosts also
have problem [A] above!

There appear to be at least two separate bugs. One is 'understood and solved' and the other is not.

[1] The 4.19 (and some earlier) clients did not handle some http proxy servers correctly. So hosts behind a proxy or networked with some variants of Windows networking options did not get (or get all) scheduler replies correctly.

[2] Even when not using an http proxy server, some scheduler replies do not make it back to the host. I haven't been able to isolate (yet) when/how this happens. I'd be grateful for assistance!

Topic [B] above is *already* addressed in the FAQ.

If you have a particular host machine which continues to exhibit problem [A] (work sent to it is repeatedly lost) and it is NOT behind a proxy server, please write something about the host in this thread so that we can try to understand what is wrong.

I've modified the server to try and detect problem [B] above and to send an error message to the BOINC client. This may reduce the impact of the bug, but is only a workaround, not a solution.

Cheers,
Bruce

Director, Einstein@Home

Iron Sun 254
Iron Sun 254
Joined: 25 Feb 05
Posts: 38
Credit: 55,455
RAC: 0

Should we let you know about

Should we let you know about lost work even if it isn't a case of it being continually lost or will those WUs be taken casr of some other way?

------------------------------------
There's a thin line between Genius and Insanity. That's where I live, baby!

THE SPACEPORT - The Other Side of Space

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.