Serious BUG: Phantom WUs NOT on user client machines but on result pages

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

> ^TOP^ > > Bruce, > > The

Message 6402 in response to message 6401

> ^TOP^
>
> Bruce,
>
> The issue is still repeating itself for the hosts listed below. None are not
> behind a firewall and they're networked at the same location. What networking
> details do you need?
>
>
> 7672 GenuineIntel Intel(R) Pentium(R) 4 CPU 3.40GHz Microsoft Windows XP
> Professional Edition, Service Pack 2, (05.01.2600.00)
>
> 1397642 395805 25 Feb 2005 11:23:06 UTC 4 Mar 2005 11:23:06 UTC In Progress
> Unknown New --- --- ---
> 1393899 394912 25 Feb 2005 11:23:00 UTC 4 Mar 2005 11:23:00 UTC In Progress
> Unknown New --- --- ---
> 1203915 351169 18 Feb 2005 7:28:45 UTC 25 Feb 2005 7:28:45 UTC Over No reply
> New 0.00 --- ---
>
>
> 7660 GenuineIntel Intel(R) Pentium(R) 4 CPU 2.80GHz Microsoft Windows XP Home
> Edition, Service Pack 2, (05.01.2600.00)
>
> 1440862 405896 27 Feb 2005 14:19:17 UTC 6 Mar 2005 14:19:17 UTC In Progress
> Unknown New --- --- ---
> 1440850 405893 27 Feb 2005 14:19:17 UTC 6 Mar 2005 14:19:17 UTC In Progress
> Unknown New --- --- ---
> 1178852 347092 15 Feb 2005 21:57:28 UTC 22 Feb 2005 21:57:28 UTC Over No reply
> New 0.00 --- ---
> 1177546 346808 16 Feb 2005 4:09:06 UTC 23 Feb 2005 4:09:06 UTC Over No reply
> New 0.00 --- ---
>
>
> 7668 AuthenticAMD AMD Athlon(tm) Processor Microsoft Windows XP Professional
> Edition, Service Pack 2, (05.01.2600.00)
>
> 1254561 362688 20 Feb 2005 21:52:55 UTC 27 Feb 2005 21:52:55 UTC In Progress
> Unknown New --- --- ---
>
>
> This is not a complete list as some WU's have already been deleted.

Michael, this can happen if your host machine makes contact with the server, but then the connection is broken after the scheduler has taken work out of the database but hasn't sent it to you yet. At that point it't can't easily put the work 'back in the database' so it's lost. If the problem continues to replicate itself on these machines, I'll get in touch with you next week to follow up.

Bruce

Director, Einstein@Home

chelliot
chelliot
Joined: 9 Feb 05
Posts: 9
Credit: 135202
RAC: 0

Bruce, I've seen this

Bruce,

I've seen this problem where WU's are sent from the server but not seen by one of my computers twice recently, on two different computers. Both computers are running Win2K SP4. So far I have not seen this problem with Linux or XP, but I have a very small sample size... I do have a firewall, but no proxy.

Possibly relevant is that Time Warner's road runner service has been having some connectivity issues in my area (Durham, North Carolina) recently.

However, it seems to me that the server and the boinc client should have some sort of handshake agreeing on the receipt of the WU. That does not appear to be the case.

If a packet trace from my side would be useful please let me know and I'll setup Ethereal to try to capture an instance.

Here's the most recent WU that shows this issue for computer 47378:

ResultID WorkID Sent
1632725 442582 4 Mar 2005 13:23:17 UTC

Here's the relevant log entries from the computer, showing the failed request and the followup that succeeded:

--- - 2005-03-04 08:26:05 - May run out of work in 0.01 days; requesting more
Einstein@Home - 2005-03-04 08:26:05 - Requesting 1725 seconds of work
Einstein@Home - 2005-03-04 08:26:05 - Sending request to scheduler: http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
Einstein@Home - 2005-03-04 08:26:08 - Scheduler RPC to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi failed
Einstein@Home - 2005-03-04 08:26:08 - No schedulers responded
Einstein@Home - 2005-03-04 08:26:08 - Deferring communication with project for 1 minutes and 0 seconds
--- - 2005-03-04 08:27:09 - May run out of work in 0.01 days; requesting more
Einstein@Home - 2005-03-04 08:27:09 - Requesting 1808 seconds of work
Einstein@Home - 2005-03-04 08:27:09 - Sending request to scheduler: http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
Einstein@Home - 2005-03-04 08:27:19 - Scheduler RPC to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded

Chris.
chelliot [at] pobox.com

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

> Bruce, > > I've seen this

Message 6404 in response to message 6403

> Bruce,
>
> I've seen this problem where WU's are sent from the server but not seen by one
> of my computers twice recently, on two different computers. Both computers are
> running Win2K SP4. So far I have not seen this problem with Linux or XP, but I
> have a very small sample size... I do have a firewall, but no proxy.
>
> Possibly relevant is that Time Warner's road runner service has been having
> some connectivity issues in my area (Durham, North Carolina) recently.
>
> However, it seems to me that the server and the boinc client should have some
> sort of handshake agreeing on the receipt of the WU. That does not appear to
> be the case.
>
> If a packet trace from my side would be useful please let me know and I'll
> setup Ethereal to try to capture an instance.
>
> Here's the most recent WU that shows this issue for computer 47378:
>
> ResultID WorkID Sent
> 1632725 442582 4 Mar 2005 13:23:17 UTC
>
> Here's the relevant log entries from the computer, showing the failed request
> and the followup that succeeded:
>
> --- - 2005-03-04 08:26:05 - May run out of work in 0.01 days; requesting
> more
> Einstein@Home - 2005-03-04 08:26:05 - Requesting 1725 seconds of work
> Einstein@Home - 2005-03-04 08:26:05 - Sending request to scheduler:
> http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
> Einstein@Home - 2005-03-04 08:26:08 - Scheduler RPC to
> http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi failed
> Einstein@Home - 2005-03-04 08:26:08 - No schedulers responded
> Einstein@Home - 2005-03-04 08:26:08 - Deferring communication with project
> for 1 minutes and 0 seconds
> --- - 2005-03-04 08:27:09 - May run out of work in 0.01 days; requesting
> more
> Einstein@Home - 2005-03-04 08:27:09 - Requesting 1808 seconds of work
> Einstein@Home - 2005-03-04 08:27:09 - Sending request to scheduler:
> http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
> Einstein@Home - 2005-03-04 08:27:19 - Scheduler RPC to
> http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded

Chris, here are the relevant bits of the scheduler logs:
2005-03-04 13:23:17 [normal ] OS version Microsoft Windows 2000 Professional Edition, Service Pack 4, (05.00.2195.00)
2005-03-04 13:23:17 [normal ] Request [HOST#47378] Database [HOST#47378] Request [RPC#51] Database [RPC#50]
2005-03-04 13:23:17 [normal ] Processing request from [USER#12153] [HOST#47378] [IP XX.XX.215.43] [RPC#51] core client version 4.19
2005-03-04 13:23:17 [normal ] [HOST#47378] got request for 1725.305507 seconds of work; available disk 50.000000 GB
2005-03-04 13:23:17 [debug ] [HOST#47378]: has file H1_0976.4
2005-03-04 13:23:17 [debug ] in_send_results_for_file(H1_0976.4, 0) prev_result.id=1632345
2005-03-04 13:23:17 [debug ] est cpu dur 34061.844227; running_frac 0.998496; rsf 1.000000; est 34113.150406
2005-03-04 13:23:17 [debug ] Sorted list of URLs follows [host timezone: UTC-18000]
2005-03-04 13:23:17 [debug ] zone=-21600 url=http://einstein.phys.uwm.edu
2005-03-04 13:23:17 [debug ] zone=+3600 url=http://einstein.aei.mpg.de
2005-03-04 13:23:17 [debug ] [HOST#47378] Sending app_version einstein windows_intelx86 479
2005-03-04 13:23:17 [debug ] [HOST#47378] Already has file H1_0976.4
2005-03-04 13:23:17 [debug ] [HOST#47378] reducing disk needed for WU by 14736000 bytes (length of H1_0976.4)
2005-03-04 13:23:17 [debug ] est cpu dur 34061.844227; running_frac 0.998496; rsf 1.000000; est 34113.150406
2005-03-04 13:23:17 [normal ] [HOST#47378] Sending [RESULT#1632725 H1_0976.4__0976.5_0.1_T24_Test02_4] (fills 34113.15 seconds)
2005-03-04 13:23:17 [normal ] [HOST#47378] Sent 1 results

Then:

2005-03-04 13:24:21 [normal ] OS version Microsoft Windows 2000 Professional Edition, Service Pack 4, (05.00.2195.00)
2005-03-04 13:24:21 [normal ] Request [HOST#47378] Database [HOST#47378] Request [RPC#52] Database [RPC#51]
2005-03-04 13:24:21 [normal ] Processing request from [USER#12153] [HOST#47378] [IP XX.XX.215.43] [RPC#52] core client version 4.19
2005-03-04 13:24:21 [normal ] [HOST#47378] got request for 1807.746221 seconds of work; available disk 50.000000 GB
2005-03-04 13:24:21 [debug ] [HOST#47378]: has file H1_0976.4
2005-03-04 13:24:21 [debug ] in_send_results_for_file(H1_0976.4, 0) prev_result.id=1632725
2005-03-04 13:24:21 [debug ] touched ../locality_scheduling/need_work/H1_0976.4: need work for file H1_0976.4
2005-03-04 13:24:21 [debug ] make_more_work_for_file(H1_0976.4, 0)=0
2005-03-04 13:24:27 [debug ] in_send_results_for_file(H1_0976.4, 1) prev_result.id=1632725
2005-03-04 13:24:27 [debug ] est cpu dur 34061.844227; running_frac 0.998496; rsf 1.000000; est 34113.150406
2005-03-04 13:24:27 [debug ] Sorted list of URLs follows [host timezone: UTC-18000]
2005-03-04 13:24:27 [debug ] zone=-21600 url=http://einstein.phys.uwm.edu
2005-03-04 13:24:27 [debug ] zone=+3600 url=http://einstein.aei.mpg.de
2005-03-04 13:24:27 [debug ] [HOST#47378] Sending app_version einstein windows_intelx86 479
2005-03-04 13:24:27 [debug ] [HOST#47378] Already has file H1_0976.4
2005-03-04 13:24:27 [debug ] [HOST#47378] reducing disk needed for WU by 14736000 bytes (length of H1_0976.4)
2005-03-04 13:24:27 [debug ] est cpu dur 34061.844227; running_frac 0.998496; rsf 1.000000; est 34113.150406
2005-03-04 13:24:27 [normal ] [HOST#47378] Sending [RESULT#1635259 H1_0976.4__0976.9_0.1_T26_Test02_0] (fills 34113.15 seconds)
2005-03-04 13:24:28 [normal ] [HOST#47378] Sent 1 results

Is this machine part of a Windows network, or some other network? Could you describe your network infrastructure, please?

By the way, it would help in doing this if you could set your computer's clock to the right time. The best way is to set it up to automatically sync with an NTP (Network Time Protocol) server. This would make it easier to compare the logs.

And no, BOINC does not have a handshake which verifies that WU are not lost when they are sent. I've thought about trying to add one into the scheduler but it's too complex to do in the near future.

Cheers,
Bruce

Director, Einstein@Home

chelliot
chelliot
Joined: 9 Feb 05
Posts: 9
Credit: 135202
RAC: 0

>Is this machine part of a

>Is this machine part of a Windows network, or some other network? Could you
>describe your network infrastructure, please?

It's my home Ethernet network. Running just TCP/IP, for the most part--one printer shared using a Windows workgroup share/samba. One of the Win2K systems (15933) is setup to be part of a NT domain, but the domain is non-existent at home. The other (47378) is setup to be part of a Windows workgroup, but no other computer on my network is setup in that workgroup. I can change or disable 47378's Windows networking config with no issues if that would help.

The Internet connection is through Time Warner's Road Runner cable modem service using a SMC router/firewall (SMC7004AWBR), as well as a couple of daisy-chained no-name 8-port 10/100 Ethernet switches. Wireless is available, but not used for any BOINC clients.

Topology (bad ascii art warning!):

Internet-->CableModem-->SMC Firewall-->Switch 1-->Nodes 15933, 46764, 15970, and Switch 2-->Nodes 47378 and 47511

I don't know of any issues with the network itself--internal transfer rates between nodes are very good. I'll do a bit of sniffing to verify.

>By the way, it would help in doing this if you could set your computer's clock
>to the right time. The best way is to set it up to automatically sync with an
>NTP (Network Time Protocol) server. This would make it easier to compare the
>logs.

Sorry about that! Most of my computers already are setup to use NTP--I'll get that one fixed and verify the rest. It's about 4 minutes and 4 seconds fast right now, but you've probably already figured that out. :-)

Enjoy!
Chris.

chelliot
chelliot
Joined: 9 Feb 05
Posts: 9
Credit: 135202
RAC: 0

OK--so I've setup Ethereal to

OK--so I've setup Ethereal to watch the packets go by on both clients where I've seen this issue. I haven't seen the issue since getting it setup as neither one has completed any work.

However, I have had failures when running a manual update of the project--and I am speculating that the reason that I am getting failures then is the same as why my computers are occasionally missing WU's. The messages ("Scheduler RPC...failed", "No schedulers responded") appear to be the same in both cases.

Here's the log (time should be right now!) from computer 15933:

Einstein@Home - 2005-03-04 21:19:49 - Sending request to scheduler: http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
Einstein@Home - 2005-03-04 21:20:04 - Scheduler RPC to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi failed
Einstein@Home - 2005-03-04 21:20:04 - No schedulers responded
Einstein@Home - 2005-03-04 21:20:04 - Deferring communication with project for 1 minutes and 0 seconds
Einstein@Home - 2005-03-04 21:20:22 - Sending request to scheduler: http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
Einstein@Home - 2005-03-04 21:20:25 - Scheduler RPC to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded

Looking at the packet traces I see one difference. The client sends a HTTP POST message to the server and the server responds with a HTTP reply containing the . When things succeeds this packet is 977 bytes long. When things fail the exact same packet is sent, except it is 998 bytes long and the HTTP header includes this additional field:

Content-Length: 775

This is exactly 21 bytes long (exactly accounts for the difference in length of the two packets) and is the only difference between them. After an exchange of FIN/ACKs, the exchange halts in the case of the failing connection. When it succeeds the client continues with a HTTP GET and things continue from there.

Now I have no idea why this Content-Length field is really the issue or why the client should reject the connection after receiving this packet, but I have lots of examples of updates that succeed and none of them include this field while the two failures I've captured both do.

I also don't know why your server should sometimes send this field and other times not. A guess: Is the server a virtual device where one of the real servers is behaving slightly differently?

Hope you have a clue. I will attempt to capture a WU failure and see if I see the same thing.

Fun, huh?
Chris.

AnRM
AnRM
Joined: 9 Feb 05
Posts: 213
Credit: 4346941
RAC: 0

Computer 12208 is missing WUs

Computer 12208 is missing WUs again. It appears that the scheduler has doubled up the assigned WUs on computer 12208 for some reason. Three other machines on our nwk are NOT having this problem so I conclude it's not at this end?? It takes approx. 18 hours for this machine to process a WU and it requests approx. 85000 sec. work and receives 2 WUs per request when all is ok. When the error occurs (this is the third time) the scheduler appears to assign double this work (4 WUs), only half of which appear on this machine's work page. Still using the 4.19/4.79 combo. Hope this helps.

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

> OK--so I've setup Ethereal

Message 6408 in response to message 6406

> OK--so I've setup Ethereal to watch the packets go by on both clients where
> I've seen this issue. I haven't seen the issue since getting it setup as
> neither one has completed any work.
>
> However, I have had failures when running a manual update of the project--and
> I am speculating that the reason that I am getting failures then is the same
> as why my computers are occasionally missing WU's. The messages ("Scheduler
> RPC...failed", "No schedulers responded") appear to be the same in both
> cases.
>
> Here's the log (time should be right now!) from computer 15933:
>
> Einstein@Home - 2005-03-04 21:19:49 - Sending request to scheduler:
> http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
> Einstein@Home - 2005-03-04 21:20:04 - Scheduler RPC to
> http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi failed
> Einstein@Home - 2005-03-04 21:20:04 - No schedulers responded
> Einstein@Home - 2005-03-04 21:20:04 - Deferring communication with project for
> 1 minutes and 0 seconds
> Einstein@Home - 2005-03-04 21:20:22 - Sending request to scheduler:
> http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
> Einstein@Home - 2005-03-04 21:20:25 - Scheduler RPC to
> http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded
>
> Looking at the packet traces I see one difference. The client sends a HTTP
> POST message to the server and the server responds with a HTTP reply
> containing the . When things succeeds this packet is 977 bytes long. When
> things fail the exact same packet is sent, except it is 998 bytes long and the
> HTTP header includes this additional field:
>
> Content-Length: 775
>
> This is exactly 21 bytes long (exactly accounts for the difference in length
> of the two packets) and is the only difference between them. After an exchange
> of FIN/ACKs, the exchange halts in the case of the failing connection. When it
> succeeds the client continues with a HTTP GET and things continue from there.
>
> Now I have no idea why this Content-Length field is really the issue or why
> the client should reject the connection after receiving this packet, but I
> have lots of examples of updates that succeed and none of them include this
> field while the two failures I've captured both do.
>
> I also don't know why your server should sometimes send this field and other
> times not. A guess: Is the server a virtual device where one of the real
> servers is behaving slightly differently?
>
> Hope you have a clue. I will attempt to capture a WU failure and see if I see
> the same thing.

Thanks for looking carefully at this. I have no idea why the scheduler/apache is putting Content-Length onto some replies and not onto others. Note that proxies *often* do this sort of thing, but you're not using a proxy or packet filtering software that might do this.

I just got off the phone with Rom Walton and he confirmed that the 4.19 client has a bug in which if the Content-Length is specified (correctly) it will sometimes misbehave because a variable that stores this internally is not properly initialized/reset.

Advice for now is to try the 4.25 core client.

No where in the scheduler code is Content-Length written, so if you are getting this it must either be from apache or your network is somehow adding it in. I'll look at the apache config file and see if there's some way to disable adding content-length into the reply.

Cheers,
Bruce

Director, Einstein@Home

chelliot
chelliot
Joined: 9 Feb 05
Posts: 9
Credit: 135202
RAC: 0

Bruce, Sounds like this is

Bruce,

Sounds like this is the issue. I'll try the 4.25 client and let you know if it reoccurs.

By the way, I've now had this happen on two XP clients and a windows client running in wine under linux--missing the Win2K clients that I'd setup Ethereal on, of course! So I went ahead and started Ethereal on all of my clients and just captured the linux/wine client exhibiting this behavior. I'll confirm the content-length issue as soon as I can read through the packet capture.

I don't have anything in my network that would modify a HTTP header, so I'm pretty sure that the content-length is being inserted from your side.

Enjoy!
Chris.

chelliot
chelliot
Joined: 9 Feb 05
Posts: 9
Credit: 135202
RAC: 0

OK, I've verified tha the

OK, I've verified tha the difference in the two requests/responses for more work is just the content-length field.

If you are convinced that your side is not inserting this field then I could trace the packets before they hit my firewall. Takes a bit more work and finding an Ethernet hub--I blew up my power supply for mine in Australia (thought it was dual power...whoops!) a bit more work. Let me know.

How stable is the 4.25 client with Einstein? I've seen references to bugs with Einstein and earlier beta clients.

Chris.

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

> OK, I've verified tha the

Message 6411 in response to message 6410

> OK, I've verified tha the difference in the two requests/responses for more
> work is just the content-length field.
>
> If you are convinced that your side is not inserting this field then I could
> trace the packets before they hit my firewall. Takes a bit more work and
> finding an Ethernet hub--I blew up my power supply for mine in Australia
> (thought it was dual power...whoops!) a bit more work. Let me know.

This would be really nice. I've grepped the scheduler code pretty thoroughly and it's NOT putting Content-Length in the reply. So it must be coming from apache, or elsewhere. If you could verify that sometime's it's there, and sometimes not, when coming into your firewall, this would be helpful in pinning down where it's coming from.

> How stable is the 4.25 client with Einstein? I've seen references to bugs with
> Einstein and earlier beta clients.

I don't know. I'm running 4.24 on an XP test machine without obvious problems, but I simply don't know if 4.25 is reliable or not. In the worst case you can always put 4.19 back!

Cheers,
Bruce

Director, Einstein@Home

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.