@Boinc developers: Strange effect on 2-CPU machine / Linux

Wurgl (speak^Wcrunching for Special: Off-Topic)
Wurgl (speak^Wc...
Joined: 11 Feb 05
Posts: 321
Credit: 140550008
RAC: 0
Topic 189160

This machine reached me yesterday and the first thing I did was installing e@h.

http://einsteinathome.org/host/212379/tasks

Now yesterday night I was a little bit astonished how the client requested more work.

First it requested two workunits: 1003401 and 1003421. This is what I expect, because two CPUs and both have the same speed, so I expect that the application always requests pairs.

But 3 hours later it requested another one: 1003925. And it requested only one? The second one of the new pair was requested in the morning.

Yeah, all those WU's arrived safely, so it is not a bug in the scheduler, it seems to be a strange logic in the client?

Another thing which looks strange is the nice value. One of the two crunching processes runs with a nice value of 19, the other one with 5? Shouldn't both run with 19?

$ ps -lft tty1
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
000 S me 8792 1 0 85 5 - 635 wait4 May10 tty1 00:00:00 /bin/sh ./run.sh
000 S me 8795 8792 0 80 5 - 684 schedu May10 tty1 00:00:09 ./boinc_4.19_i68
000 S me 15264 8795 0 94 19 - 3709 schedu May10 tty1 00:00:00 einstein_4.80_i6
002 S me 15265 15264 0 94 19 - 3709 schedu May10 tty1 00:00:00 einstein_4.80_i6
042 R me 15266 15265 99 99 19 - 3709 - May10 tty1 08:13:41 einstein_4.80_i6
000 S me 15285 8795 0 94 19 - 3708 schedu May10 tty1 00:00:00 einstein_4.80_i6
002 S me 15286 15285 0 80 5 - 3708 schedu May10 tty1 00:00:00 einstein_4.80_i6
002 R me 15287 15286 99 90 5 - 3708 - May10 tty1 08:04:24 einstein_4.80_i6

Wurgl (speak^Wcrunching for Special: Off-Topic)
Wurgl (speak^Wc...
Joined: 11 Feb 05
Posts: 321
Credit: 140550008
RAC: 0

@Boinc developers: Strange effect on 2-CPU machine / Linux

This machine reached me yesterday and the first thing I did was installing e@h.

http://einsteinathome.org/host/212379/tasks

Now yesterday night I was a little bit astonished how the client requested more work.

First it requested two workunits: 1003401 and 1003421. This is what I expect, because two CPUs and both have the same speed, so I expect that the application always requests pairs.

But 3 hours later it requested another one: 1003925. And it requested only one? The second one of the new pair was requested in the morning.

Yeah, all those WU's arrived safely, so it is not a bug in the scheduler, it seems to be a strange logic in the client?

I checked a little bit more and found out that the workunit 1003401 did not reach my machine. In this thread Bruce taked about a timeout in the proxy, which I have. But the timeout is:

#
# Timeout: The number of seconds before receives and sends time out.
#
Timeout 300

So this one fits to the formular give by Bruce (N*100)?

I attach also the log from the proxy (apache), maybe there is another reason for loosing WU's? Or shall I simply set the timeout to more than 5 minutes?

$ grep 192.168.45.243 access_log
192.168.45.243 - - [10/May/2005:13:18:38 +0200] "GET http://einstein.phys.uwm.edu:80/ HTTP/1.0" 200 6992
192.168.45.243 - - [10/May/2005:13:18:38 +0200] "POST http://einstein.phys.uwm.edu:80/EinsteinAtHome_cgi/cgi HTTP/1.0" 200 10659
192.168.45.243 - - [10/May/2005:13:18:39 +0200] "GET http://einstein.phys.uwm.edu:80/ HTTP/1.0" 200 6992
192.168.45.243 - - [10/May/2005:13:18:40 +0200] "POST http://einstein.phys.uwm.edu:80/EinsteinAtHome_cgi/cgi HTTP/1.0" 200 972
192.168.45.243 - - [10/May/2005:13:18:49 +0200] "GET http://einstein.phys.uwm.edu:80/download/einstein_4.80_i686-pc-linux-gnu.so HTTP/1.0" 200 923679
192.168.45.243 - - [10/May/2005:13:18:50 +0200] "GET http://einstein.aei.mpg.de:80/download/einstein_4.80_i686-pc-linux-gnu HTTP/1.0" 200 1697218
192.168.45.243 - - [10/May/2005:13:18:51 +0200] "GET http://einstein.aei.mpg.de:80/download/2e6/sun HTTP/1.0" 200 274843
192.168.45.243 - - [10/May/2005:13:18:52 +0200] "GET http://morel.mit.edu:80/download/3c5/Config_Fin1 HTTP/1.0" 200 239
192.168.45.243 - - [10/May/2005:13:19:20 +0200] "GET http://einstein.phys.uwm.edu:80/download/dd/earth HTTP/1.0" 200 2745667
192.168.45.243 - - [10/May/2005:13:19:42 +0200] "POST http://einstein.phys.uwm.edu:80/EinsteinAtHome_cgi/cgi HTTP/1.0" 200 8804
192.168.45.243 - - [10/May/2005:13:20:05 +0200] "GET http://einstein.aei.mpg.de:80/download/3a8/H1_0688.5 HTTP/1.0" 200 14736000
192.168.45.243 - - [10/May/2005:20:53:33 +0200] "POST http://einstein.phys.uwm.edu:80/EinsteinAtHome_cgi/cgi HTTP/1.0" 200 8804
192.168.45.243 - - [10/May/2005:20:54:34 +0200] "POST http://einstein.phys.uwm.edu:80/EinsteinAtHome_cgi/cgi HTTP/1.0" 200 8804
192.168.45.243 - - [10/May/2005:23:24:12 +0200] "POST http://einstein.phys.uwm.edu:80/EinsteinAtHome_cgi/cgi HTTP/1.0" 200 8804
192.168.45.243 - - [11/May/2005:01:45:34 +0200] "POST http://einstein.phys.uwm.edu:80/EinsteinAtHome_cgi/file_upload_handler HTTP/1.0" 200 93
192.168.45.243 - - [11/May/2005:01:45:35 +0200] "POST http://einstein.phys.uwm.edu:80/EinsteinAtHome_cgi/file_upload_handler HTTP/1.0" 200 64
192.168.45.243 - - [11/May/2005:01:45:35 +0200] "POST http://einstein.phys.uwm.edu:80/EinsteinAtHome_cgi/cgi HTTP/1.0" 200 927
192.168.45.243 - - [11/May/2005:01:50:23 +0200] "POST http://einstein.phys.uwm.edu:80/EinsteinAtHome_cgi/file_upload_handler HTTP/1.0" 200 93
192.168.45.243 - - [11/May/2005:01:50:24 +0200] "POST http://einstein.phys.uwm.edu:80/EinsteinAtHome_cgi/file_upload_handler HTTP/1.0" 200 64
192.168.45.243 - - [11/May/2005:01:50:25 +0200] "POST http://einstein.phys.uwm.edu:80/EinsteinAtHome_cgi/cgi HTTP/1.0" 200 927
192.168.45.243 - - [11/May/2005:09:16:12 +0200] "POST http://einstein.phys.uwm.edu:80/EinsteinAtHome_cgi/cgi HTTP/1.0" 200 8804
192.168.45.243 - - [11/May/2005:11:52:37 +0200] "POST http://einstein.phys.uwm.edu:80/EinsteinAtHome_cgi/cgi HTTP/1.0" 200 8804
192.168.45.243 - - [11/May/2005:14:10:11 +0200] "POST http://einstein.phys.uwm.edu:80/EinsteinAtHome_cgi/file_upload_handler HTTP/1.0" 200 93
...

For all the kiddie hackers: Yep! Hack my address :-))))

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

EXECUTIVE SUMMARY: The 4.19

Message 11581 in response to message 11580

EXECUTIVE SUMMARY: The 4.19 client has known bugs when working through a proxy server. Please consider trying a more recent BOINC client.

http://einsteinathome.org/host/212379/tasks

But 3 hours later it requested another one: 1003925. And it requested only one? The second one of the new pair was requested in the morning.

You mean 7 hours later, right?

I checked a little bit more and found out that the workunit 1003401 did not reach my machine.

I am studying the scheduler contact for this lost WU. It's very helpful that your machine has NTP accurate timestamps!

I think your proxy timeout is fine. Here's the relevant part of the log:

192.168.45.243 - - [10/May/2005:20:53:33 +0200] "POST http://einstein.phys.uwm.edu:80/EinsteinAtHome_cgi/cgi HTTP/1.0" 200 8804
192.168.45.243 - - [10/May/2005:20:54:34 +0200] "POST http://einstein.phys.uwm.edu:80/EinsteinAtHome_cgi/cgi HTTP/1.0" 200 8804

The first of these is the lost WU. What do the 200 and 8804 refer to?

Here is the relevant bit of the scheduler log:

2005-05-10 18:53:33 [PID=22527] [debug ] REQUEST_METHOD=POST CONTENT_TYPE=application/octet-stream HTTP_ACCEPT= HTTP_USER_AGENT=
2005-05-10 18:53:33 [PID=22527] [debug ] CONTENT_LENGTH=3645 from X.X.X.X
2005-05-10 18:53:33 [PID=22527] [normal ] Handling request: IP X.X.X.X auth XXXX, host 212379, platform i686-pc-linux-gnu, version 4.19
2005-05-10 18:53:33 [PID=22527] [normal ] OS version Linux 2.4.19-64GB-SMP
2005-05-10 18:53:33 [PID=22527] [debug ] Request [HOST#212379] Database [HOST#212379] Request [RPC#2] Database [RPC#0]
2005-05-10 18:53:33 [PID=22527] [normal ] Processing request from [USER#12758] [HOST#212379] [IP 84.245.146.10] [RPC#2] core client version 4.19
2005-05-10 18:53:33 [PID=22527] [normal ] [HOST#212379] got request for 17280.369637 seconds of work; available disk 11.155923 GB
2005-05-10 18:53:33 [PID=22527] [debug ] [HOST#212379]: has file H1_0688.5
2005-05-10 18:53:33 [PID=22527] [debug ] send_old_work() no feasible result older than 168.0 hours
2005-05-10 18:53:33 [PID=22527] [debug ] in_send_results_for_file(H1_0688.5, 0) prev_result.id=4092185
2005-05-10 18:53:33 [PID=22527] [debug ] est cpu dur 43116.118341; running_frac 1.000000; rsf 1.000000; est 43116.118341
2005-05-10 18:53:33 [PID=22527] [debug ] Sorted list of URLs follows [host timezone: UTC-25200]
2005-05-10 18:53:33 [PID=22527] [debug ] zone=-21600 url=http://einstein.phys.uwm.edu
2005-05-10 18:53:33 [PID=22527] [debug ] zone=-18000 url=http://einstein.aset.psu.edu
2005-05-10 18:53:33 [PID=22527] [debug ] zone=-18000 url=http://morel.mit.edu
2005-05-10 18:53:33 [PID=22527] [debug ] zone=+00000 url=http://einstein.astro.gla.ac.uk
2005-05-10 18:53:33 [PID=22527] [debug ] zone=+03600 url=http://einstein.aei.mpg.de
2005-05-10 18:53:33 [PID=22527] [debug ] [HOST#212379] Sending app_version einstein i686-pc-linux-gnu 480
2005-05-10 18:53:33 [PID=22527] [debug ] [HOST#212379] Already has file H1_0688.5
2005-05-10 18:53:33 [PID=22527] [debug ] [HOST#212379] reducing disk needed for WU by 14736000 bytes (length of H1_0688.5)
2005-05-10 18:53:33 [PID=22527] [debug ] est cpu dur 43116.118341; running_frac 1.000000; rsf 1.000000; est 43116.118341
2005-05-10 18:53:33 [PID=22527] [normal ] [HOST#212379] Sending [RESULT#4098898 H1_0688.5__0688.9_0.1_T04_Fin1_2] (fills 43116.12 seconds)
2005-05-10 18:53:33 [PID=22527] [normal ] [HOST#212379] Sent 1 results [scheduler ran 0 seconds]

The problem may be a simple one to fix. The 4.19 client has some proxy-related bugs. You might try replacing it with something more recent.

Note that looking at this I noticed another bug (RPC # remaining zero in the database). I think I know where this comes from -- it's unrelated to the lost WU -- I'm going to try and track this down with David Anderson later today.

Cheers,
Bruce

Director, Einstein@Home

Wurgl (speak^Wcrunching for Special: Off-Topic)
Wurgl (speak^Wc...
Joined: 11 Feb 05
Posts: 321
Credit: 140550008
RAC: 0

EXECUTIVE SUMMARY: The 4.19

Message 11582 in response to message 11581

EXECUTIVE SUMMARY: The 4.19 client has known bugs when working through a proxy server. Please consider trying a more recent BOINC client.

http://einsteinathome.org/host/212379/tasks

But 3 hours later it requested another one: 1003925. And it requested only one? The second one of the new pair was requested in the morning.

You mean 7 hours later, right?

Nope! I ment the time difference between 18:53:33/18:54:34 and 21:24:11 which is 3 hours (ignoring the minutes ... )

I checked a little bit more and found out that the workunit 1003401 did not reach my machine.

I am studying the scheduler contact for this lost WU. It's very helpful that your machine has NTP accurate timestamps!

I think your proxy timeout is fine. Here's the relevant part of the log:

192.168.45.243 - - [10/May/2005:20:53:33 +0200] "POST http://einstein.phys.uwm.edu:80/EinsteinAtHome_cgi/cgi HTTP/1.0" 200 8804
192.168.45.243 - - [10/May/2005:20:54:34 +0200] "POST http://einstein.phys.uwm.edu:80/EinsteinAtHome_cgi/cgi HTTP/1.0" 200 8804

The first of these is the lost WU. What do the 200 and 8804 refer to?

200 is the return code of your Server, which means okay (404 means no file ...)
and 8804 seems to be the number of bytes transferred. The number 14736000 can be found in your log as well as in mine and it is sure not some number of seconds (that would be more that 170 days).

Here is the relevant bit of the scheduler log:
...
2005-05-10 18:53:33 [PID=22527] [debug ] [HOST#212379] Already has file H1_0688.5
2005-05-10 18:53:33 [PID=22527] [debug ] [HOST#212379] reducing disk needed for WU by 14736000 bytes (length of H1_0688.5)
2005-05-10 18:53:33 [PID=22527] [debug ] est cpu dur 43116.118341; running_frac 1.000000; rsf 1.000000; est 43116.118341
2005-05-10 18:53:33 [PID=22527] [normal ] [HOST#212379] Sending [RESULT#4098898 H1_0688.5__0688.9_0.1_T04_Fin1_2] (fills 43116.12 seconds)
2005-05-10 18:53:33 [PID=22527] [normal ] [HOST#212379] Sent 1 results [scheduler ran 0 seconds]

Is this normal, zero seconds for sending 14MB? I mean my end is a 2MBit line, so I would expect ~59 seconds? Which matches the timestamps in my apache log, or did you snip away the next request?

The problem may be a simple one to fix. The 4.19 client has some proxy-related bugs. You might try replacing it with something more recent.

Well, look the Linux part of that page.

Recommended version: 4.19 which I use
Development version: 4.32

Shall I upgrade to a development version? I mean I am running 4.19 on all my machines without troubles since February. Well there is no other machine with multiple CPUs, I usually act like 'Never touch a running system' :-)

But if you ask me, I will do. On the other hand starting with monday I will be away from my machines for about a week, hmm ...?

Currently the two CPU's are out of sync, so maybe the client will not request 2 WU's on block, but maybe rather one by one?

Note that looking at this I noticed another bug (RPC # remaining zero in the database). I think I know where this comes from -- it's unrelated to the lost WU -- I'm going to try and track this down with David Anderson later today.

Cheers,
Bruce

Bruce, don't tell me about bugs, doing QA I find more bugs than I can report. When I reach my daily quota of bugs, my boss calls me and gets really angry ... waiting for his next vacation ;^)))

A software without bugs is made by aliens from whatever universe, sure not from this one!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.