GNU/Linux S5R3 App 4.20 available for Beta test

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4330

Credit: 251458775

RAC: 36430

RE: it just seems to me

2 Jan 2008 11:04:05 UTC

Message 75667 in response to message 75664

(moderation:

)

Quote:

it just seems to me that with the current 4.20 application, no one seems too concerned about the signal 11 issue at the moment.

This is not true. Actually it's the problem that causes the highest failure rate of all, and consequently it's at the very top of my list of things to fix.

Annika

Joined: 8 Aug 06

Posts: 720

Credit: 494410

RAC: 0

Wedge: Afaik the packaged

2 Jan 2008 19:24:15 UTC

Message 75668

(moderation:

)

Wedge: Afaik the packaged Kubuntu version runs as a daemon, meaning with superuser privileges. I've used that myself for a while and am actually quite certain. Nevertheless, running the latest stable core client and BOINC with user rights (not an own user but a screen in my "normal" account, which should technically make no difference) has not been able to fix the signal 11 issue on my laptop. Since I've returned from vacation today I'll try to reproduce the error and get debugger output on it.

Jos van Wolput

Joined: 11 Feb 05

Posts: 47

Credit: 800840

RAC: 0

No problems running BOINC

3 Jan 2008 3:16:32 UTC

Message 75669

(moderation:

)

No problems running BOINC 5.10.28, app. 4.20 as a daemon with my own user privileges in
/usr/local/Boinc. using Debian Sid.
It runs automatically, controlled by a runlevel script.

Brian Silvers

Joined: 26 Aug 05

Posts: 772

Credit: 282700

RAC: 0

RE: Wedge: Afaik the

3 Jan 2008 20:48:04 UTC

Message 75670 in response to message 75668

(moderation:

)

Quote:

Wedge: Afaik the packaged Kubuntu version runs as a daemon, meaning with superuser privileges. I've used that myself for a while and am actually quite certain. Nevertheless, running the latest stable core client and BOINC with user rights (not an own user but a screen in my "normal" account, which should technically make no difference) has not been able to fix the signal 11 issue on my laptop. Since I've returned from vacation today I'll try to reproduce the error and get debugger output on it.

I have no idea how Ubuntu sets up the first / only user on the machine, but all of the results I ran (4 results with 4.20 and 2 results with 4.21) completed without error... I would guess NOT as root, as I had to use sudo a few times to get things installed... (???)

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4330

Credit: 251458775

RAC: 36430

It looks like the "signal 11"

14 Jan 2008 5:19:56 UTC

Message 75671

(moderation:

)

It looks like the "signal 11" problem has finally been found and fixed in the new Beta Test App. Many many thanks to Bikeman and Kathryn, and everyone who helped with reports!

Wedge009

Joined: 5 Mar 05

Posts: 128

Credit: 17563183874

RAC: 6991292

A curious

19 Jan 2008 13:36:40 UTC

Message 75672

(moderation:

)

A curious observation...

My laptop (with the worst history of signal 11 failures of Einstein WUs out of all my hosts) was displaying an unusual clock time relative to my other computers. So I manually forced a resynchronisation with a time server, causing the clock to 'jump back' around two minutes.

Now, I know changing the system time has an effect on how BOINC calculates CPU and "To completion" times of its listed WUs, but since I was running SETI at the time, I thought it wouldn't matter (and I had updated the system time previously without problems as well, albeit without such a large correction).

No, I was wrong.

I lost yet another 29 hours of Einstein work (on a WU ~90% complete!), presumably due to reasons similar for the failure on network disconnection.

I was really hoping to get that one done before moving to 4.24. Oh well, I suppose I have no reason not to now... Still really frustrating, though. :(

Soli Deo Gloria

Annika

Joined: 8 Aug 06

Posts: 720

Credit: 494410

RAC: 0

Time changes have a history

19 Jan 2008 13:49:31 UTC

Message 75673

(moderation:

)

Time changes have a history of causing "exited with zero status" and "no heartbeat from core client" error messages. Most of the time BOINC just carries on crunching, but with an already unstable core client/science app combination it might turn out destructive...

Wedge009

Joined: 5 Mar 05

Posts: 128

Credit: 17563183874

RAC: 6991292

Does that mean the 4.24

19 Jan 2008 13:56:34 UTC

Message 75674

(moderation:

)

Does that mean the 4.24 application ought not to exhibit that same destructive behaviour?

Soli Deo Gloria

M. Schmitt

Joined: 27 Jun 05

Posts: 478

Credit: 15872262

RAC: 0

I don't think date or time

19 Jan 2008 14:35:33 UTC

Message 75675 in response to message 75672

(moderation:

)

I don't think date or time changes have any influence to boinc or the science apps. The time is probably requested from the kernel as you can figure out with 'ps'.

Example:

micha@luemmel:~> pidof boinc
3839
micha@luemmel:~> ps -f --ppid 3839
UID PID PPID C STIME TTY TIME CMD
boinc 7538 3839 98 13:25 ? 01:48:51 einstein_S5R3_4.21_i686-pc-linux-gnu --method=0 --Freq=763.12
boinc 7616 3839 98 13:30 ? 01:43:56 einstein_S5R3_4.21_i686-pc-linux-gnu --method=0 --Freq=763.12
micha@luemmel:~>

There are several reasons, that can cause a signal 11. Yesterday a CGI script I am working on was going wild because of a faulty entry in the session table of the database. Within seconds it occupied the whole 4gig of memory and the complete swap space. I was still able to kill the process but loocking up the pid with top showed the einstein apps 'defunct' - not always, but sometimes. 4 WUs got a 'signal 11' this way.
In the past I also got signal 11 errors when transferring huge amounts of data from whole partitions and piping them through gzip. It probably took too long until boinc was able to write to disk.

cu,
Michael

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4330

Credit: 251458775

RAC: 36430

RE: Does that mean the 4.24

19 Jan 2008 17:40:33 UTC

Message 75676 in response to message 75674

(moderation:

)

Quote:

Does that mean the 4.24 application ought not to exhibit that same destructive behaviour?

Due to a bug in BOINC on Linux the "no heartbeat from core client" led to a segfault ("signal 11") and a Client Error of the task. Last week we foud and fixed the bug in BOINC, and the fix went into the 4.24. Instead of giving a client error the app should now just be restarted by the Core Client, issuing just a "no finished file" message.

GNU/Linux S5R3 App 4.20 available for Beta test

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner