GNU/Linux S5R3 App 4.20 available for Beta test

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,074
Credit: 226,994,954
RAC: 28,919

RE: it just seems to me

Message 75667 in response to message 75664

Quote:
it just seems to me that with the current 4.20 application, no one seems too concerned about the signal 11 issue at the moment.


This is not true. Actually it's the problem that causes the highest failure rate of all, and consequently it's at the very top of my list of things to fix.

BM

BM

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494,410
RAC: 0

Wedge: Afaik the packaged

Wedge: Afaik the packaged Kubuntu version runs as a daemon, meaning with superuser privileges. I've used that myself for a while and am actually quite certain. Nevertheless, running the latest stable core client and BOINC with user rights (not an own user but a screen in my "normal" account, which should technically make no difference) has not been able to fix the signal 11 issue on my laptop. Since I've returned from vacation today I'll try to reproduce the error and get debugger output on it.

Jos van Wolput
Jos van Wolput
Joined: 11 Feb 05
Posts: 47
Credit: 800,840
RAC: 0

No problems running BOINC

No problems running BOINC 5.10.28, app. 4.20 as a daemon with my own user privileges in
/usr/local/Boinc. using Debian Sid.
It runs automatically, controlled by a runlevel script.

Brian Silvers
Brian Silvers
Joined: 26 Aug 05
Posts: 772
Credit: 282,700
RAC: 0

RE: Wedge: Afaik the

Message 75670 in response to message 75668

Quote:
Wedge: Afaik the packaged Kubuntu version runs as a daemon, meaning with superuser privileges. I've used that myself for a while and am actually quite certain. Nevertheless, running the latest stable core client and BOINC with user rights (not an own user but a screen in my "normal" account, which should technically make no difference) has not been able to fix the signal 11 issue on my laptop. Since I've returned from vacation today I'll try to reproduce the error and get debugger output on it.

I have no idea how Ubuntu sets up the first / only user on the machine, but all of the results I ran (4 results with 4.20 and 2 results with 4.21) completed without error... I would guess NOT as root, as I had to use sudo a few times to get things installed... (???)

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,074
Credit: 226,994,954
RAC: 28,919

It looks like the "signal 11"

It looks like the "signal 11" problem has finally been found and fixed in the new Beta Test App. Many many thanks to Bikeman and Kathryn, and everyone who helped with reports!

BM

BM

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 83
Credit: 8,797,248,141
RAC: 3,604,922

A curious

A curious observation...

My laptop (with the worst history of signal 11 failures of Einstein WUs out of all my hosts) was displaying an unusual clock time relative to my other computers. So I manually forced a resynchronisation with a time server, causing the clock to 'jump back' around two minutes.

Now, I know changing the system time has an effect on how BOINC calculates CPU and "To completion" times of its listed WUs, but since I was running SETI at the time, I thought it wouldn't matter (and I had updated the system time previously without problems as well, albeit without such a large correction).

No, I was wrong.

I lost yet another 29 hours of Einstein work (on a WU ~90% complete!), presumably due to reasons similar for the failure on network disconnection.

I was really hoping to get that one done before moving to 4.24. Oh well, I suppose I have no reason not to now... Still really frustrating, though. :(

Soli Deo Gloria

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494,410
RAC: 0

Time changes have a history

Time changes have a history of causing "exited with zero status" and "no heartbeat from core client" error messages. Most of the time BOINC just carries on crunching, but with an already unstable core client/science app combination it might turn out destructive...

Wedge009
Wedge009
Joined: 5 Mar 05
Posts: 83
Credit: 8,797,248,141
RAC: 3,604,922

Does that mean the 4.24

Does that mean the 4.24 application ought not to exhibit that same destructive behaviour?

Soli Deo Gloria

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15,872,262
RAC: 0

I don't think date or time

Message 75675 in response to message 75672

I don't think date or time changes have any influence to boinc or the science apps. The time is probably requested from the kernel as you can figure out with 'ps'.

Example:

micha@luemmel:~> pidof boinc
3839
micha@luemmel:~> ps -f --ppid 3839
UID PID PPID C STIME TTY TIME CMD
boinc 7538 3839 98 13:25 ? 01:48:51 einstein_S5R3_4.21_i686-pc-linux-gnu --method=0 --Freq=763.12
boinc 7616 3839 98 13:30 ? 01:43:56 einstein_S5R3_4.21_i686-pc-linux-gnu --method=0 --Freq=763.12
micha@luemmel:~>

There are several reasons, that can cause a signal 11. Yesterday a CGI script I am working on was going wild because of a faulty entry in the session table of the database. Within seconds it occupied the whole 4gig of memory and the complete swap space. I was still able to kill the process but loocking up the pid with top showed the einstein apps 'defunct' - not always, but sometimes. 4 WUs got a 'signal 11' this way.
In the past I also got signal 11 errors when transferring huge amounts of data from whole partitions and piping them through gzip. It probably took too long until boinc was able to write to disk.

cu,
Michael

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,074
Credit: 226,994,954
RAC: 28,919

RE: Does that mean the 4.24

Message 75676 in response to message 75674

Quote:
Does that mean the 4.24 application ought not to exhibit that same destructive behaviour?


Due to a bug in BOINC on Linux the "no heartbeat from core client" led to a segfault ("signal 11") and a Client Error of the task. Last week we foud and fixed the bug in BOINC, and the fix went into the 4.24. Instead of giving a client error the app should now just be restarted by the Core Client, issuing just a "no finished file" message.

BM

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.