it just seems to me that with the current 4.20 application, no one seems too concerned about the signal 11 issue at the moment.
This is not true. Actually it's the problem that causes the highest failure rate of all, and consequently it's at the very top of my list of things to fix.
Wedge: Afaik the packaged Kubuntu version runs as a daemon, meaning with superuser privileges. I've used that myself for a while and am actually quite certain. Nevertheless, running the latest stable core client and BOINC with user rights (not an own user but a screen in my "normal" account, which should technically make no difference) has not been able to fix the signal 11 issue on my laptop. Since I've returned from vacation today I'll try to reproduce the error and get debugger output on it.
No problems running BOINC 5.10.28, app. 4.20 as a daemon with my own user privileges in
/usr/local/Boinc. using Debian Sid.
It runs automatically, controlled by a runlevel script.
Wedge: Afaik the packaged Kubuntu version runs as a daemon, meaning with superuser privileges. I've used that myself for a while and am actually quite certain. Nevertheless, running the latest stable core client and BOINC with user rights (not an own user but a screen in my "normal" account, which should technically make no difference) has not been able to fix the signal 11 issue on my laptop. Since I've returned from vacation today I'll try to reproduce the error and get debugger output on it.
I have no idea how Ubuntu sets up the first / only user on the machine, but all of the results I ran (4 results with 4.20 and 2 results with 4.21) completed without error... I would guess NOT as root, as I had to use sudo a few times to get things installed... (???)
It looks like the "signal 11" problem has finally been found and fixed in the new Beta Test App. Many many thanks to Bikeman and Kathryn, and everyone who helped with reports!
My laptop (with the worst history of signal 11 failures of Einstein WUs out of all my hosts) was displaying an unusual clock time relative to my other computers. So I manually forced a resynchronisation with a time server, causing the clock to 'jump back' around two minutes.
Now, I know changing the system time has an effect on how BOINC calculates CPU and "To completion" times of its listed WUs, but since I was running SETI at the time, I thought it wouldn't matter (and I had updated the system time previously without problems as well, albeit without such a large correction).
Time changes have a history of causing "exited with zero status" and "no heartbeat from core client" error messages. Most of the time BOINC just carries on crunching, but with an already unstable core client/science app combination it might turn out destructive...
I don't think date or time changes have any influence to boinc or the science apps. The time is probably requested from the kernel as you can figure out with 'ps'.
There are several reasons, that can cause a signal 11. Yesterday a CGI script I am working on was going wild because of a faulty entry in the session table of the database. Within seconds it occupied the whole 4gig of memory and the complete swap space. I was still able to kill the process but loocking up the pid with top showed the einstein apps 'defunct' - not always, but sometimes. 4 WUs got a 'signal 11' this way.
In the past I also got signal 11 errors when transferring huge amounts of data from whole partitions and piping them through gzip. It probably took too long until boinc was able to write to disk.
Does that mean the 4.24 application ought not to exhibit that same destructive behaviour?
Due to a bug in BOINC on Linux the "no heartbeat from core client" led to a segfault ("signal 11") and a Client Error of the task. Last week we foud and fixed the bug in BOINC, and the fix went into the 4.24. Instead of giving a client error the app should now just be restarted by the Core Client, issuing just a "no finished file" message.
RE: it just seems to me
)
This is not true. Actually it's the problem that causes the highest failure rate of all, and consequently it's at the very top of my list of things to fix.
BM
BM
Wedge: Afaik the packaged
)
Wedge: Afaik the packaged Kubuntu version runs as a daemon, meaning with superuser privileges. I've used that myself for a while and am actually quite certain. Nevertheless, running the latest stable core client and BOINC with user rights (not an own user but a screen in my "normal" account, which should technically make no difference) has not been able to fix the signal 11 issue on my laptop. Since I've returned from vacation today I'll try to reproduce the error and get debugger output on it.
No problems running BOINC
)
No problems running BOINC 5.10.28, app. 4.20 as a daemon with my own user privileges in
/usr/local/Boinc. using Debian Sid.
It runs automatically, controlled by a runlevel script.
RE: Wedge: Afaik the
)
I have no idea how Ubuntu sets up the first / only user on the machine, but all of the results I ran (4 results with 4.20 and 2 results with 4.21) completed without error... I would guess NOT as root, as I had to use sudo a few times to get things installed... (???)
It looks like the "signal 11"
)
It looks like the "signal 11" problem has finally been found and fixed in the new Beta Test App. Many many thanks to Bikeman and Kathryn, and everyone who helped with reports!
BM
BM
A curious
)
A curious observation...
My laptop (with the worst history of signal 11 failures of Einstein WUs out of all my hosts) was displaying an unusual clock time relative to my other computers. So I manually forced a resynchronisation with a time server, causing the clock to 'jump back' around two minutes.
Now, I know changing the system time has an effect on how BOINC calculates CPU and "To completion" times of its listed WUs, but since I was running SETI at the time, I thought it wouldn't matter (and I had updated the system time previously without problems as well, albeit without such a large correction).
No, I was wrong.
I lost yet another 29 hours of Einstein work (on a WU ~90% complete!), presumably due to reasons similar for the failure on network disconnection.
I was really hoping to get that one done before moving to 4.24. Oh well, I suppose I have no reason not to now... Still really frustrating, though. :(
Soli Deo Gloria
Time changes have a history
)
Time changes have a history of causing "exited with zero status" and "no heartbeat from core client" error messages. Most of the time BOINC just carries on crunching, but with an already unstable core client/science app combination it might turn out destructive...
Does that mean the 4.24
)
Does that mean the 4.24 application ought not to exhibit that same destructive behaviour?
Soli Deo Gloria
I don't think date or time
)
I don't think date or time changes have any influence to boinc or the science apps. The time is probably requested from the kernel as you can figure out with 'ps'.
Example:
micha@luemmel:~> pidof boinc
3839
micha@luemmel:~> ps -f --ppid 3839
UID PID PPID C STIME TTY TIME CMD
boinc 7538 3839 98 13:25 ? 01:48:51 einstein_S5R3_4.21_i686-pc-linux-gnu --method=0 --Freq=763.12
boinc 7616 3839 98 13:30 ? 01:43:56 einstein_S5R3_4.21_i686-pc-linux-gnu --method=0 --Freq=763.12
micha@luemmel:~>
There are several reasons, that can cause a signal 11. Yesterday a CGI script I am working on was going wild because of a faulty entry in the session table of the database. Within seconds it occupied the whole 4gig of memory and the complete swap space. I was still able to kill the process but loocking up the pid with top showed the einstein apps 'defunct' - not always, but sometimes. 4 WUs got a 'signal 11' this way.
In the past I also got signal 11 errors when transferring huge amounts of data from whole partitions and piping them through gzip. It probably took too long until boinc was able to write to disk.
cu,
Michael
RE: Does that mean the 4.24
)
Due to a bug in BOINC on Linux the "no heartbeat from core client" led to a segfault ("signal 11") and a Client Error of the task. Last week we foud and fixed the bug in BOINC, and the fix went into the 4.24. Instead of giving a client error the app should now just be restarted by the Core Client, issuing just a "no finished file" message.
BM
BM