Workaround for Linux SIGABRT errors

Desti
Desti
Joined: 20 Aug 05
Posts: 117
Credit: 23762214
RAC: 0

-

-

Desti
Desti
Joined: 20 Aug 05
Posts: 117
Credit: 23762214
RAC: 0
M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

RE: Strange... running my

Message 63137 in response to message 63134

Quote:
Strange... running my notebook under Linux, I just killed the 2 WUs (with a "signal 11/SIGABRT") it was crunching by doing a manual update although it was neither reporting finished tasks nor getting new work (which I had disallowed)... I just wanted it to get my new preferences right. As it was only about 20 minutes into the WUs, there was not much damage done, but still I think it's surprising. My two computation errors on my desktop fit the scenario described in this thread exactly.
On another note: I noticed that while I mostly get really big WUs, all those that crashed seem to have been smaller ones (under 300 credits). Only coincidence, or are those WUs really more likely to die on you?

On 2(3) hosts I manualy report(update project) finished work all the time, because they still use new BOINC clients, no problems *knockonwood* so far with this method. I think there is no way to totally fix the problems with the Linux app through user "behavior" like described in this thread. But maybe we can reduce the chances to get SIGABRTs.
I never use the BOINC manager to update something, always do it with BS. Maybe that makes a difference too. Longer and shorter WUs - well we might be confrontated with more than just one error(Murphy's Law), so debugging is hard. Still there are many errors on Windows systems too, so there may be bugs in the win-app and/or in the WUs. Who knows?

This is one of my crashed WUs. It crashed on a Windows system too(Unhandled Exception Detected...).

It would be interesting to crunch such a SIGABRTed WU on the same system again, to see if the error is showing up again. But there is/was no beta test where we could have done so.

This is also a WU that crashed on one of my systems, but the other host, that has successfully finished the WU is running ........ Linux.

Anyway, if a WU crashes exactly while the BOINC client is talking with the server, this is a sign that there might be a communication problem between app and core client.

If someone likes to play around, he/she can suspend/resume a newly started WU a couple of times to see what happens.

cu,
Mchael

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

RE: This always happens

Message 63138 in response to message 63136

You get additional 'SIGSEGV: segmentation violation' errors after the SIGABRT message. I have seen that before. It might depend on the distribution/kernel, in your case gentoo. And you don't get hundreds/thousands of lines full of 'SIGABRT: abort called'. I have seen an error_log of 2MB on one of my systems.

cu,
Michael

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

RE: This always happens

Message 63139 in response to message 63136

Quote:
This always happens when I restart BOINC.

You can try to quit BOINC with the command 'boinc_cmd --quit' to see if this error is dependent on the manager. If you start the core client through the manager, you can try to start him with 'sudo -u xxx run_client'. So far I didn't have any problems with shutting down and restarting my VMware crunchers.

cu,
Michael

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

You run BOINC as root? ;-)

You run BOINC as root? ;-)
Well, I agree that updating manually via BOINC Manager might not have been a really smart move. Won't do it again ;-) except maybe to test a few theories. I guess I'll do that later today, after having fulfilled a few more social duties (yeah, they can really interfere with one's nerdy VL...) The notebook I would test on is an Intel C1D running Ubuntu Feisty Fawn with a 2.6.20-15 kernel and BOINC 5.8.16, in case the CPU, distro, kernel and BOINC version matter... we'll see what turns up.

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

RE: You run BOINC as root?

Message 63141 in response to message 63140

Quote:
You run BOINC as root? ;-)

You are crazy? ;-)

sudo --help

;-)

In my case 'xxx' stands for 'boinc', nothing rated in any way. *g*

Happy (rest)weekend
Michael

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 686043288
RAC: 582127

RE: This always happens

Message 63142 in response to message 63136

Hi!

Your particular crashes are libGL related, it seems (I had a similar log trace, whether this is related to the other "signal 11" errors is another story).

What you might want to try is to keep einstein from loading the libGL shared library.

The following applies to Linux only!

There will be more elegant methods, this is just quick and dirty:

0) exit from boinc manager. stop BOINC core client if it's still running

# ps -Al | grep boinc | grep -v grep

should show nothing ;-), else send a SIGTERM to the boinc process

1) create a deliberately defunct libGl in a separate directory
E.g. if __BOINC_PATH__ is the absolute path of your boinc installation, do the following

# cd __BOINC_PATH__/BOINC/projects/einstein.phys.uwm.edu

# mkdir sigabrt_fix

# cd sigabrt_fix

# echo "" > libGL.so.1

2) Edit run_client script

# cd __BOINC_PATH__/BOINC

# cp run_client run_client.bak

there is a script file run_client. Use your fav editor to insert this line at the top :

export LD_LIBRARY_PATH=__BOINC_PATH__/BOINC/projects/einstein.phys.uwm.edu/sigabrt_fix:${LD_LIBRARY_PATH}

(all in one line, replace __BOINC_PATH__ with the abolute path under which BOINC is installed)

3) restart the core client

e.g.
# cd __BOINC_PATH__/BOINC

# nohup ./run_client &

4) start boinc manager to test the fix

# cd __BOINC_PATH__/BOINC

# ./run_manager &

When selecting an einstein (or any other!!) running task, the button to show the graphics must be disabled now. If it's still enabled, you did something wrong

While an einstein task is runnign, press the "update" button with teh einstein project selected as often as you like. Hopefully you are now immune to the sigabrt problem.

Disclaimer: worked for me, might not work for you!! Because this hack is disabling graphics for **all** projects and not just E@H, I strongly advise to try this one only with all other projects temporarily suspended.

Good luck. If anybody feels like trying this, pls let us know if this helps.

CU

BRM

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

I don't belive that we have a

Message 63143 in response to message 63142

I don't belive that we have a libGL relatet error(any more). All crashed results on your side were done with the old app. On 28th of April we got the actual bugfixed(!?) version. So the fact that you don't get no errors any more, may just be caused by the new app.

On my host running a recent version of Knoppix, I played around with the 3 WUs in cache. I suspended/resumed all of them a couple of times - nothing happened. All 3 errorlog.txt files are ok. I also opened the graphics window, no probs.
Furthermore I doubt, that there are changes according the graphics between the S5RI and the S5R2 app. In the past, afair most problems with graphics were related to bad ATI-drivers.

Good luck, that the errors stay away. ;)

cu,
Michael

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 686043288
RAC: 582127

RE: I don't belive that we

Message 63144 in response to message 63143

Quote:

I don't belive that we have a libGL relatet error(any more). All crashed results on your side were done with the old app. On 28th of April we got the actual bugfixed(!?) version. So the fact that you don't get no errors any more, may just be caused by the new app.

On my host running a recent version of Knoppix, I played around with the 3 WUs in cache. I suspended/resumed all of them a couple of times - nothing happened. All 3 errorlog.txt files are ok. I also opened the graphics window, no probs.
Furthermore I doubt, that there are changes according the graphics between the S5RI and the S5R2 app. In the past, afair most problems with graphics were related to bad ATI-drivers.

Good luck, that the errors stay away. ;)

cu,
Michael

Hi!

The new app still seems to have graphics related probs. Here's Desti's first result from the list he posted:

================
2007-05-06 11:50:21.5552 [normal]: Start of BOINC application 'einstein_S5R2_4.18_i686-pc-linux-gnu'.
2007-05-06 11:50:29.6341 [debug]: Reading SFTs and setting up stacks ... einstein_S5R2_4.18_i686-pc-linux-gnu: freeglut_callbacks.c:100: glutTimerFunc: Assertion `fgState.Initialised' failed.
SIGABRT: abort called
SIGSEGV: segmentation violation

]]>

Validate state Invalid
Claimed credit 104.291203763088
Granted credit 0
application version 4.18
================

GLUT is an Open GL related Utility library.

I'm not saying all "signal 11" errors are related to Open GL, but this one clearly is.

Looks a lot like my last error result with the old app, so nothing has chnaged in this department, I'm afraid:

===============
http://einsteinathome.org/task/83518453

7356, 7357, 7358, 7359, 7360, 7361, 7362, 7363, 2007-04-26 17:40:51.6694 [debug]: set_checkpt(): bytes: 585772, file: 585772
7364, 7365, einstein_S5R2_4.14_i686-pc-linux-gnu: freeglut_callbacks.c:100: glutTimerFunc: Assertion `fgState.Initialised' failed.
SIGABRT: abort called
SIGABRT: abort called
SIGABRT: abort called
SIGABRT: abort called
SIGABRT: abort called
.....

===============

It's a pitty that most result traces that are generated by "signal 11" are fileld up with lines
SIGABRT: abort called
so we don't see the interesting output just before the first such line :-(

CU

BRM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.