Workaround for Linux SIGABRT errors

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0
Topic 192672

As I have different Linux systems with different Boinc clients I found out, that at least in my case all SIGABRT errors occur, if two things come together:

The client is asking the scheduler for more work AND is reporting a result that was formerly uploaded.

This FIRST attempt is causing the SIGABRT error and a new unit is startet. One minute(or more) later the client asks for new work again and also wants to report(the broken?) work. This attempt always succeeds and the currently running WU is NOT aborted! Really strange.
On hosts where I use the Boinc client from BoincStudio, there is an option to start it with -return_results_immediately. This just does what it says and prevents(>99%) the situation, that the client is asking for new work and simultaneously wants to report finished work. These hosts have not caused any errors so far.
Other hosts with newer Boinc clients do not support this option any more. If these have finished and uploaded a WU, I usually do a manual project update through BoincStudio to report the finished WUs – no errors so far. Well, even I must sleep sometimes or I'm out of house and then these ugly SIGABRT errors occur.

Because these new large WUs IMHO don't stress the servers too much, I recommend to use one of the Boinc clients, that supports the '-return_results_immediately' option, at least until the errors are fixed.

cu,
Michael

p.s. error_log example:

04/05/2007 06:59|Einstein@Home|Computation for task h1_0245.60_S5R2__21_S5R2c_1 finished
04/05/2007 06:59|Einstein@Home|Starting h1_0245.60_S5R2__19_S5R2c_1
04/05/2007 06:59|Einstein@Home|Starting task h1_0245.60_S5R2__19_S5R2c_1 using einstein_S5R2 version 418
04/05/2007 06:59|Einstein@Home|[file_xfer] Started upload of file h1_0245.60_S5R2__21_S5R2c_1_0
04/05/2007 06:59|Einstein@Home|[file_xfer] Finished upload of file h1_0245.60_S5R2__21_S5R2c_1_0
04/05/2007 06:59|Einstein@Home|[file_xfer] Throughput 14712 bytes/sec
04/05/2007 07:40|Einstein@Home|Sending scheduler request: To fetch work
04/05/2007 07:40|Einstein@Home|Requesting 72 seconds of new work, and reporting 1 completed tasks
04/05/2007 07:41|Einstein@Home|Scheduler RPC succeeded [server version 509]
04/05/2007 07:41|Einstein@Home|Deferring communication for 1 min 0 sec
04/05/2007 07:41|Einstein@Home|Reason: requested by project
04/05/2007 07:41|Einstein@Home|Deferring communication for 1 min 0 sec
04/05/2007 07:41|Einstein@Home|Reason: Unrecoverable error for result h1_0245.60_S5R2__19_S5R2c_1 (process got signal 11)
04/05/2007 07:41|Einstein@Home|Computation for task h1_0245.60_S5R2__19_S5R2c_1 finished
04/05/2007 07:41|Einstein@Home|Output file h1_0245.60_S5R2__19_S5R2c_1_0 for task h1_0245.60_S5R2__19_S5R2c_1 absent
04/05/2007 07:41|Einstein@Home|Starting h1_0245.60_S5R2__8_S5R2c_0
04/05/2007 07:41|Einstein@Home|Starting task h1_0245.60_S5R2__8_S5R2c_0 using einstein_S5R2 version 418
04/05/2007 08:18|Einstein@Home|Sending scheduler request: To fetch work
04/05/2007 08:18|Einstein@Home|Requesting 92 seconds of new work, and reporting 1 completed tasks
04/05/2007 08:18|Einstein@Home|Scheduler RPC succeeded [server version 509]

Dave Burbank
Dave Burbank
Joined: 30 Jan 06
Posts: 275
Credit: 1548376
RAC: 0

Workaround for Linux SIGABRT errors

Thanks for the info, looks like I'll be switching back to BoincStudio. What version of the BOINC core client does it use now?

There are 10^11 stars in the galaxy. That used to be a huge number. But it's only a hundred billion. It's less than the national deficit! We used to call them astronomical numbers. Now we should call them economical numbers. - Richard Feynman

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

RE: Thanks for the info,

Message 63126 in response to message 63125

Quote:
Thanks for the info, looks like I'll be switching back to BoincStudio. What version of the BOINC core client does it use now?

28/04/2007 13:11||Starting BOINC client version 5.4.9 for i686-pc-linux-gnu
28/04/2007 13:11||BoincStudio mod 0.5.6
28/04/2007 13:11||libcurl/7.13.2 OpenSSL/0.9.7e zlib/1.2.2 libidn/0.5.13

But keep in mind, this is a modified version which you can get here.
Afaik, there is currently no further development, though they look for developers.

cu,
Michael

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

Well, the situation you

Well, the situation you describe matches my own experiences very well. But I have no idea about BOINC studio, is it very hard to use?

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

RE: Well, the situation you

Message 63128 in response to message 63127

Quote:
Well, the situation you describe matches my own experiences very well. But I have no idea about BOINC studio, is it very hard to use?

No, you don't need to use the BS anyway, just the client. It replaces the aktual client in the BOINC directory. You might have to change your start-script depending on starting via 'run_client' or some init-script.

But in the meantime I found out, that the download page of 'Black Hole Sun' is dead. :(
But the tutorial is still online.
Hey you wanted to register at heise.de. ;-)
If you did, you can send me a forum mail, so I get your mail addi and I can send you the BS-stuff.
Would be good to find a place where one could host the files.

Ah, I found a place to a new download. :-)

For your personal assistance: heise.de ff

cu,
Micha

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

I signed up ages ago, but had

I signed up ages ago, but had to use my uni mail address (no idea why heise doesn't like GMail), which I don't use for anything else... of course I forgot the password, lost the sheet with the data on it (I'm hopeless with anything written on paper) and the only mail client which has that account set up is under Windoze on the AMD box... I'll take care of it tomorrow, promise ;-)

Trog Dog
Trog Dog
Joined: 25 Nov 05
Posts: 191
Credit: 541562
RAC: 0

Seems that it would also work

Seems that it would also work that you set Einstein to No New Work when you have a task running. Once the task has completed and reported set Einstein back to requesting work. Once it receives it's new task set it back to No New Work.

Given the mammoth size of wu's at the moment this manual intervention isn't very frequent, and you just need a backup project set to a token resource share to cover the event when an Einstein wu completes when you're not present.

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

RE: Seems that it would

Message 63131 in response to message 63130

Quote:
Seems that it would also work that you set Einstein to No New Work when you have a task running. Once the task has completed and reported set Einstein back to requesting work. Once it receives it's new task set it back to No New Work.

That should work too, but I like to have a little cache of one or two days, especially with these long WUs. And besides BoincStudio has some nice features like backup projects, cpu throttling and manual cache filling without changing anything in the general preferences.

cu,
Michael

Trog Dog
Trog Dog
Joined: 25 Nov 05
Posts: 191
Credit: 541562
RAC: 0

RE: That should work too,

Message 63132 in response to message 63131

Quote:

That should work too, but I like to have a little cache of one or two days, especially with these long WUs. And besides BoincStudio has some nice features like backup projects, cpu throttling and manual cache filling without changing anything in the general preferences.

cu,
Michael

Does it still alter claimed credits? That doesn't have an effect on Einstein but it does with other projects that still use the benchmarks to grant credit.

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

RE: Does it still alter

Message 63133 in response to message 63132

Quote:
Does it still alter claimed credits? That doesn't have an effect on Einstein but it does with other projects that still use the benchmarks to grant credit.

It can, even for specific projects, but it's no must. You can use a bs_opts.xml to control an individual client, or use the BS(Windows) to control all clients.

Example bs_ops.xml:


-1.00
0

Einstein@Home
http://einstein.phys.uwm.edu/
0
0
0
-1.000000

cu,
Michael

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

Strange... running my

Strange... running my notebook under Linux, I just killed the 2 WUs (with a "signal 11/SIGABRT") it was crunching by doing a manual update although it was neither reporting finished tasks nor getting new work (which I had disallowed)... I just wanted it to get my new preferences right. As it was only about 20 minutes into the WUs, there was not much damage done, but still I think it's surprising. My two computation errors on my desktop fit the scenario described in this thread exactly.
On another note: I noticed that while I mostly get really big WUs, all those that crashed seem to have been smaller ones (under 300 credits). Only coincidence, or are those WUs really more likely to die on you?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.