5th Computing error for S5R2 on one host

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 689121881
RAC: 213260

RE: That couldn't have been

Message 62993 in response to message 62992

Quote:
That couldn't have been the reason for my WU to crash. I'm completely sure I didn't pause/resume that. Maybe this can trigger SIGABRT errors, but it can't be the only thing that causes them...
...

Hi!
SIGABRT seems to happen sometimes on *any* interruption/suspension of the science client:

- manual update
- manual suspend
- automatic interruption to perform benchmark
- automatic switch to other project if E@H isn't your only one
- power-down
- disconnecting power supply and running on batteries (if configured this way)on notebooks
- ....

CU

BRM

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

If all, it could have been

If all, it could have been the benchmarks, because this is a desktop PC and I left it completely alone. Interesting theory, I'll have a look at the logs if they show a benchmark at that time.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 689121881
RAC: 213260

RE: If all, it could have

Message 62995 in response to message 62994

Quote:
If all, it could have been the benchmarks, because this is a desktop PC and I left it completely alone. Interesting theory, I'll have a look at the logs if they show a benchmark at that time.

...or maybe an automatic interruption to report a result, or fetch new work... . There are so many different reasons for an interruption, but most of them should show in the logs. Did you try to disable graphics as I suggested in another thread? I never had a "signal 11" client error since.

CU

BRM

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

I wasn't even running the

I wasn't even running the manager at that moment. When the box is only crunching, I put the core client in a screen and run command line only. Somehow, I have the feeling this is saver and more efficient ;-) maybe because I grew up to be a Linuxer on server command lines ;-)

Trog Dog
Trog Dog
Joined: 25 Nov 05
Posts: 191
Credit: 541562
RAC: 0

I have one host that is

I have one host that is erroring out on sig11 aborts - it's an AthlonXP running FC6 - my other two AthlonXP's running Debian Etch have had no problems yet neither has my Athlon64 which also runs FC6 (64bit) - in fact out of all my boxes this is the only one that's affected.

Does anyone else have a similar experience?

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

If I should venture a guess,

If I should venture a guess, I'd say that host got a problematic datafile and keeps getting WUs which are more likely to get an error.

Trog Dog
Trog Dog
Joined: 25 Nov 05
Posts: 191
Credit: 541562
RAC: 0

RE: If I should venture a

Message 62999 in response to message 62998

Quote:
If I should venture a guess, I'd say that host got a problematic datafile and keeps getting WUs which are more likely to get an error.

Aren't these R2 wu's individual - whereas the R1/I's were derived from one datafile for a series of wu's - or are you referring to something else?

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

No, they're not individual,

No, they're not individual, data files are still in use (in a slightly different form), as stated by Bernd Machenschalk at the beginning of the run:

Quote:

With the amount of data to be analyzed, however, also grows the amount of data to be downloaded to your machines. To save bandwidth both on your and on our side, we developed a new scheme for data files. They are split into small files of about 3MB in size (that are somewhat re-combined in the App). When there is no more work available for the data files you already have on your machine, the scheduler will try to assign work to your host that minimizes the number of files you have to download, so you should in most cases get away with a new download of only two files (~6MB).

However, a complete set of files (that will be used for many tasks you get) can consist of up to 10 files, so the initial download can be more than 40MB (including the Application and other data files that need to be downloaded only once per run).

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 689121881
RAC: 213260

Annika wrote: RE: I

Message 63002 in response to (parent removed)

Annika wrote:

Quote:
I wasn't even running the manager at that moment. When the box is only crunching, I put the core client in a screen and run command line only. Somehow, I have the feeling this is saver and more efficient ;-) maybe because I grew up to be a Linuxer on server command lines ;-)

I wasn't talking about actually using the graphics. I also never used the visualization. What cured the "signal 11" problem for my hosts was that I prevented the einstein client from even loading the libGL OpenGL library (which happens when the client starts).

I doubt that bad batches of datafiles cause this problem because then the problem would probably not occur on Linux only and there only for some installations. It should be more deterministic , I guess.

Quote:


I have one host that is erroring out on sig11 aborts - it's an AthlonXP running FC6 - my other two AthlonXP's running Debian Etch have had no problems yet neither has my Athlon64 which also runs FC6 (64bit) - in fact out of all my boxes this is the only one that's affected.

Does anyone else have a similar experience?

This would be consistent with the hypothesis that somehow the new e@h client has interoperability issues with at least some versions of libGL (or any other lib, but you might want to investigate libGL for a start).

@Trog Dog, maybe you want to check libGL versions on your different hosts and see if there's any correlation with the susceptibility of the hosts to the signal 11 issue. E.g. you could do
# ldd einstein*.so
in the projects/einstein* subdir of the BOINC installation to see if/what libGL.so is loaded

CU
BRM

Trog Dog
Trog Dog
Joined: 25 Nov 05
Posts: 191
Credit: 541562
RAC: 0

G'day Bikeman After

G'day Bikeman

After conducting a quick survey of the distro's on my boxes - one thing stood out - the problem box, an AthlonXP 2400 running FC6 (32bit) has the executable bit set on libGL.so.1

./libGL.so.1 throws up a Segmentation Fault message

In my other distro's across various boxes libGL.so.1 is either non-existent or not executable.

I'm going to remove the execute permission for this library and see what breaks/works :)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.