High Failure rate on Line Veto

Donald A. Tevault
Donald A. Tevault
Joined: 17 Feb 06
Posts: 439
Credit: 73516529
RAC: 0
Topic 196219

Here's what I have so far on one of my Fedora 16 machines:

Fedora 16 machine

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7057034931
RAC: 1618437

High Failure rate on Line Veto

Quote:

Here's what I have so far on one of my Fedora 16 machines:

Fedora 16 machine


That same host seems to have rather more than common recent rate of error on Gravitational Wave S6 GC search v1.01

recent error tasks for that host

For a reference, my five hosts among them only have one "error while computing" displayed currently, and all but one of them run Einstein 24/7 at greater than 90%. I'd suggest you review the usual suspects: overclocking, overheating, software conflicts, marginal RAM, not quite enough CPU voltage even though not overclocked...

mickydl*
mickydl*
Joined: 7 Oct 08
Posts: 39
Credit: 200374822
RAC: 0

So far all of the Line Veto

So far all of the Line Veto WUs have failed on my Linux machine. The stderr.txt looks as follows:

7.0.18

process exited with code 22 (0x16, -234)

execv: No such file or directory

The Machine in question is: Nostromo

All other Einstein apps work flawlessly.

mickydl*

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245211351
RAC: 12943

The "execv: No such file or

The "execv: No such file or directory" message usually points to a missing shared library required by the application.

However the libraries required are identical between the S6Bucket and the S6LV1 (Linux) App:

EinsteinAtHome/einstein_S6Bucket_1.01_i686-pc-linux-gnu__SSE2:
	libpthread.so.0 => /lib/libpthread.so.0 (0x40019000)
	libm.so.6 => /lib/libm.so.6 (0x4002d000)
	libc.so.6 => /lib/libc.so.6 (0x4004f000)
	/lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
EinsteinAtHome/einstein_S6LV1_1.10_i686-pc-linux-gnu__SSE2:
	libpthread.so.0 => /lib/libpthread.so.0 (0x40019000)
	libm.so.6 => /lib/libm.so.6 (0x4002d000)
	libc.so.6 => /lib/libc.so.6 (0x4004f000)
	/lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)

These are, however, 32Bit Apss, there is not a 64Bit Linux App for S6LV1 yet. The BOINC Client used to check whether the compatibility libs are present on a system before offering to run 32Bit Apps to the server. From the errors we got so far I suspect that this check isn't working properly on (some) 7.0.x Clients.

Are you running on a 64Bit system and have you the 32Bit compatibility libs installed?

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245211351
RAC: 12943

As for the original question:

As for the original question: You seem get "signal 8" (Floating-Point Exception, FPE) sporadically for both the old (S6Bucket) and the new (S6LV1) Apps, so I'd conclude this is more a coincidence between the occurrence of this error and the release of S6LV1.

We get a couple of these FPE from systems with fairly recent kernels. We are currently investigating one such issue that occurs in the BOINC API on recent Debian testing systems. It is still not clear what exactly happens there and how to avoid it. Does the occurrence of these errors anyhow relates to a kernel update / change?

BM

BM

Donald A. Tevault
Donald A. Tevault
Joined: 17 Feb 06
Posts: 439
Credit: 73516529
RAC: 0

RE: As for the original

Quote:

As for the original question: You seem get "signal 8" (Floating-Point Exception, FPE) sporadically for both the old (S6Bucket) and the new (S6LV1) Apps, so I'd conclude this is more a coincidence between the occurrence of this error and the release of S6LV1.

We get a couple of these FPE from systems with fairly recent kernels. We are currently investigating one such issue that occurs in the BOINC API on recent Debian testing systems. It is still not clear what exactly happens there and how to avoid it. Does the occurrence of these errors anyhow relates to a kernel update / change?

BM

Hi Bernd!

The machines with the highest error rates have only been in operation for a few months, and are both running 64-bit Fedora 16 with 32-bit libraries installed. The one with the highest error rate, the one I've already referenced, is running Linux kernel 3.2. The one with the second-highest error rate hasn't been updated in a while, and is still running Linux kernel 3.1. So, the problem doesn't seem to have anything to do with having the most recent kernel updates.

Here's a 64-bit Debian 6 machine which doesn't have quite as bad of an error rate. It's running Debian stable, with Linux kernel 2.6.32.

I also have two 64-bit OpenSuSE 12.1 machines with Linux kernel 3.1, but the error rates for the bucket workunits are quite low. (I haven't received any Line Veto workunits on them, so I can't yet say about them.) Also, the error rates for all of my 64-bit Scientific Linux and CentOS machines are quite low.

So, that makes me wonder, is there something strange with the way Debian and Fedora are compiling their kernels?

Edit--By the time you see this, the second Fedora machine will have been updated to Linux kernel 3.2. (Update is currently in progress.)

mickydl*
mickydl*
Joined: 7 Oct 08
Posts: 39
Credit: 200374822
RAC: 0

RE: The "execv: No such

Quote:

The "execv: No such file or directory" message usually points to a missing shared library required by the application.

However the libraries required are identical between the S6Bucket and the S6LV1 (Linux) App:

EinsteinAtHome/einstein_S6Bucket_1.01_i686-pc-linux-gnu__SSE2:
	libpthread.so.0 => /lib/libpthread.so.0 (0x40019000)
	libm.so.6 => /lib/libm.so.6 (0x4002d000)
	libc.so.6 => /lib/libc.so.6 (0x4004f000)
	/lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
EinsteinAtHome/einstein_S6LV1_1.10_i686-pc-linux-gnu__SSE2:
	libpthread.so.0 => /lib/libpthread.so.0 (0x40019000)
	libm.so.6 => /lib/libm.so.6 (0x4002d000)
	libc.so.6 => /lib/libc.so.6 (0x4004f000)
	/lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)

These are, however, 32Bit Apss, there is not a 64Bit Linux App for S6LV1 yet. The BOINC Client used to check whether the compatibility libs are present on a system before offering to run 32Bit Apps to the server. From the errors we got so far I suspect that this check isn't working properly on (some) 7.0.x Clients.

Are you running on a 64Bit system and have you the 32Bit compatibility libs installed?

BM

Hi Bernd,

Thanks for your response.

The machine is running all Einstein applications for several month now and I don't think I have ever had any errors except when I trashed then WUs by doing something stupid here.

However, I may have fond the problem. When checking the app I noticed that the executable flag was not set. That might explain the execv failure. I have o idea why this happened. So far BOINC has always downloaded the application and has made it executable w/o any intervention by me. Maybe a problem with the 7.0.18 client ?

I am now waiting for new work to see if it works.

@Donald: Just a thought. There is an old kernel bug that crashes the application from time to time. From Bernds description it sounds like it could be this problem. To avoid this bug make sure that you are using a NON-preemtive kernel.

regards,
mickydl*

Donald A. Tevault
Donald A. Tevault
Joined: 17 Feb 06
Posts: 439
Credit: 73516529
RAC: 0

RE: @Donald: Just a

Quote:


@Donald: Just a thought. There is an old kernel bug that crashes the application from time to time. From Bernds description it sounds like it could be this problem. To avoid this bug make sure that you are using a NON-preemtive kernel.

regards,
mickydl*

Hi Micky!

Yeah, you're right, and that was my problem a couple of years ago. However, that doesn't seem to be the problem now, since I have the stock pre-emptive kernel running on all my machines, but only a few are giving me problems.

But then, who knows? When I get time, I might compile my own kernel for one of the problem-children, just to see what happens.

Donald A. Tevault
Donald A. Tevault
Joined: 17 Feb 06
Posts: 439
Credit: 73516529
RAC: 0

Okay, non-premptive kernel

Okay, non-premptive kernel compilation is in progress on one Fedora 16 machine. We'll see if that fixes the problem.

(You might know, Fedora doesn't offer any pre-built non-premptive kernels in its repository.)

Donald A. Tevault
Donald A. Tevault
Joined: 17 Feb 06
Posts: 439
Credit: 73516529
RAC: 0

I've just rebooted on my new

I've just rebooted on my new home-brew kernel. I'll watch it over the next few days to see if things improve. If so, I'll compile a kernel for the other machines, as well.

Donald A. Tevault
Donald A. Tevault
Joined: 17 Feb 06
Posts: 439
Credit: 73516529
RAC: 0

RE: I've just rebooted on

Quote:
I've just rebooted on my new home-brew kernel. I'll watch it over the next few days to see if things improve. If so, I'll compile a kernel for the other machines, as well.

So far, so good with the Bucket work-units. But, I'm still getting validate errors with the Gamma-Ray ones, so there's obviously another problem with them.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.