Global Correlations S5 - Signal 11

Mr. Hankey
Mr. Hankey
Joined: 30 Apr 10
Posts: 9
Credit: 103466998
RAC: 0

Hit a sig 11 on a WU with

Hit a sig 11 on a WU with this host here is the details.

http://einsteinathome.org/task/177507852

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 689006104
RAC: 212003

RE: Hit a sig 11 on a WU

Message 97943 in response to message 97942

Quote:

Hit a sig 11 on a WU with this host here is the details.

http://einsteinathome.org/task/177507852

Excellent!! Thanks for sharing this info, I forwarded this to the devs and it seems there might be a smoking gun in the log output.

Thanks again
HB

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245208851
RAC: 13214

The reason for this segfault

The reason for this segfault had been fixed in BOINC, App 1.05 was built with the new BOINC version. Looks like this wasn't the only problem that caused a signal 11, we still get these.

BM

BM

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

Computers with (intermittent)

Message 97945 in response to message 97944

Computers with (intermittent) memory problems, or problems which only show up when the memory is filled to capacity, may make up part of what you see. I doubt you can build an application that takes everything into account. :-)

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245208851
RAC: 13214

RE: Computers with

Message 97946 in response to message 97945

Quote:
Computers with (intermittent) memory problems, or problems which only show up when the memory is filled to capacity, may make up part of what you see.

We always had 'signal 11' errors on Linux, that were supposedly caused by the 'optimistic memory allocation' of the OS. But that were less than 1% of all returned tasks. Currently we get ~7% 'signal 11' errors, about 10 times as many as all other errors combined. Currently I can't see that the 1.05 App behaves significantly better in that respect than the 1.04.

BM

BM

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

I can see how that's a

Message 97947 in response to message 97946

I can see how that's a problem. So what changed then? (Besides the applications and BOINC). Additional flags on the compiler? Something in the tasks that's interfering? Is it only on Linux? If so, only on a specific distro or all over the place? 32bit, 64bit (CPUs)? Is ABP2 also seeing some of this effect?

CJOrtega
CJOrtega
Joined: 19 Feb 05
Posts: 39
Credit: 1742781
RAC: 0

Data point: Ubuntu Linux

Message 97948 in response to message 97947


Data point:

Ubuntu Linux on older PC's ( under 1ghz cpu speed ).

If the GCS5 wu completes at the same time that the Update Manager is running, or if I am doing system updates, - signal 11.

Maybe the wu api isn't waiting long enough for I/O to complete?

Claude

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245208851
RAC: 13214

Thanks! RE: Ubuntu

Message 97949 in response to message 97948

Thanks!

Quote:
Ubuntu Linux on older PC's ( under 1ghz cpu speed ).

Could you be more specific? Which version of Ubuntu, what's your PC (CPU type, possibly clock speed, single CPU, Hyperthreading)?

Quote:

If the GCS5 wu completes at the same time that the Update Manager is running, or if I am doing system updates, - signal 11.

Maybe the wu api isn't waiting long enough for I/O to complete?

Well, what I can say is that the error happens in kernel mode (or else we would get a stack dump by the signal handler of the App).

Could you install gdb and try the EAH_GDB_DEBUG file?

I'm pretty sure this is something in the BOINC library. The system-specific part of the App code is the same as we used for the HierarchicalSearch (S5R2-S5R6).

BM

BM

CJOrtega
CJOrtega
Joined: 19 Feb 05
Posts: 39
Credit: 1742781
RAC: 0

Ubuntu V 9.10 Celeron

Message 97950 in response to message 97949


Ubuntu V 9.10
Celeron (Coppermine) 650 mhz, 3/4 gb. memory
Pentium III (Coppermine) 850 mhz, 3/4 gb memory
Pentium III (Coppermine) 850 mhz 3/4 gb. memory

I'll put the debug file in the BOINC directory(s) and let you know if I get anything.

Claude

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245208851
RAC: 13214

Thanks. With 1.05 the

Thanks.

With 1.05 the Linux segfault rate went down from ~7% (1.04) to ~5%. Judging from this I'd say two more stackdumps and we're done!

BM

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.