GNU/Linux S5R3 App 4.16 available for Beta test

Annika

Joined: 8 Aug 06

Posts: 720

Credit: 494410

RAC: 0

I definitely haven't

16 Nov 2007 18:00:35 UTC

Message 75241

(moderation:

)

I definitely haven't "upgraded my system" in terms of hardware since it's a laptop. But "shared memory" rang a bell... do you think it might be the graphics card driver? I mean, this is a shared memory (mobile) card... maybe if that has a buggy driver and messes up memory allocation it could have this kind of effect? I'm using the "restricted" (proprietary) ATI driver since I wanted to test Linux gaming and that isn't exactly known to be a stability wonder...
Opinions, anyone? Please ;-) this is starting to seriously annoy me. Despite starting from scratch all my WUs crash, normally directly after rebooting, so quickly that I can't even get the debugger to run before they are dead.

Metod, S56RKO

Joined: 11 Feb 05

Posts: 135

Credit: 809846959

RAC: 63582

RE: I definitely haven't

16 Nov 2007 21:50:03 UTC

Message 75242 in response to message 75241

(moderation:

)

Quote:

I definitely haven't "upgraded my system" in terms of hardware since it's a laptop. But "shared memory" rang a bell... do you think it might be the graphics card driver? I mean, this is a shared memory (mobile) card...

Probably not that shared memory, but rather kernel parameters about shared (as being shared between different processes) memory. Try to run command ipcs -l to see what are your settings. There are three pseudo-files at /proc/sys/kernel/shm* ... Google-up their descriptions and try to increase some values to see if that's it.

Rings the bell as guys on Intel Mac OS-X had some issues with shared memory...

Metod ...

Annika

Joined: 8 Aug 06

Posts: 720

Credit: 494410

RAC: 0

Thanks for the advice, I'll

16 Nov 2007 22:11:27 UTC

Message 75243

(moderation:

)

Thanks for the advice, I'll keep that in mind. Atm I'm testing the vanilla graphics driver, though, to see if it makes any difference- if it doesn't it won't do any harm. And I seem to remember someone over at CPDN warning me about similar problems when running the project on a laptop, saying it was indeed a graphics card issue... it's more than a year ago but I think there was sth.

Annika

Joined: 8 Aug 06

Posts: 720

Credit: 494410

RAC: 0

Okay, now at least I can rule

17 Nov 2007 10:12:02 UTC

Message 75244 in response to message 75242

(moderation:

)

Okay, now at least I can rule something out... as you suspected, Metod, it had nothing to do with my graphics card. Used the vanilla drivers, let the laptop crunch overnight and got roughly a million (well, more like 20) signal 11 errors, as usually in huge "series" which ended eventually upon switching projects. The laptop seems to be doing okay now, but I'm more or less unable to do projects with big WUs like Einstein since it won't stay stable long enough.

Quote:

Probably not that shared memory, but rather kernel parameters about shared (as being shared between different processes) memory. Try to run command ipcs -l to see what are your settings. There are three pseudo-files at /proc/sys/kernel/shm* ... Google-up their descriptions and try to increase some values to see if that's it.

Honestly, I'm not quite sure if I should. I had a look at those parameters and I could just swear they are relevant for the OS. The way it's now, the only program that doesn't seem to work okay is BOINC. If I change anything without knowing exactly what I'm doing (which I obviously don't) I might well end up with my system fubar'd for all I know, and I just can't afford that atm... loads of stuff to do for Uni and some other projects as well, I simply need that laptop. So, unless any of you knows a reasonably safe way to do this, I'd consider to temporarily detach this box.

Metod, S56RKO

Joined: 11 Feb 05

Posts: 135

Credit: 809846959

RAC: 63582

RE: RE: Probably not that

17 Nov 2007 14:11:03 UTC

Message 75245 in response to message 75244

(moderation:

)

Quote:

Quote:
Probably not that shared memory, but rather kernel parameters about shared (as being shared between different processes) memory. Try to run command ipcs -l to see what are your settings. There are three pseudo-files at /proc/sys/kernel/shm* ... Google-up their descriptions and try to increase some values to see if that's it.

Honestly, I'm not quite sure if I should. I had a look at those parameters and I could just swear they are relevant for the OS. The way it's now, the only program that doesn't seem to work okay is BOINC. If I change anything without knowing exactly what I'm doing (which I obviously don't) I might well end up with my system fubar'd for all I know, and I just can't afford that atm... loads of stuff to do for Uni and some other projects as well, I simply need that laptop. So, unless any of you knows a reasonably safe way to do this, I'd consider to temporarily detach this box.

One can not be sure about anything in CS of course. Still, increasing those 3 values shouldn't break anything if not overdone. If you change values by hand (eg. by echo 67108864 > /proc/sys/kernel/shmmax will increase max shared memory segment size to 64MB), you can allways return to vanilla state by rebooting machine. Me personally, I'd try to double the value of shmall, which (in principle) should double the amount of available SHM pages, thusly allowing a couple more apps to attach SHM.

Sometimes one just must change system settings. Almost a decade ago we had to tweak stack size of our AXP Linux box by changing some define and recompiling kernel as we desperately needed more than 8MB stack size - we were running weather prediction model which was written in Fortran and that beast passed a lot of data between functions over stack. That size (8MB) was quite lower than that of Digital Unix (DU) we were running on the same box - I thing we had to increase it there also. OTOH, Linux had a really nice feature: RAM overcommitment - you could run an app that (statically) demanded more RAM than was available in system because all of that RAM wasn't ever used. Or, alternatively, you could run two such apps in parallel while amount of system RAM would only allow running one of them. DU behaved differently by default (being on the safe side), but could be configured in the same way - but it was a royal pain in the arse to do it. As running DU binaries through binary compatibility layer was faster under Linux than under DU natively we switched over to Linux entirely. Which means I've been using 64-bit Linux on my desktop machine 9 (nine) years ago and came back to 64-bit just recently :-/

Metod ...

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4273

Credit: 245219913

RAC: 12910

Annika, 1. with "upgrade"

17 Nov 2007 23:16:20 UTC

Message 75246

(moderation:

)

Annika,

1. with "upgrade" I was rather thinking of updating software packages than installing new hardware

2. the "shared memory" e.g. the error messages in stderr_out refer to is not a memory physically shared between hardware devices (such as the CPU and your graphics adapter), but a piece of memory shared between processes, so rather a software than a hardware thing.

3. Your machine reports a number of very different errors:
- exit status -185: This means that the client couldn't start the App at all. The reason for this is given in stderr_out, in your case it's usually referring to shared memory.
- exit status 139: This is the "signal 11" error
- exit status 134: "process got signal 6" (SIGABRT / Abort). I never noticed that before. Could this be from shutting down the system?

4. Putting the file "EAH_DEBUG_DDD" into the BOINC directory should start the debugger automatically at the very beginning of the task and tells it to attach to the process automatically. If successful it will interrupt the task on its own, there's no need to manually attach a debugger to the running process. In contrast, you'll have to type "cont" at the gdb prompt or press the "Cont" button on the command toolbar to get the task going again. (BTW In case of a -185 error, the App isn't started at all, and so you don't see anything of a debugger)

I still don't know why our signal handler doesn't catch the signal on certain machines / systems; currently the only way to get some information about the cause of the signals is running the App with a debugger attached.

Annika

Joined: 8 Aug 06

Posts: 720

Credit: 494410

RAC: 0

Hi Bernd, thanks a lot

17 Nov 2007 23:48:25 UTC

Message 75247 in response to message 75246

(moderation:

)

Hi Bernd,

thanks a lot for the information!

Quote:

Annika,

1. with "upgrade" I was rather thinking of updating software packages than installing new hardware

Oh, allright. Well, I can't really remember any major upgrades... It would have to be something that is running all the time, I think... even at night when I'm not doing anything with the laptop, since I also had crashes then. Oh... wait. Do you think it might be the Opera upgrade? I have noticed that 9.24 seems to behave a bit less stable than its predecessor and also a bit slower I think, and I don't usually bother closing the browser before I go to sleep. And it usually starts up right with KDE since the KDE session is restored to it's previous state upon logon... Nothing else I think, except it's a daemon or sth that is built right into KDE. The only other apps that are running all the time are Kopete and XChat, but I haven't changed the versions there, and besides, they hardly need memory. Do you recon I should use Firefox or so for a while and see if it helps?

Quote:

2. the "shared memory" e.g. the error messages in stderr_out refer to is not a memory physically shared between hardware devices (such as the CPU and your graphics adapter), but a piece of memory shared between processes, so rather a software than a hardware thing.

Yeah, got that now. Sorry.

Quote:

3. Your machine reports a number of very different errors:
- exit status -185: This means that the client couldn't start the App at all. The reason for this is given in stderr_out, in your case it's usually referring to shared memory.
- exit status 139: This is the "signal 11" error
- exit status 134: "process got signal 6" (SIGABRT / Abort). I never noticed that before. Could this be from shutting down the system?

I think the signal 6 was my own fault. Between all the messing around with the debugger, and being really exhausted, I somehow managed to accidentally change the permissions of the "slots" directory (or maybe that got triggered by not starting BOINC from "init.d", I wouldn't know) so it didn't have write permission- which of course killed the WUs.

Quote:

4. Putting the file "EAH_DEBUG_DDD" into the BOINC directory should start the debugger automatically at the very beginning of the task and tells it to attach to the process automatically. If successful it will interrupt the task on its own, there's no need to manually attach a debugger to the running process. In contrast, you'll have to type "cont" at the gdb prompt or press the "Cont" button on the command toolbar to get the task going again. (BTW In case of a -185 error, the App isn't started at all, and so you don't see anything of a debugger)

Yep, it usually worked like that. Well, I'll have to throw out the BOINC daemon for the time being (shame really) and start BOINC manually, because then the debugger seems to run okay. Pity it doesn't help with the 185.

Quote:

I still don't know why our signal handler doesn't catch the signal on certain machines / systems; currently the only way to get some information about the cause of the signals is running the App with a debugger attached.

BM

Yep, I promise I'll do my best to help as soon as I get this mess sorted out.

Melvyn Bobo Slacke

Joined: 22 Jan 05

Posts: 32

Credit: 1692164

RAC: 0

Oops.. this box of mine had a

18 Nov 2007 9:31:00 UTC

Message 75248

(moderation:

)

Oops.. this box of mine had a whole serie of signal 11's, but that chip had a bad overclock so I think you can discard them.
Sorry :blush:
My other boxen are doing just fine with 4.16.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4273

Credit: 245219913

RAC: 12910

RE: Oops.. this box of mine

19 Nov 2007 13:36:30 UTC

Message 75249 in response to message 75248

(moderation:

)

Quote:

Oops.. this box of mine had a whole serie of signal 11's, but that chip had a bad overclock so I think you can discard them.

Not sure. If it's alway the same location it might still be useful to know; there might be a program bug as well. Do you want to give it a try with ddd?

Annika

Joined: 8 Aug 06

Posts: 720

Credit: 494410

RAC: 0

Good news about my laptop

19 Nov 2007 17:13:58 UTC

Message 75250

(moderation:

)

Good news about my laptop problem... I don't want to rejoice too early, but atm it looks like the host based (shared memory) errors stopped. I would put it down to the latest apt-get upgrade, which included quite a few patches for system tools. Maybe there really was a little bug in Ubuntu which manifested itself due to my hardware/drivers/software packages/whatever. I'll see if it stays like this; in case it does, I'll hopefully be able to get some debug output to help with the signal 11.

GNU/Linux S5R3 App 4.16 available for Beta test

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner