GNU/Linux S5R3 App 4.16 available for Beta test

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494,410
RAC: 0

I definitely haven't

I definitely haven't "upgraded my system" in terms of hardware since it's a laptop. But "shared memory" rang a bell... do you think it might be the graphics card driver? I mean, this is a shared memory (mobile) card... maybe if that has a buggy driver and messes up memory allocation it could have this kind of effect? I'm using the "restricted" (proprietary) ATI driver since I wanted to test Linux gaming and that isn't exactly known to be a stability wonder...
Opinions, anyone? Please ;-) this is starting to seriously annoy me. Despite starting from scratch all my WUs crash, normally directly after rebooting, so quickly that I can't even get the debugger to run before they are dead.

Metod, S56RKO
Metod, S56RKO
Joined: 11 Feb 05
Posts: 135
Credit: 790,538,977
RAC: 15,941

RE: I definitely haven't

Message 75242 in response to message 75241

Quote:
I definitely haven't "upgraded my system" in terms of hardware since it's a laptop. But "shared memory" rang a bell... do you think it might be the graphics card driver? I mean, this is a shared memory (mobile) card...

Probably not that shared memory, but rather kernel parameters about shared (as being shared between different processes) memory. Try to run command ipcs -l to see what are your settings. There are three pseudo-files at /proc/sys/kernel/shm* ... Google-up their descriptions and try to increase some values to see if that's it.

Rings the bell as guys on Intel Mac OS-X had some issues with shared memory...

Metod ...

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494,410
RAC: 0

Thanks for the advice, I'll

Thanks for the advice, I'll keep that in mind. Atm I'm testing the vanilla graphics driver, though, to see if it makes any difference- if it doesn't it won't do any harm. And I seem to remember someone over at CPDN warning me about similar problems when running the project on a laptop, saying it was indeed a graphics card issue... it's more than a year ago but I think there was sth.

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494,410
RAC: 0

Okay, now at least I can rule

Message 75244 in response to message 75242

Okay, now at least I can rule something out... as you suspected, Metod, it had nothing to do with my graphics card. Used the vanilla drivers, let the laptop crunch overnight and got roughly a million (well, more like 20) signal 11 errors, as usually in huge "series" which ended eventually upon switching projects. The laptop seems to be doing okay now, but I'm more or less unable to do projects with big WUs like Einstein since it won't stay stable long enough.

Quote:
Probably not that shared memory, but rather kernel parameters about shared (as being shared between different processes) memory. Try to run command ipcs -l to see what are your settings. There are three pseudo-files at /proc/sys/kernel/shm* ... Google-up their descriptions and try to increase some values to see if that's it.

Honestly, I'm not quite sure if I should. I had a look at those parameters and I could just swear they are relevant for the OS. The way it's now, the only program that doesn't seem to work okay is BOINC. If I change anything without knowing exactly what I'm doing (which I obviously don't) I might well end up with my system fubar'd for all I know, and I just can't afford that atm... loads of stuff to do for Uni and some other projects as well, I simply need that laptop. So, unless any of you knows a reasonably safe way to do this, I'd consider to temporarily detach this box.

Metod, S56RKO
Metod, S56RKO
Joined: 11 Feb 05
Posts: 135
Credit: 790,538,977
RAC: 15,941

RE: RE: Probably not that

Message 75245 in response to message 75244

Quote:
Quote:
Probably not that shared memory, but rather kernel parameters about shared (as being shared between different processes) memory. Try to run command ipcs -l to see what are your settings. There are three pseudo-files at /proc/sys/kernel/shm* ... Google-up their descriptions and try to increase some values to see if that's it.

Honestly, I'm not quite sure if I should. I had a look at those parameters and I could just swear they are relevant for the OS. The way it's now, the only program that doesn't seem to work okay is BOINC. If I change anything without knowing exactly what I'm doing (which I obviously don't) I might well end up with my system fubar'd for all I know, and I just can't afford that atm... loads of stuff to do for Uni and some other projects as well, I simply need that laptop. So, unless any of you knows a reasonably safe way to do this, I'd consider to temporarily detach this box.

One can not be sure about anything in CS of course. Still, increasing those 3 values shouldn't break anything if not overdone. If you change values by hand (eg. by echo 67108864 > /proc/sys/kernel/shmmax will increase max shared memory segment size to 64MB), you can allways return to vanilla state by rebooting machine. Me personally, I'd try to double the value of shmall, which (in principle) should double the amount of available SHM pages, thusly allowing a couple more apps to attach SHM.

Sometimes one just must change system settings. Almost a decade ago we had to tweak stack size of our AXP Linux box by changing some define and recompiling kernel as we desperately needed more than 8MB stack size - we were running weather prediction model which was written in Fortran and that beast passed a lot of data between functions over stack. That size (8MB) was quite lower than that of Digital Unix (DU) we were running on the same box - I thing we had to increase it there also. OTOH, Linux had a really nice feature: RAM overcommitment - you could run an app that (statically) demanded more RAM than was available in system because all of that RAM wasn't ever used. Or, alternatively, you could run two such apps in parallel while amount of system RAM would only allow running one of them. DU behaved differently by default (being on the safe side), but could be configured in the same way - but it was a royal pain in the arse to do it. As running DU binaries through binary compatibility layer was faster under Linux than under DU natively we switched over to Linux entirely. Which means I've been using 64-bit Linux on my desktop machine 9 (nine) years ago and came back to 64-bit just recently :-/

Metod ...

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,023
Credit: 214,118,402
RAC: 42,766

Annika, 1. with "upgrade"

Annika,

1. with "upgrade" I was rather thinking of updating software packages than installing new hardware

2. the "shared memory" e.g. the error messages in stderr_out refer to is not a memory physically shared between hardware devices (such as the CPU and your graphics adapter), but a piece of memory shared between processes, so rather a software than a hardware thing.

3. Your machine reports a number of very different errors:
- exit status -185: This means that the client couldn't start the App at all. The reason for this is given in stderr_out, in your case it's usually referring to shared memory.
- exit status 139: This is the "signal 11" error
- exit status 134: "process got signal 6" (SIGABRT / Abort). I never noticed that before. Could this be from shutting down the system?

4. Putting the file "EAH_DEBUG_DDD" into the BOINC directory should start the debugger automatically at the very beginning of the task and tells it to attach to the process automatically. If successful it will interrupt the task on its own, there's no need to manually attach a debugger to the running process. In contrast, you'll have to type "cont" at the gdb prompt or press the "Cont" button on the command toolbar to get the task going again. (BTW In case of a -185 error, the App isn't started at all, and so you don't see anything of a debugger)

I still don't know why our signal handler doesn't catch the signal on certain machines / systems; currently the only way to get some information about the cause of the signals is running the App with a debugger attached.

BM

BM

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494,410
RAC: 0

Hi Bernd, thanks a lot

Message 75247 in response to message 75246

Hi Bernd,

thanks a lot for the information!

Quote:

Annika,

1. with "upgrade" I was rather thinking of updating software packages than installing new hardware

Oh, allright. Well, I can't really remember any major upgrades... It would have to be something that is running all the time, I think... even at night when I'm not doing anything with the laptop, since I also had crashes then. Oh... wait. Do you think it might be the Opera upgrade? I have noticed that 9.24 seems to behave a bit less stable than its predecessor and also a bit slower I think, and I don't usually bother closing the browser before I go to sleep. And it usually starts up right with KDE since the KDE session is restored to it's previous state upon logon... Nothing else I think, except it's a daemon or sth that is built right into KDE. The only other apps that are running all the time are Kopete and XChat, but I haven't changed the versions there, and besides, they hardly need memory. Do you recon I should use Firefox or so for a while and see if it helps?

Quote:

2. the "shared memory" e.g. the error messages in stderr_out refer to is not a memory physically shared between hardware devices (such as the CPU and your graphics adapter), but a piece of memory shared between processes, so rather a software than a hardware thing.

Yeah, got that now. Sorry.

Quote:

3. Your machine reports a number of very different errors:
- exit status -185: This means that the client couldn't start the App at all. The reason for this is given in stderr_out, in your case it's usually referring to shared memory.
- exit status 139: This is the "signal 11" error
- exit status 134: "process got signal 6" (SIGABRT / Abort). I never noticed that before. Could this be from shutting down the system?

I think the signal 6 was my own fault. Between all the messing around with the debugger, and being really exhausted, I somehow managed to accidentally change the permissions of the "slots" directory (or maybe that got triggered by not starting BOINC from "init.d", I wouldn't know) so it didn't have write permission- which of course killed the WUs.

Quote:

4. Putting the file "EAH_DEBUG_DDD" into the BOINC directory should start the debugger automatically at the very beginning of the task and tells it to attach to the process automatically. If successful it will interrupt the task on its own, there's no need to manually attach a debugger to the running process. In contrast, you'll have to type "cont" at the gdb prompt or press the "Cont" button on the command toolbar to get the task going again. (BTW In case of a -185 error, the App isn't started at all, and so you don't see anything of a debugger)

Yep, it usually worked like that. Well, I'll have to throw out the BOINC daemon for the time being (shame really) and start BOINC manually, because then the debugger seems to run okay. Pity it doesn't help with the 185.

Quote:


I still don't know why our signal handler doesn't catch the signal on certain machines / systems; currently the only way to get some information about the cause of the signals is running the App with a debugger attached.

BM

Yep, I promise I'll do my best to help as soon as I get this mess sorted out.

Melvyn Bobo Slacke
Melvyn Bobo Slacke
Joined: 22 Jan 05
Posts: 32
Credit: 1,692,164
RAC: 0

Oops.. this box of mine had a

Oops.. this box of mine had a whole serie of signal 11's, but that chip had a bad overclock so I think you can discard them.
Sorry :blush:
My other boxen are doing just fine with 4.16.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,023
Credit: 214,118,402
RAC: 42,766

RE: Oops.. this box of mine

Message 75249 in response to message 75248

Quote:
Oops.. this box of mine had a whole serie of signal 11's, but that chip had a bad overclock so I think you can discard them.


Not sure. If it's alway the same location it might still be useful to know; there might be a program bug as well. Do you want to give it a try with ddd?

BM

BM

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494,410
RAC: 0

Good news about my laptop

Good news about my laptop problem... I don't want to rejoice too early, but atm it looks like the host based (shared memory) errors stopped. I would put it down to the latest apt-get upgrade, which included quite a few patches for system tools. Maybe there really was a little bug in Ubuntu which manifested itself due to my hardware/drivers/software packages/whatever. I'll see if it stays like this; in case it does, I'll hopefully be able to get some debug output to help with the signal 11.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.