I definitely haven't "upgraded my system" in terms of hardware since it's a laptop. But "shared memory" rang a bell... do you think it might be the graphics card driver? I mean, this is a shared memory (mobile) card... maybe if that has a buggy driver and messes up memory allocation it could have this kind of effect? I'm using the "restricted" (proprietary) ATI driver since I wanted to test Linux gaming and that isn't exactly known to be a stability wonder...
Opinions, anyone? Please ;-) this is starting to seriously annoy me. Despite starting from scratch all my WUs crash, normally directly after rebooting, so quickly that I can't even get the debugger to run before they are dead.
I definitely haven't "upgraded my system" in terms of hardware since it's a laptop. But "shared memory" rang a bell... do you think it might be the graphics card driver? I mean, this is a shared memory (mobile) card...
Probably not that shared memory, but rather kernel parameters about shared (as being shared between different processes) memory. Try to run command ipcs -l to see what are your settings. There are three pseudo-files at /proc/sys/kernel/shm* ... Google-up their descriptions and try to increase some values to see if that's it.
Rings the bell as guys on Intel Mac OS-X had some issues with shared memory...
Thanks for the advice, I'll keep that in mind. Atm I'm testing the vanilla graphics driver, though, to see if it makes any difference- if it doesn't it won't do any harm. And I seem to remember someone over at CPDN warning me about similar problems when running the project on a laptop, saying it was indeed a graphics card issue... it's more than a year ago but I think there was sth.
Okay, now at least I can rule something out... as you suspected, Metod, it had nothing to do with my graphics card. Used the vanilla drivers, let the laptop crunch overnight and got roughly a million (well, more like 20) signal 11 errors, as usually in huge "series" which ended eventually upon switching projects. The laptop seems to be doing okay now, but I'm more or less unable to do projects with big WUs like Einstein since it won't stay stable long enough.
Quote:
Probably not that shared memory, but rather kernel parameters about shared (as being shared between different processes) memory. Try to run command ipcs -l to see what are your settings. There are three pseudo-files at /proc/sys/kernel/shm* ... Google-up their descriptions and try to increase some values to see if that's it.
Honestly, I'm not quite sure if I should. I had a look at those parameters and I could just swear they are relevant for the OS. The way it's now, the only program that doesn't seem to work okay is BOINC. If I change anything without knowing exactly what I'm doing (which I obviously don't) I might well end up with my system fubar'd for all I know, and I just can't afford that atm... loads of stuff to do for Uni and some other projects as well, I simply need that laptop. So, unless any of you knows a reasonably safe way to do this, I'd consider to temporarily detach this box.
Probably not that shared memory, but rather kernel parameters about shared (as being shared between different processes) memory. Try to run command ipcs -l to see what are your settings. There are three pseudo-files at /proc/sys/kernel/shm* ... Google-up their descriptions and try to increase some values to see if that's it.
Honestly, I'm not quite sure if I should. I had a look at those parameters and I could just swear they are relevant for the OS. The way it's now, the only program that doesn't seem to work okay is BOINC. If I change anything without knowing exactly what I'm doing (which I obviously don't) I might well end up with my system fubar'd for all I know, and I just can't afford that atm... loads of stuff to do for Uni and some other projects as well, I simply need that laptop. So, unless any of you knows a reasonably safe way to do this, I'd consider to temporarily detach this box.
One can not be sure about anything in CS of course. Still, increasing those 3 values shouldn't break anything if not overdone. If you change values by hand (eg. by echo 67108864 > /proc/sys/kernel/shmmax will increase max shared memory segment size to 64MB), you can allways return to vanilla state by rebooting machine. Me personally, I'd try to double the value of shmall, which (in principle) should double the amount of available SHM pages, thusly allowing a couple more apps to attach SHM.
Sometimes one just must change system settings. Almost a decade ago we had to tweak stack size of our AXP Linux box by changing some define and recompiling kernel as we desperately needed more than 8MB stack size - we were running weather prediction model which was written in Fortran and that beast passed a lot of data between functions over stack. That size (8MB) was quite lower than that of Digital Unix (DU) we were running on the same box - I thing we had to increase it there also. OTOH, Linux had a really nice feature: RAM overcommitment - you could run an app that (statically) demanded more RAM than was available in system because all of that RAM wasn't ever used. Or, alternatively, you could run two such apps in parallel while amount of system RAM would only allow running one of them. DU behaved differently by default (being on the safe side), but could be configured in the same way - but it was a royal pain in the arse to do it. As running DU binaries through binary compatibility layer was faster under Linux than under DU natively we switched over to Linux entirely. Which means I've been using 64-bit Linux on my desktop machine 9 (nine) years ago and came back to 64-bit just recently :-/
1. with "upgrade" I was rather thinking of updating software packages than installing new hardware
2. the "shared memory" e.g. the error messages in stderr_out refer to is not a memory physically shared between hardware devices (such as the CPU and your graphics adapter), but a piece of memory shared between processes, so rather a software than a hardware thing.
3. Your machine reports a number of very different errors:
- exit status -185: This means that the client couldn't start the App at all. The reason for this is given in stderr_out, in your case it's usually referring to shared memory.
- exit status 139: This is the "signal 11" error
- exit status 134: "process got signal 6" (SIGABRT / Abort). I never noticed that before. Could this be from shutting down the system?
4. Putting the file "EAH_DEBUG_DDD" into the BOINC directory should start the debugger automatically at the very beginning of the task and tells it to attach to the process automatically. If successful it will interrupt the task on its own, there's no need to manually attach a debugger to the running process. In contrast, you'll have to type "cont" at the gdb prompt or press the "Cont" button on the command toolbar to get the task going again. (BTW In case of a -185 error, the App isn't started at all, and so you don't see anything of a debugger)
I still don't know why our signal handler doesn't catch the signal on certain machines / systems; currently the only way to get some information about the cause of the signals is running the App with a debugger attached.
1. with "upgrade" I was rather thinking of updating software packages than installing new hardware
Oh, allright. Well, I can't really remember any major upgrades... It would have to be something that is running all the time, I think... even at night when I'm not doing anything with the laptop, since I also had crashes then. Oh... wait. Do you think it might be the Opera upgrade? I have noticed that 9.24 seems to behave a bit less stable than its predecessor and also a bit slower I think, and I don't usually bother closing the browser before I go to sleep. And it usually starts up right with KDE since the KDE session is restored to it's previous state upon logon... Nothing else I think, except it's a daemon or sth that is built right into KDE. The only other apps that are running all the time are Kopete and XChat, but I haven't changed the versions there, and besides, they hardly need memory. Do you recon I should use Firefox or so for a while and see if it helps?
Quote:
2. the "shared memory" e.g. the error messages in stderr_out refer to is not a memory physically shared between hardware devices (such as the CPU and your graphics adapter), but a piece of memory shared between processes, so rather a software than a hardware thing.
Yeah, got that now. Sorry.
Quote:
3. Your machine reports a number of very different errors:
- exit status -185: This means that the client couldn't start the App at all. The reason for this is given in stderr_out, in your case it's usually referring to shared memory.
- exit status 139: This is the "signal 11" error
- exit status 134: "process got signal 6" (SIGABRT / Abort). I never noticed that before. Could this be from shutting down the system?
I think the signal 6 was my own fault. Between all the messing around with the debugger, and being really exhausted, I somehow managed to accidentally change the permissions of the "slots" directory (or maybe that got triggered by not starting BOINC from "init.d", I wouldn't know) so it didn't have write permission- which of course killed the WUs.
Quote:
4. Putting the file "EAH_DEBUG_DDD" into the BOINC directory should start the debugger automatically at the very beginning of the task and tells it to attach to the process automatically. If successful it will interrupt the task on its own, there's no need to manually attach a debugger to the running process. In contrast, you'll have to type "cont" at the gdb prompt or press the "Cont" button on the command toolbar to get the task going again. (BTW In case of a -185 error, the App isn't started at all, and so you don't see anything of a debugger)
Yep, it usually worked like that. Well, I'll have to throw out the BOINC daemon for the time being (shame really) and start BOINC manually, because then the debugger seems to run okay. Pity it doesn't help with the 185.
Quote:
I still don't know why our signal handler doesn't catch the signal on certain machines / systems; currently the only way to get some information about the cause of the signals is running the App with a debugger attached.
BM
Yep, I promise I'll do my best to help as soon as I get this mess sorted out.
Oops.. this box of mine had a whole serie of signal 11's, but that chip had a bad overclock so I think you can discard them.
Sorry :blush:
My other boxen are doing just fine with 4.16.
Oops.. this box of mine had a whole serie of signal 11's, but that chip had a bad overclock so I think you can discard them.
Not sure. If it's alway the same location it might still be useful to know; there might be a program bug as well. Do you want to give it a try with ddd?
Good news about my laptop problem... I don't want to rejoice too early, but atm it looks like the host based (shared memory) errors stopped. I would put it down to the latest apt-get upgrade, which included quite a few patches for system tools. Maybe there really was a little bug in Ubuntu which manifested itself due to my hardware/drivers/software packages/whatever. I'll see if it stays like this; in case it does, I'll hopefully be able to get some debug output to help with the signal 11.
I definitely haven't
)
I definitely haven't "upgraded my system" in terms of hardware since it's a laptop. But "shared memory" rang a bell... do you think it might be the graphics card driver? I mean, this is a shared memory (mobile) card... maybe if that has a buggy driver and messes up memory allocation it could have this kind of effect? I'm using the "restricted" (proprietary) ATI driver since I wanted to test Linux gaming and that isn't exactly known to be a stability wonder...
Opinions, anyone? Please ;-) this is starting to seriously annoy me. Despite starting from scratch all my WUs crash, normally directly after rebooting, so quickly that I can't even get the debugger to run before they are dead.
RE: I definitely haven't
)
Probably not that shared memory, but rather kernel parameters about shared (as being shared between different processes) memory. Try to run command ipcs -l to see what are your settings. There are three pseudo-files at /proc/sys/kernel/shm* ... Google-up their descriptions and try to increase some values to see if that's it.
Rings the bell as guys on Intel Mac OS-X had some issues with shared memory...
Metod ...
Thanks for the advice, I'll
)
Thanks for the advice, I'll keep that in mind. Atm I'm testing the vanilla graphics driver, though, to see if it makes any difference- if it doesn't it won't do any harm. And I seem to remember someone over at CPDN warning me about similar problems when running the project on a laptop, saying it was indeed a graphics card issue... it's more than a year ago but I think there was sth.
Okay, now at least I can rule
)
Okay, now at least I can rule something out... as you suspected, Metod, it had nothing to do with my graphics card. Used the vanilla drivers, let the laptop crunch overnight and got roughly a million (well, more like 20) signal 11 errors, as usually in huge "series" which ended eventually upon switching projects. The laptop seems to be doing okay now, but I'm more or less unable to do projects with big WUs like Einstein since it won't stay stable long enough.
Honestly, I'm not quite sure if I should. I had a look at those parameters and I could just swear they are relevant for the OS. The way it's now, the only program that doesn't seem to work okay is BOINC. If I change anything without knowing exactly what I'm doing (which I obviously don't) I might well end up with my system fubar'd for all I know, and I just can't afford that atm... loads of stuff to do for Uni and some other projects as well, I simply need that laptop. So, unless any of you knows a reasonably safe way to do this, I'd consider to temporarily detach this box.
RE: RE: Probably not that
)
One can not be sure about anything in CS of course. Still, increasing those 3 values shouldn't break anything if not overdone. If you change values by hand (eg. by echo 67108864 > /proc/sys/kernel/shmmax will increase max shared memory segment size to 64MB), you can allways return to vanilla state by rebooting machine. Me personally, I'd try to double the value of shmall, which (in principle) should double the amount of available SHM pages, thusly allowing a couple more apps to attach SHM.
Sometimes one just must change system settings. Almost a decade ago we had to tweak stack size of our AXP Linux box by changing some define and recompiling kernel as we desperately needed more than 8MB stack size - we were running weather prediction model which was written in Fortran and that beast passed a lot of data between functions over stack. That size (8MB) was quite lower than that of Digital Unix (DU) we were running on the same box - I thing we had to increase it there also. OTOH, Linux had a really nice feature: RAM overcommitment - you could run an app that (statically) demanded more RAM than was available in system because all of that RAM wasn't ever used. Or, alternatively, you could run two such apps in parallel while amount of system RAM would only allow running one of them. DU behaved differently by default (being on the safe side), but could be configured in the same way - but it was a royal pain in the arse to do it. As running DU binaries through binary compatibility layer was faster under Linux than under DU natively we switched over to Linux entirely. Which means I've been using 64-bit Linux on my desktop machine 9 (nine) years ago and came back to 64-bit just recently :-/
Metod ...
Annika, 1. with "upgrade"
)
Annika,
1. with "upgrade" I was rather thinking of updating software packages than installing new hardware
2. the "shared memory" e.g. the error messages in stderr_out refer to is not a memory physically shared between hardware devices (such as the CPU and your graphics adapter), but a piece of memory shared between processes, so rather a software than a hardware thing.
3. Your machine reports a number of very different errors:
- exit status -185: This means that the client couldn't start the App at all. The reason for this is given in stderr_out, in your case it's usually referring to shared memory.
- exit status 139: This is the "signal 11" error
- exit status 134: "process got signal 6" (SIGABRT / Abort). I never noticed that before. Could this be from shutting down the system?
4. Putting the file "EAH_DEBUG_DDD" into the BOINC directory should start the debugger automatically at the very beginning of the task and tells it to attach to the process automatically. If successful it will interrupt the task on its own, there's no need to manually attach a debugger to the running process. In contrast, you'll have to type "cont" at the gdb prompt or press the "Cont" button on the command toolbar to get the task going again. (BTW In case of a -185 error, the App isn't started at all, and so you don't see anything of a debugger)
I still don't know why our signal handler doesn't catch the signal on certain machines / systems; currently the only way to get some information about the cause of the signals is running the App with a debugger attached.
BM
BM
Hi Bernd, thanks a lot
)
Hi Bernd,
thanks a lot for the information!
Oh, allright. Well, I can't really remember any major upgrades... It would have to be something that is running all the time, I think... even at night when I'm not doing anything with the laptop, since I also had crashes then. Oh... wait. Do you think it might be the Opera upgrade? I have noticed that 9.24 seems to behave a bit less stable than its predecessor and also a bit slower I think, and I don't usually bother closing the browser before I go to sleep. And it usually starts up right with KDE since the KDE session is restored to it's previous state upon logon... Nothing else I think, except it's a daemon or sth that is built right into KDE. The only other apps that are running all the time are Kopete and XChat, but I haven't changed the versions there, and besides, they hardly need memory. Do you recon I should use Firefox or so for a while and see if it helps?
Yeah, got that now. Sorry.
I think the signal 6 was my own fault. Between all the messing around with the debugger, and being really exhausted, I somehow managed to accidentally change the permissions of the "slots" directory (or maybe that got triggered by not starting BOINC from "init.d", I wouldn't know) so it didn't have write permission- which of course killed the WUs.
Yep, it usually worked like that. Well, I'll have to throw out the BOINC daemon for the time being (shame really) and start BOINC manually, because then the debugger seems to run okay. Pity it doesn't help with the 185.
Yep, I promise I'll do my best to help as soon as I get this mess sorted out.
Oops.. this box of mine had a
)
Oops.. this box of mine had a whole serie of signal 11's, but that chip had a bad overclock so I think you can discard them.
Sorry :blush:
My other boxen are doing just fine with 4.16.
RE: Oops.. this box of mine
)
Not sure. If it's alway the same location it might still be useful to know; there might be a program bug as well. Do you want to give it a try with ddd?
BM
BM
Good news about my laptop
)
Good news about my laptop problem... I don't want to rejoice too early, but atm it looks like the host based (shared memory) errors stopped. I would put it down to the latest apt-get upgrade, which included quite a few patches for system tools. Maybe there really was a little bug in Ubuntu which manifested itself due to my hardware/drivers/software packages/whatever. I'll see if it stays like this; in case it does, I'll hopefully be able to get some debug output to help with the signal 11.