They are running different version of the core client. Windows is 5.10.30 (I think) installed as a service and Linux is 5.10.21 installed via rpm as a system daemon.
Does this run BOINC and the App as root? (try "ps -ef | grep eistein" or similar)?
BM
As far as I know (and Eric Myers can confirm for you) when installing as a rpm, it runs as its own user. The rpm creates a user boinc and the home directory is /var/lib/boinc.
Here's the output of ps, I hope it makes sense to you because it doesn't make sense to me.
I also had a unit finish overnight without issue and another one that has about an hour to go on it. The finished one is waiting validation as are the first couple I finished.
I erased everything and restarted with Boinc 5.2.13. Guess I will do that with the other boxen too.
What is this with "boxen"? LOL. I asked someone else over on LHC what the plural of "moose" was, and they replied back with the "correct answer", which is "mooxen" (or "moosen"), according to an old Brian Regan comedy skit "Stupid In School"... You can watch this Youtube video that is a stick-figure sketch done by someone with his skit as the audio... A decent understanding of English concepts is required to "get" some of the humor... English is, honestly, a very odd language as it has different pronunciations for words that you'd think are pronounced the same, like "comb" and "tomb"...
As far as I know (and Eric Myers can confirm for you) when installing as a rpm, it runs as its own user. The rpm creates a user boinc and the home directory is /var/lib/boinc.
Well, what the rpm does depends on the distributor. Might be that it is more common by now to create an own user (which is good), but I've seen installations where the client (and thus the App) ran as root.
Quote:
Here's the output of ps, I hope it makes sense to you because it doesn't make sense to me.
It does. The App is running as user "boinc".
It also reveals that this is a dual-CPU machine, which might be the reason why I couldn't reproduce the problem. I'll try with a dual-core VM.
Are others seeing this problem (only) on multi-CPU/core machines?
As far as I know (and Eric Myers can confirm for you) when installing as a rpm, it runs as its own user. The rpm creates a user boinc and the home directory is /var/lib/boinc.
Well, what the rpm does depends on the distributor. Might be that it is more common by now to create an own user (which is good), but I've seen installations where the client (and thus the App) ran as root.
Quote:
Here's the output of ps, I hope it makes sense to you because it doesn't make sense to me.
It does. The App is running as user "boinc".
It also reveals that this is a dual-CPU machine, which might be the reason why I couldn't reproduce the problem. I'll try with a dual-core VM.
Are others seeing this problem (only) on multi-CPU/core machines?
Thanks,
BM
Come to think of it, all of my signal 11 problems have been on dual processor machines.
Computer 1042068--This problem machine is an SGI 1200 server, with dual Pentium III 700 processors. It has 256 Meg of registered, ECC memory, if it helps to know that. It's the only machine that still has crashed results showing up in its workunit list. There are four, but only two are from the signal-11 problem. The other two crashed when I removed the app_info file so that it could finally upgrade to the 4.20 app.
Computer 1059057--This one is a home-brew box that I built from a dual Pentium III 866 motherboard that I got from Ebay. It has 512-Meg of non-registered, non-ECC PC-133 memory. If I remember correctly, I believe that all of its problem workunits were running with the 4.14 app. I've also since upgraded it to the 4.20 app.
Computer 1060000-The third one is an old IBM Intellistation with dual 2.8 GHz Xeons and 512 Meg of memory. I neglected to take note of what kind of memory it is when I had the box opened, but I believe that it's probably DDR. If I remember correctly, I believe that the problems on this one occurred with the 4.20 app. I've since upgraded to the 4.21 power-users' app, and have had no problems with it.
Most, and perhaps all, of the signal 11 problems occurred when I had network problems. Also, the problem machines are all running the newer 5.10.x versions of BOINC.
****
Having said all of this, I also have to note that I have several other dual-processor machines, and one dual-core machine, that have never had signal-11 problems. But, all of the problems I have had have been on the dual-processor machines listed above; the single-processor machines haven't had any problems at all.
As far as I know (and Eric Myers can confirm for you) when installing as a rpm, it runs as its own user. The rpm creates a user boinc and the home directory is /var/lib/boinc.
Well, what the rpm does depends on the distributor. Might be that it is more common by now to create an own user (which is good), but I've seen installations where the client (and thus the App) ran as root.
Sorry. I should have been more clear. Eric did package it up.
Quote:
Quote:
Here's the output of ps, I hope it makes sense to you because it doesn't make sense to me.
It does. The App is running as user "boinc".
It also reveals that this is a dual-CPU machine, which might be the reason why I couldn't reproduce the problem. I'll try with a dual-core VM.
Are others seeing this problem (only) on multi-CPU/core machines?
Thanks,
BM
Would it help in testing a) to limit BOINC to 1 CPU and/or b) try an older core client? I'm comfortable enough with installing BOINC to uninstall the rpm and install an older version (5.4.x or 5.8.x?)
It can be that the problems are really related to interrupted network-access. But i don't think so, because i have often problems to reach the seti-project or the not frequently used projects LHC & Chess960 and Boinc logs this every time, but the last einstein-units going like the hell with power-app_4.21.
Now I'm running Augustines Boinc.5.10.30 in a Suse10.2_x86-64_DualCore-VM as root in folder /root/BOINC. I think all my problems earlier posted in this thread are only depended on faulty libraries or linkings to and between them. All dependecies earlier was solved, but a fuse comes to the end... I had done the mentioned fschk without errors but my problems continued to exist and resetting all projects are useless. Since i have installed all gnome-libs again plus some more all will be fine. No errors are more reported from this host. Knocking on wood.
[edit]
I believe more that Boinc has under some circumstances problems with handling of working-slots especially with 2 task from same project. But that should be solved with 5.10.X or not?
Strange for me is that only Einstein and Spinhenge have had this problems, all other projects runs fine on that host.
[/edit]
Funny thing happened here when switching from 3 to 4 cores:
Quote:
2008-01-11 11:51:57 [Einstein@Home] Started upload of h1_0734.45_S5R2__204_S5R3a_0_0
2008-01-11 11:51:58 [Einstein@Home] Scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded
2008-01-11 11:51:58 [Einstein@Home] General preferences have been updated
2008-01-11 11:51:58 [---] General prefs: from Einstein@Home (last modified 2008-01-11 11:40:31)
2008-01-11 11:51:58 [---] General prefs: no separate prefs for home; using your defaults
2008-01-11 11:51:58 [---] Number of usable CPUs has changed. Running benchmarks.
2008-01-11 11:51:58 [---] Suspending computation and network activity - running CPU benchmarks
2008-01-11 11:51:58 [Einstein@Home] Pausing result h1_0734.45_S5R2__175_S5R3a_0 (left in memory)
2008-01-11 11:51:58 [Einstein@Home] Pausing result h1_0734.45_S5R2__174_S5R3a_0 (left in memory)
2008-01-11 11:51:58 [Einstein@Home] Pausing result h1_0734.45_S5R2__173_S5R3a_0 (left in memory)
SIGSEGV: segmentation violationStack trace (8 frames):
./boinc[0x80845b2]
[0xffffe420]
./boinc[0x8058d77]
./boinc[0x8057e71]
./boinc[0x8078819]
./boinc[0x807895f]
/lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xdc)[0xb7e4aebc]
./boinc(shmat+0x59)[0x804bf21]
Funny thing happened here when switching from 3 to 4 cores:
Quote:
2008-01-11 11:51:57 [Einstein@Home] Started upload of h1_0734.45_S5R2__204_S5R3a_0_0
2008-01-11 11:51:58 [Einstein@Home] Scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded
2008-01-11 11:51:58 [Einstein@Home] General preferences have been updated
2008-01-11 11:51:58 [---] General prefs: from Einstein@Home (last modified 2008-01-11 11:40:31)
2008-01-11 11:51:58 [---] General prefs: no separate prefs for home; using your defaults
2008-01-11 11:51:58 [---] Number of usable CPUs has changed. Running benchmarks.
2008-01-11 11:51:58 [---] Suspending computation and network activity - running CPU benchmarks
2008-01-11 11:51:58 [Einstein@Home] Pausing result h1_0734.45_S5R2__175_S5R3a_0 (left in memory)
2008-01-11 11:51:58 [Einstein@Home] Pausing result h1_0734.45_S5R2__174_S5R3a_0 (left in memory)
2008-01-11 11:51:58 [Einstein@Home] Pausing result h1_0734.45_S5R2__173_S5R3a_0 (left in memory)
SIGSEGV: segmentation violationStack trace (8 frames):
./boinc[0x80845b2]
[0xffffe420]
./boinc[0x8058d77]
./boinc[0x8057e71]
./boinc[0x8078819]
./boinc[0x807895f]
/lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xdc)[0xb7e4aebc]
./boinc(shmat+0x59)[0x804bf21]
Exiting...[/quoted]
Ubuntu 7.04 Boinc 5.2.13
Wow!
But this is a segmentation fault inside the BOINC software itself, so someone should make a bug report to BOINC's TRAC system, I guess. This issue should not have any relation to Einstein@Home.
RE: RE: They are running
)
As far as I know (and Eric Myers can confirm for you) when installing as a rpm, it runs as its own user. The rpm creates a user boinc and the home directory is /var/lib/boinc.
Here's the output of ps, I hope it makes sense to you because it doesn't make sense to me.
[Kathryn@Galaxy ~]$ ps -ef | grep einstein
boinc 5585 5581 96 04:15 ? 04:36:02 einstein_S5R3_4.21_i686-pc-linux-gnu --method=0 --Freq=724.755964345 --FreqBand=0.0161398745876 --dFreq=6.71056161393e-06 --f1dot=-1.58548959919e-09 --f1dotBand=1.74403855911e-09 --df1dot=3.88447721545e-10 --skyGridFile=skygrid_0730Hz_S5R3.dat --numSkyPartitions=109 --partitionIndex=50 --DataFiles1=h1_0724.60_S5R2 l1_0724.60_S5R2 h1_0724.65_S5R2 l1_0724.65_S5R2 h1_0724.70_S5R2 l1_0724.70_S5R2 h1_0724.75_S5R2 l1_0724.75_S5R2 h1_0724.80_S5R2 l1_0724.80_S5R2 h1_0724.85_S5R2 l1_0724.85_S5R2 h1_0724.90_S5R2 l1_0724.90_S5R2 --tStack=90000 --nStacksMax=84 --pixelFactor=0.500 --nf1dotRes=1 --ephemE=earth --ephemS=sun --nCand1=10000 -o Hough.out --gridType=3 --useWeights=0 --printCand1 --semiCohToplist -d1 --WUfpops=2.0406e+14
boinc 6174 5581 41 06:32 ? 01:02:37 einstein_S5R3_4.21_i686-pc-linux-gnu --method=0 --Freq=724.755964345 --FreqBand=0.0161398745876 --dFreq=6.71056161393e-06 --f1dot=-1.58548959919e-09 --f1dotBand=1.74403855911e-09 --df1dot=3.88447721545e-10 --skyGridFile=skygrid_0730Hz_S5R3.dat --numSkyPartitions=109 --partitionIndex=40 --DataFiles1=h1_0724.60_S5R2 l1_0724.60_S5R2 h1_0724.65_S5R2 l1_0724.65_S5R2 h1_0724.70_S5R2 l1_0724.70_S5R2 h1_0724.75_S5R2 l1_0724.75_S5R2 h1_0724.80_S5R2 l1_0724.80_S5R2 h1_0724.85_S5R2 l1_0724.85_S5R2 h1_0724.90_S5R2 l1_0724.90_S5R2 --tStack=90000 --nStacksMax=84 --pixelFactor=0.500 --nf1dotRes=1 --ephemE=earth --ephemS=sun --nCand1=10000 -o Hough.out --gridType=3 --useWeights=0 --printCand1 --semiCohToplist -d1 --WUfpops=2.0406e+14
Kathryn 6659 6576 0 09:01 pts/1 00:00:00 grep einstein
I also had a unit finish overnight without issue and another one that has about an hour to go on it. The finished one is waiting validation as are the first couple I finished.
Kathryn :o)
Einstein@Home Moderator
Damn :( Lost another 35 units
)
Damn :(
Lost another 35 units when network was cut off, hostid=1027762
Boinc 5.10.21
I erased everything and restarted with Boinc 5.2.13. Guess I will do that with the other boxen too.
OK, guess I should have aborted the ones "In Progress" first, sorry..
RE: I erased everything
)
What is this with "boxen"? LOL. I asked someone else over on LHC what the plural of "moose" was, and they replied back with the "correct answer", which is "mooxen" (or "moosen"), according to an old Brian Regan comedy skit "Stupid In School"... You can watch this Youtube video that is a stick-figure sketch done by someone with his skit as the audio... A decent understanding of English concepts is required to "get" some of the humor... English is, honestly, a very odd language as it has different pronunciations for words that you'd think are pronounced the same, like "comb" and "tomb"...
Also, try this more "on topic" skit snippet...
RE: As far as I know (and
)
Well, what the rpm does depends on the distributor. Might be that it is more common by now to create an own user (which is good), but I've seen installations where the client (and thus the App) ran as root.
It does. The App is running as user "boinc".
It also reveals that this is a dual-CPU machine, which might be the reason why I couldn't reproduce the problem. I'll try with a dual-core VM.
Are others seeing this problem (only) on multi-CPU/core machines?
Thanks,
BM
BM
RE: RE: As far as I know
)
Come to think of it, all of my signal 11 problems have been on dual processor machines.
Computer 1042068--This problem machine is an SGI 1200 server, with dual Pentium III 700 processors. It has 256 Meg of registered, ECC memory, if it helps to know that. It's the only machine that still has crashed results showing up in its workunit list. There are four, but only two are from the signal-11 problem. The other two crashed when I removed the app_info file so that it could finally upgrade to the 4.20 app.
Computer 1059057--This one is a home-brew box that I built from a dual Pentium III 866 motherboard that I got from Ebay. It has 512-Meg of non-registered, non-ECC PC-133 memory. If I remember correctly, I believe that all of its problem workunits were running with the 4.14 app. I've also since upgraded it to the 4.20 app.
Computer 1060000-The third one is an old IBM Intellistation with dual 2.8 GHz Xeons and 512 Meg of memory. I neglected to take note of what kind of memory it is when I had the box opened, but I believe that it's probably DDR. If I remember correctly, I believe that the problems on this one occurred with the 4.20 app. I've since upgraded to the 4.21 power-users' app, and have had no problems with it.
Most, and perhaps all, of the signal 11 problems occurred when I had network problems. Also, the problem machines are all running the newer 5.10.x versions of BOINC.
****
Having said all of this, I also have to note that I have several other dual-processor machines, and one dual-core machine, that have never had signal-11 problems. But, all of the problems I have had have been on the dual-processor machines listed above; the single-processor machines haven't had any problems at all.
Well... mine is a Core Duo,
)
Well... mine is a Core Duo, so it would kinda fit. But I don't have any single-core Linux boxes to test if they don't get problems.
RE: RE: As far as I know
)
Sorry. I should have been more clear. Eric did package it up.
Would it help in testing a) to limit BOINC to 1 CPU and/or b) try an older core client? I'm comfortable enough with installing BOINC to uninstall the rpm and install an older version (5.4.x or 5.8.x?)
Kathryn :o)
Einstein@Home Moderator
It can be that the problems
)
It can be that the problems are really related to interrupted network-access. But i don't think so, because i have often problems to reach the seti-project or the not frequently used projects LHC & Chess960 and Boinc logs this every time, but the last einstein-units going like the hell with power-app_4.21.
Now I'm running Augustines Boinc.5.10.30 in a Suse10.2_x86-64_DualCore-VM as root in folder /root/BOINC. I think all my problems earlier posted in this thread are only depended on faulty libraries or linkings to and between them. All dependecies earlier was solved, but a fuse comes to the end... I had done the mentioned fschk without errors but my problems continued to exist and resetting all projects are useless. Since i have installed all gnome-libs again plus some more all will be fine. No errors are more reported from this host. Knocking on wood.
[edit]
I believe more that Boinc has under some circumstances problems with handling of working-slots especially with 2 task from same project. But that should be solved with 5.10.X or not?
Strange for me is that only Einstein and Spinhenge have had this problems, all other projects runs fine on that host.
[/edit]
Funny thing happened here
)
Funny thing happened here when switching from 3 to 4 cores:
RE: Funny thing happened
)