old: Mac OSX test Application for Einstein@Home

Snake Doctor
Snake Doctor
Joined: 21 Jul 05
Posts: 71
Credit: 552724
RAC: 0

RE: Seems that the 008 runs

Message 15468 in response to message 15466

Quote:

Seems that the 008 runs into a problem, probably in the FreeBSD layer. Will take some time to get this fixed. I can confirm that it _might_ hang, and if the screensaver is active, it's hard to get access to it without reboot. Sorry for that, this is Beta test. I suggest for now you turn off the screensaver function if you want to use the 008.

BM

Not a problem. I will turn off the Screen saver, It has just been working on that WU for another 3 hours and Stilll no change. It looks like it is hung up still. The WU is number w1_0979.6_0.1_T02_S4hA_3. The report deadline id Aug, 7, 2005. My computer number is 365364.

Do you think I should dump the hung WU or just suspend it?

I also just noticed that on restart of BOINC, it says "Found app_info.xml using anonymous platform"

Phil


We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.

C
C
Joined: 9 Feb 05
Posts: 94
Credit: 189446
RAC: 0

Phil: While BOINC is

Phil:

While BOINC is running, start up Activity Monitor (Applications/Utilities folder) and tell it you want the Dock Icon to display CPU History, then close it using the yellow button. You should see the icon in the dock with the history marching across it. The Blue is BOINC running - usually the largest portion of the CPU. When BOINC has been hanging, you'll see the icon display showing roughly 10% green, 80% red, and 10% blue. I'm running the BOINC stuff as CLI on my Powerbook, and in MenuBar on my iBook and on my old iMac. All three machines show the same hung state about once or twice per day. To stop it, on the MenuBar machines, I have to Quit Menubar, then go to Activity Monitor force quit the Einstein app. Then restart Menubar, and everything runs fine again for a while. On the CLI machin, sending Control-C to the Terminal, then using Activity Monitor to force quit the E@H, then restarting in the Terminal will get everything back to normal.

I've been trying to figure out what happens right when the BOINC system seems to hang, but the closest I've come is that I "think" it's usually right after SETI pauses and E@H restarts. I've now begun leaving Activity Monitor open in my other monitor screen so I can keep an eye on it while I'm working...

C

Snake Doctor
Snake Doctor
Joined: 21 Jul 05
Posts: 71
Credit: 552724
RAC: 0

RE: Phil: While BOINC is

Message 15470 in response to message 15469

Quote:

Phil:

While BOINC is running, start up Activity Monitor (Applications/Utilities folder) and tell it you want the Dock Icon to display CPU History, then close it using the yellow button. You should see the icon in the dock with the history marching across it. The Blue is BOINC running - usually the largest portion of the CPU. When BOINC has been hanging, you'll see the icon display showing roughly 10% green, 80% red, and 10% blue. I'm running the BOINC stuff as CLI on my Powerbook, and in MenuBar on my iBook and on my old iMac. All three machines show the same hung state about once or twice per day. To stop it, on the MenuBar machines, I have to Quit Menubar, then go to Activity Monitor force quit the Einstein app. Then restart Menubar, and everything runs fine again for a while. On the CLI machin, sending Control-C to the Terminal, then using Activity Monitor to force quit the E@H, then restarting in the Terminal will get everything back to normal.

I've been trying to figure out what happens right when the BOINC system seems to hang, but the closest I've come is that I "think" it's usually right after SETI pauses and E@H restarts. I've now begun leaving Activity Monitor open in my other monitor screen so I can keep an eye on it while I'm working...

C

Ok I have done that. Your suspisions about the switch point for SETI bear out my own experiance. It seems that there is still some issue on the swap out because mine usually stops there as well. I have noticed that SETI now seems to be running faster since I reinstalled the BOINC client and the 008 beta. I do not know what has caused this but SETI WU now run about an Hr faster than before. The speed increase in E@H is dramatic too. I still have one E@H WU stuck in the que. It won't process (just keeps checking) and I know it had reached 99% in about 7:30Hrs. I suspended it and I am waiting for the Beta team to advise as to weather I should dump that WU. In about 2:30 Hr .008 will complete the WU it downloaded during the night. I am interested to see what happens with it.

Something strange also. PP@H has not been sending WU all weekend. Suddenly my BOINC client aborted itself and when I restarted it PP@H downloaded a new APP package, and a new WU and started crunching. I do not know if this is any way related to the BETA. In any event I am content to work with the E@H guys to get this ready for release even in the face of a few system crashes.

Thank for your info on App monitoring I have been using X resource graph. I uses a littl CPU but I have found it useful. I will switch to App Monitor.

Regards and thanks
Phil


We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.

Snake Doctor
Snake Doctor
Joined: 21 Jul 05
Posts: 71
Credit: 552724
RAC: 0

E@H seems to be giving up the

Message 15471 in response to message 15470

E@H seems to be giving up the CPU to other apps ok in the middle of the process. It switched to P@H and S@H ok but I have noticed that when it returns to E@H there is a very long delay before it starts to crunch again. According to BOINC Stat viewer it is "Checking" the WU. The more processing that has been completed on the WU the longer the process takes to get started again. On one WU that is 5:15:35 into crunching it took long enough to restart that BOINC finally switched to a S@H module instead. I have turned off all the projects except C@H and E@H and set E@H up so it is forced to process the already started WU to so how long it will take to begine crunching it again. The systems monitors do not show any hangs.

Regards
Phil


We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250536279
RAC: 34232

Hm. What is your "Switch

Hm. What is your "Switch between applications every" preference set to?

I'll take a look at that Workunit. Hm, hard to find the Result you were talking about - You mentioned a hostid 365364, I assume you mean 356364 (the latter one belongs to your account). This machine got sent a Result named "w1_0979.5__0979.6_0.1_T02_S4hA_3", which I suspect is what you mean.

Can you please take a look into your slots directories ("/Library/Application Support/BOINC Data/slots/*")? There should be two Files "Fstat.Ha" and "Fstat.Hb". How large are they? Is your diskspace getting tight?

If you want, you can abort the Result, I'll keep an eye on this Workunit. Maybe there's something wrong with the data. We had some WUs in S3 which caused that files to get very large, leading to long processing times at the end.

BM

BM

Snake Doctor
Snake Doctor
Joined: 21 Jul 05
Posts: 71
Credit: 552724
RAC: 0

RE: Hm. What is your

Message 15473 in response to message 15472

Quote:

Hm. What is your "Switch between applications every" preference set to?

I'll take a look at that Workunit. Hm, hard to find the Result you were talking about - You mentioned a hostid 365364, I assume you mean 356364 (the latter one belongs to your account). This machine got sent a Result named "w1_0979.5__0979.6_0.1_T02_S4hA_3", which I suspect is what you mean.

Can you please take a look into your slots directories ("/Library/Application Support/BOINC Data/slots/*")? There should be two Files "Fstat.Ha" and "Fstat.Hb". How large are they? Is your diskspace getting tight?

If you want, you can abort the Result, I'll keep an eye on this Workunit. Maybe there's something wrong with the data. We had some WUs in S3 which caused that files to get very large, leading to long processing times at the end.

BM


I just checked the slots files none of the files you mentiond were over about 1.5 megs. I have 88 GB free on the drive. You are probably right on the Machine number I must have crossed the numbers. EH is now trying to chew on WU w1_0979.7_0.1_T02_s4ha_2. this WU had about 5 hr time on it when BOINC switched to SETI. The system has tried to continue the WU but cant seem to get going. BOINC monitor say it is "checking" the WU. It has been doing that for about an hour. So I might have another one stuck. Shoul I just reset EH and blow off these WU? I still have one that the system has only lookid at long enough to notice it is there. Almost forgot. My switch is set to the default 1 hour.

PS - It has been about 15 Min since I wrote the first part of this message. In that time I aborted the first 008 WU and shut dwn BOINC and rebooted the system just to clear anything out. I restarted BOINC and it would not connect at first. Forced it to quit and restarted it. When it came up it went back to the WU numberd above. After a few Min. I tried to turn on the graphics just to see if it would show me anything. It of cource did not. So I force the graphic window to close, and the module started crunching where it had originally left off a 5 hours. It now shows about 2 hours left. I will set the system so ity can not work on anything ekse till that WU uploads.

Thanks
Phil


We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

OK, I think David Anderson

OK, I think David Anderson has figured out what's going on. The signal handler used by the BOINC application library to manage timers and suspend/resume operation is calling functions that are not async interrupt safe. This explains the random nature of these failures.

It may take a little while to figure out how to fix this.

Bruce

Director, Einstein@Home

Snake Doctor
Snake Doctor
Joined: 21 Jul 05
Posts: 71
Credit: 552724
RAC: 0

RE: OK, I think David

Message 15475 in response to message 15474

Quote:

OK, I think David Anderson has figured out what's going on. The signal handler used by the BOINC application library to manage timers and suspend/resume operation is calling functions that are not async interrupt safe. This explains the random nature of these failures.

It may take a little while to figure out how to fix this.

Bruce

From what I have seen you may be on to something. I have a WU that should finish in about 1 hour. I am forcing the BOINC client to work on just that WU until completion. If it finishes and uploads ok that would indicat that the problem is actually in the scheduler/switiching function. It was interesting to e that attempting to run the EH graph and aborting it cleard the processing lock up on my system for this WU. This tells me that that process must reset something in the system as well that clear a jam.

Be back in an hour.

Ok so I didn't wait an hour. For some reason the BOINC client accepted a download of P@H WU. This despite having the prject suspended. E@H gave ground to one of these WU and it started to process. When I tried to force it back the the E@H WU, it would not run until I turned on the E@H graphic window for that WU. It is now running but only if I keep the window open. At the same time the suspended P@H WU is also running and it won't stop. I am going to let it go until the E@H WU is done and see what happens, but there is definatly something wrong in the scheduler.

Regards
Phil


We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.

C
C
Joined: 9 Feb 05
Posts: 94
Credit: 189446
RAC: 0

Bernd: I just checked my

Bernd:
I just checked my slots folder in my 1.33 GHz iBook...in Slots, I have a folder "0" and another "1". "1" is empty. "0" has several files, including an Fstats.Ha right at 1,000 KB, and an Fstats.Ha.ckp at 28 Bytes. There is no Fstats.Hb. Menubar says it's 26% done which seems about right, based on the log. This machine ran out of SETI work a couple of days ago, and won't download more. Both clients have identical LTD, and both have 00000 for STD in the client_state.xml file

[edit]
Just checked the old 333 MHz iMac - it has both .Ha at 1,796 KB, and .ckp at 28 Bytes, .Hb at 848 KB and it's .ckp at 27 Bytes.

C

Snake Doctor
Snake Doctor
Joined: 21 Jul 05
Posts: 71
Credit: 552724
RAC: 0

RE: RE: OK, I think David

Message 15477 in response to message 15475

Quote:
Quote:

OK, I think David Anderson has figured out what's going on. The signal handler used by the BOINC application library to manage timers and suspend/resume operation is calling functions that are not async interrupt safe. This explains the random nature of these failures.

It may take a little while to figure out how to fix this.

Bruce

From what I have seen you may be on to something. I have a WU that should finish in about 1 hour. I am forcing the BOINC client to work on just that WU until completion. If it finishes and uploads ok that would indicat that the problem is actually in the scheduler/switiching function. It was interesting to e that attempting to run the EH graph and aborting it cleard the processing lock up on my system for this WU. This tells me that that process must reset something in the system as well that clear a jam.

Be back in an hour.

Ok so I didn't wait an hour. For some reason the BOINC client accepted a download of P@H WU. This despite having the prject suspended. E@H gave ground to one of these WU and it started to process. When I tried to force it back the the E@H WU, it would not run until I turned on the E@H graphic window for that WU. It is now running but only if I keep the window open. At the same time the suspended P@H WU is also running and it won't stop. I am going to let it go until the E@H WU is done and see what happens, but there is definatly something wrong in the scheduler.

Regards
Phil

Ok I think I know what is happening. After the WU completes for some reasun the 4.29 version of E@H is grabbing the WU and continuing the processing. this is obvious in the activity monitor, I actually watched as 008 quit and the mfold app took over. I have no idea what it was doing. The WU in question is w1_0979.7_0.1_T02_s4ha_2. I will try pulling the old app out of the E@H folder and see if that fixes things. My recollection is that we did not have to remove the old app to run the Beta. Was I wrong about this?

I am going to remove the old app and turn on other processes and see if they can work and play well togeather. I'll let you know.


We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.