OK, I think David Anderson has figured out what's going on. The signal handler used by the BOINC application library to manage timers and suspend/resume operation is calling functions that are not async interrupt safe. This explains the random nature of these failures.
It may take a little while to figure out how to fix this.
Bruce
From what I have seen you may be on to something. I have a WU that should finish in about 1 hour. I am forcing the BOINC client to work on just that WU until completion. If it finishes and uploads ok that would indicat that the problem is actually in the scheduler/switiching function. It was interesting to e that attempting to run the EH graph and aborting it cleard the processing lock up on my system for this WU. This tells me that that process must reset something in the system as well that clear a jam.
Be back in an hour.
Ok so I didn't wait an hour. For some reason the BOINC client accepted a download of P@H WU. This despite having the prject suspended. E@H gave ground to one of these WU and it started to process. When I tried to force it back the the E@H WU, it would not run until I turned on the E@H graphic window for that WU. It is now running but only if I keep the window open. At the same time the suspended P@H WU is also running and it won't stop. I am going to let it go until the E@H WU is done and see what happens, but there is definatly something wrong in the scheduler.
Regards
Phil
Ok I think I know what is happening. After the WU completes for some reasun the 4.29 version of E@H is grabbing the WU and continuing the processing. this is obvious in the activity monitor, I actually watched as 008 quit and the mfold app took over. I have no idea what it was doing. The WU in question is w1_0979.7_0.1_T02_s4ha_2. I will try pulling the old app out of the E@H folder and see if that fixes things. My recollection is that we did not have to remove the old app to run the Beta. Was I wrong about this?
I am going to remove the old app and turn on other processes and see if they can work and play well togeather. I'll let you know.
OK it ran for about 2:55 Hrs, at that point it hung. Other work seemed to progress but EH never went any furthur. The Activity monitor flagged it red. The Graphic would not start so I could not get it going with that as before. I stopped boink, and restarted it and it came up and started working on the WU again. It seems as though (Based on the way work had proceeded) that if the system has at lleast 2 WU for each app, it will run fine. The problem seems to occur when it uses the two processor system to work on more than one thing at a time. This happens any time I only have one WU for a particular project in the que. If EH is one of those things, for some reason it will not release the WU properly and move on. As long as it can work on two WU from the same project it seems happy. I looked in the slots file while it was hung and the two files mentioned earlier were 1.5 Mb and 4K. There was a "Lock: file in the slot as well. Also for some reason the CPU use had dropped to only 70% for EH while it would normally run at more like 95+.
Regards
Phil
We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
OK, I think David Anderson has figured out what's going on. The signal handler used by the BOINC application library to manage timers and suspend/resume operation is calling functions that are not async interrupt safe. This explains the random nature of these failures.
It may take a little while to figure out how to fix this.
Bruce
From what I have seen you may be on to something. I have a WU that should finish in about 1 hour. I am forcing the BOINC client to work on just that WU until completion. If it finishes and uploads ok that would indicat that the problem is actually in the scheduler/switiching function. It was interesting to e that attempting to run the EH graph and aborting it cleard the processing lock up on my system for this WU. This tells me that that process must reset something in the system as well that clear a jam.
Be back in an hour.
Ok so I didn't wait an hour. For some reason the BOINC client accepted a download of P@H WU. This despite having the prject suspended. E@H gave ground to one of these WU and it started to process. When I tried to force it back the the E@H WU, it would not run until I turned on the E@H graphic window for that WU. It is now running but only if I keep the window open. At the same time the suspended P@H WU is also running and it won't stop. I am going to let it go until the E@H WU is done and see what happens, but there is definatly something wrong in the scheduler.
Regards
Phil
Ok I think I know what is happening. After the WU completes for some reasun the 4.29 version of E@H is grabbing the WU and continuing the processing. this is obvious in the activity monitor, I actually watched as 008 quit and the mfold app took over. I have no idea what it was doing. The WU in question is w1_0979.7_0.1_T02_s4ha_2. I will try pulling the old app out of the E@H folder and see if that fixes things. My recollection is that we did not have to remove the old app to run the Beta. Was I wrong about this?
I am going to remove the old app and turn on other processes and see if they can work and play well togeather. I'll let you know.
OK it ran for about 2:55 Hrs, at that point it hung. Other work seemed to progress but EH never went any furthur. The Activity monitor flagged it red. The Graphic would not start so I could not get it going with that as before. I stopped boink, and restarted it and it came up and started working on the WU again. It seems as though (Based on the way work had proceeded) that if the system has at lleast 2 WU for each app, it will run fine. The problem seems to occur when it uses the two processor system to work on more than one thing at a time. This happens any time I only have one WU for a particular project in the que. If EH is one of those things, for some reason it will not release the WU properly and move on. As long as it can work on two WU from the same project it seems happy. I looked in the slots file while it was hung and the two files mentioned earlier were 1.5 Mb and 4K. There was a "Lock: file in the slot as well. Also for some reason the CPU use had dropped to only 70% for EH while it would normally run at more like 95+.
Regards
Phil
I know I seem to be talking to myself but I want to keep you posted on the events as I find them. So I thought I would test to see if some of the EH switching problems might be related to my dual processor system working more than one project at a time. I reloaded all the software (recommended boing freshly downloaded, and 008) on a MAc powerbook G4 (computer ID 382316) and installed the 008 beta. The first time it tried to switch, EH stayed in memory after the switch. iIt was only using about 30% of the CPU but it was there. The clock on the WU was not incrementing. I stopped BOINC and restarted it. It promptly downloaded new WUs from both SETI and P@H. It has been running for about 5 hours hours now and seems to have successfully switched from other apps to EH and back again. The app unloaded from memory when this occured.
It seem to me that for some reason the EH app is not correctly recognizing the BOINC switch commands and yielding the processor by dumping completely from memory. This would could only be worse on a 2 cpu system where it seems each processor is handled individually and could be running seoparate projects at the same time. It also seems that the problem is more frequent if the handoff is occuring between EH and SETI. So far P@H has been relativly mild manered having only become tangled once.
That said, it also seems that once the system runs for a while it seems to stabilize and behaive better.
Good Luck.
Regards
Phil
We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
Is there any fraking reason to quote everytime you type something? Quote only when needed, which to be honest is not very often. I think most of us can scroll if needed, we don't need a recap of a recap of a recap.
Well I have been watching 008 working for 24 hours now with other apps running. After a few spits and starts last night, it started working bettwr. It now yields the CPU for other apps cleanly and restarts well. I have watched it run on one CPU while other apps run on the other. It has switched off while other apps continued on the other cpu, and other apps have switched and left it running on the other. It seems to be very memory efficient usually running less that 3 MB during most of the process with ocassional spikes to 5MB. Left alone it will use up to 99% of the CPU.
One thing I have noticed is that BOINC may not be doing the best housecleaning on the slot directories. I am seeing WU listed in the work display in BOINC that have already been uploaded. I assume this is because the stats for the WU are still in the slot folder. Is it possible that some of the problems when first loading 008 are caused by reminant files in the slot folders? This would explain why some of the problems seem to clear up after the system runs for a while. As the system runs eventually the slots get overwritten and clean themselves. Just wondering.
The only other thing I have seen is about a 15-20% increase in speed for the SETI WU. It is probably not related to 008 but it amounts to over an hour per SETI WU on my system. There was some speed increase when I first loaded 008, but I then upgraded BOINC to the lastest build and the performance really went up. E@H Wu take almost exactally 7:20 each. You can almost set your watch by it. The public system takes over 9 hours per WU on my system. Nice work guys.
Regards
Phil
We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
I will appologize for the long meassage up front but I have now place from which to serve files.
008 seemed to run fine for most of the night until it neared the end of WU w1_0979.5__0979.8_0.1_T02)S4hA_2. It restarted the unit at about 70% complete and also started processing a second e@H WU. when it reached 100% it BOINC cleard the progress column. The CPU time remaind at 07:20:23, and E@H would not release the CPU. The second WU seemd to complete its cycle and moved on to other projects. Eventually the Activity monitor recognized the process as not responding and flagged it red even though it showed 55% CPU usage. I have suspended the WU for now. In the Slot folder there a number of files and in particular four "fstats" files.
Fstats.Ha
Fstats.Ha.ckp
Fstats.Hb.ckp
Fstats.Hb
the stder.txt file reads
Detected CPU type 1
Detected CPU type 1
Resuming computation at 1584/59035/59035
Detected CPU type 1
Resuming computation at 1708/66325/66325
Detected CPU type 1
Resuming computation at 2412/102864/102864
Detected CPU type 1
Resuming computation at 2412/102864/102864
Detected CPU type 1
Resuming computation at 2412/102864/102864
Detected CPU type 1
Resuming computation at 10070/664632/664632
Resuming computation at 10070/664632/664632
Detected CPU type 1
Resuming computation at 18999/1151821/1151821
Detected CPU type 1
APP DEBUG: Application caught signal 15
Resuming computation at 25287/1541897/1541897
Detected CPU type 1
Resuming computation at 28537/1716779/1716779
Detected CPU type 1
detected finished Fstat file - skipping Fstat run 1
Resuming computation at 4737/266480/266480
Detected CPU type 1
detected finished Fstat file - skipping Fstat run 1
Resuming computation at 11328/697111/697111
Detected CPU type 1
detected finished Fstat file - skipping Fstat run 1
Resuming computation at 15478/939477/939477
Detected CPU type 1
detected finished Fstat file - skipping Fstat run 1
Resuming computation at 22980/1351225/1351225
Detected CPU type 1
detected finished Fstat file - skipping Fstat run 1
APP DEBUG: Application caught signal 15
detected finished Fstat file - skipping Fstat run 1
APP DEBUG: Application caught signal 15
detected finished Fstat file - skipping Fstat run 1
Regards
Phil
We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
I will appologize for the long meassage up front but I have now place from which to serve files.
008 seemed to run fine for most of the night until it neared the end of WU w1_0979.5__0979.8_0.1_T02)S4hA_2. It restarted the unit at about 70% complete and also started processing a second e@H WU. When it reached 100% it cleared the progress column. The CPU time remained at 07:20:23, and E@H would not release the CPU. The second WU seemd to complete its cycle and moved on to other projects at about 23% complete. Eventually the Activity monitor recognized the process as not responding and flagged it red even though it showed 55% CPU usage. I have suspended the WU for now. In the Slot folder there a number of files and in particular four "fstats" files.
Detected CPU type 1
Detected CPU type 1
Resuming computation at 1584/59035/59035
Detected CPU type 1
Resuming computation at 1708/66325/66325
Detected CPU type 1
Resuming computation at 2412/102864/102864
Detected CPU type 1
Resuming computation at 2412/102864/102864
Detected CPU type 1
Resuming computation at 2412/102864/102864
Detected CPU type 1
Resuming computation at 10070/664632/664632
Resuming computation at 10070/664632/664632
Detected CPU type 1
Resuming computation at 18999/1151821/1151821
Detected CPU type 1
APP DEBUG: Application caught signal 15
Resuming computation at 25287/1541897/1541897
Detected CPU type 1
Resuming computation at 28537/1716779/1716779
Detected CPU type 1
detected finished Fstat file - skipping Fstat run 1
Resuming computation at 4737/266480/266480
Detected CPU type 1
detected finished Fstat file - skipping Fstat run 1
Resuming computation at 11328/697111/697111
Detected CPU type 1
detected finished Fstat file - skipping Fstat run 1
Resuming computation at 15478/939477/939477
Detected CPU type 1
detected finished Fstat file - skipping Fstat run 1
Resuming computation at 22980/1351225/1351225
Detected CPU type 1
detected finished Fstat file - skipping Fstat run 1
APP DEBUG: Application caught signal 15
detected finished Fstat file - skipping Fstat run 1
APP DEBUG: Application caught signal 15
detected finished Fstat file - skipping Fstat run 1
Regards
Phil
We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
David Anderson has "fixed" the BOINC API library bug that was causing unpredictable hangs in test app version 0.08. Bernd has just finished building and testing a new version 0.11 of the einstein application. Please download it from the app test page. Please continue to report success or problems with the Mac OS X test app to this discussion group thread.
One thing I have noticed is that BOINC may not be doing the best housecleaning on the slot directories....
It does. After the result file has been uploaded to the upload server, the slot directory in which it stays there until the result (and the upload) has been reported to the scheduler, e.g to keep the stderr output which goes into the database. Between these two actions the result is shown in the Manager as "uploaded", it vanishes from the work list along with the slots directory when it has been reported.
Thanks. That explains why these files are still there. My WU never seems to get to the point of uploading. I was wondering since there seem to be to sets of result files, do you suppose after it detects an erro it tis trying to process the WU again. The BOINC Monitor app shows the WU status as "Checking" while all this is going on.
Bruce -
The link off the Beta page only goes to the standard BOINC packages. Where can I find the the latest version of BOINC with the API Library fix?
Thank you in advance.
Phil
We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
RE: RE: RE: OK, I think
)
OK it ran for about 2:55 Hrs, at that point it hung. Other work seemed to progress but EH never went any furthur. The Activity monitor flagged it red. The Graphic would not start so I could not get it going with that as before. I stopped boink, and restarted it and it came up and started working on the WU again. It seems as though (Based on the way work had proceeded) that if the system has at lleast 2 WU for each app, it will run fine. The problem seems to occur when it uses the two processor system to work on more than one thing at a time. This happens any time I only have one WU for a particular project in the que. If EH is one of those things, for some reason it will not release the WU properly and move on. As long as it can work on two WU from the same project it seems happy. I looked in the slots file while it was hung and the two files mentioned earlier were 1.5 Mb and 4K. There was a "Lock: file in the slot as well. Also for some reason the CPU use had dropped to only 70% for EH while it would normally run at more like 95+.
Regards
Phil
We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
RE: RE: RE: RE: OK, I
)
I know I seem to be talking to myself but I want to keep you posted on the events as I find them. So I thought I would test to see if some of the EH switching problems might be related to my dual processor system working more than one project at a time. I reloaded all the software (recommended boing freshly downloaded, and 008) on a MAc powerbook G4 (computer ID 382316) and installed the 008 beta. The first time it tried to switch, EH stayed in memory after the switch. iIt was only using about 30% of the CPU but it was there. The clock on the WU was not incrementing. I stopped BOINC and restarted it. It promptly downloaded new WUs from both SETI and P@H. It has been running for about 5 hours hours now and seems to have successfully switched from other apps to EH and back again. The app unloaded from memory when this occured.
It seem to me that for some reason the EH app is not correctly recognizing the BOINC switch commands and yielding the processor by dumping completely from memory. This would could only be worse on a 2 cpu system where it seems each processor is handled individually and could be running seoparate projects at the same time. It also seems that the problem is more frequent if the handoff is occuring between EH and SETI. So far P@H has been relativly mild manered having only become tangled once.
That said, it also seems that once the system runs for a while it seems to stabilize and behaive better.
Good Luck.
Regards
Phil
We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
Is there any fraking reason
)
Is there any fraking reason to quote everytime you type something? Quote only when needed, which to be honest is not very often. I think most of us can scroll if needed, we don't need a recap of a recap of a recap.
MacStef
Sorry, different system from
)
Sorry, different system from a different Beta group.
Regards
Phil
We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
Well I have been watching 008
)
Well I have been watching 008 working for 24 hours now with other apps running. After a few spits and starts last night, it started working bettwr. It now yields the CPU for other apps cleanly and restarts well. I have watched it run on one CPU while other apps run on the other. It has switched off while other apps continued on the other cpu, and other apps have switched and left it running on the other. It seems to be very memory efficient usually running less that 3 MB during most of the process with ocassional spikes to 5MB. Left alone it will use up to 99% of the CPU.
One thing I have noticed is that BOINC may not be doing the best housecleaning on the slot directories. I am seeing WU listed in the work display in BOINC that have already been uploaded. I assume this is because the stats for the WU are still in the slot folder. Is it possible that some of the problems when first loading 008 are caused by reminant files in the slot folders? This would explain why some of the problems seem to clear up after the system runs for a while. As the system runs eventually the slots get overwritten and clean themselves. Just wondering.
The only other thing I have seen is about a 15-20% increase in speed for the SETI WU. It is probably not related to 008 but it amounts to over an hour per SETI WU on my system. There was some speed increase when I first loaded 008, but I then upgraded BOINC to the lastest build and the performance really went up. E@H Wu take almost exactally 7:20 each. You can almost set your watch by it. The public system takes over 9 hours per WU on my system. Nice work guys.
Regards
Phil
We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
I will appologize for the
)
I will appologize for the long meassage up front but I have now place from which to serve files.
008 seemed to run fine for most of the night until it neared the end of WU w1_0979.5__0979.8_0.1_T02)S4hA_2. It restarted the unit at about 70% complete and also started processing a second e@H WU. when it reached 100% it BOINC cleard the progress column. The CPU time remaind at 07:20:23, and E@H would not release the CPU. The second WU seemd to complete its cycle and moved on to other projects. Eventually the Activity monitor recognized the process as not responding and flagged it red even though it showed 55% CPU usage. I have suspended the WU for now. In the Slot folder there a number of files and in particular four "fstats" files.
Fstats.Ha
Fstats.Ha.ckp
Fstats.Hb.ckp
Fstats.Hb
the stder.txt file reads
Detected CPU type 1
Detected CPU type 1
Resuming computation at 1584/59035/59035
Detected CPU type 1
Resuming computation at 1708/66325/66325
Detected CPU type 1
Resuming computation at 2412/102864/102864
Detected CPU type 1
Resuming computation at 2412/102864/102864
Detected CPU type 1
Resuming computation at 2412/102864/102864
Detected CPU type 1
Resuming computation at 10070/664632/664632
Resuming computation at 10070/664632/664632
Detected CPU type 1
Resuming computation at 18999/1151821/1151821
Detected CPU type 1
APP DEBUG: Application caught signal 15
Resuming computation at 25287/1541897/1541897
Detected CPU type 1
Resuming computation at 28537/1716779/1716779
Detected CPU type 1
detected finished Fstat file - skipping Fstat run 1
Resuming computation at 4737/266480/266480
Detected CPU type 1
detected finished Fstat file - skipping Fstat run 1
Resuming computation at 11328/697111/697111
Detected CPU type 1
detected finished Fstat file - skipping Fstat run 1
Resuming computation at 15478/939477/939477
Detected CPU type 1
detected finished Fstat file - skipping Fstat run 1
Resuming computation at 22980/1351225/1351225
Detected CPU type 1
detected finished Fstat file - skipping Fstat run 1
APP DEBUG: Application caught signal 15
detected finished Fstat file - skipping Fstat run 1
APP DEBUG: Application caught signal 15
detected finished Fstat file - skipping Fstat run 1
Regards
Phil
We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
I will appologize for the
)
I will appologize for the long meassage up front but I have now place from which to serve files.
008 seemed to run fine for most of the night until it neared the end of WU w1_0979.5__0979.8_0.1_T02)S4hA_2. It restarted the unit at about 70% complete and also started processing a second e@H WU. When it reached 100% it cleared the progress column. The CPU time remained at 07:20:23, and E@H would not release the CPU. The second WU seemd to complete its cycle and moved on to other projects at about 23% complete. Eventually the Activity monitor recognized the process as not responding and flagged it red even though it showed 55% CPU usage. I have suspended the WU for now. In the Slot folder there a number of files and in particular four "fstats" files.
Fstats.Ha - 1.4Mb
Fstats.Ha.ckp - 4Kb
Fstats.Hb.ckp - 1.4Mb
Fstats.Hb - 4Kb
the stder.txt file reads
Detected CPU type 1
Detected CPU type 1
Resuming computation at 1584/59035/59035
Detected CPU type 1
Resuming computation at 1708/66325/66325
Detected CPU type 1
Resuming computation at 2412/102864/102864
Detected CPU type 1
Resuming computation at 2412/102864/102864
Detected CPU type 1
Resuming computation at 2412/102864/102864
Detected CPU type 1
Resuming computation at 10070/664632/664632
Resuming computation at 10070/664632/664632
Detected CPU type 1
Resuming computation at 18999/1151821/1151821
Detected CPU type 1
APP DEBUG: Application caught signal 15
Resuming computation at 25287/1541897/1541897
Detected CPU type 1
Resuming computation at 28537/1716779/1716779
Detected CPU type 1
detected finished Fstat file - skipping Fstat run 1
Resuming computation at 4737/266480/266480
Detected CPU type 1
detected finished Fstat file - skipping Fstat run 1
Resuming computation at 11328/697111/697111
Detected CPU type 1
detected finished Fstat file - skipping Fstat run 1
Resuming computation at 15478/939477/939477
Detected CPU type 1
detected finished Fstat file - skipping Fstat run 1
Resuming computation at 22980/1351225/1351225
Detected CPU type 1
detected finished Fstat file - skipping Fstat run 1
APP DEBUG: Application caught signal 15
detected finished Fstat file - skipping Fstat run 1
APP DEBUG: Application caught signal 15
detected finished Fstat file - skipping Fstat run 1
Regards
Phil
We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
David Anderson has "fixed"
)
David Anderson has "fixed" the BOINC API library bug that was causing unpredictable hangs in test app version 0.08. Bernd has just finished building and testing a new version 0.11 of the einstein application. Please download it from the app test page. Please continue to report success or problems with the Mac OS X test app to this discussion group thread.
Director, Einstein@Home
RE: One thing I have
)
It does. After the result file has been uploaded to the upload server, the slot directory in which it stays there until the result (and the upload) has been reported to the scheduler, e.g to keep the stderr output which goes into the database. Between these two actions the result is shown in the Manager as "uploaded", it vanishes from the work list along with the slots directory when it has been reported.
BM
BM
Bernd- Thanks. That
)
Bernd-
Thanks. That explains why these files are still there. My WU never seems to get to the point of uploading. I was wondering since there seem to be to sets of result files, do you suppose after it detects an erro it tis trying to process the WU again. The BOINC Monitor app shows the WU status as "Checking" while all this is going on.
Bruce -
The link off the Beta page only goes to the standard BOINC packages. Where can I find the the latest version of BOINC with the API Library fix?
Thank you in advance.
Phil
We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.