Stuck in last gear

Anonymous
Topic 219286

Been having small problems with tasks getting stuck and errors.  Errors are of the "Error-1" and "Unexpected XML tag or syntax".  I have been stuck for more than 2 days on 99% complete so I suspended the task to allow memory for the other tasks to run.  I am including the pertinent data from the task below:
Application
Continuous Gravitational Wave search O2 All-Sky 1.01
Name
h1_0524.05_O2C02Cl1In0__O2AS20-500_524.20Hz_764
State
Task suspended by user
Received
7/23/2019 05:01:48 AM
Report deadline
8/6/2019 05:01:45 AM
Estimated computation size
144,000 GFLOPs
CPU time
16:37:59
CPU time since checkpoint
00:05:11
Elapsed time
1d 18:25:59
Estimated time remaining
---
Fraction done
99.000%
Virtual memory size
309.20 MB
Working set size
1.40 MB
Directory
slots/0
Process ID
12732
Progress rate
2.160% per hour
Executable
einstein_O2AS20-500_1.01_windows_x86_64.exe

In addition, I have allowed my computer to be viewed on the web.  Any help will be appreciated and thanks to Gary Roberts for the following help.



Hi Bobby,

Welcome to the Einstein@Home project!

Projects that use the BOINC ecosystem are often complex so it all can be a bit daunting when first getting started.  There are lots of volunteers willing to help but none are looking over your shoulder seeing what you see.  If your computers are 'hidden' (the default these days) and if you don't give lots of details about your setup, how it's configured, what searches you are running and exact error messages you are seeing, it's just about impossible to diagnose any problems you have.

The thread you have chosen to post in is not a good choice.  It's an 'announcement' thread about a FAQ service that was current more than a decade ago.  The golden rule is, "Always start a new thread for your particular problem", unless it's the exact problem that someone else is also having at approximately the same point in time.

If you aren't sure what details to supply, a good starting point is to go to your account page on the website and click preferences -> privacy and set 'yes' to the setting that allows your computers to be 'shown' on the website.  That allows non-sensitive information about the hardware you have, the searches you run, and the results you are getting to be inspected by other volunteers willing to help.

In addition to that, it's a good idea to mention any of the myriad of other settings you may have changed from their default values.  Some of these may well be mentioned in the 'start-up messages' you can see in the event log which is accessible through BOINC Manager -> Advanced view.  It's very helpful sometimes to see about the first 40 lines of messages that can be copied and pasted into a forum message for others to inspect.

I would like to move your original message and my response into a new thread so that this thread can return to its former state.  So, if you start a new thread showing the startup messages you get after restarting BOINC, I'll transfer these two messages and then try to work out what is causing the behaviour you see.  Please feel free to 'un-hide' your computers if you are agreeable to that.

 

Cheers,
Gary

mikey
mikey
Joined: 22 Jan 05
Posts: 12780
Credit: 1868711686
RAC: 1862765

Bobby Conger the easy answer

Bobby Conger the easy answer is you are running the new tasks which are much longer than the tasks you were running before, stop running these and go back to the old ones you were running just fine.

Anonymous

Thanks, I'll do that.

Thanks, I'll do that.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118383158773
RAC: 25591224

Thanks for creating your own

Thanks for creating your own thread and thanks for allowing others to see the details of your computer.

I've had a look through your list of tasks and it looks pretty normal apart from one particular aspect which I'll talk about later on.

Firstly, you have completed tasks for the gamma-ray pulsar search and more recently you've been doing the O2AS gravitational wave search.  Was that a deliberate change of preferences on your part or do you have both searches enabled in your preferences and the scheduler may have just decided to give you some tasks for the alternate choice?

Before you just "go back to the old ones", please understand that you do have a choice and that perhaps understanding what those two searches are about might influence that choice.

The 'holy grail' for this project is the first detection of continuous GW emissions.  It's what the whole project is ultimately all about.  During the many years of upgrading the sensitivity of the LIGO detectors, the project developed the ability to detect previously unknown pulsars (of the radio- as well as gamma-ray variety) and continues to do that successfully.  More than 50 previously unknown objects have been discovered by these interim searches.  This was nothing to do with GW emissions but it is expected that these massive bodies will also be emitting continuous GW.

Now we have much more sensitive LIGO instruments and the data from the 2nd observational run (O2) gives the best chance that there could be a discovery of the continuous GW emissions of spinning massive objects like neutron stars.

The tasks for O2AS GW search are more computationally intensive and do take longer than the gamma-ray pulsar tasks - perhaps twice as long.  But that's not a problem in itself.  Your tasks list shows the times involved and those times are well short of the task deadline.  The real question to ask your self is what aspect of this scientific endeavour is of most interest to you?  The first ever detection of continuous GW is going to be a really big deal when it happens.  You have the chance to participate in that.  The choice about that is entirely up to you.

Bobby Conger wrote:
Been having small problems with tasks getting stuck and errors.  Errors are of the "Error-1" and "Unexpected XML tag or syntax".

There are no errors in the tasks shown in your tasks list.  I've never heard of an "Error-1".  Is it a Windows error message of some sort?  Where exactly is it produced?  Can you copy and paste some 'context' so that we can at least see exactly what bit of software gives this message and what other details there are that go with it?

The 2nd example you give is quite possibly something to do with BOINC since there are a lot of .xml files used for configuration and control - eg client_state.xml, cc_config.xml, global_prefs.xml, app_config.xml to name but a few.  Have you been editing any files with a .xml extension?  Are you running any 3rd party software that might be interacting with BOINC stuff?  It looks like something may have introduced a syntax error when modifying one of these files.  Once again, exactly what software shows the error message and is there a filename mentioned for the file where the error is detected?  Are there other parts of the message that give some context?  Copy and paste further details that might give that information, please.

Bobby Conger wrote:
I have been stuck for more than 2 days on 99% complete so I suspended the task to allow memory for the other tasks to run.

Did you try stopping BOINC completely and then restarting it?  Sometimes a task may be apparently stuck.  Suspending it doesn't really do anything useful but stopping BOINC completely causes the task (when restarted) to be reloaded from a saved checkpoint on disk.  If that doesn't work, rebooting the computer is a way of totally clearing what might be in memory that's interfering with the task's progress.

From the data you provided, I've extracted what I think might be relevant:-

Bobby Conger wrote:

CPU time:                     16:37:59
CPU time since checkpoint:    00:05:11
Elapsed time:                 1d 18:25:59
Fraction done:                99.000%

If you look at the tasks in your tasks list (link earlier in the reply) you can see a large difference between CPU time and Run Time for all returned tasks. It's quite normal for there to be a relatively modest difference because occasionally, the task that is crunching may lose some CPU cycles when other higher priority work has to be attended to.  The size of that difference in your case seems abnormally large.  So is the variation between run times of similar tasks that would be expected to take pretty much the same amount of time.

Here is a good example.  Task ID LATeah0057F_1352.0_710289_0.0_2 took 33,408 secs and used 27,259 secs CPU time.  Further down, Task ID LATeah0057F_1512.0_214090_0.0_1 took 44,998 secs and used 31,040 secs CPU time.  I would expect those two tasks (using the same data file LATeah0057F.dat) to take pretty much the same time.

So the question I would ask is, "Are you running any other compute intensive work when BOINC is crunching?".  I'm not talking about ordinary day to day stuff like 'office work', web-browsing, email, document creation, printing, etc.  Those use very little CPU.  I'm talking about something seriously heavy in CPU use.

If you are, then that would explain the results in your tasks list.  BOINC is just getting out of the road when you have more important things to do.  If you're not,  perhaps you need to find out why the big differences and variations.  Perhaps it might be something to do with malware that you don't realise is running without your consent.

These differences and variations have reached a much higher level in the task you suspended (data snip above).  The task is listed as 99% complete so rebooting your computer and allowing it to run might finish it rather quickly.  The concern is that the current CPU time for it is 59,879 secs whilst the Run Time is 152,759 secs.  So there is something preventing it from having the extra CPU seconds that would allow it to complete - while the elapsed time keeps ticking over.  It could be a flaw in the task itself but I think it's far more likely to be something else like malware or mis-configuration elsewhere in your system.

Notice that the 2 GW tasks you have returned both used about 60k secs of CPU time.  This one you suspended seems to be very close to that same amount of CPU time - a good reason for rebooting and giving it a bit more time to complete.

Cheers,
Gary.

Anonymous

Gary: Thanks for taking a

Gary:

Thanks for taking a look.  I just recently had one that lingered on for >5 days and by rebooting the computer, they resolved themselves.  I have a bad habit of letting the computer run all week until Saturday or Sunday when I will clean things up and reboot.  I suppose I should do it more often.  Thanks for your input on this matter and for the info about the Gravitational Wave search.  I enjoy being part of it.

 

Bobby

Anonymous

Gary: I noticed that the two

Gary:

I noticed that the two tasks that were stuck last week with >5 days and one with > 2 days started over when I rebooted the computer.  Also, I have a task from July 12 that is still pending verification.  Should I be worried?  Am I doing something wrong?

 

Bobby

Anonymous

Gary: Sorry for being so

Gary:

Sorry for being so scattered, but to answer your question about preferences, I have all of them checked so I can get a taste of all of the different possibilities.  If I keep having trouble, I will cut those down but I'm not participating to get the most credits nor to compete with others about credits.  I want to contribute any way I can to the research.  I appreciate your time and thank you.

Bobby

Anonymous

h1_0574.25_O2C02Cl1In0__O2AS2

h1_0574.25_O2C02Cl1In0__O2AS20-500_574.40Hz_893
Application: Continuous Gravitational Wave search O2 All-Sky
Created: 12 Sep 2019 8:06:01 UTC
 
This task has been computing for almost 4 days now and it is hovering at 99.977%.  I have been able to solve the other tasks that seem to get stuck by restarting my computer, however, I notice that when I do this, some of the tasks get set back to zero.  I'm going to let this one go, simply because I don't want to lose > 3.5 days of computing time.  Any suggestions?
 
Bobby
Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118383158773
RAC: 25591224

Bobby Conger wrote:This task

Bobby Conger wrote:
This task has been computing for almost 4 days now and it is hovering at 99.977%.

All Einstein tasks are designed to create 'checkpoints' at fairly regular intervals.  A checkpoint is simply the state of calculations at a point in time (eg. like at the end of a loop of calculations) from which the whole calculation sequence could be restarted without having to go back to the very beginning.  This simply means that if a computer is shut down somewhere in the middle, the calculations can be restarted from the last saved checkpoint, thus minimising the loss of all the previous calculations that occurred prior to that checkpoint..

I believe the GW CPU tasks stop creating checkpoints at 99% and the last 1% involves 'different' calculations which only take a few minutes and after which the progress jumps straight to 100%.  The fact that you see a figure of 99.977% leads me to believe that you are seeing BOINCs 'simulated' progress.  BOINC simulates progress until the very first checkpoint is written.  The implication of this is that a 'very first' checkpoint (that should have been created in the first few minutes - 4 days ago) has never been created and written to disk for this task.

Bobby Conger wrote:
I have been able to solve the other tasks that seem to get stuck by restarting my computer, however, I notice that when I do this, some of the tasks get set back to zero.

This is a pretty clear sign that the tasks that go back to zero have never written the very first checkpoint.  I very strongly suspect that if you stop and restart the task that's currently at 99.977% it will go back to zero.  If you highlight that task on the tasks tab of BOINC Manager and then click the properties button, it will give you information about the last checkpoint.  If there is no information, there is no checkpoint so it will restart from zero when you stop and restart BOINC.  There is no use continuing on if there is no checkpoint after all this time.  Stopping and restarting may in fact get the task to start creating checkpoints after a few minutes from restarting.

I have no idea why you have tasks where checkpoints are not being created at regular intervals.  It's probably something to do with why the CPU time and the run (elapsed) time are so significantly different for the tasks that do eventually get crunched to completion.  It's most likely some sort of problem with your computer that you will have to solve.

Cheers,
Gary.

Anonymous

Thanks for the information. 

Thanks for the information.  So how do I write the checkpoint that should have been written, or should I just restart the computer?

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2980863995
RAC: 756242

Bobby Conger wrote:Thanks for

Bobby Conger wrote:
Thanks for the information.  So how do I write the checkpoint that should have been written, or should I just restart the computer?

My suspicion is that there's no specific problem with checkpointing: the Einstein science app simply gets stuck somewhere really close to the beginning, long before the first checkpoint is ready to write.

The best way to investigate this is to examine the temporary working files that will exist on your computer. First, have a look at the 'Properties' of the stuck task - they will look something like this:

zm5ffZ8.png

This one is checkpointing just fine, as the lines outlined in red show. But if the 'CPU time since last checkpoint' is large, you have the type of problem Gary is describing.

Then, make a note of the 'Directory' (outlined in green) - especially the number on the end.

By default, the temporary working files are kept at

C:/ProgramData/BOINC/

That's a hidden folder, so simply open the File Explorer and paste the line above into the address bar. If your BOINC installation is using default settings, you should see a number of folders, including 'slots'. Open that, and then the numbered folder matching what you noted from your own system.

The most helpful file to start with is 'stderr.txt'. Copy the contents of that, and post them here. If my suspicions are right, it will be quite short - no more than 40 lines. We can then compare the file with a complete equivalent generated from a machine which is working properly.

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.