Crunching time becomes longer and longer

vdquang
vdquang
Joined: 2 Mar 06
Posts: 10
Credit: 2800855
RAC: 766
Topic 219639

I am crunching a task of Continuous Gravitational Wave Search named h1_0584.10_O2C02Cl1In0_O2AS20-500_584.25Hz_1953_0. The elapsed time is forecasted to be approx. 11 and half an hour. (By the way, other tasks of Continuous Gravitational Wave Search spent 11 to 12 hours to complete). But now, for this task, you can see:

Progress: 86.160% | Elapsed time: 19:43:47 | Remaining time: 04:08:33

Progress: 90.030% | Elapsed time: 23:00:02 | Remaining time: 03:25:49

Progress: 92.664% | Elapsed time: 26:03:40 | Remaining time: 02:51:38

Progress: 95.057% | Elapsed time: 30:00:18 | Remaining time: 02:14:54

The crunching time of this task seems to last forever. What I need to do now?

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7226054929
RAC: 1065832

What else is running on the

What else is running on the system?

One way Einstein tasks can slow way down is by sharing with another application which is more successful in getting allocated resources.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117691489184
RAC: 35068900

vdquang wrote:The crunching

vdquang wrote:
The crunching time of this task seems to last forever. What I need to do now?

I suspect that the task is not making any progress at all and what you are seeing as 'progress' is what BOINC tends to show (simulated progress) until the very first checkpoint is written.  To confirm that, you need to look at the task properties using BOINC Manager.  The easiest way to see those properties is described in this message.  In fact, you may have the very same sort of issue that was described in the thread that contains the linked message.  You perhaps should look at other messages as well to get the full picture.

If it turns out that there is no saved checkpoint, you should at least stop and restart BOINC.  The task will restart from the beginning and after a couple of minutes, details for a checkpoint should be able to be viewed using the properties page.  If, after a while, there is still no checkpoint, you could try rebooting your machine to see if that will fix the issue.  I have no idea what might be doing this but would be interested to know if you can get crunching working again.

Cheers,
Gary.

vdquang
vdquang
Joined: 2 Mar 06
Posts: 10
Credit: 2800855
RAC: 766

In this computer I run in

In this computer I run in total 3 applications (with E@H included). Two tasks can run at the same time, because the computer has two processors. I don't think this may so much slow down crunching E@H or any tasks. It may prolong the total time of crunching (= elapsed time + suspending time), but not the elapsed time alone.

vdquang
vdquang
Joined: 2 Mar 06
Posts: 10
Credit: 2800855
RAC: 766

I checked the property of the

I checked the property of the task. Both the 'CPU time' and 'CPU time since checkpoint' were the same (00:00:20) and this time display remained unchanged while the task continued running and the elapsed time continued to increase. I exited the BOINC and restated the computer. And now the task shows: Progress 0% | Elapsed time --- | Remaining time 11:09:29. It means all the crunching progress has been lost!

This task is now waiting to run. Should I start running the task again?

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117691489184
RAC: 35068900

vdquang wrote:... the task

vdquang wrote:
... the task shows: Progress 0% | Elapsed time --- | Remaining time 11:09:29. It means all the crunching progress has been lost!

No.  It means that the task had not made any progress in the first place (so there really was nothing to lose) and by continuing on, it probably would have eventually failed with a "time limit exceeded" error.

vdquang wrote:
This task is now waiting to run. Should I start running the task again?

Yes.  After it has run for a couple of minutes, you should keep checking the properties to confirm if checkpoints are being saved.  Each time there is a new checkpoint, the 'CPU time since checkpoint' should reset to a low value (zero if you check at just the right time) and then start increasing again.  If you don't see that reset and subsequent rise, the task is not making proper progress and probably never will.

The other thing to do is to check in the correct slot directory to see the log contents as was discussed in the other thread I linked to.  If you see something like the 'good' log in the other thread, the task is making progress.

Cheers,
Gary.

vdquang
vdquang
Joined: 2 Mar 06
Posts: 10
Credit: 2800855
RAC: 766

At present I have 4 E@H tasks

At present I have 4 E@H tasks to run. All of them are of Continuous Gravitational Wave Search. I decided to run all of them to check their properties.

Task 1 (Dead line 3 Oct.)  h1_0584.10_O2C02Cl1In0_O2AS20-500_584.25Hz_1953_0 (It is the old task I referred at the beginning of this post. I re-run it now)

Task 2 (Dead line 8 Oct.)  h1_0584.65_O2C02Cl1In0_O2AS20-500_584.80Hz_1267_0

Task 3 (Dead line 8 Oct.)  h1_0584.65_O2C02Cl1In0_O2AS20-500_584.80Hz_1268_0

Task 4 (Dead line 8 Oct.)  h1_0584.65_O2C02Cl1In0_O2AS20-500_584.80Hz_1269_0

I checked the properties of each task with intervals of 2-3 min. and made final checks in approx. 10 min. after the start of running.

In the main view window everything seemed OK for all 4 tasks (the progress and the elapsed time increased...). However, there were different situations in the properties of the 4 tasks. The CPU time and the CPU time since checkpoint of the tasks 2 and 3 were OK, while those of the tasks 1 and 4 were strange: always at zero point from the beginning to the latest checks!

I made pictures of the 4 tasks' properties at the final checks (in 10 min. after the start of running). But I don't know how to insert the pictures to the post.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7226054929
RAC: 1065832

vdquang wrote:I don't know

vdquang wrote:
I don't know how to insert the pictures to the post.

We don't insert pictures in posts here.  We insert links in posts which link to someplace the pictures are actually hosted elsewhere.  One from of such linking using the IMG tag and lets the forum users actually see your image in your post.  Another form just looks like a link, which saves bandwidth for people who don't actually need to see your image each time they review the thread messages.

Elsewhere could be your own web site, if you have one.  It could also be an image hosting site.  One currently popular one which provides basic service for free is Imgur.  I used to use Photobucket for this purpose for years, but it has gone to a bad place--don't try them.

For details on images in your posts (and other useful tricks), review the BBCODE HELP link right below the text entry box you use when posting here.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117691489184
RAC: 35068900

vdquang wrote:... I decided

vdquang wrote:
... I decided to run all of them to check their properties.

When you are trying to understand what is happening, it's not a good idea to make an overly complicated test procedure.  Your host has a dual core CPU.  The maximum you should be trying to test is just two, but preferably just the one that seems to have a problem.

vdquang wrote:
I checked the properties of each task with intervals of 2-3 min. and made final checks in approx. 10 min. after the start of running.

You don't say exactly how you achieved this but, presumably, you ran two for a while, then suspended them so that the other two could start.  Every time you suspend and restart tasks, you are adding a complication whose effect on the overall behaviour is simply an unnecessary distraction you really don't need.

vdquang wrote:
In the main view window everything seemed OK for all 4 tasks (the progress and the elapsed time increased...).

You cannot know what is going on that way.  The elapsed time will keep increasing as will the estimated % completed but you won't know for sure whether the progress is 'real' or just BOINC's 'simulated' progress.  You really need to know if new checkpoints are being written from time to time.  You really need to see this by watching the properties page of a single task but even that might confuse you if you keep changing between different tasks.

If you now have 4 partially completed tasks, go to the slot directory for each one and check the contents of the stderr.txt file that you find in each separate directory.  Be careful to browse with a plain text editor but do NOT make any changes to the files.  What you might see has been explained in the other thread I linked to in a previous message.  By comparing what you see with what was shown for the 'good' log in the other thread, you will know for sure whether or not checkpoints are being written.

Cheers,
Gary.

Rolf
Rolf
Joined: 7 Aug 17
Posts: 27
Credit: 135377187
RAC: 0

I am not sure how to

I am not sure how to interpret exactly what you tried, but I assume that:

You tried running the four tasks, one by one at a time, by selecting "suspend" and "resume". Two of them are stuck and not making any progress after ten minutes. Two tasks are making some progress.

Ten minutes might be too short to see progress on these long-running tasks, but since you also tried running one of them for eleven hours, I think we can assume they are stuck.

To solve the problem, you can abort the tasks which means they will be skipped and new tasks will be downloaded instead. In worst case, there may be a common Einstein- or GW-file that is preventing progress. In that case you can "reset" the project in the project tab. It is less dramatic than it sounds, it will delete your local files for Einstein and download fresh copies. You will lose all progress of the currently running tasks, but you only have two tasks with ten minutes progress each. Not much to lose.

 

vdquang
vdquang
Joined: 2 Mar 06
Posts: 10
Credit: 2800855
RAC: 766

I indeed ran 2 tasks for

I indeed ran 2 tasks for approx. 10 min. and checked their properties with intervals of 2-3 min. Then I suspent them and let 2 other tasks to run to check the properties with the same procedures. I made pictures of the properties as well as copied the stderr.txt of all the 4 tasks (at the points when they ran approx. 10 min.). I will try to send them to you later.

As I wrote in the previous post, the tasks 1 and 4 strangely behaved: nothing were writen in both 'CPU time' and 'CPU time since checkpoints' during 10 min. and late on.

I am busy now and will be back later.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.