Help with elapsed time recording, please

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2961149300
RAC: 693340
Topic 220522

I'm setting up a new computer for testing, and saw this:

It's host 12803928, running Gamma-ray pulsar search #5. As you can see, it downloaded four tasks this morning, and all started at the same time. Doing some more set-up work this afternoon, I saw that the last task had paused - waiting for memory. So I configured that as well - default memory constraints for 'in use' were a bit tight - and the task restarted.

What happened to the 'elapsed time' column? I'm also testing BOINC v7.16.4, released overnight. So, is this common for these tasks at Einstein, or is it a bug I need to report to BOINC?

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2961149300
RAC: 693340

And now it looks like this -

And now it looks like this - time has passed, but no progress made.

Or, possibly, it started again at zero and is counting up again. Has moved on to 5.487% while I thought about it. Clues:

1) stderr

11:24:20 (1772): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_windows_intelx86__FGRPSSE.exe'.
........................................................................................................................................
14:05:51 (5100): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_windows_intelx86__FGRPSSE.exe'.
read_checkpoint(): Couldn't open file 'LATeah1003F_88.0_252_-8.3e-11_0_0.out.cpt': No such file or directory (2)

2) files

It's true - there's no .cpt file in the slot directory occupied by that task (there is all the others). There's no boinc_task_state.xml file either.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117792318568
RAC: 34677885

Here is an excerpt of what

Here is an excerpt of what the final task of the first four to finish returned.  All 4 are now validated - so all OK.

In the excerpt, I've truncated the extremely long line of 'dots' - replaced most with a few spaces - there were 738 in total :-).

% Sky point 1/6
% Creating FFT (3.3.4 22109fa) plan.
% Starting semicoherent search over f0 and f1.
% nf1dots: 738 df1dot: 1.35777895e-015 f1dot_start: -7.6e-011 f1dot_band: 1e-012
% Filling array of photon pairs
........................................................   ................................
INFO: Major Windows version: 6
% C 1 0

Sky point 1/6 means the data is for the 1st sky point out of a total of only 6 for the entire task.  Since the task took over a day, it would have been more than 4 hours for the first checkpoint to be written so a lot of time for 'simulated' progress before some hard data is available.  At that point there would be an 'adjustment' to the % done, based on the data.

The resetting of the elapsed time to zero was probably due to the task being kicked out of memory when there wasn't enough available for all 4.  At the time of your first image, none of the 4 tasks had created checkpoints.  Checkpoint creation would occur after each 90%/6 = 15% of progress.  So, at the time of your second picture, 3 tasks would have created checkpoints but not the one that had been suspended earlier.

The nf1dots parameter (value 738) represents individual calculation loops that go into making up the sky point.  A new 'dot' is written to the full string of dots as each loop finishes.  Unfortunately, no checkpoints until the full 738 loops are done.  The 'C' in the last line signifies a checkpoint being written.

So, all in all, the very low number of sky points might cause some discomfort for people who stop and start crunching or who suspend BOINC when the user is active and who don't keep tasks in memory when suspended.  These tasks do have quite a large memory footprint.  I wasn't aware of the low sky points until I looked just now.  I'm used to seeing rather larger values - like 70 to 80 or more.  There has been a recent change in the sequence of data file names.  It would have been nice to have had a warning about the very low number of checkpoints if that came along with the data change.  I don't know that for sure but it seems likely.

Cheers,
Gary.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2961149300
RAC: 693340

Thanks Gary - that's exactly

Thanks Gary - that's exactly the sort of confirmation I was looking for. I got the machine specifically to explore a BOINC problem - the 64-bit client crashes on these low-power Celeron CPUs, but it's fixed in the new v7.16.4 - and I'd have preferred to put on a known application. But SETI is having major server problems at the moment. You've probably noticed a spike in refugees...

Looks like there is still a problem with the pseudo-progress report: elapsed time has reverted to zero, but progress hasn't. I had 'leave in memory when suspended' checked, but since the task was closed down for lack of memory, I think it's reasonable that was over-ridden. I hope it's thrown up some useful information about this particular batch of tasks.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.