Strange WUs names and checkpoint issues in the latest FGRP5 batch

Mad_Max

Joined: 2 Jan 10

Posts: 165

Credit: 2268702088

RAC: 672667

29 Dec 2024 3:31:40 UTC

Topic 231888

(moderation:

)

In one of the latest WUs batches for the FGRP5 project, there are strange problems with naming tasks and data files:

What does the normal name of the tasks and the corresponding files look like:

LATeah2112F_1384.0_207988_0.0_0
LATeah2112F_1384.0_242638_0.0_0
LATeah2112F_1384.0_608542_0.0_0
LATeah2112F_1400.0_41162_0.0_0
LATeah2112F_1400.0_41184_0.0_0
LATeah2112F_1400.0_41206_0.0_0

How problematic WUs names look:

LATeah2113F_56.0_1208_-1.6e-11_0
LATeah2113F_72.0_598_-8.499999999999999e-11_0
LATeah2113F_72.0_598_-9.299999999999996e-11_0
LATeah2113F_72.0_56_-8.4e-11_2
LATeah2113F_72.0_762_-2.2999999999999998e-11_0
LATeah2113F_72.0_782_-4.000000000000003e-11_1
LATeah2113F_72.0_782_-4.400000000000004e-11_1
LATeah2113F_72.0_1078_-8.999999999999997e-11_0

Looks like a rounding issue - some of the zeros are replaced by extremely small values of "almost zero" in FP32 variable like -8.999999999999997e-11 (it's 0.00000000000899999)

Same apply to names of some input and output files to such WUs. I see file names like 'LATeah2113F_88.0_2970_-9.099999999999997e-11_0_1" quite often.

It could just be a cosmetic display defect. But I found that many (probably all, but I didn't check everything, because there are too many of them) such tasks also have other more significant problems:

1 - the checkpoints in such WU batches are partially broken, they make checkpoints only twice at 45% and ~89.95% points(but multiple times at 89%). Depending on the CPU speed, this can take up to many hours between checkpoints and significant losses of useful calculations in the event of restarting computer/BOINC/tasks or frequent switching between BOINC projects (if the user does not have the option to leave suspended tasks in memory enabled).

2 - the reporting of calculation progress to BOINC client is similarly disrupted (if BOINC task progress interpolation is disabled by the user via <fraction_done_exact/> flag or if we check boinc_task_state.xml in working slot directory). WU progress jumps 0% - 45% - 89.95% - 100% without any intermediate values.

Examples of stderr.txt and boinc_task_state.xml files for a taks running ~6 hours on Ryzen 2700 (about 75-80% done as it usually take about 7.5-8 hours to finish one FGRP5 WU on this machine):

01:20:19 (1580): [normal]: This Einstein@home App was built at: Jul 26 2017 09:32:43
01:20:19 (1580): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_windows_intelx86__FGRPSSE.exe'.
01:20:19 (1580): [debug]: 2.1e+015 fp, 4.2e+009 fp/s, 500823 s, 139h07m02s87
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_windows_intelx86__FGRPSSE.exe --inputfile ../../projects/einstein.phys.uwm.edu/LATeah2113F.dat --alpha 3.5340718238 --delta -1.0671047766 --skyRadius 0.0008901179185 --ldiBins 15 --f0start 72.0 --f0Band 16 --firstSkyPoint 3014 --numSkyPoints 2 --f1dot -5.700000000000008e-11 --f1dotBand 1e-12 --df1dot 1.004320633e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 4194304.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.15 --reftime 57569.0 --f0orbit 0.005 --freeRadiusFactor 2 --mismatch 0.15 --debug 0 -o LATeah2113F_88.0_3014_-5.600000000000008e-11_0_0.out
output files: 'LATeah2113F_88.0_3014_-5.600000000000008e-11_0_0.out' '../../projects/einstein.phys.uwm.edu/LATeah2113F_88.0_3014_-5.600000000000008e-11_0_0' 'LATeah2113F_88.0_3014_-5.600000000000008e-11_0_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah2113F_88.0_3014_-5.600000000000008e-11_0_1'
01:20:19 (1580): [debug]: Flags: i386 SSE GNUC X86 GNUX86
01:20:19 (1580): [debug]: Set up communication with graphics process.
read_checkpoint(): Couldn't open file 'LATeah2113F_88.0_3014_-5.600000000000008e-11_0_0.out.cpt': No such file or directory (2)
INFO: Major Windows version: 6
% C 1 0

====================

<active_task>
    <project_master_url>https://einstein.phys.uwm.edu/</project_master_url>
    <result_name>LATeah2113F_88.0_3014_-5.600000000000008e-11_0</result_name>
    <checkpoint_cpu_time>11602.730000</checkpoint_cpu_time>
    <checkpoint_elapsed_time>11837.349976</checkpoint_elapsed_time>
    <fraction_done>0.450000</fraction_done>
    <peak_working_set_size>685195264</peak_working_set_size>
    <peak_swap_size>681181184</peak_swap_size>
    <peak_disk_usage>13367</peak_disk_usage>
</active_task>

As you can see - only 1 checkpoint was saved at 45% after 3.2 hours from task start and no more after (next ~2.5h). But near 89% it can write up to 10-15 checkpoints in a row.

Link to example of such "bad" WUs in DB: https://einsteinathome.org/task/1704029984

And normal WU for comparison: https://einsteinathome.org/task/1703391437

Note difference in checkpoints: 12 cpt in 2 blocks (1 @ 45% and 11 @ ~89%) vs 31 cpt in 22 blocks written regularly at approximately regular intervals, except for a pack of probably excessive checkpoints recorded in a row at ~89% too.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119757200069

RAC: 25688844

Mad_Max wrote:... there are

29 Dec 2024 23:33:52 UTC

Message 231471

(moderation:

)

Mad_Max wrote:

... there are strange problems with naming tasks and data files:

Values listed in the task name are not used in the calculations being performed. They are probably just there to remind the researchers of the parameter space being probed. So there is nothing "problematic" with an unusual name.

Mad_Max wrote:

.... such tasks also have other more significant problems:

Unfortunately, what you are listing are things that are not problems but simply unavoidable characteristics.

A checkpoint can only be written when a particular set of calculations has been completed. In the parameters you listed in the stderr.txt snip, there were two key values - --numskypoints 2 and --toplist 10. This tells you that for the 90% main calculations stage, there will only be 2 'skypoints' and so only two opportunities to write a checkpoint. If you examine previous tasks, you might find values in the 50 - 100 range, hence many more checkpointing opportunities with those. So checkpoints only at 45% and 90% are quite normal for the current tasks.

The 'toplist' is a list of the top candidates (in this case 10) found in the main analysis. Each of these is 'recalculated' and a checkpoint is written after each one. That's why you are seeing 10 checkpoints during the 90-100% stage.

Cheers,
Gary.

San-Fernando-Valley

Joined: 16 Mar 16

Posts: 565

Credit: 10952312812

RAC: 14202631

+1

6 Jan 2025 7:23:45 UTC

Message 231749 in response to message 231471

(moderation:

)

San-Fernando-Valley

Joined: 16 Mar 16

Posts: 565

Credit: 10952312812

RAC: 14202631

-1

6 Jan 2025 7:24:00 UTC

Message 231750

(moderation:

)

-1

Scrooge McDuck

Joined: 2 May 07

Posts: 1142

Credit: 18939348

RAC: 12680

Mad_Max schrieb: How

8 Jan 2025 17:07:57 UTC

Message 231894

(moderation:

)

Mad_Max wrote:

How ~~problematic~~ WUs names look:

LATeah2113F_56.0_1208_-1.6e-11_0
LATeah2113F_72.0_598_-8.499999999999999e-11_0
LATeah2113F_72.0_598_-9.299999999999996e-11_0
LATeah2113F_72.0_56_-8.4e-11_2
LATeah2113F_72.0_762_-2.2999999999999998e-11_0
LATeah2113F_72.0_782_-4.000000000000003e-11_1
LATeah2113F_72.0_782_-4.400000000000004e-11_1
LATeah2113F_72.0_1078_-8.999999999999997e-11_0

Workunits with few skypoints (e.g. just two) which therefore checkpoint a few times only, typically feature a number < 100 immediately after the workunit's name prefix "LATeah_NNNNF" (the raw data file name):

LATeah2113F_NN.*

...just as the suspected "anomalous" examples above.

San-Fernando-Valley

Joined: 16 Mar 16

Posts: 565

Credit: 10952312812

RAC: 14202631

Scrooge McDuck

8 Jan 2025 18:35:46 UTC

Message 231898 in response to message 231894

(moderation:

)

Scrooge McDuck wrote:

Mad_Max wrote:
How ~~problematic~~ WUs names look:
LATeah2113F_56.0_1208_-1.6e-11_0
LATeah2113F_72.0_598_-8.499999999999999e-11_0
LATeah2113F_72.0_598_-9.299999999999996e-11_0
LATeah2113F_72.0_56_-8.4e-11_2
LATeah2113F_72.0_762_-2.2999999999999998e-11_0
LATeah2113F_72.0_782_-4.000000000000003e-11_1
LATeah2113F_72.0_782_-4.400000000000004e-11_1
LATeah2113F_72.0_1078_-8.999999999999997e-11_0
Workunits with few skypoints (e.g. just two) which therefore checkpoint a few times only, typically feature a number < 100 immediately after the workunit's name prefix "LATeah_NNNNF" (the raw data file name):
LATeah2113F_NN.*
...just as the suspected "anomalous" examples above.

I got the impression that mad_max doesn't mean the NNs, but the part after them ...

cheers

sfv

Scrooge McDuck

Joined: 2 May 07

Posts: 1142

Credit: 18939348

RAC: 12680

San-Fernando-Valley

10 Jan 2025 11:54:27 UTC

Message 231952 in response to message 231898

(moderation:

)

San-Fernando-Valley wrote:

I got the impression that mad_max doesn't mean the NNs, but the part after them ...

Yes, sure. But mad_max also suspected the rounding anomalies in filenames have something to do with the observed behaviour of these WUs.

I don't know if these 'anomalies' only can be observed at WUs which checkpoint seldomly or at others too. It's not relevant... numbers not used for calculations, as Gary wrote.

My point was: these WUs can be easily identified by the small numbers (NN) in their WU name without the need to check logfiles or WU's xml configuration. Gary already explained the true reason for the few checkpoints with logfile and numskypoints in cmdln parameters.

Mad_Max

Joined: 2 Jan 10

Posts: 165

Credit: 2268702088

RAC: 672667

Gary Roberts wrote:Values

15 Jan 2025 17:34:26 UTC

Message 232110 in response to message 231471

(moderation:

)

Gary Roberts wrote:

Values listed in the task name are not used in the calculations being performed. They are probably just there to remind the researchers of the parameter space being probed. So there is nothing "problematic" with an unusual name.

Are you sure about this?

Yes, of course, the calculation data is NOT taken directly from the file nameы. But as you correctly noted, these numbers in the names of tasks and files reflect the search parameters that are USED in actual calculations. Simply put, as far as I understand it, the same variables serve as a data source for real scientific calculations and also for naming tasks/files. And although this can't cause any problems in names, it can lead to errors or inaccuracies in calculations.

Although it may not cause any problems and everything is fine with such tasks, I just do not know enough about the scientific component to draw any conclusions about this. And just wanted to draw the attention of one of the scientists or programmers of the project by this topic so that they would check when there is time whether such "anomalous" values do not cause problems in useful scientific calculations. And the file names themselves do not require correction - this is just a visual indication by which you can distinguish batches of similar WUs.

Gary Roberts wrote:

Unfortunately, what you are listing are things that are not problems but simply unavoidable characteristics.

A checkpoint can only be written when a particular set of calculations has been completed. In the parameters you listed in the stderr.txt snip, there were two key values - --numskypoints 2 and --toplist 10. This tells you that for the 90% main calculations stage, there will only be 2 'skypoints' and so only two opportunities to write a checkpoint. If you examine previous tasks, you might find values in the 50 - 100 range, hence many more checkpointing opportunities with those. So checkpoints only at 45% and 90% are quite normal for the current tasks.

Hmm, it looks like you're right about checkpoints. I know how checkpoints work in BOINC, but I didn't notice the fact that there are only 2 suitable points for saving them in these particular WUs. Thanks for bringing this fact to my attention.

But it's strange if all these WUs contain only two "skypoints" per WU, while the rest of the "regular" ones contain at least few dozens of such skypoints. Then shouldn't they be much (many times) shorter/faster to calculate? Just because there is much less data in them that needs to be processed?
But I don't see any significant difference in the duration of the calculation of such WUs on my computers - they are calculated about the same(+/- few %) as tasks containing dozens of "skypoints".

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119757200069

RAC: 25688844

Mad_Max wrote:Are you sure

16 Jan 2025 0:37:41 UTC

Message 232126 in response to message 232110

(moderation:

)

Mad_Max wrote:

Are you sure about this?

I'm just an ordinary volunteer like yourself. I'm not in communication with the scientists who design the task parameters but I'm quite sure that if there were any mistakes, the tasks would have been withdrawn a long time ago. The scientists would be post-processing what is returned and I'm pretty sure they would have noticed by now :-).

Mad_Max wrote:

But it's strange if all these WUs contain only two "skypoints" per WU, while the rest of the "regular" ones contain at least few dozens of such skypoints. Then shouldn't they be much (many times) shorter/faster to calculate?

It's not strange because it has happened before. I'm not running CPU tasks at the moment (summer heat issues) but I've run them over the years and seen this same situation before. Sometimes (but not always) there has been a change in runtime. Often, the change is reasonably small.

Mad_Max wrote:

Just because there is much less data in them that needs to be processed?

How do you know that? Why couldn't it just be a more intensive (and time consuming) analysis of two particular 'points in the sky? The amount of 'data' in the LATeahnnnn large data files probably gets adjusted to keep the overall run times relatively constant.

Cheers,
Gary.

Strange WUs names and checkpoint issues in the latest FGRP5 batch

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports