Strange WUs names and checkpoint issues in the latest FGRP5 batch

Mad_Max
Mad_Max
Joined: 2 Jan 10
Posts: 161
Credit: 2236599481
RAC: 647437
Topic 231888

In one of the latest WUs batches for the FGRP5 project, there are strange problems with naming tasks and data files:

What does the normal name of the tasks and the corresponding files look like:

LATeah2112F_1384.0_207988_0.0_0
LATeah2112F_1384.0_242638_0.0_0
LATeah2112F_1384.0_608542_0.0_0
LATeah2112F_1400.0_41162_0.0_0
LATeah2112F_1400.0_41184_0.0_0
LATeah2112F_1400.0_41206_0.0_0

How problematic WUs names look:

LATeah2113F_56.0_1208_-1.6e-11_0
LATeah2113F_72.0_598_-8.499999999999999e-11_0
LATeah2113F_72.0_598_-9.299999999999996e-11_0
LATeah2113F_72.0_56_-8.4e-11_2
LATeah2113F_72.0_762_-2.2999999999999998e-11_0
LATeah2113F_72.0_782_-4.000000000000003e-11_1
LATeah2113F_72.0_782_-4.400000000000004e-11_1
LATeah2113F_72.0_1078_-8.999999999999997e-11_0

Looks like a rounding issue - some of the zeros are replaced by extremely small values of "almost zero" in FP32 variable like -8.999999999999997e-11 (it's 0.00000000000899999)

Same apply to names of some input and output files to such WUs. I see file names like 'LATeah2113F_88.0_2970_-9.099999999999997e-11_0_1" quite often.

It could just be a cosmetic display defect. But I found that many (probably all, but I didn't check everything, because there are too many of them) such tasks also have other more significant problems:

1 - the checkpoints in such WU batches are partially broken, they make checkpoints only twice at 45% and ~89.95% points(but multiple times at 89%). Depending on the CPU speed, this can take up to many hours between checkpoints and significant losses of useful calculations in the event of restarting computer/BOINC/tasks or frequent switching between BOINC projects (if the user does not have the option to leave suspended tasks in memory enabled).

2 - the reporting of calculation progress to BOINC client is similarly disrupted (if BOINC task progress interpolation is disabled by the user via <fraction_done_exact/> flag or if we check boinc_task_state.xml in working slot directory). WU progress jumps 0% - 45% - 89.95% - 100% without any intermediate values.

Examples of stderr.txt and boinc_task_state.xml files for a taks running ~6 hours on Ryzen 2700 (about 75-80% done as it usually take about 7.5-8 hours to finish one FGRP5 WU on this machine):

01:20:19 (1580): [normal]: This Einstein@home App was built at: Jul 26 2017 09:32:43
01:20:19 (1580): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_windows_intelx86__FGRPSSE.exe'.
01:20:19 (1580): [debug]: 2.1e+015 fp, 4.2e+009 fp/s, 500823 s, 139h07m02s87
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_windows_intelx86__FGRPSSE.exe --inputfile ../../projects/einstein.phys.uwm.edu/LATeah2113F.dat --alpha 3.5340718238 --delta -1.0671047766 --skyRadius 0.0008901179185 --ldiBins 15 --f0start 72.0 --f0Band 16 --firstSkyPoint 3014 --numSkyPoints 2 --f1dot -5.700000000000008e-11 --f1dotBand 1e-12 --df1dot 1.004320633e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 4194304.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.15 --reftime 57569.0 --f0orbit 0.005 --freeRadiusFactor 2 --mismatch 0.15 --debug 0 -o LATeah2113F_88.0_3014_-5.600000000000008e-11_0_0.out
output files: 'LATeah2113F_88.0_3014_-5.600000000000008e-11_0_0.out' '../../projects/einstein.phys.uwm.edu/LATeah2113F_88.0_3014_-5.600000000000008e-11_0_0' 'LATeah2113F_88.0_3014_-5.600000000000008e-11_0_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah2113F_88.0_3014_-5.600000000000008e-11_0_1'
01:20:19 (1580): [debug]: Flags: i386 SSE GNUC X86 GNUX86
01:20:19 (1580): [debug]: Set up communication with graphics process.
read_checkpoint(): Couldn't open file 'LATeah2113F_88.0_3014_-5.600000000000008e-11_0_0.out.cpt': No such file or directory (2)
INFO: Major Windows version: 6
% C 1 0

====================

<active_task>
    <project_master_url>https://einstein.phys.uwm.edu/</project_master_url>
    <result_name>LATeah2113F_88.0_3014_-5.600000000000008e-11_0</result_name>
    <checkpoint_cpu_time>11602.730000</checkpoint_cpu_time>
    <checkpoint_elapsed_time>11837.349976</checkpoint_elapsed_time>
    <fraction_done>0.450000</fraction_done>
    <peak_working_set_size>685195264</peak_working_set_size>
    <peak_swap_size>681181184</peak_swap_size>
    <peak_disk_usage>13367</peak_disk_usage>
</active_task>

As you can see - only 1 checkpoint was saved at 45% after 3.2 hours from task start and no more after (next ~2.5h). But near 89% it can write up to 10-15 checkpoints in a row.

Link to example of such "bad" WUs in DB: https://einsteinathome.org/task/1704029984

And normal WU for comparison: https://einsteinathome.org/task/1703391437

Note difference in checkpoints:  12 cpt in 2 blocks (1 @ 45% and 11 @ ~89%) vs 31 cpt in 22 blocks written regularly at approximately regular intervals, except for a pack of probably excessive checkpoints recorded in a row at ~89% too.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5877
Credit: 118566730928
RAC: 21921389

Mad_Max wrote:... there are

Mad_Max wrote:
... there are strange problems with naming tasks and data files:

Values listed in the task name are not used in the calculations being performed.  They are probably just there to remind the researchers of the parameter space being probed.  So there is nothing "problematic" with an unusual name.

Mad_Max wrote:
.... such tasks also have other more significant problems:

Unfortunately, what you are listing are things that are not problems but simply unavoidable characteristics.

A checkpoint can only be written when a particular set of calculations has been completed.  In the parameters you listed in the stderr.txt snip, there were two key values -  --numskypoints 2  and  --toplist 10.  This tells you that for the 90% main calculations stage, there will only be 2 'skypoints' and so only two opportunities to write a checkpoint.  If you examine previous tasks, you might find values in the 50 - 100 range, hence many more checkpointing opportunities with those.  So checkpoints only at 45% and 90% are quite normal for the current tasks.

The 'toplist' is a list of the top candidates (in this case 10) found in the main analysis.  Each of these is 'recalculated' and a checkpoint is written after each one.  That's why you are seeing 10 checkpoints during the 90-100% stage.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.