A strange behaviour: after reaching ~25% whithin 6 min or so the progress bar makes a major step backwards to 2.4%
If you are using the recommended BOINC v7.2.42, or a later Beta build, BOINC will estimate and display a 'pseudo-progress' percentage while waiting for the first actual checkpoint and progress report from the science application.
If the first checkpoint is made quickly, or if the overall runtime estimate for the task is reasonably accurate, then the transition from pseudo-progress to real progress is almost invisible.
But if the estimate for the whole task is seriously wrong, pseudo- and real progress have time to diverge before the real figure is available, and a large correction becomes necessary.
Having a pseudo-progress display avoids the progress bar displaying 0.000% for extended periods, which tends to make users nervous.
A strange behaviour: after reaching ~25% whithin 6 min or so the progress bar makes a major step backwards to 2.4%
If you are using the recommended BOINC v7.2.42, or a later Beta build, BOINC will estimate and display a 'pseudo-progress' percentage while waiting for the first actual checkpoint and progress report from the science application.
If the first checkpoint is made quickly, or if the overall runtime estimate for the task is reasonably accurate, then the transition from pseudo-progress to real progress is almost invisible.
But if the estimate for the whole task is seriously wrong, pseudo- and real progress have time to diverge before the real figure is available, and a large correction becomes necessary.
Having a pseudo-progress display avoids the progress bar displaying 0.000% for extended periods, which tends to make users nervous.
A great explanation!
But can a big change of the estimated runtime also be handled by the BM (every version)? The initial estimated runtime was ~20 to 25 min, jumping then to > 4hrs. Maybe this is the reason why they time out on some machines.
My first 2 wu's are at 73% now with a runtime of ~3hrs and reporting a remaining time of ~1hr.
I'm using BM 7.4.12
A strange behaviour: after reaching ~25% whithin 6 min or so the progress bar makes a major step backwards to 2.4%
If you are using the recommended BOINC v7.2.42, or a later Beta build, BOINC will estimate and display a 'pseudo-progress' percentage while waiting for the first actual checkpoint and progress report from the science application.
If the first checkpoint is made quickly, or if the overall runtime estimate for the task is reasonably accurate, then the transition from pseudo-progress to real progress is almost invisible.
But if the estimate for the whole task is seriously wrong, pseudo- and real progress have time to diverge before the real figure is available, and a large correction becomes necessary.
Having a pseudo-progress display avoids the progress bar displaying 0.000% for extended periods, which tends to make users nervous.
A great explanation!
But can a big change of the estimated runtime also be handled by the BM (every version)? The initial estimated runtime was ~20 to 25 min, jumping then to > 4hrs. Maybe this is the reason why they time out on some machines.
My first 2 wu's are at 73% now with a runtime of ~3hrs and reporting a remaining time of ~1hr.
I'm using BM 7.4.12
By BM, I assume you mean BOINC Manager. As the term 'Manager' implies, that's the command-and-control module for BOINC, and doesn't do any actual work - your question would be better directed to the BOINC client.
And yes, the BOINC ecosystem as a whole - client and server - can handle a big change like this.
If both the server and the client are to a recent (2010 or later) specification, the adjustment is handled on the server, using tools like CreditNew and RuntimeEstimation.
If either (or both) of the server and client pre-date 2010 - as the server here at Einstein does - then both components drop back to the older 'Duration Correction Factor' mechanism (no longer documented, since the demise of the Unofficial BOINC Wiki).
Unfortunately, catch-22 applies in both cases. Neither CN/RE, nor DCF, updates their estimates until a task has successfully completed - in the case of CN/RE, 11 tasks have to complete and validate: in the case of DCF, a single completed task is sufficient. But if BOINC aborts the tasks for 'Maximum elapsed time exceeded' before successful completion......
Hence the references in this thread to 'innocculation' - modifying to bypass the infinite-loop safety-valve, and allowing completion so that estimate-modification can proceed. These are the sort of issues we were grappling with at Albert before attention switched to the new web design.
... workunits generated today shouldn't exhibit this error anymore.
I guess it will depend on how long it takes to clear out previously generated workunits.
I've recently got a couple more tasks on a host just set up for beta work and they turned out to be 'short ends' estimated at a couple of mins that ended up taking around 52 mins. Looks like we're still working on 'old' tasks :-).
The host had previously been doing FGRP3 (run time around 12-13 hours) and those left in the cache are now estimated at 270 hours and the machine is in panic mode. I'll edit the state file and try again in a few more hours to see if the old tasks are gone.
My fgrp4_v1.02 all fail after
)
My fgrp4_v1.02 all fail after ~ 8 sec with
7.4.12
(unknown error) - exit code -1073741680 (0xc0000090)
19:34:50 (7116): [normal]: This Einstein@home App was built at: Aug 21 2014 14:21:42
19:34:50 (7116): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRP4_1.02_windows_intelx86__FGRP4-Beta.exe'.
19:34:50 (7116): [debug]: 0 fp, 0 fp/s, -1 scommand line: projects/einstein.phys.uwm.edu/hsgamma_FGRP4_1.02_windows_intelx86__FGRP4-Beta.exe --inputfile ../../projects/einstein.phys.uwm.edu/fgrp4_test.dat --outputfile results.cand.out --alpha 0.961677206 --delta 0.724528894 --skyRadius 4.363323e-04 --ldiBins 30 --f0start 0 --f0Band 16 --firstSkyPoint 0 --numSkyPoints 3 --f1dot -2.49e-10 --f1dotBand 1e-12 --df1dot 5.62738096e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 2097152.0 --toplist 5 --cohFollow 5 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.15 --reftime 55716 --debug 1
output files: 'results.cand.out' '../../projects/einstein.phys.uwm.edu/fgrp4_test_16.0_0_-2.48e-10_2_0' 'results.cand.out.cohfu' '../../projects/einstein.phys.uwm.edu/fgrp4_test_16.0_0_-2.48e-10_2_1'
19:34:50 (7116): [debug]: Flags: i386 SSE GNUC X86 GNUX86
-- signal handler called: signal 8
Win7-64, intel i7 8GB ram, BM 7.4.12
Edid: they seem to be still enabled
Looks like the app version
)
Looks like the app version 1.02 caused a division-by-zero error pretty early.
Deprecated for now, re-issuing 1.01, I'll fix this problem tomorrow.
BM
BM
The 1.03 version passed the 8
)
The 1.03 version passed the 8 sec marker ...
A strange behaviour: after reaching ~25% whithin 6 min or so the progress bar makes a major step backwards to 2.4%
Quite understandable if you
)
Quite understandable if you think of the behaviour of FGRP3. You also need to remember that the time estimates for FGRP4 are way too short.
Cheers,
Gary.
RE: The 1.03 version passed
)
If you are using the recommended BOINC v7.2.42, or a later Beta build, BOINC will estimate and display a 'pseudo-progress' percentage while waiting for the first actual checkpoint and progress report from the science application.
If the first checkpoint is made quickly, or if the overall runtime estimate for the task is reasonably accurate, then the transition from pseudo-progress to real progress is almost invisible.
But if the estimate for the whole task is seriously wrong, pseudo- and real progress have time to diverge before the real figure is available, and a large correction becomes necessary.
Having a pseudo-progress display avoids the progress bar displaying 0.000% for extended periods, which tends to make users nervous.
RE: Looks like the app
)
Not sure if that did really help, I have 3 tasks which did go to an error after 10 hours:
7.2.42
Maximum elapsed time exceeded
23:10:54 (15808): [normal]: This Einstein@home App was built at: Aug 20 2014 06:16:19
23:10:54 (15808): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRP4_1.01_windows_intelx86__FGRP4-Beta.exe'.
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRP4_1.01_windows_intelx86__FGRP4-Beta.exe --inputfile ../../projects/einstein.phys.uwm.edu/fgrp4_test.dat --outputfile results.cand.out --alpha 0.961677206 --delta 0.724528894 --skyRadius 4.363323e-04 --ldiBins 30 --f0start 16 --f0Band 32 --firstSkyPoint 0 --numSkyPoints 29 --f1dot -4.15e-10 --f1dotBand 1e-12 --df1dot 5.62738096e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 2097152.0 --toplist 5 --cohFollow 5 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.15 --reftime 55716 --debug 1
output files: 'results.cand.out' '../../projects/einstein.phys.uwm.edu/fgrp4_test_48.0_0_-4.14e-10_2_0' 'results.cand.out.cohfu' '../../projects/einstein.phys.uwm.edu/fgrp4_test_48.0_0_-4.14e-10_2_1'
23:10:54 (15808): [debug]: Flags: i386 SSE GNUC X86 GNUX86
23:10:55 (15808): [debug]: Set up communication with graphics process.
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/fgrp4_test.dat
% Total amount of photon times: 12104
% Preparing toplist of length: 5
read_checkpoint(): Couldn't open file 'results.cand.out.cpt': No such file or directory (2)
% fft_size: 67108864 (0x4000000)
% Sky point 1/29
% Creating FFT plan.
% Starting semicoherent search over f0 and f1.
% nf1dots: 179 df1dot: 5.62738096e-015 f1dot_start: -4.15e-010 f1dot_band: 1e-012
.
.
% checkpoint 27
% Sky point 28/29
% Starting semicoherent search over f0 and f1.
% nf1dots: 179 df1dot: 5.62738096e-015 f1dot_start: -4.15e-010 f1dot_band: 1e-012
.
.
.
.
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x75633226
Engaging BOINC Windows Runtime Debugger...
********************
BOINC Windows Runtime Debugger Version 7.3.0
Dump Timestamp : 08/22/14 13:00:07
Install Directory :
Data Directory : C:\ProgramData\BOINC
Project Symstore :
LoadLibraryA( C:\ProgramData\BOINC\dbghelp.dll ): GetLastError = 126
Loaded Library : dbghelp.dll
LoadLibraryA( C:\ProgramData\BOINC\symsrv.dll ): GetLastError = 126
LoadLibraryA( symsrv.dll ): GetLastError = 126
LoadLibraryA( C:\ProgramData\BOINC\srcsrv.dll ): GetLastError = 126
LoadLibraryA( srcsrv.dll ): GetLastError = 126
LoadLibraryA( C:\ProgramData\BOINC\version.dll ): GetLastError = 126
Loaded Library : version.dll
Debugger Engine : 4.0.5.0
Symbol Search Path: C:\ProgramData\BOINC\slots\3;C:\ProgramData\BOINC\projects\einstein.phys.uwm.edu
ModLoad: 0000000000400000 0000000000cb2000 C:\ProgramData\BOINC\projects\einstein.phys.uwm.edu\hsgamma_FGRP4_1.01_windows_intelx86__FGRP4-Beta.exe (-nosymbols- Symbols Loaded)
RE: RE: The 1.03 version
)
A great explanation!
But can a big change of the estimated runtime also be handled by the BM (every version)? The initial estimated runtime was ~20 to 25 min, jumping then to > 4hrs. Maybe this is the reason why they time out on some machines.
My first 2 wu's are at 73% now with a runtime of ~3hrs and reporting a remaining time of ~1hr.
I'm using BM 7.4.12
RE: Maximum elapsed time
)
This is actually an error in the generated workunits, it is completely independent of the application version.
We are aware of it though, workunits generated today shouldn't exhibit this error anymore.
BM
BM
RE: RE: RE: The 1.03
)
By BM, I assume you mean BOINC Manager. As the term 'Manager' implies, that's the command-and-control module for BOINC, and doesn't do any actual work - your question would be better directed to the BOINC client.
And yes, the BOINC ecosystem as a whole - client and server - can handle a big change like this.
If both the server and the client are to a recent (2010 or later) specification, the adjustment is handled on the server, using tools like CreditNew and RuntimeEstimation.
If either (or both) of the server and client pre-date 2010 - as the server here at Einstein does - then both components drop back to the older 'Duration Correction Factor' mechanism (no longer documented, since the demise of the Unofficial BOINC Wiki).
Unfortunately, catch-22 applies in both cases. Neither CN/RE, nor DCF, updates their estimates until a task has successfully completed - in the case of CN/RE, 11 tasks have to complete and validate: in the case of DCF, a single completed task is sufficient. But if BOINC aborts the tasks for 'Maximum elapsed time exceeded' before successful completion......
Hence the references in this thread to 'innocculation' - modifying to bypass the infinite-loop safety-valve, and allowing completion so that estimate-modification can proceed. These are the sort of issues we were grappling with at Albert before attention switched to the new web design.
RE: ... workunits generated
)
I guess it will depend on how long it takes to clear out previously generated workunits.
I've recently got a couple more tasks on a host just set up for beta work and they turned out to be 'short ends' estimated at a couple of mins that ended up taking around 52 mins. Looks like we're still working on 'old' tasks :-).
The host had previously been doing FGRP3 (run time around 12-13 hours) and those left in the cache are now estimated at 270 hours and the machine is in panic mode. I'll edit the state file and try again in a few more hours to see if the old tasks are gone.
Cheers,
Gary.