Fermi LAT Gamma-ray pulsar search #4 "FGRP4"

Alex
Alex
Joined: 1 Mar 05
Posts: 451
Credit: 500394558
RAC: 38624

My fgrp4_v1.02 all fail after

My fgrp4_v1.02 all fail after ~ 8 sec with

7.4.12

(unknown error) - exit code -1073741680 (0xc0000090)

19:34:50 (7116): [normal]: This Einstein@home App was built at: Aug 21 2014 14:21:42

19:34:50 (7116): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRP4_1.02_windows_intelx86__FGRP4-Beta.exe'.
19:34:50 (7116): [debug]: 0 fp, 0 fp/s, -1 scommand line: projects/einstein.phys.uwm.edu/hsgamma_FGRP4_1.02_windows_intelx86__FGRP4-Beta.exe --inputfile ../../projects/einstein.phys.uwm.edu/fgrp4_test.dat --outputfile results.cand.out --alpha 0.961677206 --delta 0.724528894 --skyRadius 4.363323e-04 --ldiBins 30 --f0start 0 --f0Band 16 --firstSkyPoint 0 --numSkyPoints 3 --f1dot -2.49e-10 --f1dotBand 1e-12 --df1dot 5.62738096e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 2097152.0 --toplist 5 --cohFollow 5 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.15 --reftime 55716 --debug 1
output files: 'results.cand.out' '../../projects/einstein.phys.uwm.edu/fgrp4_test_16.0_0_-2.48e-10_2_0' 'results.cand.out.cohfu' '../../projects/einstein.phys.uwm.edu/fgrp4_test_16.0_0_-2.48e-10_2_1'
19:34:50 (7116): [debug]: Flags: i386 SSE GNUC X86 GNUX86

-- signal handler called: signal 8

Win7-64, intel i7 8GB ram, BM 7.4.12

Edid: they seem to be still enabled

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245182851
RAC: 13930

Looks like the app version

Looks like the app version 1.02 caused a division-by-zero error pretty early.

Deprecated for now, re-issuing 1.01, I'll fix this problem tomorrow.

BM

BM

Alex
Alex
Joined: 1 Mar 05
Posts: 451
Credit: 500394558
RAC: 38624

The 1.03 version passed the 8

The 1.03 version passed the 8 sec marker ...

A strange behaviour: after reaching ~25% whithin 6 min or so the progress bar makes a major step backwards to 2.4%

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109874772673
RAC: 30513841

Quite understandable if you

Quite understandable if you think of the behaviour of FGRP3. You also need to remember that the time estimates for FGRP4 are way too short.

Cheers,
Gary.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2140
Credit: 2768122104
RAC: 990846

RE: The 1.03 version passed

Quote:

The 1.03 version passed the 8 sec marker ...

A strange behaviour: after reaching ~25% whithin 6 min or so the progress bar makes a major step backwards to 2.4%


If you are using the recommended BOINC v7.2.42, or a later Beta build, BOINC will estimate and display a 'pseudo-progress' percentage while waiting for the first actual checkpoint and progress report from the science application.

If the first checkpoint is made quickly, or if the overall runtime estimate for the task is reasonably accurate, then the transition from pseudo-progress to real progress is almost invisible.

But if the estimate for the whole task is seriously wrong, pseudo- and real progress have time to diverge before the real figure is available, and a large correction becomes necessary.

Having a pseudo-progress display avoids the progress bar displaying 0.000% for extended periods, which tends to make users nervous.

tgoti
tgoti
Joined: 12 Aug 10
Posts: 1
Credit: 5540257
RAC: 0

RE: Looks like the app

Quote:

Looks like the app version 1.02 caused a division-by-zero error pretty early.

Deprecated for now, re-issuing 1.01, I'll fix this problem tomorrow.

BM

Not sure if that did really help, I have 3 tasks which did go to an error after 10 hours:

7.2.42

Maximum elapsed time exceeded

23:10:54 (15808): [normal]: This Einstein@home App was built at: Aug 20 2014 06:16:19

23:10:54 (15808): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRP4_1.01_windows_intelx86__FGRP4-Beta.exe'.
command line: projects/einstein.phys.uwm.edu/hsgamma_FGRP4_1.01_windows_intelx86__FGRP4-Beta.exe --inputfile ../../projects/einstein.phys.uwm.edu/fgrp4_test.dat --outputfile results.cand.out --alpha 0.961677206 --delta 0.724528894 --skyRadius 4.363323e-04 --ldiBins 30 --f0start 16 --f0Band 32 --firstSkyPoint 0 --numSkyPoints 29 --f1dot -4.15e-10 --f1dotBand 1e-12 --df1dot 5.62738096e-15 --ephemdir ..\..\projects\einstein.phys.uwm.edu\JPLEPH --Tcoh 2097152.0 --toplist 5 --cohFollow 5 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.15 --reftime 55716 --debug 1
output files: 'results.cand.out' '../../projects/einstein.phys.uwm.edu/fgrp4_test_48.0_0_-4.14e-10_2_0' 'results.cand.out.cohfu' '../../projects/einstein.phys.uwm.edu/fgrp4_test_48.0_0_-4.14e-10_2_1'
23:10:54 (15808): [debug]: Flags: i386 SSE GNUC X86 GNUX86
23:10:55 (15808): [debug]: Set up communication with graphics process.
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/fgrp4_test.dat
% Total amount of photon times: 12104
% Preparing toplist of length: 5
read_checkpoint(): Couldn't open file 'results.cand.out.cpt': No such file or directory (2)
% fft_size: 67108864 (0x4000000)
% Sky point 1/29
% Creating FFT plan.
% Starting semicoherent search over f0 and f1.
% nf1dots: 179 df1dot: 5.62738096e-015 f1dot_start: -4.15e-010 f1dot_band: 1e-012
.

.
% checkpoint 27
% Sky point 28/29
% Starting semicoherent search over f0 and f1.
% nf1dots: 179 df1dot: 5.62738096e-015 f1dot_start: -4.15e-010 f1dot_band: 1e-012
.

.
.
.

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x75633226

Engaging BOINC Windows Runtime Debugger...

********************

BOINC Windows Runtime Debugger Version 7.3.0

Dump Timestamp : 08/22/14 13:00:07
Install Directory :
Data Directory : C:\ProgramData\BOINC
Project Symstore :
LoadLibraryA( C:\ProgramData\BOINC\dbghelp.dll ): GetLastError = 126
Loaded Library : dbghelp.dll
LoadLibraryA( C:\ProgramData\BOINC\symsrv.dll ): GetLastError = 126
LoadLibraryA( symsrv.dll ): GetLastError = 126
LoadLibraryA( C:\ProgramData\BOINC\srcsrv.dll ): GetLastError = 126
LoadLibraryA( srcsrv.dll ): GetLastError = 126
LoadLibraryA( C:\ProgramData\BOINC\version.dll ): GetLastError = 126
Loaded Library : version.dll
Debugger Engine : 4.0.5.0
Symbol Search Path: C:\ProgramData\BOINC\slots\3;C:\ProgramData\BOINC\projects\einstein.phys.uwm.edu

ModLoad: 0000000000400000 0000000000cb2000 C:\ProgramData\BOINC\projects\einstein.phys.uwm.edu\hsgamma_FGRP4_1.01_windows_intelx86__FGRP4-Beta.exe (-nosymbols- Symbols Loaded)

Alex
Alex
Joined: 1 Mar 05
Posts: 451
Credit: 500394558
RAC: 38624

RE: RE: The 1.03 version

Quote:
Quote:

The 1.03 version passed the 8 sec marker ...

A strange behaviour: after reaching ~25% whithin 6 min or so the progress bar makes a major step backwards to 2.4%


If you are using the recommended BOINC v7.2.42, or a later Beta build, BOINC will estimate and display a 'pseudo-progress' percentage while waiting for the first actual checkpoint and progress report from the science application.

If the first checkpoint is made quickly, or if the overall runtime estimate for the task is reasonably accurate, then the transition from pseudo-progress to real progress is almost invisible.

But if the estimate for the whole task is seriously wrong, pseudo- and real progress have time to diverge before the real figure is available, and a large correction becomes necessary.

Having a pseudo-progress display avoids the progress bar displaying 0.000% for extended periods, which tends to make users nervous.

A great explanation!
But can a big change of the estimated runtime also be handled by the BM (every version)? The initial estimated runtime was ~20 to 25 min, jumping then to > 4hrs. Maybe this is the reason why they time out on some machines.
My first 2 wu's are at 73% now with a runtime of ~3hrs and reporting a remaining time of ~1hr.
I'm using BM 7.4.12

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245182851
RAC: 13930

RE: Maximum elapsed time

Quote:
Maximum elapsed time exceeded

This is actually an error in the generated workunits, it is completely independent of the application version.

We are aware of it though, workunits generated today shouldn't exhibit this error anymore.

BM

BM

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2140
Credit: 2768122104
RAC: 990846

RE: RE: RE: The 1.03

Quote:
Quote:
Quote:

The 1.03 version passed the 8 sec marker ...

A strange behaviour: after reaching ~25% whithin 6 min or so the progress bar makes a major step backwards to 2.4%


If you are using the recommended BOINC v7.2.42, or a later Beta build, BOINC will estimate and display a 'pseudo-progress' percentage while waiting for the first actual checkpoint and progress report from the science application.

If the first checkpoint is made quickly, or if the overall runtime estimate for the task is reasonably accurate, then the transition from pseudo-progress to real progress is almost invisible.

But if the estimate for the whole task is seriously wrong, pseudo- and real progress have time to diverge before the real figure is available, and a large correction becomes necessary.

Having a pseudo-progress display avoids the progress bar displaying 0.000% for extended periods, which tends to make users nervous.


A great explanation!
But can a big change of the estimated runtime also be handled by the BM (every version)? The initial estimated runtime was ~20 to 25 min, jumping then to > 4hrs. Maybe this is the reason why they time out on some machines.
My first 2 wu's are at 73% now with a runtime of ~3hrs and reporting a remaining time of ~1hr.
I'm using BM 7.4.12


By BM, I assume you mean BOINC Manager. As the term 'Manager' implies, that's the command-and-control module for BOINC, and doesn't do any actual work - your question would be better directed to the BOINC client.

And yes, the BOINC ecosystem as a whole - client and server - can handle a big change like this.

If both the server and the client are to a recent (2010 or later) specification, the adjustment is handled on the server, using tools like CreditNew and RuntimeEstimation.

If either (or both) of the server and client pre-date 2010 - as the server here at Einstein does - then both components drop back to the older 'Duration Correction Factor' mechanism (no longer documented, since the demise of the Unofficial BOINC Wiki).

Unfortunately, catch-22 applies in both cases. Neither CN/RE, nor DCF, updates their estimates until a task has successfully completed - in the case of CN/RE, 11 tasks have to complete and validate: in the case of DCF, a single completed task is sufficient. But if BOINC aborts the tasks for 'Maximum elapsed time exceeded' before successful completion......

Hence the references in this thread to 'innocculation' - modifying to bypass the infinite-loop safety-valve, and allowing completion so that estimate-modification can proceed. These are the sort of issues we were grappling with at Albert before attention switched to the new web design.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109874772673
RAC: 30513841

RE: ... workunits generated

Quote:
... workunits generated today shouldn't exhibit this error anymore.


I guess it will depend on how long it takes to clear out previously generated workunits.

I've recently got a couple more tasks on a host just set up for beta work and they turned out to be 'short ends' estimated at a couple of mins that ended up taking around 52 mins. Looks like we're still working on 'old' tasks :-).

The host had previously been doing FGRP3 (run time around 12-13 hours) and those left in the cache are now estimated at 270 hours and the machine is in panic mode. I'll edit the state file and try again in a few more hours to see if the old tasks are gone.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.