Gamma-Ray pulsar #5 stops

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5888
Credit: 119866579221
RAC: 25964446

Edward Johansson wrote:I'm

Edward Johansson wrote:
I'm glad that you guys find the time to discuss this problem with me!

I retired a long time ago so this stuff keeps me off the streets :-).  I run a lot of computers and I try to understand how everything works.  I get to play with and potentially understand enough to be able to pass on things I think I understand.  People do need to double check and I can easily be wrong.  I don't regard myself as any sort of 'expert'.

I never use one word when ten will do since I'm conscious that lots of people read but don't feel confident enough either to contribute or ask about the more technical stuff.  I'm always conscious of the potentially wider audience so I'd rather over-explain than not give enough detail for all who might be reading.  Unfortunately, some people think I'm 'talking down' to them when I'm actually just trying to make sure I give the complete picture for the benefit of the 'lurkers' :-).

Edward Johansson wrote:
... Looking at the error reports you linked to, most, if not all, of the restarts are probably me suspending manually/restarting BOINC (i tend to close it while gaming/software development).

OK, that's fine.  I suspected this but just wanted to be sure.  The logs I linked to are not "error reports" as such, but rather just "logs" - detailed records of what was happening.  They can contain error messages and if they do, the details will be clearly identified with some sort of *ERROR* flag that makes it pretty obvious.  So, for your 2nd example where the task ultimately completed and validated, there are no documented errors.

However, looking at that 2nd example, below is an excerpt from the full stderr.txt output.  I have omitted some lines which are not relevant to the understanding of what was happening.  I'm starting with the successful recording of sky point 9 and the start of sky point 10 (out of 79).  When you see "<== blah blah blah", these are notes added by me.

As an aside, with 79 sky points in total and a task completion time of ~47,000s (which we know is inflated for some reason), every checkpoint should take no longer than 47000/79=~10 mins, and we know it would be a lot less than that for 'normal' operations.  The very start of crunching had a timestamp of 14:50:30.

% C 9 0               <== Timestamp here would be less than 14:50:30 + 9x10mins = 16:20:30
% Sky point 10/79
% Starting semicoherent search over f0 and f1.
% nf1dots: 56 df1dot: 1.84974169e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
..........................................18:10:59 (7408): [normal]: This Einstein@home App was built at: Jul 26 2017 09:32:43
                                         <== Partial row of dots (no line end so new timestamp is not on a new line
                                              as it would normally be.  You probably stopped or suspended to
                                              account for gap of nearly 2 hours (probably more than that).

18:10:59 (7408): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FG......exe'.
....
....
% checkpoint read: skypoint 9 binarypoint 0            <== We are restarting using checkpoint 9
% fft_size: 67108864 (0x4000000, 2^26); alloc: 268435464
% Sky point 10/79
% Creating FFT (3.3.4 22109fa) plan.
18:27:44 (17808): [normal]: This Einstein@home App was built at: Jul 26 2017 09:32:43
                                         <== This is the bit that troubles me.  Up to this point, things are
                                              understandable.  With the previous checkpoint read and sky point 10
                                              started, why didn't things just continue?  Instead there is a 17 min
                                               hiatus until a further attempt to restart the app.

18:27:44 (17808): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FG......exe'.
.....
.....
% checkpoint read: skypoint 9 binarypoint 0
% fft_size: 67108864 (0x4000000, 2^26); alloc: 268435464
% Sky point 10/79
% Creating FFT (3.3.4 22109fa) plan.
19:40:14 (18104): [normal]: This Einstein@home App was built at: Jul 26 2017 09:32:43

19:40:14 (18104): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FG......exe'.
                                          <== I've deleted many more examples of the above failure of the
                                               science app to get started properly.  Finally, it does get going
                                               as the next cycle clearly shows. (on a new day.)

16:54:16 (13704): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FG......exe'.
.....
.....
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah0055F.dat
% Total amount of photon times: 9989
% Preparing toplist of length: 10
% checkpoint read: skypoint 9 binarypoint 0
% fft_size: 67108864 (0x4000000, 2^26); alloc: 268435464
% Sky point 10/79
% Creating FFT (3.3.4 22109fa) plan.
% Starting semicoherent search over f0 and f1.
% nf1dots: 56 df1dot: 1.84974169e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
........................................................
INFO: Major Windows version: 6
% C 10 0                                             < == Yipee!! sky point 10 finally gets checkpointed !!!!
% Sky point 11/79

Edward Johansson wrote:
I've been running 16 CPU-tasks simultaneusly since i started using BOINC so no change there. Before about that time though, i mostly set my cores to 3,8 GHz (If you are not familiar with Ryzen 1800x, stock settings are at 3,7GHz and boosts to 4,1 GHz, so shouldn't be a large mismatch of performance, probably even negative) but i don't think it actually has anything to do with that.

OK, so not a cores/threads change and small frequency changes are not going to make any real difference.

Edward Johansson wrote:
Regarding malware, i haven't installed any third-party virus-protection. I do have windows´ virus-protection on though, i'll try to exclude the jobs from it but it feels like this would be a more widespread problem if that was what was causing this behaviour. I installed Malwarebytes today just to check for malware but found nothing on the system.

I don't use Windows, period.  All I know is that people often seem to need to exclude the BOINC stuff in order not to get false positives or other sorts of unwanted interference.  All I think I know is that the above documented behaviour seems to indicate that something is preventing the science app from starting up again after a suspension.  The fact that it eventually succeeds to get started seems to indicate that whatever is doing it isn't an 'always on' continuous thing so that you can eventually get lucky and find a window of opportunity.

It seems that once restarted, it is usually allowed to continue until you next suspend it.  This is why it would be useful to keep things in memory because, apart from preventing a partial row of dots from being wasted, the image already in memory might be just allowed to resume.  My first reaction is to change your preferences so that the state of play is kept in memory when a task is suspended and see if that makes any difference whatsoever.  Then you should investigate the Windows protection and see if it can be configured to ignore the BOINC stuff.

 

Cheers,
Gary.

daghtus
daghtus
Joined: 1 Mar 19
Posts: 11
Credit: 31994422
RAC: 0

From my experience, Gamma #5

From my experience, Gamma #5 jobs indeed get hanging at random times and they never resume. I let some of these running for over 14 hours with no luck. The only chance is to abort, start over and pray. I experienced identical behavior on two different PCs.

EDIT: These jobs do not even checkpoint properly. If I happen to exit the app, all progress is usually lost.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.