Gamma-Ray pulsar #5 stops

Edward Johansson
Edward Johansson
Joined: 2 Feb 19
Posts: 7
Credit: 439773
RAC: 0
Topic 218540

So since i started getting these jobs a couple of days ago they run fine until i suspend computation at which point they stop att the same % and aren´t progressing even though it says that they are running. Had them going for 3h+ and no change. Am i the only one having this problem?

Edward Johansson
Edward Johansson
Joined: 2 Feb 19
Posts: 7
Credit: 439773
RAC: 0

I just realized i placed this

I just realized i placed this in the wrong part of the forum, sorry guys.

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

Well, if you suspend

Well, if you suspend computation then they should stop and they should say their stopped.
If you then resume computations then they should get going again. If not try to ascertain if they consume any resources CPU/GPU. If your running Windows then open "Task manager" (if you get a window with minimal info then click "More details"" and check for a process with FGRP5 or something similar in it, does it show any CPU usage?

To get the tasks going again try to restart Boinc or reboot the computer.

Edward Johansson
Edward Johansson
Joined: 2 Feb 19
Posts: 7
Credit: 439773
RAC: 0

According to task manager,

According to task manager, the jobs are using the CPU. They don´t show any progress though which stopped at the point where I last suspended computation/closed boinc(all around 10-12%). 

As they are obviously running, maybe if I just let it chug for a long period of time they will eventually finish but that won´t work for me in the long run.

Thanks for your help!

Edward Johansson
Edward Johansson
Joined: 2 Feb 19
Posts: 7
Credit: 439773
RAC: 0

Program Gamma-ray pulsar

Program Gamma-ray pulsar search #5 1.08 (FGRPSSE)
Name LATeah0055F_440.0_35629_0.0
State Running
Received 2019-03-29 12:09:18
Deadline 2019-04-12 13:09:18
Estimated  105 000 GFLOPs
CPU-time 03:18:59
CPU-time since last checkpoint 01:02:47
Elapsed time 03:31:09
Estimated remaining time 06:48:48
Fraction done 13,671%
Size of virtual memory 282,38 MB
Size of job 271,49 MB
Katalog slots/7
Process-ID 7724
Executable hsgamma_FGRP5_1.08_windows_intelx86__FGRPSSE.exe

That´s the properties from one of the jobs, one thing that seems weird is that the executable says intel while i have a AMD ryzen CPU in my system. Don´t know if that part of the executable is indicating what kind of processor it is supposed to be run on though.

Time since checkpoint is the same as the time since i started the boinc client.

Since I started the client i've gotten two more of the same job which are running fine for now(at about 19%), but i suspect they will sieze up as soon as I close the client, especially since their properties show that they haven't gotten a new checkpoint since they started.

Edward Johansson
Edward Johansson
Joined: 2 Feb 19
Posts: 7
Credit: 439773
RAC: 0

Now, the two that got started

Now, the two that got started have restarted from 0% and stopped at only a few % but most of the rest have started showing progress again and seems to be creating checkpoints at a reasonable rate… i'm confused.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110026016162
RAC: 22504007

Edward Johansson wrote:Now,

Edward Johansson wrote:
Now, the two that got started have restarted from 0% and stopped at only a few % but most of the rest have started showing progress again and seems to be creating checkpoints at a reasonable rate… i'm confused.

I think most of us who are willing to help are confused too! :-).

You need to remember that we aren't looking over your shoulder, seeing everything you do and watching the changing information on your screen as a result of changes you make.

Your computers are 'hidden' (standard default setting these days due to the European GDPR privacy regulations) so it is very helpful if you at least supply a computer ID for the machine showing the problems, even if you want to keep your full list private.  If you don't do that, people wont feel competent to fully understand the problem and diagnose it.

Without being able to see the details, you need to supply much more information about your system.  You can easily find your computer ID by going to your account page on the website and following the link to your particular computer.  By supplying that ID, you save having to supply all the hardware/software/completed results details that make diagnosis easier to do.

Here are some answers to other points you raise.

Edward Johansson wrote:
They don´t show any progress though which stopped at the point where I last suspended computation/closed boinc(all around 10-12%).

What exactly do you mean by this. Did you suspend computation of one or more tasks and if so, why?  I suspect you may actually mean that you stopped BOINC itself (which is different from suspending computation) and that when you (at some later stage) restarted BOINC, the progress of tasks changed to lower values or perhaps even went right back to zero.

When BOINC is running and computation is not suspended, you should see all tasks showing incremental progress at reasonably regular intervals, usually every few seconds but certainly no longer than a minute or two, even if the computer is quite slow.  If progress reaches some point and then stops completely for an extended period, there is something else interfering.  What do you use for the computing preference question, "Suspend when computer is in use"?  What about the question, "Leave non-GPU tasks in memory while suspended"?   Sometimes, behaviour that looks like a problem may just be due to inappropriate preference settings.

Apart from the progress % regularly increasing, you can have problems with losing progress if a particular task just happens not to create checkpoints very regularly.  There are some data series where the checkpoints could be more than an hour apart.  You could lose significant progress when BOINC shuts down or if a task is suspended without being kept in memory.  The 'properties' page for a task will list the CPU time since the last checkpoint was written.  This is how much you could lose on that task if BOINC shuts down at this point without being kept in memory.

Edward Johansson wrote:
As they are obviously running, maybe if I just let it chug for a long period of time they will eventually finish but that won´t work for me in the long run.

A task is not "obviously running" if it's not making regular incremental progress.  A task will run more efficiently if you can allow it not to be interrupted unnecessarily.  If running tasks are interfering with your normal use of the computer, the most efficient thing to do is limit the number of cores BOINC is allowed to use.  If your computer is relatively modern, and if your normal activities are mainly office/browsing/email type use, you should hardly notice BOINC running in the background.

If you want to experiment to see if BOINC is having much impact, try reducing the number of cores incrementally until you find a setting that works for you.  As an example, if your machine had 8 cores, you could reduce the % of cores that BOINC uses from 100% to 87.5% to 75%, etc, until you are happy with the result.

Edward Johansson wrote:

CPU-time since last checkpoint 01:02:47
...
...
Executable hsgamma_FGRP5_1.08_windows_intelx86__FGRPSSE.exe

That´s the properties from one of the jobs, one thing that seems weird is that the executable says intel while i have a AMD ryzen CPU in my system. Don´t know if that part of the executable is indicating what kind of processor it is supposed to be run on though.

The first line shows how much CPU time will be lost if you stop the task at that point (over an hour).  Ideally, if you want to stop crunching without losing progress, you might try to choose a stage where that number is reasonably low.

Usually checkpoints are created fairly regularly.  As an example, the properties info you provided shows that the data series was LATeah0055F.dat.  Because your info is not available, I found a task from one of my hosts for that same data series.  By clicking on the task ID link for that task on the website (you could do exactly the same for any of your tasks) I scrolled down in the page of info to find the "STDERR.TXT" heading and from what was there, I extracted the following snip which shows the first two checkpoints being written.

% fft_size: 67108864 (0x4000000, 2^26); alloc: 268435464
% Sky point 1/79
% Creating FFT (3.3.6-pl2 22109fa) plan.
% Starting semicoherent search over f0 and f1.
% nf1dots: 56 df1dot: 1.84974169e-15 f1dot_start: -1e-13 f1dot_band: 1e-13
% Filling array of photon pairs
........................................................
% C 1 0
% Sky point 2/79
% Starting semicoherent search over f0 and f1.
% nf1dots: 56 df1dot: 1.84974169e-15 f1dot_start: -1e-13 f1dot_band: 1e-13
% Filling array of photon pairs
........................................................
% C 2 0

There are 79 'Sky points' being analysed (line 2).
The value of the parameter nf1dots is 56 (line 5).
This means that each sky point will have 56 calculation loops to be performed.
As each of these is completed, a 'dot' is added to the row of dots in line 7. (Count them if you like - there are 56.
When the sky point is finished,  a checkpoint is written - % C 1 indicates the first checkpoint (line 8).
Then the calculation loops move on to the 2nd sky point - 2/79 and the process repeats.

Since there are 79 sky points there will be 79 checkpoints and this will occur regularly for every 90/79=1.139% of progress.  This means that when the progress % shows 1.139%, 2.278%, 3.418%, 4.557%, .... etc, you can know a checkpoint has been written without even checking the properties page.

With other data files (and perhaps even within the one data file - I don't know) I've seen examples where the sky points are very low (eg 8) and the nf1dots is very high (about 553 from memory).  The amount of computation in a task can probably be roughly estimated from sky points X nf1dots - 79x56=4424 and 8x553=4424 - which I guess is why the crunch time seems reasonably constant even if the checkpoint time is not.  With 8 sky points, a checkpoint would be written every 11.25% of progress.

In talking about when checkpoints are created, I'm reminded about a computing preference setting, "Request tasks to checkpoint at most every: ".  You haven't changed that from the default value of 60 secs have you by any chance?

A lot of the above is an educated guess on my part but I'm sharing it in the hope it might help you understand why it's useful to allow crunching to proceed without constant interruption if possible.

With regard to your concern about the name of the executable, the intelx86 part refers to the architecture that the executable is designed to run on and not the processor brand.  Both Intel and AMD make processors that comply with the basic x86 (or x86_64) architecture.  The same application runs on both brands.

If these sorts of comments don't tally with what you are actually seeing, please take some time to explain exactly what you do see.

 

Cheers,
Gary.

Edward Johansson
Edward Johansson
Joined: 2 Feb 19
Posts: 7
Credit: 439773
RAC: 0

Thanks for the lengthy

Thanks for the lengthy answer! And i beg your pardon for not explaining myself in a good manner!

So, by now i have finished a couple of these jobs and, looking at the tasks in my account on this website, i see that they took an unproportionate amount of time to finish considering estimated operations (at 105 000 GFLOPs) and the credit i got for them. I removed the privacy for my computers so you should be able to look it up if you wish, 12765356 is the computer ID. 

The behavior of these jobs:

When they start, they chug along nicely altough they seem to be creating checkpoints at a slow pace, if at all. When i restart boinc, all of the jobs that actually got to a checkpoint are stuck there for extended periods of time, i haven´t really gotten a feel for when they start up again, but seeing as i actually finished some jobs yesterday i know that they do. When i look in task manager, i can see that the jobs that don't show progress in boinc are using CPU-time though.

For example, since i started boinc about 80 minutes ago, i have 4 of these jobs at varying progress (79%,58%,39%,36%) which haven't progressed at all since startup. At the same time, i have 3 other jobs that started from 0% as i started boinc, which now have progressed more than 19% without creating any checkpoints.

<<<<<< During writing, one of these jobs restarted to 0%:

LATeah0056F_56.0_3736_-4.2e-11

i did not turn off boinc. One of the other jobs:

LATeah0056F_56.0_3600_-2.4e-11

seems to have lost only a couple % progress while the job

LATeah0056F_56.0_3728_-9e-12

is running at 20% progress, no checkpoint yet though. Just checked again and this job restarted to 0% as well.

And now the jobs that hadn't made progress in all this time started making progress!!!!

Maybe you understand why i'm so confused now (and confusing)>>>>>

I guess it seems like it isn´t really that much of a problem, I just have to keep my system running during extended periods of time and iv´e looked into letting boinc use less of the cores to make this plausible as you mentioned. It just seems weird to me that these jobs show such an erratic behavior, not at all like the ones i have run before. I´ve only been a contributor for a couple of months though so excuse me if i´m just ignorant in some way.

Regarding you questions about my settings:

"Suspend when computer is in use": Not checked

"Leave non-GPU tasks in memory while suspended": Not checked

"Request tasks to checkpoint at most every": I had changed this to 120 but chenged it back to 60 and it doesn't seem to make a difference. 

Thank you for taking your time on this problem! 

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

I'll chime in and I'm sure

I'll chime in and I'm sure Gary will be back in a while to fill in some more details. Wink

Edward Johansson wrote:
When i look in task manager, i can see that the jobs that don't show progress in boinc are using CPU-time though.

That's a good indication that the jobs are actually running and processing data. (Unless they've gotten stuck in a loop and that rarely happens)

Quote:

For example, since i started boinc about 80 minutes ago, i have 4 of these jobs at varying progress (79%,58%,39%,36%) which haven't progressed at all since startup. At the same time, i have 3 other jobs that started from 0% as i started boinc, which now have progressed more than 19% without creating any checkpoints.

<<<<<< During writing, one of these jobs restarted to 0%:

LATeah0056F_56.0_3736_-4.2e-11

i did not turn off boinc. One of the other jobs:

LATeah0056F_56.0_3600_-2.4e-11

seems to have lost only a couple % progress while the job

LATeah0056F_56.0_3728_-9e-12

is running at 20% progress, no checkpoint yet though. Just checked again and this job restarted to 0% as well.

This one is easy to explain if one knows a bit about how Boinc works.
A few years ago Boinc was changed to show "pseudo progress" if a task doesn't report progress to Boinc in a timely manner. The change was made as many volunteers thought that tasks didn't work if the progress percentage didn't increase often enough. So now Boinc simulates progress until the tasks reports progress. It will increase towards but never reach 100% complete. Then when the task finally report progress Boinc will reset to the reported progress!
This can be really confusing to new volunteers.

Quote:
And now the jobs that hadn't made progress in all this time started making progress!!!!

As the tasks here at Einstein@home only reports progress when a checkpoint is written then the progress update frequency will vary between tasks and can for some tasks be quite long apart, more than 1 hour apart on a slower task/host.

Quote:
Maybe you understand why i'm so confused now (and confusing)>>>>>


I do feel your pain! Wink

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110026016162
RAC: 22504007

Edward Johansson wrote:...

Edward Johansson wrote:
... 12765356 is the computer ID.

Thanks for that.  A Ryzen 7 1800x with 16GB RAM is quite a decent machine.  BOINC sees 16 processors - 16 threads for the 8 true cores.  Keep in mind that CPU tasks will take rather longer when two share a core and you may compromise overall performance if you allow BOINC to load up all available threads.  It's worth experimenting to quantify the difference.

Your only completed FGRP5 CPU tasks were returned after April 2 and show a crunch time just below 50,000s.  I have an Athlon 200GE (a much less capable processor with 2 cores 4 threads) running just 2 (mainly the GW engineering run) CPU tasks at a time.  My times for some FGRP5 tasks that were also received were a bit over 18,000s.  Even if using threads rather than cores caused crunch times to double (it shouldn't) that would mean your machine takes around 25,000s for something my machine is doing in 18,000s.  This indicates some sort of problem with the way your machine is handling the FGRP5 search.

When looking at the O1OD1E times, my machine takes less than 30,000 secs.  On your machine there are two groups for these tasks.  Those completed on or before 29 March show around 23,000s.  After 01 April, the time has become 40,000s.  This could indicate a change from using cores (23,000s) to using threads (40,000s) since that looks about the sort of benefit (2 tasks in 40,000s rather than 2 consecutive tasks taking a total of 46,000s) that might be expected.  Did you make such a change around 01 April??

However, the real problem is why your FGRP5 tasks are taking so much time compared to the O1OD1E tasks. 

Edward Johansson wrote:
When they start, they chug ... creating checkpoints at a slow pace, if at all.

As Holmis mentions, this sounds very much like something to do with BOINC's 'simulated' progress, but it's more than just that, I think.  To investigate, I had a look at two returned tasks.  This one is one you aborted, probably because it wasn't making progress.  This one immediately follows the previous one in your list of FGRP5 tasks and it completed and validated, admittedly after a very long time.

In my previous message, I spelt out how to check out the details from the stderr.txt that is returned with a task.  Please look at those two tasks and see for yourself what is going on.  To understand my next comments, you need to have the log for each in turn open in front of you and be following along.

In the first example, the starting timestamp was 14:50:30 and 10 checkpoints were written, apparently quite normally.  During the processing of sky point 11 (out of 79), something drastic happened and BOINC was restarted with a fresh timestamp of 18:10:59 - 3hrs 10mins after the initial startup.  There were about a third of the 'dots' completed.  There is no way of knowing the precise time of the event or how long it took for BOINC to be restarted.  We can't tell if just BOINC quit and was restarted manually or if the whole machine rebooted and BOINC restarted automatically.  Maybe you might remember.

From that point onwards, the machine never gets to finish sky point 11.  There are even no partial lines of dots to indicate that any processing at all was going on.  Eventually there was just a long series of later timestamps and further restart cycles.  It almost looks like the science app is being prevented from making progress and after some time BOINC decides to force another restart.  I have no idea why this is happening but I'd be very doubtful of a 'problem with the app or the data' explanation.  A wild guess would suggest it might be some sort of interference by a virus/malware checker.  Are you running one and do you allow it to scan the BOINC tree?

For the second example, where the task eventually completed, there were 9 sky points completed and something happened on the 10th.  There were a whole bunch of attempts to get going on sky point 10, with no sign of progress, with each one followed by a restart cycle.  All of a sudden, sky point 10 got going and things were back to normal until sky point 27 failed to make any progress (no dots).  Some more timestamps and restart cycles and then 27 got going and was completed.

There was a further shutdown and restart at sky point 41 but this looks quite normal.  A partial row of dots and then a normal looking restart with sky point 41 with its full complement of dots.  A further normal shutdown and restart at sky point 56 and then again at 70.  These all look like regular shutdown or suspending of processing followed by a normal restart.

Originally you talked about suspending computation at times.  Possibly these seem to be examples of that.  Can you confirm that you do suspend and then restart tasks?  If you do and if tasks aren't being kept in memory when suspended, you can lose computation each time.  Look at the dots for sky point 70.  That one was within a dot or two of being completed when the plug was pulled.  That whole sky point had to be repeated when crunching resumed.

This is all way longer than I intended it to be so I'm sorry about that.  People trying to help can't ever see all that you see and it's very difficult (for both parties) to get a proper picture across (about the problem or the diagnostic information) just with words.  I think the best course of action is for you to create a hand written log (with timestamps) of what you do and what you see, and then compare that with what eventually shows up on the website when the result is returned.

Your O1OD1E tasks don't seem to have this problem which is another reason why I'm wondering if something (like anti-malware) is interfering with the proper running of the FGRP5 app.  I'm not aware of others having this issue so it seems rather specific to your setup.

 

Cheers,
Gary.

Edward Johansson
Edward Johansson
Joined: 2 Feb 19
Posts: 7
Credit: 439773
RAC: 0

I'm glad that you guys find

I'm glad that you guys find the time to discuss this problem with me!

Yes i do suspend computation at times and if i do lose progress(now i know i do), it is so little that i haven't noticed it in any jobs except the gamma #5. Looking at the error reports you linked to, most, if not all, of the restarts are probably me suspending manually/restarting BOINC (i tend to close it while gaming/software development). I can only remember one time the last week that the computer froze up during gaming and i had to reset the system but i'm not sure that i even had BOINC on at that time, definitely not running tasks atleast.

… This could indicate a change from using cores (23,000s) to using threads (40,000s) …. Did you make such a change around 01 April??

I've been running 16 CPU-tasks simultaneusly since i started using BOINC so no change there. Before about that time though, i mostly set my cores to 3,8 GHz (If you are not familiar with Ryzen 1800x, stock settings are at 3,7GHz and boosts to 4,1 GHz, so shouldn't be a large mismatch of performance, probably even negative) but i don't think it actually has anything to do with that. The CPU is at stock settings now, to clarify. 

Regarding malware, i haven't installed any third-party virus-protection. I do have windows´ virus-protection on though, i'll try to exclude the jobs from it but it feels like this would be a more widespread problem if that was what was causing this behaviour. I installed Malwarebytes today just to check for malware but found nothing on the system.

I've learned a lot about how this works from reading your answers so i'll try to get some kind of log going and to try to identify the behaviour more closely.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.