Albert: Validate Errors

mmonnin
mmonnin
Joined: 29 May 16
Posts: 291
Credit: 3229183992
RAC: 1146195
Topic 213787

Not sure anyone is going to pick it up over there as its been several days where everything has been unable to validate. An example task where many users have validate errors.

 

https://albertathome.org/workunit/988777

 

Can anyone take a look.

TB
TB
Joined: 23 Jul 17
Posts: 1
Credit: 29268292
RAC: 2891

Far from informed about these

Far from informed about these things, but I know that sometimes I have to suspend a task.

I also know that sometimes the count-down timer stops counting down. It might say 15 minutes and be ready to report 5 minutes later, it might say 5 minutes and then start inching back up. There must be some mismatch between calculated time to completion and what the actual calculations require. Let's call that state count-down limbo.

I think that every "failure to validate" error I've gotten in the last couple of weeks were for files where I suspended the task while it was count-down limbo. Perhaps some internal index gets mis-set during that time or parts of memory are overwritten. Like I said, dunno. I'm a linguist, not a programmer.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109378862864
RAC: 35982292

TB_7 wrote:Far from informed

TB_7 wrote:
Far from informed about these things, but I know that sometimes I have to suspend a task.

I'm sorry, but you don't have to suspend a task at all.  You may choose to do that, but it certainly isn't necessary.  If you do so, you may well lengthen the time it takes to complete a task because, if the task isn't kept in memory when suspended, it would need to start from a somewhat earlier point in time, when you remove the suspension.

You have posted in a thread referring to the test site, albertathome.org.  Your tasks list here at einsteinathome.org shows that you are crunching the normal FGRP5 gamma-ray pulsar tasks and the current tasks for the Gravity wave tuning run.  Are you crunching test tasks at albertathome as well?  I'm guessing that perhaps you are not and that perhaps you have inadvertently chosen this topic without understanding the difference between the two sites.

TB_7 wrote:
I also know that sometimes the count-down timer stops counting down. It might say 15 minutes and be ready to report 5 minutes later, it might say 5 minutes and then start inching back up. There must be some mismatch between calculated time to completion and what the actual calculations require. Let's call that state count-down limbo.

This sounds like you are referring to the 'two crunching stages' behaviour (the main stage and the follow-up stage) for the processing of FGRP5 tasks.  If so, have a look at this thread to understand what is happening.  In addition, you need to be careful about assigning importance to estimates of the remaining crunch time.  Because there is no ongoing progress of the % completed during the follow-up stage, BOINC will be fooled into increasing the estimate of remaining time because it can't know that there is progress until the very end when the % completed jumps straight to 100%.

TB_7 wrote:
I think that every "failure to validate" error I've gotten in the last couple of weeks were for files where I suspended the task while it was count-down limbo. Perhaps some internal index gets mis-set during that time or parts of memory are overwritten. Like I said, dunno. I'm a linguist, not a programmer.

When you suspend a task, it may remain in memory ready to proceed immediately when allowed, or it may be removed completely, depending on your BOINC settings.  At regular intervals while crunching, the current status of a task is saved on disk as a checkpoint file. If you suspend a task and your settings say it should be removed from memory, it will be.  In that case, when the task is resumed, it will be reloaded into memory from a checkpoint file, if such a file exists.  Usually, the interval between checkpoints is of the order of several minutes but it can be a lot longer than that.  So it is possible to see a resumed task go back to an earlier stage when it's reloaded from a checkpoint file.  If you have sufficient memory (these tasks can be large) you can save the waste of some crunching by keeping tasks in memory when suspended.  Better still, stop suspending tasks.  It's not needed in normal circumstances.

I had a quick look at your current list of invalid tasks.  There is just one marked as invalid and 131 that are valid.  Different hardware, different operating systems, different compute libraries can all have an impact on the precise final results returned to the project.  It's generally accepted that these small differences cause approximately 0.5% to 1.0% of tasks to fail validation.  Since you have 1 out of 132, that seems pretty normal.  It is unlikely that an invalid result would have anything to do with a task having been suspended at some point.

This thread was about a different type of error - a validate error, and over at albertathome.org, the test site.  A validate error is one where what was returned to the project is so scrambled that it can immediately be declared as rubbish without having to go through the formal validation process.  These are usually caused by hardware problems or equipment being operated well outside the normal limits (inappropriate clock speed or voltage adjustments).  You have no examples of validate errors.

 

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.