Albert: Validate Errors

mmonnin

Joined: 29 May 16

Posts: 292

Credit: 3444726540

RAC: 2099

4 Mar 2018 23:24:39 UTC

Topic 213787

(moderation:

)

Not sure anyone is going to pick it up over there as its been several days where everything has been unable to validate. An example task where many users have validate errors.

https://albertathome.org/workunit/988777

Can anyone take a look.

Joined: 23 Jul 17

Posts: 1

Credit: 29315416

RAC: 0

Far from informed about these

12 Mar 2018 0:57:45 UTC

Message 164611

(moderation:

)

Far from informed about these things, but I know that sometimes I have to suspend a task.

I also know that sometimes the count-down timer stops counting down. It might say 15 minutes and be ready to report 5 minutes later, it might say 5 minutes and then start inching back up. There must be some mismatch between calculated time to completion and what the actual calculations require. Let's call that state count-down limbo.

I think that every "failure to validate" error I've gotten in the last couple of weeks were for files where I suspended the task while it was count-down limbo. Perhaps some internal index gets mis-set during that time or parts of memory are overwritten. Like I said, dunno. I'm a linguist, not a programmer.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119693560517

RAC: 25373521

TB_7 wrote:Far from informed

12 Mar 2018 4:43:39 UTC

Message 164613 in response to message 164611

(moderation:

)

TB_7 wrote:

Far from informed about these things, but I know that sometimes I have to suspend a task.

I'm sorry, but you don't have to suspend a task at all. You may choose to do that, but it certainly isn't necessary. If you do so, you may well lengthen the time it takes to complete a task because, if the task isn't kept in memory when suspended, it would need to start from a somewhat earlier point in time, when you remove the suspension.

You have posted in a thread referring to the test site, albertathome.org. Your tasks list here at einsteinathome.org shows that you are crunching the normal FGRP5 gamma-ray pulsar tasks and the current tasks for the Gravity wave tuning run. Are you crunching test tasks at albertathome as well? I'm guessing that perhaps you are not and that perhaps you have inadvertently chosen this topic without understanding the difference between the two sites.

TB_7 wrote:

I also know that sometimes the count-down timer stops counting down. It might say 15 minutes and be ready to report 5 minutes later, it might say 5 minutes and then start inching back up. There must be some mismatch between calculated time to completion and what the actual calculations require. Let's call that state count-down limbo.

This sounds like you are referring to the 'two crunching stages' behaviour (the main stage and the follow-up stage) for the processing of FGRP5 tasks. If so, have a look at this thread to understand what is happening. In addition, you need to be careful about assigning importance to estimates of the remaining crunch time. Because there is no ongoing progress of the % completed during the follow-up stage, BOINC will be fooled into increasing the estimate of remaining time because it can't know that there is progress until the very end when the % completed jumps straight to 100%.

TB_7 wrote:

I think that every "failure to validate" error I've gotten in the last couple of weeks were for files where I suspended the task while it was count-down limbo. Perhaps some internal index gets mis-set during that time or parts of memory are overwritten. Like I said, dunno. I'm a linguist, not a programmer.

When you suspend a task, it may remain in memory ready to proceed immediately when allowed, or it may be removed completely, depending on your BOINC settings. At regular intervals while crunching, the current status of a task is saved on disk as a checkpoint file. If you suspend a task and your settings say it should be removed from memory, it will be. In that case, when the task is resumed, it will be reloaded into memory from a checkpoint file, if such a file exists. Usually, the interval between checkpoints is of the order of several minutes but it can be a lot longer than that. So it is possible to see a resumed task go back to an earlier stage when it's reloaded from a checkpoint file. If you have sufficient memory (these tasks can be large) you can save the waste of some crunching by keeping tasks in memory when suspended. Better still, stop suspending tasks. It's not needed in normal circumstances.

I had a quick look at your current list of invalid tasks. There is just one marked as invalid and 131 that are valid. Different hardware, different operating systems, different compute libraries can all have an impact on the precise final results returned to the project. It's generally accepted that these small differences cause approximately 0.5% to 1.0% of tasks to fail validation. Since you have 1 out of 132, that seems pretty normal. It is unlikely that an invalid result would have anything to do with a task having been suspended at some point.

This thread was about a different type of error - a validate error, and over at albertathome.org, the test site. A validate error is one where what was returned to the project is so scrambled that it can immediately be declared as rubbish without having to go through the formal validation process. These are usually caused by hardware problems or equipment being operated well outside the normal limits (inappropriate clock speed or voltage adjustments). You have no examples of validate errors.

Cheers,
Gary.

Albert: Validate Errors

Forums › Problems and Bug Reports

Far from informed about these

TB_7 wrote:Far from informed

Comment viewing options

Forums › Problems and Bug Reports