Never Had This Happen Before

MAGIC Quantum M...

Joined: 18 Jan 05

Posts: 1979

Credit: 1550732504

RAC: 1856632

30 Apr 2015 4:38:55 UTC

Topic 198075

(moderation:

)

http://einsteinathome.org/task/496820098

http://einsteinathome.org/host/11723466/tasks&offset=0&show_names=1&state=5&appid=0

Something told me to get up and go upstairs to check my crunchers and sure enough this is the first thing I see.

But I expected any problem would be with one of my Cern tasks.

I was lucky to abort 272 of them and actually get new ones without having to wait up to 24 hours.

I did turn off the Precision X since it has been having problems and it is running ok again now.

I will check again in about 4 hours and see if it completed the new ones.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119988491695

RAC: 26434654

Never Had This Happen Before

1 May 2015 3:35:22 UTC

Message 132366

(moderation:

)

Quote:

... I was lucky to abort 272 of them and actually get new ones without having to wait up to 24 hours.

When you say, "abort 272", do you really mean that you had 272 tasks showing 'computation error' and that you just reported them to get rid of them? When I've seen this previously, the errored tasks are just sitting there with a 24 hour project backoff just ticking down. Is that what you had? And then did you just 'update' the project to report them and get some replacements?

This is the actual failure message:-

app_version download error: couldn't get input files:

einsteinbinary_BRP4_1.00_graphics_windows_intelx86.exe
-120 (RSA key check failed for file)
signature verification failed

I've seen this sort of thing many times over the years and I've always wondered why BOINC can't be just a bit smarter. OK, so there's a problem with a required file - the graphics app in this case - not actually required at all for pure crunching. You would think that BOINC could at least try to download a fresh copy of such a file and test it again before taking the rather drastic action of immediately trashing every single task in the entire cache that supposedly depends on the 'faulty' file. Even if it can't download an acceptable replacement, you would hope that it could just suspend that particular science run for a period and alert the user that intervention is required. If it didn't get a response in a reasonable time, it could then trash the offending tasks as a last resort.

I can remember seeing a few similar examples a couple of years ago before I had proper racking and ventilation of sets of hosts. Something seems to go wrong with the actual signature check itself rather than a file which was 'good' suddenly going 'bad'. I haven't seen it at all in recent times.

I'm only guessing but here is my assessment of what happens. Each time a new task is to be launched, BOINC will check the 'signatures' of needed files that are listed in the state file as needing a signature check. Sometimes, for whatever reason, something seems to go wrong with BOINC's calculation of the checksum. A lot of the time these seem to be random events which don't seem to recur. On one occassion I had this problem continue on the one machine over several weeks, with gaps of a couple of days between recurrences.

This particular recurring example was with GW tasks and the problem was a supposedly bad signature for a h1_... or l1_... data file, or occasionally with the 'sun' or 'earth' ephemeris files. I would actually run a manual MD5 check on the supposedly bad file and it would get the correct MD5 checksum. I could edit the state file to remove the error condition on the particular data file and then restart crunching, either with the same supposedly 'bad' file, or with a freshly downloaded copy, and in both cases there was no immediate further problem. Eventually, maybe days later, a different data file would be reported as bad and all tasks in the work cache that depended on that file would be immediately trashed and BOINC would go into a 24 hour backoff.

This was happening from time to time on a particular machine and I developed a 'quick recovery' technique. Because of the 24 hour backoff, there was plenty of time to notice the failed tasks before BOINC actually got around to reporting them. In a nutshell, stop BOINC and edit the state file to 'fix' the entry for the supposedly corrupt file - very simple to do. Then delete all ... blocks for all trashed tasks. This is particularly easy to do if the entire cache of work is trashed. It's a bit more tedious if you have to pick out individual trashed results from in between good ones, as listed in the state file. Thirdly, check the entire state file for negative status values - -161 springs to mind as a common value - it needs to be 0, if I remember correctly - check the state file of a 'good' machine if in doubt. When you restart BOINC all the errored tasks will be gone and an 'update' will allow the scheduler to start sending them back to you as "lost tasks". Works a treat. The big bonus is that you get to do your own tasks without them having to be sent out as 'resends' to somebody else. You can even save a copy of all the downloaded data files and replace them after restarting BOINC but before requesting the lost tasks. BOINC tends to delete the original data files if the state file no longer contains entries referring to them. After BOINC starts and does its deletions, just put the copies back. Can save a lot of downloading when the lost tasks are resent.

I eventually 'solved' the problem in the example above by replacing the RAM. The memory tested OK but something must have been getting corrupted very occasionally during the MD5 check. After replacing the RAM, there were no further issues with that particular host.

Cheers,
Gary.

MAGIC Quantum M...

Joined: 18 Jan 05

Posts: 1979

Credit: 1550732504

RAC: 1856632

RE: When you say, "abort

2 May 2015 1:43:37 UTC

Message 132367 in response to message 132366

(moderation:

)

Quote:

When you say, "abort 272", do you really mean that you had 272 tasks showing 'computation error' and that you just reported them to get rid of them? When I've seen this previously, the errored tasks are just sitting there with a 24 hour project backoff just ticking down. Is that what you had? And then did you just 'update' the project to report them and get some replacements?

Yes that is basically what happened.

I always check my hosts to see if they are running and any problems always used to be from those VB tasks (CERN) and it just happened again with a couple hundred tasks all changed to "Computation Error" at the exact same time.

After that first time I decided to update to the newest Boinc and newest VB and it ran normal for a day (April 1st) and then after completing 13 tasks did this again and the only difference is it would NOT give me any new tasks and has that 24 hours running delay.

In the past when that happened I would try to update several times over that 24hr and could get 1 days worth.

It has 12GB ram that says it is ok but I have some extra I could try.

And I will check the files like you mentioned......and see if I can get new tasks and right now I have it running vLHC X2,Atlas X2 and a CMS-dev

MAGIC Quantum M...

Joined: 18 Jan 05

Posts: 1979

Credit: 1550732504

RAC: 1856632

Well by checking the *24 hour

2 May 2015 8:52:42 UTC

Message 132368

(moderation:

)

Well by checking the *24 hour project backoff* when it was down to a bit over 17 hours remaining I gave it an update try and I got 8 tasks so it is back up and running the 660Ti SC and I set it so it won't ask for more until I take a look tomorrow.

Never Had This Happen Before

Forums › Cruncher's Corner

Never Had This Happen Before

RE: When you say, "abort

Well by checking the *24 hour

Comment viewing options

Forums › Cruncher's Corner