O2AS20-500T ... questions and problems

Richie
Richie
Joined: 7 Mar 14
Posts: 403
Credit: 1,516,920,584
RAC: 2,632
Topic 213286

What will exactly happen if I set 'YES' on this option, but my system doesn't support it:


Run Linux app versions built with LIBC 2.15:

YES / NO
 
This ensures compatibility with new Linux systems that have virtual syscalls disabled, but breaks compatibility with older systems with (G)LIBC prior to 2.15


 

 
Would my computer receive wrong tasks? Would they start to run but then crash?
 
I'm asking because I had set YES on that option. Then my host had received those LIBC215 tasks. One of them run for some time, but then suddenly it crashed and also all the rest of those queued tasks crashed (about 40 of them). Also one v0.01 tasks crashed along (it was 99% finished at that point).
 
A problem with hardware (or software other than BOINC or this app) might have occurred, but I'm not sure.
 
Those tasks got this kind of "MD5 check failed" error:

WU download error: couldn't get input files:
<file_xfer_error>
  <file_name>l1_0210.15_O2C01Cl1In1.I7tn</file_name>
  <error_code>-119 (md5 checksum failed for file)</error_code>
  <error_message>MD5 check failed</error_message>
</file_xfer_error>

Here's the one that had run almost for an hour before crashing with that same error:

https://einsteinathome.org/task/728479473

Is it strange that a task is able to run so long and then gets MD5 checksum error?

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 4,835
Credit: 28,399,145,391
RAC: 35,964,845

Richie_9 wrote:What will

Richie_9 wrote:
What will exactly happen if I set 'YES' on this option, but my system doesn't support it:

You'll have to wait for Bernd to answer that, probably.  I have no idea.

I'm responding because of these two bits:-

Quote:
... (it was 99% finished at that point).

and

Quote:
WU download error: couldn't get input files: <file_xfer_error> <file_name>l1_0210.15_O2C01Cl1In1.I7tn</file_name> <error_code>-119 (md5 checksum failed for file)


I have seen this sort of thing happen, fairly infrequently, but over many years, so the total number of times does add up.  When all tasks depending on a certain file get trashed like this, BOINC will quite often go into a long project backoff (15-20 hours or more) with a timer counting down.  This effectively leaves all the failed tasks just sitting there waiting for the backoff to end.  When I notice this sort of event, I've always been able to retrieve the failed tasks by using the 'resend lost tasks' BOINC feature that Einstein has enabled.  Essentially all I do is delete all the computation error <result> entries in the state file and, when ready to restart, force the client to update with the scheduler and thereby receive fresh copies of all those tasks.

Each time I have done this, I have found that the supposedly corrupt file is not actually corrupt at all.  I have two theories (and that's all they are) about the causes of this.  In two cases that I know of, one cause seems likely to be a RAM error.  I usually run memtest when these failures occur and a couple of errors were found in those two cases.  Replacing the faulty RAM module caused the problem to go away.

However, in other cases, no RAM errors were found.  A fairly common factor in this second larger group, is that the problem occurred right near the completion of a task.  My theory (and I could be quite wrong) is that somewhere around the start of a new task, BOINC performs some sort of MD5 check of the files that will be needed for the next task to start.  For some reason, this check can give a bogus result and so the whole cache of work depending on that bogus result gets trashed.  It appears that if the machine is under stress (heat stress springs to mind because these events seem to occur at times of heatwaves) the checksum can be miscalculated.  This is all I can come up with since, when I run an independent check of the file, it is never found to be corrupt.  I probably shouldn't say 'never' because I used to just replace the files (without testing) with known good copies.  Then I started testing and found no errors.  For the last couple of years, I've just fixed the state file and put the machine back to work without even testing.  In no case that I can think of has crunching failed to restart because a file really was corrupt.

Quote:
Is it strange that a task is able to run so long and then gets MD5 checksum error?

That's what drove me to start investigating all those years ago.  My very first theory was a bad block suddenly appearing with a data file spanning it.  I tried renaming the file to file.bad and installing a fresh copy.  The problem kept coming back, maybe days later, and with a different data file.  In the end, after running disk checks and not finding bad blocks, I had to come up with different theories.  I really remember that initial case quite clearly because the tasks would get trashed every few days or so - until I worked out it was a RAM problem :-).

 

Cheers,
Gary.

Richie
Richie
Joined: 7 Mar 14
Posts: 403
Credit: 1,516,920,584
RAC: 2,632

Thanks for giving me ideas

Thanks for giving me ideas and interesting information of your long-time observations, Gary !

Gary Roberts wrote:

... I have two theories (and that's all they are) about the causes of this.  In two cases that I know of, one cause seems likely to be a RAM error.

...

It appears that if the machine is under stress (heat stress springs to mind because these events seem to occur at times of heatwaves) the checksum can be miscalculated.

I believe actually both of those scenarios are very possible and may have occurred. I have been fiddling with a few different RAM modules and configurations lately (timings, voltages). I must admit I've been too lazy to run any RAM tests. I tested stability with eye... by watching earlier during a few days that the system was 'stable enough' when running 1 GPU task + a couple of CPU SSE2/3 tasks at the same time. I had mainly aimed to keep voltages low... and was not planning on loading the system much at all. It has been running most of the time under very light load.

I don't know what kind of CPU instructions these tuning tasks use when running in Linux (this problematic system in question has Linux). Someone mentioned these might be using AVX instructions in Windows. My observations so far would support a feeling these tasks seem to stress CPU a bit more than SSEx tasks (from Asteroids@Home for example). I have a real-time electric ampere meter and I think when I tested to run 3 of these tasks it caused a slight difference in amps, compared to running 3 SSE's.

So it might be my 'tuned-by-lazy-eye' system was 'stable enough' for X load of SSE's, but not stable enough for similar load or slightly higher load of these tuning tasks. Then the system faced a situation where it had to work (time to end one tasks and start a next one) and it was too much.

Quote:
When all tasks depending on a certain file get trashed like this, BOINC will quite often go into a long project backoff (15-20 hours or more) with a timer counting down.

Luckily (and a surprise for myself) Boinc didn't go to that. I rebooted that host and lowered total amount/load of tasks in advance, before starting to crunch anything. I also changed the LIBC215-option to 'NO'. Then I updated Einstein... and Boinc delivered one task without delays... and seemed to be ready to deliver them more, one at a time. Today I see there haven't been anymore problems with crashing or invalids. I'll keep monitoring.

Richie
Richie
Joined: 7 Mar 14
Posts: 403
Credit: 1,516,920,584
RAC: 2,632

Short update to this...

Short update to this... problem came back. Five more tasks had errored out with 0 sec running time.

There were three different versions of Stderr outputs in them:

https://einsteinathome.org/task/728752766

https://einsteinathome.org/task/728752770

https://einsteinathome.org/task/728752776

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.