Ubuntu Linux 11.10
BOINC 6.12.33
(2) GTX 460 -OC@800MHZ
The system runs fine for days, but then I'll get an error. Either the client just quits or ALL the Einstein tasks error out in the same minute.
Here is what starts it:
[error] Signature verification failed for einsteinbinary_BRP4_1.31_x86_64-pc-linux-gnu__BRP4cuda32nv270
After this, every task immediately errors out.
I don't want to keep giving errors to the project.
Anyone know what could be causing this?
Copyright © 2024 Einstein@Home. All rights reserved.
Linux Computation Error: Once one goes, they all go.
)
I think there's a file descriptor leak bug in boinc lesser than 6.12.34 (?) and that might be it. I had the same problems (boinc crashed after N days).
Try upgrading to 6.12.34 or better yet 6.12.43 if you can compile it from source.
7.0.x is also an option but make sure you read readme because there are some changes regarding workunit cache settings.
It's not just a Linux thing.
)
It's not just a Linux thing. One of my windows boxes went kersplat recently; and Gary Roberts says he sees it semi-regularly on his boxes.
http://einsteinathome.org/node/196669
RE: ... Either the client
)
I have a machine that does this too. It's a 3570K based host, overclocked to 4.2HGz, with a GTX650Ti. It's running GW tasks on all cores and x2 BRP4 tasks on the GPU. About every week or so, the client will just quit. No tasks have errors - it's just like the host gets sick of the stress and tells the client to get lost for a while :-). It has all new components and an after-market CPU cooler and it's in a relatively cool environment. I just restart the client and it continues happily from where it left off. It's running PCLinuxOS, fully updated. The machine actually hosts a local repository for PCLinuxOS packages which I update using rsync once a week from one of the official PCLinuxOS mirrors. I keep the local repo on a USB external hard drive and use it to update all my PCLinuxOS hosts. Updating is lightning fast from a local repo like this and I like the 'download once, deploy to many' philosophy which saves heaps of bandwidth :-). Apart from the BOINC client deciding to quit every once in a while, the machine is running perfectly smoothly.
Which is completely to be expected if BOINC thinks your copy of the app is corrupt.
I've seen this very behaviour (or slight variants of it) many times over the years, so welcome to the club :-). The thing that occurs to me is that BOINC (having decided that a vital component is corrupt) perhaps should try to get a fresh copy of that component rather than just trashing all the tasks. You would think things could be put on hold for a bit, at least while an attempt was made. However I've never seen BOINC do anything other than trash the entire cache.
Most of the examples I've seen have been with GW tasks with one of the large data files suddenly being declared corrupt. Because of locality scheduling, it's quite likely that most if not all tasks in the cache could depend on a single large data file. I've also seen examples of either the 'sun' or 'earth' or 'skygrid' files suddenly failing the MD5 checksum check. A long time ago, when I saw my first examples of this, I got suspicious as to why files in continuous use should suddenly be corrupt. I decided to check and to my surprise, on many occasions, the files in question were not actually corrupt at all. I could use a separate program to generate a checksum and it agreed perfectly with what was stored in the state file.
I'm sure that some of these failures were due to transient hardware issues under the high temperature, high stress conditions of crunching. I'm sure that flaky RAM has caused quite a few. Also some were due to old disks developing bad sectors, etc. In these cases, the files were actually bad. I don't delete any files that are bad. I simply rename them with a .BAD extension and then get a fresh copy.
When one of these problems occurs, BOINC seems to trash the affected tasks and go into a 24hr backoff, with all the errored tasks just sitting there in the tasks tab and the clock counting down. You could just 'update' to force BOINC to report them all but, with a bit of experimenting, I think there is a perfectly viable recovery procedure. You need to be familiar with the contents of the state file (client_state.xml) and you need to be prepared to edit it after stopping BOINC.
In summary, what I do is
* Find and fix all negative values in blocks.
* Find and remove any blocks that have been added by BOINC to blocks.
* Fix (or completely remove) all the blocks that have been trashed by BOINC.
* Restart BOINC.
This is just a summary of what I do - there are quite a few pesky little details I've glossed over :-) I would experience at least one of these failures per month and, so far, I've always been able to recover the full cache of trashed tasks as long as I notice the problem before the 24 hr backoff expires. Obviously it's too late if BOINC has already reported the damage.
The trickiest bit is handling the blocks appropriately. If there are successfully completed results, you don't (usually) need to touch those but you certainly want to keep them so they can be reported. It appears that BOINC often decides that something is corrupt just as one task is finishing and a new one is starting. This seems (occasionally) to cause the just finished task to be marked as an error. So I usually check and fix that if necessary. The needs to be zero and the needs to be 5.
not started or not completed should have an of zero and a of 2. Trashed will have three distinguishing characteristics. The will be -185, the will be 3, and there will be an block of several lines followed by a tag and a value. All these lines need to be removed because (with the 'corrupt' file checked or replaced) there is no error condition, the result is not ready to report and a completed time has not yet been determined. If you have lots of results in your cache, it can be a bit tedious to fix every last one of them like this.
So, with ~100 tasks to recover, I usually fix a few (a couple per core) and then completely delete all the remaining blocks from the state file. That way, there is work to commence when BOINC is restarted and when I allow the client to contact the server, the server will notice what is missing and send them all back to the client (in batches of 12 per contact) as 'resend lost results'.
I'm not suggesting that this will be viable for the average volunteer. You really do need to be completely comfortable with editing the state file and you need a good understanding of what , , , etc, blocks should look like. And you need a lot of patience :-). These days I'm quite comfortable with all this and it usually only takes me around 15 minutes to recover from one of these 'events', once I've noticed it :-).
I used to think that too but, if you think about it, the project knows quickly and can resend the tasks so it's no biggie. For me, the biggie is the inconvenience of the 1 task per day per core limit until good results are returned and the extra management to get things back on track. It's worthwhile for me to spend 15 mins and then be immediately back to full speed with a full cache of work and an unsullied tasks per day limit :-).
Cheers,
Gary.
RE: BOINC client deciding
)
Are you using 6.12.x or 6.10.x? That is definitely the file descriptor leak I mentioned earlier. I just don't know at which point that was fixed but in 6.12.43, which I'm using now the problem with boinc crashing is gone - and I didn't change anything else.
On a faster machine (greater WU turnaround; more files open) it used to crash twice a week, on slower it lasted for month(s). Now with 6.12.43 (and 7.0.40 on some) it never crashes again.
RE: Are you using 6.12.x or
)
I'm using 6.12.34.
Thanks for this information. I don't have time to follow BOINC development so I wasn't aware of this particular bug in BOINC. The Linux distro I use is very slow with upgrading BOINC packages (they're still on 6.10.x) so I normally grab a version from the BOINC website and install the components manually. I check out the library dependencies with ldd and then find any needed extra library packages and install them. I wanted to go to V7 BOINC on this machine but when I did the ldd check, I found the client depends on version 2.15 of glibc and the latest version in the repo is 2.13 so at that point I put the transition to V7 BOINC on hold. If the FD leak was fixed in a V6 BOINC, I'll give 6.12.43 a spin and hope that the glibc 2.15 requirement didn't come in until V7 :-).
I have a number of machines with GTX650 GPUs that are on 6.12.34 but the fastest one is the only one so far where I've noticed this problem. Looks like I'd better attend to this rather quickly :-). Thanks again for the info.
Cheers,
Gary.
RE: RE: Are you using
)
I thought I'd add a followup to this story for the benefit of anybody running affected BOINC versions on Linux.
Firstly, many thanks to Khangollo for drawing my attention to the file descriptor leak problem. As reported above, I first noticed the problem on a fast machine running lots of GPU tasks. I have since noticed it on long running hosts doing CPU only. The only version I know for sure is a problem is 6.12.34. A version that is not affected is 6.12.43.
I converted all hosts with GPUs to 6.12.43 some time ago and the problem was completely resolved by that. I had overlooked a couple of CPU only hosts that were also on 6.12.34 and recently I noticed such a machine where BOINC had quit. This was the first of these to show and it should be the last as I've located and upgraded all remaining ones now.
I took the trouble to examine the event log of this machine just prior to BOINC quitting. To do that you have to browse the file 'stdoutdae.txt' which contains the information. Here is the relevant bit
As you can see, a task had just finished and was being uploaded and BOINC was trying to update client_state.xml. It does this by attempting to create a new copy with a different name in order to protect the old copy until the new one is safely written. The new file couldn't be opened (fopen() failed) presumably as a result of file descriptors being exhausted. So BOINC immediately quit to prevent any damage being done. You can see that I restarted with 6.12.34 when I noticed the problem about a day later. Soon after that, I stopped BOINC again and upgraded.
So for anybody running Linux with a BOINC version of 6.12.x or possibly even 6.10.x, you might consider upgrading to 6.12.43 if you can't (or don't want to) go to 7.0.x. This could mean that you might need to go outside your distro's repository system if it doesn't contain a recent enough version. I had to do that even to go to 6.12.34 in the first place as my distro's BOINC version was 6.10.x if I remember correctly.
Cheers,
Gary.
Hmmmm... Interesting. I've
)
Hmmmm...
Interesting. I've been plowing through all my project message boards to get back up to speed with what is and has been going on since I took my sabbatical.
This is the first one I've come across discussing 'cascade' failures with BOINC (other than ones where I mentioned it).
I noted you talked about it in terms of the effect on (and remedies for) EAH.
So the question I have for you is when they happen are they just killing EAH or does it get any other projects you have running as well. The reason I ask is when I have one on Windows I can't remember a single time it didn't take out everything the host had in the cache at the time the cascade started.
Of course, that could just be I never seem to be in front a host when the cascade starts. Even if I had been though, when I examine the post mortem wreckage and see how fast it progressed through the cache, I'm not sure I could have done anything to stop it anyway. ;-)
Just as a side note to your remarks about BOINC versions. At least on Windows, I never had one occur until I was running a CC in the late 6x series. I want to say that it wasn't until they moved to 7 as the official and I upgraded to 6.12.34 since that was the last 'recommended' version (IIRC), but I just don't remember for sure. I haven't had one on the 7x host I have (yet). :-D
RE: So the question I have
)
Only E@H is affected in the (many) examples I've seen over the years. Most of my hosts run E@H only but I have some running LHC as well and I've seen an example where E@H is on 24 hour backoff with all tasks trashed and LHC still crunching. The OS hasn't crashed. BOINC just decided that a required file was corrupt and so immediately marked all tasks that depended on the corrupt file as 'computation errors'. If the corrupt file is the app itself, all tasks are trashed. If the corrupt file is one of the large data files, there may be some tasks that don't depend on that particular file and they may still be crunching. I've seen that happen too.
My guess is that the MD5 calculation may give an erroneous answer if the machine is too highly stressed at the time because quite often the file is not actually corrupt. I put the frequency of the problem down to the fact that I run a lot of hosts in non-air-conditioned, sub-tropical space, many of which are (moderately) overclocked and the frequency of the problem always goes up considerably in summer.
You can't. I've seen one happen and it's all over in a flash :-).
A lot of mine have occurred with 6.2.15 but I've seen them in later 6.x versions as well. I have only a few hosts on V7 and haven't seen any examples on those so far.
Cheers,
Gary.
Something similar seems to
)
Something similar seems to have just happened to me; a whole bunch of BRP4 tasks all just errored out like this within a few seconds of starting.
Though poking about the results, quite a few of the tasks have also failed for my wingmen (who all seems to running Windows). Could it be that some (or even one?) of the iffy workunits has somehow caused a cascade failure? Or is there a bad batch of jobs? Something else?
Failures on the affected host here
A few sample tasks that also errored for others 1 2 3 4
I did find a few jobs that a wingman managed to complete 1 2 3
As yet, none of the failed jobs have validated ok for anyone else as far as I can see.
RE: RE: A lot of mine
)
I had a bitrot crash with a 7.x client a few months ago.