High frequency of client errors

Walt Gribben
Walt Gribben
Joined: 20 Feb 05
Posts: 219
Credit: 1645393
RAC: 0

Thanks for the information.

Message 6528 in response to message 6527

Thanks for the information. The auto update tool should have details on what was installed when, or you could check the event log. Don't think the updates have much to do with the problem, but if you see some graphics driver updates on the 24th it might.

>..... The only example I've got where there's
> an unrecoverable error logged for Einstein@Home follows on from a message that
> the result is being paused and then one that protein predictor is starting.
> There is also a log message of protein predictor requesting work within the
> same second that the failure occurs. I suppose this may be significant or just
> a coincidence.

Thats what I'm looking for, one WU starting and another failing. Do you have the "leave in memory" preference set to "yes"? You'll see messages saying wether it leaves suspended apps in memory or removes them.

This page has some logging and tracing options BOINC, might be worth turning a couple of them on. Specifically, the one to "log the start, restart and completion of computational tasks."

Create a file in the BOINC directory named log_flags.xml, edit it with notepad to add the trace options, save it, then stop and restart BOINC. The lines to add are:

You won't see anything in the message pane, but they'll be written to the stdout.txt file. When you see the error, copy stdout.txt and stderr.txt so the messages don't get lost. If you restart BOINC, they get renamed to stdout.old and stderr.old so just one restart won't toss the messages.

David Worton
David Worton
Joined: 22 Feb 05
Posts: 20
Credit: 45824
RAC: 0

My "leave in memory"

My "leave in memory" preferences are set to "no". I'll restart BOINC with the extra logging xml control file and see what turns up.

Walt Gribben
Walt Gribben
Joined: 20 Feb 05
Posts: 219
Credit: 1645393
RAC: 0

> My "leave in memory"

Message 6530 in response to message 6529

> My "leave in memory" preferences are set to "no". I'll restart BOINC with the
> extra logging xml control file and see what turns up.
>

Thanks, that narrows it down a bit. Meaning when BOINC "paused" einstein@home, it was really telling it to stop processing and exit.

So, the question is, did BOINC wait until the application was actually finished before it started the next one? Somehow I don't think so, and maybe this is another problem with starting work on a new WU.

Could be that if you set your "leave in memory" preference to "yes" you'd get a "exited with zero status but no finished file" message instead. If you continue to get the errors with Einstein@home, and they always coincide with starting a WU for another project (or restarting one), see if setting the preference to "yes" changes anything.

David Worton
David Worton
Joined: 22 Feb 05
Posts: 20
Credit: 45824
RAC: 0

The latest situation is that,

The latest situation is that, after turning the logging on, I now seem to have processed a work unit successfully... I'm not sure whether to be pleased about this or vexed that it didn't raise an error when I was watching it!

I've checked my XP event logs and can confirm that apart from the BOINC upgrade from 4.19 to 4.22, which I took midway through the processing of the first succesful unit, there was nothing installed between the start of the processing of the 1st successful unit and the end of the processing of the 1st failure.

As I'm now processing with a blank screen saver this may be what's cured things. Alternatively, though, I've noticed that there was an unusual pattern of behaviour in the other projects whilst this unit was being processed. SETI & Protein Predictor were unable to supply work units and had run out of work, so Einstein was only being paged in and out against CPDN and LHC. If, as Walt suggests, the error is caused by another project starting up before Einstein has terminated properly it could be that the offending project didn't swap in against Einstein this time and hence I was successful. My main suspect would be Protein Predictor 'cos of the one log entry I've seen where that started immediately after the error.

I'll leave the logging on, continue with no screen saver, and watch with interest to see what happens with future units.

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

> I'll leave the logging on,

Message 6532 in response to message 6531

> I'll leave the logging on, continue with no screen saver, and watch with
> interest to see what happens with future units.

David, thank you. If you are willing to sacrifice your credits 'for the general good of the project' I would be grateful if you continue to work with Walt to track down the source of these errors. In other words Walt might ask you to try turning on graphics again or something else, to try and reproduce the problem again. Understanding what causes the problem and how to fix it might increase the effective number of trouble free host machines by hundreds or thousands, so from the project perspective this would be beneficial.

In any case, please just carry on. Walt will let you know if he wants you to try something.

Bruce

Director, Einstein@Home

David Worton
David Worton
Joined: 22 Feb 05
Posts: 20
Credit: 45824
RAC: 0

Yes, that's fine. I'm more

Yes, that's fine. I'm more interested in making a contribution than racking up arbitrary credits, so if any of this helps elsewhere, then great. I just didn't want to be returning lots of computation errors for no purpose! I'll keep an eye on this thread, follow Walt's suggestions and post the results of any future trials to see if the problem can be diagnosed.

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

> Yes, that's fine. I'm more

Message 6534 in response to message 6533

> Yes, that's fine. I'm more interested in making a contribution than racking up
> arbitrary credits, so if any of this helps elsewhere, then great. I just
> didn't want to be returning lots of computation errors for no purpose! I'll
> keep an eye on this thread, follow Walt's suggestions and post the results of
> any future trials to see if the problem can be diagnosed.

Thank you!

Director, Einstein@Home

David Worton
David Worton
Joined: 22 Feb 05
Posts: 20
Credit: 45824
RAC: 0

Einstein@home seems to have

Einstein@home seems to have stabalised on my machine. I've now returned six successive successful results. Based on the one unit where I saw an error I have one other theory for the run of bad results. These occured after I upgraded BOINC from 4.19 to 4.22 and in the one case I observed closely, at the moment when Predictor was paging in and Einstein was paging out of memory. I noticed on the Predictor site that they were claiming not to fully support BIONC 4.2x until 3/02/05. This is the time when my Einstein units started working. I'm speculating that although Einstein was OK, because Predictor was paging in immediately after it and wasn't fully supported there was some sort of problem which crashed Einstein's page out. A bit of a guess but if it is true the problem could have gone away permanently. I'm now going to cautiously turn on the BOINC screen saver and see what happens (but still not use any other graphics - the BOINC screen saver on my machine only ever shows a status monitor - nothing fancy). It'll be interesting to see if this brings the problem back or if the screen saver is exonerated by the test. I think it will be...

clarksn
clarksn
Joined: 28 Feb 05
Posts: 1
Credit: 0
RAC: 0

All of my results have

All of my results have resulted in Client errors except the first one :(

I'm also running Protein and Climate.

Any ideas?

Steve

David Worton
David Worton
Joined: 22 Feb 05
Posts: 20
Credit: 45824
RAC: 0

Steve, Just had a quick

Steve,

Just had a quick look at your results. The first two errors you got were computation errors like mine. But they don't report quite as much detail as my logs which had an address in memory where Einstein@Home had crashed. It looked something like this and was the same every time:-

***UNHANDLED EXCEPTION****
Reason: Access Violation (0xc0000005) at address 0x77F69ECD write attempt to address 0x00000010

So I suspect that your problem and mine aren't the same. Your recent errors seem to be different (the download ones). I haven't had download errors with Einstein@Home at all, though I have had this with the LHC project.

I'm not an expert on any of this, just a humble participant, but I think it might be worth trying the logging options mentioned in this thread to see if there is a pattern of failure. And alternatively, you could try changing your memory paging options to see what happens (again as suggested in this thread). Other than that, maybe some one with a bit more knowledge will come up with a better suggestion.

I see you've got a 2nd machine which hasn't returned any results yet. You might have better luck with that one if the problem is specific to one machine.

Sorry I can't be more helpful!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.