Computation Error on most of my work units

hamesy
hamesy
Joined: 20 Sep 11
Posts: 3
Credit: 8779218
RAC: 0
Topic 196881

I'm getting a lot of computation errors. It started just after Christmas and I managed to get it all working for a while by completely removing Boinc and reinstalling.

Not sure if this any helps, this is the error log from one of my work units:

6.12.34

The system cannot find the file specified. (0x2) - exit code 2 (0x2)

Activated exception handling...
[11:23:33][7132][INFO ] Starting data processing...
[11:23:33][7132][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 153 MB (872 MB free / 1025 MB total) -> Used by this application (assuming a single GPU task): 0 MB
[11:23:33][7132][INFO ] Using CUDA device #0 "GeForce GT 420" (48 CUDA cores / 134.40 GFLOPS)
[11:23:33][7132][INFO ] Version of installed CUDA driver: 5000
[11:23:33][7132][INFO ] Version of CUDA driver API used: 3020
[11:23:33][7132][INFO ] Continuing work on ../../projects/einstein.phys.uwm.edu/p2030.20110128.G175.45-03.87.S.b1s0g0.00000_966.bin4 at template no. 3700
[11:23:33][7132][ERROR] Input file on command line ../../projects/einstein.phys.uwm.edu/p2030.20121229.G203.13+00.56.S.b1s0g0.00000_952.bin4 doesn't agree with input file ../../projects/einstein.phys.uwm.edu/p2030.20110128.G175.45-03.87.S.b1s0g0.00000_966.bin4 from checkpoint header.
[11:23:33][7132][ERROR] Demodulation failed (error: 2)!
11:23:33 (7132): called boinc_finish

]]>

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 4

Computation Error on most of my work units

What have you tried so far to remedy the problems?

- Checked hard drives for consistency?
- Tried other drivers for just about anything in the system?
- Tried different (power) cables on the hardware in the computer?
- Tried other BOINC versions?
- Checked the system for dust bunnies and cleaned those out?
- Tried a different GPU?

hamesy
hamesy
Joined: 20 Sep 11
Posts: 3
Credit: 8779218
RAC: 0

- Checked hard drives for

- Checked hard drives for consistency? Hard drive is fine. Passed every test I've tried.

- Tried other drivers for just about anything in the system? All drivers up to date.

- Tried different (power) cables on the hardware in the computer? Not yet but I'll try that

- Tried other BOINC versions? I was running version 7, downgraded to 6.12.34 and that seemed to have worked for a week or two, but the problem come back.

- Checked the system for dust bunnies and cleaned those out? System is pretty much clear.

- Tried a different GPU? Haven't got one to hand I'm afraid.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 760729515
RAC: 1114250

Hi Very strange. What

Hi

Very strange. What seems to be the problem is that a checkpoint file that records the app's state periodically somehow survived the end of a task and messes up the next task starting in the same "slot" directory. At least that's what the error message is telling us: the checkpoint file's information doesn't match the current task's. Removing BOINC fixed the problem the last time because it removed the checkpoint file (status.cpt) along with all other files :-).

So you might want to look for those status.cpt files in slots/x subdirectories of the Boinc data directory that are suspiciously old (older than the other files in those subdirectories) and delete them, then restart BOINC.

Of course this doesn't explain what kept BOINC from deleting the old checkpoint file in the first place. Maybe a virus scanner was inspecting the file at the same time that BOINC wanted to delete it?? Any other unusual software running on your system that might interfere with BOINC's file IO??

Cheers
HBE

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 4

Which is why I thought of

Which is why I thought of hard drive problems. It looked like a corruption of the data directory, but could well be a lock by anti-virus/anti-malware programs as well.

hamesy
hamesy
Joined: 20 Sep 11
Posts: 3
Credit: 8779218
RAC: 0

RE: Hi Very strange. What

Quote:

Hi

Very strange. What seems to be the problem is that a checkpoint file that records the app's state periodically somehow survived the end of a task and messes up the next task starting in the same "slot" directory. At least that's what the error message is telling us: the checkpoint file's information doesn't match the current task's. Removing BOINC fixed the problem the last time because it removed the checkpoint file (status.cpt) along with all other files :-).

So you might want to look for those status.cpt files in slots/x subdirectories of the Boinc data directory that are suspiciously old (older than the other files in those subdirectories) and delete them, then restart BOINC.

Of course this doesn't explain what kept BOINC from deleting the old checkpoint file in the first place. Maybe a virus scanner was inspecting the file at the same time that BOINC wanted to delete it?? Any other unusual software running on your system that might interfere with BOINC's file IO??

Cheers
HBE

Thanks for your response. At the moment Boinc is running fine as I reinstalled yesterday morning, so if the issue comes up again I'll have a look. Antivirus might be causing it, as I've had issues with it before, so I will keep an eye on it.

mikey
mikey
Joined: 22 Jan 05
Posts: 12778
Credit: 1864327061
RAC: 1653817

RE: RE: Hi Very strange.

Quote:
Quote:

Hi

Very strange. What seems to be the problem is that a checkpoint file that records the app's state periodically somehow survived the end of a task and messes up the next task starting in the same "slot" directory. At least that's what the error message is telling us: the checkpoint file's information doesn't match the current task's. Removing BOINC fixed the problem the last time because it removed the checkpoint file (status.cpt) along with all other files :-).

So you might want to look for those status.cpt files in slots/x subdirectories of the Boinc data directory that are suspiciously old (older than the other files in those subdirectories) and delete them, then restart BOINC.

Of course this doesn't explain what kept BOINC from deleting the old checkpoint file in the first place. Maybe a virus scanner was inspecting the file at the same time that BOINC wanted to delete it?? Any other unusual software running on your system that might interfere with BOINC's file IO??

Cheers
HBE

Thanks for your response. At the moment Boinc is running fine as I reinstalled yesterday morning, so if the issue comes up again I'll have a look. Antivirus might be causing it, as I've had issues with it before, so I will keep an eye on it.

What you can do is exempt the Boinc set of directories from the a/v scanner, any real virus will be caught as it then moves out into the rest of the file system, and who cares what it does as long as it stays confined in the Boinc directory. I care about my pc's, I can't protect the whole Worlds pc's.

Andreas
Andreas
Joined: 19 Oct 05
Posts: 754
Credit: 41507440
RAC: 13108

I am experiencing the same

I am experiencing the same error now too. One day everything was fine, the next just a bunch of compute errors. I participate in a lot of projects, but it's only Einstein that's affected.

Click my stat image to go to the BOINC Synergy Team site!

Andreas
Andreas
Joined: 19 Oct 05
Posts: 754
Credit: 41507440
RAC: 13108

Update: I stopped work fetch

Update: I stopped work fetch and then reset the project, but to no avail.

Click my stat image to go to the BOINC Synergy Team site!

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 4

As far as we can figure, it

As far as we can figure, it is not so much a BOINC problem or a project application error, so resetting the project won't help. Have you tried to set up your anti-virus software so that the BOINC Data directory is excluded from being scanned by the realtime scanner?

Andreas
Andreas
Joined: 19 Oct 05
Posts: 754
Credit: 41507440
RAC: 13108

I just did that, and the

I just did that, and the first wu I got ended up like this:

http://einsteinathome.org/task/362797484

Edit: This one seems to be running fine though (six minutes and counting):

http://einsteinathome.org/task/362782488

Edit2: Why would the anti-virus software cause trouble for E@H all of a sudden? For what it's worth, I use Norton 360.

Click my stat image to go to the BOINC Synergy Team site!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.