Work unit stuck at 100% progress 0:00 time running

leg
leg
Joined: 25 Feb 05
Posts: 7
Credit: 905521032
RAC: 354470
Topic 188753

I have had a Work unit stuck at 100% progress 0:00 time to completion and running for out 11hours without reporting to server. A subsequent WU in que has competed and reported. Al previous WU have reported as soon as they have finished. Is this a problem? Is there any way to force the WU to report?

Thank you.

Miko
Miko
Joined: 22 Jan 05
Posts: 5
Credit: 739574
RAC: 0

Work unit stuck at 100% progress 0:00 time running

Maybe You can check the results on (Your userpage?) Einsteins webside.
During Boincs Betatimes (in Seti@Home) there was an "option" to see an "Result Okay" or the errors...
If You speak german, go to
http://www.crunching-family.wins.info/
there are "familymembers" with great knowledge of this stuff!
They help You to understand or to fix Your results!
(i saw You are in no team...)
If You like You're welcome to become a member, Crunching Family is searching for pointly competent members.
But of couse it's NOT a "You must" if You need help!
Regards
Miko

http://www.crunching-family.at - my team

I'm a beekeeper in the Austrian Alps.
If you  want to know / to learn something about bees, the bee-living and the beekeeping, just ask me.

Sharky T
Sharky T
Joined: 19 Feb 05
Posts: 159
Credit: 1187722
RAC: 0

RE: Is there any way to

Quote:
Is there any way to force the WU to report?


Well,ever tried that update button.. :)
But that WU sounds like it's dead somehow.Don't see any holes(errors) in your "userpage" either..so check if that Result is in your resultpage like Miko suggested.If it is there and it looks OK,then the report was succesful and the WU is just stuck in your BoincManager.


MarkF
MarkF
Joined: 12 Apr 05
Posts: 393
Credit: 1516715
RAC: 0

Have you tried the transfers

Have you tried the transfers pane? I have had to restart a few uploads from there.

psionix
psionix
Joined: 6 Jul 05
Posts: 3
Credit: 395382
RAC: 0

I am experiencing this same

I am experiencing this same issue. BOINC will load up E@H, but E@H will just sit there in a "Running" state using 0% CPU for the duration of its alloted timeslice. It shows 100% complete in the BOINC Manager, yet it does not attempt to upload and shows "In Progress" on the results page. If anyone cares, the WU ID is 1482621 (w1_0657.5__0657.8_0.1_T00_S4hA) and expires (for me) on Jul 14.

Athlon XP 1800
Linux (kernel 2.6)
BOINC v4.45a (compiled)
E@H 4.80

Incidentally, on another, nearly identical machine, E@H will sometimes fail to die when BOINC tries to kill it, and will run simultaneously with whatever project gets scheduled by BOINC.

E@H is not returning very much work on my Linux boxen, which both have low credit. My one Windows machine is racking up the credit. The Linux boxen have been working perfectly fine with SETI and CPDN for some time now. Does anyone else have trouble with E@H on Linux?

Walt Gribben
Walt Gribben
Joined: 20 Feb 05
Posts: 219
Credit: 1645393
RAC: 0

Since its still running, take

Since its still running, take a look in the stderr.txt file. It'll be in one of the slots/n directories, the one with the einstein application.

I looked at one of the other results already returned and saw these messages:

Quote:

Failed to close the application-lockfile boinc_lockfile: Bad file descriptor
Resuming computation at 12199/684613/693883
Failed to close the application-lockfile boinc_lockfile: Bad file descriptor
Resuming computation at 24923/1169901/1182450
detected finished Fstat file - skipping Fstat run 1
Resuming computation at 6594/380421/399681
Failed to close the application-lockfile boinc_lockfile: Bad file descriptor
detected finished Fstat file - skipping Fstat run 1
Resuming computation at 21229/1384592/1400611
Failed to close the application-lockfile boinc_lockfile: Bad file descriptor
detected finished Fstat file - skipping Fstat run 1
Resuming computation at 27859/1851833/1852189
Fstats.Ha: bytecount 1630915 checksum 78388467
Fstats.Hb: bytecount 2079489 checksum 99935875
detected finished Fstat file - skipping Fstat run 1
detected finished Fstat file - skipping Fstat run 2
Fstats.Ha: bytecount 1630915 checksum 78388467
Fstats.Hb: bytecount 2079489 checksum 99935875

Actually the results from both your Linux machines show those messages. The ones with "Failed to close the application-lockfile boinc_lockfile: Bad file descriptor".

I see that error sometimes when two applications use the same file (which shouldn't happen for boinc_lockfile). And when the system crashes and the filesystem is mounted "dirty" in subsequent boots. That is, after a crash the system is supposed to "clean and repair" the filesystems before mounting them, its part of the startup. But sometimes it doesn't happen - configuration problem or it the user answered "no" to fixups. Most often its because everything is on one filesystem and thats mounted read-write at boot time.

Another term for "crash" is turning off the system without going thru the normal shutdown. So the filesystems don't get unmounted properly and that tends to mess them up.

Your boot log should have something about that, if its a problem.

psionix
psionix
Joined: 6 Jul 05
Posts: 3
Credit: 395382
RAC: 0

Thanks Walt for the reply.

Thanks Walt for the reply. There hasn't been an unclean shutdown, and my last fsck's have all been good.

Is there a way that I can manually fix this? Should I stop boinc and delete the boinc_lockfiles, or would that be a bad thing?

Thanks again.

Walt Gribben
Walt Gribben
Joined: 20 Feb 05
Posts: 219
Credit: 1645393
RAC: 0

RE: Thanks Walt for the

Message 9523 in response to message 9522

Quote:

Thanks Walt for the reply. There hasn't been an unclean shutdown, and my last fsck's have all been good.

Is there a way that I can manually fix this? Should I stop boinc and delete the boinc_lockfiles, or would that be a bad thing?

Thanks again.

Did you check the "stderr.txt" files in the slots/n directories?

Stopping BOINC and deleting the lock files should work.

But after boinc stops wait a minute for the science apps to stop and then check that they really did stop. Use ps -ax or something like that, in case one is still running.

After the science apps stopped, delete the lock files and start boinc again.

psionix
psionix
Joined: 6 Jul 05
Posts: 3
Credit: 395382
RAC: 0

RE: Did you check the

Message 9524 in response to message 9523

Quote:

Did you check the "stderr.txt" files in the slots/n directories?

Stopping BOINC and deleting the lock files should work.

But after boinc stops wait a minute for the science apps to stop and then check that they really did stop. Use ps -ax or something like that, in case one is still running.

After the science apps stopped, delete the lock files and start boinc again.

Worked like a charm! BOINC finished the WU and uploaded it successfully. Removing the boinc_lockfiles in the E@H slots fixed the problem.

For the record, yes, I did check the stderr.txt files and they did have the "boinc_lockfile: Bad file descriptor" errors exactly as you described.

Thanks again for your help!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.