Neeed help with compute errors

Nothing But Idle Time
Nothing But Idl...
Joined: 24 Aug 05
Posts: 158
Credit: 289204
RAC: 0
Topic 194135

Been with Einstein since 2005 without incident, but in last month have been getting compute errors continuously, different flavors with v6.05 and now v6.10. Reboots sometimes get things going again but it soon recurs. I don't use the graphics at all and have plenty of memory. Boinc v5.10.13 is a little old but isn't the problem. Memory is relatively new and stress tested.

Need someone to help look through some of these tasks and give guidance on what might be going wrong. Some are access violations or missing input file or parsing error on skygrid, etc. Many have this line in them "WARNING: Fixing yLower (-1181 -> 0) [HoughMap.c 771]". Frankly I think something is wrong with the files that in turn is killing the associated tasks.

Spent many days of fruitless crunching so I've shut down my computer. Since the only two projects I run are Rosetta and Einstein and both apps are problematic lately I'm ready to jettison BOINC forever. Can do without the headaches.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 689002771
RAC: 211815

Neeed help with compute errors

Quote:

Spent many days of fruitless crunching so I've shut down my computer. Since the only two projects I run are Rosetta and Einstein and both apps are problematic lately I'm ready to jettison BOINC forever. Can do without the headaches.

If you are experiencing problems with both projects lately, and given the nature of the error messages, I'd suspect that the problem is more likely related to your computer. Is it overclocked? Could the harddisk be fading?

CU
Bikeman

Gundolf Jahn
Gundolf Jahn
Joined: 1 Mar 05
Posts: 1079
Credit: 341280
RAC: 0

RE: ...Need someone to help

Quote:
...Need someone to help look through some of these tasks and give guidance on what might be going wrong.


We can't see the tasks based on your userid (only you can), what we can see is your tasks based on hostid.

Quote:
Some are access violations or missing input file or parsing error on skygrid, etc. Many have this line in them "WARNING: Fixing yLower (-1181 -> 0) [HoughMap.c 771]". Frankly I think something is wrong with the files that in turn is killing the associated tasks.


You think right, but your assumption that BOINC is the cause is false. I strongly recommend a file-system check, just as Bikeman proposed. Obviously the values stored in the checkpoint file aren't read back correctly and have to be fixed or cause access violations. "Missing input file" and parsing errors also hint at a faulty disk.

Quote:
Spent many days of fruitless crunching so I've shut down my computer. Since the only two projects I run are Rosetta and Einstein and both apps are problematic lately I'm ready to jettison BOINC forever. Can do without the headaches.


Sorry for your headaches, but they are not caused by BOINC. You could have saved a lot of fruitless crunching, if you had asked for help earlier. I think your headaches will get worse if your disk turns belly-up, especially if you don't back up regularly.

Gruß,
Gundolf

Computer sind nicht alles im Leben. (Kleiner Scherz)

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

Start->Run, type cmd and

Start->Run, type cmd and click OK.
In the command line window, use the DOS commands to go to the root of the drive where BOINC is positioned. (e.g. cd\ puts you in the root of C:\ )
Now type chkdsk /r /f and press Enter.
If it asks you that it should unmount your disk, press N.
Then when it asks if it should do this upon a restart, press Y.

Having done that, log off and reboot the computer. Keep your hands off the keyboard, do not press any keys when you see check disk starting up. Let it run through the checks. This can take quite a while though.

/f will fix errors on the volume.
/r Locates bad sectors and recovers readable information.

Dagorath
Dagorath
Joined: 22 Apr 06
Posts: 146
Credit: 226423
RAC: 0

Have you blown the dust out

Message 89852 in response to message 89851

Have you blown the dust out the cooling system lately? Do that first and then run chkdsk. The reasoning is chkdsk may not run properly if your CPU is overheating so clean the system first.

Nothing But Idle Time
Nothing But Idl...
Joined: 24 Aug 05
Posts: 158
Credit: 289204
RAC: 0

@Bikeman - My problems at

@Bikeman - My problems at Rosetta are unlike and unrelated to those here at E@H. Too complicated to explain here but they are not compute errors per se. My Dell computer is locked and cannot be overclocked. Hard disk is a few years old and could be fading, though I don't know how I would recognize that short of a disk crash.

@Gundolf - Not blaming Boinc for anything, simply stating that if I can't successfully run E@H and Rosetta without grief then I will jettison Boinc which is the substrate on which E@H and Rosetta are run.

@Jord - I had compute errors back in December and rebooting seemed to get things back on track. On Jan 7th windows reported a "dirty" disk and automatically ran chkdsk on restart. There was a hung index entry in the file structure. Ran chkdsk a second time and no additional errors were reported. But I've encountered E@H task compute errors since Jan 7th; yet another chkdsk did not indicate anything wrong with disk, no bad sectors etc.

@Dagorath - computer not overclocked, resides in unheated room, gets cleaned twice yearly. Never find dirt when I do clean it.

I've been using this computer on EAH tasks for 3 years now so either my computer is crapping out or something is wrong with the more recent apps and data files. Explain why I find this error "WARNING: Fixing yLower (-1181 -> 0) [HoughMap.c 771]" in many of the faulty tasks, or why the app couldn't parse the skygrid, or why the access violations -- some of this after detaching and re-attaching with all new files. The last errors encountered were not the same data pack I had previously; encountered similar errors with both packs. Not all tasks on a given data pack were faulty. I'm not capable of debugging this myself so I have no recourse except to abandon ship. I'm not one to settle for a huge error rate. Maybe try again someday when I get a newer computer. This is disheartening, when the appeal fades it tends to fade forever.

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

Look at this old post by

Message 89854 in response to message 89853

Look at this old post by Bernd, in case you're dropped somewhere in the thread, scroll up or down. it's the first post in the thread.

Read up on the Exit code -1073741819 (0xC0000005) and Error 99. It would seem that in your case they are making their return.

Dagorath
Dagorath
Joined: 22 Apr 06
Posts: 146
Credit: 226423
RAC: 0

RE: @Dagorath - computer

Message 89855 in response to message 89853

Quote:
@Dagorath - computer not overclocked, resides in unheated room, gets cleaned twice yearly. Never find dirt when I do clean it.

Every Dell I have seen has a duct that directs the hot air off the fan/heatsink outside the case. It's a good design idea because directing the hot air to the exterior keeps the case temp lower. When cleaning, you need to remove the duct and the fan so you can get a good look at the fins on the heatsink. I suspect you figured that out long ago but maybe not.

Also, on the subject of cooling... fans do wear out and spin slower. They're inexpensive and easy to replace. Maybe try a new fan. And while you're at it remove the heatsink, clean the old thermalpaste off the CPU and heatsink and apply new thermal paste. I once found, on one of my computers, that the heatsink had tipped slightly due to a slightly misplaced hold down bracket and the heatsink was not in proper contact with the CPU (it was not flat against the CPU).

Your disk probably has SMARTDrive. There are programs around that read the report from the SMARTDrive system and present the report to you. The report gives you an idea of how healthy your disk is. I have had disks that didn't give chkdsk errors but did get a lousy SMARTDrive report and when I replaced them my problems went away.

Finally, if you think the problem is due to Rosetta or Einstein apps/files then test that hypothesis by trying a few tasks from a project with a nice stable app, like ABC@home. You'll get the occassional error off ABC but if you get lots then it's definitely your hardware or OS.

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

RE: Your disk probably has

Message 89856 in response to message 89855

Quote:
Your disk probably has SMARTDrive. There are programs around that read the report from the SMARTDrive system and present the report to you. The report gives you an idea of how healthy your disk is. I have had disks that didn't give chkdsk errors but did get a lousy SMARTDrive report and when I replaced them my problems went away.


Not to hijack this thread, but you have to watch out what program you use to check this with. I had two programs around here (not a clue which ones they were), one of which predicted that my brand new hard drive would fail one week after I installed it.... The other predicted it would go somewhere in 2011.

It didn't fail after a week, so let's wait until 2011. :-)

Paul D. Buck
Paul D. Buck
Joined: 17 Jan 05
Posts: 754
Credit: 5385205
RAC: 0

FOr windows systems the tool

FOr windows systems the tool that I have found (historically) to be best to test a windows system is Norton Utilities.

Other things can cause intermittent errors including a bad memory stick that is now showing errors.

The suggestion to try other projects is a good one to see if it is your system or not.

Rosetta's application has a number of issues on windows systems which is why i stopped running RaH on windows sometime ago ... there may be a fix in the works ... but I digress

What you *MAY* be seeing is a confluence of unrelated events on what I guess are the two projects you want to support the most. A short period where you try another project may be helpful in figuring this one out (without cost) ... I have found that Prime Grid, particularly the longer tasks can fail a lot so this may not be the best test in that you can be seeing failing tasks for problems in the application.

Astro
Astro
Joined: 18 Jan 05
Posts: 257
Credit: 1000560
RAC: 0

I stopped running my AMD's at

I stopped running my AMD's at Einstein, but let my Q6600 keep on chuggin. I just checked back in on these boards and found we're running a new S5 app. (64b linux). I checked on my results and found that 1 out of 3 (roughly) of the S5 results returned by this machine have computation errors, but all the following pages of S4 wus have no errors.

I know, the new S5 must be more efficient, creating more heat and thereby causing the errors, and turning back the OC might fix it, and that's what I'm going to try, but I found it interesting anyway.

tony

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.