Neeed help with compute errors

Paul D. Buck
Paul D. Buck
Joined: 17 Jan 05
Posts: 754
Credit: 5,385,205
RAC: 0

RE: I ran prime95 for 24

Message 89889 in response to message 89885

Quote:
I ran prime95 for 24 hours testing memory and chips and found nothing wrong. However, I get access violations with some Rosetta tasks also. I understand that Einstein tasks are floating point intensive while Rosetta tasks are more integer intensive which is partly why I run these tasks opposite each other on my hyper-threaded machine.

The Rosetta application does have some issues where the application itself can throw access violation errors. Many of them have been cleaned up and the currently issued tasks should be free of most of them. Though I did get one in Ralph the other day ... no word yet if it is a new error or something old. The good news is that the staff there have been adding code that has been helpful in finding these problems.

Harri Liljeroos
Harri Liljeroos
Joined: 10 Dec 05
Posts: 624
Credit: 867,615,821
RAC: 1,115,891

Hi, I had a bit similar

Hi,
I had a bit similar experience with one of my computers (Pentium 930D, Win XP). It ran Einstein, Seti and LHC. The access violation error happened on about one in five Einstein work units on stock and beta applications with S5 applications. Also same errors happened during S4 but not on all versions of the science application. Some versions worked flawlessly (don't remember which ones).

Finally I decided to retire that computer from Einstein and run only Seti and LHC as they had no problems. Now that computer is completely retired from Boinc as it is passed on to a new owner.

Tom95134
Tom95134
Joined: 21 Aug 07
Posts: 18
Credit: 2,498,832
RAC: 0

I am running a Pentium 4

I am running a Pentium 4 3.0GHz, HT with 4GB of RAM. It is running in a standard mode (not overclocked). BOINC is running as a Service.

I am occasionally seeing the kind of failure shown inthe following messages. The failure starts at the 3rd message in for no apparent reason. As you can see, another project requested work but during this time the Einstein@Home task was running. I run three projects; Einstein@Home, SETI@Home, and lhcathome. lhcathome has only occasional work. SETI@Home always has a task running and occasionally a task waiting to run. The same is true with Einstein@Home. I NEVER get failures on SETI@Home. Since work is rather rare on lhcathome it's might not be a valid comparison.

Are other people seeing failures of Einstein@Home?

2/2/2009 7:49:30 AM|lhcathome|Sending scheduler request: To fetch work. Requesting 20160 seconds of work, reporting 0 completed tasks
2/2/2009 7:49:35 AM|lhcathome|Scheduler request completed: got 0 new tasks
2/2/2009 7:52:30 AM|Einstein@Home|Task h1_0912.95_S5R4__1661_S5R5a_1 exited with zero status but no 'finished' file
2/2/2009 7:52:30 AM|Einstein@Home|If this happens repeatedly you may need to reset the project.
2/2/2009 7:52:30 AM|Einstein@Home|Restarting task h1_0912.95_S5R4__1661_S5R5a_1 using einstein_S5R5 version 301
2/2/2009 7:53:12 AM|Einstein@Home|Task h1_0912.95_S5R4__1661_S5R5a_1 exited with zero status but no 'finished' file
2/2/2009 7:53:12 AM|Einstein@Home|If this happens repeatedly you may need to reset the project.
2/2/2009 7:53:12 AM|Einstein@Home|Restarting task h1_0912.95_S5R4__1661_S5R5a_1 using einstein_S5R5 version 301
2/2/2009 7:53:53 AM|Einstein@Home|Task h1_0912.95_S5R4__1661_S5R5a_1 exited with zero status but no 'finished' file
2/2/2009 7:53:53 AM|Einstein@Home|If this happens repeatedly you may need to reset the project.
2/2/2009 7:53:54 AM|Einstein@Home|Restarting task h1_0912.95_S5R4__1661_S5R5a_1 using einstein_S5R5 version 301
2/2/2009 7:54:35 AM|Einstein@Home|Task h1_0912.95_S5R4__1661_S5R5a_1 exited with zero status but no 'finished' file
2/2/2009 7:54:35 AM|Einstein@Home|If this happens repeatedly you may need to reset the project.
2/2/2009 7:54:36 AM|Einstein@Home|Restarting task h1_0912.95_S5R4__1661_S5R5a_1 using einstein_S5R5 version 301
2/2/2009 7:55:17 AM|Einstein@Home|Task h1_0912.95_S5R4__1661_S5R5a_1 exited with zero status but no 'finished' file
2/2/2009 7:55:17 AM|Einstein@Home|If this happens repeatedly you may need to reset the project.
2/2/2009 7:55:18 AM|Einstein@Home|Restarting task h1_0912.95_S5R4__1661_S5R5a_1 using einstein_S5R5 version 301
2/2/2009 7:55:59 AM|Einstein@Home|Task h1_0912.95_S5R4__1661_S5R5a_1 exited with zero status but no 'finished' file
2/2/2009 7:55:59 AM|Einstein@Home|If this happens repeatedly you may need to reset the project.
2/2/2009 7:56:00 AM|Einstein@Home|Restarting task h1_0912.95_S5R4__1661_S5R5a_1 using einstein_S5R5 version 301
2/2/2009 7:56:41 AM|Einstein@Home|Task h1_0912.95_S5R4__1661_S5R5a_1 exited with zero status but no 'finished' file
2/2/2009 7:56:41 AM|Einstein@Home|If this happens repeatedly you may need to reset the project.
2/2/2009 7:56:42 AM|Einstein@Home|Restarting task h1_0912.95_S5R4__1661_S5R5a_1 using einstein_S5R5 version 301
2/2/2009 7:57:23 AM|Einstein@Home|Task h1_0912.95_S5R4__1661_S5R5a_1 exited with zero status but no 'finished' file

Gundolf Jahn
Gundolf Jahn
Joined: 1 Mar 05
Posts: 1,079
Credit: 341,280
RAC: 0

RE: I am running a Pentium

Message 89892 in response to message 89891

Quote:

I am running a Pentium 4 3.0GHz, HT with 4GB of RAM. It is running in a standard mode (not overclocked). BOINC is running as a Service.

I am occasionally seeing the kind of failure shown inthe following messages...


Sometimes simply stopping BOINC (completely) and restarting resolves the problem. If not, did you already see this wiki entry? I did an advanced search for "acquire lockfile" to find it mentioned, as your last reported task showed this error message.

If the settings in your preferences for

Use at most XX percent of CPU time
(Can be used to reduce CPU heat)
Enforced by version 5.6+

is not 100%, then you can also get problems with restarting tasks like yours. Check also your local preferences!

Gruß,
Gundolf

[edit] You should have started a new thread, because the symptoms of your error are totally different from those of the OP (Original Poster). [/edit]

Computer sind nicht alles im Leben. (Kleiner Scherz)

Nothing But Idle Time
Nothing But Idl...
Joined: 24 Aug 05
Posts: 158
Credit: 289,204
RAC: 0

I'm baaaack. I cleaned out

I'm baaaack. I cleaned out the dust bunnies and -- for no particular reason -- I reinstalled some older 400MHz memory instead of my current 533 MHz. I'm on day 3 of additional prime95 stress testing and no errors yet.

I think Bernd mentioned in some ancient thread(s) that he has to be careful how he compiles the apps because some of the libraries and load modules are not always compatible with this or that. There are a wide variety of architectures in use. I don't know the particulars.

Is there a software knowledgeable person, someone who knows about compiling windows apps like Einstein, who can shed light on this thought: My computer is SSE capable and I run the SSE version of the app. Could the compilation for SSE use libraries and routines that for some reason simply run a little flakey on my fpu and lead to the message "WARNING: Fixing yLower (-389 -> 0) [HoughMap.c 771]"?

After running this last and final stress test I intend to restart EaH and see if I encounter any more compute errors. If so, I then might run the non-SSE version if anyone thinks that is worthy of pursuit.

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9,352,143
RAC: 0

Hmmmm... How dirty was it?

Hmmmm...

How dirty was it? 3.0 GHz Prescott P4's were pushing the thermal envelope for the design pretty hard, especially if you have a OEM style heatsink/fan.

Generally speaking, I doubt the problem is with the SSEx compiled apps themselves, or one would expect to see the issue more widespread in the Wintel population. I'm not having any issues with them on mine.

One thing you could try is to disable HT initially when you go back to crunching EAH. If it runs without errors for a few days, then reenable HT and see what happens. If the errors return then that's a pretty good indicator you are right at the edge thermally for the FPU, even if you are seeing 'nominal' CPU temps when you check it.

I had a similar problem with my T2400 when I first started running EAH on it. The errors when away when I gave the notebook a little extra cooling help by leaving the lid cracked open, rather than closing it all the way when not being used, and getting small baking rack to sit it on.

HTH,

Alinator

Nothing But Idle Time
Nothing But Idl...
Joined: 24 Aug 05
Posts: 158
Credit: 289,204
RAC: 0

RE: Hmmmm...How dirty was

Message 89895 in response to message 89894

Quote:
Hmmmm...How dirty was it? 3.0 GHz Prescott P4's were pushing the thermal envelope for the design pretty hard, especially if you have a OEM style heatsink/fan.

Not to be facetious, but how dirty is too dirty? I have no reference standard to say whether it was too dirty or not. I don't personally think it was all that dirty, but I'm not a heatsink.

Quote:
One thing you could try is to disable HT initially when you go back to crunching EAH. If it runs without errors for a few days, then reenable HT and see what happens. If the errors return then that's a pretty good indicator you are right at the edge thermally for the FPU, even if you are seeing 'nominal' CPU temps when you check it.

For some reason my Dell computer won't let me see any temps, nominal or otherwise. However, I've been running prime95 stress test with maximum emphasis on heat and power consumption. I have a power meter attached to my computer's power line. Normally 2 Boinc threads consumes 150 watts but 2 threads of prime95 are currently using 165 watts for 3 days elapsed. Ambient room temp right now is 75F. That ought to push the temp even higher than Boinc does, yet no error is showing up.

Ambient room temp varies from 67-77F and I have a vortex fan located next to the exhaust port of the computer case so as to move the heat away from the local area. I cannot monitor the temps inside the case because my Dell apparently doesn't allow for it. I've tried Speedfan software and it cannot locate any sensors to report from. I have read the SMART data from the disk drive and it claims to be running an average of 37C or 1 degree C lower than the average for disks of this type. Maybe this can be extrapolated to say that the computer case on average is not overly hot, though peak temps are possible.

What I find noteworthy is that I won't just encounter a random here-and-there compute error or access violation. Seems like when I encounter the first error then every task running concurrrently and subsequently will also have access violations. And that may include non-Einstein tasks as well. Reboots sometimes make it better for a while and then it starts again. Monthly windows updates also seem to trigger this problem until after a few more reboots. At least that's the way it seems to me: flakey.

paul milton
paul milton
Joined: 16 Sep 05
Posts: 329
Credit: 35,825,044
RAC: 0

RE: For some reason my

Message 89896 in response to message 89895

Quote:

For some reason my Dell computer won't let me see any temps, nominal or otherwise. However, I've been running prime95 stress test with maximum emphasis on heat and power consumption. I have a power meter attached to my computer's power line. Normally 2 Boinc threads consumes 150 watts but 2 threads of prime95 are currently using 165 watts for 3 days elapsed. Ambient room temp right now is 75F. That ought to push the temp even higher than Boinc does, yet no error is showing up.

have you tryd "speedfan" there is a setting for "dell support" its disabled by default, but it works fine on my dell laptop after being enabled.

seeing without seeing is something the blind learn to do, and seeing beyond vision can be a gift.

Nothing But Idle Time
Nothing But Idl...
Joined: 24 Aug 05
Posts: 158
Credit: 289,204
RAC: 0

RE: ...have you tryd

Message 89897 in response to message 89896

Quote:
...have you tryd "speedfan" there is a setting for "dell support" its disabled by default, but it works fine on my dell laptop after being enabled.


I stated in my last post that I tried speedfan and, in fact, that is what I used to get the SMART data from the disk drive. However, the version I originally downloaded was 4.33 and it didn't seem to support the Dell Dimension 8400 motherboard that I have. I don't think Dell wanted anybody to have any insight into or control of it's product for warranty reasons. I can't find anything regarding fan speeds in the bios setup either and I kept the bios up to date until they stopped updating the bios.

Where is this "dell support" switch that is disabled by default? IIRC only very specific Dell MBs were capable of being supported.

paul milton
paul milton
Joined: 16 Sep 05
Posts: 329
Credit: 35,825,044
RAC: 0

RE: RE: ...have you tryd

Message 89898 in response to message 89897

Quote:
Quote:
...have you tryd "speedfan" there is a setting for "dell support" its disabled by default, but it works fine on my dell laptop after being enabled.

I stated in my last post that I tried speedfan and, in fact, that is what I used to get the SMART data from the disk drive. However, the version I originally downloaded was 4.33 and it didn't seem to support the Dell Dimension 8400 motherboard that I have. I don't think Dell wanted anybody to have any insight into or control of it's product for warranty reasons. I can't find anything regarding fan speeds in the bios setup either and I kept the bios up to date until they stopped updating the bios.

Where is this "dell support" switch that is disabled by default? IIRC only very specific Dell MBs were capable of being supported.

my bad. its located in "configure" then "options" just above debug mode. i see it states for dell laptops so i have no idea if it will work for a dell desktop.

seeing without seeing is something the blind learn to do, and seeing beyond vision can be a gift.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.