Computation error 0.9 CPU + 1 AMD/ATI

MN_Firefighter
MN_Firefighter
Joined: 23 Aug 12
Posts: 9
Credit: 154936279
RAC: 3443
Topic 226005

Been crunching on this workstation for a long time. Maybe 6-9 months ago I installed a new graphics card and was working great. Starting today I see all my eintstein tasks finish 100% then say computation error. This is happening with Gravitional wave search O3 all sky #1.  No problems with other projects. I am not running any Beta tasks. Thoughts?

 

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4754
Credit: 17706415916
RAC: 5286280

Bad work units

Bad work units

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110035760704
RAC: 22382874

MN_Firefighter wrote:...

MN_Firefighter wrote:
... Starting today I see all my eintstein tasks finish 100% then say computation error.

You don't mention exactly which host but you seem to have only 1 with a GPU that could crunch GW GPU tasks - this host with an RX 580.  The others are listed as having Intel GPUs only.

For that host, I did a very quick check of tasks returned for Sept 14 UTC.  The approximate numbers are 40 valid, 7 pending, 1 invalid, 2 inconclusive and just 2 compute errors.  None of this seems to fit the above description of your problem so I'm wondering if I'm looking at the wrong machine??  Maybe you'd like to confirm that.

In any case, it's quite unlikely to be "bad work units" as this search has been going for a while now and there have been no reports recently of anyone having issues with the tasks themselves.  The above host has a total of 20 invalid tasks over the whole period for which tasks are still in the online database (and lots of valid ones as well) so that would tend to indicate that there is some sort of hardware issue developing, particularly now that the first two actual compute errors have just occurred.

Your computer is a Phenom II x6 which is quite old these days.  I run a lot of old hosts as well and a very common age related issue is capacitor failure on the mainboard or in the PSU if it's the same age as the machine.  I would have done hundreds of capacitor replacements over the years and the fact that this cures the vast majority of problems I see with older hardware suggests that this may well be your problem as well.  The second most common issue is overheating due to blocked fins on heat sinks or actual fan failure.  It's worthwhile checking that as well.

Cheers,
Gary.

MN_Firefighter
MN_Firefighter
Joined: 23 Aug 12
Posts: 9
Credit: 154936279
RAC: 3443

Gary, Thank you for the quick

Gary, Thank you for the quick response. My main workstation with the problem described here is the one running the Phenom II X6. I have several old workstations crunching numbers. No money for anything new. The one item I can verify again is that the einstein project is the only one having issues. I run tasks for milkyway, MLC, LHC, universe, climateprediction, world community grid, etc and all others are working. 

When reviewing the items you laid out as possible causes I actually just cleaned out my entire workstation last week. Took it all apart and cleaned the CPU fans (double fan push/pull system) and all the fins and cleaned out the graphics card and all the case fans so overheating should not be an issue. 

I guess I did not post enough information about the host. I guess I do not completely follow this. Would I need to provide more information about the application name or something else? I am in the Boinc Manager and go to advanced view so I can see all the individual tasks my computer has ready to start. Then when I scroll all the way to the right I have the application column which was what I typed out some of (gravtiional wave search O3.....) and the last column being name. With a better understanding I might be able to post more information here. 

When I woke up this morning I came down to my office and looked at my workstation and I see a new error that has stopped boinc from running anything overnight. It said "bad memory ..... buffer problem ..... application stopped.." I wasn't able to write the actual error message down just a little what I can remember. First time I have seen that. 

What I was able to do was restart the workstation and launch boinc. I "reset project" for einstein. The workstation is currently downloading new tasks for this project. I have to leave for work and will report when I am able to. 

 

Thank you. I have always had great communication through the einstein forums. 

MN_Firefighter
MN_Firefighter
Joined: 23 Aug 12
Posts: 9
Credit: 154936279
RAC: 3443

After resetting the project

After resetting the project and getting new tasks assigned everything is finishing fine an reporting. Thank you

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110035760704
RAC: 22382874

MN_Firefighter wrote:I guess

MN_Firefighter wrote:
I guess I did not post enough information about the host.

Because your computers aren't hidden, there's no need to specify a long list of hardware details.  When you have a number of different machines, either a link to the one in question or something that clearly identifies it (eg. RX 580 GPU) is all that you need to do.  I was pretty sure I was looking at the right machine.  It was just that you mentioned that all tasks were finishing 100% and then turning into compute errors.  I could only see two compute errors that happened immediately after startup along with hundreds of successfully completed tasks and that made me wonder if I had the right machine.

MN_Firefighter wrote:
I guess I do not completely follow this. Would I need to provide more information about the application name ...

No, not at all.  Anyone responding can find what is needed once they have the correct host ID on the website.  What is useful though is any unusual messages from the OS or from BOINC Manager's event log.  For this particular case the following report probably tells us what the problem is.

MN_Firefighter wrote:
When I woke up this morning I came down to my office and looked at my workstation and I see a new error that has stopped boinc from running anything overnight. It said "bad memory ..... buffer problem ..... application stopped.."

It looks like you may have an issue with one of your RAM sticks.  This is not as common as the other things I mentioned but I have previously seen the odd memory issue - fairly infrequently though.

The best action to take is to run some sort of memory testing utility.  In Linux, I have used a utility called memtest86.  I haven't used a Windows machine for about 15 years so I have no idea of the best way to test memory under Windows.

In the past, I've had some success with curing intermittent memory problems by cleaning the gold plated contacts on memory sticks.  I use an old typists' eraser - the type that will abrade the surface without damaging the gold plating or leaving any sort of sticky residue.  Quite often the gold appears dull with a brownish tarnish and after a gentle cleaning becomes very bright and shiny.  Sometimes that fixes the problem.  After blowing out the slots with air, I also remove and re-seat the sticks a couple of times so that the contacts get a bit of extra abrasion.

MN_Firefighter wrote:
What I was able to do was restart the workstation and launch boinc. I "reset project" for einstein. The workstation is currently downloading new tasks for this project. I have to leave for work and will report when I am able to.

Resetting the project will probably make no difference to this issue.  A 'reset' just throws away all apps and data files and downloads new copies to replace them.  As well, the scheduler will send you the same set of tasks that you previously had and not different ones.  It's unlikely that any of that will cure a memory issue.

Hopefully someone else with Windows experience will chime in with a suggestion for testing memory on a Windows machine.  If you get any further OS messages about memory errors you should immediately run some extended memory tests with a proper testing utility and perhaps try to refurbish your existing RAM sticks if errors are revealed.  Ultimately you may need to replace any affected stick if the errors persist.

 

Cheers,
Gary.

GWGeorge007
GWGeorge007
Joined: 8 Jan 18
Posts: 2823
Credit: 4636722195
RAC: 3643856

Gary Roberts

Gary Roberts wrote:

Resetting the project will probably make no difference to this issue.  A 'reset' just throws away all apps and data files and downloads new copies to replace them.  As well, the scheduler will send you the same set of tasks that you previously had and not different ones.  It's unlikely that any of that will cure a memory issue.

Hopefully someone else with Windows experience will chime in with a suggestion for testing memory on a Windows machine.  If you get any further OS messages about memory errors you should immediately run some extended memory tests with a proper testing utility and perhaps try to refurbish your existing RAM sticks if errors are revealed.  Ultimately you may need to replace any affected stick if the errors persist.

Your Windows 10 Pro has it already.

https://www.windowscentral.com/how-check-your-pc-memory-problems-windows-10

 

George

Proud member of the Old Farts Association

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.