5th Computing error for S5R2 on one host

Dave Burbank
Dave Burbank
Joined: 30 Jan 06
Posts: 275
Credit: 1548376
RAC: 0
Topic 192658

My primary host has now had 5 failed WUs returned as 'Compute Error' with exit status : 139 (0x8b)

These are the failed WUs

33482430
33450969
33440121
33364148
33360594

This host is a A64 3700 overclocked to 2.86 Ghz. This host has never had stability issues (running 24/7 for over a year) with previous Science Runs. To rule out the overclock causing the WUs to fail, I set the host back to stock speed at 2.2 Ghz. No luck. The first WU I listed above failed at stock settings.

There is no obvious pattern to these failures, some WUs finishing successfully others not. I don't care about the credit, but the waisted crunching hours (38+) is starting to bother me. Anyone have any ideas? Is this a problem with my host, or are they still ironing out some kinks in the new app.

There are 10^11 stars in the galaxy. That used to be a huge number. But it's only a hundred billion. It's less than the national deficit! We used to call them astronomical numbers. Now we should call them economical numbers. - Richard Feynman

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

5th Computing error for S5R2 on one host

Yep, unfortunately the new app is still not entirely reliable. I had a WU crash on me with the dreaded "SIGABRT" error with the new app- on an AMD, of course. I really wish the project, and all of us crunchers, that the developers will manage to put a stop to that soon. While I'm not complaining, it is, as you said, Dave, a pity about all those CPU hours.

Dave Burbank
Dave Burbank
Joined: 30 Jan 06
Posts: 275
Credit: 1548376
RAC: 0

RE: Yep, unfortunately the

Message 62984 in response to message 62983

Quote:
Yep, unfortunately the new app is still not entirely reliable. I had a WU crash on me with the dreaded "SIGABRT" error with the new app- on an AMD, of course. I really wish the project, and all of us crunchers, that the developers will manage to put a stop to that soon. While I'm not complaining, it is, as you said, Dave, a pity about all those CPU hours.

Thanks for the quick reply. I hope the devs can get this worked out soon without to much headache (for them). Guess its time to crank my overclock back up an hope for some "lucky" WUs!

There are 10^11 stars in the galaxy. That used to be a huge number. But it's only a hundred billion. It's less than the national deficit! We used to call them astronomical numbers. Now we should call them economical numbers. - Richard Feynman

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

Yes, the only sure way to

Yes, the only sure way to avoid this would be running Windows, I think... I've only seen Linux hosts get that error. But Windows is just TOO slow on an AMD host, apart from not being many people's OS of choice ;-) so one has to take the odds... I wish you happy crunching and more luck with your next WUs.

Dave Burbank
Dave Burbank
Joined: 30 Jan 06
Posts: 275
Credit: 1548376
RAC: 0

I didn't realize this was

I didn't realize this was just limited to the Linux app. I dual-boot with Windows for games and such, but couldn't imagine "going back" and loosing Beryl and other features of Linux. I'll just wait patiently for a new app.

On a side note, wouldn't this bug be wrecking havoc with Bruce's cluster?

Cheers

There are 10^11 stars in the galaxy. That used to be a huge number. But it's only a hundred billion. It's less than the national deficit! We used to call them astronomical numbers. Now we should call them economical numbers. - Richard Feynman

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3562358667
RAC: 0

IIRC I've seen win machines

IIRC I've seen win machines erroring out too.

Dave Burbank
Dave Burbank
Joined: 30 Jan 06
Posts: 275
Credit: 1548376
RAC: 0

Then I'm definitely sticking

Then I'm definitely sticking with linux!

There are 10^11 stars in the galaxy. That used to be a huge number. But it's only a hundred billion. It's less than the national deficit! We used to call them astronomical numbers. Now we should call them economical numbers. - Richard Feynman

Brian Silvers
Brian Silvers
Joined: 26 Aug 05
Posts: 772
Credit: 282700
RAC: 0

RE: IIRC I've seen win

Message 62989 in response to message 62987

Quote:
IIRC I've seen win machines erroring out too.

Not with SIGABRT... That seems to be Linux-specific...

wijata.com
wijata.com
Joined: 11 Feb 05
Posts: 113
Credit: 25495895
RAC: 0

It seems that every WU that

It seems that every WU that was interupted/resumed gets compute error with signal 11/SIGABRT on Linux machine.
Example http://einsteinathome.org/task/83757575 and this host have more such.
It's pitty, as I have to restart them quite often...

Mats Nilsson
Mats Nilsson
Joined: 10 Dec 05
Posts: 94
Credit: 15011147
RAC: 0

RE: It seems that every WU

Message 62991 in response to message 62990

Quote:
It seems that every WU that was interupted/resumed gets compute error with signal 11/SIGABRT on Linux machine.
Example http://einsteinathome.org/task/83757575 and this host have more such.
It's pitty, as I have to restart them quite often...

That host is using an old version of BOINC (4.43) could it have something to do with that.

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

That couldn't have been the

That couldn't have been the reason for my WU to crash. I'm completely sure I didn't pause/resume that. Maybe this can trigger SIGABRT errors, but it can't be the only thing that causes them...
Oh, and for Bruce's cluster... afaik some of his machines were indeed affected quite badly...

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.