S5R2c Client Error

Jim Howe
Jim Howe
Joined: 25 Mar 05
Posts: 18
Credit: 11707416
RAC: 0

On my machines I am getting

On my machines I am getting SIGABRT for every case where a machine is using core_client_5.4.9 and a WU '...S5R2c_1' but not other cases. I have three machines runing core_client_5.8.15, 2 are AMD one is Intel, and so far these machines are getting to success with these WUs. All of my machines running 5.4.9 are getting the SIGABRT errors described.

Jim Howe
Zhuhai, China and Portland, Oregon

Ananas
Ananas
Joined: 22 Jan 05
Posts: 272
Credit: 2500681
RAC: 0

Here's one more (not mine),

Here's one more (not mine), he had no problems with S5RIa but every S5R2c gives him

5.4.9

Couldn't start or resume: -108

This one gets a totally different error with S5R2c :

5.2.13
process exited with code 2 (0x2)

2007-04-20 20:28:51 [Einstein@Home] execv(../../projects/einstein.phys.uwm.edu/einstein_S5R2_4.14_i686-pc-linux-gnu) failed: error -1
execv: No such file or directory

Those totally different errors look much like a pointer or array size problem. Too many open files could cause something like that too

Sir Barsteward of Pubs
Sir Barsteward ...
Joined: 8 Apr 06
Posts: 5
Credit: 84336737
RAC: 0

RE: All my WU - client

Message 62706 in response to message 62697

Quote:
All my WU - client error. :(

Since switching over Iv'e lost 730000 CPU secs (280 credits) and have yet to see one to successful conclusion :o((

The Barsteward

Beer is proof that God loves us and wants us to be happy.
--Benjamin Franklin

Sir Barsteward of Pubs
Sir Barsteward ...
Joined: 8 Apr 06
Posts: 5
Credit: 84336737
RAC: 0

RE: RE: All my WU -

Message 62707 in response to message 62706

Quote:
Quote:
All my WU - client error. :(

Since switching over Iv'e lost 730000 CPU secs (280 credits) and have yet to see one to successful conclusion :o((

Another 60000 secs lost for no apparent reason, so will, not transfer wu until it seems to be resolved.

The Barsteward

Beer is proof that God loves us and wants us to be happy.
--Benjamin Franklin

Ananas
Ananas
Joined: 22 Jan 05
Posts: 272
Credit: 2500681
RAC: 0

Got a new one @ CPU time =

Got a new one @ CPU time = 22089 :

- Unhandled Exception Record -
Reason: Privileged Instruction (0xc0000096) at address 0x0044FC48

http://einsteinathome.org/workunit/33428928

The other result in that WU is nice too :

5.2.13
Maximum disk usage exceeded

Nice to see that Bruce Allen still uses 5.2.13 too :-)

Signal 11 on Bruces box as well btw. : resultid=83588082

jjwhalen
jjwhalen
Joined: 21 Jun 06
Posts: 7
Credit: 645238
RAC: 0

One of my S5R2 workunits:

One of my S5R2 workunits: http://einsteinathome.org/workunit/33355866 looks to be stuck in database limbo.

It's status shows 2 successful results (a quorum) pending validation, plus a 3rd with client error (hence posting in this thread), plus a 4th 'unsent'. The WU appears in my Results list as 'pending' but doesn't show up in my Pending Credit list at all (hence the limbo).

No change in validation status for over 24 hours. No indication what's keeping it from validating with a quorum, though I notice that Server Status doesn't yet show the S5R2 validators/assimilators either up OR down.

It's not about the credit; I just hate to see 30-plus hours of CPU time go down the toilet, since I have other project mouths to feed. With the larger workunits, time invested in a single result that won't validate becomes a concern.

B/W

Best wishes :)

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7023174931
RAC: 1830249

RE: One of my S5R2

Message 62710 in response to message 62709

Quote:

One of my S5R2 workunits: http://einsteinathome.org/workunit/33355866 looks to be stuck in database limbo.

...

No indication what's keeping it from validating with a quorum, though I notice that Server Status doesn't yet show the S5R2 validators/assimilators either up OR down.

The validator has looked, and does not like what it sees enough to accept the result. If you look at the end of the result detail, this shows as:

Checked, but no consensus yet

So another result is already prepared to send out to another host. When that goes out and comes back, if it is sufficiently similar to one of the two already returned, it will decide the issue of who is right.

I'll speculate that one of the the hosts suffered computational error, but not of the sort which generates an illegal memory access or anything else caught but run-time checking in the application.

PaperDragon
PaperDragon
Joined: 31 Mar 05
Posts: 6
Credit: 72813968
RAC: 0

So far every one has errored

So far every one has errored out on one of my machines. It says client error, and when I look into the work units it has 'invalid function' logged in it.

Machine with errors

You like Myst? Uru Live returns! www.urulive.com

Adi
Adi
Joined: 1 Jan 06
Posts: 11
Credit: 43581749
RAC: 0

I have about 35 hosts, maybe

Message 62712 in response to message 62711

I have about 35 hosts, maybe more
almost none of them haven't received credit last week
NONE of them are overclocked, almost all are linux servers at different companies

for example, my dual Xeon at home CANNOT be overclocked (HPxw6000)
(BIOSes for dual procs don't have such options, servers are made for stability)
but I didn't received credit since 22 apr

results

more than 800 credit lost, and that's only on 1 host!

ALL other projects (seti, climate, predictor) are OK on ALL hosts

for me the decision is simple:

int ResourceShare=1;
int veryFewClientErrors=0;
int bugMessages=1;
int date=1; 
/* hmm, not int, but let's hope it'll not take more than 100 years  365*100>32767
*/
checkNextWeek();

main() {
for (date < theEndOfTheUniverse) {
checkNextWeek();
if (veryFewClientErrors && !bugMessages ) {
increaseBack(ResourceShare);
exit;
} // end if
} // end for
} //end main

checkNextWeek() {
if ( "very few errors"==1) veryFewClientErrors=1;
if ("few bug messages") bugMessages=0;
date+=7;
} //end check
// EOF


have a nice (and successful to find bugs) week

wijata.com
wijata.com
Joined: 11 Feb 05
Posts: 113
Credit: 25495895
RAC: 0

It seems that every WU that

It seems that every WU that was interupted/resumed gets compute error with signal 11/SIGABRT on Linux machine.
Example http://einsteinathome.org/task/83757575 and this host have more such.
It's pitty, as I have to restart them quite often...

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.