Client Errors of S5R2/S5R3 Apps

Daedalus
Daedalus
Joined: 18 Oct 07
Posts: 22
Credit: 71710469
RAC: 1236

Hmmmm, my computer isn't

Hmmmm, my computer isn't overclocked and has alwas benn very stable. Also, E@H seemd to crash after having started a brand new WU. With the "failing" WU, all other uncomplete ones wer trashed too. I solved the problem by desactivating the network. I wait for my two WU's to be finished before i allow BOINC to contact the servers again. It seems to work. I am still using 4.02.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245275384
RAC: 12137

RE: Also, E@H seemd to

Message 71167 in response to message 71166

Quote:
Also, E@H seemd to crash after having started a brand new WU. With the "failing" WU, all other uncomplete ones wer trashed too. I solved the problem by desactivating the network.


I don't precisely know what kind of crash that was until the tasks have been reported, but a bad network driver could mess up the FPU stack and might be an explanation e.g. for "Input domain errors".

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245275384
RAC: 12137

RE: I'm quite sure to have

Message 71168 in response to message 71165

Quote:
I'm quite sure to have no HW problems, since I can calculate Seti and Climate WU's without errors.


AFAIK SETI isn't that picky about FP operations, and CPDN only validates the results after completed tasks (if at all), which takes months.

Quote:
My CPU is not overclocked - quite the oppusite, I clock down from 3200 to 2400, to get less noisy ;-)


That's interesting indeed.

Quote:
Are there other possible errors, or should I try again with the 4.16


I'd say try the 4.16.

It might be that your machine is already using it, and the tasks ran with 4.02 had already been assigned to this App version before you installed the 4.16 App.

What do the client messages say about the last einstein task that has been started? Is there a "found app_info.xml - using anonymous platform" message after starting the Client?

BM

BM

barney-s
barney-s
Joined: 23 Aug 06
Posts: 4
Credit: 42023
RAC: 0

RE: I'd say try the

Message 71169 in response to message 71168

Quote:


I'd say try the 4.16.

It might be that your machine is already using it, and the tasks ran with 4.02 had already been assigned to this App version before you installed the 4.16 App.


I startet 1 new WU about 14 Hours ago - still running :-)

Quote:

What do the client messages say about the last einstein task that has been started? Is there a "found app_info.xml - using anonymous platform" message after starting the Client?

Well, maybe I found the error. in the app_info.xml was for Version 402 the 402 libaries

    
        einstein_S5R3
        402
        
            einstein_S5R3_4.02_i686-pc-linux-gnu
            
        
        
            einstein_S5R3_4.02_i686-pc-linux-gnu.so
        
    


I changed now to the following:

    
        einstein_S5R3
        402
        
            einstein_S5R3_4.16_i686-pc-linux-gnu
            
        
        
            einstein_S5R3_4.16_i686-pc-linux-gnu.so
        
    


Now it Shows the 4.16 in the Messages as well.
I'll keep you updated when the WU is finished.

Regards Barney

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110028769439
RAC: 22434484

RE: Well, maybe I found

Message 71170 in response to message 71169

Quote:

Well, maybe I found the error. in the app_info.xml was for Version 402 the 402 libaries

That's not an error - the 4.02 versions of the app and .pdb files are supposed to be listed in the app_info.xml file. If you read further in that file you should also find the 4.16 versions separately listed.

The reason for this is that the format of checkpoint files, regularly saved during crunching, changed for versions later than 4.02. You should not be attempting to resume crunching with 4.16 something that was commenced with 4.02 since 4.16 will not be able to read the 4.02 saved checkpoint. The correct behaviour is to allow version 4.02 to "finish off" any work "branded" 4.02 and then to allow 4.16 to do any work "branded" 4.09 or above. If you carefully read the full app_info file as distributed you should be able to see how this works.

As Bernd tried to point out in his previous message, it may be that the reason that 4.16 was not running could be due to the size of your tasks cache. If you have work on hand, downloaded before you installed 4.16, that work will be "branded" 4.02 and the app_info mechanism will use 4.02 to crunch it until it is all gone. Any new tasks downloaded will be branded 4.16 and will be crunched with 4.16 once the 4.02 work is exhausted. The process can be fast tracked but involves other editing which the average user would not be familiar with and is therefore not recommended unless you really understand exactly what you are doing. It is much safer to let the app_info mechanism, as distributed, do its thing.

The result of the edit you posted would be to cause any 4.02 work in progress to fail. I'm not sure exactly what would happen to a partly crunched result. When the checkpoint couldn't be read, either the result would crash or it might be completely restarted from the beginning. Either way you lose something. However, 4.16 will be able to crunch 4.02 "branded" tasks if they haven't already been started (and therefore no incompatible checkpoint existed).

Cheers,
Gary.

barney-s
barney-s
Joined: 23 Aug 06
Posts: 4
Credit: 42023
RAC: 0

Hello Gary and Bernd, now

Message 71171 in response to message 71170

Hello Gary and Bernd,

now with the new WU and VErsio 4.16 I was able to finish without error messages:
http://einsteinathome.org/task/88887402

As for the app_info.xml:
Maybe you've been right, that I had already WU's when I tried the 4.16.
But I think, I will remain the app_info.xml as it is now, since it is working for me now :-)

Thx for your patience
Barney

tapir
tapir
Joined: 19 Mar 05
Posts: 23
Credit: 462935446
RAC: 0

23/11/2007

23/11/2007 18:01:56|Einstein@Home|Reason: Unrecoverable error for result h1_0438.65_S5R2__95_S5R3a_2 ( - exit code -1073741679 (0xc0000091))

resultid=88969298

???

Mike Francis
Mike Francis
Joined: 18 Mar 06
Posts: 4
Credit: 6564723
RAC: 0

Just had this compute error.

Just had this compute error. Not sure if its the workunit or my system.

11/24/2007 4:34:49 PM|Einstein@Home|Resuming task h1_0578.15_S5R2__179_S5R3a_1 using einstein_S5R3 version 415
11/24/2007 6:07:43 PM|Einstein@Home|Computation for task h1_0578.15_S5R2__179_S5R3a_1 finished
11/24/2007 6:07:43 PM|Einstein@Home|Output file h1_0578.15_S5R2__179_S5R3a_1_0 for task h1_0578.15_S5R2__179_S5R3a_1 absent

The runtime was 5:54; usual runtime 9:30 to 10: hrs.

KWSN Sir Clark
KWSN Sir Clark
Joined: 26 Jun 05
Posts: 42
Credit: 1200171
RAC: 0

Just had the same

Just had the same error:

Quote:

11/12/2007 13:53:20|Einstein@Home|Task h1_0588.85_S5R2__151_S5R3a_3 exited with zero status but no 'finished' file
11/12/2007 13:53:20|Einstein@Home|If this happens repeatedly you may need to reset the project.
11/12/2007 13:53:20|Einstein@Home|Restarting task h1_0588.85_S5R2__151_S5R3a_3 using einstein_S5R3 version 415
11/12/2007 15:23:56|Einstein@Home|Computation for task h1_0588.85_S5R2__151_S5R3a_3 finished
11/12/2007 15:23:56|Einstein@Home|Output file h1_0588.85_S5R2__151_S5R3a_3_0 for task h1_0588.85_S5R2__151_S5R3a_3 absent

Here's the WU

The end of stderr.txt is

Quote:

ERROR: HoughMap.c 388: map index out of bounds: 3008 [0..2970] j:52 xp[j]:148
Level 0: $Id: HierarchicalSearch.c,v 1.184 2007/09/28 21:26:58 reinhard Exp $
Function call `ComputeFstatHoughMap ( &status, &semiCohCandList, &pgV, &semiCohPar)' failed.
file HierarchicalSearch.c, line 1105
2007-12-11 15:23:52.7812 [normal]:
Level 1: $Id: HierarchicalSearch.c,v 1.184 2007/09/28 21:26:58 reinhard Exp $
2007-12-11 15:23:52.7812 [normal]: Status code -1: Recursive error
2007-12-11 15:23:52.7968 [normal]: function ComputeFstatHoughMap, file HierarchicalSearch.c, line 1916
2007-12-11 15:23:52.7968 [normal]:
Level 2: $Id: DriveHough.c,v 1.17 2007/01/08 17:30:04 reinhard Exp $
2007-12-11 15:23:52.7968 [normal]: Status code -1: Recursive error
2007-12-11 15:23:52.7968 [normal]: function LALHOUGHConstructHMT_W, file DriveHough.c, line 630
2007-12-11 15:23:52.7968 [normal]:
Level 3: $Id: HoughMap.c,v 1.11 2007/07/23 14:48:20 bema Exp $
2007-12-11 15:23:52.7968 [normal]: Status code 2: Invalid input size
2007-12-11 15:23:52.7968 [normal]: function LALHOUGHAddPHMD2HD_W, file HoughMap.c, line 389
2007-12-11 15:23:52.7968 [CRITICAL]: BOINC_LAL_ErrHand(): now calling boinc_finish()

I'm using BOINC 5.10.28 and Win XP. It was the sole crunching task when this happened and the only program running apart from AVG and Comodo Firewall.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110028769439
RAC: 22434484

RE: Just had the same

Message 71175 in response to message 71174

Quote:
Just had the same error: ....

Yes, both yours and the previous poster's errors seem quite similar - map index out of bounds at line 388 in HoughMap.c. I'm sure Bernd will look at these when he is able to.

Quote:

ERROR: HoughMap.c 388: map index out of bounds: 3008 [0..2970] j:52 xp[j]:148
Level 0: $Id: HierarchicalSearch.c,v 1.184 2007/09/28 21:26:58 reinhard Exp $
Function call `ComputeFstatHoughMap ( &status, &semiCohCandList, &pgV, &semiCohPar)' failed.
file HierarchicalSearch.c, line 1105
2007-12-11 15:23:52.7812 [normal]:
Level 1: $Id: HierarchicalSearch.c,v 1.184 2007/09/28 21:26:58 reinhard Exp $
2007-12-11 15:23:52.7812 [normal]: Status code -1: Recursive error
2007-12-11 15:23:52.7968 [normal]: function ComputeFstatHoughMap, file HierarchicalSearch.c, line 1916
2007-12-11 15:23:52.7968 [normal]:
Level 2: $Id: DriveHough.c,v 1.17 2007/01/08 17:30:04 reinhard Exp $
2007-12-11 15:23:52.7968 [normal]: Status code -1: Recursive error
2007-12-11 15:23:52.7968 [normal]: function LALHOUGHConstructHMT_W, file DriveHough.c, line 630
2007-12-11 15:23:52.7968 [normal]:
Level 3: $Id: HoughMap.c,v 1.11 2007/07/23 14:48:20 bema Exp $
2007-12-11 15:23:52.7968 [normal]: Status code 2: Invalid input size
2007-12-11 15:23:52.7968 [normal]: function LALHOUGHAddPHMD2HD_W, file HoughMap.c, line 389
2007-12-11 15:23:52.7968 [CRITICAL]: BOINC_LAL_ErrHand(): now calling boinc_finish()

If you look at this list of results, you will see a set of similar errors followed by some successful results. This is one of my machines and here is a snippet from the oldest of the error outputs. It's also listed as error code 99 like yours:

Quote:

260, c
261,
ERROR: HoughMap.c 388: map index out of bounds: 16540 [0..2970] j:0 xp[j]:16540
Level 0: $Id: HierarchicalSearch.c,v 1.184 2007/09/28 21:26:58 reinhard Exp $
Function call `ComputeFstatHoughMap ( &status, &semiCohCandList, &pgV, &semiCohPar)' failed.
file HierarchicalSearch.c, line 1105
2007-12-07 02:12:59.9531 [normal]:
Level 1: $Id: HierarchicalSearch.c,v 1.184 2007/09/28 21:26:58 reinhard Exp $
2007-12-07 02:12:59.9531 [normal]: Status code -1: Recursive error
2007-12-07 02:12:59.9687 [normal]: function ComputeFstatHoughMap, file HierarchicalSearch.c, line 1916
2007-12-07 02:12:59.9687 [normal]:
Level 2: $Id: DriveHough.c,v 1.17 2007/01/08 17:30:04 reinhard Exp $
2007-12-07 02:12:59.9687 [normal]: Status code -1: Recursive error
2007-12-07 02:12:59.9687 [normal]: function LALHOUGHConstructHMT_W, file DriveHough.c, line 630
2007-12-07 02:12:59.9687 [normal]:
Level 3: $Id: HoughMap.c,v 1.11 2007/07/23 14:48:20 bema Exp $
2007-12-07 02:12:59.9687 [normal]: Status code 2: Invalid input size
2007-12-07 02:12:59.9687 [normal]: function LALHOUGHAddPHMD2HD_W, file HoughMap.c, line 389
2007-12-07 02:12:59.9687 [CRITICAL]: BOINC_LAL_ErrHand(): now calling boinc_finish()

and here is a further snippet from the last (most recent) of the set of errors:

Quote:

15, c
16,
ERROR: HoughMap.c 423: map index out of bounds: 26801 [0..2970] j:30 xp[j]:25151
Level 0: $Id: HierarchicalSearch.c,v 1.184 2007/09/28 21:26:58 reinhard Exp $
Function call `ComputeFstatHoughMap ( &status, &semiCohCandList, &pgV, &semiCohPar)' failed.
file HierarchicalSearch.c, line 1105
2007-12-07 04:09:26.0937 [normal]:
Level 1: $Id: HierarchicalSearch.c,v 1.184 2007/09/28 21:26:58 reinhard Exp $
2007-12-07 04:09:26.0937 [normal]: Status code -1: Recursive error
2007-12-07 04:09:26.0937 [normal]: function ComputeFstatHoughMap, file HierarchicalSearch.c, line 1916
2007-12-07 04:09:26.0937 [normal]:
Level 2: $Id: DriveHough.c,v 1.17 2007/01/08 17:30:04 reinhard Exp $
2007-12-07 04:09:26.0937 [normal]: Status code -1: Recursive error
2007-12-07 04:09:26.0937 [normal]: function LALHOUGHConstructHMT_W, file DriveHough.c, line 630
2007-12-07 04:09:26.0937 [normal]:
Level 3: $Id: HoughMap.c,v 1.11 2007/07/23 14:48:20 bema Exp $
2007-12-07 04:09:26.0937 [normal]: Status code 2: Invalid input size
2007-12-07 04:09:26.0937 [normal]: function LALHOUGHAddPHMD2HD_W, file HoughMap.c, line 424
2007-12-07 04:09:26.0937 [CRITICAL]: BOINC_LAL_ErrHand(): now calling boinc_finish()

Once again, a very similar error but a different line number - 423 - in HoughMap.c

So, in my case, why did these errors then stop?

Well, the machine in question has a Tualatin Celeron processor (1300MHz) with a FSB of 100MHz. These machines are quite overclockable and I have around 30 of them, all overclocked, some as high as 124MHz FSB which gives a CPU speed of over 1600MHz. Most run at around 118 - 120MHz FSB but the above machine was only running at 112MHz and producing errors.

I decided to change the processor as I felt that the PC133 RAM should be quite OK at 112MHz. The machine still seemed to be unstable with the different CPU chip so in desperation I changed the RAM as well. Suddenly stability returned and I've now been running it at 116MHz with no further errors. I can't say for sure that it was the RAM as I really didn't give the changed CPU enough of a test before changing the RAM as well. However, on balance, the RAM seems the most likely culprit. I'll probably bump the FSB to 118MHZ shortly to see if all is still well, particularly if several more tasks go through without incident.

So, in light of my experiences, I would ask the two previous posters who have reported similar errors to check if there are any hardware issues with CPU or RAM that might be responsible for their errors.

EDIT: Not all of the 4 errors on my machine were code 99s. One of them was a code -1073741819 access violation (Unhandled Exception) but I would guess that it's of little consequence as all the errors would seem to be attributable to faulty hardware.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.