Hmmmm, my computer isn't overclocked and has alwas benn very stable. Also, E@H seemd to crash after having started a brand new WU. With the "failing" WU, all other uncomplete ones wer trashed too. I solved the problem by desactivating the network. I wait for my two WU's to be finished before i allow BOINC to contact the servers again. It seems to work. I am still using 4.02.
Also, E@H seemd to crash after having started a brand new WU. With the "failing" WU, all other uncomplete ones wer trashed too. I solved the problem by desactivating the network.
I don't precisely know what kind of crash that was until the tasks have been reported, but a bad network driver could mess up the FPU stack and might be an explanation e.g. for "Input domain errors".
I'm quite sure to have no HW problems, since I can calculate Seti and Climate WU's without errors.
AFAIK SETI isn't that picky about FP operations, and CPDN only validates the results after completed tasks (if at all), which takes months.
Quote:
My CPU is not overclocked - quite the oppusite, I clock down from 3200 to 2400, to get less noisy ;-)
That's interesting indeed.
Quote:
Are there other possible errors, or should I try again with the 4.16
I'd say try the 4.16.
It might be that your machine is already using it, and the tasks ran with 4.02 had already been assigned to this App version before you installed the 4.16 App.
What do the client messages say about the last einstein task that has been started? Is there a "found app_info.xml - using anonymous platform" message after starting the Client?
It might be that your machine is already using it, and the tasks ran with 4.02 had already been assigned to this App version before you installed the 4.16 App.
I startet 1 new WU about 14 Hours ago - still running :-)
Quote:
What do the client messages say about the last einstein task that has been started? Is there a "found app_info.xml - using anonymous platform" message after starting the Client?
Well, maybe I found the error. in the app_info.xml was for Version 402 the 402 libaries
Well, maybe I found the error. in the app_info.xml was for Version 402 the 402 libaries
That's not an error - the 4.02 versions of the app and .pdb files are supposed to be listed in the app_info.xml file. If you read further in that file you should also find the 4.16 versions separately listed.
The reason for this is that the format of checkpoint files, regularly saved during crunching, changed for versions later than 4.02. You should not be attempting to resume crunching with 4.16 something that was commenced with 4.02 since 4.16 will not be able to read the 4.02 saved checkpoint. The correct behaviour is to allow version 4.02 to "finish off" any work "branded" 4.02 and then to allow 4.16 to do any work "branded" 4.09 or above. If you carefully read the full app_info file as distributed you should be able to see how this works.
As Bernd tried to point out in his previous message, it may be that the reason that 4.16 was not running could be due to the size of your tasks cache. If you have work on hand, downloaded before you installed 4.16, that work will be "branded" 4.02 and the app_info mechanism will use 4.02 to crunch it until it is all gone. Any new tasks downloaded will be branded 4.16 and will be crunched with 4.16 once the 4.02 work is exhausted. The process can be fast tracked but involves other editing which the average user would not be familiar with and is therefore not recommended unless you really understand exactly what you are doing. It is much safer to let the app_info mechanism, as distributed, do its thing.
The result of the edit you posted would be to cause any 4.02 work in progress to fail. I'm not sure exactly what would happen to a partly crunched result. When the checkpoint couldn't be read, either the result would crash or it might be completely restarted from the beginning. Either way you lose something. However, 4.16 will be able to crunch 4.02 "branded" tasks if they haven't already been started (and therefore no incompatible checkpoint existed).
As for the app_info.xml:
Maybe you've been right, that I had already WU's when I tried the 4.16.
But I think, I will remain the app_info.xml as it is now, since it is working for me now :-)
11/12/2007 13:53:20|Einstein@Home|Task h1_0588.85_S5R2__151_S5R3a_3 exited with zero status but no 'finished' file
11/12/2007 13:53:20|Einstein@Home|If this happens repeatedly you may need to reset the project.
11/12/2007 13:53:20|Einstein@Home|Restarting task h1_0588.85_S5R2__151_S5R3a_3 using einstein_S5R3 version 415
11/12/2007 15:23:56|Einstein@Home|Computation for task h1_0588.85_S5R2__151_S5R3a_3 finished
11/12/2007 15:23:56|Einstein@Home|Output file h1_0588.85_S5R2__151_S5R3a_3_0 for task h1_0588.85_S5R2__151_S5R3a_3 absent
Yes, both yours and the previous poster's errors seem quite similar - map index out of bounds at line 388 in HoughMap.c. I'm sure Bernd will look at these when he is able to.
Quote:
ERROR: HoughMap.c 388: map index out of bounds: 3008 [0..2970] j:52 xp[j]:148
Level 0: $Id: HierarchicalSearch.c,v 1.184 2007/09/28 21:26:58 reinhard Exp $
Function call `ComputeFstatHoughMap ( &status, &semiCohCandList, &pgV, &semiCohPar)' failed.
file HierarchicalSearch.c, line 1105
2007-12-11 15:23:52.7812 [normal]:
Level 1: $Id: HierarchicalSearch.c,v 1.184 2007/09/28 21:26:58 reinhard Exp $
2007-12-11 15:23:52.7812 [normal]: Status code -1: Recursive error
2007-12-11 15:23:52.7968 [normal]: function ComputeFstatHoughMap, file HierarchicalSearch.c, line 1916
2007-12-11 15:23:52.7968 [normal]:
Level 2: $Id: DriveHough.c,v 1.17 2007/01/08 17:30:04 reinhard Exp $
2007-12-11 15:23:52.7968 [normal]: Status code -1: Recursive error
2007-12-11 15:23:52.7968 [normal]: function LALHOUGHConstructHMT_W, file DriveHough.c, line 630
2007-12-11 15:23:52.7968 [normal]:
Level 3: $Id: HoughMap.c,v 1.11 2007/07/23 14:48:20 bema Exp $
2007-12-11 15:23:52.7968 [normal]: Status code 2: Invalid input size
2007-12-11 15:23:52.7968 [normal]: function LALHOUGHAddPHMD2HD_W, file HoughMap.c, line 389
2007-12-11 15:23:52.7968 [CRITICAL]: BOINC_LAL_ErrHand(): now calling boinc_finish()
If you look at this list of results, you will see a set of similar errors followed by some successful results. This is one of my machines and here is a snippet from the oldest of the error outputs. It's also listed as error code 99 like yours:
Quote:
260, c
261,
ERROR: HoughMap.c 388: map index out of bounds: 16540 [0..2970] j:0 xp[j]:16540
Level 0: $Id: HierarchicalSearch.c,v 1.184 2007/09/28 21:26:58 reinhard Exp $
Function call `ComputeFstatHoughMap ( &status, &semiCohCandList, &pgV, &semiCohPar)' failed.
file HierarchicalSearch.c, line 1105
2007-12-07 02:12:59.9531 [normal]:
Level 1: $Id: HierarchicalSearch.c,v 1.184 2007/09/28 21:26:58 reinhard Exp $
2007-12-07 02:12:59.9531 [normal]: Status code -1: Recursive error
2007-12-07 02:12:59.9687 [normal]: function ComputeFstatHoughMap, file HierarchicalSearch.c, line 1916
2007-12-07 02:12:59.9687 [normal]:
Level 2: $Id: DriveHough.c,v 1.17 2007/01/08 17:30:04 reinhard Exp $
2007-12-07 02:12:59.9687 [normal]: Status code -1: Recursive error
2007-12-07 02:12:59.9687 [normal]: function LALHOUGHConstructHMT_W, file DriveHough.c, line 630
2007-12-07 02:12:59.9687 [normal]:
Level 3: $Id: HoughMap.c,v 1.11 2007/07/23 14:48:20 bema Exp $
2007-12-07 02:12:59.9687 [normal]: Status code 2: Invalid input size
2007-12-07 02:12:59.9687 [normal]: function LALHOUGHAddPHMD2HD_W, file HoughMap.c, line 389
2007-12-07 02:12:59.9687 [CRITICAL]: BOINC_LAL_ErrHand(): now calling boinc_finish()
and here is a further snippet from the last (most recent) of the set of errors:
Quote:
15, c
16,
ERROR: HoughMap.c 423: map index out of bounds: 26801 [0..2970] j:30 xp[j]:25151
Level 0: $Id: HierarchicalSearch.c,v 1.184 2007/09/28 21:26:58 reinhard Exp $
Function call `ComputeFstatHoughMap ( &status, &semiCohCandList, &pgV, &semiCohPar)' failed.
file HierarchicalSearch.c, line 1105
2007-12-07 04:09:26.0937 [normal]:
Level 1: $Id: HierarchicalSearch.c,v 1.184 2007/09/28 21:26:58 reinhard Exp $
2007-12-07 04:09:26.0937 [normal]: Status code -1: Recursive error
2007-12-07 04:09:26.0937 [normal]: function ComputeFstatHoughMap, file HierarchicalSearch.c, line 1916
2007-12-07 04:09:26.0937 [normal]:
Level 2: $Id: DriveHough.c,v 1.17 2007/01/08 17:30:04 reinhard Exp $
2007-12-07 04:09:26.0937 [normal]: Status code -1: Recursive error
2007-12-07 04:09:26.0937 [normal]: function LALHOUGHConstructHMT_W, file DriveHough.c, line 630
2007-12-07 04:09:26.0937 [normal]:
Level 3: $Id: HoughMap.c,v 1.11 2007/07/23 14:48:20 bema Exp $
2007-12-07 04:09:26.0937 [normal]: Status code 2: Invalid input size
2007-12-07 04:09:26.0937 [normal]: function LALHOUGHAddPHMD2HD_W, file HoughMap.c, line 424
2007-12-07 04:09:26.0937 [CRITICAL]: BOINC_LAL_ErrHand(): now calling boinc_finish()
Once again, a very similar error but a different line number - 423 - in HoughMap.c
So, in my case, why did these errors then stop?
Well, the machine in question has a Tualatin Celeron processor (1300MHz) with a FSB of 100MHz. These machines are quite overclockable and I have around 30 of them, all overclocked, some as high as 124MHz FSB which gives a CPU speed of over 1600MHz. Most run at around 118 - 120MHz FSB but the above machine was only running at 112MHz and producing errors.
I decided to change the processor as I felt that the PC133 RAM should be quite OK at 112MHz. The machine still seemed to be unstable with the different CPU chip so in desperation I changed the RAM as well. Suddenly stability returned and I've now been running it at 116MHz with no further errors. I can't say for sure that it was the RAM as I really didn't give the changed CPU enough of a test before changing the RAM as well. However, on balance, the RAM seems the most likely culprit. I'll probably bump the FSB to 118MHZ shortly to see if all is still well, particularly if several more tasks go through without incident.
So, in light of my experiences, I would ask the two previous posters who have reported similar errors to check if there are any hardware issues with CPU or RAM that might be responsible for their errors.
EDIT: Not all of the 4 errors on my machine were code 99s. One of them was a code -1073741819 access violation (Unhandled Exception) but I would guess that it's of little consequence as all the errors would seem to be attributable to faulty hardware.
Hmmmm, my computer isn't
)
Hmmmm, my computer isn't overclocked and has alwas benn very stable. Also, E@H seemd to crash after having started a brand new WU. With the "failing" WU, all other uncomplete ones wer trashed too. I solved the problem by desactivating the network. I wait for my two WU's to be finished before i allow BOINC to contact the servers again. It seems to work. I am still using 4.02.
RE: Also, E@H seemd to
)
I don't precisely know what kind of crash that was until the tasks have been reported, but a bad network driver could mess up the FPU stack and might be an explanation e.g. for "Input domain errors".
BM
BM
RE: I'm quite sure to have
)
AFAIK SETI isn't that picky about FP operations, and CPDN only validates the results after completed tasks (if at all), which takes months.
That's interesting indeed.
I'd say try the 4.16.
It might be that your machine is already using it, and the tasks ran with 4.02 had already been assigned to this App version before you installed the 4.16 App.
What do the client messages say about the last einstein task that has been started? Is there a "found app_info.xml - using anonymous platform" message after starting the Client?
BM
BM
RE: I'd say try the
)
I startet 1 new WU about 14 Hours ago - still running :-)
Well, maybe I found the error. in the app_info.xml was for Version 402 the 402 libaries
I changed now to the following:
Now it Shows the 4.16 in the Messages as well.
I'll keep you updated when the WU is finished.
Regards Barney
RE: Well, maybe I found
)
That's not an error - the 4.02 versions of the app and .pdb files are supposed to be listed in the app_info.xml file. If you read further in that file you should also find the 4.16 versions separately listed.
The reason for this is that the format of checkpoint files, regularly saved during crunching, changed for versions later than 4.02. You should not be attempting to resume crunching with 4.16 something that was commenced with 4.02 since 4.16 will not be able to read the 4.02 saved checkpoint. The correct behaviour is to allow version 4.02 to "finish off" any work "branded" 4.02 and then to allow 4.16 to do any work "branded" 4.09 or above. If you carefully read the full app_info file as distributed you should be able to see how this works.
As Bernd tried to point out in his previous message, it may be that the reason that 4.16 was not running could be due to the size of your tasks cache. If you have work on hand, downloaded before you installed 4.16, that work will be "branded" 4.02 and the app_info mechanism will use 4.02 to crunch it until it is all gone. Any new tasks downloaded will be branded 4.16 and will be crunched with 4.16 once the 4.02 work is exhausted. The process can be fast tracked but involves other editing which the average user would not be familiar with and is therefore not recommended unless you really understand exactly what you are doing. It is much safer to let the app_info mechanism, as distributed, do its thing.
The result of the edit you posted would be to cause any 4.02 work in progress to fail. I'm not sure exactly what would happen to a partly crunched result. When the checkpoint couldn't be read, either the result would crash or it might be completely restarted from the beginning. Either way you lose something. However, 4.16 will be able to crunch 4.02 "branded" tasks if they haven't already been started (and therefore no incompatible checkpoint existed).
Cheers,
Gary.
Hello Gary and Bernd, now
)
Hello Gary and Bernd,
now with the new WU and VErsio 4.16 I was able to finish without error messages:
http://einsteinathome.org/task/88887402
As for the app_info.xml:
Maybe you've been right, that I had already WU's when I tried the 4.16.
But I think, I will remain the app_info.xml as it is now, since it is working for me now :-)
Thx for your patience
Barney
23/11/2007
)
23/11/2007 18:01:56|Einstein@Home|Reason: Unrecoverable error for result h1_0438.65_S5R2__95_S5R3a_2 ( - exit code -1073741679 (0xc0000091))
resultid=88969298
???
Just had this compute error.
)
Just had this compute error. Not sure if its the workunit or my system.
11/24/2007 4:34:49 PM|Einstein@Home|Resuming task h1_0578.15_S5R2__179_S5R3a_1 using einstein_S5R3 version 415
11/24/2007 6:07:43 PM|Einstein@Home|Computation for task h1_0578.15_S5R2__179_S5R3a_1 finished
11/24/2007 6:07:43 PM|Einstein@Home|Output file h1_0578.15_S5R2__179_S5R3a_1_0 for task h1_0578.15_S5R2__179_S5R3a_1 absent
The runtime was 5:54; usual runtime 9:30 to 10: hrs.
Just had the same
)
Just had the same error:
Here's the WU
The end of stderr.txt is
I'm using BOINC 5.10.28 and Win XP. It was the sole crunching task when this happened and the only program running apart from AVG and Comodo Firewall.
RE: Just had the same
)
Yes, both yours and the previous poster's errors seem quite similar - map index out of bounds at line 388 in HoughMap.c. I'm sure Bernd will look at these when he is able to.
If you look at this list of results, you will see a set of similar errors followed by some successful results. This is one of my machines and here is a snippet from the oldest of the error outputs. It's also listed as error code 99 like yours:
and here is a further snippet from the last (most recent) of the set of errors:
Once again, a very similar error but a different line number - 423 - in HoughMap.c
So, in my case, why did these errors then stop?
Well, the machine in question has a Tualatin Celeron processor (1300MHz) with a FSB of 100MHz. These machines are quite overclockable and I have around 30 of them, all overclocked, some as high as 124MHz FSB which gives a CPU speed of over 1600MHz. Most run at around 118 - 120MHz FSB but the above machine was only running at 112MHz and producing errors.
I decided to change the processor as I felt that the PC133 RAM should be quite OK at 112MHz. The machine still seemed to be unstable with the different CPU chip so in desperation I changed the RAM as well. Suddenly stability returned and I've now been running it at 116MHz with no further errors. I can't say for sure that it was the RAM as I really didn't give the changed CPU enough of a test before changing the RAM as well. However, on balance, the RAM seems the most likely culprit. I'll probably bump the FSB to 118MHZ shortly to see if all is still well, particularly if several more tasks go through without incident.
So, in light of my experiences, I would ask the two previous posters who have reported similar errors to check if there are any hardware issues with CPU or RAM that might be responsible for their errors.
EDIT: Not all of the 4 errors on my machine were code 99s. One of them was a code -1073741819 access violation (Unhandled Exception) but I would guess that it's of little consequence as all the errors would seem to be attributable to faulty hardware.
Cheers,
Gary.