I exited BOINC, dragged the v4.33 files to my Einstein folder and then restarted BOINC. So, I guess the message got "scrolled" out. However, "the last 20%" rule of thumb you refer to can't be very accurate. The result I cited in this message showed the switchover and it occurred fairly early in its processing (roughly 15% from the start).
Not that it matters much, but the stderr of the result you cited starts with a debug message from computing skypos # 36477 of a total of 45570. That means the last 20 % of the run is covered. What you see in the middle of the output is the last re-start of the (new) client before completion, which isn't necessarily the switch-over to the new app (it did run before).
Not that it matters much, but the stderr of the result you cited starts with a debug message from computing skypos # 36477 of a total of 45570. That means the last 20 % of the run is covered. What you see in the middle of the output is the last re-start of the (new) client before completion, which isn't necessarily the switch-over to the new app (it did run before).
CU
H-B
Thank you for the explanation! As you probably noticed, I didn't get a response to that first message and didn't understand the stderr_txt process very well. I guess I had also "forgotten about" the second BOINC restart. (I also run uFluids and SZTAKI; and, both have problems checkpointing. Therefore, I try to manage my BOINC restarts carefully and keep them to a minimum. But, obviously, I still can't remember when I do them. ;-))
My Host has completed one WU fully with the new beta and four others partly with the new beta and partly with the older beta. Everything looks good so far, still waiting on validation. I'm trying the new Linux beta now to see if it is 'back to speed'.
There are 10^11 stars in the galaxy. That used to be a huge number. But it's only a hundred billion. It's less than the national deficit! We used to call them astronomical numbers. Now we should call them economical numbers. - Richard Feynman
Here is the results list of a machine that was crashing on previous apps but then had success when switched to 4.33. The success was shortlived as there have now been two crashes since the "success" result whilst still on 4.33. Hopefully Bernd might be able to make some sense from this.
Here is the results list of a machine that was crashing on previous apps but then had success when switched to 4.33. The success was shortlived as there have now been two crashes since the "success" result whilst still on 4.33. Hopefully Bernd might be able to make some sense from this.
Thank you!
This is indeed very helpful, though it was not what I was hoping for. It means that the code we inserted to fix the access violation addresses the right problem, but doesn't do everything that is needed to fix it completely.
Any clue why you didn't get the symbols from the PDB ("einstein_S5R2_4.33_windows_intelx86.exe (-nosymbols- Symbols Loaded)")? Was the Einstein@Home main server not accessible at that time?
Please try to put a file named "EAH_MSC_BREAKPOINT" (w/o extension) into the BOINC directory of that machine (maybe suspend running tasks first and start new ones) and restart BOINC. Each task should fail with a client error, but it should download the PDB from the symbol store (which may take some seconds), the stderr should list a "breakpoint encountered". Stop the BOINC Client, remove the file, and start it again for (hopefully) normal operation.
EDIT: I put up the PDB here for manual download. Putting it into the project directory beside the App file should also lead to useful stackdump.
Any clue why you didn't get the symbols from the PDB ("einstein_S5R2_4.33_windows_intelx86.exe (-nosymbols- Symbols Loaded)")? Was the Einstein@Home main server not accessible at that time?
I've no idea. I have an "always on" broadband connection that doesn't appear to be having any issues at the moment. I'm not a programmer so I don't really know what to expect with debugging. I've now downloaded the .pdb and deployed it and stopped and restarted BOINC on that machine. The current result has clocked up over 20 hours so far without incident. About another 35 hours to go to completion.
The following snippet comes from the messages tab of Boinc Manager at the time of one of the crashes. There doesn't seem to be any attempt to download the .pdb from the server.
Quote:
2007-07-30 10:54:37 [Einstein@Home] Deferring communication for 1 min 0 sec
2007-07-30 10:54:37 [Einstein@Home] Reason: Unrecoverable error for result h1_0491.15_S5R2__155_S5R2c_0 ( - exit code -1073741819 (0xc0000005))
2007-07-30 10:54:38 [Einstein@Home] Computation for task h1_0491.15_S5R2__155_S5R2c_0 finished
2007-07-30 10:54:38 [Einstein@Home] Output file h1_0491.15_S5R2__155_S5R2c_0_0 for task h1_0491.15_S5R2__155_S5R2c_0 absent
2007-07-30 10:54:38 [Einstein@Home] Starting h1_0491.15_S5R2__147_S5R2c_1
2007-07-30 10:54:39 [Einstein@Home] Starting task h1_0491.15_S5R2__147_S5R2c_1 using einstein_S5R2 version 433
Please let me know if there is anything else you want me to do.
I've now downloaded the .pdb and deployed it and stopped and restarted BOINC on that machine. The current result has clocked up over 20 hours so far without incident. About another 35 hours to go to completion.
Thanks. You'll probably understand that I'm hoping that the error occurs again :-)
Quote:
The following snippet comes from the messages tab of Boinc Manager at the time of one of the crashes. There doesn't seem to be any attempt to download the .pdb from the server.
The PDB is not downloaded by the BOINC Client, but by the debugger embedded in the App.
The above result errored out at about the 65% mark (about 40 hours) with an Exit status -185 (0xffffff47). The result started out on v4.32 but was switched to v 4.33 when it was about 15% complete. The result was a "resend" and I wonder if there may be a problem with the WU. One other result ended with Exit status 99 (0x63). A couple of others were just "No Replies". (Earlier I complained about the result's 2 week deadline in this message).
However, there may be another explanation for the failure on my host. Earlier today, this computer downloaded a new update to "McAfee Security Center". There was a problem installing the update which I could only resolve by reverting Windows to an earlier "Restore Point" and then redownloading/reinstalling the update. I noticed that this result "crashed" sometime after going through all that. On the other hand, none of my other BOINC projects had any problems.
EDIT: The following messages (from my under BOINC Messages Tab) were repeated several times. (And, when I checked my Einstein folder the v4.33 app was gone.)
7/30/2007 9:38:45 PM|Einstein@Home|Couldn't start download of einstein_S5R2_4.33_windows_intelx86.exe
7/30/2007 9:38:45 PM|Einstein@Home|URL (null): invalid URL
7/30/2007 9:38:45 PM|Einstein@Home|Backing off 1 min 0 sec on download of file einstein_S5R2_4.33_windows_intelx86.exe
Thanks. You'll probably understand that I'm hoping that the error occurs again :-)
Of course and I'm willing it to fail too which is exactly why it probably wont :).
It's now 60% done having clocked up another 10+ hours during the day here. The machine is a P4 HT and I'm running both Seti and EAH at 50/50 on the 2 virtual cores. Both projects give a greater throughput by always having one of each running. Is there any possibility that the liklihood of a crash is influenced by running the two virtual cores? The machine had been running OK this way for about a month before the crashes suddenly started.
Quote:
The PDB is not downloaded by the BOINC Client, but by the debugger embedded in the App.
OK, I guess I just assumed that BOINC would be used for all transfers like this. You can tell that I wasn't really paying attention earlier when you talked about the app "phoning home" to download the debugging symbols when needed :). So why didn't it do what it was supposed to do? Is it some misconfiguration with my LAN? It would be really galling if there are no more crashes to analyse :).
RE: RE: You did restart
)
Not that it matters much, but the stderr of the result you cited starts with a debug message from computing skypos # 36477 of a total of 45570. That means the last 20 % of the run is covered. What you see in the middle of the output is the last re-start of the (new) client before completion, which isn't necessarily the switch-over to the new app (it did run before).
CU
H-B
RE: Not that it matters
)
Thank you for the explanation! As you probably noticed, I didn't get a response to that first message and didn't understand the stderr_txt process very well. I guess I had also "forgotten about" the second BOINC restart. (I also run uFluids and SZTAKI; and, both have problems checkpointing. Therefore, I try to manage my BOINC restarts carefully and keep them to a minimum. But, obviously, I still can't remember when I do them. ;-))
My Host has completed one WU
)
My Host has completed one WU fully with the new beta and four others partly with the new beta and partly with the older beta. Everything looks good so far, still waiting on validation. I'm trying the new Linux beta now to see if it is 'back to speed'.
There are 10^11 stars in the galaxy. That used to be a huge number. But it's only a hundred billion. It's less than the national deficit! We used to call them astronomical numbers. Now we should call them economical numbers. - Richard Feynman
Here is the results list of a
)
Here is the results list of a machine that was crashing on previous apps but then had success when switched to 4.33. The success was shortlived as there have now been two crashes since the "success" result whilst still on 4.33. Hopefully Bernd might be able to make some sense from this.
Cheers,
Gary.
Here the 4.33 seems to work
)
Here the 4.33 seems to work fine. Some WUs finished successful and validated, but some are still pending.
http://einsteinathome.org/host/833861/tasks
RE: Here is the results
)
Thank you!
This is indeed very helpful, though it was not what I was hoping for. It means that the code we inserted to fix the access violation addresses the right problem, but doesn't do everything that is needed to fix it completely.
Any clue why you didn't get the symbols from the PDB ("einstein_S5R2_4.33_windows_intelx86.exe (-nosymbols- Symbols Loaded)")? Was the Einstein@Home main server not accessible at that time?
Please try to put a file named "EAH_MSC_BREAKPOINT" (w/o extension) into the BOINC directory of that machine (maybe suspend running tasks first and start new ones) and restart BOINC. Each task should fail with a client error, but it should download the PDB from the symbol store (which may take some seconds), the stderr should list a "breakpoint encountered". Stop the BOINC Client, remove the file, and start it again for (hopefully) normal operation.
EDIT: I put up the PDB here for manual download. Putting it into the project directory beside the App file should also lead to useful stackdump.
BM
BM
RE: Any clue why you
)
I've no idea. I have an "always on" broadband connection that doesn't appear to be having any issues at the moment. I'm not a programmer so I don't really know what to expect with debugging. I've now downloaded the .pdb and deployed it and stopped and restarted BOINC on that machine. The current result has clocked up over 20 hours so far without incident. About another 35 hours to go to completion.
The following snippet comes from the messages tab of Boinc Manager at the time of one of the crashes. There doesn't seem to be any attempt to download the .pdb from the server.
Please let me know if there is anything else you want me to do.
Cheers,
Gary.
RE: I've now downloaded the
)
Thanks. You'll probably understand that I'm hoping that the error occurs again :-)
The PDB is not downloaded by the BOINC Client, but by the debugger embedded in the App.
BM
BM
http://einstein.phys.uwm.edu/
)
http://einsteinathome.org/task/85948016
The above result errored out at about the 65% mark (about 40 hours) with an Exit status -185 (0xffffff47). The result started out on v4.32 but was switched to v 4.33 when it was about 15% complete. The result was a "resend" and I wonder if there may be a problem with the WU. One other result ended with Exit status 99 (0x63). A couple of others were just "No Replies". (Earlier I complained about the result's 2 week deadline in this message).
However, there may be another explanation for the failure on my host. Earlier today, this computer downloaded a new update to "McAfee Security Center". There was a problem installing the update which I could only resolve by reverting Windows to an earlier "Restore Point" and then redownloading/reinstalling the update. I noticed that this result "crashed" sometime after going through all that. On the other hand, none of my other BOINC projects had any problems.
EDIT: The following messages (from my under BOINC Messages Tab) were repeated several times. (And, when I checked my Einstein folder the v4.33 app was gone.)
7/30/2007 9:38:45 PM|Einstein@Home|Couldn't start download of einstein_S5R2_4.33_windows_intelx86.exe
7/30/2007 9:38:45 PM|Einstein@Home|URL (null): invalid URL
7/30/2007 9:38:45 PM|Einstein@Home|Backing off 1 min 0 sec on download of file einstein_S5R2_4.33_windows_intelx86.exe
RE: Thanks. You'll probably
)
Of course and I'm willing it to fail too which is exactly why it probably wont :).
It's now 60% done having clocked up another 10+ hours during the day here. The machine is a P4 HT and I'm running both Seti and EAH at 50/50 on the 2 virtual cores. Both projects give a greater throughput by always having one of each running. Is there any possibility that the liklihood of a crash is influenced by running the two virtual cores? The machine had been running OK this way for about a month before the crashes suddenly started.
OK, I guess I just assumed that BOINC would be used for all transfers like this. You can tell that I wasn't really paying attention earlier when you talked about the app "phoning home" to download the debugging symbols when needed :). So why didn't it do what it was supposed to do? Is it some misconfiguration with my LAN? It would be really galling if there are no more crashes to analyse :).
Cheers,
Gary.