Windows S5R2 App 4.33 available for Beta Test

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 689006104
RAC: 212003

RE: RE: You did restart

Message 70621 in response to message 70620

Quote:
Quote:
You did restart BOINC, though?? :-)

I exited BOINC, dragged the v4.33 files to my Einstein folder and then restarted BOINC. So, I guess the message got "scrolled" out. However, "the last 20%" rule of thumb you refer to can't be very accurate. The result I cited in this message showed the switchover and it occurred fairly early in its processing (roughly 15% from the start).

Not that it matters much, but the stderr of the result you cited starts with a debug message from computing skypos # 36477 of a total of 45570. That means the last 20 % of the run is covered. What you see in the middle of the output is the last re-start of the (new) client before completion, which isn't necessarily the switch-over to the new app (it did run before).

CU

H-B

Stick
Stick
Joined: 24 Feb 05
Posts: 790
Credit: 31200434
RAC: 262

RE: Not that it matters

Message 70622 in response to message 70621

Quote:

Not that it matters much, but the stderr of the result you cited starts with a debug message from computing skypos # 36477 of a total of 45570. That means the last 20 % of the run is covered. What you see in the middle of the output is the last re-start of the (new) client before completion, which isn't necessarily the switch-over to the new app (it did run before).

CU

H-B

Thank you for the explanation! As you probably noticed, I didn't get a response to that first message and didn't understand the stderr_txt process very well. I guess I had also "forgotten about" the second BOINC restart. (I also run uFluids and SZTAKI; and, both have problems checkpointing. Therefore, I try to manage my BOINC restarts carefully and keep them to a minimum. But, obviously, I still can't remember when I do them. ;-))

Dave Burbank
Dave Burbank
Joined: 30 Jan 06
Posts: 275
Credit: 1548376
RAC: 0

My Host has completed one WU

My Host has completed one WU fully with the new beta and four others partly with the new beta and partly with the older beta. Everything looks good so far, still waiting on validation. I'm trying the new Linux beta now to see if it is 'back to speed'.

There are 10^11 stars in the galaxy. That used to be a huge number. But it's only a hundred billion. It's less than the national deficit! We used to call them astronomical numbers. Now we should call them economical numbers. - Richard Feynman

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109956980611
RAC: 31263256

Here is the results list of a

Here is the results list of a machine that was crashing on previous apps but then had success when switched to 4.33. The success was shortlived as there have now been two crashes since the "success" result whilst still on 4.33. Hopefully Bernd might be able to make some sense from this.

Cheers,
Gary.

Svenie25
Svenie25
Joined: 21 Mar 05
Posts: 139
Credit: 2436862
RAC: 0

Here the 4.33 seems to work

Here the 4.33 seems to work fine. Some WUs finished successful and validated, but some are still pending.

http://einsteinathome.org/host/833861/tasks

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245208476
RAC: 13236

RE: Here is the results

Message 70626 in response to message 70624

Quote:
Here is the results list of a machine that was crashing on previous apps but then had success when switched to 4.33. The success was shortlived as there have now been two crashes since the "success" result whilst still on 4.33. Hopefully Bernd might be able to make some sense from this.


Thank you!

This is indeed very helpful, though it was not what I was hoping for. It means that the code we inserted to fix the access violation addresses the right problem, but doesn't do everything that is needed to fix it completely.

Any clue why you didn't get the symbols from the PDB ("einstein_S5R2_4.33_windows_intelx86.exe (-nosymbols- Symbols Loaded)")? Was the Einstein@Home main server not accessible at that time?

Please try to put a file named "EAH_MSC_BREAKPOINT" (w/o extension) into the BOINC directory of that machine (maybe suspend running tasks first and start new ones) and restart BOINC. Each task should fail with a client error, but it should download the PDB from the symbol store (which may take some seconds), the stderr should list a "breakpoint encountered". Stop the BOINC Client, remove the file, and start it again for (hopefully) normal operation.

EDIT: I put up the PDB here for manual download. Putting it into the project directory beside the App file should also lead to useful stackdump.

BM

BM

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109956980611
RAC: 31263256

RE: Any clue why you

Message 70627 in response to message 70626

Quote:

Any clue why you didn't get the symbols from the PDB ("einstein_S5R2_4.33_windows_intelx86.exe (-nosymbols- Symbols Loaded)")? Was the Einstein@Home main server not accessible at that time?

I've no idea. I have an "always on" broadband connection that doesn't appear to be having any issues at the moment. I'm not a programmer so I don't really know what to expect with debugging. I've now downloaded the .pdb and deployed it and stopped and restarted BOINC on that machine. The current result has clocked up over 20 hours so far without incident. About another 35 hours to go to completion.

The following snippet comes from the messages tab of Boinc Manager at the time of one of the crashes. There doesn't seem to be any attempt to download the .pdb from the server.

Quote:

2007-07-30 10:54:37 [Einstein@Home] Deferring communication for 1 min 0 sec
2007-07-30 10:54:37 [Einstein@Home] Reason: Unrecoverable error for result h1_0491.15_S5R2__155_S5R2c_0 ( - exit code -1073741819 (0xc0000005))
2007-07-30 10:54:38 [Einstein@Home] Computation for task h1_0491.15_S5R2__155_S5R2c_0 finished
2007-07-30 10:54:38 [Einstein@Home] Output file h1_0491.15_S5R2__155_S5R2c_0_0 for task h1_0491.15_S5R2__155_S5R2c_0 absent
2007-07-30 10:54:38 [Einstein@Home] Starting h1_0491.15_S5R2__147_S5R2c_1
2007-07-30 10:54:39 [Einstein@Home] Starting task h1_0491.15_S5R2__147_S5R2c_1 using einstein_S5R2 version 433

Please let me know if there is anything else you want me to do.

Cheers,
Gary.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245208476
RAC: 13236

RE: I've now downloaded the

Message 70628 in response to message 70627

Quote:
I've now downloaded the .pdb and deployed it and stopped and restarted BOINC on that machine. The current result has clocked up over 20 hours so far without incident. About another 35 hours to go to completion.


Thanks. You'll probably understand that I'm hoping that the error occurs again :-)

Quote:
The following snippet comes from the messages tab of Boinc Manager at the time of one of the crashes. There doesn't seem to be any attempt to download the .pdb from the server.


The PDB is not downloaded by the BOINC Client, but by the debugger embedded in the App.

BM

BM

Stick
Stick
Joined: 24 Feb 05
Posts: 790
Credit: 31200434
RAC: 262

http://einstein.phys.uwm.edu/

http://einsteinathome.org/task/85948016

The above result errored out at about the 65% mark (about 40 hours) with an Exit status -185 (0xffffff47). The result started out on v4.32 but was switched to v 4.33 when it was about 15% complete. The result was a "resend" and I wonder if there may be a problem with the WU. One other result ended with Exit status 99 (0x63). A couple of others were just "No Replies". (Earlier I complained about the result's 2 week deadline in this message).

However, there may be another explanation for the failure on my host. Earlier today, this computer downloaded a new update to "McAfee Security Center". There was a problem installing the update which I could only resolve by reverting Windows to an earlier "Restore Point" and then redownloading/reinstalling the update. I noticed that this result "crashed" sometime after going through all that. On the other hand, none of my other BOINC projects had any problems.

EDIT: The following messages (from my under BOINC Messages Tab) were repeated several times. (And, when I checked my Einstein folder the v4.33 app was gone.)

7/30/2007 9:38:45 PM|Einstein@Home|Couldn't start download of einstein_S5R2_4.33_windows_intelx86.exe
7/30/2007 9:38:45 PM|Einstein@Home|URL (null): invalid URL
7/30/2007 9:38:45 PM|Einstein@Home|Backing off 1 min 0 sec on download of file einstein_S5R2_4.33_windows_intelx86.exe

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109956980611
RAC: 31263256

RE: Thanks. You'll probably

Message 70630 in response to message 70628

Quote:
Thanks. You'll probably understand that I'm hoping that the error occurs again :-)

Of course and I'm willing it to fail too which is exactly why it probably wont :).

It's now 60% done having clocked up another 10+ hours during the day here. The machine is a P4 HT and I'm running both Seti and EAH at 50/50 on the 2 virtual cores. Both projects give a greater throughput by always having one of each running. Is there any possibility that the liklihood of a crash is influenced by running the two virtual cores? The machine had been running OK this way for about a month before the crashes suddenly started.

Quote:
The PDB is not downloaded by the BOINC Client, but by the debugger embedded in the App.

OK, I guess I just assumed that BOINC would be used for all transfers like this. You can tell that I wasn't really paying attention earlier when you talked about the app "phoning home" to download the debugging symbols when needed :). So why didn't it do what it was supposed to do? Is it some misconfiguration with my LAN? It would be really galling if there are no more crashes to analyse :).

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.