S41.xx Observation Thread

Raistmer*
Raistmer*
Joined: 20 Feb 05
Posts: 198
Credit: 69,584,113
RAC: 164,679

RE: I didn't get new

Message 30030 in response to message 30029

Quote:


I didn't get new errors since the FSB was set back to 150MHz from 153MHz.

code snipet:
0x0040AA17 SUBPS XMM0,[0x0040AD30]
0x0040AA1E ADDPS XMM1,[0x0040AD30]

Both instructions access only the 0x0040AD30 memory area directly, not the '0x00000000' (invalid address).

Quote:
[pre]***UNHANDLED EXCEPTION****
Reason: Access Violation (0xc0000005) at address 0x0040AA17 read attempt to address 0xFFFFFFFF[/pre]

It seems to be the same problem.

I think the processor could not prepare the address in time.
It has lots of work with this code...


So it's overclocking. Will watch closely on my another overclocked PC when will test S-version on it. Thank you for explanation.

Ananas
Ananas
Joined: 22 Jan 05
Posts: 272
Credit: 2,500,681
RAC: 0

Mine isn't OCed (it's a Tyan

Mine isn't OCed (it's a Tyan board without any OC settings), it was quite warm though as one case vent broke, 11°C more than usual on CPU1.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,513
Credit: 68,643,230,060
RAC: 58,879,425

RE: RE: [pre]***UNHANDLED

Message 30032 in response to message 30029

Quote:
Quote:
[pre]***UNHANDLED EXCEPTION****
Reason: Access Violation (0xc0000005) at address 0x0040AA1E read attempt to address 0x00000000[/pre]

Three of my computers run S41.06/S41.07.
Dothan 1,86GHz: 108 valid / 0 invalid
ThoroughbredB 1,54GHz: 85 valid / 0 invalid
Applebread 2,07GHz: 27 valid / 23 invalid (each unit with access violation)

I didn't get new errors since the FSB was set back to 150MHz from 153MHz.

code snipet:
0x0040AA17 SUBPS XMM0,[0x0040AD30]
0x0040AA1E ADDPS XMM1,[0x0040AD30]

Both instructions access only the 0x0040AD30 memory area directly, not the '0x00000000' (invalid address).

Quote:
[pre]***UNHANDLED EXCEPTION****
Reason: Access Violation (0xc0000005) at address 0x0040AA17 read attempt to address 0xFFFFFFFF[/pre]

It seems to be the same problem.

I think the processor could not prepare the address in time.
It has lots of work with this code...

I have some information that may be relevant to this issue of unhandled exceptions. I have around 70 or so machines crunching for EAH, many of which are overclocked and most of which are running the S41.06/07 version of Albert. Some are running S40.12 and just a few are still running the stock app. I only noticed your brilliant optimisation work around a week ago and have been converting machines to the S41.07 version as quickly as possible as this is giving the best results for me.

Around 20 of my machines are HP Vectra VL420s which I purchased at a surplus government equipment auction. (It's amazing what the Government is prepared to throw away for a song - but I digress). They have a P4 1.6G Williamette CPU and use PC133 SDRAM. They have a suitable PLL chip (ICS 950202) so the FSB can be tweaked with almost limitless precision using CPUFSB while running under WindowsXP-SP2. The stock configuration is 100 FSB and 16x multiplier. With CPUFSB, approximately 70% of these machines are running quite happily at 125 FSB and 16x multiplier = 2.0Gig. They are prime95 stable (approx 3 hrs runtime) at that level. Prime95 errors seem to creep in around 2050 to 2100MHz. Any machines that show Prime95 rounding errors lower than 2050 get backed off to around 1950 and so on. For some reason one or two machines in the batch need to be around 1750 - 1800 before they will operate stably. Collectively, these machines have done thousands of results without me noticing any invalids or any abnormal program terminations until now. Be aware however that I don't have time to monitor closely 70+ machines so I probably wouldn't notice the odd error or even the odd batch of errors if the problem went away quickly. I would certainly notice a machine lockup, and that hasn't been happening with these machines.

Just one machine in the whole batch of VL420s has now within the last 24 hours had 32 of these unhandled exceptions and this was enough for me to notice. Having read your comments implicating overclocking, I have tried to investigate this issue thoroughly. Here is the information I have gleaned:-

Machine: HP Vectra VL420 - P4 Williamette 1600 @ 1960MHz (FSB = 122.5MHz)
Optimised App Running: S41.06 - Boinc Version 5.2.13
EAH CPUID: 536520
Identical Machine CPUID: 610566 - running @ 2000MHz - no errors or invalids
Recent Result Names (536520): r1_1221.5, z1_1086.0, z1_1320.0, z1_1158.5, z1_1269.0

Error Message:

Quote:

5.2.13
- exit code -1073741819 (0xc0000005)

2006-05-08 10:41:09.6250 [normal]: Optimised by akosf S41.06 --> 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
r2006-05-08 10:41:09.6250 [normal]: Started search at lalDebugLevel = 0
2006-05-08 10:41:11.3281 [normal]: Checkpoint-file 'Fstat.out.ckp' not found.
2006-05-08 10:41:11.3281 [normal]: No usable checkpoint found, starting from beginning.

***UNHANDLED EXCEPTION****
Reason: Access Violation (0xc0000005) at address 0x0040ABF8 read attempt to address 0x02024000

1: 05/08/06 10:41:11
1: e:\\einsteinathome\\cfs\\windows_build\\albert4.37\\cfslaldemod.c(945) +9 bytes (TestLALDemod)

Of the various recent result files processed, the errors (32 of them in total) have all come from just one file - z1_1269.0 and none of the others. There have been no successfully completed results from z1_1269.0. The addresses mentioned in the message always seemed to be the same (I checked about 10 or so). Interspaced with the errors were perfectly normal and valid results which came from the other mentioned result files. There have been no errors from any other result file that I have noticed.

There are still three results from z1_1269.0 left in the work cache and which will be done in the next few hours or so once others which are processing correctly have been finished. I have backed off the CPU speed to 1900 from 1960 (should I back it off more??) and will report if this allows the last three problem results to complete correctly. Apart from these unhandled exceptions, everything else about the operation of the computer seems perfectly normal. The puzzling thing is why there are errors only from z1_1269.0 and not from any other result file.

I wiil report further once the three results of interest have been processed. There are still six @ 1.5 hrs each ahead of them in the worklist.

Cheers, (and many thanks for your brilliant work},

Cheers,
Gary.

Michael Roycraft
Michael Roycraft
Joined: 10 Mar 05
Posts: 846
Credit: 157,718
RAC: 0

RE: Cheers, (and many

Message 30033 in response to message 30032

Quote:
Cheers, (and many thanks for your brilliant work},

GARY!!!

It's so good to see you've returned!!! Where the devil have you been, old friend?

Michael Roycraft

microcraft
"The arc of history is long, but it bends toward justice" - MLK

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,513
Credit: 68,643,230,060
RAC: 58,879,425

RE: Where the devil have

Message 30034 in response to message 30033

Quote:
Where the devil have you been, old friend?

Hi Michael,

So as not to disrupt this thread, I'll email you when I get a chance.

It's good to see you still contributing here.

Cheers,

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,513
Credit: 68,643,230,060
RAC: 58,879,425

RE: I wiil report further

Message 30035 in response to message 30032

Quote:

I wiil report further once the three results of interest have been processed. There are still six @ 1.5 hrs each ahead of them in the worklist.

The six have now finished successfully and the three from the file z1_1269.0 have failed yet again, even with the lowered FSB. In the meantime, two more from this same problem file have been downloaded so I've further reduced the FSB so that the CPU is at 1800MHz for the next lot. Once again there are a number of others from non-problem files to be processed before the next from z1_1269.0 comes up for processing.

I'll report what happens with the next one when it gets to run. There are two of these and they are separated so if the first fails again I'll have a further opportunity to go even lower with FSB. Maybe even more will download in the interim.

I have at least a dozen identical machines running the FSB at 125 (CPU at 2000MHZ) showing no signs of similar problems. Why does z1_1269.0 have to be so picky :).

Cheers,

Cheers,
Gary.

Gray Handcock
Gray Handcock
Joined: 11 Mar 05
Posts: 211
Credit: 135,567
RAC: 0

Greetings All Just to add

Greetings All

Just to add my bit - done probably around 20 WU so far, both small and big on this 2.6 Intel with no over-clocking but well-tweaked OS, and no errors so far, tho still several pending. Obviously using S41.07

To Akosf, many thanks !!

Gray

Akos Fekete
Akos Fekete
Joined: 13 Nov 05
Posts: 561
Credit: 4,527,270
RAC: 0

RE: [pre]***UNHANDLED

Message 30037 in response to message 30029

Quote:

[pre]***UNHANDLED EXCEPTION****
Reason: Access Violation (0xc0000005) at address 0x0040AA1E read attempt to address 0x00000000[/pre]Applebread 2,07GHz: 27 valid / 23 invalid (each unit with access violation)

I didn't get new errors since the FSB was set back to 150MHz from 153MHz.

Yes. This FSB modification totally eliminate this error.
So, it was a good cpu stability test. :)

EggZZ
EggZZ
Joined: 7 Feb 06
Posts: 2
Credit: 9,259,991
RAC: 0

hi akos, i can give you a

Message 30038 in response to message 30037

hi akos,

i can give you a vpn - vnc full account for a 3,0 GHz Xeon with HT for testing new albert´s. Machine ist running 27/7 only for einstein@home :^)

Contact:

EggZZ

Akos Fekete
Akos Fekete
Joined: 13 Nov 05
Posts: 561
Credit: 4,527,270
RAC: 0

RE: i can give you a vpn -

Message 30039 in response to message 30038

Quote:
i can give you a vpn - vnc full account for a 3,0 GHz Xeon with HT for testing new albert?s. Machine ist running 27/7 only for einstein@home :^)

Thanks EggZZ.
But I cannot take the advantage of the opportunity.
I have lot of work and lot of other things, no free time.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.