CPU frequency-related S5R2 errors

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7214014931
RAC: 952984
Topic 192631

One of my hosts errored three of its first four S5R2 results in the first two hours of running S5R2. I speculated that the problem might be speed, and have, I think, got very strong corroboration in the last two days.

SUMMARY

host 882945
is a Conroe E6600 on a Gigabyte 965P-DS3 motherboard, presently overclocked to 3.006 GHz under Windows XP SP2. I've used voltage as the exploratory variable:

While 1.31875 was enough to run S5RI, it gave infrequent SETI errors, which only vanished on raising three increments, to 1.34375. S5R2, however, gives errors in less than an hour until the voltage is raised three more increments, to 1.3625.

DETAILS

all for 3.006 GHz

1.31875 ran Einstein S5RI for two full days--no problem
1.325 S5RI OK, one SETI error after 12 hours S5R2 error after 6 minutes
1.33125 SETI error after 16 hours S5R2 error after 12 minutes
1.3375 not tried on SETI S5R2 error after 7 minutes
1.34375 SETI good for days at least S5R2 error after 22 minutes
1.35 S5R2 error after 26 minutes
1.35625 S5R2 error after 11 minutes
1.36250 no S5R2 errors yet (7 hours, two results completed, one credited)

1.375 ran S5R2 without error for two days--seven results complete, several credited

CONCLUSIONS

For my specific system, there are infrequent events both in running SETI (I use the KWSN C2D code) and Einstein S5R2 which require substantially higher CPU speed than does merely booting Windows, running Einstein S5RI, or routine use of browsers, virus checkers, etc. on the system). The speed requirement in running S5R2 is higher than for SETI.

Unless my specific sample of the Conroe CPU has a unique sub-critical fault, it seems likely that other Conroe's of the same stepping which have been deliberately overclocked just close to the point of working on S5RI or SETI will fairly frequently generate speed-related compute errors on S5R2. This can be fixed by backing off on frequency, raising voltage, or both.

I have no basis for guessing what fraction of the S5R2 flood of compute errors fit this syndrome--but I suspect that overclocking is fairly common, and especially so on those posting here.

I think any design has the same risk--if the application is substantially different there is no reason to expect its limiting speed on a particular part to be the same as the previous application. Only testing can tell.

SUGGESTION

If you own a host which behaved well on S5RI, but is generating repeated compute errors on S5R2, try backing off appreciably on frequency, or raising voltage, and report your experience here.

I emphasize the word "appreciably". On my particular system, the S5R2 error behavior was substantially unchanged over a span of six increments of CPU voltage. That is a huge change--when observing Windows boot behavior and other basic measures it would make a night and day difference. If a big shift fixes your problem, you can inch back in to find the edge, but if a small shift makes no difference, nothing is learned.

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7214014931
RAC: 952984

CPU frequency-related S5R2 errors

Quote:

SUGGESTION

If you own a host which behaved well on S5RI, but is generating repeated compute errors on S5R2, try backing off appreciably on frequency, or raising voltage, and report your experience here.

No one has responded to my post in this thread, but in another thread tapir has reported that a slight frequency reduction appears to have solved client error on an overclocked Athlon. As the offending wu log is gone from the log, I can't compare the symptoms.

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9352143
RAC: 0

LOL.... Well in my case

LOL....

Well in my case it's not because I haven't been keeping an eye on this thread, it's just that I don't have a host to test your hypothesis on. ;-)

Alinator

Kimegi Tepeex
Kimegi Tepeex
Joined: 1 May 05
Posts: 8
Credit: 250148
RAC: 0

RE: ... If you own a host

Quote:

...
If you own a host which behaved well on S5RI, but is generating repeated compute errors on S5R2, try backing off appreciably on frequency, or raising voltage, and report your experience here.
...

I just saw your post, and I am giving it a try.

The chosen processor is an AMD Duron 1800.
It has been running slightly overclocked (at FSB 140 instead of 133) for months, with no compute error either on E@H or any other project, except in the last few days, beginning with S5R2 v4.18.
I have now reset everything to stock setting, and allowed a new WU to download.
At the moment, the estimated completion time is around 42 hours, but resource share with other projects will increase the time to complete until around next Sunday (if not killed before).
Wait and see...

[edit]app v4.14 has had compute error as well[/edit]

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

Worth a try I think, if it

Worth a try I think, if it helps for some people, that's already quite an achievement. Can't say anything substantial myself, I'm like Alinator there, I don't usually overclock (might give it a try on my desktop one day but didn't want to take the risk so far).

tapir
tapir
Joined: 19 Mar 05
Posts: 23
Credit: 462935446
RAC: 0

RE: RE: SUGGESTION If

Message 62846 in response to message 62842

Quote:
Quote:

SUGGESTION

If you own a host which behaved well on S5RI, but is generating repeated compute errors on S5R2, try backing off appreciably on frequency, or raising voltage, and report your experience here.

No one has responded to my post in this thread, but in another thread tapir has reported that a slight frequency reduction appears to have solved client error on an overclocked Athlon. As the offending wu log is gone from the log, I can't compare the symptoms.

Host with errors
...will find three errors left for compare

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7214014931
RAC: 952984

RE: RE: RE: SUGGESTION

Message 62847 in response to message 62846

Quote:
Quote:
Quote:

SUGGESTION

If you own a host which behaved well on S5RI, but is generating repeated compute errors on S5R2, try backing off appreciably on frequency, or raising voltage, and report your experience here.

No one has responded to my post in this thread, but in another thread tapir has reported that a slight frequency reduction appears to have solved client error on an overclocked Athlon. As the offending wu log is gone from the log, I can't compare the symptoms.

Host with errors
...will find three errors left for compare


archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7214014931
RAC: 952984

RE: Host with

Message 62848 in response to message 62846

Quote:
Host with errors
...will find three errors left for compare


Thanks Tapir, your three errors remaining resemble mine in perfectly matching the exit code (which just means they all had an access violation). However none of your three match each other or mine in reported code or data location for the violation, whereas five of my six match each other for code location, but none for data location:

Overclocking Einstein errors

Tapir , host 911468
AuthenticAMD
AMD Athlon(tm) 64 Processor 3000+ [x86 Family 15 Model 31 Stepping 0] [fpu tsc pae nx sse sse2 3dnow mmx]

At first mention (very near the top of stderr out, all are:

exit code -1073741819 (0xc0000005)

Just before the debug section, the three differ in code and wrote address

The three errors currently available read:
Access Violation (0xc0000005) at address 0x0045906B write attempt to address 0x010B0B4E
Access Violation (0xc0000005) at address 0x00458DE4 write attempt to address 0x0126EB2C
Access Violation (0xc0000005) at address 0x00458F49 write attempt to address 0x010E30A8

archae86 host 882945
GenuineIntel
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz [x86 Family 6 Model 15 Stepping 6] [fpu tsc pae nx sse sse2 mmx]

At first mention (very near the top of stderr out, all are:

exit code -1073741819 (0xc0000005)

The errors currently available read:

Access Violation (0xc0000005) at address 0x0044AE0C read attempt to address 0x000009BD
Access Violation (0xc0000005) at address 0x0044AE0D write attempt to address 0x0210A713
Access Violation (0xc0000005) at address 0x0044AE0C read attempt to address 0x00000755
Access Violation (0xc0000005) at address 0x0044AE0C read attempt to address 0x00000057
Access Violation (0xc0000005) at address 0x0044AE0C read attempt to address 0x00000078
Access Violation (0xc0000005) at address 0x0044AE0C read attempt to address 0x00000899

I'm actually quite surprised that your Athlon and my Conroe have even this degree of similarity in the syndrome they display for a S5R2 limiting speed lower than other BOINC applications.

Tentatively, I'd suggest that folks seeing access violations are a bit more likely to have a speed problem, and folks displaying Signal 11 rather less likely, but I'm not at all sure.

Kimegi Tepeex
Kimegi Tepeex
Joined: 1 May 05
Posts: 8
Credit: 250148
RAC: 0

RE: I have now reset

Quote:
I have now reset everything to stock setting, and allowed a new WU to download.
At the moment, the estimated completion time is around 42 hours, but resource share with other projects will increase the time to complete until around next Sunday (if not killed before).
Wait and see...

This WU is now complete and valid :
Over Success Done 141,278.88 525.14 525.14

However, I am unsure about the lowering of FSB being the reason for the success : during all crunching duration, E@H was set to "No new task", and I have made no manual "Update", which may have been related to previous compute errors (process got signal 11 / SIGABRT)

I keep crunching on this machine anyway...

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.