Computation error after resuming from hibernate mode

Bluesilvergreen
Bluesilvergreen
Joined: 20 May 06
Posts: 23
Credit: 1206151
RAC: 0
Topic 193666

Hello,

sometimes I get a computation error after I resume from hibernate mode.

System:

P5N-E SLI
Q6600 (@3,00GHz)
2 x 1GB OCZ DDR2 800
WinXP 64-bit
BOINC 5.10.45 (64bit)
Einstein 4.36 power-app

Is it possible, that this has something to do with core-Voltage or northbridge-voltage?

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 691966304
RAC: 160146

Computation error after resuming from hibernate mode

Quote:

Hello,

sometimes I get a computation error after I resume from hibernate mode.

System:

P5N-E SLI
Q6600 (@3,00GHz)
2 x 1GB OCZ DDR2 800
WinXP 64-bit
BOINC 5.10.45 (64bit)
Einstein 4.36 power-app

Is it possible, that this has something to do with core-Voltage or northbridge-voltage?

Hmm...very strange. So the error occurs when the PC "suspends to RAM" or during wake-up from that?

The kind of error that happens would indeed be consistent with a problem from overclocking/undervolting, but I have no idea what the relation to Hibernation could be. Maybe there'S some extra-stress in the hibernation process for the Northbridge???

CU
Bikeman

Bluesilvergreen
Bluesilvergreen
Joined: 20 May 06
Posts: 23
Credit: 1206151
RAC: 0

I have the CPU now clocked to

I have the CPU now clocked to standard 2,40GHz and standard voltage. The error occurs as before with 3,00GHz.

It happens right after wake-up from hibernate, because the time for the absent result is right after the wake-up.

IT happens also with a E4300 at standard clock and voltage.

The RAM runs in the specifications (DDR2 800) and it is clocked to DDR2 667 and the timings are normal (4-4-4-12).

I think it has either something to do with the 64-bit OS and the WOW64 for the 32-bit einstein app or it is the einstein power app itself.

So, now I empty my cache and try it with the standard einstein app.

I have also a 32-bit Vista and no such problems.

Bluesilvergreen
Bluesilvergreen
Joined: 20 May 06
Posts: 23
Credit: 1206151
RAC: 0

And what's also strange is,

And what's also strange is, that this error occurs always just at 1 thread and the other 3 are not affected.

And only after resuming from hibernate. A normal boot has no affect...

Bluesilvergreen
Bluesilvergreen
Joined: 20 May 06
Posts: 23
Credit: 1206151
RAC: 0

So, now I tried the

So, now I tried the stock-app, but it's the same error: 1 thread (1 wu) produces an error.

I can't find an edit-button here, so I have to write an extra post. Is this the way it should be?

RandyC
RandyC
Joined: 18 Jan 05
Posts: 6113
Credit: 111139797
RAC: 0

RE: So, now I tried the

Message 81118 in response to message 81117

Quote:

So, now I tried the stock-app, but it's the same error: 1 thread (1 wu) produces an error.

I can't find an edit-button here, so I have to write an extra post. Is this the way it should be?

You have exactly 1 hour from first posting a message, to make any edits. After that, the edit button is removed.

Seti Classic Final Total: 11446 WU.

Pepo
Pepo
Joined: 17 Aug 05
Posts: 15
Credit: 458309
RAC: 0

Bikeman wrote:Bluesilvergreen

Message 81119 in response to message 81114

Bikeman wrote:
Bluesilvergreen wrote:
Hello,
sometimes I get a computation error after I resume from hibernate mode.

So the error occurs when the PC "suspends to RAM" or during wake-up from that?


I hope I'm right to correct you - not "Suspend to RAM" (and keep the power on), but "Hibernation" (= Suspend to HD, and shut the power off).

Indeed weird. (I'm also regularly getting problems upon resume from hibernation, but that's just missing heartbeats, not app errors.) You've ruled out the overclocking, what about undervolting?

Could you point to any of such failed results? All of your failed ones contain similar text at the end:

APP DEBUG: Application caught signal 8.
Stack trace of LAL functions in worker thread:
LocalComputeFStatFreqBand at line 187 of file LocalComputeFstat.c
LocalComputeFStat at line 289 of file LocalComputeFstat.c
(null) at line 0 of file (null)
At lowest level status code = 0, description: NO LAL ERROR REGISTERED


or

APP DEBUG: Application caught signal 8.
Stack trace of LAL functions in worker thread:
LocalComputeFstatHoughMap at line 131 of file LocalComputeFstatHoughMap.c
LALHOUGHConstructHMT_W at line 562 of file LocalComputeFstatHoughMap.c
LALHOUGHAddPHMD2HD_W at line 661 of file LocalComputeFstatHoughMap.c
(null) at line 0 of file (null)
At lowest level status code = 0, description: NO LAL ERROR REGISTERED


Did all of them happen upon resuming from hibernation?

Could you paste here few lines from the Manager's Messages tab since the wakeup until any result fails?

Peter

Bluesilvergreen
Bluesilvergreen
Joined: 20 May 06
Posts: 23
Credit: 1206151
RAC: 0

The error occurs, when I

The error occurs, when I resume from hibernation (RAM -> HDD and then HDD -> RAM), so you're right.

I'm getting this error even when I set the CPU-clock and VCore to standard.
It is reproducable, so everytime I resume from hibernation 1 WU produces an error.

I think it's because of the 64-bit version of XP and the WOW64. SP2 is installed.

This is from the BOINC-messages (It's almost after resuming from hibernate is done - 11:02:30) - So it's just a few seconds and then the error occurs:

12.05.2008 10:59:43||Starting BOINC client version 5.10.45 for windows_x86_64
12.05.2008 10:59:43||log flags: task, file_xfer, sched_ops
12.05.2008 10:59:43||Libraries: libcurl/7.18.0 OpenSSL/0.9.8e zlib/1.2.3
12.05.2008 10:59:43||Data directory: C:\\Program Files\\BOINC
12.05.2008 10:59:43|Einstein@Home|Found app_info.xml; using anonymous platform
12.05.2008 10:59:43||Processor: 4 GenuineIntel Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz [EM64T Family 6 Model 15 Stepping 11]
12.05.2008 10:59:43||Processor features: fpu tsc pae nx sse sse2
12.05.2008 10:59:43||OS: Microsoft Windows XP Professional x64 Edition: , Service Pack 2, (05.02.3790.00)
12.05.2008 10:59:43||Memory: 2.00 GB physical, 4.90 GB virtual
12.05.2008 10:59:43||Disk: 232.88 GB total, 34.80 GB free
12.05.2008 10:59:43||Local time is UTC +2 hours
12.05.2008 10:59:43|Einstein@Home|URL: http://einstein.phys.uwm.edu/; Computer ID: 1149444; location: home; project prefs: home
12.05.2008 10:59:43||General prefs: from Einstein@Home (last modified 13-Apr-2008 14:59:38)
12.05.2008 10:59:43||Host location: home
12.05.2008 10:59:43||General prefs: using separate prefs for home
12.05.2008 10:59:43||Preferences limit memory usage when active to 2046.38MB
12.05.2008 10:59:43||Preferences limit memory usage when idle to 2046.38MB
12.05.2008 10:59:43||Preferences limit disk usage to 34.80GB
12.05.2008 10:59:43||Suspending network activity - user request
12.05.2008 10:59:43|Einstein@Home|Restarting task h1_1039.55_S5R3__410_S5R3b_1 using einstein_S5R3 version 436
12.05.2008 10:59:44|Einstein@Home|Restarting task h1_1039.55_S5R3__409_S5R3b_0 using einstein_S5R3 version 436
12.05.2008 10:59:44|Einstein@Home|Restarting task h1_1039.55_S5R3__408_S5R3b_0 using einstein_S5R3 version 436
12.05.2008 10:59:44|Einstein@Home|Restarting task h1_1039.55_S5R3__407_S5R3b_0 using einstein_S5R3 version 436
12.05.2008 11:02:37|Einstein@Home|Computation for task h1_1039.55_S5R3__407_S5R3b_0 finished
12.05.2008 11:02:37|Einstein@Home|Output file h1_1039.55_S5R3__407_S5R3b_0_0 for task h1_1039.55_S5R3__407_S5R3b_0 absent
12.05.2008 11:02:37|Einstein@Home|Starting h1_1039.60_S5R3__329_S5R3b_1
12.05.2008 11:02:40|Einstein@Home|Starting task h1_1039.60_S5R3__329_S5R3b_1 using einstein_S5R3 version 436

Pepo
Pepo
Joined: 17 Aug 05
Posts: 15
Credit: 458309
RAC: 0

RE: 12.05.2008

Message 81121 in response to message 81120

Quote:
12.05.2008 10:59:43|Einstein@Home|Restarting task h1_1039.55_S5R3__410_S5R3b_1 using einstein_S5R3 version 436
12.05.2008 10:59:44|Einstein@Home|Restarting task h1_1039.55_S5R3__409_S5R3b_0 using einstein_S5R3 version 436
12.05.2008 10:59:44|Einstein@Home|Restarting task h1_1039.55_S5R3__408_S5R3b_0 using einstein_S5R3 version 436
12.05.2008 10:59:44|Einstein@Home|Restarting task h1_1039.55_S5R3__407_S5R3b_0 using einstein_S5R3 version 436
12.05.2008 11:02:37|Einstein@Home|Computation for task h1_1039.55_S5R3__407_S5R3b_0 finished
12.05.2008 11:02:37|Einstein@Home|Output file h1_1039.55_S5R3__407_S5R3b_0_0 for task h1_1039.55_S5R3__407_S5R3b_0 absent
12.05.2008 11:02:37|Einstein@Home|Starting h1_1039.60_S5R3__329_S5R3b_1
12.05.2008 11:02:40|Einstein@Home|Starting task h1_1039.60_S5R3__329_S5R3b_1 using einstein_S5R3 version 436


What's interesting - the "failed" exited task h1_1039.55_S5R3__407_S5R3b_0 successfully finished and validated (like the other three more successful tasks h1_1039.55_S5R3__408_S5R3b_0, h1_1039.55_S5R3__409_S5R3b_0 and h1_1039.55_S5R3__410_S5R3b_1).

What's weird - the problem task h1_1039.55_S5R3__407_S5R3b_0 was possibly restarted sometimes later, but (according to the internal logs and timestamps from block) was apparently crunched in parallel and finished nearly simultaneously (a few seconds difference) with the h1_1039.55_S5R3__408_S5R3b_0, or two minutes after h1_1039.55_S5R3__409_S5R3b_0. Actually all four tasks seem to have been finished simultaneously around 18:45.

The newer task h1_1039.60_S5R3__329_S5R3b_1, which (according to BOINC logs) should have replaced h1_1039.55_S5R3__407_S5R3b_0, was (according to its internal log) started just at 18:43:43.7500 - after h1_1039.55_S5R3__410_S5R3b_1 was finished at 18:43:40.5312, what means, I do not understand it. Maybe later in the logs after 11:02:40, the tasks were swapped again?

Peter

Bluesilvergreen
Bluesilvergreen
Joined: 20 May 06
Posts: 23
Credit: 1206151
RAC: 0

Each time, I tried to go to

Each time, I tried to go to hibernate, I made a backup of the hole BOINC directory, so all results are validated correctly, because they run without problems.

After returning from hibernate, I deleted the hole directory and copied the backup back to the origin.

Pepo
Pepo
Joined: 17 Aug 05
Posts: 15
Credit: 458309
RAC: 0

RE: Each time, I tried to

Message 81123 in response to message 81122

Quote:

Each time, I tried to go to hibernate, I made a backup of the hole BOINC directory, so all results are validated correctly, because they run without problems.

After returning from hibernate, I deleted the hole directory and copied the backup back to the origin.


Thanks, this explains the behavior I've observed.

I would be inclined to make a Boinc trac ticket for this. But possibly the Einstein devs should express themselves first...

Peter

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.