I got home a couple days ago and discovered that one of my computers had recovered from a blue screen. Of course, Boinc hadn't found the GPU and I had to exit and start it again. Then I discovered that 2 Einstein GPU tasks (the ones running at the time of the blue screen, presumably) returned errors. So I'm wondering if the error was the cause or the effect of the blue screen. This is from one of them, task 423321920. The other, task 423320063, has the same error message in it.
Stderr output
6.10.60
An I/O operation initiated by the registry failed unrecoverably. The registry could not read in, or write out, or flush, one of the files that contain the system's image of the registry. (0x3f8) - exit code 1016 (0x3f8)
Activated exception handling...
[00:41:30][6596][INFO ] Starting data processing...
[00:41:30][6596][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 220 MB (1317 MB free / 1537 MB total) -> Used by this application (assuming a single GPU task): 0 MB
[00:41:30][6596][INFO ] Using CUDA device #0 "GeForce GT 440" (144 CUDA cores / 342.43 GFLOPS)
[00:41:30][6596][INFO ] Version of installed CUDA driver: 5050
[00:41:30][6596][INFO ] Version of CUDA driver API used: 3020
[00:41:32][6596][INFO ] Checkpoint file unavailable: status.cpt (No such file or directory).
------> Starting from scratch...
[00:41:32][6596][INFO ] Header contents:
------> Original WAPP file: ./PA0084_00281_DM308.00
------> Sample time in microseconds: 1000
------> Observation time in seconds: 2097.152
------> Time stamp (MJD): 54399.857698562795
------> Number of samples/record: 0
------> Center freq in MHz: 1231.5
------> Channel band in MHz: 3
------> Number of channels/record: 96
------> Nifs: 1
------> RA (J2000): 82423.3250008
------> DEC (J2000): -263613.155
------> Galactic l: 0
------> Galactic b: 0
------> Name: G4516531
------> Lagformat: 0
------> Sum: 1
------> Level: 3
------> AZ at start: 0
------> ZA at start: 0
------> AST at start: 0
------> LST at start: 0
------> Project ID: --
------> Observers: --
------> File size (bytes): 0
------> Data size (bytes): 0
------> Number of samples: 2097152
------> Trial dispersion measure: 308 cm^-3 pc
------> Scale factor: 1.62162
[00:41:32][6596][INFO ] Seed for random number generator is 1087967505.
[00:41:33][6596][INFO ] Derived global search parameters:
------> f_A probability = 0.04
------> single bin prob(P_noise > P_thr) = 1.2977e-008
------> thr1 = 18.1601
------> thr2 = 21.263
------> thr4 = 26.2923
------> thr8 = 34.674
------> thr16 = 48.9881
[00:41:33][6596][INFO ] CUDA global memory status (GPU setup complete):
------> Used in total: 341 MB (1196 MB free / 1537 MB total) -> Used by this application (assuming a single GPU task): 121 MB
[00:42:36][6596][INFO ] Checkpoint committed!
[00:43:41][6596][INFO ] Checkpoint committed!
{a whole lot of checkpoints deleted for brevity}
[05:22:11][6596][INFO ] Checkpoint committed!
[05:23:09][6596][INFO ] Statistics: count dirty SumSpec pages 3387 (not checkpointed), Page Size 1024, fundamental_idx_hi-window_2: 1100505
[05:23:09][6596][INFO ] Data processing finished successfully!
[05:23:09][6596][INFO ] Starting data processing...
[05:23:09][6596][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 220 MB (1317 MB free / 1537 MB total) -> Used by this application (assuming a single GPU task): 0 MB
[05:23:09][6596][INFO ] Using CUDA device #0 "GeForce GT 440" (144 CUDA cores / 342.43 GFLOPS)
[05:23:09][6596][INFO ] Version of installed CUDA driver: 5050
[05:23:09][6596][INFO ] Version of CUDA driver API used: 3020
[05:23:10][6596][INFO ] Checkpoint file unavailable: status.cpt (No such file or directory).
------> Starting from scratch...
[05:23:10][6596][INFO ] Header contents:
------> Original WAPP file: ./PA0084_00281_DM310.00
------> Sample time in microseconds: 1000
------> Observation time in seconds: 2097.152
------> Time stamp (MJD): 54399.857698521024
------> Number of samples/record: 0
------> Center freq in MHz: 1231.5
------> Channel band in MHz: 3
------> Number of channels/record: 96
------> Nifs: 1
------> RA (J2000): 82423.3250008
------> DEC (J2000): -263613.155
------> Galactic l: 0
------> Galactic b: 0
------> Name: G4516531
------> Lagformat: 0
------> Sum: 1
------> Level: 3
------> AZ at start: 0
------> ZA at start: 0
------> AST at start: 0
------> LST at start: 0
------> Project ID: --
------> Observers: --
------> File size (bytes): 0
------> Data size (bytes): 0
------> Number of samples: 2097152
------> Trial dispersion measure: 310 cm^-3 pc
------> Scale factor: 1.62162
[05:23:11][6596][INFO ] Seed for random number generator is 1091183138.
[05:23:12][6596][INFO ] Derived global search parameters:
------> f_A probability = 0.04
------> single bin prob(P_noise > P_thr) = 1.2977e-008
------> thr1 = 18.1601
------> thr2 = 21.263
------> thr4 = 26.2923
------> thr8 = 34.674
------> thr16 = 48.9881
[05:23:12][6596][INFO ] CUDA global memory status (GPU setup complete):
------> Used in total: 341 MB (1196 MB free / 1537 MB total) -> Used by this application (assuming a single GPU task): 121 MB
[05:23:17][6596][INFO ] Checkpoint committed!
[05:24:22][6596][INFO ] Checkpoint committed!
{more checkpoints deleted}
[06:14:36][6596][INFO ] Checkpoint committed!
[06:15:42][6596][INFO ] Checkpoint committed!
[06:16:01][6596][ERROR] Error during CUDA host->device HS thresholds data transfer (error: 999)
[06:16:01][6596][ERROR] Demodulation failed (error: 1007)!
06:16:01 (6596): called boinc_finish
Activated exception handling...
[08:56:00][5828][INFO ] Starting data processing...
[08:56:00][5828][INFO ] CUDA global memory status (initial GPU state, including context):
------> Used in total: 80 MB (1457 MB free / 1537 MB total) -> Used by this application (assuming a single GPU task): 0 MB
[08:56:00][5828][INFO ] Using CUDA device #0 "GeForce GT 440" (144 CUDA cores / 342.43 GFLOPS)
[08:56:00][5828][INFO ] Version of installed CUDA driver: 5050
[08:56:00][5828][INFO ] Version of CUDA driver API used: 3020
[08:56:00][5828][ERROR] Couldn't load main CUDA device module (error: 301)!
[08:56:00][5828][ERROR] Demodulation failed (error: 1016)!
08:56:00 (5828): called boinc_finish
]]>
David
Miserable old git
Patiently waiting for the asteroid with my name on it.
Copyright © 2024 Einstein@Home. All rights reserved.
2 errors: cause of or by blue screen?
)
You need to look in Windows event log and compare the timestamps or at least dig out what the error in the blue screen was to be able to make any kind of guess about the task errors being the cause or effect of the blue screen.
Well, this happens here also
)
Well, this happens here also from time to time, about once in 2 weeks.
Last time it happend I saw something like 'interrupt equal or less' on the screen before the system rebootet and one cuda wu was destroyed. It never destroyed a ATI wu, sometimes a cuda wu but always the WU-prop wu's.
I thought it has to do with my DVB-S Tuner card because it happens only on this system, so I ignored it.
Different cuda drivers, happened on the earlier and the actual system.
Use Blue Screen View to see
)
Use Blue Screen View to see what the blue screen said, and where it came from.
RE: Use Blue Screen View to
)
THX for the link, I used it and found:
https://dl.dropboxusercontent.com/u/50246791/Crashdump1.PNG
Google results point to a memory fault, where 'memory' means more than than RAM, it can also be GPU ram, harddisk aso.
As far as my system is concerned, it might be something like an 'intellectual overload' for windows; 3 different types of GPU's, a tuner card using the ram for timeshift memory and usually 5-8 programs open.
There was a discussion about the needed amount of ram for best performance. My system has 8GB, so this might not be enough.
Alexander
RE: Use Blue Screen View to
)
Okay, I did. It says this BSOD (and the last one, in November) was caused by dxgkrnl.sys. I Binged that and got lots of discussion of it. I'm still reading, but nothing seems entirely pertinent so far. The gist of the answers is that I have a bad video driver (doesn't seem to matter whether it's ATI or NVidia) and I should either update it or roll it back. I'll do some more reading and probably try updating the driver when I get home today. (Or maybe I'll wait until Boinc finishes all the Einstein work on hand, which is dangerously close to deadline.)
I can tell you that no one was actively using the computer at the time of the crash. It may have been as long as a month since I laid hands on it; I spend an average of about 30-40 minutes a day (almost daily) on it via Teamviewer, but other than that it sits there crunching.
David
Miserable old git
Patiently waiting for the asteroid with my name on it.
Well, if it all turns out to
)
Well, if it all turns out to be nothing, you can always try to update DirectX. dxgkrnl.sys, DirectX Graphics Kernel?
How to update DirectX? See here. It may be that there's just a glitch that a reinstall will fix. The videocard drivers usually don't update the DirectX environment, although new game installations will do that.
Off my own topic... I have
)
Off my own topic...
I have at least two tasks that are not going to make deadline. There are five that are due less than 12 hours 40 minutes from now. Two are crunching, showing 4:53 and 5:39 to completion. One is waiting, showing 17 minutes left. This one and this one have not started. They've all been taking roughly 9.5 hours. so there's no way it can finish what it started and get through what it hasn't started in the time remaining.
And there are 12 more due at various times in the ensuing 48 hours.
I'm going to get off Teamviewer and let it run. I'll leave the driver check until these are over with one way or the other.
David
Miserable old git
Patiently waiting for the asteroid with my name on it.
If I were in that situation I
)
If I were in that situation I would consider aborting the not started tasks and focus on the ones that might actually make it before a resend is sent out or at least before it comes back in again. I would also check my cache setting so I don't end up in the same situation again. =)
I think the server here is configured to send a message to abort not started and unneeded tasks so if the resend hasn't begun processing on your wingman's machine it should be aborted at the next scheduler contact.
RE: Off my own topic... I
)
Holmis, I considered that, but I let them go.
Of the two noted above that hadn't started, one is now out to a third host uselessly (sorry) and the other has its third task marked as "didn't need."
I just aborted four more that were due in the next two hours and hadn't started yet. There are at least two more that I probably should abort, but I'll hold off and see what happens with the ones currently running. Actually, those two can't possibly make it either...
David
Miserable old git
Patiently waiting for the asteroid with my name on it.
I aborted two more that
)
I aborted two more that hadn't started and are due in under four hours from now.
I see that one of the ones I've aborted is within an hour of timing out from the other user as well.
Also, one of the ones still running (even thought it already timed out) previously had a timeout and an error.
Anyway, they should all be done by the time I get home from work tomorrow, and then I can get back to my original topic and try updating my video driver. I probably need to blow the dust out of the computer, too.
David
Miserable old git
Patiently waiting for the asteroid with my name on it.