BRP6 Parkes cuda nv301 crashes graphics card

Rechenkuenstler
Rechenkuenstler
Joined: 22 Aug 10
Posts: 138
Credit: 102567115
RAC: 0
Topic 198063

I faced problems with the brp6 application. My quattro k4200 crashes with that application. Screen remain black and fan runs on maximum. I have to switch off the computer, to restart. Here is a link to one of that tasks. But it occured reproduceable.

http://einsteinathome.org/task/494448697

Since I suspended BRP6, the problem didn't occur any more. At the moment I'm running only BRP4 and everything is fine

Andrew Dicker
Andrew Dicker
Joined: 6 Apr 13
Posts: 18
Credit: 90041313
RAC: 0

BRP6 Parkes cuda nv301 crashes graphics card

I've been noticing crashes since nvidia drivers 350.12, and so was planning on blaming driver version (introduces OpenCL 1.2 as a one line release note entry). I have no degugging or error logs pinning black screen (after turning on screens, while machine running) and subsequent crash on any particular boinc app tho.

Dr Who Fan
Dr Who Fan
Joined: 25 Feb 05
Posts: 88
Credit: 2809335
RAC: 1138

Rechenkuenstler, No one

Rechenkuenstler,

No one can help you unless you unhide your computer(s)

If you are concerned with with what information is shown, click HERE to see my account.

mikey
mikey
Joined: 22 Jan 05
Posts: 12776
Credit: 1861084874
RAC: 1446843

RE: I've been noticing

Quote:
I've been noticing crashes since nvidia drivers 350.12, and so was planning on blaming driver version (introduces OpenCL 1.2 as a one line release note entry). I have no degugging or error logs pinning black screen (after turning on screens, while machine running) and subsequent crash on any particular boinc app tho.

I think Einstein is one of the projects that is kinda slow to integrate the latest drivers into their system, it's often better to wait a bit instead of being on the pointy end of the stick here.

That being said my own Nvidia 760 is running version 350.12 is working just fine running 2 units at a time. Are you leaving a cpu core free just to keep the gpu fed? How many units are running at a time? I see you have 2 gpu's in the machine, are they both crunching here? And are they running multiple units each?

Andrew Dicker
Andrew Dicker
Joined: 6 Apr 13
Posts: 18
Credit: 90041313
RAC: 0

Firstly, damn bonics' gpu

Firstly, damn bonics' gpu reporting. I wish i had 2x780Ti. One is actually a 760.
Both crunch 2 Einstein WUs, or... 2 MilkyWay WUs, or 1 Collatz, or 1 PrimeGrid, or i GPUGrid... all the projects, hence not wanting to point the finger at Einstein apps explicitly. All i can say is that GPU not detecting screen on screen power (the machine in question is connected to my Marantz receiver, and through that to tv), and either suddenly causing machine reboot, or requiring reboot manually, started coincidentally with 350.12 nvidia install.

So yeah, until we know Rechenkuenstler's driver version, i wouldn't pin stuff on Einstein apps - i note his profile shows a similar list of projects to me on the gpu

Rechenkuenstler
Rechenkuenstler
Joined: 22 Aug 10
Posts: 138
Credit: 102567115
RAC: 0

Sorry for the delay. I was

Sorry for the delay. I was away for two weeks.
Computer unhided now. And the driver version is 347.88. I had no problems the last two weeks, when parkes application was disabled. Reactivated it yesterday and the problem was there again.

MAGIC Quantum Mechanic
MAGIC Quantum M...
Joined: 18 Jan 05
Posts: 1910
Credit: 1442542811
RAC: 1256541

http://einstein.phys.uwm.edu/

http://einsteinathome.org/task/498800607

Input file on command line ../../projects/einstein.phys.uwm.edu/PM0023_039C1_217.bin4 doesn't agree with input file from checkpoint header.
[ERROR] Demodulation failed (error: 2)!

I remember seeing that here before but never saw anyone figure out if it was a memory problem or drivers or Boinc version.

But you are completing S6Bucket tasks so it is a problem with your video card.

Are you using GPU-Z or something to check the temp. and other numbers?

Rechenkuenstler
Rechenkuenstler
Joined: 22 Aug 10
Posts: 138
Credit: 102567115
RAC: 0

Yes. I'm using the

Yes. I'm using the GPU-Observer Sidebar Gadget. The problem is definitly with the video card. and only with the parks application. BRP4 Arecibo runs fine.

I will check next, if it is linked with the screen switch off through energy control, when idle. I have seen the error only in the situation, when screens are in idle mode (energy saving mode)

Rechenkuenstler
Rechenkuenstler
Joined: 22 Aug 10
Posts: 138
Credit: 102567115
RAC: 0

Made a number of more

Made a number of more detailed tests to figure out more details with the error situation.

ErrorEvent Hardware:
graphic Card hangs up. When screens in energy save mode, they cannot be reactivated and stay dark. If the occurs during work on the computer, then screens get white. Nothing else displayed. In both cases computer must be turned off and on to rebbot.

ErrorEvent Software:
When the error event happens nothing is written in the error.txt file, or somewhere else. There is no error log at all. After restarting the computer the task is also restarting at the last checkpoint and runs normal until the next event (see section error frequency) or until the end.

Hardware Details:
CPU Typ GenuineIntel
Intel(R) Xeon(R) CPU E3-1241 v3 @ 3.50GHz [Family 6 Model 60 Stepping 3]
Anzahl der Prozessoren 8
Coprozessor NVIDIA Quadro K4200 (4095MB) driver: 34807
Betriebssystem Microsoft Windows 7
Professional x64 Edition, Service Pack 1, (06.01.7601.00)
BOINC-Client Version 7.4.42

Tested Graphics Driver versions:
Since for Quattro cards exist certified drivers from software manufactureres, I've tested with different versions of such grafics drivers, as well as with several genuine Nvidia drivers. This were: Last certified driver from Adobe. Last certified driver from Dell. Older driver version from Nvidia (but newer, as both certified drivers). Latest Nvidia driver (actually installed). Result: No difference with the driver versions. All produce the same error event with the same frequency.

Affected applications:
Both applications, Arecibo and Parkes, are affected. The error event is identical. The difference is the frequency, at wich the error occurs.

Error frequency:
The error with the Arecibo applications occurs not at every WU and totally random, with an average time of ca. 2 days (once in 2 days). That means that the error occurs in ca. 1 of 60 WU

The picture with the Parkes application is completely different. Here the errors occurs minimum 1 per WU, Sometimes there are two or three errors in one WU.

Logforme
Logforme
Joined: 13 Aug 10
Posts: 332
Credit: 1714373961
RAC: 0

RE: http://einstein.phys.uw

Quote:

http://einsteinathome.org/task/498800607

Input file on command line ../../projects/einstein.phys.uwm.edu/PM0023_039C1_217.bin4 doesn't agree with input file from checkpoint header.
[ERROR] Demodulation failed (error: 2)!

I remember seeing that here before but never saw anyone figure out if it was a memory problem or drivers or Boinc version.


I had that error on one of my machines. That turned out to be a faulty RAM stick.
It was a brand new machine with 4x8GB RAM in it. The problem RAM stick was mapped to the top of the 32GB memory so the machine ran fine for most everything, but under Linux the machine would sometimes produce these faulty E@H results. Under Windows it just blue screened as I remember it.
Easily found with the Linux memtest86 application once I got it into my head that brand new memory could be faulty (which in hindsight of course is more likely than old memory going bad. Test your new memory sticks!).

Rechenkuenstler
Rechenkuenstler
Joined: 22 Aug 10
Posts: 138
Credit: 102567115
RAC: 0

After my last test series,

After my last test series, this former communicated error message is not connected to the error situation. With this former error message the WUs ended up with calculation error.
I have sorted out, that is a different problem and has nothing to do with the crashing of the graphics card.
Again: There is NO ERROR Log and WU are restarting at last checkpoint, as if nothing would have happened. But I will test the meomory sticks. Thanks for the hint.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.