GPU died?

Stranger7777
Stranger7777
Joined: 17 Mar 05
Posts: 436
Credit: 403,133,219
RAC: 7,611
Topic 197977

Suddenly about a week ago this host http://einsteinathome.org/host/10191797
started to fall in video driver errors. I've suspended all the tasks trying to drain WU cache. Yesterday I made a project reset with no luck. 10 minutes ago I tried to revert video driver version to 340.52 from 347.xx. Nothing helps. BRP WUs fail after a few seconds after start together with the video driver. Intel OpenCL WUs work just fine and I can't stop them, they run even if CPU and GPU computation is suspended manually. FGRP4 works smoothly too. What that can be?

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2,042
Credit: 811,172,272
RAC: 1,186,920

GPU died?

I can't advise on the main video driver problem, but this caught my eye.

Quote:
Intel OpenCL WUs work just fine and I can't stop them, they run even if CPU and GPU computation is suspended manually.


This has been observed at other OpenCL BOINC projects, but I hadn't previously heard of it here.

There was a known bug in the BOINC OpenCL API, which was finally fixed on 31 October 2014 - it might cause this behaviour.

Commit f0c39bdf5117d8f7dd5092033971d7f700bd22dc

(linking the GitHub mirror because BOINC's own Git repo seems to be inaccessible at the time of writing).

The Intel-GPU application is being investigated at the moment because of driver incompatibilities: perhaps this API fix could be incorporated in any re-issue.

Logforme
Logforme
Joined: 13 Aug 10
Posts: 332
Credit: 1,714,373,961
RAC: 0

Some errors I

Some errors I spotted:
"Failed to enable CUDA thread yielding for device #0 (error: 999)! Sorry, will try to occupy one CPU core.."

"Error during CUDA device->host time series mean transfer (error: 999)"

To me these suggest a problem in communication between the GPU and the CPU. Possibly because the CPU is too busy? Try to disable all CPU tasks and see if that helps.

Is the problem with the CPU/Motherboard or the GPU? Try to narrow the problem down by moving the GPU around:
Move the GPU to another PCIe slot
Move the GPU to another machine

Disclaimer: I'm just a happy amateur. No real knowledge of BOINC, E@H, GPUs or anything else really.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2,042
Credit: 811,172,272
RAC: 1,186,920

RE: Possibly because the

Quote:
Possibly because the CPU is too busy? Try to disable all CPU tasks and see if that helps.


Judging by the runtime of the intel_gpu apps (>> 1 hour), I'd agree with that. Task time would decrease by 5x or 6x if even one core was freed.

Stranger7777
Stranger7777
Joined: 17 Mar 05
Posts: 436
Credit: 403,133,219
RAC: 7,611

RE: RE: Possibly because

Quote:
Quote:
Possibly because the CPU is too busy? Try to disable all CPU tasks and see if that helps.

Judging by the runtime of the intel_gpu apps (>> 1 hour), I'd agree with that. Task time would decrease by 5x or 6x if even one core was freed.


Tried with no luck. It works to 0.75% then fails or gets a pause for 15 minutes with the message "not enough free CPU/GPU memory available! Delaying next attempt for at least 15 minutes...". Then, after restart it fails.
This config run with success for months. Then about 2 weeks ago it started to fail suddenly with no visible cause. Before that I crunched with app_config.xml 3 GPU WUs at time and 3 Intel WUs. I have 4 Gbs RAM with only 2976 MBs available (no matter Win7-32 or Win7-64 because of Intel GPU). But this was enough. Now I've reset the project (no other projects running) with app_config.xml manually removed before reset. But it tries to use 0.02CPU+0.5GPU for NVidia and 0.3 CPU+0.3 GPU for Intel. Strange indeed...
Geforce GTX 660Ti I use is playing games like World of Tanks normally. No visible bugs (e.g. within video memory). I can play for hours even now. But can't resume GPU tasks. Just after last failure I've rechecked power cables for this card (2x6-pin jacks), moved the board in and out. Nothing helps.
Will try:
1) another slot (I have 2 on Mobo)
2) another PSU.
But playing game on this card I understand that something is somewhere else, not in this 2 cases.

Stranger7777
Stranger7777
Joined: 17 Mar 05
Posts: 436
Credit: 403,133,219
RAC: 7,611

RE: There was a known bug

Quote:
There was a known bug in the BOINC OpenCL API, which was finally fixed on 31 October 2014 - it might cause this behaviour.

Thanks Richard. I've stopped all the tasks already one by one...
and found one more strange behaviour:
even when stopped FGRP4 tasks don't want to leave memory though it is checked in BOINC manager settings to "remove tasks from memory". I have to exit BOINC with checkbox "stop running applications" and rerun in again to free memory from them.

Stranger7777
Stranger7777
Joined: 17 Mar 05
Posts: 436
Credit: 403,133,219
RAC: 7,611

Changing slot and PSU didn't

Changing slot and PSU didn't help. What else can I try?

Pollux_P3D
Pollux_P3D
Joined: 8 Feb 11
Posts: 30
Credit: 212,418,648
RAC: 0

Stranger7777

Stranger7777 wrote:
0.02CPU+0.5GPU for NVidia and 0.3 CPU+0.3 GPU

Accordingly, you are crunching 9 Wus simultaneously ?

2 X Perseus (ca. 150 MB x 2)
3 X Aricibo (ca 150 MB x 3)
4 X Gamma-ray pulsar search #4 v1.05 (FGRP4-Beta) (ca 600 MB (?) x 4)

And that had previously worked at 2.9GB Ram minus INTEL Intel (R) HD Graphics 2500 (1224MB)


einsteinbinary_BRP4

1.0
0.5


einsteinbinary_BRP5

1.0
0.5

In Boincmanager: Use no more than 99% of the processors.
Use no more than 0% of the paging file
Set high memory usage
" Leave application in memory when they paused ". Do not activate

Stranger7777
Stranger7777
Joined: 17 Mar 05
Posts: 436
Credit: 403,133,219
RAC: 7,611

Exactly. It worked smoothly

Exactly. It worked smoothly enough due to SSD. Until this failure.
After that I removed app_config.xml and made project reset.
And again it didn't solved the problem. Don't know what can it be else.

Holmis
Joined: 4 Jan 05
Posts: 1,118
Credit: 1,055,935,564
RAC: 14,397

Check the utilization factors

Check the utilization factors in your Einstein@home preferences. To run one task per GPU set them to 1.

Stranger7777
Stranger7777
Joined: 17 Mar 05
Posts: 436
Credit: 403,133,219
RAC: 7,611

RE: Check the utilization

Quote:
Check the utilization factors in your Einstein@home preferences. To run one task per GPU set them to 1.

Done through app_config.xml. Doesn't help. Computation error with video driver restart.
Will try to clear the BOINC out followed by clean install in a slightly different folder.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.