Suddenly about a week ago this host http://einsteinathome.org/host/10191797
started to fall in video driver errors. I've suspended all the tasks trying to drain WU cache. Yesterday I made a project reset with no luck. 10 minutes ago I tried to revert video driver version to 340.52 from 347.xx. Nothing helps. BRP WUs fail after a few seconds after start together with the video driver. Intel OpenCL WUs work just fine and I can't stop them, they run even if CPU and GPU computation is suspended manually. FGRP4 works smoothly too. What that can be?
Copyright © 2024 Einstein@Home. All rights reserved.
GPU died?
)
I can't advise on the main video driver problem, but this caught my eye.
This has been observed at other OpenCL BOINC projects, but I hadn't previously heard of it here.
There was a known bug in the BOINC OpenCL API, which was finally fixed on 31 October 2014 - it might cause this behaviour.
Commit f0c39bdf5117d8f7dd5092033971d7f700bd22dc
(linking the GitHub mirror because BOINC's own Git repo seems to be inaccessible at the time of writing).
The Intel-GPU application is being investigated at the moment because of driver incompatibilities: perhaps this API fix could be incorporated in any re-issue.
Some errors I
)
Some errors I spotted:
"Failed to enable CUDA thread yielding for device #0 (error: 999)! Sorry, will try to occupy one CPU core.."
"Error during CUDA device->host time series mean transfer (error: 999)"
To me these suggest a problem in communication between the GPU and the CPU. Possibly because the CPU is too busy? Try to disable all CPU tasks and see if that helps.
Is the problem with the CPU/Motherboard or the GPU? Try to narrow the problem down by moving the GPU around:
Move the GPU to another PCIe slot
Move the GPU to another machine
Disclaimer: I'm just a happy amateur. No real knowledge of BOINC, E@H, GPUs or anything else really.
RE: Possibly because the
)
Judging by the runtime of the intel_gpu apps (>> 1 hour), I'd agree with that. Task time would decrease by 5x or 6x if even one core was freed.
RE: RE: Possibly because
)
Tried with no luck. It works to 0.75% then fails or gets a pause for 15 minutes with the message "not enough free CPU/GPU memory available! Delaying next attempt for at least 15 minutes...". Then, after restart it fails.
This config run with success for months. Then about 2 weeks ago it started to fail suddenly with no visible cause. Before that I crunched with app_config.xml 3 GPU WUs at time and 3 Intel WUs. I have 4 Gbs RAM with only 2976 MBs available (no matter Win7-32 or Win7-64 because of Intel GPU). But this was enough. Now I've reset the project (no other projects running) with app_config.xml manually removed before reset. But it tries to use 0.02CPU+0.5GPU for NVidia and 0.3 CPU+0.3 GPU for Intel. Strange indeed...
Geforce GTX 660Ti I use is playing games like World of Tanks normally. No visible bugs (e.g. within video memory). I can play for hours even now. But can't resume GPU tasks. Just after last failure I've rechecked power cables for this card (2x6-pin jacks), moved the board in and out. Nothing helps.
Will try:
1) another slot (I have 2 on Mobo)
2) another PSU.
But playing game on this card I understand that something is somewhere else, not in this 2 cases.
RE: There was a known bug
)
Thanks Richard. I've stopped all the tasks already one by one...
and found one more strange behaviour:
even when stopped FGRP4 tasks don't want to leave memory though it is checked in BOINC manager settings to "remove tasks from memory". I have to exit BOINC with checkbox "stop running applications" and rerun in again to free memory from them.
Changing slot and PSU didn't
)
Changing slot and PSU didn't help. What else can I try?
Stranger7777
)
Accordingly, you are crunching 9 Wus simultaneously ?
2 X Perseus (ca. 150 MB x 2)
3 X Aricibo (ca 150 MB x 3)
4 X Gamma-ray pulsar search #4 v1.05 (FGRP4-Beta) (ca 600 MB (?) x 4)
And that had previously worked at 2.9GB Ram minus INTEL Intel (R) HD Graphics 2500 (1224MB)
einsteinbinary_BRP4
1.0
0.5
einsteinbinary_BRP5
1.0
0.5
In Boincmanager: Use no more than 99% of the processors.
Use no more than 0% of the paging file
Set high memory usage
" Leave application in memory when they paused ". Do not activate
Exactly. It worked smoothly
)
Exactly. It worked smoothly enough due to SSD. Until this failure.
After that I removed app_config.xml and made project reset.
And again it didn't solved the problem. Don't know what can it be else.
Check the utilization factors
)
Check the utilization factors in your Einstein@home preferences. To run one task per GPU set them to 1.
RE: Check the utilization
)
Done through app_config.xml. Doesn't help. Computation error with video driver restart.
Will try to clear the BOINC out followed by clean install in a slightly different folder.