I currently have 148 invalids on Binary Radio Pulsar Search (Parkes PMPS XT) v1.52 (BRP6-cuda32-nv270) on machine 167. This is a NVIDIA GTX 760 card. Nothing has changed on my end that I am aware of. All fans are operational and temps seem within acceptable range. At a loss as to why this machine is acting up unless the GPU is failing. Don't have another to swap out.
My other machines are clean.
Copyright © 2024 Einstein@Home. All rights reserved.
large spike in BRP6 invalids
)
robi do you mean this host?
http://einsteinathome.org/host/11875078
I had a look over some of the tasks and noticed this host running only BRP6-cuda32-nv270 tasks at E@H, and some BRP6 tasks are still validating correctly, most not.
Just from the task times - they seem a bit slow slow for for a 760 on BRP6 - I am running a much slower 460 with GPU usage 0.5 (2 tasks per GPU) and they are completing in about 15Ks, RAC about 45K each.
I did see some stability problems running BTP6 at 0.33 a while back.
Maybe that might be worth trying?
Thanks. I will give .5 a try
)
Thanks. I will give .5 a try on this machine. I have another NVIDIA machine running 4 concurrent brp6 jobs w/o any issues. Seems odd.
My guess is that you'd be
)
My guess is that you'd be more likely to see improvement by lowering clock rate on the GPU than by lowering concurrency.
Even if you are currently at stock clock, I still suggest lowering clock rate.
This is only a guess, repeat this is only a guess.
RE: Even if you are
)
It is only fairly recently 346.x i think onwards, the clock rates are able to be adjusted on Nvidia / linux (for 4xx series anyway), it might have been earlier on 7xx series cards, so robi you may need to upgrade a driver.
There are some links
at gpugrid
to follow if you want to try down clocking.
As a test i upped the clock rates by 10% last night on GPU0 to see what happens on my gtx460, it seems to get warmer and turn a couple of tasks over faster.
Iĺl see if they validate later.
Edit: I have been running 349.12 for several weeks without any issues, i notice 349.16 has been relased.
RE: As a test i upped the
)
Well they did, but after a few hours the card dropped down from Level3 727MHz (+10% I added) to Level1 405MHz.
Required a reboot to restore to normal (727Hz)... Nvidia won that one.
problem continues at 2
)
problem continues at 2 concurrent jobs. It looks like I will have to try a new driver in order to modify the GPU's clock rate. I will have to visit this in a few days.
I had similar issues on a
)
I had similar issues on a GeForce 660 Ti and RADEON R9 280X, both running at stock clocks (on the 280X I even reduced the GPU MemClock). It's really strange, since running the same stuff on my other GPUs like RADEON HD 7950, Quadros or Tesla K20c was OK.
A lot of invalid occurred when running BRP6 + BRP4G, but even running 2xBRP6 produced a lot of invalids. I have turned the problematic host to run a single unit only and this significantly reduced the rate of invalids. Suggest you try the same and see the results...
-----
RE: I had similar issues on
)
Your experience seems quite similar to mine. Other PCs running Nvidia 770s, 650s with the same driver as this machine at "stock clock" are behaving well. This would sort of suggest that maybe this card is having an issue. Don't have a spare (unused) PC capable of supporting this card to try a swap.
Indeed, it sounds like a
)
Indeed, it sounds like a problem with the card. Did you already power off, remove the power card for 10 minutes and try again?
You could also cross-swap the GTX760 with e.g. your GTX770 to check the system & software. This way you wouldn't need a spare PC. And if the error travels with the card we know it's the card.
Has it gotten hotter recently where you live? I know you said the temperature is fine, but chips age when run under load. If the chip has degraded so far that at stock clock it's borderline unstable, a temperature increase from e.g. 60°C to otherwise perfectly fine 70°C may be enough to push it past the stability boundary. Lowering the GPU clock speed slightly, as suggested by archae, would help in such cases.
And generally the new app pushes the GPUs harder, which might make a card fail that was apparently fine with the old app.
MrS
Scanning for our furry friends since Jan 2002
RE: And generally the new
)
I observed this--the maximum acceptable clock rate was a bit slower with the revised BRP6 application than previously for one of my cards I knew to be running near the edge. In that case, most WUs still validated, but some failed where none had before. I made a big change downward, to check that the fails went away, then inched my way back up to find the new edge, then backed off when I found it.