large spike in BRP6 invalids

Anonymous
Topic 198078

I currently have 148 invalids on Binary Radio Pulsar Search (Parkes PMPS XT) v1.52 (BRP6-cuda32-nv270) on machine 167. This is a NVIDIA GTX 760 card. Nothing has changed on my end that I am aware of. All fans are operational and temps seem within acceptable range. At a loss as to why this machine is acting up unless the GPU is failing. Don't have another to swap out.

My other machines are clean.

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

large spike in BRP6 invalids

Quote:

I currently have 148 invalids on Binary Radio Pulsar Search (Parkes PMPS XT) v1.52 (BRP6-cuda32-nv270) on machine 167. This is a NVIDIA GTX 760 card. Nothing has changed on my end that I am aware of. All fans are operational and temps seem within acceptable range. At a loss as to why this machine is acting up unless the GPU is failing. Don't have another to swap out.

My other machines are clean.

robi do you mean this host?

http://einsteinathome.org/host/11875078

I had a look over some of the tasks and noticed this host running only BRP6-cuda32-nv270 tasks at E@H, and some BRP6 tasks are still validating correctly, most not.

Just from the task times - they seem a bit slow slow for for a 760 on BRP6 - I am running a much slower 460 with GPU usage 0.5 (2 tasks per GPU) and they are completing in about 15Ks, RAC about 45K each.

I did see some stability problems running BTP6 at 0.33 a while back.

Maybe that might be worth trying?

Anonymous

Thanks. I will give .5 a try

Thanks. I will give .5 a try on this machine. I have another NVIDIA machine running 4 concurrent brp6 jobs w/o any issues. Seems odd.

archae86
archae86
Joined: 6 Dec 05
Posts: 3161
Credit: 7264475101
RAC: 1573014

My guess is that you'd be

My guess is that you'd be more likely to see improvement by lowering clock rate on the GPU than by lowering concurrency.

Even if you are currently at stock clock, I still suggest lowering clock rate.

This is only a guess, repeat this is only a guess.

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

RE: Even if you are

Quote:

Even if you are currently at stock clock, I still suggest lowering clock rate.

It is only fairly recently 346.x i think onwards, the clock rates are able to be adjusted on Nvidia / linux (for 4xx series anyway), it might have been earlier on 7xx series cards, so robi you may need to upgrade a driver.

There are some links

at gpugrid

to follow if you want to try down clocking.

As a test i upped the clock rates by 10% last night on GPU0 to see what happens on my gtx460, it seems to get warmer and turn a couple of tasks over faster.

Iĺl see if they validate later.

Edit: I have been running 349.12 for several weeks without any issues, i notice 349.16 has been relased.

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

RE: As a test i upped the

Quote:

As a test i upped the clock rates by 10% last night on GPU0 to see what happens on my gtx460, it seems to get warmer and turn a couple of tasks over faster.

Iĺl see if they validate later.

Well they did, but after a few hours the card dropped down from Level3 727MHz (+10% I added) to Level1 405MHz.

Required a reboot to restore to normal (727Hz)... Nvidia won that one.

Anonymous

problem continues at 2

problem continues at 2 concurrent jobs. It looks like I will have to try a new driver in order to modify the GPU's clock rate. I will have to visit this in a few days.

Mumak
Joined: 26 Feb 13
Posts: 335
Credit: 3555508153
RAC: 1299558

I had similar issues on a

I had similar issues on a GeForce 660 Ti and RADEON R9 280X, both running at stock clocks (on the 280X I even reduced the GPU MemClock). It's really strange, since running the same stuff on my other GPUs like RADEON HD 7950, Quadros or Tesla K20c was OK.
A lot of invalid occurred when running BRP6 + BRP4G, but even running 2xBRP6 produced a lot of invalids. I have turned the problematic host to run a single unit only and this significantly reduced the rate of invalids. Suggest you try the same and see the results...

Anonymous

RE: I had similar issues on

Quote:
I had similar issues on a GeForce 660 Ti and RADEON R9 280X, both running at stock clocks (on the 280X I even reduced the GPU MemClock). It's really strange, since running the same stuff on my other GPUs like RADEON HD 7950, Quadros or Tesla K20c was OK.
A lot of invalid occurred when running BRP6 + BRP4G, but even running 2xBRP6 produced a lot of invalids. I have turned the problematic host to run a single unit only and this significantly reduced the rate of invalids. Suggest you try the same and see the results...

Your experience seems quite similar to mine. Other PCs running Nvidia 770s, 650s with the same driver as this machine at "stock clock" are behaving well. This would sort of suggest that maybe this card is having an issue. Don't have a spare (unused) PC capable of supporting this card to try a swap.

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 582051482
RAC: 137966

Indeed, it sounds like a

Indeed, it sounds like a problem with the card. Did you already power off, remove the power card for 10 minutes and try again?

You could also cross-swap the GTX760 with e.g. your GTX770 to check the system & software. This way you wouldn't need a spare PC. And if the error travels with the card we know it's the card.

Has it gotten hotter recently where you live? I know you said the temperature is fine, but chips age when run under load. If the chip has degraded so far that at stock clock it's borderline unstable, a temperature increase from e.g. 60°C to otherwise perfectly fine 70°C may be enough to push it past the stability boundary. Lowering the GPU clock speed slightly, as suggested by archae, would help in such cases.

And generally the new app pushes the GPUs harder, which might make a card fail that was apparently fine with the old app.

MrS

Scanning for our furry friends since Jan 2002

archae86
archae86
Joined: 6 Dec 05
Posts: 3161
Credit: 7264475101
RAC: 1573014

RE: And generally the new

Quote:
And generally the new app pushes the GPUs harder, which might make a card fail that was apparently fine with the old app.


I observed this--the maximum acceptable clock rate was a bit slower with the revised BRP6 application than previously for one of my cards I knew to be running near the edge. In that case, most WUs still validated, but some failed where none had before. I made a big change downward, to check that the fails went away, then inched my way back up to find the new edge, then backed off when I found it.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.