Multi NVIDIA GPUs with almost half Invalids

KWSN-SpongeBob SquarePants
KWSN-SpongeBob ...
Joined: 24 Apr 07
Posts: 3
Credit: 77019125
RAC: 0
Topic 198132

Ni!

I have recently had a real problem with my results being invalid. In the last screen on my account summary I have had 304 Valid results and 208 Invalid. I understand errors, but that is almost 915,000 points lost. They are factory overclocked 980GTX x 2 and a 780TI running together. They had been running a higher daily total of 350-380,000 points per day but have recently dropped. I have not changed any settings. Is there a new project that utilizes resources differently?

I am running these at X3 per WU and the loads are 88%,89% and 91%. They had been running higher loads but I noticed recently the have dropped from 94%.

Any suggestions form the peanut gallery?

The only thing I can think of is the new Nvidia driver for Windows 10. I am running 352.84.

Is there anything to glean from the Stderr file of why they failed?

My computers are viewable form the main page.

http://einsteinathome.org/host/11747462/tasks&offset=0&show_names=1&state=0&appid=0

Thanks in advance for any insight...

KWSN-SpongeBob SquarePants

Brent





Manuel Palacios
Manuel Palacios
Joined: 18 Jan 05
Posts: 40
Credit: 224259334
RAC: 0

Multi NVIDIA GPUs with almost half Invalids

Hey, I run 2 GTX970's here and run x3 tasks per GPU. I am almost certain that what you are experiencing is a driver related issue. I run 350.12 and get a fluctuation anywhere from 88% load on my cards to 95% load, depending on the mix of work units being processed.

From my observations, different drivers affect GPU computation loads differently, why you may ask? That I do not know. However, back when I ran 344.60 I had higher average GPU utilization and then subsequent driver releases made it fall from the ~94-96% GPU loads I saw under 344.60. It wasn't until 350.12 came around that I saw my work unit completion times average out to what I was seeing under 344.60 and that my GPU utilization had started averaging mid-low 90% as well.

Finally, I have found that stopping BOINC and restarting it can mess up the CUDA tasks on my 970's, and maybe a mix of this is what you are seeing on your PC?

Have you checked to see which cards are throwing out the invalids? in the stderr, you can see which device # that WU went to which would then represent a specific graphics card in your PC.

That may be helpful as the Maxwell 2 architecture cards (GTX9xx) seem to show strange behavior under BOINC at times.

Best of luck!

Logforme
Logforme
Joined: 13 Aug 10
Posts: 332
Credit: 1714373961
RAC: 0

Here are my peanuts. Valid

Here are my peanuts.

Valid tasks seem to take longer (around 10k secs) while invalids are faster (around 6-7k). The invalid tasks seem to be produced by the 780Ti.

Quote:
They are factory overclocked 980GTX x 2 and a 780TI running together.


I have seen factory overclocked cards producing invalids. Suggest you try to underclock (restore to stock) the 780Ti and see if that helps. Pushing a card slightly over the limit can produce errors. Errors causing artifacts in a game is no big deal, errors while computing a task is a big deal.

Quote:
The only thing I can think of is the new Nvidia driver for Windows 10. I am running 352.84.


Backing down to the previous version should be an easy test.

KWSN-SpongeBob SquarePants
KWSN-SpongeBob ...
Joined: 24 Apr 07
Posts: 3
Credit: 77019125
RAC: 0

RE: Here are my

Quote:

Here are my peanuts.

Valid tasks seem to take longer (around 10k secs) while invalids are faster (around 6-7k). The invalid
tasks seem to be produced by the 780Ti.

I will research which actual cards are throwing out the duds.

Quote:
They are factory overclocked 980GTX x 2 and a 780TI running together.

I have seen factory overclocked cards producing invalids. Suggest you try to underclock (restore to stock) the 780Ti and see if that helps. Pushing a card slightly over the limit can produce errors. Errors causing artifacts in a game is no big deal, errors while computing a task is a big deal.

I will try to drop it down a little. It is running at 1137 currently.

Quote:
The only thing I can think of is the new Nvidia driver for Windows 10. I am running 352.84.

Backing down to the previous version should be an easy test.

I tried that and I lost all GPU knowledge from BOINC. I think it is a combo of Windows 10 and new drivers. I can try an older version than the most recent, i.e. go back two versions.

Thanks for the insight! @MP and @Log

SBSP





Logforme
Logforme
Joined: 13 Aug 10
Posts: 332
Credit: 1714373961
RAC: 0

RE: I tried that and I lost

Quote:
I tried that and I lost all GPU knowledge from BOINC. I think it is a combo of Windows 10 and new drivers.


Life on the bleeding edge :)

MAGIC Quantum Mechanic
MAGIC Quantum M...
Joined: 18 Jan 05
Posts: 1704
Credit: 1069966093
RAC: 1274492

This is why I save my peanuts

This is why I save my peanuts and let others test the Win10 update before I switch any of mine since I am anti-error with all my OC'd cards and they don't get to take a break yet........since on my dsl it will take hours to do that upgrade on each one.

By then it will be peanut butter here

KWSN-SpongeBob SquarePants
KWSN-SpongeBob ...
Joined: 24 Apr 07
Posts: 3
Credit: 77019125
RAC: 0

Well, thanks all who replied.

Well, thanks all who replied.

I reinstalled NVIDIA drivers again for WIN 10 and backed off the speed for the 780TI from 1137 to 1098 and I have not had a failure in 24 hours. Let's hope that was it. I have gone from 200,000 to almost 500,00 per day rate in E@H.

I also dialed down cpu usage in BOINC to 80% to give them some headroom for CPU. I am running a ton of ASICs on the machine so I am sure that was overloading the system even though it is screaming fast.

i7 5820 running at 4.0 GHZ liquid cooled, 2 980s and a 780Ti, and 8 x 125 GH/S ASIC all running at full blast!

(also a NVIDIA GTX 770 running on another machine.)

Ni!

SpongeBob SquarePants





MAGIC Quantum Mechanic
MAGIC Quantum M...
Joined: 18 Jan 05
Posts: 1704
Credit: 1069966093
RAC: 1274492

Yes it sounds like you have

Yes it sounds like you have it fixed and the card drivers are that way some times.

I usually wait before I update the drivers even though you get the message every time you look.

And yes they always run faster and better when you leave a free core for them.

I mess around with mine all the time doing different things with my OC'd cards

invader zim
invader zim
Joined: 16 Oct 10
Posts: 2
Credit: 1936850772
RAC: 314417

I just upgraded to a gtx 970.

I just upgraded to a gtx 970. How do I run more that one task at a time ?? Thx Zim

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109970609732
RAC: 30166031

Go to your account page and

Go to your account page and click on the Einstein preferences. You need to set the GPU utilization factor - 0.5 for 2x, 0.33 for 3x, 0.25 for 4x, etc. Multiple tasks will be started after your host does the next work fetch. You can speed this up with a small increase in your work cache settings (which can be reverted after the work fetch occurs).

Be aware that some people report adverse effects when running multiple tasks from different science runs. You are currently running both BRP4G and BRP6. You may find better performance if you select just one of these when running multiple concurrent tasks.

At the moment a BRP6 beta app is being tested which is using cuda55 libs. The old app uses cuda32. For most people using Maxwell cards, this seems to be giving ~20% performance improvement. If it were my card, I'd let the current work finish and select (through preference settings) the BRP6 science run only. I'd then change the preference setting to allow beta test apps to run on the machine. You should try 2x, 3x, 4x to see what the optimum concurrency is for your setup. It will be a case of diminishing returns but seeing as you aren't running CPU tasks, I would think each higher setting will give an improvement. The only way to know for certain is to do experiments.

Just be aware that your GPU will be under higher load so you need to pay attention to proper cooling. That's why there is a warning attached to the settings.

Cheers,
Gary.

invader zim
invader zim
Joined: 16 Oct 10
Posts: 2
Credit: 1936850772
RAC: 314417

Thank you for the

Thank you for the information. I will have to give it a try.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.