Sporatic validate errors using Parkes PMPS XT v1.57-BRP6-Beta-cuda55

archae86
archae86
Joined: 6 Dec 05
Posts: 3160
Credit: 7248116397
RAC: 1358712

RE: This is going to take

Quote:
This is going to take some detective work unless someone points me at a simple way to determine which card is #0 according to BOINC. Any suggestions??


The dead simple solution would be to flip a coin and pull one card out of the box. Rinse and repeat. Compare.

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2969911757
RAC: 694025

In 'tower' orientation, the

In 'tower' orientation, the upper card tends to get hotter. GPUGrid prints both card temperatures into stderr_txt:

# GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0	:
#	Name		: GeForce GTX 970
#	ECC		: Disabled
#	Global mem	: 4095MB
#	Capability	: 5.2
#	PCI ID		: 0000:01:00.0
#	Device clock	: 1228MHz
#	Memory clock	: 3505MHz
#	Memory width	: 256bit
#	Driver version	: r349_00 : 35012
# GPU 0 : 78C
# GPU 1 : 47C
# GPU 0 : 79C
# GPU 0 : 80C
# Time per step (avg over 17500000 steps): 	2.594 ms
# Approximate elapsed time for entire WU:  	45403.341 s
# PERFORMANCE: 34830 Natoms 2.594 ns/day 0.000 ms/step 0.000 us/step/atom
18:45:03 (1976): called boinc_finish


Edit - because the temperature difference arises because of hot air rising (or being blown) from the lower card, temperature - and hence failure rate - may be changed by physically removing a card.

Ron Kosinski
Ron Kosinski
Joined: 23 Mar 05
Posts: 57
Credit: 1076087218
RAC: 767915

RE: RE: ... they stopped

Quote:
Quote:
... they stopped using V1.57 and have gone back to V 1.52!

No they haven't. V1.57 is the beta test app (BRP6-Beta-cuda55) and V1.52 is the standard non-test app (BRP6-cuda32-nv301). You must have changed your preferences so that BOINC no longer has permission to run test apps. You would be advised to re-enable that setting as the test app uses cuda55 libs which gives significantly faster crunch times.


;-(

My bad, I must have unchecked the test app box by mistake. I checked it again and did a project update. Let's see what transpires.

Thanks, Ron

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4993
Credit: 18844234933
RAC: 5909288

RE: RE: Trouble is how do

Quote:
Quote:
Trouble is how do I determine how BOINC enumerates identical cards.

BOINC labels tasks by device number while running:

You could, as here, run different projects on each card, and use the SETI information to deduce, by elimination, the identity of the one which Einstein is running on. Or suspend all tasks except enough to fill one card, and apply the finger test - which card is still running hot?

And don't forget coproc_info.xml:

Quote:


1
GeForce GTX 970
...
13

1
0
0

1
GeForce GTX 750 Ti
...
5

6
0
0


I know I run GPUGrid on my fast card, so BOINC device 0 is in PCIe bus_id 1 (on this motherboard). And so on.


Thanks Richard. That doesn't tell me anything unfortunately since the GPU's are identical. I know that the cards are in PCIe bus_id 1 and 2. Both cards run all projects equally with identical configurations. The simplest solution would be to just stop running Einstein on this computer since it is the only project with errors. I still can try shotgunning the spare cards into the different positions. This will take some time because Einstein only gets 10% resource allocation and it takes a couple of days to finish a task.

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4993
Credit: 18844234933
RAC: 5909288

RE: RE: This is going to

Quote:
Quote:
This is going to take some detective work unless someone points me at a simple way to determine which card is #0 according to BOINC. Any suggestions??

The dead simple solution would be to flip a coin and pull one card out of the box. Rinse and repeat. Compare.


Yes, just will take some time since I only complete 1 task every couple of days.

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4993
Credit: 18844234933
RAC: 5909288

RE: In 'tower' orientation,

Quote:

In 'tower' orientation, the upper card tends to get hotter. GPUGrid prints both card temperatures into stderr_txt:

# GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0	:
#	Name		: GeForce GTX 970
#	ECC		: Disabled
#	Global mem	: 4095MB
#	Capability	: 5.2
#	PCI ID		: 0000:01:00.0
#	Device clock	: 1228MHz
#	Memory clock	: 3505MHz
#	Memory width	: 256bit
#	Driver version	: r349_00 : 35012
# GPU 0 : 78C
# GPU 1 : 47C
# GPU 0 : 79C
# GPU 0 : 80C
# Time per step (avg over 17500000 steps): 	2.594 ms
# Approximate elapsed time for entire WU:  	45403.341 s
# PERFORMANCE: 34830 Natoms 2.594 ns/day 0.000 ms/step 0.000 us/step/atom
18:45:03 (1976): called boinc_finish

Edit - because the temperature difference arises because of hot air rising (or being blown) from the lower card, temperature - and hence failure rate - may be changed by physically removing a card.


Either card will be hotter or colder by only a degree or so depending on the mix of projects actively running at any time. So no help there cause of temperature equilibrium throughout the case.

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3160
Credit: 7248116397
RAC: 1358712

RE: RE: RE: This is

Quote:
Quote:
Quote:
This is going to take some detective work unless someone points me at a simple way to determine which card is #0 according to BOINC. Any suggestions??

The dead simple solution would be to flip a coin and pull one card out of the box. Rinse and repeat. Compare.

Yes, just will take some time since I only complete 1 task every couple of days.


A further dead simple (but possibly not desired) solution for that would be to suspend all other projects from processing on that host during testing.

By the way, such a mass suspension would avoid all cases of tasks migrating from one GPU to the other in the absence of reboots or other interruptions--at least that is my personal experience and observation, though I'm not aware that BOINC guarantees it.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4993
Credit: 18844234933
RAC: 5909288

RE: By the way, such a

Quote:

By the way, such a mass suspension would avoid all cases of tasks migrating from one GPU to the other in the absence of reboots or other interruptions--at least that is my personal experience and observation, though I'm not aware that BOINC guarantees it.


Yes, that would work though obviously not very desirable. The whole point in this exercise was to build a new cruncher to boost my RAC for my primary project, Seti@Home. Well, I just pulled card in PCIe bus_id 1 slot and labelled it "suspicious" for Einstein work and stuck in another 670. This card doesn't automatically boost as high as the previous card so maybe it won't be so close to the edge and error tasks like the suspect card. A few days and hopefully a few tasks crunched and we shall see if I pulled the correct card and its replacement is all hunky-dory with Einstein work.

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4993
Credit: 18844234933
RAC: 5909288

I will say that my invalid

I will say that my invalid issue is resolved. Either by pulling the correct defective card or the fact its replacement doesn't GPUBoost as high as the original card. In any matter, no more invalids since I swapped cards. Thanks for all the input and suggestions from my commenters.

 

Ron Kosinski
Ron Kosinski
Joined: 23 Mar 05
Posts: 57
Credit: 1076087218
RAC: 767915

RE: You should reduce any

Quote:

You should reduce any overclocking back closer to default values to see if the problem goes away.

Good luck with finding what is causing the errors.

Put all parameters back to stock settings for the GPU CPU and Memory clock in the problem machine, enabled beta programs again, (so embarrassed about that), ran around 20 WU on the 1.57 release without a single invalid WU, WOOHOO! I will start to increase the GPU CPU and GPU Memory clock settings until I find the sweet spot for OC settings.
Any advice on which would be the more critical setting; GPU CPU or GPU memory clock?

Thanks to everyone for your help and assistance in this issue! Ron

I see the thread was somewhat hijacked, but we both had our respective issues solved, so, not a problem! Thanks again!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.