Pascal again available, Turing may be coming soon

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

archae86 wrote:Not sure what

archae86 wrote:
Not sure what you mean by "average time". 

When testing multiple work units we look at "average time" or

Total time to complete divided by number of work units running on GPU  example

496 seconds/1 = 496 seconds per work unit

total time (secs) / # of work units on GPU = time to complete per each work unit

962 seconds / 2 =  481 seconds per work unit

then you collect data on say 20 tests and find the average for those 20 test and that is the average for that number of work units running on the GPU.

 

By your examples, you see that 2 at a time is 15 seconds faster than 1 at a time.  Since I don't know what 3 at a time did but that also might be much faster than 2 at a time. Then you test 4 at a time. At some point the time per work unit will start to go up,not down. When it does, the previous number per gpu is the sweet spot on how many to run at a time on the GPU. 

Look forward to your results.

 

Zalster

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109410157854
RAC: 35079308

He gave his results.  He used

He gave his results.  He used the term "productivity".  A higher productivity means more output per unit of time.

It would seem that the days of getting a decent increase in productivity by running 2x compared to 1x are gone.

At the outset, he mentioned he would try for a big swag of hopefully similar running tasks so as to compensate for any variability due to different data files.  After a quick look, he seems to be using tasks based on 1022L.dat and 1023L.dat.  These are the slower running variety when compared to the short burst of tasks that had code letters from M to Q.  The main thing to be careful of is not to allow any 'resend' results from the previous series to be part of the mix.  I have no worries on that score.  Archae86 is very careful in how he conducts these experiments.

As he mentioned, the ~3% improvement for all the higher multiplicities is hardly worthwhile.  It will be interesting to see how the overclocking goes.

 

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7024844931
RAC: 1812309

I believe I hit the memory

I believe I hit the memory clock ceiling for my 8020 card today, at a considerably higher level than I expected.  In MSIAfterburner scale (which is 4X the GPU-Z scale for memory clock) I first saw trouble at +960 requested, having run +930 for a half hour successfully.  The symptoms were both a failed WU (error 69 this time) and downclocking of the memory clock to base level (but not to a far lower "safety" level, as I recall seeing on some earlier Nvidia GPU models).

I've got a fair number of hours of successful operation now at what I suspect will turn out to be my 24-hour safe level for both clocks, +125 requested on core clock, giving 1995, and +900 on memory clock, giving 1925 as reported by GPU-Z, or 7700 as reported by MSI Afterburner and some other applications.

At this level, the card is 1.097 times more productive than at stock clock for the current predominant flavor of Einstein WU.  

All of this post is for 1X running.  The most common elapsed time when the system lacks distractions (me running a browser, a backup running, the Antivirus doing a scan...) is reported at 7 minutes 32 seconds (452 seconds).

This will not be my long-term operating point, as even if it passes a 24-hour stability run, I plan to back off two steps each in memory clock and core clock.  

But first I'll check to see how close to these clocks it works when running 2X.

While I've claimed to be running stock, in fact I left intact the fan curve set in MSIAfterburner that I have used for many months now.  That curve in conjunction with my PC case and fans, has had the 8020 running about 69C with fans at 60% for this part of the testing today in a 79F ambient room.  Quite likely were I to crank the fans to 100% I might get an extra step or so of speed.  Possibly if I let the on-card fan ramp take effect, I might lose a step.

By about 20 hours from now I should be home from a concert (I'm singing in a small chorus for a performance of Bach Cantata 140 "Wachet Auf" on a New Mexico Philharmonic subscription concert) and if the stability run looks good at 1X I'll dip my toe back in the 2X waters.

As overclocking experiences go, this one was well-behaved.  No sudden PC reboots, and the downclocking was recoverable on first try.  The modest core clock overclock available was an unpleasant surprise, and suggests that people choosing to stick with stock clock may be giving up less than usual.  (Or I may have just received a slow chip).

 

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7024844931
RAC: 1812309

I seem to have stumbled on a

I seem to have stumbled on a very serious problem with my new Turing 8020 card.  It has failed every time it has tried to run one of the immediate previous flavor of "high-pay" Gamma-ray pulsar units.

Most of these happened during my overnight run last night, so I initially assumed it was a matter of my clock rates both of which were only one step below the failure level (this was meant to be part of my 24-hour proof run before easing down a little).  The first of them caused the card to downclock both core and memory clocks to default values--where they remained for the night.  So based on low-pay WU elapsed times and the observed final state of the clocks, only the first failing high-pay unit was started with high speed clocks.

After seeing several such failures from my log of overnight activity, I did a full power-off shutdown and reboot, with no overclocking.  After running a couple of the current low-pay units successfully, I gave it one high-pay unit. 

It failed with a similar signature as the overnight one: log shows about 25 seconds elapsed time with about 10 CPU seconds, and the history list shows it as a computation error type 28 (this was the type for my core-clock overspeed, while my memory clock overspeed was type 69).

If you like you can review the stderr for the one after my reboot here

It appears that with the current driver, this piece of hardware stumbles across something it can't do properly when running Einstein Gamma-ray pulsar work from data files LATeah0104Q, LATeah0104P, or LATeah0104O.  I'm just generalizing from inadequate data in supposing all the other high-pay WUs from that collection would also fail.

My observation covers a dozen such units attempted in the last half-day, with zero successes, and all seemingly similar failures.

I'm a bit a sixes and sevens as to what to do.  Clearly this is a killer deficiency.  I don't know whether I have a defective card, or there is a driver problem one might hope will go away in a driver revision soon, or I've stumbled on a Turing design flaw, or ...  Possibly it is a problem with the newish driver I'm using that has nothing to do with Turing at all.

In the short term, there seems little point in running more of the high-pay units, so I plan to suspend the ones I have in stock (not many), do another power-off reboot, and attempt to continue my just below the ceiling proof run.

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7024844931
RAC: 1812309

After my previous report of

After my previous report of 100% fast failures on running "high-pay" WUs, I thought I just suspend those for the moment and press on building hours on the regular kind.  But now the card has entered a pattern of suddenly down-clocking the core clock to 1515, running that way at greatly reduced power dissipation and temperature for roughly ten minutes, then perking up and running at the expected clock for about a minute, then repeating the cycle.

I first observed this when I was a fair way up a new ladder of rapidly cutting in half the difference between my clock rates requested and the previous ceiling, but not having reached it.  I've now dropped the overclock requests to zero.  Possibly the behavior will cease.

At this stage I don't know whether I may in fact have a defective card, which might explain some of the oddities.  As I bought it from NewEgg which has a replacement only policy for this item, my only option is to decide (soon) to request an RMA, send it back for another, and see if that does any better. 

While the clock is ticking, I think I should keep trying it out for somewhat longer, if only so I'll be better able to see whether a replacement seems any different.

Wish I'd kept the order with Amazon, but I was in a hurry.

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

If you are truly concern then

If you are truly concern then I would go for the RMA and see if you can get a new one. I'd hate for you to stuck with a defective product. 

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

I see your card is running

I see your card is running driver version 411.63. Have you considered trying out 411.70 ?

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 536670990
RAC: 185309

Peter, thanks for all those

Peter, thanks for all those tests and information! It's certainly enough for now to judge "don't rush towards Turing".

Apart from that: I'm pretty sure your "8020" is actually a "2080" ;)

And be careful with the memory overclock: ever since GDDR5 it could be that a very high clock may be stable, but require too many resends on the bus, decreasing performance over a lower OC. Did the performance consistently increase as you increased the clock speed?

The reviews say there's now a decent auto-OC tool made available for the usual tuners. Could you try that and compare with your manual OC?

MrS

Scanning for our furry friends since Jan 2002

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7024844931
RAC: 1812309

Richie wrote:I see your card

Richie wrote:
I see your card is running driver version 411.63. Have you considered trying out 411.70 ?

Thanks.  I thought sure I had checked that and was up-to-date.  I hope that may resolve my more important problems.  Will do right away.

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7024844931
RAC: 1812309

ExtraTerrestrial Apes wrote:

ExtraTerrestrial Apes wrote:
your "8020" is actually a "2080"

True.  Oops.

Quote:
Did the performance consistently increase as you increased the clock speed?

yes.

Quote:
The reviews say there's now a decent auto-OC tool made available for the usual tuners.

I'm troubled that they don't deal with memory clock (though that seems far less helpful for current Einstein on Turing than it did for then-current Einstein on Maxwell).  I might give it a try at some point, if things settle down and I keep the card.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.