RTX 3070 initial impressions

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3562358667
RAC: 1580
Topic 224482

Last night I updated my gaming system (W10) from a GTX 1080 to an RTX 3070 and installed the latest NVidia drivers.

 

I initially left BOINC settings alone, and let it keep running Fermi tasks 3 at a time.

 

Initial results were completing 3 tasks in ~1500 seconds (vs 2300 for the immediate previous batch).  This was a bit disappointing, only a 50% speedup vs an expected nominal 2x.  Part of that could be clock related; my 3070's at stock and on air cooling, while my 1080 was water cooled and clocked high for gaming.  So far all the completed tasks have validated at least (or are waiting for a quorum), which is a promising start.

 

I'm running into bigger problems though than just disappointing speed, both this morning and 12 hours later (after done with work, didn't think to check between) I checked in and saw that one after the next all 3 GPU tasks it was running had stalled out and were doing Zeno's progress bar with several hours on the runtime.  In addition, looking at results on the server I'm seeing some tasks that finished after ~4500 seconds, looking at what was reported for them I'm not seeing anything obviously different vs the fast ones other than the number in what I'm assuming is a progress indicator being checkpointed ("% C 0 21") is incrementing about 20 per line instead of about 60 (in line with taking 3x as long).

 

I've updated my app_config to only run GPU tasks 1 at a time, and will post an update later tonight (tomorrow?) after several have had time to complete.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4699
Credit: 17541692935
RAC: 6360436

The card should work.  My

The card should work.  My friend has a couple 3070's running both GW and GR at resource zero as a backup project when GPUGrid is out of work.  Currently running with GPUGrid out of work.

https://einsteinathome.org/host/12850601/tasks/0/40?sort=desc&order=Sent

Generally Nvidia cards don't like running multiples on a card. Best to run singles.

Also you halved your DP processing rate moving from the 1080 to the 3070.  Ampere cards are only 1:64 FP32 now.

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3681
Credit: 33814547828
RAC: 37820370

don't run multiples on nvidia

don't run multiples on nvidia cards. one at a time is always faster. multiples causes the GPU utilization to increase to 100%, but power draw drops and you get less overall production. Something about the AMD cards or even the AMD applications give favorable results with multiples, but it's not the case with the Nvidia cards/apps.

 

I use an RTX 3070 on one of my hosts. and it runs pretty well. about 450s per Gamma-ray task (only uses ~150W) and about 330s per Gravitational-Wave task (only uses about ~165W). running 1x.

 

you wont see the advertised 2x performance gain in all workloads. it seems that the Einstein app isn't as heavy on FP32 as some other projects (like folding@home) so the gains here are mostly in efficiency gains. from my tests, it's anywhere from 5-20% more power efficient than Turing. and the 3070 will be better for efficiency than the 3080 or 3090 because their higher end GDDR6X memory uses more power but doesn't meaningfully contribute to crunch times.

_________________________________________________________________________

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7023004931
RAC: 1834087

Ian&Steve C. wrote:don't run

Ian&Steve C. wrote:
don't run multiples on nvidia cards. one at a time is always faster.

Keith Myers wrote:
Generally Nvidia cards don't like running multiples on a card. Best to run singles.

Care to scope that a little?  I only converted from Nvidia to AMD starting just under two years ago.  All my experience is at Einstein, which two years ago means I was running Gamma-Ray Pulsar GPU work, only.

For generations before I switched to AMD I ran multiplicity testing on new Nvidia cards, and always observed better performance at 2X than 1X.  The advantage shrank a little with the newer generations.

Was there some generation at which you observed the long-standing advantage finally to reverse?

Or are you making a blanket statement here on an Einstein forum which you primarily base on observations elsewhere?

Do you hold this true for both Gamma-Ray Pulsar and Gravity Wave?

I am genuinely curious, not trying to talk back to you.

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3562358667
RAC: 1580

Running 1x I've had a bit

Running 1x I've had a bit more than 2 dozen successfully complete in ~500 seconds each, although they might've gotten a boost from my not adjusting compute settings to put 2 more CPU tasks on the freed cores. I'll be running another bath overnight to see if that makes a difference or not.  And then probably run a 2x batch at the next day to see where they end up standing.

 

I'd need to pull numbers into a spreadsheet but if I was getting any speedup at 3x vs 1x it's definitely very small now.  That's a big difference from the last time I did this a few years ago, when I saw a big gain at 2x (30-50%???) and maybe 5% more at 3x.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3681
Credit: 33814547828
RAC: 37820370

archae86 wrote: Ian&Steve C.

archae86 wrote:

Ian&Steve C. wrote:
don't run multiples on nvidia cards. one at a time is always faster.

Keith Myers wrote:
Generally Nvidia cards don't like running multiples on a card. Best to run singles.

Care to scope that a little?  I only converted from Nvidia to AMD starting just under two years ago.  All my experience is at Einstein, which two years ago means I was running Gamma-Ray Pulsar GPU work, only.

For generations before I switched to AMD I ran multiplicity testing on new Nvidia cards, and always observed better performance at 2X than 1X.  The advantage shrank a little with the newer generations.

Was there some generation at which you observed the long-standing advantage finally to reverse?

Or are you making a blanket statement here on an Einstein forum which you primarily base on observations elsewhere?

Do you hold this true for both Gamma-Ray Pulsar and Gravity Wave?

I am genuinely curious, not trying to talk back to you.

 

based on my own extensive testing with both GR and GW apps across a half dozen different models of nvidia GPUs. 
 

ive tried this on: 

RTX 2070, RTX 2080, RTX 2080ti, GTX 1660 Super, GTX 1650, and finally the RTX 3070. I always leave 2-4CPU threads free not processing to ensure no CPU bottlenecks. 
 

every time I’ve tried (I retest occasionally to see if anything has changed with different datasets) and every time has been the same result. Less overall production (task runtimes are more than 2x at 2x multiple) 

 

so yeah, it’s directly based on my own experience with Einstein apps, not from somewhere else. 

_________________________________________________________________________

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4699
Credit: 17541692935
RAC: 6360436

Quote:Was there some

Quote:
Was there some generation at which you observed the long-standing advantage finally to reverse?

I think you have surmised the problem well.  I ran doubles on Einstein on the BRP4G and GR tasks on Maxwell and Pascal with no issues.

The problem with multiplicity started with Turing. If I wanted to write an explicit, complicated app_info and app_config just for my 1080 Ti cards on my multi-card, multi-generational hosts, I would still see the advantage of running doubles on GR.  Not sure about GW though.

So, my opinion is that the Turing generation is where the advantage ended with multiplicity.

 

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 536500999
RAC: 189238

I also used to run 2x BRP on

I also used to run 2x BRP on my GTX1070, which 2 years ago gave a throughput advantage. I recently re-avaluated this for GR tasks and found 1x to be slightly more productive than 2x. For the current GW O2 1x throughput is also a bit better than 2x.

So I think it's rather the app than the GPU generation. The old BRP had 70% - 80% GPU utilization running 1x, which could be improved close too 100% running multiple tasks. The current GR app already runs at very high utilization, so there's not much to be gained. And GW O2 seems to be highly GPU memory bandwidth starved (~70% utilization), so adding tasks doesn't help.

MrS

Scanning for our furry friends since Jan 2002

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3562358667
RAC: 1580

Overnight observations,

Overnight observations, adding enough CPU tasks to bring my cores back to 100% loading bumped the minimum runtime from 8:14 to 8:24.  Tasks ran while using the system normally this morning (basic web browsing) ran in the low 8:30's.  Two tasks ran while I was watching youtube last night came in around 10 minutes each.

Tasks didn't stall overnight.  I'm going to keep running 1x for another 10-12 hours before trying 2x to see what happens there.  However...

GPU load from one task sits around 90% for the main computational phase with low load startup/shutdown activities taking about 8-10s; which would fit well with there no longer being much scope for parallel task speedup.

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 536500999
RAC: 189238

Good that you got it working.

Good that you got it working. but the performance is underwhelming, indeed. For the GR tasks you average around 510s, whereas my GTX1070 does them in 810 - 830s. If I just scale your number by the difference in memory bandwidth I get: 510s * 14GHz / 8.8 GHz = 811s. Almost a perfect match! So it seems like GR tasks can't benefit from the enhanced compute capability of Ampere at all.

And your GW-O2 tasks are extremly slow: 40k - 80ks vs. 600 - 700s on my GPU (running 1 concurrent task).

MrS

Scanning for our furry friends since Jan 2002

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3562358667
RAC: 1580

I tried running at 2x, got

I tried running at 2x, got times from 16:10-16:50, so virtually no net speedup; but after stalling twice in two hours quickly cut that experiment short.

 

My GW tasks are all CPU, although I might try running a few GPU flavor just to see what I get later, so of course they're far slower than corresponding GPU tasks.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.