RTX 3070 initial impressions

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 545,204,030
RAC: 188,379

Well, 60-70% would not limit

Well, 60-70% would not limit if the load was continous. If there are bursts of 100% / high load and than periods of lower load, it can still limit as the percentage shown is an average of this. From my experience for 20-30% load the memory speed hardly matters (GPU-Grid), whereas at 40-50% it can already have a significant impact (GR) and for 60-70% (GW) I expect it to matter a lot. Sorry for fogetting about Danneely's downclock experiment, though!

Anyway, I tried the downclock myself right now. It's now using GR tasks which cause ~50% memory controller load. My CPU is a Ryzen 5900X at 4.0 GHz with DDR4-3600 and tight timings, OS updated Win 10 with current drivers - nothing special software side, I would say.

The runtimes for the last couple of GR tasks (1 concurrent WU):

GPU core 1.86 GHz, Mem 8.80 GHz: 821 s (8 WU average)

GPU core 1.86 GHz, Mem 8.00 GHz: 865 s (2 WU average)

GPU core 1.52 GHz, Mem 8.80 GHz: 955 s (6 WU average)

So a 10% memory downclock yielded a 5% throughput deficit. That's at 50% average memory controller load, so I would say it's definitly starting to limit. On the other hand a 22% GPU downclock reduced throughput by 16%, which is still good scaling but not perfect (="starting to limit"). For GW tasks the same GPU core clock speed change made virtually no difference. There it's harder to measure, though, as the runtimes vary more.

also for the most part you do not NEED to recompile the OpenCL app to "use" the extra cores. it's not like half of the cores are sitting idle or anything. that kind of stuff happens in the scheduler internal to the GPU

In the GPU-Grid forums someone posted guidelines from nVidia for using them under CUDA. They were very clear about having to recompile the app, targetting the newest CUDA compute capability, in order to use those new FP32 units. This would not be needed if it was just the GPU internal scheduler.

Do you think that the driver has some specific, project aware implementation?

Of course not. But it certainly knows the hardware specifics. OpenCL devices can differ in such fundamental things like how many "threads" are within one "warp", using nvidia terms. As far as I understand such details are not something the OpenCL programmer should worry about (1), but "someone" has to take care of it. And it can't be the compiler creating the intermediate code, as it doesn't know yet on which hardware it's going to run.

OK, it was time to look this up instead of guessing around here:

https://en.wikipedia.org/wiki/OpenCL

Programs in the OpenCL language are intended to be compiled at run-time, so that OpenCL-using applications are portable between implementations for various host devices.

In order to open the OpenCL programming model to other languages or to protect the kernel source from inspection, the Standard Portable Intermediate Representation (SPIR)[16] can be used as a target-independent way to ship kernels between a front-end compiler and the OpenCL back-end.

So there is static compilation into the intermediate SPIR format and later on just-in-time compilation for the actual hardware it's being run on. This is the job I thought the driver would do.

(1) I think the CUDA programmer would need to worry about it, if nVidia would not have kept this number constant over many architectures.

MrS

Scanning for our furry friends since Jan 2002

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3,748
Credit: 35,690,142,761
RAC: 37,694,670

those posts on GPUGRID

those posts on GPUGRID were/are speculation (mine included). the app needs to be recompiled because right now Ampere cards are locked out. computation doesn't even attempt to run because it sees an invalid gpu architecture and errors immediately.

 

recompiling the app is more to get Ampere cards unlocked than anything else. not really needed to use all the FP cores, that's internal to the GPU scheduler. you'll get better optimization with a CUDA 11.1+ app, but older cuda apps can still run if they are compiled in a way that isn't reliant on the gpu architecture. I think some of the Primegrid apps were like this, and you had users claiming to run the PPS-Sieve tasks, which is claimed to be a CUDA app and was compiled in 2019 before these new cards were even released. but I can't say for sure since it looks like they use 'nvidia' and 'cuda' synonymously in their app names (which is confusing if true). but GPUGRID apps aren't like this, and need to be recompiled with the proper flags for Ampere support. likely no significant changes need to be made to the source code. just recompiled with the CUDA 11.1 (or 11.2 now that its out) toolkit. but again, that's the case with CUDA apps. OpenCL apps are much more versatile and work on a wider variety of hardware (which is why most projects use OpenCL), but are always less optimized for nvidia hardware.

 

there are lots of programs and benchmarks that can utilize the new cards. but depending on the ratio of FP to INT calculations being done, you might not see the advertised speedups. it really does vary based on workload. workloads heavy on FP see the most benefit, since there's literally 2x FP cores now, but workloads with significant amount of INT will not see nearly the same benefit because those "extra" FP cores are stuck doing INT.  

 

when compiling nvidia CUDA apps, the compiler does know what hardware it's being setup for. those are available flags that can be used. you basically use a flag that's associated with the CC of the cards you want supported. when I was compiling apps for SETI you had to do exactly that, not only point to the right cuda toolkit library, but also define the gencodes for supported hardware in the makefiles or config files. you can compile a CUDA 10 app, then forget to add in Maxwell support, or even a CUDA 11.1 app and forget to add in Ampere support. it all gets defined in compilation.

_________________________________________________________________________

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4,779
Credit: 17,802,927,848
RAC: 3,985,070

Quote:when compiling nvidia

Quote:
when compiling nvidia CUDA apps, the compiler does know what hardware it's being setup for. those are available flags that can be used. you basically use a flag that's associated with the CC of the cards you want supported. when I was compiling apps for SETI you had to do exactly that, not only point to the right cuda toolkit library, but also define the gencodes for supported hardware in the makefiles or config files. you can compile a CUDA 10 app, then forget to add in Maxwell support, or even a CUDA 11.1 app and forget to add in Ampere support. it all gets defined in compilation.

Exactly, I had to use a specific gencode=53 for the Maxwell gpu in my Nvidia Jetson Nano SBC to compile the cpu BRP4 application source code for the gpu and remove all the other gencodes for the other generations and gpu models to get the application to run and not produce errors looking for normal hardware.

 

 

mmonnin
mmonnin
Joined: 29 May 16
Posts: 291
Credit: 3,232,287,015
RAC: 8,477

If there's one BOINC project

If there's one BOINC project that responds well to GPU mem OC its E@H.

And GPUGrid has historically been one of the slowest projects to add support for new NV generations.

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1,364
Credit: 3,562,358,667
RAC: 8

I ran a batch of GW-GPU tasks

I ran a batch of GW-GPU tasks overnight.  Run times ranged from 11-23 minutes, but it was a very bi-modal distribution with a large group around 21 minutes and a small one at 12m.

 

GPU loads ranged from 25-55%; so these tasks are definitely still being bottlenecked highly by my CPU. 

 

I haven't tried running 2x.  Would this be worth doing?  I haven't ran GW on my GPUs for a few years (when the initial app came out); and honestly don't remember anything about how multiple tasks performed then other than even running 3 or 4 wasn't enough to get the GPU itself fully loaded.

 

All of my tasks appear to be waiting for quorum partners to return, so  I can't say anything about if they ran successfully or not.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4,779
Credit: 17,802,927,848
RAC: 3,985,070

What you observed in the

[deleted]

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3,748
Credit: 35,690,142,761
RAC: 37,694,670

well since this new 3001

well since this new 3001 gamma ray data set is quite unoptimized for the nvidia app, it now makes sense to run 2x on fast nvidia cards (RTX 2080 or faster). the slow processing is due to unoptimized code/logic, instead of the CPU bottleneck seen on the GW tasks. You should still only run 1x on GW/nvidia.

my RTX 2080tis and my RTX 3070 show about 10% production improvement when running 2x vs just 1x with the LATeah3001L00 tasks. This benefits the new tasks only, and resends of older LATeah2049Lxx tasks will not see any improvement at 2x. Since the work being distributed now seems to be mostly these new 3001 tasks, it seems worthwhile to go to 2x at the moment. You might even squeeze a tiny bit more with 3x, but I personally feel that losing another CPU thread isn't worth the tiny improvement you might get. stick to 2x IMO, unless you're not using the CPU for anything else and have the threads to spare (I run other CPU projects)

slower GPUs see less improvement from my tests. RTX 2070 only saw a 4% improvement at 2x, and even less on the GTX 1660 Super and GTX 1650.

_________________________________________________________________________

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1,364
Credit: 3,562,358,667
RAC: 8

I tried running 3001 FGRP

I tried running 3001 FGRP tasks on my 3070.  Initial results were promising 1.5-12m runtimes vs 6.5m for single tasks (despite the beta app supposedly being slower these tasks are running faster than earlier datasets did for me).  But after about an hour or normal running I had a task hang and massively slow its companions down.  It'd been running >4hours when I noticed and aborted with its companions slowed to 40-45m and the GPU load at 15-25%.

 

I've had problems running multiples with the old app version on this card; and the latest build didn't change anything for it.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.