RTX 3070 initial impressions

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 471,578,888
RAC: 348,941

Ops @ GW! Forgot there's more

Ops @ GW! Forgot there's more than GPUs here...

MrS

Scanning for our furry friends since Jan 2002

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1,361
Credit: 3,182,174,270
RAC: 1,800,880

ExtraTerrestrial Apes

ExtraTerrestrial Apes wrote:

Good that you got it working. but the performance is underwhelming, indeed. For the GR tasks you average around 510s, whereas my GTX1070 does them in 810 - 830s. If I just scale your number by the difference in memory bandwidth I get: 510s * 14GHz / 8.8 GHz = 811s. Almost a perfect match! So it seems like GR tasks can't benefit from the enhanced compute capability of Ampere at all.

 

As an experiment I downclocked my vram by 1ghz (7%).  Run times increased by 8-10 seconds (1%), so the increased memory speed vs increased performance vs your 1070 was just a coincidence.

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 471,578,888
RAC: 348,941

Thanks for checking! BTW:

Thanks for checking!

BTW: I read elsewhere that in order to use the doubled FP32 units of ampere, the app would have to be recompiled with the new shader model as target. The apps here haven't changed in a while, didn't they?

MrS

Scanning for our furry friends since Jan 2002

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 2,046
Credit: 17,114,826,131
RAC: 40,802,610

Nvidia cards would perform

Nvidia cards would perform better with a CUDA app, but as far as FP use, it likely depends on how the app is coded and what kinds of calculations are being done. If you have a task that requires a lot of integer calcs, then you won’t see much benefit in the new architecture since it can’t use all the FP cores if half of them are stuck doing integer work.  

_________________________________________________________________________

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 471,578,888
RAC: 348,941

You're right, but still I

You're right, but still I think the data strongly suggests the additional FP32 units are not being used because:

  • My Pascal can't do FP32 and integer together (like Turing), so even in this case Ampere (and Turing) should have a significant advantage
  • The Einstein workload should be mainly FP32, as it has not shown to depend significantly on FP64 performance and integer doesn't make sense for the main workload.
  • if even CUDA apps have to be recompiled in order to take advantage of the new architecture, I wonder how this is handled in OpenCL. Maybe there the driver should do it? Would make sense. Maybe someone with enough spare time could google it.

MrS

Scanning for our furry friends since Jan 2002

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 2,639
Credit: 7,930,888,739
RAC: 27,599,397

Driver can't do anything

Driver can't do anything other than allow the computing calls the science app makes.  The science app is the only thing that can control the computations.

As long as the OpenCL API in the driver has the required functions, then the science application can call them and make use of them.  The function calls to allow use of both FP32 pipelines may be in the OpenCL API, but if the science app isn't written to use them, it is a moot point.

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 2,046
Credit: 17,114,826,131
RAC: 40,802,610

Turing CAN do FP32+INT

Turing CAN do FP32+INT concurrently. That was one of the big selling points of the architecture. But they are dedicated pipelines. The INT cores couldn’t do FP32 or vice versa. 
 

what sets Ampere apart, is that the the “INT” pipeline became “INT or FP32”. Meaning its cores can do one or the other, but not both. So if you have a computation that’s mostly FP32, you got a huge boost, and essentially the use of 2x cores. But if you had a large portion of the load being INT then the card wouldn’t be much faster than Turning other than some generational efficiency improvements. 
 

ive verified as much, and the GR app seems primarily FP32 based and I see a ~20% efficiency boost running GR on the 3070 (vs Turing), but only about 5% efficiency boost running GW, leading me to believe that the GW app has more integer calcs to do.
 

Integer math is being used (likely) mainly for indexing to read data from buffers. And other bitwise operations. the scientific calculations might be mainly FP32, but moving data around is mostly INT. You need both. 

_________________________________________________________________________

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 471,578,888
RAC: 348,941

Is that so, Keith? OpenCL is

Is that so, Keith? OpenCL is created specifically to run on various hardware, without the app programmer having to worry about the specific hardware implementation. But "someone" has to map this to the actual hardware, which as far as I understand would be the specific OpenCL driver.

@Ian&Steve: on my GTX1070 I get no obvious performance difference for current GW WUs, whether the card is clocked at 1.52 GHz or 1.86 GHz. There's probably a small difference, which would need serious averaging to see. The memory controller load is very high running these tasks, 60 - 70%. Thus I assume GW is limited by memory bandwdith on Pascal, Turing and Ampere. Which would explain the minor difference you've observed between Turing and Ampere, rather than integer use.

One could still assume that it's the integer calculations for these memory transfers. However, thus would not apply to Pascal vs. Ampere. But as said earlier in this thread the performance between my Pascal and the Ampere mentioned here scales almost perfectly with bandwidth, which contradicts integer instructions playing a major role.

MrS

Scanning for our furry friends since Jan 2002

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 2,046
Credit: 17,114,826,131
RAC: 40,802,610

if the mem controller is only

if the mem controller is only at 60-70%, it's not limiting anything, it's not operating at max capacity. Danneely already debunked your mem bandwidth theory by reducing the mem bandwidth (reducing clocks) and saw a tiny reduction not linear to the amount that was reduced. the fact that the GW tasks have such a high mem controller load points to more integer load IMO, since moving the data around in memory are largely integer operations. 

 

also for the most part you do not NEED to recompile the OpenCL app to "use" the extra cores. it's not like half of the cores are sitting idle or anything. that kind of stuff happens in the scheduler internal to the GPU. so while OpenCL won't be the best for nvidia GPUs just due to optimization, it will have access to all the cores. it's just that a CUDA app would be much better optimized, less translation necessary to get the work processed. doesn't seem like the project devs are interested in creating a CUDA app though, or they don't have anyone who's capable enough to make a CUDA app.

 

your computers are hidden, so I can't see, but what is the system that's running your GTX 1070? what CPU? CPU speed is a bottleneck with the nvidia cards/apps on GW tasks. a fast CPU will be able to feed the GPU better, and allow higher GPU utilization and faster run times. 

case in point, my low-end 2080ti paired with an overclocked 3900X runs tasks faster (with higher utilization) than my higher end (and higher clocked) 2080tis that are paired with a lower clocked EPYC CPU.

you get better run times with more GPU utilization, and better GPU utilization scales well with CPU speed on the GW apps.

_________________________________________________________________________

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 2,639
Credit: 7,930,888,739
RAC: 27,599,397

Do you think that the driver

Do you think that the driver has some specific, project aware implementation? No it does not.

The drivers purpose is to define what hardware it runs on and enable the function calls to the various platform components in the driver package.

The driver just has the agnostic OpenCL component included in the driver package.

The science app developer then has to write the source code for the app and then compile the application to make use of the function calls within the OpenCL component of the driver.

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.