RTX 3070 initial impressions

ExtraTerrestria...

Joined: 10 Nov 04

Posts: 770

Credit: 557547156

RAC: 185774

Ops @ GW! Forgot there's more

14 Jan 2021 21:13:33 UTC

Message 182431

(moderation:

)

Ops @ GW! Forgot there's more than GPUs here...

MrS

Scanning for our furry friends since Jan 2002

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3562358667

RAC: 0

ExtraTerrestrial Apes

16 Jan 2021 14:32:09 UTC

Message 182468 in response to message 182398

(moderation:

)

ExtraTerrestrial Apes wrote:

Good that you got it working. but the performance is underwhelming, indeed. For the GR tasks you average around 510s, whereas my GTX1070 does them in 810 - 830s. If I just scale your number by the difference in memory bandwidth I get: 510s * 14GHz / 8.8 GHz = 811s. Almost a perfect match! So it seems like GR tasks can't benefit from the enhanced compute capability of Ampere at all.

As an experiment I downclocked my vram by 1ghz (7%). Run times increased by 8-10 seconds (1%), so the increased memory speed vs increased performance vs your 1070 was just a coincidence.

ExtraTerrestria...

Joined: 10 Nov 04

Posts: 770

Credit: 557547156

RAC: 185774

Thanks for checking! BTW:

16 Jan 2021 23:08:02 UTC

Message 182485

(moderation:

)

Thanks for checking!

BTW: I read elsewhere that in order to use the doubled FP32 units of ampere, the app would have to be recompiled with the new shader model as target. The apps here haven't changed in a while, didn't they?

MrS

Scanning for our furry friends since Jan 2002

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3839

Credit: 39381532644

RAC: 56659063

Nvidia cards would perform

17 Jan 2021 5:27:13 UTC

Message 182493 in response to message 182485

(moderation:

)

Nvidia cards would perform better with a CUDA app, but as far as FP use, it likely depends on how the app is coded and what kinds of calculations are being done. If you have a task that requires a lot of integer calcs, then you won’t see much benefit in the new architecture since it can’t use all the FP cores if half of them are stuck doing integer work.

_________________________________________________________________________

ExtraTerrestria...

Joined: 10 Nov 04

Posts: 770

Credit: 557547156

RAC: 185774

You're right, but still I

18 Jan 2021 20:26:11 UTC

Message 182540

(moderation:

)

You're right, but still I think the data strongly suggests the additional FP32 units are not being used because:

My Pascal can't do FP32 and integer together (like Turing), so even in this case Ampere (and Turing) should have a significant advantage
The Einstein workload should be mainly FP32, as it has not shown to depend significantly on FP64 performance and integer doesn't make sense for the main workload.
if even CUDA apps have to be recompiled in order to take advantage of the new architecture, I wonder how this is handled in OpenCL. Maybe there the driver should do it? Would make sense. Maybe someone with enough spare time could google it.

MrS

Scanning for our furry friends since Jan 2002

Keith Myers

Joined: 11 Feb 11

Posts: 4836

Credit: 18017375082

RAC: 3849531

Driver can't do anything

18 Jan 2021 23:22:46 UTC

Message 182544

(moderation:

)

Driver can't do anything other than allow the computing calls the science app makes. The science app is the only thing that can control the computations.

As long as the OpenCL API in the driver has the required functions, then the science application can call them and make use of them. The function calls to allow use of both FP32 pipelines may be in the OpenCL API, but if the science app isn't written to use them, it is a moot point.

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3839

Credit: 39381532644

RAC: 56659063

Turing CAN do FP32+INT

19 Jan 2021 1:23:37 UTC

Message 182546 in response to message 182540

(moderation:

)

Turing CAN do FP32+INT concurrently. That was one of the big selling points of the architecture. But they are dedicated pipelines. The INT cores couldn’t do FP32 or vice versa.

what sets Ampere apart, is that the the “INT” pipeline became “INT or FP32”. Meaning its cores can do one or the other, but not both. So if you have a computation that’s mostly FP32, you got a huge boost, and essentially the use of 2x cores. But if you had a large portion of the load being INT then the card wouldn’t be much faster than Turning other than some generational efficiency improvements.

ive verified as much, and the GR app seems primarily FP32 based and I see a ~20% efficiency boost running GR on the 3070 (vs Turing), but only about 5% efficiency boost running GW, leading me to believe that the GW app has more integer calcs to do.

Integer math is being used (likely) mainly for indexing to read data from buffers. And other bitwise operations. the scientific calculations might be mainly FP32, but moving data around is mostly INT. You need both.

_________________________________________________________________________

ExtraTerrestria...

Joined: 10 Nov 04

Posts: 770

Credit: 557547156

RAC: 185774

Is that so, Keith? OpenCL is

19 Jan 2021 12:37:10 UTC

Message 182573

(moderation:

)

Is that so, Keith? OpenCL is created specifically to run on various hardware, without the app programmer having to worry about the specific hardware implementation. But "someone" has to map this to the actual hardware, which as far as I understand would be the specific OpenCL driver.

@Ian&Steve: on my GTX1070 I get no obvious performance difference for current GW WUs, whether the card is clocked at 1.52 GHz or 1.86 GHz. There's probably a small difference, which would need serious averaging to see. The memory controller load is very high running these tasks, 60 - 70%. Thus I assume GW is limited by memory bandwdith on Pascal, Turing and Ampere. Which would explain the minor difference you've observed between Turing and Ampere, rather than integer use.

One could still assume that it's the integer calculations for these memory transfers. However, thus would not apply to Pascal vs. Ampere. But as said earlier in this thread the performance between my Pascal and the Ampere mentioned here scales almost perfectly with bandwidth, which contradicts integer instructions playing a major role.

MrS

Scanning for our furry friends since Jan 2002

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3839

Credit: 39381532644

RAC: 56659063

if the mem controller is only

19 Jan 2021 15:55:16 UTC

Message 182577

(moderation:

)

if the mem controller is only at 60-70%, it's not limiting anything, it's not operating at max capacity. Danneely already debunked your mem bandwidth theory by reducing the mem bandwidth (reducing clocks) and saw a tiny reduction not linear to the amount that was reduced. the fact that the GW tasks have such a high mem controller load points to more integer load IMO, since moving the data around in memory are largely integer operations.

also for the most part you do not NEED to recompile the OpenCL app to "use" the extra cores. it's not like half of the cores are sitting idle or anything. that kind of stuff happens in the scheduler internal to the GPU. so while OpenCL won't be the best for nvidia GPUs just due to optimization, it will have access to all the cores. it's just that a CUDA app would be much better optimized, less translation necessary to get the work processed. doesn't seem like the project devs are interested in creating a CUDA app though, or they don't have anyone who's capable enough to make a CUDA app.

your computers are hidden, so I can't see, but what is the system that's running your GTX 1070? what CPU? CPU speed is a bottleneck with the nvidia cards/apps on GW tasks. a fast CPU will be able to feed the GPU better, and allow higher GPU utilization and faster run times.

case in point, my low-end 2080ti paired with an overclocked 3900X runs tasks faster (with higher utilization) than my higher end (and higher clocked) 2080tis that are paired with a lower clocked EPYC CPU.

you get better run times with more GPU utilization, and better GPU utilization scales well with CPU speed on the GW apps.

_________________________________________________________________________

Keith Myers

Joined: 11 Feb 11

Posts: 4836

Credit: 18017375082

RAC: 3849531

Do you think that the driver

19 Jan 2021 17:08:25 UTC

Message 182586 in response to message 182573

(moderation:

)

Do you think that the driver has some specific, project aware implementation? No it does not.

The drivers purpose is to define what hardware it runs on and enable the function calls to the various platform components in the driver package.

The driver just has the agnostic OpenCL component included in the driver package.

The science app developer then has to write the source code for the app and then compile the application to make use of the function calls within the OpenCL component of the driver.

RTX 3070 initial impressions

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner