BRP on GPUs: loop order

ExtraTerrestria...

Joined: 10 Nov 04

Posts: 770

Credit: 589271382

RAC: 128429

14 Jun 2015 22:10:17 UTC

Topic 198122

(moderation:

)

When Maxwell with its relatively large L2 cache launched we had a short but interesting discussion: the current app streams the entire data array for each operation. That's the usual way suitable for GPUs, as they have massive memory bandwidth, a massive amount of execution units with long latencies and small caches.

For CPUs one would do it the other way around: perform several calculations on a subset of data fitting into the cache. And only go through the entire array step-by-step once the previous block has finished.

Currently we have a nicely optimized GPU app, which uses almost all the GPU memory bandwidth it can get. And which shows strong signs of being limited by that bandwidth. With modern GPUs such as Maxwell moving to larger caches and generally focussing on keeping the execution units busy, the question arose: was the traditional scheme still the best option? We didn't pursue this thought any further, as the PCIe communication optimization had higher priority. I think it would be worth to give this a further look. Apart from Maxwell the AMD and Intel integrated GPUs could benefit especially, since they have limited bandwidth but comparably huge caches.

What do you guys think? Obviously I haven't seen the code, so I can only speculate. But from this comfortable distance it surely sounds worth a try.

Edit: the "current working set", i.e. the number of data points the chip would have to keep in flight, would not need to fit into the cache entirely. Even if it exceeds the cache size by a factor of 3, still 1/3 of all memory operations would be performed within the cache. Which should (to 1st approximation) reduce the memory bandwidth requirement by 1/3.

MrS

Scanning for our furry friends since Jan 2002

ExtraTerrestria...

Joined: 10 Nov 04

Posts: 770

Credit: 589271382

RAC: 128429

Any thoughts or work on this

10 Aug 2016 20:03:16 UTC

Message 148782

(moderation:

)

Any thoughts or work on this topic? Seeing how the new GPUs have an ever decreasing amount of bandwidth to TFlops (and make up for that in games by using delta color compression, which is of no help here) the benefit of relieving the memory bandwidth requirements of the BRP app is increasing.

MrS

Scanning for our furry friends since Jan 2002

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4349

Credit: 253590105

RAC: 36264

In short: We don't have any

11 Aug 2016 6:12:58 UTC

Message 148789

(moderation:

)

In short: We don't have any resources to work on the BRP app any further.

As far as we are concerned, the BRP app has been developed to full extent. In this environment of very heterogeneous GPUs on E@H, the benefit from further development for any particular GPU type doesn't justify the effort we would need to invest.

ExtraTerrestria...

Joined: 10 Nov 04

Posts: 770

Credit: 589271382

RAC: 128429

Fair enough.. thanks! MrS

11 Aug 2016 21:12:51 UTC

Message 148826

(moderation:

)

Fair enough.. thanks!

MrS

Scanning for our furry friends since Jan 2002

BRP on GPUs: loop order

Forums › Wish List

Any thoughts or work on this

In short: We don't have any

Fair enough.. thanks! MrS

Comment viewing options

Forums › Wish List