BRP on GPUs: loop order

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 758
Credit: 181,604,641
RAC: 48,918
Topic 198122

When Maxwell with its relatively large L2 cache launched we had a short but interesting discussion: the current app streams the entire data array for each operation. That's the usual way suitable for GPUs, as they have massive memory bandwidth, a massive amount of execution units with long latencies and small caches.

For CPUs one would do it the other way around: perform several calculations on a subset of data fitting into the cache. And only go through the entire array step-by-step once the previous block has finished.

Currently we have a nicely optimized GPU app, which uses almost all the GPU memory bandwidth it can get. And which shows strong signs of being limited by that bandwidth. With modern GPUs such as Maxwell moving to larger caches and generally focussing on keeping the execution units busy, the question arose: was the traditional scheme still the best option? We didn't pursue this thought any further, as the PCIe communication optimization had higher priority. I think it would be worth to give this a further look. Apart from Maxwell the AMD and Intel integrated GPUs could benefit especially, since they have limited bandwidth but comparably huge caches.

What do you guys think? Obviously I haven't seen the code, so I can only speculate. But from this comfortable distance it surely sounds worth a try.

Edit: the "current working set", i.e. the number of data points the chip would have to keep in flight, would not need to fit into the cache entirely. Even if it exceeds the cache size by a factor of 3, still 1/3 of all memory operations would be performed within the cache. Which should (to 1st approximation) reduce the memory bandwidth requirement by 1/3.

MrS

Scanning for our furry friends since Jan 2002

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 758
Credit: 181,604,641
RAC: 48,918

Any thoughts or work on this

Any thoughts or work on this topic? Seeing how the new GPUs have an ever decreasing amount of bandwidth to TFlops (and make up for that in games by using delta color compression, which is of no help here) the benefit of relieving the memory bandwidth requirements of the BRP app is increasing.

MrS

Scanning for our furry friends since Jan 2002

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 3,935
Credit: 198,679,229
RAC: 48,651

In short: We don't have any

In short: We don't have any resources to work on the BRP app any further.

As far as we are concerned, the BRP app has been developed to full extent. In this environment of very heterogeneous GPUs on E@H, the benefit from further development for any particular GPU type doesn't justify the effort we would need to invest.

BM

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 758
Credit: 181,604,641
RAC: 48,918

Fair enough.. thanks! MrS

Fair enough.. thanks!

MrS

Scanning for our furry friends since Jan 2002

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.