Improvements in the code of the clients

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4267
Credit: 244932206
RAC: 16344

Hm. twiddle_dee is not in our

Hm. twiddle_dee is not in our own code; there is a twiddle_dee possibly in the clFFT library. That library, however, is linked but should not be used in the GW App, at least not on NVidia.

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4267
Credit: 244932206
RAC: 16344

In clFFT there is

In clFFT there is

 ./src/library/generator.stockham.h:209: inline std::string TwTableLargeName()
 ./src/library/generator.stockham.h-210- {
 ./src/library/generator.stockham.h-211- return "twiddle_dee";
 ./src/library/generator.stockham.h-212- }

...

./src/library/generator.stockham.h-282- ss << "\n __constant ";
./src/library/generator.stockham.h-283- ss << RegBaseType<PR>(2);
./src/library/generator.stockham.h:284: ss << " " << TwTableLargeName();

I'll patch that.

BM

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3681
Credit: 33843529248
RAC: 36814791

just be aware of the

just be aware of the consequences of this change with regard to OpenCL supported features. defining a table in this way is only supported in openCL 2.0 and greater. and I think it's safe to assume that a large portion of the user base is using drivers that only have openCL 1.2

 

you might need to gatekeep the application from hosts that have incompatible drivers to avoid mass errors. this applies to both AMD and NVIDIA

_________________________________________________________________________

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4267
Credit: 244932206
RAC: 16344

There is now a 1.25 FGRP App

There is now a 1.25 FGRP App (Beta test) that should have the clFFT patched in the suggested way. For now this is restricted to NVidia Pascal & up (compute capability >= 6.0) and OpenCL 2.0 in the respective plan class. The app is available for all three major platforms (Windows, Linux, OSX).

BM

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3681
Credit: 33843529248
RAC: 36814791

I tested out the app, but

I tested out the app, but it’s not really any different in run speed or behavior. I can see __global in the hex editor now with twiddle_dee. Applying my patch overtop of this new 1.25 app brings runtime back to being fast again. 
 

there may be some other changes that petri has made besides this that are complimentary. 

_________________________________________________________________________

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4267
Credit: 244932206
RAC: 16344

I would be happy to receive

I would be happy to receive this patch. As long as it's OpenCL, we should be able to incorporate it in the App.

BM

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3681
Credit: 33843529248
RAC: 36814791

Bernd Machenschalk wrote: I

Bernd Machenschalk wrote:

I would be happy to receive this patch. As long as it's OpenCL, we should be able to incorporate it in the App.

I've PMed you a link to the code and instructions/info.

_________________________________________________________________________

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4267
Credit: 244932206
RAC: 16344

Thanks, got it. Regarding the

Thanks, got it. Regarding the GPU code, this only changes the type of twiddle_dee and adds some options to the OpenCL compiler, in particular to use OpenCL 2.0 (*). I built app version 1.26 with that, please give it a try. It's the same plan class as 1.25, so the same restrictions apply.

(*) The other stuff in there just puts the CPU to sleep while the GPU is running, there is some other method implemented in our app.

BM

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3681
Credit: 33843529248
RAC: 36814791

Bernd Machenschalk

Bernd Machenschalk wrote:

Thanks, got it. Regarding the GPU code, this only changes the type of twiddle_dee and adds some options to the OpenCL compiler, in particular to use OpenCL 2.0 (*). I built app version 1.26 with that, please give it a try. It's the same plan class as 1.25, so the same restrictions apply.

(*) The other stuff in there just puts the CPU to sleep while the GPU is running, there is some other method implemented in our app.

Hi Bernd, I tried 1.26 and I see the same performance as 1.25. Did you add the other conditions from my followup PM last night?

 

petri33 wrote:

First difference: a kernel that begins like this:

__attribute__(( reqd_work_group_size( 16, 16, 1 ) ))
kernel void
transpose_gcn_tw_fwd( global float2* restrict pmComplexIn, global float2* restrict pmComplexOut )
{
   const Tile localIndex = { get_local_id( 0 ), get_local_id( 1 ) };
   const Tile localExtent = { get_local_size( 0 ), get_local_size( 1 ) };

...

   local float2 lds[ 64 ][ 64 ];

should be changed to

   local float2 lds[ 64 ][ 65 ];

It uses a bit more memory, but the access alignment becomes better and the number of cache line conflicts is reduced.

petri33 wrote:

The same thing applies to kernel called transpose gcn_tw_back

and both of those gcn_tw kernels can have the line:

const Tile localExtent = { get_local_size( 0 ), get_local_size( 1 ) };

changed to

const Tile localExtent = { 16,  16 };

to help compiler to generate better code at compile time. The (16, 16) is the required work group size and can be hard coded,

 

That should make the final speedup.

 

_________________________________________________________________________

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4267
Credit: 244932206
RAC: 16344

Thanks, probably missed that.

Thanks, probably missed that. I'll take another look next week.

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.