use OpenCL 2.0 for shared memory in APUs?

ExtraTerrestria...

Joined: 10 Nov 04

Posts: 770

Credit: 583431466

RAC: 144631

5 Jan 2015 22:52:56 UTC

Topic 197900

(moderation:

)

Dear Einstein developers,

the new Intel GPUs in Broadwell are starting to arrive, complete with support for OpenCL 2.0. It will take months for them to reach meaningful numbers, but together with AMD APUs (which should also support this in due time, maybe already now) it may be a good time to look into possible benefits.

I know Einstein needs to stream lot's of data into the GPUs. I don't know how this is done exactly on current shared memory GPUs - is the data copied from main memory into a private section of the main memory, reserved for the GPU? If this is true Einstein could benefit tremendously from the shared memory in the new API, as the data is already in main memory.

If this does not help Einstein any further I'm glad, as it means the current app is already nicely optimized :)

MrS

Scanning for our furry friends since Jan 2002

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 767624266

RAC: 1083472

use OpenCL 2.0 for shared memory in APUs?

8 Feb 2015 15:41:36 UTC

Message 128965

(moderation:

)

Sorry for the late reply.

This is a valid point, our current GPU apps were indeed written having dedicated GPUs attached over PCIe in mind: memory from the host's address space is copied to the GPUs memory and vice versa with no attempt to detect or separately handle the situation that GPU and CPU share the same physical memory.

Currently I'm a bit skeptical that an effort to change this is really worth it. I have another optimization idea that would further decrease the amount of memory being copied, and if we find that feasible and worthwhile, the remaining advantage from the special handling of shared memory might be even less significant.

But again, it's a valid point, thanks for bringing it up.

Cheers
HB

ExtraTerrestria...

Joined: 10 Nov 04

Posts: 770

Credit: 583431466

RAC: 144631

Thanks for your reply, HB.

19 Feb 2015 22:49:51 UTC

Message 128966 in response to message 128965

(moderation:

)

Thanks for your reply, HB. Any optimization reducing the bandwidth requirements would be very helpful, as it would apply to all kinds of supported processors. If it's going to be enough to ease the pressure on the APU memory subsystem - who knows? The current AMD Kaveris with 512 shaders are already badly bandwidth limited in regular games, where they're hardly faster than units with 384 shaders. And Intel Skylake is supposed to top Broadwells 48 shaders with 72 in the top model - that's going to be at least enough of a speed-up to eat-up any benefit from the switch to DDR4. Manufacturers like to outfit notebooks with relatively beefy i5 & i7 CPUs and single channel memory.

Anyway, regarding the implementation:
Around CUDA 5.5 they introduced a unified adress space, software wise. The idea is that the programmer doesn't care about where memory is and benefits from simpler programming. This doesn't help existing CUDA apps with manual memory management and can not make discrete GPUs any faster.

But if something similar is possible under OpenCL it might be possible to use it to handle discrete GPUs and APUs with the same code path. The "APU detection" could be passed on to the OpenCL driver, so to say. I admit I don't know much about the current technical possibilities, though, and if this would still perform as good as the current app on discrete GPUs.

MrS

Scanning for our furry friends since Jan 2002

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 767624266

RAC: 1083472

RE: Anyway, regarding the

19 Feb 2015 23:49:37 UTC

Message 128967 in response to message 128966

(moderation:

)

Quote:

Anyway, regarding the implementation:
Around CUDA 5.5 they introduced a unified adress space, software wise. The idea is that the programmer doesn't care about where memory is and benefits from simpler programming. This doesn't help existing CUDA apps with manual memory management and can not make discrete GPUs any faster.

But if something similar is possible under OpenCL it might be possible to use it to handle discrete GPUs and APUs with the same code path. The "APU detection" could be passed on to the OpenCL driver, so to say. I admit I don't know much about the current technical possibilities, though, and if this would still perform as good as the current app on discrete GPUs.

MrS

Yup, something like this is possible in OpenCL.

I think I'm now close to getting a test version out (first as a CUDA app version, I guess) that will feature the newest optimizations to conserve PCIe bandwidth. If the results are anywhere near where I hope they will be, this should already make APU owners (and everyone else) happy ;-). Testing should start in parallel with the new BRP6 run, perhaps as early as next week.

ExtraTerrestria...

Joined: 10 Nov 04

Posts: 770

Credit: 583431466

RAC: 144631

From the results I see from

8 Mar 2015 22:44:46 UTC

Message 128968 in response to message 128967

(moderation:

)

From the results I see from the new nVidia and AMD app this should be right what the doctor ordered for iGPUs as well! The gains on PCIe bandwidth-constrained systems are spectacular, so for the iGPU a significant amount of main memory bandwidth should be saved.

Could I help in beta-testing this on an HD4000? (I haven't been following Albert at all)

MrS

Scanning for our furry friends since Jan 2002

ExtraTerrestria...

Joined: 10 Nov 04

Posts: 770

Credit: 583431466

RAC: 144631

RE: RE: unified adress

7 May 2015 18:50:45 UTC

Message 128969 in response to message 128967

(moderation:

)

Quote:

Quote:

unified adress space... it might be possible to use it to handle discrete GPUs and APUs with the same code path. The "APU detection" could be passed on to the OpenCL driver, so to say.

Yup, something like this is possible in OpenCL.

Now that the dust around the great new app has settled: could we give this a try? The benefit wouldn't be as great as with the old app, but I suspect powerful APUs (the bigger AMDs and the upcoming Intels) would still gain something.

MrS

Scanning for our furry friends since Jan 2002

use OpenCL 2.0 for shared memory in APUs?

Forums › Wish List

use OpenCL 2.0 for shared memory in APUs?

Thanks for your reply, HB.

RE: Anyway, regarding the

From the results I see from

RE: RE: unified adress

Comment viewing options

Forums › Wish List