use OpenCL 2.0 for shared memory in APUs?

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 583431466
RAC: 144631
Topic 197900

Dear Einstein developers,

the new Intel GPUs in Broadwell are starting to arrive, complete with support for OpenCL 2.0. It will take months for them to reach meaningful numbers, but together with AMD APUs (which should also support this in due time, maybe already now) it may be a good time to look into possible benefits.

I know Einstein needs to stream lot's of data into the GPUs. I don't know how this is done exactly on current shared memory GPUs - is the data copied from main memory into a private section of the main memory, reserved for the GPU? If this is true Einstein could benefit tremendously from the shared memory in the new API, as the data is already in main memory.

If this does not help Einstein any further I'm glad, as it means the current app is already nicely optimized :)

MrS

Scanning for our furry friends since Jan 2002

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 767624266
RAC: 1083472

use OpenCL 2.0 for shared memory in APUs?

Sorry for the late reply.

This is a valid point, our current GPU apps were indeed written having dedicated GPUs attached over PCIe in mind: memory from the host's address space is copied to the GPUs memory and vice versa with no attempt to detect or separately handle the situation that GPU and CPU share the same physical memory.

Currently I'm a bit skeptical that an effort to change this is really worth it. I have another optimization idea that would further decrease the amount of memory being copied, and if we find that feasible and worthwhile, the remaining advantage from the special handling of shared memory might be even less significant.

But again, it's a valid point, thanks for bringing it up.

Cheers
HB

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 583431466
RAC: 144631

Thanks for your reply, HB.

Thanks for your reply, HB. Any optimization reducing the bandwidth requirements would be very helpful, as it would apply to all kinds of supported processors. If it's going to be enough to ease the pressure on the APU memory subsystem - who knows? The current AMD Kaveris with 512 shaders are already badly bandwidth limited in regular games, where they're hardly faster than units with 384 shaders. And Intel Skylake is supposed to top Broadwells 48 shaders with 72 in the top model - that's going to be at least enough of a speed-up to eat-up any benefit from the switch to DDR4. Manufacturers like to outfit notebooks with relatively beefy i5 & i7 CPUs and single channel memory.

Anyway, regarding the implementation:
Around CUDA 5.5 they introduced a unified adress space, software wise. The idea is that the programmer doesn't care about where memory is and benefits from simpler programming. This doesn't help existing CUDA apps with manual memory management and can not make discrete GPUs any faster.

But if something similar is possible under OpenCL it might be possible to use it to handle discrete GPUs and APUs with the same code path. The "APU detection" could be passed on to the OpenCL driver, so to say. I admit I don't know much about the current technical possibilities, though, and if this would still perform as good as the current app on discrete GPUs.

MrS

Scanning for our furry friends since Jan 2002

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 767624266
RAC: 1083472

RE: Anyway, regarding the

Quote:


Anyway, regarding the implementation:
Around CUDA 5.5 they introduced a unified adress space, software wise. The idea is that the programmer doesn't care about where memory is and benefits from simpler programming. This doesn't help existing CUDA apps with manual memory management and can not make discrete GPUs any faster.

But if something similar is possible under OpenCL it might be possible to use it to handle discrete GPUs and APUs with the same code path. The "APU detection" could be passed on to the OpenCL driver, so to say. I admit I don't know much about the current technical possibilities, though, and if this would still perform as good as the current app on discrete GPUs.

MrS

Yup, something like this is possible in OpenCL.

I think I'm now close to getting a test version out (first as a CUDA app version, I guess) that will feature the newest optimizations to conserve PCIe bandwidth. If the results are anywhere near where I hope they will be, this should already make APU owners (and everyone else) happy ;-). Testing should start in parallel with the new BRP6 run, perhaps as early as next week.

HB

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 583431466
RAC: 144631

From the results I see from

From the results I see from the new nVidia and AMD app this should be right what the doctor ordered for iGPUs as well! The gains on PCIe bandwidth-constrained systems are spectacular, so for the iGPU a significant amount of main memory bandwidth should be saved.

Could I help in beta-testing this on an HD4000? (I haven't been following Albert at all)

MrS

Scanning for our furry friends since Jan 2002

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 583431466
RAC: 144631

RE: RE: unified adress

Quote:
Quote:

unified adress space... it might be possible to use it to handle discrete GPUs and APUs with the same code path. The "APU detection" could be passed on to the OpenCL driver, so to say.

Yup, something like this is possible in OpenCL.


Now that the dust around the great new app has settled: could we give this a try? The benefit wouldn't be as great as with the old app, but I suspect powerful APUs (the bigger AMDs and the upcoming Intels) would still gain something.

MrS

Scanning for our furry friends since Jan 2002

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.