A BOINC friend on another thread noted the following:
Btw,
Beside of twiddling constantly with your cooling system, you could ask the developers (official) to take a look at the hw support for PTX atomic global add f32 atleast on NVIDIA hardware.
5 seconds cut away in tha first try.
The next advice is to take a look at the access pattern of twiddle_dee: that would shave off some 50 seconds on NVIDIA and 17% on some ATI model I know of.
Keep on crunching!
A Proud member of the O.F.A. (Old Farts Association).
Copyright © 2024 Einstein@Home. All rights reserved.
Twiddle dee is defined
)
Twiddle dee is defined __constant float2[3][256] and it is accessed from different location by each thread resulting to serialized access on the nvidia hardware. SLOW!
Please replace the word __constant with global.
See: https://einsteinathome.org/fi/workunit/565663876
petri33 wrote:Twiddle dee
)
this is the biggest thing holding Nvidia back for sure. it's a kind of artificial limit that's not allowing Nvidia cards to operate to their full potential, at least for modern Nvidia cards (Pascal - Ampere). It's the sole reason that Nvidia has long under-performed at Einstein. But that's changing ;)
with the above changes,
Pascal can speed up processing ~40-60%
Turing can speed up processing ~65%
Ampere can speed up processing ~100-110%
It requires the use of OpenCL 2.0+ drivers though, which conveniently enough Nvidia pumped their drivers up to OpenCL 3.0 since the 465 driver branch. Getting OpenCL 2.0 working on AMD/Linux with older cards is a bit more of a challenge, but not impossible. Newer cards have better ROCm support than the older cards. Nvidia really has a better handle on their drivers than AMD does.
I wonder if this fix could be applied to the Windows app as well?
_________________________________________________________________________
I passed that on to our GPU
)
I passed that on to our GPU App developer.
BM
Thanks Bernd
)
Thanks Bernd. Please keep us updated if this makes it into a new app.
_________________________________________________________________________
Bernd Machenschalk wrote: I
)
Thank you.
A Proud member of the O.F.A. (Old Farts Association).
I've just downloaded a new
)
I've just downloaded a new version 1.01 (GW-opencl-nvidia) (beta test) for the Gravitational Wave search O3 All-Sky #1 (O3AS) search - deployed about an hour and a half ago. Could that be related?
Possibly. But in my testing
)
Possibly. But in my testing the mentioned changes have little effect on the GW app since it’s so heavily CPU bound.
they need to update the Gamma Ray apps primarily.
also, if they implement the changes, you’ll need OpenCL 2.0 compatible drivers to make use of it. You’ll get errors otherwise.
_________________________________________________________________________
I noticed the new app first
)
I noticed the new app first on a Windows machine, which has "device version OpenCL 1.2 CUDA". There's a separate one for Linux, which I've also downloaded.
The first tasks specified to use the new app will reach the head of the cache while I'm out at dinner. I'll see what sort of a mess we have when I get back.
For nvidia, you need drivers
)
For nvidia, you need drivers from the 465 or 470 branch. Those include OpenCL 3.0 on both Windows and Linux
on AMD, I believe the Windows drivers support OpenCL 2.0 for the cards that support it. but on Linux it’s a little more complicated. The AMDGPU-Pro drivers only support OpenCL 2.0 on Vega and newer. Older cards like the Polaris based RX500 series GPUs will only get OpenCL 1.2 from the AMDGPU-Pro drivers even though the hardware supports it. It’s an issue/limitation with the ROCm runtime included with the AMD driver installer. You’ll only get OpenCL 2.0 support (at least enough to work with this code change) if you do the full ROCm install.
_________________________________________________________________________
Returned from dinner - new
)
Returned from dinner - new app tasks are running well, and seemingly quicker then before on both Windows and Linux.
Windows driver version (easiest to check) is 452.06