EM searches, BRP Raidiopulsar and FGRP Gamma-Ray Pulsar

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3956

Credit: 46953272642

RAC: 64604711

CUDA 11 (470.xx) drivers will

9 Aug 2022 15:44:26 UTC

Message 199630

(moderation:

)

CUDA 11 (470.xx) drivers will support from current to all the way back to Kepler (GTX 600/700 series). I think that’s sufficiently old to support. How many connected devices are older than that and would be ostracized by the newer CUDA app? There can’t be that many Tesla/Fermi cards still in production on Einstein that it would be a big hit to lose them, can it?

so I guess the app is built in a way that it works with modern cards? With PTX version of kernels?

what’s the reason for wanting a CUDA app for Nvidia? Is the performance significantly better vs the OpenCL app?

_________________________________________________________________________

petri33

Joined: 4 Mar 20

Posts: 123

Credit: 4051725819

RAC: 6963874

Btw, There is a

10 Aug 2022 8:49:19 UTC

Message 199646 in response to message 199618

(moderation:

)

Btw,

There is a possibility for tasks not always validating - in BRP4 demod_binary_resamp_cpu.c function run_resampling there is a loop that sums ~4 million floats one by one. The sum will be slightly inaccurate due to precision loss in least significant bits when the sum grows bigger. I suggest you use double for the variable 'mean' when calculating the sum in the CPU code.

Petri

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250540625

RAC: 34308

@PETRI33: Thanks, that's an

10 Aug 2022 9:35:56 UTC

Message 199648

(moderation:

)

@PETRI33: Thanks, that's an important hint!

@Ian&Steve C.: For some reason the results of the CUDA version agree better with the CPU version, which is still our reference. And yes, performance is better, in particular when running multiple tasks in parallel. NVidias OpenCL drivers always require a full CPU core, even when doing nothing. I guess they do a "busy waiting" for the GPU kernels to finish.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2958419573

RAC: 712413

Bernd Machenschalk wrote:For

10 Aug 2022 10:50:34 UTC

Message 199650 in response to message 199648

(moderation:

)

Bernd Machenschalk wrote:

For some reason the results of the CUDA version agree better with the CPU version, which is still our reference. And yes, performance is better, in particular when running multiple tasks in parallel. NVidias OpenCL drivers always require a full CPU core, even when doing nothing. I guess they do a "busy waiting" for the GPU kernels to finish.

The lower precision of the OpenCL versions is something I've written about before, in the context of the OpenCL versions for Intel iGPU: you need to avoid the fused multiply-add ('MAD') opcode via an OpenCL compiler directive.

The full-core CPU requirement derives from a process called kernel (or thread) synchronisation. Multiple efficient ways of achieving this are available in CUDA (with significant enhancements from CUDA 9 onwards), but NVidia - notoriously - didn't transfer them to their OpenCL implementation: you're stuck with busy-wait spin loops. OpenCL shouldn't need a full core (especially the specialist bits, like floating point units and SIMD processors), but it grabs them anyway and won't let go.

Cruncher-American

Joined: 24 Mar 05

Posts: 71

Credit: 5492531762

RAC: 4330546

Richard - is this the reason

10 Aug 2022 11:11:49 UTC

Message 199651 in response to message 199650

(moderation:

)

Richard - is this the reason my AMD cards use only a small fraction of a CPU on these WUs while Nvidias use nearly 100 pct? Just the Nvidia coders were lazy when coding their opencl drivers?

Ugh!

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2958419573

RAC: 712413

Cruncher-American

10 Aug 2022 11:17:57 UTC

Message 199652 in response to message 199651

(moderation:

)

Cruncher-American wrote:

Richard - is this the reason my AMD cards use only a small fraction of a CPU on these WUs while Nvidias use nearly 100 pct? Just the Nvidia coders were lazy when coding their opencl drivers?

Lazy, or deliberately anti-collaborative to protect their proprietary (and lucrative) CUDA alternative.

Quote:

Ugh!

Ugh indeed.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250540625

RAC: 34308

@Richard: The only flag I

10 Aug 2022 12:08:00 UTC

Message 199654

(moderation:

)

@Richard: The only flag I could find to do this is '-cl-opt-disable', which disables all math optimizations in OpenCL kernels. I'll try to get this into the app. The advantage of working on BRP7 is that we revived the process of building BRP Apps again, so there are better chances to get the BRP4 Intel GPU issue fixed.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2958419573

RAC: 712413

I've discussed this problem,

10 Aug 2022 12:53:01 UTC

Message 199655 in response to message 199654

(moderation:

)

I've discussed this problem, on and off, at many projects over the years. I've found references in my own posts to -cl-mad-enable, which suggests the possible existence of a -cl-mad-disable. My memory tells me that I've posted links in the past to intel documentation explicitly advising against -cl-mad-enable where accuracy is important, but neither I nor Google can find those references again today. I think I need to take a bit of a break now, but I'll keep looking when I get back.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250540625

RAC: 34308

@Richard: you are totally

10 Aug 2022 13:52:30 UTC

Message 199659

(moderation:

)

@Richard: you are totally right in complaining about "-cl-mad-enable", and there doesn't seem to be a "-cl-mad-disable" counterpart. You could set "-cl-opt-disable", and then possibly enable other, vendor-specific optimization flags.However, for E@H I did actually find a "-cl-mad-enable" hardcoded deep down in some library. I'll fix that.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2958419573

RAC: 712413

Marvellous what a quiet walk

10 Aug 2022 15:21:09 UTC

Message 199663 in response to message 199659

(moderation:

)

Marvellous what a quiet walk in the countryside can do to clear the mind! Retrieved from my sent emails box:

Addressed to Keith Uplinger, then of WCG:

This is the second point I noticed during my offline tests. In the terminal window, we can see

Kernel compilation flags: -I ./device -I ./common -DN128WI -cl-mad-enable

'mad' in this case stands for 'fused multiply+add opcode', and the developer notes say

"Enables a * b + c to be replaced by mad. Note that mad computes a * b + c with reduced accuracy." (from https://software.intel.com/content/www/us/en/develop/documentation/iocl_rt_ref/top/opencl-build-and-linking-options/optimization-options2.html), and

"mad approximates a * b + c. Whether or how the product of a * b is rounded and how supernormal or subnormal intermediate products are handled is not defined. mad is intended to be used where speed is preferred over accuracy." (from https://www.khronos.org/registry/spir-v/specs/1.0/OpenCL.ExtendedInstructionSet.100.mobile.html)

And, quoting from (I think) a conversation on this Einstein message board:

I spent a long weekend with Raistmer - him trying various code revisions and compiler settings, me testing and reporting 'still inaccurate'. Some notes from the end of that testing session:

Hm... that's interesting...
FFT untouched in that build. So, all harm comes from own kernels only.

This one leaves FP_CONTRACT OFF but enables -cl_mad_enable for oclFFT.

So, 2 hares in one shot - establish minimal changes for fix and locates issues (my code/oclFFT)

Eric Korpela will confirm that the results of that session were accepted as an official SETI app

I hope that gives you enough context to be confident with the change. Looks like both websites have been changed since I wrote the email (April 2021), but they were direct quotes at the time.

EM searches, BRP Raidiopulsar and FGRP Gamma-Ray Pulsar

Forums › Technical News

Comment viewing options

Forums › Technical News