EM searches, BRP Raidiopulsar and FGRP Gamma-Ray Pulsar

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 2,012
Credit: 16,796,435,502
RAC: 40,557,530

CUDA 11 (470.xx) drivers will

CUDA 11 (470.xx) drivers will support from current to all the way back to Kepler (GTX 600/700 series). I think that’s sufficiently old to support. How many connected devices are older than that and would be ostracized by the newer CUDA app? There can’t be that many Tesla/Fermi cards still in production on Einstein that it would be a big hit to lose them, can it? 
 

so I guess the app is built in a way that it works with modern cards? With PTX version of kernels? 
 

what’s the reason for wanting a CUDA app for Nvidia? Is the performance significantly better vs the OpenCL app?  

_________________________________________________________________________

petri33
petri33
Joined: 4 Mar 20
Posts: 111
Credit: 2,873,503,007
RAC: 1,640,082

Btw, There is a

Btw,

There is a possibility for tasks not always validating - in BRP4 demod_binary_resamp_cpu.c function run_resampling there is a loop that sums ~4 million floats one by one. The sum will be slightly inaccurate due to precision loss in least significant bits when the sum grows bigger. I suggest you use double for the variable 'mean' when calculating the sum in the CPU code.             

Petri

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,099
Credit: 227,908,862
RAC: 25,056

@PETRI33: Thanks, that's an

@PETRI33: Thanks, that's an important hint!

@Ian&Steve C.: For some reason the results of the CUDA version agree better with the CPU version, which is still our reference. And yes, performance is better, in particular when running multiple tasks in parallel. NVidias OpenCL drivers always require a full CPU core, even when doing nothing. I guess they do a "busy waiting" for the GPU kernels to finish.

BM

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2,112
Credit: 1,483,107,362
RAC: 5,374,546

Bernd Machenschalk wrote:For

Bernd Machenschalk wrote:
For some reason the results of the CUDA version agree better with the CPU version, which is still our reference. And yes, performance is better, in particular when running multiple tasks in parallel. NVidias OpenCL drivers always require a full CPU core, even when doing nothing. I guess they do a "busy waiting" for the GPU kernels to finish.

The lower precision of the OpenCL versions is something I've written about before, in the context of the OpenCL versions for Intel iGPU: you need to avoid the fused multiply-add ('MAD') opcode via an OpenCL compiler directive.

The full-core CPU requirement derives from a process called kernel (or thread) synchronisation. Multiple efficient ways of achieving this are available in CUDA (with significant enhancements from CUDA 9 onwards), but NVidia - notoriously - didn't transfer them to their OpenCL implementation: you're stuck with busy-wait spin loops. OpenCL shouldn't need a full core (especially the specialist bits, like floating point units and SIMD processors), but it grabs them anyway and won't let go.

Cruncher-American
Cruncher-American
Joined: 24 Mar 05
Posts: 48
Credit: 2,083,043,726
RAC: 5,280,063

Richard - is this the reason

Richard - is this the reason my AMD cards use only a small fraction of a CPU on these WUs while Nvidias use nearly 100 pct? Just the Nvidia coders were lazy when coding their opencl drivers?

Ugh!

 

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2,112
Credit: 1,483,107,362
RAC: 5,374,546

Cruncher-American

Cruncher-American wrote:

Richard - is this the reason my AMD cards use only a small fraction of a CPU on these WUs while Nvidias use nearly 100 pct? Just the Nvidia coders were lazy when coding their opencl drivers?

Lazy, or deliberately anti-collaborative to protect their proprietary (and lucrative) CUDA alternative.

Quote:
Ugh!

Ugh indeed.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,099
Credit: 227,908,862
RAC: 25,056

@Richard: The only flag I

@Richard: The only flag I could find to do this is '-cl-opt-disable', which disables all math optimizations in OpenCL kernels. I'll try to get this into the app. The advantage of working on BRP7 is that we revived the process of building BRP Apps again, so there are better chances to get the BRP4 Intel GPU issue fixed.

BM

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2,112
Credit: 1,483,107,362
RAC: 5,374,546

I've discussed this problem,

I've discussed this problem, on and off, at many projects over the years. I've found references in my own posts to -cl-mad-enable, which suggests the possible existence of a -cl-mad-disable. My memory tells me that I've posted links in the past to intel documentation explicitly advising against -cl-mad-enable where accuracy is important, but neither I nor Google can find those references again today. I think I need to take a bit of a break now, but I'll keep looking when I get back.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,099
Credit: 227,908,862
RAC: 25,056

@Richard: you are totally

@Richard: you are totally right in complaining about "-cl-mad-enable", and there doesn't seem to be a "-cl-mad-disable" counterpart. You could set "-cl-opt-disable", and then possibly enable other, vendor-specific optimization flags.However, for E@H I did actually find a "-cl-mad-enable" hardcoded deep down in some library. I'll fix that.

BM

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2,112
Credit: 1,483,107,362
RAC: 5,374,546

Marvellous what a quiet walk

Marvellous what a quiet walk in the countryside can do to clear the mind! Retrieved from my sent emails box:

Addressed to Keith Uplinger, then of WCG:

This is the second point I noticed during my offline tests. In the terminal window, we can see

Kernel compilation flags: -I ./device -I ./common -DN128WI -cl-mad-enable

'mad' in this case stands for 'fused multiply+add opcode', and the developer notes say

"Enables a * b + c to be replaced by mad. Note that mad computes a * b + c with reduced accuracy." (from https://software.intel.com/content/www/us/en/develop/documentation/iocl_rt_ref/top/opencl-build-and-linking-options/optimization-options2.html), and

"mad approximates a * b + c. Whether or how the product of a * b is rounded and how supernormal or subnormal intermediate products are handled is not defined. mad is intended to be used where speed is preferred over accuracy." (from https://www.khronos.org/registry/spir-v/specs/1.0/OpenCL.ExtendedInstructionSet.100.mobile.html)

And, quoting from (I think) a conversation on this Einstein message board:

I spent a long weekend with Raistmer - him trying various code revisions and compiler settings, me testing and reporting 'still inaccurate'. Some notes from the end of that testing session:

Hm... that's interesting...
FFT untouched in that build. So, all harm comes from own kernels only.

This one leaves FP_CONTRACT OFF but enables -cl_mad_enable for oclFFT.

So, 2 hares in one shot - establish minimal changes for fix and locates issues (my code/oclFFT)

Eric Korpela will confirm that the results of that session were accepted as an official SETI app 

I hope that gives you enough context to be confident with the change. Looks like both websites have been changed since I wrote the email (April 2021), but they were direct quotes at the time.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.