Are there any E@H apps that allow for command line optimization of FFT in either AMD or NVidia GPU apps? More generally, is there specific information from the Linux command 'clinfo' that would be useful for folks wanting to optimize GPU run parameters? I'm asking to help with upgrades to the program amdgpu-utils available from GitHub.
Ideas are not fixed, nor should they be; we live in model-dependent reality.
Copyright © 2024 Einstein@Home. All rights reserved.
Hello Cecht, the tool you
)
Hello Cecht,
the tool you mentioned (https://github.com/Ricks-Lab/amdgpu-utils) is also available for Debian and Ubuntu (https://packages.ubuntu.com/focal/ricks-amdgpu-utils). I admit not to have understood you question, though :o/
I think that's a no to both
)
I think that's a no to both questions I'm afraid.
[This may not be relevant/helpful]
IIRC the bottleneck in the FFT is the evaluation of the sine and cosine of theta, where theta is the ( radian measured ) argument with the precision/granularity of the base transform ie. however many data points we are transforming. Evaluation by power series approximation converges way too slow, but I think it was solved using lookup of precalculated values in a table and double angle formulae eg.
sin(A + B ) = sinA * cosB + cosA * sinB ; cos(A + B) = cosA * cosB - sinA * sinB
making this amenable to a fused multiply/add scheme, if available.
[/This may not be relevant/helpful]
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
Mike Hewson wrote:making this
)
Isn't there a problem with loss of precision with fused multiply/add on some hardware, such as Intel GPUs?
Richard Haselgrove wrote:Mike
)
Is there ? That's sad. I guess the question is equivalent to "does Intel/whoever comply with IEEE- 754-2008 ?" As the prime market for GPUs is not computing then it may be better to be fast than right .....
[dull, dirty and dangerous]
So you are right to worry, as with E@H the FFTs are of order 222 data points IIRC. So you have to have a precision equal/deeper than that in the representation. Of interest is that 32 bit IEEE-754 floats have 23 bits significant for the fractional part of an operand.
This is a clue to calculating the trig functions via double angle formula. Pre-calculate the sine and cosine for each 2n from 2-1 thru to 2-23. Put these into respective tables/vectors indexed suitably, making two tables each of 23 entries of some precision. Now any 23 bit theta argument, from zero up until 1 - 2-23 , is going to be some sum of certain such powers of 2 ie.
a1 * 2-1 + a2 * 2-2 + ..... + ak * 2-k + ...... a23 * 2-23 { Each ak is either 0 or 1 }
Set the value of two accumulators, call them Asin and Acos. The first is set to sin(0.0) = 0.0 and the second is set to cos(0.0) = 1.0.
Now, by for looping say, shift the operand to the right by one bit and inspect the bit that falls off the right end ( first you will get a23 then another shift gets a22 etc ..... all along to a1 ). At each shift test the bit's value :
- if ak is 0 then forget it as 2-k does not contribute to theta's value, so neither will sin(2-k) or cos(2-k) contribute to our task here. Exit that loop iteration.
- if ak is 1 then 2-k does contribute to theta's value. So by using sin(2-k) and cos(2-k) from your tables plug it into the double angle formulae :
Asinold <---- Asin
Acosold <---- Acos
Asin <---- Asinold * cos(2-k) + Acosold * sin(2-k)
Acos <---- Acosold * cos(2-k) - Asinold * sin(2-k)
... where Asinold and Acosold are temporary variables with <---- indicating assignment. So actually I don't need FMA at all now that I come to think in detail ....
When the loop finishes looking at all the bit's of theta then you will have Asin = sin(theta) and Acos = cos(theta). To that level of precision anyway. For that matter the precision of the trigs doesn't have to equal the precision of the argument, though the precision of sine must be the same as cosine. But you need a full circle ie. so from 0 to just below 2 * PI is sought : mutatis mutandis to reach that range.
Plus to be truly general you want to enact the arguments moduli PI, bringing the argument back to some principal range. Why not go from -PI to +PI, a lovely (anti-)symmetric range ? Know that :
sin(-theta) = - sin(theta )
cos(-theta) = cos(theta)
{ Ultimately you're actually after cos(theta) + i * sin(theta) = ei*theta }
Alternatively : to be utterly brutal you could just pre-calculate every sine and cosine for every argument in your principal range. That unwraps all the above gobbledygook to simple lookups from two arrays each of 223 entries. What a luxury that would be ...
[/dull, dirty and dangerous]
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
steffen_moeller wrote:Hello
)
Right, though amdgpu-utils 3.0.0 has been released, so the deb package should be updated.
My questions related to a rather minor point on what information should be listed from the amdgpu-ls --clinfo command. Rick, the dev, set that option to include parameters that are relevant for setting command line arguments of SETI apps. He wasn't sure whether E@H apps used similar or additional information (related to FFT), so I offered to ask here.
Ideas are not fixed, nor should they be; we live in model-dependent reality.
Mike Hewson wrote:I think
)
Thank you. The rest of what you and Richard posted is fascinating (though above my head!) and hopefully will be helpful to someone down the road.
Ideas are not fixed, nor should they be; we live in model-dependent reality.