Parallella, Raspberry Pi, FPGA & All That Stuff

Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3,516
Credit: 591,827,237
RAC: 5,006

RE: Hi, I am working with

Quote:

Hi,

I am working with VHDL/Verilog for a long time (MSc, PhD) and for the last 2 years hardware development is my daily job. I have some questions for the very interesting discussion about porting EaH clients to FPGAs.

Sure!

Quote:

1. The FFT calculations performed in the EaH is always in a frame that is a power of 2 (e.g 2^20) or you need a non-power-of-2 FFT core? The non-power-of-2 FFT implementations are far more complex in hardware.

The E@H search that is best (maybe the only one) suited for this type of hardware ,because of the relatively small size of tasks, is BRP4 (the one running on Raspberry Pis and Android devices). It computes a 3*2^22 real to complex DFT. We were facing the same challenge when using an OpenCL FFT lib that would do only power-of-two complex-to-complex transforms, so what we do is to perform a batch of 3 transforms, each of length 2^21 [sic], complex-to-complex and do the rest in a custom kernel on the GPU (it's relatively easy to compute a 2^N real to complex transform with just a single, 2^(N-1) C2C transform, with some extra twiddling at the end. Better than doing the naive thing of using a full length C2C transform and setting all the imaginary parts to 0 :-)

I would think that something similar would be cool on FPGA as well. Even having the ability to perform (batches of?) shorter transforms might help in a first step.

Quote:

2. Do you need floating-point calculations in the FFT core or you can use fixed-point / block-floating-point calculations also ? The floating-point calculations will result in more complex and more large (hardware resources) implementations.

Definitely floating point. What gives us confidence in the validity of the different approaches (CUDA, OpenCL, x86, ARM VFP, ARM NEON) is the fact that the results validate against each other.

Quote:

3. The maximum length of the FFT calculation is N=2^20 ?

See above

Quote:

The majority of hardware FFT cores are based on 2 architectures. One (in-place) to minimize the hardware resources and one (pipeline) to maximize the throughput. Both architectures needs at least (in-place) N*complex_word_length memory for storing the input, the intermediate results and the output of the FFT calculation. If N is too large (2^20) then a hardware implementation of the FFT core will not fit in a small/cheap FPGA device (even the in-place implementation needs 64Mbits memory for 64bit complex words - a large xilinx spartan6 device has at most ~5Mbit memory).
You need an external RAM to store the intermediate results which compromise the performance of the FFT core. There are some algorithms for very long FFT calculations (four-step, six-step) which may be more suitable for the case of an external RAM but I am not familiar with theseâ€¦

Yeah I saw the original paper suggesting these methods (four step & 6 step) recently. I think this might be very interesting. Also for trying to use FFT on the Epiphany coprocessor that, in addition to the FPGA, is also on the Parallella and has native floating point capabilities, but very limited RAM and bandwidth.

Quote:

4. If you choose to accelerate some EaH calculations with hardware accelerators then it is more efficient to use a SoC/FPGA like Xilinx Zynq. You can use the ARM processors for running the EaH client and the FPGA resources to implement some hardware accelerators. The problem with this solution is that the Xilinx Zynq device has limited FPGA resources for implementing a large core like 2^20-point FFT. You will need a large Zynq device like Z-7045 / Z-7100 which are not cheap. Furthermore, in a software/hardware partitioning it is more efficient (if possible) to implement a hardware accelerator which contain a path of calculations and not only one calculation (FFT) because the data transfers to and from the accelerator are costly. Is there any profiling information for the EaH client software? If the FFT calculation is the bottleneck and consume more than the 50% of the calculations performed then I think that one FFT hardware accelerator will be the case. If the bottleneck is in a path of calculation (containing the FFT) then I think that this path must be accelerated with a hardware core.

On CPUs, the FFT step is definitely the bottleneck with >> 50% of the share of overall runtime. Just as an example from the ARM world: A Raspberry Pi (overclocked to almost 1GHz) will do a BRP4 task in around 110k sec. A task has 6662 iterations of the outer loop that handles a 'signal template', the step that can be seen as our unit of work (preprocessing is not significant, runtime-wise). So let's say each template takes around 17 seconds (BTW it's about 1000 times faster on good GPUs ... yes....a few 10ms per template, faster than most pulsars spin ;-)!!).

The FFT step on the Raspberry Pi (currently using FFTW 3.3.2 which handles real-to-complex , N=3*2^22 DFT out-of-the-box) takes about 12 seconds (yes, per template!) so ca 70 % of the total runtime.

OK, Raspberry Pi uses an ARMv6 , modern ARMv7 cores found in SoC/FPGA combos should be faster and the proportions might be a bit different especially if they have NEON vector units, but you get the drift on what the potential for optimization is!!!

Cheers
HB

BackGroundMAN
Joined: 25 Feb 05
Posts: 58
Credit: 246,736,656
RAC: 0

So, if I understand

So, if I understand correctly, for the BRP4 client you perform several N-point real floating-point FFTs with N=3*2^22. To do that you calculate 3 complex FFTs of size N1=2^21 and then a radix-3 stage with extra twiddling to get the results.
I suppose that the major problem for a hardware implementation of the FFT core would be the memory and the bus throughput. If you need floating-point with 32bit words (I assume you need more ?) then you will need at least (64*2^21) 128Mbits memory inside the FPGA to implement an efficient in-place 2^21-point FFT core !! The biggest Zynq device (Z-7100) has ~24Mbits block RAMs while a cheap Zynq device (like Z-7020 found at Parallela has ~5Mbits block RAMs).
There are 2 solution for this problem. Either you can use an external RAM (DDR most probably) and reduce the throughput of the accelerator or you can split the FFT calculations in smaller pieces (I think this is more efficient). Also, the 21 (of 2^21) is not a good number if you like to use high-radix implementations.

If you implement an (2^15) 32K-point complex FFT core with a radix-8 module (radix-2^3 to save some complex multipliers) then you will need at least 2Mbits for the FFT module and some memory for the input/output FIFOs of the accelerator, assuming an efficient (conflict-free) in-place architecture. Keep in mind that you will need also at least (N/4)*64 ~= 520Kbits memory for the twiddle factors ROM (complex twiddle number is 32+32=64bits).

With this scheme you need to calculate 64 32K-point complex FFTs in the accelerator and then perform 6 radix-2 stages (or 3 radix-4 stages if possible) in software to get one 2^21-point FFT. So for one template you need 3*64 32K-point FFTs in the accelerator, 3*6 radix-2 stages (or 3*3 radix-4 stages) in software and the final radix-3 with extra twiddling in the software.

The hardware accelerator needs 5 stages of radix-2^3 so if we assume that the radix-2^3 processor has 15 pipeline registers (15 cycles for a radix-8 butterfly - due to more complex floating-point calculations) then it will perform a radix-8 stage (for all the dataset) in 4096*15 cycles and it will need 5*15*4096 = 307200 cycles to perform a 32K FFT calculation. If this FFT core can achieve 100MHz in the FPGA then you can calculate one 32K FFT in about 3 msec. You will need 3*64 32K FFTs for a template so you will need ~576ms only for the calculations performed at the hardware accelerator. All these calculations assume that you are able to provide 8 complex input data to the FFT accelerator and retrieve 8 complex output data at each cycle of 100MHz. So you will need (8*64*100) ~51.2 Gbps bus throughput (!!), assuming that you never send and retrieve data at the same time.

The Zynq devices have AXI buses to interconnect the ARM processor and DDR memory with the programmable logic (FPGA). These buses support up to 64bits words transfers so for this throughput you will need the buses to run at 800MHz (this is impossible) assuming you have 2 ports on the accelerator (I think there are two AXI-slave ports in the programmable logic). So if you have an AXI-bus at 100MHz you will need about 327*(1/bus_util_factor) us to move one 32K FFT frame to accelerator. For a reasonable bus utilization factor of 70% you will need ~560us to move the frame to accelerator and same time to move results back. So an additional 1ms is needed for the transfer of one 32K FFT frame. So you need 4 ms for each 32K FFT or 768ms for all the 32K FFTs (calculations and transfers).

Zynq SoC has a 256KB memory (TCM) for the communication between the ARM processor and the programmable logic and the ARM has an L2-cache of 512KB. To be able to perform the extra radix-2 stages in software you will need more than one 32K FFT frame in memory so you most probably transfer the FFT frames to and from the accelerator to the external DDR memory and may cost you some extra time.

After the calculation of 3*64 32K FFTs in about 1-1.2 sec (assuming DDR) you have to calculate the rest of the butterflies in software by accessing the data from the DDR memory. I assume that you will need another 1-2 sec to do that. So finally you will perform one 3*2^22 FFT (one template) in about 2-3 sec with this scheme or one BRP4 unit in 46-53Ksec (assuming that you will need an extra 5 sec for the non FFT calculations as you said 17-12).

If you need 64bit floating-point numbers for the calculations then you will need a lot of memory inside the FPGA (I think that Z-7020 will do the job), the double time for the transfers and most probably more pipeline registers in the radix-8 core processor.

These are some very draft estimations with several assumptions but you can get the main idea. There are several ways to do this in a Zynq device and most probably there are more efficient architectures. I do not know the calculations performed by other EaH clients and if there is any efficient way to implement hardware accelerators for these clients. By the way are there any documents/URLs to describe the calculations for each of the EaH clients and the bottlenecks?

I think that parallela board is very interesting and may result in a very efficient and low cost EaH accelerator. I already register to get an update when the board will be available again (I hope soon) :).

Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,416
Credit: 213,205,059
RAC: 209,384

Thank you kindly

Thank you kindly BackGroundMAN for your extremely pertinent observations and terrific knowledge. Indeed the nettle to be grasped is resources ! :-) :-)

Since we don't have devices with massive resources we must trick some lesser hardware into the game. Whether it be in hardware/FPGA and/or software/Zynq/Epiphany the cruel part is working memory vs. speed ( ancient adversaries ). I am slowly coming around to the view that to use Parallella as it exists currently for E@H, then that will likely be a blended solution of ARM-code/FPGA/Epiphany-code to make full use of the best features of the gadget. I am optimistic but one necessarily must consider that it won't actually work well enough. One can but try. However as noted there are some helpful aspects within the problem :

- one may assume a purely real vector input.

- the twiddle factors have unity magnitude.

- there is freedom to use whatever mixed-radix decomposition suits.

- Epiphany has amazing floating point speed ( especially fused multiply-add/subtract ) and blazing on-chip writes.

- FPGA versatility to hardware accelerate.

What I have especially pondered about this last few weeks is the generation/storage of the twiddle factors. Regardless of radix/decomposition strategy one still needs access to every power from 0 to N-1 of the Nth root of unity = exp[+/- (i*2*PI/N)]. So that's N complex numbers, basically the sine and the cosine of 2*PI*k/N for every k = 0 ..... N-1. Now :

- pre-calculation is sensible but do you store all N roots, as that would be a memory burden equal to the output vector ?

- if you know the sine you can get the cosine easy ( say via 1 - sin^2 ).

- one can use double angle formulae [ sin(A +/- B) = sin(A) * cos(B) +/- cos(A) * sin(B) and/or cos(A +/- B) = cos(A) * cos(B) -/+ sin(A) * sin(B) ] to form those by considering any angle as compound. Notably the roots are equally spaced around the unit circle in the Argand plane.

- hence if one is using radix-2, then consider k as an index expressed in binary form. That will have m bits where N = 2^m eg. k = 001010 .... 100100 ( ie. m = 22 for E@H )

- viewing such a form of k as a mask, then for each bit ( say at position q numbered starting at zero from the left to m-1 on the right ) :

(a) if = 0 ignore, however

(b) if = 1 trigger the inclusion of the angle theta(q) = [2*PI*(2^q)/N] as one factor in the compounded angle that makes up the given root. That is to say, if one adds up all those theta(q)'s where the bit mask = 1 then you get that angle on the Argand plane which represents that kth power of the base root 2*PI/N.

- if you suppose there are s bits that are of value 1, then one can by s-1 uses of a double angle formula finally arrive at the sine/cosine of any one of the roots*. Then do this :

(a) store pre-calculated values of the sine ( OR the cosine ) for every theta(q), for q = 0 to m-1, only m real values.

(b) Iterate over the bit mask of k

(c) For those s bits that are one, use a double angle formula to get the sine/cosine of the compounded angle from lookups on the pre-calculated values.

HOWEVER : since that might be as many as, for E@H, some 21 double-angle formula applications ( within a loop ) then accumulated rounding error is an important issue. Even though one has used 21 sequential fused multiply-add instructions. What I'm mulling over is storing pre-calculated values to a greater precision than required eg. 40bits if you're after a 32 bit final representation. That implies I'll have to stump up code to cope with that extension as Epiphany only does 32bit floats natively. :-) :-)

This may allow one or more Epiphany cores to take on the role of 'twiddle factor server(s)'. Fortunately the core register set will allow use of R16 through R63 ( 48 32-bit registers ) for my use. With which you could store up to 24 extended representation sine values for instance, and so cope with up to 2^24 point FFT's .... without having to store that many complex roots. Alas apart from PI, those twiddle's are the only firmly known numbers !

Cheers, Mike.

* Like any monetary currency, with a sufficient quantity of varying denominations one can make up any amount. Indeed there will some minimum number of coins/notes for any given total amount, and apart from mere ordering a unique combination too ....

( edit ) One could also cache most recently requested twiddles as it is likely to have phases of execution where similiar/related Nth roots have temporal co-incidence.

( edit ) Naturally an extended representation could indeed be double precision at 64 bits. The key error that I try to avoid here is being unable to disambiguate adjacent powers of the Nth root ie. kth indistinguishable from (k + 1)th . Else one destroys the merit of the N-point FFT to some lower granularity sampling. I'm assuming here that my method ( to be tested etc ) is improved over some power series method of determining the sines/cosines on the fly.

( edit ) FWIW : the commonest case is where s = m/2 ie. half of the bits representing k are non-zero ( and according to the binomial theorem that dwarfs by many orders the next commonest cases at m/2 -1 and m/2 + 1 ). So that will m/2 - 1 double angle formula invocations, that being about 10 for E@H.

( late edit ) to be even more efficient divide the bitfield ( say length 22 ) of k amongst several Epiphany cores, let them each compute the (co)sine of their part of the compounded angle, then aggregate. You could do, say, two bits per core and out of 16 cores still have several left over 16 - 11 = 5 cores to do the final compounding from the other's outputs. Just shootin' the breeze here ... :-)

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3,516
Credit: 591,827,237
RAC: 5,006

Hi! @BackGroundMAN :

Hi!

@BackGroundMAN : Thanks for expanding in so great detail on FPGA implementation. Very interesting indeed to get a feel for what is possible and what not .

We are fine with single precision. We checked that when we introduced the GPU app versions.

Quote:

By the way are there any documents/URLs to describe the calculations for each of the EaH clients and the bottlenecks?

But to get an overall picture of how it works, it's best to read the paper by Bruce Allen, Benjamin Knispel et al, downloadable from arXiv here: http://arxiv.org/abs/1303.0028

@Mike: funny you mention the twiddle factors, we actually use kind of a generalization of your trick in several parts of the E@H software. It's a nice trick so let's look at it in a rather general context:

The problem:

Let's say you need to evaluate a rather computationally expensive function f(x), at the values of the progression x_k = x0 + k*h , for k=0...N-1.

Let's further assume you need those values in a rather arbitrary order (not sequentially), you'll need them more than once per value x_k, and let's further assume you have not enough space to store a precomputed look-up-table of length N.

What can save you:

Let's assume you are lucky and have an "addition-theorem" for f, so that there is a function g such that

f(x+y) = g(f(x),f(y)), *
and g is much easier to compute than f(x) itself. [We have this for sin & cos , obviously!]

The shortcut:

We want to compute f(x_k).

Instead of looking at binary representation of k, look at it in radix M where M=ceil(sqrt(N))

then all our k in 0...N-1 are represented in radix M as a two digit number .

k=j*M+n for some j, n < M .

Well, well......so we need two lookup-tables each of length approx sqrt(N), one holding values f(x_jM) and one holding values f(x_n) .

For each x_k we want to compute,
- we find j,n so that k=j*M+n ,
- we lookup the values f(x_(jM)) and f(x_n) , and
- use our "addition-theorem" function g to get the final result f(x_k).

So in this case, for the price of two lookups and one evaluation of g, we get a reduction in table size from N to 2*sqrt(N)....nice.

Our two lookup-tables still store values of f evaluated at an arithmetic progression, so you could do this recursively, and have 4 lookups and tables of size sqrt(sqrt(N)) [or look at it as using a radix of SQRT(SQRT(N)) for representing the look-up index). Also if M is a power of two , finding the radix M representation of k is just a matter of shifting and masking bits, no real division is needed.

So yeah, the complex roots of unity should not be a concern.

Cheers
HB

* or in an even more general form f(x+y)=g(p(x),q(y)) where g is inexpensive and p and q are not prohibitively expensive either to compute (we'll need Sqrt(N) many values of p(x) and q(x)).

Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,416
Credit: 213,205,059
RAC: 209,384

@HB : Well there is indeed

@HB : Well there is indeed nothing new under the Sun. A wheel re-invented. While nodding off last night I was thinking "why have the Epiphany cores only evaluate for two bits of k each, there's got to be a better partition ?" and so I wake up and you've given me a beautiful path of generalisation to neatly extend that attack ! Thank you very much ! :-) :-)

As you say by going to SQRT(SQRT(SQRT( .... )) one can add levels of depth/recursion. As in theory at least any recursion can be unrolled, then for an array of processors a convenient match in sizes may well occur to parallelise much of that ..... with apologies to Baldrick : I may have a cunning plan. :-)

All this suggests a subtle movement of mental crosshairs. For the interim : instead of thinking 'what Parallella algorithm will best fit E@H' move to 'what is the best Parallella algorithm for large FFTs' and then look later at specific fitness to E@H purposes. Break to smaller steps then milestone it.

Now speaking of array geometry : I had earlier punted out ideas as regards functional/role pairings of adjacent Epiphany cores. After some thought I now think that the 'natural' grouping is in fact four processors but with a nod to Tetris/Chess* :

... so you still get nearest neighbour fast transfers around the 'H' cores ie. North/South and East/West & no 'diagonals'. In fact any 'local' service can be arranged thus using that pattern and permutations. Plus off-chip read/writes do not contend** as they only go East/West :

... thus one could 'land' and 'retrieve' quartered subsets of data, and as noted flicking to a radix-4 would be no sweat for the butterflies ( at a base level -1, +1, -i, +i are all merely swaps at most of real and imaginary components with/without a sign change ). One big problem with the prior 'pairing plan' was load balancing, being not symmetric at all - a real flogging for one core in particular, while the others would well and truly be waiting on that single pipe in/out bottleneck.

Cheers, Mike.

* Reminiscent of the 'queen distribution problem' ie. arrange same without any mutual challenge, giving the solution of a knight's move displacement arrangement. Also this can generalise to larger arrays eg. E64.

** At least within the array.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3,516
Credit: 591,827,237
RAC: 5,006

Hi! Just for the sake of

Hi!

Just for the sake of completeness, here's a reference to the paper mentioned earlier ("4-step" and "6-step" FFT):

David H. Bailey, "FFTs in external or hierarchical memory," Journal of Supercomputing, vol. 4, no. 1 (Mar 1990), pg. 23-35.

http://www.davidhbailey.com/dhbpapers/fftq.pdf

It has timings from FFT runs of various lengths on Cray-2 and Cray Y-MP machines. You can see that actually a modern Android smartphone is in the same league, FFT-performance-wise, as a Cray supercomputer back then, 25 years ago (more or less in line with Moore's "law").

Cheers
HB

BackGroundMAN
Joined: 25 Feb 05
Posts: 58
Credit: 246,736,656
RAC: 0

RE: Hi! @BackGroundMAN :

Quote:

Hi!

@BackGroundMAN : Thanks for expanding in so great detail on FPGA implementation. Very interesting indeed to get a feel for what is possible and what not .

We are fine with single precision. We checked that when we introduced the GPU app versions.

Quote:

By the way are there any documents/URLs to describe the calculations for each of the EaH clients and the bottlenecks?

But to get an overall picture of how it works, it's best to read the paper by Bruce Allen, Benjamin Knispel et al, downloadable from arXiv here: http://arxiv.org/abs/1303.0028

Thank you for the paper, it seems very interesting. I will start the reading when I found some time.

I have downloaded the source code several months ago and I managed to build the BRP4 client for CUDA and AMD64. I search for some sample workunits to profile the clients with no luck. Is there a way for a EaH workunit (downloaded from boinc) to run only some of the iterations ?

Thank you

ExtraTerrestria...
Joined: 10 Nov 04
Posts: 770
Credit: 500,704,254
RAC: 104,962

Reading a bit more about how Einstein works made me think: the FFT seems to be very suitable for number crunchers like GPUs. It's very regular code and we've got algorithms available which can keep the many execution units quite busy, given some caches and enough memory bandwidth. This might be the worst case to try to improve with FPGAs. I think they might do a lot better when some irregular code is involved, which regular CPUs / GPUs don't like. Something that benefits greatly from transitioning to an ASIC.

Sure, if you can create a better FFT on the FPGA yopu struck gold. But still.. I'm not sure there's any shortcut here to the "sea of cores", bandwidth and memory. And raw crunching power is what FPGAs are worst at.

MrS

Scanning for our furry friends since Jan 2002

Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3,516
Credit: 591,827,237
RAC: 5,006

Quote:

I have downloaded the source code several months ago and I managed to build the BRP4 client for CUDA and AMD64. I search for some sample workunits to profile the clients with no luck. Is there a way for a EaH workunit (downloaded from boinc) to run only some of the iterations ?

Thank you

Yup. For example, this is the command line for one of the Perseus Arm BRP5 units on one of my hosts:

```-i PA0049_00141_72.bin4 -i PA0049_00141_73.bin4 -t rand_PAS.bank.v3 -l PA0049_00141.zap -o results.cand0 -o results.cand1 -c status.cpt -A 0.04 -P 3.0 -f 175.0 -W -z
```

You see that there are two input (-i) and output (-o) arguments, this is a bundle of two sub-units, you'll want to test only on one, so drop the second -i filename and -o filename arguments.

Next you want to create a smaller template bank, the file that is passed via "-t rand_PAS.bank.v3"

Use a text editor or "head" in OSX or Linux or cygwin to create a new file, say rand_PAS.bank.v3.short , that contains only the first few lines of the original template bank, and use this file instead, so in our example you would try

```-i PA0049_00141_72.bin4  -t rand_PAS.bank.v3.short -l PA0049_00141.zap -o results.cand0 -c status.cpt -A 0.04 -P 3.0 -f 175.0 -W -z
```

as command line arguments.

Note that the app tests if the result file is already present (and stops if it is), so for repeated tests you need to rename/move the result file before starting the next test run.

Cheers
HB

Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3,516
Credit: 591,827,237
RAC: 5,006

RE: Sure, if you can create

Quote:

Sure, if you can create a better FFT on the FPGA yopu struck gold. But still.. I'm not sure there's any shortcut here to the "sea of cores", bandwidth and memory. And raw crunching power is what FPGAs are worst at.

MrS

I wonder whether Enigma@Home could be a project that could benefit from FPGAs, perhaps?

Cheers
HB