Parallella, Raspberry Pi, FPGA & All That Stuff

KF7IJZ
KF7IJZ
Joined: 27 Feb 15
Posts: 110
Credit: 6108311
RAC: 0

Several things... First, I

Several things...

First, I am now down 4 Pi in my farm - all due to SD Card failure.  3 have been gone since before thanksgiving, but I haven't had a chance to repair them.  I really really really really really need to move on to Netbooting at least the Pi 3s.

I am trying to use this project as an excuse to learn Eagle PCB Design software as I have an idea for micro discrete power supplies using a Murata OKI-78SR (http://www.mcmelectronics.com/product/MURATA-POWER-SOLUTIONS-OKI-78SR-5-1-5-W36-C-/124-10255) for 1.5A / board or one of the larger OKR 6-10A modules for a lot less / board.  These would mount to the power pins similar to the Pimoroni Zero LiPo.  They could be fed from a 12V source rather easily, and this is particularly applicable as the 24 port switch I purchased to build the next gen cluster on runs off 12V as well.  I am exploring replacing the stock PSU with a beefier one as there is plenty of room in the rack switch for a larger supply.  Of course, my desire exceeds my time for hobbies these days.

Finally, there is a new SBC - the Asus Tinker Board - http://arstechnica.com/gadgets/2017/01/asus-tinker-board-price-specs-release-date/   .  Quad Core Rockchip A17 (32 Bit) clocked at 1.8Ghz, better GPU, and 2 gigs of Ram.  Also $70 so we'll see if it's twice as fast as a Pi 3.  The magic would be if we could get GPU crunching on it!

My YouTube Channel: https://www.youtube.com/user/KF7IJZ
Follow me on Twitter: https://twitter.com/KF7IJZ

Phil-Pi
Phil-Pi
Joined: 7 Jan 17
Posts: 32
Credit: 867513
RAC: 0

I haven't even been able to

I haven't even been able to get one to boot from USB yet, and haven't tried netboot. But I'm just starting to learn this whole Pi thing. We'll get there.

I've got one Pi that refuses to go past 3% without erroring out. I'll be ordering a replacement tomorrow so I can have full Blades for testing.

At first glance, the power supply numbers look good. At full bore with all cores crunching, each Pi appears to be pulling about 700 ma.

 

Tom Rinehart
Tom Rinehart
Joined: 17 Jun 09
Posts: 9
Credit: 6591748
RAC: 0

I've 3D printed a rack that

I've 3D printed a rack that holds 4 PIs and a power board built on an Adafriut proto board using 2 Murata 78SR DC/DC converters.  I've been using 

http://www.mouser.com/ProductDetail/Murata-Power-Solutions/OKI-78SR-5-15-W36-C/?qs=sGAEpiMZZMslBFvnKnOhcsAPP%252bIEe4SP

 I've been running 2 PI's off of one converter.  

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6540
Credit: 286821850
RAC: 89593

OK. Long time, no post. I've

OK. Long time, no post. I've just noticed that Epiphany V chip is 'under construction'. Notes of interest, especially with regard to FFT usage :

- now 64KB per core ( up from 32KB ).

- full 64 bit operands and addressing.

- 64 bits per cycle for intra-core moves and extra-core to/from bus.

- the Network On Chip part of a core is now 136 bits wide.

The rest is pretty much consistent with III/IV. Of course the key thing to await is in what form will it be issued as a product, especially how many V's per board ? I will find some envelopes to write upon the back of and also have to dig out the LEGO again .... :-)

Cheers, Mike.

( edit ) TMSC is here ( I believe ) and 4 -5 months from last October is about now. But we've heard that before from Adapteva. :-)))

( edit ) FWIW : IIRC last time they used Global Foundry ie. Silicon Valley. Was initially owned by AMD.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6540
Credit: 286821850
RAC: 89593

Hmmmm. So now I'm somewhat

Hmmmm. So now I'm somewhat more interested in Epiphany again, if V comes through. You may recall, or wish to forget, an analysis I did of the E16 variant of III for FFTs. The basic disappointment then was insufficient memory per RISC core and not enough cores. This would not really grasp the FFT tasks at E@H ie. ~ 222 data points, at least not without considerable off-chip assistance/processing that yields mainly serial behaviour rather than parallel advantages.

Back Of The Envelope :

- 64KB per core in four banks of 16KB.

- you can have two of those banks for pure data. The other two are for code and stack.

- using single precision floating point you have 4 bytes per operand.

- that gives 32KB / 4 = 8K of single precision operands ( SPO ).

- per original time series data point you then need : 2 SPO for the data itself + 2 SPO for a twiddle factor + 2 SPO for the emitted result = 6 SPO.

( Express result as amplitude + phase per frequency value, or a coefficient each for a sine and a cosine. Remembering that in the amalgamation phase we are generating complex numbers from a lower order FFT to create a higher order FFT. Depending on what you are up to you may ignore phase at the very end* of the analysis but you can't discard it en-route. )

- thus, with some headroom, you can manage 1K = 1024 = 210 data points per core

- and we have 210 cores !

- hence, at least on room-for-data grounds, we can grasp a 220 point FFT per Epiphany V chip.

Also keep in mind that a Parallella E16 board/variant had an ARM and an FPGA, oodles of commonly addressed DRAM ( 1GB ). Plenty to manage at least two vectors each 222 of 4 byte operands ( 32MB total ) ie. input and output no sweat. Interesting ...... :-))

Cheers, Mike.

( edit ) Two banks of 16KB each immediately suggests keeping the real part of operands in one and the imaginary part in the other. For loops, per operand, that's the same index but different constant offsets.

* Or at the very start for that matter.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6540
Credit: 286821850
RAC: 89593

More thinking out loud, if

More thinking out loud, if you will bear it.

So there are several main tasks :

- divide the input vector.

- disperse to 1024 cores.

- produce a 1024 point FFT per core.

- combine the results from 1024 cores.

- return an output vector to host.

The primary key to efficiency is the management/placement of the twiddle factors. These are powers of the Nth root of unity ( complex number on the unit circle in the z-plane ). N being the total transform size eg. 222. All N of them are needed from the zeroeth power right through to the (222 - 1)th power. This devolves to finding the sine and cosine of every ( radian ) angle b/w 0 and 2*PI in N equal steps. However these are not required everywhere and all of the time. Indeed a given core is only ever going to require some subset of those N twiddles. Therein lies some hope ....

I think I will have to come up with a "process encoding scheme" which a given core can refer to at some point in time and thus deduce what are the twiddles it either does need now or, even better, will need soon. In effect this will situate it within the overall transform algorithm ( that the entire RISC node array is engaged with ) ie. which decimation subset is it handling and what temporal stage it has achieved.

So who & when are the twiddles generated by ? Ideally statically ie. at compilation and thus pre-loaded to the cores before triggering the whole chip to process. I'll investigate how many a given core would need and could that fit into available data banks ( maybe ). Otherwise a base set of twiddles can be statically loaded from which others can be generated on the fly ( double angle formulae ). It is a moot point to be studied as to whether a given core entirely generates it's own cache of twiddles or whether twiddle sharing may be adopted. Which is quicker ? Which can be spatially afforded ?

Overall I think it has to be done in assembler for maximum benefit to leverage the known patterns inherent in FFTs. Particularly as the Epiphany has epic fused-multiply-add instruction capability ( FMA3 to be exact ). However that has to mix with the existing SDK elements especially via the Application Binary Interface so that custom code doesn't trip up ( or vice versa ) the extant system library procedures/functions written in C. The ABI is a sort of gentleman's agreement at machine level detail about which registers do what and whether the caller or callee is responsible for saving & restoring values, how to receive and return values with subroutines, what is off limits and what is not. Etc.

I believe I will whip out the E16 and have a play .... :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6540
Credit: 286821850
RAC: 89593

So here's a thought or nine

So here's a thought or nine more :

- the Epiphany V will realise in actual silicon 1/4 of the entire Epiphany architecture ( as patented say ). The total design limit is 64 x 64 = 4096 cores. Think of that entirety as virtual space into which a real device may be inserted/constructed. 

- a given subset will be ( hardware ) mapped to some sub-array of addresses with said architecture to produce a particular SoC.

- software : each core at run time may be given the statically bound subset of the libraries as defined in the provided SDK. That is to say each core may commence operation with it's own little copy of an operating system ( Epiphany Run-Time library & stuff ).

- if so, that is of course convenient because someone else has done all that aspect of the work for you. The libraries are written in C, have an ABI for custom assembler code as mentioned, have some awesome & optimised functions, and will compile smoothly with any application one may write in C & include for some purpose.

- you don't have to do that at all. One may choose to crawl over broken glass on hands and knees to designate and execute every assembly level specified operation from go to whoa on the entire Epiphany V. Just use a host side loader to initialise the core array and then trigger the go button. LOL. What A Great Plan. :-)

- now from the point of view of a given core, some memory is addressable which doesn't truly lie within another core in the physical core array. It may map externally to the chip in fact, but logically seems to be in the Epiphany design space ( a large flat byte granular space ).  

- fortunately there is a host side to the SDK too ( Epiphany Hardware Abstraction Layer ). Here we would be more relaxed in constraints : plenty of memory, plenty of time, at least a decent ARM system to play with ( but could even be a Linux PC say ). No reason to optimise that aspect what-so-ever.

- now an interesting question becomes : what is the memory footprint per-core to allow for the 'luxury' of those library elements being present ? In detail that depends on the intra-library structure, exact choice of routines referred to in code, some linker flags etc.

epiphany_sdk.jpg

... that 's the memory used within the above red box. A task then would be to create a “ Hello World” program and then gauge it ..... the provided debugger ( e- gdb, based on gdb ) should do that nicely. By subtraction, and with alignment constraints, one can then estimate more closely the maximum allowable number of operands for an FFT.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6540
Credit: 286821850
RAC: 89593

FWIW : the V is being

FWIW : the V is being produced in 16nm FinFET.

Correction : "The total design limit is 64 x 64 = 4096 cores". Nope, was. Is now 64-bit addressing, was 32-bit, so with this expansion one can do one billion-ish cores ..... and 1 PetaByte memory. In a special magical country called Potentia that is. :-))

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

ML1
ML1
Joined: 20 Feb 05
Posts: 347
Credit: 86314215
RAC: 213

        Mike Hewson

 

 

 

 

Mike Hewson wrote:

FWIW : the V is being produced in 16nm FinFET.

Correction : "The total design limit is 64 x 64 = 4096 cores". Nope, was. Is now 64-bit addressing, was 32-bit, so with this expansion one can do one billion-ish cores ..... and 1 PetaByte memory. In a special magical country called Potentia that is. :-))

Cheers, Mike.

Worth an email and a giggle to get a demo board with one or four of them? Science grant also so that you can enjoy a sabbatical to get the numbers together quickly?...

 

Keep searchin,

Martin

 

Wink

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6540
Credit: 286821850
RAC: 89593

ML1 wrote: Science grant also

ML1 wrote:
Science grant also so that you can enjoy a sabbatical to get the numbers together quickly?...

I'd love a grant. As does Mr Olofsson .... however his role as 'CEO of Adapteva' is being referred to in the past tense. He hasn't posted/tweeted/etc anywhere I can find since the day he got that job. Hence the Potentia comment alas. Oh well. :-((

Now if you can fund a sabbatical for me to slowly drink whiskey on a tropical beach ( deck chair, palm tree, panama hat ), producing metrics for the local horizon, then that's a different matter again. I'd promise to send reports via postcard.  :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.