Parallella, Raspberry Pi, FPGA & All That Stuff

KF7IJZ

Joined: 27 Feb 15

Posts: 110

Credit: 6108311

RAC: 0

Several things... First, I

24 Jan 2017 2:10:29 UTC

Message 154547

(moderation:

)

Several things...

First, I am now down 4 Pi in my farm - all due to SD Card failure. 3 have been gone since before thanksgiving, but I haven't had a chance to repair them. I really really really really really need to move on to Netbooting at least the Pi 3s.

I am trying to use this project as an excuse to learn Eagle PCB Design software as I have an idea for micro discrete power supplies using a Murata OKI-78SR (http://www.mcmelectronics.com/product/MURATA-POWER-SOLUTIONS-OKI-78SR-5-1-5-W36-C-/124-10255) for 1.5A / board or one of the larger OKR 6-10A modules for a lot less / board. These would mount to the power pins similar to the Pimoroni Zero LiPo. They could be fed from a 12V source rather easily, and this is particularly applicable as the 24 port switch I purchased to build the next gen cluster on runs off 12V as well. I am exploring replacing the stock PSU with a beefier one as there is plenty of room in the rack switch for a larger supply. Of course, my desire exceeds my time for hobbies these days.

Finally, there is a new SBC - the Asus Tinker Board - http://arstechnica.com/gadgets/2017/01/asus-tinker-board-price-specs-release-date/ . Quad Core Rockchip A17 (32 Bit) clocked at 1.8Ghz, better GPU, and 2 gigs of Ram. Also $70 so we'll see if it's twice as fast as a Pi 3. The magic would be if we could get GPU crunching on it!

My YouTube Channel: https://www.youtube.com/user/KF7IJZ
Follow me on Twitter: https://twitter.com/KF7IJZ

Phil-Pi

Joined: 7 Jan 17

Posts: 32

Credit: 867513

RAC: 0

I haven't even been able to

24 Jan 2017 6:02:29 UTC

Message 154555

(moderation:

)

I haven't even been able to get one to boot from USB yet, and haven't tried netboot. But I'm just starting to learn this whole Pi thing. We'll get there.

I've got one Pi that refuses to go past 3% without erroring out. I'll be ordering a replacement tomorrow so I can have full Blades for testing.

At first glance, the power supply numbers look good. At full bore with all cores crunching, each Pi appears to be pulling about 700 ma.

Tom Rinehart

Joined: 17 Jun 09

Posts: 9

Credit: 6591748

RAC: 0

I've 3D printed a rack that

30 Jan 2017 21:55:25 UTC

Message 154869 in response to message 154547

(moderation:

)

I've 3D printed a rack that holds 4 PIs and a power board built on an Adafriut proto board using 2 Murata 78SR DC/DC converters. I've been using

http://www.mouser.com/ProductDetail/Murata-Power-Solutions/OKI-78SR-5-15-W36-C/?qs=sGAEpiMZZMslBFvnKnOhcsAPP%252bIEe4SP

I've been running 2 PI's off of one converter.

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6588

Credit: 318093094

RAC: 399368

OK. Long time, no post. I've

23 Mar 2017 3:18:00 UTC

Message 156555

(moderation:

)

OK. Long time, no post. I've just noticed that Epiphany V chip is 'under construction'. Notes of interest, especially with regard to FFT usage :

- now 64KB per core ( up from 32KB ).

- full 64 bit operands and addressing.

- 64 bits per cycle for intra-core moves and extra-core to/from bus.

- the Network On Chip part of a core is now 136 bits wide.

The rest is pretty much consistent with III/IV. Of course the key thing to await is in what form will it be issued as a product, especially how many V's per board ? I will find some envelopes to write upon the back of and also have to dig out the LEGO again .... :-)

Cheers, Mike.

( edit ) TMSC is here ( I believe ) and 4 -5 months from last October is about now. But we've heard that before from Adapteva. :-)))

( edit ) FWIW : IIRC last time they used Global Foundry ie. Silicon Valley. Was initially owned by AMD.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6588

Credit: 318093094

RAC: 399368

Hmmmm. So now I'm somewhat

24 Mar 2017 2:43:00 UTC

Message 156574

(moderation:

)

Hmmmm. So now I'm somewhat more interested in Epiphany again, if V comes through. You may recall, or wish to forget, an analysis I did of the E16 variant of III for FFTs. The basic disappointment then was insufficient memory per RISC core and not enough cores. This would not really grasp the FFT tasks at E@H ie. ~ 2²²data points, at least not without considerable off-chip assistance/processing that yields mainly serial behaviour rather than parallel advantages.

Back Of The Envelope :

- 64KB per core in four banks of 16KB.

- you can have two of those banks for pure data. The other two are for code and stack.

- using single precision floating point you have 4 bytes per operand.

- that gives 32KB / 4 = 8K of single precision operands ( SPO ).

- per original time series data point you then need : 2 SPO for the data itself + 2 SPO for a twiddle factor + 2 SPO for the emitted result = 6 SPO.

( Express result as amplitude + phase per frequency value, or a coefficient each for a sine and a cosine. Remembering that in the amalgamation phase we are generating complex numbers from a lower order FFT to create a higher order FFT. Depending on what you are up to you may ignore phase at the very end* of the analysis but you can't discard it en-route. )

- thus, with some headroom, you can manage 1K = 1024 = 2¹⁰ data points per core

- and we have 2¹⁰cores !

- hence, at least on room-for-data grounds, we can grasp a 2²⁰ point FFT per Epiphany V chip.

Also keep in mind that a Parallella E16 board/variant had an ARM and an FPGA, oodles of commonly addressed DRAM ( 1GB ). Plenty to manage at least two vectors each 2²²of 4 byte operands ( 32MB total ) ie. input and output no sweat. Interesting ...... :-))

Cheers, Mike.

( edit ) Two banks of 16KB each immediately suggests keeping the real part of operands in one and the imaginary part in the other. For loops, per operand, that's the same index but different constant offsets.

* Or at the very start for that matter.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6588

Credit: 318093094

RAC: 399368

More thinking out loud, if

29 Mar 2017 1:05:00 UTC

Message 156706 in response to message 156574

(moderation:

)

More thinking out loud, if you will bear it.

So there are several main tasks :

- divide the input vector.

- disperse to 1024 cores.

- produce a 1024 point FFT per core.

- combine the results from 1024 cores.

- return an output vector to host.

The primary key to efficiency is the management/placement of the twiddle factors. These are powers of the Nth root of unity ( complex number on the unit circle in the z-plane ). N being the total transform size eg. 2²². All N of them are needed from the zeroeth power right through to the (2²² - 1)th power. This devolves to finding the sine and cosine of every ( radian ) angle b/w 0 and 2*PI in N equal steps. However these are not required everywhere and all of the time. Indeed a given core is only ever going to require some subset of those N twiddles. Therein lies some hope ....

I think I will have to come up with a "process encoding scheme" which a given core can refer to at some point in time and thus deduce what are the twiddles it either does need now or, even better, will need soon. In effect this will situate it within the overall transform algorithm ( that the entire RISC node array is engaged with ) ie. which decimation subset is it handling and what temporal stage it has achieved.

So who & when are the twiddles generated by ? Ideally statically ie. at compilation and thus pre-loaded to the cores before triggering the whole chip to process. I'll investigate how many a given core would need and could that fit into available data banks ( maybe ). Otherwise a base set of twiddles can be statically loaded from which others can be generated on the fly ( double angle formulae ). It is a moot point to be studied as to whether a given core entirely generates it's own cache of twiddles or whether twiddle sharing may be adopted. Which is quicker ? Which can be spatially afforded ?

Overall I think it has to be done in assembler for maximum benefit to leverage the known patterns inherent in FFTs. Particularly as the Epiphany has epic fused-multiply-add instruction capability ( FMA3 to be exact ). However that has to mix with the existing SDK elements especially via the Application Binary Interface so that custom code doesn't trip up ( or vice versa ) the extant system library procedures/functions written in C. The ABI is a sort of gentleman's agreement at machine level detail about which registers do what and whether the caller or callee is responsible for saving & restoring values, how to receive and return values with subroutines, what is off limits and what is not. Etc.

I believe I will whip out the E16 and have a play .... :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6588

Credit: 318093094

RAC: 399368

So here's a thought or nine

7 Apr 2017 5:11:00 UTC

Message 156966

(moderation:

)

So here's a thought or nine more :

- the Epiphany V will realise in actual silicon 1/4 of the entire Epiphany architecture ( as patented say ). The total design limit is 64 x 64 = 4096 cores. Think of that entirety as virtual space into which a real device may be inserted/constructed.

- a given subset will be ( hardware ) mapped to some sub-array of addresses with said architecture to produce a particular SoC.

- software : each core at run time may be given the statically bound subset of the libraries as defined in the provided SDK. That is to say each core may commence operation with it's own little copy of an operating system ( Epiphany Run-Time library & stuff ).

- if so, that is of course convenient because someone else has done all that aspect of the work for you. The libraries are written in C, have an ABI for custom assembler code as mentioned, have some awesome & optimised functions, and will compile smoothly with any application one may write in C & include for some purpose.

- you don't have to do that at all. One may choose to crawl over broken glass on hands and knees to designate and execute every assembly level specified operation from go to whoa on the entire Epiphany V. Just use a host side loader to initialise the core array and then trigger the go button. LOL. What A Great Plan. :-)

- now from the point of view of a given core, some memory is addressable which doesn't truly lie within another core in the physical core array. It may map externally to the chip in fact, but logically seems to be in the Epiphany design space ( a large flat byte granular space ).

- fortunately there is a host side to the SDK too ( Epiphany Hardware Abstraction Layer ). Here we would be more relaxed in constraints : plenty of memory, plenty of time, at least a decent ARM system to play with ( but could even be a Linux PC say ). No reason to optimise that aspect what-so-ever.

- now an interesting question becomes : what is the memory footprint per-core to allow for the 'luxury' of those library elements being present ? In detail that depends on the intra-library structure, exact choice of routines referred to in code, some linker flags etc.

... that 's the memory used within the above red box. A task then would be to create a “ Hello World” program and then gauge it ..... the provided debugger ( e- gdb, based on gdb ) should do that nicely. By subtraction, and with alignment constraints, one can then estimate more closely the maximum allowable number of operands for an FFT.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6588

Credit: 318093094

RAC: 399368

FWIW : the V is being

10 Apr 2017 0:10:00 UTC

Message 157010

(moderation:

)

FWIW : the V is being produced in 16nm FinFET.

Correction : "The total design limit is 64 x 64 = 4096 cores". Nope, was. Is now 64-bit addressing, was 32-bit, so with this expansion one can do one billion-ish cores ..... and 1 PetaByte memory. In a special magical country called Potentia that is. :-))

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

ML1

Joined: 20 Feb 05

Posts: 347

Credit: 86563414

RAC: 554

Mike Hewson

10 Apr 2017 23:05:15 UTC

Message 157027 in response to message 157010

(moderation:

)

Mike Hewson wrote:

FWIW : the V is being produced in 16nm FinFET.

Correction : "The total design limit is 64 x 64 = 4096 cores". Nope, was. Is now 64-bit addressing, was 32-bit, so with this expansion one can do one billion-ish cores ..... and 1 PetaByte memory. In a special magical country called Potentia that is. :-))

Cheers, Mike.

Worth an email and a giggle to get a demo board with one or four of them? Science grant also so that you can enjoy a sabbatical to get the numbers together quickly?...

Keep searchin,

Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6588

Credit: 318093094

RAC: 399368

ML1 wrote: Science grant also

11 Apr 2017 0:08:00 UTC

Message 157028 in response to message 157027

(moderation:

)

ML1 wrote:

Science grant also so that you can enjoy a sabbatical to get the numbers together quickly?...

I'd love a grant. As does Mr Olofsson .... however his role as 'CEO of Adapteva' is being referred to in the past tense. He hasn't posted/tweeted/etc anywhere I can find since the day he got that job. Hence the Potentia comment alas. Oh well. :-((

Now if you can fund a sabbatical for me to slowly drink whiskey on a tropical beach ( deck chair, palm tree, panama hat ), producing metrics for the local horizon, then that's a different matter again. I'd promise to send reports via postcard. :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Parallella, Raspberry Pi, FPGA & All That Stuff

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner