Parallella, Raspberry Pi, FPGA & All That Stuff

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,126
Credit: 128,361,770
RAC: 33,370

RE: Thought you would have

Quote:

Thought you would have the picture...

... I see they are offering a Parallella Cluster kit for $575 that consists of:
- Four Parallella-16 boards (XC7Z010 version with backside expansion connectors "PEC")
- 4 Pre-loaded 16GB SD Cards (Ubuntu 12.04)
- 4 board-to-board flexible Epiphany link connector cables (up to 10Gb/s total bandwidth per cable)
- 20 metal standoff legs
- 110-240V Power adapter


The other thing I've just noted is the availability of 'spare' FPGA associated with the ARM processor dies ( Zynq 7020 ). Now one can totally ignore this and still use Parallella boards for sure, but as I've never done any programming at the hardware logic level why should that stop me ?? :-)

While I have a number of good books on digital circuit design, I guess I'll have to dig out some resources on VHDL .... this is truly turning into a geek's heaven !

Cheers, Mike.

( edit ) Hmmmm .... their accessory kit has a 5V x 2A power adapter ie. 10 Watts ....

I have made this letter longer than usual because I lack the time to make it shorter. Blaise Pascal

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,126
Credit: 128,361,770
RAC: 33,370

Musings While Waiting For The

Musings While Waiting For The Postman

- counting down to delivery, the promise is by the end of this month !

- I'm quite happy with the monitored power supply I've built.

- it's hard to do everything in assembler, so most programming will be at a higher level. C is fine. But I'm defining a subset of ( anticipated ) generally required functionality, most to do with inter-core communication/transfer/signalling, that I'll write in assembler and hand optimise. The gag here is that both the ARM CPU and Epiphany chips have RISC cores ie. REDUCED Instruction Set. This is a joy compared to the CISC assembly that I have done in the past ( x86 ). Epiphany has no fancy memory based operands scheme, no memory segmentation, no memory paging, no memory protection, no privileged rings/instructions, and only a tad of base + index addressing. It does have a variable length instruction pipeline that suitably but simply manages dependency hazards ( it stalls until resolution ). There is simple branch prediction with a fixed 3 cycle penalty. Floating point is either 32 or 64 bit wide IEEE-754 format, but some features of that standard are not supported ( relating to NaN's, denormals, rounding to infinity and inexact flags ).

- special/intriguing features ( to be ruthlessly leveraged in my opinion ) are :

(a) Substantial write based/biased asymmetry in mesh network transfers.

(b) Dual-issue scheduling rules that " .... allows two instructions to be executed in parallel on every clock cycle, if certain parallel-issue rules are followed.... "

(c) Two DMAs per core with a handy suite of configurable behaviours.

(d) Displacement-postmodify stores and loads from/to memory. Or if you like 'automated' stepping through of arrays where you get to specify up to 8 byte strides.

[ (e) A software interrupt ( of low priority admittedly ) is available, however the current recommendation is to use the SDK supplied routines. Higher priority interrupts can actually be masked out during the servicing of a lower priority one. One does stuff about with interrupt priorities at one's peril, so I might leave that alone .... ]

Cheers, Mike.

( edit ) Silly me. I didn't mention the general register set of 64 32-bit registers that is nine-way ported. Only about 16 of those have either implicit use, are reserved, or subject to convention. Per cycle you can get (a) three 32-bit floating-point operands read and one 32-bit result written by FPU, (b) two 32-bit integer operands read and one 32-bit result written by IALU, and (c) a 64-bit doubleword can be written or read using a load/store instruction. Plus during that one cycle the ( no/low latency ) network mesh hardware can transfer up to 32 bytes to and/or from that core's local memory.

( edit ) I will add though a word or five about fused multiply-add ( FMADD ). This is the jewel in the crown that especially optimises Epiphany cores for signal processing. It is a three input and one output operand floating point instruction ( single cycle ), and you can use it to accumulate a sum of products :

A <- A + B*C

or if you like to perform an inner product of vectors eg. a matrix row times a column vector. So ( a temporary representing ) B is multiplied by ( a temporary representing ) C and that result is added to ( a temporary representing ) A and then stored back to A. These temporaries are copies within the FPU of their respective register operands, and have extra least significant bits. Because a result eventually has to be written back to a 32-bit register, then these LSB's participate in rounding ( several schemes ). The fused aspect is that these extra bits on the intermediate result B*C are NOT rounded before the addition to A. This improves the accuracy of the entire operation. There is a fused multiply-subtract ( FMSUB ) using

A - B*C

which if you like is an alias of

A + (-B)*C

or

A + B*(-C)

and thus can be considered as an ( accumulator leading to an ) inner product of one vector by the negative of the other.

Now this is all well and good if the registers A, B and C represent purely real numbers in your problem space. If you want to perform the inner product of two complex vectors ( Hermitian ... etc ) then that's four real multiplications per complex multiplication, as each complex number is a pair of reals and hence yielding four cross-product real terms during that. It's certainly do-able, with care of course, and now the FMSUB comes well into play when you multiply one complex number by the complex conjugate of the other! You see even real data vectors can yield complex Fourier co-efficients ( complex conjugates across frequency=zero ie. F[k] = F*[-k] ) and so having these instructions available at assembler level is a real boon !! :-)

( edit ) Also ( having a slow day, eh Mike ? ) branch prediction assumes that a branch is taken - and there is no time penalty if so. The 3 cycle penalty is when you don't take the branch ie. the next instruction in sequence after the branch point is then executed. What this means is that if you 'phrase' your logic correctly for the main/default code path then your standard case handling will be optimised. But this then assumes that you do have some idea of the most likely branch behaviour as per the data at hand .... this is especially important to get right for ( large index counts used for ) loop iterations.

I have made this letter longer than usual because I lack the time to make it shorter. Blaise Pascal

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 769
Credit: 187,624,761
RAC: 182,550

Wow Mike, I didn't know

Wow Mike, I didn't know you'are this deep into this stuff! I wish you all the best for your experiments, and at least lot's of fun.. which seems redundant, since you apparently already had quite a bit of it ;)

Anyway, if you talk about implementing a FFT on it and hand optimized assembler code and such: is there some kind of developer community to share the results? I suppose it is, because sharing results is what openness is all about, but I'd just want to make sure the time invested here is worth something at the end.

MrS

Scanning for our furry friends since Jan 2002

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,126
Credit: 128,361,770
RAC: 33,370

For me the fun is the prime

For me the fun is the prime object of the exercise. :-)

If something useful emerges then we have a bonus ! Of course I am aiming at E@H applications. :-)

The trick with this board is that it has an ARM A9 dual core processor running it, with the Epiphany chip as a co-processor. That means you can do all the boring serial/management/context stuff on the ARM ( which is well supported in software eg. it runs a full Ubuntu desktop ) while the Epiphany is passed slabs of data to do the parallel magic invented by oneself .....

FFT's are eminently able to be parallelised, being essentially matrix based ( well, that's one model you can use ). The challenge for me will be to do so in a manner that takes full advantage of Epiphany features.

The Epiphany reference design is great, but the current implementation lacks sufficient memory-per-core ( it's an evaluation kit ). I would rate that as the most important feature to upgrade, second would be cores-per-chip.

The developer community is over at Parallella, and we kick the footy around over there ! :-)

Cheers, Mike.

( edit ) An added bonus would be to duplicate, say, the CUDA FFT interface ... 'cos that would slot right in to existing builds.

( edit ) Also one can choose to run the Parallella development toolchain on either the ARM Ubuntu OR on your own PC ( choice of USB or Ethernet link ) where in that case it will be a cross-compile.

( edit ) Yes, I am aware that FFT's may well be performed E@H server side prior to the host WU phase.

I have made this letter longer than usual because I lack the time to make it shorter. Blaise Pascal

Claggy
Claggy
Joined: 29 Dec 06
Posts: 560
Credit: 2,557,091
RAC: 0

The Preparing for Parallella

The Preparing for Parallella videos have been uploaded:

http://www.youtube.com/user/embecosm

Claggy

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 769
Credit: 187,624,761
RAC: 182,550

I've not been all that

I've not been all that enthusiastic about bringing ARM CPUs into the race. Yet another architecture to support with little benefit as of now. And if they tack larger vector units onto the small cores they'll build another Larrabee / Knights Corner, just with a probably better ISA. If they make the cores a but more dumb in exchange for even more of them they'll build a GPU. A bit more die space efficient because they wouldn't need the graphics units, but nothing earth shattering.

However, if some novel tricks are put into the design, like Parallela solving the locks and dependencies via ultra-fast writes (as far as I understand), some truly great things could evolve. The main benefit would be that they're able to to start from scratch.. something Intel, AMD and nVidia might sometimes dream about.

A problem will surely be to reach a critical mass. Your solution can be as technically good as it wants, if the software ecosystem is not there (compilers, IDEs, quality software) it won't matter. Lot's of innovation have gone this way before.. but nowadays we might be more flexible in creating new ecosystems.

MrS

Scanning for our furry friends since Jan 2002

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,126
Credit: 128,361,770
RAC: 33,370

@Claggy : Thanks for those

@Claggy : Thanks for those links, I will watch them. What did you think of the event ?

@MrS : Excellent points. In the Parallella context the inclusion of ARM was to have a ready made conduit to the Epiphany core, thus virtualising that product behind a hardware abstraction layer and an API. That choice IIRC was simply cost, existing support, power and size ( you can stuff it onto anything ). For me it wouldn't matter if another 'wrapping' method was used.

[ BTW there is a UK company wanting to create 'cluster' facilities, including breakout of the FPGA to JTAG ]

As for locks & dependencies, as ever they become the programmer's burden ( including whoever does the utility API ). Parallella per se is not going to solve the standard issues of races and deadlocks for competed resources*.

So, as you say, the real benefit ( if any ) is that it is from scratch, and it is also cheap. The compiler, linker, debugger are already sorted as GNU knock-offs ( gotta love open source ). I think Adapteva is well progressed upon their stated initial goal of providing a cheap kit for parallel programming testing/development/fun.

Cheers, Mike.

* which is why I am chancing my arm at a hierarchical method of decentralised task coordination ( my 'Roman Legion Model' ) to eliminate such contentions. So while every legionnaire uses their own sword and shield, only a 'first spear' has a pilum, and only a centurion can gives orders to the catapult .... :-) :-)

( edit ) As we speak the hardware specs for gen1 have been finally finalised ! :-)

I have made this letter longer than usual because I lack the time to make it shorter. Blaise Pascal

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,126
Credit: 128,361,770
RAC: 33,370

[ I've set my dog to watch

[ I've set my dog to watch for the postman .... ]

Now there are no PUSH or POP instructions, no hardware catching of stack under- or over- flow states, no stack segment register, that is : the hardware has no concept/support of/for a stack. The RTS and IRET instructions merely reload the program counter, but how do you nest calls?

Well, you roll your own stack using loads and stores and slap those behaviours in assembly macros. Where do you place the stack? Current Epiphany chips have limited per-core memory so if you make a core's stack in it's own local memory ( really fast ) then you lose some of what ( relatively ) little data space you have. If you site the stack off-chip ( in the ARM's 1GB memory, as mapped through the EAST eLink pins ) then you have tons of stack space, but dog slow. So for performance the matter reverts to how nested is your particular program's functional structure, offset against the time spent within said subroutines.

As for stack bounds checking then I'd recognise four options :

(A) - optimism. Ignore the topic until you crash, then debug. Fast, but loose.

(B) - check the stack pointer ( R13 by convention, but not enforced by hardware or instruction ) for bounds breach with each PUSH ( just prior to ) and POP ( just after ). Emit, say, a software interrupt in the breach BUT note that one can't assume a return from this pathway, as by definition you are there because the stack is broken. So that would be an immediate program exit mode, hopefully with information to trace the cause. Slow, but sure.

(C) - Use the memory block read-only flagging system. Clumsy but probably error prone alas - it won't catch a POP/underflow state as you don't write to stack during that anyway, and the limits are too wide for the currently provided memory. You'd have to rely upon a read-only exception for overflow and explicit bounds check for underflow. No real advantage over B here.

(D) - Use B during development and remove for release, of course patternising the stack area during testing to map memory usage. Thus you release the code under A with your optimism backed by tests.

[ A related issue is running several stacks for a given executable on a given core - and thus reliably swapping between the two - but I'll leave this on the back-burner. That generalises to the question of whether you want to do full context switches on the eCores anyway .... but hey, it's only a coprocessor after all. ]

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter. Blaise Pascal

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3,516
Credit: 460,264,506
RAC: 18,501

RE: [ I've set my dog to

Quote:

[ I've set my dog to watch for the postman .... ]

I guess your dog can relax a bit. The last thing I read was "All other shipments are still a few weeks out." (see http://www.kickstarter.com/projects/adapteva/parallella-a-supercomputer-for-everyone/posts/547314 Aug 3rd entry).

Before boards are sent out there is supposed to be a mass email to backers to verify shipping addresses (some will have moved since making the Kickstarter pledge). That hasn't been done yet and there has been a lack of feedback in the past few days :-( .... so don't hold your breath.

Cheers
HBE

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,126
Credit: 128,361,770
RAC: 33,370

Thanks for the reminder, HB.

Thanks for the reminder, HB. I'll stand Rusty down. :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter. Blaise Pascal

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.