Parallella, Raspberry Pi, FPGA & All That Stuff

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 1,946
Credit: 289,804,409
RAC: 547,879

It's been suggested to me

It's been suggested to me that you might find it helpful to read a couple of academic sources:

http://www.davidhbailey.com/dhbpapers/
Starting at "Technical Papers" #1 & #2

http://www.cs.berkeley.edu/~volkov/

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 766
Credit: 185,986,026
RAC: 111,305

Hi Mike, I think we've

Hi Mike,

I think we've missed each other a bit here. You're talking about how to possibly work with the limited amount of memory on the Epiphany cores. I didn't comment on this bacuse, frankly, I don't have anything to add :D

What I was thinking about was "what other benefit could there be in daisy-chaining many small cores instead of just dumping the work onto a sea of identical workers, like e.g. a GPU". Well, 2 days later I think my idea / analogy wasn't all that helpful. Out of simple energy considerations it should be clear that - if you can hold the instructions locally in a processor / cache - it can't be more efficient to pass data round the chip instead of working on it locally.

MrS

Scanning for our furry friends since Jan 2002

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,126
Credit: 128,108,102
RAC: 30,246

RE: Hi Mike, I think we've

Quote:

Hi Mike,

I think we've missed each other a bit here. You're talking about how to possibly work with the limited amount of memory on the Epiphany cores. I didn't comment on this bacuse, frankly, I don't have anything to add :D


Well ultimately the solution is to get more memory, but you are right : we're not reading off the same page. Please read on ..... :-)

Quote:
What I was thinking about was "what other benefit could there be in daisy-chaining many small cores instead of just dumping the work onto a sea of identical workers, like e.g. a GPU". Well, 2 days later I think my idea / analogy wasn't all that helpful. Out of simple energy considerations it should be clear that - if you can hold the instructions locally in a processor / cache - it can't be more efficient to pass data round the chip instead of working on it locally.


That's precisely what I'm challenging/wondering. As you may not be aware of the design features I'm perusing and relying on, I'll show you the relevant design statements ( my red highlight ) from this document :

Quote:
The Epiphany architecture defines a multicore, scalable, shared-memory, parallel computing fabric. It consists of a 2D array of mesh nodes connected by a low-latency mesh network-on-chip. Figure 1 shows an implementation of the architecture, highlighting the key components:
- A superscalar, floating-point RISC CPU in each mesh node that can execute two floating point operations and a 64-bit memory load operation on every clock cycle.
- Local memory in each mesh node that supports provides 32 Bytes/cycle of sustained bandwidth and is part of a distributed, shared memory system.
- Multicore communication infrastructure in each node that includes a network interface, a multi-channel DMA engine, multicore address decoder, and network-monitor.
- A 2D mesh network that supports on-chip node-to-node communication latencies in nanoseconds, with zero startup overhead


That 32 bytes for a single cycle is in detail :

Quote:
On every clock cycle, the following operations can occur:
- 64 bits of instructions can be fetched from memory to the program sequencer.
- 64 bits of data can be passed between the local memory and the CPU’s register file.
- 64 bits can be written into the local memory from the network interface.
- 64 bits can be transferred from the local memory to the network using the local DMA.

Quote:
The eMesh Network-on-Chip consists of three separate and orthogonal mesh structures, each serving different types of transaction traffic:
- cMesh: Used for write transactions destined for an on-chip mesh node. The cMesh network connects a mesh node to all four of its neighbors and has a maximum bidirectional throughput of 8 bytes/cycle in each of the four routing directions. At an operating frequency of 1GHz, the cMesh network has a total throughput of more than 0.5 Terabit/sec.
- rMesh: Used for all read requests. The rMesh network connects a mesh node to all four of its neighbors and has a maximum throughput of 1 read transaction every 8 clock cycles in each routing direction. .
- xMesh: Used for write transactions destined for off-chip resources and for passing through transactions destined for another chip in a multi-chip system configuration. The xMesh network allows an array of chips to be connected in a mesh structure without glue logic. The xMesh network is split into a south-north network and an east-west network. The maximum throughput of the mesh depends on the available-off chip I/O bandwidth. Current silicon versions of the Epiphany architecture can sustain a total off-chip bandwidth of 8GB/sec.

Quote:
Key features of the eMesh network include:
- Optimization of Write Transactions over Read Transactions. Writes are approximately 16x more efficient than reads for on-chip transactions. Programs should use the high write-transaction bandwidth and minimize inter-node, on-chip read transactions.
- Separation of On-Chip and Off-Chip Traffic. The separation of the xMesh and cMesh networks decouples off-chip and on-chip communication, making it possible to write on-chip applications that have deterministic execution times regardless of the types of applications running on neighboring nodes.
- Deadlock-Free Operation. The separation of read and write meshes—together with a fixed routing scheme of moving transactions first along rows, then along columns—guarantees that the network is free of deadlocks for all traffic conditions.
- Scalability. The implementation of the eMesh network allows it to scale to very large arrays. The only limitation is the size of the address space. For example, a 32-bit Epiphany architecture allows for building shared memory systems with 4,096 processors and a 64-bit architecture allows for scaling up to 18 billion processing elements in a shared memory system.


Sorry to blitz/flood you with that, but the upshot is that a given operand in a general register on one core can appear into a general register on an adjacent core within 5 cycles, the sequence being :

register (core A) -> local memory (core A) -> mesh interface (core A) -> mesh interface (core B) -> local memory (core B) -> register (core B)

and during any one of these cycles you can also do operations on the register set ( this is part of the nine-way porting stuff ):

Quote:
In every cycle, the register file can simultaneously perform the following operations:
- Three 32-bit floating-point operands can be read and one 32-bit result written by FPU.
- Two 32-bit integer operands can be read and one 32-bit result written by IALU.
- A 64-bit doubleword can be written or read using a load/store instruction.


These cores/nodes really are next door to each other, so it is indeed reasonable to think of algorithms/pipelines along the lines of production cataract surgery! :-)

Now you don't get something for nothing here. The designers have clearly made choices to create this hardware based write asymmetry, thus foregoing other things, but if you transact b/w cores using write instructions one's throughput might be massive. To my knowledge this is a pretty unique direction of hardware development. So data may appear in another core's local memory ( which is globally mapped ) 'un-invited' so to speak and thus a receiving core a priori may not know when a data location is valid ( or which other core wrote to it for that matter ) and hence to be worked upon. Here a key point is to coordinate - say, by regular synchronous writes - transfer of data across the entire core array by nearest neighbour jumps.

Many are looking at the processing of image and audio streams, where the code per core is pretty constant but the data flows through the chip. In that regard the network mesh structure that separates on-chip from off-chip traffic is crucial. For such streams one would break up into chunks that could be held on a core.

It might be helpful to think along the lines of Boolean logic gates where ( once constructed ) the functionality is static, but each time you change an input then after a ( short ) propagation time the output varies in response.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter. Blaise Pascal

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,126
Credit: 128,108,102
RAC: 30,246

RE: It's been suggested to

Quote:

It's been suggested to me that you might find it helpful to read a couple of academic sources:

http://www.davidhbailey.com/dhbpapers/
Starting at "Technical Papers" #1 & #2

http://www.cs.berkeley.edu/~volkov/


Thank you indeed Richard, and thanks to whom-so-ever suggested that to you ! :-)

Mr Bailey is certainly profligate in his publications, and I feel as if I have stumbled into Aladdin's cave. That's a rich seam you've pointed me to.

The Cray-2 article has firmed up a suspicion of mine : that many variants of FFT are each targeted with some specific type of hardware in mind. Hence, say, the choices regarding array stride, generation of unity roots and bit-reversal. This is why I haven't been entirely happy with the algorithms I've studied so far. So I'll wind up inventing/detailing a variant that is ( hopefully ) tuned/optimised for Parallella/Epiphany.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter. Blaise Pascal

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 766
Credit: 185,986,026
RAC: 111,305

RE: a given operand in a

Quote:
a given operand in a general register on one core can appear into a general register on an adjacent core within 5 cycles


Where do you get 5 cycles from? In the text from your previous post it reads

Quote:
- A 2D mesh network that supports on-chip node-to-node communication latencies in nanoseconds, with zero startup overhead


which for me could be less than register-register transfers, and "a few ns" could be more than 5 clocks, since at 1 GHz one clock takes exactly 1 ns. Anyway, even if it's more than 5 clocks that's just nitpicking and still pretty fast ;)

Going back to the big picture:

Quote:
Quote:
Out of simple energy considerations it should be clear that - if you can hold the instructions locally in a processor / cache - it can't be more efficient to pass data round the chip instead of working on it locally.

That's precisely what I'm challenging/wondering.


Or put another way: no matter how fast a transfer / write is, it can never be faster / more efficient than not having to transfer! If all instructions easily fit into local caches / memory. If not you surely have a point.. but that's not what we're discussing here, are we? I thought this was pretty clear a few posts before the last ones :)

MrS

Scanning for our furry friends since Jan 2002

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,126
Credit: 128,108,102
RAC: 30,246

RE: RE: a given operand

Quote:
Quote:
a given operand in a general register on one core can appear into a general register on an adjacent core within 5 cycles

Where do you get 5 cycles from? In the text from your previous post it reads

Quote:
- A 2D mesh network that supports on-chip node-to-node communication latencies in nanoseconds, with zero startup overhead

which for me could be less than register-register transfers, and "a few ns" could be more than 5 clocks, since at 1 GHz one clock takes exactly 1 ns. Anyway, even if it's more than 5 clocks that's just nitpicking and still pretty fast ;)


If you track the transfers from each entity in the sequence I described you get five. In any case it could be four as I'd assumed DMA usage there, as you can actually write from a core's register into another core's memory ( ie. straight to the mesh latches from general register bypassing the initiating core's local memory ). In any case that is still slower than any register-register transfer on the same core ( of course! ), though not a slow as one might first assume. But yeah, it is fast whichever ... :-)

Anyhows there's only 64 register words ( = 4 bytes each ) of those and a core's local memory has only 32KB at present. That's four banks of 8KB, where those memory efficiencies ( ie. if you want nil contention b/w code fetch and data fetch ) assume you're not writing to the same bank you are ( wanting to ) simultaneously reading from. The first 8KB bank has an interupt vector table low down then the rest is given over as code. The second bank is typically allocated to stack ( top down ), which only leaves 2 x 8KB banks for data. Well, you can shift those allocations but the ceiling remains.

Quote:
Going back to the big picture:
Quote:
Quote:
Out of simple energy considerations it should be clear that - if you can hold the instructions locally in a processor / cache - it can't be more efficient to pass data round the chip instead of working on it locally.

That's precisely what I'm challenging/wondering.

Or put another way: no matter how fast a transfer / write is, it can never be faster / more efficient than not having to transfer! If all instructions easily fit into local caches / memory. If not you surely have a point.. but that's not what we're discussing here, are we? I thought this was pretty clear a few posts before the last ones :)


Ah, but these things are not mutually exclusive ! It's not a case of keep all data or transfer all data. For FFT's one would likely want to keep co-efficients ( complex numbers which are constant for a given choice of transform size ) and maybe pass a given input data (sub)set around, and then emit a resultant vector. The plan is for a one-shot setup for given transform choices/styles/extents followed by very many operands into and out of that creation.

As I've analysed thus far ( and my numbers may be wrong .... ) you can't fit a whole transform of meaningful practical size onto a single core, even if I had one megabyte per core to play with. So you have to split the transform in some way ( intersecting mathematical correctness with core restraints ) so the whole chip ( array of cores ) can manage. The deep issue here for FFT's is that every element of the result vector depends upon every element of the input vector. Thus you have to suffer some degree of data passage b/w nodes : I want to keep that to nearest neighbour if possible.

[ASIDE : for instance, a single input vector/array of 2^22 elements ( a typical E@H FFT input operand ) each of which is a ( IEEE single precision float ) real number entity requires 16MB just to be stored.]

This may all be tilting at windmills. Maybe the Epiphany is going to be crap at FFT's compared to benchmarks/competitors. There's no doubt that shaders on video cards have a huge embedded lead here, so my thinking has to be a scalable solution which may then give an edge upon Epiphany development towards 4096 cores with 1MB each.

Call me Don Quixote, and thanks for discussing this with me. It helps ! :-) :-)

Cheers, Mike.

( edit ) I should add : a 'cycle' is the same everywhere on the Epiphany chip. So a cycle for some core instruction is the same time as for a mesh network transaction to an adjacent core.

I have made this letter longer than usual because I lack the time to make it shorter. Blaise Pascal

ExtraTerrestrial Apes
ExtraTerrestria...
Joined: 10 Nov 04
Posts: 766
Credit: 185,986,026
RAC: 111,305

RE: If you track the

Quote:
If you track the transfers from each entity in the sequence I described you get five.


Agreed. But what I'm missing here is the confirmation that each step indeed takes only 1 clock. Sorry in advance if I overlooked this in what you already postet, but I certainly looked for it.

Wow, only ~16 KB of usable memory per core? I think that's less than the register space in a Kepler GPU's SMX! I get the feeling that crunching efficiently on these "bad boys" is going to require some serious low level / assembler work. Which had almost died out in the past due to being just to hard.. except for highly optimized libraries etc., which are called from high level languages. Luckily for Parallela this is just what they're aiming at ;)

Quote:
Quote:
Or put another way: no matter how fast a transfer / write is, it can never be faster / more efficient than not having to transfer! If all instructions easily fit into local caches / memory.

Ah, but these things are not mutually exclusive ! ... you can't fit a whole transform of meaningful practical size onto a single core, even if I had one megabyte per core to play with.


Let's assume the instructions fit into the cache / local memory and there has to be data streaming in any case. Then, when doing everything on a single core I'd get:
- fetch data from memory
- compute (everything)
- write results to memory

whereas by daisy-chaning 2 cores I'd need:
- fetch data from memory
- compute (half)
- transfer to 2nd core
- compute (half)
- write results to memory

so I add "transfer to 2nd core", which - no matter how fast - requires additional energy. If the input vector chunks that a single core could work on are sufficiently large for sufficient precision, then I can not see how splitting the work over cores could increase efficiency. Simple, isn't it?

On the Epiphanies, on the other hand, the local memory is so small that daisy-chaining may be neccessary to reduce the memory needed for instructions, so that the input vector chunk being worked on can be increased. If we're talking about an increase of more than a factor of 2 this could be very beneficial, maybe even neccessary for precision.

Quote:
my thinking has to be a scalable solution which may then give an edge upon Epiphany development towards 4096 cores with 1MB each.


Definitely go for a very scalable solution. After all, that's the whole point of using many small cores :)
And I wound't expect 1 MB per core at any time, though, as this approaches L2/3 cache sizes on regular fat cores! GPUs have L2 caches shared among all shaders measured in the 100's of kB.. to put things into perspective. And for 128 MB eDRAM Intel needs ~90 mm² at 22 nm (Crystalwell), which is already packed quite dense.. yet still huge.

MrS

Scanning for our furry friends since Jan 2002

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,126
Credit: 128,108,102
RAC: 30,246

RE: RE: If you track the

Quote:
Quote:
If you track the transfers from each entity in the sequence I described you get five.

Agreed. But what I'm missing here is the confirmation that each step indeed takes only 1 clock. Sorry in advance if I overlooked this in what you already postet, but I certainly looked for it.


OK, it's here ( my coloring ) :

Quote:
On every clock cycle, the following operations can occur:
- 64 bits of instructions can be fetched from memory to the program sequencer.
- 64 bits of data can be passed between the local memory and the CPU’s register file.
- 64 bits can be written into the local memory from the network interface.
- 64 bits can be transferred from the local memory to the network using the local DMA


and

Quote:
cMesh: Used for write transactions destined for an on-chip mesh node. The cMesh network connects a mesh node to all four of its neighbors and has a maximum bidirectional throughput of 8 bytes/cycle in each of the four routing directions


As for :

Quote:
Wow, only ~16 KB of usable memory per core? I think that's less than the register space in a Kepler GPU's SMX! I get the feeling that crunching efficiently on these "bad boys" is going to require some serious low level / assembler work. Which had almost died out in the past due to being just to hard.. except for highly optimized libraries etc., which are called from high level languages. Luckily for Parallela this is just what they're aiming at ;)


Yes, there will be some crawling over broken glass to do! :-)

Quote:
Quote:
Quote:
Or put another way: no matter how fast a transfer / write is, it can never be faster / more efficient than not having to transfer! If all instructions easily fit into local caches / memory.

Ah, but these things are not mutually exclusive ! ... you can't fit a whole transform of meaningful practical size onto a single core, even if I had one megabyte per core to play with.

Let's assume the instructions fit into the cache / local memory and there has to be data streaming in any case. Then, when doing everything on a single core I'd get:
- fetch data from memory
- compute (everything)
- write results to memory

whereas by daisy-chaning 2 cores I'd need:
- fetch data from memory
- compute (half)
- transfer to 2nd core
- compute (half)
- write results to memory

so I add "transfer to 2nd core", which - no matter how fast - requires additional energy. If the input vector chunks that a single core could work on are sufficiently large for sufficient precision, then I can not see how splitting the work over cores could increase efficiency. Simple, isn't it?

On the Epiphanies, on the other hand, the local memory is so small that daisy-chaining may be neccessary to reduce the memory needed for instructions, so that the input vector chunk being worked on can be increased. If we're talking about an increase of more than a factor of 2 this could be very beneficial, maybe even neccessary for precision.


That's right, if you can fit all instructions on core, and if you can partition the task without much need for inter-core communication. These are the pivots. What would be ideal is 'simply phrased' algorithms that can either loop or recurse well. In theory at least, one might 'unroll' a loop to perform the same essential calculations on several cores, with each core doing what might have been done for a single loop iteration ie. accounting for different values of whatever loop variable(s) would have otherwise been updated per round of the loop. So you may want to do function blob()* some X number of times in a loop, thus could one split that into blob() done X/16 times per E16 core ?? Is there any significant dependency b/w separate invocations of blob(), that can't be 'hard-coded'/factorised out ?? Etc ....

A subtle bit here is we are using RISC processors which by definition will/may/could have an expanded code memory footprint for a given task(s) c/w their CISC cousins ( but not necessarily ). Even a minimal inclusion of any, say C, runtime library components might be an issue ( this I will test for my E16's ).

Quote:
Quote:
my thinking has to be a scalable solution which may then give an edge upon Epiphany development towards 4096 cores with 1MB each.

Definitely go for a very scalable solution. After all, that's the whole point of using many small cores :)
And I wound't expect 1 MB per core at any time, though, as this approaches L2/3 cache sizes on regular fat cores! GPUs have L2 caches shared among all shaders measured in the 100's of kB.. to put things into perspective. And for 128 MB eDRAM Intel needs ~90 mm² at 22 nm (Crystalwell), which is already packed quite dense.. yet still huge.


I won't be holding my breath for 1MB ! :-)

Put it this way : I am optimistic that I will find an FFT implementation that will work really well for Epiphany designs. After much trawling around the literature ( that from Richard, thank you, and others ) I reckon I have a reasonable match b/w algorithm and architecture. I am currently bundling this up into an article entitled "Investigations Into Parallella Processing Using LEGO : A semi-serious approach to FFT on the Epiphany chip" which if nothing else ought be amusing, perhaps stimulating. During some idle moments I was staring at the pile of LEGO in a corner of my consulting rooms** thinking about this stuff when I started to play with them and some ideas dawned upon me. Anyway I won't spoil ... please await. :-)

Cheers, Mike.

* that's trademarked BTW ... :-)

** ahem ..... which I have there for the kids to fiddle with, right ? :-)

I have made this letter longer than usual because I lack the time to make it shorter. Blaise Pascal

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,126
Credit: 128,108,102
RAC: 30,246

Last night a new architecture

Last night a new architecture manual was released - largely in response to numerous queries on the Parallella forums - and has some rather interesting material, apart from typo corrections and expanded explanations/clarifications of existing disclosures.

The first I will note is a MULTICAST feature whereby from a given core data can be sent in a radial pattern outwards :

Quote:
The eMesh supports efficient broadcasting of data to multiple cores through a special “multicast†routing mode. To use the multicast routing method, set the CTRLMODE field in the CONFIG register to 0011 in the master core sending eMesh write transactions. In multicast mode, the normal eMesh routing algorithm described in 5.2 is overridden and the transaction is instead routed radially outwards from the transmitting node. The write destination address is compared to the value found in the MULTICAST register at each eMesh node. If the eMesh write transaction destination address matches the MULTICAST register, the transaction enters the node.


which means that the destination core ID row/column section of a write packet, normally indicating a 'Cartesian' location in the array, becomes like a channel setting that any receiver may choose to listen into. Or not. Thus you could have many distinct data 'transmitters', and subsets of the array ( developers choice ) may receive data from different master sources at different times. Just like a radio tuner ....

The second is for the Epiphany-IV, the 64 core chip, which I haven't ordered myself but will. This is called 'detour routing support', an opportunity to not accept the default routing behaviour at nodes but to insert one's own pattern. Each of the xMesh/cMesh/rMesh ( which are independent of each other ) may be separately configured to yield a flow control pattern of one's choice. For example :

Quote:
CMESHROUTE: 0xF0710
[5:3] EAST_CONFIG
0xx: normal routing
100: block northbound transactions
101: send northbound transactions south
110: send northbound transactions west
111: send northbound transactions north


... with similiar entries for the other compass directions. Put simply, I think this is a rather exciting capability to enable developer control over the traffic patterns : on a mesh and/or node and/or directional basis. It has triggered further fevered thoughts .... :-)

Other points :

Quote:
All instructions are available as both 16- and 32-bit instructions, with the instruction width depending on the registers used in the operation. Any command that uses registers 0 through 7 only and does not have a large immediate constant is encoded as a 16-bit instruction.


Also a MESHCONFIG register is described for monitoring routing and even blocking multicore debug & communication signals from propagating, add this to the mesh routing as above and one can effectively partition in hardware a given array into separate ones ! That could vary over the course of some overall ( developer determined ) timeline too ....

Cheers, Mike.

( edit ) For the adjacent node communication a slight correction to timing :

Quote:
Every router in the mesh is connected to the north, east, west, south, and to a mesh node. Write transactions move through the network, with a latency of 1.5 clock cycles per routing hop. A transaction traversing from the left edge to right edge of a 64- core chip would thus take 12 clock cycles.


My guess is that the extra 1/2 cycle is waiting for a downstream latch to propagate it's current contents onwards on a rising clock edge, say, which then allows uptake of new values on the very next falling clock edge. Or vice versa. Something like that.

I have made this letter longer than usual because I lack the time to make it shorter. Blaise Pascal

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3,516
Credit: 460,251,006
RAC: 34,678

It has been a bit silent

It has been a bit silent lately, tho, wrt status updates e.g. on the Kickstarter page. At one point Adapteva had suggested weekly updates would be provided, but the last one was issued 5 weeks ago.
Cheers
HB

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.