Parallella, Raspberry Pi, FPGA & All That Stuff

ForumsCruncher's Corner

Mike Hewson
Mike Hewson
Joined: 1 Dec 05
Posts: 5111
Credit: 42033358
RAC: 6490
Topic 196560

This came up in a thread over at Cafe Einstein :

Parallela

which I think is well worth looking at. At a glance it would be absolutely red-hot for FFT's and thus have excellent performance in the signal processing area. If it can be done it could quite revolutionise distributed workflows as practised here at E@H. Notably software can be developed and compiled for it using C/C++ on a GNU system. One to watch.

Cheers, Mike.

"I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

Fred J. Verster
Fred J. Verster
Joined: 27 Apr 08
Posts: 118
Credit: 22451438
RAC: 0

Parallella, Raspberry Pi, FPGA & All That Stuff

Quote:

This came up in a thread over at Cafe Einstein :

Parallela

which I think is well worth looking at. At a glance it would be absolutely red-hot for FFT's and thus have excellent performance in the signal processing area. If it can be done it could quite revolutionise distributed workflows as practised here at E@H. Notably software can be developed and compiled for it using C/C++ on a GNU system. One to watch.

Cheers, Mike.

Very good article, I've followed this development for awhile, increasing
CPU/GPU core freqency has it limits as does (22nm) process shrinking to moleculair level.

Unfortunatly I've never learned a computer language, except BASIC :-/
Programs need to be programmed a different way, but C/C++ can be used.
And paralellezation is has already proved to be very effective. (CUDA / OpenCL).

Mike Hewson
Mike Hewson
Joined: 1 Dec 05
Posts: 5111
Credit: 42033358
RAC: 6490

RE: Very good article, I've

Quote:

Very good article, I've followed this development for awhile, increasing
CPU/GPU core freqency has it limits as does (22nm) process shrinking to moleculair level.

Unfortunatly I've never learned a computer language, except BASIC :-/
Programs need to be programmed a different way, but C/C++ can be used.
And paralellezation is has already proved to be very effective. (CUDA / OpenCL).


It wouldn't currently compete on performance anywhere near the existing GPU porting of E@H WU's, as their current arrays are too small for that ( Apteva's primary focus is on lowering price and power consumption ). But I envisage having an Epiphany array in a co-processor role, which could thus be handed off the stuff that it would be quite spectacular at eg. matrix manipulations, and thus delivering great performance on algorithms for which that is key ( Fast Fourier Transforms ). Their simplest offering is 4 x 4 for around $100 USD, however the design scales up to 64 x 64 ... I was most intrigued by their matrix multiplications using blocks within the matrix shifted synchronously b/w nodes, roughly speaking a two dimensional pipeline.

Cheers, Mike.

( edit ) The programming skill would largely be a matter of having a "parallel approach" and not language per se. For instance the address space within any given array is unprotected, meaning that any node/processor can read and write to any other's memory within a globally flat space, so the discipline required to prevent any incongruities arising from that would have to come from the program design and compilation. So you'd want to identify the elements in the problem space that could be simultaneously and independently executed, and if we have already written for GPU thread parallelism then that aspect is largely done.

( edit ) This also highlights an issue/query that arises here from time to time : why can't GPU's be used to speed up ? Answer : Algorithm X or the problem space it arises from may not have sufficiently parallel aspects for that to yield a gain over non-massively-parallel solutions.

"I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

Fred J. Verster
Fred J. Verster
Joined: 27 Apr 08
Posts: 118
Credit: 22451438
RAC: 0

RE: RE: Very good

Quote:
Quote:

Very good article, I've followed this development for awhile, increasing
CPU/GPU core freqency has it limits as does (22nm) process shrinking to moleculair level.

Unfortunatly I've never learned a computer language, except BASIC :-/
Programs need to be programmed a different way, but C/C++ can be used.
And paralellezation is has already proved to be very effective. (CUDA / OpenCL).


It wouldn't currently compete on performance anywhere near the existing GPU porting of E@H WU's, as their current arrays are too small for that ( Apteva's primary focus is on lowering price and power consumption ). But I envisage having an Epiphany array in a co-processor role, which could thus be handed off the stuff that it would be quite spectacular at eg. matrix manipulations, and thus delivering great performance on algorithms for which that is key ( Fast Fourier Transforms ). Their simplest offering is 4 x 4 for around $100 USD, however the design scales up to 64 x 64 ... I was most intrigued by their matrix multiplications using blocks within the matrix shifted synchronously b/w nodes, roughly speaking a two dimensional pipeline.

Cheers, Mike.

( edit ) The programming skill would largely be a matter of having a "parallel approach" and not language per se. For instance the address space within any given array is unprotected, meaning that any node/processor can read and write to any other's memory within a globally flat space, so the discipline required to prevent any incongruities arising from that would have to come from the program design and compilation. So you'd want to identify the elements in the problem space that could be simultaneously and independently executed, and if we have already written for GPU thread parallelism then that aspect is largely done.

( edit ) This also highlights an issue/query that arises here from time to time : why can't GPU's be used to speed up ? Answer : Algorithm X or the problem space it arises from may not have sufficiently parallel aspects for that to yield a gain over non-massively-parallel solutions.

It does appear to be quite a change in 'thinking' and programming, as not
much replies or anwers have rosen ;-)

It also took some time before GPGPU was being used as CUDA o openCL.

Mike Hewson
Mike Hewson
Joined: 1 Dec 05
Posts: 5111
Credit: 42033358
RAC: 6490

RE: It does appear to be

Quote:
It does appear to be quite a change in 'thinking' and programming, as not
much replies or anwers have rosen ;-)


There certainly is a lot to swallow ! :-)

Quote:
It also took some time before GPGPU was being used as CUDA o openCL.


Actually they have said they will consider developing an OpenCL facility for it. Now that's a clever move, these guys are forwards thinkers for sure.

Also each node ( RISC processor plus it's slab of local memory ) is connected to each of three independent data buses that constitutes the 'mesh', two for writing and one for reading, with no latency on the channel that does fast on-chip writes b/w nodes. That goes ~ 16 x faster than corresponding reads! That's quite an asymmetry and a processor can never stall using that type of write! That would imply a chunk of buffering by the network-on-chip system. Anyways I think the hint there is : if a node_B needs results from a node_A, it is far more efficient for node_A to execute the above fast write to node_B's local memory THAN node_B executing a much slower read from node_A's memory. ( Plus Node_A will know better when it's finished some computation step, as opposed to Node_B polling ). Given that said data transfer could also include flags/semaphores/etc to coordinate/validate any data state, then you have a real snappy mechanism in the hardware to satisfy most 'process/thread' cooperation paradigms. [ There is specifically a TESTSET command at hardware level which is an atomic "test-if-not-zero" followed by conditional write. This is the usual mechanism to prevent deadlocks/races and whatnot/troubles with semaphores et al ]

Cheers, Mike.

( edit ) Think of the Three Stooges trying to go through the same door at the same time : 'after you Larry' ... 'no, after you Curly' ... 'please, you first Moe' ... 'no I couldn't' ... 'I must insist' ... 'no, I couldn't possibly' ... eventually they jam in the doorway and fight.

( edit ) You may be thinking : how can one label a data bus as 'only for reading' or 'only for writing', that typically being a question of perspective or which end you're at ? Answer : the difference is the specification of the addressing, who is controlling the transaction, and buffering. A read request has a 'return to sender' address component that a write doesn't ( similiar to a stamped self-addressed envelope ). That propagates through the mesh going left/right along a row and then going up/down along a column until the target node is found, and ditto for the return leg. A write does simply the first phase, in fact a node receiving data from another node's write does not know who sent it ( well, not from the hardware at least ).

( edit ) Of course such an array of processors can be task dedicated without double duty, unlike GPU's which generally perform a system graphics role also. BTW currently either a USB or an Ethernet pathway is how to connect such a co-processor to some 'host' system. There is talk of other modes, say even digital video output. So one can attach it to all manner of devices via suitable I/O ports. The other slower write channel, called xMesh, is for such off-chip connections - which could well be another similiar chip, but may be anything with appropriate circuit level compatibility/buffering. As you can tell I am rather enthused ... :-)

"I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

Rod_5
Rod
Joined: 3 Jan 06
Posts: 4396
Credit: 811266
RAC: 0

I just increased my pledge

I just increased my pledge again.
So close.. A platform has so much potential, and right now that it. 'Potential to do great things'.

There are some who can live without wild things and some who cannot. - Aldo Leopold

Mike Hewson
Mike Hewson
Joined: 1 Dec 05
Posts: 5111
Credit: 42033358
RAC: 6490

RE: I just increased my

Quote:
I just increased my pledge again.
So close.. A platform has so much potential, and right now that it. 'Potential to do great things'.


Hey they've bumped up ~ $100K in the last day!! I'd given up hope ... so I've just gone up myself - the $199+ package ( 64 core Epiphany IV on the stretch ).

Cheers, Mike.

"I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

Claggy
Claggy
Joined: 29 Dec 06
Posts: 559
Credit: 2448501
RAC: 83

RE: RE: I just increased

Quote:
Quote:
I just increased my pledge again.
So close.. A platform has so much potential, and right now that it. 'Potential to do great things'.

Hey they've bumped up ~ $100K in the last day!! I'd given up hope ... so I've just gone up myself - the $199+ package ( 64 core Epiphany IV on the stretch ).

Cheers, Mike.


I've just gone for the same package too,

Claggy

dmike
dmike
Joined: 11 Oct 12
Posts: 76
Credit: 31369048
RAC: 0

Sure looks interesting. I

Sure looks interesting. I love the open source nature of the product and the horsepower for such a low amount of power consumption.

I personally wouldn't have much use for one but still the price and package are attractive. I was considering one to add for E@H but I think I'd be better off buying a 550ti to add in another box as they're claiming 90 GFLOPS vs the 550ti 691 GFLOPS.

I'm a fan of what they're doing, I just wish I'd use it for something more than crunching. Unfortunately I don't have the need, but would look forward to others posting their experiences with the system.

Oh, and the cluster reward for pledging $975 sounds beyond awesome!
In any case, thanks for sharing this with us, Mike. I'd have never known about it otherwise!

Mike Hewson
Mike Hewson
Joined: 1 Dec 05
Posts: 5111
Credit: 42033358
RAC: 6490

RE: Sure looks interesting.

Quote:
Sure looks interesting. I love the open source nature of the product and the horsepower for such a low amount of power consumption.


Well I don't want to get all poetical about it, but I think these guys are on an historic cusp. What they propose has massive potential and is so accessible. Why they're asking for grass roots money is simply that the biggies can't re-tool their way out of current commitments to existing paradigms ( which is not a criticism, just a fact of entrenched investment ). With development then the price goes down under threshold and you have a card that'll fit a PCIe slot with open source software that will hammer away.

Quote:
I personally wouldn't have much use for one but still the price and package are attractive. I was considering one to add for E@H but I think I'd be better off buying a 550ti to add in another box as they're claiming 90 GFLOPS vs the 550ti 691 GFLOPS.


For now yes. But you can scale the very same design by several orders, while barely breaking a sweat ..... :-)

Quote:

I'm a fan of what they're doing, I just wish I'd use it for something more than crunching. Unfortunately I don't have the need, but would look forward to others posting their experiences with the system.

Oh, and the cluster reward for pledging $975 sounds beyond awesome!
In any case, thanks for sharing this with us, Mike. I'd have never known about it otherwise!


Thank Rod for that, he told me! :-)

I've also just noted that they drop a Ubuntu 12.04 onto the ARM A9 CPU on the board they supply.

Cheers, Mike.

( edit ) They're up another $100K, and just $110K shy now.

"I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

MarkJ
MarkJ
Joined: 28 Feb 08
Posts: 311
Credit: 34899582
RAC: 13678

As of this morning (7:55am

As of this morning (7:55am Sydney time) and with 25 hours left they need another 41,000 to get there. Yes I have made a pledge, let's see if I need to pay.

I already have a Raspberry Pi but find it hard to find any BOINC projects that support it.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.