Sad Panda. I haven't worked out a way to meaningfully do even vaguely E@H size FFT's, entirely on the Epiphany chip at least. The low memory per core - effectively 16KB available for data at best - is the killer. I can't find the wiggle room to optimise anything once placed on the E16, the rest is savagely bandwidth limited. Way too much thrash & transfer per compute .... and it's a long way down from 2^22 to 2^10. My design is more an ARM solution which we have already.
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
Sorry to hear that! But then it's not entirely unexpected.. and the reason why innovative approaches like the Cell processor often fail if they place too much of a burden on the programmer.
Beginning tomorrow, the Kepler GPU cores equipped ARM SoC by Nvidia, Tegra K1, will be available in Europe as a dev board. It should perhaps come closer to the "supercomputer for everyone" promise that Parallella probably fails to deliver on. I guess I'll wait until the first reviews from users are in (what bothers me a bit is that it will come with a "Linux for Tegra" ?!?... what the heck do I need a special Linux for when all the other ARM boards run Debian or Ubuntu or whatever.)
Sorry to hear that! But then it's not entirely unexpected.. and the reason why innovative approaches like the Cell processor often fail if they place too much of a burden on the programmer.
MrS
Well I think it is reasonable to mark the Parallella design as 'further work needed' ( ie. more cores and more memory per core ) before it could fulfill what I have in mind. Factorising a larger DFT into a smaller one is good but the recombination phase requires the twiddle factors that depend on the original unfactored transform size. Plus there is shuffling of the operands either 'on the way in' to the base size DFT or 'on the way out'.
So problem dependencies limit parallelising into independent data streams here ( the 'ideal' ), and what I have envisioned for data flows is effectively a virtual paging system but with a tiny swap file. My method could scale for sure, but we have to wait for that scaling.
I had a ( vaguer ) method for which one could pass data along core-to-core ( turning the 4x4 grid into a pipeline ) but that rather depended upon each core having considerable persistent data. Also my development focus assumes the Epiphany chip should do the heavy lifting, as if I wanted the host to do that then I would have got an ARM alone. You could of course use the Parallella for E@H by totally ignoring the E16, as what remains is a dual core ARM with 2GB.
While disappointed with this aspect, I am not sorry that I backed them however! Not at all. I look forward to meatier instantiations of the Epiphany model. :-)
Cheers, Mike.
( edit ) 2^22 = 2^12 * 2^10, while 2^10 = 1024 and 2^12 = 4096. I reckon I could just fit a 2^10 point FFT on an E16 core ( inside 16KB ), but if I had 4096 of them ... :-) :-)
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
@HB: if the Jetson is a "supercomputer for everyone", then anybody owning a GT640 already owns 2 to 3 supercomputers :D
@Mike: you're right, I should have directed my "sorry to hear" more towards "sorry to hear it doesn't work yet, but it was certainly worth trying"! It's too early yet to question the whole concept just because the memory is too small for a certain task.
Still, I remain sceptical: if the memory per core (and maybe external bandwdith?) is significantly increased, will the chip still be as nimble and (cost-)efficient? It could be very interesting to find out :)
My gut leans more towards a bunch of relatively simple yet low-latency cores for scalar tasks (as in the Bulddozer architecture) sharing a pool of massively parallel number crunchers (like GPU shaders) to do the heavy lifting. Latency would be higher and not neccessarily oredictable, but I see some major advantages:
- maximum throughput is high, as even a single core might be able to use the entire "shader" array
- automatic load balancing could be excellent, if the ISA was flexible enough
- scalability is easy, as the number of small cores could easily be tailored to the application (more for servers, less for HPC, games etc.) as well as the parallel pool (more for HPC, less for servers, mobile, low cost etc.)
I think this approach is actually the opposite of what Parallela is doing: they're trying to make many tiny cores work well together, relying on advanced / new programming techniques and low level access. I'd rather try to build something which can work as one large monolithic massively parallel "vector unit", if it's beneficial to do so, or as mulit-/many-threaded more conventional big chip. Like a Kaveri where the shared FPU in each module is replaced by access to the GPU shader clusters, or a GPU with more flexible work distribution.
In essence I'd try to hide most of the current software complexity (for using GPUs etc.) via a flexible API, whereas Parallela needs to expose and burden the programmer with more low level details, as far as I understand. In my "vision" (OK, this might exagerate things a bit) instructions should be like in Matlab: "perform operation x on these single FP vectors or matrices and write the result to y, I don't care about any further details". One could also describe it like an on-chip HPC batch system / job scheduler.
Of course this approach is not without its own challenges. Communication and synchronisation is expensive over large distances, so designing e.g. a chip as large as a high end GPU and making one large pool of execution units available to many small cores for work submission poses a tremendous challenge for the scheduler itself, task dispatch, register files etc. That's why nVidia actually made these smaller going from Kepler and the SMX to Maxwell to gain efficiency. So some kind of partitioning scheme / hierachy would certainy still be needed.
It's a funny business this parallel stuff for sure. A real mind-stretcher .... :-) :-)
With regard to FFT's I have 'discovered' a curious thing. The N*log(N) scaling with FFT's versus N^2 with ordinary DFT's is worthy of a closer look. While it's true that the count of operations scale that way that is not necessarily so for the execution times. It would be so with a bog standard serial Turing style device. But we are not handling one of those are we ?
To achieve the N*log(N) operation count [ for our discussion here let one base operation = a floating point multiplication followed by an addition of that product into an accumulator ] absolutely requires :
(A) the pre- or post- ordering of vector components to/from their linear presentation PLUS
(B) the generation of the twiddles.
That is, the A and B requirements don't enter into the N*log(N) statement. So given the restrictions encountered with the Parallella ( as is ) why not just do a quicker but otherwise standard matrix multiplication of a linear un-shuffled vector of length N? Call this a Quicker Fourier Transform ( QFT ) for this discussion. Given a square array of identical processors :
- for a given transform size N the contents of the Fourier array are fixed.
- moreover that N^2 Fourier matrix has only N distinct values within.
- the Fourier matrix is highly structured, so that a given block/sub-matrix will only have a fixed number - much less than N - of values ( corresponding to the FFT twiddles ). That is a once only generation ( across all presented input vectors of a given length ) and may be persistently stored per processor node.
- suppose the array block size is K where N = P * K, P being the number of processors in a row/column ( for E16 this would be 4 ).
- each processor has responsibility for a K * K block/sub-matrix. But again that does not require storage of K^2 twiddles. Each block does have considerable structure just like the whole, with base + offset semantics allowing a single constant twiddle to be applied once after a series of accumulations for a given inner product.
- present the input vector of length N to each entire processor row, just shovel in from the west and pass operands ( synchronously ) to the east. Each processor in a row will ( by counting the operands flowing past ) only select & store those relevant to it's block. The others will be passed on eastwards.
- do the inner products for the blocks with the sub-vector ( also of length K ).
- accumulate off-chip the sub-vector inner products to form the final output operand of length N.
Now I have yet to get down & dirty with the estimates here. But I would guesstimate that the scaling of QFT timing is :
- b/w FFT and DFT with
- dependency upon the relative transfer rates on & off chip, plus
- dependency on the relative scale of the processor array to the transform array, or if you like how much persistent data can you stuff onto a single Epiphany core ?
Of course I am alluding that much of the N^2 base operations can be indeed be done simultaneously in time. Furthermore you could run an FFT on the ARM host down to some lesser factor which is then in turn handed off to the Epiphany array. The key point being that the enclosing host algorithm doesn't care how any lesser factored matrix/vector is managed.
In generality though there is an ( obvious ) crucial lesson here : the detail matters, especially the architecture of the hardware in relation to the problem to be solved.
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
RE: Hi! Sorry, I haven't
)
Unfortunately they do not ship to Europe.
RE: Unfortunately they do
)
It will be sold in Europe thru partners, e.g.
http://www.zotac.com/uk/z-zone/nvidia-jetson-tk1
HBE
RE: RE: Unfortunately
)
In Germany, Conrad, Voelkner and probably many others offer this board for € 219 , available starting 29th of April 2014.
Cheers
HB
RE: RE: RE: Unfortuna
)
Great news!
Sad Panda. I haven't worked
)
Sad Panda. I haven't worked out a way to meaningfully do even vaguely E@H size FFT's, entirely on the Epiphany chip at least. The low memory per core - effectively 16KB available for data at best - is the killer. I can't find the wiggle room to optimise anything once placed on the E16, the rest is savagely bandwidth limited. Way too much thrash & transfer per compute .... and it's a long way down from 2^22 to 2^10. My design is more an ARM solution which we have already.
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
Sorry to hear that! But then
)
Sorry to hear that! But then it's not entirely unexpected.. and the reason why innovative approaches like the Cell processor often fail if they place too much of a burden on the programmer.
MrS
Scanning for our furry friends since Jan 2002
Hi Parallella Updae:
)
Hi
Parallella Updae:
Beginning tomorrow, the Kepler GPU cores equipped ARM SoC by Nvidia, Tegra K1, will be available in Europe as a dev board. It should perhaps come closer to the "supercomputer for everyone" promise that Parallella probably fails to deliver on. I guess I'll wait until the first reviews from users are in (what bothers me a bit is that it will come with a "Linux for Tegra" ?!?... what the heck do I need a special Linux for when all the other ARM boards run Debian or Ubuntu or whatever.)
HB
RE: Sorry to hear that! But
)
Well I think it is reasonable to mark the Parallella design as 'further work needed' ( ie. more cores and more memory per core ) before it could fulfill what I have in mind. Factorising a larger DFT into a smaller one is good but the recombination phase requires the twiddle factors that depend on the original unfactored transform size. Plus there is shuffling of the operands either 'on the way in' to the base size DFT or 'on the way out'.
So problem dependencies limit parallelising into independent data streams here ( the 'ideal' ), and what I have envisioned for data flows is effectively a virtual paging system but with a tiny swap file. My method could scale for sure, but we have to wait for that scaling.
I had a ( vaguer ) method for which one could pass data along core-to-core ( turning the 4x4 grid into a pipeline ) but that rather depended upon each core having considerable persistent data. Also my development focus assumes the Epiphany chip should do the heavy lifting, as if I wanted the host to do that then I would have got an ARM alone. You could of course use the Parallella for E@H by totally ignoring the E16, as what remains is a dual core ARM with 2GB.
While disappointed with this aspect, I am not sorry that I backed them however! Not at all. I look forward to meatier instantiations of the Epiphany model. :-)
Cheers, Mike.
( edit ) 2^22 = 2^12 * 2^10, while 2^10 = 1024 and 2^12 = 4096. I reckon I could just fit a 2^10 point FFT on an E16 core ( inside 16KB ), but if I had 4096 of them ... :-) :-)
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
@HB: if the Jetson is a
)
@HB: if the Jetson is a "supercomputer for everyone", then anybody owning a GT640 already owns 2 to 3 supercomputers :D
@Mike: you're right, I should have directed my "sorry to hear" more towards "sorry to hear it doesn't work yet, but it was certainly worth trying"! It's too early yet to question the whole concept just because the memory is too small for a certain task.
Still, I remain sceptical: if the memory per core (and maybe external bandwdith?) is significantly increased, will the chip still be as nimble and (cost-)efficient? It could be very interesting to find out :)
My gut leans more towards a bunch of relatively simple yet low-latency cores for scalar tasks (as in the Bulddozer architecture) sharing a pool of massively parallel number crunchers (like GPU shaders) to do the heavy lifting. Latency would be higher and not neccessarily oredictable, but I see some major advantages:
- maximum throughput is high, as even a single core might be able to use the entire "shader" array
- automatic load balancing could be excellent, if the ISA was flexible enough
- scalability is easy, as the number of small cores could easily be tailored to the application (more for servers, less for HPC, games etc.) as well as the parallel pool (more for HPC, less for servers, mobile, low cost etc.)
I think this approach is actually the opposite of what Parallela is doing: they're trying to make many tiny cores work well together, relying on advanced / new programming techniques and low level access. I'd rather try to build something which can work as one large monolithic massively parallel "vector unit", if it's beneficial to do so, or as mulit-/many-threaded more conventional big chip. Like a Kaveri where the shared FPU in each module is replaced by access to the GPU shader clusters, or a GPU with more flexible work distribution.
In essence I'd try to hide most of the current software complexity (for using GPUs etc.) via a flexible API, whereas Parallela needs to expose and burden the programmer with more low level details, as far as I understand. In my "vision" (OK, this might exagerate things a bit) instructions should be like in Matlab: "perform operation x on these single FP vectors or matrices and write the result to y, I don't care about any further details". One could also describe it like an on-chip HPC batch system / job scheduler.
Of course this approach is not without its own challenges. Communication and synchronisation is expensive over large distances, so designing e.g. a chip as large as a high end GPU and making one large pool of execution units available to many small cores for work submission poses a tremendous challenge for the scheduler itself, task dispatch, register files etc. That's why nVidia actually made these smaller going from Kepler and the SMX to Maxwell to gain efficiency. So some kind of partitioning scheme / hierachy would certainy still be needed.
MrS
Scanning for our furry friends since Jan 2002
It's a funny business this
)
It's a funny business this parallel stuff for sure. A real mind-stretcher .... :-) :-)
With regard to FFT's I have 'discovered' a curious thing. The N*log(N) scaling with FFT's versus N^2 with ordinary DFT's is worthy of a closer look. While it's true that the count of operations scale that way that is not necessarily so for the execution times. It would be so with a bog standard serial Turing style device. But we are not handling one of those are we ?
To achieve the N*log(N) operation count [ for our discussion here let one base operation = a floating point multiplication followed by an addition of that product into an accumulator ] absolutely requires :
(A) the pre- or post- ordering of vector components to/from their linear presentation PLUS
(B) the generation of the twiddles.
That is, the A and B requirements don't enter into the N*log(N) statement. So given the restrictions encountered with the Parallella ( as is ) why not just do a quicker but otherwise standard matrix multiplication of a linear un-shuffled vector of length N? Call this a Quicker Fourier Transform ( QFT ) for this discussion. Given a square array of identical processors :
- for a given transform size N the contents of the Fourier array are fixed.
- moreover that N^2 Fourier matrix has only N distinct values within.
- the Fourier matrix is highly structured, so that a given block/sub-matrix will only have a fixed number - much less than N - of values ( corresponding to the FFT twiddles ). That is a once only generation ( across all presented input vectors of a given length ) and may be persistently stored per processor node.
- suppose the array block size is K where N = P * K, P being the number of processors in a row/column ( for E16 this would be 4 ).
- each processor has responsibility for a K * K block/sub-matrix. But again that does not require storage of K^2 twiddles. Each block does have considerable structure just like the whole, with base + offset semantics allowing a single constant twiddle to be applied once after a series of accumulations for a given inner product.
- present the input vector of length N to each entire processor row, just shovel in from the west and pass operands ( synchronously ) to the east. Each processor in a row will ( by counting the operands flowing past ) only select & store those relevant to it's block. The others will be passed on eastwards.
- do the inner products for the blocks with the sub-vector ( also of length K ).
- accumulate off-chip the sub-vector inner products to form the final output operand of length N.
Now I have yet to get down & dirty with the estimates here. But I would guesstimate that the scaling of QFT timing is :
- b/w FFT and DFT with
- dependency upon the relative transfer rates on & off chip, plus
- dependency on the relative scale of the processor array to the transform array, or if you like how much persistent data can you stuff onto a single Epiphany core ?
Of course I am alluding that much of the N^2 base operations can be indeed be done simultaneously in time. Furthermore you could run an FFT on the ARM host down to some lesser factor which is then in turn handed off to the Epiphany array. The key point being that the enclosing host algorithm doesn't care how any lesser factored matrix/vector is managed.
In generality though there is an ( obvious ) crucial lesson here : the detail matters, especially the architecture of the hardware in relation to the problem to be solved.
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal