A more definite example of procedural vs. concurrent approaches : 'short circuit' evaluation of a compound boolean expression.
IF(( A AND B ) OR ( C AND D )) THEN ....
where in the case that A = B = TRUE, it makes no difference in a sequential machine whether/not I proceed by evaluating according to the values of C with D. Assuming that I evaluate left to right that is. If I evaluate from right to left then I only need to worry about A with B if the sub-expression ( C AND D ) is false. So some choices of operand values allow one to escape part of the evaluation, perhaps yielding a time benefit then. This gives an effective algorithm ( in a left to right order ) of :
[pre]IF(A AND B) {
return TRUE;
}
ELSE {
return (C AND D);
}
[/pre]
Now in a concurrent device it would be reasonable to have the outputs of two 2-input AND gates leading into a 2-input OR, the output of which is the final evaluation of the entire expression. There's no reason here to block the evaluation of one sub-expression due to the result of the other as, being concurrent, there is no time advantage for doing so.
Cheers, Mike.
( edit ) 'short circuit' has a double meaning here. On the one hand you have the 'short-cut' idea of having a briefer computation time ( sequential circuit ). But on the other hand, one could actually pull a transistor output either 'up' or 'down' and effectively block the effect of a competing current driver ( concurrent circuit ), like the other AND gate for instance ( but possibly you have to sink the extra current load though ).
( edit ) By all this I mean to demonstrate the question : can an FPGA synthetic tool writer make a 'sequential to concurrent' conversion as obvious and as correct as here, for any arbitrary sequential construct ??
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
From the beginning I have never had it in my mind to use a conversion tool to port C/C++ to HDL. Intuition told me right from the beginning that route is likely difficult though in the hands of master who knows the precise limitations and strengths of such a tool, when to thrust and when to parry so to speak, it might work a treat. In my experience such masters are not born they are made and what makes them is learning how to do it the "hard" way first which in this case means mastering HDL first. Before the dawn of CNC machining, in the UK an apprentice machinist wasn't allowed to touch a lathe or any machine tool for that matter, until he mastered the art of cutting with a hack saw and filing. During that time he acquired an intimacy with the materials he works with. He knows how they feel in his hands and that is good because then he knows the challenge his first lathe will have to overcome at some vague, intuitive, almost spiritual level.
Early on I decided that if I was going to pursue this I was going to become proficient in HDL or I was not going to pursue it at all. Now I am slowly realizing that doing the conversion "manually" probably isn't the best way either and I'm comfortable with that, it just means it might take a little longer. I think I'll have to work from first principles by which I mean understanding the problem and the math first then determining an algorithm amenable to HDL and finally coding it without polluting my mind with C++ (sorry Stroustrup, no offense intended). Maybe I can build or find something like libraries or modules to perform oft performed tasks but frankly I don't know if such constructs even exist in HDL. Guess I'll find out.
I'm going to order a XuLA2-LX25 . The Parallela looks good too but I don't want to wait long for one.
@Mac : yeah it's an area full of squirrels, alas. As some Parallella backer has pointed out : the inclusion of an FPGA via the particular Zynq chip is actually a tad cheaper ( not much, but lower none-the-less ) route to said FPGA ownership than otherwise ie. you could ignore the Epiphany chip and still have a bargain. Ouch !!! :-0
Here's another serial vs parallel contrast. Gauss when he was a mere lad annoyed his teacher so - by finishing work real quick and then asking for more - to the point where he was told to go away and add up the integers from 1 to 100 inclusive. Obviously the teacher was expecting some sort of sequential approach that would take the young Gauss a while to complete. But in a few minutes he returned with the correct answer by a clever ruse : you can group the factors in pairs such that each pair totals 101 ie. 1 with 100, 2 with 99, 3 with 98 etc. Their are fifty such pairs hence 50 * 101 = 5050 !!! :-)
Now a key point here is by using a sequential construct the time to do that grows linearly with the upper operand ( 100 + ) - the teacher's intent - but Gauss's method is fairly flat when you increase the upper operand. His goes like the logarithm of the upper operand or if you like the number of decimal digits to encode that ( as that determines the number of operations for a long multiplication ).
So here be squirrels. It took knowledge of a pattern specific to the problem domain ( certain pairings of operands give an invariant ) to defeat brute force aka bland FOR looping. Provided the operands are bounded ( and in what digital circuit aren't they ? ) then for this problem a concurrent approach using fixed length operand representation will beat a sequential device for all but maybe the smallest values.
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
Sure, doing it "properly" straight in HDL is probably necessary for the most efficient result. However, this does take serious development effort. And just look around how quickly projects are transitioning to GPUs, OpenCL, SSE, AVX etc. Some are able to do some of it, others aren't. And by going this route you (at least initially) suddenly have to develop and update one more code branch, which might work in a fundamentally different way. That's no joy for projects which are already limited by man-power.
I've once read a statement from some programmer: "I write all my programs in assembler and do not waste a single bit of memory or clock cycle. That's how all programming should be done." - which IMO is completely rubbish. You have to use the right tool for the job. Creating the perfect program is of no use if the first stable version is ready 10 years after it is not being needed any more.
Perhaps converting from OpenCL or CUDA is easier than straight C/C++? The reason being that these languages are fundamentally laid out for parallel problems, and it's surely easier to copy small synthesized blocks as often as possible than to synthesize one complicated block.
Regarding soft cores: I'm not sure that's enough. I still have in memory that FPGAs need about 10 x as many transistors as the native circuit they're emulating would require, and run at 1/10th the clock speed. If this is true you'd have to make up for a 100-fold performance loss compared to traditional CPUs before gaining any speed. Since our CPU cores are not that bad I think this is a really tough challenge. It's much easier with "hard"-wired logic adapted to the problem at hand (what FPGAs are traditionally are used for).
Well it seems shipping has started - to the earlier and/or higher investing backers.
@MrS : My HDL etc knowledge is pretty skimpy. Like all problems the core issues remain as to what resources one may have to throw at something, plus other constraints.
[ BTW : E@H may well have comparably more manpower than other projects, but I'm pretty sure their allocation ( to E@H ) is maxed out currently and has-been/will-be for some time. By that I mean they don't work 'for' E@H, they work for AEI/MPG and E@H is one of many irons they have in the fire there. ]
@Mac : There's a pretty good book on FPGA linked to from the XuLA2-LX25 site. It reads well to me at least. Probably could use it for other boards as well, by slotting in different constraints files and family/device/package settings on the project wizard.
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
I agree with your views on assembler vs. C/C++, especially these days when compilers do a decent job of optimizing and an even better job if you understand the fundamentals of hardware architecture and code. I extent that belief to HDL and claim that if I'm not proficient in HDL then I probably won't get good results using a translator. I think one needs to know HDL rather well to know how to use a translator tool well but these are early days. I might think differently 2 months from now after I get my hands dirty with it.
Regarding the speed of FGPA, I have heard the same but there are tradeoffs. I didn't make it clear in my first post but my plan is to build something for crunching GPU tasks not CPU tasks. I doubt an FPGA is a cost effective alternative to CPUs for CPU tasks. My goal is to create a motherboard based on an FPGA that has 1 or more PCIe slots that you would plug a GPU into. The rationale is that we spend a lot of money on general purpose motherboards with general purpose CPUs that do way more than many of us need them to do which is to just drive a GPU. I would use ARM but it's not fast enough to drive a PCIe x16 "bus" at gen 3.0 speeds, probably not even gen 2.0 speeds. It won't do email, skype, play music or balance your cheque book but it will crunch, cost far less and need less real estate. I considered adding a PCIe bridge to a RasPi but I don't think it has enough I/O lines or speed?
@Mike,
That book is one of the things that got me interested in FPGA. It reads well for me too.
My goal is to create a motherboard based on an FPGA that has 1 or more PCIe slots that you would plug a GPU into. The rationale is that we spend a lot of money on general purpose motherboards with general purpose CPUs that do way more than many of us need them to do which is to just drive a GPU.
I would say this is very, very ambitious. Note that the NVIDIA and AMD GPGPU cards are rather dumb (unlike the Intel Xeon phi which is a complete Linux system by itself). So the FPGA(s) would not only have to implement the PCIe data transfer protocol etc, you would also have to convert much of the functionality that is in the CUDA or OpenCL drivers into FPGA logic!
Quote:
I would use ARM but it's not fast enough to drive a PCIe x16 "bus" at gen 3.0 speeds, probably not even gen 2.0 speeds.
The good thing about ARM is that NVIDIA is supporting those CPUs for CUDA. And there are already implementations that do support PCIe x16 (Gen 2 I guess), for example google for "Kayla" (unfortunately those are quite expensive, but they demonstrate that ARM SoCs can drive off-the-shelve GPGPU boards).
Thank you for posting this link. I pointlessly go thru Kickstarter every once in a while and never see anything about kicking a computer to get it started. I will go to bed a wiser woman thanks to you.
The first small batch of "production" boards that will go to backers still shows some electrical and thermal issues. But at least the boards seem to be usable, as shown in a video linked to the update. As the Parallella has no real GPU, a Raspberry Pi is used in this demo to render the OpenGL graphics computed on the Parallella board.
Cheers
HBE
A more definite example of
)
A more definite example of procedural vs. concurrent approaches : 'short circuit' evaluation of a compound boolean expression.
IF(( A AND B ) OR ( C AND D )) THEN ....
where in the case that A = B = TRUE, it makes no difference in a sequential machine whether/not I proceed by evaluating according to the values of C with D. Assuming that I evaluate left to right that is. If I evaluate from right to left then I only need to worry about A with B if the sub-expression ( C AND D ) is false. So some choices of operand values allow one to escape part of the evaluation, perhaps yielding a time benefit then. This gives an effective algorithm ( in a left to right order ) of :
[pre]IF(A AND B) {
return TRUE;
}
ELSE {
return (C AND D);
}
[/pre]
Now in a concurrent device it would be reasonable to have the outputs of two 2-input AND gates leading into a 2-input OR, the output of which is the final evaluation of the entire expression. There's no reason here to block the evaluation of one sub-expression due to the result of the other as, being concurrent, there is no time advantage for doing so.
Cheers, Mike.
( edit ) 'short circuit' has a double meaning here. On the one hand you have the 'short-cut' idea of having a briefer computation time ( sequential circuit ). But on the other hand, one could actually pull a transistor output either 'up' or 'down' and effectively block the effect of a competing current driver ( concurrent circuit ), like the other AND gate for instance ( but possibly you have to sink the extra current load though ).
( edit ) By all this I mean to demonstrate the question : can an FPGA synthetic tool writer make a 'sequential to concurrent' conversion as obvious and as correct as here, for any arbitrary sequential construct ??
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
Thanks Mike, From the
)
Thanks Mike,
From the beginning I have never had it in my mind to use a conversion tool to port C/C++ to HDL. Intuition told me right from the beginning that route is likely difficult though in the hands of master who knows the precise limitations and strengths of such a tool, when to thrust and when to parry so to speak, it might work a treat. In my experience such masters are not born they are made and what makes them is learning how to do it the "hard" way first which in this case means mastering HDL first. Before the dawn of CNC machining, in the UK an apprentice machinist wasn't allowed to touch a lathe or any machine tool for that matter, until he mastered the art of cutting with a hack saw and filing. During that time he acquired an intimacy with the materials he works with. He knows how they feel in his hands and that is good because then he knows the challenge his first lathe will have to overcome at some vague, intuitive, almost spiritual level.
Early on I decided that if I was going to pursue this I was going to become proficient in HDL or I was not going to pursue it at all. Now I am slowly realizing that doing the conversion "manually" probably isn't the best way either and I'm comfortable with that, it just means it might take a little longer. I think I'll have to work from first principles by which I mean understanding the problem and the math first then determining an algorithm amenable to HDL and finally coding it without polluting my mind with C++ (sorry Stroustrup, no offense intended). Maybe I can build or find something like libraries or modules to perform oft performed tasks but frankly I don't know if such constructs even exist in HDL. Guess I'll find out.
I'm going to order a XuLA2-LX25 . The Parallela looks good too but I don't want to wait long for one.
@Mac : yeah it's an area full
)
@Mac : yeah it's an area full of squirrels, alas. As some Parallella backer has pointed out : the inclusion of an FPGA via the particular Zynq chip is actually a tad cheaper ( not much, but lower none-the-less ) route to said FPGA ownership than otherwise ie. you could ignore the Epiphany chip and still have a bargain. Ouch !!! :-0
Here's another serial vs parallel contrast. Gauss when he was a mere lad annoyed his teacher so - by finishing work real quick and then asking for more - to the point where he was told to go away and add up the integers from 1 to 100 inclusive. Obviously the teacher was expecting some sort of sequential approach that would take the young Gauss a while to complete. But in a few minutes he returned with the correct answer by a clever ruse : you can group the factors in pairs such that each pair totals 101 ie. 1 with 100, 2 with 99, 3 with 98 etc. Their are fifty such pairs hence 50 * 101 = 5050 !!! :-)
Now a key point here is by using a sequential construct the time to do that grows linearly with the upper operand ( 100 + ) - the teacher's intent - but Gauss's method is fairly flat when you increase the upper operand. His goes like the logarithm of the upper operand or if you like the number of decimal digits to encode that ( as that determines the number of operations for a long multiplication ).
So here be squirrels. It took knowledge of a pattern specific to the problem domain ( certain pairings of operands give an invariant ) to defeat brute force aka bland FOR looping. Provided the operands are bounded ( and in what digital circuit aren't they ? ) then for this problem a concurrent approach using fixed length operand representation will beat a sequential device for all but maybe the smallest values.
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
So for those who are not yet
)
So for those who are not yet thru with Kickstarter projects:
http://www.kickstarter.com/projects/1575992013/logi-fpga-development-board-for-raspberry-pi-beagl
Cheers
HBE
Sure, doing it "properly"
)
Sure, doing it "properly" straight in HDL is probably necessary for the most efficient result. However, this does take serious development effort. And just look around how quickly projects are transitioning to GPUs, OpenCL, SSE, AVX etc. Some are able to do some of it, others aren't. And by going this route you (at least initially) suddenly have to develop and update one more code branch, which might work in a fundamentally different way. That's no joy for projects which are already limited by man-power.
I've once read a statement from some programmer: "I write all my programs in assembler and do not waste a single bit of memory or clock cycle. That's how all programming should be done." - which IMO is completely rubbish. You have to use the right tool for the job. Creating the perfect program is of no use if the first stable version is ready 10 years after it is not being needed any more.
Perhaps converting from OpenCL or CUDA is easier than straight C/C++? The reason being that these languages are fundamentally laid out for parallel problems, and it's surely easier to copy small synthesized blocks as often as possible than to synthesize one complicated block.
Regarding soft cores: I'm not sure that's enough. I still have in memory that FPGAs need about 10 x as many transistors as the native circuit they're emulating would require, and run at 1/10th the clock speed. If this is true you'd have to make up for a 100-fold performance loss compared to traditional CPUs before gaining any speed. Since our CPU cores are not that bad I think this is a really tough challenge. It's much easier with "hard"-wired logic adapted to the problem at hand (what FPGAs are traditionally are used for).
MrS
Scanning for our furry friends since Jan 2002
Well it seems shipping has
)
Well it seems shipping has started - to the earlier and/or higher investing backers.
@MrS : My HDL etc knowledge is pretty skimpy. Like all problems the core issues remain as to what resources one may have to throw at something, plus other constraints.
[ BTW : E@H may well have comparably more manpower than other projects, but I'm pretty sure their allocation ( to E@H ) is maxed out currently and has-been/will-be for some time. By that I mean they don't work 'for' E@H, they work for AEI/MPG and E@H is one of many irons they have in the fire there. ]
@Mac : There's a pretty good book on FPGA linked to from the XuLA2-LX25 site. It reads well to me at least. Probably could use it for other boards as well, by slotting in different constraints files and family/device/package settings on the project wizard.
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
@MrS, I agree with your
)
@MrS,
I agree with your views on assembler vs. C/C++, especially these days when compilers do a decent job of optimizing and an even better job if you understand the fundamentals of hardware architecture and code. I extent that belief to HDL and claim that if I'm not proficient in HDL then I probably won't get good results using a translator. I think one needs to know HDL rather well to know how to use a translator tool well but these are early days. I might think differently 2 months from now after I get my hands dirty with it.
Regarding the speed of FGPA, I have heard the same but there are tradeoffs. I didn't make it clear in my first post but my plan is to build something for crunching GPU tasks not CPU tasks. I doubt an FPGA is a cost effective alternative to CPUs for CPU tasks. My goal is to create a motherboard based on an FPGA that has 1 or more PCIe slots that you would plug a GPU into. The rationale is that we spend a lot of money on general purpose motherboards with general purpose CPUs that do way more than many of us need them to do which is to just drive a GPU. I would use ARM but it's not fast enough to drive a PCIe x16 "bus" at gen 3.0 speeds, probably not even gen 2.0 speeds. It won't do email, skype, play music or balance your cheque book but it will crunch, cost far less and need less real estate. I considered adding a PCIe bridge to a RasPi but I don't think it has enough I/O lines or speed?
@Mike,
That book is one of the things that got me interested in FPGA. It reads well for me too.
RE: My goal is to create a
)
I would say this is very, very ambitious. Note that the NVIDIA and AMD GPGPU cards are rather dumb (unlike the Intel Xeon phi which is a complete Linux system by itself). So the FPGA(s) would not only have to implement the PCIe data transfer protocol etc, you would also have to convert much of the functionality that is in the CUDA or OpenCL drivers into FPGA logic!
The good thing about ARM is that NVIDIA is supporting those CPUs for CUDA. And there are already implementations that do support PCIe x16 (Gen 2 I guess), for example google for "Kayla" (unfortunately those are quite expensive, but they demonstrate that ARM SoCs can drive off-the-shelve GPGPU boards).
Cheers
HB
Thank you for posting this
)
Thank you for posting this link. I pointlessly go thru Kickstarter every once in a while and never see anything about kicking a computer to get it started. I will go to bed a wiser woman thanks to you.
Parallella Update #46 The
)
Parallella Update #46
The first small batch of "production" boards that will go to backers still shows some electrical and thermal issues. But at least the boards seem to be usable, as shown in a video linked to the update. As the Parallella has no real GPU, a Raspberry Pi is used in this demo to render the OpenGL graphics computed on the Parallella board.
Cheers
HBE