I presume this fftw plan would also work on the Parallella although probably not as optimal for the cortex A9 cpu (the Pi2 has cortex A7).
[It's actually just wisdom, not a plan as such. I understand that FFTW still needs to assemble a plan from the performance hints of FFT building blocks given in the wisdom.]
Trying it on a Parallella would be an interesting experiment. Apart form the CPU cores, also the cost of memory transfers should be different on both platforms (if you run 2 tasks in parallel on the Parallella compared to 4 in parallel with the Raspi2.)
around 37500 sec for one workunit
running with only boinc-client, Ubuntu repo version 7.2.42, managed with Boinc-Tasks from another computer.
max temp of 59 C with ambient temp of 20 C, but using original heatsink in a case.
I put the run times for 24 tasks into a spread sheet before the change. Average was 82,238 seconds. Its completed 8 runs since the change and times seem to have increased to 83,451 seconds.
I suspect its not using the hints. Does user boinc need to be the owner of /etc/fftw/wisdomf? Does the file need some attributes set, or do we have to install the full libfftw3 package? This is what I've got at the moment:
[pre]
pi@xxx /etc/fftw $ ls -lh
total 4.0K
-rw-r--r-- 1 root root 2.1K Feb 19 21:43 wisdomf
pi@xxx /etc/fftw $
[/pre]
I put the run times for 24 tasks into a spread sheet before the change. Average was 82,238 seconds. Its completed 8 runs since the change and times seem to have increased to 83,451 seconds.
I suspect its not using the hints. Does user boinc need to be the owner of /etc/fftw/wisdomf? Does the file need some attributes set, or do we have to install the full libfftw3 package? This is what I've got at the moment:
[pre]
pi@xxx /etc/fftw $ ls -lh
total 4.0K
-rw-r--r-- 1 root root 2.1K Feb 19 21:43 wisdomf
pi@xxx /etc/fftw $
[/pre]
That is all perfect. No need to install libfftw3 and the wisdom file just needs to be readable by boinc. The /etc/fftw directory itself has to be readable and 'executable' for user boinc as well , tho! You might want to check that.
But I'm confused, which one of your ARM hosts is a Pi2? This one here:
Correction: Running fftw-wisdom -v -x -o wisdom rif12582912 on a Parallella at the moment.
Great! Prepare to wait for quite some time (days) for this to finish.
Notes (I mentioned the following points before, but just to be sure):
a) The version of fftw-wisdom must match the version of libfftw3 that would use the wisdom file, for the BRP4 stock neon app from E@H, this is version 3.3.2 (!). Wisdom generated by 3.3.3 or later versions won't help! I just wanted to stress this before you waste days of computing time on that little fella.
b) The stock BRP , ARM,linux, neon version of the app that E@H distributes uses in-place FFTs. If you take the official sourcecode tarball (doesn't include our Android and ARM-linux patches) and adapt it to ARM yourself, you will get an app that uses out-of-place FFTs.
c) When experimenting with this I found that it is better to run fftw-wisdom while the CPU is under a similar load level compared to when the wisdom file will be used. If you stop BOINC and run just one instance of fftw-wisdom, the timings that fftw-wisdom will collect might be very different because it can use the full bandwidth to the RAM, while under full E@H load, libfftw3 will need to share the bandwidth with the other BRP instances running.
Good luck for getting good wisdom for the Parallella
Great! Prepare to wait for quite some time (days) for this to finish.
Notes (I mentioned the following points before, but just to be sure):
a) The version of fftw-wisdom must match the version of libfftw3 that would use the wisdom file, for the BRP4 stock neon app from E@H, this is version 3.3.2 (!). Wisdom generated by 3.3.3 or later versions won't help! I just wanted to stress this before you waste days of computing time on that little fella.
b) The stock BRP , ARM,linux, neon version of the app that E@H distributes uses in-place FFTs. If you take the official sourcecode tarball (doesn't include our Android and ARM-linux patches) and adapt it to ARM yourself, you will get an app that uses out-of-place FFTs.
c) When experimenting with this I found that it is better to run fftw-wisdom while the CPU is under a similar load level compared to when the wisdom file will be used. If you stop BOINC and run just one instance of fftw-wisdom, the timings that fftw-wisdom will collect might be very different because it can use the full bandwidth to the RAM, while under full E@H load, libfftw3 will need to share the bandwidth with the other BRP instances running.
Good luck for getting good wisdom for the Parallella
HB
I compile the fftw-3.3.2 for the board (TBS2910) and run the fftwf-wisdom for rof12M with other 3 EaH clients for CPU load (quad-core CPU).
Is there any other changes in the source of the ARM official client?
When the ARM-patches would be available in the source releases?
I think this is an interesting approach. Unlike the accelerator chip on the Parallella, this solution uses plain ARM cores in a multi-core SoC (up to 48 cores per chip). Should be much easier to write software for, and is similar to the current Xeon Phis but w/o the complication of having a traditional host system and an accelerator card with x86 multicores connected over PCIe.
I wonder how many years it will take for the first ARM based supercomputer to appear in the top 500 list. Any bets?
RE: I presume this fftw
)
[It's actually just wisdom, not a plan as such. I understand that FFTW still needs to assemble a plan from the performance hints of FFT building blocks given in the wisdom.]
Trying it on a Parallella would be an interesting experiment. Apart form the CPU cores, also the cost of memory transfers should be different on both platforms (if you run 2 tasks in parallel on the Parallella compared to 4 in parallel with the Raspi2.)
HB
out of curiosity, i had
)
out of curiosity, i had myself forced to do a testrun on the Odroid C1 at total stock freq, running 2 at a time:
http://einsteinathome.org/host/11753871/tasks
around 37500 sec for one workunit
running with only boinc-client, Ubuntu repo version 7.2.42, managed with Boinc-Tasks from another computer.
max temp of 59 C with ambient temp of 20 C, but using original heatsink in a case.
RE: I've put it onto one of
)
I put the run times for 24 tasks into a spread sheet before the change. Average was 82,238 seconds. Its completed 8 runs since the change and times seem to have increased to 83,451 seconds.
I suspect its not using the hints. Does user boinc need to be the owner of /etc/fftw/wisdomf? Does the file need some attributes set, or do we have to install the full libfftw3 package? This is what I've got at the moment:
[pre]
pi@xxx /etc/fftw $ ls -lh
total 4.0K
-rw-r--r-- 1 root root 2.1K Feb 19 21:43 wisdomf
pi@xxx /etc/fftw $
[/pre]
BOINC blog
RE: I think that EaH client
)
Running this on a Parallella at the moment. Had to install libfftw3-dev package.
With the issue above, the doco I've read refers to a system-wide fftw default file in /etc/fftw/wisdom (without the f).
BOINC blog
Correction: Running
)
Correction: Running fftw-wisdom -v -x -o wisdom rif12582912 on a Parallella at the moment.
BOINC blog
RE: RE: I've put it onto
)
That is all perfect. No need to install libfftw3 and the wisdom file just needs to be readable by boinc. The /etc/fftw directory itself has to be readable and 'executable' for user boinc as well , tho! You might want to check that.
But I'm confused, which one of your ARM hosts is a Pi2? This one here:
http://einsteinathome.org/host/11751336/tasks
seems to show exactly the kind of speed-up that my Pi2 got from the wisdom file. Is that another host?
Cheers
HB
RE: Correction: Running
)
Great! Prepare to wait for quite some time (days) for this to finish.
Notes (I mentioned the following points before, but just to be sure):
a) The version of fftw-wisdom must match the version of libfftw3 that would use the wisdom file, for the BRP4 stock neon app from E@H, this is version 3.3.2 (!). Wisdom generated by 3.3.3 or later versions won't help! I just wanted to stress this before you waste days of computing time on that little fella.
b) The stock BRP , ARM,linux, neon version of the app that E@H distributes uses in-place FFTs. If you take the official sourcecode tarball (doesn't include our Android and ARM-linux patches) and adapt it to ARM yourself, you will get an app that uses out-of-place FFTs.
c) When experimenting with this I found that it is better to run fftw-wisdom while the CPU is under a similar load level compared to when the wisdom file will be used. If you stop BOINC and run just one instance of fftw-wisdom, the timings that fftw-wisdom will collect might be very different because it can use the full bandwidth to the RAM, while under full E@H load, libfftw3 will need to share the bandwidth with the other BRP instances running.
Good luck for getting good wisdom for the Parallella
HB
RE: Great! Prepare to wait
)
I compile the fftw-3.3.2 for the board (TBS2910) and run the fftwf-wisdom for rof12M with other 3 EaH clients for CPU load (quad-core CPU).
Is there any other changes in the source of the ARM official client?
When the ARM-patches would be available in the source releases?
Thank you,
RE: Is there any other
)
ARM-changes:
* The backtrace - dumping functionality is turned off because I couln't get it to work for the ARM,
* the build.sh script is heavily modified to allow also for cross-compilation
* FFT is in-place,
* support for pre-canned wisdom read from a string compiled into the code itself
* support for reading the /etc/fftw/wisdomf file. See http://www.fftw.org/doc/Caveats-in-Using-Wisdom.html on how to include this in your code, it's just a single line.
We are currently working on performance fixes and improvements, after that is finished and tested, a new source tar-ball should be created.
Cheers
HB
Is this the
)
Is this the future?
http://www.computerworld.com/article/2885164/lenovo-building-its-first-prototype-arm-server.html
I think this is an interesting approach. Unlike the accelerator chip on the Parallella, this solution uses plain ARM cores in a multi-core SoC (up to 48 cores per chip). Should be much easier to write software for, and is similar to the current Xeon Phis but w/o the complication of having a traditional host system and an accelerator card with x86 multicores connected over PCIe.
I wonder how many years it will take for the first ARM based supercomputer to appear in the top 500 list. Any bets?
Cheers
HB