I am collecting IT-talented folks that would like to help preparing the Debian packages not only for the BOINC client but also for the scientific applications like the ones of E@H. For instance there is already a very nicely performing and easily installable boinc-app-seti package in Debian.
I think this is a great idea! To take it one step further, I would love to see platform optimized versions. For instance, the work that N30dG did on the E@H BRP4 app to tune it specifically for the RPi 3 has it returning results in half the time. I made the effort to put his software on my Pi 3 farm this weekend so I have doubled throughput (processing per WU went from 40Ksec to 20Ksec) without spending a dime!
...
At the risk of sounding repetitive/boring N30dG's app is running on two of my Ordoids and ithey are leaving the Piz behind ~1242 - 1642 average credit (13Ksec per WU). If I can be of help from a test perspective let me know.
For now I still want to fiddle around a bit more. For instance I just introduced link time optimisation for which I first want to see successful cross-validations prior to an upload to the distribution. And while it may not be ready for prime time, yet, we certainly like to hear from you and your experiences - you just need to know how to build your own Debian package for the time being. It would be nice to find volunteers for extending the build instructions to cover the respectively accelerated binaries, too.
N30dG has not yet contributed his efforts to use the NEON chip in parallel to the CPU. And he knows a lot more than me on FFT and wisdom files (http://www.fftw.org/fftw-wisdom.1.html) or on how to help with data/instruction prefetches. It all gets my senses reeling. My hunch is that quite some bits are of interest to all platforms, not just to ARM. But we'll get there.
I would love to get a tutorial from N30dG on how to compile the application to optimize it for a platform - for instance the Tinker Board.
I thought the application from E@H was setup to use the NEON processor and had an internal Wisdom file already - are you saying his app isn't taking advantage of NEON in the A53 at all?
I would love to get a tutorial from N30dG on how to compile the application to optimize it for a platform - for instance the Tinker Board.
I think I can also speak for N30dG that we want
popular platforms optimally supported without local tinkering
new platforms optimised with a semi-humane and communicateable effort
It gets all a bit more complicated (see below) since quite some part of the computational effort is not in the code that Einstein@Home codes but in the FFT library it uses. We hence want that library optimised for the individual platforms, too.
KF7IJZ wrote:
I thought the application from E@H was setup to use the NEON processor and had an internal Wisdom file already - are you saying his app isn't taking advantage of NEON in the A53 at all?
We have to wait for N30dG or Bikeman for deeper insights but I have seen a wisdom file in the source tree (https://github.com/VolunteerComputingHelp/boinc-app-eah-brp/tree/master/wisdom) that says it is optimised for ARM6. If I got it right, these wisdom files are formalised experiences on what combination of FFT variants to use for a large data sets and/or its parts of particular sizes. What FFT variant is available and how relatively fast that is depends on the hardware and the compilation options of the FFTW library. The authors of FFTW suggest that power users recompile their library to optimise individually and as a consequence they would all also generate their own wisdom files. This certainly applies for your better-than-ARM6 hardware on the Tinker board. But it is unclear if that wisdom is not possibly overrated since this is auto-acquired and should affect only the planning phases - I see little effect (if at all) on my laptop for instance which may indicate that it were other bits that N30dG has done just right or I should not have just used the canonical wisdom generation. For the very moment we are still gathering some experience and it should somehow be possible to address all that in a more orchestrated fashion. We still need to observe and exeriment more, and maybe we also find someone on board of the FFTW library or its packaging to direct us a bit on how to best cater for the diverse ARMscape.
So, please rest assured that your pointer to N30dG and his work was very fruitful. We will certainly upload to Debian once we have something half-way close to the performance of N30dG's current packages. N30dG or I will keep you informed about any progress we (or someone else? Get in touch!) make(s).
But it is unclear if that wisdom is not possibly overrated since this is auto-acquired and should affect only the planning phases - I see little effect (if at all) on my laptop for instance which may indicate that it were other bits that N30dG has done just right or I should not have just used the canonical wisdom generation. For the very moment we are still gathering some experience and it should somehow be possible to address all that in a more orchestrated fashion. We still need to observe and exeriment more, and maybe we also find someone on board of the FFTW library or its packaging to direct us a bit on how to best cater for the diverse ARMscape.
Some additional insight on the impact of wisdom - https://einsteinathome.org/content/building-boinc-einsteinhome-and-raspberry-pi-2
steffen_moeller wrote:
So, please rest assured that your pointer to N30dG and his work was very fruitful. We will certainly upload to Debian once we have something half-way close to the performance of N30dG's current packages. N30dG or I will keep you informed about any progress we (or someone else? Get in touch!) make(s).
So, we can't discuss the techniques and processes used to optimize our BRP4 builds?
But it is unclear if that wisdom is not possibly overrated since this is auto-acquired and should affect only the planning phases - I see little effect (if at all) on my laptop for instance which may indicate that it were other bits that N30dG has done just right or I should not have just used the canonical wisdom generation. For the very moment we are still gathering some experience and it should somehow be possible to address all that in a more orchestrated fashion. We still need to observe and exeriment more, and maybe we also find someone on board of the FFTW library or its packaging to direct us a bit on how to best cater for the diverse ARMscape.
Some additional insight on the impact of wisdom - https://einsteinathome.org/content/building-boinc-einsteinhome-and-raspberry-pi-2
In our experience, it makes a difference of a couple of minutes per work unit when running a wisdom file that was created for an fftw3 library with different compilation options (like with or without the frame pointer omitted). And it is worse for different versions of the library (as in 3.3.3 vs 3.3.4) and we have not even tried it across platforms (as in ARM64 vs X86_64). We hence had the idea to encourage everyone to create their very own then perfectly fitting wisdomf file.
That wisdom file should also help with the regular binary distributed from Einstein@Home. Sadly, the official E@H BRP4 app is linked statically against the fftw3 library and the one installed by Debian may be a different one. But give it chance. If it does work or does not work or if this is too difficult to do or ...whatever .. just send your experiences back and we see what we can do. I could imagine to eventually come up with a catalog of wisdomf files for different setups that is contributed by us crunchers.
This will run for a long time, i.e. days on the ARM, half a day elsewhere. Once you have the file created, you need to move it to /etc/fftw/wisdomf for the Einstein@Home BRP4 app to find it. Create the /etc/fftw folder if it is not already existing. If there is a wisdomf file in there already, save/rename that one before copying to the destination - you may have another app that wants that. Caveat: It is truly wisdomf with the terminal f which stands for float, i.e. the typical non-fixed point 32bit representation of decimals.
KF7IJZ wrote:
steffen_moeller wrote:
So, please rest assured that your pointer to N30dG and his work was very fruitful. We will certainly upload to Debian once we have something half-way close to the performance of N30dG's current packages. N30dG or I will keep you informed about any progress we (or someone else? Get in touch!) make(s).
So, we can't discuss the techniques and processes used to optimize our BRP4 builds?
By all means, this is all completely meaningless without you all. Please chime in. @KF7IJZ in particular, you and your videos are perfect communicators. If you have any idea for some better outreach then rest assured we listen and happily help. Ourselves we cannot do it much better than with a script like the one referenced and instructed above and take some pride in that
N30dG does not expect much of an immediate effect of recompiling for the Tinker board. It will likely help a bit, but only a bit. And he was much of a fan for the dead-easy compilation with the source tree in github for Debian vs the "works on all platforms" build script provided. So, for recompilation you can be already happy now since you do not really need to do it and you will be even happier once we have the package in Debian. Detailed instructions will then follow. The wisdomf file will have most of the "wow!" and you can run the above now. That is what you want as a start: get up to twice as fast (as I observed it for me) without touching the binary at all. This will also help us tons for the investigation what may cause invalid results. Just because of a change to the wisdomf file there should not be any additional invalid. But it is likely there will be some, just because of rounding errors. This is then something to be addressed since there is no real error. There are a series of rearrangements in the code that N30dG sees for introducing a vectorisation of some loops, i.e. introducing SIMD commands. But this may similarly lead to rounding differences and we should go through that first with binaries everyone trusts, i.e. Einstein's BRP4 not our self-compiled ones.
That wisdom file should also help with the regular binary distributed from Einstein@Home. Sadly, the official E@H BRP4 app is linked statically against the fftw3 library and the one installed by Debian may be a different one. But give it chance. If it does work or does not work or if this is too difficult to do or ...whatever .. just send your experiences back and we see what we can do. I could imagine to eventually come up with a catalog of wisdomf files for different setups that is contributed by us crunchers.
There is another little problem when using a wisdom with the official BRP4-app. At least the 1.47_NEON_Beta and 1.06 have included wisdoms, both don't look for the system-wisdom in /etc/fftw/wisdomf.
I think the 1.42 doesn't use an included wisdom. If you want to use your own wisdom-file on ARM plattform you should try this version.
BTW: The offical-BRP4-App's uses fftw 3.3.2.
@KF7IJZ:
Try the following on your ASUS-board:
First try to create a wisdom-file as steffen discribed (use the 1.42_NEON).
If you didn't get a speedup:
Try to build your own BRP-App from our git-repository. You will be surprised how easy it is to build your own BRP-App on any Plattform. I'm still impressed by this (and steffen's linux & programming knowledge).
I like your idea to downgrade and just checked with http://www.fftw.org/release-notes.html - fftw version 3.3.2 is not ideal but ok from how I read it. If that version of the fftw3 library is no longer available in the Debian distribution you use, it can be retrieved from http://snapshot.debian.org/package/fftw3/3.3.2-3.1/ . But to go through all that may be more error prone than to just go and compile from what we put up on github.
So here is what I previously meant to avoid explaining. Not because it is difficult but because it is easier once the package is uploaded to the distribution:
We could use git-buildpackage again now, but this would complain about the patches that is has already applied. But we do not need that. We just want the package built:
Quote:
quilt push -a # which ensures all patches are applied
fakeroot ./debian/rules binary # debian/rules where you want to eventually add your optimizations
And once the compilation finished, complete the tasks you have running with E@H and then
You will then almost instantly see your Einstein@Home project filled with BRP tasks.
On the plus side we have now the BRP app use the system fftw3 library and your wisdom generator tool will be of the same version. On the down side you are no longer using the same version that the other contributors to Einstein use which may lead to some tasks to be flagged as invalid. That should not matter performance-wise since it will be fewer than every second work unit. Science-wise, it should be investigated if the validator can be mended or otherwise sensitivities be reduced when optimisations are not affecting scientific accuracy.
First - this level of engagement is amazing - thank you both!
N30dG wrote:
@KF7IJZ:
Try the following on your ASUS-board:
First try to create a wisdom-file as steffen discribed (use the 1.42_NEON).
If you didn't get a speedup:
Try to build your own BRP-App from our git-repository. You will be surprised how easy it is to build your own BRP-App on any Plattform. I'm still impressed by this (and steffen's linux & programming knowledge).
When I run the wisdom creation script, should I kill the E@H processes? I thought I had read that you are supposed to burden the other cores while calculating wisdom (which didn't make sense to me).
When you say "use the 1.42_NEON", do you mean the non beta version of the BRP4 app? If so, I will need to create another account solely for my Tinker Board as the "use beta" is an account wide setting
I stopped boinc on my Tinker Board. I ran the wisdom creation (libfftw 3.3.4-2). I ran your script. It returned in about 60 seconds which somehow feels wrong :)
OK, so I reran it while E@H was crunching and it took 90 seconds to return.
I ran the wisdom creation (libfftw 3.3.4-2). I ran your script. It returned in about 60 seconds which somehow feels wrong :)
OK, so I reran it while E@H was crunching and it took 90 seconds to return.
Both seems far too quick, indeed. You mean that you ran the wisdom generation with my script, right?
N30dG observed a very short run on his odroid, too. We have not investigated this, yet. The script sets the "-n" flag to fftwf-wisdom, so it should not matter, but just to be sure, please have an empty /etc/fftw folder when you start the wisdom-generation script.
Does the script show the "I: Wisdom file was computed successfully" message? Otherwise the script terminated somewhere prematurely with an error.
If a wisdomf file was created then, well, just use it and (fingers crossed) let us see if it works.
Hello again, robl
)
Hello again,
Well, this was quite some praise for N30dG's work, I tend to think. I contacted N30dG and together we set up https://github.com/VolunteerComputingHelp/boinc-app-eah-brp . For the moment this is just the source code for the binary radio pulsar search as offered on https://einsteinathome.org/application-source-code-and-license . Have not yet looked at GPU acceleration, admittedly. And while N30dG revisited his above referenced changes to best possibly accomodate the code for ARM64, I just got a first successful cross-validation https://einsteinathome.org/workunit/291544594 for my laptop.
For now I still want to fiddle around a bit more. For instance I just introduced link time optimisation for which I first want to see successful cross-validations prior to an upload to the distribution. And while it may not be ready for prime time, yet, we certainly like to hear from you and your experiences - you just need to know how to build your own Debian package for the time being. It would be nice to find volunteers for extending the build instructions to cover the respectively accelerated binaries, too.
N30dG has not yet contributed his efforts to use the NEON chip in parallel to the CPU. And he knows a lot more than me on FFT and wisdom files (http://www.fftw.org/fftw-wisdom.1.html) or on how to help with data/instruction prefetches. It all gets my senses reeling. My hunch is that quite some bits are of interest to all platforms, not just to ARM. But we'll get there.
I would love to get a
)
I would love to get a tutorial from N30dG on how to compile the application to optimize it for a platform - for instance the Tinker Board.
I thought the application from E@H was setup to use the NEON processor and had an internal Wisdom file already - are you saying his app isn't taking advantage of NEON in the A53 at all?
My YouTube Channel: https://www.youtube.com/user/KF7IJZ
Follow me on Twitter: https://twitter.com/KF7IJZ
KF7IJZ wrote:I would love to
)
I think I can also speak for N30dG that we want
It gets all a bit more complicated (see below) since quite some part of the computational effort is not in the code that Einstein@Home codes but in the FFT library it uses. We hence want that library optimised for the individual platforms, too.
We have to wait for N30dG or Bikeman for deeper insights but I have seen a wisdom file in the source tree (https://github.com/VolunteerComputingHelp/boinc-app-eah-brp/tree/master/wisdom) that says it is optimised for ARM6. If I got it right, these wisdom files are formalised experiences on what combination of FFT variants to use for a large data sets and/or its parts of particular sizes. What FFT variant is available and how relatively fast that is depends on the hardware and the compilation options of the FFTW library. The authors of FFTW suggest that power users recompile their library to optimise individually and as a consequence they would all also generate their own wisdom files. This certainly applies for your better-than-ARM6 hardware on the Tinker board. But it is unclear if that wisdom is not possibly overrated since this is auto-acquired and should affect only the planning phases - I see little effect (if at all) on my laptop for instance which may indicate that it were other bits that N30dG has done just right or I should not have just used the canonical wisdom generation. For the very moment we are still gathering some experience and it should somehow be possible to address all that in a more orchestrated fashion. We still need to observe and exeriment more, and maybe we also find someone on board of the FFTW library or its packaging to direct us a bit on how to best cater for the diverse ARMscape.
So, please rest assured that your pointer to N30dG and his work was very fruitful. We will certainly upload to Debian once we have something half-way close to the performance of N30dG's current packages. N30dG or I will keep you informed about any progress we (or someone else? Get in touch!) make(s).
steffen_moeller wrote:But
)
Some additional insight on the impact of wisdom - https://einsteinathome.org/content/building-boinc-einsteinhome-and-raspberry-pi-2
So, we can't discuss the techniques and processes used to optimize our BRP4 builds?
My YouTube Channel: https://www.youtube.com/user/KF7IJZ
Follow me on Twitter: https://twitter.com/KF7IJZ
KF7IJZ wrote:steffen_moeller
)
In our experience, it makes a difference of a couple of minutes per work unit when running a wisdom file that was created for an fftw3 library with different compilation options (like with or without the frame pointer omitted). And it is worse for different versions of the library (as in 3.3.3 vs 3.3.4) and we have not even tried it across platforms (as in ARM64 vs X86_64). We hence had the idea to encourage everyone to create their very own then perfectly fitting wisdomf file.
That wisdom file should also help with the regular binary distributed from Einstein@Home. Sadly, the official E@H BRP4 app is linked statically against the fftw3 library and the one installed by Debian may be a different one. But give it chance. If it does work or does not work or if this is too difficult to do or ...whatever .. just send your experiences back and we see what we can do. I could imagine to eventually come up with a catalog of wisdomf files for different setups that is contributed by us crunchers.
To avoid problems, we have created this script (https://github.com/VolunteerComputingHelp/boinc-app-eah-brp/blob/master/debian/extra/create_wisdomf_eah_brp.sh) that creates the perfect-for-BRP4 wisdomf file for you as /tmp/wisdomf. Nothing overly special in there as you can see. We are discussing to name the output file to reflect the hardware configuration, so we can start sharing. Here it goes:
This will run for a long time, i.e. days on the ARM, half a day elsewhere. Once you have the file created, you need to move it to /etc/fftw/wisdomf for the Einstein@Home BRP4 app to find it. Create the /etc/fftw folder if it is not already existing. If there is a wisdomf file in there already, save/rename that one before copying to the destination - you may have another app that wants that. Caveat: It is truly wisdomf with the terminal f which stands for float, i.e. the typical non-fixed point 32bit representation of decimals.
By all means, this is all completely meaningless without you all. Please chime in. @KF7IJZ in particular, you and your videos are perfect communicators. If you have any idea for some better outreach then rest assured we listen and happily help. Ourselves we cannot do it much better than with a script like the one referenced and instructed above and take some pride in that
N30dG does not expect much of an immediate effect of recompiling for the Tinker board. It will likely help a bit, but only a bit. And he was much of a fan for the dead-easy compilation with the source tree in github for Debian vs the "works on all platforms" build script provided. So, for recompilation you can be already happy now since you do not really need to do it and you will be even happier once we have the package in Debian. Detailed instructions will then follow. The wisdomf file will have most of the "wow!" and you can run the above now. That is what you want as a start: get up to twice as fast (as I observed it for me) without touching the binary at all. This will also help us tons for the investigation what may cause invalid results. Just because of a change to the wisdomf file there should not be any additional invalid. But it is likely there will be some, just because of rounding errors. This is then something to be addressed since there is no real error. There are a series of rearrangements in the code that N30dG sees for introducing a vectorisation of some loops, i.e. introducing SIMD commands. But this may similarly lead to rounding differences and we should go through that first with binaries everyone trusts, i.e. Einstein's BRP4 not our self-compiled ones.
steffen_moeller
)
There is another little problem when using a wisdom with the official BRP4-app. At least the 1.47_NEON_Beta and 1.06 have included wisdoms, both don't look for the system-wisdom in /etc/fftw/wisdomf.
I think the 1.42 doesn't use an included wisdom. If you want to use your own wisdom-file on ARM plattform you should try this version.
BTW: The offical-BRP4-App's uses fftw 3.3.2.
@KF7IJZ:
Try the following on your ASUS-board:
First try to create a wisdom-file as steffen discribed (use the 1.42_NEON).
If you didn't get a speedup:
Try to build your own BRP-App from our git-repository. You will be surprised how easy it is to build your own BRP-App on any Plattform. I'm still impressed by this (and steffen's linux & programming knowledge).
I like your idea to downgrade
)
I like your idea to downgrade and just checked with http://www.fftw.org/release-notes.html - fftw version 3.3.2 is not ideal but ok from how I read it. If that version of the fftw3 library is no longer available in the Debian distribution you use, it can be retrieved from http://snapshot.debian.org/package/fftw3/3.3.2-3.1/ . But to go through all that may be more error prone than to just go and compile from what we put up on github.
So here is what I previously meant to avoid explaining. Not because it is difficult but because it is easier once the package is uploaded to the distribution:
It will now likely fail with build dependency errors like
So you go and install what is missing with apt-get. In my case this is looks like
We could use git-buildpackage again now, but this would complain about the patches that is has already applied. But we do not need that. We just want the package built:
And once the compilation finished, complete the tasks you have running with E@H and then
You will then almost instantly see your Einstein@Home project filled with BRP tasks.
On the plus side we have now the BRP app use the system fftw3 library and your wisdom generator tool will be of the same version. On the down side you are no longer using the same version that the other contributors to Einstein use which may lead to some tasks to be flagged as invalid. That should not matter performance-wise since it will be fewer than every second work unit. Science-wise, it should be investigated if the validator can be mended or otherwise sensitivities be reduced when optimisations are not affecting scientific accuracy.
First - this level of
)
First - this level of engagement is amazing - thank you both!
When I run the wisdom creation script, should I kill the E@H processes? I thought I had read that you are supposed to burden the other cores while calculating wisdom (which didn't make sense to me).
When you say "use the 1.42_NEON", do you mean the non beta version of the BRP4 app? If so, I will need to create another account solely for my Tinker Board as the "use beta" is an account wide setting
My YouTube Channel: https://www.youtube.com/user/KF7IJZ
Follow me on Twitter: https://twitter.com/KF7IJZ
OK, so I'm impatient...I
)
OK, so I'm impatient...
I stopped boinc on my Tinker Board. I ran the wisdom creation (libfftw 3.3.4-2). I ran your script. It returned in about 60 seconds which somehow feels wrong :)
OK, so I reran it while E@H was crunching and it took 90 seconds to return.
My YouTube Channel: https://www.youtube.com/user/KF7IJZ
Follow me on Twitter: https://twitter.com/KF7IJZ
KF7IJZ wrote:I ran the wisdom
)
Both seems far too quick, indeed. You mean that you ran the wisdom generation with my script, right?
N30dG observed a very short run on his odroid, too. We have not investigated this, yet. The script sets the "-n" flag to fftwf-wisdom, so it should not matter, but just to be sure, please have an empty /etc/fftw folder when you start the wisdom-generation script.
Does the script show the "I: Wisdom file was computed successfully" message? Otherwise the script terminated somewhere prematurely with an error.
If a wisdomf file was created then, well, just use it and (fingers crossed) let us see if it works.