We are currently re-unifying our code branches for the regular x86 CPU&GPU apps with the (up until now) somewhat experimental ARM code line, and after that we'll put a new BRP source-code package online so everyone can do experiments on their own.
Once you unify the code base, can the wisdom-tuning also be applied to regular CPU & GPU apps? This could be a really cool feature! The more knowledgable people could optimize for some typical cases and share the results, so they could be used by many.
Not sure how the impact on overall E@H throughput would be. Well, first and foremost this would depend on the actual speedup.
Once you unify the code base, can the wisdom-tuning also be applied to regular CPU & GPU apps? This could be a really cool feature! The more knowledgable people could optimize for some typical cases and share the results, so they could be used by many.
Not sure how the impact on overall E@H throughput would be. Well, first and foremost this would depend on the actual speedup.
MrS
Theoretically yes, but
* there won't be any BRP work for CPUs other than for Android and ARM-Linux. The BRP6 tasks are just too big.
* this wisdom tuning is FFTW-only, it has no equivalent mechanism for GPU-app versions
* The only other E@H app where this applies at the moment is the Fermi search, where we use FFTW for CPU jobs. As an experimental feature, the app already accepts a wisdom-file with the command line parameter "-v widsomfile" , you can use an app_config.xml file to append this extra option to the apps command line.
The only other E@H app where this applies at the moment is the Fermi search, where we use FFTW for CPU jobs. As an experimental feature, the app already accepts a wisdom-file with the command line parameter "-v widsomfile" , you can use an app_config.xml file to append this extra option to the apps command line.
Cheers
HB
Could we get a command line switch to use out of place FFTW? Have the app default to use in place to reduce memory consumption, but those with the extra memory could use it via command line.
Could we get a command line switch to use out of place FFTW? Have the app default to use in place to reduce memory consumption, but those with the extra memory could use it via command line.
To be honest, this would be very low on my list of things to do for E@H as it is really useful only for a very limited number of hosts. But there will be source code available , so .... ;-)
I still have NTSC TVs, and no convert boxes, in the US. I currently use the old TVs to play DVDs. No live TV of any kind. Well, youtube.
My long term (getting shorter all the time, as two TVs have died, leaving me with my smallest screens) strategy has been to get bigger monitors for my desktops. They'd be on the LAN, and their video cards (an NVIDIA 650ti (768 processors) is now $100. In my experience, desktops last a long time, and are more repairable and upgradable. The downside is audio noise. An idea that may work in my home is to have the computer physically in another room: below or next to the display room. Wireless keyboard or maybe just mouse. Linux. I'd expect that a 650ti would kick compared to an ARM DSP, or several. I just discovered that at least one TV can be replaced by a monitor, at a reasonable price. An existing card can power it. In addition to being *much* sharper (many more dots), the new screen would be physically larger, making it easier on our aging eyes.
For Einstein, supporting the Pi, Parallella, Arduino, and smart phones may make sense as well.
I put Debian Wheezy, which has fftw 3.3.2, onto an SD card and then generated an fft plan in patient mode for my only B+. I then got it across to the Debian Jessie SD card and have now got it running a task. They normally take 31.5 hours so let see if it makes any difference. Its got a "medium" overclock. The host can be found here
I also got some more copper heat sinks so have put a 3rd Pi2 up. I overclocked it to Pi2 (1000Mhz) and it seems to be turning in tasks around 13.83 hours. Its got the wisdom as well as being OC'ed. Its this host
In contrast the other two Pi2's are running the wisdom but are not overclocked. They seem to be taking around 16.17 hours.
Lastly I put the wisdom file on both Parallella's so they are now slightly quicker than the Pi2's that are not OCed. The OCed Pi2 beats the Parallella now.
Andrew (Rpi GPU_FFT developer) sent me a new version of the 2M C2C GPU_FFT with increased accuracy (he will release this version in the future).
His tests (and mine) show that the accuracy of the 2M C2C GPU_FFT is ~10^-6 while the FFTW-3.3.2 has accuracy of about 10^-7.
I run and validate 3 wu in my Rpi (B+ @1GHz). You can find the host here.
For the first two WU there is a lot of debug info (mostly timing info).
The client without debug info crunch the WU in 65,3 Ksec.
Is there any Rpi B+ @1GHz to compare with the official client?
The main problems now are:
1. You must run boinc as root to be able to access the GPU. There is a custom driver which let you access the GPU as normal user (here).
2. If you stop the eah_client there is a problem in cleaning up the GPU resources (mem, etc).
You must reboot the Rpi to be able to start the client again. In the custom driver there is a mechanism to cleanup the resources.
Andrew (Rpi GPU_FFT developer) sent me a new version of the 2M C2C GPU_FFT with increased accuracy (he will release this version in the future).
His tests (and mine) show that the accuracy of the 2M C2C GPU_FFT is ~10^-6 while the FFTW-3.3.2 has accuracy of about 10^-7.
Exciting news!!
Quote:
I run and validate 3 wu in my Rpi (B+ @1GHz). You can find the host here.
For the first two WU there is a lot of debug info (mostly timing info).
The client without debug info crunch the WU in 65,3 Ksec.
So here it is: the first GPU driven, working BRP4 app on the Pi that beats the CPU version (see below). Congratulations!!!!
Quote:
Is there any Rpi B+ @1GHz to compare with the official client?
This is a PI B of mine @ 1GHz CPU and 600MHz RAM overclock (B or B+ shouldn't matter here):
The main problems now are:
1. You must run boinc as root to be able to access the GPU. There is a custom driver which let you access the GPU as normal user (here).
Or alternatively, you could have a wrapper executable that is started by BOINC as boinc and which then starts the real app via sudo.
The "hello_fft" GPU example application also required you to create a device file as root, how is that handled in your scenario?
Quote:
2. If you stop the eah_client there is a problem in cleaning up the GPU resources (mem, etc).
You must reboot the Rpi to be able to start the client again. In the custom driver there is a mechanism to cleanup the resources.
That's a problem I've also run into when interrupting (or crashing) some Pi camera applications, it would be great if the driver could take care of this. Otherwise, maybe a wrapper executable could help here as well with the cleanup.
This is exciting. Do you think there is potential for even higher performance?
It would be cool to beat the Pi2 B+ (at ca 50k s per task), which would also beat the Parallella (on CPU only, on a per core basis).
EDIT:
PS.: Also the OUYA running under Android, with the Tegra 3 CPU, would appear within range (on a per core basis). Perhaps surprisingly, the wisdom file I posted earlier also works with the soft-fp ABI version of FFTW under Android and gives a noticeable speedup http://einsteinathome.org/host/10214158/tasks&offset=0&show_names=1&state=3&appid=0 Beating that one (per core) with a Raspi GPU would also be cool.
So we have about 25% speedup with the use of the GPU.
Quote:
Or alternatively, you could have a wrapper executable that is started by BOINC as boinc and which then starts the real app via sudo.
The "hello_fft" GPU example application also required you to create a device file as root, how is that handled in your scenario?
The device file (/dev/gpu_dev) is created by the sources if not exists.
Is there any way to force boinc to run the eah_client as root from the config files?
Quote:
That's a problem I've also run into when interrupting (or crashing) some Pi camera applications, it would be great if the driver could take care of this. Otherwise, maybe a wrapper executable could help here as well with the cleanup.
We can eliminate the problem if we can run the "tear_down_fft" function when the boinc closes the eah_client. We must first wait for the running GPU_FFT (current iteration) to finish and then cleanup with the tear_down_fft function.
Quote:
This is exciting. Do you think there is potential for even higher performance?
It would be cool to beat the Pi2 B+ (at ca 50k s per task), which would also beat the Parallella (on CPU only, on a per core basis).
EDIT:
PS.: Also the OUYA running under Android, with the Tegra 3 CPU, would appear within range (on a per core basis). Perhaps surprisingly, the wisdom file I posted earlier also works with the soft-fp ABI version of FFTW under Android and gives a noticeable speedup http://einsteinathome.org/host/10214158/tasks&offset=0&show_names=1&state=3&appid=0 Beating that one (per core) with a Raspi GPU would also be cool.
Cheers
HB
Andrew is working on a 4M-FFT core with the GPU_FFT with the same accuracy as 2M-FFT. This implementation most probably will speed up the eah_client even more.
I must change a lot of code to support the new 4M-FFT kernel but I suspect that the new implementation will also reduce the memory utilization of the current GPU client (which is very high ~230MB):
In the current implementation I run 3 x 2M-FFT C2C kernels and then I perform a radix-3 and an extra twiddling for the C2C to R2C transform (half the output with embedded the powerspectrum calculations).
I use ping-pong buffers to save memory but each buffer is 6M complex floats.
Also all the twiddles for the radix-3 and the R2C stages are pre-computed and stored in memory for speed.
An optimization there is that I split each stage to 4 loops and each loop uses the same twiddles with different calculations.
So in the memory I have only 1/4 of the needed twiddles per stage.
With the new 4M FFT kernel I will run 3 x 4M-GPU_FFT and an extra radix-3 (half the output - 2 of the 4 loops - with embedded the powerspectum calculations).
RE: We are currently
)
Once you unify the code base, can the wisdom-tuning also be applied to regular CPU & GPU apps? This could be a really cool feature! The more knowledgable people could optimize for some typical cases and share the results, so they could be used by many.
Not sure how the impact on overall E@H throughput would be. Well, first and foremost this would depend on the actual speedup.
MrS
Scanning for our furry friends since Jan 2002
RE: Once you unify the
)
Theoretically yes, but
* there won't be any BRP work for CPUs other than for Android and ARM-Linux. The BRP6 tasks are just too big.
* this wisdom tuning is FFTW-only, it has no equivalent mechanism for GPU-app versions
* The only other E@H app where this applies at the moment is the Fermi search, where we use FFTW for CPU jobs. As an experimental feature, the app already accepts a wisdom-file with the command line parameter "-v widsomfile" , you can use an app_config.xml file to append this extra option to the apps command line.
Cheers
HB
RE: The only other E@H app
)
Could we get a command line switch to use out of place FFTW? Have the app default to use in place to reduce memory consumption, but those with the extra memory could use it via command line.
BOINC blog
RE: Could we get a command
)
To be honest, this would be very low on my list of things to do for E@H as it is really useful only for a very limited number of hosts. But there will be source code available , so .... ;-)
Cheers
HB
I still have NTSC TVs, and no
)
I still have NTSC TVs, and no convert boxes, in the US. I currently use the old TVs to play DVDs. No live TV of any kind. Well, youtube.
My long term (getting shorter all the time, as two TVs have died, leaving me with my smallest screens) strategy has been to get bigger monitors for my desktops. They'd be on the LAN, and their video cards (an NVIDIA 650ti (768 processors) is now $100. In my experience, desktops last a long time, and are more repairable and upgradable. The downside is audio noise. An idea that may work in my home is to have the computer physically in another room: below or next to the display room. Wireless keyboard or maybe just mouse. Linux. I'd expect that a 650ti would kick compared to an ARM DSP, or several. I just discovered that at least one TV can be replaced by a monitor, at a reasonable price. An existing card can power it. In addition to being *much* sharper (many more dots), the new screen would be physically larger, making it easier on our aging eyes.
For Einstein, supporting the Pi, Parallella, Arduino, and smart phones may make sense as well.
HB wrote:Theoretically yes,
)
Ah, that makes sense. Thanks for answering!
MrS
Scanning for our furry friends since Jan 2002
I put Debian Wheezy, which
)
I put Debian Wheezy, which has fftw 3.3.2, onto an SD card and then generated an fft plan in patient mode for my only B+. I then got it across to the Debian Jessie SD card and have now got it running a task. They normally take 31.5 hours so let see if it makes any difference. Its got a "medium" overclock. The host can be found here
I also got some more copper heat sinks so have put a 3rd Pi2 up. I overclocked it to Pi2 (1000Mhz) and it seems to be turning in tasks around 13.83 hours. Its got the wisdom as well as being OC'ed. Its this host
In contrast the other two Pi2's are running the wisdom but are not overclocked. They seem to be taking around 16.17 hours.
Lastly I put the wisdom file on both Parallella's so they are now slightly quicker than the Pi2's that are not OCed. The OCed Pi2 beats the Parallella now.
BOINC blog
Hi, Andrew (Rpi GPU_FFT
)
Hi,
Andrew (Rpi GPU_FFT developer) sent me a new version of the 2M C2C GPU_FFT with increased accuracy (he will release this version in the future).
His tests (and mine) show that the accuracy of the 2M C2C GPU_FFT is ~10^-6 while the FFTW-3.3.2 has accuracy of about 10^-7.
I run and validate 3 wu in my Rpi (B+ @1GHz). You can find the host here.
For the first two WU there is a lot of debug info (mostly timing info).
The client without debug info crunch the WU in 65,3 Ksec.
Is there any Rpi B+ @1GHz to compare with the official client?
The main problems now are:
1. You must run boinc as root to be able to access the GPU. There is a custom driver which let you access the GPU as normal user (here).
2. If you stop the eah_client there is a problem in cleaning up the GPU resources (mem, etc).
You must reboot the Rpi to be able to start the client again. In the custom driver there is a mechanism to cleanup the resources.
Thank you,
RE: Hi, Andrew (Rpi
)
Exciting news!!
So here it is: the first GPU driven, working BRP4 app on the Pi that beats the CPU version (see below). Congratulations!!!!
This is a PI B of mine @ 1GHz CPU and 600MHz RAM overclock (B or B+ shouldn't matter here):
http://einsteinathome.org/host/11456696/tasks&offset=0&show_names=1&state=3&appid=0
Pretty much exactly 24h per task.
Or alternatively, you could have a wrapper executable that is started by BOINC as boinc and which then starts the real app via sudo.
The "hello_fft" GPU example application also required you to create a device file as root, how is that handled in your scenario?
That's a problem I've also run into when interrupting (or crashing) some Pi camera applications, it would be great if the driver could take care of this. Otherwise, maybe a wrapper executable could help here as well with the cleanup.
This is exciting. Do you think there is potential for even higher performance?
It would be cool to beat the Pi2 B+ (at ca 50k s per task), which would also beat the Parallella (on CPU only, on a per core basis).
EDIT:
PS.: Also the OUYA running under Android, with the Tegra 3 CPU, would appear within range (on a per core basis). Perhaps surprisingly, the wisdom file I posted earlier also works with the soft-fp ABI version of FFTW under Android and gives a noticeable speedup http://einsteinathome.org/host/10214158/tasks&offset=0&show_names=1&state=3&appid=0 Beating that one (per core) with a Raspi GPU would also be cool.
Cheers
HB
RE: This is a PI B of mine
)
So we have about 25% speedup with the use of the GPU.
The device file (/dev/gpu_dev) is created by the sources if not exists.
Is there any way to force boinc to run the eah_client as root from the config files?
We can eliminate the problem if we can run the "tear_down_fft" function when the boinc closes the eah_client. We must first wait for the running GPU_FFT (current iteration) to finish and then cleanup with the tear_down_fft function.
Andrew is working on a 4M-FFT core with the GPU_FFT with the same accuracy as 2M-FFT. This implementation most probably will speed up the eah_client even more.
I must change a lot of code to support the new 4M-FFT kernel but I suspect that the new implementation will also reduce the memory utilization of the current GPU client (which is very high ~230MB):
In the current implementation I run 3 x 2M-FFT C2C kernels and then I perform a radix-3 and an extra twiddling for the C2C to R2C transform (half the output with embedded the powerspectrum calculations).
I use ping-pong buffers to save memory but each buffer is 6M complex floats.
Also all the twiddles for the radix-3 and the R2C stages are pre-computed and stored in memory for speed.
An optimization there is that I split each stage to 4 loops and each loop uses the same twiddles with different calculations.
So in the memory I have only 1/4 of the needed twiddles per stage.
With the new 4M FFT kernel I will run 3 x 4M-GPU_FFT and an extra radix-3 (half the output - 2 of the 4 loops - with embedded the powerspectum calculations).
Thank you,