Parallella, Raspberry Pi, FPGA & All That Stuff

ExtraTerrestria...

Joined: 10 Nov 04

Posts: 770

Credit: 578346872

RAC: 197536

RE: We are currently

23 Feb 2015 22:15:42 UTC

Message 111944 in response to message 111920

(moderation:

)

Quote:

We are currently re-unifying our code branches for the regular x86 CPU&GPU apps with the (up until now) somewhat experimental ARM code line, and after that we'll put a new BRP source-code package online so everyone can do experiments on their own.

Once you unify the code base, can the wisdom-tuning also be applied to regular CPU & GPU apps? This could be a really cool feature! The more knowledgable people could optimize for some typical cases and share the results, so they could be used by many.

Not sure how the impact on overall E@H throughput would be. Well, first and foremost this would depend on the actual speedup.

MrS

Scanning for our furry friends since Jan 2002

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 728255554

RAC: 1178769

RE: Once you unify the

24 Feb 2015 15:41:32 UTC

Message 111945 in response to message 111944

(moderation:

)

Quote:

Once you unify the code base, can the wisdom-tuning also be applied to regular CPU & GPU apps? This could be a really cool feature! The more knowledgable people could optimize for some typical cases and share the results, so they could be used by many.

Not sure how the impact on overall E@H throughput would be. Well, first and foremost this would depend on the actual speedup.

MrS

Theoretically yes, but
* there won't be any BRP work for CPUs other than for Android and ARM-Linux. The BRP6 tasks are just too big.

* this wisdom tuning is FFTW-only, it has no equivalent mechanism for GPU-app versions

* The only other E@H app where this applies at the moment is the Fermi search, where we use FFTW for CPU jobs. As an experimental feature, the app already accepts a wisdom-file with the command line parameter "-v widsomfile" , you can use an app_config.xml file to append this extra option to the apps command line.

Cheers
HB

MarkJ

Joined: 28 Feb 08

Posts: 437

Credit: 139002861

RAC: 0

RE: The only other E@H app

24 Feb 2015 20:25:35 UTC

Message 111946 in response to message 111945

(moderation:

)

Quote:

The only other E@H app where this applies at the moment is the Fermi search, where we use FFTW for CPU jobs. As an experimental feature, the app already accepts a wisdom-file with the command line parameter "-v widsomfile" , you can use an app_config.xml file to append this extra option to the apps command line.

Cheers
HB

Could we get a command line switch to use out of place FFTW? Have the app default to use in place to reduce memory consumption, but those with the extra memory could use it via command line.

BOINC blog

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 728255554

RAC: 1178769

RE: Could we get a command

25 Feb 2015 15:41:32 UTC

Message 111947 in response to message 111946

(moderation:

)

Quote:

Could we get a command line switch to use out of place FFTW? Have the app default to use in place to reduce memory consumption, but those with the extra memory could use it via command line.

To be honest, this would be very low on my list of things to do for E@H as it is really useful only for a very limited number of hosts. But there will be source code available , so .... ;-)

Cheers
HB

Stephen Uitti

Joined: 21 Sep 14

Posts: 2

Credit: 12200748

RAC: 0

I still have NTSC TVs, and no

25 Feb 2015 19:07:50 UTC

Message 111948

(moderation:

)

I still have NTSC TVs, and no convert boxes, in the US. I currently use the old TVs to play DVDs. No live TV of any kind. Well, youtube.

My long term (getting shorter all the time, as two TVs have died, leaving me with my smallest screens) strategy has been to get bigger monitors for my desktops. They'd be on the LAN, and their video cards (an NVIDIA 650ti (768 processors) is now $100. In my experience, desktops last a long time, and are more repairable and upgradable. The downside is audio noise. An idea that may work in my home is to have the computer physically in another room: below or next to the display room. Wireless keyboard or maybe just mouse. Linux. I'd expect that a 650ti would kick compared to an ARM DSP, or several. I just discovered that at least one TV can be replaced by a monitor, at a reasonable price. An existing card can power it. In addition to being *much* sharper (many more dots), the new screen would be physically larger, making it easier on our aging eyes.

For Einstein, supporting the Pi, Parallella, Arduino, and smart phones may make sense as well.

ExtraTerrestria...

Joined: 10 Nov 04

Posts: 770

Credit: 578346872

RAC: 197536

HB wrote:Theoretically yes,

26 Feb 2015 22:54:29 UTC

Message 111949 in response to message 111945

(moderation:

)

HB wrote:

Theoretically yes, but...

Ah, that makes sense. Thanks for answering!

MrS

Scanning for our furry friends since Jan 2002

MarkJ

Joined: 28 Feb 08

Posts: 437

Credit: 139002861

RAC: 0

I put Debian Wheezy, which

1 Mar 2015 1:27:38 UTC

Message 111950

(moderation:

)

I put Debian Wheezy, which has fftw 3.3.2, onto an SD card and then generated an fft plan in patient mode for my only B+. I then got it across to the Debian Jessie SD card and have now got it running a task. They normally take 31.5 hours so let see if it makes any difference. Its got a "medium" overclock. The host can be found here

I also got some more copper heat sinks so have put a 3rd Pi2 up. I overclocked it to Pi2 (1000Mhz) and it seems to be turning in tasks around 13.83 hours. Its got the wisdom as well as being OC'ed. Its this host

In contrast the other two Pi2's are running the wisdom but are not overclocked. They seem to be taking around 16.17 hours.

Lastly I put the wisdom file on both Parallella's so they are now slightly quicker than the Pi2's that are not OCed. The OCed Pi2 beats the Parallella now.

BOINC blog

BackGroundMAN

Joined: 25 Feb 05

Posts: 58

Credit: 246736656

RAC: 0

Hi, Andrew (Rpi GPU_FFT

1 Mar 2015 11:20:35 UTC

Message 111951

(moderation:

)

Hi,

Andrew (Rpi GPU_FFT developer) sent me a new version of the 2M C2C GPU_FFT with increased accuracy (he will release this version in the future).
His tests (and mine) show that the accuracy of the 2M C2C GPU_FFT is ~10^-6 while the FFTW-3.3.2 has accuracy of about 10^-7.

I run and validate 3 wu in my Rpi (B+ @1GHz). You can find the host here.
For the first two WU there is a lot of debug info (mostly timing info).
The client without debug info crunch the WU in 65,3 Ksec.
Is there any Rpi B+ @1GHz to compare with the official client?

The main problems now are:
1. You must run boinc as root to be able to access the GPU. There is a custom driver which let you access the GPU as normal user (here).
2. If you stop the eah_client there is a problem in cleaning up the GPU resources (mem, etc).
You must reboot the Rpi to be able to start the client again. In the custom driver there is a mechanism to cleanup the resources.

Thank you,

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 728255554

RAC: 1178769

RE: Hi, Andrew (Rpi

1 Mar 2015 11:45:16 UTC

Message 111952 in response to message 111951

(moderation:

)

Quote:

Hi,

Andrew (Rpi GPU_FFT developer) sent me a new version of the 2M C2C GPU_FFT with increased accuracy (he will release this version in the future).
His tests (and mine) show that the accuracy of the 2M C2C GPU_FFT is ~10^-6 while the FFTW-3.3.2 has accuracy of about 10^-7.

Exciting news!!

Quote:

I run and validate 3 wu in my Rpi (B+ @1GHz). You can find the host here.
For the first two WU there is a lot of debug info (mostly timing info).
The client without debug info crunch the WU in 65,3 Ksec.

So here it is: the first GPU driven, working BRP4 app on the Pi that beats the CPU version (see below). Congratulations!!!!

Quote:

Is there any Rpi B+ @1GHz to compare with the official client?

This is a PI B of mine @ 1GHz CPU and 600MHz RAM overclock (B or B+ shouldn't matter here):

http://einsteinathome.org/host/11456696/tasks&offset=0&show_names=1&state=3&appid=0

Pretty much exactly 24h per task.

Quote:

The main problems now are:
1. You must run boinc as root to be able to access the GPU. There is a custom driver which let you access the GPU as normal user (here).

Or alternatively, you could have a wrapper executable that is started by BOINC as boinc and which then starts the real app via sudo.
The "hello_fft" GPU example application also required you to create a device file as root, how is that handled in your scenario?

Quote:

2. If you stop the eah_client there is a problem in cleaning up the GPU resources (mem, etc).
You must reboot the Rpi to be able to start the client again. In the custom driver there is a mechanism to cleanup the resources.

That's a problem I've also run into when interrupting (or crashing) some Pi camera applications, it would be great if the driver could take care of this. Otherwise, maybe a wrapper executable could help here as well with the cleanup.

This is exciting. Do you think there is potential for even higher performance?
It would be cool to beat the Pi2 B+ (at ca 50k s per task), which would also beat the Parallella (on CPU only, on a per core basis).

EDIT:

PS.: Also the OUYA running under Android, with the Tegra 3 CPU, would appear within range (on a per core basis). Perhaps surprisingly, the wisdom file I posted earlier also works with the soft-fp ABI version of FFTW under Android and gives a noticeable speedup http://einsteinathome.org/host/10214158/tasks&offset=0&show_names=1&state=3&appid=0 Beating that one (per core) with a Raspi GPU would also be cool.

Cheers
HB

BackGroundMAN

Joined: 25 Feb 05

Posts: 58

Credit: 246736656

RAC: 0

RE: This is a PI B of mine

1 Mar 2015 13:51:52 UTC

Message 111953 in response to message 111952

(moderation:

)

Quote:

This is a PI B of mine @ 1GHz CPU and 600MHz RAM overclock (B or B+ shouldn't matter here):
http://einsteinathome.org/host/11456696/tasks&offset=0&show_names=1&state=3&appid=0
Pretty much exactly 24h per task.

So we have about 25% speedup with the use of the GPU.

Quote:

Or alternatively, you could have a wrapper executable that is started by BOINC as boinc and which then starts the real app via sudo.
The "hello_fft" GPU example application also required you to create a device file as root, how is that handled in your scenario?

The device file (/dev/gpu_dev) is created by the sources if not exists.
Is there any way to force boinc to run the eah_client as root from the config files?

Quote:

That's a problem I've also run into when interrupting (or crashing) some Pi camera applications, it would be great if the driver could take care of this. Otherwise, maybe a wrapper executable could help here as well with the cleanup.

We can eliminate the problem if we can run the "tear_down_fft" function when the boinc closes the eah_client. We must first wait for the running GPU_FFT (current iteration) to finish and then cleanup with the tear_down_fft function.

Quote:

This is exciting. Do you think there is potential for even higher performance?
It would be cool to beat the Pi2 B+ (at ca 50k s per task), which would also beat the Parallella (on CPU only, on a per core basis).

EDIT:

PS.: Also the OUYA running under Android, with the Tegra 3 CPU, would appear within range (on a per core basis). Perhaps surprisingly, the wisdom file I posted earlier also works with the soft-fp ABI version of FFTW under Android and gives a noticeable speedup http://einsteinathome.org/host/10214158/tasks&offset=0&show_names=1&state=3&appid=0 Beating that one (per core) with a Raspi GPU would also be cool.

Cheers
HB

Andrew is working on a 4M-FFT core with the GPU_FFT with the same accuracy as 2M-FFT. This implementation most probably will speed up the eah_client even more.
I must change a lot of code to support the new 4M-FFT kernel but I suspect that the new implementation will also reduce the memory utilization of the current GPU client (which is very high ~230MB):

In the current implementation I run 3 x 2M-FFT C2C kernels and then I perform a radix-3 and an extra twiddling for the C2C to R2C transform (half the output with embedded the powerspectrum calculations).
I use ping-pong buffers to save memory but each buffer is 6M complex floats.
Also all the twiddles for the radix-3 and the R2C stages are pre-computed and stored in memory for speed.
An optimization there is that I split each stage to 4 loops and each loop uses the same twiddles with different calculations.
So in the memory I have only 1/4 of the needed twiddles per stage.
With the new 4M FFT kernel I will run 3 x 4M-GPU_FFT and an extra radix-3 (half the output - 2 of the 4 loops - with embedded the powerspectum calculations).

Thank you,

Parallella, Raspberry Pi, FPGA & All That Stuff

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner