We are discussing with Andrew the possibility of changing the twiddle generation functions with a LUT-based twiddle functionality to improve accuracy performance of the GPU_FFT.
This might solve the accuracy problems for EaH client.
Thank you,
Have you heard back from Andrew and if so have you had any success creating a client that that uses the GPU_FFT? I run a solar-powered RPI model B and it would be great to have a faster client. I'm using the Turbo setting (1 GHz) and added heat sinks to my RPI. It completes a work unit in 25 hours.
I see. The per task CPU run time is consistently higher than on my fastest Raspi which needs less than 90k sec per task (the tasks are pretty much all equal, run time wise).
I guess they might be less aggressively overclocked, mine (a model B) has these parameters in /boot/config.txt :
The twiddles* are a challenge indeed, they are the piece of the FFT which doesn't factorise like the rest. Producing them efficiently is governed by the classic memory vs. speed conundrum. If they are not accurate then one effectively gets a lower resolution transform.
What's the memory on a Pi and what's the float/precision operand lengths ?
Cheers, Mike.
* Essentially one needs the sine and cosine of every angle from 0 to 2PI in increments of 2PI/N where N is the transform size.
( edit ) Perversely : on the Parallella platform you could probably just totally ignore the Epiphany chip and use the dual-core ARM, BUT configure the FPGA to do the FFT heavy lifting. Even if the FPGA simply generated the twiddles on demand that would still be one heck of a speed and memory advantage. I'll have another look at that when/if I get a chance ...
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
What's the memory on a Pi and what's the float/precision operand lengths ?
The "Raspberry Pi Model B" comes with 512 MB RAM, shared by CPU and GPU, with a configurable memory-split. We are talking about single precision FFTs, real to complex.
The new quad-core "Raspberry Pi 2 model B" has 1 GB RAM. The smaller 256 MB RAM "model A" is probably not worth exploring for this.
Unfortunately, I do not have any update yet on the GPU_FFT. I didn’t had the chance to change the twiddle generation procedure.
@Mike Hewson: Actually for a N-point C2C FFT you can reduce the memory needed by storing the cos/sin values for N/8 angles and take advantage of the twiddle symmetries (more calculations). The GPU of the Rpi supports single precision floats (32bit) but I do not know if there is any extension mode for the intermediate results (e.g. 40bits). At the current implementation the twiddles are pre-calculated in the ARM with double precision and then stored (casted) in single precision on the GPU memory. The accuracy problem, most probably, is a result of the pre-calculation procedure which is step-based and not LUT-based. The step-based procedure calculates “higher†twiddles (smaller angles) based on previous calculated twiddle values. This technique accumulates errors in the “higher†twiddles.
That sounds reasonably easy to fix. Actually we at E@H had a pretty similar problem some years ago when we used an OpenCL FFT lib that computed the twiddle factors with faster, but reduced precision trig. functions (native_sin, native_cos). We replaced this with a LUT based method.
Einstein@Home already has an ARMv7 NEON capable BRP4 app for Linux, and the 1GB RAM should in theory be enough to run 3 or 4 tasks in parallel, so E@H is more than ready for this new board. All this would probably boost productivity in E@H compared to a Raspi "1" by a factor of 4 or more if (!!) the RAM can provide the necessary throughput.
Mine arrived yesterday, the Micro-SDCard this morning, just spent the last couple of hours putting Raspbian Wheezy on it, Booting it, wow that was quick,
Getting the Boinc source, Building Boinc 7.2.47 and attaching it to Seti, Seti Beta, Einstein and Albert,
It's got a couple of Neon tasks from here, and a couple of non-Neon tasks from Albert, just running two up at present:
RE: Hmm...that is a bit
)
Mine is at http://einsteinathome.org/account/tasks
Maybe the validate inconclusive's are the problem. Not spotted them before.
That link won't work as your
)
That link won't work as your hosts are hidden. Even then, links to individual hosts will work, tho.
Cheers
HB
RE: Hi, We are discussing
)
Have you heard back from Andrew and if so have you had any success creating a client that that uses the GPU_FFT? I run a solar-powered RPI model B and it would be great to have a faster client. I'm using the Turbo setting (1 GHz) and added heat sinks to my RPI. It completes a work unit in 25 hours.
RE: That link won't work as
)
Okay try these
http://einsteinathome.org/host/10457609/tasks
http://einsteinathome.org/host/11678121/tasks
I see these are both about 62 points per WU.
The numbers I was quoting were from
http://einstein.phys.uwm.edu/hosts_user.php
Which I assume is per day
Hi! I see. The per task
)
Hi!
I see. The per task CPU run time is consistently higher than on my fastest Raspi which needs less than 90k sec per task (the tasks are pretty much all equal, run time wise).
I guess they might be less aggressively overclocked, mine (a model B) has these parameters in /boot/config.txt :
Not all Raspis will overclock to this level, tho, try at your own risk :-).
HB
The twiddles* are a challenge
)
The twiddles* are a challenge indeed, they are the piece of the FFT which doesn't factorise like the rest. Producing them efficiently is governed by the classic memory vs. speed conundrum. If they are not accurate then one effectively gets a lower resolution transform.
What's the memory on a Pi and what's the float/precision operand lengths ?
Cheers, Mike.
* Essentially one needs the sine and cosine of every angle from 0 to 2PI in increments of 2PI/N where N is the transform size.
( edit ) Perversely : on the Parallella platform you could probably just totally ignore the Epiphany chip and use the dual-core ARM, BUT configure the FPGA to do the FFT heavy lifting. Even if the FPGA simply generated the twiddles on demand that would still be one heck of a speed and memory advantage. I'll have another look at that when/if I get a chance ...
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
RE: What's the memory on a
)
The "Raspberry Pi Model B" comes with 512 MB RAM, shared by CPU and GPU, with a configurable memory-split. We are talking about single precision FFTs, real to complex.
The new quad-core "Raspberry Pi 2 model B" has 1 GB RAM. The smaller 256 MB RAM "model A" is probably not worth exploring for this.
Cheers
HB
Unfortunately, I do not have
)
Unfortunately, I do not have any update yet on the GPU_FFT. I didn’t had the chance to change the twiddle generation procedure.
@Mike Hewson: Actually for a N-point C2C FFT you can reduce the memory needed by storing the cos/sin values for N/8 angles and take advantage of the twiddle symmetries (more calculations). The GPU of the Rpi supports single precision floats (32bit) but I do not know if there is any extension mode for the intermediate results (e.g. 40bits). At the current implementation the twiddles are pre-calculated in the ARM with double precision and then stored (casted) in single precision on the GPU memory. The accuracy problem, most probably, is a result of the pre-calculation procedure which is step-based and not LUT-based. The step-based procedure calculates “higher†twiddles (smaller angles) based on previous calculated twiddle values. This technique accumulates errors in the “higher†twiddles.
Thank you,
That sounds reasonably easy
)
That sounds reasonably easy to fix. Actually we at E@H had a pretty similar problem some years ago when we used an OpenCL FFT lib that computed the twiddle factors with faster, but reduced precision trig. functions (native_sin, native_cos). We replaced this with a LUT based method.
HB
RE: Can't wait to get
)
Mine arrived yesterday, the Micro-SDCard this morning, just spent the last couple of hours putting Raspbian Wheezy on it, Booting it, wow that was quick,
Getting the Boinc source, Building Boinc 7.2.47 and attaching it to Seti, Seti Beta, Einstein and Albert,
It's got a couple of Neon tasks from here, and a couple of non-Neon tasks from Albert, just running two up at present:
Computer 11741356 at Einstein
COMPUTER 12650 at Albert
Claggy