Compiling BRP for AARCH64-Linux

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 197251252
RAC: 55011

RE: RE: I found the

Quote:
Quote:
I found the problem I was running into. I forgot to remove a configure option (--enable-maintainer-mode) needed when directly building the master branch. I now can build fftw 3.3.2 by patching the two files in this commit: Double precision Neon SIMD for aarch64 and patching the configure script accordingly.

I don't suggest you to do that, simply add NEON_CFLAGS="-D__ARM_NEON__" as option to FFTW's configure and you ready. There is nothing more needed. I've tried to apply these patches too, it's not worth the trouble. Result is the same as adding this flag.


By applying the patch and patching the configure script I can basically use the same configure command as for armv7 which sets the option too. I'm also using the same Makefile as for armv7. Overall it's less changes and they are more easily incorporated in our current build script.

N30dG
N30dG
Joined: 29 Feb 16
Posts: 89
Credit: 4805610
RAC: 0

RE: RE: But beside of

Quote:
Quote:
But beside of that, here are the minimum changes to get a working copy:
build.sh:
  • - BINUTILS: 2.26 - GSL-Version: 1.16
    - LIBXML-Version 2.9.3

I had no problem with binutils but had to add the build type to the configure commands for gsl and libxml because config.guess is too old.


Yes, that is also possible, I've decided to change the Version. Both are fine i guess.

Quote:
Quote:
einsteinbinary-Makefile:
  • - demod_binary_resamp_cpu.c: - ffp-contract=off

I don't understand this change. You mean I should add the compiler option to the demod_binary_resamp_cpu.o target?


Yes, without this option you getting many invalid results from this line:
del_t[i] = params->tau * sinValue * params->step_inv - params->S0;
I think the problem is that AARCH64 supports fused-multiply-subtracts. ARMv7 only have fused-multiply-accumulate. So this line gives you sometimes different Values. And the resampling is really sensitive to this.

Quote:
Quote:
FFTW-Version:
  • 3.3.3 seem's to be the fastest but 3.3.2 works also.
I'm currently running a version (via app_info) that was build using 3.3.2 to get a baseline but than will try one using 3.3.3 next.

You should note here that 3.3.3 gives you only better Times if a Wisdom is presented.

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 197251252
RAC: 55011

Thanks for the explanation. I

Thanks for the explanation. I will incoporate those into my next iteration of testing. Do you already have a wisdom file? If you send it to me I can put that into our git too.

N30dG
N30dG
Joined: 29 Feb 16
Posts: 89
Credit: 4805610
RAC: 0

Yes I have already a

Yes I have already a Wisdom-file. But it's for an out-of-place FFT. We have 2gb of RAM on the C2 and we need ~208M per Task. So i think that's okay.

const char * EMBEDDED_WISDOM =
"(fftw-3.3.3 fftwf_wisdom #x4a633eef #xb5a95564 #x91014bdd #x9c85ce5f"
"  (fftwf_codelet_n2fv_32_neon 1 #x31bff #x31bff #x0 #x77f8eb56 #x9b3243ea #xcfaa4341 #x0775bcaa)"
"  (fftwf_codelet_t3fv_16_neon 1 #x1040 #x1040 #x0 #xe303c5b2 #xc2ae2214 #xa72ab5f4 #x6995a04c)"
"  (fftwf_codelet_t3fv_32_neon 0 #x31bff #x31bff #x0 #x78aaf7d5 #x0ead6a1d #x9ea9500c #xfe4649ee)"
"  (fftwf_codelet_hc2cbdftv_12_neon 0 #x30bff #x30bff #x0 #xed7d2717 #x182d4499 #xb650d8bf #xc3ac709e)"
"  (fftwf_codelet_n1fv_32_neon 0 #x31bff #x31bff #x0 #x1377ac76 #x486b5979 #x85f7d06e #x99d80ab3)"
"  (fftwf_codelet_r2cb_12 2 #x30bff #x30bff #x0 #x0cd86b8c #xa5bb5bdf #xa6841a6f #x2b51bb34)"
"  (fftwf_dft_vrank_geq1_register 0 #x31bff #x31bff #x0 #x10aa5232 #xccc0b4f9 #x63b0f397 #x4046d871)"
"  (fftwf_codelet_t1fv_12_neon 0 #x1040 #x1040 #x0 #x79756f0e #x66d09426 #xc0f7c2b4 #x261a84b3)"
"  (fftwf_ct_genericbuf_register 0 #x30bff #x30bff #x0 #x09ffadfd #xdb59a068 #x0745df6d #xd58d3904)"
"  (fftwf_codelet_r2cfII_12 2 #x31bff #x31bff #x0 #x3bf1ef07 #x3d06dd3e #x565dfc8a #x2b7c20c9)"
"  (fftwf_dft_vrank_geq1_register 0 #x1040 #x1040 #x0 #x74dd935c #x94ceb996 #x09d11935 #x41c5b235)"
"  (fftwf_codelet_r2cf_12 2 #x31bff #x31bff #x0 #xf0420ce7 #xc918dcf0 #x03aac9b2 #x16107661)"
"  (fftwf_codelet_hc2cfdftv_2_neon 0 #x1040 #x1040 #x0 #xf2e21ede #xa7926244 #x904b58ef #x5516abdd)"
"  (fftwf_dft_vrank_geq1_register 0 #x1040 #x1040 #x0 #x592190c3 #x7ce845cd #xb138a247 #xbaa61ebe)"
"  (fftwf_codelet_n2fv_32_neon 1 #x30bff #x30bff #x0 #x77f8eb56 #x9b3243ea #xcfaa4341 #x0775bcaa)"
"  (fftwf_codelet_t1buv_8_neon 0 #x30bff #x30bff #x0 #x89648b87 #x4428e205 #x9c1eb28f #x0e1b59df)"
"  (fftwf_codelet_r2cfII_2 2 #x1040 #x1040 #x0 #x7d17401b #xbead8c34 #x59d0bca0 #x4ce1dcef)"
"  (fftwf_dft_vrank_geq1_register 0 #x1040 #x1040 #x0 #xfe6deb84 #x4f26ad5c #xb890d5fc #xa90ed671)"
"  (fftwf_codelet_r2cbIII_12 2 #x30bff #x30bff #x0 #xfb67d341 #x537f52c4 #xbaa6c92c #x64c28e12)"
"  (fftwf_dft_vrank_geq1_register 0 #x31bff #x31bff #x0 #x098ff363 #x2e742041 #xf8ba4623 #x3d99eadb)"
"  (fftwf_ct_genericbuf_register 0 #x31bff #x31bff #x0 #xfe3a0fe3 #xb55c134b #x0645bd4a #xf197f7c6)"
"  (fftwf_dft_vrank_geq1_register 0 #x30bff #x30bff #x0 #x792a7736 #x4fc700e1 #xe3e5f7fa #x7534e533)"
"  (fftwf_dft_vrank_geq1_register 0 #x31bff #x31bff #x0 #xb02371f5 #xa5458024 #x6d46a518 #x009c8e76)"
"  (fftwf_codelet_t3fv_32_neon 0 #x31bff #x31bff #x0 #xb8f247fc #xb8fa53ba #x7d5cec88 #x6a2cc555)"
"  (fftwf_dft_vrank_geq1_register 0 #x30bff #x30bff #x0 #x2d4f7b39 #xe89c78f3 #xf04db27c #x71312c69)"
"  (fftwf_codelet_t3fv_16_neon 1 #x1040 #x1040 #x0 #x4c7d44eb #xc8fdb88f #x5c58f633 #x11913e40)"
"  (fftwf_codelet_r2cf_2 2 #x1040 #x1040 #x0 #x7f491169 #x040dd9bd #xd46830ed #x3084e984)"
"  (fftwf_codelet_t3fv_32_neon 0 #x30bff #x30bff #x0 #xb8f247fc #xb8fa53ba #x7d5cec88 #x6a2cc555)"
"  (fftwf_codelet_t3fv_32_neon 1 #x1040 #x1040 #x0 #x6401dea5 #xb86b1548 #x336ceb05 #x5ea75d6c)"
"  (fftwf_codelet_n2fv_64_neon 1 #x1040 #x1040 #x0 #x6174f23c #xfb0fa51d #xd769129d #xfb18817d)"
"  (fftwf_codelet_hc2cfdftv_12_neon 0 #x31bff #x31bff #x0 #xf2e21ede #xa7926244 #x904b58ef #x5516abdd)"
"  (fftwf_dft_vrank_geq1_register 0 #x30bff #x30bff #x0 #xb02371f5 #xa5458024 #x6d46a518 #x009c8e76)"
"  (fftwf_codelet_n1bv_128_neon 0 #x30bff #x30bff #x0 #x5551ea8e #x0745fccd #x49992db0 #x34d9d629)"
")";

Computed over 8day's directly out of a "dummy"-BRP-App.

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 197251252
RAC: 55011

First test run with "stock"

First test run with "stock" app (fftw 3.3.2, no ffp-contract change):
4 tasks of which 1 valid, 1 inconclusive, 2 waiting on wingman - Runtime ~48k sec (13 h) with all 4 running at the same time

Host

The 2 in progress are using the same app but with fftw 3.3.3 and the ffp-contract change.

I'm also not using the -mcpu=cortex-a53+simd compiler flag (yet) because I want to have a generic app. Right now we can't target specific ARM cpu's with BOINC. Let's see what performance gain I get when I activate this in one of the next tests.

N30dG
N30dG
Joined: 29 Feb 16
Posts: 89
Credit: 4805610
RAC: 0

seem's to be identical with

seem's to be identical with my test's. ~48ks
Switching to FFTW to out-of-place should give you result's ~25ks.
Applying the wisdom and you should get ~17-18ks.
The rest are my changes off the resampling.

The ffp-contract=off makes it a bit slower. That's why I only applied this to the to the resampling.
Maybe we can avoid this by making some variables volatile and change the order a little bit. To force GCC to make a MUL and a SUB instead of a MLS. So that we still have the speed-benefit's from the MLA instruction. switching ffp-contract=off eliminates this to.

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 197251252
RAC: 55011

And now I'm back to 48k sec

And now I'm back to 48k sec when running 4 tasks concurrently even with the wisdom file. My next test is to remove the ffp-contract change and add the mcpu change to see if I compile for a specific CPU I get a speed improvement.

N30dG
N30dG
Joined: 29 Feb 16
Posts: 89
Credit: 4805610
RAC: 0

RE: And now I'm back to 48k

Quote:
And now I'm back to 48k sec when running 4 tasks concurrently even with the wisdom file. My next test is to remove the ffp-contract change and add the mcpu change to see if I compile for a specific CPU I get a speed improvement.


Have you switched to an out-of-place fft? My wisdom is for out-of-place. If you using it for inplace, it will just ignored.

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 197251252
RAC: 55011

I switched to out-of-place

I switched to out-of-place fft now and also try to cool the C2 better in case it is downclocking (which I couldn't verify).

N30dG
N30dG
Joined: 29 Feb 16
Posts: 89
Credit: 4805610
RAC: 0

By looking at your last

By looking at your last run-times it doesn't seem that it was successfull?
I would guess that your FFTW-Patch doesn't activate neon properly.

Maybe you could try, only for testing, my method (NEON_CFLAGS="-D__ARM_NEON__")?
If that doesn't help I can send you my build.sh & makefile. (But it's only quick&dirty modified)

I've also finished my resampling but I want to write a little explanation. And I don't have the Time. My wife has late shift this week so I have to work, buy Food, cook, go with the dog, ... But at the Weekend I send it to you.
Sorry for the delay.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.