Compiling BRP for AARCH64-Linux

Christian Beer

Joined: 9 Feb 05

Posts: 595

Credit: 197453608

RAC: 32777

I'm now testing the

22 Jun 2016 11:20:07 UTC

Message 138533

(moderation:

)

I'm now testing the -D__ARM_NEON__ option for fftw. A quick and dirty version of build.sh and the makefile would be good so I can compare to what I effectively use in my pretty version.

N30dG

Joined: 29 Feb 16

Posts: 89

Credit: 4805610

RAC: 0

RE: I'm now testing the

22 Jun 2016 14:22:34 UTC

Message 138534 in response to message 138533

(moderation:

)

Quote:

I'm now testing the -D__ARM_NEON__ option for fftw. A quick and dirty version of build.sh and the makefile would be good so I can compare to what I effectively use in my pretty version.

Can you send me a PM with your email?

Christian Beer

Joined: 9 Feb 05

Posts: 595

Credit: 197453608

RAC: 32777

Now I'm also getting 16.2k

24 Jun 2016 6:01:31 UTC

Message 138535

(moderation:

)

Now I'm also getting 16.2k sec runtime. I made two changes to my build setup that where different to the setup I got from N30dG and I'm going to verify which of the two is responsible next week. I fthis goes well we may have a test run on Albert@home. So if you have a 64bit ARM running Linux please attach it to https://albertathome.org

Christian Beer

Joined: 9 Feb 05

Posts: 595

Credit: 197453608

RAC: 32777

Just to let everyone know

28 Jun 2016 11:51:41 UTC

Message 138536

(moderation:

)

Just to let everyone know that I identified the change that made the app jump from 40k to 16k. It is the "-ftree-vectorize" compiler option. I still need to make another test without the "-mcpu=cortex-a53+simd" option to see if we stay at this level when we release an app version on Albert@home that is not CPU specific.

Regards
Christian

N30dG

Joined: 29 Feb 16

Posts: 89

Credit: 4805610

RAC: 0

RE: Just to let everyone

1 Jul 2016 15:37:53 UTC

Message 138537 in response to message 138536

(moderation:

)

Quote:

Just to let everyone know that I identified the change that made the app jump from 40k to 16k. It is the "-ftree-vectorize" compiler option. I still need to make another test without the "-mcpu=cortex-a53+simd" option to see if we stay at this level when we release an app version on Albert@home that is not CPU specific.

Mhhh, this can't be.
I only activated the -ftree-vectorize Option for testing purpose. I've simple forgot to remove it. - Sorry for the misinformation -

ftree-vectorize can improve Performance but it also can decrease it, by stalling the Pipeline.

You should note that the cost's for moving data between Core and Neon are high, very very high if you load unaligned data. And the Neon is always some cycles behind the core. By vetorize the tree you produce high movement (often unaligned) between the Core and Neon. So the core have often to wait for neon to finish and moving data back. And it comes to a pipeline-stall.

But it can improve the Performance if the Neon-computed data isn't needed at the next step. Neon and the Core has it's own Pipeline so we can get a gain by making things in parallel.
GCC try's to interleave your code by his schedule-model, so that it makes the Neon-work some instructions before it is needed. But that doesn't work in all cases.

So I wouldn't recommend to apply ftree-vectorize over all, only for functions you know that it helps.
Or the simple Way use GCC's PGO-Option's. And let GCC determine (by a test-run) where it is usefull and where not.

Hopefully, my explanation is understandable. In german I can explain this much better. And sorry for my missinformation again, I simply forgot it.

N30dG

Joined: 29 Feb 16

Posts: 89

Credit: 4805610

RAC: 0

I forgot something: The

1 Jul 2016 16:05:28 UTC

Message 138538

(moderation:

)

I forgot something: The diffrence between a generic-AARCH64 and a App compiled for Cortex-A53 shouldn't be too big. The instruction-set is the same.
GCC mainly changes the sheduling-model (In this case).
Instead of mcpu you could use march + mtune. So you get a generic App that is only "tuned" for the A53. The performance decrease on other AARCH64-CPU's shouldn't be too much. And at the moment we have mainly(or only?) A53-linux devices at the market. So I think this is a good decision.

Note that in earlier-Version's of GCC march+mtune was the same like mcpu, but this behavior was changed sometime. (don't know the Version-number).

A app compiled with mcpu may be runable on other CPU-types of the same Architecture, but there is no guarantee. But I guess that it is for AARCH64(don't have a device to try).

Christian Beer

Joined: 9 Feb 05

Posts: 595

Credit: 197453608

RAC: 32777

I'm still getting 50%

20 Jul 2016 9:28:42 UTC

Message 138539

(moderation:

)

I'm still getting 50% invalids because the numbers produced by the 64bit app are on the edge of the allowed error range. My next test will be to remove the "-ftree-vectorize" compiler option again and see what happens with the runtime. If this options has no effect on runtime I'm not sure what made the app go from 40k to 16k.

N30dG

Joined: 29 Feb 16

Posts: 89

Credit: 4805610

RAC: 0

I've already tested that, I't

21 Jul 2016 16:37:06 UTC

Message 138540 in response to message 138539

(moderation:

)

I've already tested that, I't makes ~200sec difference (or less).

The speed improvement comes from FFTW. Switching FFTW to out-of-place let the runtime drop from 40ksec to 25ksec. Using the wisdom let it drop to 16ksec.

But I don't know what causes your high invalid-rate, my last Version run's without any issues. But let me think about it...

Are you using exactly my Makefile?

BTW: I'm down to <13,5ksec :) But I think that's the bottom end. There is nothing more that I can optimize, without getting to big difference in the output.
https://einsteinathome.org/host/12260646/tasks
https://einsteinathome.org/host/12278439/tasks
(Only look at task's after the 15.07, before that date I was trying somethink.)

Christian Beer

Joined: 9 Feb 05

Posts: 595

Credit: 197453608

RAC: 32777

RE: The speed improvement

22 Jul 2016 7:00:13 UTC

Message 138541 in response to message 138540

(moderation:

)

Quote:

The speed improvement comes from FFTW. Switching FFTW to out-of-place let the runtime drop from 40ksec to 25ksec. Using the wisdom let it drop to 16ksec.

But I don't know what causes your high invalid-rate, my last Version run's without any issues. But let me think about it...

Are you using exactly my Makefile?

I talked with Benjamin about the errors and he thinks they come from using a different FFTW version in my app than the regular apps. I'm currently using version 3.3.4 with some ARM64 specific patches. I would really like to have a version 3.3.5 but this is not released yet.

I used the ARMv7 native Makefile and applied your changes to it. So in the end, yes I'm using your Makefile. I also patched the sin and sqrt functions to see if the validation rate changes (which it does not).

I will take a look at all those changes again and compare them with what you send me.

N30dG

Joined: 29 Feb 16

Posts: 89

Credit: 4805610

RAC: 0

Tomorrow under the shower,

24 Jul 2016 17:13:17 UTC

Message 138542 in response to message 138541

(moderation:

)

Tomorrow under the shower, I've had a idea. I think I know now what causes your Problem and how to solve it(if it the Problem). But let me start at the top:

First, Benjamin is right, there is a slightly change in the Values when using a different Version of FFTW, even the use of a wisdom can give you a slightly value-changes. You can read this in the FFTW-FAQ:
http://www.fftw.org/faq/section3.html#conventions
http://www.fftw.org/faq/section3.html#nondeterministic
My tests resulted in a higher value-changes in a range of 1e-7 and not 1e-15 like the FAQ say's(the FAQ is written for double, I guess).

But I guess that's not what causes you high invalid-rate. I'm using FFTW 3.3.3 and that gives me also a slightly value-difference compared to FFTW 3.3.2. But all of my results become valid (till now), so that's maybe not the center of the Problem.

I would guess that the Problem comes from the resampling-function.
And here I have to say sorry, I probably mislead you, with my Makefile. I'm using a modifyed Version of the resampling where the sin-LUT-access is already included. In the orginal-source this function is included in the erp_utilities.cpp

So there two option's to solve it:
1. Apply -ffp-contract=off (and maybe -fno-associative-math) to the erp_utilities.cpp in the Makefile
2. Use my function wich is also a little bit faster but restricted to ARM (i've send to you via mail now).

I've also included my modified hs_common.c and demod_binary_hs_cpu.c(only removed one memset, wich was already marked as ToDo)

Compiling BRP for AARCH64-Linux

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports