Add several more optimizations #92

Open
opened 2022-01-14 07:34:33 +01:00 by OpenSourceAnarchist · 22 comments

I currently use ALHP repos (x86-64-v3) on my computer, and everything has been flawless! However, I have previously used, and am also a big fan of, gentooLTO. I believe some additional optimizations that they suggest could be enabled for this repo's packages as they are updated and built, with minimal side effects (beyond slightly longer build times). If some of these are undesirable, I completely understand, but I thought I'd bring them up just in case.

The first is Graphite. Arch's build of gcc has graphite support, so it should be as simple as adding -fgraphite-identity -floop-nest-optimize.

The second is Polly. AFAIK Polly is not usually used in gentooLTO, but it should produce the same benefits as Graphite, but for LLVM/Clang. Supporting this would mean requiring polly from [extra], and adding -Xclang -load -Xclang LLVMPolly.so and -mllvm -polly to CFLAGS. I'm not sure if this would affect gcc since it obviously doesn't support these flags. Perhaps also consider -mllvm -polly-parallel -lgomp for automatic OpenMP code generation.

The third is simply -fdevirtualize-at-ltrans. This should improve optimizations with LTO.

Moreover, I saw a comment by a user on Reddit that recommended adding -mpclmul. I think this should be harmless because even though it isn't implied by -march=x86-64-v3, according to Wikipedia, it has been in every processor since Westmere, including from AMD. Perhaps -maes should also be considered for the same reason!

Lastly, -mfpmath=sse should help speed up performance without any breakage, and should always help so long as the processor supports SSE instructions, which is a requirement of x86-64-v2 and higher.

I think all of these could be implemented, but graphite/polly seem like the most disruptive ones in term of build requirements (more memory). Let me know your thoughts!

I currently use ALHP repos (x86-64-v3) on my computer, and everything has been flawless! However, I have previously used, and am also a big fan of, [gentooLTO](https://github.com/InBetweenNames/gentooLTO). I believe some additional optimizations that they suggest could be enabled for this repo's packages as they are updated and built, with minimal side effects (beyond slightly longer build times). If some of these are undesirable, I completely understand, but I thought I'd bring them up just in case. The first is Graphite. Arch's build of `gcc` has graphite support, so it should be as simple as adding `-fgraphite-identity -floop-nest-optimize`. The second is Polly. AFAIK Polly is not usually used in gentooLTO, but it should produce the same benefits as Graphite, but for LLVM/Clang. Supporting this would mean requiring `polly` from `[extra]`, and adding `-Xclang -load -Xclang LLVMPolly.so` and `-mllvm -polly` to `CFLAGS`. I'm not sure if this would affect `gcc` since it obviously doesn't support these flags. Perhaps also consider `-mllvm -polly-parallel -lgomp` for [automatic OpenMP code generation](https://man.archlinux.org/man/extra/polly/polly.1.en#Automatic_OpenMP_code_generation). The third is simply `-fdevirtualize-at-ltrans`. This should improve optimizations with LTO. Moreover, I saw [a comment](https://www.reddit.com/r/archlinux/comments/oflged/comment/hoaj0zs/) by a user on Reddit that recommended adding `-mpclmul`. I think this should be harmless because even though it isn't implied by `-march=x86-64-v3`, according to Wikipedia, it has been in every processor since Westmere, including from AMD. Perhaps `-maes` should also be considered for the same reason! Lastly, `-mfpmath=sse` should help speed up performance without any breakage, and should always help so long as the processor supports SSE instructions, which is a requirement of x86-64-v2 and higher. I think all of these could be implemented, but graphite/polly seem like the most disruptive ones in term of build requirements (more memory). Let me know your thoughts!
Owner

Hi @OpenSourceAnarchist and thanks for your detailed request.

I need to do some more research on some points you raise, but lets just dig right in:

The first is Graphite. Arch's build of gcc has graphite support, so it should be as simple as adding -fgraphite-identity -floop-nest-optimize.

I'm aware of Graphite. Need to do more research on the resource implications of enabling it.

The second is Polly. AFAIK Polly is not usually used in gentooLTO, but it should produce the same benefits as Graphite, but for LLVM/Clang. Supporting this would mean requiring polly from [extra], and adding -Xclang -load -Xclang LLVMPolly.so and -mllvm -polly to CFLAGS. I'm not sure if this would affect gcc since it obviously doesn't support these flags. Perhaps also consider -mllvm -polly-parallel -lgomp for automatic OpenMP code generation.

That is probably not going to happen soon, since requiring another makedep. dependent on the buildchain used requires more PKGBUILD editing (and checks for buildchain used, etc.).

The third is simply -fdevirtualize-at-ltrans. This should improve optimizations with LTO.

If I understood this correctly, this would increase mem-usage significantly. LTO already causes problems with mem-usage since we are still on GCC 11.1, adding even more memory usage would not be helpful at the moment. We can reconsider this once a newer GCC version hits the repos.

Moreover, I saw a comment by a user on Reddit that recommended adding -mpclmul. I think this should be harmless because even though it isn't implied by -march=x86-64-v3, according to Wikipedia, it has been in every processor since Westmere, including from AMD. Perhaps -maes should also be considered for the same reason!

I'll do some research on pclmul availability on >=v2 CPUs. If Wikipedia is to be trusted, it should be available. Most software which uses AES heavily already does runtime detection of AES afaik (openssl, dm-crypt, etc.), so the performance gains here would probably not be that big.

Lastly, -mfpmath=sse should help speed up performance without any breakage, and should always help so long as the processor supports SSE instructions, which is a requirement of x86-64-v2 and higher.

man gcc about -mfpmath=sse:

This is the default choice for the x86-64 compiler, Darwin x86-32 targets, and the default choice for x86-32 targets with the SSE2 instruction set when -ffast-math is enabled.

If I interpret this correctly, -mfpmath=sse should be the default for x86-64 anyway?

Hi @OpenSourceAnarchist and thanks for your detailed request. I need to do some more research on some points you raise, but lets just dig right in: > The first is Graphite. Arch's build of gcc has graphite support, so it should be as simple as adding -fgraphite-identity -floop-nest-optimize. I'm aware of Graphite. Need to do more research on the resource implications of enabling it. > The second is Polly. AFAIK Polly is not usually used in gentooLTO, but it should produce the same benefits as Graphite, but for LLVM/Clang. Supporting this would mean requiring polly from [extra], and adding -Xclang -load -Xclang LLVMPolly.so and -mllvm -polly to CFLAGS. I'm not sure if this would affect gcc since it obviously doesn't support these flags. Perhaps also consider -mllvm -polly-parallel -lgomp for automatic OpenMP code generation. That is probably not going to happen soon, since requiring another makedep. dependent on the buildchain used requires more PKGBUILD editing (and checks for buildchain used, etc.). > The third is simply -fdevirtualize-at-ltrans. This should improve optimizations with LTO. If I understood this correctly, this would increase mem-usage significantly. LTO already causes problems with mem-usage since we are still on GCC 11.1, adding even more memory usage would not be helpful at the moment. We can reconsider this once a newer GCC version hits the repos. > Moreover, I saw a comment by a user on Reddit that recommended adding -mpclmul. I think this should be harmless because even though it isn't implied by -march=x86-64-v3, according to Wikipedia, it has been in every processor since Westmere, including from AMD. Perhaps -maes should also be considered for the same reason! I'll do some research on *pclmul* availability on >=v2 CPUs. If Wikipedia is to be trusted, it should be available. Most software which uses AES heavily already does runtime detection of AES afaik (openssl, dm-crypt, etc.), so the performance gains here would probably not be that big. > Lastly, -mfpmath=sse should help speed up performance without any breakage, and should always help so long as the processor supports SSE instructions, which is a requirement of x86-64-v2 and higher. `man gcc` about `-mfpmath=sse`: > This is the default choice for the x86-64 compiler, Darwin x86-32 targets, and the default choice for x86-32 targets with the SSE2 instruction set when -ffast-math is enabled. If I interpret this correctly, `-mfpmath=sse` should be the default for x86-64 anyway?
anonfunc added the
enhancement
label 2022-01-15 13:07:45 +01:00

Sounds good for Graphite, -fdevirtualize-at-ltrans, and -mpclmul/-maes.

I think you're right about Polly. Definitely could be a long-term goal, but would require checks and string replacement.

As for -mfpmath=sse, you are completely correct! Somehow I missed that it's the default for anything detected with SSE2 support. Just ignore that suggestion :)

Sounds good for Graphite, `-fdevirtualize-at-ltrans`, and `-mpclmul/-maes`. I think you're right about Polly. Definitely could be a long-term goal, but would require checks and string replacement. As for `-mfpmath=sse`, you are completely correct! Somehow I missed that it's the default for anything detected with SSE2 support. Just ignore that suggestion :)
Owner

Memory usage looks good with gcc-11.2 in the repos now. I'll continue to monitor this for a while, then I plan add -fdevirtualize-at-ltrans -maes.

Memory usage looks good with `gcc-11.2` in the repos now. I'll continue to monitor this for a while, then I plan add `-fdevirtualize-at-ltrans -maes`.
Contributor

@anonfunc do you use -fdevirtualize-speculatively already?

@anonfunc do you use `-fdevirtualize-speculatively` already?
Contributor

Perhaps -maes should also be considered for the same reason!

No it should not. I don't think we should alter the functions of security algorithms. Libaries do either use AES instructions or not. This might force them to use it - which has security implications.

I can't think of a single library which is not capable of using the AES instructions if you tell it to. So I doubt there's any real life benefit of doing this.

> Perhaps -maes should also be considered for the same reason! No it should not. I don't think we should alter the functions of security algorithms. Libaries do either use AES instructions or not. This might force them to use it - which has security implications. I can't think of a single library which is not capable of using the AES instructions if you tell it to. So I doubt there's any real life benefit of doing this.
Owner

@anonfunc do you use -fdevirtualize-speculatively already?

Nope, haven't gotten around to add it yet.

> @anonfunc do you use `-fdevirtualize-speculatively` already? Nope, haven't gotten around to add it yet.
Contributor

@anonfunc well, -fdevirtualize-speculatively wasn't mentioned here before, but it sounds like it could lead to further optimisations down the road :)

So maybe just add it to the ToDo list, if you're fine with it

@anonfunc well, `-fdevirtualize-speculatively` wasn't mentioned here before, but it sounds like it could lead to further optimisations down the road :) So maybe just add it to the ToDo list, if you're fine with it
Contributor

Moreover, I saw a comment by a user on Reddit that recommended adding -mpclmul. I think this should be harmless because even though it isn't implied by -march=x86-64-v3, according to Wikipedia, it has been in every processor since Westmere, including from AMD.

Well, so this is a Sandy Bridge processor, so I can confirm that it's available there. :)

$ diff -u <(gcc -march=native -Q --help=target) <(gcc -march=x86-64-v2 -Q --help=target) | grep "^[-|+] "
-  -march=                              sandybridge
+  -march=                              x86-64-v2
-  -mavx                                [enabled]
+  -mavx                                [disabled]
-  -mavx256-split-unaligned-load        [enabled]
-  -mavx256-split-unaligned-store       [enabled]
+  -mavx256-split-unaligned-load        [disabled]
+  -mavx256-split-unaligned-store       [disabled]
-  -mpclmul                             [enabled]
+  -mpclmul                             [disabled]
-  -mtune=                              sandybridge
+  -mtune=                              generic
-  -mxsave                              [enabled]
+  -mxsave                              [disabled]
-  -mxsaveopt                           [enabled]
+  -mxsaveopt                           [disabled]

And here's the lowest end second generation Atom (a Silvermont based x5-Z8300)

$ diff -u <(gcc -march=native -Q --help=target) <(gcc -march=x86-64-v2 -Q --help=target) | grep "^[-|+] "
-  -maccumulate-outgoing-args           [enabled]
+  -maccumulate-outgoing-args           [disabled]
-  -maes                                [enabled]
+  -maes                                [disabled]
-  -march=                              silvermont
+  -march=                              x86-64-v2
-  -mmovbe                              [enabled]
+  -mmovbe                              [disabled]
-  -mpclmul                             [enabled]
+  -mpclmul                             [disabled]
-  -mprfchw                             [enabled]
+  -mprfchw                             [disabled]
-  -mrdrnd                              [enabled]
+  -mrdrnd                              [disabled]
-  -mtune=                              silvermont
+  -mtune=                              generic

[...] I think this should be harmless because even though it isn't implied by -march=x86-64-v3, according to Wikipedia, it has been in every processor since Westmere, including from AMD. Perhaps -maes should also be considered for the same reason!

That's not the case the Atom J1900 for example doesn't have AES-NI:

https://ark.intel.com/content/www/de/de/ark/products/78867/intel-celeron-processor-j1900-2m-cache-up-to-2-42-ghz.html

> Moreover, I saw [a comment](https://www.reddit.com/r/archlinux/comments/oflged/comment/hoaj0zs/) by a user on Reddit that recommended adding `-mpclmul`. I think this should be harmless because even though it isn't implied by `-march=x86-64-v3`, according to Wikipedia, it has been in every processor since Westmere, including from AMD. Well, so this is a Sandy Bridge processor, so I can confirm that it's available there. :) ```console $ diff -u <(gcc -march=native -Q --help=target) <(gcc -march=x86-64-v2 -Q --help=target) | grep "^[-|+] " - -march= sandybridge + -march= x86-64-v2 - -mavx [enabled] + -mavx [disabled] - -mavx256-split-unaligned-load [enabled] - -mavx256-split-unaligned-store [enabled] + -mavx256-split-unaligned-load [disabled] + -mavx256-split-unaligned-store [disabled] - -mpclmul [enabled] + -mpclmul [disabled] - -mtune= sandybridge + -mtune= generic - -mxsave [enabled] + -mxsave [disabled] - -mxsaveopt [enabled] + -mxsaveopt [disabled] ``` And here's the lowest end second generation Atom (a Silvermont based x5-Z8300) ```console $ diff -u <(gcc -march=native -Q --help=target) <(gcc -march=x86-64-v2 -Q --help=target) | grep "^[-|+] " - -maccumulate-outgoing-args [enabled] + -maccumulate-outgoing-args [disabled] - -maes [enabled] + -maes [disabled] - -march= silvermont + -march= x86-64-v2 - -mmovbe [enabled] + -mmovbe [disabled] - -mpclmul [enabled] + -mpclmul [disabled] - -mprfchw [enabled] + -mprfchw [disabled] - -mrdrnd [enabled] + -mrdrnd [disabled] - -mtune= silvermont + -mtune= generic ``` > [...] I think this should be harmless because even though it isn't implied by `-march=x86-64-v3`, according to Wikipedia, it has been in every processor since Westmere, including from AMD. Perhaps `-maes` should also be considered for the same reason! That's not the case the Atom J1900 for example doesn't have AES-NI: https://ark.intel.com/content/www/de/de/ark/products/78867/intel-celeron-processor-j1900-2m-cache-up-to-2-42-ghz.html
Owner

@anonfunc well, -fdevirtualize-speculatively wasn't mentioned here before, but it sounds like it could lead to further optimisations down the road :)

So maybe just add it to the ToDo list, if you're fine with it

@anonfunc do you use -fdevirtualize-speculatively already?

Nope, haven't gotten around to add it yet.

Sorry, I thought you were talking about -devirtualize-at-ltrans. There is no need for -fdevirtualize-speculatively, it's included in -O2 and above.

That's not the case the Atom J1900 for example doesn't have AES-NI:

I lean more towards letting the program choose if it wants to use AES or not. -mpclmul and -devirtualize-at-ltrans can probably be enabled safely.

> @anonfunc well, `-fdevirtualize-speculatively` wasn't mentioned here before, but it sounds like it could lead to further optimisations down the road :) > > So maybe just add it to the ToDo list, if you're fine with it > > @anonfunc do you use -fdevirtualize-speculatively already? > Nope, haven't gotten around to add it yet. Sorry, I thought you were talking about `-devirtualize-at-ltrans`. There is no need for `-fdevirtualize-speculatively`, it's included in `-O2` and above. > That's not the case the Atom J1900 for example doesn't have AES-NI: I lean more towards letting the program choose if it wants to use AES or not. `-mpclmul` and `-devirtualize-at-ltrans` can probably be enabled safely.
Contributor

There is no need for -fdevirtualize-speculatively, it's included in -O2 and above.

Ah damn, sorry. I missed that. :)

> There is no need for -fdevirtualize-speculatively, it's included in -O2 and above. Ah damn, sorry. I missed that. :)
Contributor

What do you think about '-fipa-pta'? It does take more memory but is commonly used on Gentoo builds.

Do you got some headroom on the build server in terms of memory? :)

What do you think about '-fipa-pta'? It does take more memory but is commonly used on Gentoo builds. Do you got some headroom on the build server in terms of memory? :)
Owner

What do you think about '-fipa-pta'? It does take more memory but is commonly used on Gentoo builds.

Do you got some headroom on the build server in terms of memory? :)

Not really sadly. That may be an option if we ever get a dedicated machine for alhp, otheriwse this server has to handle some other tasks as well, and even 64Gb ram are filled faster than one might think :P

> What do you think about '-fipa-pta'? It does take more memory but is commonly used on Gentoo builds. > > Do you got some headroom on the build server in terms of memory? :) Not really sadly. That may be an option if we ever get a dedicated machine for alhp, otheriwse this server has to handle some other tasks as well, and even 64Gb ram are filled faster than one might think :P
Contributor

I'm wondering how much more it would really use. I mean, sometimes those warnings are just overly cautious. :)

Could we test it? I mean it should fail with OOM for the container if it hits the memory limit and we could restart it without, like with LTO? :)

I'm wondering how much more it would really use. I mean, sometimes those warnings are just overly cautious. :) Could we test it? I mean it should fail with OOM for the container if it hits the memory limit and we could restart it without, like with LTO? :)
Owner

We probably could test it, but I would prefer not doing it on the production machine. I can probably make some test builds locally later/tomorrow.

We probably could test it, but I would prefer not doing it on the production machine. I can probably make some test builds locally later/tomorrow.
Owner

Just FYI, all builds after Thu, 12 May 2022 01:00 UTC are build with the new flags -mpclmul and -devirtualize-at-ltrans. Memory usage looks good.

Just FYI, all builds after Thu, 12 May 2022 01:00 UTC are build with the new flags `-mpclmul` and `-devirtualize-at-ltrans`. Memory usage looks good.
Contributor

Nice!

I worked through the GCC docs and had a look at the docs vs the GCC flags. I found some strange defaults GCC sets and some more optimisation flags we could discuss about.

They all just enable additional checks/analysis/optimization passes and don't force the compiler to do anything specific - so should fit in line with the goals :)

Will push this later to my fork and link it here. So maybe hold your breath with the test builds? :)

Nice! I worked through the GCC docs and had a look at the docs vs the GCC flags. I found some strange defaults GCC sets and some more optimisation flags we could discuss about. They all just enable additional checks/analysis/optimization passes and don't force the compiler to do anything specific - so should fit in line with the goals :) Will push this later to my fork and link it here. So maybe hold your breath with the test builds? :)
Contributor

@OpenSourceAnarchist maybe you want to take a look over this as well?

It includes your changes as well:

https://git.harting.dev/ALHP/ALHP.GO/pulls/117

@OpenSourceAnarchist maybe you want to take a look over this as well? It includes your changes as well: https://git.harting.dev/ALHP/ALHP.GO/pulls/117
Contributor

-floop-nest-optimize / -ftree-loop-linear -floop-strip-mine -floop-block was left out, because:

This option is experimental.

`-floop-nest-optimize` / `-ftree-loop-linear -floop-strip-mine -floop-block` was left out, because: > This option is experimental.

What about these?
-pthread -fsanitize=bounds,alignment,object-size -fsanitize-undefined-trap-on-error -fvisibility=hidden

What about these? `-pthread -fsanitize=bounds,alignment,object-size -fsanitize-undefined-trap-on-error -fvisibility=hidden`
Owner

What about these?
-pthread -fsanitize=bounds,alignment,object-size -fsanitize-undefined-trap-on-error -fvisibility=hidden

-pthread is no optimization flag, neither is -fsanitize=bounds,alignment,object-size -fsanitize-undefined-trap-on-error.

-fvisibility=hidden: I think it's not wise to mess with symbol visibility. And I'm not aware that this could lead to any benefit, LTO already does program-wide optimization if applicable. Default is public anyways.

> What about these? > `-pthread -fsanitize=bounds,alignment,object-size -fsanitize-undefined-trap-on-error -fvisibility=hidden` `-pthread` is no optimization flag, neither is `-fsanitize=bounds,alignment,object-size -fsanitize-undefined-trap-on-error`. `-fvisibility=hidden`: I think it's not wise to mess with symbol visibility. And I'm not aware that this could lead to any benefit, LTO already does program-wide optimization if applicable. Default is public anyways.

Hmm...

-fvisibility=[default|internal|hidden|protected]
Sets the default ELF image symbol visibility to the specified option---all symbols are marked with this unless overridden within the code.  Using this feature can very substantially improve linking and load times of shared object libraries, produce more optimized code, provide near-perfect API export and prevent symbol clashes.

Also, while the sanitizers aren't optimization flags, I thought it would help but...
¯\(ツ)

Hmm... ``` -fvisibility=[default|internal|hidden|protected] Sets the default ELF image symbol visibility to the specified option---all symbols are marked with this unless overridden within the code. Using this feature can very substantially improve linking and load times of shared object libraries, produce more optimized code, provide near-perfect API export and prevent symbol clashes. ``` Also, while the sanitizers aren't optimization flags, I thought it would help but... ¯\\_(ツ)_/¯
Owner

Sure, that section could suggest that. But the rest is where it gets interesting:

Manual page gcc(1) line 9198

[...] Despite the nomenclature, default always means public; i.e., available to be linked against from outside the shared object. protected and internal are pretty useless in real-world usage so the only other commonly used option is hidden. The default if -fvisibility isn't specified is default, i.e., make every symbol public. A good explanation of the benefits offered by ensuring ELF symbols have the correct visibility is given by "How To Write Shared Libraries" by Ulrich Drepper (which can be found at https://www.akkadia.org/drepper/)---however a superior solution made possible by this option to marking things hidden when the default is public is to make the default hidden and mark things public. This is the norm with DLLs on Windows and with -fvisibility=hidden and "attribute((visibility("default")))" instead of "__declspec(dllexport)" you get almost identical semantics with identical syntax. This is a great boon to those working with cross-platform projects. Be aware that headers from outside your project, in particular system headers and headers from any other library you use, may not be expecting to be compiled with visibility other than the default. You may need to explicitly say "#pragma GCC visibility push(default)" before including any such headers. "extern" declarations are not affected by -fvisibility, so a lot of code can be recompiled with -fvisibility=hidden with no modifications. However, this means that calls to "extern" functions with no explicit visibility use the PLT, so it is more effective to use "__attribute ((visibility))" and/or "#pragma GCC visibility" to tell the compiler which "extern" declarations should be treated as hidden. [...]

Sanitize options usually enable static code analysis for memory leaks, pointer checks, array bounds, etc.

EDIT: Btw, I'm always thankful for suggestions, even if they do not work out :)

Sure, that section could suggest that. But the rest is where it gets interesting: <details> <summary>Manual page gcc(1) line 9198</summary> >[...] Despite the nomenclature, default always means public; i.e., available to be linked against from outside the shared object. protected and internal are pretty useless in real-world usage so the only other commonly used option is hidden. **The default if -fvisibility isn't specified is default, i.e., make every symbol public**. A good explanation of the benefits offered by ensuring ELF symbols have the correct visibility is given by "How To Write Shared Libraries" by Ulrich Drepper (which can be found at <https://www.akkadia.org/drepper/>)---however a superior solution made possible by this option to marking things hidden when the default is public is to make the default hidden and mark things public. This is the norm with DLLs on Windows and with -fvisibility=hidden and "__attribute__((visibility("default")))" instead of "\__declspec(dllexport)" you get almost identical semantics with identical syntax. This is a great boon to those working with cross-platform projects. Be aware that headers from outside your project, in particular system headers and headers from any other library you use, may not be expecting to be compiled with visibility other than the default. You may need to explicitly say "#pragma GCC visibility push(default)" before including any such headers. "extern" declarations are not affected by -fvisibility, so a lot of code can be recompiled with -fvisibility=hidden with no modifications. However, this means that calls to "extern" functions with no explicit visibility use the PLT, so it is more effective to use "\__attribute ((visibility))" and/or "#pragma GCC visibility" to tell the compiler which "extern" declarations should be treated as hidden. [...] </details> Sanitize options usually enable static code analysis for memory leaks, pointer checks, array bounds, etc. *EDIT*: Btw, I'm always thankful for suggestions, even if they do not work out :)
Sign in to join this conversation.
No description provided.