Add several more optimizations #92
Labels
No Label
blocked upstream
bug
build-failure
duplicate
enhancement
help wanted
informational
invalid
invalid/corrupt package
packaging issue
priority: high
question
support
wontfix
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: ALHP/ALHP.GO#92
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I currently use ALHP repos (x86-64-v3) on my computer, and everything has been flawless! However, I have previously used, and am also a big fan of, gentooLTO. I believe some additional optimizations that they suggest could be enabled for this repo's packages as they are updated and built, with minimal side effects (beyond slightly longer build times). If some of these are undesirable, I completely understand, but I thought I'd bring them up just in case.
The first is Graphite. Arch's build of
gcc
has graphite support, so it should be as simple as adding-fgraphite-identity -floop-nest-optimize
.The second is Polly. AFAIK Polly is not usually used in gentooLTO, but it should produce the same benefits as Graphite, but for LLVM/Clang. Supporting this would mean requiring
polly
from[extra]
, and adding-Xclang -load -Xclang LLVMPolly.so
and-mllvm -polly
toCFLAGS
. I'm not sure if this would affectgcc
since it obviously doesn't support these flags. Perhaps also consider-mllvm -polly-parallel -lgomp
for automatic OpenMP code generation.The third is simply
-fdevirtualize-at-ltrans
. This should improve optimizations with LTO.Moreover, I saw a comment by a user on Reddit that recommended adding
-mpclmul
. I think this should be harmless because even though it isn't implied by-march=x86-64-v3
, according to Wikipedia, it has been in every processor since Westmere, including from AMD. Perhaps-maes
should also be considered for the same reason!Lastly,
-mfpmath=sse
should help speed up performance without any breakage, and should always help so long as the processor supports SSE instructions, which is a requirement of x86-64-v2 and higher.I think all of these could be implemented, but graphite/polly seem like the most disruptive ones in term of build requirements (more memory). Let me know your thoughts!
Hi @OpenSourceAnarchist and thanks for your detailed request.
I need to do some more research on some points you raise, but lets just dig right in:
I'm aware of Graphite. Need to do more research on the resource implications of enabling it.
That is probably not going to happen soon, since requiring another makedep. dependent on the buildchain used requires more PKGBUILD editing (and checks for buildchain used, etc.).
If I understood this correctly, this would increase mem-usage significantly. LTO already causes problems with mem-usage since we are still on GCC 11.1, adding even more memory usage would not be helpful at the moment. We can reconsider this once a newer GCC version hits the repos.
I'll do some research on pclmul availability on >=v2 CPUs. If Wikipedia is to be trusted, it should be available. Most software which uses AES heavily already does runtime detection of AES afaik (openssl, dm-crypt, etc.), so the performance gains here would probably not be that big.
man gcc
about-mfpmath=sse
:If I interpret this correctly,
-mfpmath=sse
should be the default for x86-64 anyway?Sounds good for Graphite,
-fdevirtualize-at-ltrans
, and-mpclmul/-maes
.I think you're right about Polly. Definitely could be a long-term goal, but would require checks and string replacement.
As for
-mfpmath=sse
, you are completely correct! Somehow I missed that it's the default for anything detected with SSE2 support. Just ignore that suggestion :)Memory usage looks good with
gcc-11.2
in the repos now. I'll continue to monitor this for a while, then I plan add-fdevirtualize-at-ltrans -maes
.@anonfunc do you use
-fdevirtualize-speculatively
already?No it should not. I don't think we should alter the functions of security algorithms. Libaries do either use AES instructions or not. This might force them to use it - which has security implications.
I can't think of a single library which is not capable of using the AES instructions if you tell it to. So I doubt there's any real life benefit of doing this.
Nope, haven't gotten around to add it yet.
@anonfunc well,
-fdevirtualize-speculatively
wasn't mentioned here before, but it sounds like it could lead to further optimisations down the road :)So maybe just add it to the ToDo list, if you're fine with it
Well, so this is a Sandy Bridge processor, so I can confirm that it's available there. :)
And here's the lowest end second generation Atom (a Silvermont based x5-Z8300)
That's not the case the Atom J1900 for example doesn't have AES-NI:
https://ark.intel.com/content/www/de/de/ark/products/78867/intel-celeron-processor-j1900-2m-cache-up-to-2-42-ghz.html
Sorry, I thought you were talking about
-devirtualize-at-ltrans
. There is no need for-fdevirtualize-speculatively
, it's included in-O2
and above.I lean more towards letting the program choose if it wants to use AES or not.
-mpclmul
and-devirtualize-at-ltrans
can probably be enabled safely.Ah damn, sorry. I missed that. :)
What do you think about '-fipa-pta'? It does take more memory but is commonly used on Gentoo builds.
Do you got some headroom on the build server in terms of memory? :)
Not really sadly. That may be an option if we ever get a dedicated machine for alhp, otheriwse this server has to handle some other tasks as well, and even 64Gb ram are filled faster than one might think :P
I'm wondering how much more it would really use. I mean, sometimes those warnings are just overly cautious. :)
Could we test it? I mean it should fail with OOM for the container if it hits the memory limit and we could restart it without, like with LTO? :)
We probably could test it, but I would prefer not doing it on the production machine. I can probably make some test builds locally later/tomorrow.
Just FYI, all builds after Thu, 12 May 2022 01:00 UTC are build with the new flags
-mpclmul
and-devirtualize-at-ltrans
. Memory usage looks good.Nice!
I worked through the GCC docs and had a look at the docs vs the GCC flags. I found some strange defaults GCC sets and some more optimisation flags we could discuss about.
They all just enable additional checks/analysis/optimization passes and don't force the compiler to do anything specific - so should fit in line with the goals :)
Will push this later to my fork and link it here. So maybe hold your breath with the test builds? :)
@OpenSourceAnarchist maybe you want to take a look over this as well?
It includes your changes as well:
https://git.harting.dev/ALHP/ALHP.GO/pulls/117
-floop-nest-optimize
/-ftree-loop-linear -floop-strip-mine -floop-block
was left out, because:What about these?
-pthread -fsanitize=bounds,alignment,object-size -fsanitize-undefined-trap-on-error -fvisibility=hidden
-pthread
is no optimization flag, neither is-fsanitize=bounds,alignment,object-size -fsanitize-undefined-trap-on-error
.-fvisibility=hidden
: I think it's not wise to mess with symbol visibility. And I'm not aware that this could lead to any benefit, LTO already does program-wide optimization if applicable. Default is public anyways.Hmm...
Also, while the sanitizers aren't optimization flags, I thought it would help but...
¯\(ツ)/¯
Sure, that section could suggest that. But the rest is where it gets interesting:
Manual page gcc(1) line 9198
Sanitize options usually enable static code analysis for memory leaks, pointer checks, array bounds, etc.
EDIT: Btw, I'm always thankful for suggestions, even if they do not work out :)