feat/additional_optimizations #117

Closed
RubenKelevra wants to merge 18 commits from RubenKelevra/ALHP.GO:feat/additional_optimizations into main
Contributor

Let's discuss :)

Let's discuss :)
RubenKelevra added 13 commits 2022-05-14 20:14:09 +02:00
fed114a335 enable store motion pass run after global common subexpression elimination
docs:
When -fgcse-sm is enabled, a store motion pass is run after global common subexpression elimination. This pass attempts to move stores out of loops. When used in conjunction with -fgcse-lm, loops containing a load/store sequence can be changed to a load before the loop and a store after the loop.
10bb80049a enable elimination of redundant loads that come after stores to the same memory location
docs:
When -fgcse-las is enabled, the global common subexpression elimination pass eliminates redundant loads that come after stores to the same memory location (both partial and full redundancies).
db7da01f5d enable live range shrinkage to relief register pressure
docs:
I'd recommend to use this at
least for x86/x86-64.  I think any OOO processor with small or
moderate register file which does not use the 1st insn scheduling
might benefit from this too.

  On SPEC2000 for x86/x86-64 (I use Haswell processor, -O3 with
general tuning), the optimization usage results in smaller code size
in average (for floating point and integer benchmarks in 32- and
64-bit mode).  The improvement better visible for SPECFP2000 (although
I have the same improvement on x86-64 SPECInt2000 but it might be
attributed mostly mcf benchmark unstability).  It is about 0.5% for
32-bit and 64-bit mode.  It is understandable, as the optimization has
more opportunities to improve the code on longer BBs.  Different from
other heuristic optimizations, I don't see any significant worse
performance.  It gives practically the same or better performance (a
few benchmarks imporoved by 1% or more upto 3%).

  The single but significant drawback is additional compilation time
(4%-6%) as the 1st insn scheduling pass is quite expensive.

Source of docs: https://gcc.gnu.org/legacy-ml/gcc-patches/2013-11/msg00420.html
63801b7797 enable evaluate register pressure in loops for decisions to move loop invariants
This is enabled with -O3, but only for some targets. Seems to be off for x86_64.

docs:
Use IRA to evaluate register pressure in loops for decisions to move loop invariants. This option usually results in generation of faster and smaller code on machines with large register files (>= 32 registers), but it can slow the compiler down.
15df5b0bf3 enable register pressure sensitive insn scheduling before register allocation
docs:
Enable register pressure sensitive insn scheduling before register allocation. This only makes sense when scheduling before register allocation is enabled, i.e. with -fschedule-insns or at -O2 or higher. Usage of this option can improve the generated code and decrease its size by preventing register pressure increase above the number of available hard registers and subsequent spills in register allocation.
9689f5e54a enable Swing Modulo Scheduling for basic innermost loop optimizations without blocking further rescheduling
docs:
SMS is intended to schedule instructions of loops rather than the traditional scheduler (in GCC) that does not give a special handling for loops. For more information on the theory behind SMS take a look at the 2004 GCC summit proceedings (page 55). This optimization helps in loops where there is a place to run consecutive iterations concurrently but the traditional instruction scheduling is not able to fully utilize the hardware functional units. This optimization is disabled by default because of compile time consumption; -fmodulo-sched activates it.

Source: https://gcc.gnu.org/news/sms.html
a46a966127 allow dead exception code to be removed
docs:
Consider that instructions that may throw exceptions but don’t otherwise contribute to the execution of the program can be optimized away. This does not affect calls to functions except those with the pure or const attributes. This option is enabled by default for the Ada and C++ compilers, as permitted by the language specifications. Optimization passes that cause dead exceptions to be removed are enabled independently at different optimization levels.
56a6d18779 enable graphite with identity transformation
docs:
Enable the identity transformation for graphite. For every SCoP we generate the polyhedral representation and transform it back to gimple. Using -fgraphite-identity we can check the costs or benefits of the GIMPLE -> GRAPHITE -> GIMPLE transformation. Some minimal optimizations are also performed by the code generator isl, like index splitting and dead code elimination in loops.
c9a0d6fb3b enable auto split of the stack before it overflows
docs:
Generate code to automatically split the stack before it overflows. The resulting program has a discontiguous stack which can only overflow if the program is unable to allocate any more memory. This is most useful when running threaded programs, as it is no longer necessary to calculate a good stack size to use for each thread. This is currently only implemented for the x86 targets running GNU/Linux.

When code compiled with -fsplit-stack calls code compiled without -fsplit-stack, there may not be much stack space available for the latter code to run. If compiling all code, including library code, with -fsplit-stack is not an option, then the linker can fix up these calls so that the code compiled without -fsplit-stack always has a large stack. Support for this is implemented in the gold linker in GNU binutils release 2.21 and later.
56abe8be8d enable interprocedural pointer analysis and interprocedural modification and reference analysis
docs:
Perform interprocedural pointer analysis and interprocedural modification and reference analysis. This option can cause excessive memory and compile-time usage on large compilation units. It is not enabled by default at any optimization level.
a8ae214305 Detect paths that trigger erroneous or undefined behavior
docs:
Detect paths that trigger erroneous or undefined behavior due to a null value being used in a way forbidden by a returns_nonnull or nonnull attribute. Isolate those paths from the main control flow and turn the statement with erroneous or undefined behavior into a trap. This is not currently enabled, but may be enabled by -O2 in the future.
fabc35f4d7 set regions for the integrated register allocator to the default value
the behaivor of gcc 12 isn't following the docs here. Regardless of the -O2/3 or -march=native/generic/x86-64-v3 it's always set to 'one'. Docs state it should be set to mixed if optimizations are on.

Docs:
Use specified regions for the integrated register allocator. The region argument should be one of the following:

‘all’

    Use all loops as register allocation regions. This can give the best results for machines with a small and/or irregular register set.
‘mixed’

    Use all loops except for loops with small register pressure as the regions. This value usually gives the best results in most cases and for most architectures, and is enabled by default when compiling with optimization for speed (-O, -O2, …).
‘one’

    Use all functions as a single region. This typically results in the smallest code size, and is enabled by default for -Os or -O0.
66ec997a04 correcting model cost for vectorization of loops marked with the OpenMP simd directive
gcc's behaivor is again not following the documentation, which lets assume that it would use dynamic by default. By default gcc 12 uses here unlimited which does not do any estimation how costly it is vs the benefit.

Docs:
Alter the cost model used for vectorization of loops marked with the OpenMP simd directive. The model argument should be one of ‘unlimited’, ‘dynamic’, ‘cheap’. All values of model have the same meaning as described in -fvect-cost-model and by default a cost model defined with -fvect-cost-model is used.
RubenKelevra changed title from feat/additional_optimizations to WIP: feat/additional_optimizations 2022-05-14 20:14:21 +02:00
RubenKelevra added 1 commit 2022-05-14 20:59:25 +02:00
d423bb7486 Attempt to avoid false dependencies in scheduled code by making use of registers left over after register allocation
docs state that it should be active by default if -funroll-loops, which we don't use.

docs:
Attempt to avoid false dependencies in scheduled code by making use of registers left over after register allocation. This optimization most benefits processors with lots of registers. Depending on the debug information format adopted by the target, however, it can make debugging impossible, since variables no longer stay in a “home register”.

Enabled by default with -funroll-loops.
RubenKelevra added 1 commit 2022-05-14 21:03:12 +02:00
2e126c2285 Constructs webs
docs:
Constructs webs as commonly used for register allocation purposes and assign each web individual pseudo register. This allows the register allocation pass to operate on pseudos directly, but also strengthens several other optimization passes, such as CSE, loop optimizer and trivial dead code remover. It can, however, make debugging impossible, since variables no longer stay in a “home register”.
RubenKelevra added 1 commit 2022-05-14 21:16:14 +02:00
5b04a95afd enable vectorization of trees
docs:
Perform vectorization on trees. This flag enables -ftree-loop-vectorize and -ftree-slp-vectorize if not explicitly specified.
RubenKelevra added 1 commit 2022-05-14 21:30:09 +02:00
d768ecc420 don't split LTO in multiple segments
The rationale of the default 'dynamic' is to do some multiprocessing in a first local optimisation step. This might lead to failures on link time (rare\!) but is also not optimal.

More details:
https://gcc.gnu.org/onlinedocs/gccint/LTO-Overview.html
RubenKelevra force-pushed feat/additional_optimizations from d768ecc420 to ac537119bb 2022-05-14 21:30:34 +02:00 Compare
RubenKelevra added 1 commit 2022-05-14 21:34:51 +02:00
c5f2d5d3cb assume loops are finite
docs:
Assume that a loop with an exit will eventually take the exit and not loop indefinitely. This allows the compiler to remove loops that otherwise have no side-effects, not considering eventual endless looping as such.

This option is enabled by default at -O2 for C++ with -std=c++11 or higher.
RubenKelevra changed title from WIP: feat/additional_optimizations to feat/additional_optimizations 2022-05-14 21:43:27 +02:00
Author
Contributor

From reading the GCC docs they should all be nul to slightly benefitial and a safe choice.

From reading the GCC docs they should all be nul to slightly benefitial and a safe choice.
RubenKelevra force-pushed feat/additional_optimizations from c5f2d5d3cb to 31de58baa7 2022-05-18 14:34:16 +02:00 Compare
Author
Contributor

@anonfunc should I split this?

@anonfunc should I split this?
Owner

That may be a good idea, since we can talk about specific options in detail instead of creating a huge messy thread here.

That may be a good idea, since we can talk about specific options in detail instead of creating a huge messy thread here.
Author
Contributor

There's a blocker for this PR we need to solve first: https://git.harting.dev/ALHP/ALHP.GO/issues/124

There's a blocker for this PR we need to solve first: https://git.harting.dev/ALHP/ALHP.GO/issues/124
Owner

Since there has been no activity recently, I'll close this.

Since there has been no activity recently, I'll close this.
anonfunc closed this pull request 2023-03-14 01:50:06 +01:00

Pull request closed

Sign in to join this conversation.
No description provided.