2022 Cauldron Value Numbering for gcc versions

Value Numbering in GCC
Dr. Richard Biener
SUSE Labs, Sep 15th, 2022

Value Numbering
I Assign value numbers to expressions
I Expressions that produce the same value should have the same
value number
I Usually achieved by hashing of simplified and canonicalized
expressions with operands replaced by their value number

Value Numbering in GCC
Multiple value numbering implementations and their main users
I RTL CSE (cselib)
I RTL PRE
I GIMPLE SSA DOM (scoped tables)
I GIMPLE SSA FRE/PRE (RPO VN)
I simpler forms of VN in CCP and copy propagation

Common Subexpression Elimination
For each statement
I try to simplify the computed expression using value numbers of
the operands
I lookup value number of the simplified expression
I if found and a register with that value is available, replace the
expression with the register or constant
I if not found, record a new value number for it and make it
available in the destination receiving the value of the expression

Availability
Different ways to track, update and query availability of a so called
leader for a value number
I with a DOM walk a value to leader map can be kept
up-to-date with an unwind stack
I the RPO VN walk records a list of leaders for each value that
can be unwound when iterating and otherwise queried with
dominator checks

Availability and expression simplification
I use match.pd based simplification
I value expression operands get substituted with their leaders
I allows to keep flow-sensitive info like ranges

Memory Expressions
ENTRY
<bb 2>:
# .MEM_3 = VDEF <.MEM_1(D)>
p_2(D)->a = 0;
# .MEM_4 = VDEF <.MEM_3>
p_2(D)->b = 1;
x = *p_2(D);
# VUSE <.MEM_5>
_6 = x.a;
x ={v} {CLOBBER(eol)};
# VUSE <.MEM_7>
return _6;

Memory Expressions
I memory state is part of hashing, the current .MEM_n virtual
definition is used
I at lookup time walk the virtual SSA use->def chains, skip
clobbers that do not alias and perform lookups with the
previous memory state
I fancy tricks during walking
I memory to memory copies
I pieces from larger entities
I larger objects formed from smaller entities
I memory handling consumes the majority of compile time

Why RPO VN
I SSA SCC VN
I reduces what to iterate
I difficult to mate with CFG: not executable parts, predication,
equivalences, region
I RPO VN
I iteration more costly
I maps to the CFG, allows for flow-sensitive optimizations easily
I allows region-based operation

RPO VN Operation Modes
I can operate with different effort for memory handling
I can do optimistic, iterating VN with elimination done after the
fact
I can do non-iterating VN with immediate elimination
I can operate on the whole function or a single entry, multiple
exit region

Iterating vs non-Iterating
loop 1
<bb 3>:
# i_1 = PHI <i_4(2), i_7(3)>
# val_2 = PHI <val_5(2), val_6(3)>
val_6 = val_2 + 1;
i_7 = i_1 + 1;
if (i_7 < n_3)
goto <bb 3>; [INV]
else
goto <bb 4>; [INV]
<bb 4>:
_8 = val_6;
return _8;
ENTRY
<bb 2>:
n_3 = 1;
i_4 = 0;
val_5 = 0;

Iteration scheme
I SSA SCC based VN iterates SSA SCCs until nothing changes
I RPO VN iterates CFG cycles
I rev_post_order_and_mark_dfs_back_seme can compute a
RPO with CFG cycles adjacent and their extent in the RPO
array recorded
I handles irreducible regions, loop info would not
I optimal regions for iteration
I avoid iteration when possible, do not iterate until nothing
changes
I unwind cost to the iteration point linear with the amound of
things to undo (expression hashes, availability)
I iteration itself is O(n * loop-depth), inner cycles are iterated
fully before iterating outer cycles

Non-iterative mode
I Greedy walk along edges discovered as executable, but
enforcing RPO visiting of reachable blocks.
I Predecessors not visited and reachable from blocks later in
RPO have to be conservatively assumed reachable.
I Handles PHIs with unreachable incoming non-back edges
optimally

RPO VN as Utility
RPO VN was designed to be usable on small regions of a function
without much overhead when doing that very often and with being
much cheaper than a pass over the whole function.
I loop unrolling applies CSE on unrolled bodies before trying to
unroll the containing loop
I loop if-conversion applies CSE to optimize predicates
I unroll-and-jam applies CSE to leverage cross loop redundancies
I uninit analysis uses RPO VN to compute basic block
reachability without performing actual CSE

RPO VN Utility API
enum vn_lookup_kind { VN_NOWALK, VN_WALK, VN_WALKREWRITE };
unsigned do_rpo_vn
(function *fun, edge entry, bitmap exits,
/* iterate */ bool = false, /* eliminate */ bool = true
vn_lookup_kind = VN_WALKREWRITE);
rev_post_order_and_mark_dfs_back_seme
(function *fn, edge entry, bitmap exit_bbs,
bool for_iteration, int *rev_post_order,
vec<std::pair<int, int> > *scc_ext);
auto_bb_flag, auto_edge_flag

RPO VN Utility Efficiency
Non-iterating region-based VN with or without elimination was
designed to be efficient
I startup cost linear in the size of the region
I performing RPO VN with VN_NOWALK, without iteration
and elimination on each basic-block individually vs. performing
a single RPO VN on the whole function is only around 15%
slower for cc1files with insn-attrtab.i being the outlier at 280%
I more elaborate memory handling or doing elimination does not
allow for an apples vs. apples comparison
I while doing CSE on the whole function might perform more
optimizations doing that should never be faster than only doing
CSE on the regions a pass performed a transformation on

TODO
I experiment with using ranger instead of the ad-hoc predication
we have
I review equivalence tracking changes
I think of a cheaper way to do “iteration”
I we have simple DCE with a SSA worklist, need region
DCE/DSE

2022 Cauldron Value Numbering for gcc versions

More Related Content

Similar to 2022 Cauldron Value Numbering for gcc versions (20)

More from ssuser866937 (11)

Recently uploaded (20)

2022 Cauldron Value Numbering for gcc versions