Automatic Generation of Peephole Superoptimizers

Automatic Generation of Peephole Superoptimizers Speaker:Shuai-wei Huang Advisor:Wuu Yang Sorav Bansal and Alex Aiken Computer System Lab Stanford University ASPLOS’06

Contents 2. Design of the Optimizer 1. Introduction 3. Experimental Results 4. Conclusion

Introduction – Peephole Optimizer Peephole optimizers are pattern matching systems that replace one sequence of instructions by another equivalent, but faster, sequence of instructions. The optimizations are usually expressed as parameterized replacement rules, so that, for example, mov r1, r2; mov r2, r1 => mov r1, r2 Peephole optimizers are typically constructed using human-written pattern matching rules. Requires Time Less Systematic

Introduction – Superoptimization Automatically discover replacement rules that are optimizations. Optimizations are computed off-line and then presented as an indexed structure for efficient lookup. The optimizations are then organized into a lookup table mapping original sequences to their optimized counterparts.

Introduction – Superoptimization The goals in this paper are considerably more modest, focusing on showing that an automatically constructed peephole optimizer is possible and, even with limited resources (i.e., a single machine) and learning hundreds to thousands of useful optimizations, such an optimizer can find significant speedups that standard optimizers miss.

Introduction – T erminology problem The classical meaning of superoptimization is to find the optimal code sequence for a single, loop-free assembly sequence of instructions, which we call the target sequence . The terminology problem: In order to distinguish superoptimization from garden variety optimization as that term is normally used.

Introduction – Related work Massalin Simply enumerates sequences of instructions of increasing length, testing each for equality with the target sequence; the lowest cost equivalent sequence found is the optimal one. Denali Constrains the search space to a set of equality-preserving transformations expressed by the system designer. For a given target sequence, a structure representing all possible equivalent sequences under the transformation rules is searched for the lowest cost equivalent sequence.

Design of the Optimizer - Term Instruction : an opcode together with some valid operands. A potential problem arises with opcodes that take immediate operands, as they generate a huge number of instructions. Restrict immediate operands to a small set of constants and symbolic constants. Cost function : different cost functions for different purposes Running time to optimize speed Instruction byte count to optimize the size of a binary

Design of the Optimizer - Term Equivalence of two instruction sequences : Registers live Stack locations Memory locations (For implementation simplicity, we currently conservatively assume memory and stack locations are always live.) Equivalence test ≡ L tests two instruction sequences for equivalence under the context (set of live registers) L. For a target sequence T and a cost function c, we are interested in finding a minimum cost instruction sequence O such that O ≡ L T

Design of the Optimizer - Flowchart

Design of the Optimizer – Structure Harvester Extracts target instruction sequences from the training applications. The target instruction sequences are the ones we seek to optimize. Enumerator Exhaustively enumerates all possible candidate instruction sequences up to a certain length. Checking if each candidate sequence is an optimal replacement for any of the target instruction sequences. Optimization database An index of all discovered optimizations

Design of the Optimizer – Harvester Harvesting Target Instruction Sequences 1. Obtain target instruction sequences from a representative set of applications. 2. These harvested instruction sequences form the corpus used to train the optimizer. 3. A harvestable instruction sequence I must have a single entry point. 4. Records the set of register live.

Design of the Optimizer – Canonicalization(1/3) On a machine with 8 registers, an instruction mov r1, r0 has 8*7 = 56 equivalent versions with different register names. Canonicalization : eliminate all unnecessary instruction sequences that are mere renamings of others.

Design of the Optimizer – Canonicalization(2/3) An instruction sequence is canonical if its registers and constants are named in the order of their appearance in the instruction sequence. The first register used is always r0, the second distinct register used is always r1, and so on. Similarly, the first constant used in a canonical instruction sequence is c0, the second distinct constant c1, and so on.

Design of the Optimizer – Canonicalization(3/3)

Design of the Optimizer – Fingerprinting(1/3) We execute I on test machine states and then compute a hash of the result, which we call I's fingerprint. Fingerprint Hash table. Each bucket holds the target instruction sequence.

Design of the Optimizer – Fingerprinting(2/3) Testvectors Each bit in the two testvectors is set randomly, but the same testvectors are used for fingerprinting every instruction sequence. The machine is loaded with a testvector and control is transferred to the instruction sequence.

Design of the Optimizer – Fingerprinting(3/3) Minimal collisions It should be asymmetric with respect to different memory locations and registers. It should not be based on a single operator (like xor) r distinct registers and c distinct constants can generate at most r!*c! fingerprints. Typically r≦5 and c≦2, so the blow-up is upper-bounded by 240.

Design of the Optimizer – Enumerate(1/6) Instruction sequences are enumerated from a subset of all instructions. To bound the search space, we restrict the maximum number of distinct registers and constants that can appear in an enumerable instruction sequence. Constants :0,1,C0,C1 Register number : 4

Design of the Optimizer – Enumerate(2/6)

Design of the Optimizer – Enumerate(4/6) Enumerator’s search space is exponential in the length of the instruction sequence. Two techniques to reduce the size Enumerate only canonical instruction sequences. Prune the search space by identifying and eliminating instructions that are functionally equivalent to other cheaper instructions.

Design of the Optimizer – Enumerate(5/6) In order to avoid fingerprinting { mov r0, r1; mov r0, r1 } weed out such special cases automatically through fingerprinting and equivalence checks .

Design of the Optimizer – Equivalence test The equivalence test proceeds in two steps. a fast but incomplete execution test a slower but exact boolean test. Execution test We run the two sequences over a set of testvectors and observe if they yield the same output on each test. Boolean test The boolean verfication test represents an instruction sequence by a boolean formula and expresses the equivalence relation as a satisability constraint. The satisability constraint is tested using a SAT solver.

Design of the Optimizer – Boolean test All memory writes are stored in a table in order of their occurrence. (addr1 = addr2) => (data1 = data2) Each read-access R is checked for address-equivalence with each of the preceding write accesses W i in decreasing order of i, where W i is the i'th write access by the instruction sequence.

Design of the Optimizer – Optimization database The optimization database records all optimizations discovered by the superoptimizer. The database is indexed by the original instruction sequence (in its canonical form) The set of live registers Corresponding optimal sequence if one exists.

Experimental Results GCC 3.2.3 -02 zChaff SAT solver Intel Pentium 3.0GHz processor 100 Gigabytes local storage. Linux machine peephole size : 3 instruction sequences Optimized windows : 6 instruction sequences

Experimental Results Improvements of between 1:7 and 10 times over already-optimized code

Experimental Results Five optimizations were used more than 1,000 times each; in total over 600 distinct optimizations were used at least once each on these benchmarks.

Experimental Results Codesize cost function Simply considers the size of the executable binary code of a sequence as its cost. Runtime cost function Number of memory accesses Branch instructions Approximate cycle costs

Conclusion The target sequences are extracted, or harvested , from a training set of programs. The idea is that the important sequences to optimize are the ones emitted by compilers. Our prototype implementation handles nearly all of the 300+ opcodes of the x86 architecture. Introduce a new technique, canonicalization we need never consider a sequence that is equal up to consistent renaming of registers and symbolic constants.

Automatic Generation of Peephole Superoptimizers

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Automatic Generation of Peephole Superoptimizers (20)

Recently uploaded (20)

Automatic Generation of Peephole Superoptimizers