SlideShare a Scribd company logo
Automatic Generation of Peephole Superoptimizers Speaker:Shuai-wei Huang Advisor:Wuu Yang Sorav Bansal and Alex Aiken Computer System Lab  Stanford University ASPLOS’06
Contents 2.  Design of the Optimizer 1. Introduction 3. Experimental Results   4. Conclusion
Introduction  – Peephole Optimizer Peephole optimizers are pattern matching systems that replace one sequence of instructions by another equivalent, but faster, sequence of instructions. The optimizations are usually expressed as parameterized replacement rules, so that, for example, mov r1, r2; mov r2, r1 => mov r1, r2 Peephole optimizers are typically constructed using human-written pattern matching rules. Requires Time Less Systematic
Introduction  – Superoptimization Automatically discover replacement rules that are optimizations. Optimizations are computed off-line and then presented as an indexed structure for efficient lookup. The optimizations are then organized into a lookup table mapping original sequences to their optimized counterparts.
Introduction  – Superoptimization The goals in this paper are considerably more modest, focusing on showing that an automatically constructed peephole optimizer is possible and,  even with limited resources  (i.e., a single machine) and learning hundreds to thousands of useful optimizations, such an optimizer can find significant speedups that standard optimizers miss.
Introduction  – T erminology problem The classical meaning of  superoptimization  is to find the optimal code sequence for a single, loop-free assembly sequence of instructions, which we call the  target sequence . The terminology problem: In order to distinguish superoptimization from garden variety  optimization  as that term is normally used.
Introduction  – Related work Massalin Simply enumerates sequences of instructions of increasing length, testing each for equality with the target sequence; the lowest cost equivalent sequence found is the optimal one. Denali Constrains the  search space  to a set of equality-preserving transformations expressed by the system designer.  For a given target sequence, a structure representing all possible equivalent sequences under the transformation rules is searched for the lowest cost equivalent sequence.
Design of the Optimizer - Term Instruction   : an opcode together with some valid operands. A potential problem arises with opcodes that take immediate operands, as they generate a huge number of instructions. Restrict immediate operands to a small set of constants and symbolic constants. Cost function  : different cost functions for different purposes Running time to optimize speed Instruction byte count to optimize the size of a binary
Design of the Optimizer - Term Equivalence of two instruction sequences : Registers live Stack locations Memory locations (For implementation simplicity, we currently conservatively assume memory and stack locations are always live.) Equivalence test   ≡ L  tests two instruction sequences for equivalence under the context (set of live registers) L.  For a target sequence T and a cost function c, we are interested in finding a minimum cost instruction sequence O such that O ≡ L  T
Design of the Optimizer - Flowchart
Design of the Optimizer – Structure Harvester Extracts target instruction sequences from the training applications. The target instruction sequences are the ones we seek to optimize. Enumerator Exhaustively enumerates all possible candidate instruction sequences up to a certain length.  Checking if each candidate sequence is an optimal replacement for any of the target instruction sequences. Optimization database An index of all discovered optimizations
Design of the Optimizer – Harvester Harvesting Target Instruction Sequences 1.  Obtain  target instruction sequences from a representative set of applications. 2. These  harvested  instruction sequences form the corpus used to  train  the optimizer. 3. A harvestable instruction sequence I must have a single entry point. 4. Records the set of register live.
Design of the Optimizer – Canonicalization(1/3) On a machine with 8 registers, an instruction  mov r1, r0 has 8*7 = 56 equivalent versions with different register names. Canonicalization  : eliminate all unnecessary instruction sequences that are mere renamings of others.
Design of the Optimizer – Canonicalization(2/3) An instruction sequence is  canonical  if its registers and constants are named in the order of their appearance in the instruction sequence. The first register used is always r0, the second distinct register used is always r1, and so on. Similarly, the first constant used in a canonical instruction sequence is c0, the second distinct constant c1, and so on.
Design of the Optimizer – Canonicalization(3/3)
Design of the Optimizer – Fingerprinting(1/3) We execute I on test machine states and then compute a hash of the result, which we call  I's fingerprint. Fingerprint Hash table. Each bucket holds the target instruction sequence.
Design of the Optimizer – Fingerprinting(2/3) Testvectors Each bit in the two testvectors is set randomly, but the same testvectors are used for fingerprinting every instruction sequence. The machine is loaded with a testvector and control is transferred to the instruction sequence.
Design of the Optimizer – Fingerprinting(3/3) Minimal collisions It should be asymmetric with respect to different memory locations and registers. It should not be based on a single operator (like xor) r distinct registers and c distinct constants can generate at most r!*c! fingerprints. Typically r≦5 and c≦2, so the blow-up is upper-bounded by 240.
Design of the Optimizer – Enumerate(1/6) Instruction sequences are enumerated from a subset of all instructions. To bound the search space, we restrict the maximum number of distinct  registers  and  constants  that can appear in an enumerable instruction sequence. Constants :0,1,C0,C1 Register number : 4
Design of the Optimizer – Enumerate(2/6)
Design of the Optimizer – Enumerate(3/6)
Design of the Optimizer – Enumerate(4/6) Enumerator’s search space is exponential in the length of the instruction sequence. Two techniques to reduce the size Enumerate only canonical instruction sequences. Prune the search space by identifying and eliminating instructions that are functionally equivalent to other cheaper instructions.
Design of the Optimizer – Enumerate(5/6) In order to avoid fingerprinting { mov r0, r1; mov r0, r1 } weed out such special cases automatically through  fingerprinting  and  equivalence   checks .
Design of the Optimizer – Enumerate(6/6)
Design of the Optimizer – Equivalence test The equivalence test proceeds in two steps. a fast but incomplete execution test a slower but exact boolean test. Execution test We run the two sequences over a set of testvectors and observe if they yield the same output on each test. Boolean test The boolean verfication test represents an instruction sequence by a boolean formula and expresses the equivalence relation as a satisability constraint. The satisability constraint is tested using a SAT solver.
Design of the Optimizer – Boolean test All memory writes are stored in a table in order of their occurrence. (addr1 = addr2)  =>  (data1 = data2) Each read-access R is checked for address-equivalence with each   of the preceding write accesses W i  in decreasing order of i, where   W i  is the i'th write access by the instruction sequence.
Design of the Optimizer – Optimization database The optimization database records all optimizations discovered by the superoptimizer. The database is indexed by  the original instruction sequence (in its canonical form) The set of live registers Corresponding optimal sequence if one exists.
Experimental Results GCC 3.2.3 -02 zChaff SAT solver Intel Pentium 3.0GHz processor 100 Gigabytes local storage. Linux machine peephole size : 3 instruction sequences Optimized windows : 6 instruction sequences
Experimental Results Improvements of between 1:7 and 10 times over already-optimized code
Experimental Results Five optimizations were used more than 1,000 times each; in total over 600 distinct optimizations were used at least once each on these benchmarks.
Experimental Results Codesize cost function Simply considers the size of the executable binary code of a sequence as its cost. Runtime cost function Number of memory accesses Branch instructions Approximate cycle costs
Conclusion The target sequences are extracted, or  harvested , from a  training set  of programs. The idea is that the important sequences to optimize are the ones emitted by compilers. Our prototype implementation handles nearly all of the 300+ opcodes of the x86 architecture. Introduce a new technique,  canonicalization we need never consider a sequence that is equal up to consistent renaming of registers and symbolic constants.
Thank  You !

More Related Content

PPTX
Intro to OpenMP
PPTX
Peephole Optimization
PDF
OpenMP Tutorial for Beginners
KEY
OpenMP
PDF
Parallel Programming
PDF
Optimization in Programming languages
PDF
Introduction to MPI
PDF
Open mp directives
Intro to OpenMP
Peephole Optimization
OpenMP Tutorial for Beginners
OpenMP
Parallel Programming
Optimization in Programming languages
Introduction to MPI
Open mp directives

What's hot (20)

PDF
Open mp library functions and environment variables
PDF
Open mp
PDF
Introduction to OpenMP
PDF
Open mp intro_01
PDF
Concurrent Programming OpenMP @ Distributed System Discussion
PPTX
Peephole optimization techniques in compiler design
PDF
Introduction to OpenMP
PPTX
Code optimization
PPTX
Peephole optimization techniques
PPTX
Parallelization using open mp
PDF
Introduction to OpenMP (Performance)
PPT
Unit 3 part2
PPT
Compiler optimization
PPT
OpenMP And C++
PPTX
Presentation on Shared Memory Parallel Programming
PDF
ODP
openmp
PPTX
Code Optimization
PPTX
MPI n OpenMP
PPTX
Compiler optimization techniques
Open mp library functions and environment variables
Open mp
Introduction to OpenMP
Open mp intro_01
Concurrent Programming OpenMP @ Distributed System Discussion
Peephole optimization techniques in compiler design
Introduction to OpenMP
Code optimization
Peephole optimization techniques
Parallelization using open mp
Introduction to OpenMP (Performance)
Unit 3 part2
Compiler optimization
OpenMP And C++
Presentation on Shared Memory Parallel Programming
openmp
Code Optimization
MPI n OpenMP
Compiler optimization techniques
Ad

Viewers also liked (20)

PPTX
The Peephole
PPT
新網站內部管理系統
PPT
Maximize Minnesota Eric Mittelstadt
PPT
Report Prima Parte Lezione 6 Dicembre
PDF
Interiores residenciais Aula 1
PPT
Power Point
PDF
Introduction to FreeBSD 7.0
PDF
Scenario Web 2.0 By Nielsen Net Ratings[1]
PPT
My Head Hurts
PPT
A Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
PPT
La creació d'una pilota Andrea
PPT
La creació de la pilota
PPT
Report Prima Parte Lezione 6 Dicembre
PPT
Com Fer PlàStic De Forma Cassolana
PPT
Presentation Of My Web 2.0
 
PPT
Acers I Foses Hector Robles 1
PDF
No Translation Necessary
PPT
Code optimisation presnted
PDF
Selling Travel April 2013
PPT
Hogwarts - presentation
The Peephole
新網站內部管理系統
Maximize Minnesota Eric Mittelstadt
Report Prima Parte Lezione 6 Dicembre
Interiores residenciais Aula 1
Power Point
Introduction to FreeBSD 7.0
Scenario Web 2.0 By Nielsen Net Ratings[1]
My Head Hurts
A Lightweight Instruction Scheduling Algorithm For Just In Time Compiler
La creació d'una pilota Andrea
La creació de la pilota
Report Prima Parte Lezione 6 Dicembre
Com Fer PlàStic De Forma Cassolana
Presentation Of My Web 2.0
 
Acers I Foses Hector Robles 1
No Translation Necessary
Code optimisation presnted
Selling Travel April 2013
Hogwarts - presentation
Ad

Similar to Automatic Generation of Peephole Superoptimizers (20)

PDF
Test PDF file
PDF
SPCC_Sem6_Chapter 6_Code Optimization part
PPTX
Principal Sources of Optimization in compiler design
PPT
COMPILER_DESIGN_CLASS 2.ppt
PPTX
COMPILER_DESIGN_CLASS 1.pptx
PPTX
Compiler Design theory and various phases of compiler.pptx
PDF
PDF
Optimization Techniques
PDF
Performance_Programming
PPTX
UNIT V - Compiler Design notes power point presentation
PPT
PPTX
Compiler optimizations based on call-graph flattening
PDF
Embedded C - Optimization techniques
PPTX
Code optmize.pptx which is related to coding
ODP
Ti DSP optimization on Jacinto
PPT
Code Optimization Lec#7.ppt Code Optimizer
PDF
Cs718min1 2008soln View
PPTX
Embedded and Real Time Systems Unit II.pptx
PPT
PRESENTATION ON DATA STRUCTURE AND THEIR TYPE
Test PDF file
SPCC_Sem6_Chapter 6_Code Optimization part
Principal Sources of Optimization in compiler design
COMPILER_DESIGN_CLASS 2.ppt
COMPILER_DESIGN_CLASS 1.pptx
Compiler Design theory and various phases of compiler.pptx
Optimization Techniques
Performance_Programming
UNIT V - Compiler Design notes power point presentation
Compiler optimizations based on call-graph flattening
Embedded C - Optimization techniques
Code optmize.pptx which is related to coding
Ti DSP optimization on Jacinto
Code Optimization Lec#7.ppt Code Optimizer
Cs718min1 2008soln View
Embedded and Real Time Systems Unit II.pptx
PRESENTATION ON DATA STRUCTURE AND THEIR TYPE

Recently uploaded (20)

PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PPTX
TLE Review Electricity (Electricity).pptx
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Hybrid model detection and classification of lung cancer
PDF
Getting Started with Data Integration: FME Form 101
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPTX
1. Introduction to Computer Programming.pptx
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
PPTX
Tartificialntelligence_presentation.pptx
PDF
Hindi spoken digit analysis for native and non-native speakers
PPTX
Modernising the Digital Integration Hub
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
The various Industrial Revolutions .pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
Chapter 5: Probability Theory and Statistics
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
A contest of sentiment analysis: k-nearest neighbor versus neural network
TLE Review Electricity (Electricity).pptx
O2C Customer Invoices to Receipt V15A.pptx
cloud_computing_Infrastucture_as_cloud_p
Hybrid model detection and classification of lung cancer
Getting Started with Data Integration: FME Form 101
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
1. Introduction to Computer Programming.pptx
observCloud-Native Containerability and monitoring.pptx
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
Tartificialntelligence_presentation.pptx
Hindi spoken digit analysis for native and non-native speakers
Modernising the Digital Integration Hub
gpt5_lecture_notes_comprehensive_20250812015547.pdf
The various Industrial Revolutions .pptx
NewMind AI Weekly Chronicles - August'25-Week II
DP Operators-handbook-extract for the Mautical Institute
A comparative study of natural language inference in Swahili using monolingua...
Chapter 5: Probability Theory and Statistics
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx

Automatic Generation of Peephole Superoptimizers

  • 1. Automatic Generation of Peephole Superoptimizers Speaker:Shuai-wei Huang Advisor:Wuu Yang Sorav Bansal and Alex Aiken Computer System Lab Stanford University ASPLOS’06
  • 2. Contents 2. Design of the Optimizer 1. Introduction 3. Experimental Results 4. Conclusion
  • 3. Introduction – Peephole Optimizer Peephole optimizers are pattern matching systems that replace one sequence of instructions by another equivalent, but faster, sequence of instructions. The optimizations are usually expressed as parameterized replacement rules, so that, for example, mov r1, r2; mov r2, r1 => mov r1, r2 Peephole optimizers are typically constructed using human-written pattern matching rules. Requires Time Less Systematic
  • 4. Introduction – Superoptimization Automatically discover replacement rules that are optimizations. Optimizations are computed off-line and then presented as an indexed structure for efficient lookup. The optimizations are then organized into a lookup table mapping original sequences to their optimized counterparts.
  • 5. Introduction – Superoptimization The goals in this paper are considerably more modest, focusing on showing that an automatically constructed peephole optimizer is possible and, even with limited resources (i.e., a single machine) and learning hundreds to thousands of useful optimizations, such an optimizer can find significant speedups that standard optimizers miss.
  • 6. Introduction – T erminology problem The classical meaning of superoptimization is to find the optimal code sequence for a single, loop-free assembly sequence of instructions, which we call the target sequence . The terminology problem: In order to distinguish superoptimization from garden variety optimization as that term is normally used.
  • 7. Introduction – Related work Massalin Simply enumerates sequences of instructions of increasing length, testing each for equality with the target sequence; the lowest cost equivalent sequence found is the optimal one. Denali Constrains the search space to a set of equality-preserving transformations expressed by the system designer. For a given target sequence, a structure representing all possible equivalent sequences under the transformation rules is searched for the lowest cost equivalent sequence.
  • 8. Design of the Optimizer - Term Instruction : an opcode together with some valid operands. A potential problem arises with opcodes that take immediate operands, as they generate a huge number of instructions. Restrict immediate operands to a small set of constants and symbolic constants. Cost function : different cost functions for different purposes Running time to optimize speed Instruction byte count to optimize the size of a binary
  • 9. Design of the Optimizer - Term Equivalence of two instruction sequences : Registers live Stack locations Memory locations (For implementation simplicity, we currently conservatively assume memory and stack locations are always live.) Equivalence test ≡ L tests two instruction sequences for equivalence under the context (set of live registers) L. For a target sequence T and a cost function c, we are interested in finding a minimum cost instruction sequence O such that O ≡ L T
  • 10. Design of the Optimizer - Flowchart
  • 11. Design of the Optimizer – Structure Harvester Extracts target instruction sequences from the training applications. The target instruction sequences are the ones we seek to optimize. Enumerator Exhaustively enumerates all possible candidate instruction sequences up to a certain length. Checking if each candidate sequence is an optimal replacement for any of the target instruction sequences. Optimization database An index of all discovered optimizations
  • 12. Design of the Optimizer – Harvester Harvesting Target Instruction Sequences 1. Obtain target instruction sequences from a representative set of applications. 2. These harvested instruction sequences form the corpus used to train the optimizer. 3. A harvestable instruction sequence I must have a single entry point. 4. Records the set of register live.
  • 13. Design of the Optimizer – Canonicalization(1/3) On a machine with 8 registers, an instruction mov r1, r0 has 8*7 = 56 equivalent versions with different register names. Canonicalization : eliminate all unnecessary instruction sequences that are mere renamings of others.
  • 14. Design of the Optimizer – Canonicalization(2/3) An instruction sequence is canonical if its registers and constants are named in the order of their appearance in the instruction sequence. The first register used is always r0, the second distinct register used is always r1, and so on. Similarly, the first constant used in a canonical instruction sequence is c0, the second distinct constant c1, and so on.
  • 15. Design of the Optimizer – Canonicalization(3/3)
  • 16. Design of the Optimizer – Fingerprinting(1/3) We execute I on test machine states and then compute a hash of the result, which we call I's fingerprint. Fingerprint Hash table. Each bucket holds the target instruction sequence.
  • 17. Design of the Optimizer – Fingerprinting(2/3) Testvectors Each bit in the two testvectors is set randomly, but the same testvectors are used for fingerprinting every instruction sequence. The machine is loaded with a testvector and control is transferred to the instruction sequence.
  • 18. Design of the Optimizer – Fingerprinting(3/3) Minimal collisions It should be asymmetric with respect to different memory locations and registers. It should not be based on a single operator (like xor) r distinct registers and c distinct constants can generate at most r!*c! fingerprints. Typically r≦5 and c≦2, so the blow-up is upper-bounded by 240.
  • 19. Design of the Optimizer – Enumerate(1/6) Instruction sequences are enumerated from a subset of all instructions. To bound the search space, we restrict the maximum number of distinct registers and constants that can appear in an enumerable instruction sequence. Constants :0,1,C0,C1 Register number : 4
  • 20. Design of the Optimizer – Enumerate(2/6)
  • 21. Design of the Optimizer – Enumerate(3/6)
  • 22. Design of the Optimizer – Enumerate(4/6) Enumerator’s search space is exponential in the length of the instruction sequence. Two techniques to reduce the size Enumerate only canonical instruction sequences. Prune the search space by identifying and eliminating instructions that are functionally equivalent to other cheaper instructions.
  • 23. Design of the Optimizer – Enumerate(5/6) In order to avoid fingerprinting { mov r0, r1; mov r0, r1 } weed out such special cases automatically through fingerprinting and equivalence checks .
  • 24. Design of the Optimizer – Enumerate(6/6)
  • 25. Design of the Optimizer – Equivalence test The equivalence test proceeds in two steps. a fast but incomplete execution test a slower but exact boolean test. Execution test We run the two sequences over a set of testvectors and observe if they yield the same output on each test. Boolean test The boolean verfication test represents an instruction sequence by a boolean formula and expresses the equivalence relation as a satisability constraint. The satisability constraint is tested using a SAT solver.
  • 26. Design of the Optimizer – Boolean test All memory writes are stored in a table in order of their occurrence. (addr1 = addr2) => (data1 = data2) Each read-access R is checked for address-equivalence with each of the preceding write accesses W i in decreasing order of i, where W i is the i'th write access by the instruction sequence.
  • 27. Design of the Optimizer – Optimization database The optimization database records all optimizations discovered by the superoptimizer. The database is indexed by the original instruction sequence (in its canonical form) The set of live registers Corresponding optimal sequence if one exists.
  • 28. Experimental Results GCC 3.2.3 -02 zChaff SAT solver Intel Pentium 3.0GHz processor 100 Gigabytes local storage. Linux machine peephole size : 3 instruction sequences Optimized windows : 6 instruction sequences
  • 29. Experimental Results Improvements of between 1:7 and 10 times over already-optimized code
  • 30. Experimental Results Five optimizations were used more than 1,000 times each; in total over 600 distinct optimizations were used at least once each on these benchmarks.
  • 31. Experimental Results Codesize cost function Simply considers the size of the executable binary code of a sequence as its cost. Runtime cost function Number of memory accesses Branch instructions Approximate cycle costs
  • 32. Conclusion The target sequences are extracted, or harvested , from a training set of programs. The idea is that the important sequences to optimize are the ones emitted by compilers. Our prototype implementation handles nearly all of the 300+ opcodes of the x86 architecture. Introduce a new technique, canonicalization we need never consider a sequence that is equal up to consistent renaming of registers and symbolic constants.