SlideShare a Scribd company logo
Edgar Barbosa
  H2HC 2011
São Paulo - Brazil
Who am I?
                      
 Edgar Barbosa
 Senior Security Researcher at COSEINC (Singapore)
 One of the developers of Blue Pill, a hardware-based
  virtualization rootkit. Also presented a way to detect this type
  of rootkit.
 Discovered the Windows kernel KdVersionBlock data structure
  used for some forensic tools.
 Focus: RCE, Windows Internals, Virtualization and Program
  Analysis.
 Currently working on the COSEINC SMT Project, which aims
  to automate the bug finding process with the help of SMT
  solvers. The current presentation is part of the research done for
  the SMT project.
Control Flow Analysis
Control Flow Analysis
             
 Control Flow Analysis (CFA)
 Static analysis technique to discover the hierarchical flow of
  control within a procedure (function).
 Analysis of all possible execution paths inside a program or
  procedure.
 Represents the control structure of the procedure using
  Control Flow Graphs.
 Compiler theory - optimization
 The focus of this presentation is to demonstrate CFA for
  Reverse Code Engineering, where the source code isn’t
  available.
RCE and CFA
           
     Executable
                       Disassembler
   (binary format)




Extract control flow   Control Flow
    information          Graph
What is a CFG?
                
 A Control Flow Graph (CFG) is a directed graph
  G(V;E) which consists of a set of vertices (nodes)V,
  and a set of edges E, which indicate possible flow of
  control between nodes
 Or, is a directed graph that represents a superset of
  all possible execution paths of a procedure.
 Graph nodes represents objects called Basic Blocks
  (BB)
CFG
Nodes
         
Edges
               
tail


   head

       tail           head
CFG
Edges
         
BinNavi
             
 Views
 Nodes
 Edges
CFG properties
                
 In the CFA literature the algorithms assume the following
  CFG properties:
    Unique Start node (Entry node)
    All the nodes of must be reachable from the START node.
    Unique Exit node
 Real-world:
    Easy to find multiple exit nodes (RETURN) on the
     disassembly of a function
 Create a new exit node, add it to the graph and modify
  the return instructions to jump to the new node.
BB identification
                
 In general, the problem of discovering all the
  possible execution paths of a code is undecidable. (cf.
  Halting problem).
 First step for CFG reconstruction is to identifiy all the
  basic blocks.
 A basic block is a maximal sequence of instructions
  that can be entered only at the first of them and
  exited only from the last of them
Basic Block (BB)
       
Basic Blocks
                      
 First instruction of a BB (the leader instruction):
   1.   The entry point of the routine
   2.   The target of a branch instruction
   3.   The instruction immediately following a branch
 Although CALL is a branch instruction, the target
  function is assumed to always return and therefore it is
  allowed in the middle of a BB.
 To build the BB’s we need to identify all the leader
  instructions. This requires the disassembly of the
  instructions.
 Two disassembly algorithms
1. Linear Sweep
                  
 A linear sweep algorithm starts with the first byte in the
  code section and proceeds by decoding each byte until an
  illegal instruction is encountered[a]




>> 8B FF 55 8B EC 8B 45 08

8B FF         mov    edi, edi
55            push   ebp
8B EC         mov    ebp, esp
8B 45 08      mov    eax, [ebp+8]
2. Recursive Traversal
             
 Linear sweep algorithm doesn’t take into account the
  control flow behaviour of some instructions.
>> EB 01 FF 8B 45 FC

 EB 01      jmp short 0x401020
 FF         ???     ;invalid
 Recursive traversal disassemblers interpret branch
  instructions in the program to translate only those
  bytes which can actually be reached by control flow.   [b]
2. Recursive Traversal
           
EB 01 FF 8B 45 FC



EB 01    jmp short 0x401020
FF       ???   (UNREACHABLE)
8B 45 FC mov eax, dword ptr ss:[ebp-4]
State-of-art CFG
          reconstruction
                
 Once identified the basic blocks, the CFG
  construction is done after the addition of the edges.
 CFG construction is especially difficult when the
  code includes indirect calls. (call dword ptr[eax])
 State-of-art CFG construction available is the open-
  source Jakstab tool (Java Toolkit for Static Analysis
  of Binaries) from Johannes Kinder.
 Provides better results than IDAPro.
Jakstab   [d]




   
Self-modifying code
        
     Control Flow Analysis
Self-modifying code
              
 Consider the following example (not real x86 opcodes)
                                   [c]



      Address      Assembly               Binary
      0x0          movb 0xc 0x8           c6 0c 08
      0x3          inc %ebx               40 01
      0x5          movb 0xc 0x5           c6 0c 05
      0x8          inc %edx               40 03
      0xa          push %ecx              ff 02
      0xc          dec %ebx               48 01
 A linear sweep or recursive traversal algorithm execution on
  the above code would result in a single Basic Block (single
  entry/single exit/no branches)
SMC
       CFG 1            CFG 2          CFG 3


0x0   movb 0xc 0x8   movb 0xc 0x8   movb 0xc 0x8
0x3   inc %ebx
0x5   movb 0xc 0x5
0x8   inc %edx       inc %ebx       inc %ebx
0xa   push %ecx      movb 0xc 0x5   jmp 0xc
0xc   dec %ebx       jmp 0x3

                                    jmp 0x3

                     push %ecx      push %ecx
                     dec %ebx


                                    dec %ebx
SE-CFG
                    
 State-Enhanced Control Flow Graph (SE-CFG)
 CFG augmented with extensions to support SMC.
 Allows the use of control flow analysis algorithms
  for SMC.
 “A Model for Self-Modifying Code”
 Codebyte extensions – Codebyte conditional edges
 Implemented in a link-time binary rewriter: Diablo.
 It can be downloaded from
   http://www.elis.ugent.be/diablo
SMC - CFG
             
            movb 0xc 0x8

            inc %ebx


jmp 0xc     movb 0xc 0x5


            inc %edx       jmp 0x3
            push %ecx


            dec %ebx
Control Flow Analysis
Dominators
   
 Control Flow Analysis
Dominance relation
            
 Relation about the nodes of a control flow graph.
 “Node A dominates Node B if every path from the
  entry node to B includes A”.
 Representation: A dom B
 Properties:
   Antisymmetric (either A dom B or B dom A)
   Reflexive (A dom A)
   Transitive (If A dom B and B dom C then A dom C)
 Can be represented by a tree, the Dominator Tree.
Control Flow Graph
                 
Entry
Node




Exit
node
Dominator Tree
     
Implementations
              
 Classic reference:
    Lengauer-Tarjan algorithm
 Boost C++ library
 Immunity Debugger
    libcontrolflow.py
      Class DominatorTree
 BinNavi API
    GraphAlgorithms getDominatorTree()
    DEMO: Gui plugin
Natural loops
                 
 We can use the Dominator Tree to identify loops.
 Locate the back edges
 Back edge:
   An edge whose head dominates its tail.
 A loop consists:
   of all nodes dominated by its entry node (head of the
    back edge) from which the entry node can be reached
 These loops are named Natural Loops.
Loop
 Header




Back Edge
ImmunityDbg !findloop
         
         ImmDbgPyCommandsfindloop.py
Strongly connected
    components
        
     Control Flow Analysis
SCC
                        
 SCC  Strongly connected components
 A graph (directed/undirected) is called strongly
  connected if there is a path from each vertex to every
  other vertex
 Any loop is a strongly connected component
SCC
            a


    b                   This graph is
                    d   not strongly
c
                        connected.
        e


                f
SCC
            a
                    SCC

    b
                          But it contains
                          a subgraph
                    d
c                         which is
        e                 strongly-
                          connected.
                f
SCC - algorithms
               
 Tarjan algorithm
   fast algorithm - complex
 Kosaraju-Sharir algorithm
   simple, but slower than Tarjan’s algorithm
 Implementations available for all languages:
   C#/Python/Lua/Ruby/Java
Control Flow Analysis
Interval Analysis
       
    Control Flow Analysis
Regions and intervals
            
 Unfortunately SCC isn’t able to identify nested loops
 Interval Analysis
    Divides the CFG into regions and consolidate them into
     new nodes (abstract nodes) resulting in an abstract flowgraph.
 We need to identify regions and pre-intervals
 Region:
    A region in a flow graph is a sub graph H with an unique
     entry node h
 Pre-Interval:
    A pre-interval in a flow graph is a region <H,h> such that
     every cycle (loop) in H includes the header h.
 Similar to a unique entry SCC.
Nested Intervals
T1/T2 transformations
           
 Reduction of graphs
 We can collapse nodes from a region to a single
  node. This is called t1/t2 transformation. If we apply
  it to all loops, the graph becomes a cycle-free one.
 Cycle-free graphs are easier to analyze.
Control Flow Analysis
Control Flow Analysis
Interval analysis
            
 DEMO
GOTO considered
               harmful…
                       




http://xkcd.com/292/
Irreducible graphs
               
 All the loops identified by the previous methods
  (dominance tree/interval analysis) are called natural
  loops.
 They are unique entry loops.
 There another type of loop:
    irreducible graphs or improper regions
Irreducible graph
                
                      e
                                  Loop (a , b)
Entry
                                  2 entries! b or a
                  s


              a           b



Return
                              r
Irreducible graphs
              
 Who codes like that?
   Anyone who uses GOTO
   It is rare, but it does exist
      notepad.exe
      ntoskrnl.exe (Windows Kernel)
 What’s the problem?
   Most of the algorithms are unable to handle
    irreducible graphs!!! Including Interval analysis.
   Can’t apply T1/T2
translateString
                           
int *__stdcall TranslateString(int a1)
{
            wchar_t v1; // cx@1
            …
            if ( v1 )
            {
                         while ( 1 )

                       {
                             v5 = &v22 + v26;
                             …
                                                  Jump inside the
                             LABEL_49:
                             v1 = *(_WORD *)v7;
                                                  WHILE statement
                                         …

                       }
           }
               goto LABEL_49;
 }
Solutions
                    
 There are 2 main solutions to handle irreducible
  graphs:
    Structural Analysis
    DJ-Graphs
Structural Analysis
        
     Control Flow Analysis
Structural Analysis
              
 Structural analysis will identify the main language
  constructs inside a flow graph using region schemas.
 Do you want to build your own decompiler?
    Hex-Rays decompiler internally uses Structural
     Analysis
 Created by Micha Sharir
 Reference paper:
    Structural analysis: a new approach to flow analysis in
     optimizing compliers (1979)
Acyclic schemas
       
Cyclic schemas
       
DJ-Graphs
                    
 Another way to handle irreducible graphs.
 It is also able to identify all types of structures,
  including improper regions and nested structures.
 Uses a combination of the dominance tree and the
  original flowgraph with two additional types of
  edges:
    the D edge (Dominator)
    the J edges
 Paper: Identifying loops using DJ graphs.[e]
DJ-Graphs
    
Applications
                  
 Taint analysis
    Control dependency (dominators, post-dominators)
 Diff Slicing
    Execution Indexing (view the CFG as a grammar)
      Execution alignment
    Identification of root causes of software crashes
 Decompilation
 Code coverage
 Bug finding
References
                             
 a-
  http://www.usenix.org/event/usenix03/tech/full_papers/prasad/prasad_html/n
  ode5.html
 b - An Abstract Interpretation-Based Framework for Control Flow Reconstruction
  from Binaries
 c – Bertrand Anckaert, Matias Madou, and Koen De Bosschere. 2006. A model
    for self-modifying code. In Proceedings of the 8th international conference on
    Information hiding (IH'06)
 d - http://www.jakstab.org/
 e - Vugranam C. Sreedhar, Guang R. Gao, and Yong-Fong Lee. 1996. Identifying
  loops using DJ graphs. ACM Trans. Program. Lang. Syst. 18, 6 (November 1996), 649-
  658.
 f - Advanced compiler implementation – Steven Muchnick
 g - Notes on Graph Algorithms Used in Optimizing Compilers - Carl D. Offner
Questions?
                 
 Contact: edgarmb@gmail.com
            edgar@research.coseinc.com
 twitter: @embarbosa

More Related Content

PPT
Chapter 6 intermediate code generation
PDF
Syntax Directed Definition and its applications
PPTX
Type checking in compiler design
PPTX
Lexical Analysis - Compiler Design
PPTX
Text Classification
PDF
Code optimization in compiler design
PDF
Query trees
PPTX
Presentation on Text Classification
Chapter 6 intermediate code generation
Syntax Directed Definition and its applications
Type checking in compiler design
Lexical Analysis - Compiler Design
Text Classification
Code optimization in compiler design
Query trees
Presentation on Text Classification

What's hot (20)

PDF
Target language in compiler design
PPT
Problems, Problem spaces and Search
PDF
Principles of programming languages. Detail notes
PPTX
Artificial Intelligence Searching Techniques
PPTX
Finite automata-for-lexical-analysis
PDF
Intermediate code generation in Compiler Design
PPTX
Loop optimization
PPTX
Structure of agents
PPT
Chapter 5 Syntax Directed Translation
PPTX
Recognition-of-tokens
PPT
Intermediate code generation (Compiler Design)
PPTX
Code Optimization
PPTX
Peephole optimization techniques in compiler design
PPTX
Mining single dimensional boolean association rules from transactional
PPT
Heuristic Search Techniques {Artificial Intelligence}
PPT
Informed search (heuristics)
PPTX
RECURSIVE DESCENT PARSING
PPTX
A Role of Lexical Analyzer
PDF
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
PPTX
Back patching
Target language in compiler design
Problems, Problem spaces and Search
Principles of programming languages. Detail notes
Artificial Intelligence Searching Techniques
Finite automata-for-lexical-analysis
Intermediate code generation in Compiler Design
Loop optimization
Structure of agents
Chapter 5 Syntax Directed Translation
Recognition-of-tokens
Intermediate code generation (Compiler Design)
Code Optimization
Peephole optimization techniques in compiler design
Mining single dimensional boolean association rules from transactional
Heuristic Search Techniques {Artificial Intelligence}
Informed search (heuristics)
RECURSIVE DESCENT PARSING
A Role of Lexical Analyzer
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
Back patching
Ad

Similar to Control Flow Analysis (20)

PDF
Lecture03
PPTX
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
PDF
Control Flow Graphs
PDF
Control Flow Graphs
PDF
Ast2Cfg - A Framework for CFG-Based Analysis and Visualisation of Ada Programs
PDF
Stale pointers are the new black
PDF
Stale pointers are the new black - white paper
PDF
Compiler Construction | Lecture 11 | Monotone Frameworks
PDF
Software Verification, Validation and Testing
PPTX
WEB DEVELOPMET FRONT END WITH ADVANCED RECEAT
PPTX
Data structure Graph PPT ( BFS & DFS ) NOTES
PPTX
Code Generation Part-2 in Compiler Construction
PDF
Automated static deobfuscation in the context of Reverse Engineering
PPT
ERTS UNIT 3.ppt
PDF
Architecture of a morphological malware detector
PDF
FP Days: Down the Clojure Rabbit Hole
PDF
Compiler Construction | Lecture 10 | Data-Flow Analysis
PPTX
Dfg &amp; sg ppt (1)
PPTX
Intro to reverse engineering owasp
PPT
Algorithm
Lecture03
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Control Flow Graphs
Control Flow Graphs
Ast2Cfg - A Framework for CFG-Based Analysis and Visualisation of Ada Programs
Stale pointers are the new black
Stale pointers are the new black - white paper
Compiler Construction | Lecture 11 | Monotone Frameworks
Software Verification, Validation and Testing
WEB DEVELOPMET FRONT END WITH ADVANCED RECEAT
Data structure Graph PPT ( BFS & DFS ) NOTES
Code Generation Part-2 in Compiler Construction
Automated static deobfuscation in the context of Reverse Engineering
ERTS UNIT 3.ppt
Architecture of a morphological malware detector
FP Days: Down the Clojure Rabbit Hole
Compiler Construction | Lecture 10 | Data-Flow Analysis
Dfg &amp; sg ppt (1)
Intro to reverse engineering owasp
Algorithm
Ad

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Approach and Philosophy of On baking technology
PPTX
Cloud computing and distributed systems.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPT
Teaching material agriculture food technology
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Empathic Computing: Creating Shared Understanding
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Spectroscopy.pptx food analysis technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Approach and Philosophy of On baking technology
Cloud computing and distributed systems.
The Rise and Fall of 3GPP – Time for a Sabbatical?
Building Integrated photovoltaic BIPV_UPV.pdf
Teaching material agriculture food technology
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
20250228 LYD VKU AI Blended-Learning.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Machine learning based COVID-19 study performance prediction
Empathic Computing: Creating Shared Understanding
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Encapsulation_ Review paper, used for researhc scholars
MYSQL Presentation for SQL database connectivity
Spectroscopy.pptx food analysis technology

Control Flow Analysis

  • 1. Edgar Barbosa H2HC 2011 São Paulo - Brazil
  • 2. Who am I?   Edgar Barbosa  Senior Security Researcher at COSEINC (Singapore)  One of the developers of Blue Pill, a hardware-based virtualization rootkit. Also presented a way to detect this type of rootkit.  Discovered the Windows kernel KdVersionBlock data structure used for some forensic tools.  Focus: RCE, Windows Internals, Virtualization and Program Analysis.  Currently working on the COSEINC SMT Project, which aims to automate the bug finding process with the help of SMT solvers. The current presentation is part of the research done for the SMT project.
  • 4. Control Flow Analysis   Control Flow Analysis (CFA)  Static analysis technique to discover the hierarchical flow of control within a procedure (function).  Analysis of all possible execution paths inside a program or procedure.  Represents the control structure of the procedure using Control Flow Graphs.  Compiler theory - optimization  The focus of this presentation is to demonstrate CFA for Reverse Code Engineering, where the source code isn’t available.
  • 5. RCE and CFA  Executable Disassembler (binary format) Extract control flow Control Flow information Graph
  • 6. What is a CFG?   A Control Flow Graph (CFG) is a directed graph G(V;E) which consists of a set of vertices (nodes)V, and a set of edges E, which indicate possible flow of control between nodes  Or, is a directed graph that represents a superset of all possible execution paths of a procedure.  Graph nodes represents objects called Basic Blocks (BB)
  • 7. CFG Nodes
  • 8. Edges  tail head tail head
  • 9. CFG Edges
  • 10. BinNavi   Views  Nodes  Edges
  • 11. CFG properties   In the CFA literature the algorithms assume the following CFG properties:  Unique Start node (Entry node)  All the nodes of must be reachable from the START node.  Unique Exit node  Real-world:  Easy to find multiple exit nodes (RETURN) on the disassembly of a function  Create a new exit node, add it to the graph and modify the return instructions to jump to the new node.
  • 12. BB identification   In general, the problem of discovering all the possible execution paths of a code is undecidable. (cf. Halting problem).  First step for CFG reconstruction is to identifiy all the basic blocks.  A basic block is a maximal sequence of instructions that can be entered only at the first of them and exited only from the last of them
  • 14. Basic Blocks   First instruction of a BB (the leader instruction): 1. The entry point of the routine 2. The target of a branch instruction 3. The instruction immediately following a branch  Although CALL is a branch instruction, the target function is assumed to always return and therefore it is allowed in the middle of a BB.  To build the BB’s we need to identify all the leader instructions. This requires the disassembly of the instructions.  Two disassembly algorithms
  • 15. 1. Linear Sweep   A linear sweep algorithm starts with the first byte in the code section and proceeds by decoding each byte until an illegal instruction is encountered[a] >> 8B FF 55 8B EC 8B 45 08 8B FF mov edi, edi 55 push ebp 8B EC mov ebp, esp 8B 45 08 mov eax, [ebp+8]
  • 16. 2. Recursive Traversal   Linear sweep algorithm doesn’t take into account the control flow behaviour of some instructions. >> EB 01 FF 8B 45 FC EB 01 jmp short 0x401020 FF ??? ;invalid  Recursive traversal disassemblers interpret branch instructions in the program to translate only those bytes which can actually be reached by control flow. [b]
  • 17. 2. Recursive Traversal  EB 01 FF 8B 45 FC EB 01 jmp short 0x401020 FF ??? (UNREACHABLE) 8B 45 FC mov eax, dword ptr ss:[ebp-4]
  • 18. State-of-art CFG reconstruction   Once identified the basic blocks, the CFG construction is done after the addition of the edges.  CFG construction is especially difficult when the code includes indirect calls. (call dword ptr[eax])  State-of-art CFG construction available is the open- source Jakstab tool (Java Toolkit for Static Analysis of Binaries) from Johannes Kinder.  Provides better results than IDAPro.
  • 19. Jakstab [d] 
  • 20. Self-modifying code  Control Flow Analysis
  • 21. Self-modifying code   Consider the following example (not real x86 opcodes) [c] Address Assembly Binary 0x0 movb 0xc 0x8 c6 0c 08 0x3 inc %ebx 40 01 0x5 movb 0xc 0x5 c6 0c 05 0x8 inc %edx 40 03 0xa push %ecx ff 02 0xc dec %ebx 48 01  A linear sweep or recursive traversal algorithm execution on the above code would result in a single Basic Block (single entry/single exit/no branches)
  • 22. SMC CFG 1 CFG 2 CFG 3 0x0 movb 0xc 0x8 movb 0xc 0x8 movb 0xc 0x8 0x3 inc %ebx 0x5 movb 0xc 0x5 0x8 inc %edx inc %ebx inc %ebx 0xa push %ecx movb 0xc 0x5 jmp 0xc 0xc dec %ebx jmp 0x3 jmp 0x3 push %ecx push %ecx dec %ebx dec %ebx
  • 23. SE-CFG   State-Enhanced Control Flow Graph (SE-CFG)  CFG augmented with extensions to support SMC.  Allows the use of control flow analysis algorithms for SMC.  “A Model for Self-Modifying Code”  Codebyte extensions – Codebyte conditional edges  Implemented in a link-time binary rewriter: Diablo.  It can be downloaded from  http://www.elis.ugent.be/diablo
  • 24. SMC - CFG  movb 0xc 0x8 inc %ebx jmp 0xc movb 0xc 0x5 inc %edx jmp 0x3 push %ecx dec %ebx
  • 26. Dominators  Control Flow Analysis
  • 27. Dominance relation   Relation about the nodes of a control flow graph.  “Node A dominates Node B if every path from the entry node to B includes A”.  Representation: A dom B  Properties:  Antisymmetric (either A dom B or B dom A)  Reflexive (A dom A)  Transitive (If A dom B and B dom C then A dom C)  Can be represented by a tree, the Dominator Tree.
  • 28. Control Flow Graph  Entry Node Exit node
  • 30. Implementations   Classic reference:  Lengauer-Tarjan algorithm  Boost C++ library  Immunity Debugger  libcontrolflow.py  Class DominatorTree  BinNavi API  GraphAlgorithms getDominatorTree()  DEMO: Gui plugin
  • 31. Natural loops   We can use the Dominator Tree to identify loops.  Locate the back edges  Back edge:  An edge whose head dominates its tail.  A loop consists:  of all nodes dominated by its entry node (head of the back edge) from which the entry node can be reached  These loops are named Natural Loops.
  • 33. ImmunityDbg !findloop  ImmDbgPyCommandsfindloop.py
  • 34. Strongly connected components  Control Flow Analysis
  • 35. SCC   SCC  Strongly connected components  A graph (directed/undirected) is called strongly connected if there is a path from each vertex to every other vertex  Any loop is a strongly connected component
  • 36. SCC a b This graph is d not strongly c connected. e f
  • 37. SCC a SCC b But it contains a subgraph d c which is e strongly- connected. f
  • 38. SCC - algorithms   Tarjan algorithm  fast algorithm - complex  Kosaraju-Sharir algorithm  simple, but slower than Tarjan’s algorithm  Implementations available for all languages:  C#/Python/Lua/Ruby/Java
  • 40. Interval Analysis  Control Flow Analysis
  • 41. Regions and intervals   Unfortunately SCC isn’t able to identify nested loops  Interval Analysis  Divides the CFG into regions and consolidate them into new nodes (abstract nodes) resulting in an abstract flowgraph.  We need to identify regions and pre-intervals  Region:  A region in a flow graph is a sub graph H with an unique entry node h  Pre-Interval:  A pre-interval in a flow graph is a region <H,h> such that every cycle (loop) in H includes the header h.  Similar to a unique entry SCC.
  • 43. T1/T2 transformations   Reduction of graphs  We can collapse nodes from a region to a single node. This is called t1/t2 transformation. If we apply it to all loops, the graph becomes a cycle-free one.  Cycle-free graphs are easier to analyze.
  • 46. Interval analysis   DEMO
  • 47. GOTO considered harmful…  http://xkcd.com/292/
  • 48. Irreducible graphs   All the loops identified by the previous methods (dominance tree/interval analysis) are called natural loops.  They are unique entry loops.  There another type of loop:  irreducible graphs or improper regions
  • 49. Irreducible graph  e Loop (a , b) Entry 2 entries! b or a s a b Return r
  • 50. Irreducible graphs   Who codes like that?  Anyone who uses GOTO  It is rare, but it does exist  notepad.exe  ntoskrnl.exe (Windows Kernel)  What’s the problem?  Most of the algorithms are unable to handle irreducible graphs!!! Including Interval analysis.  Can’t apply T1/T2
  • 51. translateString  int *__stdcall TranslateString(int a1) { wchar_t v1; // cx@1 … if ( v1 ) { while ( 1 ) { v5 = &v22 + v26; … Jump inside the LABEL_49: v1 = *(_WORD *)v7; WHILE statement … } } goto LABEL_49; }
  • 52. Solutions   There are 2 main solutions to handle irreducible graphs:  Structural Analysis  DJ-Graphs
  • 53. Structural Analysis  Control Flow Analysis
  • 54. Structural Analysis   Structural analysis will identify the main language constructs inside a flow graph using region schemas.  Do you want to build your own decompiler?  Hex-Rays decompiler internally uses Structural Analysis  Created by Micha Sharir  Reference paper:  Structural analysis: a new approach to flow analysis in optimizing compliers (1979)
  • 57. DJ-Graphs   Another way to handle irreducible graphs.  It is also able to identify all types of structures, including improper regions and nested structures.  Uses a combination of the dominance tree and the original flowgraph with two additional types of edges:  the D edge (Dominator)  the J edges  Paper: Identifying loops using DJ graphs.[e]
  • 58. DJ-Graphs
  • 59. Applications   Taint analysis  Control dependency (dominators, post-dominators)  Diff Slicing  Execution Indexing (view the CFG as a grammar)  Execution alignment  Identification of root causes of software crashes  Decompilation  Code coverage  Bug finding
  • 60. References   a- http://www.usenix.org/event/usenix03/tech/full_papers/prasad/prasad_html/n ode5.html  b - An Abstract Interpretation-Based Framework for Control Flow Reconstruction from Binaries  c – Bertrand Anckaert, Matias Madou, and Koen De Bosschere. 2006. A model for self-modifying code. In Proceedings of the 8th international conference on Information hiding (IH'06)  d - http://www.jakstab.org/  e - Vugranam C. Sreedhar, Guang R. Gao, and Yong-Fong Lee. 1996. Identifying loops using DJ graphs. ACM Trans. Program. Lang. Syst. 18, 6 (November 1996), 649- 658.  f - Advanced compiler implementation – Steven Muchnick  g - Notes on Graph Algorithms Used in Optimizing Compilers - Carl D. Offner
  • 61. Questions?   Contact: edgarmb@gmail.com edgar@research.coseinc.com  twitter: @embarbosa