SlideShare a Scribd company logo
Fall 2012




            Thanks to Prof. Kim
• Discuss Lab 3

• Dealing with Branches

• Mid semester survey
• Most branches are biased

• Interference in PHT entries
  – Constructive (T+T, or N+N)
  – Destructive (T+N, or N+T)

  Agree predictor  Check if branches agree with
  Bias direction (most entries will agree)

  Reduces destructive interference in PHT
• Instructions are predicated
  -> Depending on the predicate value the
    instruction is valid or becomes a No-op.

  (p) add R1 = R2 + R3

              P              R1 = R2 + R3
            TRUE             R1 <- R2 + R3
            FALSE               No op
If ( a == 0 ) {
 b = 1;           Set p
}
else {            (p) b = 1
  b = 0;
}                 (!p) b = 0
(normal branch code)        (predicated code)
                            A
                        T        N                   A
if (cond) {
     b = 0;                                          B
                        C        B
}
else {                                               C
     b = 1;                 D                        D
}
              A
                      p1 = (cond)         A
                      branch p1, TARGET
              B                                    p1 = (cond)
                                          B
                      mov b, 1                 (!p1) mov b, 1
                      jmp JOIN
              C                           C
                  TARGET:
                      mov b, 0                  (p1) mov b, 0


                                                                  6
• Eliminate branch mispredictions
  – Convert control dependency to data
    dependency
• Increase compiler’s optimization
  opportunities
  – Trace scheduling, bigger basic blocks,
    instruction re-ordering
  – SIMD (Nvidia G80), vector processing
• More machine resources
  – Fetch more instructions
  – Occupy useful resources (ROB, scheduler..)
• ISA should support predicated execution
  – (ISA), predicate registers
  – X86: c-move
• In OOO, supporting predicated execution is
  harder
  – Three input sources
  – Dependent instructions cannot be executed.
• Conditional move
  – The simplest form of predicated execution
  – Works only for registers not for memory
  – E.g.) CMOVA r16, r/m16 (move if CF=0 and
    ZF-0)
• Full predication support
  – Only IA-64 (later lecture)
• When to use predicated execution?
  – Hard to predict?
  – Short branches?
  – Compiler optimization benefit?
• Who should decide it?
• Applicable to all branches?
  – Loop, function calls, indirect branches …
• Transforms an M-iteration loop into
  a loop with M/N iterations
    – We say that the loop has been unrolled N
      times
                                       for(i=0;i<100;i+=4){
   for(i=0;i<100;i++)                    a[i]*=2;
     a[i]*=2;                            a[i+1]*=2;
                                         a[i+2]*=2;
                                         a[i+3]*=2;
                                       }

Some compilers can do this (gcc -funroll-loops)
        Or you can do it manually (above)
• Less loop overhead
                             for(i=0;i<100;i+=4){
  for(i=0;i<100;i++)           a[i]   += 2;
    a[i] += 2;                 a[i+1] += 2;
                               a[i+2] += 2;
                               a[i+3] += 2;
                             }


  How many branches?

           Fewer branch prediction,
           Fewer number of instructions
R2 = R3 * #4
   R2 = R2 + #a                                   R2 = R3 * #4
 R1 = LOAD 0[R2]             • Allows better      R2 = R2 + #a
   R1 = R1 + #2                                 R1 = LOAD 0[R2]
STORE R1  0[R2]               scheduling of      R1 = R1 + #2
   R3 = R3 + 1                                 STORE R1  0[R2]
 BLT R3, 100, #top
                               instructions
                                                R1 = LOAD 4[R2]
                                                  R1 = R1 + #2
       R2 = R3 * #4                            STORE R1  4[R2]
       R2 = R2 + #a                             R1 = LOAD 8[R2]
     R1 = LOAD 0[R2]                              R1 = R1 + #2
       R1 = R1 + #2                            STORE R1  8[R2]
    STORE R1  0[R2]                            R1 = LOAD 12[R2]
       R3 = R3 + 1                                R1 = R1 + #2
     BLT R3, 100, #top
                                               STORE R1  12[R2]
                                                   R3 = R3 + 4
           R2 = R3 * #4                         BLT R3, 100, #top
           R2 = R2 + #a
         R1 = LOAD 0[R2]
           R1 = R1 = #2
        STORE R1  0[R2]
           R3 = R3 + 1
         BLT R3, 100, #top
• Get rid of small loops
                                                    a[0]*=2;
      for(i=0;i<4;i++)                              a[1]*=2;
        a[i]*=2;                                    a[2]*=2;
                                                    a[3]*=2;



  for(0)
              Difficult to schedule/hoist
  for(1)
              insts from bottom block to
  for(2)
              top block due to branches
  for(3)


                                            Easier: no branches in the way
• Instruction size is larger (code bloat)
• What if N not a multiple of M?
  – Or if N not known at compile time?
  – Or if it is a while loop?
                            j1=j-j%4;
                            for(i=0;i<j1;i+=4){
                              a[i]*=2;
   for(i=0;i<j;i++)           a[i+1]*=2;
     a[i]*=2;                 a[i+2]*=2;
                              a[i+3]*=2;
                            }
                            for(i=j1;i<j;i++)
                              a[i]*=2;

More Related Content

TXT
New text document
PDF
Problem
DOCX
Unit v laplace transform(formula)
PPTX
Automata theory - Push Down Automata (PDA)
PPTX
R tist
PDF
Little o and little omega
PDF
Laplace table
PDF
Theta notation
New text document
Problem
Unit v laplace transform(formula)
Automata theory - Push Down Automata (PDA)
R tist
Little o and little omega
Laplace table
Theta notation

What's hot (14)

PDF
高いChurn耐性と検索性能を持つキー順序保存型構造化オーバレイネットワークSuzakuの提案と評価
PDF
Big omega
ODP
Hyperbola as an-example-learning-shifts-on-internet
PDF
GC in C++0x [eng]
PDF
Nesting of for loops using C++
PPT
Chapter3 presentation2
PDF
Lecture04
PPT
Matlab dsp examples
PDF
Brief introduction to Algorithm analysis
PDF
5th Semester Electronic and Communication Engineering (2013-December) Questio...
PPT
Happy To Use SIMD
PDF
Instruction types
PDF
Design and Implementation of GCC Register Allocation
高いChurn耐性と検索性能を持つキー順序保存型構造化オーバレイネットワークSuzakuの提案と評価
Big omega
Hyperbola as an-example-learning-shifts-on-internet
GC in C++0x [eng]
Nesting of for loops using C++
Chapter3 presentation2
Lecture04
Matlab dsp examples
Brief introduction to Algorithm analysis
5th Semester Electronic and Communication Engineering (2013-December) Questio...
Happy To Use SIMD
Instruction types
Design and Implementation of GCC Register Allocation
Ad

Viewers also liked (7)

ODP
8 grade unit 9
PPTX
Predicates and its types
PPTX
Semantics analysis ppt
PPTX
Lecture 1: Semantic Analysis in Language Technology
PDF
Instruction Level Parallelism (ILP) Limitations
PDF
isca-95-partial-pred
PDF
Design of Predicate Filter for Predicated Branch Instructions
8 grade unit 9
Predicates and its types
Semantics analysis ppt
Lecture 1: Semantic Analysis in Language Technology
Instruction Level Parallelism (ILP) Limitations
isca-95-partial-pred
Design of Predicate Filter for Predicated Branch Instructions
Ad

Similar to Predication (20)

PDF
optimization c code on blackfin
PDF
Javascript engine performance
PDF
Module 6 Intermediate Code Generation.pdf
PDF
C Code and the Art of Obfuscation
PDF
Real number system full
PDF
Real number system full
PDF
Virtual machine and javascript engine
PDF
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
PDF
Basic arithmetic, instruction execution and program
PDF
Rainer Grimm, “Functional Programming in C++11”
PDF
Boosting Developer Productivity with Clang
PPTX
Instruction Set Architecture: MIPS
PPTX
Code generation
PPTX
2 d array(part 1) || 2D ARRAY FUNCTION WRITING || GET 100% MARKS IN CBSE CS
PPT
Lifting 1
PPTX
halstead software science measures
PDF
Introduction to Polyhedral Compilation
PDF
Vectorization in ATLAS
PDF
OptimizingARM
PPTX
9.Sorting & Searching
optimization c code on blackfin
Javascript engine performance
Module 6 Intermediate Code Generation.pdf
C Code and the Art of Obfuscation
Real number system full
Real number system full
Virtual machine and javascript engine
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
Basic arithmetic, instruction execution and program
Rainer Grimm, “Functional Programming in C++11”
Boosting Developer Productivity with Clang
Instruction Set Architecture: MIPS
Code generation
2 d array(part 1) || 2D ARRAY FUNCTION WRITING || GET 100% MARKS IN CBSE CS
Lifting 1
halstead software science measures
Introduction to Polyhedral Compilation
Vectorization in ATLAS
OptimizingARM
9.Sorting & Searching

More from VisualBee.com (20)

PPTX
Homenagem para luiz e marcos (shared using VisualBee)
PPTX
PowerPoint Presentation (shared using VisualBee)
PPTX
PowerPoint Presentation (shared using http://VisualBee.com). (shared using Vi...
PPTX
The bible and I (shared using VisualBee)
PPTX
bb bb b
PPTX
bb (shared using VisualBee)
PDF
PPTX
ASH WEDNESDAY
PPTX
hijospreferidos
PPTX
hijo preferido
PPTX
Alcoholismo
PPTX
west love
PPTX
PPTX
Chua nhat III mua Thuong Nien - Nam C
PPTX
LA FE QUE AGRADA A DIOS
PPTX
Martin Luther king JR
PPTX
Diapositive 1 (shared using http://VisualBee.com).
PPTX
my cara de empanaaa
Homenagem para luiz e marcos (shared using VisualBee)
PowerPoint Presentation (shared using VisualBee)
PowerPoint Presentation (shared using http://VisualBee.com). (shared using Vi...
The bible and I (shared using VisualBee)
bb bb b
bb (shared using VisualBee)
ASH WEDNESDAY
hijospreferidos
hijo preferido
Alcoholismo
west love
Chua nhat III mua Thuong Nien - Nam C
LA FE QUE AGRADA A DIOS
Martin Luther king JR
Diapositive 1 (shared using http://VisualBee.com).
my cara de empanaaa

Predication

  • 1. Fall 2012 Thanks to Prof. Kim
  • 2. • Discuss Lab 3 • Dealing with Branches • Mid semester survey
  • 3. • Most branches are biased • Interference in PHT entries – Constructive (T+T, or N+N) – Destructive (T+N, or N+T) Agree predictor  Check if branches agree with Bias direction (most entries will agree) Reduces destructive interference in PHT
  • 4. • Instructions are predicated -> Depending on the predicate value the instruction is valid or becomes a No-op. (p) add R1 = R2 + R3 P R1 = R2 + R3 TRUE R1 <- R2 + R3 FALSE No op
  • 5. If ( a == 0 ) { b = 1; Set p } else { (p) b = 1 b = 0; } (!p) b = 0
  • 6. (normal branch code) (predicated code) A T N A if (cond) { b = 0; B C B } else { C b = 1; D D } A p1 = (cond) A branch p1, TARGET B p1 = (cond) B mov b, 1 (!p1) mov b, 1 jmp JOIN C C TARGET: mov b, 0 (p1) mov b, 0 6
  • 7. • Eliminate branch mispredictions – Convert control dependency to data dependency • Increase compiler’s optimization opportunities – Trace scheduling, bigger basic blocks, instruction re-ordering – SIMD (Nvidia G80), vector processing
  • 8. • More machine resources – Fetch more instructions – Occupy useful resources (ROB, scheduler..) • ISA should support predicated execution – (ISA), predicate registers – X86: c-move • In OOO, supporting predicated execution is harder – Three input sources – Dependent instructions cannot be executed.
  • 9. • Conditional move – The simplest form of predicated execution – Works only for registers not for memory – E.g.) CMOVA r16, r/m16 (move if CF=0 and ZF-0) • Full predication support – Only IA-64 (later lecture)
  • 10. • When to use predicated execution? – Hard to predict? – Short branches? – Compiler optimization benefit? • Who should decide it? • Applicable to all branches? – Loop, function calls, indirect branches …
  • 11. • Transforms an M-iteration loop into a loop with M/N iterations – We say that the loop has been unrolled N times for(i=0;i<100;i+=4){ for(i=0;i<100;i++) a[i]*=2; a[i]*=2; a[i+1]*=2; a[i+2]*=2; a[i+3]*=2; } Some compilers can do this (gcc -funroll-loops) Or you can do it manually (above)
  • 12. • Less loop overhead for(i=0;i<100;i+=4){ for(i=0;i<100;i++) a[i] += 2; a[i] += 2; a[i+1] += 2; a[i+2] += 2; a[i+3] += 2; } How many branches? Fewer branch prediction, Fewer number of instructions
  • 13. R2 = R3 * #4 R2 = R2 + #a R2 = R3 * #4 R1 = LOAD 0[R2] • Allows better R2 = R2 + #a R1 = R1 + #2 R1 = LOAD 0[R2] STORE R1  0[R2] scheduling of R1 = R1 + #2 R3 = R3 + 1 STORE R1  0[R2] BLT R3, 100, #top instructions R1 = LOAD 4[R2] R1 = R1 + #2 R2 = R3 * #4 STORE R1  4[R2] R2 = R2 + #a R1 = LOAD 8[R2] R1 = LOAD 0[R2] R1 = R1 + #2 R1 = R1 + #2 STORE R1  8[R2] STORE R1  0[R2] R1 = LOAD 12[R2] R3 = R3 + 1 R1 = R1 + #2 BLT R3, 100, #top STORE R1  12[R2] R3 = R3 + 4 R2 = R3 * #4 BLT R3, 100, #top R2 = R2 + #a R1 = LOAD 0[R2] R1 = R1 = #2 STORE R1  0[R2] R3 = R3 + 1 BLT R3, 100, #top
  • 14. • Get rid of small loops a[0]*=2; for(i=0;i<4;i++) a[1]*=2; a[i]*=2; a[2]*=2; a[3]*=2; for(0) Difficult to schedule/hoist for(1) insts from bottom block to for(2) top block due to branches for(3) Easier: no branches in the way
  • 15. • Instruction size is larger (code bloat) • What if N not a multiple of M? – Or if N not known at compile time? – Or if it is a while loop? j1=j-j%4; for(i=0;i<j1;i+=4){ a[i]*=2; for(i=0;i<j;i++) a[i+1]*=2; a[i]*=2; a[i+2]*=2; a[i+3]*=2; } for(i=j1;i<j;i++) a[i]*=2;