SlideShare a Scribd company logo
OPTIMIZATION TECHNIQUES
BY
ENG. JOUD KHATTAB
TABLE OF CONTENT
• Introduction:
• Scheduling, Instruction Level Parallelism (ILP), Dependency.
• Loop Optimization:
• Definition, Loop Transformation, Superscalar.
• Software Pipelining:
• Definition, Example.
• Out-Of-Order Execution:
• Definition, In-Order VS. Out-Of-Order, Example.
BY JOUD KHATTAB
WHY OPTIMIZATION TECHNIQUES?
PROCESSING SPEED
BY JOUD KHATTAB
WHAT IS SCHEDULING?
• Scheduling is the ordering of program execution so as to improve
performance without affecting program correctness.
• Compiler-based scheduling, which is also known as static scheduling if the
hardware does not subsequently reorder the instruction sequence produced by
the compiler.
BY JOUD KHATTAB
SOFTWARE-BASED SCHEDULING VS.
HARDWARE-BASED SCHEDULING
• Unlike with hardware-based approaches, we can afford to perform more detailed
analysis of the instruction sequence.
• But: there will be a significant number of cases where not enough information can be
extracted from the instruction sequence statically to perform an optimization:
• Do two pointers point to the same memory location?
• what is the upper bound on the induction variable of a loop?
• Still: we can exploit characteristics of the underlying architecture to increase
performance.
BY JOUD KHATTAB
ARCHITECTURE OF A TYPICAL OPTIMIZING
COMPILER
FE O(1) O(2) O(N-1) O(N) BE
FRONT END BACK ENDMIDDLE END
High Level
Language
(Pascal,
C, Fortran,..)
Intermediate
Representation (IR)
(AST-s, Three-address Code, DAG-s...)
Optimized IR Machine
Language
I(1)…..I(N-1)
CHECK
SYNTAX AND
SEMANTICS
PERFORM OPTIMIZATION
EMIT TARGET
ARCHITECTURE
MACHINE CODE
BY JOUD KHATTAB
WHAT ARE SOME TYPICAL OPTIMIZATIONS?
HIGH LEVEL OPTIMIZATIONS
• Optimization that are very likely to improve
performance, but do not generally depend
on the target architecture:
• Scalar Replacement of Aggregates.
• Data-Cache Optimizations.
• Procedure Integration.
• Constant Propagation.
• Symbolic Substitution.
LOW LEVEL OPTIMIZATIONS
• Perform optimizations, which are usually
very target architecture specific or very
low level:
• Prediction of Data & Control Flow.
• Loop Optimization.
• Software Pipelining.
• Out-Of-Order Execution.
BY JOUD KHATTAB
INSTRUCTION LEVEL PARALLELISM (ILP)
• Independent a sequence of instructions, allow them to be executed in parallel.
• ILP implies parallelism across a sequence of instructions (block). This could be a loop,
a conditional, or some other valid sequence of statements.
• There is an upper bound, as too how much parallelism can be achieved.
• Dependencies within a sequence of instructions determine how much ILP is present.
• what degree can we rearrange the instructions without compromising correctness.
BY JOUD KHATTAB
DATA DEPENDENCY
B + C  A
A + D  E
Flow Dependency
RAW Conflicts
A + C  B
E + D  A
Anti Dependency
WAR Conflicts
B + C  A
E + D  A
Output Dependency
WAW Conflicts
RAR are not really a
problem
BY JOUD KHATTAB
LOOP OPTIMIZATION
BY JOUD KHATTAB
LOOP OPTIMIZATION
• Concerns with machine-independent code optimization:
• 90-10 rule: execution spends 90% time in 10% of the code.
• It is moderately easy to achieve 90% optimization. The rest 10% is very difficult.
• Identification of the 10% of the code is not possible for a compiler – it is the job of a profiler.
• In general, loops are the hot-spots.
BY JOUD KHATTAB
LOOP TRANSFORMATIONS
• Loop optimization can be viewed as the application of a sequence of specific
loop transformations to the source code or intermediate representation.
Fission
(Distribution)
Interchange
(Permutation)
Loop-invariant
code motion
Un-switching Unrolling Parallelization
BY JOUD KHATTAB
LOOP TRANSFORMATIONS:
FISSION (DISTRIBUTION)
Broke loop into multiple loops over the same index range with each taking only
a part of the original loop's body.
This optimization is most efficient in multi-core processors that can split a task
into multiple tasks for each processor.
int i, a[100], b[100];
for (i = 0; i < 100; i++) {
a[i] = 1;
b[i] = 2; }
int i, a[100], b[100];
for (i = 0; i < 100; i++) {
a[i] = 1; }
for (i = 0; i < 100; i++) {
b[i] = 2; }
BY JOUD KHATTAB
LOOP TRANSFORMATIONS:
INTERCHANGE (PERMUTATION)
The process of exchanging the order of two iteration variables used by a nested loop.
It is often done to ensure that the elements of a multi-dimensional array are accessed
in the order in which they are present in memory, improving locality of reference.
for i from 0 to 10
for k from 0 to 20
a[i,k] = i + k
for k from 0 to 20
for i from 0 to 10
a[i,k] = i + k
BY JOUD KHATTAB
LOOP TRANSFORMATIONS:
LOOP-INVARIANT CODE MOTION
loop-invariant code consists of statements or expressions which can be moved
outside the body of a loop without affecting the semantics of the program.
Loop-invariant code motion is a compiler optimization which performs this
movement automatically.
for (int i = 0; i < n; i++) {
x = y + z;
a[i] = 6 * i + x * x;
}
x = y + z;
t1 = x * x;
for (int i = 0; i < n; i++) {
a[i] = 6 * i + t1;
}
BY JOUD KHATTAB
LOOP TRANSFORMATIONS:
UN-SWITCHING
move a conditional inside a loop outside of it by duplicating the loop's body, and
placing a version of it inside each of the if and else clauses of the conditional.
While the loop un-switching may double the amount of code written, each of these new
loops may now be separately optimized.
int i, w, x[1000], y[1000];
for (i = 0; i < 1000; i++) {
x[i] += y[i];
if (w)
y[i] = 0;
}
int i, w, x[1000], y[1000];
if (w) {
for (i = 0; i < 1000; i++) {
x[i] += y[i];
y[i] = 0; } }
else {
for (i = 0; i < 1000; i++) {
x[i] += y[i]; } }
BY JOUD KHATTAB
LOOP TRANSFORMATIONS:
UNROLLING
Loop unrolling attempts to optimize a program's execution speed at the expense of its binary
size, which is an approach known as the space-time tradeoff.
The goal of unrolling is to increase a program's speed by reducing or eliminating instructions
that control the loop, such as pointer arithmetic and "end of loop" tests on each iteration;
reducing branch penalties; as well as hiding latencies including the delay in reading data from
memory.
int x;
for (x = 0; x < 100; x++) {
delete(x); }
int x;
for (x = 0; x != 100; x += 5 ) {
delete(x);
delete(x + 1);
delete(x + 2);
delete(x + 3);
delete(x + 4); }
BY JOUD KHATTAB
LOOP TRANSFORMATIONS:
PARALLELIZATION
Loop parallelism is concerned with extracting parallel tasks from loops.
The opportunity for it often arises where data is stored in random access data
structures. Where a sequential program will iterate over the data structure and
operate on indices one at a time, a program exploiting LLP will use multiple threads or
processes which operate on some or all of the indices at the same time.
for (int i = 0; i < n; i++) {
S1: L[i] = L[i] + 10;
}
// estimated time: n * T
for (int i = 1; i < n; i++) {
S1: L[i] = L[i - 1] + 10;
}
// estimated time: T
BY JOUD KHATTAB
SUPERSCALAR
sum += a[i--]
fetch decode ld
fetch decode add
fetch decode sub
fetch decode bne
Loop
Superscalar ld r2, 10(r1)
add r3, r3, r2
sub r1, r1, 1
bne r1, r0, loop
Can we do better?
BY JOUD KHATTAB
SOFTWARE PIPELINING
BY JOUD KHATTAB
SOFTWARE PIPELINING
• Software Pipelining is an optimization that can improve the loop-execution-
performance of any system that allows ILP, including superscalar architectures.
• It derives its performance gain by filling delays within each iteration of a
loop body with instructions from different iterations of that same loop.
BY JOUD KHATTAB
SOFTWARE PIPELINING EXAMPLE
• Consider the instruction sequence:
L.D
ADD.D
S.D
DADDUI
BNE
Loop : F0,0(R1)
F4,F0,F2
F4,0(R1)
R1,R1,#-8
R1,R2,Loop
; F0 = array elem.
; add scalar in F2
; store result
; decrement ptr
; branch if R1 !=R2
BY JOUD KHATTAB
SOFTWARE PIPELINING EXAMPLE
• Which was executed in the following sequence on pipeline:
L.D
ADD.D
S.D
DADDUI
BNE
Loop : F0,0(R1)
stall
F4,F0,F2
stall
stall
F4,0(R1)
R1,R1,#-8
stall
R1,R2,Loop
stall
1
2
3
4
5
6
7
8
9
10
BY JOUD KHATTAB
SOFTWARE PIPELINING EXAMPLE
• Software pipelining eliminates nop’s by inserting instructions from different
iterations of the same loop body:
L.D
ADD.D
S.D
DADDUI
BNE
Insert instructions
from different
iterations to replace
the nop’s!
nop
nop
nop
nop
nop
BY JOUD KHATTAB
OUT-OF-ORDER EXECUTION
BY JOUD KHATTAB
IN-ORDER VS. OUT-OF-ORDER
IN-ORDER EXECUTION
• Instructions are fetched, executed &
completed in compiler generated
order.
• One stalls, they all stall.
OUT-OF-ORDER EXECUTION
• Instructions are fetched in compiler-
generated order.
• Instruction completion may be in-order
or out-of-order in between they may be
executed in some other order.
• Independent instructions behind a
stalled instruction can pass it.
BY JOUD KHATTAB
IN-ORDER VS. OUT-OF-ORDER
• In-Order processors:
• lw $3, 100($4) in execution, cache miss
• add $2, $3, $4 waits until the miss is satisfied
• sub $5, $6, $7 waits for the add
• Out-Of-Order processors:
• lw $3, 100($4) in execution, cache miss
• sub $5, $6, $7 can execute during the cache miss
• add $2, $3, $4 waits until the miss is satisfied
BY JOUD KHATTAB
THANK YOU
BY JOUD KHATTAB

More Related Content

PDF
design-compiler.pdf
PPTX
Verilog data types -For beginners
PPTX
Loop optimization
PDF
Lecture1 introduction compilers
PDF
Lecture: Automata
PPT
Verilog tutorial
PDF
Code optimization in compiler design
PPTX
Finite state machines
design-compiler.pdf
Verilog data types -For beginners
Loop optimization
Lecture1 introduction compilers
Lecture: Automata
Verilog tutorial
Code optimization in compiler design
Finite state machines

What's hot (20)

PPTX
Ambha axi
DOCX
16 x 16 banyan switch
PDF
Syntax Directed Definition and its applications
PPT
Regular expression with DFA
PDF
System verilog verification building blocks
PPTX
LR(1) and SLR(1) parsing
PDF
UVM ARCHITECTURE FOR VERIFICATION
PPT
Operaciones entre lenguajes
PPT
mano.ppt
PPTX
AMBA 3 APB Protocol
PDF
Symbol table in compiler Design
PPTX
Finite automata-for-lexical-analysis
PPTX
asymptotic notation
PPT
Compiler Design Unit 1
PPTX
Conditional branches
PPT
Booths Multiplication Algorithm
PDF
UVM Methodology Tutorial
PPTX
Race around and master slave flip flop
PDF
Basic blocks and flow graph in Compiler Construction
Ambha axi
16 x 16 banyan switch
Syntax Directed Definition and its applications
Regular expression with DFA
System verilog verification building blocks
LR(1) and SLR(1) parsing
UVM ARCHITECTURE FOR VERIFICATION
Operaciones entre lenguajes
mano.ppt
AMBA 3 APB Protocol
Symbol table in compiler Design
Finite automata-for-lexical-analysis
asymptotic notation
Compiler Design Unit 1
Conditional branches
Booths Multiplication Algorithm
UVM Methodology Tutorial
Race around and master slave flip flop
Basic blocks and flow graph in Compiler Construction
Ad

Similar to Optimization Techniques (20)

PPTX
Peephole Optimization
PPTX
Peephole Optimization
PPT
instruction parallelism .ppt
PDF
Advanced Techniques for Exploiting ILP
PDF
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
PPTX
CODE TUNINGtertertertrtryryryryrtytrytrtry
PPTX
Computer Architecture Assignment Help
PDF
Optimization in Programming languages
PDF
LOOP STATEMENTS AND TYPES OF LOOP IN C LANGUAGE BY RIZWAN
PDF
Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort
PPT
Unit 4 - Features of Verilog HDL (1).ppt
PDF
XPDDS18: Real Time in XEN on ARM - Andrii Anisov, EPAM Systems Inc.
PPTX
Computer architecture
PPTX
Flow chart programming
PDF
running stable diffusion on android
PPTX
Compiler optimization techniques
PPTX
1.My Presentation.pptx
PDF
Dive into PySpark
PPT
PPT
Slicing of Object-Oriented Programs
Peephole Optimization
Peephole Optimization
instruction parallelism .ppt
Advanced Techniques for Exploiting ILP
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
CODE TUNINGtertertertrtryryryryrtytrytrtry
Computer Architecture Assignment Help
Optimization in Programming languages
LOOP STATEMENTS AND TYPES OF LOOP IN C LANGUAGE BY RIZWAN
Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort
Unit 4 - Features of Verilog HDL (1).ppt
XPDDS18: Real Time in XEN on ARM - Andrii Anisov, EPAM Systems Inc.
Computer architecture
Flow chart programming
running stable diffusion on android
Compiler optimization techniques
1.My Presentation.pptx
Dive into PySpark
Slicing of Object-Oriented Programs
Ad

More from Joud Khattab (20)

PDF
Customer Engagement Management
PDF
Design thinking and Role Playing
PDF
Algorithms and Data Structure 2020
PDF
Artificial Intelligence 2020
PDF
Automata and Compiler 2020
PDF
Database 2020
PDF
Software Engineering 2020
PDF
Software Engineering 2018
PDF
Database 2018
PDF
Automate and Compiler 2018
PDF
Artificial Intelligence 2018
PDF
Algorithms and Data Structure 2018
PDF
Data Storytelling
PDF
Geospatial Information Management
PDF
Big Data for Development
PDF
Personality Detection via MBTI Test
PDF
Fog Computing
PDF
Seasonal ARIMA
PDF
Network Address Translation (NAT)
PDF
From Image Processing To Computer Vision
Customer Engagement Management
Design thinking and Role Playing
Algorithms and Data Structure 2020
Artificial Intelligence 2020
Automata and Compiler 2020
Database 2020
Software Engineering 2020
Software Engineering 2018
Database 2018
Automate and Compiler 2018
Artificial Intelligence 2018
Algorithms and Data Structure 2018
Data Storytelling
Geospatial Information Management
Big Data for Development
Personality Detection via MBTI Test
Fog Computing
Seasonal ARIMA
Network Address Translation (NAT)
From Image Processing To Computer Vision

Recently uploaded (20)

PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Introduction to Artificial Intelligence
PPTX
Online Work Permit System for Fast Permit Processing
PDF
Understanding Forklifts - TECH EHS Solution
PDF
System and Network Administration Chapter 2
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
history of c programming in notes for students .pptx
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
System and Network Administraation Chapter 3
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PTS Company Brochure 2025 (1).pdf.......
ISO 45001 Occupational Health and Safety Management System
Odoo Companies in India – Driving Business Transformation.pdf
Design an Analysis of Algorithms II-SECS-1021-03
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Design an Analysis of Algorithms I-SECS-1021-03
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Introduction to Artificial Intelligence
Online Work Permit System for Fast Permit Processing
Understanding Forklifts - TECH EHS Solution
System and Network Administration Chapter 2
Operating system designcfffgfgggggggvggggggggg
history of c programming in notes for students .pptx
Softaken Excel to vCard Converter Software.pdf
How to Migrate SBCGlobal Email to Yahoo Easily
System and Network Administraation Chapter 3
Odoo POS Development Services by CandidRoot Solutions
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf

Optimization Techniques

  • 2. TABLE OF CONTENT • Introduction: • Scheduling, Instruction Level Parallelism (ILP), Dependency. • Loop Optimization: • Definition, Loop Transformation, Superscalar. • Software Pipelining: • Definition, Example. • Out-Of-Order Execution: • Definition, In-Order VS. Out-Of-Order, Example. BY JOUD KHATTAB
  • 4. WHAT IS SCHEDULING? • Scheduling is the ordering of program execution so as to improve performance without affecting program correctness. • Compiler-based scheduling, which is also known as static scheduling if the hardware does not subsequently reorder the instruction sequence produced by the compiler. BY JOUD KHATTAB
  • 5. SOFTWARE-BASED SCHEDULING VS. HARDWARE-BASED SCHEDULING • Unlike with hardware-based approaches, we can afford to perform more detailed analysis of the instruction sequence. • But: there will be a significant number of cases where not enough information can be extracted from the instruction sequence statically to perform an optimization: • Do two pointers point to the same memory location? • what is the upper bound on the induction variable of a loop? • Still: we can exploit characteristics of the underlying architecture to increase performance. BY JOUD KHATTAB
  • 6. ARCHITECTURE OF A TYPICAL OPTIMIZING COMPILER FE O(1) O(2) O(N-1) O(N) BE FRONT END BACK ENDMIDDLE END High Level Language (Pascal, C, Fortran,..) Intermediate Representation (IR) (AST-s, Three-address Code, DAG-s...) Optimized IR Machine Language I(1)…..I(N-1) CHECK SYNTAX AND SEMANTICS PERFORM OPTIMIZATION EMIT TARGET ARCHITECTURE MACHINE CODE BY JOUD KHATTAB
  • 7. WHAT ARE SOME TYPICAL OPTIMIZATIONS? HIGH LEVEL OPTIMIZATIONS • Optimization that are very likely to improve performance, but do not generally depend on the target architecture: • Scalar Replacement of Aggregates. • Data-Cache Optimizations. • Procedure Integration. • Constant Propagation. • Symbolic Substitution. LOW LEVEL OPTIMIZATIONS • Perform optimizations, which are usually very target architecture specific or very low level: • Prediction of Data & Control Flow. • Loop Optimization. • Software Pipelining. • Out-Of-Order Execution. BY JOUD KHATTAB
  • 8. INSTRUCTION LEVEL PARALLELISM (ILP) • Independent a sequence of instructions, allow them to be executed in parallel. • ILP implies parallelism across a sequence of instructions (block). This could be a loop, a conditional, or some other valid sequence of statements. • There is an upper bound, as too how much parallelism can be achieved. • Dependencies within a sequence of instructions determine how much ILP is present. • what degree can we rearrange the instructions without compromising correctness. BY JOUD KHATTAB
  • 9. DATA DEPENDENCY B + C  A A + D  E Flow Dependency RAW Conflicts A + C  B E + D  A Anti Dependency WAR Conflicts B + C  A E + D  A Output Dependency WAW Conflicts RAR are not really a problem BY JOUD KHATTAB
  • 11. LOOP OPTIMIZATION • Concerns with machine-independent code optimization: • 90-10 rule: execution spends 90% time in 10% of the code. • It is moderately easy to achieve 90% optimization. The rest 10% is very difficult. • Identification of the 10% of the code is not possible for a compiler – it is the job of a profiler. • In general, loops are the hot-spots. BY JOUD KHATTAB
  • 12. LOOP TRANSFORMATIONS • Loop optimization can be viewed as the application of a sequence of specific loop transformations to the source code or intermediate representation. Fission (Distribution) Interchange (Permutation) Loop-invariant code motion Un-switching Unrolling Parallelization BY JOUD KHATTAB
  • 13. LOOP TRANSFORMATIONS: FISSION (DISTRIBUTION) Broke loop into multiple loops over the same index range with each taking only a part of the original loop's body. This optimization is most efficient in multi-core processors that can split a task into multiple tasks for each processor. int i, a[100], b[100]; for (i = 0; i < 100; i++) { a[i] = 1; b[i] = 2; } int i, a[100], b[100]; for (i = 0; i < 100; i++) { a[i] = 1; } for (i = 0; i < 100; i++) { b[i] = 2; } BY JOUD KHATTAB
  • 14. LOOP TRANSFORMATIONS: INTERCHANGE (PERMUTATION) The process of exchanging the order of two iteration variables used by a nested loop. It is often done to ensure that the elements of a multi-dimensional array are accessed in the order in which they are present in memory, improving locality of reference. for i from 0 to 10 for k from 0 to 20 a[i,k] = i + k for k from 0 to 20 for i from 0 to 10 a[i,k] = i + k BY JOUD KHATTAB
  • 15. LOOP TRANSFORMATIONS: LOOP-INVARIANT CODE MOTION loop-invariant code consists of statements or expressions which can be moved outside the body of a loop without affecting the semantics of the program. Loop-invariant code motion is a compiler optimization which performs this movement automatically. for (int i = 0; i < n; i++) { x = y + z; a[i] = 6 * i + x * x; } x = y + z; t1 = x * x; for (int i = 0; i < n; i++) { a[i] = 6 * i + t1; } BY JOUD KHATTAB
  • 16. LOOP TRANSFORMATIONS: UN-SWITCHING move a conditional inside a loop outside of it by duplicating the loop's body, and placing a version of it inside each of the if and else clauses of the conditional. While the loop un-switching may double the amount of code written, each of these new loops may now be separately optimized. int i, w, x[1000], y[1000]; for (i = 0; i < 1000; i++) { x[i] += y[i]; if (w) y[i] = 0; } int i, w, x[1000], y[1000]; if (w) { for (i = 0; i < 1000; i++) { x[i] += y[i]; y[i] = 0; } } else { for (i = 0; i < 1000; i++) { x[i] += y[i]; } } BY JOUD KHATTAB
  • 17. LOOP TRANSFORMATIONS: UNROLLING Loop unrolling attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as the space-time tradeoff. The goal of unrolling is to increase a program's speed by reducing or eliminating instructions that control the loop, such as pointer arithmetic and "end of loop" tests on each iteration; reducing branch penalties; as well as hiding latencies including the delay in reading data from memory. int x; for (x = 0; x < 100; x++) { delete(x); } int x; for (x = 0; x != 100; x += 5 ) { delete(x); delete(x + 1); delete(x + 2); delete(x + 3); delete(x + 4); } BY JOUD KHATTAB
  • 18. LOOP TRANSFORMATIONS: PARALLELIZATION Loop parallelism is concerned with extracting parallel tasks from loops. The opportunity for it often arises where data is stored in random access data structures. Where a sequential program will iterate over the data structure and operate on indices one at a time, a program exploiting LLP will use multiple threads or processes which operate on some or all of the indices at the same time. for (int i = 0; i < n; i++) { S1: L[i] = L[i] + 10; } // estimated time: n * T for (int i = 1; i < n; i++) { S1: L[i] = L[i - 1] + 10; } // estimated time: T BY JOUD KHATTAB
  • 19. SUPERSCALAR sum += a[i--] fetch decode ld fetch decode add fetch decode sub fetch decode bne Loop Superscalar ld r2, 10(r1) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Can we do better? BY JOUD KHATTAB
  • 21. SOFTWARE PIPELINING • Software Pipelining is an optimization that can improve the loop-execution- performance of any system that allows ILP, including superscalar architectures. • It derives its performance gain by filling delays within each iteration of a loop body with instructions from different iterations of that same loop. BY JOUD KHATTAB
  • 22. SOFTWARE PIPELINING EXAMPLE • Consider the instruction sequence: L.D ADD.D S.D DADDUI BNE Loop : F0,0(R1) F4,F0,F2 F4,0(R1) R1,R1,#-8 R1,R2,Loop ; F0 = array elem. ; add scalar in F2 ; store result ; decrement ptr ; branch if R1 !=R2 BY JOUD KHATTAB
  • 23. SOFTWARE PIPELINING EXAMPLE • Which was executed in the following sequence on pipeline: L.D ADD.D S.D DADDUI BNE Loop : F0,0(R1) stall F4,F0,F2 stall stall F4,0(R1) R1,R1,#-8 stall R1,R2,Loop stall 1 2 3 4 5 6 7 8 9 10 BY JOUD KHATTAB
  • 24. SOFTWARE PIPELINING EXAMPLE • Software pipelining eliminates nop’s by inserting instructions from different iterations of the same loop body: L.D ADD.D S.D DADDUI BNE Insert instructions from different iterations to replace the nop’s! nop nop nop nop nop BY JOUD KHATTAB
  • 26. IN-ORDER VS. OUT-OF-ORDER IN-ORDER EXECUTION • Instructions are fetched, executed & completed in compiler generated order. • One stalls, they all stall. OUT-OF-ORDER EXECUTION • Instructions are fetched in compiler- generated order. • Instruction completion may be in-order or out-of-order in between they may be executed in some other order. • Independent instructions behind a stalled instruction can pass it. BY JOUD KHATTAB
  • 27. IN-ORDER VS. OUT-OF-ORDER • In-Order processors: • lw $3, 100($4) in execution, cache miss • add $2, $3, $4 waits until the miss is satisfied • sub $5, $6, $7 waits for the add • Out-Of-Order processors: • lw $3, 100($4) in execution, cache miss • sub $5, $6, $7 can execute during the cache miss • add $2, $3, $4 waits until the miss is satisfied BY JOUD KHATTAB
  • 28. THANK YOU BY JOUD KHATTAB