Optimization Techniques

OPTIMIZATION TECHNIQUES
BY
ENG. JOUD KHATTAB

TABLE OF CONTENT
• Introduction:
• Scheduling, Instruction Level Parallelism (ILP), Dependency.
• Loop Optimization:
• Definition, Loop Transformation, Superscalar.
• Software Pipelining:
• Definition, Example.
• Out-Of-Order Execution:
• Definition, In-Order VS. Out-Of-Order, Example.
BY JOUD KHATTAB

WHY OPTIMIZATION TECHNIQUES?
PROCESSING SPEED
BY JOUD KHATTAB

WHAT IS SCHEDULING?
• Scheduling is the ordering of program execution so as to improve
performance without affecting program correctness.
• Compiler-based scheduling, which is also known as static scheduling if the
hardware does not subsequently reorder the instruction sequence produced by
the compiler.
BY JOUD KHATTAB

SOFTWARE-BASED SCHEDULING VS.
HARDWARE-BASED SCHEDULING
• Unlike with hardware-based approaches, we can afford to perform more detailed
analysis of the instruction sequence.
• But: there will be a significant number of cases where not enough information can be
extracted from the instruction sequence statically to perform an optimization:
• Do two pointers point to the same memory location?
• what is the upper bound on the induction variable of a loop?
• Still: we can exploit characteristics of the underlying architecture to increase
performance.
BY JOUD KHATTAB

ARCHITECTURE OF A TYPICAL OPTIMIZING
COMPILER
FE O(1) O(2) O(N-1) O(N) BE
FRONT END BACK ENDMIDDLE END
High Level
Language
(Pascal,
C, Fortran,..)
Intermediate
Representation (IR)
(AST-s, Three-address Code, DAG-s...)
Optimized IR Machine
Language
I(1)…..I(N-1)
CHECK
SYNTAX AND
SEMANTICS
PERFORM OPTIMIZATION
EMIT TARGET
ARCHITECTURE
MACHINE CODE
BY JOUD KHATTAB

WHAT ARE SOME TYPICAL OPTIMIZATIONS?
HIGH LEVEL OPTIMIZATIONS
• Optimization that are very likely to improve
performance, but do not generally depend
on the target architecture:
• Scalar Replacement of Aggregates.
• Data-Cache Optimizations.
• Procedure Integration.
• Constant Propagation.
• Symbolic Substitution.
LOW LEVEL OPTIMIZATIONS
• Perform optimizations, which are usually
very target architecture specific or very
low level:
• Prediction of Data & Control Flow.
• Loop Optimization.
• Software Pipelining.
• Out-Of-Order Execution.
BY JOUD KHATTAB

INSTRUCTION LEVEL PARALLELISM (ILP)
• Independent a sequence of instructions, allow them to be executed in parallel.
• ILP implies parallelism across a sequence of instructions (block). This could be a loop,
a conditional, or some other valid sequence of statements.
• There is an upper bound, as too how much parallelism can be achieved.
• Dependencies within a sequence of instructions determine how much ILP is present.
• what degree can we rearrange the instructions without compromising correctness.
BY JOUD KHATTAB

DATA DEPENDENCY
B + C  A
A + D  E
Flow Dependency
RAW Conflicts
A + C  B
E + D  A
Anti Dependency
WAR Conflicts
B + C  A
E + D  A
Output Dependency
WAW Conflicts
RAR are not really a
problem
BY JOUD KHATTAB

LOOP OPTIMIZATION
BY JOUD KHATTAB

LOOP OPTIMIZATION
• Concerns with machine-independent code optimization:
• 90-10 rule: execution spends 90% time in 10% of the code.
• It is moderately easy to achieve 90% optimization. The rest 10% is very difficult.
• Identification of the 10% of the code is not possible for a compiler – it is the job of a profiler.
• In general, loops are the hot-spots.
BY JOUD KHATTAB

LOOP TRANSFORMATIONS
• Loop optimization can be viewed as the application of a sequence of specific
loop transformations to the source code or intermediate representation.
Fission
(Distribution)
Interchange
(Permutation)
Loop-invariant
code motion
Un-switching Unrolling Parallelization
BY JOUD KHATTAB

LOOP TRANSFORMATIONS:
FISSION (DISTRIBUTION)
Broke loop into multiple loops over the same index range with each taking only
a part of the original loop's body.
This optimization is most efficient in multi-core processors that can split a task
into multiple tasks for each processor.
int i, a[100], b[100];
for (i = 0; i < 100; i++) {
a[i] = 1;
b[i] = 2; }
int i, a[100], b[100];
for (i = 0; i < 100; i++) {
a[i] = 1; }
for (i = 0; i < 100; i++) {
b[i] = 2; }
BY JOUD KHATTAB

INTERCHANGE (PERMUTATION)
The process of exchanging the order of two iteration variables used by a nested loop.
It is often done to ensure that the elements of a multi-dimensional array are accessed
in the order in which they are present in memory, improving locality of reference.
for i from 0 to 10
for k from 0 to 20
a[i,k] = i + k
for k from 0 to 20
for i from 0 to 10
a[i,k] = i + k
BY JOUD KHATTAB

LOOP-INVARIANT CODE MOTION
loop-invariant code consists of statements or expressions which can be moved
outside the body of a loop without affecting the semantics of the program.
Loop-invariant code motion is a compiler optimization which performs this
movement automatically.
for (int i = 0; i < n; i++) {
x = y + z;
a[i] = 6 * i + x * x;
}
x = y + z;
t1 = x * x;
for (int i = 0; i < n; i++) {
a[i] = 6 * i + t1;
}
BY JOUD KHATTAB

UN-SWITCHING
move a conditional inside a loop outside of it by duplicating the loop's body, and
placing a version of it inside each of the if and else clauses of the conditional.
While the loop un-switching may double the amount of code written, each of these new
loops may now be separately optimized.
int i, w, x[1000], y[1000];
for (i = 0; i < 1000; i++) {
x[i] += y[i];
if (w)
y[i] = 0;
}
int i, w, x[1000], y[1000];
if (w) {
for (i = 0; i < 1000; i++) {
x[i] += y[i];
y[i] = 0; } }
else {
for (i = 0; i < 1000; i++) {
x[i] += y[i]; } }
BY JOUD KHATTAB

UNROLLING
Loop unrolling attempts to optimize a program's execution speed at the expense of its binary
size, which is an approach known as the space-time tradeoff.
The goal of unrolling is to increase a program's speed by reducing or eliminating instructions
that control the loop, such as pointer arithmetic and "end of loop" tests on each iteration;
reducing branch penalties; as well as hiding latencies including the delay in reading data from
memory.
int x;
for (x = 0; x < 100; x++) {
delete(x); }
int x;
for (x = 0; x != 100; x += 5 ) {
delete(x);
delete(x + 1);
delete(x + 2);
delete(x + 3);
delete(x + 4); }
BY JOUD KHATTAB

PARALLELIZATION
Loop parallelism is concerned with extracting parallel tasks from loops.
The opportunity for it often arises where data is stored in random access data
structures. Where a sequential program will iterate over the data structure and
operate on indices one at a time, a program exploiting LLP will use multiple threads or
processes which operate on some or all of the indices at the same time.
for (int i = 0; i < n; i++) {
S1: L[i] = L[i] + 10;
}
// estimated time: n * T
for (int i = 1; i < n; i++) {
S1: L[i] = L[i - 1] + 10;
}
// estimated time: T
BY JOUD KHATTAB

SUPERSCALAR
sum += a[i--]
fetch decode ld
fetch decode add
fetch decode sub
fetch decode bne
Loop
Superscalar ld r2, 10(r1)
add r3, r3, r2
sub r1, r1, 1
bne r1, r0, loop
Can we do better?
BY JOUD KHATTAB

SOFTWARE PIPELINING
BY JOUD KHATTAB

SOFTWARE PIPELINING
• Software Pipelining is an optimization that can improve the loop-execution-
performance of any system that allows ILP, including superscalar architectures.
• It derives its performance gain by filling delays within each iteration of a
loop body with instructions from different iterations of that same loop.
BY JOUD KHATTAB

SOFTWARE PIPELINING EXAMPLE
• Consider the instruction sequence:
L.D
ADD.D
S.D
DADDUI
BNE
Loop : F0,0(R1)
F4,F0,F2
F4,0(R1)
R1,R1,#-8
R1,R2,Loop
; F0 = array elem.
; add scalar in F2
; store result
; decrement ptr
; branch if R1 !=R2
BY JOUD KHATTAB

• Which was executed in the following sequence on pipeline:
L.D
ADD.D
S.D
DADDUI
BNE
Loop : F0,0(R1)
stall
F4,F0,F2
stall
stall
F4,0(R1)
R1,R1,#-8
stall
R1,R2,Loop
stall
1
2
3
4
5
6
7
8
9
10
BY JOUD KHATTAB

• Software pipelining eliminates nop’s by inserting instructions from different
iterations of the same loop body:
L.D
ADD.D
S.D
DADDUI
BNE
Insert instructions
from different
iterations to replace
the nop’s!
nop
nop
nop
nop
nop
BY JOUD KHATTAB

OUT-OF-ORDER EXECUTION
BY JOUD KHATTAB

IN-ORDER VS. OUT-OF-ORDER
IN-ORDER EXECUTION
• Instructions are fetched, executed &
completed in compiler generated
order.
• One stalls, they all stall.
OUT-OF-ORDER EXECUTION
• Instructions are fetched in compiler-
generated order.
• Instruction completion may be in-order
or out-of-order in between they may be
executed in some other order.
• Independent instructions behind a
stalled instruction can pass it.
BY JOUD KHATTAB

IN-ORDER VS. OUT-OF-ORDER
• In-Order processors:
• lw $3, 100($4) in execution, cache miss
• add $2, $3, $4 waits until the miss is satisfied
• sub $5, $6, $7 waits for the add
• Out-Of-Order processors:
• lw $3, 100($4) in execution, cache miss
• sub $5, $6, $7 can execute during the cache miss
• add $2, $3, $4 waits until the miss is satisfied
BY JOUD KHATTAB

Optimization Techniques

More Related Content

What's hot (20)

Similar to Optimization Techniques (20)

More from Joud Khattab (20)

Recently uploaded (20)

Optimization Techniques