Lec1 final

Superscalar and VLIW
Architectures

Parallel processing [2]
Processing instructions in parallel requires
three major tasks:
2. checking dependencies between
instructions to determine which
instructions can be grouped together for
parallel execution;
3. assigning instructions to the functional
units on the hardware;
4. determining when instructions are initiated
placed together into a single word.

Major categories [2]

VLIW – Very Long Instruction Word
EPIC – Explicitly Parallel Instruction Computing

Superscalar Processors [1]

 Superscalar processors are designed to exploit
more instruction-level parallelism in user
programs.
 Only independent instructions can be executed
in parallel without causing a wait state.
 The amount of instruction-level parallelism
varies widely depending on the type of code
being executed.

Pipelining in Superscalar
Processors [1]
 In order to fully utilise a superscalar processor
of degree m, m instructions must be executable
in parallel. This situation may not be true in all
clock cycles. In that case, some of the pipelines
may be stalling in a wait state.
 In a superscalar processor, the simple
operation latency should require only one cycle,
as in the base scalar processor.

Superscalar
Implementation
 Simultaneously fetch multiple instructions
 Logic to determine true dependencies
involving register values
 Mechanisms to communicate these values
 Mechanisms to initiate multiple instructions in
parallel
 Resources for parallel execution of multiple
instructions
 Mechanisms for committing process state in
correct order

Some Architectures
 PowerPC 604
– six independent execution units:
 Branch execution unit
 Load/Store unit
 3 Integer units
 Floating-point unit
– in-order issue
– register renaming
 Power PC 620
– provides in addition to the 604 out-of-order issue
 Pentium
– three independent execution units:
 2 Integer units
 Floating point unit
– in-order issue

VLIW
 Very Long Instruction Word (VLIW) architectures are used for executing more
than one basic instruction at a time.

 These processors contain multiple functional units, which fetch from the
instruction cache a Very-Long Instruction Word containing several basic
instructions, and dispatch the entire VLIW for parallel execution. These
capabilities are exploited by compilers which generate code that has grouped
together independent primitive instructions executable in parallel.

 VLIW has been described as a natural successor to RISC (Reduced Instruction
Set Computing), because it moves complexity from the hardware to the compiler,
allowing simpler, faster processors.

 VLIW eliminates the complicated instruction scheduling and parallel dispatch
that occurs in most modern microprocessors.

WHY VLIW ?
The key to higher performance in microprocessors for a broad range of
applications is the ability to exploit fine-grain, instruction-level
parallelism.

Some methods for exploiting fine-grain parallelism include:

 Pipelining
 Multiple processors
 Superscalar implementation
 Specifying multiple independent operations per instruction

Architecture Comparison:
CISC, RISC & VLIW

ARCHITECTURE CISC RISC VLIW
CHARACTERISTIC

INSTRUCTION SIZE Varies One size, usually 32 bits One size

INSTRUCTION Field placement varies Regular, consistent Regular, consistent
FORMAT placement of fields placement of
Fields
INSTRUCTION Varies from simple to Almost always one Many simple,
SEMANTICS complex ; possibly many simple operation independent
dependent operations operations
per instruction

REGISTERS Few, sometimes special Many, general-purpose Many, general-purpose

Architecture Comparison:
CISC, RISC & VLIW
ARCHITECTURE CISC RISC VLIW
CHARACTERISTIC

MEMORY REFERENCES Bundled with operations Not bundled with Not bundled with
in many different types operations, operations,i.e.,
of instructions i.e.,load/store load/store
architecture architecture

HARDWARE DESIGN Exploit micro coded Exploit Exploit
FOCUS implementations implementations Implementations
with one pipeline and & With multiple pipelines,
no microcode no microcode & no
complex dispatch logic

PICTURES OF FIVE
TYPICAL INSTRUCTIONS

Advantages of VLIW
 VLIW processors rely on the compiler that generates the VLIW code to

explicitly specify parallelism. Relying on the compiler has advantages.
 VLIW architecture reduces hardware complexity. VLIW simply moves
complexity from hardware into software.

What is ILP ?

 Instruction-level parallelism (ILP) is a measure of how many of the
operations in a computer program can be performed simultaneously.
 A system is said to embody ILP (instruction-level parallelism) is
multiple instructions runs on them at the same time.
 ILP can have a significant effect on performance which is critical to
embedded systems.
 ILP provides an form of power saving by slowing the clock.

What we intend to do
with ILP ?
We use Micro-architectural techniques to exploit the ILP. The various techniques
include :
 Instruction pipelining which depend on CPU caches.
 Register renaming which refers to a technique used to avoid unnecessary.
serialization of program operations imposed by the reuse of registers by those
operations.
 Speculative execution which reduce pipeline stalls due to control dependencies.
 Branch prediction which is used to keep the pipeline full.
 Superscalar execution in which multiple execution units are used to execute
multiple instructions in parallel.
 Out of Order execution which reduces pipeline stall due to operand dependencies.

Algorithms for
scheduling

Few of the Instruction scheduling algorithms used are :

 List scheduling

 Trace scheduling

 Software pipelining (modulo scheduling)

List Scheduling
List scheduling by steps :
2. Construct a dependence graph of the basic block. (The edges are

weighted with the latency of the instruction).

3. Use the dependence graph to determine instructions that can execute;

insert on a list, called the Readylist.

4. Use the dependence graph and the Ready list to schedule an instruction

that causes the smallest possible stall; update the Ready list. Repeat

Code Representation
for
List Scheduling
a=b+c
d=e - f
1 2 5 6

3 7
1. load R1, b
2. load R2, c 4 8
3. add R2,R1
4. store a, R2
5. load R3, e
6. load R4,f
7. sub R3,R4
8. store d,R3

Code Representation
for
List Scheduling
1. load R1, b 1. load R1, b 1 2 5 6
2. load R2, c 5.load R3, e
3. add R2,R1 2. load R2, c 3 7
4. store a, R2 6.load R4, f
5. load R3, e 3.add R2,R1
6. load R4,f 7.sub R3,R4 4 8
7. sub R3,R4 4.store a, R2
8. store d,R3 8. store d, R3
a=b+c
d=e - f

Now we have a schedule that requires no stalls and no NOPs.

Problem and
Solution
 Register allocation conflict : use of same register creates

anti-Dependencies that restrict scheduling

 Register allocation before scheduling

–prevents good scheduling

 Scheduling before register allocation

–spills destroy scheduling

 Solution : Schedule abstract assembly, Allocate registers, Schedule

Trace scheduling

Steps involved in Trace Scheduling :
 Trace Selection

– Find the most common trace of basic blocks.
 Trace Compaction

–Combine the basic blocks in the trace and schedule them as one block

–Create clean-up code if the execution goes off-trace
 Parallelism across IF branches vs. LOOP branches
 Can provide a speedup if static prediction is accurate

How Trace Scheduling
works
Look for higher priority and trace the blocks as shown below.

works
After tracing the priority blocks you schedule it first and rest
parallel to that .

works
We can see the blocks been
traced depending on the priority.

works
• Creating large extended basic blocks by duplication
• Schedule the larger blocks

Figure above shows how the extended basic blocks can be
created.

works
This block diagram in its final stage shows you the parallelism across the
branches.

Limitations of Trace
Scheduling

 Optimizations depends on the traces being the dominant paths
in the program’s control-flow.
 Therefore, the following two things should be true:

–Programs should demonstrate the behavior of being skewed in
the branches taken at run-time, for typical mixes of input data.

–We should have access to this information at compile time.

Not so easy.

Software Pipelining
 In software pipelining, iterations of a loop in the source program are

continuously initiated at constant intervals, before the preceding

iterations complete thus taking advantage of the parallelism in data path.
 Its also explained as scheduling the operations within an iteration,

such that the iterations can be pipelined to yield optimal throughput.
 The sequence of instructions before the steady state are called

PROLOG and the ones that are in the sequence after the steady state is

called EPILOG.

Software Pipelining
Example
•Source code:
for(i=0;i<n;i++) sum += a[i] r7 = L r6
---;stall
•Loop body in assembly:
r2 = Add r2,r7
r1 = L r0
---;stall r6 = add r6,12
r2 = Addr2,r1
r0 = addr0,4 r10 = L r9
---;stall
•Unroll loop & allocate registers
r2 = Add r2,r10
r1 = L r0
---;stall r9 = add r9,12
r2 = Add r2,r1
r0 = Add r0,12

r4 = L r3
---;stall
r2 = Add r2,r4
r3 = add r3,12

Software Pipelining
Example
Schedule Unrolled Instructions, exploiting VLIW (or not)
PROLOG

Identify
Repeating
Pattern
(Kernel)

EPILOG

Constraints in Software
pipelining

 Recurrence Constraints: which is determined
by loop carried data dependencies.
 Resource Constraints: which is determined by
total resource requirements.

Remarks on Software
Pipelining
 Innermost loop, loops with larger trip count, loops without conditionals
can be software pipelined.
 Code size increase due to prolog and epilog.
 Code size increase due to unrolling for MVE (Modulo Variable
Expansion).
 Register allocation strategies for software pipelined loops .
 Loops with conditional can be software pipelined if predicated execution
is supported.

–Higher resource requirement, but efficient schedule

Lec1 final

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to Lec1 final (20)

More from Gichelle Amon (19)

Recently uploaded (20)

Lec1 final