SlideShare a Scribd company logo
Superscalar and VLIW
    Architectures
Parallel processing [2]
Processing instructions in parallel requires
   three major tasks:
2. checking dependencies between
   instructions to determine which
   instructions can be grouped together for
   parallel execution;
3. assigning instructions to the functional
   units on the hardware;
4. determining when instructions are initiated
   placed together into a single word.
Major categories [2]




VLIW – Very Long Instruction Word
EPIC – Explicitly Parallel Instruction Computing
Major categories [2]
Superscalar Processors [1]

    Superscalar processors are designed to exploit
     more instruction-level parallelism in user
     programs.
    Only independent instructions can be executed
     in parallel without causing a wait state.
    The amount of instruction-level parallelism
     varies widely depending on the type of code
     being executed.
Pipelining in Superscalar
Processors [1]
     In order to fully utilise a superscalar processor
      of degree m, m instructions must be executable
      in parallel. This situation may not be true in all
      clock cycles. In that case, some of the pipelines
      may be stalling in a wait state.
     In a superscalar processor, the simple
      operation latency should require only one cycle,
      as in the base scalar processor.
Lec1 final
Superscalar Execution
Superscalar
Implementation
   Simultaneously fetch multiple instructions
   Logic to determine true dependencies
    involving register values
   Mechanisms to communicate these values
   Mechanisms to initiate multiple instructions in
    parallel
   Resources for parallel execution of multiple
    instructions
   Mechanisms for committing process state in
    correct order
Some Architectures
   PowerPC 604
    – six independent execution units:
           Branch execution unit
           Load/Store unit
           3 Integer units
           Floating-point unit
    – in-order issue
    – register renaming
   Power PC 620
    – provides in addition to the 604 out-of-order issue
   Pentium
    – three independent execution units:
           2 Integer units
           Floating point unit
    – in-order issue
VLIW
   Very Long Instruction Word (VLIW) architectures are used for executing more
    than one basic instruction at a time.

   These processors contain multiple functional units, which fetch from the
    instruction cache a Very-Long Instruction Word containing several basic
    instructions, and dispatch the entire VLIW for parallel execution. These
    capabilities are exploited by compilers which generate code that has grouped
    together independent primitive instructions executable in parallel.

   VLIW has been described as a natural successor to RISC (Reduced Instruction
    Set Computing), because it moves complexity from the hardware to the compiler,
    allowing simpler, faster processors.

    VLIW eliminates the complicated instruction scheduling and parallel dispatch
    that occurs in most modern microprocessors.
WHY VLIW ?
The key to higher performance in microprocessors for a broad range of
applications is the ability to exploit fine-grain, instruction-level
parallelism.

Some methods for exploiting fine-grain parallelism include:

   Pipelining
   Multiple processors
   Superscalar implementation
   Specifying multiple independent operations per instruction
Architecture Comparison:
          CISC, RISC & VLIW

ARCHITECTURE                CISC                     RISC                        VLIW
CHARACTERISTIC

INSTRUCTION SIZE   Varies                    One size, usually 32 bits   One size



INSTRUCTION        Field placement varies    Regular, consistent         Regular, consistent
FORMAT                                       placement of fields         placement of
                                                                         Fields
INSTRUCTION        Varies from simple to     Almost always one           Many simple,
SEMANTICS          complex ; possibly many   simple operation            independent
                   dependent operations                                  operations
                   per instruction


REGISTERS          Few, sometimes special    Many, general-purpose       Many, general-purpose
Architecture Comparison:
           CISC, RISC & VLIW
ARCHITECTURE                  CISC                       RISC                      VLIW
CHARACTERISTIC

MEMORY REFERENCES      Bundled with operations   Not bundled with          Not bundled with
                       in many different types   operations,               operations,i.e.,
                       of instructions           i.e.,load/store           load/store
                                                 architecture              architecture

HARDWARE DESIGN        Exploit micro coded       Exploit                   Exploit
FOCUS                  implementations           implementations           Implementations
                                                 with one pipeline and &   With multiple pipelines,
                                                 no microcode              no microcode & no
                                                                           complex dispatch logic

PICTURES OF FIVE
TYPICAL INSTRUCTIONS
Advantages of VLIW
   VLIW processors rely on the compiler that generates the VLIW code to

explicitly specify parallelism. Relying on the compiler has advantages.
   VLIW architecture reduces hardware complexity. VLIW simply moves
    complexity from hardware into software.
What is ILP ?

   Instruction-level parallelism (ILP) is a measure of how many of the
    operations in a computer program can be performed simultaneously.
   A system is said to embody ILP (instruction-level parallelism) is
    multiple instructions runs on them at the same time.
   ILP can have a significant effect on performance which is critical to
    embedded systems.
   ILP provides an form of power saving by slowing the clock.
What we intend to do
    with ILP ?
We use Micro-architectural techniques to exploit the ILP. The various techniques
    include :
   Instruction pipelining which depend on CPU caches.
   Register renaming which refers to a technique used to avoid unnecessary.
    serialization of program operations imposed by the reuse of registers by those
    operations.
   Speculative execution which reduce pipeline stalls due to control dependencies.
   Branch prediction which is used to keep the pipeline full.
   Superscalar execution in which multiple execution units are used to execute
    multiple instructions in parallel.
   Out of Order execution which reduces pipeline stall due to operand dependencies.
Algorithms for
scheduling

Few of the Instruction scheduling algorithms used are :

   List scheduling

   Trace scheduling

   Software pipelining (modulo scheduling)
List Scheduling
List scheduling by steps :
2.   Construct a dependence graph of the basic block. (The edges are

     weighted with the latency of the instruction).

3.   Use the dependence graph to determine instructions that can execute;

     insert on a list, called the Readylist.

4.   Use the dependence graph and the Ready list to schedule an instruction

     that causes the smallest possible stall; update the Ready list. Repeat
Code Representation
for
List Scheduling
      a=b+c
      d=e - f
                   1       2   5       6


                       3           7
1.   load R1, b
2.   load R2, c        4           8
3.   add R2,R1
4.   store a, R2
5.   load R3, e
6.   load R4,f
7.   sub R3,R4
8.   store d,R3
Code Representation
for
List Scheduling
1. load R1, b      1. load R1, b    1       2         5       6
2. load R2, c      5.load R3, e
3. add R2,R1       2. load R2, c        3                 7
4. store a, R2     6.load R4, f
5. load R3, e      3.add R2,R1
6. load R4,f       7.sub R3,R4          4                 8
7. sub R3,R4       4.store a, R2
8. store d,R3      8. store d, R3
                                            a=b+c
                                            d=e - f


Now we have a schedule that requires no stalls and no NOPs.
Problem and
    Solution
   Register allocation conflict : use of same register creates

    anti-Dependencies that restrict scheduling

   Register allocation before scheduling

–prevents good scheduling

   Scheduling before register allocation

–spills destroy scheduling

   Solution : Schedule abstract assembly, Allocate registers, Schedule
Trace scheduling

Steps involved in Trace Scheduling :
    Trace Selection

– Find the most common trace of basic blocks.
    Trace Compaction

–Combine the basic blocks in the trace and schedule them as one block

–Create clean-up code if the execution goes off-trace
    Parallelism across IF branches vs. LOOP branches
    Can provide a speedup if static prediction is accurate
How Trace Scheduling
works
Look for higher priority and trace the blocks as shown below.
How Trace Scheduling
works
After tracing the priority blocks you schedule it first and rest
parallel to that .
How Trace Scheduling
 works
We can see the blocks been
traced depending on the priority.
How Trace Scheduling
works
• Creating large extended basic blocks by duplication
• Schedule the larger blocks




Figure above shows how the extended basic blocks can be
created.
How Trace Scheduling
 works
This block diagram in its final stage shows you the parallelism across the
branches.
Limitations of Trace
 Scheduling


   Optimizations depends on the traces being the dominant paths
    in the program’s control-flow.
   Therefore, the following two things should be true:

–Programs should demonstrate the behavior of being skewed in
    the branches taken at run-time, for typical mixes of input data.

–We should have access to this information at compile time.

    Not so easy.
Software Pipelining
   In software pipelining, iterations of a loop in the source program are

continuously initiated at constant intervals, before the preceding

iterations complete thus taking advantage of the parallelism in data path.
   Its also explained as scheduling the operations within an iteration,

such that the iterations can be pipelined to yield optimal throughput.
   The sequence of instructions before the steady state are called

PROLOG and the ones that are in the sequence after the steady state is

called EPILOG.
Software Pipelining
 Example
•Source code:
for(i=0;i<n;i++) sum += a[i]         r7 = L r6
                                    ---;stall
•Loop body in assembly:
                                    r2 = Add r2,r7
r1 = L r0
---;stall                           r6 = add r6,12
r2 = Addr2,r1
r0 = addr0,4                        r10 = L r9
                                    ---;stall
•Unroll loop & allocate registers
                                    r2 = Add r2,r10
r1 = L r0
---;stall                           r9 = add r9,12
r2 = Add r2,r1
r0 = Add r0,12

r4 = L r3
---;stall
r2 = Add r2,r4
r3 = add r3,12
Software Pipelining
Example
Software Pipelining
Example
Schedule Unrolled Instructions, exploiting VLIW (or not)
                                                   PROLOG


                                                     Identify
                                                     Repeating
                                                     Pattern
                                                     (Kernel)



                                                    EPILOG
Constraints in Software
pipelining

   Recurrence Constraints: which is determined
    by loop carried data dependencies.
   Resource Constraints: which is determined by
    total resource requirements.
Remarks on Software
Pipelining
   Innermost loop, loops with larger trip count, loops without conditionals
    can be software pipelined.
   Code size increase due to prolog and epilog.
   Code size increase due to unrolling for MVE (Modulo Variable
    Expansion).
   Register allocation strategies for software pipelined loops .
   Loops with conditional can be software pipelined if predicated execution
    is supported.

–Higher resource requirement, but efficient schedule

More Related Content

PPTX
Compiler design
PPT
what is compiler and five phases of compiler
PPTX
RISC (reduced instruction set computer)
PPTX
Multiprocessor structures
PPTX
Peephole Optimization
PPTX
evolution of operating system
PPTX
Semophores and it's types
PPTX
Compiler Chapter 1
Compiler design
what is compiler and five phases of compiler
RISC (reduced instruction set computer)
Multiprocessor structures
Peephole Optimization
evolution of operating system
Semophores and it's types
Compiler Chapter 1

What's hot (20)

PPTX
Peephole optimization techniques
PPT
Memory Mapping Cache
PPTX
Os unit 3
PPTX
Chap3 Communication Inter Processus.pptx
PDF
R&amp;c
PPTX
pipeline in computer architecture design
PDF
Introduction to compilers
PPTX
Computer architecture addressing modes and formats
PPT
Peterson Critical Section Problem Solution
PPSX
Real Time Operating System
PPT
Risc processors
PPTX
Pipeline processing - Computer Architecture
PPTX
Computer registers
PPTX
PPT
Instruction Level Parallelism and Superscalar Processors
PDF
History of computers
PDF
Advanced computer architechture -Memory Hierarchies and its Properties and Type
PDF
Embedded Operating System - Linux
PDF
Intel® RDT Hands-on Lab
PPT
Basic operational concepts.ppt
Peephole optimization techniques
Memory Mapping Cache
Os unit 3
Chap3 Communication Inter Processus.pptx
R&amp;c
pipeline in computer architecture design
Introduction to compilers
Computer architecture addressing modes and formats
Peterson Critical Section Problem Solution
Real Time Operating System
Risc processors
Pipeline processing - Computer Architecture
Computer registers
Instruction Level Parallelism and Superscalar Processors
History of computers
Advanced computer architechture -Memory Hierarchies and its Properties and Type
Embedded Operating System - Linux
Intel® RDT Hands-on Lab
Basic operational concepts.ppt
Ad

Viewers also liked (8)

PDF
Trace Scheduling
PPT
Vliw and superscaler
PPT
Os module 2 d
PPT
Vliw
PPT
6 spatial filtering p2
PPT
5 spatial filtering p1
PPTX
Kerberos
PPT
Network security
Trace Scheduling
Vliw and superscaler
Os module 2 d
Vliw
6 spatial filtering p2
5 spatial filtering p1
Kerberos
Network security
Ad

Similar to Lec1 final (20)

PPT
Overview of Very long instruction word Computing
PPTX
Difficulties in Pipelining
PPT
Advanced computer architecture lesson 5 and 6
DOCX
Crussoe proc
PDF
Fpga based 128 bit customised vliw processor for executing dual scalarvector ...
PPTX
The sunsparc architecture
PDF
Vliw or epic
PPTX
VLIW(Very Long Instruction Word)
PPTX
Parallel Computing
PDF
SOC System Design Approach
PPTX
1.My Presentation.pptx
PPTX
Instruction Set Architecture
PPTX
Advanced processor principles
PDF
5-Embedded processor technology-06-01-2024.pdf
PDF
Advanced Techniques for Exploiting ILP
PDF
W04505116121
PDF
DPDK Integration: A Product's Journey - Roger B. Melton
PPTX
CISC & RISC Architecture
PPTX
Computer Organization.pptx
DOCX
embeddeed real time systems 2 mark questions and answers
Overview of Very long instruction word Computing
Difficulties in Pipelining
Advanced computer architecture lesson 5 and 6
Crussoe proc
Fpga based 128 bit customised vliw processor for executing dual scalarvector ...
The sunsparc architecture
Vliw or epic
VLIW(Very Long Instruction Word)
Parallel Computing
SOC System Design Approach
1.My Presentation.pptx
Instruction Set Architecture
Advanced processor principles
5-Embedded processor technology-06-01-2024.pdf
Advanced Techniques for Exploiting ILP
W04505116121
DPDK Integration: A Product's Journey - Roger B. Melton
CISC & RISC Architecture
Computer Organization.pptx
embeddeed real time systems 2 mark questions and answers

More from Gichelle Amon (19)

PPT
Os module 2 c
PPT
Image segmentation ppt
PPT
Lec3 final
PPTX
PPT
Lec2 final
PPTX
DOC
Module 3 law of contracts
PPT
Transport triggered architecture
PPT
Time triggered arch.
PPT
Subnetting
PPT
Os module 2 c
PPT
Os module 2 ba
PPT
PPT
Delivery
PPT
Addressing
PPT
Medical image analysis
PPTX
Presentation2
PPTX
Harvard architecture
PPT
Micro channel architecture
Os module 2 c
Image segmentation ppt
Lec3 final
Lec2 final
Module 3 law of contracts
Transport triggered architecture
Time triggered arch.
Subnetting
Os module 2 c
Os module 2 ba
Delivery
Addressing
Medical image analysis
Presentation2
Harvard architecture
Micro channel architecture

Recently uploaded (20)

PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Getting Started with Data Integration: FME Form 101
PDF
STKI Israel Market Study 2025 version august
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Architecture types and enterprise applications.pdf
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Web App vs Mobile App What Should You Build First.pdf
PPTX
observCloud-Native Containerability and monitoring.pptx
PPTX
The various Industrial Revolutions .pptx
PPT
Module 1.ppt Iot fundamentals and Architecture
PPTX
TLE Review Electricity (Electricity).pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Getting Started with Data Integration: FME Form 101
STKI Israel Market Study 2025 version august
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
A contest of sentiment analysis: k-nearest neighbor versus neural network
cloud_computing_Infrastucture_as_cloud_p
Developing a website for English-speaking practice to English as a foreign la...
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Hindi spoken digit analysis for native and non-native speakers
Architecture types and enterprise applications.pdf
A comparative study of natural language inference in Swahili using monolingua...
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
Zenith AI: Advanced Artificial Intelligence
Web App vs Mobile App What Should You Build First.pdf
observCloud-Native Containerability and monitoring.pptx
The various Industrial Revolutions .pptx
Module 1.ppt Iot fundamentals and Architecture
TLE Review Electricity (Electricity).pptx
NewMind AI Weekly Chronicles - August'25-Week II

Lec1 final

  • 1. Superscalar and VLIW Architectures
  • 2. Parallel processing [2] Processing instructions in parallel requires three major tasks: 2. checking dependencies between instructions to determine which instructions can be grouped together for parallel execution; 3. assigning instructions to the functional units on the hardware; 4. determining when instructions are initiated placed together into a single word.
  • 3. Major categories [2] VLIW – Very Long Instruction Word EPIC – Explicitly Parallel Instruction Computing
  • 5. Superscalar Processors [1]  Superscalar processors are designed to exploit more instruction-level parallelism in user programs.  Only independent instructions can be executed in parallel without causing a wait state.  The amount of instruction-level parallelism varies widely depending on the type of code being executed.
  • 6. Pipelining in Superscalar Processors [1]  In order to fully utilise a superscalar processor of degree m, m instructions must be executable in parallel. This situation may not be true in all clock cycles. In that case, some of the pipelines may be stalling in a wait state.  In a superscalar processor, the simple operation latency should require only one cycle, as in the base scalar processor.
  • 9. Superscalar Implementation  Simultaneously fetch multiple instructions  Logic to determine true dependencies involving register values  Mechanisms to communicate these values  Mechanisms to initiate multiple instructions in parallel  Resources for parallel execution of multiple instructions  Mechanisms for committing process state in correct order
  • 10. Some Architectures  PowerPC 604 – six independent execution units:  Branch execution unit  Load/Store unit  3 Integer units  Floating-point unit – in-order issue – register renaming  Power PC 620 – provides in addition to the 604 out-of-order issue  Pentium – three independent execution units:  2 Integer units  Floating point unit – in-order issue
  • 11. VLIW  Very Long Instruction Word (VLIW) architectures are used for executing more than one basic instruction at a time.  These processors contain multiple functional units, which fetch from the instruction cache a Very-Long Instruction Word containing several basic instructions, and dispatch the entire VLIW for parallel execution. These capabilities are exploited by compilers which generate code that has grouped together independent primitive instructions executable in parallel.  VLIW has been described as a natural successor to RISC (Reduced Instruction Set Computing), because it moves complexity from the hardware to the compiler, allowing simpler, faster processors.  VLIW eliminates the complicated instruction scheduling and parallel dispatch that occurs in most modern microprocessors.
  • 12. WHY VLIW ? The key to higher performance in microprocessors for a broad range of applications is the ability to exploit fine-grain, instruction-level parallelism. Some methods for exploiting fine-grain parallelism include:  Pipelining  Multiple processors  Superscalar implementation  Specifying multiple independent operations per instruction
  • 13. Architecture Comparison: CISC, RISC & VLIW ARCHITECTURE CISC RISC VLIW CHARACTERISTIC INSTRUCTION SIZE Varies One size, usually 32 bits One size INSTRUCTION Field placement varies Regular, consistent Regular, consistent FORMAT placement of fields placement of Fields INSTRUCTION Varies from simple to Almost always one Many simple, SEMANTICS complex ; possibly many simple operation independent dependent operations operations per instruction REGISTERS Few, sometimes special Many, general-purpose Many, general-purpose
  • 14. Architecture Comparison: CISC, RISC & VLIW ARCHITECTURE CISC RISC VLIW CHARACTERISTIC MEMORY REFERENCES Bundled with operations Not bundled with Not bundled with in many different types operations, operations,i.e., of instructions i.e.,load/store load/store architecture architecture HARDWARE DESIGN Exploit micro coded Exploit Exploit FOCUS implementations implementations Implementations with one pipeline and & With multiple pipelines, no microcode no microcode & no complex dispatch logic PICTURES OF FIVE TYPICAL INSTRUCTIONS
  • 15. Advantages of VLIW  VLIW processors rely on the compiler that generates the VLIW code to explicitly specify parallelism. Relying on the compiler has advantages.  VLIW architecture reduces hardware complexity. VLIW simply moves complexity from hardware into software.
  • 16. What is ILP ?  Instruction-level parallelism (ILP) is a measure of how many of the operations in a computer program can be performed simultaneously.  A system is said to embody ILP (instruction-level parallelism) is multiple instructions runs on them at the same time.  ILP can have a significant effect on performance which is critical to embedded systems.  ILP provides an form of power saving by slowing the clock.
  • 17. What we intend to do with ILP ? We use Micro-architectural techniques to exploit the ILP. The various techniques include :  Instruction pipelining which depend on CPU caches.  Register renaming which refers to a technique used to avoid unnecessary. serialization of program operations imposed by the reuse of registers by those operations.  Speculative execution which reduce pipeline stalls due to control dependencies.  Branch prediction which is used to keep the pipeline full.  Superscalar execution in which multiple execution units are used to execute multiple instructions in parallel.  Out of Order execution which reduces pipeline stall due to operand dependencies.
  • 18. Algorithms for scheduling Few of the Instruction scheduling algorithms used are :  List scheduling  Trace scheduling  Software pipelining (modulo scheduling)
  • 19. List Scheduling List scheduling by steps : 2. Construct a dependence graph of the basic block. (The edges are weighted with the latency of the instruction). 3. Use the dependence graph to determine instructions that can execute; insert on a list, called the Readylist. 4. Use the dependence graph and the Ready list to schedule an instruction that causes the smallest possible stall; update the Ready list. Repeat
  • 20. Code Representation for List Scheduling a=b+c d=e - f 1 2 5 6 3 7 1. load R1, b 2. load R2, c 4 8 3. add R2,R1 4. store a, R2 5. load R3, e 6. load R4,f 7. sub R3,R4 8. store d,R3
  • 21. Code Representation for List Scheduling 1. load R1, b 1. load R1, b 1 2 5 6 2. load R2, c 5.load R3, e 3. add R2,R1 2. load R2, c 3 7 4. store a, R2 6.load R4, f 5. load R3, e 3.add R2,R1 6. load R4,f 7.sub R3,R4 4 8 7. sub R3,R4 4.store a, R2 8. store d,R3 8. store d, R3 a=b+c d=e - f Now we have a schedule that requires no stalls and no NOPs.
  • 22. Problem and Solution  Register allocation conflict : use of same register creates anti-Dependencies that restrict scheduling  Register allocation before scheduling –prevents good scheduling  Scheduling before register allocation –spills destroy scheduling  Solution : Schedule abstract assembly, Allocate registers, Schedule
  • 23. Trace scheduling Steps involved in Trace Scheduling :  Trace Selection – Find the most common trace of basic blocks.  Trace Compaction –Combine the basic blocks in the trace and schedule them as one block –Create clean-up code if the execution goes off-trace  Parallelism across IF branches vs. LOOP branches  Can provide a speedup if static prediction is accurate
  • 24. How Trace Scheduling works Look for higher priority and trace the blocks as shown below.
  • 25. How Trace Scheduling works After tracing the priority blocks you schedule it first and rest parallel to that .
  • 26. How Trace Scheduling works We can see the blocks been traced depending on the priority.
  • 27. How Trace Scheduling works • Creating large extended basic blocks by duplication • Schedule the larger blocks Figure above shows how the extended basic blocks can be created.
  • 28. How Trace Scheduling works This block diagram in its final stage shows you the parallelism across the branches.
  • 29. Limitations of Trace Scheduling  Optimizations depends on the traces being the dominant paths in the program’s control-flow.  Therefore, the following two things should be true: –Programs should demonstrate the behavior of being skewed in the branches taken at run-time, for typical mixes of input data. –We should have access to this information at compile time. Not so easy.
  • 30. Software Pipelining  In software pipelining, iterations of a loop in the source program are continuously initiated at constant intervals, before the preceding iterations complete thus taking advantage of the parallelism in data path.  Its also explained as scheduling the operations within an iteration, such that the iterations can be pipelined to yield optimal throughput.  The sequence of instructions before the steady state are called PROLOG and the ones that are in the sequence after the steady state is called EPILOG.
  • 31. Software Pipelining Example •Source code: for(i=0;i<n;i++) sum += a[i] r7 = L r6 ---;stall •Loop body in assembly: r2 = Add r2,r7 r1 = L r0 ---;stall r6 = add r6,12 r2 = Addr2,r1 r0 = addr0,4 r10 = L r9 ---;stall •Unroll loop & allocate registers r2 = Add r2,r10 r1 = L r0 ---;stall r9 = add r9,12 r2 = Add r2,r1 r0 = Add r0,12 r4 = L r3 ---;stall r2 = Add r2,r4 r3 = add r3,12
  • 33. Software Pipelining Example Schedule Unrolled Instructions, exploiting VLIW (or not) PROLOG Identify Repeating Pattern (Kernel) EPILOG
  • 34. Constraints in Software pipelining  Recurrence Constraints: which is determined by loop carried data dependencies.  Resource Constraints: which is determined by total resource requirements.
  • 35. Remarks on Software Pipelining  Innermost loop, loops with larger trip count, loops without conditionals can be software pipelined.  Code size increase due to prolog and epilog.  Code size increase due to unrolling for MVE (Modulo Variable Expansion).  Register allocation strategies for software pipelined loops .  Loops with conditional can be software pipelined if predicated execution is supported. –Higher resource requirement, but efficient schedule