SlideShare a Scribd company logo
Carc 06.03
alessandro.bogliolo@uniurb.it
06. Performance optimization
06.03. Multiple-issue processors
• CPI < 1
• Superscalar
• VLIW
Computer Architecture
alessandro.bogliolo@uniurb.it
Carc 06.03
alessandro.bogliolo@uniurb.it
• Pipelined CPUs may have multiple execution units
• of different types (to execute different instructions)
• of the same type (to reduce repetition time)
• IF, ID, MA and WB stages (and the registers among them) are
not replicated
• they can be handle a single instruction at the time
• The inherent limitation of a microprocessor with a single
pipeline is CPI ≥ 1
• To get CPI < 1 all pipeline stages need to be replicated in order
to issue more than one instruction at the time
• Processors with multiple pipelines are called multiple-issue
processors
CPI < 1
Carc 06.03
alessandro.bogliolo@uniurb.it
• Contain N parallel pipelines
• Read sequential code and issue up to N instructions at the
same time
• The instructions issued at the same time must:
• be independent from each other
• have sufficient resources available
• The ideal CPI is 1/N
• If an instruction (say, instrk) cannot be issued together with the
previous ones, the previous ones are issues together and instrk
is issued at the subsequent clock cycle, possibly together with
some subsequent instructions
Superscalar processors
Carc 06.03
alessandro.bogliolo@uniurb.it
• N=3
• Variable issuing rate
• CPI > 1/N
Superscalar processors
(example)
instr1 IF ID EX MA WB
instr2 IF ID EX MA WB
instr3 IF ID EX MA WB
instr4 IF ID EX MA WB
instr5 IF ID EX MA WB
instr6 IF ID EX MA WB
instr7 IF ID EX MA WB
instr8 IF ID EX MA WB
instr9 IF ID EX MA WB
instr10 IF ID EX MA WB
instr11 IF ID EX MA WB
instr12 IF ID EX MA WB
… … … … … …
Instr6 depends on
instr4 or instr5
Instr10 depends
on instr9
Carc 06.03
alessandro.bogliolo@uniurb.it
• In a superscalar processor, different pipelines may be devoted
to different types of instructions
• e.g., an integer pipeline (for integer/logic operation, memory accesses
and branches), and a floating-point pipeline (for floating point
operations)
• All pipelines are stalled together
• Different pipelines may have different latencies, but they need
to have the same repetition time
• To fully exploit the parallel pipelines, their instructions should
appear at similar rates
Superscalar processors
(dedicated pipelines)
Carc 06.03
alessandro.bogliolo@uniurb.it
• Assumptions:
• N=2
• One integer pipeline (Int)
• One floating-point pipeline (FP) (ADDD has latency 3)
• FP and Int do not share registers.
• Decisions on parallel issuing can be taken based only on the
OpCode.
Superscalar DLX
Carc 06.03
alessandro.bogliolo@uniurb.it
Superscalar DLX
Int FP
Loop: LD F0, 0(R1)
LD F4, -8(R1)
LD F6, -16(R1) ADDD F0, F0, F2
LD F8, -24(R1) ADDD F4, F4, F2
LD F10, -32(R1) ADDD F6, F6, F2
SD 0(R1), F0 ADDD F8, F8, F2
SD -8(R1), F4 ADDD F10, F10, F2
SD -16(R1), F6
SD -24(R1), F8
SD -32(R1), F10
SUBI R1, R1, #40
BNEZ R1, Loop
LD F0, 0(R1)
LD F4, -8(R1)
LD F6, -16(R1)
ADDD F0, F0, F2
LD F8, -24(R1)
ADDD F4, F4, F2
LD F10, -32(R1)
ADDD F6, F6, F2
SD 0(R1), F0
ADDD F8, F8, F2
SD -8(R1), F4
ADDD F10, F10, F2
SD -16(R1), F6
SD -24(R1), F8
SD -32(R1), F10
SUBI R1, R1, #40
BNEZ R1, Loop
Carc 06.03
alessandro.bogliolo@uniurb.it
Superscalar DLX
LD F0, 0(R1)
LD F4, -8(R1)
LD F6, -16(R1)
ADDD F0, F0, F2
LD F8, -24(R1)
ADDD F4, F4, F2
SUBI R1, R1, #40
ADDD F6, F6, F2
SD 0(R1), F0
ADDD F8, F8, F2
SD 32(R1), F4
SD 24(R1), F6
SD 16(R1), F8
SD 8(R1), F10
BNEZ R1, Loop
Int FP
Loop: LD F0, 0(R1)
LD F4, -8(R1)
LD F6, -16(R1) ADDD F0, F0, F2
LD F8, -24(R1) ADDD F4, F4, F2
SUBI R1, R1, #32 ADDD F6, F6, F2
SD 32(R1), F0 ADDD F8, F8, F2
SD 24(R1), F4
SD 16(R1), F6
SD 8(R1), F8
BNEZ R1, Loop
Carc 06.03
alessandro.bogliolo@uniurb.it
Superscalar processors
performance evaluation
• Assumptions:
• static scheduling
• sequential code available
• Parse the code sequentially
• Group together contiguous instructions that are not conflicting
• Determine the parallel instruction count (PIC)
• Insert stalls according to worst-case latency and repetition
time
• Determine the number of stall cycles (SC)
CPUT = (PIC+SC)Tclk > IC/N * Tclk
Carc 06.03
alessandro.bogliolo@uniurb.it
VLIW processors
• N (from 5 to 30) parallel pipelines
• Parallel code
• Very long instruction words (VLIW)
• Each instruction is obtained by concatenating the instructions for all
the pipelines
• Up 1000 bits per instruction
• Static issuing, static scheduling
• Instruction-level parallelism decided at compile-time
• VLIW processors have simpler control units than superscalar
processors
Carc 06.03
alessandro.bogliolo@uniurb.it
VLIW DLX
• Assumptions:
• N=5
• 2 floating-point pipelines (FP)
• 2 memory access pipelines (MEM)
• 1 pipeline for branches and integer/logic operations
(INT/BRANCH)
Carc 06.03
alessandro.bogliolo@uniurb.it
VLIW DLX
MEM1 MEM2 FP1 FP2 INT/BRANCH
Loop: LD F0, 0(R1) LD F4, -8(R1)
LD F6, -16(R1) LD F8, -24(R1)
LD F10, -32(R1) LD F12, -40(R1) ADDD F0, F0, F2 ADDD F4, F4, F2
LD F14, -48(R1) ADDD F6, F6, F2 ADDD F8, F8, F2
ADDD F10, F10, F2 ADDD F12, F12, F2 SUBI R1, R1, #56
SD 56(R1), F0 SD 48(R1), F4 ADDD F14, F14, F2
SD 40(R1), F6 SD 32(R1), F8
SD 24(R1), F10 SD 16(R1), F12
SD 8(r1), F14 BNEZ R1, Loop
Carc 06.03
alessandro.bogliolo@uniurb.it
VLIW processors
performance evaluation
• Evaluating the performance of a VLIW processor starting from a sequential
code is non-trivial since the compiler can perform static optimization
• Assuming the sequential code is optimized, proceed as for a superscalar
processor to determine the parallel instruction count (PIC) or VLIW count
(VLIWC)
• Evaluating the performance of a VLIW processor starting from VLIW code is
much simpler
• Compute the number of VLIW instructions (VLIWC)
• Insert stalls according to worst-case latency and repetition time
• Determine the number of stall cycles (SC)
• Assuming that all instructions have CPI=1:
CPUT = (VLIWC+SC)Tclk > IC/N * Tclk

More Related Content

PDF
CArcMOOC 05.02 - Reference architecture
PDF
CArcMOOC 06.04 - Dynamic optimizations
PDF
CArcMOOC 05.03 - Pipeline hazards
PDF
CArcMOOC 05.01 - Elementary pipelining and performance metrics
PPT
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
PPT
Timing Analysis
PPT
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...
CArcMOOC 05.02 - Reference architecture
CArcMOOC 06.04 - Dynamic optimizations
CArcMOOC 05.03 - Pipeline hazards
CArcMOOC 05.01 - Elementary pipelining and performance metrics
Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW
Timing Analysis
Lec12 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- P6, Netbur...

What's hot (20)

PDF
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
PPTX
ARM instruction set
PPTX
ARM instruction set
PPTX
ARM stacks, subroutines, Cortex M3, LPC 214X
PDF
CArcMOOC 04.01 - Von Neumann and CPU micro-architecture
PPTX
Study of inter and intra chip variations
PPTX
ARM Architecture in Details
PDF
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
PPT
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...
PPT
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
PPT
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
PPT
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
PDF
FIFOPt
PDF
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
PDF
Code GPU with CUDA - SIMT
PPT
Ct213 processor design_pipelinehazard
PPTX
Arm architecture
PDF
ARM Architecture
PPT
Lec18 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- In...
PPTX
Arm architechture
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
ARM instruction set
ARM instruction set
ARM stacks, subroutines, Cortex M3, LPC 214X
CArcMOOC 04.01 - Von Neumann and CPU micro-architecture
Study of inter and intra chip variations
ARM Architecture in Details
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
FIFOPt
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Code GPU with CUDA - SIMT
Ct213 processor design_pipelinehazard
Arm architecture
ARM Architecture
Lec18 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- In...
Arm architechture
Ad

Viewers also liked (19)

PDF
CodeMOOC2 - Dall'idea alla specifica
PPTX
CodeMOOC2 - Cos'è un'App
PPTX
безопасный интернет и его ловушки
PDF
Management presentations Final Mutta Matheus
PPTX
Civic literacy presentation by Daryna Pyrohova
PPTX
Jhon alexander 10 a yisela
PPTX
L'immagine giusta
PPTX
Adrenalina Deportiva
PDF
IT talk "Python language evolution"
PDF
The Europe Code Week (CodeEU) initiative
PDF
CArcMOOC 03.03 - Sequential circuits
PPTX
DOCX
enbe photo model
PDF
John Sargood - folio 20151105
PDF
Asia by numbers ita
PDF
CArcMOOC 03.04 - Gate-level design
PDF
Letter european investors – government of the Slovak republic
PDF
CArcMOOC 04.02 - Instruction Set Architecture
PDF
HighLine Vila Isabel Lancamento Residencial
CodeMOOC2 - Dall'idea alla specifica
CodeMOOC2 - Cos'è un'App
безопасный интернет и его ловушки
Management presentations Final Mutta Matheus
Civic literacy presentation by Daryna Pyrohova
Jhon alexander 10 a yisela
L'immagine giusta
Adrenalina Deportiva
IT talk "Python language evolution"
The Europe Code Week (CodeEU) initiative
CArcMOOC 03.03 - Sequential circuits
enbe photo model
John Sargood - folio 20151105
Asia by numbers ita
CArcMOOC 03.04 - Gate-level design
Letter european investors – government of the Slovak republic
CArcMOOC 04.02 - Instruction Set Architecture
HighLine Vila Isabel Lancamento Residencial
Ad

Similar to CArcMOOC 06.03 - Multiple-issue processors (20)

PPT
Chapter 2 pc
PPT
Instruction Level Parallelism and Superscalar Processors
PPT
14 superscalar
PPT
Lec1 final
PDF
Topic2a ss pipelines
PPT
14 superscalar
PDF
The Challenges facing Libraries and Imperative Languages from Massively Paral...
PPT
13_Superscalar.ppt
PPT
13 superscalar
PPT
Overview of Very long instruction word Computing
PPT
Cp uarch
PPT
Overview of Very long instruction word processors
PPT
Top ranking colleges in india
PDF
23_Advanced_Processors controller system
PPT
Performance Enhancement with Pipelining
PPTX
Scope of parallelism
PPT
Chapter 3
PPTX
Parallel Computing
PPTX
Difficulties in Pipelining
PDF
Fpga based 128 bit customised vliw processor for executing dual scalarvector ...
Chapter 2 pc
Instruction Level Parallelism and Superscalar Processors
14 superscalar
Lec1 final
Topic2a ss pipelines
14 superscalar
The Challenges facing Libraries and Imperative Languages from Massively Paral...
13_Superscalar.ppt
13 superscalar
Overview of Very long instruction word Computing
Cp uarch
Overview of Very long instruction word processors
Top ranking colleges in india
23_Advanced_Processors controller system
Performance Enhancement with Pipelining
Scope of parallelism
Chapter 3
Parallel Computing
Difficulties in Pipelining
Fpga based 128 bit customised vliw processor for executing dual scalarvector ...

More from Alessandro Bogliolo (20)

PDF
AIXMOOC 2.6 - Come funzionano i Large Language Models
PDF
AIXMOOC 6.1 - Non sono un robot (Dom Holdaway)
PDF
AIXMOOC 5.3 - L'essere umano di fronte all'I.A. (Cristiano Maria Bellei)
PDF
AIXMOOC 4.3 - Geopolitica dell'intelligenza artificiale (Alessandro Aresu)
PDF
AIXMOOC 3.3 - Linguaggio e capacità cognitive (Gabriella Bottini)
PDF
AIXMOOC 3.2 - Linguaggio e memoria (Manuela Berlingeri)
PDF
AIXMOOC 4.2 - IA e informazione (Fabio Giglietto)
PDF
AIXMOOC 2.5 - CPU e GPU per Machine Learning (Luca Benini)
PDF
AIXMOOC 5.2 - IA generativa e creatività
PDF
AIXMOOC 3.1 - L'acquisizione del linguaggio (Mirta Vernice)
PPTX
AIXMOOC 4.1 - Comunicare con l'IA (Giovanni Boccia Artieri)
PDF
AIXMOOC 2.4 - Intelligenza artificiale generativa (Mirco Musolesi)
PDF
AIXMOOC 2.3 - Modelli di reti neurali con esperimenti di addestramento
PDF
AIXMOOC 2.2 - Reti neurali e machine learning (Valerio Freschi)
PDF
AIXMOOC 2.1 - Il modello del neurone (Stefano Sartini)
PDF
AIXMOOC 1.4 - Macchine Calcolatrici e Intelligenza, di A. Turing
PDF
AIXMOOC 5.1 - EU AI Act - Il regolamento europeo (Lucilla Sioli)
PPTX
AIXMOOC 1.2 - Quando le macchine impararono a parlare
PDF
AIXMOOC 1.1 - L'esplosione dell'Intelligenza Artificiale - Introduzione
PDF
BIBMOOC 05.03 - Codici in biblioteca
AIXMOOC 2.6 - Come funzionano i Large Language Models
AIXMOOC 6.1 - Non sono un robot (Dom Holdaway)
AIXMOOC 5.3 - L'essere umano di fronte all'I.A. (Cristiano Maria Bellei)
AIXMOOC 4.3 - Geopolitica dell'intelligenza artificiale (Alessandro Aresu)
AIXMOOC 3.3 - Linguaggio e capacità cognitive (Gabriella Bottini)
AIXMOOC 3.2 - Linguaggio e memoria (Manuela Berlingeri)
AIXMOOC 4.2 - IA e informazione (Fabio Giglietto)
AIXMOOC 2.5 - CPU e GPU per Machine Learning (Luca Benini)
AIXMOOC 5.2 - IA generativa e creatività
AIXMOOC 3.1 - L'acquisizione del linguaggio (Mirta Vernice)
AIXMOOC 4.1 - Comunicare con l'IA (Giovanni Boccia Artieri)
AIXMOOC 2.4 - Intelligenza artificiale generativa (Mirco Musolesi)
AIXMOOC 2.3 - Modelli di reti neurali con esperimenti di addestramento
AIXMOOC 2.2 - Reti neurali e machine learning (Valerio Freschi)
AIXMOOC 2.1 - Il modello del neurone (Stefano Sartini)
AIXMOOC 1.4 - Macchine Calcolatrici e Intelligenza, di A. Turing
AIXMOOC 5.1 - EU AI Act - Il regolamento europeo (Lucilla Sioli)
AIXMOOC 1.2 - Quando le macchine impararono a parlare
AIXMOOC 1.1 - L'esplosione dell'Intelligenza Artificiale - Introduzione
BIBMOOC 05.03 - Codici in biblioteca

Recently uploaded (20)

PPTX
Cell Structure & Organelles in detailed.
PPTX
History, Philosophy and sociology of education (1).pptx
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Classroom Observation Tools for Teachers
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Weekly quiz Compilation Jan -July 25.pdf
PDF
Trump Administration's workforce development strategy
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Cell Types and Its function , kingdom of life
PDF
01-Introduction-to-Information-Management.pdf
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
Cell Structure & Organelles in detailed.
History, Philosophy and sociology of education (1).pptx
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Classroom Observation Tools for Teachers
Anesthesia in Laparoscopic Surgery in India
Microbial disease of the cardiovascular and lymphatic systems
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Weekly quiz Compilation Jan -July 25.pdf
Trump Administration's workforce development strategy
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Cell Types and Its function , kingdom of life
01-Introduction-to-Information-Management.pdf
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Chinmaya Tiranga quiz Grand Finale.pdf
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Complications of Minimal Access Surgery at WLH
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Practical Manual AGRO-233 Principles and Practices of Natural Farming

CArcMOOC 06.03 - Multiple-issue processors

  • 1. Carc 06.03 alessandro.bogliolo@uniurb.it 06. Performance optimization 06.03. Multiple-issue processors • CPI < 1 • Superscalar • VLIW Computer Architecture alessandro.bogliolo@uniurb.it
  • 2. Carc 06.03 alessandro.bogliolo@uniurb.it • Pipelined CPUs may have multiple execution units • of different types (to execute different instructions) • of the same type (to reduce repetition time) • IF, ID, MA and WB stages (and the registers among them) are not replicated • they can be handle a single instruction at the time • The inherent limitation of a microprocessor with a single pipeline is CPI ≥ 1 • To get CPI < 1 all pipeline stages need to be replicated in order to issue more than one instruction at the time • Processors with multiple pipelines are called multiple-issue processors CPI < 1
  • 3. Carc 06.03 alessandro.bogliolo@uniurb.it • Contain N parallel pipelines • Read sequential code and issue up to N instructions at the same time • The instructions issued at the same time must: • be independent from each other • have sufficient resources available • The ideal CPI is 1/N • If an instruction (say, instrk) cannot be issued together with the previous ones, the previous ones are issues together and instrk is issued at the subsequent clock cycle, possibly together with some subsequent instructions Superscalar processors
  • 4. Carc 06.03 alessandro.bogliolo@uniurb.it • N=3 • Variable issuing rate • CPI > 1/N Superscalar processors (example) instr1 IF ID EX MA WB instr2 IF ID EX MA WB instr3 IF ID EX MA WB instr4 IF ID EX MA WB instr5 IF ID EX MA WB instr6 IF ID EX MA WB instr7 IF ID EX MA WB instr8 IF ID EX MA WB instr9 IF ID EX MA WB instr10 IF ID EX MA WB instr11 IF ID EX MA WB instr12 IF ID EX MA WB … … … … … … Instr6 depends on instr4 or instr5 Instr10 depends on instr9
  • 5. Carc 06.03 alessandro.bogliolo@uniurb.it • In a superscalar processor, different pipelines may be devoted to different types of instructions • e.g., an integer pipeline (for integer/logic operation, memory accesses and branches), and a floating-point pipeline (for floating point operations) • All pipelines are stalled together • Different pipelines may have different latencies, but they need to have the same repetition time • To fully exploit the parallel pipelines, their instructions should appear at similar rates Superscalar processors (dedicated pipelines)
  • 6. Carc 06.03 alessandro.bogliolo@uniurb.it • Assumptions: • N=2 • One integer pipeline (Int) • One floating-point pipeline (FP) (ADDD has latency 3) • FP and Int do not share registers. • Decisions on parallel issuing can be taken based only on the OpCode. Superscalar DLX
  • 7. Carc 06.03 alessandro.bogliolo@uniurb.it Superscalar DLX Int FP Loop: LD F0, 0(R1) LD F4, -8(R1) LD F6, -16(R1) ADDD F0, F0, F2 LD F8, -24(R1) ADDD F4, F4, F2 LD F10, -32(R1) ADDD F6, F6, F2 SD 0(R1), F0 ADDD F8, F8, F2 SD -8(R1), F4 ADDD F10, F10, F2 SD -16(R1), F6 SD -24(R1), F8 SD -32(R1), F10 SUBI R1, R1, #40 BNEZ R1, Loop LD F0, 0(R1) LD F4, -8(R1) LD F6, -16(R1) ADDD F0, F0, F2 LD F8, -24(R1) ADDD F4, F4, F2 LD F10, -32(R1) ADDD F6, F6, F2 SD 0(R1), F0 ADDD F8, F8, F2 SD -8(R1), F4 ADDD F10, F10, F2 SD -16(R1), F6 SD -24(R1), F8 SD -32(R1), F10 SUBI R1, R1, #40 BNEZ R1, Loop
  • 8. Carc 06.03 alessandro.bogliolo@uniurb.it Superscalar DLX LD F0, 0(R1) LD F4, -8(R1) LD F6, -16(R1) ADDD F0, F0, F2 LD F8, -24(R1) ADDD F4, F4, F2 SUBI R1, R1, #40 ADDD F6, F6, F2 SD 0(R1), F0 ADDD F8, F8, F2 SD 32(R1), F4 SD 24(R1), F6 SD 16(R1), F8 SD 8(R1), F10 BNEZ R1, Loop Int FP Loop: LD F0, 0(R1) LD F4, -8(R1) LD F6, -16(R1) ADDD F0, F0, F2 LD F8, -24(R1) ADDD F4, F4, F2 SUBI R1, R1, #32 ADDD F6, F6, F2 SD 32(R1), F0 ADDD F8, F8, F2 SD 24(R1), F4 SD 16(R1), F6 SD 8(R1), F8 BNEZ R1, Loop
  • 9. Carc 06.03 alessandro.bogliolo@uniurb.it Superscalar processors performance evaluation • Assumptions: • static scheduling • sequential code available • Parse the code sequentially • Group together contiguous instructions that are not conflicting • Determine the parallel instruction count (PIC) • Insert stalls according to worst-case latency and repetition time • Determine the number of stall cycles (SC) CPUT = (PIC+SC)Tclk > IC/N * Tclk
  • 10. Carc 06.03 alessandro.bogliolo@uniurb.it VLIW processors • N (from 5 to 30) parallel pipelines • Parallel code • Very long instruction words (VLIW) • Each instruction is obtained by concatenating the instructions for all the pipelines • Up 1000 bits per instruction • Static issuing, static scheduling • Instruction-level parallelism decided at compile-time • VLIW processors have simpler control units than superscalar processors
  • 11. Carc 06.03 alessandro.bogliolo@uniurb.it VLIW DLX • Assumptions: • N=5 • 2 floating-point pipelines (FP) • 2 memory access pipelines (MEM) • 1 pipeline for branches and integer/logic operations (INT/BRANCH)
  • 12. Carc 06.03 alessandro.bogliolo@uniurb.it VLIW DLX MEM1 MEM2 FP1 FP2 INT/BRANCH Loop: LD F0, 0(R1) LD F4, -8(R1) LD F6, -16(R1) LD F8, -24(R1) LD F10, -32(R1) LD F12, -40(R1) ADDD F0, F0, F2 ADDD F4, F4, F2 LD F14, -48(R1) ADDD F6, F6, F2 ADDD F8, F8, F2 ADDD F10, F10, F2 ADDD F12, F12, F2 SUBI R1, R1, #56 SD 56(R1), F0 SD 48(R1), F4 ADDD F14, F14, F2 SD 40(R1), F6 SD 32(R1), F8 SD 24(R1), F10 SD 16(R1), F12 SD 8(r1), F14 BNEZ R1, Loop
  • 13. Carc 06.03 alessandro.bogliolo@uniurb.it VLIW processors performance evaluation • Evaluating the performance of a VLIW processor starting from a sequential code is non-trivial since the compiler can perform static optimization • Assuming the sequential code is optimized, proceed as for a superscalar processor to determine the parallel instruction count (PIC) or VLIW count (VLIWC) • Evaluating the performance of a VLIW processor starting from VLIW code is much simpler • Compute the number of VLIW instructions (VLIWC) • Insert stalls according to worst-case latency and repetition time • Determine the number of stall cycles (SC) • Assuming that all instructions have CPI=1: CPUT = (VLIWC+SC)Tclk > IC/N * Tclk