Data-Level Parallelism in Microprocessors

Data-Level Parallelism
CS4342 Advanced Computer Architecture
Dilum Bandara
Dilum.Bandara@uom.lk
Slides adapted from “Computer Architecture, A Quantitative Approach” by
John L. Hennessy and David A. Patterson, 5th Edition, 2012, Morgan
Kaufmann Publishers

Outline
 Vector architectures
 SIMD instruction set for multimedia
 GPUs
2

SIMD Architectures
 Exploit significant data-level parallelism for
 Matrix-oriented scientific computing
 Media-oriented image & sound processors
 SIMD is more energy efficient than MIMD
 Only needs to fetch 1 instruction per data operation
 Makes SIMD attractive for personal mobile devices
 SIMD allows programmer to continue to think
sequentially
3

SIMD Parallelism
1. Vector architectures
2. SIMD extensions
3. Graphics Processor Units (GPUs)
 For x86 processors
 Expect 2 additional cores per chip per year
 SIMD width to double every 4 years
 Potential speedup from SIMD to be 2x from MIMD
4

1. Vector Architectures
 Idea
 Read sets of data elements into “vector registers”
 Operate on those registers
 Disperse results back into memory
 Registers are controlled by compiler
 Used to hide memory latency
 Leverage memory bandwidth
6

Vector Architecture – VMIPS
7
Vector MIPS

VMIPS
 Loosely based on Cray-1
 Vector registers
 Each register holds a 64-element, 64 bits/element vector
 Register file has 16 read ports & 8 write ports
 Vector functional units
 Fully pipelined
 Data & control hazards are detected
 Vector load-store unit
 Fully pipelined
 1 word per clock cycle after initial latency
 Scalar registers
 32 general-purpose registers
 32 floating-point registers 8

VMIPS Instructions
 ADDVV.D – add 2 vectors
 ADDVS.D – add vector to a scalar
 LV/SV – vector load & vector store from
address
9

Example – DAXPY
 Y = aX + Y
L.D F0,a ; load scalar a
LV V1,Rx ; load vector X
MULVS.D V2,V1,F0 ; vector-scalar multiply
LV V3,Ry ; load vector Y
ADDVV V4,V2,V3 ; add
SV Ry,V4 ; store result
 Requires 6 instructions vs. almost 600 for MIPS
 Assuming 64 elements per vector
10
DAXPY – Double precision a×X plus Y

Vector Execution Time
 Execution time depends on 3 factors
 Length of operand vectors
 Structural hazards
 Data dependencies
 VMIPS functional units consume 1 element per
clock cycle
 ~ Execution time = Vector length
11

Definitions
 Convey
 Set of vector instructions that could potentially execute
together
 Free of structural hazards
 Sequences with read-after-write hazards can be in
same convey via chaining
 Chaining
 Allows a vector operation to start as soon as individual
elements of its vector source operand become available
12

Convey
LV V1,Rx ;load vector X
MULVS.D V2,V1,F0 ;vector-scalar multiply
LV V3,Ry ;load vector Y
ADDVV.D V4,V2,V3 ;add two vectors
SV Ry,V4 ;store the sum
 Convoys
1 LV MULVS.D
2 LV ADDVV.D
3 SV
13

Definitions – Chime
 Unit of time to execute 1 convey
 m conveys executes in m chimes
 For vector length of n, requires m × n clock cycles
14

Convey
LV V1,Rx ;load vector X
MULVS.D V2,V1,F0 ;vector-scalar multiply
ADDVV.D V4,V2,V3 ;add two vectors
SV Ry,V4 ;store the sum
 Convoys
1 LV MULVS.D
2 LV ADDVV.D
3 SV
 3 chimes, 2 FP ops per result, cycles per FLOP = 1.5
 For 64 element vectors, requires 64 x 3 = 192 clock
cycles 15

Challenges
 Start up time
 Latency of vector functional unit
 Assume same as Cray-1
 Floating-point add  6 clock cycles
 Floating-point multiply  7 clock cycles
 Floating-point divide  20 clock cycles
 Vector load  12 clock cycles
16

Improvements
1. > 1 element per clock cycle
 Multiple Lanes
2. Non-64 wide vectors
 Vector Length Register
3. IF statements in vector code
 Vector Mask Registers
4. Memory system optimizations to support vector
processors
 Memory Banks
5. Multi-dimensional matrices
 Stride
6. Sparse matrices
7. Programming a vector computer 17

1. Multiple Lanes
 Multiple hardware lanes
 Element n of vector register A is “hardwired” to
element n of vector register B
18

2. Vector Length Register
 Vector length not known at compile time?
 Vector Length Register (VLR)
 Use strip mining for vectors over maximum length
19

3. Vector Mask Registers
for (i = 0; i < 64; i=i+1)
if (X[i] != 0)
X[i] = X[i] – Y[i];
 Use vector mask register to “disable” elements
LV V1,Rx ;load vector X into V1
L.D F0,#0 ;load FP zero into F0
SNEVS.D V1,F0 ;sets VM(i) to 1 if V1(i)!=F0
SUBVV.D V1,V1,V2 ;subtract under vector mask
SV Rx,V1 ;store the result in X
20

4. Memory Banks
 Designed to support high bandwidth for vector
loads & stores
 Spread accesses across multiple banks
 Control bank addresses independently
 Load or store none sequential words
 Support multiple vector processors sharing the same
memory
21
Source: www.hardwareanalysis.com/content/image/10220/

5. Stride
for (i = 0; i < 100; i=i+1)
for (j = 0; j < 100; j = j+1) {
A[i][j] = 0.0;
for (k = 0; k < 100; k=k+1)
A[i][j] = A[i][j] + B[i][k] * D[k][j];
}
}
 Row-major vs. column-major
 Vectorize multiplication of rows of B with columns of D
 Use non-unit stride
 Bank conflict (stall) occurs when same bank is hit faster
than bank busy time
22

2. SIMD Extensions
 Media applications operate on data types
narrower than native word size
 E.g., four 8-bit operations on a 32-bit system
23

Why SIMD?
 Need minor changes to hardware
 Require little extra state compared to vector
architectures
 While context switching
 No need for high memory bandwidth compared
to vector processors
 No need to deal with page faults when used with
virtual memory
24

SIMD Extensions
 Limitations compared to vector instructions
 No of data operands encoded into op code
 Many instructions
 No sophisticated addressing modes
 Strided, scatter-gather
 No mask registers for conditional execution
25

SIMD Implementations
 Intel MMX (1996)
 Eight 8-bit integer ops or four 16-bit integer ops
 Streaming SIMD Extensions (SSE) (1999)
 Eight 16-bit integer ops
 Four 32-bit integer/FP ops or two 64-bit integer/FP ops
 Advanced Vector Extensions (AVX) (2010)
 Four 64-bit integer/fp ops
 Rely on programmer & libraries than compiler
 Operands must be consecutive & aligned memory
locations 26

Graphical Processing Units (GPUs)
 Originally designed to accelerate large no of
computations performed in graphics rendering
 Offloaded numerically intensive computation from
CPU
 GPUs grew with demand for high performance
graphics
 Eventually GPUs have become much powerful,
even more than CPUs for many computations
 Cost-power-performance advantage
27

GPUs
 GPU is typically a computer card, installed into a
PCI Express slot
 Market leaders: NVIDIA, Intel, AMD (ATI)
 Example NVIDIA GPUs at UoM
28
GeForce GTX 480 Tesla 2070

Example Specifications
29
GTX 480 Tesla 2070 Tesla K80
Peak double
precision FP
performance
650 Gigaflops 515 Gigaflops 2.91 Teraflops
Peak single
precision FP
performance
1.3 Teraflops 1.03 Teraflops 8.74 Teraflops
CUDA cores 480 448 4992
Frequency of CUDA
Cores
1.40 GHz 1.15 GHz 560/875 MHz
Memory size
(GDDR5)
1536 MB 6 GB 24 GB
Memory bandwidth 177.4 GB/sec 150 GB/sec 480 GB/sec
ECC Memory No Yes Yes

CPU vs. GPU Architecture
30
GPU devotes more transistors for computation

Basic Idea
 Heterogeneous execution model
 CPU – host
 GPU – device
 Develop a C-like programming language for GPU
 Unify all forms of GPU parallelism as threads
 Handle each data element in a separate thread  no
need for synchronization
 Programming model is “Single Instruction Multiple
Thread (SIMT)”
31

Programming GPUs
 CUDA language for Nvidia GPU products
 Compute Unified Device Architecture
 Based on C
 nvcc compiler
 Lots of tools for analysis, debug, profile, …
 OpenCL - Open Computing Language
 Based on C
 Supports GPU & CPU programming
 Lots of active research
 e.g., automatic code generation for GPUs
32

Threads & Blocks
 A thread is associated with
each data element
 Threads are organized into
blocks
 Blocks are organized into a
grid
 GPU hardware handles
thread management, not
applications or OS
33

Grids, Blocks, & Threads
July-Aug 2011 34
 Grid of size 6 (3x2
blocks)
 Each block has 12
threads (4x3)

NVIDIA GPU Architecture
 Similarities to vector machines
 Works well with data-level parallel problems
 Scatter-gather transfers
 Mask registers
 Large register files
 Differences
 No scalar processor
 Uses multithreading to hide memory latency
 Has many functional units, as opposed to a few
deeply pipelined units like a vector processor
35

Example – Multiply 2 Vectors of
Length 8192
 Code that works over all elements is the grid
 Thread blocks break this down into manageable sizes
 512/1024 threads per block
 SIMD instruction executes 32 elements at a time
 Thus, grid size = 16/8 blocks
 Block is analogous to a strip-mined vector loop with
vector length of 32
 Block is assigned to a multithreaded SIMD processor by
thread block scheduler
 Current-generation GPUs (Fermi) have 7-15
multithreaded SIMD processors (a.k.a. streaming
multiprocessors, SMX)
36

Multithreaded SIMD Processor
37

Threads of SIMD Instructions
 Each has its own PC
 Thread scheduler uses scoreboard to dispatch
 Keeps track of up to 48 threads of SIMD
instructions
 Hides memory latency
 No data dependencies between threads!
 Thread block scheduler schedules blocks to
SIMD processors
 Within each SIMD processor
 32 SIMD lanes
 Wide & shallow compared to vector processors 38

Registers
 NVIDIA GPU has 32,768 registers
 Divided into lanes
 Each SIMD thread is limited to 64 registers
 SIMD thread has up to
 64 vector registers of 32 32-bit elements
 32 vector registers of 32 64-bit elements
 Fermi has 16 physical SIMD lanes, each containing
2048 registers
39

Conditional Branching
 Like vector architectures, GPU branch hardware uses
internal masks
 Also uses
 Branch synchronization stack
 Entries consist of masks for each SIMD lane
 i.e. which threads commit their results (all threads execute)
 Instruction markers to manage when a branch diverges into
multiple execution paths
 Push on divergent branch
 …and when paths converge
 Act as barriers
 Pops stack
 Per-thread-lane 1-bit predicate register, specified by
programmer 40

NVIDIA GPU Memory Structure
GRID Architecture
41
Grid
• A group of threads all running
the same kernel
• Can run multiple grids at once
Block
• Grids composed of blocks
• Each block is a logical unit
containing a no of
coordinating threads &
some amount of shared
memory

NVIDIA GPU Memory Structures
(Cont.)
42

NVIDIA GPU Memory Structures
(Cont.)
 Each SIMD lane has private section of off-chip DRAM
 Private memory
 Contains stack frame, spilling registers, & private
variables
 Each multithreaded SIMD processor also has
local memory
 Shared by SIMD lanes / threads within a block
 Memory shared by SIMD processors is GPU
Memory
 Host can read & write GPU memory
43

Fermi Architecture Innovations
 Each SIMD processor has
 2 SIMD thread schedulers, 2 instruction dispatch units
 16 SIMD lanes (SIMD width = 32, chime = 2 cycles), 16 load-
store units, 4 special function units
 Thus, 2 threads of SIMD instructions are scheduled every 2
clock cycles
 Fast double precision
 Caches for GPU memory
 64-bit addressing & unified address space
 Error correcting codes
 Faster context switching
 Faster atomic instructions
44

Multithreaded Dual SIMD Processor
of a Fermi GPU
45

Summary
 Vector architectures
 High performance
 High cost
 SIMD instruction set for multimedia
 Simple extensions
 Low cost
 Low performance
 GPUs
 High performance
 Low cost
 SIMD only 46

Data-Level Parallelism in Microprocessors

More Related Content

What's hot (20)

Similar to Data-Level Parallelism in Microprocessors (20)

More from Dilum Bandara (20)

Recently uploaded (20)

Data-Level Parallelism in Microprocessors

Editor's Notes