Unit 4 Computer Architecture Multicore Processor.pptx

Velammal Engineering College
Department of Computer Science
and Engineering
Welcome…
Dr.S.Gunasundari
Mr. A. Arockia Abins &
Ms. R. Amirthavalli,
CSE,
Velammal Engineering College

Subject Code / Name:
19IT202T /
Computer Architecture

Syllabus – Unit IV
UNIT-IV PARALLELISM
Introduction to Multicore processors
and other shared memory
multiprocessors - Flynn's classification:
SISD, MIMD, SIMD, SPMD and
Vector - Hardware multithreading:
Fine-grained, Coarse-grained and
Simultaneous Multithreading (SMT) -
GPU architecture: NVIDIA GPU
Architecture, NVIDIA GPU Memory
Structure

4
Topics:
• Introduction to Multicore processors
• Other shared memory multiprocessors
• Flynn’s classification:
o SISD,
o MIMD,
o SIMD,
o SPMD and Vector
• Hardware multithreading
• GPU architecture

Introduction to Multicore processors

6
Multicore processors
• What is a Processor?
o A single chip package that fits in a socket
o Cores can have functional units, cache, etc.
associated with them
• The main goal of the multi-core design is to
provide computing units with an increasing
processing power.
• A multicore processor is a single computing
component with two or more “independent”
processors (called "cores").
• known as a chip multiprocessor or CMP

7
EXAMPLES
 dual-core processor with 2 cores
• e.g. AMD Phenom II X2, Intel Core 2 Duo E8500
 quad-core processor with 4 cores
• e.g. AMD Phenom II X4, Intel Core i5 2500T
 hexa-core processor with 6 cores
• e.g. AMD Phenom II X6, Intel Core i7 Extreme Ed. 980X
 octa-core processor with 8 cores
• e.g. AMD FX-8150, Intel Xeon E7-2820

11
Number of core types
Homogeneous (symmetric) cores:
• All of the cores in a homogeneous multicore
processor are of the same type; typically the core
processing units are general-purpose central
processing units that run a single multicore
operating system.
• Example: Intel Core 2
Heterogeneous (asymmetric) cores:
• Heterogeneous multicore processors have a mix of
core types that often run different operating
systems and include graphics processing units.
• Example: IBM's Cell processor, used in the Sony
PlayStation 3 video game console

12
Homogeneous Multicore Processor

13
Heterogeneous Multicore Processor

14
shared memory multiprocessors

15
Shared Memory Multiprocessors
• A system with multiple CPUs “sharing” the same
main memory is called multiprocessor.
• In a multiprocessor system all processes on the
various CPUs share a unique logical address
space, which is mapped on a physical memory
that can be distributed among the processors.
• Each process can read and write a data item
simply using load and store operations, and
process communication is through shared
memory.

16
Shared Memory Multiprocessors
• Processors communicate through shared
variables in memory, with all processors capable
of accessing any memory location via loads and
stores.

17
Questions:
• Multicore processor
• Hexacore processor
• Homogeneous Multicore processor
• Heterogeneous Multicore processor
• Multiprocessor
• Shared memory Multiprocessor

18
• Single address space multiprocessors come in two
styles.
o Uniform Memory Access (UMA)
o Non-Uniform Memory Access (NUMA)
UMA Architecture:
• In the first style, the latency to a word in memory
does not depend on which processor asks for it. Such
machines are called uniform memory access (UMA)
multiprocessors.
NUMA/DSMA Architecture:
• In the second style, some memory accesses are
much faster than others, depending on which
processor asks for which word, typically because
main memory is divided and attached to different
microprocessors or to different memory controllers on
the same chip.
• Such machines are called nonuniform memory
access (NUMA) multiprocessors.

19
Types:
• The shared-memory multiprocessors fall into two
classes, depending on the number of processors
involved, which in turn dictates a memory
organization and interconnect strategy.
• They are:
1. Centralized shared memory (Uniform Memory
Access)
2. Distributed shared memory (NonUniform
Memory Access)

20
1. Centralized shared memory architecture

21
2. Distributed shared memory architecture

23
Flynn's classification:
• In 1966, Michael Flynn proposed a classification
for computer architectures based on the
number of instruction steams and data streams
(Flynn’s Taxonomy).
o SISD (Single Instruction stream, Single Data stream)
o SIMD (Single Instruction stream, Multiple Data
streams)
o MISD (Multiple Instruction streams, Single Data
stream)
o MIMD (Multiple Instruction streams, Multiple Data
streams)

24
Flynn's classification:
Simple Diagrammatic Representation

25
SISD
• SISD machines executes a single instruction on
individual data values using a single processor.
• Based on traditional Von Neumann uniprocessor
architecture, instructions are executed sequentially
or serially, one step after the next.
• Until most recently, most computers are of SISD type.
• Conventional uniprocessor

27
SIMD
• An SIMD machine executes a single instruction on
multiple data values simultaneously using many
processors.
• Since there is only one instruction, each processor
does not have to fetch and decode each
instruction. Instead, a single control unit does the
fetch and decoding for all processors.
• SIMD architectures include array processors.

28
SIMD
• Data level parallelism:
o Parallelism achieved by performing the same operation on
independent data.

29
MISD
• Each processor executes a different sequence of instructions.
• In case of MISD computers, multiple processing units operate on one
single-data stream .
• This category does not actually exist. This category was included in
the taxonomy for the sake of completeness.

31
Questions:
• Uniform Memory Access (UMA)
• Non-Uniform Memory Access (NUMA)
• Centralized shared memory
• Distributed shared memory
• Flynn’s classification:

32
MIMD
• MIMD machines are usually referred to as
multiprocessors or multicomputers.
• It may execute multiple instructions simultaneously,
contrary to SIMD machines.
• Each processor must include its own control unit that
will assign to the processors parts of a task or a
separate task.
• It has two subclasses: Shared memory and
distributed memory

34
Analogy of Flynn’s Classifications
• An analogy of Flynn’s classification is the check-in desk at an airport
 SISD: a single desk
 SIMD: many desks and a supervisor with a megaphone giving instructions that every desk obeys
 MIMD: many desks working at their own pace, synchronized through a central database

35
Hardware categorization
SSE : Streaming SIMD Extensions

Processor Organizations
Computer Architecture
Classifications
Single Instruction, Single Instruction, Multiple Instruction
Multiple Instruction
Single Data Stream Multiple Data Stream Single Data Stream
Multiple Data Stream
(SISD) (SIMD) (MISD) (MIMD)
Uniprocessor Vector Array Shared Memory Multicomputer
Processor Processor (tightly coupled) (loosely
coupled)

37
Vector
• more elegant interpretation of SIMD is called a vector architecture
• the vector architectures pipelined the ALU to get good performance
at lower cost
• to collect data elements from memory, put them in order into a large
set of registers, operate on them sequentially in registers using
pipelined execution units.
• then write the results back to memory

38
Structure of a vector unit containing four lanes

39
vector lane
• One or more vector functional units and a portion of the vector
register fi le.

40
Questions:
• MIMD
• Examples for Flynn’s classification

42
Hardware multithreading
• A thread is a lightweight process with its own
instructions and data.
• Each thread has all the state (instructions, data, PC,
register state, etc.) necessary to allow it to execute.
• Multithreading (MT) allows multiple threads to share
the functional units of a single processor.

43
Hardware multithreading
• Increasing utilization of a processor by switching
to another thread when one thread is stalled.
• Types of Multithreading:
o Fine-grained Multithreading
• Cycle by cycle
o Coarse-grained Multithreading
• Switch on event (e.g., cache miss)
o Simultaneous Multithreading (SMT)
• Instructions from multiple threads executed concurrently in the
same cycle

Thread B
Thread A Thread D
Thread C
4-issue machine

Fine-grained MT
Idea: Switch to another thread every cycle such
that no two instructions from the thread are in the
pipeline concurrently
Advantages
+ No need for dependency checking between instructions
(only one instruction in pipeline from a single thread)
+ No need for branch prediction logic
+ Otherwise-bubble cycles used for executing useful instructions
from different threads
+ Improved system throughput, latency tolerance, utilization

Fine-grained MT
Idea: Switch to another thread every cycle such
that no two instructions from the thread are in the
pipeline concurrently
Disadvantages
- Extra hardware complexity: multiple hardware contexts, thread
selection logic
- Reduced single thread performance (one instruction fetched every
N cycles)
- Resource contention between threads in caches and memory
- Dependency checking logic between threads remains (load/store)

47
Fine-grained MT
1
2
3
4
5
6
7
8
9
10
11
12
3 3 3
3
6
4 4
4 4
4
7
5 5
5 5 5 5
5
8 8 8
6 6 6 6
7 7 7 7
9 9
7 7
11
8 8
8 8
10 10 10
1 1
1 1 1
1 1 1
1
Time stamp of
single thread
execution
2
2 2
5 5
3 3 3

48
Coarse-grained MT switches threads only on
costly stalls, such as L2 misses.
The processor is not slowed down (by thread
switching), since instructions from other threads
will only be issued when a thread encounters a
costly stall.
Since a CPU with coarse-grained MT issues
instructions from a single thread, when a stall
occurs the pipeline must be emptied.
The new thread must fill the pipeline before
instructions will be able to complete.

49
Coarse-grained MT switches threads only on
costly stalls, such as L2 misses.
Advantages:
– thread switching doesn’t have to be essentially free
and much less likely to slow down the execution of an
individual thread
Disadvantage:
– limited, due to pipeline start-up costs, in its ability
to overcome throughput loss
Pipeline must be flushed and refilled on thread
switches

Questions
• Define thread.
• What is mean by hardware multithreading?
• Types of multithreading
June 2015
ILP Limits and Multithreading 51

Simultaneous Multithreading
52
Simultaneous multithreading (SMT) is a
variation on MT to exploit TLP simultaneously
with ILP.
SMT is motivated by multiple-issue processors
which have more functional unit parallelism than a
single thread can effectively use.
Multiple instructions from different threads can be
issued

53
1
2
3
4
5
6
7
8
9
10
11
12
2 2 2
4
4 4
4 4
5
5 5
5 5
5 5
5 5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
6
6
6 6 6
7
7 7 7 7
7
7 11
9 9
9 9
10 10 10
8 8 8
8
8
8
8 12 12 12
1 1 1 1
1 1 1 1
1
Time stamp of
single thread
execution

54
Approaches to use the issue slots.

57
Speedup
• Speedup measures increase in running time due
to parallelism. The number of PEs is given by n.
• Based on running times, S(n) = ts/tp , where
o ts is the execution time on a single processor, using the fastest known
sequential algorithm
o tp is the execution time using a parallel processor.
• For theoretical analysis, S(n) = ts/tp where
o ts is the worst case running time for of the fastest known sequential
algorithm for the problem
o tp is the worst case running time of the parallel algorithm using n PEs.

59
Amdahl’s law:
“It states that the potential speedup gained by the parallel execution of
a program is limited by the portion that can be parallelized.”

60
Amdahl’s law
• execution time before is 1 for some unit of time

61
Question:
• When parallelizing an application, the ideal speedup is speeding up
by the number of processors. What is the speedup with 8 processors if
60% of the application is parallelizable?

62
Question:
• When parallelizing an application, the ideal speedup is speeding up
by the number of processors. What is the speedup with 8 processors if
80% of the application is parallelizable?

63
QUESTION:
• Suppose that we are considering an enhancement that runs 10 times
faster than the original machine but is usable only 40% of the time.
What is the overall speedup gained by incorporating the
enhancement.?

64
Question
• Suppose you want to achieve a speed-up of 90
times faster with 100 processors. What
percentage of the original computation can be
sequential?

65
Question
• Suppose you want to achieve a speed-up of 90
times faster with 100 processors. What
percentage of the original computation can be
sequential?

66
Question
• Suppose you want to perform two sums: one is a sum of 10
scalar variables, and one is a matrix sum of a pair of two-
dimensional arrays, with dimensions 10 by 10. For now
let’s assume only the matrix sum is parallelizable. What
speed-up do you get with 10 versus 40 processors?
• Next, calculate the speed-ups assuming the matrices grow
to 20 by 20.

Graphics processing
unit (GPU)

68
Graphics processing unit (GPU)
• It is a processor optimized for 2D/3D graphics, video, visual computing, and display.
• It is highly parallel, highly multithreaded multiprocessor optimized for visual
computing.
• It provide real-time visual interaction with computed objects via graphics images,
and video.
• Heterogeneous Systems: combine a GPU with a CPU

71
An Introduction to the NVIDIA GPU Architecture

72
NVIDIA GPU Memory Structures

Unit 4 Computer Architecture Multicore Processor.pptx

More Related Content

Similar to Unit 4 Computer Architecture Multicore Processor.pptx (20)

More from Gunasundari Selvaraj (9)

Recently uploaded (20)

Unit 4 Computer Architecture Multicore Processor.pptx