SlideShare a Scribd company logo
Introduction to
High Performance Computer
Architecture
Introduction to
Multiprocessors
1
Mr. SUBHASIS DASH
SCHOLE OF COMPUTER ENGINEERING.
KIIT UNIVERSITY, BHUBANESWAR
Introduction
● Initial computer performance improvements
came from use of:
– Innovative manufacturing techniques.– Innovative manufacturing techniques.
● In later years,
– Most improvements came from exploitation of ILP.
– Both software and hardware techniques are being
used.
– Pipelining, dynamic instruction scheduling, out of
order execution, VLIW, vector processing, etc.
2
order execution, VLIW, vector processing, etc.
● ILP now appears fully exploited:
– Further performance improvements from ILP
appears limited.
Thread and Process-
Level Parallelism
● The way to achieve higher performance:
Of late, exploitation of thread and process-– Of late, exploitation of thread and process-
level parallelism is being focused.
● Exploit parallelism existing across
multiple processes or threads:
– Cannot be exploited by any ILP processor.
3
– Cannot be exploited by any ILP processor.
● Consider a banking application:
– Individual transactions can be executed in
parallel.
Processes versus Threads
● Processes:
– A process is a program in execution.A process is a program in execution.
– An application normally consists of
multiple processes.
● Threads:
– A process consists of one of more
threads.
4
– A process consists of one of more
threads.
– Threads belonging to the same process
share data, and code space.
Single and Multithreaded
Processes
5
How can Threads be
Created?
● By using any of the popular
thread libraries:
By using any of the popular
thread libraries:
– POSIX Pthreads
– Win32 threads
Java threads, etc.
6
– Java threads, etc.
User Threads
● Thread management done in user
space.space.
● User threads are supported and
managed without kernel support.
– Invisible to the kernel.
If one thread blocks, entire
7
– If one thread blocks, entire
process blocks.
– Limited benefits of threading.
Kernel Threads
● Kernel threads supported and
managed directly by the OS.
– Kernel creates Light Weight Processes– Kernel creates Light Weight Processes
(LWPs).
● Most modern OS support kernel
threads:
– Windows XP/2000
8
Windows XP/2000
– Solaris
– Linux
– Mac OS, etc.
Benefits of Threading
● Responsiveness:
– Threads share code, and data.
Thread creation and switching– Thread creation and switching
therefore much more efficient than
that for processes;
● As an example in Solaris:
Creating threads 30x less costly
9
– Creating threads 30x less costly
than processes.
– Context switching about 5x faster
than processes.
Benefits of Threading
cont…
● Truly concurrent execution:
Possible with processors
Truly concurrent execution:
–Possible with processors
supporting concurrent execution
of threads: SMP, multi-core,
SMT (hyper threading), etc.
10
SMT (hyper threading), etc.
A Few Thread Examples
● Independent threads occur
naturally in several applications:
Web server: different http– Web server: different http
requests are the threads.
– File server
– Name server
– Banking: independent transactions
11
– Banking: independent transactions
– Desktop applications: file loading,
display, computations, etc. can be
threads.
Reflection on Threading
● To think of it:
– Threading is inherent to any– Threading is inherent to any
server application.
● Threads are also easily
identifiable in traditional
applications:
12
applications:
– Banking, Scientific computations,
etc.
Thread-level Parallelism
--- Cons cont…
● Threads with severe
dependencies:
Threads with severe
dependencies:
– May make multithreading an
exercise in futility.
● Also not as “programmer
13
● Also not as “programmer
friendly” as ILP.
Thread Vs. Process-
Level Parallelism
● Threads are light weight (or fine-
grained):grained):
– Threads share address space, data, files etc.
– Even when extent of data sharing and
synchronization is low: Exploitation of
thread-level parallelism meaningful only when
communication latency is low.
14
communication latency is low.
– Consequently, shared memory architectures
(UMA) are a popular way to exploit thread-
level parallelism.
A Broad Classification of
Computers
● Shared-memory multiprocessors
Also called UMA– Also called UMA
● Distributed memory computers
– Also called NUMA:
● Distributed Shared-memory (DSM)
architectures
15
architectures
● Clusters
● Grids, etc.
UMA vs. NUMA
Computers
Latency = several
milliseconds to seconds
Cache
P1
Cache
P2
Cache
Pn
Cache
P1
Cache
P2
Cache
Pn
Main
Main
Memory
Main
Memory
Main
Memory
Bus
milliseconds to seconds
16
Network
Main
Memory
(a) UMA Model (b) NUMA Model
Latency = 100s of ns
Distributed Memory
Computers
● Distributed memory computers use:
Message Passing Model– Message Passing Model
● Explicit message send and receive
instructions have to be written by the
programmer.
– Send: specifies local buffer + receiving
17
– Send: specifies local buffer + receiving
process (id) on remote computer (address).
–Receive: specifies sending process on
remote computer + local buffer to place
data.
Advantages of Message-
Passing Communication
● Hardware for communication and
synchronization are much simpler:synchronization are much simpler:
– Compared to communication in a shared memory
model.
● Explicit communication:
– Programs simpler to understand, helps to reduce
maintenance and development costs.
18
maintenance and development costs.
● Synchronization is implicit:
– Naturally associated with sending/receiving
messages.
– Easier to debug.
Disadvantages of Message-
Passing Communication
● Programmer has to write explicit
message passing constructs.
Programmer has to write explicit
message passing constructs.
– Also, precisely identify the
processes (or threads) with which
communication is to occur.
19
communication is to occur.
● Explicit calls to operating
system:
– Higher overhead.
DSM
● Physically separate memories are
accessed as one logical address space.accessed as one logical address space.
● Processors running on a multi-
computer system share their memory.
– Implemented by operating system.
DSM multiprocessors are NUMA:
20
● DSM multiprocessors are NUMA:
– Access time depends on the exact
location of the data.
Distributed Shared-Memory
Architecture (DSM)
● Underlying mechanism is message
passing:passing:
– Shared memory convenience provided to
the programmer by the operating system.
– Basically, an operating system facility
takes care of message passing implicitly.
21
takes care of message passing implicitly.
● Advantage of DSM:
– Ease of programming
Disadvantage of DSM
● High communication cost:
– A program not specifically optimized– A program not specifically optimized
for DSM by the programmer shall
perform extremely poorly.
– Data (variables) accessed by specific
program segments have to be
22
program segments have to be
collocated.
– Useful only for process-level (coarse-
grained) parallelism.
Symmetric
High Performance Computer
Architecture
Symmetric
Multiprocessors(SMPs)
23
Mr. SUBHASIS DASH
SCHOLE OF COMPUTER ENGINEERING.
KIIT UNIVERSITY, BHUBANESWAR
Symmetric Multiprocessors
(SMPs)
● SMPs are a popular shared memory
multiprocessor architecture:
– Processors share Memory and I/O
– Bus based: access time for all memory locations is
equal --- “Symmetric MP”
P P P P
24
Cache Cache Cache Cache
Main memory I/O system
Bus
SMPs: Some Insights
● In any multiprocessor, main memory
access is a bottleneck:access is a bottleneck:
–Multilevel caches reduce the memory demand
of a processor.
– Multilevel caches in fact make it possible for
more than one processor to meaningfully
share the memory bus.
25
share the memory bus.
–Hence multilevel caches are a must in a
multiprocessor!
Different SMP
Organizations
● Processor and cache on separate
extension boards (1980s):extension boards (1980s):
– Plugged on to the backplane.
● Integrated on the main board (1990s):
– 4 or 6 processors placed per board.
Integrated on the same chip (multi-core)
26
● Integrated on the same chip (multi-core)
(2000s):
– Dual core (IBM, Intel, AMD)
– Quad core
Pros of SMPs
● Ease of programming:
–Especially when communication–Especially when communication
patterns are complex or vary
dynamically during execution.
27
Cons of SMPs
● As the number of processors increases,
contention for the bus increases.
Scalability of the SMP model restricted.– Scalability of the SMP model restricted.
– One way out may be to use switches
(crossbar, multistage networks, etc.)
instead of a bus.
– Switches set up parallel point-to-point
connections.
28
Switches set up parallel point-to-point
connections.
– Again switches are not without any
disadvantages: make implementation of
cache coherence difficult.
Why Multicores?
● Can you recollect the constraints on
further increase in circuit complexity:further increase in circuit complexity:
– Clock skew and temperature.
● Use of more complex techniques to
improve single-thread performance is
limited.
29
limited.
● Any additional transistors have to be
used in a different core.
Why Multicores?
Cont…
● Multiple cores on the same
physical packaging:physical packaging:
– Execute different threads.
– Switched off, if no thread to
execute (power saving).
30
execute (power saving).
– Dual core, quad core, etc.
Cache Organizations for
Multicores
● L1 caches are always private to a core
L2 caches can be private or shared● L2 caches can be private or shared
– which is better?
P4P3P2P1
L1L1L1L1
P4P3P2P1
L1L1L1L1
31
L1L1L1L1
L2L2L2L2
L1L1L1L1
L2
L2 Organizations
● Advantages of a shared L2 cache:
– Efficient dynamic use of space by each core– Efficient dynamic use of space by each core
– Data shared by multiple cores is not
replicated.
– Every block has a fixed “home” – hence, easy
to find the latest copy.
Advantages of a private L2 cache:
32
● Advantages of a private L2 cache:
– Quick access to private L2
– Private bus to private L2, less contention.
An Important Problem with
Shared-Memory: Coherence
● When shared data are cached:When shared data are cached:
– These are replicated in multiple
caches.
– The data in the caches of different
processors may become inconsistent.
33
processors may become inconsistent.
● How to enforce cache coherency?
– How does a processor know changes in
the caches of other processors?
The Cache Coherency
Problem
4
5
P1 P2 P3
U:5 U:5
4
U:? U:? U:7 3
5
1 3
U:
?
34
U:51 2
What value will P1 and P2 read?
Cache Coherence Solutions
(Protocols)
● The key to maintain cache coherence:
Track the state of sharing of every– Track the state of sharing of every
data block.
● Based on this idea, following can be
an overall solution:
35
– Dynamically recognize any potential
inconsistency at run-time and carry out
preventive action.
Basic Idea Behind Cache
Coherency Protocols
P P P P
Cache Cache Cache Cache
36
Main memory I/O system
Bus
Pros and Cons of the
Solution
● Pro:
–Consistency maintenance becomes–Consistency maintenance becomes
transparent to programmers,
compilers, as well as to the
operating system.
Con:
37
● Con:
–Increased hardware complexity .
Two Important Cache
Coherency Protocols
● Snooping protocol:
Each cache “snoops” the bus to find out– Each cache “snoops” the bus to find out
which data is being used by whom.
● Directory-based protocol:
– Keep track of the sharing state of each
data block using a directory.
A directory is a centralized register for
38
– A directory is a centralized register for
all memory blocks.
– Allows coherency protocol to avoid
broadcasts.
Snoopy and Directory-
Based Protocols
P P P P
Cache Cache Cache Cache
Bus
39
Main memory I/O system
Snooping vs. Directory-
based Protocols
● Snooping protocol reduces memory
traffic.traffic.
– More efficient.
● Snooping protocol requires broadcasts:
– Can meaningfully be implemented only when
there is a shared bus.
40
there is a shared bus.
– Even when there is a shared bus, scalability
is a problem.
– Some work arounds have been tried: Sun
Enterprise server has up to 4 buses.
Snooping Protocol
● As soon as a request for any data block
by a processor is put out on the bus:
– Other processors “snoop” to check if they
have a copy and respond accordingly.
● Works well with bus interconnection:
–All transmissions on a bus are essentially
broadcast:
41
broadcast:
● Snooping is therefore effortless.
–Dominates almost all small scale machines.
Categories of Snoopy
Protocols
● Essentially two types:
– Write Invalidate Protocol
Write Broadcast Protocol
– Write Invalidate Protocol
– Write Broadcast Protocol
● Write invalidate protocol:
–When one processor writes to its cache, all
other processors having a copy of that
data block invalidate that block.
Write broadcast:
42
● Write broadcast:
– When one processor writes to its cache, all
other processors having a copy of that
data block update that block with the
recent written value.
Write Invalidate Vs.
Write Update Protocols
P P P P
Cache Cache Cache Cache
Bus
43
Main memory I/O system
Write Invalidate Protocol
● Handling a write to shared data:
– An invalidate command is sent on bus ---– An invalidate command is sent on bus ---
all caches snoop and invalidate any copies
they have.
● Handling a read Miss:
– Write-through: memory is always up-to-
44
– Write-through: memory is always up-to-
date.
– Write-back: snooping finds most recent
copy.
Write Invalidate in Write
Through Caches
● Simple implementation.
● Writes:● Writes:
– Write to shared data: broadcast on bus,
processors snoop, and update any copies.
– Read miss: memory is always up-to-date.
● Concurrent writes:
45
Concurrent writes:
– Write serialization automatically achieved
since bus serializes requests.
– Bus provides the basic arbitration support.
Write Invalidate versus
Broadcast cont…
● Invalidate exploits spatial locality:
Only one bus transaction for any–Only one bus transaction for any
number of writes to the same block.
–Obviously, more efficient.
● Broadcast has lower latency for
46
● Broadcast has lower latency for
writes and reads:
–As compared to invalidate.
Cache Coherence
High Performance Computer
Architecture
Cache Coherence
Protocols
Mr. SUBHASIS DASH
47
Mr. SUBHASIS DASH
SCHOLE OF COMPUTER ENGINEERING.
KIIT UNIVERSITY, BHUBANESWAR
An Example Snoopy
Protocol
● Assume:
– Invalidation protocol, write-back cache.– Invalidation protocol, write-back cache.
● Each block of memory is in one of the
following states:
– Shared: Clean in all caches and up-to-date
in memory, block can be read.
48
–Exclusive: cache has the only copy, it is
writeable, and dirty.
–Invalid: Data present in the block obsolete,
cannot be used.
Implementation of the
Snooping Protocol
● A cache controller at every processor
would implement the protocol:would implement the protocol:
– Has to perform specific actions:
● When the local processor requests certain
things.
● Also, certain actions are required when certain
address appears on the bus.
Exact actions of the cache controller
49
address appears on the bus.
– Exact actions of the cache controller
depends on the state of the cache block.
– Two FSMs can show the different types of
actions to be performed by a controller.
Snoopy-Cache State
Machine-I
● State machine
considering only
CPU requests
a each cache
block.
Invalid
Shared
(read/o
nly)
CPU Read
CPU Read hit
Place read missa each cache
block. nly)
CPU Write
Place read miss
on bus
Place Write
Miss on bus
CPU read miss
Write back block,
Place read miss
on bus
CPU Write
CPU Read miss
Place read miss
on bus
50
Exclusive
(read/wr
ite)
CPU Write
Place Write Miss on Bus
CPU Write Miss
Write back cache block
Place write miss on bus
CPU read hit
CPU write hit
Snoopy-Cache State
Machine-II
● State machine
considering only
bus requests
for each cache
Invalid
Shared
(read/o
nly)
Write miss
for this block
for each cache
block.
nly)
Write Back
Block; (abort
memory access)
Write miss
for this block
Read miss
for this block
Write Back
51
Exclusive
(read/wr
ite)
memory access) Write Back
Block; (abort
memory
access)
Place read miss
Combined Snoopy-Cache
State Machine● State machine
considering both
CPU requests
and bus requests Invalid
Shared
(read/o
nly)
CPU Read
CPU Read hit
Write miss
for this block
Place read miss
on bus
and bus requests
for each
cache block.
Invalid
nly)
CPU Write
Place Write
Miss on bus
CPU read miss
Write back block,
Place read miss
on bus CPU Write
Place Write Miss on Bus
CPU Read miss
Place read miss
on bus
Write Back
Block; Abort
memory access.
Write miss
for this block
Write Back
52
Exclusive
(read/wr
ite)
Place Write Miss on Bus
CPU Write Miss
Write back cache block
Place write miss on bus
CPU read hit
CPU write hit
memory access.
Read miss
for this block
Write Back
Block; (abort
memory access)
Directory-based Solution
● In NUMA computers:
– Messages have long latency.
Also, broadcast is inefficient --- all– Also, broadcast is inefficient --- all
messages have explicit responses.
● Main memory controller to keep track of:
– Which processors are having cached copies
of which memory locations.
● On a write,
53
● On a write,
– Only need to inform users, not everyone
● On a dirty read,
– Forward to owner
Directory Protocol
● Three states as in Snoopy Protocol
–Shared: 1 or more processors have data,
memory is up-to-date.memory is up-to-date.
– Uncached: No processor has the block.
–Exclusive: 1 processor (owner) has the block.
● In addition to cache state,
–Must track which processors have data when
in the shared state.
54
Must track which processors have data when
in the shared state.
–Usually implemented using bit vector, 1 if
processor has copy.
Directory Behavior
● On a read:
– Unused:
give (exclusive) copy to requester● give (exclusive) copy to requester
● record owner
– Exclusive or shared:
● send share message to current exclusive
owner
record owner
55
owner
● record owner
● return value
– Exclusive dirty:
● forward read request to exclusive owner.
Directory Behavior
● On Write
Send invalidate messages to all– Send invalidate messages to all
hosts caching values.
● On Write-Thru/Write-back
– Update value.
56
CPU-Cache State Machine
● State machine
for CPU requests
for each
Invalidate
or Miss due to
address conflict:Uncacheed Shared
(read/o
CPU Read hit
for each
memory block
● Invalid state
if in
memory
Fetch/Invalidate
or Miss due to
address conflict:
send Data Write Back message
Uncacheed
(read/o
nly)
CPU Read
Send Read Miss
message
CPU Write:
Send Write Miss
msg to h.d.
CPU Write:
Send
Write Miss message
to home directory
57
send Data Write Back message
to home directory
Exclusive
(read/wri
te)CPU read hit
CPU write hit
Fetch: send
Data Write Back message
to home directory
State Transition Diagram
for the Directory
● Tracks all copies of memory block.
Same states as the transition diagram● Same states as the transition diagram
for an individual cache.
● Memory controller actions:
–Update of directory state
Send msgs to statisfy requests.
58
–Send msgs to statisfy requests.
–Also indicates an action that updates the
sharing set, Sharers, as well as sending a
message.
Directory State Machine
● State machine
for Directory
requests for each
memory block Uncached
Shared
Read miss:
Sharers = {P}
send Data Value
Reply
Read miss:
Sharers += {P};
send Data Value Reply
memory block
● Uncached state
if in memory
Data Write Back:
Sharers = {}
(Write back block)
Uncached
Shared
(read
only)
Reply
Write Miss:
send Invalidate
to Sharers;
then Sharers = {P};
send Data Value
Write Miss:
Sharers = {P};
send Data
Value Reply
msg
59
Exclusive
(read/wri
te)
send Data Value
Reply msg
Read miss:
Sharers += {P};
send Fetch;
send Data Value Reply
msg to remote cache
(Write back block)
Write Miss:
Sharers = {P};
send Fetch/Invalidate;
send Data Value Reply
msg to remote cache

More Related Content

PPT
Computer architecture
PPTX
PPT
Lecture 6
PPT
Hardware multithreading
PPT
Lecture 6
PPT
Parallel processing
PPT
Parallel processing
PPT
Parallel processing extra
Computer architecture
Lecture 6
Hardware multithreading
Lecture 6
Parallel processing
Parallel processing
Parallel processing extra

What's hot (18)

PPT
并行计算与分布式计算的区别
PDF
Lecture 6.1
PPT
Parallel architecture
PDF
Notes on NUMA architecture
PPTX
Parallel Processing Presentation2
PPT
Introduction to parallel_computing
PPTX
Graphics processing uni computer archiecture
PPTX
ADVANCED COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
PDF
Multiprocessor
PPT
Lecture4
PPTX
Advanced processor principles
PDF
Multithreaded processors ppt
PPTX
Lecture4
PPTX
PDF
Lecture02 types
PPT
NUMA overview
PPTX
Advanced computer architecture
PDF
Non-Uniform Memory Access ( NUMA)
并行计算与分布式计算的区别
Lecture 6.1
Parallel architecture
Notes on NUMA architecture
Parallel Processing Presentation2
Introduction to parallel_computing
Graphics processing uni computer archiecture
ADVANCED COMPUTER ARCHITECTURE AND PARALLEL PROCESSING
Multiprocessor
Lecture4
Advanced processor principles
Multithreaded processors ppt
Lecture4
Lecture02 types
NUMA overview
Advanced computer architecture
Non-Uniform Memory Access ( NUMA)
Ad

Similar to High Performance Computer Architecture (20)

PPT
module4.ppt
PPTX
Computer system Architecture. This PPT is based on computer system
PPTX
Multiprocessor.pptx
PPTX
CSA unit5.pptx
PDF
Introduction to Embedded System
PPT
module01.ppt
PPTX
CA UNIT IV.pptx
PPTX
Multiprocessors and Special Processors_Group9.pptx
PDF
Maxwell siuc hpc_description_tutorial
PPTX
High performance computing
PPT
PPT
IS_Ch03.ppt
PDF
High Performance Computer Architecture
PPTX
parellelisum edited_jsdnsfnjdnjfnjdn.pptx
PDF
Lect18
PPT
OpenPOWER Webinar
PPT
12429908.ppt
PPTX
Hpc 4 5
PPT
Multiprocessor_YChen.ppt
PPTX
Introduction to DSP Processors-UNIT-6
module4.ppt
Computer system Architecture. This PPT is based on computer system
Multiprocessor.pptx
CSA unit5.pptx
Introduction to Embedded System
module01.ppt
CA UNIT IV.pptx
Multiprocessors and Special Processors_Group9.pptx
Maxwell siuc hpc_description_tutorial
High performance computing
IS_Ch03.ppt
High Performance Computer Architecture
parellelisum edited_jsdnsfnjdnjfnjdn.pptx
Lect18
OpenPOWER Webinar
12429908.ppt
Hpc 4 5
Multiprocessor_YChen.ppt
Introduction to DSP Processors-UNIT-6
Ad

More from Subhasis Dash (16)

PPT
Operating System
PPT
Operating System
PPT
Operating System
PPT
Operating System
PPT
Operating System
PPT
Operating System
PPT
Operating System
PPT
Operating System
PPT
Computer Organisation and Architecture
PPT
Computer Organisation and Architecture
PPT
Computer Organisation and Architecture
PPT
Computer Organisation and Architecture
PPT
High Performance Computer Architecture
PDF
High Performance Computer Architecture
PPTX
Computer Organisation & Architecture (chapter 1)
PPTX
Motherboard
Operating System
Operating System
Operating System
Operating System
Operating System
Operating System
Operating System
Operating System
Computer Organisation and Architecture
Computer Organisation and Architecture
Computer Organisation and Architecture
Computer Organisation and Architecture
High Performance Computer Architecture
High Performance Computer Architecture
Computer Organisation & Architecture (chapter 1)
Motherboard

Recently uploaded (20)

PDF
Well-logging-methods_new................
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
Construction Project Organization Group 2.pptx
PPTX
web development for engineering and engineering
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
Welding lecture in detail for understanding
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Geodesy 1.pptx...............................................
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Well-logging-methods_new................
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
UNIT 4 Total Quality Management .pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Construction Project Organization Group 2.pptx
web development for engineering and engineering
R24 SURVEYING LAB MANUAL for civil enggi
Welding lecture in detail for understanding
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
bas. eng. economics group 4 presentation 1.pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Foundation to blockchain - A guide to Blockchain Tech
CYBER-CRIMES AND SECURITY A guide to understanding
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
OOP with Java - Java Introduction (Basics)
Geodesy 1.pptx...............................................
CH1 Production IntroductoryConcepts.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS

High Performance Computer Architecture

  • 1. Introduction to High Performance Computer Architecture Introduction to Multiprocessors 1 Mr. SUBHASIS DASH SCHOLE OF COMPUTER ENGINEERING. KIIT UNIVERSITY, BHUBANESWAR
  • 2. Introduction ● Initial computer performance improvements came from use of: – Innovative manufacturing techniques.– Innovative manufacturing techniques. ● In later years, – Most improvements came from exploitation of ILP. – Both software and hardware techniques are being used. – Pipelining, dynamic instruction scheduling, out of order execution, VLIW, vector processing, etc. 2 order execution, VLIW, vector processing, etc. ● ILP now appears fully exploited: – Further performance improvements from ILP appears limited.
  • 3. Thread and Process- Level Parallelism ● The way to achieve higher performance: Of late, exploitation of thread and process-– Of late, exploitation of thread and process- level parallelism is being focused. ● Exploit parallelism existing across multiple processes or threads: – Cannot be exploited by any ILP processor. 3 – Cannot be exploited by any ILP processor. ● Consider a banking application: – Individual transactions can be executed in parallel.
  • 4. Processes versus Threads ● Processes: – A process is a program in execution.A process is a program in execution. – An application normally consists of multiple processes. ● Threads: – A process consists of one of more threads. 4 – A process consists of one of more threads. – Threads belonging to the same process share data, and code space.
  • 6. How can Threads be Created? ● By using any of the popular thread libraries: By using any of the popular thread libraries: – POSIX Pthreads – Win32 threads Java threads, etc. 6 – Java threads, etc.
  • 7. User Threads ● Thread management done in user space.space. ● User threads are supported and managed without kernel support. – Invisible to the kernel. If one thread blocks, entire 7 – If one thread blocks, entire process blocks. – Limited benefits of threading.
  • 8. Kernel Threads ● Kernel threads supported and managed directly by the OS. – Kernel creates Light Weight Processes– Kernel creates Light Weight Processes (LWPs). ● Most modern OS support kernel threads: – Windows XP/2000 8 Windows XP/2000 – Solaris – Linux – Mac OS, etc.
  • 9. Benefits of Threading ● Responsiveness: – Threads share code, and data. Thread creation and switching– Thread creation and switching therefore much more efficient than that for processes; ● As an example in Solaris: Creating threads 30x less costly 9 – Creating threads 30x less costly than processes. – Context switching about 5x faster than processes.
  • 10. Benefits of Threading cont… ● Truly concurrent execution: Possible with processors Truly concurrent execution: –Possible with processors supporting concurrent execution of threads: SMP, multi-core, SMT (hyper threading), etc. 10 SMT (hyper threading), etc.
  • 11. A Few Thread Examples ● Independent threads occur naturally in several applications: Web server: different http– Web server: different http requests are the threads. – File server – Name server – Banking: independent transactions 11 – Banking: independent transactions – Desktop applications: file loading, display, computations, etc. can be threads.
  • 12. Reflection on Threading ● To think of it: – Threading is inherent to any– Threading is inherent to any server application. ● Threads are also easily identifiable in traditional applications: 12 applications: – Banking, Scientific computations, etc.
  • 13. Thread-level Parallelism --- Cons cont… ● Threads with severe dependencies: Threads with severe dependencies: – May make multithreading an exercise in futility. ● Also not as “programmer 13 ● Also not as “programmer friendly” as ILP.
  • 14. Thread Vs. Process- Level Parallelism ● Threads are light weight (or fine- grained):grained): – Threads share address space, data, files etc. – Even when extent of data sharing and synchronization is low: Exploitation of thread-level parallelism meaningful only when communication latency is low. 14 communication latency is low. – Consequently, shared memory architectures (UMA) are a popular way to exploit thread- level parallelism.
  • 15. A Broad Classification of Computers ● Shared-memory multiprocessors Also called UMA– Also called UMA ● Distributed memory computers – Also called NUMA: ● Distributed Shared-memory (DSM) architectures 15 architectures ● Clusters ● Grids, etc.
  • 16. UMA vs. NUMA Computers Latency = several milliseconds to seconds Cache P1 Cache P2 Cache Pn Cache P1 Cache P2 Cache Pn Main Main Memory Main Memory Main Memory Bus milliseconds to seconds 16 Network Main Memory (a) UMA Model (b) NUMA Model Latency = 100s of ns
  • 17. Distributed Memory Computers ● Distributed memory computers use: Message Passing Model– Message Passing Model ● Explicit message send and receive instructions have to be written by the programmer. – Send: specifies local buffer + receiving 17 – Send: specifies local buffer + receiving process (id) on remote computer (address). –Receive: specifies sending process on remote computer + local buffer to place data.
  • 18. Advantages of Message- Passing Communication ● Hardware for communication and synchronization are much simpler:synchronization are much simpler: – Compared to communication in a shared memory model. ● Explicit communication: – Programs simpler to understand, helps to reduce maintenance and development costs. 18 maintenance and development costs. ● Synchronization is implicit: – Naturally associated with sending/receiving messages. – Easier to debug.
  • 19. Disadvantages of Message- Passing Communication ● Programmer has to write explicit message passing constructs. Programmer has to write explicit message passing constructs. – Also, precisely identify the processes (or threads) with which communication is to occur. 19 communication is to occur. ● Explicit calls to operating system: – Higher overhead.
  • 20. DSM ● Physically separate memories are accessed as one logical address space.accessed as one logical address space. ● Processors running on a multi- computer system share their memory. – Implemented by operating system. DSM multiprocessors are NUMA: 20 ● DSM multiprocessors are NUMA: – Access time depends on the exact location of the data.
  • 21. Distributed Shared-Memory Architecture (DSM) ● Underlying mechanism is message passing:passing: – Shared memory convenience provided to the programmer by the operating system. – Basically, an operating system facility takes care of message passing implicitly. 21 takes care of message passing implicitly. ● Advantage of DSM: – Ease of programming
  • 22. Disadvantage of DSM ● High communication cost: – A program not specifically optimized– A program not specifically optimized for DSM by the programmer shall perform extremely poorly. – Data (variables) accessed by specific program segments have to be 22 program segments have to be collocated. – Useful only for process-level (coarse- grained) parallelism.
  • 23. Symmetric High Performance Computer Architecture Symmetric Multiprocessors(SMPs) 23 Mr. SUBHASIS DASH SCHOLE OF COMPUTER ENGINEERING. KIIT UNIVERSITY, BHUBANESWAR
  • 24. Symmetric Multiprocessors (SMPs) ● SMPs are a popular shared memory multiprocessor architecture: – Processors share Memory and I/O – Bus based: access time for all memory locations is equal --- “Symmetric MP” P P P P 24 Cache Cache Cache Cache Main memory I/O system Bus
  • 25. SMPs: Some Insights ● In any multiprocessor, main memory access is a bottleneck:access is a bottleneck: –Multilevel caches reduce the memory demand of a processor. – Multilevel caches in fact make it possible for more than one processor to meaningfully share the memory bus. 25 share the memory bus. –Hence multilevel caches are a must in a multiprocessor!
  • 26. Different SMP Organizations ● Processor and cache on separate extension boards (1980s):extension boards (1980s): – Plugged on to the backplane. ● Integrated on the main board (1990s): – 4 or 6 processors placed per board. Integrated on the same chip (multi-core) 26 ● Integrated on the same chip (multi-core) (2000s): – Dual core (IBM, Intel, AMD) – Quad core
  • 27. Pros of SMPs ● Ease of programming: –Especially when communication–Especially when communication patterns are complex or vary dynamically during execution. 27
  • 28. Cons of SMPs ● As the number of processors increases, contention for the bus increases. Scalability of the SMP model restricted.– Scalability of the SMP model restricted. – One way out may be to use switches (crossbar, multistage networks, etc.) instead of a bus. – Switches set up parallel point-to-point connections. 28 Switches set up parallel point-to-point connections. – Again switches are not without any disadvantages: make implementation of cache coherence difficult.
  • 29. Why Multicores? ● Can you recollect the constraints on further increase in circuit complexity:further increase in circuit complexity: – Clock skew and temperature. ● Use of more complex techniques to improve single-thread performance is limited. 29 limited. ● Any additional transistors have to be used in a different core.
  • 30. Why Multicores? Cont… ● Multiple cores on the same physical packaging:physical packaging: – Execute different threads. – Switched off, if no thread to execute (power saving). 30 execute (power saving). – Dual core, quad core, etc.
  • 31. Cache Organizations for Multicores ● L1 caches are always private to a core L2 caches can be private or shared● L2 caches can be private or shared – which is better? P4P3P2P1 L1L1L1L1 P4P3P2P1 L1L1L1L1 31 L1L1L1L1 L2L2L2L2 L1L1L1L1 L2
  • 32. L2 Organizations ● Advantages of a shared L2 cache: – Efficient dynamic use of space by each core– Efficient dynamic use of space by each core – Data shared by multiple cores is not replicated. – Every block has a fixed “home” – hence, easy to find the latest copy. Advantages of a private L2 cache: 32 ● Advantages of a private L2 cache: – Quick access to private L2 – Private bus to private L2, less contention.
  • 33. An Important Problem with Shared-Memory: Coherence ● When shared data are cached:When shared data are cached: – These are replicated in multiple caches. – The data in the caches of different processors may become inconsistent. 33 processors may become inconsistent. ● How to enforce cache coherency? – How does a processor know changes in the caches of other processors?
  • 34. The Cache Coherency Problem 4 5 P1 P2 P3 U:5 U:5 4 U:? U:? U:7 3 5 1 3 U: ? 34 U:51 2 What value will P1 and P2 read?
  • 35. Cache Coherence Solutions (Protocols) ● The key to maintain cache coherence: Track the state of sharing of every– Track the state of sharing of every data block. ● Based on this idea, following can be an overall solution: 35 – Dynamically recognize any potential inconsistency at run-time and carry out preventive action.
  • 36. Basic Idea Behind Cache Coherency Protocols P P P P Cache Cache Cache Cache 36 Main memory I/O system Bus
  • 37. Pros and Cons of the Solution ● Pro: –Consistency maintenance becomes–Consistency maintenance becomes transparent to programmers, compilers, as well as to the operating system. Con: 37 ● Con: –Increased hardware complexity .
  • 38. Two Important Cache Coherency Protocols ● Snooping protocol: Each cache “snoops” the bus to find out– Each cache “snoops” the bus to find out which data is being used by whom. ● Directory-based protocol: – Keep track of the sharing state of each data block using a directory. A directory is a centralized register for 38 – A directory is a centralized register for all memory blocks. – Allows coherency protocol to avoid broadcasts.
  • 39. Snoopy and Directory- Based Protocols P P P P Cache Cache Cache Cache Bus 39 Main memory I/O system
  • 40. Snooping vs. Directory- based Protocols ● Snooping protocol reduces memory traffic.traffic. – More efficient. ● Snooping protocol requires broadcasts: – Can meaningfully be implemented only when there is a shared bus. 40 there is a shared bus. – Even when there is a shared bus, scalability is a problem. – Some work arounds have been tried: Sun Enterprise server has up to 4 buses.
  • 41. Snooping Protocol ● As soon as a request for any data block by a processor is put out on the bus: – Other processors “snoop” to check if they have a copy and respond accordingly. ● Works well with bus interconnection: –All transmissions on a bus are essentially broadcast: 41 broadcast: ● Snooping is therefore effortless. –Dominates almost all small scale machines.
  • 42. Categories of Snoopy Protocols ● Essentially two types: – Write Invalidate Protocol Write Broadcast Protocol – Write Invalidate Protocol – Write Broadcast Protocol ● Write invalidate protocol: –When one processor writes to its cache, all other processors having a copy of that data block invalidate that block. Write broadcast: 42 ● Write broadcast: – When one processor writes to its cache, all other processors having a copy of that data block update that block with the recent written value.
  • 43. Write Invalidate Vs. Write Update Protocols P P P P Cache Cache Cache Cache Bus 43 Main memory I/O system
  • 44. Write Invalidate Protocol ● Handling a write to shared data: – An invalidate command is sent on bus ---– An invalidate command is sent on bus --- all caches snoop and invalidate any copies they have. ● Handling a read Miss: – Write-through: memory is always up-to- 44 – Write-through: memory is always up-to- date. – Write-back: snooping finds most recent copy.
  • 45. Write Invalidate in Write Through Caches ● Simple implementation. ● Writes:● Writes: – Write to shared data: broadcast on bus, processors snoop, and update any copies. – Read miss: memory is always up-to-date. ● Concurrent writes: 45 Concurrent writes: – Write serialization automatically achieved since bus serializes requests. – Bus provides the basic arbitration support.
  • 46. Write Invalidate versus Broadcast cont… ● Invalidate exploits spatial locality: Only one bus transaction for any–Only one bus transaction for any number of writes to the same block. –Obviously, more efficient. ● Broadcast has lower latency for 46 ● Broadcast has lower latency for writes and reads: –As compared to invalidate.
  • 47. Cache Coherence High Performance Computer Architecture Cache Coherence Protocols Mr. SUBHASIS DASH 47 Mr. SUBHASIS DASH SCHOLE OF COMPUTER ENGINEERING. KIIT UNIVERSITY, BHUBANESWAR
  • 48. An Example Snoopy Protocol ● Assume: – Invalidation protocol, write-back cache.– Invalidation protocol, write-back cache. ● Each block of memory is in one of the following states: – Shared: Clean in all caches and up-to-date in memory, block can be read. 48 –Exclusive: cache has the only copy, it is writeable, and dirty. –Invalid: Data present in the block obsolete, cannot be used.
  • 49. Implementation of the Snooping Protocol ● A cache controller at every processor would implement the protocol:would implement the protocol: – Has to perform specific actions: ● When the local processor requests certain things. ● Also, certain actions are required when certain address appears on the bus. Exact actions of the cache controller 49 address appears on the bus. – Exact actions of the cache controller depends on the state of the cache block. – Two FSMs can show the different types of actions to be performed by a controller.
  • 50. Snoopy-Cache State Machine-I ● State machine considering only CPU requests a each cache block. Invalid Shared (read/o nly) CPU Read CPU Read hit Place read missa each cache block. nly) CPU Write Place read miss on bus Place Write Miss on bus CPU read miss Write back block, Place read miss on bus CPU Write CPU Read miss Place read miss on bus 50 Exclusive (read/wr ite) CPU Write Place Write Miss on Bus CPU Write Miss Write back cache block Place write miss on bus CPU read hit CPU write hit
  • 51. Snoopy-Cache State Machine-II ● State machine considering only bus requests for each cache Invalid Shared (read/o nly) Write miss for this block for each cache block. nly) Write Back Block; (abort memory access) Write miss for this block Read miss for this block Write Back 51 Exclusive (read/wr ite) memory access) Write Back Block; (abort memory access)
  • 52. Place read miss Combined Snoopy-Cache State Machine● State machine considering both CPU requests and bus requests Invalid Shared (read/o nly) CPU Read CPU Read hit Write miss for this block Place read miss on bus and bus requests for each cache block. Invalid nly) CPU Write Place Write Miss on bus CPU read miss Write back block, Place read miss on bus CPU Write Place Write Miss on Bus CPU Read miss Place read miss on bus Write Back Block; Abort memory access. Write miss for this block Write Back 52 Exclusive (read/wr ite) Place Write Miss on Bus CPU Write Miss Write back cache block Place write miss on bus CPU read hit CPU write hit memory access. Read miss for this block Write Back Block; (abort memory access)
  • 53. Directory-based Solution ● In NUMA computers: – Messages have long latency. Also, broadcast is inefficient --- all– Also, broadcast is inefficient --- all messages have explicit responses. ● Main memory controller to keep track of: – Which processors are having cached copies of which memory locations. ● On a write, 53 ● On a write, – Only need to inform users, not everyone ● On a dirty read, – Forward to owner
  • 54. Directory Protocol ● Three states as in Snoopy Protocol –Shared: 1 or more processors have data, memory is up-to-date.memory is up-to-date. – Uncached: No processor has the block. –Exclusive: 1 processor (owner) has the block. ● In addition to cache state, –Must track which processors have data when in the shared state. 54 Must track which processors have data when in the shared state. –Usually implemented using bit vector, 1 if processor has copy.
  • 55. Directory Behavior ● On a read: – Unused: give (exclusive) copy to requester● give (exclusive) copy to requester ● record owner – Exclusive or shared: ● send share message to current exclusive owner record owner 55 owner ● record owner ● return value – Exclusive dirty: ● forward read request to exclusive owner.
  • 56. Directory Behavior ● On Write Send invalidate messages to all– Send invalidate messages to all hosts caching values. ● On Write-Thru/Write-back – Update value. 56
  • 57. CPU-Cache State Machine ● State machine for CPU requests for each Invalidate or Miss due to address conflict:Uncacheed Shared (read/o CPU Read hit for each memory block ● Invalid state if in memory Fetch/Invalidate or Miss due to address conflict: send Data Write Back message Uncacheed (read/o nly) CPU Read Send Read Miss message CPU Write: Send Write Miss msg to h.d. CPU Write: Send Write Miss message to home directory 57 send Data Write Back message to home directory Exclusive (read/wri te)CPU read hit CPU write hit Fetch: send Data Write Back message to home directory
  • 58. State Transition Diagram for the Directory ● Tracks all copies of memory block. Same states as the transition diagram● Same states as the transition diagram for an individual cache. ● Memory controller actions: –Update of directory state Send msgs to statisfy requests. 58 –Send msgs to statisfy requests. –Also indicates an action that updates the sharing set, Sharers, as well as sending a message.
  • 59. Directory State Machine ● State machine for Directory requests for each memory block Uncached Shared Read miss: Sharers = {P} send Data Value Reply Read miss: Sharers += {P}; send Data Value Reply memory block ● Uncached state if in memory Data Write Back: Sharers = {} (Write back block) Uncached Shared (read only) Reply Write Miss: send Invalidate to Sharers; then Sharers = {P}; send Data Value Write Miss: Sharers = {P}; send Data Value Reply msg 59 Exclusive (read/wri te) send Data Value Reply msg Read miss: Sharers += {P}; send Fetch; send Data Value Reply msg to remote cache (Write back block) Write Miss: Sharers = {P}; send Fetch/Invalidate; send Data Value Reply msg to remote cache