SlideShare a Scribd company logo
2
Most read
3
Most read
13
Most read
Multiprocessors and Multicomputers
Categories of Parallel Computers
Considering their architecture only, there are two
main categories of parallel computers:
systems with shared common memories, and
systems with unshared distributed memories.
Shared-Memory Multiprocessors
Shared-memory multiprocessor models:
Uniform-memory-access (UMA)
Nonuniform-memory-access (NUMA)
Cache-only memory architecture (COMA)
These systems differ in how the memory and
peripheral resources are shared or distributed.
The UMA Model - 1
Physical memory uniformly shared by all
processors, with equal access time to all words.
Processors may have local cache memories.
Peripherals also shared in some fashion.
Tightly coupled systems use a common bus,
crossbar, or multistage network to connect
processors, peripherals, and memories.
Many manufacturers have multiprocessor (MP)
extensions of uniprocessor (UP) product lines.
The UMA Model - 2
Synchronization and communication among
processors achieved through shared variables in
common memory.
Symmetric MP systems – all processors have
access to all peripherals, and any processor can
run the OS and I/O device drivers.
Asymmetric MP systems – not all peripherals
accessible by all processors; kernel runs only on
selected processors (master); others are called
attached processors (AP).
The UMA Multiprocessor Model
P1 P2 Pn…
System Interconnect
(Bus, Crossbar, Multistage network)
I/O SM1 … SMm
Example: Performance Calculation
Consider two loops. The first loop adds
corresponding elements of two N-element vectors
to yield a third vector. The second loop sums
elements of the third vector. Assume each
add/assign operation takes 1 cycle, and ignore
time spent on other actions (e.g. loop counter
incrementing/testing, instruction fetch, etc.).
Assume interprocessor communication requires k
cycles.
On a sequential system, each loop will require N
cycles, for a total of 2N cycles of processor time.
Example: Performance Calculation
On an M-processor system, we can partition each loop into
M parts, each having L = N / M add/assigns requiring L
cycles. The total time required is thus 2L. This leaves us
with M partial sums that must be totaled.
Computing the final sum from the M partial sums requires l
= log2(M) additions, each requiring k cycles (to access a
non-local term) and 1 cycle (for the add/assign), for a total
of l × (k+1) cycles.
The parallel computation thus requires
2N / M + (k + 1) log2(M) cycles.
Example: Performance Calculation
Assume N = 220
.
Sequential execution requires 2N = 221
cycles.
If processor synchronization requires k = 200 cycles, and
we have M = 256 processors, parallel execution requires
2N / M + (k + 1) log2(M)
= 221
/ 28
+ 201 × 8
= 213
+ 1608 = 9800 cycles
Comparing results, the parallel solution is 214 times faster
than the sequential, with the best theoretical speedup being
256 (since there are 256 processors). Thus the efficiency
of the parallel solution is 214 / 256 = 83.6 %.
The NUMA Model - 1
Shared memories, but access time depends on the location
of the data item.
The shared memory is distributed among the processors as
local memories, but each of these is still accessible by all
processors (with varying access times).
Memory access is fastest from the locally-connected
processor, with the interconnection network adding delays
for other processor accesses.
Additionally, there may be global memory in a
multiprocessor system, with two separate interconnection
networks, one for clusters of processors and their cluster
memories, and another for the global shared memories.
Shared Local Memories
P1
P2
Pn
.
.
.
LM1
Inter-
connection
Network
LM2
LMn
.
.
.
Hierarchical Cluster Model
GSM …
Global Interconnect Network
GSM GSM
P
P
P
.
.
.
C
I
N
CS
M
.
.
.
CS
M
CS
M
…
P
P
P
.
.
.
C
I
N
CS
M
.
.
.
CS
M
CS
M
The COMA Model
In the COMA model, processors only have cache
memories; the caches, taken together, form a
global address space.
Each cache has an associated directory that aids
remote machines in their lookups; hierarchical
directories may exist in machines based on this
model.
Initial data placement is not critical, as cache
blocks will eventually migrate to where they are
needed.
Cache-Only Memory Architecture
Interconnection Network
…C
D
P
C
D
P
C
D
P
Other Models
There can be other models used for multiprocessor
systems, based on a combination of the models
just presented. For example:
cache-coherent non-uniform memory access (each
processor has a cache directory, and the system has a
distributed shared memory)
cache-coherent cache-only model (processors have
caches, no shared memory, caches must be kept
coherent).
Distributed-Memory Multicomputer Models
Multicomputer consist of multiple computers, or nodes,
interconnected by a message-passing network.
Each node is autonomous, with its own processor and local
memory, and sometimes local peripherals.
The message-passing network provides point-to-point static
connections among the nodes.
Local memories are not shared, so traditional
multicomputer are sometimes called no-remote-memory-
access (or NORMA) machines.
Inter-node communication is achieved by passing
messages through the static connection network.
Generic Message-Passing Multicomputer
Message-passing
interconnection
network
M P
M P P M
P M
P
M
P
M
P
M
P
M
…
…
Multicomputer Generations
Each multicomputer uses routers and channels in its
interconnection network, and heterogeneous systems may
involved mixed node types and uniform data representation
and communication protocols.
First generation: hypercube architecture, software-
controlled message switching, processor boards.
Second generation: mesh-connected architecture,
hardware message switching, software for medium-grain
distributed computing.
Third generation: fine-grained distributed computing, with
each VLSI chip containing the processor and
communication resources.
Multivector and SIMD Computers
Vector computers often built as a scalar processor
with an attached optional vector processor.
All data and instructions are stored in the central
memory, all instructions decoded by scalar control
unit, and all scalar instructions handled by scalar
processor.
When a vector instruction is decoded, it is sent to
the vector processor’s control unit which
supervises the flow of data and execution of the
instruction.
Vector Processor Models
In register-to-register models, a fixed number of
possibly reconfigurable registers are used to hold
all vector operands, intermediate, and final vector
results. All registers are accessible in user
instructions.
In a memory-to-memory vector processor, primary
memory holds operands and results; a vector
stream unit accesses memory for fetches and
stores in units of large superwords (e.g. 512 bits).
SIMD Supercomputers
Operational model is a 5-tuple (N, C, I, M, R).
N = number of processing elements (PEs).
C = set of instructions (including scalar and flow control)
I = set of instructions broadcast to all PEs for parallel
execution.
M = set of masking schemes used to partion PEs into
enabled/disabled states.
R = set of data-routing functions to enable inter-PE
communication through the interconnection network.
Operational Model of SIMD Computer
Interconnection Network
…
P
M
Control Unit
P
M
P
M

More Related Content

PDF
Ejercicios algebra de boole
PDF
TACACS Protocol
PPTX
Conditional and control statement
PPTX
Biosignal: ECG, EEG and EMG
PPT
advanced computer architesture-conditions of parallelism
PPTX
Diameter Presentation
PPTX
Threads (operating System)
PPTX
HDLC(High level Data Link Control)
Ejercicios algebra de boole
TACACS Protocol
Conditional and control statement
Biosignal: ECG, EEG and EMG
advanced computer architesture-conditions of parallelism
Diameter Presentation
Threads (operating System)
HDLC(High level Data Link Control)

What's hot (20)

PDF
Distributed Operating System_1
DOCX
Parallel computing persentation
PDF
Lecture 3 parallel programming platforms
PPT
Distributed Operating System
PPTX
Multiprocessor Architecture (Advanced computer architecture)
PPTX
Concurrency Control in Distributed Systems.pptx
PPTX
Parallel computing and its applications
PPT
Process Management-Process Migration
PPT
Parallel processing
PPT
Parallel Processing Concepts
PDF
Chapter 5: Mapping and Scheduling
PPT
Parallel processing
PPTX
Distributed operating system
PDF
OS - Process Concepts
PPT
Hardware and Software parallelism
PDF
Deadlock Avoidance - OS
DOCX
Operating System Process Synchronization
PDF
Parallel programming model, language and compiler in ACA.
PPTX
Multi processor scheduling
PPT
Parallel computing
Distributed Operating System_1
Parallel computing persentation
Lecture 3 parallel programming platforms
Distributed Operating System
Multiprocessor Architecture (Advanced computer architecture)
Concurrency Control in Distributed Systems.pptx
Parallel computing and its applications
Process Management-Process Migration
Parallel processing
Parallel Processing Concepts
Chapter 5: Mapping and Scheduling
Parallel processing
Distributed operating system
OS - Process Concepts
Hardware and Software parallelism
Deadlock Avoidance - OS
Operating System Process Synchronization
Parallel programming model, language and compiler in ACA.
Multi processor scheduling
Parallel computing
Ad

Similar to multiprocessors and multicomputers (20)

PDF
PDF
Distributed system lectures
PPT
Multiprocessors Characters coherence.ppt
PPT
PPTX
PDF
07-multiprocessors-cccffw whoo ofsvnk cfchjMF.pdf
PPTX
Multiprocessor structures
PPTX
Komputasi Parallel Minggu ke 6 - Shared Memory.pptx
PPTX
Lecture 04 chapter 2 - Parallel Programming Platforms
PPTX
PP - CH01 (2).pptxhhsjoshhshhshhhshhshsbx
PPTX
Underlying principles of parallel and distributed computing
PPT
Parallel processing Concepts
PPT
Embedded Linux
DOC
Symmetric multiprocessing and Microkernel
PPT
parallel computing.ppt
PDF
High Performance Computer Architecture
PDF
Parallel Processing
PPT
module4.ppt
PPTX
PA CO-1.pptx on business analysis on systems
PPTX
Unit 5 lect-3-multiprocessor
Distributed system lectures
Multiprocessors Characters coherence.ppt
07-multiprocessors-cccffw whoo ofsvnk cfchjMF.pdf
Multiprocessor structures
Komputasi Parallel Minggu ke 6 - Shared Memory.pptx
Lecture 04 chapter 2 - Parallel Programming Platforms
PP - CH01 (2).pptxhhsjoshhshhshhhshhshsbx
Underlying principles of parallel and distributed computing
Parallel processing Concepts
Embedded Linux
Symmetric multiprocessing and Microkernel
parallel computing.ppt
High Performance Computer Architecture
Parallel Processing
module4.ppt
PA CO-1.pptx on business analysis on systems
Unit 5 lect-3-multiprocessor
Ad

Recently uploaded (20)

PPTX
Welding lecture in detail for understanding
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
PPT on Performance Review to get promotions
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
additive manufacturing of ss316l using mig welding
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
web development for engineering and engineering
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPT
Mechanical Engineering MATERIALS Selection
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Sustainable Sites - Green Building Construction
PPTX
bas. eng. economics group 4 presentation 1.pptx
Welding lecture in detail for understanding
Structs to JSON How Go Powers REST APIs.pdf
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Foundation to blockchain - A guide to Blockchain Tech
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPT on Performance Review to get promotions
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
additive manufacturing of ss316l using mig welding
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
web development for engineering and engineering
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Mechanical Engineering MATERIALS Selection
CYBER-CRIMES AND SECURITY A guide to understanding
Model Code of Practice - Construction Work - 21102022 .pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
OOP with Java - Java Introduction (Basics)
Sustainable Sites - Green Building Construction
bas. eng. economics group 4 presentation 1.pptx

multiprocessors and multicomputers

  • 2. Categories of Parallel Computers Considering their architecture only, there are two main categories of parallel computers: systems with shared common memories, and systems with unshared distributed memories.
  • 3. Shared-Memory Multiprocessors Shared-memory multiprocessor models: Uniform-memory-access (UMA) Nonuniform-memory-access (NUMA) Cache-only memory architecture (COMA) These systems differ in how the memory and peripheral resources are shared or distributed.
  • 4. The UMA Model - 1 Physical memory uniformly shared by all processors, with equal access time to all words. Processors may have local cache memories. Peripherals also shared in some fashion. Tightly coupled systems use a common bus, crossbar, or multistage network to connect processors, peripherals, and memories. Many manufacturers have multiprocessor (MP) extensions of uniprocessor (UP) product lines.
  • 5. The UMA Model - 2 Synchronization and communication among processors achieved through shared variables in common memory. Symmetric MP systems – all processors have access to all peripherals, and any processor can run the OS and I/O device drivers. Asymmetric MP systems – not all peripherals accessible by all processors; kernel runs only on selected processors (master); others are called attached processors (AP).
  • 6. The UMA Multiprocessor Model P1 P2 Pn… System Interconnect (Bus, Crossbar, Multistage network) I/O SM1 … SMm
  • 7. Example: Performance Calculation Consider two loops. The first loop adds corresponding elements of two N-element vectors to yield a third vector. The second loop sums elements of the third vector. Assume each add/assign operation takes 1 cycle, and ignore time spent on other actions (e.g. loop counter incrementing/testing, instruction fetch, etc.). Assume interprocessor communication requires k cycles. On a sequential system, each loop will require N cycles, for a total of 2N cycles of processor time.
  • 8. Example: Performance Calculation On an M-processor system, we can partition each loop into M parts, each having L = N / M add/assigns requiring L cycles. The total time required is thus 2L. This leaves us with M partial sums that must be totaled. Computing the final sum from the M partial sums requires l = log2(M) additions, each requiring k cycles (to access a non-local term) and 1 cycle (for the add/assign), for a total of l × (k+1) cycles. The parallel computation thus requires 2N / M + (k + 1) log2(M) cycles.
  • 9. Example: Performance Calculation Assume N = 220 . Sequential execution requires 2N = 221 cycles. If processor synchronization requires k = 200 cycles, and we have M = 256 processors, parallel execution requires 2N / M + (k + 1) log2(M) = 221 / 28 + 201 × 8 = 213 + 1608 = 9800 cycles Comparing results, the parallel solution is 214 times faster than the sequential, with the best theoretical speedup being 256 (since there are 256 processors). Thus the efficiency of the parallel solution is 214 / 256 = 83.6 %.
  • 10. The NUMA Model - 1 Shared memories, but access time depends on the location of the data item. The shared memory is distributed among the processors as local memories, but each of these is still accessible by all processors (with varying access times). Memory access is fastest from the locally-connected processor, with the interconnection network adding delays for other processor accesses. Additionally, there may be global memory in a multiprocessor system, with two separate interconnection networks, one for clusters of processors and their cluster memories, and another for the global shared memories.
  • 12. Hierarchical Cluster Model GSM … Global Interconnect Network GSM GSM P P P . . . C I N CS M . . . CS M CS M … P P P . . . C I N CS M . . . CS M CS M
  • 13. The COMA Model In the COMA model, processors only have cache memories; the caches, taken together, form a global address space. Each cache has an associated directory that aids remote machines in their lookups; hierarchical directories may exist in machines based on this model. Initial data placement is not critical, as cache blocks will eventually migrate to where they are needed.
  • 14. Cache-Only Memory Architecture Interconnection Network …C D P C D P C D P
  • 15. Other Models There can be other models used for multiprocessor systems, based on a combination of the models just presented. For example: cache-coherent non-uniform memory access (each processor has a cache directory, and the system has a distributed shared memory) cache-coherent cache-only model (processors have caches, no shared memory, caches must be kept coherent).
  • 16. Distributed-Memory Multicomputer Models Multicomputer consist of multiple computers, or nodes, interconnected by a message-passing network. Each node is autonomous, with its own processor and local memory, and sometimes local peripherals. The message-passing network provides point-to-point static connections among the nodes. Local memories are not shared, so traditional multicomputer are sometimes called no-remote-memory- access (or NORMA) machines. Inter-node communication is achieved by passing messages through the static connection network.
  • 18. Multicomputer Generations Each multicomputer uses routers and channels in its interconnection network, and heterogeneous systems may involved mixed node types and uniform data representation and communication protocols. First generation: hypercube architecture, software- controlled message switching, processor boards. Second generation: mesh-connected architecture, hardware message switching, software for medium-grain distributed computing. Third generation: fine-grained distributed computing, with each VLSI chip containing the processor and communication resources.
  • 19. Multivector and SIMD Computers Vector computers often built as a scalar processor with an attached optional vector processor. All data and instructions are stored in the central memory, all instructions decoded by scalar control unit, and all scalar instructions handled by scalar processor. When a vector instruction is decoded, it is sent to the vector processor’s control unit which supervises the flow of data and execution of the instruction.
  • 20. Vector Processor Models In register-to-register models, a fixed number of possibly reconfigurable registers are used to hold all vector operands, intermediate, and final vector results. All registers are accessible in user instructions. In a memory-to-memory vector processor, primary memory holds operands and results; a vector stream unit accesses memory for fetches and stores in units of large superwords (e.g. 512 bits).
  • 21. SIMD Supercomputers Operational model is a 5-tuple (N, C, I, M, R). N = number of processing elements (PEs). C = set of instructions (including scalar and flow control) I = set of instructions broadcast to all PEs for parallel execution. M = set of masking schemes used to partion PEs into enabled/disabled states. R = set of data-routing functions to enable inter-PE communication through the interconnection network.
  • 22. Operational Model of SIMD Computer Interconnection Network … P M Control Unit P M P M