Multiprocessor

Multiprocessors
Mr. A. B. Shinde
Electronics Engineering

Contents…
 Symmetric and distributed shared
memory architectures –
 Cache coherence issues –
 Performance issues -
 Synchronization issues-
 Models of memory consistency –
 Interconnection networks
 Buses,
 crossbar and
 multi – stage switches
2

Taxonomy of Parallel Architectures
 In 1966, Flynn proposed a simple model for categorizing all
computers.
 He used the parallelism in the instruction and data streams and
placed all computers into one of four categories:
 Single Instruction stream, Single Data stream (SISD)
 Single Instruction stream, Multiple Data streams (SIMD)
 Multiple Instruction streams, Single Data stream (MISD)
 Multiple Instruction streams, Multiple Data streams (MIMD)
3

SISD
 This category is the uniprocessor.
 In computing, SISD is a term
referring to a computer architecture in
which a single processor,
(uniprocessor) executes a single
instruction stream, to operate on data
stored in a single memory.
 This corresponds to the Von Neumann
Architecture.
 Instruction fetching and pipelined
execution of instructions are common
examples found in most modern
SISD computers.
4

SISD
 This is the oldest style of computer
architecture, and still one of the most
important: all personal computers fit
within this category
 Single instruction refers to the fact that
there is only one instruction stream
being acted on by the CPU during any
one clock tick;
 Single data means, analogously, that
one and only one data stream is being
employed as input during any one
clock tick.
5

SIMD
 The same instruction is executed by multiple processors using
different data streams.
 SIMD computers exploit data-level parallelism by applying the same
operations to multiple items of data in parallel.
 Each processor has its own data memory (hence multiple data), but
there is a single instruction memory and control processor, which
fetches and dispatches instructions.
6

SIMD
 SIMD machines are capable of applying
the exact same instruction stream to
multiple streams of data
simultaneously.
 This type of architecture is perfectly
suited to achieving very high
processing rates, as the data can be
split into many different independent
pieces, and the multiple instruction units
can all operate on them at the same time.
7

SIMD
8
SIMD Processable Patterns SIMD Unprocesable Patterns
Example: Brightness Computation by SIMD Operations

MISD
 In computing, MISD is a type of parallel
computing architecture where many
functional units perform different
operations on the same data.
 Pipeline architectures belong to this
type.
 Fault-tolerant computers executing the
same instructions redundantly in order to
detect and mask errors, in a manner
known as task replication, may be
considered to belong to this type.
 Not many instances of this
architecture exist, as MIMD and
SIMD are often more appropriate for
common data parallel techniques.
9

MISD
 Another example of a MISD process
that is carried out routinely at United
Nations.
 When a delegate speaks in a
language of his/her choice, his
speech is simultaneously translated
into a number of other languages for
the benefit of other delegates
present. Thus the delegate‘s speech
(a single data) is being processed by
a number of translators (processors)
yielding different results.
10
No commercial multiprocessor of this type has been built to date.

MIMD
 Each processor fetches its own instructions
and operates on its own data.
 MIMD computers exploit thread-level
parallelism, since multiple threads operate
in parallel.
 In general, thread-level parallelism is
more flexible than data-level parallelism
and thus more generally applicable.
 Machines using MIMD have a number of
processors that function asynchronously
and independently.
 At any time, different processors may be
executing different instructions on different
pieces of data.
11

MIMD
 Two other factors have also contributed to the rise of the MIMD
multiprocessors:
1. MIMDs offer flexibility. With the correct hardware and software
support, MIMDs can function as single-user multiprocessors.
2. MIMDs can build on the cost-performance advantages of off-the-
shelf processors. Multicore chips leverage the design investment in a
single processor core by replicating it.
12

Shared-Memory Multiprocessor
13
Basic structure of a centralized shared-memory multiprocessor

Distributed-Memory Multiprocessor
14
Basic architecture of a distributed-memory multiprocessor

Symmetric Shared-Memory Architectures
 Symmetric shared-memory machines usually support the caching of
both shared and private data.
 Private data are used by a single processor, while shared data are
used by multiple processors.
 When a private item is cached, its location is migrated to the cache,
reducing the average access time as well as the memory bandwidth
required.
15

Symmetric Shared-Memory Architectures
 When shared data are cached, the shared value may be replicated in
multiple caches.
 In addition to the reduction in access latency and required memory
bandwidth, this replication also provides a reduction in contention that
may exist for shared data items.
 Caching of shared data, introduces a new problem: Cache Coherence.
16

Multiprocessor Cache Coherence
 Cache Coherence occurs because the view of memory held by two
different processors is through their individual caches, (without any
additional precautions, could end up seeing two different values).
17
Figure illustrates the problem and shows how two different processors
can have two different values for the same location.
This difficulty is generally referred to as the cache coherence problem.

 Memory system is coherent, if any read of a data item returns the
most recently written value of that data item.
 This definition, is vague and simplistic; the reality is much more complex.
 This simple definition contains two different aspects of memory system
behavior, which are critical to writing correct shared-memory programs.
 The first aspect, called coherence, defines what values can be
returned by a read.
 The second aspect, called consistency, determines when a written
value will be returned by a read.
18

 A memory system is coherent if
1. A read by a processor P to a location X that follows a write by P to X,
with no writes of X by another processor occurring between the write and
the read by P, always returns the value written by P.
2. A read by a processor to location X that follows a write by another
processor to X returns the written value if the read and write are
sufficiently separated in time and no other writes to X occur between the
two accesses.
3. Writes to the same location are serialized; that is, two writes to the
same location by any two processors are seen in the same order by
all processors.
 For example: If the values 1 and then 2 are written to a location,
processors can never read the value of the location as 2 and then later
read it as 1.
19

 The question of when a written value will be seen is also important.
 We cannot require that a read of X instantaneously see the value written
for X by some other processor.
 For example: A write of X on one processor precedes a read of X on
another processor by a very small time, it may be impossible to
ensure that the read returns the value of the data written, since the
written data may not even have left the processor at that point.
 The issue of exactly when a written value must be seen by a reader is
defined by a memory consistency model.
20

 Coherence and consistency are complementary:
 Coherence defines the behavior of reads and writes to the same
memory location, while
 Consistency defines the behavior of reads and writes with respect to
accesses to other memory locations.
21

 Coherence and Consistency are complementary:
 Make the following two assumptions:
 First: A write does not complete (and allow the next write to occur) until
all processors have seen the effect of that write.
 Second: The processor does not change the order of any write with
respect to any other memory access.
 These two conditions mean that, if a processor writes location A
followed by location B, any processor that sees the new value of B
must also see the new value of A.
 These restrictions allow the processor to reorder reads, but forces
the processor to finish a write in program order.
22

Basic Schemes for Enforcing Coherence
 The coherence problem for multiprocessors and I/O, are similar in
origin, but has different characteristics.
 In I/O, multiple data copies are a rare whereas program running on
multiple processors will normally have copies of the same data in
several caches.
 In a coherent multiprocessor, the caches provide both migration and
replication of shared data items.
 Coherent caches provide migration, since a data item can be moved
to a local cache and used there in a transparent fashion.
 This migration reduces both the latency to access a shared data item
and the bandwidth demand on the shared memory.
23

 Coherent caches also provide replication for shared data, since the
caches make a copy of the data item in the local cache.
 Replication reduces both latency of access and contention for a
read shared data item.
 Supporting this migration and replication is critical to performance in
accessing shared data.
 Small-scale multiprocessors adopt a hardware solution by
introducing a protocol to maintain coherent caches.
 The protocols used to maintain coherence for multiple processors
are called cache coherence protocols.
 Key to implementing a cache coherence protocol is tracking the state of
any sharing of a data block.
24

 There are two classes of protocols, which use different techniques
to track the sharing status:
 Directory based:
 The sharing status of a block of physical memory is kept in just one
location, called the directory.
 Directory-based coherence has slightly higher implementation
overhead than snooping, but it can scale to larger processor counts.
 The Sun T1 design, uses directories.
25

 There are two classes of protocols, which use different techniques
to track the sharing status:
 Snooping:
 Every cache that has a copy of the data from a block of physical
memory also has a copy of the sharing status of the block.
 The caches are all accessible via some broadcast medium (a bus or
switch), and all cache controllers monitor or snoop on the medium to
determine whether or not they have a copy of a block that is requested
on a bus or switch access.
26

Performance of Symmetric Shared-Memory
 In a multiprocessor using a snoopy coherence protocol, several
different phenomena combine to determine performance.
 The overall cache performance is a combination of the behavior of
uniprocessor cache miss traffic and the traffic caused by
communication, which results in invalidations and subsequent cache
misses.
 Changing the processor count, cache size, and block size can affect
these two components of the miss rate.
 The misses raised from interprocessor communication, called as
coherence misses, can be broken into two separate sources.
27

 Coherence misses, has two separate sources:
 First source is the true sharing misses that arise from the
communication of data through the cache coherence mechanism.
 In an invalidation based protocol, the first write by a processor to a
shared cache block causes an invalidation to establish ownership
of that block.
 Additionally, when another processor attempts to read a modified
word in that cache block, a miss occurs and the resultant block is
transferred.
 Both these misses are classified as true sharing misses since they
directly arise from the sharing of data among processors.
28

 Coherence misses, has two separate sources:
 Second source, is false sharing, arises from the use of an
invalidation based coherence algorithm with a single valid bit per
cache block.
 False sharing occurs when a block is invalidated because some
word in the block.
 If the word being written and the word read are different and the
invalidation does not cause a new value to be communicated, but
only causes an extra cache miss, then it is a false sharing miss.
 In a false sharing miss, the block is shared, but no word in the
cache is actually shared
29

Distributed Shared Memory
 A snooping protocol requires communication with all caches on
every cache miss, including writes of potentially shared data.
 The absence of any centralized data structure that tracks the state
of the caches is the fundamental advantage of a snooping-based
scheme.
30

Distributed Shared Memory
 For example:
 With 16 processors, a block size of 64 bytes, and a 512 KB data
cache, the total bus bandwidth demand (ignoring stall cycles) for the
four programs in the scientific/technical workload, ranges from about 4
GB/sec to about 170 GB/sec.
 In comparison, the memory bandwidth of the highest-performance
centralized shared-memory 16-way multiprocessor in 2006 was 2.4
GB/sec per processor.
 In 2006, multiprocessors with a distributed-memory model are
available with over 12 GB/sec per processor to the nearest memory.
31

 We can increase the memory bandwidth and interconnection
bandwidth by distributing the memory as shown in figure;
 This immediately separates local memory traffic from remote
memory traffic, reducing the bandwidth demands on the memory
system and on the interconnection network.
32

 Eliminating the need for the coherence protocol to broadcast on
every cache miss, distributing the memory will gain little in
performance.
 The alternative to a snoop-based coherence protocol is a directory
protocol.
 A directory keeps the state of every block that may be cached.
 Information in the directory includes which caches have copies of the
block, whether it is dirty, and so on.
 A directory protocol also can be used to reduce the bandwidth
demands in a centralized shared-memory machine
33

 The simplest directory implementations associate with an entry in
the directory with each memory block.
 In such implementations, the amount of information is proportional
to the product of the number of memory blocks and the number of
processors.
 This overhead is not a problem for less than about 200 processors
because the directory overhead with a reasonable block size will be
tolerable.
 For larger multiprocessors, we need methods to allow the directory
structure to be efficiently scaled.
 The methods used either try to keep information for fewer blocks or
try to keep fewer bits per entry by using individual bits to stand for a
small collection of processors.
34

 To prevent the directory from becoming the bottleneck, the directory
is distributed along with the memory, so that different directory
accesses can go to different directories, just as different memory
requests go to different memories.
 A distributed directory retains the characteristic that the sharing
status of a block is always in a single known location.
 This property allows the coherence protocol to avoid broadcast.
35

Synchronization: The Basics
 Synchronization mechanisms are typically built with user-level
software routines that rely on hardware-supplied synchronization
instructions.
 For smaller multiprocessors, the key hardware capability is an
uninterruptible instruction sequence capable of automatically retrieving
and changing a value. Software synchronization mechanisms are then
constructed using this capability.
 Lock and Unlock are the synchronization operations.
 Lock and Unlock can be used to create mutual exclusion, as well as
to implement more complex synchronization mechanisms.
37

 Synchronization mechanisms are typically built with user-level
software routines that rely on hardware-supplied synchronization
instructions.
 In larger-scale multiprocessors, synchronization can become a
performance bottleneck because contention introduces additional
delays and because latency is potentially greater in such a
multiprocessor.
38

 Basic Hardware Primitives:
 The key ability to implement synchronization in a multiprocessor is a set
of hardware primitives with the ability to automatically read and
modify a memory location.
 Without such a capability, the cost of building basic synchronization
primitives will be too high and will increase as the processor count.
39

 These hardware primitives are the basic building blocks that are
used to build a wide variety of user-level synchronization
operations, including things such as locks and barriers.
 In general, Architects do not expect users to employ the basic
hardware primitives, but instead expect that the primitives will be
used by system programmers to build a synchronization library.
40

 One typical operation for building synchronization operations is the
atomic exchange, which interchanges a value in a register for a value in
memory.
 Assume that we want to build a simple lock where the value 0 is used
to indicate that the lock is free and 1 is used to indicate that the lock
is unavailable.
 A processor tries to set the lock by exchanging of 1, which is in a
register, with the memory address corresponding to the lock.
 The value returned from the exchange instruction is 1 if some other
processor had already claimed access and 0 otherwise.
 In the latter case, the value is also changed to 1, preventing any
competing exchange from also retrieving a 0.
41

Models of Memory Consistency
42

 Cache coherence ensures that multiple processors see a consistent
view of memory.
 Since processors communicate through shared variables, the question
arises: In what order must a processor observe the data writes of
another processor?
 “Observe the writes of another processor” through reads.
43

 Consider two code segments from processes P1 and P2…
44
Assume that the processes are running on different processors, and that
locations A and B are originally cached by both processors with the initial
value of 0.
Writes always take immediate effect and are immediately seen by
other processors, it will be impossible for both if statements (labelled L1
and L2) to evaluate their conditions as true.

 The question is, Should this behavior be allowed, and if so, under
what conditions?
 The most straightforward model for memory consistency is called
sequential consistency.
 Sequential consistency requires that the result of any execution be
the same as if the memory accesses executed by each processor
were kept in order and the accesses among different processors were
arbitrarily interleaved.
 Sequential consistency eliminates the possibility of some
nonobvious execution (previous example) because the assignments
must be completed before the if statements are initiated.
45

 The question is, Should this behavior be allowed, and if so, under
what conditions?
 The simplest way to implement sequential consistency is to require a
processor to delay the completion of any memory access until all the
invalidations caused by that access are completed.
 Memory consistency involves operations among different variables:
The two accesses that must be ordered are actually to different memory
locations.
 In our example, we must delay the read of A or B (A == 0 or B == 0) until
the previous write has completed (B = 1 or A = 1).
46

 Relaxed Consistency Models:
 The key idea in relaxed consistency models is to allow reads and
writes to complete out of order, but to use synchronization
operations to enforce ordering, so that a synchronized program
behaves as if the processor were sequentially consistent.
 There are a variety of relaxed models that are classified according to
what read and write orderings they relax.
 We specify the orderings by a set of rules of the form X→Y, meaning that
operation X must complete before operation Y is done.
 Sequential consistency requires maintaining all four possible orderings:
R→W, R→R, W→R, and W→W.
47

 The relaxed models are defined by four sets of orderings they relax:
1. Relaxing the W→R ordering yields a model known as total store
ordering or processor consistency. Because this ordering retains
ordering among writes, many programs that operate under sequential
consistency operate under this model, without additional synchronization.
2. Relaxing the W→W ordering yields a model known as partial store
order.
3. Relaxing the R→W and R→R orderings yields a variety of models
including weak ordering, the PowerPC consistency model, and
release consistency, depending on the details of the ordering
restrictions and how synchronization operations enforce ordering.
48

 Finally,
 At the present time, many multiprocessors being built to support
some sort of relaxed consistency model.
 Since synchronization is highly multiprocessor specific, the
expectation is that most programmers will use standard
synchronization libraries.
 With speculation much of the performance advantage of relaxed
consistency models can be obtained with sequential or processor
consistency.
 Relaxed consistency revolves around the role of the compiler and its
ability to optimize memory access to potentially shared variables.
49

Interconnection Networks
 A bus is a communication pathway connecting two or more devices.
 Bus is a shared transmission medium.
 Multiple devices are connected to the bus, and a signal transmitted
by any one device is available for reception by all other devices
attached to the bus.
 If two devices transmit during the same time period, their signals will
overlap and become garbled. Thus, only one device at a time can
successfully transmit the data.
51

 Typically, a bus consists of multiple communication pathways, or
lines.
 Each line is capable of transmitting signals representing binary 1 and
binary 0.
 For example:
 An 8-bit unit of data can be transmitted over eight bus lines.
 A bus that connects major computer components (processor,
memory, I/O) is called a system bus.
 The most common computer interconnection structures are based on the
use of one or more system buses.
52

 Bus Structure:
53
Bus Interconnection Scheme

 Bus Structure: Data Bus
 The data lines provide a path for moving data among system
modules is called as data bus.
 The data bus may consist of 32, 64, 128, or even more separate
lines, the number of lines being referred to as the width of the data
bus.
 Because each line can carry only 1 bit at a time, the number of lines
determines how many bits can be transferred at a time.
 The width of the data bus is a major factor in determining overall system
performance.
54

 Bus Structure: Address Bus
 The address lines are used to designate the source or destination of
the data on the data bus.
 For example: If the processor wishes to read a word (8, 16, or 32 bits) of
data from memory, it puts the address of the desired word on the address
lines.
 The width of the address bus determines the maximum possible
memory capacity of the system.
55

 Bus Structure: Address Bus
 The address lines are also used to address I/O ports.
 Typically, the higher-order bits are used to select a particular module
on the bus, and the lower-order bits select a memory location or I/O
port within the module.
 For example: On an 8-bit address bus, address 01111111 and below
might reference locations in a memory module (module 0) with 128
words of memory, and address 10000000 and above refer to devices
attached to an I/O module (module 1).
56

 Bus Structure: Control Bus
 The control lines are used to control the access to and the use of
the data and address lines.
 Since, the data and address lines are shared by all components,
controlling their use becomes crucial.
 Control signals transmit both command and timing information
among system modules.
 Timing signals indicate the validity of data and address information
& Command signals specify operations to be performed.
57

 Typical control lines include
 Memory write: Causes data on the bus to be written into the addressed
location
 Memory read: Causes data from the addressed location to be placed on
the bus
 I/O write: Causes data on the bus to be output to the addressed I/O port
 I/O read: Causes data from the addressed I/O port to be placed on the
bus
 Transfer ACK: Indicates that data have been accepted from or placed
on the bus
58

 Typical control lines include
 Bus request: Indicates that a module needs to gain control of the bus
 Bus grant: Indicates that a requesting module has been granted control
of the bus
 Interrupt request: Indicates that an interrupt is pending
 Interrupt ACK: Acknowledges that the pending interrupt has been
recognized
 Clock: Is used to synchronize operations
 Reset: Initializes all modules
59

60
Physical Realization of a Bus
Architecture

 Multiple-Bus Hierarchies:
 If a great number of devices are connected to the bus, performance will
suffer. There are two main causes:
 1. In general, the more devices attached to the bus, the greater the
bus length and hence the greater the propagation delay.
This delay determines the time it takes for devices to coordinate the use
of the bus.
When control of the bus passes from one device to another frequently,
these propagation delays can noticeably affect performance.
61

 Multiple-Bus Hierarchies:
 If a great number of devices are connected to the bus, performance will
suffer. There are two main causes:
 2. The bus may become a bottleneck as the aggregate data transfer
demand approaches the capacity of the bus.
If data rate is increased beyond the capacity then it will create bottleneck.
The data rates generated by attached devices (graphics and video
controllers, network interfaces) are growing rapidly, hence bottleneck
problem will be faced.
62

63
Traditional bus architecture

64
High-performance architecture

65
This presentation is published only for educational purpose
shindesir.pvp@gmail.com

Multiprocessor

More Related Content

What's hot (20)

Similar to Multiprocessor (20)

More from Dr. A. B. Shinde (20)

Recently uploaded (20)

Multiprocessor