Parallel Computing

Parallel and Distributed
Computing
Unit 1

Parallel Computing: The
Computational Problem

Limitations of Serial Computing

Parallel Computer Memory
Architectures

Cache-only memory access(COMA)
• In these memory architectures, only cache memories are
present; no main memory is employed either in the form
of a central shared memory as in UMA machines or in
the form of a distributed main memory as in NUMA and
CC-NUMA computers.

Shared Memory: Advantages and
Disadvantages

Distributed Memory: Advantages and
Disadvantages

Hybrid Distributed-Shared Memory

Message Passing Model
Implementations: MPI

Single Program Multiple Data (SPMD)

Multiple Program Multiple Data

Automatic Vs Manual parallelization

Understanding the Problem and the
Program

Example of Non-Parallelizable Problem

How to Handle Data Dependencies?

Options: Reduce overall I/O as much
as possible

Limits and Costs of Parallel
Programming

Performance Analysis and Tuning

• Thus, it is answered in 2n+1 time with systolic
architechture.

Vector Processor
• Vector processor is basically a central processing unit
that has the ability to execute the complete vector
input in a single instruction. More specifically we can
say, it is a complete unit of hardware resources that
executes a sequential set of similar data items in the
memory using a single instruction.
• We know elements of the vector are ordered properly
so as to have successive addressing format of the
memory. This is the reason why we have mentioned
that it implements the data sequentially.

• It holds a single control unit but has multiple execution
units that perform the same operation on different data
elements of the vector.
• Unlike scalar processors that operate on only a single pair
of data, a vector processor operates on multiple pair of
data. However, one can convert a scalar code into vector
code. This conversion process is known as vectorization. So,
we can say vector processing allows operation on multiple
data elements by the help of single instruction.
• These instructions are said to be single instruction multiple
data or vector instructions. The CPU used in recent time
makes use of vector processing as it is advantageous than
scalar processing.

• The functional units of a vector computer are as
follows:
• IPU or instruction processing unit
• Vector register
• Scalar register
• Scalar processor
• Vector instruction controller
• Vector access controller
• Vector processor

• As it has several functional pipes thus it can execute the instructions over the
operands. We know that both data and instructions are present in the memory at
the desired memory location. So, the instruction processing unit i.e., IPU fetches
the instruction from the memory.
• Once the instruction is fetched then IPU determines either the fetched instruction
is scalar or vector in nature. If it is scalar in nature, then the instruction is
transferred to the scalar register and then further scalar processing is performed.
• While, when the instruction is a vector in nature then it is fed to the vector
instruction controller. This vector instruction controller first decodes the vector
instruction then accordingly determines the address of the vector operand present
in the memory.
• Then it gives a signal to the vector access controller about the demand of the
respective operand. This vector access controller then fetches the desired operand
from the memory. Once the operand is fetched then it is provided to the
instruction register so that it can be processed at the vector processor.
• At times when multiple vector instructions are present, then the vector instruction
controller provides the multiple vector instructions to the task system. And in case
the task system shows that the vector task is very long then the processor divides
the task into subvectors.

• These subvectors are fed to the vector processor that makes use of several
pipelines in order to execute the instruction over the operand fetched from the
memory at the same time.
• The various vector instructions are scheduled by the vector instruction controller.

Very Long Instruction Word (VLIW)
Architecture
• The limitations of the Superscalar processor are prominent as the difficulty of
scheduling instruction becomes complex. The intrinsic parallelism in the
instruction stream, complexity, cost, and the branch instruction issue get
resolved by a higher instruction set architecture called the Very Long
Instruction Word (VLIW) or VLIW Machines.
• VLIW uses Instruction Level Parallelism, i.e. it has programs to control the
parallel execution of the instructions.
• In other architectures, the performance of the processor is improved by using
either of the following methods: pipelining (break the instruction into
subparts), superscalar processor (independently execute the instructions in
different parts of the processor), out-of-order-execution (execute orders
differently to the program) but each of these methods add to the complexity
of the hardware very much.
• VLIW Architecture deals with it by depending on the compiler. The programs
decide the parallel flow of the instructions and to resolve conflicts. This
increases compiler complexity but decreases hardware complexity by a lot.

Features
• The processors in this architecture have multiple functional units, fetch
from the Instruction cache that have the Very Long Instruction Word.
• Multiple independent operations are grouped together in a single VLIW
Instruction. They are initialized in the same clock cycle.
• Each operation is assigned an independent functional unit.
• All the functional units share a common register file.
• Instruction words are typically of the length 64-1024 bits depending on
the number of execution unit and the code length required to control each
unit.
• Instruction scheduling and parallel dispatch of the word is done statically
by the compiler.
• The compiler checks for dependencies before scheduling parallel
execution of the instructions.

Advantages
• Reduces hardware complexity.
• Reduces power consumption because of reduction of
hardware complexity.
• Since compiler takes care of data dependency check,
decoding, instruction issues, it becomes a lot simpler.

Disadvantages
• Complex compilers are required which are hard to design.
• Increased program code size.
• Larger memory bandwidth and register-file bandwidth.

SuperPipelined Architecture
• Super-pipelining is the breaking of stages of a given pipeline into smaller
stages (thus making the pipeline deeper) in an attempt to shorten the
clock period and thus enhancing the instruction throughput by keeping
more and more instructions in flight at a time.
• Superpipelining is an alternative approach to achieve greater
performance. Many pipeline stages need half a clock cycle.

Superscalar Vs superpipelined
structure
• Superscalar machines can issue several instructions per
cycle. Superpipelined machines can issue only one instruction per cycle,
but they have cycle times shorter than the time required for any
operation. Both of these techniques exploit instruction-level parallelism,
which is often limited in many applications.
• Superscalar attempts to increase performance by executing multiple
instructions in parallel. Super-pipelining seeks to improve the sequential
instruction rate, while superscalar seeks to improve the parallel
instruction rate. Most modern processors are both superscalar and super-
pipelined.

More Details on the PRAM Model

Parallel Computing

More Related Content

Similar to Parallel Computing (20)

Recently uploaded (20)

Parallel Computing