2. AUTOMATING PARALLEL PROGRAMMING
When writing code, we typically don’t need to understand the details of the target system, as the
compiler handles it.
Developers usually think in terms of a single CPU and sequential processing during coding and
debugging.
Implementing algorithms for parallel systems (software or hardware) is more related than it seems.
Parallelism in software and hardware shares common challenges and approaches.
4. AUTOMATING PARALLEL PROGRAMMING
Layers of Implementation
Layer 5 - Application Layer:
Defines the application or problem to be implemented on a parallel computing platform.
Specifies the inputs and outputs, including data storage and timing requirements.
Layer 4 - Algorithm Development:
Focuses on defining the tasks and their interdependencies.
Parallelism may not be evident in this layer, as tasks are usually developed for linear execution.
The result is a dependence graph, directed graph (DG), or adjacency matrix summarizing task
dependencies.
5. AUTOMATING PARALLEL PROGRAMMING
Layer 3 - Parallelization Layer:
Extracts parallelism from the algorithm developed in Layer 4.
It generates thread timing and processor assignments for software or hardware implementations.
This layer is crucial for optimizing the algorithm for parallel execution.
Layer 2 - Coding Layer:
Involves writing the parallel algorithm in a high-level language.
The language depends on the target parallel computing platform. For general-purpose platforms,
languages like Cilk++, OpenMP, or CUDA(computer unified device architecture) are used.
For custom platforms, Hardware Description Languages (HDLs) like Verilog or VHDL are used.
6. AUTOMATING PARALLEL PROGRAMMING
Layer 1 - Realization Layer:
The algorithm is realized on a parallel computer platform, using methods like multithreading or custom parallel
processors (e.g., ASICs(application specific integrated circuit) or FPGAs (Field programmable gateways) ).
Automatic Programming in Parallel Computing:
Automatic serial programming: The programmer writes code in high-level languages (C, Java, FORTRAN), and the
code is compiled automatically.
Parallel computing: It is more complex as programmers need to manage how tasks are distributed and executed across
multiple processors.
Parallelizing compilers can handle simple loops and embarrassingly parallel algorithms (tasks that can be easily
parallelized).
For more complex tasks, the programmer needs intimate knowledge of processor interactions and task execution timing.
7. Parallel Algorithms and Parallel Architectures
Parallel algorithms and parallel hardware are interconnected; the development of one often depends on
the other.
Parallelism can be implemented at different levels in a computing system through hardware and
software techniques:
Data-Level Parallelism
Operates on multiple bits of a datum or multiple data simultaneously.
Examples: Bit-parallel addition, multiplication, division, vector processor arrays, and systolic arrays.
Instruction-Level Parallelism (ILP)
Executes multiple instructions simultaneously within a processor. Example:
Instruction pipelining.
8. Parallel Algorithms and Parallel Architectures
Thread-Level Parallelism (TLP)
Executes multiple threads (lightweight processes sharing processor resources) simultaneously.
Threads can run on one or multiple processors.
Process-Level Parallelism
Manages multiple independent processes, each with dedicated resources like memory and registers.
Example:
Classic multitasking and time-sharing across single or multiple machines.
9. Measuring benefits of Parallel Computing
Speedup Factor
The benefit of parallel computing is measured by comparing the time taken to complete a task on a
single processor with the time taken on N parallel processors. The speedup, S(N) , is defined as:
where Tp (1) is the algorithm processing time on a single processor and
Tp ( N ) is the processing time on the parallel processors.
In an ideal situation, for a fully parallelizable algorithm, and when the communication time between
processors and memory is neglected , we have Tp ( N ) = Tp (1)/ N , and the above equation gives
10. Communication Overhead
Both single and parallel computing systems require data transfer between processors and memory.
Communication delays occur due to a speed mismatch between the processor and memory.
Parallel systems need processors to exchange data via interconnection networks, adding complexity.
Issues Affecting Communication Efficiency:
Interconnection Network Delay:
Delays arise from factors like:
Bit propagation.
Message transmission.
Queuing within the network.
These delays depend on network topology, data size, and network speed.
11. Communication Overhead
Memory Bandwidth:
Memory access is limited by a single-port system, restricting data transfer to one word per memory cycle.
Memory Collisions:
Occur when multiple processors try to access the same memory module simultaneously.
Arbitration mechanisms are required to resolve access conflicts.
Memory Wall:
Memory transfer speeds lag behind processor speeds.
This problem is being solved using memory hierarchy such as
register → cache → RAM → electronic disk → magnetic disk → optical disk).
13. Estimating Speedup Factor and Communication
Overhead
Let us assume we
have a parallel
algorithm consisting
of N independent
tasks that can be
executed either
on a single processor
or on N processors
Under these ideal
circumstances,
14. Amdahl's Law
Amdahl's Law is a fundamental principle used to estimate the potential speedup that can be achieved by
parallelizing a computation.
It describes the maximum expected improvement in the execution time of a program when part of the
computation is parallelized.
15. Amdahl's Law
Overall Speedup(max) = 1/{1 – Fraction Enhanced}
Likewise, we can also think of the case where f = 1. Amdahl’s law is a principle that states that the
maximum potential improvement to the performance of a system is limited by the portion of the system
that cannot be improved.
In other words, the performance improvement of a system as a whole is limited by its bottlenecks. The
law is often used to predict the potential performance improvement of a system when adding more
processors or improving the speed of individual processors.
It is named after Gene Amdahl, who first proposed it in 1967.
16. Amdahl's Law
The formula for Amdahl’s law is:
S = 1 / (1 – P + (P / N)) Where:
S is the speedup of the system
P is the proportion of the system that can be improved N is the number of processors in the system
For example, if a system has a single bottleneck that occupies 20% of the total execution time, and we
add 4 more processors to the system, the speedup would be:
S = 1 / (1 – 0.2 + (0.2 / 5))
S = 1 / (0.8 + 0.04)
S = 1 / 0.84
S = 1.19
This means that the overall performance of the system would improve by about 19% with the addition of
the 4 processors.
17. APPLICATIONS OF PARALLEL COMPUTING
Scientific Research and Simulation:
Weather Forecasting: Running complex models to predict weather patterns and climate changes.
Astrophysics and Cosmology: Simulating celestial bodies, universe evolution, etc.
Molecular Dynamics: Studying molecular interactions, protein folding, drug discovery, etc.
Big Data Analytics and Data Processing:
Data Mining: Analyzing vast datasets to extract patterns, trends, and insights.
Machine Learning and AI: Training deep neural networks, processing large datasets in real-time.
Web Search Engines: Indexing and retrieving information from enormous web databases.
18. APPLICATIONS OF PARALLEL COMPUTING
High-Performance Computing (HPC):
Financial Modeling: Performing risk analysis, option pricing, and portfolio optimization.
Fluid Dynamics and Computational Chemistry: Simulating fluid flows, chemical reactions, etc.
Finite Element Analysis: Solving complex engineering problems in aerospace, automotive industries, etc.
Parallel Databases and Search Algorithms:
Parallel Database Systems: Handling concurrent queries and transactions in large-scale databases.
Parallel Search Algorithms: Speeding up searches in large datasets, such as in cryptography and pattern
matching.
19. APPLICATIONS OF PARALLEL COMPUTING
Image and Signal Processing:
Medical Imaging: Processing MRI, CT scans for diagnostics and treatment planning.
Video Processing: Real-time video encoding, decoding, and analysis.
Distributed Systems and Networking:
Distributed Computing: Handling distributed tasks efficiently in cloud computing environments.
Network Routing and Traffic Analysis: Optimizing routing algorithms, analyzing network traffic.
Real-Time Systems and Simulation:
Robotics and Automation: Controlling multiple robots simultaneously for complex tasks.
Virtual Reality and Gaming: Rendering complex scenes and simulations in real-time.
20. SHARED - MEMORY MULTIPROCESSORS (UNIFORM
MEMORY ACCESS [ UMA ])
Shared-memory processors are popular due to their simple and general programming model, enabling
easy development of parallel software.
Another term for shared-memory processors is Parallel Random Access Machine (PRAM).
A shared-address space is used for communication between processors, with all processors accessing a
common memory space.