SlideShare a Scribd company logo
INTRODUCTION TO HPC
Alessandro Romeo, PhD
a.romeo@cineca.it
July 1st
, 2024
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 2
Summary
Why do we bother about parallelization?
Elements of HPC architecture
Parallel programming
Measuring performances
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 3
References
• Previous CINECA courses & CINECA website. Special thanks to Andrew Emerson &
Alessandro Marani
• Michael T. Heath and Edgar Solomonik – Parallel numerical algorithms, Lecture
notes
• Stefan Boeriu, Kai-Ping Wang and John C. Bruch Jr. – Parallel computation, Lecture
notes
• https://hpc-wiki.info/hpc/HPC_Wiki
• https://stackoverflow.com/
Parallelization
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 5
The need for parallelism
The use of computers to study physical systems allows us explore phenomena at all scales:
• very large (weather forecasts, climatology, cosmology, data mining, oil reservoir)
• very small (drug design, silicon chip design, structural biology)
• very complex (fundamental physics fluid dynamics, turbulence)
• too dangerous or expensive (fault simulation, nuclear tests, crash analysis) Magneticum homepage
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 6
Simulation of an earthquake:
https://www.etp4hpc.eu/what-is-hpc.html
Exscalate4cov:
https://www.exscalate4cov.eu/
Weather – MISTRAL project:
https://www.cineca.it/news/mistral-portale
Computational methods allow us to study complex phenomena, giving a powerful impetus to scientific
research. This can be possible thanks to supercomputers.
The need for parallelism
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 7
An algorithm that uses only a single core is
referred to as a serial algorithm.
These algorithms complete one instruction
at a time, in order.
Parallelization means converting a serial
algorithm into an algorithm that can
perform multiple operations
simultaneously.
Parallel computing is the simultaneous use
of multiple compute resources to solve a
computational problem, broken into
discrete parts that can be solved
concurrently.
Serial vs parallel computing
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 8
Serial vs parallel computing
Most programs in lots of basic languages run on a single thread, on a single processor core. Such
processing is synchronous.
https://godbolt.org/z/cWjPcnca3
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 9
Von Neumann model
Control unit
Input Memory Output
Arithmetic
Logic Unit
Conventional Computer
Instructions are processed sequentially:
1. A single instruction is loaded from memory (fetch) and
decoded;
2. Compute the addresses of operands;
3. Fetch the operands from memory;
4. Execute the instruction;
5. Write the result in memory (store).
Data
Control
Memory: store data and programs
CPU
Control Unit: execute program
ALU: do arithmetic/logic operations
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 10
The need for parallelism
The instructions of all modern processors need to be
synchronized with a timer or clock.
The clock cycle is defined as the time between
𝜏
two adjacent pulses of oscillator that sets the time of
the processor.
The number of these pulses per second is known as
clock speed or clock frequency, generally
measured in GHz (or billions of pulses per second).
The clock cycle controls the synchronization of
operations in a computer: all the operations inside
the processor last a multiple of .
𝜏
Clock frequency can not infinitely increase:
• Power consumption
• Heat dissipation
• Speed of light
• Cost
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 11
The need for parallelism
The rapid spread and disruptive emergence of computers in the modern world is due to the exponential
growth in their capabilities and corresponding reduction in relative cost per component.
Accordingly to Moore's law, the density of transistors on a chip would double every two years. For more
than 40 years, transistor size of state-of-the-art chips has indeed shrunk by a factor of two every two years.
Moore’s Law still holds but will
undeniably fail for the following
reasons:
• Minimum transistor size: transistors
cannot be smaller than single atoms
(10-14
m feature sizes are common)
• Quantum tunnelling: quantum effects
can cause current leakage
• Heat dissipation and power
consumption
Increase in transistor numbers does not
necessarily mean more CPU power:
software usually struggles to make use
of the available hardware threads.
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 12
The need for parallelism
Regardless of the power of the processor, the real
limitation in HPC is the performance difference
between processors and getting data to/from
memory which has been increasing in time. It is very
important in both software and hardware design to
minimize the time it takes to transfer data to and from
the CPU.
• Bandwidth: how much data can be transferred in a
data channel over unitary time
• Latency: the minimum time needed to transfer
data
Reading data
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 13
The need for parallelism
Memory hierarchy
For all systems, CPUs are much faster than the devices
providing the data.
Cache memory is small but very fast memory which sits
between the processor and the main memory.
General strategy:
• check cache before main memory
• if not in cache then similar data from main memory are loaded in the hope that the next data access will
be from the cache (cache hit) and not from main memory (cache miss).
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 14
The need for parallelism
Cache levels
Cache memory is classified in terms of levels which describe its closeness to the microprocessor.
• Level 1 (L1): extremely fast but small (e.g. 32 kb), usually embedded in the CPU.
• Level 2 (L2): bigger (e.g. 2 Mb) but slower, maybe on a separate chip. Each core may have its own
dedicated L1 and L2 cache.
• Level 3 (L3): often shared amongst cores.
Programs constantly waiting for data from memory are memory bound.
Elements of HPC architecture
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 16
Basic concepts of parallel computer architecture
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 17
Basic concepts of parallel computer architecture
A parallel computer consists of
a collection of processors and
memory banks together with an
interconnection network to
enable processors to work
collectively in concert.
Major architectural issues in the
design of parallel computer
systems include the following:
• processor coordination
• memory organization
• address space
• memory access
• granularity
• scalability
• interconnection network
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 18
Basic concepts of parallel computer architecture
The parallelism can be performed within each core, across cores in a multicore chip, and on networked
systems of chips.
Often, several chips are combined in a single processing node, with all having direct access to a shared
memory within that node.
• instruction-level parallelism (ILP) refers to concurrent execution of multiple machine instructions within a
single core
• shared-memory parallelism refers to execution on a single-node multicore (and often multi-chip) system,
where all threads/cores have access to a common memory,
• distributed-memory parallelism refers to execution on a cluster of nodes connected by a network,
typically with a single processor associated with every node.
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 19
Parallel taxonomy
Parallelism in computer systems can take various forms, which can be classified by Flynn’s taxonomy in
terms of the numbers of instruction streams and data streams.
• SISD – single instruction stream, single data stream
– used in traditional serial computers (Turing
machines)
• SIMD – single global instruction stream, multiple
data streams – critical in special purpose “data
parallel” (vector) computers, present also in many
modern computers. It is basically parallelism within a
single core
• MISD – multiple instruction streams, single data
stream – not particularly useful, except if interpreted
as a method of “pipelining”
• MIMD – multiple instruction streams, multiple data
streams – corresponds to general purpose parallel
(multicore or multinode) computers.
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 20
Memory classification
Performances strongly depends on the underlying memory organization.
• Shared memory: memory is shared among processors within a node
• Distributed memory: memory is shared among processors in different nodes through a given network
The time each processor needs to access the memory is not uniform. This accessing methodology is
called NUMA (Non Uniform Memory Access). In this way we get a faster memory access if a processor
accesses local memory, rather than use the network.
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 21
Network topology
Networks linking the nodes in a distributed system are available according to price, performance and
hardware vendor (Ethernet, Gigabit, Infiniband, Omnipath, …).
Networks are configured in particular topologies (direct and indirect).
If switches are used, then the network is called fabric.
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 22
Leonardo topology
All the nodes are interconnected through an Nvidia
Mellanox network, with Dragonfly+, capable of a
maximum bandwidth of 200Gbit/s between each pair of
nodes.
This is a relatively new topology for Infiniband based
networks that allows to interconnect a very large number
of nodes containing the number of switches and
cables, while also keeping the network diameter very
small.
In comparison to non-blocking fat tree topologies, cost
can be reduced and scaling-out to a larger number of
nodes becomes feasible. In comparison to 2:1 blocking fat
tree, close to 100% network throughput can be achieved
for arbitrary traffic.
Leonardo Dragonfly+ topology features a fat-tree intra-
group interconnection, with 2 layers of switches, and an
all-to-all inter-group interconnection.
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 23
Parallel filesystems
The filesystem manages how files are stored on disks and how they can be retrieved or written. In a
parallel architecture, with many simultaneous accesses to the disks, it is important to use a parallel
filesystem technology such as GPFS, LUSTRE, BeeGFS, …
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 24
Accelerators: GPUs
4
GPUsall-bonds
4
GPUsh-bonds
2GPUsh-bonds
32
nodesGalileo
0
20
40
60
80
100
120
140 86
140
92 91
DPPC on M100 (1 node)
0 5 10 15 20 25 30 35
0
20
40
60
80
100
Galileo DPPC (GMX 2018.8)
#nodes
ns/day
Low energy per Watt, high
performance/$$
GPUs are ideal for machine
learning:
 high performance for linear
algebra
 hardware support for low floating
point precisions (16 and 8 bit)
Code acceleration
NVIDIA are market
leaders but also:
 AMD
 Intel GPU
Originally video cards for
personal computers,
GPUs have become big
business in HPC.
https://xkcd.com/1838/
Parallel programming
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 26
Typical numerical algorithms and
applications leverage MIMD architectures
not by explicitly coding different
instruction streams, but by running the
same program on different portions of
data. The instruction stream created by
programs is often data-dependent (e.g.,
contains branches based on state of
data), resulting in MIMD execution.
This single program multiple-data SPMD
programming model is widely prevalent
and will be our principal mode for
specifying parallel algorithms.
SPMD
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 27
Parallel programming
Thread (or task) parallelism is based on parting the
operations of the algorithm. If an algorithm is implemented
with a series of independent operations these can be
spread throughout the processors thus realizing program
parallelization.
Data parallelism means spreading data
to be computed through the processors.
The processors execute merely the same
operations, but on diverse data sets. This
often means distribution of array
elements across the computing units.
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 28
Parallel programming
There are multiple ways to perform parallelism in
a HPC cluster:
• Instruction level (e.g. fma = fused multiply
and add)
• SIMD or vector processing (e.g. data
parallelism)
• Hyperthreading (e.g. 4 hardware
threads/core for Intel KNL, 8 for PowerPC)
• Cores/processor (e.g. 18 for Intel Broadwell)
• Processors (or sockets)/node - often 2 but
can be 1 (KNL) or > 2
• Processors + accelerators (e.g. CPU+GPU)
• Nodes in a system
To reach the maximum (peak) performance of
a parallel computer, all levels of parallelism
need to be exploited.
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 29
Parallel programming
We need libraries, tools, language extensions, algorithms and
paradigms which allow us to:
- exploit within a node vector and cache units, hardware, shared
memory
- manage inter-node connections to exchange data with processes on
other nodes
- debug and profile programs to check correctness of results and
performance
- use appropriately the disk space
• Most used languages for years: C/C++, Fortran. Now there are also
implementations for more high level languages such as Python or
Julia.
• For GPUs: CUDA, OpenACC, …
Thanks to these libraries numerical problems are then mapped into a
parallel algorithm.
On a parallel computer, user applications are executed as processes,
threads or tasks.
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 30
Parallel programming
Process
A process is an instance of a computer program that is being executed. It contains the program code and
its current activity. Depending on the operating system, a process may be made up of multiple threads of
execution that execute instructions concurrently.
Process-based multitasking enables you to run, for example, some compiler while you are using a text
editor. In employing multiple processes with a single CPU, context switching between various memory
context is used. Each process has a complete set of its own variables.
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 31
Parallel programming
Thread
A thread is a basic unit of CPU
utilization, consisting of a program
counter, a stack, and a set of
registers.
A thread of execution results from a
fork of a computer program into
two or more concurrently running
tasks. The implementation of
threads and processes differs from
one operating system to another,
but in most cases, a thread is
contained inside a process.
Moreover, a process can create
threads.
Threads are basically processes that run in the same memory context and have a unique ID assigned.
Threads of a process may share the same data while execution. Threads execution is managed by a
scheduler and content switching is performed with threads as well.
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 32
Parallel programming
Task
A task is a set of program instructions that are loaded in memory. Threads can split themselves into two or
more simultaneously running tasks. Often tasks and processes are synonymous.
Multiple threads can exist within the
same process and share resources
such as memory, while different
processes do not share these
resources.
Example of threads in same process
is automatic spell check and
automatic saving of a file while
writing.
Multi-core processors are capable
to run more than one process or
thread at the same time.
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 33
Programming models
Shared and distributed memory organizations are
associated with different programming models.
The performance of a multi-threaded program often
depends primarily on load balance and data movement.
As different threads access different parts of a shared
memory, the hardware caching protocols implicitly
orchestrate data movement between local caches and
main memory (as for OpenMP).
These instructions typically coordinate their accesses via:
• locks, which restrict access to some data to a specific
instruction
• atomic operations, which allow some threads to perform
basic manipulations of data items without interference
from other threads during that operation.
By contrast, on distributed-memory systems, data movement is typically done explicitly via passing
messages (i.e. data) between processors of a node or different nodes (such as MPI).
Often simulations combine both in a hybrid approach: OpenMP in a node and MPI among nodes.
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 34
Some example: CPU
Message Passing Interface (MPI)
• Allows parallel processes to communicate via sending “messages”
(i.e. data).
• Most standard way of communication between nodes, but can also
be used within a node.
OpenMP
• Allows parallel processes to communicate via shared memory in a
node.
• Cannot be used between shared memory nodes.
Hybrid MPI + OpenMP
• Combines both MPI + OpenMP.
• For example, OpenMP within a shared memory node and MPI
between nodes.
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 35
Programming models
In terms of their corresponding programming models, shared and distributed memory organizations have
advantages and disadvantages with respect to one another.
Programmers need to be careful in not spending too much time in performing communication instead of
computation. Most of the times application show massive bottlenecks because of communication.
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 36
Some example: GPU
OpenACC
• Similar to OpenMP but used to program
devices such as GPUs.
CUDA
• Nvidia extension to C/C++ for GPU
programming.
OpenCL
• Non-Nvidia alternative to programming GPUs.
Sycl and Intel OneAPI
• C++ dialects for programming any devices.
HIP
• AMD extension to C/C++ for GPU programming.
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 37
Load balance
There are several HPC models which distribute the amount of job to perform among different processors.
A common distribution model is the master-worker model (or fork-join). In this model, we start with a single
control process which is called master, controlling the worker processes. Depending on programming
language, the communication can be one-sided and the programmer needs to explicitly manage only
one process in a two-processes operation.
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 38
Load balance
How can we distribute the job among workers?
• Static Scheduling is the mechanism where the
order/way that the threads/processes are
executing in the code is already controlled, by
evenly dividing the amount of work among all
available threads. This is useful if workload is
known a priori before code execution.
• Dynamic Scheduling is the mechanism where
thread scheduling is done by the operating
systems based on any scheduling algorithm
implemented in OS level. So, the execution
order of threads will be completely dependent
on that algorithm, unless we have put some
control on it. It is faster but often it is not thread-
safe. This is useful when the exact workload of
subtasks before execution is unknown.
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 39
Domain decomposition
Domain decomposition is the subdivision of the geometric
domain into subdomains. It allows to decompose a
problem into fine-grain tasks, maximizing number of tasks
that can execute concurrently.
This strategy is usually combined with divide-and-conquer
(subdivide problem recursively into tree-like hierarchy of
subproblems), array parallelism and pipelining (subdivision
of sequences of tasks performed on each chunk of data).
With a mix of these partitioning strategies the general goal
is to maximize the potential for concurrent execution and
to maintain load balance. Ideally one would like to have
an embarrassingly parallel problem, i.e. solving many
similar, but independent tasks simultaneously (never really
achievable!)
Mapping: assign coarse-grain tasks to processors, subject
to tradeoffs between communication costs and
concurrency.
Measuring performances in HPC
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 41
Scalability
• For HPC clusters, it is important that they
are scalable: in other words the capacity
of the whole system can be proportionally
increased by adding more hardware.
• For HPC software, scalability is sometimes
referred to as parallelization efficiency —
the ratio between the actual speedup
and the ideal speedup obtained when
using a certain number of processors.
In the most general sense, scalability or
scaling is defined as the ability to handle
more work as the size of the computer or
application grows. Scalability is widely used
to indicate the ability of hardware and
software to deliver greater computational
power when the number of resources is
increased.
A scalability test measures the ability of an application to perform well or better with varying problem sizes
and numbers of processors. It does not test the applications general functionality or correctness.
Applications scalability tests can generally be divided into strong scaling and weak scaling.
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 42
Strong scaling
Applications scalability tests can generally be
divided into strong scaling and weak scaling.
In case of strong scaling, the number of
processors is increased while the problem size
remains constant. This also results in a reduced
workload per processor.
Strong scaling is mostly used for long-running
CPU-bound applications to find a setup which
results in a reasonable runtime with moderate
resource costs.
The individual workload must be kept high
enough to keep all processors fully occupied.
Typical plotted performance quantity is the
speed-up The speedup achieved by increasing the
number of processes usually decreases
continuously.
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 43
Strong scaling
Another typically performance parameter is the
efficiency
where is the time needed to complete a serial
task, is the amount of time to complete the same
unit of work with processing elements, and is
the number of processors needed to complete a
serial task (typically is 1, but sometimes one may
refer to a given basic number of cores).
Ideally, we would like software to have a linear speedup that is equal to the number of processors, as that
would mean that every processor would be contributing 100% of its computational power. Unfortunately,
this is a very challenging goal for real world applications to attain.
1 10 100
0
20
40
60
80
100
120
cores
Parallel
Efficiency
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 44
Amdahl’s law
Execution times will not proportionally decrease increasing the number of cores or resources. The
compute time also cannot be reduced further as the problem size cannot be further divided to improve
it for an effective reduction in time for data traffic and I/O.
Moreover, the behavior can not be ideal because of nonzero latency and nonzero bandwidth in real
applications, which do not communicate data infinitely fast in some finite amount of time.
This is well explained considering the speedup trend with processors
as stated by Amdahl's Law
where is the percentage of code that can be parallelized, is the
percentage corresponding to the serial part of the job, and is the
parallel part divided up by processors.
For a fixed problem, the upper limit of speedup is determined by the
serial fraction of the code.
P
P
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 45
Weak scaling
In weak scaling both the number of processors and the problem size are increased. This also results in a
constant workload per processor.
Weak scaling is mostly used for large memory-
bound applications where the required memory
cannot be satisfied by a single node.
They usually scale well to higher core counts as
memory access strategies often focus on the
nearest neighboring nodes while ignoring those
further away and therefore scale well themselves.
The upscaling is usually restricted only by the
available resources or the maximum problem size.
In this case the weak scaling efficiency is used and
defined as:
For an application that scales perfectly weakly, the work done by each node remains the same as the
scale of the machine increases, which means that we are solving progressively larger problems at the
same time as it takes to solve smaller ones on a smaller machine.
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 46
Gustafson’s law
Sizes of problems scale with the number of
available resources. If a problem only
requires a small number of resources, it is
not beneficial to use many resources to
carry out the computation. A more
reasonable choice is to use small amounts
of resources for small problems and larger
quantities of resources for big problems.
According to Gustafson’s law, the parallel
part scales linearly with the number of
resources, while the serial part does not
increase with respect to the size of the
problem:
or:
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 47
Gustafson’s law
Amdahl’s point of view is focused on a fixed
computation problem size as it deals with a
code taking a fixed amount of sequential
calculation time.
Gustafson's objection is that massively
parallel machines allow computations
previously unfeasible since they enable
computations on very large data sets in
fixed amount of time. In other words, a
parallel platform does more than speeding
up the execution of a code: it enables
dealing with larger problems.
The pitfall of these rather optimistic speedup
and efficiency evaluations is related to the
fact that, as the problem size increases,
communication costs will increase, but
increases in communication costs are not
accounted for by Gustafson’s law.
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 48
Scalability limits
• Hardware: memory-CPU bandwidths and
network communications
• Algorithm (and domain decomposition)
• Parallel Overhead: the amount of time
required to coordinate parallel tasks.
Parallel overhead can include factors
such as:
• task start-up time;
• Synchronizations;
• data communications;
• software overhead imposed by
parallel compilers, libraries, tools,
operating system, etc;
• task termination time.
13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 49
GRAZIE
Alessandro Romeo, PhD
a.romeo@cineca.it

More Related Content

PPT
Introduction to parallel_computing
DOCX
Parallel computing persentation
PPTX
PPTX
Industrial trends in heterogeneous and esoteric compute
DOCX
INTRODUCTION TO PARALLEL PROCESSING
PDF
From Rack scale computers to Warehouse scale computers
PPTX
Clustering by AKASHMSHAH
PPT
Massively Parallel Architectures
Introduction to parallel_computing
Parallel computing persentation
Industrial trends in heterogeneous and esoteric compute
INTRODUCTION TO PARALLEL PROCESSING
From Rack scale computers to Warehouse scale computers
Clustering by AKASHMSHAH
Massively Parallel Architectures

Similar to introduction to high performance computing (20)

PPTX
Lecture 04 Chapter 1 - Introduction to Parallel Computing
PPTX
multithread in multiprocessor architecture
PDF
CC LECTURE NOTES (1).pdf
PDF
High–Performance Computing
DOCX
Introduction to parallel computing
PPTX
Exascale Capabl
PPTX
intro, definitions, basic laws+.pptx
PPT
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
PDF
IEEExeonmem
PPT
parallel computing.ppt
ODP
Distributed Computing
PPTX
Microcontroller architecture
PDF
I understand that physics and hardware emmaded on the use of finete .pdf
PPT
Par com
PPT
Chapter_1.ppt Peter S Pacheco, Matthew Malensek – An Introduction to Parallel...
PPT
Chapter_1_16_10_2024.pptPeter S Pacheco, Matthew Malensek – An Introduction t...
PPTX
Exploring Computing Concepts and Technologies
PDF
Japan's post K Computer
PPT
Embedded computer system
PPTX
Trends and challenges in IP based SOC design
Lecture 04 Chapter 1 - Introduction to Parallel Computing
multithread in multiprocessor architecture
CC LECTURE NOTES (1).pdf
High–Performance Computing
Introduction to parallel computing
Exascale Capabl
intro, definitions, basic laws+.pptx
multi-core Processor.ppt for IGCSE ICT and Computer Science Students
IEEExeonmem
parallel computing.ppt
Distributed Computing
Microcontroller architecture
I understand that physics and hardware emmaded on the use of finete .pdf
Par com
Chapter_1.ppt Peter S Pacheco, Matthew Malensek – An Introduction to Parallel...
Chapter_1_16_10_2024.pptPeter S Pacheco, Matthew Malensek – An Introduction t...
Exploring Computing Concepts and Technologies
Japan's post K Computer
Embedded computer system
Trends and challenges in IP based SOC design
Ad

Recently uploaded (20)

PPTX
Nature of X-rays, X- Ray Equipment, Fluoroscopy
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPT
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
PPT
Total quality management ppt for engineering students
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PPTX
communication and presentation skills 01
PDF
86236642-Electric-Loco-Shed.pdf jfkduklg
PPTX
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PPTX
Information Storage and Retrieval Techniques Unit III
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPT
introduction to datamining and warehousing
PDF
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Soil Improvement Techniques Note - Rabbi
PDF
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PDF
Categorization of Factors Affecting Classification Algorithms Selection
Nature of X-rays, X- Ray Equipment, Fluoroscopy
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
Total quality management ppt for engineering students
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
communication and presentation skills 01
86236642-Electric-Loco-Shed.pdf jfkduklg
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
Information Storage and Retrieval Techniques Unit III
Automation-in-Manufacturing-Chapter-Introduction.pdf
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
introduction to datamining and warehousing
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
UNIT 4 Total Quality Management .pptx
Soil Improvement Techniques Note - Rabbi
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
Categorization of Factors Affecting Classification Algorithms Selection
Ad

introduction to high performance computing

  • 1. INTRODUCTION TO HPC Alessandro Romeo, PhD a.romeo@cineca.it July 1st , 2024
  • 2. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 2 Summary Why do we bother about parallelization? Elements of HPC architecture Parallel programming Measuring performances
  • 3. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 3 References • Previous CINECA courses & CINECA website. Special thanks to Andrew Emerson & Alessandro Marani • Michael T. Heath and Edgar Solomonik – Parallel numerical algorithms, Lecture notes • Stefan Boeriu, Kai-Ping Wang and John C. Bruch Jr. – Parallel computation, Lecture notes • https://hpc-wiki.info/hpc/HPC_Wiki • https://stackoverflow.com/
  • 5. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 5 The need for parallelism The use of computers to study physical systems allows us explore phenomena at all scales: • very large (weather forecasts, climatology, cosmology, data mining, oil reservoir) • very small (drug design, silicon chip design, structural biology) • very complex (fundamental physics fluid dynamics, turbulence) • too dangerous or expensive (fault simulation, nuclear tests, crash analysis) Magneticum homepage
  • 6. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 6 Simulation of an earthquake: https://www.etp4hpc.eu/what-is-hpc.html Exscalate4cov: https://www.exscalate4cov.eu/ Weather – MISTRAL project: https://www.cineca.it/news/mistral-portale Computational methods allow us to study complex phenomena, giving a powerful impetus to scientific research. This can be possible thanks to supercomputers. The need for parallelism
  • 7. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 7 An algorithm that uses only a single core is referred to as a serial algorithm. These algorithms complete one instruction at a time, in order. Parallelization means converting a serial algorithm into an algorithm that can perform multiple operations simultaneously. Parallel computing is the simultaneous use of multiple compute resources to solve a computational problem, broken into discrete parts that can be solved concurrently. Serial vs parallel computing
  • 8. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 8 Serial vs parallel computing Most programs in lots of basic languages run on a single thread, on a single processor core. Such processing is synchronous. https://godbolt.org/z/cWjPcnca3
  • 9. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 9 Von Neumann model Control unit Input Memory Output Arithmetic Logic Unit Conventional Computer Instructions are processed sequentially: 1. A single instruction is loaded from memory (fetch) and decoded; 2. Compute the addresses of operands; 3. Fetch the operands from memory; 4. Execute the instruction; 5. Write the result in memory (store). Data Control Memory: store data and programs CPU Control Unit: execute program ALU: do arithmetic/logic operations
  • 10. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 10 The need for parallelism The instructions of all modern processors need to be synchronized with a timer or clock. The clock cycle is defined as the time between 𝜏 two adjacent pulses of oscillator that sets the time of the processor. The number of these pulses per second is known as clock speed or clock frequency, generally measured in GHz (or billions of pulses per second). The clock cycle controls the synchronization of operations in a computer: all the operations inside the processor last a multiple of . 𝜏 Clock frequency can not infinitely increase: • Power consumption • Heat dissipation • Speed of light • Cost
  • 11. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 11 The need for parallelism The rapid spread and disruptive emergence of computers in the modern world is due to the exponential growth in their capabilities and corresponding reduction in relative cost per component. Accordingly to Moore's law, the density of transistors on a chip would double every two years. For more than 40 years, transistor size of state-of-the-art chips has indeed shrunk by a factor of two every two years. Moore’s Law still holds but will undeniably fail for the following reasons: • Minimum transistor size: transistors cannot be smaller than single atoms (10-14 m feature sizes are common) • Quantum tunnelling: quantum effects can cause current leakage • Heat dissipation and power consumption Increase in transistor numbers does not necessarily mean more CPU power: software usually struggles to make use of the available hardware threads.
  • 12. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 12 The need for parallelism Regardless of the power of the processor, the real limitation in HPC is the performance difference between processors and getting data to/from memory which has been increasing in time. It is very important in both software and hardware design to minimize the time it takes to transfer data to and from the CPU. • Bandwidth: how much data can be transferred in a data channel over unitary time • Latency: the minimum time needed to transfer data Reading data
  • 13. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 13 The need for parallelism Memory hierarchy For all systems, CPUs are much faster than the devices providing the data. Cache memory is small but very fast memory which sits between the processor and the main memory. General strategy: • check cache before main memory • if not in cache then similar data from main memory are loaded in the hope that the next data access will be from the cache (cache hit) and not from main memory (cache miss).
  • 14. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 14 The need for parallelism Cache levels Cache memory is classified in terms of levels which describe its closeness to the microprocessor. • Level 1 (L1): extremely fast but small (e.g. 32 kb), usually embedded in the CPU. • Level 2 (L2): bigger (e.g. 2 Mb) but slower, maybe on a separate chip. Each core may have its own dedicated L1 and L2 cache. • Level 3 (L3): often shared amongst cores. Programs constantly waiting for data from memory are memory bound.
  • 15. Elements of HPC architecture
  • 16. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 16 Basic concepts of parallel computer architecture
  • 17. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 17 Basic concepts of parallel computer architecture A parallel computer consists of a collection of processors and memory banks together with an interconnection network to enable processors to work collectively in concert. Major architectural issues in the design of parallel computer systems include the following: • processor coordination • memory organization • address space • memory access • granularity • scalability • interconnection network
  • 18. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 18 Basic concepts of parallel computer architecture The parallelism can be performed within each core, across cores in a multicore chip, and on networked systems of chips. Often, several chips are combined in a single processing node, with all having direct access to a shared memory within that node. • instruction-level parallelism (ILP) refers to concurrent execution of multiple machine instructions within a single core • shared-memory parallelism refers to execution on a single-node multicore (and often multi-chip) system, where all threads/cores have access to a common memory, • distributed-memory parallelism refers to execution on a cluster of nodes connected by a network, typically with a single processor associated with every node.
  • 19. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 19 Parallel taxonomy Parallelism in computer systems can take various forms, which can be classified by Flynn’s taxonomy in terms of the numbers of instruction streams and data streams. • SISD – single instruction stream, single data stream – used in traditional serial computers (Turing machines) • SIMD – single global instruction stream, multiple data streams – critical in special purpose “data parallel” (vector) computers, present also in many modern computers. It is basically parallelism within a single core • MISD – multiple instruction streams, single data stream – not particularly useful, except if interpreted as a method of “pipelining” • MIMD – multiple instruction streams, multiple data streams – corresponds to general purpose parallel (multicore or multinode) computers.
  • 20. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 20 Memory classification Performances strongly depends on the underlying memory organization. • Shared memory: memory is shared among processors within a node • Distributed memory: memory is shared among processors in different nodes through a given network The time each processor needs to access the memory is not uniform. This accessing methodology is called NUMA (Non Uniform Memory Access). In this way we get a faster memory access if a processor accesses local memory, rather than use the network.
  • 21. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 21 Network topology Networks linking the nodes in a distributed system are available according to price, performance and hardware vendor (Ethernet, Gigabit, Infiniband, Omnipath, …). Networks are configured in particular topologies (direct and indirect). If switches are used, then the network is called fabric.
  • 22. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 22 Leonardo topology All the nodes are interconnected through an Nvidia Mellanox network, with Dragonfly+, capable of a maximum bandwidth of 200Gbit/s between each pair of nodes. This is a relatively new topology for Infiniband based networks that allows to interconnect a very large number of nodes containing the number of switches and cables, while also keeping the network diameter very small. In comparison to non-blocking fat tree topologies, cost can be reduced and scaling-out to a larger number of nodes becomes feasible. In comparison to 2:1 blocking fat tree, close to 100% network throughput can be achieved for arbitrary traffic. Leonardo Dragonfly+ topology features a fat-tree intra- group interconnection, with 2 layers of switches, and an all-to-all inter-group interconnection.
  • 23. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 23 Parallel filesystems The filesystem manages how files are stored on disks and how they can be retrieved or written. In a parallel architecture, with many simultaneous accesses to the disks, it is important to use a parallel filesystem technology such as GPFS, LUSTRE, BeeGFS, …
  • 24. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 24 Accelerators: GPUs 4 GPUsall-bonds 4 GPUsh-bonds 2GPUsh-bonds 32 nodesGalileo 0 20 40 60 80 100 120 140 86 140 92 91 DPPC on M100 (1 node) 0 5 10 15 20 25 30 35 0 20 40 60 80 100 Galileo DPPC (GMX 2018.8) #nodes ns/day Low energy per Watt, high performance/$$ GPUs are ideal for machine learning:  high performance for linear algebra  hardware support for low floating point precisions (16 and 8 bit) Code acceleration NVIDIA are market leaders but also:  AMD  Intel GPU Originally video cards for personal computers, GPUs have become big business in HPC. https://xkcd.com/1838/
  • 26. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 26 Typical numerical algorithms and applications leverage MIMD architectures not by explicitly coding different instruction streams, but by running the same program on different portions of data. The instruction stream created by programs is often data-dependent (e.g., contains branches based on state of data), resulting in MIMD execution. This single program multiple-data SPMD programming model is widely prevalent and will be our principal mode for specifying parallel algorithms. SPMD
  • 27. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 27 Parallel programming Thread (or task) parallelism is based on parting the operations of the algorithm. If an algorithm is implemented with a series of independent operations these can be spread throughout the processors thus realizing program parallelization. Data parallelism means spreading data to be computed through the processors. The processors execute merely the same operations, but on diverse data sets. This often means distribution of array elements across the computing units.
  • 28. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 28 Parallel programming There are multiple ways to perform parallelism in a HPC cluster: • Instruction level (e.g. fma = fused multiply and add) • SIMD or vector processing (e.g. data parallelism) • Hyperthreading (e.g. 4 hardware threads/core for Intel KNL, 8 for PowerPC) • Cores/processor (e.g. 18 for Intel Broadwell) • Processors (or sockets)/node - often 2 but can be 1 (KNL) or > 2 • Processors + accelerators (e.g. CPU+GPU) • Nodes in a system To reach the maximum (peak) performance of a parallel computer, all levels of parallelism need to be exploited.
  • 29. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 29 Parallel programming We need libraries, tools, language extensions, algorithms and paradigms which allow us to: - exploit within a node vector and cache units, hardware, shared memory - manage inter-node connections to exchange data with processes on other nodes - debug and profile programs to check correctness of results and performance - use appropriately the disk space • Most used languages for years: C/C++, Fortran. Now there are also implementations for more high level languages such as Python or Julia. • For GPUs: CUDA, OpenACC, … Thanks to these libraries numerical problems are then mapped into a parallel algorithm. On a parallel computer, user applications are executed as processes, threads or tasks.
  • 30. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 30 Parallel programming Process A process is an instance of a computer program that is being executed. It contains the program code and its current activity. Depending on the operating system, a process may be made up of multiple threads of execution that execute instructions concurrently. Process-based multitasking enables you to run, for example, some compiler while you are using a text editor. In employing multiple processes with a single CPU, context switching between various memory context is used. Each process has a complete set of its own variables.
  • 31. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 31 Parallel programming Thread A thread is a basic unit of CPU utilization, consisting of a program counter, a stack, and a set of registers. A thread of execution results from a fork of a computer program into two or more concurrently running tasks. The implementation of threads and processes differs from one operating system to another, but in most cases, a thread is contained inside a process. Moreover, a process can create threads. Threads are basically processes that run in the same memory context and have a unique ID assigned. Threads of a process may share the same data while execution. Threads execution is managed by a scheduler and content switching is performed with threads as well.
  • 32. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 32 Parallel programming Task A task is a set of program instructions that are loaded in memory. Threads can split themselves into two or more simultaneously running tasks. Often tasks and processes are synonymous. Multiple threads can exist within the same process and share resources such as memory, while different processes do not share these resources. Example of threads in same process is automatic spell check and automatic saving of a file while writing. Multi-core processors are capable to run more than one process or thread at the same time.
  • 33. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 33 Programming models Shared and distributed memory organizations are associated with different programming models. The performance of a multi-threaded program often depends primarily on load balance and data movement. As different threads access different parts of a shared memory, the hardware caching protocols implicitly orchestrate data movement between local caches and main memory (as for OpenMP). These instructions typically coordinate their accesses via: • locks, which restrict access to some data to a specific instruction • atomic operations, which allow some threads to perform basic manipulations of data items without interference from other threads during that operation. By contrast, on distributed-memory systems, data movement is typically done explicitly via passing messages (i.e. data) between processors of a node or different nodes (such as MPI). Often simulations combine both in a hybrid approach: OpenMP in a node and MPI among nodes.
  • 34. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 34 Some example: CPU Message Passing Interface (MPI) • Allows parallel processes to communicate via sending “messages” (i.e. data). • Most standard way of communication between nodes, but can also be used within a node. OpenMP • Allows parallel processes to communicate via shared memory in a node. • Cannot be used between shared memory nodes. Hybrid MPI + OpenMP • Combines both MPI + OpenMP. • For example, OpenMP within a shared memory node and MPI between nodes.
  • 35. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 35 Programming models In terms of their corresponding programming models, shared and distributed memory organizations have advantages and disadvantages with respect to one another. Programmers need to be careful in not spending too much time in performing communication instead of computation. Most of the times application show massive bottlenecks because of communication.
  • 36. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 36 Some example: GPU OpenACC • Similar to OpenMP but used to program devices such as GPUs. CUDA • Nvidia extension to C/C++ for GPU programming. OpenCL • Non-Nvidia alternative to programming GPUs. Sycl and Intel OneAPI • C++ dialects for programming any devices. HIP • AMD extension to C/C++ for GPU programming.
  • 37. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 37 Load balance There are several HPC models which distribute the amount of job to perform among different processors. A common distribution model is the master-worker model (or fork-join). In this model, we start with a single control process which is called master, controlling the worker processes. Depending on programming language, the communication can be one-sided and the programmer needs to explicitly manage only one process in a two-processes operation.
  • 38. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 38 Load balance How can we distribute the job among workers? • Static Scheduling is the mechanism where the order/way that the threads/processes are executing in the code is already controlled, by evenly dividing the amount of work among all available threads. This is useful if workload is known a priori before code execution. • Dynamic Scheduling is the mechanism where thread scheduling is done by the operating systems based on any scheduling algorithm implemented in OS level. So, the execution order of threads will be completely dependent on that algorithm, unless we have put some control on it. It is faster but often it is not thread- safe. This is useful when the exact workload of subtasks before execution is unknown.
  • 39. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 39 Domain decomposition Domain decomposition is the subdivision of the geometric domain into subdomains. It allows to decompose a problem into fine-grain tasks, maximizing number of tasks that can execute concurrently. This strategy is usually combined with divide-and-conquer (subdivide problem recursively into tree-like hierarchy of subproblems), array parallelism and pipelining (subdivision of sequences of tasks performed on each chunk of data). With a mix of these partitioning strategies the general goal is to maximize the potential for concurrent execution and to maintain load balance. Ideally one would like to have an embarrassingly parallel problem, i.e. solving many similar, but independent tasks simultaneously (never really achievable!) Mapping: assign coarse-grain tasks to processors, subject to tradeoffs between communication costs and concurrency.
  • 41. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 41 Scalability • For HPC clusters, it is important that they are scalable: in other words the capacity of the whole system can be proportionally increased by adding more hardware. • For HPC software, scalability is sometimes referred to as parallelization efficiency — the ratio between the actual speedup and the ideal speedup obtained when using a certain number of processors. In the most general sense, scalability or scaling is defined as the ability to handle more work as the size of the computer or application grows. Scalability is widely used to indicate the ability of hardware and software to deliver greater computational power when the number of resources is increased. A scalability test measures the ability of an application to perform well or better with varying problem sizes and numbers of processors. It does not test the applications general functionality or correctness. Applications scalability tests can generally be divided into strong scaling and weak scaling.
  • 42. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 42 Strong scaling Applications scalability tests can generally be divided into strong scaling and weak scaling. In case of strong scaling, the number of processors is increased while the problem size remains constant. This also results in a reduced workload per processor. Strong scaling is mostly used for long-running CPU-bound applications to find a setup which results in a reasonable runtime with moderate resource costs. The individual workload must be kept high enough to keep all processors fully occupied. Typical plotted performance quantity is the speed-up The speedup achieved by increasing the number of processes usually decreases continuously.
  • 43. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 43 Strong scaling Another typically performance parameter is the efficiency where is the time needed to complete a serial task, is the amount of time to complete the same unit of work with processing elements, and is the number of processors needed to complete a serial task (typically is 1, but sometimes one may refer to a given basic number of cores). Ideally, we would like software to have a linear speedup that is equal to the number of processors, as that would mean that every processor would be contributing 100% of its computational power. Unfortunately, this is a very challenging goal for real world applications to attain. 1 10 100 0 20 40 60 80 100 120 cores Parallel Efficiency
  • 44. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 44 Amdahl’s law Execution times will not proportionally decrease increasing the number of cores or resources. The compute time also cannot be reduced further as the problem size cannot be further divided to improve it for an effective reduction in time for data traffic and I/O. Moreover, the behavior can not be ideal because of nonzero latency and nonzero bandwidth in real applications, which do not communicate data infinitely fast in some finite amount of time. This is well explained considering the speedup trend with processors as stated by Amdahl's Law where is the percentage of code that can be parallelized, is the percentage corresponding to the serial part of the job, and is the parallel part divided up by processors. For a fixed problem, the upper limit of speedup is determined by the serial fraction of the code. P P
  • 45. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 45 Weak scaling In weak scaling both the number of processors and the problem size are increased. This also results in a constant workload per processor. Weak scaling is mostly used for large memory- bound applications where the required memory cannot be satisfied by a single node. They usually scale well to higher core counts as memory access strategies often focus on the nearest neighboring nodes while ignoring those further away and therefore scale well themselves. The upscaling is usually restricted only by the available resources or the maximum problem size. In this case the weak scaling efficiency is used and defined as: For an application that scales perfectly weakly, the work done by each node remains the same as the scale of the machine increases, which means that we are solving progressively larger problems at the same time as it takes to solve smaller ones on a smaller machine.
  • 46. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 46 Gustafson’s law Sizes of problems scale with the number of available resources. If a problem only requires a small number of resources, it is not beneficial to use many resources to carry out the computation. A more reasonable choice is to use small amounts of resources for small problems and larger quantities of resources for big problems. According to Gustafson’s law, the parallel part scales linearly with the number of resources, while the serial part does not increase with respect to the size of the problem: or:
  • 47. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 47 Gustafson’s law Amdahl’s point of view is focused on a fixed computation problem size as it deals with a code taking a fixed amount of sequential calculation time. Gustafson's objection is that massively parallel machines allow computations previously unfeasible since they enable computations on very large data sets in fixed amount of time. In other words, a parallel platform does more than speeding up the execution of a code: it enables dealing with larger problems. The pitfall of these rather optimistic speedup and efficiency evaluations is related to the fact that, as the problem size increases, communication costs will increase, but increases in communication costs are not accounted for by Gustafson’s law.
  • 48. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 48 Scalability limits • Hardware: memory-CPU bandwidths and network communications • Algorithm (and domain decomposition) • Parallel Overhead: the amount of time required to coordinate parallel tasks. Parallel overhead can include factors such as: • task start-up time; • Synchronizations; • data communications; • software overhead imposed by parallel compilers, libraries, tools, operating system, etc; • task termination time.
  • 49. 13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 49 GRAZIE Alessandro Romeo, PhD a.romeo@cineca.it

Editor's Notes

  • #10: Clock cycle: single increment of CPU during which the smallest unit of processor activity is carried out. Modern CPUs (microprocessors can execute multiple instructions per clock cycle). Power consumption varies as the square of cube of the clock frequency. Propagation velocity of a signal in a vacuum: 300000Km/s = 30cm/ns
  • #12: For message passing latency is time to send a message of zero length.
  • #14: Caches are searched in order when seeking data (L1 -> L2 -> L3 -> main memory). In HPC exploiting the cache is crucial for performance. In particular, loops accessing arrays must be written with special care.
  • #16: Usually Closest caches are L1 and L2. Cache common to cores is L3.
  • #17: processor coordination: do processors need to synchronize (halt execution) to exchange data and status information? memory organization: is the memory distributed among processors or do all processors own a local chunk? address space: is the memory referenced globally or locally with respect to each processor? memory access: does accessing a given memory address take equal time for all processors? granularity: how much resources does the hardware provide to each execution stream? scalability: how does the amount of available resources scale with the size of a subdivision of a system (number of processors)? interconnection network: what connectivity pattern defines the network topology and how does the network route messages?
  • #18: “instructions” can refer to low-level machine instructions or statements in a high-level programming language, and “data” can correspondingly refer to individual operands or higher-level data structures such as arrays.
  • #19: SIMD execution is the backbone of vector-based architectures, which operate directly on vectors of floating-point values rather than on individual values. While restrictive in that the same operation must be applied to each value or pairs of values of vectors, they can often be used effectively by data parallel applications. Dense matrix computations are one domain where this type of parallelism is particularly prevalent and easy to exploit, while algorithms that access data in irregular (unstructured) patterns oftentimes cannot.
  • #20: In an SPMD style program, we must generally be able to move data so that different instances of the program can exchange information and coordinate. The semantics of data movement depend on the underlying memory organization of the parallel computer system. It is rare nowadays to find entire clusters where all processors in the system see the same memory (poor performance). Instead, we have clusters where memory is shared only within a node. The network is needed to share memory among processors in different nodes and hence network topology become quite important in designing a supercomputer.
  • #21: • direct – each network link connects some pair of nodes, • indirect (switched) – each network link connects a node to a network switch or connects two switches. Topology-aware algorithms aim to execute effectively on specific network topologies. If mapped ideally to a network topology, applications and algorithms often see significant performance gains, as network congestion can be controlled and often eliminated. In practice, unfortunately, applications are often executed on a subset of nodes of a distributed machine, which may not have the same connectivity structure as the overall machine. In an unstructured context, network congestion is inevitable, but could be alleviated with general measures.
  • #28: In hyper-threading there are only some elaboration unit doubled,
  • #29: The languages most used for parallel programming have been C/C++ and Fortran for years, though they were not originally designed for parallelism. To achieve an improvement in speed through the use of parallelism, it is necessary to divide the computation into tasks or processes that can be executed simultaneously.
  • #35: For instance, incrementally parallelizing a sequential code is usually easiest with shared-memory systems, but achieving high performance using threads usually requires the data access pattern to be carefully optimized. Distributed-memory algorithms often require the bulk of the computation to be done in a data parallel manner, but they provide a clear performance profile, due to data movement being done explicitly.
  • #41: Factors that contribute to scalability include: hardware: memory-CPU bandwidths and network communications application algorithm parallel overhead
  • #48: Mention data races and deadlocks!