introduction to high performance computing

INTRODUCTION TO HPC
Alessandro Romeo, PhD
a.romeo@cineca.it
July 1st
, 2024

13/08/2025 Introduction to HPC - Alessandro Romeo, PhD - CINECA 2
Summary
Why do we bother about parallelization?
Elements of HPC architecture
Parallel programming
Measuring performances

References
• Previous CINECA courses & CINECA website. Special thanks to Andrew Emerson &
Alessandro Marani
• Michael T. Heath and Edgar Solomonik – Parallel numerical algorithms, Lecture
notes
• Stefan Boeriu, Kai-Ping Wang and John C. Bruch Jr. – Parallel computation, Lecture
notes
• https://hpc-wiki.info/hpc/HPC_Wiki
• https://stackoverflow.com/

The need for parallelism
The use of computers to study physical systems allows us explore phenomena at all scales:
• very large (weather forecasts, climatology, cosmology, data mining, oil reservoir)
• very small (drug design, silicon chip design, structural biology)
• very complex (fundamental physics fluid dynamics, turbulence)
• too dangerous or expensive (fault simulation, nuclear tests, crash analysis) Magneticum homepage

Simulation of an earthquake:
https://www.etp4hpc.eu/what-is-hpc.html
Exscalate4cov:
https://www.exscalate4cov.eu/
Weather – MISTRAL project:
https://www.cineca.it/news/mistral-portale
Computational methods allow us to study complex phenomena, giving a powerful impetus to scientific
research. This can be possible thanks to supercomputers.

An algorithm that uses only a single core is
referred to as a serial algorithm.
These algorithms complete one instruction
at a time, in order.
Parallelization means converting a serial
algorithm into an algorithm that can
perform multiple operations
simultaneously.
Parallel computing is the simultaneous use
of multiple compute resources to solve a
computational problem, broken into
discrete parts that can be solved
concurrently.
Serial vs parallel computing

Serial vs parallel computing
Most programs in lots of basic languages run on a single thread, on a single processor core. Such
processing is synchronous.
https://godbolt.org/z/cWjPcnca3

Von Neumann model
Control unit
Input Memory Output
Arithmetic
Logic Unit
Conventional Computer
Instructions are processed sequentially:
1. A single instruction is loaded from memory (fetch) and
decoded;
2. Compute the addresses of operands;
3. Fetch the operands from memory;
4. Execute the instruction;
5. Write the result in memory (store).
Data
Control
Memory: store data and programs
CPU
Control Unit: execute program
ALU: do arithmetic/logic operations

The instructions of all modern processors need to be
synchronized with a timer or clock.
The clock cycle is defined as the time between
𝜏
two adjacent pulses of oscillator that sets the time of
the processor.
The number of these pulses per second is known as
clock speed or clock frequency, generally
measured in GHz (or billions of pulses per second).
The clock cycle controls the synchronization of
operations in a computer: all the operations inside
the processor last a multiple of .
𝜏
Clock frequency can not infinitely increase:
• Power consumption
• Heat dissipation
• Speed of light
• Cost

The rapid spread and disruptive emergence of computers in the modern world is due to the exponential
growth in their capabilities and corresponding reduction in relative cost per component.
Accordingly to Moore's law, the density of transistors on a chip would double every two years. For more
than 40 years, transistor size of state-of-the-art chips has indeed shrunk by a factor of two every two years.
Moore’s Law still holds but will
undeniably fail for the following
reasons:
• Minimum transistor size: transistors
cannot be smaller than single atoms
(10-14
m feature sizes are common)
• Quantum tunnelling: quantum effects
can cause current leakage
• Heat dissipation and power
consumption
Increase in transistor numbers does not
necessarily mean more CPU power:
software usually struggles to make use
of the available hardware threads.

Regardless of the power of the processor, the real
limitation in HPC is the performance difference
between processors and getting data to/from
memory which has been increasing in time. It is very
important in both software and hardware design to
minimize the time it takes to transfer data to and from
the CPU.
• Bandwidth: how much data can be transferred in a
data channel over unitary time
• Latency: the minimum time needed to transfer
data
Reading data

Memory hierarchy
For all systems, CPUs are much faster than the devices
providing the data.
Cache memory is small but very fast memory which sits
between the processor and the main memory.
General strategy:
• check cache before main memory
• if not in cache then similar data from main memory are loaded in the hope that the next data access will
be from the cache (cache hit) and not from main memory (cache miss).

Cache levels
Cache memory is classified in terms of levels which describe its closeness to the microprocessor.
• Level 1 (L1): extremely fast but small (e.g. 32 kb), usually embedded in the CPU.
• Level 2 (L2): bigger (e.g. 2 Mb) but slower, maybe on a separate chip. Each core may have its own
dedicated L1 and L2 cache.
• Level 3 (L3): often shared amongst cores.
Programs constantly waiting for data from memory are memory bound.

Basic concepts of parallel computer architecture

A parallel computer consists of
a collection of processors and
memory banks together with an
interconnection network to
enable processors to work
collectively in concert.
Major architectural issues in the
design of parallel computer
systems include the following:
• processor coordination
• memory organization
• address space
• memory access
• granularity
• scalability
• interconnection network

The parallelism can be performed within each core, across cores in a multicore chip, and on networked
systems of chips.
Often, several chips are combined in a single processing node, with all having direct access to a shared
memory within that node.
• instruction-level parallelism (ILP) refers to concurrent execution of multiple machine instructions within a
single core
• shared-memory parallelism refers to execution on a single-node multicore (and often multi-chip) system,
where all threads/cores have access to a common memory,
• distributed-memory parallelism refers to execution on a cluster of nodes connected by a network,
typically with a single processor associated with every node.

Parallel taxonomy
Parallelism in computer systems can take various forms, which can be classified by Flynn’s taxonomy in
terms of the numbers of instruction streams and data streams.
• SISD – single instruction stream, single data stream
– used in traditional serial computers (Turing
machines)
• SIMD – single global instruction stream, multiple
data streams – critical in special purpose “data
parallel” (vector) computers, present also in many
modern computers. It is basically parallelism within a
single core
• MISD – multiple instruction streams, single data
stream – not particularly useful, except if interpreted
as a method of “pipelining”
• MIMD – multiple instruction streams, multiple data
streams – corresponds to general purpose parallel
(multicore or multinode) computers.

Memory classification
Performances strongly depends on the underlying memory organization.
• Shared memory: memory is shared among processors within a node
• Distributed memory: memory is shared among processors in different nodes through a given network
The time each processor needs to access the memory is not uniform. This accessing methodology is
called NUMA (Non Uniform Memory Access). In this way we get a faster memory access if a processor
accesses local memory, rather than use the network.

Network topology
Networks linking the nodes in a distributed system are available according to price, performance and
hardware vendor (Ethernet, Gigabit, Infiniband, Omnipath, …).
Networks are configured in particular topologies (direct and indirect).
If switches are used, then the network is called fabric.

Leonardo topology
All the nodes are interconnected through an Nvidia
Mellanox network, with Dragonfly+, capable of a
maximum bandwidth of 200Gbit/s between each pair of
nodes.
This is a relatively new topology for Infiniband based
networks that allows to interconnect a very large number
of nodes containing the number of switches and
cables, while also keeping the network diameter very
small.
In comparison to non-blocking fat tree topologies, cost
can be reduced and scaling-out to a larger number of
nodes becomes feasible. In comparison to 2:1 blocking fat
tree, close to 100% network throughput can be achieved
for arbitrary traffic.
Leonardo Dragonfly+ topology features a fat-tree intra-
group interconnection, with 2 layers of switches, and an
all-to-all inter-group interconnection.

Parallel filesystems
The filesystem manages how files are stored on disks and how they can be retrieved or written. In a
parallel architecture, with many simultaneous accesses to the disks, it is important to use a parallel
filesystem technology such as GPFS, LUSTRE, BeeGFS, …

Accelerators: GPUs
4
GPUsall-bonds
4
GPUsh-bonds
2GPUsh-bonds
32
nodesGalileo
0
20
40
60
80
100
120
140 86
140
92 91
DPPC on M100 (1 node)
0 5 10 15 20 25 30 35
0
20
40
60
80
100
Galileo DPPC (GMX 2018.8)
#nodes
ns/day
Low energy per Watt, high
performance/$$
GPUs are ideal for machine
learning:
 high performance for linear
algebra
 hardware support for low floating
point precisions (16 and 8 bit)
Code acceleration
NVIDIA are market
leaders but also:
 AMD
 Intel GPU
Originally video cards for
personal computers,
GPUs have become big
business in HPC.
https://xkcd.com/1838/

Typical numerical algorithms and
applications leverage MIMD architectures
not by explicitly coding different
instruction streams, but by running the
same program on different portions of
data. The instruction stream created by
programs is often data-dependent (e.g.,
contains branches based on state of
data), resulting in MIMD execution.
This single program multiple-data SPMD
programming model is widely prevalent
and will be our principal mode for
specifying parallel algorithms.
SPMD

Thread (or task) parallelism is based on parting the
operations of the algorithm. If an algorithm is implemented
with a series of independent operations these can be
spread throughout the processors thus realizing program
parallelization.
Data parallelism means spreading data
to be computed through the processors.
The processors execute merely the same
operations, but on diverse data sets. This
often means distribution of array
elements across the computing units.

There are multiple ways to perform parallelism in
a HPC cluster:
• Instruction level (e.g. fma = fused multiply
and add)
• SIMD or vector processing (e.g. data
parallelism)
• Hyperthreading (e.g. 4 hardware
threads/core for Intel KNL, 8 for PowerPC)
• Cores/processor (e.g. 18 for Intel Broadwell)
• Processors (or sockets)/node - often 2 but
can be 1 (KNL) or > 2
• Processors + accelerators (e.g. CPU+GPU)
• Nodes in a system
To reach the maximum (peak) performance of
a parallel computer, all levels of parallelism
need to be exploited.

We need libraries, tools, language extensions, algorithms and
paradigms which allow us to:
- exploit within a node vector and cache units, hardware, shared
memory
- manage inter-node connections to exchange data with processes on
other nodes
- debug and profile programs to check correctness of results and
performance
- use appropriately the disk space
• Most used languages for years: C/C++, Fortran. Now there are also
implementations for more high level languages such as Python or
Julia.
• For GPUs: CUDA, OpenACC, …
Thanks to these libraries numerical problems are then mapped into a
parallel algorithm.
On a parallel computer, user applications are executed as processes,
threads or tasks.

Process
A process is an instance of a computer program that is being executed. It contains the program code and
its current activity. Depending on the operating system, a process may be made up of multiple threads of
execution that execute instructions concurrently.
Process-based multitasking enables you to run, for example, some compiler while you are using a text
editor. In employing multiple processes with a single CPU, context switching between various memory
context is used. Each process has a complete set of its own variables.

Thread
A thread is a basic unit of CPU
utilization, consisting of a program
counter, a stack, and a set of
registers.
A thread of execution results from a
fork of a computer program into
two or more concurrently running
tasks. The implementation of
threads and processes differs from
one operating system to another,
but in most cases, a thread is
contained inside a process.
Moreover, a process can create
threads.
Threads are basically processes that run in the same memory context and have a unique ID assigned.
Threads of a process may share the same data while execution. Threads execution is managed by a
scheduler and content switching is performed with threads as well.

Task
A task is a set of program instructions that are loaded in memory. Threads can split themselves into two or
more simultaneously running tasks. Often tasks and processes are synonymous.
Multiple threads can exist within the
same process and share resources
such as memory, while different
processes do not share these
resources.
Example of threads in same process
is automatic spell check and
automatic saving of a file while
writing.
Multi-core processors are capable
to run more than one process or
thread at the same time.

Programming models
Shared and distributed memory organizations are
associated with different programming models.
The performance of a multi-threaded program often
depends primarily on load balance and data movement.
As different threads access different parts of a shared
memory, the hardware caching protocols implicitly
orchestrate data movement between local caches and
main memory (as for OpenMP).
These instructions typically coordinate their accesses via:
• locks, which restrict access to some data to a specific
instruction
• atomic operations, which allow some threads to perform
basic manipulations of data items without interference
from other threads during that operation.
By contrast, on distributed-memory systems, data movement is typically done explicitly via passing
messages (i.e. data) between processors of a node or different nodes (such as MPI).
Often simulations combine both in a hybrid approach: OpenMP in a node and MPI among nodes.

Some example: CPU
Message Passing Interface (MPI)
• Allows parallel processes to communicate via sending “messages”
(i.e. data).
• Most standard way of communication between nodes, but can also
be used within a node.
OpenMP
• Allows parallel processes to communicate via shared memory in a
node.
• Cannot be used between shared memory nodes.
Hybrid MPI + OpenMP
• Combines both MPI + OpenMP.
• For example, OpenMP within a shared memory node and MPI
between nodes.

Programming models
In terms of their corresponding programming models, shared and distributed memory organizations have
advantages and disadvantages with respect to one another.
Programmers need to be careful in not spending too much time in performing communication instead of
computation. Most of the times application show massive bottlenecks because of communication.

Some example: GPU
OpenACC
• Similar to OpenMP but used to program
devices such as GPUs.
CUDA
• Nvidia extension to C/C++ for GPU
programming.
OpenCL
• Non-Nvidia alternative to programming GPUs.
Sycl and Intel OneAPI
• C++ dialects for programming any devices.
HIP
• AMD extension to C/C++ for GPU programming.

Load balance
There are several HPC models which distribute the amount of job to perform among different processors.
A common distribution model is the master-worker model (or fork-join). In this model, we start with a single
control process which is called master, controlling the worker processes. Depending on programming
language, the communication can be one-sided and the programmer needs to explicitly manage only
one process in a two-processes operation.

Load balance
How can we distribute the job among workers?
• Static Scheduling is the mechanism where the
order/way that the threads/processes are
executing in the code is already controlled, by
evenly dividing the amount of work among all
available threads. This is useful if workload is
known a priori before code execution.
• Dynamic Scheduling is the mechanism where
thread scheduling is done by the operating
systems based on any scheduling algorithm
implemented in OS level. So, the execution
order of threads will be completely dependent
on that algorithm, unless we have put some
control on it. It is faster but often it is not thread-
safe. This is useful when the exact workload of
subtasks before execution is unknown.

Domain decomposition
Domain decomposition is the subdivision of the geometric
domain into subdomains. It allows to decompose a
problem into fine-grain tasks, maximizing number of tasks
that can execute concurrently.
This strategy is usually combined with divide-and-conquer
(subdivide problem recursively into tree-like hierarchy of
subproblems), array parallelism and pipelining (subdivision
of sequences of tasks performed on each chunk of data).
With a mix of these partitioning strategies the general goal
is to maximize the potential for concurrent execution and
to maintain load balance. Ideally one would like to have
an embarrassingly parallel problem, i.e. solving many
similar, but independent tasks simultaneously (never really
achievable!)
Mapping: assign coarse-grain tasks to processors, subject
to tradeoffs between communication costs and
concurrency.

Scalability
• For HPC clusters, it is important that they
are scalable: in other words the capacity
of the whole system can be proportionally
increased by adding more hardware.
• For HPC software, scalability is sometimes
referred to as parallelization efficiency —
the ratio between the actual speedup
and the ideal speedup obtained when
using a certain number of processors.
In the most general sense, scalability or
scaling is defined as the ability to handle
more work as the size of the computer or
application grows. Scalability is widely used
to indicate the ability of hardware and
software to deliver greater computational
power when the number of resources is
increased.
A scalability test measures the ability of an application to perform well or better with varying problem sizes
and numbers of processors. It does not test the applications general functionality or correctness.
Applications scalability tests can generally be divided into strong scaling and weak scaling.

Strong scaling
Applications scalability tests can generally be
divided into strong scaling and weak scaling.
In case of strong scaling, the number of
processors is increased while the problem size
remains constant. This also results in a reduced
workload per processor.
Strong scaling is mostly used for long-running
CPU-bound applications to find a setup which
results in a reasonable runtime with moderate
resource costs.
The individual workload must be kept high
enough to keep all processors fully occupied.
Typical plotted performance quantity is the
speed-up The speedup achieved by increasing the
number of processes usually decreases
continuously.

Strong scaling
Another typically performance parameter is the
efficiency
where is the time needed to complete a serial
task, is the amount of time to complete the same
unit of work with processing elements, and is
the number of processors needed to complete a
serial task (typically is 1, but sometimes one may
refer to a given basic number of cores).
Ideally, we would like software to have a linear speedup that is equal to the number of processors, as that
would mean that every processor would be contributing 100% of its computational power. Unfortunately,
this is a very challenging goal for real world applications to attain.
1 10 100
0
20
40
60
80
100
120
cores
Parallel
Efficiency

Amdahl’s law
Execution times will not proportionally decrease increasing the number of cores or resources. The
compute time also cannot be reduced further as the problem size cannot be further divided to improve
it for an effective reduction in time for data traffic and I/O.
Moreover, the behavior can not be ideal because of nonzero latency and nonzero bandwidth in real
applications, which do not communicate data infinitely fast in some finite amount of time.
This is well explained considering the speedup trend with processors
as stated by Amdahl's Law
where is the percentage of code that can be parallelized, is the
percentage corresponding to the serial part of the job, and is the
parallel part divided up by processors.
For a fixed problem, the upper limit of speedup is determined by the
serial fraction of the code.
P
P

Weak scaling
In weak scaling both the number of processors and the problem size are increased. This also results in a
constant workload per processor.
Weak scaling is mostly used for large memory-
bound applications where the required memory
cannot be satisfied by a single node.
They usually scale well to higher core counts as
memory access strategies often focus on the
nearest neighboring nodes while ignoring those
further away and therefore scale well themselves.
The upscaling is usually restricted only by the
available resources or the maximum problem size.
In this case the weak scaling efficiency is used and
defined as:
For an application that scales perfectly weakly, the work done by each node remains the same as the
scale of the machine increases, which means that we are solving progressively larger problems at the
same time as it takes to solve smaller ones on a smaller machine.

Gustafson’s law
Sizes of problems scale with the number of
available resources. If a problem only
requires a small number of resources, it is
not beneficial to use many resources to
carry out the computation. A more
reasonable choice is to use small amounts
of resources for small problems and larger
quantities of resources for big problems.
According to Gustafson’s law, the parallel
part scales linearly with the number of
resources, while the serial part does not
increase with respect to the size of the
problem:
or:

Gustafson’s law
Amdahl’s point of view is focused on a fixed
computation problem size as it deals with a
code taking a fixed amount of sequential
calculation time.
Gustafson's objection is that massively
parallel machines allow computations
previously unfeasible since they enable
computations on very large data sets in
fixed amount of time. In other words, a
parallel platform does more than speeding
up the execution of a code: it enables
dealing with larger problems.
The pitfall of these rather optimistic speedup
and efficiency evaluations is related to the
fact that, as the problem size increases,
communication costs will increase, but
increases in communication costs are not
accounted for by Gustafson’s law.

Scalability limits
• Hardware: memory-CPU bandwidths and
network communications
• Algorithm (and domain decomposition)
• Parallel Overhead: the amount of time
required to coordinate parallel tasks.
Parallel overhead can include factors
such as:
• task start-up time;
• Synchronizations;
• data communications;
• software overhead imposed by
parallel compilers, libraries, tools,
operating system, etc;
• task termination time.

GRAZIE
Alessandro Romeo, PhD
a.romeo@cineca.it

introduction to high performance computing

More Related Content

Similar to introduction to high performance computing (20)

Recently uploaded (20)

introduction to high performance computing

Editor's Notes