Mahti quick-start guide

Webinar: Getting started with Mahti
Jussi Enkovaara

Contents
• Overview of Mahti
• Running programs in Mahti
• Building programs in Mahti
• Technical details about Mahti
2
User documentation: docs.csc.fi

Getting access to Mahti
• All users need to apply for new services via new CSC
customer portal my.csc.fi
• Project manager of CSC project need to apply for Mahti
service in CSC customer portal my.csc.fi
oProject manager can addCSC users to the project
oUsers need to accept terms and conditions
• Connect with ssh
ossh <csc_username>@mahti.csc.fi
1.9.20203

Mahti - overview
• 1404 compute nodes with next generation AMD Rome CPUs
• Two 64 coreCPUs per node
• core can run 2 threads, thus applications can see 256 “cores” per node
• 2.6GHz base frequency (maximum boost 3.3GHz)
• 256 GB of memory per node
• About 180 000 cores in total
• Infiniband HDR interconnect between nodes
o200 GB/s bandwidth
• Over 8 petabytes of work disk for data under active use
1.9.20204
In customer use
sinceAugust 2020

Storage in Mahti
1.9.20205
• Similar disk system as in Puhti
• SCRATCH directories are of the form: /scratch/<project>
• PROJAPPL: /projappl/<project>
• Project names and other information can be found at my.csc.fi
• csc-workspaces –command can be used for listing available
directories in Mahti
• The disk areas for different supercomputers are separate,
home, projappl and scratch in Puhti cannot be directly
accessed from Mahti.

Moving data between Puhti and Mahti
1.9.20206
• Data can be moved between supercomputers via Allas
oRecommended approach if the data should also be preserved for a longer
time.
• Data can also be copied directly with the rsync command
• From example, copy directory my_results from Puhti to Mahti:
rsync -azP my_results <username>@mahti.csc.fi:/scratch/project_xxxxxxx

Module system
• Similar module system as in Puhti
• Module system is hierarchical, availability of modules can
depend on the loaded modules (e.g. compiler suite).
• List modules compatible with the current set
module avail
• List all available modules
module spider
7

Running applications
• Scientific software installed by CSC available via modules
• Due to many cores within a node, many applications benefit
from hybrid MPI/OpenMP parallelization
oSome applications benefit from simultaneous multithreading (SMT),
i.e. two threads per core.
oSimultaneous multithreading can also slow down applications
oMemory bound applications may benefit from using less than 128
cores per node
• Optimum ratio of MPI tasks / OpenMP threads per node
depends heavily on application and the input set, and
should be tested before production runs
1.9.20208

Batch job partitions in Mahti
9
Partition Nodes Time limit Access
test 1-2 1 hour All
medium 1-20 36 hours All
large 20-200 36 hours Scalability test
gc 1-700 36 hours Scalability test
• Only full nodes are allocated in Mahti
• Jobs have access to all cores and memory in node, but may
choose to run with fewer cores for better performance
• Billing is based on allocated nodes

Access to large partition
• Project manager can apply access to large partition via
my.csc.fi
• 30 day test period is granted automatically
• During the test period the scalability and parallel
performance of the code can be demonstrated
• Results are submitted for evaluation and production
access is granted if the performance is sufficient
• Detailed instructions at:
docs.csc.fi/accounts/how-to-access-mahti-large-partition/
10

Interactive pre/post-processing
• Jobs in interactive partition can reserve 1-8 cores and
each core reserves 1,875 GB of memory
• Easy to use sinteractive -i tool
oBy default, two cores and 24 hours
• Interactive partition can be used also via normal batch
job scripts
• User can reserve a maximum of 8 cores at a time
oCan be split in multiple small sessions
11

Pure MPI job
12
#!/bin/bash
#SBATCH --job-name=example
#SBATCH --account=<project>
#SBATCH --partition=medium
#SBATCH --time=02:00:00
#SBATCH --nodes=10
#SBATCH --ntasks-per-node=128
export OMP_NUM_THREADS=1
srun myprog <options>

Hybrid MPI+OpenMP job
13
#!/bin/bash
#SBATCH --nodes=10
#SBATCH --cpus-per-task=8
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

Hybrid MPI+OpenMP job with SMT
14
#!/bin/bash
#SBATCH --nodes=10
#SBATCH --hint=multithread
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

Affinity in hybrid MPI + OpenMP jobs
• By default, operating system is allowed to move threads between
CPU cores
• In many HPC applications it is beneficial to bind threads to cores
by setting
export OMP_PLACES=cores
• Affinity can be printed out to stderr of job by setting
export OMP_AFFINITY_FORMAT="Process %P level %L thread %0.3n affinity %A"
export OMP_DISPLAY_AFFINITY=true
15

Building applications in Mahti
• Currently, GNU, AMD and Intel compiler suites are
available via modules
• See documentation for recommended compiler settings
• High performance libraries available via modules
oMost libraries are provided both as single threaded and
multithreadedversions (with omp in the module version)
oFor pure MPI applications and applications calling libraries from
multiple threads use single threaded version
• Note: MKL library that is provided within Intel compiler
suite does not fully utilize AMDCPUs !
1.9.202016

Mahti node
• two 64 core AMD EPYC 7H12 (Rome) processors
o2.6GHz base frequency (3.3GHz max boost )
oAVX2 vector instructions
• Cache hierarchy
1.9.202018
Cache L1 L2 L3
Size 32 kb 512 kb 16 MB
Private / shared Private per
core
Private per
core
Shared among
4 cores
SMT threads share L1 and L2 caches

Hierarchical architecture
• Mahti node has highly hierarchical architecture
19
• Each CCD
(Cluster Complex Die)
contains two 4 core
CCXs (Core CompleX)
• L3 cache is shared
withinCCX

Hierarchical architecture
• Mahti node has highly hierarchical architecture
20
• Even though memory is
shared between all cores,
latency and bandwidth vary
Idle latency (ns) Bandwidth (GB/s)
Within NUMA node 80 41
Within socket 100-120 37-39
Between sockets 220 21-22

Rank/thread placement
• Memory bound applications may benefit from running only
a single MPI task / OpenMP thread per NUMA node or per
CCX (L3 cache)
• Slurm places MPI tasks --cpus-per-task apart
• OMP_PLACES and OMP_PROC_BIND can be used for
controlling placement of OpenMP threads
• srun option --cpu-bind=verbose can be used for printing
out the binding of MPI tasks
• OMP_AFFINITY_FORMAT + OMP_DISPLAY_AFFINITY can be
used for checking thread binding21

Example: single MPI task per NUMA
22
#!/bin/bash
#SBATCH --nodes=10

Example: single MPI task per NUMA, single
thread per CCX (L3 cache)
23
#!/bin/bash
#SBATCH --nodes=10
export OMP_PROC_BIND=spread

Network topology
• Infiniband HDR network
obandwidth of 200 Gbit/s and a
MPI latency of ~1.3 us per link
• Dragonfly+ topology, 6
groups each consisting of a
separate fat tree
oFat trees connected with all-
to-all links
24

Questions?
• Up-to-date information about timetables, relevant
changes for users etc. : research.csc.fi/dl2021-utilization
• CSC Customer portal: my.csc.fi
• User documention: docs.csc.fi
1.9.202025

Mahti quick-start guide

More Related Content

What's hot (20)

Similar to Mahti quick-start guide (20)

More from CSC - IT Center for Science (10)

Recently uploaded (20)

Mahti quick-start guide