Introduction to SLURM

Introduction to SLURM
Ismael Fernández Pavón Cristian Gomollón Escribano
08 / 10 / 2019

What is SLURM?
• Allocates access to resources for some duration of time.
• Provides a framework for starting, executing, and
monitoring work (normally a parallel job).
• Arbitrates contention for resources by managing
a queue of pending work.
Cluster manager and job scheduler
system for large and small Linux
clusters.

LoadLeveler (IBM)
LSF
SLURM
PBS Pro
Resource Managers Scheduler
What is SLURM?
ALPS (Cray)
Torque
Maui
Moab

✓ Open source
✓ Fault-tolerant
✓ Highly scalable
LoadLeveler (IBM)
LSF
SLURM
PBS Pro
Resource Managers Scheduler
What is SLURM?
ALPS (Cray)
Torque
Maui
Moab

Node
CPU
(Core)
CPU
(Thread)
SLURM: Resource Management
Nodes:
• Baseboards, Sockets,
Cores, Threads
• CPUs (Core or thread)
• Memory size
• Generic resources
• Features
• State
− Idle − Completing
− Mix − Drain / ing
− Alloc − Down

Partitions:
• Associatedwith specific
set of nodes
• Nodes can be in more
than one partition
• Job size and time limits
• Access control list
• State information
− Up
− Drain
− Down
Partitions

Allocated
cores
Allocated
memory
Jobs:
• ID (a number)
• Name
• Time limit
• Size specification
• Node features required
• Other Jobs Dependency
• Quality Of Service (QoS)
• State (Pending, Running,
Suspended, Canceled,
Failed, etc.)

Core
used
Memory
used
Jobs Step:
• ID (a number)
• Name
• Time limit (maximum)
• Size specification
• Node features required
in allocation

FULL CLUSTER!
✓ Job scheduling

SLURM: Job Scheduling
Scheduling: The process of determining next job to run and
on which resources.
FIFO Scheduler
Backfill Scheduler
Resources
Time

Scheduling: The process of determining next job to run and
on which resources.
Backfill Scheduler:
• Based on the job request, resources available, and
policy limits imposed.
• Starts with job priority.
• Results in a resource allocation over a period.

Backfill Scheduler:
• Starts with job priority.
Job_priority = site_factor +
(PriorityWeightAge) * (age_factor) +
(PriorityWeightAssoc) * (assoc_factor) +
(PriorityWeightFairshare) * (fair-share_factor) +
(PriorityWeightJobSize) * (job_size_factor) +
(PriorityWeightPartition) * (partition_factor) +
(PriorityWeightQOS) * (QOS_factor) +
SUM(TRES_weight_cpu * TRES_factor_cpu,
TRES_weight_<type> * TRES_factor_<type>,
...) - nice_factor

•sbatch – Submit a batch script to Slurm.
•salloc – Request resources to SLURM for an interactive
job.
•srun – Start a new job step.
•scancel – Cancel a job.
SLURM: Commands

• sinfo – Report system status (nodes, queues, etc.).
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
rest up infinite 3 idle~ pirineusgpu[1-2],pirineusknl1
rest up infinite 1 idle canigo2
std* up infinite 11 idle~ pirineus[14,19-20,23,25-26,29-30,33-34,40]
std* up infinite 18 mix pirineus[13,15-16,18,21-22,27-28,35,38-39,41-45,48-49]
std* up infinite 7 alloc pirineus[17,24,31,36-37,46-47]
gpu up infinite 2 alloc pirineusgpu[3-4]
knl up infinite 3 idle~ pirineusknl[2-4]
mem up infinite 1 mix canigo1
class_a up infinite 8 mix canigo1,pirineus[1-7]
class_a up infinite 1 alloc pirineus8
class_b up infinite 8 mix canigo1,pirineus[1-7]
class_b up infinite 1 alloc pirineus8
class_c up infinite 8 mix canigo1,pirineus[1-7]
class_c up infinite 1 alloc pirineus8
std_curs up infinite 5 idle~ pirineus[9-12,50]
gpu_curs up infinite 2 idle~ pirineusgpu[1-2]
SLURM: Commands

• sinfo – Report system status (nodes, queues, etc.).
sinfo -Np class_a -O
"Nodelist,Partition,StateLong,CpusState,Memory,Freemem"
NODELIST PARTITION STATE CPUS(A/I/O/T) MEMORY FREE_MEM
canigo1 class_a mixed 113/79/0/192 3094521 976571
pirineus1 class_a mixed 20/28/0/48 191904 120275
pirineus8 class_a allocated 48/0/0/48 191904 165682
SLURM: Commands

1193936 std g09d1 upceqt04 PD 0:00 1 16 32G (Priority)
1195916 gpu A2B2_APO_n ubator01 PD 0:00 1 24 3900M (Priority)
1195927 gpu uncleaved_ ubator02 PD 0:00 1 24 3900M (Priority)
1195928 gpu uncleaved_ ubator02 PD 0:00 1 24 3900M (Priority)
1195929 gpu cleaved_wt ubator02 PD 0:00 1 24 3900M (Priority)
1138005 std U98-CuONN1 imoreira PD 0:00 1 12 3998M (Priority)
1195597 std sh gomollon R 20:04:04 4 24 6000M pirineus[31,38,44,47]
1195579 class_a rice crag49366 R 6:44:45 1 8 3998M pirineus5
• squeue – Report job and job step status.
SLURM: Commands

• scontrol – Administrator tool to view and/or update
system, job, step, partition or reservation status.
scontrol hold <jobid>
scontrol release <jobid>
scontrol show job <jobid>
SLURM: Commands

JobId=1195597 JobName=sh
UserId=gomollon(80128) GroupId=csuc(10000) MCS_label=N/A
Priority=100176 Nice=0 Account=csuc QOS=test
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=20:09:58 TimeLimit=5-00:00:00 TimeMin=N/A
SubmitTime=2019-10-07T12:21:29 EligibleTime=2019-10-07T12:21:29
StartTime=2019-10-07T12:21:29 EndTime=2019-10-12T12:21:30 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=std AllocNode:Sid=login2:20262
ReqNodeList=(null) ExcNodeList=(null)
NodeList=pirineus[31,38,44,47]
BatchHost=pirineus31
NumNodes=4 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=24,mem=144000M,node=4
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=6000M MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/home/gomollon
Power=
SLURM: Commands

Login on CSUC infrastructure
• Login
ssh –p 2122 username@hpc.csuc.cat
• Transferfiles
scp -P 2122 local_file username@hpc.csuc.cat:[path to your folder]
sftp -oPort=2122 username@hpc.csuc.cat
• Useful paths
Name Variable Availability Quote/project Time limit Backup
/home/$user $HOME global 4 GB unlimited Yes
/scratch/$user $SCRATCH global unlimited 30 days No
/scratch/$user/tmp/jobid $TMPDIR Local to each node job file limit 1 week No
/tmp/$user/jobid $TMPDIR Local to each node job file limit 1 week No
• Get HC consumption
consum -a ‘any’ (group consumption)
consum -a ‘any’ -u ‘nom_usuari’ (user consumption)

Batch job submission: Default settings
• 4Gb/core (excepting on mem partition).
• 24Gb/core on mem partition.
• 1 core on std and mem partitions.
• 24 cores on gpu partition
• The whole node on KNL partition
• Non-exclusive, multinode job.
• Scratch and Output directory are the submit directory.

Batch job submission
• Basic Linux commands:
Description Command Exemple
List files ls ls /home/user
Making folders mkdir mkdir /home/prova
Changing folder cd cd /home/prova
Copy files cp cp nom_arxiu1 nom_arxiu2
Move file mv mv /home/prova.txt /cescascratch/prova.txt
Delete file rm rm filename
Print file content cat cat filename
Find string into files grep grep ‘word’ filename
List last lines on file tail tail filename
• Text editors : vim, nano, emacs,etc.
• More detailed info and options about the commands:
‘command’ –help
man ‘command’

Scheduler directives/Options
• -c, --cpus-per-task=ncpus number of cpus required per task
• --gres=list required generic resources
• -J, --job-name=jobname name of job
• -n, --ntasks=ntasks number of tasks to run
• --ntasks-per-node=n number of tasks to invoke on each node
• -N, --nodes=N number of nodes on which to run (N = min[-max])
• -o, --output=out file for batch script's standard output
• -p, --partition=partition partition requested
• -t, --time=minutes time limit (format: dd-hh:mm)

• -C, --constraint=list specify a list of constraints(mem, vnc , ....)
• --mem=MB minimum amount of total real memory
• --reservation=name allocate resources from named reservation
• -w, --nodelist=hosts... request a specific list of hosts
• --mem-per-cpu=MB amount of real memory per allocated core
Scheduler directives/Options

#!/bin/bash
#SBATCH–jtreball_prova
#SBATCH-o treball_prova.log
#SBATCH-e treball_prova.err
#SBATCH-p std
#SBATCH-n 48
module load mpi/intel/openmpi/3.1.0
cp –r $input $SCRATCH
Cd $SCRATCH
srun $APPLICATION
mkdir -p $OUTPUT_DIR
cp -r * $output
Batch job submission
Schedulerdirectives
Setting up the environment
Move the input files to the working directory
Launch the application(similar to mpirun)
Create the output folderand move the outputs

Gaussian 16 Example
#!/bin/bash
#SBATCH-j gau16_test
#SBATCH-o gau_test_%j.log
#SBATCH-e gau_test_%j.err
#SBATCH-p std
#SBATCH-n 1
#SBATCH-c 16
module load gaussian/g16b1
INPUT_DIR=/$HOME/gaussian_test/inputs
OUTPUT_DIR=$HOME/gaussian_test/outputs
cd $SCRATCH
cp -r $INPUT_DIR/*.
g16 < input.gau > output.out
cp -r * $output

Vasp 5.4.4 Example
#!/bin/bash
#SBATCH-j vasp_test_%j
#SBATCH-o vasp_test_%j.log
#SBATCH–e vasp_test_%j.err
#SBATCH-p std
#SBATCH-n 24
module load vasp/5.4.4
INPUT_DIR=/$HOME/vasp_test/inputs
OUTPUT_DIR=$HOME/vasp_test/outputs
cd $SCRATCH
cp -r $INPUT_DIR/*.
srun `which vasp_std`
cp -r * $output

Gromacs Example
#!/bin/bash
#SBATCH--job-name=gromacs
#SBATCH--output=gromacs_%j.out
#SBATCH--error=gromacs_%j.err
#SBATCH-n 24
#SBATCH--gres=gpu:2
#SBATCH-N 1
#SBATCH-p gpu
#SBATCH-c 2
#SBATCH--time=00:30:00
module load gromacs/2018.4_mpi
cd $SHAREDSCRATCH
cp -r $HOME/SLMs/gromacs/CASE/*.
srun `which gmx_mpi`mdrun -v -deffnm input_system -ntomp $SLURM_CPUS_PER_TASK -nb
gpu -npme 12 -dlb yes -pin on –gpu_id 01
cp –r * /scratch/$USER/gromacs/CASE/output/

ANSYS Fluent Example
#!/bin/bash
#SBATCH-j truck.cas
#SBATCH-o truck.log
#SBATCH-e truck.err
#SBATCH-p std
#SBATCH-n 16
module load toolchains/gcc_mkl_ompi
INPUT_DIR=$HOME/FLUENT/inputs
OUTPUT_DIR=$HOME/FLUENT/outputs
cd $SCRATCH
cp -r $INPUT_DIR/*.
`/prod/ANSYS16/v162/fluent/bin/fluent3ddp –t $SLURM_NCPUS -mpi=hp -g -i input1_50.txt
cp -r * $output

Best Practices
• Use $SCRATCHas workingdirectory.
• Move only the necessaryfiles(notall files in the folder each time).
• Try to keep importantfiles only at $HOME
• Try to choose the partition and resoruces whose mostfit to your job.

Introduction to SLURM

More Related Content

What's hot (20)

Similar to Introduction to SLURM (20)

More from CSUC - Consorci de Serveis Universitaris de Catalunya (20)

Recently uploaded (20)

Introduction to SLURM