SlideShare a Scribd company logo
Lustre Parallel Filesystem
Best Practices
George Markomanolis
Computational Scientist
KAUST Supercomputing Laboratory
georgios.markomanolis@kaust.edu.sa
7 November 2017
Outline
—  Introduction to Parallel I/O
—  Understanding the I/O performance on Lustre
—  Accelerating the performance
I/O Performance
—  There is no magic solution
—  I/O performance depends on the pattern
—  Of course a bottleneck can occur from any part of an application
—  Increasing computation and decreasing I/O is a good solution but not
always possible
Lustre best practices I
—  Do not use ls –l
—  The information on ownership and permission metadata is stores on
MDTs, however, the file size metadata is available from OSTs
—  Use ls instead
—  Avoid having large number of files in a single directory
—  Split a large number of files into multiple subdirectories to minimize
the contention.
Lustre best practices II
—  Avoid accessing large number of small files on Lustre
—  On Lustre, create a small file 64KB:
dd if=/dev/urandom of=my_big_file count=128
128+0 records in
128+0 records out
65536 bytes (66 kB) copied, 0.014081 s, 4.7 MB/s
—  On /tmp/ of the node
/tmp> dd if=/dev/urandom of=my_big_file count=128
128+0 records in
128+0 records out
65536 bytes (66 kB) copied, 0.00678089 s, 9.7 MB/s
Various I/O modes
Lustre
—  Lustre file system is made up of an underlying:
—  Set of I/O servers called Object Storage Servers (OSSs)
—  Disks called Object Storage Targets (OSTs), stores file data (chunk of
files). We have 144 OSTs on Shaheen
—  The file metadata is controlled by a Metadata Server (MDS) and
stored on a Metadata Target (MDT)
Lustre Operation
Lustre best practices III
—  Use stripe count 1 for directories with many small files (default on
our system)
—  mkdir experiments
—  lfs setstripe -c 1 experiments
—  cd experiments
—  tar -zxvf code.tgz
—  Copy larger files to another directory with higher striping
—  mkdir large_files
—  lfs setstripe -c 16 large_files
—  cp file large_files/
Lustre best practices IV
—  Increase striping count for parallel access, especially on large files.
—  The striping factor should be a factor of a number of processes
performing the parallel I/O.
—  To identify the minimum number of striping, use the square root of the
file size. If the file is 90GB. The square root is 9.5, so use at least 9
OSTs
—  lfs setstripe –c 9 file
—  If you use 64 MPI processes for parallel I/O, the number of used OSTs
should be less than 64, you could try 8, 16, 32, 64, depending on file
size
Software/techiniques to optimize I/O
—  Important factors:
—  Striping
—  Aligned data
—  IOBUF for serial I/O
—  module load iobuf
—  Compile and link your application
—  export IOBUF_PARAMS=‘*.out:verbose:count=12:size=32M’
—  Hugepage for MPI applications
—  module load craype-hugepages4M
—  compile and link
—  Multiple options for Huge pages size
—  Huge pages increase the maximum size of data and text in a program accessible by the high
speed network
—  Use parallel I/O libraries suchs as PHDF5, PNetCDF, ADIOS
—  But… how parallel is your I/O?
Collective Buffering – MPI I/O
aggregators
—  During a collective write, the buffers on the aggregated nodes are
buffered through MPI, then these nodes write the data to the I/O
servers.
—  Example 8 MPI processes, 2 MPI I/O aggregators
How many MPI processes are writing a
shared file?
—  With CRAY-MPICH, we execute one application with 1216 MPI processes
and it supports parallel I/O with Parallel NetCDF and the file’s size is
361GB:
—  First case (no stripping):
—  mkdir execution_folder
—  copy necessary files in the folder
—  cd execution_folder
—  run the application
—  Timing for Writing restart for domain 1: 674.26 elapsed
seconds
—  Answer: 1 MPI process
How many MPI processes are writing a
shared file?
—  With CRAY-MPICH, we execute one application with 1216 MPI processes
and it supports parallel I/O with Parallel NetCDF and the file’s size is
361GB:
—  Second case:
—  mkdir execution_folder
—  lfs seststripe –c 144 execution_folder
—  copy necessary files in the folder
—  cd execution_folder
—  Run the application
—  Timing for Writing restart for domain 1: 10.35 elapsed seconds
—  Answer: 144 MPI processes
I/O performance on Lustre while
increasing OSTs
0
20
40
60
80
100
120
140
160
5 10 20 40 80 100 144
Time(inseconds)
# OST
Lustre
Lustre
WRF – 361GB restart file
If the file is not big enough, increasing OSTs can cause performance issues
How to declare the number of MPI I/O
aggregators
—  By default with the current version of Lustre, the number of MPI
I/O aggregators is the number of OSTs.
—  There are two ways to declare the striping (number of OSTs).
—  Execute the following command on an empty folder
—  lfs setstripe -c X empty_folder
where X is between 2 and 144, depending on the size of the used files.
—  Use the environment variable MPICH_MPIIO_HINTS to declare striping per files
export MPICH_MPIIO_HINTS=
"wrfinput*:striping_factor=64,wrfrst*:striping_factor=144,
wrfout*:striping_factor=144”
Using Darshan tool to visualize I/O
performance
Handling Darshan Data: https://kaust-ksl.github.io/HArshaD/
Lustre best practices V
—  Experiment with different stripe count/sizes for MPI collective writes
—  Stripe align, make sure that the I/O requests do not access multiple
OSTs
—  Avoid repetitive “stat” operations
Loop:
Check if file exists
Execute a command
Delete a file
Lustre best practices VI
—  Avoid multiple processes opening the same file at same time.
—  When opening a read-only file in Fortran use  ACTION='read’ and not
the default  ACTION='readwrite’ to reduce contention by not locking
the file
—  Avoid repetitive Open/Close operations, if you read only a file use
Action=‘read’, read the file once, read all the information and save
the results.
Test case
 
—  4 MPI processes
  n=1000
do i=1,n
  open(11, file=ffile,position="append")
  write (11, *) 5
close(11)
end do
Test case - Darshan
Test case – Darshan II
Test case II
 
—  4 MPI processes
  n=1000
open(11, file=ffile,position="append")
do i=1,n
    write (11, *) 5
end do
close(11)
Test case II - Darshan
Test case II – Darshan II
Conclusions
—  It was explained how Lustre operates
—  Some parameters need be investigated for the optimum performance
—  Be patient, probably you will not achieve the best performance
immediately
Thank you!
Questions?
georgios.markomanolis@kaust.edu.sa

More Related Content

PPTX
AWS Snowball
PPTX
NetApp Se training storage grid webscale technical overview
PPTX
Whamcloud - Lustre for HPC and Ai
PPTX
Cassandra における SSD の活用
PDF
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
PDF
Apache Bigtop3.2 (仮)(Open Source Conference 2022 Online/Hiroshima 発表資料)
PDF
Apache Impalaパフォーマンスチューニング #dbts2018
PDF
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
AWS Snowball
NetApp Se training storage grid webscale technical overview
Whamcloud - Lustre for HPC and Ai
Cassandra における SSD の活用
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Apache Bigtop3.2 (仮)(Open Source Conference 2022 Online/Hiroshima 発表資料)
Apache Impalaパフォーマンスチューニング #dbts2018
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...

What's hot (20)

PDF
Netapp Storage
PDF
CyberAgentにおけるMongoDB
PDF
Persistent-memory-native Database High-availability Feature
PDF
Presto As A Service - Treasure DataでのPresto運用事例
PDF
HA環境構築のベスト・プラクティス
PPTX
事例で学ぶApache Cassandra
PDF
A crash course in CRUSH
PDF
10年効く分散ファイルシステム技術 GlusterFS & Red Hat Storage
PDF
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
PDF
Ceph and RocksDB
PDF
ファイルシステム比較
PPTX
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
PDF
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
PPTX
IoT Agents とは? - FIWARE WednesdayWebinars
PDF
Snowflake SnowPro Certification Exam Cheat Sheet
PDF
データセンターネットワークでのPrometheus活用事例
PDF
EDB Postgres DBA Best Practices
 
PDF
EDB Postgres Vision 2019
PDF
第31回「今アツい、分散ストレージを語ろう」(2013/11/28 on しすなま!)
PPTX
Netapp Storage
CyberAgentにおけるMongoDB
Persistent-memory-native Database High-availability Feature
Presto As A Service - Treasure DataでのPresto運用事例
HA環境構築のベスト・プラクティス
事例で学ぶApache Cassandra
A crash course in CRUSH
10年効く分散ファイルシステム技術 GlusterFS & Red Hat Storage
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
Ceph and RocksDB
ファイルシステム比較
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
IoT Agents とは? - FIWARE WednesdayWebinars
Snowflake SnowPro Certification Exam Cheat Sheet
データセンターネットワークでのPrometheus活用事例
EDB Postgres DBA Best Practices
 
EDB Postgres Vision 2019
第31回「今アツい、分散ストレージを語ろう」(2013/11/28 on しすなま!)
Ad

Similar to Lustre Best Practices (20)

PDF
Burst Buffer: From Alpha to Omega
PDF
Distributed File Systems: An Overview
PDF
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...
PDF
Dell Lustre Storage Architecture Presentation - MBUG 2016
PDF
Tacc Infinite Memory Engine
PPTX
Secure lustre on openstack
PDF
Understanding and Measuring I/O Performance
PDF
LUG 2014
PDF
Lustre Generational Performance Improvements & New Features
PDF
Distributed File Systems
PPT
PVFS: A Parallel File System for Linux Clusters
PDF
fermilab-conf-12-398-cd
PDF
Lustre at indiana university
PDF
Architecting a 35 PB distributed parallel file system for science
PDF
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
PDF
PARALLEL FILE SYSTEM FOR LINUX CLUSTERS
PPTX
Auditing Lustre file system
PDF
PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor
PDF
Enabling a Secure Multi-Tenant Environment for HPC
PDF
Storage solutions for High Performance Computing
Burst Buffer: From Alpha to Omega
Distributed File Systems: An Overview
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...
Dell Lustre Storage Architecture Presentation - MBUG 2016
Tacc Infinite Memory Engine
Secure lustre on openstack
Understanding and Measuring I/O Performance
LUG 2014
Lustre Generational Performance Improvements & New Features
Distributed File Systems
PVFS: A Parallel File System for Linux Clusters
fermilab-conf-12-398-cd
Lustre at indiana university
Architecting a 35 PB distributed parallel file system for science
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
PARALLEL FILE SYSTEM FOR LINUX CLUSTERS
Auditing Lustre file system
PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor
Enabling a Secure Multi-Tenant Environment for HPC
Storage solutions for High Performance Computing
Ad

More from George Markomanolis (17)

PDF
Evaluating GPU programming Models for the LUMI Supercomputer
PDF
Utilizing AMD GPUs: Tuning, programming models, and roadmap
PDF
Exploring the Programming Models for the LUMI Supercomputer
PDF
Getting started with AMD GPUs
PDF
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
PDF
Introduction to Extrae/Paraver, part I
PDF
Performance Analysis with Scalasca, part II
PDF
Performance Analysis with Scalasca on Summit Supercomputer part I
PDF
Performance Analysis with TAU on Summit Supercomputer, part II
PDF
How to use TAU for Performance Analysis on Summit Supercomputer
PDF
Introducing IO-500 benchmark
PDF
Experience using the IO-500
PDF
Harshad - Handle Darshan Data
PDF
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
PDF
markomanolis_phd_defense
PDF
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
PDF
Introduction to Performance Analysis tools on Shaheen II
Evaluating GPU programming Models for the LUMI Supercomputer
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Exploring the Programming Models for the LUMI Supercomputer
Getting started with AMD GPUs
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Introduction to Extrae/Paraver, part I
Performance Analysis with Scalasca, part II
Performance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with TAU on Summit Supercomputer, part II
How to use TAU for Performance Analysis on Summit Supercomputer
Introducing IO-500 benchmark
Experience using the IO-500
Harshad - Handle Darshan Data
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
markomanolis_phd_defense
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Introduction to Performance Analysis tools on Shaheen II

Recently uploaded (20)

PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
Advanced methodologies resolving dimensionality complications for autism neur...
NewMind AI Weekly Chronicles - August'25 Week I
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Big Data Technologies - Introduction.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Spectral efficient network and resource selection model in 5G networks
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
The AUB Centre for AI in Media Proposal.docx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
NewMind AI Monthly Chronicles - July 2025
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Review of recent advances in non-invasive hemoglobin estimation
Dropbox Q2 2025 Financial Results & Investor Presentation

Lustre Best Practices

  • 1. Lustre Parallel Filesystem Best Practices George Markomanolis Computational Scientist KAUST Supercomputing Laboratory georgios.markomanolis@kaust.edu.sa 7 November 2017
  • 2. Outline —  Introduction to Parallel I/O —  Understanding the I/O performance on Lustre —  Accelerating the performance
  • 3. I/O Performance —  There is no magic solution —  I/O performance depends on the pattern —  Of course a bottleneck can occur from any part of an application —  Increasing computation and decreasing I/O is a good solution but not always possible
  • 4. Lustre best practices I —  Do not use ls –l —  The information on ownership and permission metadata is stores on MDTs, however, the file size metadata is available from OSTs —  Use ls instead —  Avoid having large number of files in a single directory —  Split a large number of files into multiple subdirectories to minimize the contention.
  • 5. Lustre best practices II —  Avoid accessing large number of small files on Lustre —  On Lustre, create a small file 64KB: dd if=/dev/urandom of=my_big_file count=128 128+0 records in 128+0 records out 65536 bytes (66 kB) copied, 0.014081 s, 4.7 MB/s —  On /tmp/ of the node /tmp> dd if=/dev/urandom of=my_big_file count=128 128+0 records in 128+0 records out 65536 bytes (66 kB) copied, 0.00678089 s, 9.7 MB/s
  • 7. Lustre —  Lustre file system is made up of an underlying: —  Set of I/O servers called Object Storage Servers (OSSs) —  Disks called Object Storage Targets (OSTs), stores file data (chunk of files). We have 144 OSTs on Shaheen —  The file metadata is controlled by a Metadata Server (MDS) and stored on a Metadata Target (MDT)
  • 9. Lustre best practices III —  Use stripe count 1 for directories with many small files (default on our system) —  mkdir experiments —  lfs setstripe -c 1 experiments —  cd experiments —  tar -zxvf code.tgz —  Copy larger files to another directory with higher striping —  mkdir large_files —  lfs setstripe -c 16 large_files —  cp file large_files/
  • 10. Lustre best practices IV —  Increase striping count for parallel access, especially on large files. —  The striping factor should be a factor of a number of processes performing the parallel I/O. —  To identify the minimum number of striping, use the square root of the file size. If the file is 90GB. The square root is 9.5, so use at least 9 OSTs —  lfs setstripe –c 9 file —  If you use 64 MPI processes for parallel I/O, the number of used OSTs should be less than 64, you could try 8, 16, 32, 64, depending on file size
  • 11. Software/techiniques to optimize I/O —  Important factors: —  Striping —  Aligned data —  IOBUF for serial I/O —  module load iobuf —  Compile and link your application —  export IOBUF_PARAMS=‘*.out:verbose:count=12:size=32M’ —  Hugepage for MPI applications —  module load craype-hugepages4M —  compile and link —  Multiple options for Huge pages size —  Huge pages increase the maximum size of data and text in a program accessible by the high speed network —  Use parallel I/O libraries suchs as PHDF5, PNetCDF, ADIOS —  But… how parallel is your I/O?
  • 12. Collective Buffering – MPI I/O aggregators —  During a collective write, the buffers on the aggregated nodes are buffered through MPI, then these nodes write the data to the I/O servers. —  Example 8 MPI processes, 2 MPI I/O aggregators
  • 13. How many MPI processes are writing a shared file? —  With CRAY-MPICH, we execute one application with 1216 MPI processes and it supports parallel I/O with Parallel NetCDF and the file’s size is 361GB: —  First case (no stripping): —  mkdir execution_folder —  copy necessary files in the folder —  cd execution_folder —  run the application —  Timing for Writing restart for domain 1: 674.26 elapsed seconds —  Answer: 1 MPI process
  • 14. How many MPI processes are writing a shared file? —  With CRAY-MPICH, we execute one application with 1216 MPI processes and it supports parallel I/O with Parallel NetCDF and the file’s size is 361GB: —  Second case: —  mkdir execution_folder —  lfs seststripe –c 144 execution_folder —  copy necessary files in the folder —  cd execution_folder —  Run the application —  Timing for Writing restart for domain 1: 10.35 elapsed seconds —  Answer: 144 MPI processes
  • 15. I/O performance on Lustre while increasing OSTs 0 20 40 60 80 100 120 140 160 5 10 20 40 80 100 144 Time(inseconds) # OST Lustre Lustre WRF – 361GB restart file If the file is not big enough, increasing OSTs can cause performance issues
  • 16. How to declare the number of MPI I/O aggregators —  By default with the current version of Lustre, the number of MPI I/O aggregators is the number of OSTs. —  There are two ways to declare the striping (number of OSTs). —  Execute the following command on an empty folder —  lfs setstripe -c X empty_folder where X is between 2 and 144, depending on the size of the used files. —  Use the environment variable MPICH_MPIIO_HINTS to declare striping per files export MPICH_MPIIO_HINTS= "wrfinput*:striping_factor=64,wrfrst*:striping_factor=144, wrfout*:striping_factor=144”
  • 17. Using Darshan tool to visualize I/O performance Handling Darshan Data: https://kaust-ksl.github.io/HArshaD/
  • 18. Lustre best practices V —  Experiment with different stripe count/sizes for MPI collective writes —  Stripe align, make sure that the I/O requests do not access multiple OSTs —  Avoid repetitive “stat” operations Loop: Check if file exists Execute a command Delete a file
  • 19. Lustre best practices VI —  Avoid multiple processes opening the same file at same time. —  When opening a read-only file in Fortran use  ACTION='read’ and not the default  ACTION='readwrite’ to reduce contention by not locking the file —  Avoid repetitive Open/Close operations, if you read only a file use Action=‘read’, read the file once, read all the information and save the results.
  • 20. Test case   —  4 MPI processes   n=1000 do i=1,n   open(11, file=ffile,position="append")   write (11, *) 5 close(11) end do
  • 21. Test case - Darshan
  • 22. Test case – Darshan II
  • 23. Test case II   —  4 MPI processes   n=1000 open(11, file=ffile,position="append") do i=1,n     write (11, *) 5 end do close(11)
  • 24. Test case II - Darshan
  • 25. Test case II – Darshan II
  • 26. Conclusions —  It was explained how Lustre operates —  Some parameters need be investigated for the optimum performance —  Be patient, probably you will not achieve the best performance immediately