HPC I/O for Computational Scientists

HPC I/O for Computational Scientists:
Understanding I/O
Presented to
ATPESC 2017 Participants
Rob Latham and Phil Carns
Mathematics and Computer Science Division
Argonne National Laboratory
Q Center, St. Charles, IL (USA)
8/4/2017

ATPESC 2017, July 30 – August 11, 20172
Motivation for
Characterizing parallel I/O
• Most scientific domains are
increasingly data intensive:
climate, physics, biology and
much more
• Upcoming platforms include
complex hierarchical
storage systems
How can we
maximize productivity
in this environment?
Times are changing in HPC storage!
Example visualizations from
the Human Connectome
Project, CERN/LHC, and the
Parallel Ocean Program
The NERSC burst buffer roadmap and architecture, including solid
state burst buffers that can be used in a variety of ways

Key challenges
• Instrumentation:
– What do we measure?
– How much overhead is acceptable and when?
• Analysis:
– How do we correlate data and extract actionable information?
– Can we identify the root cause of performance problems?
• Impact:
– Develop best practices and tune applications
– Improve system software
– Design and procure better systems
3

CHARACTERIZING APPLICATION I/O
WITH DARSHAN

What is Darshan?
Darshan is a scalable HPC I/O characterization tool. It captures an
accurate but concise picture of application I/O behavior with
minimum overhead.
• No code changes, easy to use
– Negligible performance impact: just “leave it on”
– Enabled by default at ALCF, NERSC, NCSA, and KAUST
– Installed and available for case by case use at many other sites
• Produces a summary of I/O activity for each job, including:
– Counters for file access operations
– Time stamps and cumulative timers for key operations
– Histograms of access, stride, datatype, and extent sizes
5
Project began in 2008, first public software
release and deployment in 2009

Darshan design principles
• The Darshan run time library is inserted at link time (for static
executables) or at run time (for dynamic executables)
• Transparent wrappers for I/O functions collect per-file statistics
• Statistics are stored in bounded memory at each rank
• At shutdown time:
– Collective reduction to merge shared file records
– Parallel compression
– Collective write to a single log file
• No communication or storage operations until shutdown
• Command-line tools are used to post-process log files
6

JOB analysis example
Example: Darshan-job-summary.pl
produces a 3-page PDF file
summarizing various aspects of I/O
performance
Estimated performance
Percentage of runtime in I/O
Access size histogram
Access type histograms
File usage

SYSTEM analysis example
• With a sufficient archive of
performance statistics, we can
develop heuristics to detect
anomalous behavior
8
This example highlights large jobs that spent a
disproportionate amount of time managing file
metadata rather than performing raw data transfer
Worst offender spent 99% of I/O time in
open/close/stat/seek
This identification process is not yet automated;
alerts/triggers are needed in future work for greater
impact
Example of heuristics applied to a population of
production jobs on the Hopper system in 2013:
Carns et al., “Production I/O Characterization on the Cray XE6,” In
Proceedings of the Cray User Group meeting 2013 (CUG 2013).

Performance:
function wrapping overhead
What is the cost of interposing Darshan I/O instrumentation wrappers?
• To test, we compare observed I/O time of an IOR configuration
linked against different Darshan versions on Edison
• File-per-process workload, 6,000 processes, over 12 million
instrumented calls
Type of Darshan builds now
deployed on Theta and Cori
Why the box plots? Recall
observation from this morning that
variability is a constant theme in
HPC I/O today.
(note that the Y axis labels start at 40)
Snyder et al. Modular HPC I/O Characterization with Darshan. In Proceedings
of 5th Workshop on Extreme-scale Programming Tools (ESPT 2016), 2016.

Performance: shutdown overhead
• Involves aggregating, compressing, and collectively writing I/O
data records
• To test, synthetic workloads are injected into Darshan and resulting
shutdown time is measured on Edison
Near constant shutdown time of
~100 ms in all cases
Shutdown time scales linearly with job size:
5-6s extra shutdown time with 12,000 files
single shared file file-per-process

Typical deployment and usage
• Darshan usage on Mira, Cetus, Vesta, Theta,
Cori, or Edison, abridged:
– Run your job
– If the job calls MPI_Finalize(), log will be stored in
DARSHAN_LOG_DIR/month/day/
– Theta: /lus/theta-fs0/logs/darshan/theta
– Use tools (next slides) to interpret log
• On Titan: “module load darshan” first
• Links to documentation with details will be
given at the end of this presentation
12

Generating job summaries
• Run job and find its log file:
• Copy log files to save, generate PDF summaries:
13
Job id
Corresponding
log file in today’s
directory
Copy out logs
List logs
Load “latex” module,
(if needed)
Generate PDF

First page of summary
14
Common questions:
• Did I spend much time performing IO?
• What were the access sizes?
• How many files where opened, and
how big were they?

Second page of summary (excerpt)
15
Common questions:
• Where in the timeline of the execution did ea
rank do I/O?
There are additional graphs in the PDF file with increasingly detailed information.
You can also dump all data from the log in text format using “darshan-parser”.

TIPS AND TRICKS: ENABLING ADDITIONAL DATA
CAPTURE

What if you are doing shared-file IO?
17
Your timeline might look like this
No per-process information available
because the data was aggregated by
Darshan to save space/overhead
Is that important? It depends on what
you need to learn about your
application.
– It may be interesting for applications
that access the same file in distinct
phases over time

What if you are doing shared-file IO?
18
Set environment variable to disable shared file
reductions
Increases overhead and log file size, but provides
per-rank info even on shared files

Detailed trace data
19
Set environment variable to enable “DXT” tracing
This causes additional overhead and larger files, but
captures precise access data
Parse trace with “darshan-dxt-parser”
Feature contributed by
Cong Xu and Intel’s High
Performance Data Division
Cong Xu et. al, "DXT:
Darshan eXtended Tracing",
Cray User Group Conference
2017

What’s new?
Modularized instrumentation
• Frequently asked question:
Can I add instrumentation for X?
• Darshan has been re-architected as a
modular framework to help facilitate this,
starting in v3.0
21
Snyder et al. Modular HPC I/O Characterization with
Darshan. In Proceedings of 5th Workshop on Extreme-
scale Programming Tools (ESPT 2016), 2016.
Self-describing log format

Darshan Module example
• We are using the modular
framework to integrate more data
sources and simplify the
connections between various
components in the stack
• This is a good way for
collaborators to get involved in
Darshan development
22

The need for HOLISTIC characterization
• We’ve used Darshan to improving application productivity with case
studies, application tuning, and user education
• ... But challenges remain:
– What other factors influence performance?
– What if the problem is beyond a user’s control?
– The user population evolves over time; how do we stay engaged?
23

“I observed performance XYZ. Now what?”
• A climate vs. weather analogy: It is snowing in Atlanta, Georgia.
Is that normal?
• You need context to know:
– Does it ever snow there?
– What time of year is it?
– What was the temperature yesterday?
– Do your neighbors see snow too?
– Should you look at it first hand?
• It is similarly difficult to understand a single application performance
measurement without broader context. How do we differentiate
typical I/O climate from extreme I/O weather events?
24
+ = ?

Characterizing the I/O system
• We need a big picture view
• No lack of instrumentation
methods for system
components…
– but with divergent data formats,
resolutions, and scope
25

Characterizing the I/O system
• We need a big picture view
• No lack of instrumentation
methods for system
components…
– but with wildly divergent data
formats, resolutions, and scope
• This is the motivation for the
TOKIO (TOtal Knowledge of
I/O) project:
– Integrate, correlate, and analyze
I/O behavior from the system as a
whole for holistic understanding
26
Holistic I/O characterization
https://www.nersc.gov/research-and-development/tokio/

TOKIO Strategy
• Integrate existing best-in-class instrumentation tools with help from
vendors
• Index and query data sources in their native format
– Infrastructure to align and link data sets
– Adapters/parsers to produce coherent views on demand
• Develop integration and analysis methods
• Produce tools that share a common interface and data format
– Correlation, data mining, dashboards, etc.
27
The TOKIO project is a collaboration between LBL and ANL
PI: Nick Wright (LBL), Collaborators: Suren Byna, Glenn Lockwood,
William Yoo, Prabhat, Jialin Liu (LBL) Phil Carns, Shane Snyder, Kevin
Harms, Zach Nault, Matthieu Dorier, Rob Ross (ANL)

UMAMI example
TOKIO Unified Measurements And Metrics Interface
28
UMAMI is a pluggable dashboard that displays the
I/O performance of an application in context with
system telemetry and historical records
Each metric is shown
in a separate row
Historical samples (for a
given application) are
plotted over time
Box plots relate current
values to overall
variance
(figures courtesy of Glenn Lockwood, NERSC)

UMAMI example
TOKIO Unified Measurements And Metrics Interface
29
System background
load is typical
Performance for this job
is higher than usual
Server CPU load is low
after a long-term steady
climb
Corresponds to data
purge that freed up disk
blocks
Broader contextual clues simplify interpretation of
unusual performance measurements

Hands on exercises
https://xgitlab.cels.anl.gov/ATPESC-IO/hands-on-2017
• There are hands-on exercises available for you to try out during the
day or in tonight’s session
– Demonstrates running applications and analyzing I/O on Theta
– Try some examples and see if you can find the I/O problem!
• We can also answer questions about your own applications
– Try it on Theta, Mira, Cetus, Vesta, Cori, Edison, or Titan
– (note: the Mira, Vesta, and Cetus Darshan versions are a little
older and will differ slightly in details from this presentation)
30

Next up!
• This presentation covered how to evaluate I/O and tune your
application.
• The next presentation will walk through the HDF5 data management
library.

HPC I/O for Computational Scientists

More Related Content

What's hot (20)

Similar to HPC I/O for Computational Scientists (20)

More from inside-BigData.com (20)

Recently uploaded (20)

HPC I/O for Computational Scientists