Scalable Parallel Performance Measurement with the Scalasca Toolset

MitgliedderHelmholtz-Gemeinschaft
Scalable Parallel
Performance Measurement
with the Scalasca Toolset
Bernd Mohr
June 2013

June 2013 JSC 2
Parallel Architectures: State of the Art
Network or Switch
...
N0 N1 Nk
Inter-
connect
P0 Pn
...
Memory
A0
Am
... Inter-
connect
P0 Pn
...
Memory
A0
Am
...
Inter-
connect
P0 Pn
...
A0
Am
...
Memory
Pi
Core0 Core1 Corer
L10 L11 L1
L20 L2r/2
L30
...
... Aj
Router Router
Router
Router Router
Router
Router
Router Router Router
or
SMP
NUMA

June 2013 JSC 3
Parallel Performance Challenges
• Current and future systems (will) consist of
 Complex configurations
 With a huge number of components
 Very likely heterogeneous
• Deep software hierarchies of large, complex software components will
be required to make use of such systems
 Sophisticated integrated performance
measurement, analysis, and optimization capabilities
will be required to efficiently operate such systems
 Tools which provide insight not just numbers or charts needed!

June 2013 JSC 4
“A picture is worth 1000 words…”
• “Real world” example• MPI ring program

June 2013 JSC 5
“What about 1000’s of pictures?”
(with 100’s of menu options)

June 2013 JSC 6
Example Automatic Analysis: Late Sender

June 2013 JSC 7
Scalasca: Example MPI Patterns
time
process
ENTER EXIT SEND RECV COLLEXIT
(a) Late Sender
time
process
(b) Late Receiver
time
process
(d) Wait at N x N
time
process
(c) Late Sender / Wrong Order

June 2013 JSC 8
The Scalasca Project
• Scalable Analysis of
Large Scale Applications
• Approach
 Instrument C, C++, and Fortran parallel applications
 Based on MPI, OpenMP, SHMEM, or hybrid
 Option 1: scalable call-path profiling
 Option 2: scalable event trace analysis
 Collect event traces
 Search trace for event patterns representing inefficiencies
 Categorize and rank inefficiencies found
• Supports MPI 2.2 (P2P, collectives, RMA, IO) and OpenMP 3.0 (excl. nesting)
http://www.scalasca.org/

June 2013 JSC 9
Scalasca Example: CESM Sea Ice Module
Late Sender
Analysis
• Finds waiting at
MPI_Waitall()
inside
ice boundary
halo update
• Shows distribution
of imbalance
across system
and ranks

June 2013 JSC 10
Late Sender
Analysis +
Application
Topology
• Shows distribution
of imbalance
over topology
• MPI topologies
are automatically
captured

June 2013 JSC 11time
Scalasca Root Cause Analysis
• Root-cause analysis
 Wait states typically caused by load
or communication imbalances
earlier in the program
 Waiting time can also propagate
(e.g., indirect waiting time)
 Enhanced performance analysis to
find the root cause of wait states
• Approach
 Distinguish between direct
and indirect waiting time
 Identify call path/process
combinations delaying other
processes and causing first
order waiting time
 Identify original delay
Recv
Send
Send
foo
foo
foo
bar
bar Recv
A
B
C
cause
Recv
Recv
Direct waitIndirect wait
Recv
barDELAY

June 2013 JSC 12
Direct Wait
Time Analysis
• Direct wait
caused by ranks
processing areas
near the north
and south
ice borders

June 2013 JSC 13
Indirect Wait
Time Analysis
• Indirect waits
occurs for
ranks processing
warmer areas

June 2013 JSC 14
Delay Costs
Analysis
• Delays NOT
caused on ranks
processing
ice!

June 2013 JSC 15
NEW: Scalasca on Intel MIC
Example:
• TACC Stampede
• NAS BT-MZ code
• MPI/OpenMP
• 8x16 CPU threads (2 MPI/node)
• 60x16 MIC threads (15 MPI/MIC)
Supported modes
• Host-only or MIC-only
• Symmetric
Not yet supported modes
• Offload

June 2013 JSC 16
Acknowledgements
• Scalasca team (JSC) (GRS)
• Sponsors
Michael
Knobloch
Bernd
Mohr
Peter
Philippen
Markus
Geimer
Daniel
Lorenz
Christian
Rössel
David
Böhme
Marc-André
Hermanns
Pavel
Saviankou
Marc
Schlütter
Ilja
Zhukov
Alexandre
Strube
Brian
Wylie
Felix
Wolf
Anke
Visser
Monika
Lücke
Aamer
Shah
Alexandru
Calotoiu
Jie
Jiang
Sergei
Shudler
Guoyong
Mao
Philipp
Gschwandtner

June 2013 JSC 17
Questions?
• Check out
http://www.scalasca.org
• Or contact us at
scalasca@fz-juelich.de

Scalable Parallel Performance Measurement with the Scalasca Toolset

More Related Content

Similar to Scalable Parallel Performance Measurement with the Scalasca Toolset (20)

More from Intel IT Center (20)

Scalable Parallel Performance Measurement with the Scalasca Toolset