SlideShare a Scribd company logo
MitgliedderHelmholtz-Gemeinschaft
Scalable Parallel
Performance Measurement
with the Scalasca Toolset
Bernd Mohr
June 2013
June 2013 JSC 2
Parallel Architectures: State of the Art
Network or Switch
...
N0 N1 Nk
Inter-
connect
P0 Pn
...
Memory
A0
Am
... Inter-
connect
P0 Pn
...
Memory
A0
Am
...
Inter-
connect
P0 Pn
...
A0
Am
...
Memory
Pi
Core0 Core1 Corer
L10 L11 L1
L20 L2r/2
L30
...
... Aj
Router Router
Router
Router Router
Router
Router
Router Router Router
Router Router Router
Router Router Router
Router Router Router
Router Router Router
Router Router Router
Router Router Router
Router Router Router
Router Router Router
or
SMP
NUMA
June 2013 JSC 3
Parallel Performance Challenges
• Current and future systems (will) consist of
 Complex configurations
 With a huge number of components
 Very likely heterogeneous
• Deep software hierarchies of large, complex software components will
be required to make use of such systems
 Sophisticated integrated performance
measurement, analysis, and optimization capabilities
will be required to efficiently operate such systems
 Tools which provide insight not just numbers or charts needed!
June 2013 JSC 4
“A picture is worth 1000 words…”
• “Real world” example• MPI ring program
June 2013 JSC 5
“What about 1000’s of pictures?”
(with 100’s of menu options)
June 2013 JSC 6
Example Automatic Analysis: Late Sender
June 2013 JSC 7
Scalasca: Example MPI Patterns
time
process
ENTER EXIT SEND RECV COLLEXIT
(a) Late Sender
time
process
(b) Late Receiver
time
process
(d) Wait at N x N
time
process
(c) Late Sender / Wrong Order
June 2013 JSC 8
The Scalasca Project
• Scalable Analysis of
Large Scale Applications
• Approach
 Instrument C, C++, and Fortran parallel applications
 Based on MPI, OpenMP, SHMEM, or hybrid
 Option 1: scalable call-path profiling
 Option 2: scalable event trace analysis
 Collect event traces
 Search trace for event patterns representing inefficiencies
 Categorize and rank inefficiencies found
• Supports MPI 2.2 (P2P, collectives, RMA, IO) and OpenMP 3.0 (excl. nesting)
http://www.scalasca.org/
June 2013 JSC 9
Scalasca Example: CESM Sea Ice Module
Late Sender
Analysis
• Finds waiting at
MPI_Waitall()
inside
ice boundary
halo update
• Shows distribution
of imbalance
across system
and ranks
June 2013 JSC 10
Scalasca Example: CESM Sea Ice Module
Late Sender
Analysis +
Application
Topology
• Shows distribution
of imbalance
over topology
• MPI topologies
are automatically
captured
June 2013 JSC 11time
Scalasca Root Cause Analysis
• Root-cause analysis
 Wait states typically caused by load
or communication imbalances
earlier in the program
 Waiting time can also propagate
(e.g., indirect waiting time)
 Enhanced performance analysis to
find the root cause of wait states
• Approach
 Distinguish between direct
and indirect waiting time
 Identify call path/process
combinations delaying other
processes and causing first
order waiting time
 Identify original delay
Recv
Send
Send
foo
foo
foo
bar
bar Recv
A
B
C
cause
Recv
Recv
Direct waitIndirect wait
Recv
barDELAY
June 2013 JSC 12
Scalasca Example: CESM Sea Ice Module
Direct Wait
Time Analysis
• Direct wait
caused by ranks
processing areas
near the north
and south
ice borders
June 2013 JSC 13
Scalasca Example: CESM Sea Ice Module
Indirect Wait
Time Analysis
• Indirect waits
occurs for
ranks processing
warmer areas
June 2013 JSC 14
Scalasca Example: CESM Sea Ice Module
Delay Costs
Analysis
• Delays NOT
caused on ranks
processing
ice!
June 2013 JSC 15
NEW: Scalasca on Intel MIC
Example:
• TACC Stampede
• NAS BT-MZ code
• MPI/OpenMP
• 8x16 CPU threads (2 MPI/node)
• 60x16 MIC threads (15 MPI/MIC)
Supported modes
• Host-only or MIC-only
• Symmetric
Not yet supported modes
• Offload
June 2013 JSC 16
Acknowledgements
• Scalasca team (JSC) (GRS)
• Sponsors
Michael
Knobloch
Bernd
Mohr
Peter
Philippen
Markus
Geimer
Daniel
Lorenz
Christian
Rössel
David
Böhme
Marc-André
Hermanns
Pavel
Saviankou
Marc
Schlütter
Ilja
Zhukov
Alexandre
Strube
Brian
Wylie
Felix
Wolf
Anke
Visser
Monika
Lücke
Aamer
Shah
Alexandru
Calotoiu
Jie
Jiang
Sergei
Shudler
Guoyong
Mao
Philipp
Gschwandtner
June 2013 JSC 17
Questions?
• Check out
http://www.scalasca.org
• Or contact us at
scalasca@fz-juelich.de

More Related Content

PPT
Synthesis of Platform Architectures from OpenCL Programs
PDF
GoFFish - A Sub-graph centric framework for large scale graph analytics
PPTX
DataTorrent Presentation @ Big Data Application Meetup
PPTX
20210506 meeting2
PPTX
VENUS: Vertex-Centric Streamlined Graph Computation on a Single PC
DOCX
PPTX
Apache Apex: Stream Processing Architecture and Applications
PPTX
Map reduce advantages over parallel databases
Synthesis of Platform Architectures from OpenCL Programs
GoFFish - A Sub-graph centric framework for large scale graph analytics
DataTorrent Presentation @ Big Data Application Meetup
20210506 meeting2
VENUS: Vertex-Centric Streamlined Graph Computation on a Single PC
Apache Apex: Stream Processing Architecture and Applications
Map reduce advantages over parallel databases

Similar to Scalable Parallel Performance Measurement with the Scalasca Toolset (20)

PDF
Stream Processing Overview
PDF
ACM Applicative System Methodology 2016
PDF
Blue Waters and Resource Management - Now and in the Future
PDF
system on chip book for reading apply the concept.pdf
PDF
System on Chip Design and Modelling Dr. David J Greaves
PPTX
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
PPTX
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
PPTX
CSE3069 - FLUENTD real time analytics.pptx
PPTX
Crash course on data streaming (with examples using Apache Flink)
PPT
CS8091_BDA_Unit_IV_Stream_Computing
PDF
Disadvantages Of Robotium
PDF
ODP Presentation LinuxCon NA 2014
PDF
Apache Pulsar Overview
PPTX
Software and Hardware Tools for Microprocessors
PDF
Streaming analytics state of the art
PDF
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
PDF
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
PDF
ACM DEBS 2015: Realtime Streaming Analytics Patterns
PPTX
Tulinx introduction 20130622 detailed
PPTX
Functional reactive programming
Stream Processing Overview
ACM Applicative System Methodology 2016
Blue Waters and Resource Management - Now and in the Future
system on chip book for reading apply the concept.pdf
System on Chip Design and Modelling Dr. David J Greaves
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
CSE3069 - FLUENTD real time analytics.pptx
Crash course on data streaming (with examples using Apache Flink)
CS8091_BDA_Unit_IV_Stream_Computing
Disadvantages Of Robotium
ODP Presentation LinuxCon NA 2014
Apache Pulsar Overview
Software and Hardware Tools for Microprocessors
Streaming analytics state of the art
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
ACM DEBS 2015: Realtime Streaming Analytics Patterns
Tulinx introduction 20130622 detailed
Functional reactive programming
Ad

More from Intel IT Center (20)

PDF
AI Crash Course- Supercomputing
PPTX
FPGA Inference - DellEMC SURFsara
PDF
High Memory Bandwidth Demo @ One Intel Station
PDF
INFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutions
PDF
Disrupt Hackers With Robust User Authentication
PDF
Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...
PDF
Harness Digital Disruption to Create 2022’s Workplace Today
PPTX
Don't Rely on Software Alone. Protect Endpoints with Hardware-Enhanced Security.
PDF
Achieve Unconstrained Collaboration in a Digital World
PDF
Intel® Xeon® Scalable Processors Enabled Applications Marketing Guide
PDF
#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...
PPTX
Identity Protection for the Digital Age
PDF
Three Steps to Making a Digital Workplace a Reality
PDF
Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...
PDF
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
PDF
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
PDF
Intel® Xeon® Processor E5-2600 v4 Core Business Applications Showcase
PDF
Intel® Xeon® Processor E5-2600 v4 Financial Security Applications Showcase
PDF
Intel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications Showcase
PDF
Intel® Xeon® Processor E5-2600 v4 Tech Computing Applications Showcase
AI Crash Course- Supercomputing
FPGA Inference - DellEMC SURFsara
High Memory Bandwidth Demo @ One Intel Station
INFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutions
Disrupt Hackers With Robust User Authentication
Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...
Harness Digital Disruption to Create 2022’s Workplace Today
Don't Rely on Software Alone. Protect Endpoints with Hardware-Enhanced Security.
Achieve Unconstrained Collaboration in a Digital World
Intel® Xeon® Scalable Processors Enabled Applications Marketing Guide
#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...
Identity Protection for the Digital Age
Three Steps to Making a Digital Workplace a Reality
Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Core Business Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Financial Security Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Tech Computing Applications Showcase
Ad

Scalable Parallel Performance Measurement with the Scalasca Toolset

  • 2. June 2013 JSC 2 Parallel Architectures: State of the Art Network or Switch ... N0 N1 Nk Inter- connect P0 Pn ... Memory A0 Am ... Inter- connect P0 Pn ... Memory A0 Am ... Inter- connect P0 Pn ... A0 Am ... Memory Pi Core0 Core1 Corer L10 L11 L1 L20 L2r/2 L30 ... ... Aj Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router Router or SMP NUMA
  • 3. June 2013 JSC 3 Parallel Performance Challenges • Current and future systems (will) consist of  Complex configurations  With a huge number of components  Very likely heterogeneous • Deep software hierarchies of large, complex software components will be required to make use of such systems  Sophisticated integrated performance measurement, analysis, and optimization capabilities will be required to efficiently operate such systems  Tools which provide insight not just numbers or charts needed!
  • 4. June 2013 JSC 4 “A picture is worth 1000 words…” • “Real world” example• MPI ring program
  • 5. June 2013 JSC 5 “What about 1000’s of pictures?” (with 100’s of menu options)
  • 6. June 2013 JSC 6 Example Automatic Analysis: Late Sender
  • 7. June 2013 JSC 7 Scalasca: Example MPI Patterns time process ENTER EXIT SEND RECV COLLEXIT (a) Late Sender time process (b) Late Receiver time process (d) Wait at N x N time process (c) Late Sender / Wrong Order
  • 8. June 2013 JSC 8 The Scalasca Project • Scalable Analysis of Large Scale Applications • Approach  Instrument C, C++, and Fortran parallel applications  Based on MPI, OpenMP, SHMEM, or hybrid  Option 1: scalable call-path profiling  Option 2: scalable event trace analysis  Collect event traces  Search trace for event patterns representing inefficiencies  Categorize and rank inefficiencies found • Supports MPI 2.2 (P2P, collectives, RMA, IO) and OpenMP 3.0 (excl. nesting) http://www.scalasca.org/
  • 9. June 2013 JSC 9 Scalasca Example: CESM Sea Ice Module Late Sender Analysis • Finds waiting at MPI_Waitall() inside ice boundary halo update • Shows distribution of imbalance across system and ranks
  • 10. June 2013 JSC 10 Scalasca Example: CESM Sea Ice Module Late Sender Analysis + Application Topology • Shows distribution of imbalance over topology • MPI topologies are automatically captured
  • 11. June 2013 JSC 11time Scalasca Root Cause Analysis • Root-cause analysis  Wait states typically caused by load or communication imbalances earlier in the program  Waiting time can also propagate (e.g., indirect waiting time)  Enhanced performance analysis to find the root cause of wait states • Approach  Distinguish between direct and indirect waiting time  Identify call path/process combinations delaying other processes and causing first order waiting time  Identify original delay Recv Send Send foo foo foo bar bar Recv A B C cause Recv Recv Direct waitIndirect wait Recv barDELAY
  • 12. June 2013 JSC 12 Scalasca Example: CESM Sea Ice Module Direct Wait Time Analysis • Direct wait caused by ranks processing areas near the north and south ice borders
  • 13. June 2013 JSC 13 Scalasca Example: CESM Sea Ice Module Indirect Wait Time Analysis • Indirect waits occurs for ranks processing warmer areas
  • 14. June 2013 JSC 14 Scalasca Example: CESM Sea Ice Module Delay Costs Analysis • Delays NOT caused on ranks processing ice!
  • 15. June 2013 JSC 15 NEW: Scalasca on Intel MIC Example: • TACC Stampede • NAS BT-MZ code • MPI/OpenMP • 8x16 CPU threads (2 MPI/node) • 60x16 MIC threads (15 MPI/MIC) Supported modes • Host-only or MIC-only • Symmetric Not yet supported modes • Offload
  • 16. June 2013 JSC 16 Acknowledgements • Scalasca team (JSC) (GRS) • Sponsors Michael Knobloch Bernd Mohr Peter Philippen Markus Geimer Daniel Lorenz Christian Rössel David Böhme Marc-André Hermanns Pavel Saviankou Marc Schlütter Ilja Zhukov Alexandre Strube Brian Wylie Felix Wolf Anke Visser Monika Lücke Aamer Shah Alexandru Calotoiu Jie Jiang Sergei Shudler Guoyong Mao Philipp Gschwandtner
  • 17. June 2013 JSC 17 Questions? • Check out http://www.scalasca.org • Or contact us at scalasca@fz-juelich.de