SlideShare a Scribd company logo
ORNL is managed by UT-Battelle, LLC for the US Department of Energy
Performance Analysis with Scalasca
George S. Markomanolis
7 August 2019
22 Open slide master to edit
Outline
• Introduction to Scalasca
• How to compile (using Score-P)
• Explaining functionalities of Scalasca/CUBE4 on two applications
• Testing a case with a large trace
33 Open slide master to edit
Scalasca
• Scalasca is a software tool that supports the performance analysis of
parallel applications
• The analysis identifies potential performance bottlenecks – in particular
those concerning communication and synchronization – and offers
guidance in exploring their causes.
• Installed version on Summit: v2.25
• Module: scalasca
• For instrumentation is used Score-P
• Web site: https://www.scalasca.org
• Email: scalasca@fz-juelich.de
44 Open slide master to edit
Capability Matrix - Scalasca
Capability Profiling Tracing Notes/Limitations
MPI, MPI-IO Yes Yes
OpenMP CPU Yes Yes
OpenMP GPU No No
OpenACC No No Score-P does instrument but CUBE does
not provide information
CUDA No No Score-P does instrument but CUBE does
not provide information
POSIX I/O Yes Yes
POSIX threads Yes Yes
Memory – app-level Yes Yes
Memory – func-level Yes Yes
Hotspot Detection Yes Yes
Variance Detection Yes Yes
Hardware Counters Yes Yes
55 Open slide master to edit
Techniques
• Profile analysis:
– Summary of aggregated metrics
• Per function/call-path and/or per process/thread
– mpiP, TAU, PerfSuite, Vampir
• Time-line analysis
• Pattern analysis
66 Open slide master to edit
Automatic trace analysis
• Apply tracing
• Automatic search for patterns on inefficient behavior
• Classificaiton of behavior
• Much faster than manual trace analysis
• Scalability
77 Open slide master to edit
Workflow
88 Open slide master to edit
CUBE4
• Parallel program analysis report exploration tools
• Libraries for XML report
• Algebra utilities for report processing
• GUI for interactive analysis exploration
• Three coupled tree browsers
• Performance property
• Call-tree path
• System location
• CUBE4 displays severities
• Value for precise comparison
• Colour for easy identification of hostpots
• Inclusive valye when closed and exclusive when expanded
99 Open slide master to edit
Scalasca on Summit
module load scalasca
scalasca
Scalasca 2.5
Toolset for scalable performance analysis of large-scale parallel applications
usage: scalasca [OPTION]... ACTION <argument>...
1. prepare application objects and executable for measurement:
scalasca -instrument <compile-or-link-command> # skin (using scorep)
2. run application under control of measurement system:
scalasca -analyze <application-launch-command> # scan
3. interactively explore measurement analysis report:
scalasca -examine <experiment-archive|report> # square
Options:
-c, --show-config show configuration summary and exit
-h, --help show this help and exit
-n, --dry-run show actions without taking them
--quickref show quick reference guide and exit
--remap-specfile show path to remapper specification file and exit
-v, --verbose enable verbose commentary
-V, --version show version information and exit
1010 Open slide master to edit
Scalasca Workflow
• Compilation: use Score-P
• Execution of the binary for profiling (it will create an output folder):
scalasca -analyze jsrun …
• Examine of the data (GUI is loaded)
scalasca -examine output_folder
1111 Open slide master to edit
MiniWeather – MPI – Tools parameters
• Parameters for Scalasca/Score-P
module load scorep/6.0_r14595
module load scalasca/2.5
export SCOREP_METRIC_PAPI=PAPI_TOT_INS,PAPI_TOT_CYC,PAPI_FP_OPS
export SCOREP_MPI_ENABLE_GROUPS=ALL
export SCAN_MPI_LAUNCHER=jsrun
1212 Open slide master to edit
Instrumentation
1313 Open slide master to edit
MiniWeather - MPI
• Edit the Makefile and add the $PREP before the compiler name
• Compile:
MPI: make PREP="scorep --mpp=mpi --pdt" mpi
• Execution (submission script):
scalasca -analyze jsrun -n 64 -r 8 ./miniWeather_mpi
• Visualize:
scalasca -examine /gpfs/…/scorep_miniWeather_mpi_8p64_sum
1414 Open slide master to edit
CUBE4 – Central View
3 Windows:
Metric tree
Call tree
System tree
1515 Open slide master to edit
Exploring the menus
1616 Open slide master to edit
How to expand the trees
1717 Open slide master to edit
Computation – System tre
1818 Open slide master to edit
Computation – Blox plot
1919 Open slide master to edit
Computation Sunburst
2020 Open slide master to edit
Computation – Process x Thread
2121 Open slide master to edit
Score-P configuration parameters
2222 Open slide master to edit
CUBE – Flat view
2323 Open slide master to edit
Initialization variation
2424 Open slide master to edit
MPI_Comm_dup variation
2525 Open slide master to edit
MPI_Comm_dup variation II
2626 Open slide master to edit
Getting information about metrics
2727 Open slide master to edit
MPI_Allreduce variation
2828 Open slide master to edit
Collective I/O
2929 Open slide master to edit
MPI_Waitall variation
3030 Open slide master to edit
Information about transferred data
3131 Open slide master to edit
Read-Individual operations
3232 Open slide master to edit
Write-Collective operations
3333 Open slide master to edit
Computational imbalance - Overload
3434 Open slide master to edit
Computational imbalance - Underload
3535 Open slide master to edit
Computational imbalace
3636 Open slide master to edit
Bytes read
3737 Open slide master to edit
Instructions
3838 Open slide master to edit
CUBE4 – Derived metrics – Instructions per Cycle
• Right click on any metric of the metric tree, and
select “Edit metric” -> Create derived metric ->
“as a root”
3939 Open slide master to edit
Instructions per cycle (IPC) – Useful computational
workload
There is no specific rule but codes with IPC less than 1.5, can be improved
4040 Open slide master to edit
Floating operations NOT per second
4141 Open slide master to edit
Derived metric of Floating operations to create the
metric Flops
4242 Open slide master to edit
Difference between two executions
cube_diff scorep_miniWeather_mpi_4p64_sum1/profile.cubex
scorep_miniWeather_mpi_8p64_sum2/profile.cubex -c -o result.cubex
cube result.cubex
Tracing with Scalasca
4444 Open slide master to edit
Tracing with Scalasca
• Tracing can/will cause bigger overhead during the execution
of the application
• More information are recorded including timeline
• Scalasca will analyze the trace according to various patterns
and it will identify the bottlenecks
4545 Open slide master to edit
How much memory buffer to use for tracing?
• Examine the profiling data
scalasca -examine -s /gpfs/alpine/.../scorep_miniWeather_mpi_8p64_sum
INFO: Score report written to /gpfs/alpine/…/scorep_miniWeather_mpi_8p64_sum/scorep.score
• head /gpfs/alpine/…/scorep_miniWeather_mpi_8p64_sum/scorep.score
Estimated aggregate size of event trace: 978MB
Estimated requirements for largest trace buffer (max_buf): 16MB
Estimated memory requirements (SCOREP_TOTAL_MEMORY): 18MB
• Add in your submission script (include ~10% extra):
export SCOREP_TOTAL_MEMORY=20MB
4646 Open slide master to edit
Overhead
• You need to declare enough size of SCOREP_TOTAL_MEMORY to avoid flushing of the
performance files.
• For our application, non instrumented execution on 1 node takes ~30 seconds, while for profiling
and tracing is 45 and 53 seconds respectively, so 50% and 76% overhead.
• The overhead always depends on the application and what you instrument, OpenMP etc.
• We have choice of selective instrumentation or manually profiling filter
4747 Open slide master to edit
How to use Scalasca/Score-P with tracing?
• In your submission script, replace:
scalasca -analyze jsrun …
with
scalasca -analyze -q -t jsrun
• The –q disables the profiling.
4848 Open slide master to edit
Initial view with tracing
4949 Open slide master to edit
Computation with tracing and expand trees
5050 Open slide master to edit
Late Sender for Point-to-Point communication
5151 Open slide master to edit
Late Sender - Time
Documentation:
https://apps.fz-juelich.de/scalasca/releases/scalasca/2.5/help/scalasca_patterns.html
5252 Open slide master to edit
Late Sender – Wrong Order Time/Different Sources
5353 Open slide master to edit
Late Sender – Wrong Order Time/Same source
5454 Open slide master to edit
MPI Wait at N x N Time
5555 Open slide master to edit
MPI Wait at N x N Time - Explanation
5656 Open slide master to edit
Number of MPI communications
5757 Open slide master to edit
Short and Long-term delay
5858 Open slide master to edit
Long-term delay sender
5959 Open slide master to edit
Long-term delay sender
6060 Open slide master to edit
Propagating wait states
6161 Open slide master to edit
Tracing start
6262 Open slide master to edit
Terminal wait states
6363 Open slide master to edit
Direct wait stats
6464 Open slide master to edit
Wait States
6565 Open slide master to edit
Imbalance in the critical path
6666 Open slide master to edit
Critical Path Profile
6767 Open slide master to edit
Critical Path Imbalance
6868 Open slide master to edit
Intra-partition Imbalance
6969 Open slide master to edit
Non Critical Path Activities
Scalasca with MPI+OpenMP
7171 Open slide master to edit
MiniWeather – MPI+OMP
• Compile:
MPI: make PREP="scorep --mpp=mpi –openmp --pdt" openmp
• Execution (submission script):
scalasca -analyze -q -t jsrun -n 16 -r 2 -a 1 -c 8 "-b packed:8"
./miniWeather_mpi_openmp
• Visualize:
scalasca -examine /gpfs/…/
scorep_miniWeather_mpi_openmp_2p16x8_trace
7272 Open slide master to edit
OpenMP Views
7373 Open slide master to edit
OpenMP Thread Management Time
7474 Open slide master to edit
Duration of Fork
7575 Open slide master to edit
Source code of corresponding OpenMP call
7676 Open slide master to edit
OpenMP Thread Team Fork
7777 Open slide master to edit
Implicit Synchronization
7878 Open slide master to edit
Implicit Synchronization - Explanation
7979 Open slide master to edit
Idle threads
8080 Open slide master to edit
Idle threads - Explanation
8181 Open slide master to edit
Limited Parallelism – Process x Thread
8282 Open slide master to edit
Limited Parallelism - Explanation
8383 Open slide master to edit
Long-term delay costs
8484 Open slide master to edit
Long/Short-term OpenMP Thread Idleness delay costs
8585 Open slide master to edit
Computational Imbalance overload
8686 Open slide master to edit
Computational Imbalance overload – Process x Thread

More Related Content

PDF
Manila project update openstack-berlin-summit-2018
PDF
Using SQL Plan Management (SPM) to Balance Plan Flexibility and Plan Stability
PDF
Comparative Evaluation of Spark and Flink Stream Processing
PPTX
PDF
Streaming sql w kafka and flink
PDF
Introduction to Structured Streaming
PDF
A Tale of Two Data Centers: Kafka Streams Resiliency (Anna McDonald, Confluen...
PDF
SS15_PresentationFinal
Manila project update openstack-berlin-summit-2018
Using SQL Plan Management (SPM) to Balance Plan Flexibility and Plan Stability
Comparative Evaluation of Spark and Flink Stream Processing
Streaming sql w kafka and flink
Introduction to Structured Streaming
A Tale of Two Data Centers: Kafka Streams Resiliency (Anna McDonald, Confluen...
SS15_PresentationFinal

What's hot (10)

PDF
Spark Summit EU talk by Luc Bourlier
PDF
Fig 9-02
PPTX
Restoring US First Website
PDF
Parallelization of Structured Streaming Jobs Using Delta Lake
PDF
Continuous Application with FAIR Scheduler with Robert Xue
PDF
opendayight loadBalancer
PDF
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
PDF
Fig 9-03
PDF
Spark Summit EU talk by Herman van Hovell
PDF
OSMC 2010 | Monitoring mit Icinga by Icinga Team
Spark Summit EU talk by Luc Bourlier
Fig 9-02
Restoring US First Website
Parallelization of Structured Streaming Jobs Using Delta Lake
Continuous Application with FAIR Scheduler with Robert Xue
opendayight loadBalancer
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Fig 9-03
Spark Summit EU talk by Herman van Hovell
OSMC 2010 | Monitoring mit Icinga by Icinga Team
Ad

Similar to Performance Analysis with Scalasca on Summit Supercomputer part I (20)

PDF
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
PDF
Porting and Optimization of Numerical Libraries for ARM SVE
PDF
Performance Analysis with Scalasca, part II
PDF
magellan_mongodb_workload_analysis
PDF
Application Profiling at the HPCAC High Performance Center
PDF
Scalable Parallel Performance Measurement with the Scalasca Toolset
PDF
Performance Analysis with TAU on Summit Supercomputer, part II
PDF
From the latency to the throughput age
PPTX
Combining Phase Identification and Statistic Modeling for Automated Parallel ...
PDF
Containerizing HPC and AI applications using E4S and Performance Monitor tool
PDF
Panda scalable hpc_bestpractices_tue100418
PDF
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
PDF
High Performance Engineering - 01-intro.pdf
PDF
HPC Application Profiling and Analysis
PPTX
HPC Application Profiling & Analysis
PDF
TAU E4S ON OpenPOWER /POWER9 platform
PDF
Monitorama 2015 Netflix Instance Analysis
PDF
Scalable Interconnection Network Models for Rapid Performance Prediction of H...
PPTX
PDF
Introduction to Extrae/Paraver, part I
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Porting and Optimization of Numerical Libraries for ARM SVE
Performance Analysis with Scalasca, part II
magellan_mongodb_workload_analysis
Application Profiling at the HPCAC High Performance Center
Scalable Parallel Performance Measurement with the Scalasca Toolset
Performance Analysis with TAU on Summit Supercomputer, part II
From the latency to the throughput age
Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Containerizing HPC and AI applications using E4S and Performance Monitor tool
Panda scalable hpc_bestpractices_tue100418
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
High Performance Engineering - 01-intro.pdf
HPC Application Profiling and Analysis
HPC Application Profiling & Analysis
TAU E4S ON OpenPOWER /POWER9 platform
Monitorama 2015 Netflix Instance Analysis
Scalable Interconnection Network Models for Rapid Performance Prediction of H...
Introduction to Extrae/Paraver, part I
Ad

More from George Markomanolis (14)

PDF
Evaluating GPU programming Models for the LUMI Supercomputer
PDF
Utilizing AMD GPUs: Tuning, programming models, and roadmap
PDF
Exploring the Programming Models for the LUMI Supercomputer
PDF
Getting started with AMD GPUs
PDF
How to use TAU for Performance Analysis on Summit Supercomputer
PDF
Introducing IO-500 benchmark
PDF
Experience using the IO-500
PDF
Harshad - Handle Darshan Data
PDF
Lustre Best Practices
PDF
Burst Buffer: From Alpha to Omega
PDF
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
PDF
markomanolis_phd_defense
PDF
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
PDF
Introduction to Performance Analysis tools on Shaheen II
Evaluating GPU programming Models for the LUMI Supercomputer
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Exploring the Programming Models for the LUMI Supercomputer
Getting started with AMD GPUs
How to use TAU for Performance Analysis on Summit Supercomputer
Introducing IO-500 benchmark
Experience using the IO-500
Harshad - Handle Darshan Data
Lustre Best Practices
Burst Buffer: From Alpha to Omega
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
markomanolis_phd_defense
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Introduction to Performance Analysis tools on Shaheen II

Recently uploaded (20)

PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Cloud computing and distributed systems.
PDF
cuic standard and advanced reporting.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
Spectroscopy.pptx food analysis technology
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Review of recent advances in non-invasive hemoglobin estimation
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Electronic commerce courselecture one. Pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Approach and Philosophy of On baking technology
PDF
Machine learning based COVID-19 study performance prediction
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Assigned Numbers - 2025 - Bluetooth® Document
Mobile App Security Testing_ A Comprehensive Guide.pdf
sap open course for s4hana steps from ECC to s4
Cloud computing and distributed systems.
cuic standard and advanced reporting.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Spectroscopy.pptx food analysis technology
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Review of recent advances in non-invasive hemoglobin estimation
The AUB Centre for AI in Media Proposal.docx
Electronic commerce courselecture one. Pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Programs and apps: productivity, graphics, security and other tools
Approach and Philosophy of On baking technology
Machine learning based COVID-19 study performance prediction
Dropbox Q2 2025 Financial Results & Investor Presentation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...

Performance Analysis with Scalasca on Summit Supercomputer part I

  • 1. ORNL is managed by UT-Battelle, LLC for the US Department of Energy Performance Analysis with Scalasca George S. Markomanolis 7 August 2019
  • 2. 22 Open slide master to edit Outline • Introduction to Scalasca • How to compile (using Score-P) • Explaining functionalities of Scalasca/CUBE4 on two applications • Testing a case with a large trace
  • 3. 33 Open slide master to edit Scalasca • Scalasca is a software tool that supports the performance analysis of parallel applications • The analysis identifies potential performance bottlenecks – in particular those concerning communication and synchronization – and offers guidance in exploring their causes. • Installed version on Summit: v2.25 • Module: scalasca • For instrumentation is used Score-P • Web site: https://www.scalasca.org • Email: scalasca@fz-juelich.de
  • 4. 44 Open slide master to edit Capability Matrix - Scalasca Capability Profiling Tracing Notes/Limitations MPI, MPI-IO Yes Yes OpenMP CPU Yes Yes OpenMP GPU No No OpenACC No No Score-P does instrument but CUBE does not provide information CUDA No No Score-P does instrument but CUBE does not provide information POSIX I/O Yes Yes POSIX threads Yes Yes Memory – app-level Yes Yes Memory – func-level Yes Yes Hotspot Detection Yes Yes Variance Detection Yes Yes Hardware Counters Yes Yes
  • 5. 55 Open slide master to edit Techniques • Profile analysis: – Summary of aggregated metrics • Per function/call-path and/or per process/thread – mpiP, TAU, PerfSuite, Vampir • Time-line analysis • Pattern analysis
  • 6. 66 Open slide master to edit Automatic trace analysis • Apply tracing • Automatic search for patterns on inefficient behavior • Classificaiton of behavior • Much faster than manual trace analysis • Scalability
  • 7. 77 Open slide master to edit Workflow
  • 8. 88 Open slide master to edit CUBE4 • Parallel program analysis report exploration tools • Libraries for XML report • Algebra utilities for report processing • GUI for interactive analysis exploration • Three coupled tree browsers • Performance property • Call-tree path • System location • CUBE4 displays severities • Value for precise comparison • Colour for easy identification of hostpots • Inclusive valye when closed and exclusive when expanded
  • 9. 99 Open slide master to edit Scalasca on Summit module load scalasca scalasca Scalasca 2.5 Toolset for scalable performance analysis of large-scale parallel applications usage: scalasca [OPTION]... ACTION <argument>... 1. prepare application objects and executable for measurement: scalasca -instrument <compile-or-link-command> # skin (using scorep) 2. run application under control of measurement system: scalasca -analyze <application-launch-command> # scan 3. interactively explore measurement analysis report: scalasca -examine <experiment-archive|report> # square Options: -c, --show-config show configuration summary and exit -h, --help show this help and exit -n, --dry-run show actions without taking them --quickref show quick reference guide and exit --remap-specfile show path to remapper specification file and exit -v, --verbose enable verbose commentary -V, --version show version information and exit
  • 10. 1010 Open slide master to edit Scalasca Workflow • Compilation: use Score-P • Execution of the binary for profiling (it will create an output folder): scalasca -analyze jsrun … • Examine of the data (GUI is loaded) scalasca -examine output_folder
  • 11. 1111 Open slide master to edit MiniWeather – MPI – Tools parameters • Parameters for Scalasca/Score-P module load scorep/6.0_r14595 module load scalasca/2.5 export SCOREP_METRIC_PAPI=PAPI_TOT_INS,PAPI_TOT_CYC,PAPI_FP_OPS export SCOREP_MPI_ENABLE_GROUPS=ALL export SCAN_MPI_LAUNCHER=jsrun
  • 12. 1212 Open slide master to edit Instrumentation
  • 13. 1313 Open slide master to edit MiniWeather - MPI • Edit the Makefile and add the $PREP before the compiler name • Compile: MPI: make PREP="scorep --mpp=mpi --pdt" mpi • Execution (submission script): scalasca -analyze jsrun -n 64 -r 8 ./miniWeather_mpi • Visualize: scalasca -examine /gpfs/…/scorep_miniWeather_mpi_8p64_sum
  • 14. 1414 Open slide master to edit CUBE4 – Central View 3 Windows: Metric tree Call tree System tree
  • 15. 1515 Open slide master to edit Exploring the menus
  • 16. 1616 Open slide master to edit How to expand the trees
  • 17. 1717 Open slide master to edit Computation – System tre
  • 18. 1818 Open slide master to edit Computation – Blox plot
  • 19. 1919 Open slide master to edit Computation Sunburst
  • 20. 2020 Open slide master to edit Computation – Process x Thread
  • 21. 2121 Open slide master to edit Score-P configuration parameters
  • 22. 2222 Open slide master to edit CUBE – Flat view
  • 23. 2323 Open slide master to edit Initialization variation
  • 24. 2424 Open slide master to edit MPI_Comm_dup variation
  • 25. 2525 Open slide master to edit MPI_Comm_dup variation II
  • 26. 2626 Open slide master to edit Getting information about metrics
  • 27. 2727 Open slide master to edit MPI_Allreduce variation
  • 28. 2828 Open slide master to edit Collective I/O
  • 29. 2929 Open slide master to edit MPI_Waitall variation
  • 30. 3030 Open slide master to edit Information about transferred data
  • 31. 3131 Open slide master to edit Read-Individual operations
  • 32. 3232 Open slide master to edit Write-Collective operations
  • 33. 3333 Open slide master to edit Computational imbalance - Overload
  • 34. 3434 Open slide master to edit Computational imbalance - Underload
  • 35. 3535 Open slide master to edit Computational imbalace
  • 36. 3636 Open slide master to edit Bytes read
  • 37. 3737 Open slide master to edit Instructions
  • 38. 3838 Open slide master to edit CUBE4 – Derived metrics – Instructions per Cycle • Right click on any metric of the metric tree, and select “Edit metric” -> Create derived metric -> “as a root”
  • 39. 3939 Open slide master to edit Instructions per cycle (IPC) – Useful computational workload There is no specific rule but codes with IPC less than 1.5, can be improved
  • 40. 4040 Open slide master to edit Floating operations NOT per second
  • 41. 4141 Open slide master to edit Derived metric of Floating operations to create the metric Flops
  • 42. 4242 Open slide master to edit Difference between two executions cube_diff scorep_miniWeather_mpi_4p64_sum1/profile.cubex scorep_miniWeather_mpi_8p64_sum2/profile.cubex -c -o result.cubex cube result.cubex
  • 44. 4444 Open slide master to edit Tracing with Scalasca • Tracing can/will cause bigger overhead during the execution of the application • More information are recorded including timeline • Scalasca will analyze the trace according to various patterns and it will identify the bottlenecks
  • 45. 4545 Open slide master to edit How much memory buffer to use for tracing? • Examine the profiling data scalasca -examine -s /gpfs/alpine/.../scorep_miniWeather_mpi_8p64_sum INFO: Score report written to /gpfs/alpine/…/scorep_miniWeather_mpi_8p64_sum/scorep.score • head /gpfs/alpine/…/scorep_miniWeather_mpi_8p64_sum/scorep.score Estimated aggregate size of event trace: 978MB Estimated requirements for largest trace buffer (max_buf): 16MB Estimated memory requirements (SCOREP_TOTAL_MEMORY): 18MB • Add in your submission script (include ~10% extra): export SCOREP_TOTAL_MEMORY=20MB
  • 46. 4646 Open slide master to edit Overhead • You need to declare enough size of SCOREP_TOTAL_MEMORY to avoid flushing of the performance files. • For our application, non instrumented execution on 1 node takes ~30 seconds, while for profiling and tracing is 45 and 53 seconds respectively, so 50% and 76% overhead. • The overhead always depends on the application and what you instrument, OpenMP etc. • We have choice of selective instrumentation or manually profiling filter
  • 47. 4747 Open slide master to edit How to use Scalasca/Score-P with tracing? • In your submission script, replace: scalasca -analyze jsrun … with scalasca -analyze -q -t jsrun • The –q disables the profiling.
  • 48. 4848 Open slide master to edit Initial view with tracing
  • 49. 4949 Open slide master to edit Computation with tracing and expand trees
  • 50. 5050 Open slide master to edit Late Sender for Point-to-Point communication
  • 51. 5151 Open slide master to edit Late Sender - Time Documentation: https://apps.fz-juelich.de/scalasca/releases/scalasca/2.5/help/scalasca_patterns.html
  • 52. 5252 Open slide master to edit Late Sender – Wrong Order Time/Different Sources
  • 53. 5353 Open slide master to edit Late Sender – Wrong Order Time/Same source
  • 54. 5454 Open slide master to edit MPI Wait at N x N Time
  • 55. 5555 Open slide master to edit MPI Wait at N x N Time - Explanation
  • 56. 5656 Open slide master to edit Number of MPI communications
  • 57. 5757 Open slide master to edit Short and Long-term delay
  • 58. 5858 Open slide master to edit Long-term delay sender
  • 59. 5959 Open slide master to edit Long-term delay sender
  • 60. 6060 Open slide master to edit Propagating wait states
  • 61. 6161 Open slide master to edit Tracing start
  • 62. 6262 Open slide master to edit Terminal wait states
  • 63. 6363 Open slide master to edit Direct wait stats
  • 64. 6464 Open slide master to edit Wait States
  • 65. 6565 Open slide master to edit Imbalance in the critical path
  • 66. 6666 Open slide master to edit Critical Path Profile
  • 67. 6767 Open slide master to edit Critical Path Imbalance
  • 68. 6868 Open slide master to edit Intra-partition Imbalance
  • 69. 6969 Open slide master to edit Non Critical Path Activities
  • 71. 7171 Open slide master to edit MiniWeather – MPI+OMP • Compile: MPI: make PREP="scorep --mpp=mpi –openmp --pdt" openmp • Execution (submission script): scalasca -analyze -q -t jsrun -n 16 -r 2 -a 1 -c 8 "-b packed:8" ./miniWeather_mpi_openmp • Visualize: scalasca -examine /gpfs/…/ scorep_miniWeather_mpi_openmp_2p16x8_trace
  • 72. 7272 Open slide master to edit OpenMP Views
  • 73. 7373 Open slide master to edit OpenMP Thread Management Time
  • 74. 7474 Open slide master to edit Duration of Fork
  • 75. 7575 Open slide master to edit Source code of corresponding OpenMP call
  • 76. 7676 Open slide master to edit OpenMP Thread Team Fork
  • 77. 7777 Open slide master to edit Implicit Synchronization
  • 78. 7878 Open slide master to edit Implicit Synchronization - Explanation
  • 79. 7979 Open slide master to edit Idle threads
  • 80. 8080 Open slide master to edit Idle threads - Explanation
  • 81. 8181 Open slide master to edit Limited Parallelism – Process x Thread
  • 82. 8282 Open slide master to edit Limited Parallelism - Explanation
  • 83. 8383 Open slide master to edit Long-term delay costs
  • 84. 8484 Open slide master to edit Long/Short-term OpenMP Thread Idleness delay costs
  • 85. 8585 Open slide master to edit Computational Imbalance overload
  • 86. 8686 Open slide master to edit Computational Imbalance overload – Process x Thread