SlideShare a Scribd company logo
ORNL is managed by UT-Battelle, LLC for the US Department of Energy
How to use TAU for Performance Analysis
George S. Markomanolis
7 August 2019
22 Open slide master to edit
Outline
• Introduction to TAU
• How to compile
• Explaining functionalities of TAU/ParaProf
• Presenting basic steps of PerfExplorer
33 Open slide master to edit
TAU
• Tuning and Analysis Utilities, developed at University of Oregon
• Scalable and flexible performance analysis toolkit
• Automatic instrumentation through Program Database Toolkit (PDT)
for routines, loops, I/O, memory, phases, etc.
• Installed version on Summit: v2.28.1
• Module: tau
• Web site: https://www.cs.uoregon.edu/research/tau/home.php
• Email: tau-bugs@cs.uoregon.edu
44 Open slide master to edit
Capability Matrix - TAU
Capability Profiling Tracing Notes/Limitations
MPI, MPI-IO Yes Yes
OpenMP CPU Yes Yes
OpenMP GPU Yes Yes Some restrictions apply regarding the
CUPTI metrics
OpenACC Yes Yes Some functionalities are not ready for
production, no metrics available
CUDA Yes Yes Some functionalities are not ready for
production
POSIX I/O Yes Yes
POSIX threads Yes Yes
Memory – app-level Yes Yes
Memory – func-level Yes Yes
Hotspot Detection Yes Yes
Variance Detection Yes Yes
Hardware Counters Yes Yes
55 Open slide master to edit
Compilation
• There are mainly three approaches to use an application with TAU
– Use TAU Wrappers
• For C: replace the compiler with tau_cc.sh
• For C++: replace the compiler with tau_cxx.sh
• For Fortran: replace the compiler with tau_f90.sh/tau_f77.sh
– Dynamic instrumentation, for example:
• jsrun -n 4 –r 4 –a 1 –c1 tau_exec -T mpi ./test
– Rewrite the binary (support for x86_64):
• tau_rewrite –T papi,pdf a.out –o a.inst
66 Open slide master to edit
Compilation (cont.)
Interposition: tau_exec
Compiler:
tau_cc.sh –tau_options=-optCompInst
Set the TAU_MAKEFILE
Source:
tau_cc.sh
The TAU_MAKEFILE should include the PDT
77 Open slide master to edit
tau_exec
tau_exec –help
Options:
-v Verbose mode
-s Show what will be done but don't actually do anything (dryrun)
-io Track I/O
-memory Track memory allocation/deallocation
-memory_debug Enable memory debugger
-cuda Track GPU events via CUDA
-cupti Track GPU events via CUPTI (Also see env. variable TAU_CUPTI_API)
-opencl Track GPU events via OpenCL
-openacc Track GPU events via OpenACC (currently PGI only)
-rocm Track ROCm events via rocprofiler
-ompt Track OpenMP events via OMPT interface
-ebs Enable event-based sampling
-ebs_period=<count> Sampling period (default 1000)
-ebs_source=<counter> Counter (default itimer)
-ebs_resolution=<file|function|line> Choose sampling granularity.
-um Enable Unified Memory events via CUPTI
-sass=<level> Track GPU events via CUDA with Source Code Locator activity (kernel level or source level)
-csv Outputs sass profile in CSV
-env Track GPU environment activity (power utilization, SM, memory frequency, temperature)
-T <CUPTI,DISABLE,GNU,GNU_MEM,MPI,OPENMP,PAPI,PDT,PGI,PGI_MEM,PROFILE,SERIAL> : Specify TAU tags
88 Open slide master to edit
TAU Environment Variables
99 Open slide master to edit
TAU Compile-Time Environment Variables
For using free format in .f files, use:
% export TAU_OPTIONS=`-optPdtF95Opts=``-R free’’’
1010 Open slide master to edit
How TAU works?
• Instrumentation:
– Adds probes to perform measurements
– Source code instrumentation
– Wrapping external libraries (I/O, CUDA, OpenACC, OpenCL)
– Rewriting the binary executable
• Measurement:
– Profiling or Tracing
– Direct instrumentation
– Sampling
– Throttling
• Analysis:
– Visualization of profiles and traces
– 3D visualization
– Trace conversion tools
1111 Open slide master to edit
TAU Instrumentation/Measurement
1212 Open slide master to edit
Tau_exec
Usage: tau_exec [options] [--] <exe> <exe options>
Options:
-v Verbose mode
-vv Very Verbose mode (enables TAU_VERBOSE=1)
-s Show what will be done but don't actually do anything (dryrun)
-io Track I/O
-memory Track memory allocation/deallocation
-memory_debug Enable memory debugger
-cuda Track GPU events via CUDA
-cupti Track GPU events via CUPTI (Also see env. variable TAU_CUPTI_API)
-opencl Track GPU events via OpenCL
-openacc Track GPU events via OpenACC (currently PGI only)
-rocm Track ROCm events via rocprofiler
-ompt Track OpenMP events via OMPT interface
-power Track power events via PAPI's perf RAPL interface|
-numa Track remote DRAM, total DRAM events (needs papi with recent perf support for x86_64)
-ebs Enable event-based sampling
-ebs_period=<count> Sampling period (default 1000)
-um Enable Unified Memory events via CUPTI
-sass=<level> Track GPU events via CUDA with Source Code Locator activity (kernel level or source level)
-csv Outputs sass profile in CSV
1313 Open slide master to edit
MiniWeather MPI compilation
• module load pgi
• module load tau
• export
TAU_MAKEFILE/sw/summit/tau/2.28.1_patched/ibm64linux/lib/Makef
ile.tau-pgi-papi-mpi-pdt-pgi
• Replace mpicxx with tau_cxx.sh in the Makefile
• export TAU_OPTIONS='-optLinking=-lpnetcdf -optVerbose'
• make mpi
1414 Open slide master to edit
MiniWeather MPI – Execution - Profiling
export TAU_METRICS=TIME:PAPI_TOT_INS:PAPI_TOT_CYC:PAPI_FP_OPS
#export TAU_CALLPATH=1
#export TAU_CALLPATH_DEPTH=10
export TAU_PROFILE=1
export TAU_TRACK_MESSAGE=1
export TAU_COMM_MATRIX=1
jsrun -n 64 -r 8 -a 1 -c 1 ./miniWeather_mpi
Or if compiled with mpicxx
jsrun -n 64 -r 8 -a 1 -c 1 tau_exec ./miniWeather_mpi
1515 Open slide master to edit
MiniWeather MPI - Execution
• When the execution finished, there is one folder for each TAU_METRICS
declaration with the format MULTI__
• If there is no TAU_METRICS declared, then by default is used the metric
TIME and the profiling files are not in a folder, in this case you need to
pack them and execute paraprof:
summit> paraprof –pack name.ppk
summit> paraprof name.ppk
• To visualize the results execute paraprof (check also pprof for text mode)
1616 Open slide master to edit
MiniWeather MPI - Paraprof
• The default metric is TIME
• Each color is a different call
• Each horizontal line is a process or Std.Dev./mean/max/min
1717 Open slide master to edit
Exploring Paraprof
• Options -> Uncheck Stack Bars Together
• It is easier to check the load imbalance
• We will call this window as the main one
1818 Open slide master to edit
Exploring Paraprof
• Click on any color, values per process, name of routine with callpath (if activated), units in seconds, value exclusive, max, min,
mean, std, values.
1919 Open slide master to edit
Exploring Paraprof
• Scroll down
2020 Open slide master to edit
Exploring Paraprof
• Click on any label on the left (node 0, mean, etc.). You can see immediately which calls take more time
2121 Open slide master to edit
Paraprof – Thread Statistics Text Window
• Right click on any label of the main window, select “Show Thread Statistics Text Window”
2222 Open slide master to edit
Paraprof – Thread Statistics Table
• Right click on any label of the main window, select “Show Thread Statistics Table”
2323 Open slide master to edit
Paraprof – User Bar Chart
• Right click on any label of the main window, select “Show User Bar Chart”
2424 Open slide master to edit
Paraprof – User Event
• Options -> Select Value Type -> Max. Value
2525 Open slide master to edit
Paraprof – User Event Statistics Window
• Right click on any label of the main window, select “Show User Event Statistics Window”
2626 Open slide master to edit
Paraprof – Context Event Window
• Right click on any label of the main window, select “Show Context Event Window” (with callpath)
2727 Open slide master to edit
Paraprof – Add Thread to Comparison Window
• Right click on node 0 and select “Add Thread to Comparison Window”, similar for node 12. You could use any number of processes that you prefer.
2828 Open slide master to edit
Derived Metrics
Options -> Show Derived Metric Panel, select the metrics and then operator and then
click Apply. Then uncheck the Show Derived Metric
2929 Open slide master to edit
Paraprof - IPC
• Click on the new metric, PAPI_TOT_INS/PAPI_TOT_CYC
3030 Open slide master to edit
Paraprof – Mean IPC
• Click on the label mean
3131 Open slide master to edit
Paraprof – IPC for thread 0
• From the main window with the PAPI_TOT_INS/PAPI_TOT_CYC metric, right click on node 0 and select Show Thread Statistics
Table
3232 Open slide master to edit
Paraprof
• From the main window select Options -> Select Metric… -> Exclusive -> PAPI_FP_OPS
3333 Open slide master to edit
Paraprof – 3D Visualization
• Menu Windows -> 3D Visualization (3D demands OpenGL)
• Exclusive Time and Exclusive Floating operations
3434 Open slide master to edit
Paraprof – 3D Visualization
• Menu Windows -> 3D Visualization (3D demands OpenGL)
• Specific routine and thread
3535 Open slide master to edit
Paraprof – 3D Visualization
• Menu Windows -> 3D Visualization (3D demands OpenGL)
• Exclusive time and total instructions
3636 Open slide master to edit
Paraprof – 3D Visualization
Menu Windows -> 3D Visualization (3D demands OpenGL)
3737 Open slide master to edit
Paraprof – 3D Visualization
• Menu Windows -> 3D Visualization (3D demands OpenGL)
• Exclusive time and instructions per cycle
3838 Open slide master to edit
Paraprof – 3D Visualization
• Menu Windows -> 3D Visualization (3D demands OpenGL)
• Bar Plot
3939 Open slide master to edit
Paraprof – 3D Visualization
• Menu Windows -> 3D Visualization (3D demands OpenGL)
• Scatter Plot
4040 Open slide master to edit
Paraprof – 3D Visualization
• Menu Windows -> 3D Visualization (3D demands OpenGL)
• Topology Plot
4141 Open slide master to edit
Paraprof – 3D Communication Matrix
• Menu Windows -> 3D Visualization (3D demands OpenGL)
• Max message size vs Number of calls
4242 Open slide master to edit
Paraprof
Menu Windows -> Communication Matrix
4343 Open slide master to edit
Which loops require the most time?
• File select.tau:
BEGIN_INSTRUMENT_SECTION
loops routine=“#”
END_INSTRUMENT_SECTION
• Declare TAU options:
export TAU_OPTIONS=“-optTauSelectFile=select.tau -optLinking=-lpnetcdf -
optVerbose”
• Do not forget to unset TAU_OPTIONS when not required
• Execute as before
4444 Open slide master to edit
Paraprof - Loops
4545 Open slide master to edit
Paraprof - Loops
Select Options -> Select Metric… -> Exclusive… -> PAPI_TOT_INS
4646 Open slide master to edit
Paraprof - Loops
Select Options -> Select Metric… -> Exclusive… ->
PAPI_TOT_INS/PAPI_TOT_CYC
4747 Open slide master to edit
Paraprof - Loops
Select Options -> Select Metric… -> Exclusive… -> PAPI_FP_OPS
4848 Open slide master to edit
Paraprof
From the main window select a node
Click on node 0
4949 Open slide master to edit
Paraprof – Function Histogram
From the main window select a node
Right click on
MPI_File_write_at_all() ->
Show Function
Histogram
5050 Open slide master to edit
Callpath
export TAU_METRICS=TIME:PAPI_TOT_INS:PAPI_TOT_CYC:PAPI_FP_OPS
export TAU_CALLPATH=1
export TAU_CALLPATH_DEPTH=10
export TAU_PROFILE=1
export TAU_TRACK_MESSAGE=1
export TAU_COMM_MATRIX=1
jsrun -n 64 -r 8 -a 1 -c 1 ./miniWeather_mpi
5151 Open slide master to edit
Paraprof - Callpath
From the main Window right click on any label (node 0, mean etc.) and select “Show
Thread Call Graph”
5252 Open slide master to edit
Paraprof - Callpath
From the main Window right click on any label (node 0, mean etc.) and select “Show
Thread Statistics Table”

More Related Content

PDF
Performance Analysis with TAU on Summit Supercomputer, part II
PDF
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
PDF
Introduction to Extrae/Paraver, part I
PDF
BUD17-218: Scheduler Load tracking update and improvement
PDF
Static Analysis and Code Optimizations in Glasgow Haskell Compiler
PDF
Performance Analysis with Scalasca on Summit Supercomputer part I
PDF
DevoxxUK: Optimizating Application Performance on Kubernetes
PDF
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
Performance Analysis with TAU on Summit Supercomputer, part II
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Introduction to Extrae/Paraver, part I
BUD17-218: Scheduler Load tracking update and improvement
Static Analysis and Code Optimizations in Glasgow Haskell Compiler
Performance Analysis with Scalasca on Summit Supercomputer part I
DevoxxUK: Optimizating Application Performance on Kubernetes
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...

Similar to How to use TAU for Performance Analysis on Summit Supercomputer (20)

PDF
Performance Evaluation using TAU Performance System and E4S
PPTX
JVM and OS Tuning for accelerating Spark application
PDF
May2010 hex-core-opt
PDF
TAU E4S ON OpenPOWER /POWER9 platform
PPT
Inside the JVM - Performance & Garbage Collector Tuning in JAVA
PDF
Performance_Programming
PDF
Linux Systems Performance 2016
PDF
Bgoug 2019.11 test your pl sql - not your patience
PDF
HiPEAC 2019 Tutorial - Maestro RTOS
PPTX
Java 어플리케이션 성능튜닝 Part1
PPTX
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
PDF
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PDF
Containerizing HPC and AI applications using E4S and Performance Monitor tool
PDF
Improving the performance of Odoo deployments
PDF
Deep learning - the conf br 2018
PPTX
Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...
PDF
Nexmark with beam
PDF
PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...
PDF
LAS16-TR04: Using tracing to tune and optimize EAS (English)
PDF
POUG2019 - Test your PL/SQL - your database will love you
Performance Evaluation using TAU Performance System and E4S
JVM and OS Tuning for accelerating Spark application
May2010 hex-core-opt
TAU E4S ON OpenPOWER /POWER9 platform
Inside the JVM - Performance & Garbage Collector Tuning in JAVA
Performance_Programming
Linux Systems Performance 2016
Bgoug 2019.11 test your pl sql - not your patience
HiPEAC 2019 Tutorial - Maestro RTOS
Java 어플리케이션 성능튜닝 Part1
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
Containerizing HPC and AI applications using E4S and Performance Monitor tool
Improving the performance of Odoo deployments
Deep learning - the conf br 2018
Using Compuware Strobe to Save CPU: 4 Real-life Cases from the Files of CPT G...
Nexmark with beam
PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...
LAS16-TR04: Using tracing to tune and optimize EAS (English)
POUG2019 - Test your PL/SQL - your database will love you
Ad

More from George Markomanolis (14)

PDF
Evaluating GPU programming Models for the LUMI Supercomputer
PDF
Utilizing AMD GPUs: Tuning, programming models, and roadmap
PDF
Exploring the Programming Models for the LUMI Supercomputer
PDF
Getting started with AMD GPUs
PDF
Performance Analysis with Scalasca, part II
PDF
Introducing IO-500 benchmark
PDF
Experience using the IO-500
PDF
Harshad - Handle Darshan Data
PDF
Lustre Best Practices
PDF
Burst Buffer: From Alpha to Omega
PDF
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
PDF
markomanolis_phd_defense
PDF
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
PDF
Introduction to Performance Analysis tools on Shaheen II
Evaluating GPU programming Models for the LUMI Supercomputer
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Exploring the Programming Models for the LUMI Supercomputer
Getting started with AMD GPUs
Performance Analysis with Scalasca, part II
Introducing IO-500 benchmark
Experience using the IO-500
Harshad - Handle Darshan Data
Lustre Best Practices
Burst Buffer: From Alpha to Omega
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
markomanolis_phd_defense
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Introduction to Performance Analysis tools on Shaheen II
Ad

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
Building Integrated photovoltaic BIPV_UPV.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Group 1 Presentation -Planning and Decision Making .pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Network Security Unit 5.pdf for BCA BBA.
Reach Out and Touch Someone: Haptics and Empathic Computing
MIND Revenue Release Quarter 2 2025 Press Release
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
20250228 LYD VKU AI Blended-Learning.pptx
A Presentation on Artificial Intelligence
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
MYSQL Presentation for SQL database connectivity
Unlocking AI with Model Context Protocol (MCP)
A comparative analysis of optical character recognition models for extracting...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Spectroscopy.pptx food analysis technology
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Advanced methodologies resolving dimensionality complications for autism neur...

How to use TAU for Performance Analysis on Summit Supercomputer

  • 1. ORNL is managed by UT-Battelle, LLC for the US Department of Energy How to use TAU for Performance Analysis George S. Markomanolis 7 August 2019
  • 2. 22 Open slide master to edit Outline • Introduction to TAU • How to compile • Explaining functionalities of TAU/ParaProf • Presenting basic steps of PerfExplorer
  • 3. 33 Open slide master to edit TAU • Tuning and Analysis Utilities, developed at University of Oregon • Scalable and flexible performance analysis toolkit • Automatic instrumentation through Program Database Toolkit (PDT) for routines, loops, I/O, memory, phases, etc. • Installed version on Summit: v2.28.1 • Module: tau • Web site: https://www.cs.uoregon.edu/research/tau/home.php • Email: tau-bugs@cs.uoregon.edu
  • 4. 44 Open slide master to edit Capability Matrix - TAU Capability Profiling Tracing Notes/Limitations MPI, MPI-IO Yes Yes OpenMP CPU Yes Yes OpenMP GPU Yes Yes Some restrictions apply regarding the CUPTI metrics OpenACC Yes Yes Some functionalities are not ready for production, no metrics available CUDA Yes Yes Some functionalities are not ready for production POSIX I/O Yes Yes POSIX threads Yes Yes Memory – app-level Yes Yes Memory – func-level Yes Yes Hotspot Detection Yes Yes Variance Detection Yes Yes Hardware Counters Yes Yes
  • 5. 55 Open slide master to edit Compilation • There are mainly three approaches to use an application with TAU – Use TAU Wrappers • For C: replace the compiler with tau_cc.sh • For C++: replace the compiler with tau_cxx.sh • For Fortran: replace the compiler with tau_f90.sh/tau_f77.sh – Dynamic instrumentation, for example: • jsrun -n 4 –r 4 –a 1 –c1 tau_exec -T mpi ./test – Rewrite the binary (support for x86_64): • tau_rewrite –T papi,pdf a.out –o a.inst
  • 6. 66 Open slide master to edit Compilation (cont.) Interposition: tau_exec Compiler: tau_cc.sh –tau_options=-optCompInst Set the TAU_MAKEFILE Source: tau_cc.sh The TAU_MAKEFILE should include the PDT
  • 7. 77 Open slide master to edit tau_exec tau_exec –help Options: -v Verbose mode -s Show what will be done but don't actually do anything (dryrun) -io Track I/O -memory Track memory allocation/deallocation -memory_debug Enable memory debugger -cuda Track GPU events via CUDA -cupti Track GPU events via CUPTI (Also see env. variable TAU_CUPTI_API) -opencl Track GPU events via OpenCL -openacc Track GPU events via OpenACC (currently PGI only) -rocm Track ROCm events via rocprofiler -ompt Track OpenMP events via OMPT interface -ebs Enable event-based sampling -ebs_period=<count> Sampling period (default 1000) -ebs_source=<counter> Counter (default itimer) -ebs_resolution=<file|function|line> Choose sampling granularity. -um Enable Unified Memory events via CUPTI -sass=<level> Track GPU events via CUDA with Source Code Locator activity (kernel level or source level) -csv Outputs sass profile in CSV -env Track GPU environment activity (power utilization, SM, memory frequency, temperature) -T <CUPTI,DISABLE,GNU,GNU_MEM,MPI,OPENMP,PAPI,PDT,PGI,PGI_MEM,PROFILE,SERIAL> : Specify TAU tags
  • 8. 88 Open slide master to edit TAU Environment Variables
  • 9. 99 Open slide master to edit TAU Compile-Time Environment Variables For using free format in .f files, use: % export TAU_OPTIONS=`-optPdtF95Opts=``-R free’’’
  • 10. 1010 Open slide master to edit How TAU works? • Instrumentation: – Adds probes to perform measurements – Source code instrumentation – Wrapping external libraries (I/O, CUDA, OpenACC, OpenCL) – Rewriting the binary executable • Measurement: – Profiling or Tracing – Direct instrumentation – Sampling – Throttling • Analysis: – Visualization of profiles and traces – 3D visualization – Trace conversion tools
  • 11. 1111 Open slide master to edit TAU Instrumentation/Measurement
  • 12. 1212 Open slide master to edit Tau_exec Usage: tau_exec [options] [--] <exe> <exe options> Options: -v Verbose mode -vv Very Verbose mode (enables TAU_VERBOSE=1) -s Show what will be done but don't actually do anything (dryrun) -io Track I/O -memory Track memory allocation/deallocation -memory_debug Enable memory debugger -cuda Track GPU events via CUDA -cupti Track GPU events via CUPTI (Also see env. variable TAU_CUPTI_API) -opencl Track GPU events via OpenCL -openacc Track GPU events via OpenACC (currently PGI only) -rocm Track ROCm events via rocprofiler -ompt Track OpenMP events via OMPT interface -power Track power events via PAPI's perf RAPL interface| -numa Track remote DRAM, total DRAM events (needs papi with recent perf support for x86_64) -ebs Enable event-based sampling -ebs_period=<count> Sampling period (default 1000) -um Enable Unified Memory events via CUPTI -sass=<level> Track GPU events via CUDA with Source Code Locator activity (kernel level or source level) -csv Outputs sass profile in CSV
  • 13. 1313 Open slide master to edit MiniWeather MPI compilation • module load pgi • module load tau • export TAU_MAKEFILE/sw/summit/tau/2.28.1_patched/ibm64linux/lib/Makef ile.tau-pgi-papi-mpi-pdt-pgi • Replace mpicxx with tau_cxx.sh in the Makefile • export TAU_OPTIONS='-optLinking=-lpnetcdf -optVerbose' • make mpi
  • 14. 1414 Open slide master to edit MiniWeather MPI – Execution - Profiling export TAU_METRICS=TIME:PAPI_TOT_INS:PAPI_TOT_CYC:PAPI_FP_OPS #export TAU_CALLPATH=1 #export TAU_CALLPATH_DEPTH=10 export TAU_PROFILE=1 export TAU_TRACK_MESSAGE=1 export TAU_COMM_MATRIX=1 jsrun -n 64 -r 8 -a 1 -c 1 ./miniWeather_mpi Or if compiled with mpicxx jsrun -n 64 -r 8 -a 1 -c 1 tau_exec ./miniWeather_mpi
  • 15. 1515 Open slide master to edit MiniWeather MPI - Execution • When the execution finished, there is one folder for each TAU_METRICS declaration with the format MULTI__ • If there is no TAU_METRICS declared, then by default is used the metric TIME and the profiling files are not in a folder, in this case you need to pack them and execute paraprof: summit> paraprof –pack name.ppk summit> paraprof name.ppk • To visualize the results execute paraprof (check also pprof for text mode)
  • 16. 1616 Open slide master to edit MiniWeather MPI - Paraprof • The default metric is TIME • Each color is a different call • Each horizontal line is a process or Std.Dev./mean/max/min
  • 17. 1717 Open slide master to edit Exploring Paraprof • Options -> Uncheck Stack Bars Together • It is easier to check the load imbalance • We will call this window as the main one
  • 18. 1818 Open slide master to edit Exploring Paraprof • Click on any color, values per process, name of routine with callpath (if activated), units in seconds, value exclusive, max, min, mean, std, values.
  • 19. 1919 Open slide master to edit Exploring Paraprof • Scroll down
  • 20. 2020 Open slide master to edit Exploring Paraprof • Click on any label on the left (node 0, mean, etc.). You can see immediately which calls take more time
  • 21. 2121 Open slide master to edit Paraprof – Thread Statistics Text Window • Right click on any label of the main window, select “Show Thread Statistics Text Window”
  • 22. 2222 Open slide master to edit Paraprof – Thread Statistics Table • Right click on any label of the main window, select “Show Thread Statistics Table”
  • 23. 2323 Open slide master to edit Paraprof – User Bar Chart • Right click on any label of the main window, select “Show User Bar Chart”
  • 24. 2424 Open slide master to edit Paraprof – User Event • Options -> Select Value Type -> Max. Value
  • 25. 2525 Open slide master to edit Paraprof – User Event Statistics Window • Right click on any label of the main window, select “Show User Event Statistics Window”
  • 26. 2626 Open slide master to edit Paraprof – Context Event Window • Right click on any label of the main window, select “Show Context Event Window” (with callpath)
  • 27. 2727 Open slide master to edit Paraprof – Add Thread to Comparison Window • Right click on node 0 and select “Add Thread to Comparison Window”, similar for node 12. You could use any number of processes that you prefer.
  • 28. 2828 Open slide master to edit Derived Metrics Options -> Show Derived Metric Panel, select the metrics and then operator and then click Apply. Then uncheck the Show Derived Metric
  • 29. 2929 Open slide master to edit Paraprof - IPC • Click on the new metric, PAPI_TOT_INS/PAPI_TOT_CYC
  • 30. 3030 Open slide master to edit Paraprof – Mean IPC • Click on the label mean
  • 31. 3131 Open slide master to edit Paraprof – IPC for thread 0 • From the main window with the PAPI_TOT_INS/PAPI_TOT_CYC metric, right click on node 0 and select Show Thread Statistics Table
  • 32. 3232 Open slide master to edit Paraprof • From the main window select Options -> Select Metric… -> Exclusive -> PAPI_FP_OPS
  • 33. 3333 Open slide master to edit Paraprof – 3D Visualization • Menu Windows -> 3D Visualization (3D demands OpenGL) • Exclusive Time and Exclusive Floating operations
  • 34. 3434 Open slide master to edit Paraprof – 3D Visualization • Menu Windows -> 3D Visualization (3D demands OpenGL) • Specific routine and thread
  • 35. 3535 Open slide master to edit Paraprof – 3D Visualization • Menu Windows -> 3D Visualization (3D demands OpenGL) • Exclusive time and total instructions
  • 36. 3636 Open slide master to edit Paraprof – 3D Visualization Menu Windows -> 3D Visualization (3D demands OpenGL)
  • 37. 3737 Open slide master to edit Paraprof – 3D Visualization • Menu Windows -> 3D Visualization (3D demands OpenGL) • Exclusive time and instructions per cycle
  • 38. 3838 Open slide master to edit Paraprof – 3D Visualization • Menu Windows -> 3D Visualization (3D demands OpenGL) • Bar Plot
  • 39. 3939 Open slide master to edit Paraprof – 3D Visualization • Menu Windows -> 3D Visualization (3D demands OpenGL) • Scatter Plot
  • 40. 4040 Open slide master to edit Paraprof – 3D Visualization • Menu Windows -> 3D Visualization (3D demands OpenGL) • Topology Plot
  • 41. 4141 Open slide master to edit Paraprof – 3D Communication Matrix • Menu Windows -> 3D Visualization (3D demands OpenGL) • Max message size vs Number of calls
  • 42. 4242 Open slide master to edit Paraprof Menu Windows -> Communication Matrix
  • 43. 4343 Open slide master to edit Which loops require the most time? • File select.tau: BEGIN_INSTRUMENT_SECTION loops routine=“#” END_INSTRUMENT_SECTION • Declare TAU options: export TAU_OPTIONS=“-optTauSelectFile=select.tau -optLinking=-lpnetcdf - optVerbose” • Do not forget to unset TAU_OPTIONS when not required • Execute as before
  • 44. 4444 Open slide master to edit Paraprof - Loops
  • 45. 4545 Open slide master to edit Paraprof - Loops Select Options -> Select Metric… -> Exclusive… -> PAPI_TOT_INS
  • 46. 4646 Open slide master to edit Paraprof - Loops Select Options -> Select Metric… -> Exclusive… -> PAPI_TOT_INS/PAPI_TOT_CYC
  • 47. 4747 Open slide master to edit Paraprof - Loops Select Options -> Select Metric… -> Exclusive… -> PAPI_FP_OPS
  • 48. 4848 Open slide master to edit Paraprof From the main window select a node Click on node 0
  • 49. 4949 Open slide master to edit Paraprof – Function Histogram From the main window select a node Right click on MPI_File_write_at_all() -> Show Function Histogram
  • 50. 5050 Open slide master to edit Callpath export TAU_METRICS=TIME:PAPI_TOT_INS:PAPI_TOT_CYC:PAPI_FP_OPS export TAU_CALLPATH=1 export TAU_CALLPATH_DEPTH=10 export TAU_PROFILE=1 export TAU_TRACK_MESSAGE=1 export TAU_COMM_MATRIX=1 jsrun -n 64 -r 8 -a 1 -c 1 ./miniWeather_mpi
  • 51. 5151 Open slide master to edit Paraprof - Callpath From the main Window right click on any label (node 0, mean etc.) and select “Show Thread Call Graph”
  • 52. 5252 Open slide master to edit Paraprof - Callpath From the main Window right click on any label (node 0, mean etc.) and select “Show Thread Statistics Table”