SlideShare a Scribd company logo
KAUST	Supercompu.ng	Laboratory	
		
Introduc.on	to	Performance	Analysis	tools	on	
Shaheen	II	
	
George	Markomanolis	
Computa.onal	Scien.st	
April	17th,	2016
Outline	
KAUST King Abdullah University of Science and Technology 2
❖  Introduction
❖  Test cases
❖  Cray tools
•  Perftools
•  Cray Apprentice 2
•  Reveal
❖  Extrae/Paraver (briefly)
Introduc.on	
KAUST King Abdullah University of Science and Technology 3
❖  Why performance analysis?
•  Investigate the bottlenecks of an application
•  Identify potential improvements
•  Better usage of the hardware
❖  Profiling
•  Sampling
§  Lightweight
§  Overhead depends on the sampling frequency
§  Can lack resolution if there are small function calls
•  Event Tracing
§  Detailed information
§  Captures every event
§  Can capture communication events
§  Drawbacks, overhead and large amounts of data
Sampling	
KAUST King Abdullah University of Science and Technology 4
•  Statistical inference of
program behavior
•  Not very detailed
information
•  Mainly for long-running
applications
Tracing	
KAUST King Abdullah University of Science and Technology 5
•  Every event is captured
•  Detailed information
•  Overhead (depends on
many factors)
Studying	case	
KAUST King Abdullah University of Science and Technology 6
❖  NAS Parallel Benchmarks (NPB) consist of five kernels
and three pseudo-applications, developed by NASA
Advanced Supercomputing Division
❖  Why NPB/LU?
•  LU stands for Lower-Upper Gauss-Seidel solver
•  Simple application for testing purposes which combines
computation and communication
•  Compile with Cray, Intel, GNU compilers and fast
CrayPat	overview	
KAUST King Abdullah University of Science and Technology 7
❖  Assist the user with application performance analysis
and optimization
•  Provides concrete suggestions instead of just reporting
❖  Basic functionalities apply for all the compilers on the
system
❖  Requires no source code or Makefile modification (for
most of the cases)
Components	of	CrayPat	
KAUST King Abdullah University of Science and Technology 8
❖  Module perftools-base
•  pat_build – Instruments the program to be analyzed
•  pat_report – Generates text reports from the performance data
captured during program execution and exports data for use in other
programs.
•  Cray Apprentice2 – A graphical analysis tool that can be used to
visualize and explore the performance data captured during
program, execution
•  Reveal – A graphical source code analysis tool that can be used to
correlate performance analysis data with annotated source code
listings, to identify key opportunities for optimization (it works only
with Cray compiler)
•  grid_order – Generates MPI rank order information that can be used
with the MPICH_RANK_REORDER
•  pat_help – Help system which provides extensive usage information
Files	generated	during	regular	profiling	
KAUST King Abdullah University of Science and Technology 9
❖  A.out+pat+PID-node[s|t].xf: raw data files
•  Depending on the profiling approach and conditions the execution of
an instrumented application can create one or more .xf files where:
§  a.out is the name of the original program
§  PID is the process ID assigned to the instrumented program at runtime
§  Node is the physical node ID upon which the rank zero process executed
§  s|t is a letter code indicating the type of experiment performed, either s for
sampling or t for tracing
•  Pat_report tool dump the .xf file or export to another file format for
use with other applications, i.e, *.ap2 files
❖  *.ap2 files: self contained compressed performance files
•  Normally about 5 times smaller than the corresponding *.xf files
•  Only one *.ap2 per experiment in comparison to potentially multiple
*.xf files
Prepare	for	the	tutorial	
KAUST King Abdullah University of Science and Technology 10
•  Connect to Shaheen II and copy the material:
•  ssh –X username@shaheen.kaust.edu.sa
•  cp /scratch/tmp/performance_workshop.tgz .
•  tar zxvf performance_workshop.tgz
•  cd performance_workshop/NPB3.3-MPI
•  slides located in the folder performance_workshop/
How	to	use	CrayPat	
KAUST King Abdullah University of Science and Technology 11
❖  Load Perftools
•  module unload darshan
•  module load perftools-base/6.3.2
•  module load perftools/6.3.2
❖  Compile the code
•  make clean
•  make LU NPROCS=64 CLASS=C
§  “WARNING: PerfTools is saving object files from a temporary directory
into directory…”
•  cd bin
❖  The new binary is called lu.C.64 is not instrumented yet
Sampling	instrumenta.on	I	
KAUST King Abdullah University of Science and Technology 12
❖  Execute the application
•  sbatch --reservation=s1001_85 submit.sh
•  Check the output files (lu_C_64_out_...txt)
❖  Build the instrumented binary with sampling
instrumentation
•  pat_build –S lu.C.64
❖  The instrumented binary is called lu.C.64+pat
❖  Some results of the current presentation are acquired
with 128 MPI processes.
Sampling	instrumenta.on	II	
KAUST King Abdullah University of Science and Technology 13
❖  Edit the submit.sh file, comment line 13 and uncomment
line 16
•  sbatch --reservation=s1001_85 submit.sh
•  The reservation of the nodes for this workshop is called
s1001_85, you need to use it every time you submit jobs during
this presentation.
❖  The performance data are locate in a file called with the
format
lu.C.64+PID-XXXs.xf (PID and XXX are numbers)
Create	your	first	report	with	sampling	
instrumenta.on	
KAUST King Abdullah University of Science and Technology 14
❖  Execute the pat_report tool
•  pat_report –o sampling_report_lu_C_64.txt lu.C.64+PID-XXXs.xf
❖  Open the file sampling_report_lu_C_64.txt
❖  CrayPat/X: Version 6.3.2 Revision rc1/6.3.2 02/25/16 18:26:21
Number of PEs (MPI ranks): 64
Numbers of PEs per Node: 32 PEs on each of 2 Nodes
Numbers of Threads per PE: 1
Number of Cores per Socket: 16
Execution start time: Wed Apr 13 16:57:06 2016
System name and speed: nid00035 2301 MHz (approx)
Current path to data file:
/scratch/markomg/NPB3.3.1/NPB3.3-MPI/bin/lu.C.64+pat+10974-35s.ap2
(RTS)
Create	your	first	report	with	sampling	
instrumenta.on	
KAUST King Abdullah University of Science and Technology 15
❖  Table 1: Profile by Function
Samp% | Samp | Imb. | Imb. |Group
| | Samp | Samp% | Function
| | | | PE=HIDE
100.0% | 1,039.1 | -- | -- |Total
|--------------------------------------------------------------------
| 72.3% | 751.5 | -- | -- |USER
||-------------------------------------------------------------------
|| 33.5% | 347.9 | 45.1 | 11.6% |rhs_
|| 8.0% | 83.4 | 24.6 | 23.0% |blts_
|| 7.9% | 82.1 | 18.9 | 18.9% |buts_
|| 7.8% | 81.2 | 23.8 | 22.9% |jacld_
|| 7.4% | 77.4 | 24.6 | 24.3% |jacu_
|| 4.6% | 47.8 | 26.2 | 35.7% |exchange_3_
|| 2.2% | 23.2 | 15.8 | 40.8% |ssor_
||===================================================================
Create	your	first	report	with	sampling	
instrumenta.on	(MPI	with	sampling	is	not	
helpful)	
KAUST King Abdullah University of Science and Technology 16
❖  Table 1: Profile by Function
Samp% | Samp | Imb. | Imb. |Group
| | Samp | Samp% | Function
| | | | PE=HIDE
| 18.6% | 192.9 | -- | -- |MPI
||-------------------------------------------------------------------
|| 6.3% | 65.1 | 154.9 | 70.9% |MPIDI_Cray_shared_mem_coll_bcast
|| 4.0% | 42.0 | 59.0 | 58.9% |MPIDI_CH3I_Progress
|| 2.2% | 22.5 | 99.5 | 82.2% |MPIDI_Cray_shared_mem_coll_barrier
|| 1.8% | 19.0 | 51.0 | 73.5% |MPID_nem_gni_poll
|| 1.5% | 15.2 | 39.8 | 72.9% |MPID_nem_gni_check_localCQ
||===================================================================
| 4.5% | 47.0 | -- | -- |GNI
||-------------------------------------------------------------------
|| 4.3% | 44.8 | 118.2 | 73.1% |GNI_CqGetEvent
||===================================================================
| 4.4% | 45.8 | -- | -- |ETC
||-------------------------------------------------------------------
|| 2.1% | 21.9 | 65.1 | 75.4% |GNII_DlaProgress
|| 1.0% | 10.4 | 8.6 | 45.8% |_cray_mpi_memcpy_snb
|====================================================================
Profile	by	Group,	Func.on,	and	Line	
KAUST King Abdullah University of Science and Technology 17
❖  Table 2: Profile by Group, Function, and Line
Samp% | Samp | Imb. | Imb. |Group
| | Samp | Samp% | Function
| | | | Source
| | | | Line
| | | | PE=HIDE
100.0% | 1,039.1 | -- | -- |Total
|---------------------------------------------------------------------
| 72.3% | 751.5 | -- | -- |USER
||--------------------------------------------------------------------
|| 33.5% | 347.9 | -- | -- |rhs_
3| | | | | NPB3.3.1/NPB3.3-MPI/LU/rhs.f
||||------------------------------------------------------------------
4||| 2.2% |22.5 | 13.5 | 37.8% |line.43
4||| 1.8% |18.2 | 9.8 | 35.2% |line.96
4||| 1.6% |16.4 | 10.6 | 39.6% |line.228
…
File rhs.f, line 43
do k = 1, nz
do j = 1, ny
do i = 1, nx
do m = 1, 5
rsd(m,i,j,k) = -
frct(m,i,j,k)
end do
end do
end do
end do
More	informa.on	from	sampling		
KAUST King Abdullah University of Science and Technology 18
❖  Table 3: Wall Clock Time, Memory High Water Mark (limited entries shown)
Process | Process |PE=[mmm]
Time | HiMem |
| (MBytes) |
20.455187 | 39.18 |Total
|------------------------------
| 23.922620 |39.74 |pe.34
| 19.638636 |39.57 |pe.107
| 16.558081 |39.66 |pe.68
|==============================
❖  ======================== Additional details ========================
Experiment: samp_pc_time
Sampling interval: 10000 microsecs
Automa.c	Profiling	Analysis	(APA)	
KAUST King Abdullah University of Science and Technology 19
❖  After the previous execution of the command pat_report two new
files were created with extensions apa and ap2, the second one will
be presented later.
❖  Open the file sampling_report_lu_C_64.apa
# Collect the default PERFCTR group.
-Drtenv=PAT_RT_PERFCTR=default
# Alternatively, energy counters may be added to the default
# list by commenting out the line above and enabling the
# line below. Note that this may significantly increase the
# runtime overhead for high trace counts. The parentheses
# in the syntax below denote counters that are not available
# on all platforms.
# -Drtenv=PAT_RT_PERFCTR=default,(PM_ENERGY:NODE),(PM_ENERGY:ACC)
# Libraries to trace.
-g mpi
Automa.c	Profiling	Analysis	(APA)	II	
KAUST King Abdullah University of Science and Technology 20
# Local functions are listed for completeness, but cannot be traced.
-w # Enable tracing of user-defined functions.
# 33.49% 32799 bytes
-T rhs_
# 8.02% 3379 bytes
-T blts_
# 7.90% 3863 bytes
-T buts_
# 7.81% 14983 bytes
-T jacld_
…
-o lu.C.128+apa # New instrumented program.
Automa.c	Profiling	Analysis	(APA)	III	
KAUST King Abdullah University of Science and Technology 21
❖  In order to create the new binary with regard to APA, execute the following
•  pat_build -O sampling_report_lu_C_64.apa
WARNING: Tracing small, frequently called functions can add excessive overhead.
WARNING: To set a minimum size, say 1200 bytes, for traced functions, use:
-D trace-text-size=1200.
INFO: A total of 7 selected non-group functions were traced.
INFO: A maximum of 105 functions from group 'mpi' will be traced.
❖  The new instrumented binary is called lu.C.64+apa
❖  Edit the submit.sh file, comment line 16 and uncomment line 19
•  sbatch --reservation=s1001_85 submit.sh
❖  The new performance file is called lu.C.64+apa+PID-XXXt.xf
❖  Use the tool pat_report
•  pat_report -o report_apa_lu_C_64.txt lu.C.64+apa+PID-XXXt.xf
❖  Open the file report_apa_lu_C_64.txt
Performance	report	I	
KAUST King Abdullah University of Science and Technology 22
Table 1: Profile by Function Group and Function
Time% | Time | Imb. | Imb. | Calls |Group
| | Time | Time% | | Function
| | | | | PE=HIDE
100.0% | 12.081612 | -- | -- | 455,387.7 |Total
|-------------------------------------------------------------------
| 73.8% | 8.922097 | -- | -- | 161,404.0 |USER
||------------------------------------------------------------------
|| 28.6% | 3.450003 | 0.416838 | 10.9% | 253.0 |rhs_
|| 10.4% | 1.260820 | 0.153597 | 10.9% | 40,160.0 |buts_
|| 10.4% | 1.259256 | 0.144344 | 10.4% | 40,160.0 |blts_
|| 7.5% | 0.909228 | 0.122412 | 12.0% | 40,160.0 |jacld_
|| 7.1% | 0.861425 | 0.130527 | 13.3% | 40,160.0 |jacu_
|| 5.7% | 0.684862 | 0.139784 | 17.1% | 2.0 |ssor_
|| 3.7% | 0.451014 | 0.295409 | 39.9% | 508.0 |exchange_3_
||==================================================================
Performance	report	II	
KAUST King Abdullah University of Science and Technology 23
Table 1: Profile by Function Group and Function
Time% | Time | Imb. | Imb. | Calls |Group
| | Time | Time% | | Function
| | | | | PE=HIDE
| 17.8% | 2.148878 | -- | -- | 293,958.7 |MPI
||------------------------------------------------------------------
|| 11.9% | 1.432456 | 3.029769 | 68.4% | 145,580.0 |MPI_RECV
|| 3.8% | 0.465076 | 0.411500 | 47.3% | 146,502.9 |MPI_SEND
|| 2.0% | 0.241474 | 1.003594 | 81.2% | 922.9 |mpi_wait
||==================================================================
| 8.4% | 1.010618 | -- | -- | 24.0 |MPI_SYNC
||------------------------------------------------------------------
|| 8.2% | 0.991427 | 0.991319 | 100.0% | 1.0 |mpi_init_(sync)
|===================================================================
❖  If needed disable MPI Sync with
•  export PAT_RT_MPI_SYNC=0
MPI	topology		
KAUST King Abdullah University of Science and Technology 24
❖  MPI Grid Detection:
There appears to be point-to-point MPI communication in a 8 X 16
grid pattern. The 17.8% of the total execution time spent in MPI
functions might be reduced with a rank order that maximizes
communication between ranks on the same node. The effect of several
rank orders is estimated below.
A file named MPICH_RANK_ORDER.Grid was generated along with this
report and contains usage instructions and the Hilbert rank order
from the following table.
Rank Order On-Node On-Node MPICH_RANK_REORDER_METHOD
Bytes/PE Bytes/PE%
of Total
Bytes/PE
Hilbert 3.039e+10 87.40% 3
SMP 2.947e+10 84.75% 1
Fold 1.685e+10 48.46% 2
RoundRobin 1.106e+10 31.82% 0
❖  Example for 128 MPI processes
0,1,17,16,32,48...
68,84,85,69,70,71…
How to use the new MPI topology file:
1.  cp MPICH_RANK_ORDER.XXX MPICH_RANK_ORDER
2.  export MPICH_RANK_REORDER_METHOD=3
Hardware	counters	
KAUST King Abdullah University of Science and Technology 25
D1 cache utilization:
All instrumented functions with significant execution time had D1
cache hit ratios above the desirable minimum of 75.0%.
D1 + D2 cache utilization:
All instrumented functions with significant execution time had
combined D1 and D2 cache hit ratios above the desirable minimum of
80.0%.
TLB utilization:
All instrumented functions with significant execution time had more
than the desirable minimum of 200 data references per TLB miss.
Find more about hardware performance counters
❖  Execute:
•  pat_help
•  counters haswell groups
Hardware	counters	
KAUST King Abdullah University of Science and Technology 26
Total
------------------------------------------------------------------------------
Time% 100.0%
Time 12.081612 secs
Imb. Time -- secs
Imb. Time% --
Calls 0.038M/sec 455,387.7 calls
CPU_CLK_THREAD_UNHALTED:THREAD_P 47,351,574,846
CPU_CLK_THREAD_UNHALTED:REF_XCLK 2,124,810,371
DTLB_LOAD_MISSES:MISS_CAUSES_A_WALK 6,686,929
DTLB_STORE_MISSES:MISS_CAUSES_A_WALK 2,823,391
L1D:REPLACEMENT 1,404,754,113
L2_RQSTS:ALL_DEMAND_DATA_RD 515,418,048
L2_RQSTS:DEMAND_DATA_RD_HIT 197,719,491
MEM_UOPS_RETIRED:ALL_LOADS 20,512,449,601
CPU_CLK 2.23GHz
TLB utilization 2,156.86 refs/miss 4.21 avg uses
D1 cache hit,miss ratios 93.2% hits 6.8% misses
D1 cache utilization (misses) 14.60 refs/miss 1.83 avg hits
D2 cache hit,miss ratio 77.4% hits 22.6% misses
D1+D2 cache hit,miss ratio 98.5% hits 1.5% misses
D1+D2 cache utilization 64.57 refs/miss 8.07 avg hits
D2 to D1 bandwidth 2,603.843MiB/sec 32,986,755,044 bytes
Average Time per Call 0.000027 secs
CrayPat Overhead : Time 8.0%
Hardware	Counters	-	Descrip.on	
KAUST King Abdullah University of Science and Technology 27
Hardware performance counter events:
CPU_CLK_THREAD_UNHALTED:REF_XCLK Count core clock cycles whenever the clock signal on
the specificcore is running (not halted):Cases when the core is unhalted at 100Mhz
CPU_CLK_THREAD_UNHALTED:THREAD_P Count core clock cycles whenever the clock signal on
the specificcore is running (not halted):Cycles when thread is not halted
DTLB_LOAD_MISSES:MISS_CAUSES_A_WALK Data TLB load misses:Misses in all DTLB levels that
cause page walks
DTLB_STORE_MISSES:MISS_CAUSES_A_WALK Data TLB store misses:Misses in all DTLB levels
that cause page walks
L1D:REPLACEMENT L1D cache:L1D Data line replacements
L2_RQSTS:ALL_DEMAND_DATA_RD L2 requests:Any data read request to L2 cache
L2_RQSTS:DEMAND_DATA_RD_HIT L2 requests:Demand Data Read requests that hit L2 cache
MEM_UOPS_RETIRED:ALL_LOADS Memory uops retired (Precise Event):All load uops retired
PM_ENERGY:NODE Compute node accumulated energy
CYCLES_RTC User Cycles (approx, from rtc)
Load	Balance	with	MPI	Message	stats	
KAUST King Abdullah University of Science and Technology 28
Table 3: Load Balance with MPI Message Stats (limited entries shown)
Time% | Time | MPI Msg | MPI Msg Bytes | Avg MPI |Group
| | Count | | Msg Size | PE=[mmm]
100.0% | 12.081612 | 146,522.9 | 271,667,585.0 | 1,854.10 |Total
|--------------------------------------------------------------------
| 73.8% | 8.922097 | 0.0 | 0.0 | -- |USER
||-------------------------------------------------------------------
|| 80.6% | 9.739499 | 0.0 | 0.0 | -- |pe.26
|| 75.8% | 9.160217 | 0.0 | 0.0 | -- |pe.61
|| 45.1% | 5.442844 | 0.0 | 0.0 | -- |pe.127
||===================================================================
| 17.8% | 2.148878 | 146,522.9 | 271,667,585.0 | 1,854.10 |MPI
||-------------------------------------------------------------------
|| 48.8% | 5.891394 | 80,852.0 | 143,737,828.0 | 1,777.79 |pe.127
|| 15.5% | 1.874838 | 161,678.0 | 293,895,236.0 | 1,817.78 |pe.43
|| 10.5% | 1.263484 | 161,678.0 | 303,691,732.0 | 1,878.37 |pe.26
||===================================================================
| 8.4% | 1.010618 | 0.0 | 0.0 | -- |MPI_SYNC
||-------------------------------------------------------------------
|| 22.0% | 2.653814 | 0.0 | 0.0 | -- |pe.103
|| 7.4% | 0.895974 | 0.0 | 0.0 | -- |pe.123
|| 0.1% | 0.012597 | 0.0 | 0.0 | -- |pe.0
Load	Balance	with	MPI	message	stats	by	
caller	
KAUST King Abdullah University of Science and Technology 29
Table 4: MPI Message Stats by Caller (limited entries shown)
MPI | MPI Msg Bytes | MPI Msg | MsgSz | 16<= | 256<= | 64KiB<= |Function
Msg | | Count | <16 | MsgSz | MsgSz | MsgSz | Caller
Bytes% | | | Count | <256 | <4KiB | <1MiB |PE=[mmm]
| | | | Count | Count | Count |
100.0% | 271,667,585.0 | 146,522.9 | 14.0 | 6.9 | 145,581.3 | 920.8 |Total
|-----------------------------------------------------------------------------
| 100.0% | 271,667,261.0 | 146,502.9 | 0.0 | 0.9 | 145,581.3 | 920.8 |MPI_SEND
||----------------------------------------------------------------------------
|| 67.5% | 183,314,340.0 | 920.8 | 0.0 | 0.0 | 0.0 | 920.8 |exchange_3_
3| 67.2% | 182,592,630.0 | 917.1 | 0.0 | 0.0 | 0.0 | 917.1 | rhs_
4| | | | | | | | ssor_
5| | | | | | | | applu_
||||||------------------------------------------------------------------------
6||||| 77.2% | 209,848,320.0 | 1,012.0 | 0.0 | 0.0 | 0.0 | 1,012.0 |pe.17
6||||| 72.4% | 196,732,800.0 | 1,012.0 | 0.0 | 0.0 | 0.0 | 1,012.0 |pe.88
6||||| 36.2% | 98,366,400.0 | 506.0 | 0.0 | 0.0 | 0.0 | 506.0 |pe.127
❖  In order to adjust the size of the MPI eager mode (default 8KB, max value
128KB) according to the MPI message stats, use the following command in
your job script, where
•  export MPICH_GNI_MAX_EAGER_MSG_SIZE=131072
•  export MPICH_ENV_DISPLAY=1
Wall	clock	and	memory	high	water	mark	
KAUST King Abdullah University of Science and Technology 30
Table 5: Wall Clock Time, Memory High Water Mark (limited entries shown)
Process | Process |PE=[mmm]
Time | HiMem |
| (MBytes) |
20.166938 | 48.25 |Total
|------------------------------
| 23.813279 | 48.70 |pe.98
| 20.039177 | 49.79 |pe.82
| 17.694283 | 49.70 |pe.0
|==============================
❖  In order to extract the profling information for all the processes and not aggregate data, the
pat_report tool can be used as following:
•  pat_report -s pe=ALL -o sampling_results_all.txt txt lu.C.64+apa+PID-XXXt.xf
•  pat_report -s filter_input='pe<=5' ...
•  ︎pat_report -s filter_input='pe%2==0' ...pat_report -s filter_input='pe%2==0' ...
Apprentice2
A GUI for the raw data
KAUST King Abdullah University of Science and Technology 31
How	to	start	with	Appren.ce2	
KAUST King Abdullah University of Science and Technology 32
❖  The pat_report tool has created one file with extension ap2
•  ls –ltr *.ap2
❖  In order to visualize the performance data
•  Connect to Shaheen II with “ssh –X …”
•  module load perftools-base/6.3.2
•  app2 lu.C.64+apa+PID-XXt.ap2
❖  The example of the presentation is for lu.C.128
Appren.ce2	–	Generic	view	
KAUST King Abdullah University of Science and Technology 33
Appren.ce2	–	Generic	view	
KAUST King Abdullah University of Science and Technology 34
Appren.ce2	–	Generic	view	
KAUST King Abdullah University of Science and Technology 35
Appren.ce2	–	Generic	view	
KAUST King Abdullah University of Science and Technology 36
Appren.ce2	–	Profile	I	
KAUST King Abdullah University of Science and Technology 37
Appren.ce2	–	Profile	II	
KAUST King Abdullah University of Science and Technology 38
Appren.ce2	–	Load	Balance	I	
KAUST King Abdullah University of Science and Technology 39
Appren.ce2	–	Load	Balance	II	
KAUST King Abdullah University of Science and Technology 40
Appren.ce2	–	Load	Balance	III	
KAUST King Abdullah University of Science and Technology 41
Appren.ce2	–	Ac.vity	
KAUST King Abdullah University of Science and Technology 42
Appren.ce2	–	Call	Tree	
KAUST King Abdullah University of Science and Technology 43
Appren.ce2	–	Mosaic	I	
KAUST King Abdullah University of Science and Technology 44
Appren.ce2	–	Mosaic	II	
KAUST King Abdullah University of Science and Technology 45
Appren.ce2	–	Mosaic	IV	
KAUST King Abdullah University of Science and Technology 46
Appren.ce2	–	Mosaic	V	
KAUST King Abdullah University of Science and Technology 47
Appren.ce2	–	Mosaic	VI	
KAUST King Abdullah University of Science and Technology 48
Appren.ce2	–	Hardware	counters	overview	
KAUST King Abdullah University of Science and Technology 49
Appren.ce2	–	Profile	comparison	(v6.3.2)	
KAUST King Abdullah University of Science and Technology 50
Appren.ce2	–	Profile	comparison	
KAUST King Abdullah University of Science and Technology 51
Detailed	instrumenta.on	
KAUST King Abdullah University of Science and Technology 52
❖  Do not follow these instructions during the hands-on session
❖  Disable the summary of the performance data and create one
file per node
•  export PAT_RT_SUMMARY=0
•  export PAT_RT_EXPFILE_MAX=0
•  sbatch --reservation=s001_85 submit.sh
❖  Expect more overhead, the trace file size can increase from
some MB to GB
❖  Create the ap2 file
•  pat_report –o detailed_report_lu_C_64.txt lu.C.64+apa+PID-XXt
❖  Use Apprentice2
•  app2 lu.C.64+apa+PID-XXt.ap2
Detailed	instrumenta.on	–	Example	LU.C.16	
KAUST King Abdullah University of Science and Technology 53
Detailed	instrumenta.on	–	Profile	
KAUST King Abdullah University of Science and Technology 54
Detailed	instrumenta.on	–	Ac.vity	over	.me	
KAUST King Abdullah University of Science and Technology 55
Detailed	instrumenta.on	–	Traffic	Report	
KAUST King Abdullah University of Science and Technology 56
Detailed	instrumenta.on	–	Traffic	Report	
with	links	
KAUST King Abdullah University of Science and Technology 57
Detailed	instrumenta.on	–	Plots	
KAUST King Abdullah University of Science and Technology 58
Detailed	instrumenta.on	–	Counters	Plot	
KAUST King Abdullah University of Science and Technology 59
Reveal
A tool to port your application to OpenMP
KAUST King Abdullah University of Science and Technology 60
Reveal	
KAUST King Abdullah University of Science and Technology 61
❖  Reveal is Cray’s next-generation integrated
performance analysis and code optimization tool.
•  Source code navigation using whole program
analysis (data provided by the Cray compilation
environment only)
•  Coupling with performance data collected during
execution by CrayPAT. Understand which high level
serial loops could benefit from parallelism.
•  Enhanced loop mark listing functionality.
•  Dependency information for targeted loops
•  Assist users optimize code by providing variable
scoping feedback and suggested compile directives.
Prepare	for	Reveal	
KAUST King Abdullah University of Science and Technology 62
❖  Load Perftools
•  module unload darshan
•  module load perftools-base/6.3.2
•  module load perftools/6.3.2
❖  Compile the code
•  cd performance_workshop/NPB3.3-MPI_reveal
•  make clean
•  In the config.make.def file
§  MPIF77 = ftn -h profile_generate -hpl=npb_lu.pl -h noomp -h noacc
§  FMPI_LIB = -h profile_generate -hpl=npb_lu.pl -h noomp -h noacc
•  make LU NPROCS=64 CLASS=C
§  “WARNING: PerfTools is saving object files from a temporary directory into directory…”
•  cd bin
❖  The new binary is called lu.C.64 is not instrumented yet
Prepare	and	load	Reveal	
KAUST King Abdullah University of Science and Technology 63
❖ Prepare the binary for tracing
•  pat_build –w lu.C.64
❖ Uncomment the line 16 in file submit.sh (the
one with lu.C.64+pat)
❖ sbatch --reservation=s1001_85 submit.sh
❖ pat_report -o reveal.txt lu.C.64+pat+PID-
XXt.xf
❖ reveal ../LU/npb_lu.pl ./lu.C.64+pat+PID-
XXt.ap2
Reveal	–	Loop	Performance	
KAUST King Abdullah University of Science and Technology 64
Reveal	–	Loop	performance	–	Poten.al	
Speedup	
KAUST King Abdullah University of Science and Technology 65
Reveal	–	Scoping	
KAUST King Abdullah University of Science and Technology 66
Reveal	–	Scoping	results	on	the	Loops	
KAUST King Abdullah University of Science and Technology 67
Reveal	–	Scoping	Results	
KAUST King Abdullah University of Science and Technology 68
Reveal	–	OpenMP	Direc.ves	
KAUST King Abdullah University of Science and Technology 69
Reveal	–	Compiler	messages	
KAUST King Abdullah University of Science and Technology 70
Summary	
KAUST King Abdullah University of Science and Technology 71
❖  Craypat seems easy to use
❖  The user should be careful though
❖  Studying in detail the communication with Craypat is
difficult
❖  Reveal tool could be really helpful
❖  Probably other tool(s) could be used for more detailed
analysis
Extrae/Paraver
A profiling tool from Barcelona Supercomputing Center
KAUST King Abdullah University of Science and Technology 72
Extrae/Paraver (briefly)
KAUST King Abdullah University of Science and Technology 73
❖  Instrumentation tool from Barcelona Supercomputing
Center
❖  The main details are defined in an XML file
❖  For dynamic compilation a wrapper and
LD_PRELOAD is enough
❖  For static compilation, linking is necessary
❖  Need to compile with at least -g option and
-finstrument-functions for functions instrumentation
with Intel and GNU compilers
❖  The trace for LU.C.64 is around to 5 GB
❖  Paraver is the tool to visualize and handle the traces
from Extrae
Paraver – Useful duration I
KAUST King Abdullah University of Science and Technology 74
Paraver – Useful duration II - zoom
KAUST King Abdullah University of Science and Technology 75
Paraver – Visualize events
KAUST King Abdullah University of Science and Technology 76
Paraver – User functions
KAUST King Abdullah University of Science and Technology 77
Paraver – User Functions Profile
KAUST King Abdullah University of Science and Technology 78
Paraver – Timeline selecting specific
MPI processes
KAUST King Abdullah University of Science and Technology 79
Paraver – Instantaneous parallelism
profile
KAUST King Abdullah University of Science and Technology 80
KAUST Supercomputing Laboratory
KAUST King Abdullah University of Science and Technology 81

More Related Content

PDF
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...
PDF
Xilinx timing closure
PDF
Hadoop Internals (2.3.0 or later)
PDF
Burst Buffer: From Alpha to Omega
DOC
Diario ana b
PDF
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
PDF
HPC Application Profiling and Analysis
PPTX
HPC Application Profiling & Analysis
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...
Xilinx timing closure
Hadoop Internals (2.3.0 or later)
Burst Buffer: From Alpha to Omega
Diario ana b
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
HPC Application Profiling and Analysis
HPC Application Profiling & Analysis

Similar to Introduction to Performance Analysis tools on Shaheen II (20)

PDF
Introduction to Java Profiling
PDF
Program Performance Analysis Toolkit Adaptor
PDF
GOoDA tutorial
PDF
Callgraph analysis
PDF
Performance tools developments
PDF
Performance Testing Java Applications
PDF
Java Performance & Profiling
PPT
Software Performance
PDF
Assignment 1-mtat
PDF
Building source code level profiler for C++.pdf
PPTX
Icse2013 malik
PDF
May2010 hex-core-opt
PPTX
Embedded Systems -Program-Level-Performance-Analysis.pptx
PPT
Unit 3 part2
PPT
Unit 3 part2
PPTX
JEEConf 2016. Effectiveness and code optimization in Java applications
PDF
Performance Evaluation of Open Source Data Mining Tools
PPTX
Using the big guns: Advanced OS performance tools for troubleshooting databas...
PDF
High Performance Engineering - 01-intro.pdf
PDF
TAU for Accelerating AI Applications at OpenPOWER Summit Europe
Introduction to Java Profiling
Program Performance Analysis Toolkit Adaptor
GOoDA tutorial
Callgraph analysis
Performance tools developments
Performance Testing Java Applications
Java Performance & Profiling
Software Performance
Assignment 1-mtat
Building source code level profiler for C++.pdf
Icse2013 malik
May2010 hex-core-opt
Embedded Systems -Program-Level-Performance-Analysis.pptx
Unit 3 part2
Unit 3 part2
JEEConf 2016. Effectiveness and code optimization in Java applications
Performance Evaluation of Open Source Data Mining Tools
Using the big guns: Advanced OS performance tools for troubleshooting databas...
High Performance Engineering - 01-intro.pdf
TAU for Accelerating AI Applications at OpenPOWER Summit Europe
Ad

More from George Markomanolis (16)

PDF
Evaluating GPU programming Models for the LUMI Supercomputer
PDF
Utilizing AMD GPUs: Tuning, programming models, and roadmap
PDF
Exploring the Programming Models for the LUMI Supercomputer
PDF
Getting started with AMD GPUs
PDF
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
PDF
Introduction to Extrae/Paraver, part I
PDF
Performance Analysis with Scalasca, part II
PDF
Performance Analysis with Scalasca on Summit Supercomputer part I
PDF
Performance Analysis with TAU on Summit Supercomputer, part II
PDF
How to use TAU for Performance Analysis on Summit Supercomputer
PDF
Introducing IO-500 benchmark
PDF
Experience using the IO-500
PDF
Harshad - Handle Darshan Data
PDF
Lustre Best Practices
PDF
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
PDF
markomanolis_phd_defense
Evaluating GPU programming Models for the LUMI Supercomputer
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Exploring the Programming Models for the LUMI Supercomputer
Getting started with AMD GPUs
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Introduction to Extrae/Paraver, part I
Performance Analysis with Scalasca, part II
Performance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with TAU on Summit Supercomputer, part II
How to use TAU for Performance Analysis on Summit Supercomputer
Introducing IO-500 benchmark
Experience using the IO-500
Harshad - Handle Darshan Data
Lustre Best Practices
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
markomanolis_phd_defense
Ad

Recently uploaded (20)

PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Machine Learning_overview_presentation.pptx
PPTX
Cloud computing and distributed systems.
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
A Presentation on Artificial Intelligence
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Big Data Technologies - Introduction.pptx
PDF
cuic standard and advanced reporting.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Encapsulation theory and applications.pdf
Programs and apps: productivity, graphics, security and other tools
MYSQL Presentation for SQL database connectivity
Building Integrated photovoltaic BIPV_UPV.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
“AI and Expert System Decision Support & Business Intelligence Systems”
Mobile App Security Testing_ A Comprehensive Guide.pdf
Machine Learning_overview_presentation.pptx
Cloud computing and distributed systems.
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
A Presentation on Artificial Intelligence
Machine learning based COVID-19 study performance prediction
Big Data Technologies - Introduction.pptx
cuic standard and advanced reporting.pdf
The AUB Centre for AI in Media Proposal.docx
Encapsulation_ Review paper, used for researhc scholars
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Spectral efficient network and resource selection model in 5G networks
Unlocking AI with Model Context Protocol (MCP)
Reach Out and Touch Someone: Haptics and Empathic Computing
Encapsulation theory and applications.pdf

Introduction to Performance Analysis tools on Shaheen II

  • 2. Outline KAUST King Abdullah University of Science and Technology 2 ❖  Introduction ❖  Test cases ❖  Cray tools •  Perftools •  Cray Apprentice 2 •  Reveal ❖  Extrae/Paraver (briefly)
  • 3. Introduc.on KAUST King Abdullah University of Science and Technology 3 ❖  Why performance analysis? •  Investigate the bottlenecks of an application •  Identify potential improvements •  Better usage of the hardware ❖  Profiling •  Sampling §  Lightweight §  Overhead depends on the sampling frequency §  Can lack resolution if there are small function calls •  Event Tracing §  Detailed information §  Captures every event §  Can capture communication events §  Drawbacks, overhead and large amounts of data
  • 4. Sampling KAUST King Abdullah University of Science and Technology 4 •  Statistical inference of program behavior •  Not very detailed information •  Mainly for long-running applications
  • 5. Tracing KAUST King Abdullah University of Science and Technology 5 •  Every event is captured •  Detailed information •  Overhead (depends on many factors)
  • 6. Studying case KAUST King Abdullah University of Science and Technology 6 ❖  NAS Parallel Benchmarks (NPB) consist of five kernels and three pseudo-applications, developed by NASA Advanced Supercomputing Division ❖  Why NPB/LU? •  LU stands for Lower-Upper Gauss-Seidel solver •  Simple application for testing purposes which combines computation and communication •  Compile with Cray, Intel, GNU compilers and fast
  • 7. CrayPat overview KAUST King Abdullah University of Science and Technology 7 ❖  Assist the user with application performance analysis and optimization •  Provides concrete suggestions instead of just reporting ❖  Basic functionalities apply for all the compilers on the system ❖  Requires no source code or Makefile modification (for most of the cases)
  • 8. Components of CrayPat KAUST King Abdullah University of Science and Technology 8 ❖  Module perftools-base •  pat_build – Instruments the program to be analyzed •  pat_report – Generates text reports from the performance data captured during program execution and exports data for use in other programs. •  Cray Apprentice2 – A graphical analysis tool that can be used to visualize and explore the performance data captured during program, execution •  Reveal – A graphical source code analysis tool that can be used to correlate performance analysis data with annotated source code listings, to identify key opportunities for optimization (it works only with Cray compiler) •  grid_order – Generates MPI rank order information that can be used with the MPICH_RANK_REORDER •  pat_help – Help system which provides extensive usage information
  • 9. Files generated during regular profiling KAUST King Abdullah University of Science and Technology 9 ❖  A.out+pat+PID-node[s|t].xf: raw data files •  Depending on the profiling approach and conditions the execution of an instrumented application can create one or more .xf files where: §  a.out is the name of the original program §  PID is the process ID assigned to the instrumented program at runtime §  Node is the physical node ID upon which the rank zero process executed §  s|t is a letter code indicating the type of experiment performed, either s for sampling or t for tracing •  Pat_report tool dump the .xf file or export to another file format for use with other applications, i.e, *.ap2 files ❖  *.ap2 files: self contained compressed performance files •  Normally about 5 times smaller than the corresponding *.xf files •  Only one *.ap2 per experiment in comparison to potentially multiple *.xf files
  • 10. Prepare for the tutorial KAUST King Abdullah University of Science and Technology 10 •  Connect to Shaheen II and copy the material: •  ssh –X username@shaheen.kaust.edu.sa •  cp /scratch/tmp/performance_workshop.tgz . •  tar zxvf performance_workshop.tgz •  cd performance_workshop/NPB3.3-MPI •  slides located in the folder performance_workshop/
  • 11. How to use CrayPat KAUST King Abdullah University of Science and Technology 11 ❖  Load Perftools •  module unload darshan •  module load perftools-base/6.3.2 •  module load perftools/6.3.2 ❖  Compile the code •  make clean •  make LU NPROCS=64 CLASS=C §  “WARNING: PerfTools is saving object files from a temporary directory into directory…” •  cd bin ❖  The new binary is called lu.C.64 is not instrumented yet
  • 12. Sampling instrumenta.on I KAUST King Abdullah University of Science and Technology 12 ❖  Execute the application •  sbatch --reservation=s1001_85 submit.sh •  Check the output files (lu_C_64_out_...txt) ❖  Build the instrumented binary with sampling instrumentation •  pat_build –S lu.C.64 ❖  The instrumented binary is called lu.C.64+pat ❖  Some results of the current presentation are acquired with 128 MPI processes.
  • 13. Sampling instrumenta.on II KAUST King Abdullah University of Science and Technology 13 ❖  Edit the submit.sh file, comment line 13 and uncomment line 16 •  sbatch --reservation=s1001_85 submit.sh •  The reservation of the nodes for this workshop is called s1001_85, you need to use it every time you submit jobs during this presentation. ❖  The performance data are locate in a file called with the format lu.C.64+PID-XXXs.xf (PID and XXX are numbers)
  • 14. Create your first report with sampling instrumenta.on KAUST King Abdullah University of Science and Technology 14 ❖  Execute the pat_report tool •  pat_report –o sampling_report_lu_C_64.txt lu.C.64+PID-XXXs.xf ❖  Open the file sampling_report_lu_C_64.txt ❖  CrayPat/X: Version 6.3.2 Revision rc1/6.3.2 02/25/16 18:26:21 Number of PEs (MPI ranks): 64 Numbers of PEs per Node: 32 PEs on each of 2 Nodes Numbers of Threads per PE: 1 Number of Cores per Socket: 16 Execution start time: Wed Apr 13 16:57:06 2016 System name and speed: nid00035 2301 MHz (approx) Current path to data file: /scratch/markomg/NPB3.3.1/NPB3.3-MPI/bin/lu.C.64+pat+10974-35s.ap2 (RTS)
  • 15. Create your first report with sampling instrumenta.on KAUST King Abdullah University of Science and Technology 15 ❖  Table 1: Profile by Function Samp% | Samp | Imb. | Imb. |Group | | Samp | Samp% | Function | | | | PE=HIDE 100.0% | 1,039.1 | -- | -- |Total |-------------------------------------------------------------------- | 72.3% | 751.5 | -- | -- |USER ||------------------------------------------------------------------- || 33.5% | 347.9 | 45.1 | 11.6% |rhs_ || 8.0% | 83.4 | 24.6 | 23.0% |blts_ || 7.9% | 82.1 | 18.9 | 18.9% |buts_ || 7.8% | 81.2 | 23.8 | 22.9% |jacld_ || 7.4% | 77.4 | 24.6 | 24.3% |jacu_ || 4.6% | 47.8 | 26.2 | 35.7% |exchange_3_ || 2.2% | 23.2 | 15.8 | 40.8% |ssor_ ||===================================================================
  • 16. Create your first report with sampling instrumenta.on (MPI with sampling is not helpful) KAUST King Abdullah University of Science and Technology 16 ❖  Table 1: Profile by Function Samp% | Samp | Imb. | Imb. |Group | | Samp | Samp% | Function | | | | PE=HIDE | 18.6% | 192.9 | -- | -- |MPI ||------------------------------------------------------------------- || 6.3% | 65.1 | 154.9 | 70.9% |MPIDI_Cray_shared_mem_coll_bcast || 4.0% | 42.0 | 59.0 | 58.9% |MPIDI_CH3I_Progress || 2.2% | 22.5 | 99.5 | 82.2% |MPIDI_Cray_shared_mem_coll_barrier || 1.8% | 19.0 | 51.0 | 73.5% |MPID_nem_gni_poll || 1.5% | 15.2 | 39.8 | 72.9% |MPID_nem_gni_check_localCQ ||=================================================================== | 4.5% | 47.0 | -- | -- |GNI ||------------------------------------------------------------------- || 4.3% | 44.8 | 118.2 | 73.1% |GNI_CqGetEvent ||=================================================================== | 4.4% | 45.8 | -- | -- |ETC ||------------------------------------------------------------------- || 2.1% | 21.9 | 65.1 | 75.4% |GNII_DlaProgress || 1.0% | 10.4 | 8.6 | 45.8% |_cray_mpi_memcpy_snb |====================================================================
  • 17. Profile by Group, Func.on, and Line KAUST King Abdullah University of Science and Technology 17 ❖  Table 2: Profile by Group, Function, and Line Samp% | Samp | Imb. | Imb. |Group | | Samp | Samp% | Function | | | | Source | | | | Line | | | | PE=HIDE 100.0% | 1,039.1 | -- | -- |Total |--------------------------------------------------------------------- | 72.3% | 751.5 | -- | -- |USER ||-------------------------------------------------------------------- || 33.5% | 347.9 | -- | -- |rhs_ 3| | | | | NPB3.3.1/NPB3.3-MPI/LU/rhs.f ||||------------------------------------------------------------------ 4||| 2.2% |22.5 | 13.5 | 37.8% |line.43 4||| 1.8% |18.2 | 9.8 | 35.2% |line.96 4||| 1.6% |16.4 | 10.6 | 39.6% |line.228 … File rhs.f, line 43 do k = 1, nz do j = 1, ny do i = 1, nx do m = 1, 5 rsd(m,i,j,k) = - frct(m,i,j,k) end do end do end do end do
  • 18. More informa.on from sampling KAUST King Abdullah University of Science and Technology 18 ❖  Table 3: Wall Clock Time, Memory High Water Mark (limited entries shown) Process | Process |PE=[mmm] Time | HiMem | | (MBytes) | 20.455187 | 39.18 |Total |------------------------------ | 23.922620 |39.74 |pe.34 | 19.638636 |39.57 |pe.107 | 16.558081 |39.66 |pe.68 |============================== ❖  ======================== Additional details ======================== Experiment: samp_pc_time Sampling interval: 10000 microsecs
  • 19. Automa.c Profiling Analysis (APA) KAUST King Abdullah University of Science and Technology 19 ❖  After the previous execution of the command pat_report two new files were created with extensions apa and ap2, the second one will be presented later. ❖  Open the file sampling_report_lu_C_64.apa # Collect the default PERFCTR group. -Drtenv=PAT_RT_PERFCTR=default # Alternatively, energy counters may be added to the default # list by commenting out the line above and enabling the # line below. Note that this may significantly increase the # runtime overhead for high trace counts. The parentheses # in the syntax below denote counters that are not available # on all platforms. # -Drtenv=PAT_RT_PERFCTR=default,(PM_ENERGY:NODE),(PM_ENERGY:ACC) # Libraries to trace. -g mpi
  • 20. Automa.c Profiling Analysis (APA) II KAUST King Abdullah University of Science and Technology 20 # Local functions are listed for completeness, but cannot be traced. -w # Enable tracing of user-defined functions. # 33.49% 32799 bytes -T rhs_ # 8.02% 3379 bytes -T blts_ # 7.90% 3863 bytes -T buts_ # 7.81% 14983 bytes -T jacld_ … -o lu.C.128+apa # New instrumented program.
  • 21. Automa.c Profiling Analysis (APA) III KAUST King Abdullah University of Science and Technology 21 ❖  In order to create the new binary with regard to APA, execute the following •  pat_build -O sampling_report_lu_C_64.apa WARNING: Tracing small, frequently called functions can add excessive overhead. WARNING: To set a minimum size, say 1200 bytes, for traced functions, use: -D trace-text-size=1200. INFO: A total of 7 selected non-group functions were traced. INFO: A maximum of 105 functions from group 'mpi' will be traced. ❖  The new instrumented binary is called lu.C.64+apa ❖  Edit the submit.sh file, comment line 16 and uncomment line 19 •  sbatch --reservation=s1001_85 submit.sh ❖  The new performance file is called lu.C.64+apa+PID-XXXt.xf ❖  Use the tool pat_report •  pat_report -o report_apa_lu_C_64.txt lu.C.64+apa+PID-XXXt.xf ❖  Open the file report_apa_lu_C_64.txt
  • 22. Performance report I KAUST King Abdullah University of Science and Technology 22 Table 1: Profile by Function Group and Function Time% | Time | Imb. | Imb. | Calls |Group | | Time | Time% | | Function | | | | | PE=HIDE 100.0% | 12.081612 | -- | -- | 455,387.7 |Total |------------------------------------------------------------------- | 73.8% | 8.922097 | -- | -- | 161,404.0 |USER ||------------------------------------------------------------------ || 28.6% | 3.450003 | 0.416838 | 10.9% | 253.0 |rhs_ || 10.4% | 1.260820 | 0.153597 | 10.9% | 40,160.0 |buts_ || 10.4% | 1.259256 | 0.144344 | 10.4% | 40,160.0 |blts_ || 7.5% | 0.909228 | 0.122412 | 12.0% | 40,160.0 |jacld_ || 7.1% | 0.861425 | 0.130527 | 13.3% | 40,160.0 |jacu_ || 5.7% | 0.684862 | 0.139784 | 17.1% | 2.0 |ssor_ || 3.7% | 0.451014 | 0.295409 | 39.9% | 508.0 |exchange_3_ ||==================================================================
  • 23. Performance report II KAUST King Abdullah University of Science and Technology 23 Table 1: Profile by Function Group and Function Time% | Time | Imb. | Imb. | Calls |Group | | Time | Time% | | Function | | | | | PE=HIDE | 17.8% | 2.148878 | -- | -- | 293,958.7 |MPI ||------------------------------------------------------------------ || 11.9% | 1.432456 | 3.029769 | 68.4% | 145,580.0 |MPI_RECV || 3.8% | 0.465076 | 0.411500 | 47.3% | 146,502.9 |MPI_SEND || 2.0% | 0.241474 | 1.003594 | 81.2% | 922.9 |mpi_wait ||================================================================== | 8.4% | 1.010618 | -- | -- | 24.0 |MPI_SYNC ||------------------------------------------------------------------ || 8.2% | 0.991427 | 0.991319 | 100.0% | 1.0 |mpi_init_(sync) |=================================================================== ❖  If needed disable MPI Sync with •  export PAT_RT_MPI_SYNC=0
  • 24. MPI topology KAUST King Abdullah University of Science and Technology 24 ❖  MPI Grid Detection: There appears to be point-to-point MPI communication in a 8 X 16 grid pattern. The 17.8% of the total execution time spent in MPI functions might be reduced with a rank order that maximizes communication between ranks on the same node. The effect of several rank orders is estimated below. A file named MPICH_RANK_ORDER.Grid was generated along with this report and contains usage instructions and the Hilbert rank order from the following table. Rank Order On-Node On-Node MPICH_RANK_REORDER_METHOD Bytes/PE Bytes/PE% of Total Bytes/PE Hilbert 3.039e+10 87.40% 3 SMP 2.947e+10 84.75% 1 Fold 1.685e+10 48.46% 2 RoundRobin 1.106e+10 31.82% 0 ❖  Example for 128 MPI processes 0,1,17,16,32,48... 68,84,85,69,70,71… How to use the new MPI topology file: 1.  cp MPICH_RANK_ORDER.XXX MPICH_RANK_ORDER 2.  export MPICH_RANK_REORDER_METHOD=3
  • 25. Hardware counters KAUST King Abdullah University of Science and Technology 25 D1 cache utilization: All instrumented functions with significant execution time had D1 cache hit ratios above the desirable minimum of 75.0%. D1 + D2 cache utilization: All instrumented functions with significant execution time had combined D1 and D2 cache hit ratios above the desirable minimum of 80.0%. TLB utilization: All instrumented functions with significant execution time had more than the desirable minimum of 200 data references per TLB miss. Find more about hardware performance counters ❖  Execute: •  pat_help •  counters haswell groups
  • 26. Hardware counters KAUST King Abdullah University of Science and Technology 26 Total ------------------------------------------------------------------------------ Time% 100.0% Time 12.081612 secs Imb. Time -- secs Imb. Time% -- Calls 0.038M/sec 455,387.7 calls CPU_CLK_THREAD_UNHALTED:THREAD_P 47,351,574,846 CPU_CLK_THREAD_UNHALTED:REF_XCLK 2,124,810,371 DTLB_LOAD_MISSES:MISS_CAUSES_A_WALK 6,686,929 DTLB_STORE_MISSES:MISS_CAUSES_A_WALK 2,823,391 L1D:REPLACEMENT 1,404,754,113 L2_RQSTS:ALL_DEMAND_DATA_RD 515,418,048 L2_RQSTS:DEMAND_DATA_RD_HIT 197,719,491 MEM_UOPS_RETIRED:ALL_LOADS 20,512,449,601 CPU_CLK 2.23GHz TLB utilization 2,156.86 refs/miss 4.21 avg uses D1 cache hit,miss ratios 93.2% hits 6.8% misses D1 cache utilization (misses) 14.60 refs/miss 1.83 avg hits D2 cache hit,miss ratio 77.4% hits 22.6% misses D1+D2 cache hit,miss ratio 98.5% hits 1.5% misses D1+D2 cache utilization 64.57 refs/miss 8.07 avg hits D2 to D1 bandwidth 2,603.843MiB/sec 32,986,755,044 bytes Average Time per Call 0.000027 secs CrayPat Overhead : Time 8.0%
  • 27. Hardware Counters - Descrip.on KAUST King Abdullah University of Science and Technology 27 Hardware performance counter events: CPU_CLK_THREAD_UNHALTED:REF_XCLK Count core clock cycles whenever the clock signal on the specificcore is running (not halted):Cases when the core is unhalted at 100Mhz CPU_CLK_THREAD_UNHALTED:THREAD_P Count core clock cycles whenever the clock signal on the specificcore is running (not halted):Cycles when thread is not halted DTLB_LOAD_MISSES:MISS_CAUSES_A_WALK Data TLB load misses:Misses in all DTLB levels that cause page walks DTLB_STORE_MISSES:MISS_CAUSES_A_WALK Data TLB store misses:Misses in all DTLB levels that cause page walks L1D:REPLACEMENT L1D cache:L1D Data line replacements L2_RQSTS:ALL_DEMAND_DATA_RD L2 requests:Any data read request to L2 cache L2_RQSTS:DEMAND_DATA_RD_HIT L2 requests:Demand Data Read requests that hit L2 cache MEM_UOPS_RETIRED:ALL_LOADS Memory uops retired (Precise Event):All load uops retired PM_ENERGY:NODE Compute node accumulated energy CYCLES_RTC User Cycles (approx, from rtc)
  • 28. Load Balance with MPI Message stats KAUST King Abdullah University of Science and Technology 28 Table 3: Load Balance with MPI Message Stats (limited entries shown) Time% | Time | MPI Msg | MPI Msg Bytes | Avg MPI |Group | | Count | | Msg Size | PE=[mmm] 100.0% | 12.081612 | 146,522.9 | 271,667,585.0 | 1,854.10 |Total |-------------------------------------------------------------------- | 73.8% | 8.922097 | 0.0 | 0.0 | -- |USER ||------------------------------------------------------------------- || 80.6% | 9.739499 | 0.0 | 0.0 | -- |pe.26 || 75.8% | 9.160217 | 0.0 | 0.0 | -- |pe.61 || 45.1% | 5.442844 | 0.0 | 0.0 | -- |pe.127 ||=================================================================== | 17.8% | 2.148878 | 146,522.9 | 271,667,585.0 | 1,854.10 |MPI ||------------------------------------------------------------------- || 48.8% | 5.891394 | 80,852.0 | 143,737,828.0 | 1,777.79 |pe.127 || 15.5% | 1.874838 | 161,678.0 | 293,895,236.0 | 1,817.78 |pe.43 || 10.5% | 1.263484 | 161,678.0 | 303,691,732.0 | 1,878.37 |pe.26 ||=================================================================== | 8.4% | 1.010618 | 0.0 | 0.0 | -- |MPI_SYNC ||------------------------------------------------------------------- || 22.0% | 2.653814 | 0.0 | 0.0 | -- |pe.103 || 7.4% | 0.895974 | 0.0 | 0.0 | -- |pe.123 || 0.1% | 0.012597 | 0.0 | 0.0 | -- |pe.0
  • 29. Load Balance with MPI message stats by caller KAUST King Abdullah University of Science and Technology 29 Table 4: MPI Message Stats by Caller (limited entries shown) MPI | MPI Msg Bytes | MPI Msg | MsgSz | 16<= | 256<= | 64KiB<= |Function Msg | | Count | <16 | MsgSz | MsgSz | MsgSz | Caller Bytes% | | | Count | <256 | <4KiB | <1MiB |PE=[mmm] | | | | Count | Count | Count | 100.0% | 271,667,585.0 | 146,522.9 | 14.0 | 6.9 | 145,581.3 | 920.8 |Total |----------------------------------------------------------------------------- | 100.0% | 271,667,261.0 | 146,502.9 | 0.0 | 0.9 | 145,581.3 | 920.8 |MPI_SEND ||---------------------------------------------------------------------------- || 67.5% | 183,314,340.0 | 920.8 | 0.0 | 0.0 | 0.0 | 920.8 |exchange_3_ 3| 67.2% | 182,592,630.0 | 917.1 | 0.0 | 0.0 | 0.0 | 917.1 | rhs_ 4| | | | | | | | ssor_ 5| | | | | | | | applu_ ||||||------------------------------------------------------------------------ 6||||| 77.2% | 209,848,320.0 | 1,012.0 | 0.0 | 0.0 | 0.0 | 1,012.0 |pe.17 6||||| 72.4% | 196,732,800.0 | 1,012.0 | 0.0 | 0.0 | 0.0 | 1,012.0 |pe.88 6||||| 36.2% | 98,366,400.0 | 506.0 | 0.0 | 0.0 | 0.0 | 506.0 |pe.127 ❖  In order to adjust the size of the MPI eager mode (default 8KB, max value 128KB) according to the MPI message stats, use the following command in your job script, where •  export MPICH_GNI_MAX_EAGER_MSG_SIZE=131072 •  export MPICH_ENV_DISPLAY=1
  • 30. Wall clock and memory high water mark KAUST King Abdullah University of Science and Technology 30 Table 5: Wall Clock Time, Memory High Water Mark (limited entries shown) Process | Process |PE=[mmm] Time | HiMem | | (MBytes) | 20.166938 | 48.25 |Total |------------------------------ | 23.813279 | 48.70 |pe.98 | 20.039177 | 49.79 |pe.82 | 17.694283 | 49.70 |pe.0 |============================== ❖  In order to extract the profling information for all the processes and not aggregate data, the pat_report tool can be used as following: •  pat_report -s pe=ALL -o sampling_results_all.txt txt lu.C.64+apa+PID-XXXt.xf •  pat_report -s filter_input='pe<=5' ... •  ︎pat_report -s filter_input='pe%2==0' ...pat_report -s filter_input='pe%2==0' ...
  • 31. Apprentice2 A GUI for the raw data KAUST King Abdullah University of Science and Technology 31
  • 32. How to start with Appren.ce2 KAUST King Abdullah University of Science and Technology 32 ❖  The pat_report tool has created one file with extension ap2 •  ls –ltr *.ap2 ❖  In order to visualize the performance data •  Connect to Shaheen II with “ssh –X …” •  module load perftools-base/6.3.2 •  app2 lu.C.64+apa+PID-XXt.ap2 ❖  The example of the presentation is for lu.C.128
  • 33. Appren.ce2 – Generic view KAUST King Abdullah University of Science and Technology 33
  • 34. Appren.ce2 – Generic view KAUST King Abdullah University of Science and Technology 34
  • 35. Appren.ce2 – Generic view KAUST King Abdullah University of Science and Technology 35
  • 36. Appren.ce2 – Generic view KAUST King Abdullah University of Science and Technology 36
  • 37. Appren.ce2 – Profile I KAUST King Abdullah University of Science and Technology 37
  • 38. Appren.ce2 – Profile II KAUST King Abdullah University of Science and Technology 38
  • 39. Appren.ce2 – Load Balance I KAUST King Abdullah University of Science and Technology 39
  • 40. Appren.ce2 – Load Balance II KAUST King Abdullah University of Science and Technology 40
  • 41. Appren.ce2 – Load Balance III KAUST King Abdullah University of Science and Technology 41
  • 42. Appren.ce2 – Ac.vity KAUST King Abdullah University of Science and Technology 42
  • 43. Appren.ce2 – Call Tree KAUST King Abdullah University of Science and Technology 43
  • 44. Appren.ce2 – Mosaic I KAUST King Abdullah University of Science and Technology 44
  • 45. Appren.ce2 – Mosaic II KAUST King Abdullah University of Science and Technology 45
  • 46. Appren.ce2 – Mosaic IV KAUST King Abdullah University of Science and Technology 46
  • 47. Appren.ce2 – Mosaic V KAUST King Abdullah University of Science and Technology 47
  • 48. Appren.ce2 – Mosaic VI KAUST King Abdullah University of Science and Technology 48
  • 49. Appren.ce2 – Hardware counters overview KAUST King Abdullah University of Science and Technology 49
  • 50. Appren.ce2 – Profile comparison (v6.3.2) KAUST King Abdullah University of Science and Technology 50
  • 51. Appren.ce2 – Profile comparison KAUST King Abdullah University of Science and Technology 51
  • 52. Detailed instrumenta.on KAUST King Abdullah University of Science and Technology 52 ❖  Do not follow these instructions during the hands-on session ❖  Disable the summary of the performance data and create one file per node •  export PAT_RT_SUMMARY=0 •  export PAT_RT_EXPFILE_MAX=0 •  sbatch --reservation=s001_85 submit.sh ❖  Expect more overhead, the trace file size can increase from some MB to GB ❖  Create the ap2 file •  pat_report –o detailed_report_lu_C_64.txt lu.C.64+apa+PID-XXt ❖  Use Apprentice2 •  app2 lu.C.64+apa+PID-XXt.ap2
  • 53. Detailed instrumenta.on – Example LU.C.16 KAUST King Abdullah University of Science and Technology 53
  • 54. Detailed instrumenta.on – Profile KAUST King Abdullah University of Science and Technology 54
  • 55. Detailed instrumenta.on – Ac.vity over .me KAUST King Abdullah University of Science and Technology 55
  • 56. Detailed instrumenta.on – Traffic Report KAUST King Abdullah University of Science and Technology 56
  • 58. Detailed instrumenta.on – Plots KAUST King Abdullah University of Science and Technology 58
  • 59. Detailed instrumenta.on – Counters Plot KAUST King Abdullah University of Science and Technology 59
  • 60. Reveal A tool to port your application to OpenMP KAUST King Abdullah University of Science and Technology 60
  • 61. Reveal KAUST King Abdullah University of Science and Technology 61 ❖  Reveal is Cray’s next-generation integrated performance analysis and code optimization tool. •  Source code navigation using whole program analysis (data provided by the Cray compilation environment only) •  Coupling with performance data collected during execution by CrayPAT. Understand which high level serial loops could benefit from parallelism. •  Enhanced loop mark listing functionality. •  Dependency information for targeted loops •  Assist users optimize code by providing variable scoping feedback and suggested compile directives.
  • 62. Prepare for Reveal KAUST King Abdullah University of Science and Technology 62 ❖  Load Perftools •  module unload darshan •  module load perftools-base/6.3.2 •  module load perftools/6.3.2 ❖  Compile the code •  cd performance_workshop/NPB3.3-MPI_reveal •  make clean •  In the config.make.def file §  MPIF77 = ftn -h profile_generate -hpl=npb_lu.pl -h noomp -h noacc §  FMPI_LIB = -h profile_generate -hpl=npb_lu.pl -h noomp -h noacc •  make LU NPROCS=64 CLASS=C §  “WARNING: PerfTools is saving object files from a temporary directory into directory…” •  cd bin ❖  The new binary is called lu.C.64 is not instrumented yet
  • 63. Prepare and load Reveal KAUST King Abdullah University of Science and Technology 63 ❖ Prepare the binary for tracing •  pat_build –w lu.C.64 ❖ Uncomment the line 16 in file submit.sh (the one with lu.C.64+pat) ❖ sbatch --reservation=s1001_85 submit.sh ❖ pat_report -o reveal.txt lu.C.64+pat+PID- XXt.xf ❖ reveal ../LU/npb_lu.pl ./lu.C.64+pat+PID- XXt.ap2
  • 64. Reveal – Loop Performance KAUST King Abdullah University of Science and Technology 64
  • 66. Reveal – Scoping KAUST King Abdullah University of Science and Technology 66
  • 67. Reveal – Scoping results on the Loops KAUST King Abdullah University of Science and Technology 67
  • 68. Reveal – Scoping Results KAUST King Abdullah University of Science and Technology 68
  • 69. Reveal – OpenMP Direc.ves KAUST King Abdullah University of Science and Technology 69
  • 70. Reveal – Compiler messages KAUST King Abdullah University of Science and Technology 70
  • 71. Summary KAUST King Abdullah University of Science and Technology 71 ❖  Craypat seems easy to use ❖  The user should be careful though ❖  Studying in detail the communication with Craypat is difficult ❖  Reveal tool could be really helpful ❖  Probably other tool(s) could be used for more detailed analysis
  • 72. Extrae/Paraver A profiling tool from Barcelona Supercomputing Center KAUST King Abdullah University of Science and Technology 72
  • 73. Extrae/Paraver (briefly) KAUST King Abdullah University of Science and Technology 73 ❖  Instrumentation tool from Barcelona Supercomputing Center ❖  The main details are defined in an XML file ❖  For dynamic compilation a wrapper and LD_PRELOAD is enough ❖  For static compilation, linking is necessary ❖  Need to compile with at least -g option and -finstrument-functions for functions instrumentation with Intel and GNU compilers ❖  The trace for LU.C.64 is around to 5 GB ❖  Paraver is the tool to visualize and handle the traces from Extrae
  • 74. Paraver – Useful duration I KAUST King Abdullah University of Science and Technology 74
  • 75. Paraver – Useful duration II - zoom KAUST King Abdullah University of Science and Technology 75
  • 76. Paraver – Visualize events KAUST King Abdullah University of Science and Technology 76
  • 77. Paraver – User functions KAUST King Abdullah University of Science and Technology 77
  • 78. Paraver – User Functions Profile KAUST King Abdullah University of Science and Technology 78
  • 79. Paraver – Timeline selecting specific MPI processes KAUST King Abdullah University of Science and Technology 79
  • 80. Paraver – Instantaneous parallelism profile KAUST King Abdullah University of Science and Technology 80
  • 81. KAUST Supercomputing Laboratory KAUST King Abdullah University of Science and Technology 81