SlideShare a Scribd company logo
Parallel Application Performance
Prediction of Using Analysis Based
Modeling
Mohammad Abu Obaida, Jason Liu,
Gopinath Chennupati, Nandakishore Santhi and Stephan Eidenbenz
SIGSIM-PADS’18, Rome, Italy, May 23, 2018
● Motivation and Related Work
● Automatic Performance Modeling
● Experiment Results
● Conclusions
Outline
2
HPC Performance Prediction
● HPC performance prediction provides insight about
○ Applications (e.g., scalability, performance variability)
○ Hardware/software (e.g., better design)
○ Workload behavior (present and future)
● Which is useful for —
○ Understanding application performance issues
○ Improving application and system, scalability
○ Budgeting, designing efficient systems (present and future)
3
Performance Prediction Challenges
4
1. New applications
2. New architectures (processors, GPUs, memory, interconnect)
3. New systems/tools
4. Scalability (applications/systems are getting larger)
Modeling techniques can help predict performance of
present and future applications, systems, and workloads
Existing Modeling Techniques
5
1. Analytical Performance Models (e.g., LogP, Loggp, Loggopsim)
○ Time expressed as mathematical formulas
○ Pros: simple, fast, flexible, scalable
○ Cons: low accuracy
2. Simulation (e.g., SST, TraceR, CODES, PPT)
○ Models detailed system and application behavior
○ Trace-driven and execution driven simulation
○ Pros: high accuracy, flexible, future architectures
○ Cons: slow, (typically) small in scale, complexity in building models
3. Hybrid Models (e.g., Aspen, Palm, Compass, Durango)
○ Combination of analytical, simulation and trace replaying
○ A choice of accuracy, flexibility vs speed
1. Aspen [Spafford and Vetter, SC’12]
○ Domain specific language to describe application and machine
○ Features analytical communication model
2. Compass [Lee, Meredith, Vetter, ICS’15]
○ Static analysis based automated model construction for Aspen
○ Built on Cetus compiler and OpenARC source transformation framework
○ High level communication abstraction functionality
3. Durango [Carothers et al., SIGSIM-PADS’17]
○ Combines Aspen and CODES
○ Parameterized application model w/ computation events
○ Simulates Aspen generated models or traces on CODES
4. PPT-AMM [Chennupati et al., WSC’17]
○ Simulation based performance prediction
○ Uses static analysis of source code to build data availability models
Most Related to Our Work
6
Our Approach - PyPassT
We use Static Analysis Based HPC Simulation framework
1. Automatically builds model from source code
2. CPU plus cache/memory models
3. Mid-level GPU Model
4. Detailed communication model
a. Point-to-point and collective MPI operations
b. Preserve locality or spatial features
c. Abstracts data volume properties
7
● Motivation and Related Work
● Automatic Performance Modeling
● Experiment Results
● Conclusions
Outline
8
PyPassT Framework
● Purpose is to maintain accuracy and performance, flexibility, and
scalability; so as to allow studies of large-scale applications
● PyPassT:
○ An application model in {Py}thon
○ Generated automatically in compiler {Pass}es
○ Executed in PP{T} HPC simulation.
● Steps of an application performance analysis
○ Start with an application program
○ Statically analyze the program to build an abstract model
○ Transform into an executable model (encompassing CPU, GPU, and communication)
○ Run model with HPC Simulation (for performance prediction)
9
PyPassT Framework
10
PyPassT Static Analysis
● Static Analysis
○ Built on Compass/OpenARC/Cetus Compilers
○ Derive an abstract model for the application:
○ CPU Computation:
■ Obtain workload (flops) with OpenARC
■ Extra pass through Byfl (built on LLVM) to calculate data availability reuse
profile for memory/cache performance
○ GPU Computation:
■ Identify GPU kernels using OpenARC
■ Obtain workload (flops and memory loads/stores)
○ MPI Communication:
■ Point-to-point ops (sender, receiver, transfer size, domain)
■ Collective ops (root, size, operation, domain)
11
OpenACC
12
1. Standardization effort, defines a set of directives (#pragma acc ….)
a. Compiler uses these for parallel kernel transformation
b. Compute-intensive parallel regions (work-sharing loops) offloaded to GPU
2. Generates architecture dependent parallel code
3. Helps porting codes to a wide-variety of heterogeneous
a. HPC hardware platforms and architectures
b. GPU/Accelerator
We use OpenACC to find parallel regions for CPU/GPU
OpenACC Annotated Program
#pragma acc data copy(A), create(Anew)
#pragma acc parallel num_gangs(16) num_workers(32) … private(j)
{
...
#pragma acc loop gang
for( j = 1; j < n-1; j++) {
#pragma acc loop worker ….
for( i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);
….
}
….
}
}
13
CPU Tasklist
iALU integer operations
fALU floating point operations (add/multiply)
fDIV floating point divisions
INTVEC Integer vector operations
VECTOR Integer vector operations
intranode Transfers within node (as memory access)
MEM_ACCESS cycles to move data through memory and cache
HITRATES Direct input of cache level hitrates
L1 Direct input of L1 accesses
L2 Direct input of L2accesses
L3,L4,L5,
RAM, mem
Direct input of higher cache and
memory accesses
CPU ops CPU operations
14
Memory Hierarchy: PPT-AMM [Chennupati et al., WSC’17]
● A parameterized model for computation performance prediction
● Uses reuse distance and profiles to estimate data availability
○ Patterns of hardware architecture-independent virtual memory accesses
● The reuse profile models different cache hierarchies
We use PPT-AMM to model computation more accurately
15
CPU Model Building
1. Produce memory trace w/ Byfl*
2. Transition probability from a BB to another
○ Calculated using LLVM coverage analysis
3. Calculate Probability of executing a BB
4. Calculate conditional reuse profile of a BB
5. Create and evaluate a PPT-AMM computation tasklist
6. Mimic computation with sleep
16
#stack_dist probability_sd
#3.0 0.357978254355
#7.0 0.141925649545
#11.0 0.053872280264
#12.0 1.44216666613e-05
#15.0 0.0577797511131
#19.0 0.00610221191651
#23.0 0.00814826082
tasklist_example = [['iALU', 8469], ['fALU', 89800], ['fDIV',6400], 
['MEM_ACCESS', 4, 10, 1, 1, 1, 10, 80, False, 
stack_dist, probability_sd, block_size, total_bytes, data_bus_width]]
CPU Model Example
17
alloc [host] m em ory allocations (in # of bytes)
unalloc [host] m em ory de-allocate
DEVICE_ALLO C Device allocations
DEVICE_TRANSFER Device transfers
KERNEL_CALL Call a G PU kernel with block/grid
iALU Integer operations L1 Direct input of L1 accesses
diALU Double precision integer operations L2 Direct input of L2 accesses
fALU Floating point operations (add/m ultiply) G LO B_M EM _ACCESS Access G PU on-chip global
m em ory
dfALU Double precision flop DEVICE_SYNC Synchronize G PU threads
fDIV Floating point divisions THREAD_SYNC Synchronize G PU threads w/CPU
SFU Special function calls
18
GPU Tasklist
● OpenARC provides
○ Memory-GPU transfers and vice versa
○ Loads
○ Stores
○ Flops
○ GPU block-size, grid-size
● Build GPU-warp tasklist from Compass generated IR
○ Evaluate with the desired hardware model
○ MPI rank sleep for the duration computation
GPU Model Building
19
GPU Kernel Analysis
#pragma acc kernels loop gang(16) worker(32) copy(m, n)
present(A[0:4096][0:4096], Anew[0:4096][0:4096]) private(i_0, j_0)
#pragma aspen control label(block_main50) loop((-2+n)) parallelism((-2+n))
for (j_0=0; j_0<=(-3+n); j_0 ++ ){
#pragma acc loop gang(16) worker(32)
#pragma aspen control label(block_main51) loop((-2+m)) parallelism((-2+m))
for (i_0=0; i_0<=(-3+m); i_0 ++ ){
#pragma aspen control execute label(block_main52)
loads((1*aspen_param_sizeof_double):from(Anew):traits(stride(1)))
stores((1*aspen_param_sizeof_double):to(A):traits(stride(1)))
A[(1+j_0)][(1+i_0)]=Anew[(1+j_0)][(1+i_0)];
}
}
20
Computation (CPU+GPU)
# accelerator warp instructions
GPU_WARP = [['GLOB_MEM_ACCESS'],
['GLOB_MEM_ACCESS'], ['L1_ACCESS'],
['fALU'], ['GLOB_MEM_ACCESS'],
['GLOB_MEM_ACCESS'], ['L1_ACCESS'], ['dfALU']]
# calling the ward with block size and grid size
CPU_tasklist = [['KERNEL_CALL', 0, GPU_WARP,
blocksize, gridsize,regcount],['DEVICE_SYNC', 0]]
# evaluate with hardware model and collect statistics
now = mpi_wtime(mpi_comm_world)
(time, stats) = core.time_compute(CPU_tasklist, now, True)
# sleep for the duration
mpi_ext_sleep(time) 21
Communication
● Convert IR of static analysis to PPT communication calls
● Source
○ MPI_Sendrecv( A[start], M, MPI_FLOAT, top , 0, A[end], M, MPI_FLOAT,
bottom, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE );
○ MPI_Barrier( MPI_COMM_WORLD );
● Target
if (my_rank==0) {
execute "block_stencil1d59" {
#mpi send/recv
messages [MAX_STRING] to (my_rank+1) as send
messages [MAX_STRING] from (my_rank+1) as recv
}
}
execute "block_stencil1d66" {
messages [MPI_COMM_WORLD] as barrier
} 22
● MPI_Send, MPI_Recv,MPI_Sendrecv
● MPI_Isend, MPI_Irecv
● MPI_Wait, MPI_Waitall,
● MPI_Reduce, MPI_Allreduce
● MPI_Bcast, MPI_Barrier
● MPI_Gather, MPI_Allgather
● MPI_Scatter
● MPI_Alltoall
● MPI_Alltoallv
Once the application model is ready, it is evaluated using HPC simulation.
23
Supported MPI Operations
● Launch application on the Performance Prediction Toolkit (PPT)
PPT is HPC system and application simulator allows rapid prototyping of parallel
applications at a sufficiently detailed level.
● Collaborative project between LANL and FIU
● Developed based on process oriented parallel simulator Simian
● PPT Features
1. Hardware models (processor, memory, GPU)
2. Full-fledged MPI model
3. Detailed interconnection model
4. Large-Scale Workload Model
Application Simulation
24
PPT Overview
25
Open-sourced at: https://github.com /lanl/PPT
[ppt-torus] Kishwar Ahm ed, M oham m ad Obaida, Jason Liu, Stephan Eidenbenz, Nandakishore Santhi, Guillaum e Chapuis, An Integrated Interconnection Network
M odel for Large-Scale Perform ance Prediction, ACM SIGSIM Conference on Principles of Advanced Discrete Sim ulation (PADS 2016), M ay 2016.
● Motivation and Related Work
● Automatic Performance Modeling
● Experiment Results
● Conclusions
Outline
26
Runtime Predictions
● Application
○ Laplace2D MM
○ 512x512 to 4096x4096
● System / PPT-CAMM config
○ Run with 1 core(taskset)
○ 2x 8-core E5-2450 @2.1GHz
○ 48GB shared memory
○ Optimized (O3) & unoptimized flags
○ L1/L2/L3=32K/256K/12M.
● Results
○ Collect runtime, flops, loads, stores
○ 7.08% -- average error optimized
○ 3.12% -- average error unoptimized
27
28
● Computation
○ Result – No error
● Memory
○ ~1% error
Resource Usage
GPU Performance Prediction
● Application : Laplace2D MM
● Mesh Sizes: 1024x1024 to 8192x8192
● BlockSize: 16x1x1, GridSize: 32x1x1
○ #pragma acc parallel num gangs(16)num
workers(32)
Machine Config
● Two 8-core Xeon E5-5645 @2.1GHz
● 48GB shared memory
● NVIDIA Geforce GM 204
○ 1050 Hz, 4GB GDDR5
Results:
● 0.16% error for 8192x8192
● 13.8% error for 1024x1024
● Lower error for larger GPU kernels
29
Communication Prediction
● Application
○ Jacobi Iterative Method
○ 2048x2048 matrices
● Expt. Cluster
○ Grizzly, LANL, 1.79 PFlops
○ 53K cores, Omni-path
○ Message size: 8KB
● PyPassT Machine Config
○ Two 8-core Xeon E5-5645 @2.1GHz
○ 48GB shared memory
Results
● Perfect representation of total bytes
between src-dest pair actual/predicted
30
Communication Prediction(2)
31
Results
● Perfect matching of number of
packets for every src-dst pair
between actual runs and
predicted
● PyPassT automatically builds application model
● Combining with rapid HPC modeling tools
● Producing fast predictions with accuracy
Summary of Contributions —
32
Future Work:
• Study scalability of applications
• Apply dynamic analysis and ML for irregular applications
Thank You

More Related Content

PDF
Cache Optimization Techniques for General Purpose Graphic Processing Units
PDF
HPC Essentials 0
PDF
Kaggle Lyft Motion Prediction for Autonomous Vehicles 4th Place Solution
PDF
Effective machine learning_with_tpu
PDF
UK ATC 2015: Automated Post Processing of Multimodel Optimisation Data
PDF
Architecture Aware Partitioning of Open-CL Programs
PPTX
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
PPTX
Parallel K means clustering using CUDA
Cache Optimization Techniques for General Purpose Graphic Processing Units
HPC Essentials 0
Kaggle Lyft Motion Prediction for Autonomous Vehicles 4th Place Solution
Effective machine learning_with_tpu
UK ATC 2015: Automated Post Processing of Multimodel Optimisation Data
Architecture Aware Partitioning of Open-CL Programs
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Parallel K means clustering using CUDA

What's hot (20)

PPTX
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
PDF
CPU vs. GPU presentation
PPTX
論文紹介 Fast imagetagging
PPT
Basics of programming embedded processors
PDF
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
PDF
IITB Poster. Benchmarking GPU-based Acceleration of Spark in ML Workload usin...
PPTX
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
PDF
C++ amp on linux
PPTX
TPU paper slide
PPTX
Google TPU
PDF
An35225228
PDF
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
PDF
Profiling PyTorch for Efficiency & Sustainability
PDF
GPU - Basic Working
PDF
GraphSage vs Pinsage #InsideArangoDB
DOCX
Histogram dan Segmentasi 2
PPTX
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
PDF
High Performance Medical Reconstruction Using Stream Programming Paradigms
PDF
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
PDF
Recent MIP Performance Improvements in IBM ILOG CPLEX Optimization Studio
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
CPU vs. GPU presentation
論文紹介 Fast imagetagging
Basics of programming embedded processors
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
IITB Poster. Benchmarking GPU-based Acceleration of Spark in ML Workload usin...
Hardware Acceleration of SVM Training for Real-time Embedded Systems: An Over...
C++ amp on linux
TPU paper slide
Google TPU
An35225228
Slides for In-Datacenter Performance Analysis of a Tensor Processing Unit
Profiling PyTorch for Efficiency & Sustainability
GPU - Basic Working
GraphSage vs Pinsage #InsideArangoDB
Histogram dan Segmentasi 2
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
High Performance Medical Reconstruction Using Stream Programming Paradigms
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
Recent MIP Performance Improvements in IBM ILOG CPLEX Optimization Studio
Ad

Similar to Parallel Application Performance Prediction of Using Analysis Based Modeling (20)

PPTX
Computer Architecture and Organization
PDF
hetero_pim
PDF
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
PDF
Toronto meetup 20190917
PPT
Presentation1
PDF
Accelerating Real Time Applications on Heterogeneous Platforms
PPT
Exploring Gpgpu Workloads
PDF
Accelerating the Development of Efficient CP Optimizer Models
PDF
Performance modeling and simulation for accumulo applications
PDF
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
PPT
Monte Carlo on GPUs
PPSX
matrixmultiplicationparallel.ppsx
PPSX
MAtrix Multiplication Parallel.ppsx
PPTX
ACIC: Automatic Cloud I/O Configurator for HPC Applications
PDF
IRJET- Latin Square Computation of Order-3 using Open CL
PPTX
Serving deep learning models in a serverless platform (IC2E 2018)
PDF
Hybrid Multicore Computing : NOTES
PPTX
JVM and OS Tuning for accelerating Spark application
PDF
Performance Characterization and Optimization of In-Memory Data Analytics on ...
PDF
cnsm2011_slide
Computer Architecture and Organization
hetero_pim
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
Toronto meetup 20190917
Presentation1
Accelerating Real Time Applications on Heterogeneous Platforms
Exploring Gpgpu Workloads
Accelerating the Development of Efficient CP Optimizer Models
Performance modeling and simulation for accumulo applications
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Monte Carlo on GPUs
matrixmultiplicationparallel.ppsx
MAtrix Multiplication Parallel.ppsx
ACIC: Automatic Cloud I/O Configurator for HPC Applications
IRJET- Latin Square Computation of Order-3 using Open CL
Serving deep learning models in a serverless platform (IC2E 2018)
Hybrid Multicore Computing : NOTES
JVM and OS Tuning for accelerating Spark application
Performance Characterization and Optimization of In-Memory Data Analytics on ...
cnsm2011_slide
Ad

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Cloud computing and distributed systems.
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Electronic commerce courselecture one. Pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
A Presentation on Artificial Intelligence
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
KodekX | Application Modernization Development
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
Understanding_Digital_Forensics_Presentation.pptx
Unlocking AI with Model Context Protocol (MCP)
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Cloud computing and distributed systems.
Network Security Unit 5.pdf for BCA BBA.
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Electronic commerce courselecture one. Pdf
cuic standard and advanced reporting.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Dropbox Q2 2025 Financial Results & Investor Presentation
A Presentation on Artificial Intelligence
The AUB Centre for AI in Media Proposal.docx
KodekX | Application Modernization Development
Reach Out and Touch Someone: Haptics and Empathic Computing

Parallel Application Performance Prediction of Using Analysis Based Modeling

  • 1. Parallel Application Performance Prediction of Using Analysis Based Modeling Mohammad Abu Obaida, Jason Liu, Gopinath Chennupati, Nandakishore Santhi and Stephan Eidenbenz SIGSIM-PADS’18, Rome, Italy, May 23, 2018
  • 2. ● Motivation and Related Work ● Automatic Performance Modeling ● Experiment Results ● Conclusions Outline 2
  • 3. HPC Performance Prediction ● HPC performance prediction provides insight about ○ Applications (e.g., scalability, performance variability) ○ Hardware/software (e.g., better design) ○ Workload behavior (present and future) ● Which is useful for — ○ Understanding application performance issues ○ Improving application and system, scalability ○ Budgeting, designing efficient systems (present and future) 3
  • 4. Performance Prediction Challenges 4 1. New applications 2. New architectures (processors, GPUs, memory, interconnect) 3. New systems/tools 4. Scalability (applications/systems are getting larger) Modeling techniques can help predict performance of present and future applications, systems, and workloads
  • 5. Existing Modeling Techniques 5 1. Analytical Performance Models (e.g., LogP, Loggp, Loggopsim) ○ Time expressed as mathematical formulas ○ Pros: simple, fast, flexible, scalable ○ Cons: low accuracy 2. Simulation (e.g., SST, TraceR, CODES, PPT) ○ Models detailed system and application behavior ○ Trace-driven and execution driven simulation ○ Pros: high accuracy, flexible, future architectures ○ Cons: slow, (typically) small in scale, complexity in building models 3. Hybrid Models (e.g., Aspen, Palm, Compass, Durango) ○ Combination of analytical, simulation and trace replaying ○ A choice of accuracy, flexibility vs speed
  • 6. 1. Aspen [Spafford and Vetter, SC’12] ○ Domain specific language to describe application and machine ○ Features analytical communication model 2. Compass [Lee, Meredith, Vetter, ICS’15] ○ Static analysis based automated model construction for Aspen ○ Built on Cetus compiler and OpenARC source transformation framework ○ High level communication abstraction functionality 3. Durango [Carothers et al., SIGSIM-PADS’17] ○ Combines Aspen and CODES ○ Parameterized application model w/ computation events ○ Simulates Aspen generated models or traces on CODES 4. PPT-AMM [Chennupati et al., WSC’17] ○ Simulation based performance prediction ○ Uses static analysis of source code to build data availability models Most Related to Our Work 6
  • 7. Our Approach - PyPassT We use Static Analysis Based HPC Simulation framework 1. Automatically builds model from source code 2. CPU plus cache/memory models 3. Mid-level GPU Model 4. Detailed communication model a. Point-to-point and collective MPI operations b. Preserve locality or spatial features c. Abstracts data volume properties 7
  • 8. ● Motivation and Related Work ● Automatic Performance Modeling ● Experiment Results ● Conclusions Outline 8
  • 9. PyPassT Framework ● Purpose is to maintain accuracy and performance, flexibility, and scalability; so as to allow studies of large-scale applications ● PyPassT: ○ An application model in {Py}thon ○ Generated automatically in compiler {Pass}es ○ Executed in PP{T} HPC simulation. ● Steps of an application performance analysis ○ Start with an application program ○ Statically analyze the program to build an abstract model ○ Transform into an executable model (encompassing CPU, GPU, and communication) ○ Run model with HPC Simulation (for performance prediction) 9
  • 11. PyPassT Static Analysis ● Static Analysis ○ Built on Compass/OpenARC/Cetus Compilers ○ Derive an abstract model for the application: ○ CPU Computation: ■ Obtain workload (flops) with OpenARC ■ Extra pass through Byfl (built on LLVM) to calculate data availability reuse profile for memory/cache performance ○ GPU Computation: ■ Identify GPU kernels using OpenARC ■ Obtain workload (flops and memory loads/stores) ○ MPI Communication: ■ Point-to-point ops (sender, receiver, transfer size, domain) ■ Collective ops (root, size, operation, domain) 11
  • 12. OpenACC 12 1. Standardization effort, defines a set of directives (#pragma acc ….) a. Compiler uses these for parallel kernel transformation b. Compute-intensive parallel regions (work-sharing loops) offloaded to GPU 2. Generates architecture dependent parallel code 3. Helps porting codes to a wide-variety of heterogeneous a. HPC hardware platforms and architectures b. GPU/Accelerator We use OpenACC to find parallel regions for CPU/GPU
  • 13. OpenACC Annotated Program #pragma acc data copy(A), create(Anew) #pragma acc parallel num_gangs(16) num_workers(32) … private(j) { ... #pragma acc loop gang for( j = 1; j < n-1; j++) { #pragma acc loop worker …. for( i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); …. } …. } } 13
  • 14. CPU Tasklist iALU integer operations fALU floating point operations (add/multiply) fDIV floating point divisions INTVEC Integer vector operations VECTOR Integer vector operations intranode Transfers within node (as memory access) MEM_ACCESS cycles to move data through memory and cache HITRATES Direct input of cache level hitrates L1 Direct input of L1 accesses L2 Direct input of L2accesses L3,L4,L5, RAM, mem Direct input of higher cache and memory accesses CPU ops CPU operations 14
  • 15. Memory Hierarchy: PPT-AMM [Chennupati et al., WSC’17] ● A parameterized model for computation performance prediction ● Uses reuse distance and profiles to estimate data availability ○ Patterns of hardware architecture-independent virtual memory accesses ● The reuse profile models different cache hierarchies We use PPT-AMM to model computation more accurately 15
  • 16. CPU Model Building 1. Produce memory trace w/ Byfl* 2. Transition probability from a BB to another ○ Calculated using LLVM coverage analysis 3. Calculate Probability of executing a BB 4. Calculate conditional reuse profile of a BB 5. Create and evaluate a PPT-AMM computation tasklist 6. Mimic computation with sleep 16
  • 17. #stack_dist probability_sd #3.0 0.357978254355 #7.0 0.141925649545 #11.0 0.053872280264 #12.0 1.44216666613e-05 #15.0 0.0577797511131 #19.0 0.00610221191651 #23.0 0.00814826082 tasklist_example = [['iALU', 8469], ['fALU', 89800], ['fDIV',6400], ['MEM_ACCESS', 4, 10, 1, 1, 1, 10, 80, False, stack_dist, probability_sd, block_size, total_bytes, data_bus_width]] CPU Model Example 17
  • 18. alloc [host] m em ory allocations (in # of bytes) unalloc [host] m em ory de-allocate DEVICE_ALLO C Device allocations DEVICE_TRANSFER Device transfers KERNEL_CALL Call a G PU kernel with block/grid iALU Integer operations L1 Direct input of L1 accesses diALU Double precision integer operations L2 Direct input of L2 accesses fALU Floating point operations (add/m ultiply) G LO B_M EM _ACCESS Access G PU on-chip global m em ory dfALU Double precision flop DEVICE_SYNC Synchronize G PU threads fDIV Floating point divisions THREAD_SYNC Synchronize G PU threads w/CPU SFU Special function calls 18 GPU Tasklist
  • 19. ● OpenARC provides ○ Memory-GPU transfers and vice versa ○ Loads ○ Stores ○ Flops ○ GPU block-size, grid-size ● Build GPU-warp tasklist from Compass generated IR ○ Evaluate with the desired hardware model ○ MPI rank sleep for the duration computation GPU Model Building 19
  • 20. GPU Kernel Analysis #pragma acc kernels loop gang(16) worker(32) copy(m, n) present(A[0:4096][0:4096], Anew[0:4096][0:4096]) private(i_0, j_0) #pragma aspen control label(block_main50) loop((-2+n)) parallelism((-2+n)) for (j_0=0; j_0<=(-3+n); j_0 ++ ){ #pragma acc loop gang(16) worker(32) #pragma aspen control label(block_main51) loop((-2+m)) parallelism((-2+m)) for (i_0=0; i_0<=(-3+m); i_0 ++ ){ #pragma aspen control execute label(block_main52) loads((1*aspen_param_sizeof_double):from(Anew):traits(stride(1))) stores((1*aspen_param_sizeof_double):to(A):traits(stride(1))) A[(1+j_0)][(1+i_0)]=Anew[(1+j_0)][(1+i_0)]; } } 20
  • 21. Computation (CPU+GPU) # accelerator warp instructions GPU_WARP = [['GLOB_MEM_ACCESS'], ['GLOB_MEM_ACCESS'], ['L1_ACCESS'], ['fALU'], ['GLOB_MEM_ACCESS'], ['GLOB_MEM_ACCESS'], ['L1_ACCESS'], ['dfALU']] # calling the ward with block size and grid size CPU_tasklist = [['KERNEL_CALL', 0, GPU_WARP, blocksize, gridsize,regcount],['DEVICE_SYNC', 0]] # evaluate with hardware model and collect statistics now = mpi_wtime(mpi_comm_world) (time, stats) = core.time_compute(CPU_tasklist, now, True) # sleep for the duration mpi_ext_sleep(time) 21
  • 22. Communication ● Convert IR of static analysis to PPT communication calls ● Source ○ MPI_Sendrecv( A[start], M, MPI_FLOAT, top , 0, A[end], M, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE ); ○ MPI_Barrier( MPI_COMM_WORLD ); ● Target if (my_rank==0) { execute "block_stencil1d59" { #mpi send/recv messages [MAX_STRING] to (my_rank+1) as send messages [MAX_STRING] from (my_rank+1) as recv } } execute "block_stencil1d66" { messages [MPI_COMM_WORLD] as barrier } 22
  • 23. ● MPI_Send, MPI_Recv,MPI_Sendrecv ● MPI_Isend, MPI_Irecv ● MPI_Wait, MPI_Waitall, ● MPI_Reduce, MPI_Allreduce ● MPI_Bcast, MPI_Barrier ● MPI_Gather, MPI_Allgather ● MPI_Scatter ● MPI_Alltoall ● MPI_Alltoallv Once the application model is ready, it is evaluated using HPC simulation. 23 Supported MPI Operations
  • 24. ● Launch application on the Performance Prediction Toolkit (PPT) PPT is HPC system and application simulator allows rapid prototyping of parallel applications at a sufficiently detailed level. ● Collaborative project between LANL and FIU ● Developed based on process oriented parallel simulator Simian ● PPT Features 1. Hardware models (processor, memory, GPU) 2. Full-fledged MPI model 3. Detailed interconnection model 4. Large-Scale Workload Model Application Simulation 24
  • 25. PPT Overview 25 Open-sourced at: https://github.com /lanl/PPT [ppt-torus] Kishwar Ahm ed, M oham m ad Obaida, Jason Liu, Stephan Eidenbenz, Nandakishore Santhi, Guillaum e Chapuis, An Integrated Interconnection Network M odel for Large-Scale Perform ance Prediction, ACM SIGSIM Conference on Principles of Advanced Discrete Sim ulation (PADS 2016), M ay 2016.
  • 26. ● Motivation and Related Work ● Automatic Performance Modeling ● Experiment Results ● Conclusions Outline 26
  • 27. Runtime Predictions ● Application ○ Laplace2D MM ○ 512x512 to 4096x4096 ● System / PPT-CAMM config ○ Run with 1 core(taskset) ○ 2x 8-core E5-2450 @2.1GHz ○ 48GB shared memory ○ Optimized (O3) & unoptimized flags ○ L1/L2/L3=32K/256K/12M. ● Results ○ Collect runtime, flops, loads, stores ○ 7.08% -- average error optimized ○ 3.12% -- average error unoptimized 27
  • 28. 28 ● Computation ○ Result – No error ● Memory ○ ~1% error Resource Usage
  • 29. GPU Performance Prediction ● Application : Laplace2D MM ● Mesh Sizes: 1024x1024 to 8192x8192 ● BlockSize: 16x1x1, GridSize: 32x1x1 ○ #pragma acc parallel num gangs(16)num workers(32) Machine Config ● Two 8-core Xeon E5-5645 @2.1GHz ● 48GB shared memory ● NVIDIA Geforce GM 204 ○ 1050 Hz, 4GB GDDR5 Results: ● 0.16% error for 8192x8192 ● 13.8% error for 1024x1024 ● Lower error for larger GPU kernels 29
  • 30. Communication Prediction ● Application ○ Jacobi Iterative Method ○ 2048x2048 matrices ● Expt. Cluster ○ Grizzly, LANL, 1.79 PFlops ○ 53K cores, Omni-path ○ Message size: 8KB ● PyPassT Machine Config ○ Two 8-core Xeon E5-5645 @2.1GHz ○ 48GB shared memory Results ● Perfect representation of total bytes between src-dest pair actual/predicted 30
  • 31. Communication Prediction(2) 31 Results ● Perfect matching of number of packets for every src-dst pair between actual runs and predicted
  • 32. ● PyPassT automatically builds application model ● Combining with rapid HPC modeling tools ● Producing fast predictions with accuracy Summary of Contributions — 32 Future Work: • Study scalability of applications • Apply dynamic analysis and ML for irregular applications