Parallel Application Performance Prediction of Using Analysis Based Modeling

Parallel Application Performance
Prediction of Using Analysis Based
Modeling
Mohammad Abu Obaida, Jason Liu,
Gopinath Chennupati, Nandakishore Santhi and Stephan Eidenbenz
SIGSIM-PADS’18, Rome, Italy, May 23, 2018

● Motivation and Related Work
● Automatic Performance Modeling
● Experiment Results
● Conclusions
Outline
2

HPC Performance Prediction
● HPC performance prediction provides insight about
○ Applications (e.g., scalability, performance variability)
○ Hardware/software (e.g., better design)
○ Workload behavior (present and future)
● Which is useful for —
○ Understanding application performance issues
○ Improving application and system, scalability
○ Budgeting, designing efficient systems (present and future)
3

Performance Prediction Challenges
4
1. New applications
2. New architectures (processors, GPUs, memory, interconnect)
3. New systems/tools
4. Scalability (applications/systems are getting larger)
Modeling techniques can help predict performance of
present and future applications, systems, and workloads

Existing Modeling Techniques
5
1. Analytical Performance Models (e.g., LogP, Loggp, Loggopsim)
○ Time expressed as mathematical formulas
○ Pros: simple, fast, flexible, scalable
○ Cons: low accuracy
2. Simulation (e.g., SST, TraceR, CODES, PPT)
○ Models detailed system and application behavior
○ Trace-driven and execution driven simulation
○ Pros: high accuracy, flexible, future architectures
○ Cons: slow, (typically) small in scale, complexity in building models
3. Hybrid Models (e.g., Aspen, Palm, Compass, Durango)
○ Combination of analytical, simulation and trace replaying
○ A choice of accuracy, flexibility vs speed

1. Aspen [Spafford and Vetter, SC’12]
○ Domain specific language to describe application and machine
○ Features analytical communication model
2. Compass [Lee, Meredith, Vetter, ICS’15]
○ Static analysis based automated model construction for Aspen
○ Built on Cetus compiler and OpenARC source transformation framework
○ High level communication abstraction functionality
3. Durango [Carothers et al., SIGSIM-PADS’17]
○ Combines Aspen and CODES
○ Parameterized application model w/ computation events
○ Simulates Aspen generated models or traces on CODES
4. PPT-AMM [Chennupati et al., WSC’17]
○ Simulation based performance prediction
○ Uses static analysis of source code to build data availability models
Most Related to Our Work
6

Our Approach - PyPassT
We use Static Analysis Based HPC Simulation framework
1. Automatically builds model from source code
2. CPU plus cache/memory models
3. Mid-level GPU Model
4. Detailed communication model
a. Point-to-point and collective MPI operations
b. Preserve locality or spatial features
c. Abstracts data volume properties
7

● Conclusions
Outline
8

PyPassT Framework
● Purpose is to maintain accuracy and performance, flexibility, and
scalability; so as to allow studies of large-scale applications
● PyPassT:
○ An application model in {Py}thon
○ Generated automatically in compiler {Pass}es
○ Executed in PP{T} HPC simulation.
● Steps of an application performance analysis
○ Start with an application program
○ Statically analyze the program to build an abstract model
○ Transform into an executable model (encompassing CPU, GPU, and communication)
○ Run model with HPC Simulation (for performance prediction)
9

PyPassT Static Analysis
● Static Analysis
○ Built on Compass/OpenARC/Cetus Compilers
○ Derive an abstract model for the application:
○ CPU Computation:
■ Obtain workload (flops) with OpenARC
■ Extra pass through Byfl (built on LLVM) to calculate data availability reuse
profile for memory/cache performance
○ GPU Computation:
■ Identify GPU kernels using OpenARC
■ Obtain workload (flops and memory loads/stores)
○ MPI Communication:
■ Point-to-point ops (sender, receiver, transfer size, domain)
■ Collective ops (root, size, operation, domain)
11

OpenACC
12
1. Standardization effort, defines a set of directives (#pragma acc ….)
a. Compiler uses these for parallel kernel transformation
b. Compute-intensive parallel regions (work-sharing loops) offloaded to GPU
2. Generates architecture dependent parallel code
3. Helps porting codes to a wide-variety of heterogeneous
a. HPC hardware platforms and architectures
b. GPU/Accelerator
We use OpenACC to find parallel regions for CPU/GPU

OpenACC Annotated Program
#pragma acc data copy(A), create(Anew)
#pragma acc parallel num_gangs(16) num_workers(32) … private(j)
{
...
#pragma acc loop gang
for( j = 1; j < n-1; j++) {
#pragma acc loop worker ….
for( i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);
….
}
….
}
}
13

CPU Tasklist
iALU integer operations
fALU floating point operations (add/multiply)
fDIV floating point divisions
INTVEC Integer vector operations
VECTOR Integer vector operations
intranode Transfers within node (as memory access)
MEM_ACCESS cycles to move data through memory and cache
HITRATES Direct input of cache level hitrates
L1 Direct input of L1 accesses
L2 Direct input of L2accesses
L3,L4,L5,
RAM, mem
Direct input of higher cache and
memory accesses
CPU ops CPU operations
14

Memory Hierarchy: PPT-AMM [Chennupati et al., WSC’17]
● A parameterized model for computation performance prediction
● Uses reuse distance and profiles to estimate data availability
○ Patterns of hardware architecture-independent virtual memory accesses
● The reuse profile models different cache hierarchies
We use PPT-AMM to model computation more accurately
15

CPU Model Building
1. Produce memory trace w/ Byfl*
2. Transition probability from a BB to another
○ Calculated using LLVM coverage analysis
3. Calculate Probability of executing a BB
4. Calculate conditional reuse profile of a BB
5. Create and evaluate a PPT-AMM computation tasklist
6. Mimic computation with sleep
16

#stack_dist probability_sd
#3.0 0.357978254355
#7.0 0.141925649545
#11.0 0.053872280264
#12.0 1.44216666613e-05
#15.0 0.0577797511131
#19.0 0.00610221191651
#23.0 0.00814826082
tasklist_example = [['iALU', 8469], ['fALU', 89800], ['fDIV',6400],
['MEM_ACCESS', 4, 10, 1, 1, 1, 10, 80, False,
stack_dist, probability_sd, block_size, total_bytes, data_bus_width]]
CPU Model Example
17

alloc [host] m em ory allocations (in # of bytes)
unalloc [host] m em ory de-allocate
DEVICE_ALLO C Device allocations
DEVICE_TRANSFER Device transfers
KERNEL_CALL Call a G PU kernel with block/grid
iALU Integer operations L1 Direct input of L1 accesses
diALU Double precision integer operations L2 Direct input of L2 accesses
fALU Floating point operations (add/m ultiply) G LO B_M EM _ACCESS Access G PU on-chip global
m em ory
dfALU Double precision flop DEVICE_SYNC Synchronize G PU threads
fDIV Floating point divisions THREAD_SYNC Synchronize G PU threads w/CPU
SFU Special function calls
18
GPU Tasklist

● OpenARC provides
○ Memory-GPU transfers and vice versa
○ Loads
○ Stores
○ Flops
○ GPU block-size, grid-size
● Build GPU-warp tasklist from Compass generated IR
○ Evaluate with the desired hardware model
○ MPI rank sleep for the duration computation
GPU Model Building
19

GPU Kernel Analysis
#pragma acc kernels loop gang(16) worker(32) copy(m, n)
present(A[0:4096][0:4096], Anew[0:4096][0:4096]) private(i_0, j_0)
#pragma aspen control label(block_main50) loop((-2+n)) parallelism((-2+n))
for (j_0=0; j_0<=(-3+n); j_0 ++ ){
#pragma acc loop gang(16) worker(32)
#pragma aspen control label(block_main51) loop((-2+m)) parallelism((-2+m))
for (i_0=0; i_0<=(-3+m); i_0 ++ ){
#pragma aspen control execute label(block_main52)
loads((1*aspen_param_sizeof_double):from(Anew):traits(stride(1)))
stores((1*aspen_param_sizeof_double):to(A):traits(stride(1)))
A[(1+j_0)][(1+i_0)]=Anew[(1+j_0)][(1+i_0)];
}
}
20

Computation (CPU+GPU)
# accelerator warp instructions
GPU_WARP = [['GLOB_MEM_ACCESS'],
['GLOB_MEM_ACCESS'], ['L1_ACCESS'],
['fALU'], ['GLOB_MEM_ACCESS'],
['GLOB_MEM_ACCESS'], ['L1_ACCESS'], ['dfALU']]
# calling the ward with block size and grid size
CPU_tasklist = [['KERNEL_CALL', 0, GPU_WARP,
blocksize, gridsize,regcount],['DEVICE_SYNC', 0]]
# evaluate with hardware model and collect statistics
now = mpi_wtime(mpi_comm_world)
(time, stats) = core.time_compute(CPU_tasklist, now, True)
# sleep for the duration
mpi_ext_sleep(time) 21

Communication
● Convert IR of static analysis to PPT communication calls
● Source
○ MPI_Sendrecv( A[start], M, MPI_FLOAT, top , 0, A[end], M, MPI_FLOAT,
bottom, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE );
○ MPI_Barrier( MPI_COMM_WORLD );
● Target
if (my_rank==0) {
execute "block_stencil1d59" {
#mpi send/recv
messages [MAX_STRING] to (my_rank+1) as send
messages [MAX_STRING] from (my_rank+1) as recv
}
}
execute "block_stencil1d66" {
messages [MPI_COMM_WORLD] as barrier
} 22

● MPI_Send, MPI_Recv,MPI_Sendrecv
● MPI_Isend, MPI_Irecv
● MPI_Wait, MPI_Waitall,
● MPI_Reduce, MPI_Allreduce
● MPI_Bcast, MPI_Barrier
● MPI_Gather, MPI_Allgather
● MPI_Scatter
● MPI_Alltoall
● MPI_Alltoallv
Once the application model is ready, it is evaluated using HPC simulation.
23
Supported MPI Operations

● Launch application on the Performance Prediction Toolkit (PPT)
PPT is HPC system and application simulator allows rapid prototyping of parallel
applications at a sufficiently detailed level.
● Collaborative project between LANL and FIU
● Developed based on process oriented parallel simulator Simian
● PPT Features
1. Hardware models (processor, memory, GPU)
2. Full-fledged MPI model
3. Detailed interconnection model
4. Large-Scale Workload Model
Application Simulation
24

PPT Overview
25
Open-sourced at: https://github.com /lanl/PPT
[ppt-torus] Kishwar Ahm ed, M oham m ad Obaida, Jason Liu, Stephan Eidenbenz, Nandakishore Santhi, Guillaum e Chapuis, An Integrated Interconnection Network
M odel for Large-Scale Perform ance Prediction, ACM SIGSIM Conference on Principles of Advanced Discrete Sim ulation (PADS 2016), M ay 2016.

● Conclusions
Outline
26

Runtime Predictions
● Application
○ Laplace2D MM
○ 512x512 to 4096x4096
● System / PPT-CAMM config
○ Run with 1 core(taskset)
○ 2x 8-core E5-2450 @2.1GHz
○ 48GB shared memory
○ Optimized (O3) & unoptimized flags
○ L1/L2/L3=32K/256K/12M.
● Results
○ Collect runtime, flops, loads, stores
○ 7.08% -- average error optimized
○ 3.12% -- average error unoptimized
27

28
● Computation
○ Result – No error
● Memory
○ ~1% error
Resource Usage

GPU Performance Prediction
● Application : Laplace2D MM
● Mesh Sizes: 1024x1024 to 8192x8192
● BlockSize: 16x1x1, GridSize: 32x1x1
○ #pragma acc parallel num gangs(16)num
workers(32)
Machine Config
● Two 8-core Xeon E5-5645 @2.1GHz
● 48GB shared memory
● NVIDIA Geforce GM 204
○ 1050 Hz, 4GB GDDR5
Results:
● 0.16% error for 8192x8192
● 13.8% error for 1024x1024
● Lower error for larger GPU kernels
29

Communication Prediction
● Application
○ Jacobi Iterative Method
○ 2048x2048 matrices
● Expt. Cluster
○ Grizzly, LANL, 1.79 PFlops
○ 53K cores, Omni-path
○ Message size: 8KB
● PyPassT Machine Config
○ Two 8-core Xeon E5-5645 @2.1GHz
○ 48GB shared memory
Results
● Perfect representation of total bytes
between src-dest pair actual/predicted
30

Communication Prediction(2)
31
Results
● Perfect matching of number of
packets for every src-dst pair
between actual runs and
predicted

● PyPassT automatically builds application model
● Combining with rapid HPC modeling tools
● Producing fast predictions with accuracy
Summary of Contributions —
32
Future Work:
• Study scalability of applications
• Apply dynamic analysis and ML for irregular applications

Parallel Application Performance Prediction of Using Analysis Based Modeling

More Related Content

What's hot (20)

Similar to Parallel Application Performance Prediction of Using Analysis Based Modeling (20)

Recently uploaded (20)

Parallel Application Performance Prediction of Using Analysis Based Modeling