LLVM-based Communication Optimizations for PGAS Programs

LLVM-based Communication Optimizations
for PGAS Programs
Akihiro Hayashi (Rice University)
Jisheng Zhao (Rice University)
Michael Ferguson (Cray Inc.)
Vivek Sarkar (Rice University)
2nd Workshop on the LLVM Compiler Infrastructure in HPC @ SC15
1

A Big Picture
2
Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html,
http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/
© Argonne National Lab.
© RIKEN AICS
©Berkeley Lab.
X10,
Habanero-UPC++,…

PGAS Languages
High-productivity features:
 Global-View
 Task parallelism
 Data Distribution
 Synchronization
3
X10
Photo Credits : http://chapel.cray.com/logo.html, http://upc.lbl.gov/
CAFHabanero-UPC++

Communication is implicit
in some PGAS Programming Models
Global Address Space
 Compiler and Runtime is responsible for
performing communications across nodes
4
1: var x = 1; // on Node 0
2: on Locales[1] {// on Node 1
3: … = x; // DATA ACCESS
4: }
Remote Data Access in Chapel

Communication is Implicit
in some PGAS Programming Models (Cont’d)
5
1: var x = 1; // on Node 0
2: on Locales[1] {// on Node 1
3: … = x; // DATA ACCESS
Remote Data Access
if (x.locale == MYLOCALE) {
*(x.addr) = 1;
} else {
gasnet_get(…);
}
Runtime affinity handling
1: var x = 1;
2: on Locales[1] {
3: … = 1;
Compiler Optimization
OR

Communication Optimization
is Important
6
A synthetic Chapel program
on Intel Xeon CPU X5660 Clusters with QDR Inifiniband
1
10
100
1000
10000
100000
1000000
10000000
Latency(ms)
Transferred Byte
Optimized (Bulk Transfer) Unoptimized
1,500x
59x
Lower is better

PGAS Optimizations are
language-specific
7
X10,
Habanero-UPC++,…
©Berkeley Lab.
© RIKEN AICS
Chapel Compiler
UPC Compiler
X10 Compiler
Habanero-C Compiler

Our goal
8
© RIKEN AICS
©Berkeley Lab.
X10,
Habanero-UPC++,…

Why LLVM?
Widely used language-agnostic compiler
9
LLVM Intermediate Representation (LLVM IR)
C/C++
Frontend
Clang
C/C++, Fortran, Ada, Objective-C
Frontend
dragonegg
Chapel
Frontend
x86
backend
Power PC
backend
ARM
backend
PTX
backend
Analysis & Optimizations
UPC++
Frontend
x86 Binary PPC Binary ARM Binary GPU Binary

Summary & Contributions
 Our Observations :
 Many PGAS languages share semantically similar
constructs
 PGAS Optimizations are language-specific
 Contributions:
 Built a compilation framework that can uniformly optimize
PGAS programs (Initial Focus : Communication)
 Enabling existing LLVM passes for communication optimizations
 PGAS-aware communication optimizations 10Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html,

Overview of our framework
11
Chapel
Programs
UPC++
Programs
X10
Programs
Chapel-
LLVM
frontend
UPC++-
LLVM
frontend
X10-LLVM
frontend
LLVM IR
LLVM-based
Communication
Optimization
passes
Lowering
Pass
CAF
Programs
CAF-LLVM
frontend
1. Vanilla LLVM IR
2. use address space feature
to express communications
Need to be implemented
when supporting a new language/runtime
Generally language-agnostic

How optimizations work
12
store i64 1, i64 addrspace(100)* %x, …
// x is possibly remote
x = 1;
Chapel
shared_var<int> x;
x = 1;
UPC++
1.Existing LLVM
Optimizations
2.PGAS-aware
Optimizations
treat remote
access as if it
were local
access
Runtime-Specific Lowering
Address
space-aware
Optimizations
Communication API Calls

LLVM-based Communication
Optimizations for Chapel
1. Enabling Existing LLVM passes
 Loop invariant code motion (LICM)
 Scalar replacement, …
2. Aggregation
 Combine sequences of loads/stores on
adjacent memory location into a single
memcpy
13These are already implemented in the standard Chapel compiler

An optimization example:
LICM for Communication Optimizations
14LICM = Loop Invariant Code Motion
for i in 1..100 {
%x = load i64 addrspace(100)* %xptr
A(i) = %x;
}
LICM by LLVM

An optimization example:
Aggregation
15
// p is possibly remote
sum = p.x + p.y;
llvm.memcpy(…);
load i64 addrspace(100)* %pptr+0
load i64 addrspace(100)* %pptr+4
x y
GET GET
GET

LLVM-based Communication
Optimizations for Chapel
3. Locality Optimization
 Infer the locality of data and convert
possibly-remote access to definitely-local
access at compile-time if possible
4. Coalescing
 Remote array access vectorization
16These are implemented, but not in the standard Chapel compiler

An Optimization example:
Locality Optimization
17
1: proc habanero(ref x, ref y, ref z) {
2: var p: int = 0;
3: var A:[1..N] int;
4: local { p = z; }
5: z = A(0) + z;
6:}
2.p and z are
definitely local
3.Definitely-local access!
(avoid runtime affinity checking)
1.A is definitely-
local

An Optimization example:
Coalescing
18
1:for i in 1..N {
2: … = A(i);
3:}
1:localA = A;
2:for i in 1..N {
3: … = localA(i);
4:}
Perform bulk
transfer
Converted to
definitely-local
access
Before
After

Performance Evaluations:
Benchmarks
19
Application Size
Smith-Waterman 185,600 x 192,000
Cholesky Decomp 10,000 x 10,000
NPB EP CLASS = D
Sobel 48,000 x 48,000
SSCA2 Kernel 4 SCALE = 16
Stream EP 2^30

Platforms
 Cray XC30™ Supercomputer @ NERSC
 Node
 Intel Xeon E5-2695 @ 2.40GHz x 24 cores
 64GB of RAM
 Interconnect
 Cray Aries interconnect with Dragonfly topology
 Westmere Cluster @ Rice
 Node
 Intel Xeon CPU X5660 @ 2.80GHz x 12 cores
 48 GB of RAM
 Interconnect
 Quad-data rated infiniband
20

Details of Compiler & Runtime
 Compiler
 Chapel Compiler version 1.9.0
 LLVM 3.3
 Runtime :
 GASNet-1.22.0
 Cray XC : aries
 Westmere Cluster : ibv-conduit
 Qthreads-1.10
 Cray XC: 2 shepherds, 24 workers / shepherd
 Westmere Cluster : 2 shepherds, 6 workers / shepherd
21

BRIEF SUMMARY OF
PERFORMANCE EVALUATIONS
Performance Evaluation
22

0.0
1.0
2.0
3.0
4.0
5.0
SW Cholesky Sobel StreamEP EP SSCA2
PerformanceImprovement
overLLVM-unopt
Coalescing Locality Opt
Aggregation Existing
Results on the Cray XC
(LLVM-unopt vs. LLVM-allopt)
23
 4.6x performance improvement relative to LLVM-unopt on
the same # of locales on average (1, 2, 4, 8, 16, 32, 64 locales)
2.1x
19.5x
1.1x
2.4x
1.4x 1.3x
Higher is better

0.0
1.0
2.0
3.0
4.0
5.0
SW Cholesky Sobel StreamEP EP SSCA2
PerformanceImprovement
overLLVM-unopt
Coalescing Locality Opt
Aggregation Existing
Results on Westmere Cluster
(LLVM-unopt vs. LLVM-allopt)
24
2.3x
16.9x
1.1x
2.5x
1.3x
2.3x
 4.4x performance improvement relative to LLVM-unopt on
the same # of locales on average (1, 2, 4, 8, 16, 32, 64 locales)

DETAILED RESULTS & ANALYSIS
OF CHOLESKY DECOMPOSITION
Performance Evaluation
25

Cholesky Decomposition
26
dependencies0
0
0
1
1
2
2
2
3
3
0
0
1
1
2
2
2
3
3
0
1
1
2
2
2
3
3
1
1
2
2
2
3
3
1
2
2
2
3
3
2
2
2
3
3
2
2
3
3
2
3
3
3
3 3
Node0
Node1
Node2
Node3

Metrics
1. Performance & Scalability
 Baseline (LLVM-unopt)
 LLVM-based Optimizations (LLVM-allopt)
2. The dynamic number of communication API
calls
3. Analysis of optimized code
4. Performance comparison
 Conventional C-backend vs. LLVM-backend
27

Performance Improvement
by LLVM (Cholesky on the Cray XC)
28
1
0.1 0.1 0.2 0.2 0.3
2.6 2.7
3.7
4.1 4.3 4.5
0
1
2
3
4
5
1 locale 2 locales 4 locales 8 locales 16 locales 32 locales
SpeedupoverLLVM-unopt
1locale
LLVM-unopt LLVM-allopt
 LLVM-based communication optimizations show scalability

Communication API calls elimination
by LLVM (Cholesky on the Cray XC)
29
100.0% 100.0% 100.0% 100.0%
12.1%
0.2%
89.2%
100.0%
0
0.2
0.4
0.6
0.8
1
1.2
LOCAL＿GET REMOTE_GET LOCAL_PUT REMOTE_PUT
Dynamicnumberof
communicationAPIcalls
(normalizedtoLLVM-unopt)
8.3x
improvement
500x
improvement
1.1x
improvement

Analysis of optimized code
30
for jB in zero..tileSize-1 {
for kB in zero..tileSize-1 {
4GETS
for iB in zero..tileSize-1 {
9GETS + 1PUT
}}}
1.ALLOCATE LOCAL BUFFER
2.PERFORM BULK TRANSFER
for jB in zero..tileSize-1 {
for kB in zero..tileSize-1 {
1GET
for iB in zero..tileSize-1 {
1GET + 1PUT
}}}

Performance comparison with
C-backend
31
2.8
0.1 0.1 0.1 0.3
0.7
0.4
1
0.1 0.1 0.2 0.2 0.3 0.4
2.6 2.7
3.7
4.1 4.3 4.5 4.5
0
1
2
3
4
5
1 locale 2 locales 4 locales 8 locales 16 locales 32 locales 64 locales
SpeedupoverLLVM-unopt
1locale
C-backend LLVM-unopt LLVM-allopt
C-backend is faster!

Current limitation
 In LLVM 3.3, many optimizations assume that the pointer
size is the same across all address spaces
32
Locale
(16bit)
addr
(48bit)
For LLVM Code Generation :
64bit packed pointer
For C Code Generation :
128bit struct pointer
ptr.locale;
ptr.addr;
ptr >> 48
ptr | 48BITS_MASK;
1. Needs more instructions
2. Lose opportunities for Alias
analysis

Conclusions
LLVM-based Communication optimizations
for PGAS Programs
 Promising way to optimize PGAS programs in a
language-agnostic manner
 Preliminary Evaluation with 6 Chapel applications
Cray XC30 Supercomputer
– 4.6x average performance improvement
Westmere Cluster
– 4.4x average performance improvement
33

Future work
Extend LLVM IR to support parallel programs
with PGAS and explicit task parallelism
 Higher-level IR
34
Parallel
Programs
(Chapel, X10,
CAF, HC, …)
1.RI-PIR Gen
2.Analysis
3.Transformation
1.RS-PIR Gen
2.Analysis
3.Transformation
LLVM
Runtime-Independent
Optimizations
e.g. Task Parallel Construct
LLVM
Runtime-Specific
Optimizations
e.g. GASNet API
Binary

Acknowledgements
Special thanks to
 Brad Chamberlain (Cray)
 Rafael Larrosa Jimenez (UMA)
 Rafael Asenjo Plaza (UMA)
 Habanero Group at Rice
35

Compilation Flow
37
Chapel
Programs
AST Generation and Optimizations
C-code Generation
LLVM Optimizations
Backend Compiler’s Optimizations
(e.g. gcc –O3)
LLVM IRC Programs
LLVM IR Generation
Binary Binary

LLVM-based Communication Optimizations for PGAS Programs

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to LLVM-based Communication Optimizations for PGAS Programs (20)

More from Akihiro Hayashi (10)

Recently uploaded (20)

LLVM-based Communication Optimizations for PGAS Programs

Editor's Notes