LLVM-based Communication Optimizations
for PGAS Programs
Akihiro Hayashi (Rice University)
Jisheng Zhao (Rice University)
Michael Ferguson (Cray Inc.)
Vivek Sarkar (Rice University)
2nd Workshop on the LLVM Compiler Infrastructure in HPC @ SC15
1
A Big Picture
2
Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html,
http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/
© Argonne National Lab.
© RIKEN AICS
©Berkeley Lab.
X10,
Habanero-UPC++,…
PGAS Languages
High-productivity features:
 Global-View
 Task parallelism
 Data Distribution
 Synchronization
3
X10
Photo Credits : http://chapel.cray.com/logo.html, http://upc.lbl.gov/
CAFHabanero-UPC++
Communication is implicit
in some PGAS Programming Models
Global Address Space
 Compiler and Runtime is responsible for
performing communications across nodes
4
1: var x = 1; // on Node 0
2: on Locales[1] {// on Node 1
3: … = x; // DATA ACCESS
4: }
Remote Data Access in Chapel
Communication is Implicit
in some PGAS Programming Models (Cont’d)
5
1: var x = 1; // on Node 0
2: on Locales[1] {// on Node 1
3: … = x; // DATA ACCESS
Remote Data Access
if (x.locale == MYLOCALE) {
*(x.addr) = 1;
} else {
gasnet_get(…);
}
Runtime affinity handling
1: var x = 1;
2: on Locales[1] {
3: … = 1;
Compiler Optimization
OR
Communication Optimization
is Important
6
A synthetic Chapel program
on Intel Xeon CPU X5660 Clusters with QDR Inifiniband
1
10
100
1000
10000
100000
1000000
10000000
Latency(ms)
Transferred Byte
Optimized (Bulk Transfer) Unoptimized
1,500x
59x
Lower is better
PGAS Optimizations are
language-specific
7
Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html,
http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/
X10,
Habanero-UPC++,…
© Argonne National Lab.
©Berkeley Lab.
© RIKEN AICS
Chapel Compiler
UPC Compiler
X10 Compiler
Habanero-C Compiler
Our goal
8
Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html,
http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/
© Argonne National Lab.
© RIKEN AICS
©Berkeley Lab.
X10,
Habanero-UPC++,…
Why LLVM?
Widely used language-agnostic compiler
9
LLVM Intermediate Representation (LLVM IR)
C/C++
Frontend
Clang
C/C++, Fortran, Ada, Objective-C
Frontend
dragonegg
Chapel
Frontend
x86
backend
Power PC
backend
ARM
backend
PTX
backend
Analysis & Optimizations
UPC++
Frontend
x86 Binary PPC Binary ARM Binary GPU Binary
Summary & Contributions
 Our Observations :
 Many PGAS languages share semantically similar
constructs
 PGAS Optimizations are language-specific
 Contributions:
 Built a compilation framework that can uniformly optimize
PGAS programs (Initial Focus : Communication)
 Enabling existing LLVM passes for communication optimizations
 PGAS-aware communication optimizations 10Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html,
Overview of our framework
11
Chapel
Programs
UPC++
Programs
X10
Programs
Chapel-
LLVM
frontend
UPC++-
LLVM
frontend
X10-LLVM
frontend
LLVM IR
LLVM-based
Communication
Optimization
passes
Lowering
Pass
CAF
Programs
CAF-LLVM
frontend
1. Vanilla LLVM IR
2. use address space feature
to express communications
Need to be implemented
when supporting a new language/runtime
Generally language-agnostic
How optimizations work
12
store i64 1, i64 addrspace(100)* %x, …
// x is possibly remote
x = 1;
Chapel
shared_var<int> x;
x = 1;
UPC++
1.Existing LLVM
Optimizations
2.PGAS-aware
Optimizations
treat remote
access as if it
were local
access
Runtime-Specific Lowering
Address
space-aware
Optimizations
Communication API Calls
LLVM-based Communication
Optimizations for Chapel
1. Enabling Existing LLVM passes
 Loop invariant code motion (LICM)
 Scalar replacement, …
2. Aggregation
 Combine sequences of loads/stores on
adjacent memory location into a single
memcpy
13These are already implemented in the standard Chapel compiler
An optimization example:
LICM for Communication Optimizations
14LICM = Loop Invariant Code Motion
for i in 1..100 {
%x = load i64 addrspace(100)* %xptr
A(i) = %x;
}
LICM by LLVM
An optimization example:
Aggregation
15
// p is possibly remote
sum = p.x + p.y;
llvm.memcpy(…);
load i64 addrspace(100)* %pptr+0
load i64 addrspace(100)* %pptr+4
x y
GET GET
GET
LLVM-based Communication
Optimizations for Chapel
3. Locality Optimization
 Infer the locality of data and convert
possibly-remote access to definitely-local
access at compile-time if possible
4. Coalescing
 Remote array access vectorization
16These are implemented, but not in the standard Chapel compiler
An Optimization example:
Locality Optimization
17
1: proc habanero(ref x, ref y, ref z) {
2: var p: int = 0;
3: var A:[1..N] int;
4: local { p = z; }
5: z = A(0) + z;
6:}
2.p and z are
definitely local
3.Definitely-local access!
(avoid runtime affinity checking)
1.A is definitely-
local
An Optimization example:
Coalescing
18
1:for i in 1..N {
2: … = A(i);
3:}
1:localA = A;
2:for i in 1..N {
3: … = localA(i);
4:}
Perform bulk
transfer
Converted to
definitely-local
access
Before
After
Performance Evaluations:
Benchmarks
19
Application Size
Smith-Waterman 185,600 x 192,000
Cholesky Decomp 10,000 x 10,000
NPB EP CLASS = D
Sobel 48,000 x 48,000
SSCA2 Kernel 4 SCALE = 16
Stream EP 2^30
Performance Evaluations:
Platforms
 Cray XC30™ Supercomputer @ NERSC
 Node
 Intel Xeon E5-2695 @ 2.40GHz x 24 cores
 64GB of RAM
 Interconnect
 Cray Aries interconnect with Dragonfly topology
 Westmere Cluster @ Rice
 Node
 Intel Xeon CPU X5660 @ 2.80GHz x 12 cores
 48 GB of RAM
 Interconnect
 Quad-data rated infiniband
20
Performance Evaluations:
Details of Compiler & Runtime
 Compiler
 Chapel Compiler version 1.9.0
 LLVM 3.3
 Runtime :
 GASNet-1.22.0
 Cray XC : aries
 Westmere Cluster : ibv-conduit
 Qthreads-1.10
 Cray XC: 2 shepherds, 24 workers / shepherd
 Westmere Cluster : 2 shepherds, 6 workers / shepherd
21
BRIEF SUMMARY OF
PERFORMANCE EVALUATIONS
Performance Evaluation
22
0.0
1.0
2.0
3.0
4.0
5.0
SW Cholesky Sobel StreamEP EP SSCA2
PerformanceImprovement
overLLVM-unopt
Coalescing Locality Opt
Aggregation Existing
Results on the Cray XC
(LLVM-unopt vs. LLVM-allopt)
23
 4.6x performance improvement relative to LLVM-unopt on
the same # of locales on average (1, 2, 4, 8, 16, 32, 64 locales)
2.1x
19.5x
1.1x
2.4x
1.4x 1.3x
Higher is better
0.0
1.0
2.0
3.0
4.0
5.0
SW Cholesky Sobel StreamEP EP SSCA2
PerformanceImprovement
overLLVM-unopt
Coalescing Locality Opt
Aggregation Existing
Results on Westmere Cluster
(LLVM-unopt vs. LLVM-allopt)
24
2.3x
16.9x
1.1x
2.5x
1.3x
2.3x
 4.4x performance improvement relative to LLVM-unopt on
the same # of locales on average (1, 2, 4, 8, 16, 32, 64 locales)
DETAILED RESULTS & ANALYSIS
OF CHOLESKY DECOMPOSITION
Performance Evaluation
25
Cholesky Decomposition
26
dependencies0
0
0
1
1
2
2
2
3
3
0
0
1
1
2
2
2
3
3
0
1
1
2
2
2
3
3
1
1
2
2
2
3
3
1
2
2
2
3
3
2
2
2
3
3
2
2
3
3
2
3
3
3
3 3
Node0
Node1
Node2
Node3
Metrics
1. Performance & Scalability
 Baseline (LLVM-unopt)
 LLVM-based Optimizations (LLVM-allopt)
2. The dynamic number of communication API
calls
3. Analysis of optimized code
4. Performance comparison
 Conventional C-backend vs. LLVM-backend
27
Performance Improvement
by LLVM (Cholesky on the Cray XC)
28
1
0.1 0.1 0.2 0.2 0.3
2.6 2.7
3.7
4.1 4.3 4.5
0
1
2
3
4
5
1 locale 2 locales 4 locales 8 locales 16 locales 32 locales
SpeedupoverLLVM-unopt
1locale
LLVM-unopt LLVM-allopt
 LLVM-based communication optimizations show scalability
Communication API calls elimination
by LLVM (Cholesky on the Cray XC)
29
100.0% 100.0% 100.0% 100.0%
12.1%
0.2%
89.2%
100.0%
0
0.2
0.4
0.6
0.8
1
1.2
LOCAL_GET REMOTE_GET LOCAL_PUT REMOTE_PUT
Dynamicnumberof
communicationAPIcalls
(normalizedtoLLVM-unopt)
LLVM-unopt LLVM-allopt
8.3x
improvement
500x
improvement
1.1x
improvement
Analysis of optimized code
30
for jB in zero..tileSize-1 {
for kB in zero..tileSize-1 {
4GETS
for iB in zero..tileSize-1 {
9GETS + 1PUT
}}}
1.ALLOCATE LOCAL BUFFER
2.PERFORM BULK TRANSFER
for jB in zero..tileSize-1 {
for kB in zero..tileSize-1 {
1GET
for iB in zero..tileSize-1 {
1GET + 1PUT
}}}
LLVM-unopt LLVM-allopt
Performance comparison with
C-backend
31
2.8
0.1 0.1 0.1 0.3
0.7
0.4
1
0.1 0.1 0.2 0.2 0.3 0.4
2.6 2.7
3.7
4.1 4.3 4.5 4.5
0
1
2
3
4
5
1 locale 2 locales 4 locales 8 locales 16 locales 32 locales 64 locales
SpeedupoverLLVM-unopt
1locale
C-backend LLVM-unopt LLVM-allopt
C-backend is faster!
Current limitation
 In LLVM 3.3, many optimizations assume that the pointer
size is the same across all address spaces
32
Locale
(16bit)
addr
(48bit)
For LLVM Code Generation :
64bit packed pointer
For C Code Generation :
128bit struct pointer
ptr.locale;
ptr.addr;
ptr >> 48
ptr | 48BITS_MASK;
1. Needs more instructions
2. Lose opportunities for Alias
analysis
Conclusions
LLVM-based Communication optimizations
for PGAS Programs
 Promising way to optimize PGAS programs in a
language-agnostic manner
 Preliminary Evaluation with 6 Chapel applications
Cray XC30 Supercomputer
– 4.6x average performance improvement
Westmere Cluster
– 4.4x average performance improvement
33
Future work
Extend LLVM IR to support parallel programs
with PGAS and explicit task parallelism
 Higher-level IR
34
Parallel
Programs
(Chapel, X10,
CAF, HC, …)
1.RI-PIR Gen
2.Analysis
3.Transformation
1.RS-PIR Gen
2.Analysis
3.Transformation
LLVM
Runtime-Independent
Optimizations
e.g. Task Parallel Construct
LLVM
Runtime-Specific
Optimizations
e.g. GASNet API
Binary
Acknowledgements
Special thanks to
 Brad Chamberlain (Cray)
 Rafael Larrosa Jimenez (UMA)
 Rafael Asenjo Plaza (UMA)
 Habanero Group at Rice
35
Backup slides
36
Compilation Flow
37
Chapel
Programs
AST Generation and Optimizations
C-code Generation
LLVM Optimizations
Backend Compiler’s Optimizations
(e.g. gcc –O3)
LLVM IRC Programs
LLVM IR Generation
Binary Binary

More Related Content

PPTX
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
PDF
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
PDF
CETH for XDP [Linux Meetup Santa Clara | July 2016]
PPTX
Mit cilk
PDF
FARIS: Fast and Memory-efficient URL Filter by Domain Specific Machine
PDF
customization of a deep learning accelerator, based on NVDLA
PDF
2020 icldla-updated
PDF
Exploring the Programming Models for the LUMI Supercomputer
LLVM Optimizations for PGAS Programs -Case Study: LLVM Wide Optimization in C...
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
CETH for XDP [Linux Meetup Santa Clara | July 2016]
Mit cilk
FARIS: Fast and Memory-efficient URL Filter by Domain Specific Machine
customization of a deep learning accelerator, based on NVDLA
2020 icldla-updated
Exploring the Programming Models for the LUMI Supercomputer

What's hot (20)

PPT
Course on TCP Dynamic Performance
PPTX
CILK/CILK++ and Reducers
PDF
eBPF Debugging Infrastructure - Current Techniques
PDF
BPF Hardware Offload Deep Dive
PDF
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
PDF
BPF - All your packets belong to me
PDF
Manycores for the Masses
PDF
Programming Languages & Tools for Higher Performance & Productivity
PPTX
Plane Spotting
PDF
eBPF/XDP
PDF
eBPF Tooling and Debugging Infrastructure
PDF
IBM XL Compilers Performance Tuning 2016-11-18
PDF
Porting and Optimization of Numerical Libraries for ARM SVE
PDF
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
PPTX
LEGaTO: Software Stack Runtimes
PDF
[Webinar Slides] Programming the Network Dataplane in P4
PDF
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
PPTX
DPDK KNI interface
PDF
Arm tools and roadmap for SVE compiler support
PDF
Performance evaluation with Arm HPC tools for SVE
Course on TCP Dynamic Performance
CILK/CILK++ and Reducers
eBPF Debugging Infrastructure - Current Techniques
BPF Hardware Offload Deep Dive
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
BPF - All your packets belong to me
Manycores for the Masses
Programming Languages & Tools for Higher Performance & Productivity
Plane Spotting
eBPF/XDP
eBPF Tooling and Debugging Infrastructure
IBM XL Compilers Performance Tuning 2016-11-18
Porting and Optimization of Numerical Libraries for ARM SVE
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
LEGaTO: Software Stack Runtimes
[Webinar Slides] Programming the Network Dataplane in P4
Optimize Single Particle Orbital (SPO) Evaluations Based on B-splines
DPDK KNI interface
Arm tools and roadmap for SVE compiler support
Performance evaluation with Arm HPC tools for SVE
Ad

Viewers also liked (20)

PPT
Compiler optimization
PDF
UPC and OpenMP Parallel Programming and Analysis in PTP with CDT
PDF
Open mp intro_01
PDF
Open mp 1i
PPT
Parallelization of Coupled Cluster Code with OpenMP
PDF
Agathos-PHD-uoi-2016
PDF
Openmp combined
PDF
Open mp library functions and environment variables
PDF
Multi-Processor computing with OpenMP
PDF
Open MP cheet sheet
PPT
Programming using Open Mp
PDF
Concurrent Programming OpenMP @ Distributed System Discussion
PDF
Intel parallel programming
PPT
OpenMP And C++
KEY
OpenMP
PDF
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
PPTX
Early Experiences with the OpenMP Accelerator Model
PPTX
Open MP
PPTX
Intro to OpenMP
Compiler optimization
UPC and OpenMP Parallel Programming and Analysis in PTP with CDT
Open mp intro_01
Open mp 1i
Parallelization of Coupled Cluster Code with OpenMP
Agathos-PHD-uoi-2016
Openmp combined
Open mp library functions and environment variables
Multi-Processor computing with OpenMP
Open MP cheet sheet
Programming using Open Mp
Concurrent Programming OpenMP @ Distributed System Discussion
Intel parallel programming
OpenMP And C++
OpenMP
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
Early Experiences with the OpenMP Accelerator Model
Open MP
Intro to OpenMP
Ad

Similar to LLVM-based Communication Optimizations for PGAS Programs (20)

PDF
Haskell Symposium 2010: An LLVM backend for GHC
PDF
A whirlwind tour of the LLVM optimizer
PDF
May2010 hex-core-opt
PDF
Introduction to the LLVM Compiler System
PPT
Introduction to llvm
PDF
Userspace RCU library : what linear multiprocessor scalability means for your...
PDF
Os Lattner
PDF
Cray XT Porting, Scaling, and Optimization Best Practices
PDF
OpenFabrics Interfaces introduction
PDF
Chapel Comes of Age: a Language for Productivity, Parallelism, and Performance
PDF
LCU14 209- LLVM Linux
PDF
不深不淺,帶你認識 LLVM (Found LLVM in your life)
PDF
[COSCUP 2021] LLVM Project: The Good, The Bad, and The Ugly
PDF
XT Best Practices
PPTX
Communication Frameworks for HPC and Big Data
PPTX
LLVM-Based-Compiler-for-a-Custom-Language (2).pptx
PDF
Language Support For Fast And Reliable Message Based Communication In S...
PDF
olibc: Another C Library optimized for Embedded Linux
PDF
libuv, NodeJS and everything in between
Haskell Symposium 2010: An LLVM backend for GHC
A whirlwind tour of the LLVM optimizer
May2010 hex-core-opt
Introduction to the LLVM Compiler System
Introduction to llvm
Userspace RCU library : what linear multiprocessor scalability means for your...
Os Lattner
Cray XT Porting, Scaling, and Optimization Best Practices
OpenFabrics Interfaces introduction
Chapel Comes of Age: a Language for Productivity, Parallelism, and Performance
LCU14 209- LLVM Linux
不深不淺,帶你認識 LLVM (Found LLVM in your life)
[COSCUP 2021] LLVM Project: The Good, The Bad, and The Ugly
XT Best Practices
Communication Frameworks for HPC and Big Data
LLVM-Based-Compiler-for-a-Custom-Language (2).pptx
Language Support For Fast And Reliable Message Based Communication In S...
olibc: Another C Library optimized for Embedded Linux
libuv, NodeJS and everything in between

More from Akihiro Hayashi (10)

PPTX
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
PPTX
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
PPTX
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
PDF
Introduction to Polyhedral Compilation
PPTX
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
PPTX
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
PPTX
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
PPTX
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
PPTX
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
PPTX
Accelerating Habanero-Java Program with OpenCL Generation
GPUIterator: Bridging the Gap between Chapel and GPU Platforms
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Introduction to Polyhedral Compilation
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Speculative Execution of Parallel Programs with Precise Exception Semantics ...
Accelerating Habanero-Java Program with OpenCL Generation

Recently uploaded (20)

PDF
The influence of sentiment analysis in enhancing early warning system model f...
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PPTX
Modernising the Digital Integration Hub
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Credit Without Borders: AI and Financial Inclusion in Bangladesh
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
A proposed approach for plagiarism detection in Myanmar Unicode text
PDF
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
PDF
Flame analysis and combustion estimation using large language and vision assi...
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
CloudStack 4.21: First Look Webinar slides
PPTX
TEXTILE technology diploma scope and career opportunities
The influence of sentiment analysis in enhancing early warning system model f...
Final SEM Unit 1 for mit wpu at pune .pptx
1 - Historical Antecedents, Social Consideration.pdf
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
Getting started with AI Agents and Multi-Agent Systems
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Modernising the Digital Integration Hub
Chapter 5: Probability Theory and Statistics
Credit Without Borders: AI and Financial Inclusion in Bangladesh
Benefits of Physical activity for teenagers.pptx
A proposed approach for plagiarism detection in Myanmar Unicode text
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
Flame analysis and combustion estimation using large language and vision assi...
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
sustainability-14-14877-v2.pddhzftheheeeee
CloudStack 4.21: First Look Webinar slides
TEXTILE technology diploma scope and career opportunities

LLVM-based Communication Optimizations for PGAS Programs

  • 1. LLVM-based Communication Optimizations for PGAS Programs Akihiro Hayashi (Rice University) Jisheng Zhao (Rice University) Michael Ferguson (Cray Inc.) Vivek Sarkar (Rice University) 2nd Workshop on the LLVM Compiler Infrastructure in HPC @ SC15 1
  • 2. A Big Picture 2 Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html, http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/ © Argonne National Lab. © RIKEN AICS ©Berkeley Lab. X10, Habanero-UPC++,…
  • 3. PGAS Languages High-productivity features:  Global-View  Task parallelism  Data Distribution  Synchronization 3 X10 Photo Credits : http://chapel.cray.com/logo.html, http://upc.lbl.gov/ CAFHabanero-UPC++
  • 4. Communication is implicit in some PGAS Programming Models Global Address Space  Compiler and Runtime is responsible for performing communications across nodes 4 1: var x = 1; // on Node 0 2: on Locales[1] {// on Node 1 3: … = x; // DATA ACCESS 4: } Remote Data Access in Chapel
  • 5. Communication is Implicit in some PGAS Programming Models (Cont’d) 5 1: var x = 1; // on Node 0 2: on Locales[1] {// on Node 1 3: … = x; // DATA ACCESS Remote Data Access if (x.locale == MYLOCALE) { *(x.addr) = 1; } else { gasnet_get(…); } Runtime affinity handling 1: var x = 1; 2: on Locales[1] { 3: … = 1; Compiler Optimization OR
  • 6. Communication Optimization is Important 6 A synthetic Chapel program on Intel Xeon CPU X5660 Clusters with QDR Inifiniband 1 10 100 1000 10000 100000 1000000 10000000 Latency(ms) Transferred Byte Optimized (Bulk Transfer) Unoptimized 1,500x 59x Lower is better
  • 7. PGAS Optimizations are language-specific 7 Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html, http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/ X10, Habanero-UPC++,… © Argonne National Lab. ©Berkeley Lab. © RIKEN AICS Chapel Compiler UPC Compiler X10 Compiler Habanero-C Compiler
  • 8. Our goal 8 Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html, http://upc.lbl.gov/, http://commons.wikimedia.org/, http://cs.lbl.gov/ © Argonne National Lab. © RIKEN AICS ©Berkeley Lab. X10, Habanero-UPC++,…
  • 9. Why LLVM? Widely used language-agnostic compiler 9 LLVM Intermediate Representation (LLVM IR) C/C++ Frontend Clang C/C++, Fortran, Ada, Objective-C Frontend dragonegg Chapel Frontend x86 backend Power PC backend ARM backend PTX backend Analysis & Optimizations UPC++ Frontend x86 Binary PPC Binary ARM Binary GPU Binary
  • 10. Summary & Contributions  Our Observations :  Many PGAS languages share semantically similar constructs  PGAS Optimizations are language-specific  Contributions:  Built a compilation framework that can uniformly optimize PGAS programs (Initial Focus : Communication)  Enabling existing LLVM passes for communication optimizations  PGAS-aware communication optimizations 10Photo Credits : http://chapel.cray.com/logo.html, http://llvm.org/Logo.html,
  • 11. Overview of our framework 11 Chapel Programs UPC++ Programs X10 Programs Chapel- LLVM frontend UPC++- LLVM frontend X10-LLVM frontend LLVM IR LLVM-based Communication Optimization passes Lowering Pass CAF Programs CAF-LLVM frontend 1. Vanilla LLVM IR 2. use address space feature to express communications Need to be implemented when supporting a new language/runtime Generally language-agnostic
  • 12. How optimizations work 12 store i64 1, i64 addrspace(100)* %x, … // x is possibly remote x = 1; Chapel shared_var<int> x; x = 1; UPC++ 1.Existing LLVM Optimizations 2.PGAS-aware Optimizations treat remote access as if it were local access Runtime-Specific Lowering Address space-aware Optimizations Communication API Calls
  • 13. LLVM-based Communication Optimizations for Chapel 1. Enabling Existing LLVM passes  Loop invariant code motion (LICM)  Scalar replacement, … 2. Aggregation  Combine sequences of loads/stores on adjacent memory location into a single memcpy 13These are already implemented in the standard Chapel compiler
  • 14. An optimization example: LICM for Communication Optimizations 14LICM = Loop Invariant Code Motion for i in 1..100 { %x = load i64 addrspace(100)* %xptr A(i) = %x; } LICM by LLVM
  • 15. An optimization example: Aggregation 15 // p is possibly remote sum = p.x + p.y; llvm.memcpy(…); load i64 addrspace(100)* %pptr+0 load i64 addrspace(100)* %pptr+4 x y GET GET GET
  • 16. LLVM-based Communication Optimizations for Chapel 3. Locality Optimization  Infer the locality of data and convert possibly-remote access to definitely-local access at compile-time if possible 4. Coalescing  Remote array access vectorization 16These are implemented, but not in the standard Chapel compiler
  • 17. An Optimization example: Locality Optimization 17 1: proc habanero(ref x, ref y, ref z) { 2: var p: int = 0; 3: var A:[1..N] int; 4: local { p = z; } 5: z = A(0) + z; 6:} 2.p and z are definitely local 3.Definitely-local access! (avoid runtime affinity checking) 1.A is definitely- local
  • 18. An Optimization example: Coalescing 18 1:for i in 1..N { 2: … = A(i); 3:} 1:localA = A; 2:for i in 1..N { 3: … = localA(i); 4:} Perform bulk transfer Converted to definitely-local access Before After
  • 19. Performance Evaluations: Benchmarks 19 Application Size Smith-Waterman 185,600 x 192,000 Cholesky Decomp 10,000 x 10,000 NPB EP CLASS = D Sobel 48,000 x 48,000 SSCA2 Kernel 4 SCALE = 16 Stream EP 2^30
  • 20. Performance Evaluations: Platforms  Cray XC30™ Supercomputer @ NERSC  Node  Intel Xeon E5-2695 @ 2.40GHz x 24 cores  64GB of RAM  Interconnect  Cray Aries interconnect with Dragonfly topology  Westmere Cluster @ Rice  Node  Intel Xeon CPU X5660 @ 2.80GHz x 12 cores  48 GB of RAM  Interconnect  Quad-data rated infiniband 20
  • 21. Performance Evaluations: Details of Compiler & Runtime  Compiler  Chapel Compiler version 1.9.0  LLVM 3.3  Runtime :  GASNet-1.22.0  Cray XC : aries  Westmere Cluster : ibv-conduit  Qthreads-1.10  Cray XC: 2 shepherds, 24 workers / shepherd  Westmere Cluster : 2 shepherds, 6 workers / shepherd 21
  • 22. BRIEF SUMMARY OF PERFORMANCE EVALUATIONS Performance Evaluation 22
  • 23. 0.0 1.0 2.0 3.0 4.0 5.0 SW Cholesky Sobel StreamEP EP SSCA2 PerformanceImprovement overLLVM-unopt Coalescing Locality Opt Aggregation Existing Results on the Cray XC (LLVM-unopt vs. LLVM-allopt) 23  4.6x performance improvement relative to LLVM-unopt on the same # of locales on average (1, 2, 4, 8, 16, 32, 64 locales) 2.1x 19.5x 1.1x 2.4x 1.4x 1.3x Higher is better
  • 24. 0.0 1.0 2.0 3.0 4.0 5.0 SW Cholesky Sobel StreamEP EP SSCA2 PerformanceImprovement overLLVM-unopt Coalescing Locality Opt Aggregation Existing Results on Westmere Cluster (LLVM-unopt vs. LLVM-allopt) 24 2.3x 16.9x 1.1x 2.5x 1.3x 2.3x  4.4x performance improvement relative to LLVM-unopt on the same # of locales on average (1, 2, 4, 8, 16, 32, 64 locales)
  • 25. DETAILED RESULTS & ANALYSIS OF CHOLESKY DECOMPOSITION Performance Evaluation 25
  • 27. Metrics 1. Performance & Scalability  Baseline (LLVM-unopt)  LLVM-based Optimizations (LLVM-allopt) 2. The dynamic number of communication API calls 3. Analysis of optimized code 4. Performance comparison  Conventional C-backend vs. LLVM-backend 27
  • 28. Performance Improvement by LLVM (Cholesky on the Cray XC) 28 1 0.1 0.1 0.2 0.2 0.3 2.6 2.7 3.7 4.1 4.3 4.5 0 1 2 3 4 5 1 locale 2 locales 4 locales 8 locales 16 locales 32 locales SpeedupoverLLVM-unopt 1locale LLVM-unopt LLVM-allopt  LLVM-based communication optimizations show scalability
  • 29. Communication API calls elimination by LLVM (Cholesky on the Cray XC) 29 100.0% 100.0% 100.0% 100.0% 12.1% 0.2% 89.2% 100.0% 0 0.2 0.4 0.6 0.8 1 1.2 LOCAL_GET REMOTE_GET LOCAL_PUT REMOTE_PUT Dynamicnumberof communicationAPIcalls (normalizedtoLLVM-unopt) LLVM-unopt LLVM-allopt 8.3x improvement 500x improvement 1.1x improvement
  • 30. Analysis of optimized code 30 for jB in zero..tileSize-1 { for kB in zero..tileSize-1 { 4GETS for iB in zero..tileSize-1 { 9GETS + 1PUT }}} 1.ALLOCATE LOCAL BUFFER 2.PERFORM BULK TRANSFER for jB in zero..tileSize-1 { for kB in zero..tileSize-1 { 1GET for iB in zero..tileSize-1 { 1GET + 1PUT }}} LLVM-unopt LLVM-allopt
  • 31. Performance comparison with C-backend 31 2.8 0.1 0.1 0.1 0.3 0.7 0.4 1 0.1 0.1 0.2 0.2 0.3 0.4 2.6 2.7 3.7 4.1 4.3 4.5 4.5 0 1 2 3 4 5 1 locale 2 locales 4 locales 8 locales 16 locales 32 locales 64 locales SpeedupoverLLVM-unopt 1locale C-backend LLVM-unopt LLVM-allopt C-backend is faster!
  • 32. Current limitation  In LLVM 3.3, many optimizations assume that the pointer size is the same across all address spaces 32 Locale (16bit) addr (48bit) For LLVM Code Generation : 64bit packed pointer For C Code Generation : 128bit struct pointer ptr.locale; ptr.addr; ptr >> 48 ptr | 48BITS_MASK; 1. Needs more instructions 2. Lose opportunities for Alias analysis
  • 33. Conclusions LLVM-based Communication optimizations for PGAS Programs  Promising way to optimize PGAS programs in a language-agnostic manner  Preliminary Evaluation with 6 Chapel applications Cray XC30 Supercomputer – 4.6x average performance improvement Westmere Cluster – 4.4x average performance improvement 33
  • 34. Future work Extend LLVM IR to support parallel programs with PGAS and explicit task parallelism  Higher-level IR 34 Parallel Programs (Chapel, X10, CAF, HC, …) 1.RI-PIR Gen 2.Analysis 3.Transformation 1.RS-PIR Gen 2.Analysis 3.Transformation LLVM Runtime-Independent Optimizations e.g. Task Parallel Construct LLVM Runtime-Specific Optimizations e.g. GASNet API Binary
  • 35. Acknowledgements Special thanks to  Brad Chamberlain (Cray)  Rafael Larrosa Jimenez (UMA)  Rafael Asenjo Plaza (UMA)  Habanero Group at Rice 35
  • 37. Compilation Flow 37 Chapel Programs AST Generation and Optimizations C-code Generation LLVM Optimizations Backend Compiler’s Optimizations (e.g. gcc –O3) LLVM IRC Programs LLVM IR Generation Binary Binary

Editor's Notes

  • #26: I choose cholesky decomposition to give you more detailed information about our performance evaluation & analysis. If you are interested in other applications please see the paper.
  • #30: The rightmost one Second one from the left
  • #33: But using LLVM has drawback. Chapel uses wide pointer to associate data with node. Wide pointer is as C-struct and you can extract nodeID and address by dot operator.