SlideShare a Scribd company logo
Characterization of the Emu Chick with
Microbenchmarks
E. Jason Riedy
Center for Research into Novel Computing Hierarchies at Georgia Tech
23 January 2019
Outline
Project Background
Microbenchmarks
STREAM ADD and Pointer Chasing
Sparse Matrix – Vector Product (SpMV)
Breadth-First Search (BFS)
Labeled Subgraph Alignment
Observations
Memory-centric HPDA
• “Big data” platforms fare poorly v. a single thread
plus large SSD. (McSherry, Isard, Murray. “Scalability!
But at what COST?” HotOS XV, 2015.)
• New architecture proposals are difficult to evaluate
via simulation and modeling alone.
Evaluate the FPGA-based prototype Emu Chick...
• But by what criteria?
• Chose memory bandwidth utilization.
• Memory-centric architecture
• BW is equivalent to MFLOP/s in SpMV, TEPS in BFS
Emu: µbenchmarks — 23 Jan 2019 3/27
Emu Technology’s PGAS Architecture
1 nodelet
Gossamer
Core 1
Memory-Side Processor
Gossamer
Core 4
...
Migration Engine
RapidIODisk I/O
8 nodelets
per node
64 nodelets
per Chick
RapidIO
Stationary
Core
• Multithreaded multicore
• Memory-side “processor” for
operations in
narrow-channel DRAM
• Stationary core for OS
• Threads migrate in
hardware on reads!
• Optimize for weak locality
Emu: µbenchmarks — 23 Jan 2019 4/27
Baseline: Emu STREAM ADD c[i] = a[i] + b[i]
GC Config Nodelets Scale Threads BW (MB/s)
1 8 30 512 1,599.86
3 4 29 384 1,288.39
1 64 31 4096 12,790.31
3 32 31 6144 7,241.07
Theor. Peak 8 9,600
Theor. Peak 64 76,800
STREAM results are used to compare bandwidth
utilization for the current prototype. 3 GC is experimental
and has (had?) half the memory controllers1
1
Eric Hein, Young, Srinivas Eswar, Jiajia Li, Patrick Lavin, Riedy, Vuduc, Conte. “A Microbenchmark Characterization
of the Emu Chick,” (in submission, https://arxiv.org/abs/1809.07696 ).
Emu: µbenchmarks — 23 Jan 2019 5/27
Thread Spawning in STREAM ADD
64 128 256 512 1024 2048 4096
Number of threads
0
2
4
6
8
10
12
Memorybandwidth(GBs)
serial_spawn
recursive_spawn
serial_remote_spawn
recursive_remote_spawn
Global: 1GC / nodelet, 64 nodelets
Emu: µbenchmarks — 23 Jan 2019 6/27
Bandwidth Limited by Computation
STREAM and Memory Bandwidths (BW in MB/s)
Operation Nodelets Scale Threads BW
Current - arithmetic ops 1 200
Ideal - all ld ops 1 1,400
ADD (Measured) 8 30 512 1,600
ADD (Measured) 64 31 4096 12,790
NCDIMM 8 12,800
NCDIMM 64 102,400
Per-GC peak from instruction counts:
175MHz ⇒
175M cycles
second
×
1 instruction
cycle
×
3 mem ops
21 instructions
×
8 Bytes
1 mem op
= 200MB/s
One GC per nodelet hits this peak. Eight GC/nodelet may hit the ideal peak.
Emu: µbenchmarks — 23 Jan 2019 7/27
Emu Pointer-Chasing Benchmark
Data-dependent loads, fine-grained access2
Ordered
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Intra-block shuffle: weak locality
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Full block shuffle: weak locality
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2
Eric Hein, Young, Srinivas Eswar, Jiajia Li, Patrick Lavin, Vuduc, Riedy. “An Initial Characterization of the Emu
Chick,” Workshop on Accelerators and Hybrid Exascale Systems (AsHES) 2018.
Emu: µbenchmarks — 23 Jan 2019 8/27
x86 Pointer-Chasing Benchmark
1
4
16
64
256
1K
4K
16K
64K
256K
1M
4M
Block size (number of 16B elements)
0
20
40
60
80
100
Memorybandwidth(GBs) peak STREAM bandwidth
56 threads
1
4
16
64
256
1K
4K
16K
64K
256K
1M
4M
Block size (number of 16B elements)
peak STREAM bandwidth
112 threads
block_shuffle intra_block_shuffle full_block_shuffle
Haswell results, every pattern is different.1
Emu: µbenchmarks — 23 Jan 2019 9/27
Emu Pointer-Chasing Benchmark
1
4
16
64
256
1K
4K
16K
64K
256K
1M
4MBlock size (number of 16B elements)
0
2
4
6
8
10
12
Memorybandwidth(GBs)
peak STREAM bandwidth
2048 threads
1
4
16
64
256
1K
4K
16K
64K
256K
1M
4M
Block size (number of 16B elements)
peak STREAM bandwidth
4096 threads
block_shuffle intra_block_shuffle full_block_shuffle
Mostly flat performance, high utilization.1
Emu: µbenchmarks — 23 Jan 2019 10/27
SpMV Layout, Synthetic (5pt Laplacian)
CSR:
Local 1D 2D
1 nodelet 8+ nodelets 8+ nodelets
X
row
v
col
= x
Y
X
Y =
x
Y
Xx
=
102 302 602 802 1002 2002 3002
Number of Rows
0
100
200
300
400
500
600
Bandwidth(MB/s)
Data Layout
Local layout
1D layout
2D layout
Single node, integer entries1
Emu: µbenchmarks — 23 Jan 2019 11/27
SpMV Synthetic, Replicated – Single node, 1 GC
0 500 1000 1500 2000 2500 3000 3500
Matrix Size (MB)
0
100
200
300
400
500
600
700
800
900
Bandwidth(MB/s) SpMV (Emu Chick, Single node)
No. Threads
64
128
256
512
Good bandwidth utilization with high thread counts and
replicated x.
Emu: µbenchmarks — 23 Jan 2019 12/27
SpMV Synthetic – Single node, 1 GC
0 500 1000 1500 2000 2500 3000 3500
Matrix Size (MB)
0
100
200
300
400
500
600
700
800
Bandwidth(MB/s) SpMV (Emu Chick, Single node)
No. Threads
256
512
The 5pt Laplacian without replicating x bounces between
migratory and non-migratory areas.1
Emu: µbenchmarks — 23 Jan 2019 13/27
SpMV Synthetic – Single node, 1 and 3 GC
102 502 1002 1502 2002 2502 3002 5002 10002 11002 14002 15002 20002 25002 30002 40002
Number of Rows
0
200
400
600
800
1000
Bandwidth(MB/s)
SpMV (Emu Chick, Single node, 512 threads)
1GC
3GC
3 GC version: half the nodes, half the memory controllers
Emu: µbenchmarks — 23 Jan 2019 14/27
SpMV Synthetic – Single node, 1 and 3 GC
0 200 400 600 800 1000 1200 1400 1600
Matrix Size (MB)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
BandwidthUtlization SpMV (Emu Chick, Single node, 512 threads)
ctype
1GC
3GC
3 GC results demonstrate that SpMV is compute-bound
from address computation.
Emu: µbenchmarks — 23 Jan 2019 15/27
SpMV Synthetic, Replicated – Multinode, 1 GC
0 2000 4000 6000 8000 10000 12000 14000
Matrix Size (MB)
0
1000
2000
3000
4000
5000
6000
7000
Bandwidth(MB/s)
SpMV (Emu Chick, Multi node)
No. Threads
64
128
256
512
1024
2048
4096
SpMV scales up to 50% of bandwidth for high thread
counts and replicated x.1
Emu: µbenchmarks — 23 Jan 2019 16/27
SpMV Synthetic – Multinode, 1 GC
0 2000 4000 6000 8000 10000 12000 14000
Matrix Size (MB)
0
1000
2000
3000
4000
5000
6000
Bandwidth(MB/s)
SpMV (Emu Chick, Multi node)
No. Threads
1024
2048
4096
But migrations for fetching x hurt with eight nodes.
Emu: µbenchmarks — 23 Jan 2019 17/27
SpMV Real-World Results, Replicated
SpMV multinode bandwidths (in MB/s) for real world graphs (Tim Davis’s collection)
along with matrix dimension, number of non-zeros (NNZ), and the average and
maximum row degrees. Run with 4K threads.
Matrix Rows NNZ Avg Deg Max Deg BW
mc2depi 526K 2.1M 3.99 4 3870.31
ecology1 1.0M 5.0M 5.00 5 4425.61
amazon03 401K 3.2M 7.99 10 4494.79
Delor295 296K 2.4M 8.12 11 4492.47
roadNet- 1.39M 3.84M 2.76 12 3811.57
mac_econ 206K 1.27M 6.17 44 3735.54
cop20k_A 121K 2.62M 21.65 81 4520.05
watson_2 352K 1.85M 5.25 93 3486.30
ca2010 710K 3.49M 4.91 141 4075.97
poisson3 86K 2.37M 27.74 145 4031.20
gyro_k 17K 1.02M 58.82 360 2446.36
vsp_fina 140K 1.1M 7.90 669 1335.59
Stanford 282K 2.31M 8.20 38606 287.82
ins2 309K 2.75M 8.89 309412 43.91
Emu: µbenchmarks — 23 Jan 2019 18/27
Breadth-First Search with Remote Writes
1. For each vertex in the frontier, try to set self as
parent of each neighbor vertex
• Done using remote writes, no migrations
• Last writer wins (benign race condition)
2. Double-buffer: Check to see which vertices acquired
a new parent, and add them to the queue
• This step is completely nodelet-local
• Caveat: also scans inactive vertices
Emu: µbenchmarks — 23 Jan 2019 19/27
BFS Pseudo-code
Listing 1: BFS algorithm using remote writes
queue.push(root)
while len(queue) > 0:
for src in queue:
for dst in out_edges(src):
# Remote write
new_parent[dst] = src
for v in range(num_vertices):
if parent[v] == -1:
if new_parent[v] != -1:
parent[v] = new_parent[v]
queue.push(v)
Emu: µbenchmarks — 23 Jan 2019 20/27
BFS on a Dynamic Data Structure
15 16 17 18 19 20 21
scale
0
20
40
60
80
100
MTEPS
Emu single node - Cilk
Emu multi-node - Cilk
x86 Haswell - STINGER
x86 Haswell - Cilk
0
500
1000
1500
EdgeBandwidth(MB/s)
Note: Streaming data structure, not statically optimized.
But Erdös-Rényi graphs. RMAT: Load imbalance. 3
3
Hein, Eswar, Abdurrahman Yasar, Prasanth Chatarasi, Li, Young, Conte, Ümit Çatalyürek, Vuduc, Riedy, Bora Uçar.
“Programming Strategies for Irregular Algorithms on the Emu Chick,” (in submission).
Emu: µbenchmarks — 23 Jan 2019 21/27
Labeled Subgraph Alignment
1 2 4 8 16 32 64 128
Number of Threads
0
10
20
30
40
50
Speedup
Multi-BLK
Multi-HCB
Single-BLK
Single-HCB
gsaNA, the first parallel algorithm, strong scaling on DBLP
graph (2048 vertices). Block (BLK) vertex layout is slightly
worse than Hilbert curve (HCB) layout.3
Emu: µbenchmarks — 23 Jan 2019 22/27
Lessons Learned i
• Finding appropriate metrics is difficult:
• Comparing ASICs (e.g. x86) to FPGA-based prototypes
can be unfair either way.
• Fraction of peak bandwidth for the idealized
problem?
• Measured peak is much lower than theoretical peak.
• The Chick is compute bound.
• SpMV: FLOP/s ∝ BW, level 2 sparse BLAS op.
• Graph500 BFS: TEPS ∝ BW
Emu: µbenchmarks — 23 Jan 2019 23/27
Lessons Learned ii
• Distilling observations on architecture ↔
programming model:
• Program data location for load (BW) balance.
• Remote memory operations v. migration exposes the
architecture.
• Migrations cost more than it appears. Computation?
• Stack spills/access can cause ping-ponging.
• How does HW support for top-down (Cilk-ish) affect
bottom-up (UPC) PGAS programming?
• Memory allocation similar to UPC, SHMEM
• UPC++ rpc_ff v. Emu thread migration?
Emu: µbenchmarks — 23 Jan 2019 24/27
Integrating the Chick with Flexible Infrastructure
login
rg-adm
Slurm Ctl
toolbox
(NFS)
Scheduling,
Tools, and
Admin
Key:
Schedulable Resource
Physical Resource
VM
USB device
User
Resources
fpaa-host
power-host
nvidia-tegra-N
nvidia-tegra-1
fpaa-dev
rg-db
Slurm DBD
emu-dev emu-chick
..Nfpga-dev-1
fpga-hmcfpga-intel
Powell, Riedy, Young, and Conte. “Wrangling Rogues: Managing
Experimental Post-Moore Architectures.”
https://arxiv.org/abs/1808.06334
• Available. Plans to
integrate with NSF
XSEDE.
• Scheduler being
deployed.
• Incorporates
Singularity and virtual
machines for
OS/library versioning.
Emu: µbenchmarks — 23 Jan 2019 25/27
Umbrella Project: CRNCH Rogues Gallery
A physical & virtual space for hosting novel computing
architectures, systems, and accelerators.
Host / manage remote access for novel architectures!
• Emu Chick
• FPGA + HMC: 3D stacked
• FPAA: Analog/Neuromorphic
Amortize effort and cost of trying novel architectures.
Break the “but it’s too much work” barrier.
http://crnch.gatech.edu/rogues-gallery
Emu: µbenchmarks — 23 Jan 2019 26/27
Acknowledgments
• Srinivas Eswar (GT CSE)
• Dr. Eric Hein (GT ECE ⇒ Emu)
• Patrick Lavin (GT CSE)
• Jiajia Li (GT CSE ⇒ PNNL)
• Abdurrahman Yaşar (GT CSE)
• Dr. Ümit Çatalürek (GT CSE)
• Dr. Tom Conte (GT CS/ECE)
• Dr. Bora Uçar (ENS Lyon CNRS)
• Dr. Rich Vuduc (GT CSE)
• Dr. Jeffrey S. Young (GT CS)
Code:
• https://gitlab.com/crnch-rg (soon)
• https://github.com/ehein6/emu-microbench
Emu: µbenchmarks — 23 Jan 2019 27/27

More Related Content

PPTX
GPU Architecture NVIDIA (GTX GeForce 480)
PPTX
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
PDF
High Performance Computing - Challenges on the Road to Exascale Computing
PPT
Petascale Analytics - The World of Big Data Requires Big Analytics
PDF
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
PDF
Expectations for optical network from the viewpoint of system software research
PDF
クラウド時代の半導体メモリー技術
PDF
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
GPU Architecture NVIDIA (GTX GeForce 480)
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
High Performance Computing - Challenges on the Road to Exascale Computing
Petascale Analytics - The World of Big Data Requires Big Analytics
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
Expectations for optical network from the viewpoint of system software research
クラウド時代の半導体メモリー技術
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era

What's hot (20)

PDF
HPC Cloud: Clouds on supercomputers for HPC
PPT
Anegdotic Maxeler (Romania)
PDF
Exploring the Performance Impact of Virtualization on an HPC Cloud
PPT
Memoryhierarchy
PDF
Joel Gibson - Challenge 2 - Virtual Design Master
PDF
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
PPTX
Gpu with cuda architecture
PPTX
Working together with SURF Raymond Oonk Annette Langedijk SURF
PDF
Designing High Performance Computing Architectures for Reliable Space Applica...
PDF
AI Accelerators for Cloud Datacenters
PPTX
Exascale Capabl
PDF
AES encryption on modern consumer architectures
PDF
IEEE CloudCom 2014参加報告
PDF
R&D work on pre exascale HPC systems
PDF
AI Chip Trends and Forecast
PDF
Early Benchmarking Results for Neuromorphic Computing
PPTX
GPGPU programming with CUDA
PDF
Accelerating Real-Time LiDAR Data Processing Using GPUs
PDF
"NovuTensor: Hardware Acceleration of Deep Convolutional Neural Networks for ...
PDF
From Rack scale computers to Warehouse scale computers
HPC Cloud: Clouds on supercomputers for HPC
Anegdotic Maxeler (Romania)
Exploring the Performance Impact of Virtualization on an HPC Cloud
Memoryhierarchy
Joel Gibson - Challenge 2 - Virtual Design Master
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
Gpu with cuda architecture
Working together with SURF Raymond Oonk Annette Langedijk SURF
Designing High Performance Computing Architectures for Reliable Space Applica...
AI Accelerators for Cloud Datacenters
Exascale Capabl
AES encryption on modern consumer architectures
IEEE CloudCom 2014参加報告
R&D work on pre exascale HPC systems
AI Chip Trends and Forecast
Early Benchmarking Results for Neuromorphic Computing
GPGPU programming with CUDA
Accelerating Real-Time LiDAR Data Processing Using GPUs
"NovuTensor: Hardware Acceleration of Deep Convolutional Neural Networks for ...
From Rack scale computers to Warehouse scale computers
Ad

Similar to Characterization of Emu Chick with Microbenchmarks (20)

PDF
GraphBLAS and Emus
PDF
Novel Architectures for Applications in Data Science and Beyond
PPT
Memory caching
PPT
Memory caching
PPT
Memory caching
PPT
Memory caching
PPT
Memory caching
PPT
Memory caching
PDF
Graph Analysis: New Algorithm Models, New Architectures
PDF
Graph500
PDF
Graph analysis and novel architectures
PDF
Alembic: Distilling C++ into high-performance Grappa
PDF
Numascale Product IBM
PDF
Introduction to Memoria
PDF
HBase 0.20.0 Performance Evaluation
PDF
cachegrand: A Take on High Performance Caching
PPTX
In-memory Caching in HDFS: Lower Latency, Same Great Taste
PPT
jvm goes to big data
PPT
Big Data & NoSQL - EFS'11 (Pavlo Baron)
PPT
pMatlab on BlueGene
GraphBLAS and Emus
Novel Architectures for Applications in Data Science and Beyond
Memory caching
Memory caching
Memory caching
Memory caching
Memory caching
Memory caching
Graph Analysis: New Algorithm Models, New Architectures
Graph500
Graph analysis and novel architectures
Alembic: Distilling C++ into high-performance Grappa
Numascale Product IBM
Introduction to Memoria
HBase 0.20.0 Performance Evaluation
cachegrand: A Take on High Performance Caching
In-memory Caching in HDFS: Lower Latency, Same Great Taste
jvm goes to big data
Big Data & NoSQL - EFS'11 (Pavlo Baron)
pMatlab on BlueGene
Ad

More from Jason Riedy (20)

PDF
Lucata at the HPEC GraphBLAS BoF
PDF
LAGraph 2021-10-13
PDF
Lucata at the HPEC GraphBLAS BoF
PDF
Reproducible Linear Algebra from Application to Architecture
PDF
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
PDF
ICIAM 2019: Reproducible Linear Algebra from Application to Architecture
PDF
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
PDF
CRNCH 2018 Summit: Rogues Gallery Update
PDF
Augmented Arithmetic Operations Proposed for IEEE-754 2018
PDF
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
PDF
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
PDF
A New Algorithm Model for Massive-Scale Streaming Graph Analysis
PDF
High-Performance Analysis of Streaming Graphs
PDF
High-Performance Analysis of Streaming Graphs
PDF
Updating PageRank for Streaming Graphs
PDF
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
PDF
Graph Analysis Beyond Linear Algebra
PDF
Network Challenge: Error and Sensitivity Analysis
PDF
Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014
PDF
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
Lucata at the HPEC GraphBLAS BoF
LAGraph 2021-10-13
Lucata at the HPEC GraphBLAS BoF
Reproducible Linear Algebra from Application to Architecture
PEARC19: Wrangling Rogues: A Case Study on Managing Experimental Post-Moore A...
ICIAM 2019: Reproducible Linear Algebra from Application to Architecture
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
CRNCH 2018 Summit: Rogues Gallery Update
Augmented Arithmetic Operations Proposed for IEEE-754 2018
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
CRNCH Rogues Gallery: A Community Core for Novel Computing Platforms
A New Algorithm Model for Massive-Scale Streaming Graph Analysis
High-Performance Analysis of Streaming Graphs
High-Performance Analysis of Streaming Graphs
Updating PageRank for Streaming Graphs
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Graph Analysis Beyond Linear Algebra
Network Challenge: Error and Sensitivity Analysis
Graph Analysis Trends and Opportunities -- CMG Performance and Capacity 2014
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms

Recently uploaded (20)

PPTX
material for studying about lift elevators escalation
PDF
-DIGITAL-INDIA.pdf one of the most prominent
PPTX
Syllabus Computer Six class curriculum s
PPTX
Embedded for Artificial Intelligence 1.pptx
PPTX
Wireless and Mobile Backhaul Market.pptx
PPTX
STEEL- intro-1.pptxhejwjenwnwnenemwmwmwm
PPTX
1.pptxsadafqefeqfeqfeffeqfqeqfeqefqfeqfqeffqe
PPT
Lines and angles cbse class 9 math chemistry
PPTX
code of ethics.pptxdvhwbssssSAssscasascc
PDF
PPT Determiners.pdf.......................
PDF
How NGOs Save Costs with Affordable IT Rentals
PPTX
quadraticequations-111211090004-phpapp02.pptx
PDF
Cableado de Controladores Logicos Programables
PPTX
DEATH AUDIT MAY 2025.pptxurjrjejektjtjyjjy
PDF
Smarter Security: How Door Access Control Works with Alarms & CCTV
PPTX
Operating System Processes_Scheduler OSS
PPTX
PROGRAMMING-QUARTER-2-PYTHON.pptxnsnsndn
PPTX
Lecture 3b C Library _ ESP32.pptxjfjfjffkkfkfk
PPTX
Sem-8 project ppt fortvfvmat uyyjhuj.pptx
PPTX
title _yeOPC_Poisoning_Presentation.pptx
material for studying about lift elevators escalation
-DIGITAL-INDIA.pdf one of the most prominent
Syllabus Computer Six class curriculum s
Embedded for Artificial Intelligence 1.pptx
Wireless and Mobile Backhaul Market.pptx
STEEL- intro-1.pptxhejwjenwnwnenemwmwmwm
1.pptxsadafqefeqfeqfeffeqfqeqfeqefqfeqfqeffqe
Lines and angles cbse class 9 math chemistry
code of ethics.pptxdvhwbssssSAssscasascc
PPT Determiners.pdf.......................
How NGOs Save Costs with Affordable IT Rentals
quadraticequations-111211090004-phpapp02.pptx
Cableado de Controladores Logicos Programables
DEATH AUDIT MAY 2025.pptxurjrjejektjtjyjjy
Smarter Security: How Door Access Control Works with Alarms & CCTV
Operating System Processes_Scheduler OSS
PROGRAMMING-QUARTER-2-PYTHON.pptxnsnsndn
Lecture 3b C Library _ ESP32.pptxjfjfjffkkfkfk
Sem-8 project ppt fortvfvmat uyyjhuj.pptx
title _yeOPC_Poisoning_Presentation.pptx

Characterization of Emu Chick with Microbenchmarks

  • 1. Characterization of the Emu Chick with Microbenchmarks E. Jason Riedy Center for Research into Novel Computing Hierarchies at Georgia Tech 23 January 2019
  • 2. Outline Project Background Microbenchmarks STREAM ADD and Pointer Chasing Sparse Matrix – Vector Product (SpMV) Breadth-First Search (BFS) Labeled Subgraph Alignment Observations
  • 3. Memory-centric HPDA • “Big data” platforms fare poorly v. a single thread plus large SSD. (McSherry, Isard, Murray. “Scalability! But at what COST?” HotOS XV, 2015.) • New architecture proposals are difficult to evaluate via simulation and modeling alone. Evaluate the FPGA-based prototype Emu Chick... • But by what criteria? • Chose memory bandwidth utilization. • Memory-centric architecture • BW is equivalent to MFLOP/s in SpMV, TEPS in BFS Emu: µbenchmarks — 23 Jan 2019 3/27
  • 4. Emu Technology’s PGAS Architecture 1 nodelet Gossamer Core 1 Memory-Side Processor Gossamer Core 4 ... Migration Engine RapidIODisk I/O 8 nodelets per node 64 nodelets per Chick RapidIO Stationary Core • Multithreaded multicore • Memory-side “processor” for operations in narrow-channel DRAM • Stationary core for OS • Threads migrate in hardware on reads! • Optimize for weak locality Emu: µbenchmarks — 23 Jan 2019 4/27
  • 5. Baseline: Emu STREAM ADD c[i] = a[i] + b[i] GC Config Nodelets Scale Threads BW (MB/s) 1 8 30 512 1,599.86 3 4 29 384 1,288.39 1 64 31 4096 12,790.31 3 32 31 6144 7,241.07 Theor. Peak 8 9,600 Theor. Peak 64 76,800 STREAM results are used to compare bandwidth utilization for the current prototype. 3 GC is experimental and has (had?) half the memory controllers1 1 Eric Hein, Young, Srinivas Eswar, Jiajia Li, Patrick Lavin, Riedy, Vuduc, Conte. “A Microbenchmark Characterization of the Emu Chick,” (in submission, https://arxiv.org/abs/1809.07696 ). Emu: µbenchmarks — 23 Jan 2019 5/27
  • 6. Thread Spawning in STREAM ADD 64 128 256 512 1024 2048 4096 Number of threads 0 2 4 6 8 10 12 Memorybandwidth(GBs) serial_spawn recursive_spawn serial_remote_spawn recursive_remote_spawn Global: 1GC / nodelet, 64 nodelets Emu: µbenchmarks — 23 Jan 2019 6/27
  • 7. Bandwidth Limited by Computation STREAM and Memory Bandwidths (BW in MB/s) Operation Nodelets Scale Threads BW Current - arithmetic ops 1 200 Ideal - all ld ops 1 1,400 ADD (Measured) 8 30 512 1,600 ADD (Measured) 64 31 4096 12,790 NCDIMM 8 12,800 NCDIMM 64 102,400 Per-GC peak from instruction counts: 175MHz ⇒ 175M cycles second × 1 instruction cycle × 3 mem ops 21 instructions × 8 Bytes 1 mem op = 200MB/s One GC per nodelet hits this peak. Eight GC/nodelet may hit the ideal peak. Emu: µbenchmarks — 23 Jan 2019 7/27
  • 8. Emu Pointer-Chasing Benchmark Data-dependent loads, fine-grained access2 Ordered 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Intra-block shuffle: weak locality 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Full block shuffle: weak locality 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2 Eric Hein, Young, Srinivas Eswar, Jiajia Li, Patrick Lavin, Vuduc, Riedy. “An Initial Characterization of the Emu Chick,” Workshop on Accelerators and Hybrid Exascale Systems (AsHES) 2018. Emu: µbenchmarks — 23 Jan 2019 8/27
  • 9. x86 Pointer-Chasing Benchmark 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Block size (number of 16B elements) 0 20 40 60 80 100 Memorybandwidth(GBs) peak STREAM bandwidth 56 threads 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Block size (number of 16B elements) peak STREAM bandwidth 112 threads block_shuffle intra_block_shuffle full_block_shuffle Haswell results, every pattern is different.1 Emu: µbenchmarks — 23 Jan 2019 9/27
  • 10. Emu Pointer-Chasing Benchmark 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4MBlock size (number of 16B elements) 0 2 4 6 8 10 12 Memorybandwidth(GBs) peak STREAM bandwidth 2048 threads 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Block size (number of 16B elements) peak STREAM bandwidth 4096 threads block_shuffle intra_block_shuffle full_block_shuffle Mostly flat performance, high utilization.1 Emu: µbenchmarks — 23 Jan 2019 10/27
  • 11. SpMV Layout, Synthetic (5pt Laplacian) CSR: Local 1D 2D 1 nodelet 8+ nodelets 8+ nodelets X row v col = x Y X Y = x Y Xx = 102 302 602 802 1002 2002 3002 Number of Rows 0 100 200 300 400 500 600 Bandwidth(MB/s) Data Layout Local layout 1D layout 2D layout Single node, integer entries1 Emu: µbenchmarks — 23 Jan 2019 11/27
  • 12. SpMV Synthetic, Replicated – Single node, 1 GC 0 500 1000 1500 2000 2500 3000 3500 Matrix Size (MB) 0 100 200 300 400 500 600 700 800 900 Bandwidth(MB/s) SpMV (Emu Chick, Single node) No. Threads 64 128 256 512 Good bandwidth utilization with high thread counts and replicated x. Emu: µbenchmarks — 23 Jan 2019 12/27
  • 13. SpMV Synthetic – Single node, 1 GC 0 500 1000 1500 2000 2500 3000 3500 Matrix Size (MB) 0 100 200 300 400 500 600 700 800 Bandwidth(MB/s) SpMV (Emu Chick, Single node) No. Threads 256 512 The 5pt Laplacian without replicating x bounces between migratory and non-migratory areas.1 Emu: µbenchmarks — 23 Jan 2019 13/27
  • 14. SpMV Synthetic – Single node, 1 and 3 GC 102 502 1002 1502 2002 2502 3002 5002 10002 11002 14002 15002 20002 25002 30002 40002 Number of Rows 0 200 400 600 800 1000 Bandwidth(MB/s) SpMV (Emu Chick, Single node, 512 threads) 1GC 3GC 3 GC version: half the nodes, half the memory controllers Emu: µbenchmarks — 23 Jan 2019 14/27
  • 15. SpMV Synthetic – Single node, 1 and 3 GC 0 200 400 600 800 1000 1200 1400 1600 Matrix Size (MB) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 BandwidthUtlization SpMV (Emu Chick, Single node, 512 threads) ctype 1GC 3GC 3 GC results demonstrate that SpMV is compute-bound from address computation. Emu: µbenchmarks — 23 Jan 2019 15/27
  • 16. SpMV Synthetic, Replicated – Multinode, 1 GC 0 2000 4000 6000 8000 10000 12000 14000 Matrix Size (MB) 0 1000 2000 3000 4000 5000 6000 7000 Bandwidth(MB/s) SpMV (Emu Chick, Multi node) No. Threads 64 128 256 512 1024 2048 4096 SpMV scales up to 50% of bandwidth for high thread counts and replicated x.1 Emu: µbenchmarks — 23 Jan 2019 16/27
  • 17. SpMV Synthetic – Multinode, 1 GC 0 2000 4000 6000 8000 10000 12000 14000 Matrix Size (MB) 0 1000 2000 3000 4000 5000 6000 Bandwidth(MB/s) SpMV (Emu Chick, Multi node) No. Threads 1024 2048 4096 But migrations for fetching x hurt with eight nodes. Emu: µbenchmarks — 23 Jan 2019 17/27
  • 18. SpMV Real-World Results, Replicated SpMV multinode bandwidths (in MB/s) for real world graphs (Tim Davis’s collection) along with matrix dimension, number of non-zeros (NNZ), and the average and maximum row degrees. Run with 4K threads. Matrix Rows NNZ Avg Deg Max Deg BW mc2depi 526K 2.1M 3.99 4 3870.31 ecology1 1.0M 5.0M 5.00 5 4425.61 amazon03 401K 3.2M 7.99 10 4494.79 Delor295 296K 2.4M 8.12 11 4492.47 roadNet- 1.39M 3.84M 2.76 12 3811.57 mac_econ 206K 1.27M 6.17 44 3735.54 cop20k_A 121K 2.62M 21.65 81 4520.05 watson_2 352K 1.85M 5.25 93 3486.30 ca2010 710K 3.49M 4.91 141 4075.97 poisson3 86K 2.37M 27.74 145 4031.20 gyro_k 17K 1.02M 58.82 360 2446.36 vsp_fina 140K 1.1M 7.90 669 1335.59 Stanford 282K 2.31M 8.20 38606 287.82 ins2 309K 2.75M 8.89 309412 43.91 Emu: µbenchmarks — 23 Jan 2019 18/27
  • 19. Breadth-First Search with Remote Writes 1. For each vertex in the frontier, try to set self as parent of each neighbor vertex • Done using remote writes, no migrations • Last writer wins (benign race condition) 2. Double-buffer: Check to see which vertices acquired a new parent, and add them to the queue • This step is completely nodelet-local • Caveat: also scans inactive vertices Emu: µbenchmarks — 23 Jan 2019 19/27
  • 20. BFS Pseudo-code Listing 1: BFS algorithm using remote writes queue.push(root) while len(queue) > 0: for src in queue: for dst in out_edges(src): # Remote write new_parent[dst] = src for v in range(num_vertices): if parent[v] == -1: if new_parent[v] != -1: parent[v] = new_parent[v] queue.push(v) Emu: µbenchmarks — 23 Jan 2019 20/27
  • 21. BFS on a Dynamic Data Structure 15 16 17 18 19 20 21 scale 0 20 40 60 80 100 MTEPS Emu single node - Cilk Emu multi-node - Cilk x86 Haswell - STINGER x86 Haswell - Cilk 0 500 1000 1500 EdgeBandwidth(MB/s) Note: Streaming data structure, not statically optimized. But Erdös-Rényi graphs. RMAT: Load imbalance. 3 3 Hein, Eswar, Abdurrahman Yasar, Prasanth Chatarasi, Li, Young, Conte, Ümit Çatalyürek, Vuduc, Riedy, Bora Uçar. “Programming Strategies for Irregular Algorithms on the Emu Chick,” (in submission). Emu: µbenchmarks — 23 Jan 2019 21/27
  • 22. Labeled Subgraph Alignment 1 2 4 8 16 32 64 128 Number of Threads 0 10 20 30 40 50 Speedup Multi-BLK Multi-HCB Single-BLK Single-HCB gsaNA, the first parallel algorithm, strong scaling on DBLP graph (2048 vertices). Block (BLK) vertex layout is slightly worse than Hilbert curve (HCB) layout.3 Emu: µbenchmarks — 23 Jan 2019 22/27
  • 23. Lessons Learned i • Finding appropriate metrics is difficult: • Comparing ASICs (e.g. x86) to FPGA-based prototypes can be unfair either way. • Fraction of peak bandwidth for the idealized problem? • Measured peak is much lower than theoretical peak. • The Chick is compute bound. • SpMV: FLOP/s ∝ BW, level 2 sparse BLAS op. • Graph500 BFS: TEPS ∝ BW Emu: µbenchmarks — 23 Jan 2019 23/27
  • 24. Lessons Learned ii • Distilling observations on architecture ↔ programming model: • Program data location for load (BW) balance. • Remote memory operations v. migration exposes the architecture. • Migrations cost more than it appears. Computation? • Stack spills/access can cause ping-ponging. • How does HW support for top-down (Cilk-ish) affect bottom-up (UPC) PGAS programming? • Memory allocation similar to UPC, SHMEM • UPC++ rpc_ff v. Emu thread migration? Emu: µbenchmarks — 23 Jan 2019 24/27
  • 25. Integrating the Chick with Flexible Infrastructure login rg-adm Slurm Ctl toolbox (NFS) Scheduling, Tools, and Admin Key: Schedulable Resource Physical Resource VM USB device User Resources fpaa-host power-host nvidia-tegra-N nvidia-tegra-1 fpaa-dev rg-db Slurm DBD emu-dev emu-chick ..Nfpga-dev-1 fpga-hmcfpga-intel Powell, Riedy, Young, and Conte. “Wrangling Rogues: Managing Experimental Post-Moore Architectures.” https://arxiv.org/abs/1808.06334 • Available. Plans to integrate with NSF XSEDE. • Scheduler being deployed. • Incorporates Singularity and virtual machines for OS/library versioning. Emu: µbenchmarks — 23 Jan 2019 25/27
  • 26. Umbrella Project: CRNCH Rogues Gallery A physical & virtual space for hosting novel computing architectures, systems, and accelerators. Host / manage remote access for novel architectures! • Emu Chick • FPGA + HMC: 3D stacked • FPAA: Analog/Neuromorphic Amortize effort and cost of trying novel architectures. Break the “but it’s too much work” barrier. http://crnch.gatech.edu/rogues-gallery Emu: µbenchmarks — 23 Jan 2019 26/27
  • 27. Acknowledgments • Srinivas Eswar (GT CSE) • Dr. Eric Hein (GT ECE ⇒ Emu) • Patrick Lavin (GT CSE) • Jiajia Li (GT CSE ⇒ PNNL) • Abdurrahman Yaşar (GT CSE) • Dr. Ümit Çatalürek (GT CSE) • Dr. Tom Conte (GT CS/ECE) • Dr. Bora Uçar (ENS Lyon CNRS) • Dr. Rich Vuduc (GT CSE) • Dr. Jeffrey S. Young (GT CS) Code: • https://gitlab.com/crnch-rg (soon) • https://github.com/ehein6/emu-microbench Emu: µbenchmarks — 23 Jan 2019 27/27