Characterization of Emu Chick with Microbenchmarks

Characterization of the Emu Chick with
Microbenchmarks
E. Jason Riedy
Center for Research into Novel Computing Hierarchies at Georgia Tech
23 January 2019

Outline
Project Background
Microbenchmarks
STREAM ADD and Pointer Chasing
Sparse Matrix – Vector Product (SpMV)
Breadth-First Search (BFS)
Labeled Subgraph Alignment
Observations

Memory-centric HPDA
• “Big data” platforms fare poorly v. a single thread
plus large SSD. (McSherry, Isard, Murray. “Scalability!
But at what COST?” HotOS XV, 2015.)
• New architecture proposals are difﬁcult to evaluate
via simulation and modeling alone.
Evaluate the FPGA-based prototype Emu Chick...
• But by what criteria?
• Chose memory bandwidth utilization.
• Memory-centric architecture
• BW is equivalent to MFLOP/s in SpMV, TEPS in BFS
Emu: µbenchmarks — 23 Jan 2019 3/27

Emu Technology’s PGAS Architecture
1 nodelet
Gossamer
Core 1
Memory-Side Processor
Gossamer
Core 4
...
Migration Engine
RapidIODisk I/O
8 nodelets
per node
64 nodelets
per Chick
RapidIO
Stationary
Core
• Multithreaded multicore
• Memory-side “processor” for
operations in
narrow-channel DRAM
• Stationary core for OS
• Threads migrate in
hardware on reads!
• Optimize for weak locality

Baseline: Emu STREAM ADD c[i] = a[i] + b[i]
GC Conﬁg Nodelets Scale Threads BW (MB/s)
1 8 30 512 1,599.86
3 4 29 384 1,288.39
1 64 31 4096 12,790.31
3 32 31 6144 7,241.07
Theor. Peak 8 9,600
Theor. Peak 64 76,800
STREAM results are used to compare bandwidth
utilization for the current prototype. 3 GC is experimental
and has (had?) half the memory controllers1
1
Eric Hein, Young, Srinivas Eswar, Jiajia Li, Patrick Lavin, Riedy, Vuduc, Conte. “A Microbenchmark Characterization
of the Emu Chick,” (in submission, https://arxiv.org/abs/1809.07696 ).

Thread Spawning in STREAM ADD
64 128 256 512 1024 2048 4096
Number of threads
0
2
4
6
8
10
12
Memorybandwidth(GBs)
serial_spawn
recursive_spawn
serial_remote_spawn
recursive_remote_spawn
Global: 1GC / nodelet, 64 nodelets

Bandwidth Limited by Computation
STREAM and Memory Bandwidths (BW in MB/s)
Operation Nodelets Scale Threads BW
Current - arithmetic ops 1 200
Ideal - all ld ops 1 1,400
ADD (Measured) 8 30 512 1,600
ADD (Measured) 64 31 4096 12,790
NCDIMM 8 12,800
NCDIMM 64 102,400
Per-GC peak from instruction counts:
175MHz ⇒
175M cycles
second
×
1 instruction
cycle
×
3 mem ops
21 instructions
×
8 Bytes
1 mem op
= 200MB/s
One GC per nodelet hits this peak. Eight GC/nodelet may hit the ideal peak.

Emu Pointer-Chasing Benchmark
Data-dependent loads, fine-grained access2
Ordered
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Intra-block shuffle: weak locality
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Full block shuffle: weak locality
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2
Eric Hein, Young, Srinivas Eswar, Jiajia Li, Patrick Lavin, Vuduc, Riedy. “An Initial Characterization of the Emu
Chick,” Workshop on Accelerators and Hybrid Exascale Systems (AsHES) 2018.

x86 Pointer-Chasing Benchmark
1
4
16
64
256
1K
4K
16K
64K
256K
1M
4M
Block size (number of 16B elements)
0
20
40
60
80
100
Memorybandwidth(GBs) peak STREAM bandwidth
56 threads
1
4
16
64
256
1K
4K
16K
64K
256K
1M
4M
peak STREAM bandwidth
112 threads
block_shuffle intra_block_shuffle full_block_shuffle
Haswell results, every pattern is different.1

Emu Pointer-Chasing Benchmark
1
4
16
64
256
1K
4K
16K
64K
256K
1M
4MBlock size (number of 16B elements)
0
2
4
6
8
10
12
Memorybandwidth(GBs)
2048 threads
1
4
16
64
256
1K
4K
16K
64K
256K
1M
4M
4096 threads
block_shuffle intra_block_shuffle full_block_shuffle
Mostly ﬂat performance, high utilization.1

SpMV Layout, Synthetic (5pt Laplacian)
CSR:
Local 1D 2D
1 nodelet 8+ nodelets 8+ nodelets
X
row
v
col
= x
Y
X
Y =
x
Y
Xx
=
102 302 602 802 1002 2002 3002
Number of Rows
0
100
200
300
400
500
600
Bandwidth(MB/s)
Data Layout
Local layout
1D layout
2D layout
Single node, integer entries1

SpMV Synthetic, Replicated – Single node, 1 GC
0 500 1000 1500 2000 2500 3000 3500
Matrix Size (MB)
0
100
200
300
400
500
600
700
800
900
Bandwidth(MB/s) SpMV (Emu Chick, Single node)
No. Threads
64
128
256
512
Good bandwidth utilization with high thread counts and
replicated x.

SpMV Synthetic – Single node, 1 GC
0 500 1000 1500 2000 2500 3000 3500
Matrix Size (MB)
0
100
200
300
400
500
600
700
800
Bandwidth(MB/s) SpMV (Emu Chick, Single node)
No. Threads
256
512
The 5pt Laplacian without replicating x bounces between
migratory and non-migratory areas.1

SpMV Synthetic – Single node, 1 and 3 GC
102 502 1002 1502 2002 2502 3002 5002 10002 11002 14002 15002 20002 25002 30002 40002
Number of Rows
0
200
400
600
800
1000
Bandwidth(MB/s)
SpMV (Emu Chick, Single node, 512 threads)
1GC
3GC
3 GC version: half the nodes, half the memory controllers

SpMV Synthetic – Single node, 1 and 3 GC
0 200 400 600 800 1000 1200 1400 1600
Matrix Size (MB)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
BandwidthUtlization SpMV (Emu Chick, Single node, 512 threads)
ctype
1GC
3GC
3 GC results demonstrate that SpMV is compute-bound
from address computation.

SpMV Synthetic, Replicated – Multinode, 1 GC
0 2000 4000 6000 8000 10000 12000 14000
Matrix Size (MB)
0
1000
2000
3000
4000
5000
6000
7000
Bandwidth(MB/s)
SpMV (Emu Chick, Multi node)
No. Threads
64
128
256
512
1024
2048
4096
SpMV scales up to 50% of bandwidth for high thread
counts and replicated x.1

SpMV Synthetic – Multinode, 1 GC
0 2000 4000 6000 8000 10000 12000 14000
Matrix Size (MB)
0
1000
2000
3000
4000
5000
6000
Bandwidth(MB/s)
SpMV (Emu Chick, Multi node)
No. Threads
1024
2048
4096
But migrations for fetching x hurt with eight nodes.

SpMV Real-World Results, Replicated
SpMV multinode bandwidths (in MB/s) for real world graphs (Tim Davis’s collection)
along with matrix dimension, number of non-zeros (NNZ), and the average and
maximum row degrees. Run with 4K threads.
Matrix Rows NNZ Avg Deg Max Deg BW
mc2depi 526K 2.1M 3.99 4 3870.31
ecology1 1.0M 5.0M 5.00 5 4425.61
amazon03 401K 3.2M 7.99 10 4494.79
Delor295 296K 2.4M 8.12 11 4492.47
roadNet- 1.39M 3.84M 2.76 12 3811.57
mac_econ 206K 1.27M 6.17 44 3735.54
cop20k_A 121K 2.62M 21.65 81 4520.05
watson_2 352K 1.85M 5.25 93 3486.30
ca2010 710K 3.49M 4.91 141 4075.97
poisson3 86K 2.37M 27.74 145 4031.20
gyro_k 17K 1.02M 58.82 360 2446.36
vsp_ﬁna 140K 1.1M 7.90 669 1335.59
Stanford 282K 2.31M 8.20 38606 287.82
ins2 309K 2.75M 8.89 309412 43.91

Breadth-First Search with Remote Writes
1. For each vertex in the frontier, try to set self as
parent of each neighbor vertex
• Done using remote writes, no migrations
• Last writer wins (benign race condition)
2. Double-buffer: Check to see which vertices acquired
a new parent, and add them to the queue
• This step is completely nodelet-local
• Caveat: also scans inactive vertices

BFS Pseudo-code
Listing 1: BFS algorithm using remote writes
queue.push(root)
while len(queue) > 0:
for src in queue:
for dst in out_edges(src):
# Remote write
new_parent[dst] = src
for v in range(num_vertices):
if parent[v] == -1:
if new_parent[v] != -1:
parent[v] = new_parent[v]
queue.push(v)

BFS on a Dynamic Data Structure
15 16 17 18 19 20 21
scale
0
20
40
60
80
100
MTEPS
Emu single node - Cilk
Emu multi-node - Cilk
x86 Haswell - STINGER
x86 Haswell - Cilk
0
500
1000
1500
EdgeBandwidth(MB/s)
Note: Streaming data structure, not statically optimized.
But Erdös-Rényi graphs. RMAT: Load imbalance. 3
3
Hein, Eswar, Abdurrahman Yasar, Prasanth Chatarasi, Li, Young, Conte, Ümit Çatalyürek, Vuduc, Riedy, Bora Uçar.
“Programming Strategies for Irregular Algorithms on the Emu Chick,” (in submission).

Labeled Subgraph Alignment
1 2 4 8 16 32 64 128
Number of Threads
0
10
20
30
40
50
Speedup
Multi-BLK
Multi-HCB
Single-BLK
Single-HCB
gsaNA, the ﬁrst parallel algorithm, strong scaling on DBLP
graph (2048 vertices). Block (BLK) vertex layout is slightly
worse than Hilbert curve (HCB) layout.3

Lessons Learned i
• Finding appropriate metrics is difﬁcult:
• Comparing ASICs (e.g. x86) to FPGA-based prototypes
can be unfair either way.
• Fraction of peak bandwidth for the idealized
problem?
• Measured peak is much lower than theoretical peak.
• The Chick is compute bound.
• SpMV: FLOP/s ∝ BW, level 2 sparse BLAS op.
• Graph500 BFS: TEPS ∝ BW

Lessons Learned ii
• Distilling observations on architecture ↔
programming model:
• Program data location for load (BW) balance.
• Remote memory operations v. migration exposes the
architecture.
• Migrations cost more than it appears. Computation?
• Stack spills/access can cause ping-ponging.
• How does HW support for top-down (Cilk-ish) affect
bottom-up (UPC) PGAS programming?
• Memory allocation similar to UPC, SHMEM
• UPC++ rpc_ff v. Emu thread migration?

Integrating the Chick with Flexible Infrastructure
login
rg-adm
Slurm Ctl
toolbox
(NFS)
Scheduling,
Tools, and
Admin
Key:
Schedulable Resource
Physical Resource
VM
USB device
User
Resources
fpaa-host
power-host
nvidia-tegra-N
nvidia-tegra-1
fpaa-dev
rg-db
Slurm DBD
emu-dev emu-chick
..Nfpga-dev-1
fpga-hmcfpga-intel
Powell, Riedy, Young, and Conte. “Wrangling Rogues: Managing
Experimental Post-Moore Architectures.”
https://arxiv.org/abs/1808.06334
• Available. Plans to
integrate with NSF
XSEDE.
• Scheduler being
deployed.
• Incorporates
Singularity and virtual
machines for
OS/library versioning.

Umbrella Project: CRNCH Rogues Gallery
A physical & virtual space for hosting novel computing
architectures, systems, and accelerators.
Host / manage remote access for novel architectures!
• Emu Chick
• FPGA + HMC: 3D stacked
• FPAA: Analog/Neuromorphic
Amortize effort and cost of trying novel architectures.
Break the “but it’s too much work” barrier.
http://crnch.gatech.edu/rogues-gallery

Acknowledgments
• Srinivas Eswar (GT CSE)
• Dr. Eric Hein (GT ECE ⇒ Emu)
• Patrick Lavin (GT CSE)
• Jiajia Li (GT CSE ⇒ PNNL)
• Abdurrahman Yaşar (GT CSE)
• Dr. Ümit Çatalürek (GT CSE)
• Dr. Tom Conte (GT CS/ECE)
• Dr. Bora Uçar (ENS Lyon CNRS)
• Dr. Rich Vuduc (GT CSE)
• Dr. Jeffrey S. Young (GT CS)
Code:
• https://gitlab.com/crnch-rg (soon)
• https://github.com/ehein6/emu-microbench

Characterization of Emu Chick with Microbenchmarks

More Related Content

What's hot (20)

Similar to Characterization of Emu Chick with Microbenchmarks (20)

More from Jason Riedy (20)

Recently uploaded (20)

Characterization of Emu Chick with Microbenchmarks