ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator

EXTRAV : BOOSTING GRAPH
PROCESSING NEAR STORAGE WITH A
COHERENT ACCELERATOR
Jinho Lee†
Heesu Kim*
Sungjoo Yoo*
Kiyoung Choi*
H. Peter Hofstee†
Gi-Joon Nam†
Damir Jemsek†
Mark Nutter†
†IBM Research
*Seoul National University

2
GRAPHS
• Can be found in many area
‒Twitter followers
‒Facebook friends
‒Web pages linking each other
‒Research paper citations
• Essential to Data-Mining and Machine Learning
‒Identify influential people and information
‒Find communities
‒Target ads and products
‒Model complex data dependencies
2 of 97

3
WHAT CAN WE DO?
• Shortest path
‒How close are me and the other?
• PageRank
‒How much influence does someone have on Twitter?
• Sparse matrix computation
‒Sparse matrix can be represented as graphs
me 1
2
you
you
me
3 of 97

4
HOW TO PROCESS LARGE GRAPH?
• The graph does not fit in the memory
• Distributed approach
• Split the graph, put each partition
in main memory of each node in a cluster
• Partitioning is difficult
‒Load imbalance
‒Communication overhead
you
me
you
me
4 of 97

5
HOW TO PROCESS LARGE GRAPH? (2)
• The graph does not fit in the memory
• Out-of-memory (single machine) approach
• Put the graph in a storage (e.g., HDD)
and process it piece by piece
‒Everything done within a single machine
• Minimizing read/write is the key
‒Buffer management overhead
‒Fundamentally limited by the storage bandwidth
you
me
you
me
Main memory
5 of 97

6
TARGET GRAPH MODEL – CSR (COMPRESSED SPARSE ROW)
 Compact representation for graph
0 1 2
R: Row-offset (V) 0 2 3
0 1 2 3 4
C: Column index (E) 1 2 2 0 2
0(A)
2(C)
1(B)
 Focus on fixed structure rather than their properties
‒Unlike property graph model of neo4j
‒Add/delete not easy
indices
V: Value (property) a b c

7
 ExtraV
‒ Extract + Traverse
‒ Graph Processing Near Storage with a coherent accelerator
 Graph virtualization to minimize buffer management overhead
‒ General parts done at the accelerator, app-specific parts at CPU
‒ Various optimizations possible under the abstraction
 Optimization 1 : Expand-and-filter
‒ Apply graph-specific compression to increase effective bandwidth
‒ Filtering of the decompressed data
 Optimization 2 : Multi-versioning
‒ Gathering graphs from certain time points
PREVIEW
7 of 97

8
CAPI - ACCELERATORS
1. CPU has the input data in memory
2. The data are copied to the accelerator (using device driver)
3. Accelerator outputs the result
4. The data is copied into the memory
• The accelerator speedup is diminished by the copying overhead
• Target kernel is limited
CPU
Memory
ACC
(FFT, MP3…)
Parameters
Input Data
Output Data
Output Data
Parameters
Input Data
8 of 97

9
CAPI - ACCELERATORS WITH COHERENT INTERFACE
1. CPU passes the pointer of data to the accelerator
2. The accelerator write to the system memory
• Fine-grained communication (saves multiple us)
• Easy to design
CPU
Memory
ACC
Output Data
Parameters
Input Data
9 of 97

EXTRAV
• ExtraV : Extract + Traverse
• A coherent accelerator in front of the storage
CPU
Memory
STORAGE
0
2
1
ACC
(FPGA)
10 of 97

GRAPH VIRTUALIZATION
• The ACC interprets the graph data and puts it into main memory
in the order that the CPU is going to process
• The format within the storage or the ACC is hidden
CPU
Memory
ACC
(FPGA)
STORAGE
0
2
1
0 1 2
11 of 97

GRAPH VIRTUALIZATION
• Buffer management overhead reduced
• ACC can perform extra optimizations under the
abstraction layer
CPU
Memory
ACC
(FPGA)
STORAGE
0
2
1
0 1 2
Hidden
12 of 97

OPT1: EXPAND-AND-FILTER
CPU
Memory
ACC
(FPGA)
STORAGE
Performance bottlenecks
PCIe : 3GBps
Storage : 100MBps~2GBps
Traversal
13 of 97

• The graph data are compressed inside the
storage
CPU
Memory
ACC
(FPGA)
STORAGE
DecompressionTraversal 4KB
14 of 97

Decompressed in the ACC
CPU
Memory
ACC
(FPGA)
STORAGE
DecompressionTraversal 4KB16KB
15 of 97

• Traversal engine selects needed data from the
decompressed (expand-and-filter)
• Effective bandwidth increase from storage
CPU
Memory
ACC
(FPGA)
STORAGE
DecompressionTraversal 4KB8KB 16KB
16 of 97

COMPRESSION
• Variable length integer
– Small integers take small space
• Interval coding
– Store consecutive numbers as (start, length)
– 3, 4, 5, 6, 7 -> (3, 5)
• Store differences
– Differences are smaller than the original numbers
– 1, 3, 7, 8 -> 1, +2, +4, +1
17 of 97

OPT2: MULTI-VERSION
CPU
Memory
ACC
(FPGA)
STORAGE
Decompression
Multi-version
Traversal
18 of 97

OPT2: MULTI-VERSION
CPU
Memory
ACC
(FPGA)
STORAGE
Decompression
Multi-version
Traversal
0(A)
2(C)
1(B)
0(A)
2(C)
1(B)
0(A)
2(C)
1(B)
Day 0 Day 1 Day 2
19 of 97

OPT2: MULTI-VERSION
0(A)
2(C)
1(B)
0(A)
2(C)
1(B)
0(A)
2(C)
1(B)
Day 0 Day 1 Day 2
Base graph △Day 1 △Day 2
• Save deltas
• Provide coarse-grained shortcuts
Shortcuts Shortcuts
20 of 97

MULTI-VERSIONED GRAPH
• Naïve approach
0 1 2
R: Row-offset (V) 0 1 x
0 1 2
C: Column index (E) 1 0 2
0 1 2
0 1 x
0 1 2
1 0 2 1
0(A)
2(C)
1(B)
0(A)
2(C)
1(B)
Version 0 Version 1

• Re-use Column indices
0 1 2
R: Row-offset (V) L0,0 L0,1 x
0 1 2
0 1 2
L0,0 L1,0 x
0
1 L0, 1
0(A)
2(C)
1(B)
0(A)
2(C)
1(B)
Version 0 Version 1

• Re-use Row offset by using coarse-grained indirection
0 1 2 …
R: Row-offset (V) L0,0 L0,1 x …
0 1 2
0 1
L0,0 L1,0
0
1 L0, 1
0(A)
2(C)
1(B)
0(A)
2(C)
1(B)
Version 0 Version 1
0-1 2-3
• •
0-1 2-3
• •Indirection

PROGRAMMING MODEL
1. CPU requests for the number of nodes, and ACC replies
CPU
ACC
(FPGA)
Start(G , level)
Num_nodes = 100
24 of 97

PROGRAMMING MODEL
2. CPU requests for sequential traversal of in_edges
CPU
ACC
(FPGA)
Num_nodes = 100
Seq, in_edges
Start(G , level)
25 of 97

PROGRAMMING MODEL
3. ACC streams the nodes (on the memory)
CPU
ACC
(FPGA)
Num_nodes = 100
Seq, in_edges
(0: 1, 3, 4) (1: 2, 3) (2: 0) …
Start(G , level)
26 of 97

PROGRAMMING MODEL
CPU
ACC
(FPGA)
Num_nodes = 100
Seq, in_edges
(0: 1, 3, 4) (1: 2, 3) (2: 0) …
3. ACC streams the nodes
4. CPU processes pr_nxt[0] = f(pr[1],pr[3],pr[4])
Start(G , level)
27 of 97

PROGRAMMING MODEL
5. For next iterations, ask again
CPU
ACC
(FPGA)
Num_nodes = 100
Seq, in_edges
(0: 1, 3, 4) (1: 2, 3) (2: 0) …
Seq, in_edges
(0: 1, 3, 4) (1: 2, 3) (2: 0) …
Start(G , level)
28 of 97

• Some neighbors are not needed by the CPU
• Keep a filter bitmap in memory, and CAPI can access it
CPU
ACC
(FPGA)
Num_nodes = 100
Seq, in_edges, filter
(0: 1, 3, 4) (1: 2, 3) (2: 0) …
Memory
001101
Start(G , level)
PROGRAMMING MODEL (FILTERING)
29 of 97

PROGRAMMING MODEL (Q MODE)
• Active vertex list for some implementations of pagerank
• Some apps (BFS or Dijkstra) require a frontier queue or priority list
CPU
ACC
(FPGA)
Num_nodes = 100
Queue, in_edges
(0: 1, 3, 4) (7: 3, 7) (9: 0) …
Memory
0, 7, 9
Start(G , level)
30 of 97

IMPLEMENTATION
• 6-stage pipeline
• Designed using Vivado-HLS from C++ models
31 of 97

IMPLEMENTATION
• The FPGA has PSL + AFU
• AFU has 16 worker modules, an arbiter and a scheduler
• Stream buffers that keep the locality
• Some stream buffers are connected to Storage, others to the memory
• Prefetch used to draw bandwidth
32 of 97

EVALUATION PLATFORM
• POWER8 processor (3.7GHz, 20 cores)
• Alphadata CAPI development card with Xilinx Ultrascale FPGA
– The design runs at 125MHz
• The card and the system are connected via PCIe gen3
• SSD as a storage (max 500MBps)
• Page cache limited to 4GB
• Four graph algorithms : Teenager follower, PageRank, BFS,
Connected Components
33 of 97

SYNTHESIS RESULT (KU3)
• PSL consumes about a quarter
• About 80% of CLB, 40% of register
– Execution pipelines consume most of the CLB blocks
– Stream buffers consume majority of registers
34 of 97

COMPRESSION RATIO
• A few times compression ratio
• More intervals, the better compression
• Closer neighbors, the better compression (smaller differences)
Graph Type
Uncomp
ressed
Compre
ssed
Ratio
Gsh-tpd Subdomain 9.89GB 2.47GB 4.00X
Arabic-2002 Subdomain 10.2GB 1.09GB 9.36Х
Twitter Social Network 23.7GB 9.8GB 2.41Х
Livejournal Social Network 1.23GB 456MB 2.70Х
Friendster Social Network 30.6GB 13.8GB 2.22X
MS-ref Citation 9.27GB 4.40GB 2.10X
Road-CA Road Network 145MB 76MB 1.90Х
35 of 97

PERFORMANCE RESULTS
• Multiple times speedup compared to state-of-the-art frameworks
– Up to 2-4x over FlashGraph, Llama
– Mote than 10x over X-stream
• Higher compression ratio indicates more speedup
(PageRank) (BFS)
36 of 97

PERFORMANCE RESULTS
• Stream buffers give about 10x
• Prefetching gives additional 3x
37 of 97

PERFORMANCE RESULTS
• Marginal latency increase as number of level increases
– 0.1% added data per each level
– 0.18% longer latency per level in Pagerank
– 0.15% longer latency per level in BFS
– Stream buffers suffer from more random accesses
38 of 97

GRAPH ACCELERATOR-SUMMARY
• CAPI accelerator for out-of-memory graph
processing
• Optimizations under graph virtualization
– Expand-and-filter
– Multi-version
• Prototyped on FPGA
• On average 2-4x, up to +10x speedup
39 of 97

CONCLUSION
• ExtraV system, a graph processing near storage f
ramework with a coherent accelerator
– Graph virtualization provides an abstraction to the pr
ocessor
– Significant speedup over software solutions

IT’S ALL
• Thank you!
• leejinho@us.ibm.com
41 of 97

CAPI - ACCELERATORS
1. CPU has the input data in memory
2. The data are copied to the accelerator (using device driver)
3. Accelerator outputs the result
4. The data is copied into the memory
• The accelerator speedup is diminished by the copying overhead
• Target kernel is limited
CPU
Memory
ACC
(FFT, MP3…)
Parameters
Input Data
Output
Data
Output
Data
Parameters
Input Data
43 of 97

CAPI - ACCELERATORS WITH COHERENT INTERFACE
1. CPU passes the pointer of data to the ac
celerator
2. The accelerator write to the system mem
CPU
Memory
ACC
Output
Data
Parameters
Input Data
44 of 97

45
https://www.ece.cmu.edu/~calcm/carl/lib/exe/fetch.php?media=carl15-gupta.pdf

ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator

HOMOGENEOUS COMPUTING ERA
• Homogeneous architecture is coming towards its performance limit
47“HETEROGENEOUS SYSTEM ARCHITECTURE OVERVIEW”, HOT CHIPS TUTORIAL - AUGUST 2013

ACCELERATORS
• Accelerators are much better in performance and power
48Acc rich architecture, http://cadlab.cs.ucla.edu/~cong/slides/islped14_keynote.pdf

STORAGE
4
OUT-OF-MEMORY GRAPH PROCESSING
0
2
1
• Graph too large to fit into main memory
– Use secondary storage
– The I/O becomes the bottleneck
– Buffer management overhead
CPU
Memory

PROGRAMMING MODEL
5. For next iterations, ask again
CPU
ACC
(FPGA)
Num_nodes = 100
Seq, in_edges
(0: 1, 3, 4) (1: 2, 3) (2: 0) …
Seq, in_edges
(0: 1, 3, 4) (1: 2, 3) (2: 0) …
Start(G , level)
0, 1, 3, 4, -1, 1, 2, 3, -1, 2, 0, -
1

COMPRESSION
• Variable length integer
– Small integers take small space
• Run length coding
– Store consecutive numbers as (start, length)
– 3, 4, 5, 6, 7 -> (3, 5)
• Store differences
– Differences are smaller than the original numbers
– 1, 3, 7, 8 -> 1, +2, +4, +1

COMPRESSION
Node neighbors
1 2 5 6 7 8 12 13 14 30

INTERVAL CODING
Node neighbors
1 2 (5 6 7 8)(12 13 14) 30

INTERVAL CODING
Node interval Residue
1 (5, 4) (12, 3) 2,30
Node neighbors
1 2 (5 6 7 8) (12 13 14) 30

DIFFERENTIAL CODING
1 (5, 4) (12, 3) 2,30
Node neighbors
1 2 (5 6 7 8) (12 13 14) 30
1 (5, 4) (+4, 3) 2,+28

DIFFERENTIAL CODING
1 (5, 4) (12, 3) 2,30
Node neighbors
1 2 (5 6 7 8) (12 13 14) 30
1 (5, 4) (+4, 3) 2,+28
• 9 large integers -> 6 smaller integers
• Variable length encoding

EXAMPLE GRAPH PROCESSING
Loop until convergence
for u : all nodes
v : in_neighbor(u)
α[u] = ∑f(α[v])
(0: 1, 3, 4)
(1: 2, 3)
(2: 0)
(3: 2)
1
0
2
3
4
α1
α3
α4
α0
Adjacency list

• Baseline : CSR (Compressed Sparse Row)
0 1 2
R: Row-offset (V) 0 1 x
0 1 2
0(A)
2(C)
1(B)

PERFORMANCE-CONNECTED COMPONENTS
• 23x (flashgraph)
• 12x (llama)
0
1000
2000
3000
4000
uk-2005 Webbase Twitter
ExecutionTime(ssc.)
ExtraV FlashGraph Llama

BANDWIDTH
• ExtraV draws the maximum bandwidth from HDD
60
0
20
40
60
80
100
0 1 2 3
IO(MBps)
Time(min)
FlashGraph
0
20
40
60
80
100
0 1 2 3
IO(MBps)
Time(min)
Llama
0
20
40
60
80
100
0 1 2 3
IO(MBps)
Time(min)
ExtraV

FILTERING
• BFS provides much filtering of the bandwidth
61
0
10000
20000
30000
40000
1 2 3 4
Bandwidth(MEdges/s)
Iterations
unfiltered filtered
8.8%
13.9%
1.0%
72.6%
8.8%
13.9%
1.0%
72.6%
0
5000
10000
15000
20000
25000
1 2 3 4 5 6 7 8 9 10 11 12
Bandwidth(MEdges/s)
Iterations
unfiltered filtered
1.0%12.6%
twitter webbase

PERFORMANCE RESULTS
• Stream buffer gives about 4x
• Prefetching gives another 4x
62 of 97

ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator

More Related Content

What's hot (6)

Similar to ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator (20)

Recently uploaded (20)

ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator

Editor's Notes