SlideShare a Scribd company logo
EXTRAV : BOOSTING GRAPH
PROCESSING NEAR STORAGE WITH A
COHERENT ACCELERATOR
Jinho Lee†
Heesu Kim*
Sungjoo Yoo*
Kiyoung Choi*
H. Peter Hofstee†
Gi-Joon Nam†
Damir Jemsek†
Mark Nutter†
†IBM Research
*Seoul National University
2
GRAPHS
• Can be found in many area
‒Twitter followers
‒Facebook friends
‒Web pages linking each other
‒Research paper citations
• Essential to Data-Mining and Machine Learning
‒Identify influential people and information
‒Find communities
‒Target ads and products
‒Model complex data dependencies
2 of 97
3
WHAT CAN WE DO?
• Shortest path
‒How close are me and the other?
• PageRank
‒How much influence does someone have on Twitter?
• Sparse matrix computation
‒Sparse matrix can be represented as graphs
me 1
2
you
you
me
3 of 97
4
HOW TO PROCESS LARGE GRAPH?
• The graph does not fit in the memory
• Distributed approach
• Split the graph, put each partition
in main memory of each node in a cluster
• Partitioning is difficult
‒Load imbalance
‒Communication overhead
you
me
you
me
4 of 97
5
HOW TO PROCESS LARGE GRAPH? (2)
• The graph does not fit in the memory
• Out-of-memory (single machine) approach
• Put the graph in a storage (e.g., HDD)
and process it piece by piece
‒Everything done within a single machine
• Minimizing read/write is the key
‒Buffer management overhead
‒Fundamentally limited by the storage bandwidth
you
me
you
me
Main memory
5 of 97
6
TARGET GRAPH MODEL – CSR (COMPRESSED SPARSE ROW)
 Compact representation for graph
0 1 2
R: Row-offset (V) 0 2 3
0 1 2 3 4
C: Column index (E) 1 2 2 0 2
0(A)
2(C)
1(B)
 Focus on fixed structure rather than their properties
‒Unlike property graph model of neo4j
‒Add/delete not easy
indices
V: Value (property) a b c
7
 ExtraV
‒ Extract + Traverse
‒ Graph Processing Near Storage with a coherent accelerator
 Graph virtualization to minimize buffer management overhead
‒ General parts done at the accelerator, app-specific parts at CPU
‒ Various optimizations possible under the abstraction
 Optimization 1 : Expand-and-filter
‒ Apply graph-specific compression to increase effective bandwidth
‒ Filtering of the decompressed data
 Optimization 2 : Multi-versioning
‒ Gathering graphs from certain time points
PREVIEW
7 of 97
8
CAPI - ACCELERATORS
1. CPU has the input data in memory
2. The data are copied to the accelerator (using device driver)
3. Accelerator outputs the result
4. The data is copied into the memory
• The accelerator speedup is diminished by the copying overhead
• Target kernel is limited
CPU
Memory
ACC
(FFT, MP3…)
Parameters
Input Data
Output Data
Output Data
Parameters
Input Data
8 of 97
9
CAPI - ACCELERATORS WITH COHERENT INTERFACE
1. CPU passes the pointer of data to the accelerator
2. The accelerator write to the system memory
• Fine-grained communication (saves multiple us)
• Easy to design
CPU
Memory
ACC
Output Data
Parameters
Input Data
9 of 97
EXTRAV
• ExtraV : Extract + Traverse
• A coherent accelerator in front of the storage
CPU
Memory
STORAGE
0
2
1
ACC
(FPGA)
10 of 97
GRAPH VIRTUALIZATION
• The ACC interprets the graph data and puts it into main memory
in the order that the CPU is going to process
• The format within the storage or the ACC is hidden
CPU
Memory
ACC
(FPGA)
STORAGE
0
2
1
0 1 2
11 of 97
GRAPH VIRTUALIZATION
• Buffer management overhead reduced
• ACC can perform extra optimizations under the
abstraction layer
CPU
Memory
ACC
(FPGA)
STORAGE
0
2
1
0 1 2
Hidden
12 of 97
OPT1: EXPAND-AND-FILTER
CPU
Memory
ACC
(FPGA)
STORAGE
Performance bottlenecks
PCIe : 3GBps
Storage : 100MBps~2GBps
Traversal
13 of 97
OPT1: EXPAND-AND-FILTER
• The graph data are compressed inside the
storage
CPU
Memory
ACC
(FPGA)
STORAGE
DecompressionTraversal 4KB
14 of 97
OPT1: EXPAND-AND-FILTER
Decompressed in the ACC
CPU
Memory
ACC
(FPGA)
STORAGE
DecompressionTraversal 4KB16KB
15 of 97
OPT1: EXPAND-AND-FILTER
• Traversal engine selects needed data from the
decompressed (expand-and-filter)
• Effective bandwidth increase from storage
CPU
Memory
ACC
(FPGA)
STORAGE
DecompressionTraversal 4KB8KB 16KB
16 of 97
COMPRESSION
• Variable length integer
– Small integers take small space
• Interval coding
– Store consecutive numbers as (start, length)
– 3, 4, 5, 6, 7 -> (3, 5)
• Store differences
– Differences are smaller than the original numbers
– 1, 3, 7, 8 -> 1, +2, +4, +1
17 of 97
OPT2: MULTI-VERSION
CPU
Memory
ACC
(FPGA)
STORAGE
Decompression
Multi-version
Traversal
18 of 97
OPT2: MULTI-VERSION
CPU
Memory
ACC
(FPGA)
STORAGE
Decompression
Multi-version
Traversal
0(A)
2(C)
1(B)
0(A)
2(C)
1(B)
0(A)
2(C)
1(B)
Day 0 Day 1 Day 2
19 of 97
OPT2: MULTI-VERSION
0(A)
2(C)
1(B)
0(A)
2(C)
1(B)
0(A)
2(C)
1(B)
Day 0 Day 1 Day 2
Base graph △Day 1 △Day 2
• Save deltas
• Provide coarse-grained shortcuts
Shortcuts Shortcuts
20 of 97
MULTI-VERSIONED GRAPH
• Naïve approach
0 1 2
R: Row-offset (V) 0 1 x
0 1 2
C: Column index (E) 1 0 2
0 1 2
0 1 x
0 1 2
1 0 2 1
0(A)
2(C)
1(B)
0(A)
2(C)
1(B)
Version 0 Version 1
MULTI-VERSIONED GRAPH
• Re-use Column indices
0 1 2
R: Row-offset (V) L0,0 L0,1 x
0 1 2
C: Column index (E) 1 0 2
0 1 2
L0,0 L1,0 x
0
1 L0, 1
0(A)
2(C)
1(B)
0(A)
2(C)
1(B)
Version 0 Version 1
MULTI-VERSIONED GRAPH
• Re-use Row offset by using coarse-grained indirection
0 1 2 …
R: Row-offset (V) L0,0 L0,1 x …
0 1 2
C: Column index (E) 1 0 2
0 1
L0,0 L1,0
0
1 L0, 1
0(A)
2(C)
1(B)
0(A)
2(C)
1(B)
Version 0 Version 1
0-1 2-3
• •
0-1 2-3
• •Indirection
PROGRAMMING MODEL
1. CPU requests for the number of nodes, and ACC replies
CPU
ACC
(FPGA)
Start(G , level)
Num_nodes = 100
24 of 97
PROGRAMMING MODEL
1. CPU requests for the number of nodes, and ACC replies
2. CPU requests for sequential traversal of in_edges
CPU
ACC
(FPGA)
Num_nodes = 100
Seq, in_edges
Start(G , level)
25 of 97
PROGRAMMING MODEL
1. CPU requests for the number of nodes, and ACC replies
2. CPU requests for sequential traversal of in_edges
3. ACC streams the nodes (on the memory)
CPU
ACC
(FPGA)
Num_nodes = 100
Seq, in_edges
(0: 1, 3, 4) (1: 2, 3) (2: 0) …
Start(G , level)
26 of 97
PROGRAMMING MODEL
CPU
ACC
(FPGA)
Num_nodes = 100
Seq, in_edges
(0: 1, 3, 4) (1: 2, 3) (2: 0) …
1. CPU requests for the number of nodes, and ACC replies
2. CPU requests for sequential traversal of in_edges
3. ACC streams the nodes
4. CPU processes pr_nxt[0] = f(pr[1],pr[3],pr[4])
Start(G , level)
27 of 97
PROGRAMMING MODEL
1. CPU requests for the number of nodes, and ACC replies
2. CPU requests for sequential traversal of in_edges
3. ACC streams the nodes
4. CPU processes pr_nxt[0] = f(pr[1],pr[3],pr[4])
5. For next iterations, ask again
CPU
ACC
(FPGA)
Num_nodes = 100
Seq, in_edges
(0: 1, 3, 4) (1: 2, 3) (2: 0) …
Seq, in_edges
(0: 1, 3, 4) (1: 2, 3) (2: 0) …
Start(G , level)
28 of 97
• Some neighbors are not needed by the CPU
• Keep a filter bitmap in memory, and CAPI can access it
CPU
ACC
(FPGA)
Num_nodes = 100
Seq, in_edges, filter
(0: 1, 3, 4) (1: 2, 3) (2: 0) …
Memory
001101
Start(G , level)
PROGRAMMING MODEL (FILTERING)
29 of 97
PROGRAMMING MODEL (Q MODE)
• Active vertex list for some implementations of pagerank
• Some apps (BFS or Dijkstra) require a frontier queue or priority list
CPU
ACC
(FPGA)
Num_nodes = 100
Queue, in_edges
(0: 1, 3, 4) (7: 3, 7) (9: 0) …
Memory
0, 7, 9
Start(G , level)
30 of 97
IMPLEMENTATION
• 6-stage pipeline
• Designed using Vivado-HLS from C++ models
31 of 97
IMPLEMENTATION
• The FPGA has PSL + AFU
• AFU has 16 worker modules, an arbiter and a scheduler
• Stream buffers that keep the locality
• Some stream buffers are connected to Storage, others to the memory
• Prefetch used to draw bandwidth
32 of 97
EVALUATION PLATFORM
• POWER8 processor (3.7GHz, 20 cores)
• Alphadata CAPI development card with Xilinx Ultrascale FPGA
– The design runs at 125MHz
• The card and the system are connected via PCIe gen3
• SSD as a storage (max 500MBps)
• Page cache limited to 4GB
• Four graph algorithms : Teenager follower, PageRank, BFS,
Connected Components
33 of 97
SYNTHESIS RESULT (KU3)
• PSL consumes about a quarter
• About 80% of CLB, 40% of register
– Execution pipelines consume most of the CLB blocks
– Stream buffers consume majority of registers
34 of 97
COMPRESSION RATIO
• A few times compression ratio
• More intervals, the better compression
• Closer neighbors, the better compression (smaller differences)
Graph Type
Uncomp
ressed
Compre
ssed
Ratio
Gsh-tpd Subdomain 9.89GB 2.47GB 4.00X
Arabic-2002 Subdomain 10.2GB 1.09GB 9.36Х
Twitter Social Network 23.7GB 9.8GB 2.41Х
Livejournal Social Network 1.23GB 456MB 2.70Х
Friendster Social Network 30.6GB 13.8GB 2.22X
MS-ref Citation 9.27GB 4.40GB 2.10X
Road-CA Road Network 145MB 76MB 1.90Х
35 of 97
PERFORMANCE RESULTS
• Multiple times speedup compared to state-of-the-art frameworks
– Up to 2-4x over FlashGraph, Llama
– Mote than 10x over X-stream
• Higher compression ratio indicates more speedup
(PageRank) (BFS)
36 of 97
PERFORMANCE RESULTS
• Stream buffers give about 10x
• Prefetching gives additional 3x
37 of 97
PERFORMANCE RESULTS
• Marginal latency increase as number of level increases
– 0.1% added data per each level
– 0.18% longer latency per level in Pagerank
– 0.15% longer latency per level in BFS
– Stream buffers suffer from more random accesses
38 of 97
GRAPH ACCELERATOR-SUMMARY
• CAPI accelerator for out-of-memory graph
processing
• Optimizations under graph virtualization
– Expand-and-filter
– Multi-version
• Prototyped on FPGA
• On average 2-4x, up to +10x speedup
39 of 97
CONCLUSION
• ExtraV system, a graph processing near storage f
ramework with a coherent accelerator
– Graph virtualization provides an abstraction to the pr
ocessor
– Significant speedup over software solutions
IT’S ALL
• Thank you!
• leejinho@us.ibm.com
41 of 97
SUPPLEMENTAL
42
CAPI - ACCELERATORS
1. CPU has the input data in memory
2. The data are copied to the accelerator (using device driver)
3. Accelerator outputs the result
4. The data is copied into the memory
• The accelerator speedup is diminished by the copying overhead
• Target kernel is limited
CPU
Memory
ACC
(FFT, MP3…)
Parameters
Input Data
Output
Data
Output
Data
Parameters
Input Data
43 of 97
CAPI - ACCELERATORS WITH COHERENT INTERFACE
1. CPU passes the pointer of data to the ac
celerator
2. The accelerator write to the system mem
CPU
Memory
ACC
Output
Data
Parameters
Input Data
44 of 97
45
https://www.ece.cmu.edu/~calcm/carl/lib/exe/fetch.php?media=carl15-gupta.pdf
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
HOMOGENEOUS COMPUTING ERA
• Homogeneous architecture is coming towards its performance limit
47“HETEROGENEOUS SYSTEM ARCHITECTURE OVERVIEW”, HOT CHIPS TUTORIAL - AUGUST 2013
ACCELERATORS
• Accelerators are much better in performance and power
48Acc rich architecture, http://cadlab.cs.ucla.edu/~cong/slides/islped14_keynote.pdf
STORAGE
4
OUT-OF-MEMORY GRAPH PROCESSING
0
2
1
• Graph too large to fit into main memory
– Use secondary storage
– The I/O becomes the bottleneck
– Buffer management overhead
CPU
Memory
PROGRAMMING MODEL
1. CPU requests for the number of nodes, and ACC replies
2. CPU requests for sequential traversal of in_edges
3. ACC streams the nodes
4. CPU processes pr_nxt[0] = f(pr[1],pr[3],pr[4])
5. For next iterations, ask again
CPU
ACC
(FPGA)
Num_nodes = 100
Seq, in_edges
(0: 1, 3, 4) (1: 2, 3) (2: 0) …
Seq, in_edges
(0: 1, 3, 4) (1: 2, 3) (2: 0) …
Start(G , level)
0, 1, 3, 4, -1, 1, 2, 3, -1, 2, 0, -
1
COMPRESSION
• Variable length integer
– Small integers take small space
• Run length coding
– Store consecutive numbers as (start, length)
– 3, 4, 5, 6, 7 -> (3, 5)
• Store differences
– Differences are smaller than the original numbers
– 1, 3, 7, 8 -> 1, +2, +4, +1
COMPRESSION
Node neighbors
1 2 5 6 7 8 12 13 14 30
INTERVAL CODING
Node neighbors
1 2 (5 6 7 8)(12 13 14) 30
INTERVAL CODING
Node interval Residue
1 (5, 4) (12, 3) 2,30
Node neighbors
1 2 (5 6 7 8) (12 13 14) 30
DIFFERENTIAL CODING
Node interval Residue
1 (5, 4) (12, 3) 2,30
Node neighbors
1 2 (5 6 7 8) (12 13 14) 30
Node interval Residue
1 (5, 4) (+4, 3) 2,+28
DIFFERENTIAL CODING
Node interval Residue
1 (5, 4) (12, 3) 2,30
Node neighbors
1 2 (5 6 7 8) (12 13 14) 30
Node interval Residue
1 (5, 4) (+4, 3) 2,+28
• 9 large integers -> 6 smaller integers
• Variable length encoding
EXAMPLE GRAPH PROCESSING
Loop until convergence
for u : all nodes
v : in_neighbor(u)
α[u] = ∑f(α[v])
(0: 1, 3, 4)
(1: 2, 3)
(2: 0)
(3: 2)
1
0
2
3
4
α1
α3
α4
α0
Adjacency list
MULTI-VERSIONED GRAPH
• Baseline : CSR (Compressed Sparse Row)
0 1 2
R: Row-offset (V) 0 1 x
0 1 2
C: Column index (E) 2 0 2
0(A)
2(C)
1(B)
PERFORMANCE-CONNECTED COMPONENTS
• 23x (flashgraph)
• 12x (llama)
0
1000
2000
3000
4000
uk-2005 Webbase Twitter
ExecutionTime(ssc.)
ExtraV FlashGraph Llama
BANDWIDTH
• ExtraV draws the maximum bandwidth from HDD
60
0
20
40
60
80
100
0 1 2 3
IO(MBps)
Time(min)
FlashGraph
0
20
40
60
80
100
0 1 2 3
IO(MBps)
Time(min)
Llama
0
20
40
60
80
100
0 1 2 3
IO(MBps)
Time(min)
ExtraV
FILTERING
• BFS provides much filtering of the bandwidth
61
0
10000
20000
30000
40000
1 2 3 4
Bandwidth(MEdges/s)
Iterations
unfiltered filtered
8.8%
13.9%
1.0%
72.6%
8.8%
13.9%
1.0%
72.6%
0
5000
10000
15000
20000
25000
1 2 3 4 5 6 7 8 9 10 11 12
Bandwidth(MEdges/s)
Iterations
unfiltered filtered
1.0%12.6%
twitter webbase
PERFORMANCE RESULTS
• Stream buffer gives about 4x
• Prefetching gives another 4x
62 of 97

More Related Content

PPT
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
PPT
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
PPT
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
PPT
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
PDF
Basic solaris 10 system administration commands
PDF
Dds 2
PDF
Mini proj i question and solution design
PDF
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Lec10 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part2
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Basic solaris 10 system administration commands
Dds 2
Mini proj i question and solution design

What's hot (6)

PPTX
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
PDF
Inference accelerators
PPTX
Priority assignment on the mp so c with dmac
PDF
Bc0042 os-mqp
PDF
09_Practical Multicore programming
PDF
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Optimizing Commercial Software for Intel Xeon Coprocessors: Lessons Learned
Inference accelerators
Priority assignment on the mp so c with dmac
Bc0042 os-mqp
09_Practical Multicore programming
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Ad

Similar to ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator (20)

PPTX
Dpdk applications
PDF
Designing High Performance Computing Architectures for Reliable Space Applica...
PPT
Dsp ajal
PDF
GPU Compute in Medical and Print Imaging
 
PDF
Intro to Machine Learning for GPUs
PDF
NUMA-aware Scalable Graph Traversal on SGI UV Systems
PPT
Tridiagonal solver in gpu
PPTX
Understanding DPDK
PDF
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
PPTX
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
PDF
26_Fan.pdf
PDF
Introduction to GPUs for Machine Learning
PPTX
Gpu with cuda architecture
PDF
Programar para GPUs
PDF
In datacenter performance analysis of a tensor processing unit
PPTX
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
PDF
OpenFOAM benchmark for EPYC server cavity flow small
PDF
Ultra Fast SOM using CUDA
PDF
Achitecture Aware Algorithms and Software for Peta and Exascale
DOC
Electronics product design companies in bangalore
Dpdk applications
Designing High Performance Computing Architectures for Reliable Space Applica...
Dsp ajal
GPU Compute in Medical and Print Imaging
 
Intro to Machine Learning for GPUs
NUMA-aware Scalable Graph Traversal on SGI UV Systems
Tridiagonal solver in gpu
Understanding DPDK
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
26_Fan.pdf
Introduction to GPUs for Machine Learning
Gpu with cuda architecture
Programar para GPUs
In datacenter performance analysis of a tensor processing unit
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
OpenFOAM benchmark for EPYC server cavity flow small
Ultra Fast SOM using CUDA
Achitecture Aware Algorithms and Software for Peta and Exascale
Electronics product design companies in bangalore
Ad

Recently uploaded (20)

PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Mega Projects Data Mega Projects Data
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
A Quantitative-WPS Office.pptx research study
PPTX
Introduction to machine learning and Linear Models
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Foundation of Data Science unit number two notes
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Reliability_Chapter_ presentation 1221.5784
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Introduction to Knowledge Engineering Part 1
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Mega Projects Data Mega Projects Data
Clinical guidelines as a resource for EBP(1).pdf
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
A Quantitative-WPS Office.pptx research study
Introduction to machine learning and Linear Models
oil_refinery_comprehensive_20250804084928 (1).pptx
IB Computer Science - Internal Assessment.pptx
climate analysis of Dhaka ,Banglades.pptx
Foundation of Data Science unit number two notes
Miokarditis (Inflamasi pada Otot Jantung)
MODULE 8 - DISASTER risk PREPAREDNESS.pptx

ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator

  • 1. EXTRAV : BOOSTING GRAPH PROCESSING NEAR STORAGE WITH A COHERENT ACCELERATOR Jinho Lee† Heesu Kim* Sungjoo Yoo* Kiyoung Choi* H. Peter Hofstee† Gi-Joon Nam† Damir Jemsek† Mark Nutter† †IBM Research *Seoul National University
  • 2. 2 GRAPHS • Can be found in many area ‒Twitter followers ‒Facebook friends ‒Web pages linking each other ‒Research paper citations • Essential to Data-Mining and Machine Learning ‒Identify influential people and information ‒Find communities ‒Target ads and products ‒Model complex data dependencies 2 of 97
  • 3. 3 WHAT CAN WE DO? • Shortest path ‒How close are me and the other? • PageRank ‒How much influence does someone have on Twitter? • Sparse matrix computation ‒Sparse matrix can be represented as graphs me 1 2 you you me 3 of 97
  • 4. 4 HOW TO PROCESS LARGE GRAPH? • The graph does not fit in the memory • Distributed approach • Split the graph, put each partition in main memory of each node in a cluster • Partitioning is difficult ‒Load imbalance ‒Communication overhead you me you me 4 of 97
  • 5. 5 HOW TO PROCESS LARGE GRAPH? (2) • The graph does not fit in the memory • Out-of-memory (single machine) approach • Put the graph in a storage (e.g., HDD) and process it piece by piece ‒Everything done within a single machine • Minimizing read/write is the key ‒Buffer management overhead ‒Fundamentally limited by the storage bandwidth you me you me Main memory 5 of 97
  • 6. 6 TARGET GRAPH MODEL – CSR (COMPRESSED SPARSE ROW)  Compact representation for graph 0 1 2 R: Row-offset (V) 0 2 3 0 1 2 3 4 C: Column index (E) 1 2 2 0 2 0(A) 2(C) 1(B)  Focus on fixed structure rather than their properties ‒Unlike property graph model of neo4j ‒Add/delete not easy indices V: Value (property) a b c
  • 7. 7  ExtraV ‒ Extract + Traverse ‒ Graph Processing Near Storage with a coherent accelerator  Graph virtualization to minimize buffer management overhead ‒ General parts done at the accelerator, app-specific parts at CPU ‒ Various optimizations possible under the abstraction  Optimization 1 : Expand-and-filter ‒ Apply graph-specific compression to increase effective bandwidth ‒ Filtering of the decompressed data  Optimization 2 : Multi-versioning ‒ Gathering graphs from certain time points PREVIEW 7 of 97
  • 8. 8 CAPI - ACCELERATORS 1. CPU has the input data in memory 2. The data are copied to the accelerator (using device driver) 3. Accelerator outputs the result 4. The data is copied into the memory • The accelerator speedup is diminished by the copying overhead • Target kernel is limited CPU Memory ACC (FFT, MP3…) Parameters Input Data Output Data Output Data Parameters Input Data 8 of 97
  • 9. 9 CAPI - ACCELERATORS WITH COHERENT INTERFACE 1. CPU passes the pointer of data to the accelerator 2. The accelerator write to the system memory • Fine-grained communication (saves multiple us) • Easy to design CPU Memory ACC Output Data Parameters Input Data 9 of 97
  • 10. EXTRAV • ExtraV : Extract + Traverse • A coherent accelerator in front of the storage CPU Memory STORAGE 0 2 1 ACC (FPGA) 10 of 97
  • 11. GRAPH VIRTUALIZATION • The ACC interprets the graph data and puts it into main memory in the order that the CPU is going to process • The format within the storage or the ACC is hidden CPU Memory ACC (FPGA) STORAGE 0 2 1 0 1 2 11 of 97
  • 12. GRAPH VIRTUALIZATION • Buffer management overhead reduced • ACC can perform extra optimizations under the abstraction layer CPU Memory ACC (FPGA) STORAGE 0 2 1 0 1 2 Hidden 12 of 97
  • 14. OPT1: EXPAND-AND-FILTER • The graph data are compressed inside the storage CPU Memory ACC (FPGA) STORAGE DecompressionTraversal 4KB 14 of 97
  • 15. OPT1: EXPAND-AND-FILTER Decompressed in the ACC CPU Memory ACC (FPGA) STORAGE DecompressionTraversal 4KB16KB 15 of 97
  • 16. OPT1: EXPAND-AND-FILTER • Traversal engine selects needed data from the decompressed (expand-and-filter) • Effective bandwidth increase from storage CPU Memory ACC (FPGA) STORAGE DecompressionTraversal 4KB8KB 16KB 16 of 97
  • 17. COMPRESSION • Variable length integer – Small integers take small space • Interval coding – Store consecutive numbers as (start, length) – 3, 4, 5, 6, 7 -> (3, 5) • Store differences – Differences are smaller than the original numbers – 1, 3, 7, 8 -> 1, +2, +4, +1 17 of 97
  • 20. OPT2: MULTI-VERSION 0(A) 2(C) 1(B) 0(A) 2(C) 1(B) 0(A) 2(C) 1(B) Day 0 Day 1 Day 2 Base graph △Day 1 △Day 2 • Save deltas • Provide coarse-grained shortcuts Shortcuts Shortcuts 20 of 97
  • 21. MULTI-VERSIONED GRAPH • Naïve approach 0 1 2 R: Row-offset (V) 0 1 x 0 1 2 C: Column index (E) 1 0 2 0 1 2 0 1 x 0 1 2 1 0 2 1 0(A) 2(C) 1(B) 0(A) 2(C) 1(B) Version 0 Version 1
  • 22. MULTI-VERSIONED GRAPH • Re-use Column indices 0 1 2 R: Row-offset (V) L0,0 L0,1 x 0 1 2 C: Column index (E) 1 0 2 0 1 2 L0,0 L1,0 x 0 1 L0, 1 0(A) 2(C) 1(B) 0(A) 2(C) 1(B) Version 0 Version 1
  • 23. MULTI-VERSIONED GRAPH • Re-use Row offset by using coarse-grained indirection 0 1 2 … R: Row-offset (V) L0,0 L0,1 x … 0 1 2 C: Column index (E) 1 0 2 0 1 L0,0 L1,0 0 1 L0, 1 0(A) 2(C) 1(B) 0(A) 2(C) 1(B) Version 0 Version 1 0-1 2-3 • • 0-1 2-3 • •Indirection
  • 24. PROGRAMMING MODEL 1. CPU requests for the number of nodes, and ACC replies CPU ACC (FPGA) Start(G , level) Num_nodes = 100 24 of 97
  • 25. PROGRAMMING MODEL 1. CPU requests for the number of nodes, and ACC replies 2. CPU requests for sequential traversal of in_edges CPU ACC (FPGA) Num_nodes = 100 Seq, in_edges Start(G , level) 25 of 97
  • 26. PROGRAMMING MODEL 1. CPU requests for the number of nodes, and ACC replies 2. CPU requests for sequential traversal of in_edges 3. ACC streams the nodes (on the memory) CPU ACC (FPGA) Num_nodes = 100 Seq, in_edges (0: 1, 3, 4) (1: 2, 3) (2: 0) … Start(G , level) 26 of 97
  • 27. PROGRAMMING MODEL CPU ACC (FPGA) Num_nodes = 100 Seq, in_edges (0: 1, 3, 4) (1: 2, 3) (2: 0) … 1. CPU requests for the number of nodes, and ACC replies 2. CPU requests for sequential traversal of in_edges 3. ACC streams the nodes 4. CPU processes pr_nxt[0] = f(pr[1],pr[3],pr[4]) Start(G , level) 27 of 97
  • 28. PROGRAMMING MODEL 1. CPU requests for the number of nodes, and ACC replies 2. CPU requests for sequential traversal of in_edges 3. ACC streams the nodes 4. CPU processes pr_nxt[0] = f(pr[1],pr[3],pr[4]) 5. For next iterations, ask again CPU ACC (FPGA) Num_nodes = 100 Seq, in_edges (0: 1, 3, 4) (1: 2, 3) (2: 0) … Seq, in_edges (0: 1, 3, 4) (1: 2, 3) (2: 0) … Start(G , level) 28 of 97
  • 29. • Some neighbors are not needed by the CPU • Keep a filter bitmap in memory, and CAPI can access it CPU ACC (FPGA) Num_nodes = 100 Seq, in_edges, filter (0: 1, 3, 4) (1: 2, 3) (2: 0) … Memory 001101 Start(G , level) PROGRAMMING MODEL (FILTERING) 29 of 97
  • 30. PROGRAMMING MODEL (Q MODE) • Active vertex list for some implementations of pagerank • Some apps (BFS or Dijkstra) require a frontier queue or priority list CPU ACC (FPGA) Num_nodes = 100 Queue, in_edges (0: 1, 3, 4) (7: 3, 7) (9: 0) … Memory 0, 7, 9 Start(G , level) 30 of 97
  • 31. IMPLEMENTATION • 6-stage pipeline • Designed using Vivado-HLS from C++ models 31 of 97
  • 32. IMPLEMENTATION • The FPGA has PSL + AFU • AFU has 16 worker modules, an arbiter and a scheduler • Stream buffers that keep the locality • Some stream buffers are connected to Storage, others to the memory • Prefetch used to draw bandwidth 32 of 97
  • 33. EVALUATION PLATFORM • POWER8 processor (3.7GHz, 20 cores) • Alphadata CAPI development card with Xilinx Ultrascale FPGA – The design runs at 125MHz • The card and the system are connected via PCIe gen3 • SSD as a storage (max 500MBps) • Page cache limited to 4GB • Four graph algorithms : Teenager follower, PageRank, BFS, Connected Components 33 of 97
  • 34. SYNTHESIS RESULT (KU3) • PSL consumes about a quarter • About 80% of CLB, 40% of register – Execution pipelines consume most of the CLB blocks – Stream buffers consume majority of registers 34 of 97
  • 35. COMPRESSION RATIO • A few times compression ratio • More intervals, the better compression • Closer neighbors, the better compression (smaller differences) Graph Type Uncomp ressed Compre ssed Ratio Gsh-tpd Subdomain 9.89GB 2.47GB 4.00X Arabic-2002 Subdomain 10.2GB 1.09GB 9.36Х Twitter Social Network 23.7GB 9.8GB 2.41Х Livejournal Social Network 1.23GB 456MB 2.70Х Friendster Social Network 30.6GB 13.8GB 2.22X MS-ref Citation 9.27GB 4.40GB 2.10X Road-CA Road Network 145MB 76MB 1.90Х 35 of 97
  • 36. PERFORMANCE RESULTS • Multiple times speedup compared to state-of-the-art frameworks – Up to 2-4x over FlashGraph, Llama – Mote than 10x over X-stream • Higher compression ratio indicates more speedup (PageRank) (BFS) 36 of 97
  • 37. PERFORMANCE RESULTS • Stream buffers give about 10x • Prefetching gives additional 3x 37 of 97
  • 38. PERFORMANCE RESULTS • Marginal latency increase as number of level increases – 0.1% added data per each level – 0.18% longer latency per level in Pagerank – 0.15% longer latency per level in BFS – Stream buffers suffer from more random accesses 38 of 97
  • 39. GRAPH ACCELERATOR-SUMMARY • CAPI accelerator for out-of-memory graph processing • Optimizations under graph virtualization – Expand-and-filter – Multi-version • Prototyped on FPGA • On average 2-4x, up to +10x speedup 39 of 97
  • 40. CONCLUSION • ExtraV system, a graph processing near storage f ramework with a coherent accelerator – Graph virtualization provides an abstraction to the pr ocessor – Significant speedup over software solutions
  • 41. IT’S ALL • Thank you! • leejinho@us.ibm.com 41 of 97
  • 43. CAPI - ACCELERATORS 1. CPU has the input data in memory 2. The data are copied to the accelerator (using device driver) 3. Accelerator outputs the result 4. The data is copied into the memory • The accelerator speedup is diminished by the copying overhead • Target kernel is limited CPU Memory ACC (FFT, MP3…) Parameters Input Data Output Data Output Data Parameters Input Data 43 of 97
  • 44. CAPI - ACCELERATORS WITH COHERENT INTERFACE 1. CPU passes the pointer of data to the ac celerator 2. The accelerator write to the system mem CPU Memory ACC Output Data Parameters Input Data 44 of 97
  • 47. HOMOGENEOUS COMPUTING ERA • Homogeneous architecture is coming towards its performance limit 47“HETEROGENEOUS SYSTEM ARCHITECTURE OVERVIEW”, HOT CHIPS TUTORIAL - AUGUST 2013
  • 48. ACCELERATORS • Accelerators are much better in performance and power 48Acc rich architecture, http://cadlab.cs.ucla.edu/~cong/slides/islped14_keynote.pdf
  • 49. STORAGE 4 OUT-OF-MEMORY GRAPH PROCESSING 0 2 1 • Graph too large to fit into main memory – Use secondary storage – The I/O becomes the bottleneck – Buffer management overhead CPU Memory
  • 50. PROGRAMMING MODEL 1. CPU requests for the number of nodes, and ACC replies 2. CPU requests for sequential traversal of in_edges 3. ACC streams the nodes 4. CPU processes pr_nxt[0] = f(pr[1],pr[3],pr[4]) 5. For next iterations, ask again CPU ACC (FPGA) Num_nodes = 100 Seq, in_edges (0: 1, 3, 4) (1: 2, 3) (2: 0) … Seq, in_edges (0: 1, 3, 4) (1: 2, 3) (2: 0) … Start(G , level) 0, 1, 3, 4, -1, 1, 2, 3, -1, 2, 0, - 1
  • 51. COMPRESSION • Variable length integer – Small integers take small space • Run length coding – Store consecutive numbers as (start, length) – 3, 4, 5, 6, 7 -> (3, 5) • Store differences – Differences are smaller than the original numbers – 1, 3, 7, 8 -> 1, +2, +4, +1
  • 52. COMPRESSION Node neighbors 1 2 5 6 7 8 12 13 14 30
  • 53. INTERVAL CODING Node neighbors 1 2 (5 6 7 8)(12 13 14) 30
  • 54. INTERVAL CODING Node interval Residue 1 (5, 4) (12, 3) 2,30 Node neighbors 1 2 (5 6 7 8) (12 13 14) 30
  • 55. DIFFERENTIAL CODING Node interval Residue 1 (5, 4) (12, 3) 2,30 Node neighbors 1 2 (5 6 7 8) (12 13 14) 30 Node interval Residue 1 (5, 4) (+4, 3) 2,+28
  • 56. DIFFERENTIAL CODING Node interval Residue 1 (5, 4) (12, 3) 2,30 Node neighbors 1 2 (5 6 7 8) (12 13 14) 30 Node interval Residue 1 (5, 4) (+4, 3) 2,+28 • 9 large integers -> 6 smaller integers • Variable length encoding
  • 57. EXAMPLE GRAPH PROCESSING Loop until convergence for u : all nodes v : in_neighbor(u) α[u] = ∑f(α[v]) (0: 1, 3, 4) (1: 2, 3) (2: 0) (3: 2) 1 0 2 3 4 α1 α3 α4 α0 Adjacency list
  • 58. MULTI-VERSIONED GRAPH • Baseline : CSR (Compressed Sparse Row) 0 1 2 R: Row-offset (V) 0 1 x 0 1 2 C: Column index (E) 2 0 2 0(A) 2(C) 1(B)
  • 59. PERFORMANCE-CONNECTED COMPONENTS • 23x (flashgraph) • 12x (llama) 0 1000 2000 3000 4000 uk-2005 Webbase Twitter ExecutionTime(ssc.) ExtraV FlashGraph Llama
  • 60. BANDWIDTH • ExtraV draws the maximum bandwidth from HDD 60 0 20 40 60 80 100 0 1 2 3 IO(MBps) Time(min) FlashGraph 0 20 40 60 80 100 0 1 2 3 IO(MBps) Time(min) Llama 0 20 40 60 80 100 0 1 2 3 IO(MBps) Time(min) ExtraV
  • 61. FILTERING • BFS provides much filtering of the bandwidth 61 0 10000 20000 30000 40000 1 2 3 4 Bandwidth(MEdges/s) Iterations unfiltered filtered 8.8% 13.9% 1.0% 72.6% 8.8% 13.9% 1.0% 72.6% 0 5000 10000 15000 20000 25000 1 2 3 4 5 6 7 8 9 10 11 12 Bandwidth(MEdges/s) Iterations unfiltered filtered 1.0%12.6% twitter webbase
  • 62. PERFORMANCE RESULTS • Stream buffer gives about 4x • Prefetching gives another 4x 62 of 97

Editor's Notes