Demystifying Distributed Graph Processing

DEMYSTIFYING
DISTRIBUTED
GRAPH PROCESSING
Vasia Kalavri
vasia@apache.org
@vkalavri

WHY DISTRIBUTED
GRAPH PROCESSING?

MY GRAPH IS SO BIG, IT
DOESN’T FIT IN A SINGLE
MACHINE
Big Data Ninja
MISCONCEPTION #1

YOUR INPUT DATASET SIZE
IS _OFTEN_ IRRELEVANT

INTERMEDIATE DATA: THE OFTEN DISREGARDED EVIL
▸ Naive Who(m) to Follow:
▸ compute a friends-of-friends list
per user
▸ exclude existing friends
▸ rank by common connections

DISTRIBUTED PROCESSING IS
ALWAYS FASTER THAN
SINGLE-NODE
Data Science Rockstar
MISCONCEPTION #2

GRAPHS DON’T APPEAR OUT OF THIN AIR
Expectation…

GRAPHS DON’T APPEAR OUT OF THIN AIR
Reality!

HOW DO WE EXPRESS A
DISTRIBUTED GRAPH
ANALYSIS TASK?

GRAPH APPLICATIONS ARE DIVERSE
▸ Iterative value propagation
▸ PageRank, Connected Components, Label Propagation
▸ Traversals and path exploration
▸ Shortest paths, centrality measures
▸ Ego-network analysis
▸ Personalized recommendations
▸ Pattern mining
▸ Finding frequent subgraphs

RECENT DISTRIBUTED GRAPH PROCESSING HISTORY
2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012

2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Iterative value propagation

PREGEL: THINK LIKE A VERTEX
1
5
4
3
2 1 3, 4
2 1, 4
5 3
...

PREGEL: SUPERSTEPS
(Vi+1, outbox) <— compute(Vi, inbox)
1 3, 4
2 1, 4
5 3
...
1 3, 4
2 1, 4
5 3
...
Superstep i Superstep i+1

PREGEL EXAMPLE: PAGERANK
void compute(messages):
sum = 0.0
for (m <- messages) do
sum = sum + m
end for
setValue(0.15/numVertices() + 0.85*sum)
for (edge <- getOutEdges()) do
sendMessageTo(
edge.target(), getValue()/numEdges)
end for
sum up received
messages
update vertex rank
distribute rank to
neighbors

SIGNAL-COLLECT
outbox <— signal(Vi)
1 3, 4
2 1, 4
5 3
...
1 3, 4
2 1, 4
5 3
...
Superstep i
Vi+1 <— collect(inbox)
1 3, 4
2 1, 4
5 3
...
Signal Collect
Superstep i+1

SIGNAL-COLLECT EXAMPLE: PAGERANK
void signal():
for (edge <- getOutEdges()) do
sendMessageTo(
edge.target(), getValue()/numEdges)
end for
void collect(messages):
sum = 0.0
for (m <- messages) do
sum = sum + m
end for
setValue(0.15/numVertices() + 0.85*sum)
distribute rank to
neighbors
sum up received
messages
update vertex rank

GATHER-SUM-APPLY (POWERGRAPH)
1
...
...
Gather Sum
1
2
5
...
Apply
3
1 5
5 3
1
...
Gather
3
1 5
5 3
Superstep i Superstep i+1

GSA EXAMPLE: PAGERANK
double gather(source, edge, target):
return target.value() / target.numEdges()
double sum(rank1, rank2):
return rank1 + rank2
double apply(sum, currentRank):
return 0.15 + 0.85*sum
compute partial
rank
combine partial
ranks
update rank

2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Giraph++
2013

2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Giraph++
2013
Graph Traversals

THINK LIKE A (SUB)GRAPH
1
5
4
3
2
1
5
4
3
2
- compute() on the entire partition
- Information ﬂows freely inside each partition
- Network communication between partitions,
not vertices

2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Giraph++
2013
Graph Traversals
NScale
2014

2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Giraph++
2013
Graph Traversals
NScale
2014
Ego-network analysis
Arabesque
2015
Tinkerpop

2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Giraph++
2013
Graph Traversals
NScale
2014
Ego-network analysis
Arabesque
2015
Pattern Matching
Tinkerpop

CAN WE HAVE IT ALL?
▸ Data pipeline integration: built on top of an efﬁcient
distributed processing engine
▸ Graph ETL: high-level API with abstractions and methods to
transform graphs
▸ Familiar programming model: support popular programming
abstractions

HELLO, GELLY! THE APACHE FLINK GRAPH API
▸ Java and Scala APIs: seamlessly integrate with Flink’s DataSet API
▸ Transformations, library of common algorithms
val graph = Graph.fromDataSet(edges, env)
val ranks = graph.run(new PageRank(0.85, 20))
▸ Iteration abstractions
Pregel
Signal-Collect
Gather-Sum-Apply
Partition-Centric*

POSIX Java/Scala 
Collections
POSIX
‣efﬁcient streaming runtime
‣native iteration operators
‣well-integrated
WHY FLINK?

FEELING GELLY?
▸ Paper References
http://www.citeulike.org/user/vasiakalavri/tag/dotscale
▸ Apache Flink:
http://flink.apache.org/
▸ Gelly documentation:
http://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/gelly.html
▸ Gelly-Stream:
https://github.com/vasia/gelly-streaming

Demystifying Distributed Graph Processing

More Related Content

What's hot (20)

Similar to Demystifying Distributed Graph Processing (20)

More from Vasia Kalavri (12)

Recently uploaded (20)

Demystifying Distributed Graph Processing