SlideShare a Scribd company logo
DEMYSTIFYING
DISTRIBUTED
GRAPH PROCESSING
Vasia Kalavri
vasia@apache.org
@vkalavri
Demystifying Distributed Graph Processing
WHY DISTRIBUTED
GRAPH PROCESSING?
MY GRAPH IS SO BIG, IT
DOESN’T FIT IN A SINGLE
MACHINE
Big Data Ninja
MISCONCEPTION #1
A SOCIAL NETWORK
YOUR INPUT DATASET SIZE
IS _OFTEN_ IRRELEVANT
INTERMEDIATE DATA: THE OFTEN DISREGARDED EVIL
▸ Naive Who(m) to Follow:
▸ compute a friends-of-friends list
per user
▸ exclude existing friends
▸ rank by common connections
DISTRIBUTED PROCESSING IS
ALWAYS FASTER THAN
SINGLE-NODE
Data Science Rockstar
MISCONCEPTION #2
Demystifying Distributed Graph Processing
GRAPHS DON’T APPEAR OUT OF THIN AIR
Expectation…
GRAPHS DON’T APPEAR OUT OF THIN AIR
Reality!
HOW DO WE EXPRESS A
DISTRIBUTED GRAPH
ANALYSIS TASK?
GRAPH APPLICATIONS ARE DIVERSE
▸ Iterative value propagation
▸ PageRank, Connected Components, Label Propagation
▸ Traversals and path exploration
▸ Shortest paths, centrality measures
▸ Ego-network analysis
▸ Personalized recommendations
▸ Pattern mining
▸ Finding frequent subgraphs
RECENT DISTRIBUTED GRAPH PROCESSING HISTORY
2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
RECENT DISTRIBUTED GRAPH PROCESSING HISTORY
2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Iterative value propagation
PREGEL: THINK LIKE A VERTEX
1
5
4
3
2 1 3, 4
2 1, 4
5 3
...
PREGEL: SUPERSTEPS
(Vi+1, outbox) <— compute(Vi, inbox)
1 3, 4
2 1, 4
5 3
...
1 3, 4
2 1, 4
5 3
...
Superstep i Superstep i+1
PREGEL EXAMPLE: PAGERANK
void compute(messages):
sum = 0.0
for (m <- messages) do
sum = sum + m
end for
setValue(0.15/numVertices() + 0.85*sum)
for (edge <- getOutEdges()) do
sendMessageTo(
edge.target(), getValue()/numEdges)
end for
sum up received
messages
update vertex rank
distribute rank to
neighbors
SIGNAL-COLLECT
outbox <— signal(Vi)
1 3, 4
2 1, 4
5 3
...
1 3, 4
2 1, 4
5 3
...
Superstep i
Vi+1 <— collect(inbox)
1 3, 4
2 1, 4
5 3
...
Signal Collect
Superstep i+1
SIGNAL-COLLECT EXAMPLE: PAGERANK
void signal():
for (edge <- getOutEdges()) do
sendMessageTo(
edge.target(), getValue()/numEdges)
end for
void collect(messages):
sum = 0.0
for (m <- messages) do
sum = sum + m
end for
setValue(0.15/numVertices() + 0.85*sum)
distribute rank to
neighbors
sum up received
messages
update vertex rank
GATHER-SUM-APPLY (POWERGRAPH)
1
...
...
Gather Sum
1
2
5
...
Apply
3
1 5
5 3
1
...
Gather
3
1 5
5 3
Superstep i Superstep i+1
GSA EXAMPLE: PAGERANK
double gather(source, edge, target):
return target.value() / target.numEdges()
double sum(rank1, rank2):
return rank1 + rank2
double apply(sum, currentRank):
return 0.15 + 0.85*sum
compute partial
rank
combine partial
ranks
update rank
RECENT DISTRIBUTED GRAPH PROCESSING HISTORY
2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Iterative value propagation
Giraph++
2013
RECENT DISTRIBUTED GRAPH PROCESSING HISTORY
2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Iterative value propagation
Giraph++
2013
Graph Traversals
THINK LIKE A (SUB)GRAPH
1
5
4
3
2
1
5
4
3
2
- compute() on the entire partition
- Information flows freely inside each partition
- Network communication between partitions,
not vertices
RECENT DISTRIBUTED GRAPH PROCESSING HISTORY
2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Iterative value propagation
Giraph++
2013
Graph Traversals
NScale
2014
RECENT DISTRIBUTED GRAPH PROCESSING HISTORY
2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Iterative value propagation
Giraph++
2013
Graph Traversals
NScale
2014
Ego-network analysis
Arabesque
2015
Tinkerpop
RECENT DISTRIBUTED GRAPH PROCESSING HISTORY
2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Iterative value propagation
Giraph++
2013
Graph Traversals
NScale
2014
Ego-network analysis
Arabesque
2015
Pattern Matching
Tinkerpop
Demystifying Distributed Graph Processing
CAN WE HAVE IT ALL?
▸ Data pipeline integration: built on top of an efficient
distributed processing engine
▸ Graph ETL: high-level API with abstractions and methods to
transform graphs
▸ Familiar programming model: support popular programming
abstractions
HELLO, GELLY! THE APACHE FLINK GRAPH API
▸ Java and Scala APIs: seamlessly integrate with Flink’s DataSet API
▸ Transformations, library of common algorithms
val graph = Graph.fromDataSet(edges, env)
val ranks = graph.run(new PageRank(0.85, 20))
▸ Iteration abstractions
Pregel
Signal-Collect
Gather-Sum-Apply
Partition-Centric*
POSIX Java/Scala

Collections
POSIX
‣efficient streaming runtime
‣native iteration operators
‣well-integrated
WHY FLINK?
FEELING GELLY?
▸ Paper References
http://www.citeulike.org/user/vasiakalavri/tag/dotscale
▸ Apache Flink:
http://flink.apache.org/
▸ Gelly documentation:
http://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/gelly.html
▸ Gelly-Stream:
https://github.com/vasia/gelly-streaming

More Related Content

PDF
Apache Flink & Graph Processing
PDF
Predictive Datacenter Analytics with Strymon
PDF
Gelly in Apache Flink Bay Area Meetup
PDF
Batch and Stream Graph Processing with Apache Flink
PDF
Vasia Kalavri – Training: Gelly School
PDF
Asymmetry in Large-Scale Graph Analysis, Explained
PDF
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
PDF
Ufuc Celebi – Stream & Batch Processing in one System
Apache Flink & Graph Processing
Predictive Datacenter Analytics with Strymon
Gelly in Apache Flink Bay Area Meetup
Batch and Stream Graph Processing with Apache Flink
Vasia Kalavri – Training: Gelly School
Asymmetry in Large-Scale Graph Analysis, Explained
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Ufuc Celebi – Stream & Batch Processing in one System

What's hot (20)

PPTX
Apache Flink: API, runtime, and project roadmap
PPTX
Flink 0.10 @ Bay Area Meetup (October 2015)
PDF
Mikio Braun – Data flow vs. procedural programming
PDF
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
PDF
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
PDF
GeoMesa on Apache Spark SQL with Anthony Fox
PPTX
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
PDF
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
PDF
Distributed Stream Processing - Spark Summit East 2017
PDF
Flink Gelly - Karlsruhe - June 2015
PPTX
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
PDF
Apache Flink internals
PDF
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
PDF
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
PDF
Machine Learning with Apache Flink at Stockholm Machine Learning Group
PDF
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
PDF
Flink Streaming Berlin Meetup
PPTX
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
PDF
Data Stream Analytics - Why they are important
PDF
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Apache Flink: API, runtime, and project roadmap
Flink 0.10 @ Bay Area Meetup (October 2015)
Mikio Braun – Data flow vs. procedural programming
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
GeoMesa on Apache Spark SQL with Anthony Fox
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Distributed Stream Processing - Spark Summit East 2017
Flink Gelly - Karlsruhe - June 2015
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Apache Flink internals
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Flink Streaming Berlin Meetup
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Data Stream Analytics - Why they are important
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Ad

Similar to Demystifying Distributed Graph Processing (20)

PDF
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
PPT
Pagerank (from Google)
PPT
Lec5 Pagerank
PPT
Lec5 pagerank
PPT
Lec5 Pagerank
PDF
Large Scale Graph Processing with Apache Giraph
PDF
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
PDF
Introducing Apache Giraph for Large Scale Graph Processing
PDF
Apache Spark: What? Why? When?
PDF
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
PPTX
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
PDF
Securerank ping-opendns
PDF
Spark streaming
PPTX
Deep Learning for Search
PPTX
Cloud schedulers and Scheduling in Hadoop
PPT
Hadoop trainingin bangalore
PPTX
Deep Learning for Search
PPT
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
PPT
MapReduceAlgorithms.ppt
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Pagerank (from Google)
Lec5 Pagerank
Lec5 pagerank
Lec5 Pagerank
Large Scale Graph Processing with Apache Giraph
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introducing Apache Giraph for Large Scale Graph Processing
Apache Spark: What? Why? When?
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
Securerank ping-opendns
Spark streaming
Deep Learning for Search
Cloud schedulers and Scheduling in Hadoop
Hadoop trainingin bangalore
Deep Learning for Search
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
MapReduceAlgorithms.ppt
Ad

More from Vasia Kalavri (12)

PDF
From data stream management to distributed dataflows and beyond
PDF
Self-managed and automatically reconfigurable stream processing
PDF
Online performance analysis of distributed dataflow systems (O'Reilly Velocit...
PDF
The shortest path is not always a straight line
PDF
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
PDF
Like a Pack of Wolves: Community Structure of Web Trackers
PDF
Big data processing systems research
PDF
m2r2: A Framework for Results Materialization and Reuse
PDF
MapReduce: Optimizations, Limitations, and Open Issues
PDF
A Skype case study (2011)
PDF
Apache Flink Deep Dive
PDF
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
From data stream management to distributed dataflows and beyond
Self-managed and automatically reconfigurable stream processing
Online performance analysis of distributed dataflow systems (O'Reilly Velocit...
The shortest path is not always a straight line
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
Like a Pack of Wolves: Community Structure of Web Trackers
Big data processing systems research
m2r2: A Framework for Results Materialization and Reuse
MapReduce: Optimizations, Limitations, and Open Issues
A Skype case study (2011)
Apache Flink Deep Dive
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPT
Teaching material agriculture food technology
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Modernizing your data center with Dell and AMD
PDF
Electronic commerce courselecture one. Pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Cloud computing and distributed systems.
PDF
KodekX | Application Modernization Development
PDF
Encapsulation theory and applications.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Big Data Technologies - Introduction.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Teaching material agriculture food technology
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Modernizing your data center with Dell and AMD
Electronic commerce courselecture one. Pdf
Machine learning based COVID-19 study performance prediction
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
The AUB Centre for AI in Media Proposal.docx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Spectral efficient network and resource selection model in 5G networks
Cloud computing and distributed systems.
KodekX | Application Modernization Development
Encapsulation theory and applications.pdf
NewMind AI Weekly Chronicles - August'25 Week I

Demystifying Distributed Graph Processing

  • 4. MY GRAPH IS SO BIG, IT DOESN’T FIT IN A SINGLE MACHINE Big Data Ninja MISCONCEPTION #1
  • 6. YOUR INPUT DATASET SIZE IS _OFTEN_ IRRELEVANT
  • 7. INTERMEDIATE DATA: THE OFTEN DISREGARDED EVIL ▸ Naive Who(m) to Follow: ▸ compute a friends-of-friends list per user ▸ exclude existing friends ▸ rank by common connections
  • 8. DISTRIBUTED PROCESSING IS ALWAYS FASTER THAN SINGLE-NODE Data Science Rockstar MISCONCEPTION #2
  • 10. GRAPHS DON’T APPEAR OUT OF THIN AIR Expectation…
  • 11. GRAPHS DON’T APPEAR OUT OF THIN AIR Reality!
  • 12. HOW DO WE EXPRESS A DISTRIBUTED GRAPH ANALYSIS TASK?
  • 13. GRAPH APPLICATIONS ARE DIVERSE ▸ Iterative value propagation ▸ PageRank, Connected Components, Label Propagation ▸ Traversals and path exploration ▸ Shortest paths, centrality measures ▸ Ego-network analysis ▸ Personalized recommendations ▸ Pattern mining ▸ Finding frequent subgraphs
  • 14. RECENT DISTRIBUTED GRAPH PROCESSING HISTORY 2004 MapReduce Pegasus 2009 Pregel 2010 Signal-Collect PowerGraph 2012
  • 15. RECENT DISTRIBUTED GRAPH PROCESSING HISTORY 2004 MapReduce Pegasus 2009 Pregel 2010 Signal-Collect PowerGraph 2012 Iterative value propagation
  • 16. PREGEL: THINK LIKE A VERTEX 1 5 4 3 2 1 3, 4 2 1, 4 5 3 ...
  • 17. PREGEL: SUPERSTEPS (Vi+1, outbox) <— compute(Vi, inbox) 1 3, 4 2 1, 4 5 3 ... 1 3, 4 2 1, 4 5 3 ... Superstep i Superstep i+1
  • 18. PREGEL EXAMPLE: PAGERANK void compute(messages): sum = 0.0 for (m <- messages) do sum = sum + m end for setValue(0.15/numVertices() + 0.85*sum) for (edge <- getOutEdges()) do sendMessageTo( edge.target(), getValue()/numEdges) end for sum up received messages update vertex rank distribute rank to neighbors
  • 19. SIGNAL-COLLECT outbox <— signal(Vi) 1 3, 4 2 1, 4 5 3 ... 1 3, 4 2 1, 4 5 3 ... Superstep i Vi+1 <— collect(inbox) 1 3, 4 2 1, 4 5 3 ... Signal Collect Superstep i+1
  • 20. SIGNAL-COLLECT EXAMPLE: PAGERANK void signal(): for (edge <- getOutEdges()) do sendMessageTo( edge.target(), getValue()/numEdges) end for void collect(messages): sum = 0.0 for (m <- messages) do sum = sum + m end for setValue(0.15/numVertices() + 0.85*sum) distribute rank to neighbors sum up received messages update vertex rank
  • 21. GATHER-SUM-APPLY (POWERGRAPH) 1 ... ... Gather Sum 1 2 5 ... Apply 3 1 5 5 3 1 ... Gather 3 1 5 5 3 Superstep i Superstep i+1
  • 22. GSA EXAMPLE: PAGERANK double gather(source, edge, target): return target.value() / target.numEdges() double sum(rank1, rank2): return rank1 + rank2 double apply(sum, currentRank): return 0.15 + 0.85*sum compute partial rank combine partial ranks update rank
  • 23. RECENT DISTRIBUTED GRAPH PROCESSING HISTORY 2004 MapReduce Pegasus 2009 Pregel 2010 Signal-Collect PowerGraph 2012 Iterative value propagation Giraph++ 2013
  • 24. RECENT DISTRIBUTED GRAPH PROCESSING HISTORY 2004 MapReduce Pegasus 2009 Pregel 2010 Signal-Collect PowerGraph 2012 Iterative value propagation Giraph++ 2013 Graph Traversals
  • 25. THINK LIKE A (SUB)GRAPH 1 5 4 3 2 1 5 4 3 2 - compute() on the entire partition - Information flows freely inside each partition - Network communication between partitions, not vertices
  • 26. RECENT DISTRIBUTED GRAPH PROCESSING HISTORY 2004 MapReduce Pegasus 2009 Pregel 2010 Signal-Collect PowerGraph 2012 Iterative value propagation Giraph++ 2013 Graph Traversals NScale 2014
  • 27. RECENT DISTRIBUTED GRAPH PROCESSING HISTORY 2004 MapReduce Pegasus 2009 Pregel 2010 Signal-Collect PowerGraph 2012 Iterative value propagation Giraph++ 2013 Graph Traversals NScale 2014 Ego-network analysis Arabesque 2015 Tinkerpop
  • 28. RECENT DISTRIBUTED GRAPH PROCESSING HISTORY 2004 MapReduce Pegasus 2009 Pregel 2010 Signal-Collect PowerGraph 2012 Iterative value propagation Giraph++ 2013 Graph Traversals NScale 2014 Ego-network analysis Arabesque 2015 Pattern Matching Tinkerpop
  • 30. CAN WE HAVE IT ALL? ▸ Data pipeline integration: built on top of an efficient distributed processing engine ▸ Graph ETL: high-level API with abstractions and methods to transform graphs ▸ Familiar programming model: support popular programming abstractions
  • 31. HELLO, GELLY! THE APACHE FLINK GRAPH API ▸ Java and Scala APIs: seamlessly integrate with Flink’s DataSet API ▸ Transformations, library of common algorithms val graph = Graph.fromDataSet(edges, env) val ranks = graph.run(new PageRank(0.85, 20)) ▸ Iteration abstractions Pregel Signal-Collect Gather-Sum-Apply Partition-Centric*
  • 32. POSIX Java/Scala
 Collections POSIX ‣efficient streaming runtime ‣native iteration operators ‣well-integrated WHY FLINK?
  • 33. FEELING GELLY? ▸ Paper References http://www.citeulike.org/user/vasiakalavri/tag/dotscale ▸ Apache Flink: http://flink.apache.org/ ▸ Gelly documentation: http://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/gelly.html ▸ Gelly-Stream: https://github.com/vasia/gelly-streaming