SlideShare a Scribd company logo
GraphFrames: Graph
Queries in Spark SQL
Ankur Dave
UC BerkeleyAMPLab
Joint work with Alekh Jindal (Microsoft), Li ErranLi (Uber),
Joseph Gonzalez(UC Berkeley),Matei Zaharia (MIT and Databricks)
+ Graph
Queries
2016
Spark + GraphFrames
Support for Graph Analysis in Spark
+ Graph
Algorithms
2013
Spark + GraphX
Relational
Queries
2009
Spark
Graph Updates +
Anchored
Traversals
Neo4j, etc.
Graph Algorithms vs. Graph Queries
≈
x
PageRank
Alternating Least Squares
Graph Algorithms Graph Queries
Graph Algorithms vs. Graph Queries
Graph Algorithm: PageRank Graph Query: Wikipedia Collaborators
Editor 1 Editor 2 Article 1 Article 2
⇓
Article 1
Article 2
Editor 1
Editor 2
same day} same day}
Graph Algorithms vs. Graph Queries
Graph Algorithm: PageRank
// Iterate until convergence
wikipedia.pregel(
sendMsg = { e =>
e.sendToDst(e.srcRank * e.weight)
},
mergeMsg = _ + _,
vprog = { (id, oldRank, msgSum) =>
0.15 + 0.85 * msgSum
})
Graph Query: Wikipedia Collaborators
wikipedia.find(
"(u1)-[e11]->(article1);
(u2)-[e21]->(article1);
(u1)-[e12]->(article2);
(u2)-[e22]->(article2)")
.select(
"*",
"e11.date – e21.date".as("d1"),
"e12.date – e22.date".as("d2"))
.sort("d1 + d2".desc).take(10)
Separate Systems
Graph Algorithms Graph Queries
Raw Wikipedia
< / >< / >< / >
XML
Text Table
Edit Graph
Edit Table
Frequent
Collaborators
Problem: Mixed Graph Analysis
Hyperlinks PageRank
User Product
User Article
Vandalism
Suspects
User User
User Article
Solution: GraphFrames
Graph Algorithms Graph Queries
Spark SQL
GraphFramesAPI
Pattern Query
Optimizer
GraphFrames API
• Unifies graph algorithms, graph queries, and DataFrames
• Available in Scala and Python
class GraphFrame {
def vertices: DataFrame
def edges: DataFrame
def find(pattern: String): DataFrame
def registerView(pattern: String, df: DataFrame): Unit
def degrees(): DataFrame
def pageRank(): GraphFrame
def connectedComponents(): GraphFrame
...
}
Implementation
Parsed
Pattern
Logical Plan
Materialized
Views
Optimized
Logical Plan
DataFrame
Result
Query String
Graph–Relational
Translation Join Elimination
and Reordering
Spark SQL
View Selection
Graph
Algorithms
GraphX
Graph–Relational Translation
B
D
A
C
Existing
Logical Plan
Output: A,B,C
Src Dst
⋈C=Src
Edge Table
ID Attr
VertexTable
⋈D=ID
Materialized View Selection
GraphX: Triplet view enables efficient message-passing algorithms
Vertices
B
A
C
D
Edges
A B
A C
B C
C D
A
B
Triplet View
A C
B C
C D
Graph
+
Updated
PageRanks
B
A
C
D
A
Materialized View Selection
GraphFrames: User-defined views enable efficient graph queries
Vertices
B
A
C
D
Edges
A B
A C
B C
C D
A
B
Triplet View
A C
B C
C D
Graph
User-Defined Views
PageRank
Community
Detection
…
Graph Queries
Join Elimination
Src Dst
1 2
1 3
2 3
2 5
Edges
ID Attr
1 A
2 B
3 C
4 D
Vertices
SELECT src, dst, attr AS src_attr
FROM edges INNER JOIN vertices ON src = id;
Standard vertex-edgejoin:
SELECT src, dst
FROM edges INNER JOIN vertices ON src = id;
Unnecessaryjoin
can be eliminated if tables satisfy referential
integrity, simplifying graph–relational
translation
Join Reordering
A → B B → A
⋈A, B
B → D
C → B⋈B
B → E⋈B
C → D⋈B
C → E⋈C, D
⋈C, E
Example Query
Left-Deep Plan BushyPlan
A → B B → A
⋈A, B
B → D C → B
⋈B
B → E⋈B
⋈B
⋈B, C
User-Defined View
Evaluation
Faster than Neo4j for unanchored patternqueries
0
0.5
1
1.5
2
2.5
GraphFrames Neo4j
Querylatency,s
AnchoredPatternQuery
0
10
20
30
40
50
60
70
80
GraphFrames Neo4j
Querylatency,s
UnanchoredPatternQuery
Triangle query on 1M edge subgraph of web-Google. Each system configured touse a single core.
Evaluation
Approaches performance of GraphX for graph algorithms using Spark SQL
whole-stage code generation
0
1
2
3
4
5
6
7
GraphFrames GraphX Naïve Spark
Per-iterationruntime,s
PageRankPerformance
Per-iteration performance on web-Google, single 8-core machine. Naïve SparkusesScala RDD API.
Evaluation
Registering the right views cangreatlyimprove performance
Future Work
• Suggest views automatically
• Exploit attribute-based partitioning in optimizer
• Code generationfor single node
Try It Out!
Preview available for Spark 1.4+ at:
https://github.com/graphframes/graphframes
Thanks to Databricks contributors Joseph Bradley,Xiangrui Meng,and Timothy Hunter.
Watch for the release on Spark Packages in the coming weeks.
ankurd@eecs.berkeley.edu

More Related Content

PDF
GraphFrames: Graph Queries In Spark SQL
PDF
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
ODP
Graphs are everywhere! Distributed graph computing with Spark GraphX
PDF
Machine Learning and GraphX
PPTX
Apache Spark GraphX highlights.
PDF
GraphX: Graph analytics for insights about developer communities
PPTX
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
PPTX
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
GraphFrames: Graph Queries In Spark SQL
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
Graphs are everywhere! Distributed graph computing with Spark GraphX
Machine Learning and GraphX
Apache Spark GraphX highlights.
GraphX: Graph analytics for insights about developer communities
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks

What's hot (20)

PDF
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
PDF
Graph Analytics in Spark
PDF
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
PDF
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
PPT
Graph Analytics for big data
PDF
Spark graphx
PDF
Signals from outer space
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
PDF
Large-Scale Machine Learning with Apache Spark
PDF
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
PDF
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
PDF
An excursion into Graph Analytics with Apache Spark GraphX
PDF
Graph computation
PDF
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
PDF
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
PDF
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
PDF
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
PPTX
Gephi, Graphx, and Giraph
PDF
New Directions for Spark in 2015 - Spark Summit East
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Graph Analytics in Spark
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
Graph Analytics for big data
Spark graphx
Signals from outer space
Spark Concepts - Spark SQL, Graphx, Streaming
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Large-Scale Machine Learning with Apache Spark
Extending Spark Graph for the Enterprise with Morpheus and Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
An excursion into Graph Analytics with Apache Spark GraphX
Graph computation
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Gephi, Graphx, and Giraph
New Directions for Spark in 2015 - Spark Summit East
Ad

Similar to GraphFrames: Graph Queries in Spark SQL by Ankur Dave (20)

PDF
GraphFrames: DataFrame-based graphs for Apache® Spark™
PPTX
Graphs in data structures are non-linear data structures made up of a finite ...
PDF
Web-Scale Graph Analytics with Apache® Spark™
PDF
What Makes Graph Queries Difficult?
PDF
Distributed graph processing
PDF
Web-Scale Graph Analytics with Apache® Spark™
PDF
8th TUC Meeting - Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...
ODP
Graph databases
PDF
Graph Database Use Cases - StampedeCon 2015
PDF
Graph database Use Cases
PPTX
Graph_Database_Prepared_by_Ali_Rajab.pptx
PPTX
Graph_Databases__And_Its_Usage_Presentation.pptx
PDF
A quick review of Python and Graph Databases
PDF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
PDF
Leveraging Graphs for Better AI
PDF
Compiling openCypher graph queries with Spark Catalyst
PPTX
GraphFrames Access Methods in DSE Graph
PDF
managing big data
PPTX
Calin Constantinov - Neo4j - Bucharest Big Data Week Meetup - Bucharest 2018
PDF
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
GraphFrames: DataFrame-based graphs for Apache® Spark™
Graphs in data structures are non-linear data structures made up of a finite ...
Web-Scale Graph Analytics with Apache® Spark™
What Makes Graph Queries Difficult?
Distributed graph processing
Web-Scale Graph Analytics with Apache® Spark™
8th TUC Meeting - Zhe Wu (Oracle USA). Bridging RDF Graph and Property Graph...
Graph databases
Graph Database Use Cases - StampedeCon 2015
Graph database Use Cases
Graph_Database_Prepared_by_Ali_Rajab.pptx
Graph_Databases__And_Its_Usage_Presentation.pptx
A quick review of Python and Graph Databases
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Leveraging Graphs for Better AI
Compiling openCypher graph queries with Spark Catalyst
GraphFrames Access Methods in DSE Graph
managing big data
Calin Constantinov - Neo4j - Bucharest Big Data Week Meetup - Bucharest 2018
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PDF
Lecture1 pattern recognition............
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
Quality review (1)_presentation of this 21
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Lecture1 pattern recognition............
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
climate analysis of Dhaka ,Banglades.pptx
Launch Your Data Science Career in Kochi – 2025
IBA_Chapter_11_Slides_Final_Accessible.pptx
Quality review (1)_presentation of this 21
Galatica Smart Energy Infrastructure Startup Pitch Deck
Supervised vs unsupervised machine learning algorithms
.pdf is not working space design for the following data for the following dat...
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Miokarditis (Inflamasi pada Otot Jantung)
Reliability_Chapter_ presentation 1221.5784
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Clinical guidelines as a resource for EBP(1).pdf
Business Acumen Training GuidePresentation.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm

GraphFrames: Graph Queries in Spark SQL by Ankur Dave

  • 1. GraphFrames: Graph Queries in Spark SQL Ankur Dave UC BerkeleyAMPLab Joint work with Alekh Jindal (Microsoft), Li ErranLi (Uber), Joseph Gonzalez(UC Berkeley),Matei Zaharia (MIT and Databricks)
  • 2. + Graph Queries 2016 Spark + GraphFrames Support for Graph Analysis in Spark + Graph Algorithms 2013 Spark + GraphX Relational Queries 2009 Spark Graph Updates + Anchored Traversals Neo4j, etc.
  • 3. Graph Algorithms vs. Graph Queries ≈ x PageRank Alternating Least Squares Graph Algorithms Graph Queries
  • 4. Graph Algorithms vs. Graph Queries Graph Algorithm: PageRank Graph Query: Wikipedia Collaborators Editor 1 Editor 2 Article 1 Article 2 ⇓ Article 1 Article 2 Editor 1 Editor 2 same day} same day}
  • 5. Graph Algorithms vs. Graph Queries Graph Algorithm: PageRank // Iterate until convergence wikipedia.pregel( sendMsg = { e => e.sendToDst(e.srcRank * e.weight) }, mergeMsg = _ + _, vprog = { (id, oldRank, msgSum) => 0.15 + 0.85 * msgSum }) Graph Query: Wikipedia Collaborators wikipedia.find( "(u1)-[e11]->(article1); (u2)-[e21]->(article1); (u1)-[e12]->(article2); (u2)-[e22]->(article2)") .select( "*", "e11.date – e21.date".as("d1"), "e12.date – e22.date".as("d2")) .sort("d1 + d2".desc).take(10)
  • 7. Raw Wikipedia < / >< / >< / > XML Text Table Edit Graph Edit Table Frequent Collaborators Problem: Mixed Graph Analysis Hyperlinks PageRank User Product User Article Vandalism Suspects User User User Article
  • 8. Solution: GraphFrames Graph Algorithms Graph Queries Spark SQL GraphFramesAPI Pattern Query Optimizer
  • 9. GraphFrames API • Unifies graph algorithms, graph queries, and DataFrames • Available in Scala and Python class GraphFrame { def vertices: DataFrame def edges: DataFrame def find(pattern: String): DataFrame def registerView(pattern: String, df: DataFrame): Unit def degrees(): DataFrame def pageRank(): GraphFrame def connectedComponents(): GraphFrame ... }
  • 10. Implementation Parsed Pattern Logical Plan Materialized Views Optimized Logical Plan DataFrame Result Query String Graph–Relational Translation Join Elimination and Reordering Spark SQL View Selection Graph Algorithms GraphX
  • 11. Graph–Relational Translation B D A C Existing Logical Plan Output: A,B,C Src Dst ⋈C=Src Edge Table ID Attr VertexTable ⋈D=ID
  • 12. Materialized View Selection GraphX: Triplet view enables efficient message-passing algorithms Vertices B A C D Edges A B A C B C C D A B Triplet View A C B C C D Graph + Updated PageRanks B A C D A
  • 13. Materialized View Selection GraphFrames: User-defined views enable efficient graph queries Vertices B A C D Edges A B A C B C C D A B Triplet View A C B C C D Graph User-Defined Views PageRank Community Detection … Graph Queries
  • 14. Join Elimination Src Dst 1 2 1 3 2 3 2 5 Edges ID Attr 1 A 2 B 3 C 4 D Vertices SELECT src, dst, attr AS src_attr FROM edges INNER JOIN vertices ON src = id; Standard vertex-edgejoin: SELECT src, dst FROM edges INNER JOIN vertices ON src = id; Unnecessaryjoin can be eliminated if tables satisfy referential integrity, simplifying graph–relational translation
  • 15. Join Reordering A → B B → A ⋈A, B B → D C → B⋈B B → E⋈B C → D⋈B C → E⋈C, D ⋈C, E Example Query Left-Deep Plan BushyPlan A → B B → A ⋈A, B B → D C → B ⋈B B → E⋈B ⋈B ⋈B, C User-Defined View
  • 16. Evaluation Faster than Neo4j for unanchored patternqueries 0 0.5 1 1.5 2 2.5 GraphFrames Neo4j Querylatency,s AnchoredPatternQuery 0 10 20 30 40 50 60 70 80 GraphFrames Neo4j Querylatency,s UnanchoredPatternQuery Triangle query on 1M edge subgraph of web-Google. Each system configured touse a single core.
  • 17. Evaluation Approaches performance of GraphX for graph algorithms using Spark SQL whole-stage code generation 0 1 2 3 4 5 6 7 GraphFrames GraphX Naïve Spark Per-iterationruntime,s PageRankPerformance Per-iteration performance on web-Google, single 8-core machine. Naïve SparkusesScala RDD API.
  • 18. Evaluation Registering the right views cangreatlyimprove performance
  • 19. Future Work • Suggest views automatically • Exploit attribute-based partitioning in optimizer • Code generationfor single node
  • 20. Try It Out! Preview available for Spark 1.4+ at: https://github.com/graphframes/graphframes Thanks to Databricks contributors Joseph Bradley,Xiangrui Meng,and Timothy Hunter. Watch for the release on Spark Packages in the coming weeks. ankurd@eecs.berkeley.edu