SlideShare a Scribd company logo
THE SHORTEST PATH IS NOT
ALWAYS A STRAIGHT LINE
leveraging semi-metricity in large-scale graph analysis
Vasiliki Kalavri (kalavri@kth.se) KTH Royal Institute of Technology
Tiago Simas (tiago.simas@telefonica.com)Telefonica Research
Dionysios Logothetis (dionysios@fb.com) Facebook
2
Alice42 likes
Weighted graphs capture
relationship strength
distance
similarity
social proximity
rating
preference
influential nodes
optimal propagation paths
communities
recommendations
BobMax
3 likes
3
Sparsification techniques reduce the
graph size and still give exact or good
approximate results
G G’
f(G) ~ f(G’)
THE METRIC BACKBONE
Reduces the graph size while
maintaining relevant structure
The minimum subgraph of a weighted graph, that
preserves the shortest paths of the original graph
4
B
E
DA
C
2
3
10
4
2
1
B
E
DA
C
2
3
2
1
WHAT CAN WE USE IT FOR?
• Exact computations
• any algorithm that depends on the shortest paths
• reachability, connectivity
• betweenness centrality, closeness centrality
• Approximation
• PageRank, random walks
• eigenvector centrality
• community detection, clustering
5
WHAT CAN WE USE IT FOR?
• Exact computations
• any algorithm that depends on the shortest paths
• reachability, connectivity
• betweenness centrality, closeness centrality
• Approximation
• PageRank, random walks
• eigenvector centrality
• community detection, clustering
5
Improves community detection
modularity and recommender
systems accuracy
IMPACT ON LARGE-SCALE SYSTEMS
• Graph Databases
• fewer edges => smaller path search space
• Batch Graph Processing
• CPU and memory requirements depend on #messages
• #messages proportional to #edges
• fewer edges => improved analysis performance
• Graph Compression
• fewer edges => storage reduction
6
BACKGROUND
SEMI-METRICITY
In a weighted graph, an edge is semi-metric, if there
exists a shorter indirect path between its endpoints
8
B
E
DA
C
2
3
10
4
2
1
SEMI-METRICITY
In a weighted graph, an edge is semi-metric, if there
exists a shorter indirect path between its endpoints
9
B
E
DA
C
2
3
10
4
2
1
CE is 1st-order
semi-metric:
C-D-E is a shorter
2-hop path
SEMI-METRICITY
In a weighted graph, an edge is semi-metric, if there
exists a shorter indirect path between its endpoints
10
B
E
DA
C
2
3
10
4
2
1
AD is 2nd-order
semi-metric:
A-B-C-D is a shorter
3-hop path
CE is 1st-order
semi-metric:
C-D-E is a shorter
2-hop path
SEMI-METRICITY
In a weighted graph, an edge is semi-metric, if there
exists a shorter indirect path between its endpoints
11
B
E
DA
C
2
3
10
4
2
1
CE is 1st-order
semi-metric:
C-D-E is a shorter
2-hop path
AD is 2nd-order
semi-metric:
A-B-C-D is a shorter
3-hop path
AB, BC, CD, DE
are metric
BACKBONE ALGORITHM
BACKBONE CALCULATION
• Calculating the backbone:
• find all semi-metric edges: 1 BFS per edge?
• compute APSP and store O(N2) paths
13
BACKBONE CALCULATION
• Calculating the backbone:
• find all semi-metric edges: 1 BFS per edge?
• compute APSP and store O(N2) paths
Can we calculate or
approximate the backbone
without solving APSP?
13
ORDER OF SEMI-METRICITY
14
ORDER OF SEMI-METRICITY
14
Most semi-metric edges are
1st-order semi-metric
A 3-PHASE BACKBONE ALGORITHM
15
Find 1st-order semi-metric
edges: only look at triangles
1.
A 3-PHASE BACKBONE ALGORITHM
15
Find 1st-order semi-metric
edges: only look at triangles
1. Scalable & practical
for large graphs
EXAMPLE
16
B
E
DA
C
2
3
10
4
2
1
EXAMPLE
17
B
E
DA
C
2
3
10
4
2
1
Phase 1
EXAMPLE
18
B
E
DA
C
2
3
10
2
1
Phase 1
A 3-PHASE BACKBONE ALGORITHM
19
Find 1st-order semi-metric
edges: only look at triangles
1. Scalable & practical
for large graphs
A 3-PHASE BACKBONE ALGORITHM
19
Find 1st-order semi-metric
edges: only look at triangles
1.
Identify metric edges in
2-hop paths
2.
Scalable & practical
for large graphs
A 3-PHASE BACKBONE ALGORITHM
19
Find 1st-order semi-metric
edges: only look at triangles
1.
Identify metric edges in
2-hop paths
2.
Scalable & practical
for large graphs
Most semi-metric edges
have been removed
EXAMPLE
20
B
E
DA
C
2
3
10
2
1
Phase 2
EXAMPLE
20
B
E
DA
C
2
3
10
2
1
Phase 2
M
M
M
M
The lowest-weight edge
of every vertex is metric
EXAMPLE
20
B
E
DA
C
2
3
10
2
1
Phase 2
M
M
M
M
The lowest-weight edge
of every vertex is metric
u
v
2
4
2
1
any indirect path
from u to v
would have
larger weight
EXAMPLE
20
B
E
DA
C
2
3
10
2
1
Phase 2
?
M
M
M
M
The lowest-weight edge
of every vertex is metric
u
v
2
4
2
1
any indirect path
from u to v
would have
larger weight
A 3-PHASE BACKBONE ALGORITHM
21
Find 1st-order semi-metric
edges: only look at triangles!
1.
Identify metric edges in
2-hop paths
2.
Scalable & practical
for large graphs!
Most semi-metric edges
have been removed
A 3-PHASE BACKBONE ALGORITHM
21
Find 1st-order semi-metric
edges: only look at triangles!
1.
Identify metric edges in
2-hop paths
2.
Run a BFS for remaining
unlabeled edges.
3.
Scalable & practical
for large graphs!
Most semi-metric edges
have been removed
A 3-PHASE BACKBONE ALGORITHM
21
Find 1st-order semi-metric
edges: only look at triangles!
1.
Identify metric edges in
2-hop paths
2.
Run a BFS for remaining
unlabeled edges.
3.
Scalable & practical
for large graphs!
1%-9% edges
Most semi-metric edges
have been removed
EXAMPLE
22
B
E
DA
C
2
3
10
2
1
Phase 3
M
M
M
M
BFS
EXAMPLE
22
B
E
DA
C
2
3
10
2
1
Phase 3
M
M
M
M
BFS
Explore paths
with shorter
distances only
EXAMPLE
22
B
E
DA
C
2
3
10
2
1
Phase 3
M
M
M
M
BFS
Explore paths
with shorter
distances only
If the BFS arrives at
the target, the edge
is semi-metric
EXAMPLE
23
B
E
DA
C
2
3
2
1
Metric Backbone
DISTRIBUTED IMPLEMENTATION
code available: http://grafos.ml/okapi.html#analytics
24
Implementation in the vertex-centric model
EVALUATION
EVALUATION GOALS
• How does our algorithm compare to APSP?
• Are large, real-world graphs semi-metric?
• Can we improve graph analysis performance?
26
COMPARISONTO APSP
Computing APSP in Giraph
• multiple SSSPs
• multiple MSSPs, i.e. SSSPs from
several sources in parallel
27
COMPARISONTO APSP
Computing APSP in Giraph
• multiple SSSPs
• multiple MSSPs, i.e. SSSPs from
several sources in parallel
27
In the order of months
for million-edge graphs
COMPARISONTO APSP
Computing APSP in Giraph
• multiple SSSPs
• multiple MSSPs, i.e. SSSPs from
several sources in parallel
27
In the order of months
for million-edge graphs
In the order of days for
million-edge graphs
COMPARISONTO APSP
Computing APSP in Giraph
• multiple SSSPs
• multiple MSSPs, i.e. SSSPs from
several sources in parallel
27
In the order of months
for million-edge graphs
In the order of days for
million-edge graphs
Our algorithm is 120-180x faster than SSSP
and 11-14x faster than MSSP:
order of hours for million-edge graphs
ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fast
and scalable
ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fast
and scalable
Removes up to 90%
of semi-metric edges
ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fast
and scalable
Removes up to 90%
of semi-metric edges
Moderately fast
ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fast
and scalable
Removes up to 90%
of semi-metric edges
Moderately fast
Labels up to 60%
of the unlabeled edges
ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fast
and scalable
Removes up to 90%
of semi-metric edges
Moderately fast
Labels up to 60%
of the unlabeled edges
Slow
ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fast
and scalable
Removes up to 90%
of semi-metric edges
Moderately fast
Labels up to 60%
of the unlabeled edges
Slow
Labels up to 1-9%
of the total edges
ALGORITHM PHASES
28
Phase 1 Phase 2 Phase 3
Very fast
and scalable
Removes up to 90%
of semi-metric edges
Moderately fast
Labels up to 60%
of the unlabeled edges
Slow
Labels up to 1-9%
of the total edges
Phase 1 is the fastest and most useful phase
PHASE 1 SCALABILITY
29
PHASE 1 SCALABILITY
29
<200s on a
billion-edge graph
PHASE 1 SCALABILITY
29
almost linear
scalability
<200s on a
billion-edge graph
SEMI-METRICITY IN REAL GRAPHS
30
Graph |V| |E| metric semi-metricity
Facebook 190M 49.9B custom 26.5%
Twitter 40M 1.5B jaccard 39%
Tuenti 12M 685M jaccard 59%
Livejournal 4.8M 34M jaccard 40%
NotreDame 0.3M 1.5M jaccard, adamic 45%-29%
DBLP 318K 1M jaccard, adamic 23%-9%
Twitter-ego 81K 1.7M jaccard, adamic 57%-39%
Movielens 1.6K 1.9M jaccard 88%
Facebook 1K 143K
#messages,
message size
78%-77%
US-Airports 0.5K 6K #passengers 72%
C-Elegans 0.3K 2.3K #connections 17%
SEMI-METRICITY IN REAL GRAPHS
30
Graph |V| |E| metric semi-metricity
Facebook 190M 49.9B custom 26.5%
Twitter 40M 1.5B jaccard 39%
Tuenti 12M 685M jaccard 59%
Livejournal 4.8M 34M jaccard 40%
NotreDame 0.3M 1.5M jaccard, adamic 45%-29%
DBLP 318K 1M jaccard, adamic 23%-9%
Twitter-ego 81K 1.7M jaccard, adamic 57%-39%
Movielens 1.6K 1.9M jaccard 88%
Facebook 1K 143K
#messages,
message size
78%-77%
US-Airports 0.5K 6K #passengers 72%
C-Elegans 0.3K 2.3K #connections 17%
% 1st-order semi-
metric edges =>
reduction in memory and
communication
QUERY SPEEDUP ON NEO4J
31
6.7x speedup
APACHE GIRAPH SPEEDUP
32
Including the time to calculate the backbone
4x speedup
APACHE GIRAPH SPEEDUP
33
6x speedup
COMMUNICATION REDUCTION
34
Up to 70% for highly semi-
metric graphs
BEST PRACTICES
When to use the backbone?
• semi-metric weighting schemes, e.g. neighborhood similarity
• we can amortize the overhead: e.g. many algorithms on the same graph,
multiple distance queries
• lossy compression is ok
When not to use the backbone?
• for metric weighting schemes
• we need to run one-off analysis
• we need lossless compression
35
RECAP: MAIN CONTRIBUTIONS
36
• An algorithm for computing the metric
backbone without solving APSP
• An open-source distributed implementation
• Graph query and graph analytics speedup on
Neo4j and Apache Giraph
THE SHORTEST PATH IS NOT
ALWAYS A STRAIGHT LINE
leveraging semi-metricity in large-scale graph analysis
Vasiliki Kalavri (kalavri@kth.se) KTH Royal Institute of Technology
Tiago Simas (tiago.simas@telefonica.com)Telefonica Research
Dionysios Logothetis (dionysios@fb.com) Facebook

More Related Content

PPTX
Shortest path algorithm
PDF
Self-managed and automatically reconfigurable stream processing
PDF
Online performance analysis of distributed dataflow systems (O'Reilly Velocit...
PPTX
[ICDE 2012] On Top-k Structural Similarity Search
PPT
SINGLE-SOURCE SHORTEST PATHS
PDF
20 Single Source Shorthest Path
PPT
lecture 20
PPT
1535 graph algorithms
Shortest path algorithm
Self-managed and automatically reconfigurable stream processing
Online performance analysis of distributed dataflow systems (O'Reilly Velocit...
[ICDE 2012] On Top-k Structural Similarity Search
SINGLE-SOURCE SHORTEST PATHS
20 Single Source Shorthest Path
lecture 20
1535 graph algorithms

Similar to The shortest path is not always a straight line (20)

PPTX
Data Structures - Introduction to Graph.pptx
PDF
Graph Analytics with Greenplum and Apache MADlib
PPT
An Introduction to Graph Databases
PDF
Graph Gurus Episode 26: Using Graph Algorithms for Advanced Analytics Part 1
PPTX
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
PDF
Using Graph Algorithms for Advanced Analytics - Part 2 Centrality
PDF
Graph Gurus Episode 27: Using Graph Algorithms for Advanced Analytics Part 2
PDF
Graph Algorithms - Map-Reduce Graph Processing
PPTX
ppt 1.pptx
PDF
Graph Analyses with Python and NetworkX
PPTX
Graphs for Ai and ML
PPT
Algorithm Design and Complexity - Course 10
PPTX
Spanning Tree in data structure and .pptx
PDF
Unit-10 Graphs .pdf
PPTX
GraphTour Boston - Graphs for AI and ML
PDF
F14 lec12graphs
PPTX
Week_9_Lec17_18.pptx Overview of Deep Learning
PPTX
Graph-terminology.pptx
PPTX
Graphs Algorithms
PDF
Community detection in social networks[1]
Data Structures - Introduction to Graph.pptx
Graph Analytics with Greenplum and Apache MADlib
An Introduction to Graph Databases
Graph Gurus Episode 26: Using Graph Algorithms for Advanced Analytics Part 1
ICDE-2015 Shortest Path Traversal Optimization and Analysis for Large Graph C...
Using Graph Algorithms for Advanced Analytics - Part 2 Centrality
Graph Gurus Episode 27: Using Graph Algorithms for Advanced Analytics Part 2
Graph Algorithms - Map-Reduce Graph Processing
ppt 1.pptx
Graph Analyses with Python and NetworkX
Graphs for Ai and ML
Algorithm Design and Complexity - Course 10
Spanning Tree in data structure and .pptx
Unit-10 Graphs .pdf
GraphTour Boston - Graphs for AI and ML
F14 lec12graphs
Week_9_Lec17_18.pptx Overview of Deep Learning
Graph-terminology.pptx
Graphs Algorithms
Community detection in social networks[1]
Ad

More from Vasia Kalavri (17)

PDF
From data stream management to distributed dataflows and beyond
PDF
Predictive Datacenter Analytics with Strymon
PDF
Apache Flink & Graph Processing
PDF
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
PDF
Demystifying Distributed Graph Processing
PDF
Like a Pack of Wolves: Community Structure of Web Trackers
PDF
Batch and Stream Graph Processing with Apache Flink
PDF
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
PDF
Big data processing systems research
PDF
Asymmetry in Large-Scale Graph Analysis, Explained
PDF
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
PDF
m2r2: A Framework for Results Materialization and Reuse
PDF
MapReduce: Optimizations, Limitations, and Open Issues
PDF
A Skype case study (2011)
PDF
Gelly in Apache Flink Bay Area Meetup
PDF
Apache Flink Deep Dive
PDF
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
From data stream management to distributed dataflows and beyond
Predictive Datacenter Analytics with Strymon
Apache Flink & Graph Processing
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
Demystifying Distributed Graph Processing
Like a Pack of Wolves: Community Structure of Web Trackers
Batch and Stream Graph Processing with Apache Flink
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Big data processing systems research
Asymmetry in Large-Scale Graph Analysis, Explained
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
m2r2: A Framework for Results Materialization and Reuse
MapReduce: Optimizations, Limitations, and Open Issues
A Skype case study (2011)
Gelly in Apache Flink Bay Area Meetup
Apache Flink Deep Dive
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Ad

Recently uploaded (20)

PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
Mega Projects Data Mega Projects Data
PPTX
1_Introduction to advance data techniques.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Computer network topology notes for revision
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Lecture1 pattern recognition............
Launch Your Data Science Career in Kochi – 2025
Moving the Public Sector (Government) to a Digital Adoption
Mega Projects Data Mega Projects Data
1_Introduction to advance data techniques.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Supervised vs unsupervised machine learning algorithms
IBA_Chapter_11_Slides_Final_Accessible.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Introduction-to-Cloud-ComputingFinal.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Reliability_Chapter_ presentation 1221.5784
Clinical guidelines as a resource for EBP(1).pdf
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Computer network topology notes for revision
Major-Components-ofNKJNNKNKNKNKronment.pptx
Lecture1 pattern recognition............

The shortest path is not always a straight line

  • 1. THE SHORTEST PATH IS NOT ALWAYS A STRAIGHT LINE leveraging semi-metricity in large-scale graph analysis Vasiliki Kalavri (kalavri@kth.se) KTH Royal Institute of Technology Tiago Simas (tiago.simas@telefonica.com)Telefonica Research Dionysios Logothetis (dionysios@fb.com) Facebook
  • 2. 2 Alice42 likes Weighted graphs capture relationship strength distance similarity social proximity rating preference influential nodes optimal propagation paths communities recommendations BobMax 3 likes
  • 3. 3 Sparsification techniques reduce the graph size and still give exact or good approximate results G G’ f(G) ~ f(G’)
  • 4. THE METRIC BACKBONE Reduces the graph size while maintaining relevant structure The minimum subgraph of a weighted graph, that preserves the shortest paths of the original graph 4 B E DA C 2 3 10 4 2 1 B E DA C 2 3 2 1
  • 5. WHAT CAN WE USE IT FOR? • Exact computations • any algorithm that depends on the shortest paths • reachability, connectivity • betweenness centrality, closeness centrality • Approximation • PageRank, random walks • eigenvector centrality • community detection, clustering 5
  • 6. WHAT CAN WE USE IT FOR? • Exact computations • any algorithm that depends on the shortest paths • reachability, connectivity • betweenness centrality, closeness centrality • Approximation • PageRank, random walks • eigenvector centrality • community detection, clustering 5 Improves community detection modularity and recommender systems accuracy
  • 7. IMPACT ON LARGE-SCALE SYSTEMS • Graph Databases • fewer edges => smaller path search space • Batch Graph Processing • CPU and memory requirements depend on #messages • #messages proportional to #edges • fewer edges => improved analysis performance • Graph Compression • fewer edges => storage reduction 6
  • 9. SEMI-METRICITY In a weighted graph, an edge is semi-metric, if there exists a shorter indirect path between its endpoints 8 B E DA C 2 3 10 4 2 1
  • 10. SEMI-METRICITY In a weighted graph, an edge is semi-metric, if there exists a shorter indirect path between its endpoints 9 B E DA C 2 3 10 4 2 1 CE is 1st-order semi-metric: C-D-E is a shorter 2-hop path
  • 11. SEMI-METRICITY In a weighted graph, an edge is semi-metric, if there exists a shorter indirect path between its endpoints 10 B E DA C 2 3 10 4 2 1 AD is 2nd-order semi-metric: A-B-C-D is a shorter 3-hop path CE is 1st-order semi-metric: C-D-E is a shorter 2-hop path
  • 12. SEMI-METRICITY In a weighted graph, an edge is semi-metric, if there exists a shorter indirect path between its endpoints 11 B E DA C 2 3 10 4 2 1 CE is 1st-order semi-metric: C-D-E is a shorter 2-hop path AD is 2nd-order semi-metric: A-B-C-D is a shorter 3-hop path AB, BC, CD, DE are metric
  • 14. BACKBONE CALCULATION • Calculating the backbone: • find all semi-metric edges: 1 BFS per edge? • compute APSP and store O(N2) paths 13
  • 15. BACKBONE CALCULATION • Calculating the backbone: • find all semi-metric edges: 1 BFS per edge? • compute APSP and store O(N2) paths Can we calculate or approximate the backbone without solving APSP? 13
  • 17. ORDER OF SEMI-METRICITY 14 Most semi-metric edges are 1st-order semi-metric
  • 18. A 3-PHASE BACKBONE ALGORITHM 15 Find 1st-order semi-metric edges: only look at triangles 1.
  • 19. A 3-PHASE BACKBONE ALGORITHM 15 Find 1st-order semi-metric edges: only look at triangles 1. Scalable & practical for large graphs
  • 23. A 3-PHASE BACKBONE ALGORITHM 19 Find 1st-order semi-metric edges: only look at triangles 1. Scalable & practical for large graphs
  • 24. A 3-PHASE BACKBONE ALGORITHM 19 Find 1st-order semi-metric edges: only look at triangles 1. Identify metric edges in 2-hop paths 2. Scalable & practical for large graphs
  • 25. A 3-PHASE BACKBONE ALGORITHM 19 Find 1st-order semi-metric edges: only look at triangles 1. Identify metric edges in 2-hop paths 2. Scalable & practical for large graphs Most semi-metric edges have been removed
  • 28. EXAMPLE 20 B E DA C 2 3 10 2 1 Phase 2 M M M M The lowest-weight edge of every vertex is metric u v 2 4 2 1 any indirect path from u to v would have larger weight
  • 29. EXAMPLE 20 B E DA C 2 3 10 2 1 Phase 2 ? M M M M The lowest-weight edge of every vertex is metric u v 2 4 2 1 any indirect path from u to v would have larger weight
  • 30. A 3-PHASE BACKBONE ALGORITHM 21 Find 1st-order semi-metric edges: only look at triangles! 1. Identify metric edges in 2-hop paths 2. Scalable & practical for large graphs! Most semi-metric edges have been removed
  • 31. A 3-PHASE BACKBONE ALGORITHM 21 Find 1st-order semi-metric edges: only look at triangles! 1. Identify metric edges in 2-hop paths 2. Run a BFS for remaining unlabeled edges. 3. Scalable & practical for large graphs! Most semi-metric edges have been removed
  • 32. A 3-PHASE BACKBONE ALGORITHM 21 Find 1st-order semi-metric edges: only look at triangles! 1. Identify metric edges in 2-hop paths 2. Run a BFS for remaining unlabeled edges. 3. Scalable & practical for large graphs! 1%-9% edges Most semi-metric edges have been removed
  • 35. EXAMPLE 22 B E DA C 2 3 10 2 1 Phase 3 M M M M BFS Explore paths with shorter distances only If the BFS arrives at the target, the edge is semi-metric
  • 37. DISTRIBUTED IMPLEMENTATION code available: http://grafos.ml/okapi.html#analytics 24 Implementation in the vertex-centric model
  • 39. EVALUATION GOALS • How does our algorithm compare to APSP? • Are large, real-world graphs semi-metric? • Can we improve graph analysis performance? 26
  • 40. COMPARISONTO APSP Computing APSP in Giraph • multiple SSSPs • multiple MSSPs, i.e. SSSPs from several sources in parallel 27
  • 41. COMPARISONTO APSP Computing APSP in Giraph • multiple SSSPs • multiple MSSPs, i.e. SSSPs from several sources in parallel 27 In the order of months for million-edge graphs
  • 42. COMPARISONTO APSP Computing APSP in Giraph • multiple SSSPs • multiple MSSPs, i.e. SSSPs from several sources in parallel 27 In the order of months for million-edge graphs In the order of days for million-edge graphs
  • 43. COMPARISONTO APSP Computing APSP in Giraph • multiple SSSPs • multiple MSSPs, i.e. SSSPs from several sources in parallel 27 In the order of months for million-edge graphs In the order of days for million-edge graphs Our algorithm is 120-180x faster than SSSP and 11-14x faster than MSSP: order of hours for million-edge graphs
  • 44. ALGORITHM PHASES 28 Phase 1 Phase 2 Phase 3
  • 45. ALGORITHM PHASES 28 Phase 1 Phase 2 Phase 3 Very fast and scalable
  • 46. ALGORITHM PHASES 28 Phase 1 Phase 2 Phase 3 Very fast and scalable Removes up to 90% of semi-metric edges
  • 47. ALGORITHM PHASES 28 Phase 1 Phase 2 Phase 3 Very fast and scalable Removes up to 90% of semi-metric edges Moderately fast
  • 48. ALGORITHM PHASES 28 Phase 1 Phase 2 Phase 3 Very fast and scalable Removes up to 90% of semi-metric edges Moderately fast Labels up to 60% of the unlabeled edges
  • 49. ALGORITHM PHASES 28 Phase 1 Phase 2 Phase 3 Very fast and scalable Removes up to 90% of semi-metric edges Moderately fast Labels up to 60% of the unlabeled edges Slow
  • 50. ALGORITHM PHASES 28 Phase 1 Phase 2 Phase 3 Very fast and scalable Removes up to 90% of semi-metric edges Moderately fast Labels up to 60% of the unlabeled edges Slow Labels up to 1-9% of the total edges
  • 51. ALGORITHM PHASES 28 Phase 1 Phase 2 Phase 3 Very fast and scalable Removes up to 90% of semi-metric edges Moderately fast Labels up to 60% of the unlabeled edges Slow Labels up to 1-9% of the total edges Phase 1 is the fastest and most useful phase
  • 53. PHASE 1 SCALABILITY 29 <200s on a billion-edge graph
  • 54. PHASE 1 SCALABILITY 29 almost linear scalability <200s on a billion-edge graph
  • 55. SEMI-METRICITY IN REAL GRAPHS 30 Graph |V| |E| metric semi-metricity Facebook 190M 49.9B custom 26.5% Twitter 40M 1.5B jaccard 39% Tuenti 12M 685M jaccard 59% Livejournal 4.8M 34M jaccard 40% NotreDame 0.3M 1.5M jaccard, adamic 45%-29% DBLP 318K 1M jaccard, adamic 23%-9% Twitter-ego 81K 1.7M jaccard, adamic 57%-39% Movielens 1.6K 1.9M jaccard 88% Facebook 1K 143K #messages, message size 78%-77% US-Airports 0.5K 6K #passengers 72% C-Elegans 0.3K 2.3K #connections 17%
  • 56. SEMI-METRICITY IN REAL GRAPHS 30 Graph |V| |E| metric semi-metricity Facebook 190M 49.9B custom 26.5% Twitter 40M 1.5B jaccard 39% Tuenti 12M 685M jaccard 59% Livejournal 4.8M 34M jaccard 40% NotreDame 0.3M 1.5M jaccard, adamic 45%-29% DBLP 318K 1M jaccard, adamic 23%-9% Twitter-ego 81K 1.7M jaccard, adamic 57%-39% Movielens 1.6K 1.9M jaccard 88% Facebook 1K 143K #messages, message size 78%-77% US-Airports 0.5K 6K #passengers 72% C-Elegans 0.3K 2.3K #connections 17% % 1st-order semi- metric edges => reduction in memory and communication
  • 57. QUERY SPEEDUP ON NEO4J 31 6.7x speedup
  • 58. APACHE GIRAPH SPEEDUP 32 Including the time to calculate the backbone 4x speedup
  • 60. COMMUNICATION REDUCTION 34 Up to 70% for highly semi- metric graphs
  • 61. BEST PRACTICES When to use the backbone? • semi-metric weighting schemes, e.g. neighborhood similarity • we can amortize the overhead: e.g. many algorithms on the same graph, multiple distance queries • lossy compression is ok When not to use the backbone? • for metric weighting schemes • we need to run one-off analysis • we need lossless compression 35
  • 62. RECAP: MAIN CONTRIBUTIONS 36 • An algorithm for computing the metric backbone without solving APSP • An open-source distributed implementation • Graph query and graph analytics speedup on Neo4j and Apache Giraph
  • 63. THE SHORTEST PATH IS NOT ALWAYS A STRAIGHT LINE leveraging semi-metricity in large-scale graph analysis Vasiliki Kalavri (kalavri@kth.se) KTH Royal Institute of Technology Tiago Simas (tiago.simas@telefonica.com)Telefonica Research Dionysios Logothetis (dionysios@fb.com) Facebook