SlideShare a Scribd company logo
XXL Graph Algorithms
                                              Sergei Vassilvitskii
                                                Yahoo! Research

With help from Jake Hofman, Siddharth Suri, Cong Yu and many others
Introduction
  XXL Graphs are everywhere:
   – Web graph
   – Friend graphs
   – Advertising graphs...




                             2
Introduction
  XXL Graphs are everywhere:
   – Web graph
   – Friend graphs
   – Advertising graphs...



  But we have Hadoop!
   – Few algorithms have been ported (no Hadoop Algorithms book)
   – Few general algorithmic approaches
   – Active area of research




                                  3
Outline
  Today:
   – Act 1: Crawl before you walk
      • Counting connected components
   – Act 2: The curse of the last reducer
      • Finding tight knit friend groups




                                     4
Act 1: Connected Components
     Given a graph, how many components does it have?


                        f
           b
 a
                            g


       c

                    e           h


               d




                                5
Act 1: Connected Components
     Given a graph, how many components does it have?


                        f
           b
                                                  (b,c)             1
 a                                                                      (f,h)       1
                            g                   (b,d)           1

                                    (a,c)   1                       (a,b)       1
                                                (c,d)       1
       c
                                       (c,e)      1                         (f,g)       1
                    e           h                     (d,e)             1

                                            (d,e)       1
               d                                            (b,e)             1
                                                                            (g,h)       1

                                     Data too big to fit on one reducer!

                                6
CC Overview
  Outline for Connected Components
  – Partition the input into several chunks (map 1)
  – Summarize the connectivity on each chunk (reduce 1)
  – Combine all of the (small) summaries (map 2)
  – Find the number of connected components




                                    7
Connected Components
     1. Partition (randomly):


                           f
            b
 a
                                g


        c

                       e            h


                d




                                    8
Connected Components
  1. Partition (randomly):


                                                        f
         b                               b
                                 a
                                                            g


     c                               c

                    e                                           h


               d

         Reduce 1                            Reduce 2


                             9
Connected Components
  1. Partition:
  2. Summarize (retain < n edges):
                                                        f
         b                               b
                                 a
                                                            g


     c                               c

                    e                                           h


               d

         Reduce 1                            Reduce 2


                            10
Connected Components
  1. Partition:
  2. Summarize (retain < n edges):
                                                         f
         b                                b
                                  a
                                                             g


     c                                c

                    e                                            h


               d

         Reduce 1                             Reduce 2


                             11
Connected Components
  1. Partition:
  2. Summarize:
  3. Recombine:                                     f
         b                           b
                             a
                                                        g


     c                           c

                    e                                       h


               d

         Reduce 1                        Reduce 2


                        12
Connected Components
     1. Partition:
     2. Summarize:
     3. Recombine:
            b                  f
 a


                                   g

        c

                          e
                                       h

                 d

                     Round 2


                                       13
Connected Components
     1. Partition:
     2. Summarize:
     3. Recombine:
            b                  f                          (b,c)             1
 a                                                                              (f,h)       1
                                                        (b,d)           1

                                   g        (a,c)   1                       (a,b)       1
                                                        (c,d)       1
        c
                                               (c,e)      1                         (f,g)       1
                                                              (d,e)             1
                          e
                                       h            (d,e)       1
                                                                    (b,e)             1
                 d                                                                  (g,h)       1

                     Round 2


                                       14
Connected Components
     1. Partition:
     2. Summarize:
     3. Recombine:
            b                  f
 a


                                   g        (a,c)   1                   (a,b)   1
                                                        (c,d)       1
        c
                                                                           (f,g)    1

                          e
                                       h            (d,e)       1

                 d                                                         (g,h)    1

                     Round 2
                                             Small enough to fit!

                                       15
Connected Components
  The summarization does not affect connectivity
  – Drops redundant edges
  – Dramatically reduces data size
  – Takes two MapReduce rounds




                                     16
Connected Components
  The summarization does not affect connectivity
  – Drops redundant edges
  – Dramatically reduces data size
  – Takes two MapReduce rounds


  Similar approach works in other situations:
  – Consider vertices connected only if k edges between vertices
  – Consider vertices connected if similarity score above a threshold
     • E.g. approximate Jaccard similarity when computing for recommendation
       systems
  – Find minimum spanning trees
     • Summarize by computing an MST on the subset graph
  – Clustering
     • Cluster each partition, then aggregate the clusters



                                         17
Outline
  Today:
   – Act 1: Crawl before you walk
      • Counting connected components
   – Act 2: The curse of the last reducer
      • Finding tight knit friend groups




                                     18
Act 2: Clustering Coefficient
  Finding tight knit groups of friends




                              19
Act 2: Clustering Coefficient
  Finding tight knit groups of friends




                             vs.




                              19
Act 2: Clustering Coefficient
  Finding tight knit groups of friends




                                   vs.




           2/15   ≈ 0.13                        8/15   ≈ 0.53

  CC(v) = Fraction of v’s friends who know each other
   – Count: number of triangles incident on v


                                   20
Finding CC For Each Node
  Attempt 1:
  – Look at each node
  – Enumerate all possible triangles (Pivot)




                                   21
Finding CC For Each Node
  Attempt 1:
  – Look at each node
  – Enumerate all possible triangles (Pivot)




                                   22
Finding CC For Each Node
  Attempt 1:
  – Look at each node
  – Enumerate all possible triangles (Pivot)
  – Check which of those edges exist:




                      ∩                          =


                             15 edges possible       2 edges present


                                   23
Finding CC For Each Node
  Attempt 1:
  – Look at each node
  – Enumerate all possible triangles (Pivot)
  – Check which of those edges exist




                                   24
Finding CC For Each Node
  Attempt 1:
  – Look at each node
  – Enumerate all possible triangles
  – Check which of those edges exist


  Amount of intermediate data
  – Quadratic in the degree of the nodes
  – 6 friends: 15 possible triangles
  – n friends, n(n-1)/2 possible triangles




                                       25
Finding CC For Each Node
  Attempt 1:
  – Look at each node
  – Enumerate all possible triangles
  – Check which of those edges exist


  Amount of intermediate data
  – Quadratic in the degree of the nodes
  – 6 friends: 15 possible triangles
  – n friends, n(n-1)/2 possible triangles


  There’s always “that guy”:
  – tens of thousands of friends
  – tens of thousands of movie ratings (really!)
  – millions of followers
                                       26
Finding CC For Each Node
  Attempt 1:
  – Look at each node    a le
                       Sc triangles
                    ot
  – Enumerate all possible
                sn
             oe
  – Check which of those edges exist
           D




                                 27
Finding CC For Each Node
  Attempt 1:
  – Look at each node      a le
                       Sc triangles
                    ot
  – Enumerate all possible
                sn
             oe
  – Check which of those edges exist
           D


  Attempt 2:
  – There is a limited number of High degree nodes
  – Count LLL, LLH, LHH, and HHH triangles differently
     – If a triangle has at least one Low node
        – Pivot on Low node to count the triangles
     – If a triangle has all High nodes
        – Pivot but only on other neighboring High nodes (not all nodes)


                                    28
Algorithm in Pictures
  When looking at Low degree nodes
   – Check for all triangles




                               29
Algorithm in Pictures
  When looking at Low degree nodes
   – Check for all triangles

  When looking at High degree nodes
   – Check for triangles with other High degree nodes




                                   30
Clustering Coefficient Discussion
  Attempt 2:
   – Main idea: treat High and Low degree nodes differently
      • Limit the amount of data generated (No more than O(n) per node)
   – All triangles accounted for
   – Can set High-Low threshold to balance the two cases
      • Rule of thumb: threshold around square root of number of vertices
   – A bit more complex, but still easy to code
      • Doesn’t suffer from the one high degree node problem




                                         31
XXL Graphs: Conclusions
  Algorithm Design
   – Prove performance guarantees independent of input data
      • Input skew (e.g. high degree nodes) should not severely affect
        algorithm performance
      • Number of rounds fixed (and hopefully small)




                                    32
XXL Graphs: Conclusions
  Algorithm Design
   – Prove performance guarantees independent of input data
      • Input skew (e.g. high degree nodes) should not severely affect
        algorithm performance
      • Number of rounds fixed (and hopefully small)



  Rethink graph algorithms:
   – Connected Components: Two round approach
   – Clustering Coefficient: High-Low node decomposition
   – (Breaking News) Matchings: Two round sampling technique




                                    33
Thank You
sergei@yahoo-inc.com

More Related Content

PDF
PDF
Ab31169180
PDF
Gauge Invariance Of The Action Principle For Gauge Systems With Noncanonical ...
PDF
09 trial kedah_s1
PDF
A Novel Solution Of Linear Congruences
PPTX
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
PDF
Part4 graph algorithms
Ab31169180
Gauge Invariance Of The Action Principle For Gauge Systems With Noncanonical ...
09 trial kedah_s1
A Novel Solution Of Linear Congruences
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Part4 graph algorithms

Viewers also liked (10)

PPTX
Importance
PPT
4 Cliques Clusters
PDF
Network Analysis with networkX : Fundamentals of network theory-1
PPTX
CLIQUE Automatic subspace clustering of high dimensional data for data mining...
PDF
Community detection (Поиск сообществ в графах)
PDF
Suicide ideation of individuals in online social networks tokyo webmining
PPTX
Clique
PDF
Social network analysis part ii
PPTX
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
PPT
Social Network Analysis
Importance
4 Cliques Clusters
Network Analysis with networkX : Fundamentals of network theory-1
CLIQUE Automatic subspace clustering of high dimensional data for data mining...
Community detection (Поиск сообществ в графах)
Suicide ideation of individuals in online social networks tokyo webmining
Clique
Social network analysis part ii
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
Social Network Analysis
Ad

Similar to XXL Graph Algorithms__HadoopSummit2010 (20)

PDF
07 - Graphs
XLSX
Worked examples projects unit 1
PPTX
Matrix Representation Of Graph
PDF
Simon McIntosh-Smith, University of Bristol, 'Accelerating molecular docking ...
PDF
Intelligent Tutorial System
PDF
Double patterning (4/20 update)
PDF
PPT
Seq db searching
PPT
Kruskals prims shared by: geekssay.com
PPTX
Short Transitive Signatures For Directed Trees
KEY
Profcompact
PDF
EMF Compare 2.0: Scaling to Millions (updated)
PDF
De novo assemble for NGS
PDF
005 ellipse
KEY
Verification with LoLA: 7 Implementation
PDF
Memcached
PPT
B-tree & R-tree
PDF
Saes tables
PPT
Graph
PPT
Graphs In Data Structure
07 - Graphs
Worked examples projects unit 1
Matrix Representation Of Graph
Simon McIntosh-Smith, University of Bristol, 'Accelerating molecular docking ...
Intelligent Tutorial System
Double patterning (4/20 update)
Seq db searching
Kruskals prims shared by: geekssay.com
Short Transitive Signatures For Directed Trees
Profcompact
EMF Compare 2.0: Scaling to Millions (updated)
De novo assemble for NGS
005 ellipse
Verification with LoLA: 7 Implementation
Memcached
B-tree & R-tree
Saes tables
Graph
Graphs In Data Structure
Ad

More from Yahoo Developer Network (20)

PDF
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
PDF
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
PDF
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
PDF
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
PDF
CICD at Oath using Screwdriver
PDF
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
PPTX
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
PDF
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
PPTX
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
PPTX
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
PDF
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
PPTX
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
PDF
Moving the Oath Grid to Docker, Eric Badger, Oath
PDF
Architecting Petabyte Scale AI Applications
PDF
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
PPTX
Jun 2017 HUG: YARN Scheduling – A Step Beyond
PDF
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
PPTX
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
PDF
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
PPTX
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
CICD at Oath using Screwdriver
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Moving the Oath Grid to Docker, Eric Badger, Oath
Architecting Petabyte Scale AI Applications
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
October 2016 HUG: Pulsar,  a highly scalable, low latency pub-sub messaging s...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...

Recently uploaded (20)

PPTX
TLE Review Electricity (Electricity).pptx
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Tartificialntelligence_presentation.pptx
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Hybrid model detection and classification of lung cancer
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
TLE Review Electricity (Electricity).pptx
Heart disease approach using modified random forest and particle swarm optimi...
Digital-Transformation-Roadmap-for-Companies.pptx
Tartificialntelligence_presentation.pptx
OMC Textile Division Presentation 2021.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Unlocking AI with Model Context Protocol (MCP)
Hybrid model detection and classification of lung cancer
Assigned Numbers - 2025 - Bluetooth® Document
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
A comparative analysis of optical character recognition models for extracting...
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Approach and Philosophy of On baking technology
Hindi spoken digit analysis for native and non-native speakers
Encapsulation_ Review paper, used for researhc scholars
A comparative study of natural language inference in Swahili using monolingua...
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
Web App vs Mobile App What Should You Build First.pdf
MIND Revenue Release Quarter 2 2025 Press Release

XXL Graph Algorithms__HadoopSummit2010

  • 1. XXL Graph Algorithms Sergei Vassilvitskii Yahoo! Research With help from Jake Hofman, Siddharth Suri, Cong Yu and many others
  • 2. Introduction XXL Graphs are everywhere: – Web graph – Friend graphs – Advertising graphs... 2
  • 3. Introduction XXL Graphs are everywhere: – Web graph – Friend graphs – Advertising graphs... But we have Hadoop! – Few algorithms have been ported (no Hadoop Algorithms book) – Few general algorithmic approaches – Active area of research 3
  • 4. Outline Today: – Act 1: Crawl before you walk • Counting connected components – Act 2: The curse of the last reducer • Finding tight knit friend groups 4
  • 5. Act 1: Connected Components Given a graph, how many components does it have? f b a g c e h d 5
  • 6. Act 1: Connected Components Given a graph, how many components does it have? f b (b,c) 1 a (f,h) 1 g (b,d) 1 (a,c) 1 (a,b) 1 (c,d) 1 c (c,e) 1 (f,g) 1 e h (d,e) 1 (d,e) 1 d (b,e) 1 (g,h) 1 Data too big to fit on one reducer! 6
  • 7. CC Overview Outline for Connected Components – Partition the input into several chunks (map 1) – Summarize the connectivity on each chunk (reduce 1) – Combine all of the (small) summaries (map 2) – Find the number of connected components 7
  • 8. Connected Components 1. Partition (randomly): f b a g c e h d 8
  • 9. Connected Components 1. Partition (randomly): f b b a g c c e h d Reduce 1 Reduce 2 9
  • 10. Connected Components 1. Partition: 2. Summarize (retain < n edges): f b b a g c c e h d Reduce 1 Reduce 2 10
  • 11. Connected Components 1. Partition: 2. Summarize (retain < n edges): f b b a g c c e h d Reduce 1 Reduce 2 11
  • 12. Connected Components 1. Partition: 2. Summarize: 3. Recombine: f b b a g c c e h d Reduce 1 Reduce 2 12
  • 13. Connected Components 1. Partition: 2. Summarize: 3. Recombine: b f a g c e h d Round 2 13
  • 14. Connected Components 1. Partition: 2. Summarize: 3. Recombine: b f (b,c) 1 a (f,h) 1 (b,d) 1 g (a,c) 1 (a,b) 1 (c,d) 1 c (c,e) 1 (f,g) 1 (d,e) 1 e h (d,e) 1 (b,e) 1 d (g,h) 1 Round 2 14
  • 15. Connected Components 1. Partition: 2. Summarize: 3. Recombine: b f a g (a,c) 1 (a,b) 1 (c,d) 1 c (f,g) 1 e h (d,e) 1 d (g,h) 1 Round 2 Small enough to fit! 15
  • 16. Connected Components The summarization does not affect connectivity – Drops redundant edges – Dramatically reduces data size – Takes two MapReduce rounds 16
  • 17. Connected Components The summarization does not affect connectivity – Drops redundant edges – Dramatically reduces data size – Takes two MapReduce rounds Similar approach works in other situations: – Consider vertices connected only if k edges between vertices – Consider vertices connected if similarity score above a threshold • E.g. approximate Jaccard similarity when computing for recommendation systems – Find minimum spanning trees • Summarize by computing an MST on the subset graph – Clustering • Cluster each partition, then aggregate the clusters 17
  • 18. Outline Today: – Act 1: Crawl before you walk • Counting connected components – Act 2: The curse of the last reducer • Finding tight knit friend groups 18
  • 19. Act 2: Clustering Coefficient Finding tight knit groups of friends 19
  • 20. Act 2: Clustering Coefficient Finding tight knit groups of friends vs. 19
  • 21. Act 2: Clustering Coefficient Finding tight knit groups of friends vs. 2/15 ≈ 0.13 8/15 ≈ 0.53 CC(v) = Fraction of v’s friends who know each other – Count: number of triangles incident on v 20
  • 22. Finding CC For Each Node Attempt 1: – Look at each node – Enumerate all possible triangles (Pivot) 21
  • 23. Finding CC For Each Node Attempt 1: – Look at each node – Enumerate all possible triangles (Pivot) 22
  • 24. Finding CC For Each Node Attempt 1: – Look at each node – Enumerate all possible triangles (Pivot) – Check which of those edges exist: ∩ = 15 edges possible 2 edges present 23
  • 25. Finding CC For Each Node Attempt 1: – Look at each node – Enumerate all possible triangles (Pivot) – Check which of those edges exist 24
  • 26. Finding CC For Each Node Attempt 1: – Look at each node – Enumerate all possible triangles – Check which of those edges exist Amount of intermediate data – Quadratic in the degree of the nodes – 6 friends: 15 possible triangles – n friends, n(n-1)/2 possible triangles 25
  • 27. Finding CC For Each Node Attempt 1: – Look at each node – Enumerate all possible triangles – Check which of those edges exist Amount of intermediate data – Quadratic in the degree of the nodes – 6 friends: 15 possible triangles – n friends, n(n-1)/2 possible triangles There’s always “that guy”: – tens of thousands of friends – tens of thousands of movie ratings (really!) – millions of followers 26
  • 28. Finding CC For Each Node Attempt 1: – Look at each node a le Sc triangles ot – Enumerate all possible sn oe – Check which of those edges exist D 27
  • 29. Finding CC For Each Node Attempt 1: – Look at each node a le Sc triangles ot – Enumerate all possible sn oe – Check which of those edges exist D Attempt 2: – There is a limited number of High degree nodes – Count LLL, LLH, LHH, and HHH triangles differently – If a triangle has at least one Low node – Pivot on Low node to count the triangles – If a triangle has all High nodes – Pivot but only on other neighboring High nodes (not all nodes) 28
  • 30. Algorithm in Pictures When looking at Low degree nodes – Check for all triangles 29
  • 31. Algorithm in Pictures When looking at Low degree nodes – Check for all triangles When looking at High degree nodes – Check for triangles with other High degree nodes 30
  • 32. Clustering Coefficient Discussion Attempt 2: – Main idea: treat High and Low degree nodes differently • Limit the amount of data generated (No more than O(n) per node) – All triangles accounted for – Can set High-Low threshold to balance the two cases • Rule of thumb: threshold around square root of number of vertices – A bit more complex, but still easy to code • Doesn’t suffer from the one high degree node problem 31
  • 33. XXL Graphs: Conclusions Algorithm Design – Prove performance guarantees independent of input data • Input skew (e.g. high degree nodes) should not severely affect algorithm performance • Number of rounds fixed (and hopefully small) 32
  • 34. XXL Graphs: Conclusions Algorithm Design – Prove performance guarantees independent of input data • Input skew (e.g. high degree nodes) should not severely affect algorithm performance • Number of rounds fixed (and hopefully small) Rethink graph algorithms: – Connected Components: Two round approach – Clustering Coefficient: High-Low node decomposition – (Breaking News) Matchings: Two round sampling technique 33