XXL Graph Algorithms__HadoopSummit2010

XXL Graph Algorithms
Sergei Vassilvitskii
Yahoo! Research

With help from Jake Hofman, Siddharth Suri, Cong Yu and many others

Introduction
XXL Graphs are everywhere:
– Web graph
– Friend graphs
– Advertising graphs...

2

Introduction
XXL Graphs are everywhere:
– Web graph
– Friend graphs
– Advertising graphs...

But we have Hadoop!
– Few algorithms have been ported (no Hadoop Algorithms book)
– Few general algorithmic approaches
– Active area of research

3

Outline
Today:
– Act 1: Crawl before you walk
• Counting connected components
– Act 2: The curse of the last reducer
• Finding tight knit friend groups

4

Act 1: Connected Components
Given a graph, how many components does it have?

f
b
a
g

c

e h

d

5

Act 1: Connected Components
Given a graph, how many components does it have?

f
b
(b,c) 1
a (f,h) 1
g (b,d) 1

(a,c) 1 (a,b) 1
(c,d) 1
c
(c,e) 1 (f,g) 1
e h (d,e) 1

(d,e) 1
d (b,e) 1
(g,h) 1

Data too big to fit on one reducer!

6

CC Overview
Outline for Connected Components
– Partition the input into several chunks (map 1)
– Summarize the connectivity on each chunk (reduce 1)
– Combine all of the (small) summaries (map 2)
– Find the number of connected components

7

Connected Components
1. Partition (randomly):

f
b
a
g

c

e h

d

8

1. Partition (randomly):

f
b b
a
g

c c

e h

d

Reduce 1 Reduce 2

9

1. Partition:
2. Summarize (retain < n edges):
f
b b
a
g

c c

e h

d

Reduce 1 Reduce 2

10

1. Partition:
2. Summarize (retain < n edges):
f
b b
a
g

c c

e h

d

Reduce 1 Reduce 2

11

1. Partition:
2. Summarize:
3. Recombine: f
b b
a
g

c c

e h

d

Reduce 1 Reduce 2

12

1. Partition:
2. Summarize:
3. Recombine:
b f
a

g

c

e
h

d

Round 2

13

1. Partition:
2. Summarize:
3. Recombine:
b f (b,c) 1
a (f,h) 1
(b,d) 1

g (a,c) 1 (a,b) 1
(c,d) 1
c
(c,e) 1 (f,g) 1
(d,e) 1
e
h (d,e) 1
(b,e) 1
d (g,h) 1

Round 2

14

1. Partition:
2. Summarize:
3. Recombine:
b f
a

g (a,c) 1 (a,b) 1
(c,d) 1
c
(f,g) 1

e
h (d,e) 1

d (g,h) 1

Round 2
Small enough to fit!

15

The summarization does not affect connectivity
– Drops redundant edges
– Dramatically reduces data size
– Takes two MapReduce rounds

16

The summarization does not affect connectivity
– Drops redundant edges
– Dramatically reduces data size
– Takes two MapReduce rounds

Similar approach works in other situations:
– Consider vertices connected only if k edges between vertices
– Consider vertices connected if similarity score above a threshold
• E.g. approximate Jaccard similarity when computing for recommendation
systems
– Find minimum spanning trees
• Summarize by computing an MST on the subset graph
– Clustering
• Cluster each partition, then aggregate the clusters

17

Outline
Today:
– Act 1: Crawl before you walk
• Counting connected components
– Act 2: The curse of the last reducer
• Finding tight knit friend groups

18

Act 2: Clustering Coefficient
Finding tight knit groups of friends

19


vs.

19


vs.

2/15 ≈ 0.13 8/15 ≈ 0.53

CC(v) = Fraction of v’s friends who know each other
– Count: number of triangles incident on v

20

Finding CC For Each Node
Attempt 1:
– Look at each node
– Enumerate all possible triangles (Pivot)

21

Attempt 1:

22

Attempt 1:
– Check which of those edges exist:

∩ =

15 edges possible 2 edges present

23

Attempt 1:
– Check which of those edges exist

24

Attempt 1:
– Enumerate all possible triangles

Amount of intermediate data
– Quadratic in the degree of the nodes
– 6 friends: 15 possible triangles
– n friends, n(n-1)/2 possible triangles

25

Attempt 1:
– Enumerate all possible triangles

Amount of intermediate data
– Quadratic in the degree of the nodes
– 6 friends: 15 possible triangles
– n friends, n(n-1)/2 possible triangles

There’s always “that guy”:
– tens of thousands of friends
– tens of thousands of movie ratings (really!)
– millions of followers
26

Attempt 1:
– Look at each node a le
Sc triangles
ot
– Enumerate all possible
sn
oe
D

27

Attempt 1:
– Look at each node a le
Sc triangles
ot
– Enumerate all possible
sn
oe
D

Attempt 2:
– There is a limited number of High degree nodes
– Count LLL, LLH, LHH, and HHH triangles differently
– If a triangle has at least one Low node
– Pivot on Low node to count the triangles
– If a triangle has all High nodes
– Pivot but only on other neighboring High nodes (not all nodes)

28

Algorithm in Pictures
When looking at Low degree nodes
– Check for all triangles

29

Algorithm in Pictures
When looking at Low degree nodes
– Check for all triangles

When looking at High degree nodes
– Check for triangles with other High degree nodes

30

Clustering Coefficient Discussion
Attempt 2:
– Main idea: treat High and Low degree nodes differently
• Limit the amount of data generated (No more than O(n) per node)
– All triangles accounted for
– Can set High-Low threshold to balance the two cases
• Rule of thumb: threshold around square root of number of vertices
– A bit more complex, but still easy to code
• Doesn’t suffer from the one high degree node problem

31

XXL Graphs: Conclusions
Algorithm Design
– Prove performance guarantees independent of input data
• Input skew (e.g. high degree nodes) should not severely affect
algorithm performance
• Number of rounds fixed (and hopefully small)

32

XXL Graphs: Conclusions
Algorithm Design
– Prove performance guarantees independent of input data
• Input skew (e.g. high degree nodes) should not severely affect
algorithm performance
• Number of rounds fixed (and hopefully small)

Rethink graph algorithms:
– Connected Components: Two round approach
– Clustering Coefficient: High-Low node decomposition
– (Breaking News) Matchings: Two round sampling technique

33

Thank You
sergei@yahoo-inc.com

XXL Graph Algorithms__HadoopSummit2010

More Related Content

Viewers also liked (10)

Similar to XXL Graph Algorithms__HadoopSummit2010 (20)

More from Yahoo Developer Network (20)

Recently uploaded (20)

XXL Graph Algorithms__HadoopSummit2010