SlideShare a Scribd company logo
Counting Triangles &
The Curse of the Last Reducer
                        Siddharth Suri
                    Sergei Vassilvitskii
                     Yahoo! Research
Why Count Triangles?




WWW 2011                 2   Sergei Vassilvitskii
Why Count Triangles?

           Clustering Coefficient:
            Given an undirected graph G = (V, E)
            cc(v) = fraction of v’s neighbors who are neighbors themselves
                   |{(u, w) ∈ E|u ∈ Γ(v) ∧ w ∈ Γ(v)}|
                 =                dv 
                                   2




WWW 2011                                       3                      Sergei Vassilvitskii
Why Count Triangles?

           Clustering Coefficient:
            Given an undirected graph G = (V, E)
            cc(v) = fraction of v’s neighbors who are neighbors themselves
                   |{(u, w) ∈ E|u ∈ Γ(v) ∧ w ∈ Γ(v)}|
                 =                dv 
                                   2


                                                            cc (   ) = N/A

                                                            cc (   ) = 1/3

                                                            cc (   )=1

                                                            cc (   )=1




WWW 2011                                       4                         Sergei Vassilvitskii
Why Count Triangles?

           Clustering Coefficient:
            Given an undirected graph G = (V, E)
            cc(v) = fraction of v’s neighbors who are neighbors themselves
                   |{(u, w) ∈ E|u ∈ Γ(v) ∧ w ∈ Γ(v)}|   #∆s incident on v
                 =                dv                =       dv 
                                   2                           2


                                                               cc (   ) = N/A

                                                               cc (   ) = 1/3

                                                               cc (   )=1

                                                               cc (   )=1




WWW 2011                                     5                              Sergei Vassilvitskii
Why Clustering Coefficient?

           Captures how tight-knit the network is around a node.

    cc (      ) = 0.5                             cc (   ) = 0.1


                                            vs.




WWW 2011                                     6                     Sergei Vassilvitskii
Why Clustering Coefficient?

           Captures how tight-knit the network is around a node.

    cc (      ) = 0.5                              cc (   ) = 0.1


                                             vs.




           Network Cohesion:
             - Tightly knit communities foster more trust, social norms. [Coleman
           ’88, Portes ’88]
           Structural Holes:
            - Individuals benefit form bridging [Burt ’04, ’07]
WWW 2011                                      7                         Sergei Vassilvitskii
Why MapReduce?

           De facto standard for parallel computation on
           large data
           – Widely used at: Yahoo!, Google, Facebook,
           – Also at: New York Times, Amazon.com, Match.com, ...
           – Commodity hardware
           – Reliable infrastructure


           – Data continues to outpace available RAM !




WWW 2011                                   8                       Sergei Vassilvitskii
How to Count Triangles

           Sequential Version:
            foreach v in V
                foreach u,w in Adjacency(v)
                   if (u,w) in E
                      Triangles[v]++




                          v

                                              Triangles[v]=0




WWW 2011                            9                     Sergei Vassilvitskii
How to Count Triangles

           Sequential Version:
            foreach v in V
                foreach u,w in Adjacency(v)
                   if (u,w) in E
                      Triangles[v]++




                          v

                                              Triangles[v]=1
                                         w

                              u
WWW 2011                            10                    Sergei Vassilvitskii
How to Count Triangles

           Sequential Version:
            foreach v in V
                foreach u,w in Adjacency(v)
                   if (u,w) in E
                      Triangles[v]++




                          v

                                              Triangles[v]=1

              w
                              u
WWW 2011                            11                    Sergei Vassilvitskii
How to Count Triangles

           Sequential Version:
            foreach v in V
                foreach u,w in Adjacency(v)
                   if (u,w) in E
                      Triangles[v]++

                              
           Running time:            d2
                                     v
                              v∈V



             Even for sparse graphs can be quadratic if one vertex has high
             degree.




WWW 2011                                    12                         Sergei Vassilvitskii
Parallel Version

           Parallelize the edge checking phase




WWW 2011                          13             Sergei Vassilvitskii
Parallel Version

           Parallelize the edge checking phase
           – Map 1: For each v send (v, Γ(v)) to single machine.
           – Reduce 1: Input: v; Γ(v)
             Output: all 2 paths (v1 , v2 ); u where v1 , v2 ∈ Γ(u)
             ( , );        ( , );           ( , );




WWW 2011                                         14                     Sergei Vassilvitskii
Parallel Version

           Parallelize the edge checking phase
           – Map 1: For each v send (v, Γ(v)) to single machine.
           – Reduce 1: Input: v; Γ(v)
             Output: all 2 paths (v1 , v2 ); u where v1 , v2 ∈ Γ(u)
             ( , );           ( , );             ( , );


           – Map 2: Send (v1 , v2 ); u and (v1 , v2 ); $ for (v1 , v2 ) ∈ E to same
             machine.
           – Reduce 2: input: (v, w); u1 , u2 , . . . , uk , $?
             Output: if $ part of the input, then: ui = ui + 1/3


                           ( , ); , $ −→          +1/3       +1/3   +1/3
                           ( , );     −→


WWW 2011                                               15                           Sergei Vassilvitskii
Data skew

           How much parallelization can we achieve?
           - Generate all the paths to check in parallel
           - The running time becomes max d2   v
                                         v∈V




WWW 2011                                     16            Sergei Vassilvitskii
Data skew

           How much parallelization can we achieve?
           - Generate all the paths to check in parallel
           - The running time becomes max d2   v
                                         v∈V




           Naive parallelization does not help with data skew
           – Some nodes will have very high degree
           – Example. 3.2 Million followers, must generate 10 Trillion (10^13)
             potential edges to check.

           – Even if generating 100M edges per second, 100K seconds ~ 27 hours.




WWW 2011                                     17                          Sergei Vassilvitskii
“Just 5 more minutes”

           Running the naive algorithm on LiveJournal Graph
           – 80% of reducers done after 5 min
           – 99% done after 35 min




WWW 2011                                  18        Sergei Vassilvitskii
Adapting the Algorithm

           Approach 1: Dealing with skew directly
           – currently every triangle counted 3 times (once per vertex)
           – Running time quadratic in the degree of the vertex
           – Idea: Count each once, from the perspective of lowest degree vertex
           – Does this heuristic work?




WWW 2011                                    19                            Sergei Vassilvitskii
Adapting the Algorithm

           Approach 1: Dealing with skew directly
           – currently every triangle counted 3 times (once per vertex)
           – Running time quadratic in the degree of the vertex
           – Idea: Count each once, from the perspective of lowest degree vertex
           – Does this heuristic work?



           Approach 2: Divide  Conquer
           – Equally divide the graph between machines
           – But any edge partition will be bound to miss triangles
           – Divide into overlapping subgraphs, account for the overlap




WWW 2011                                    20                            Sergei Vassilvitskii
How to Count Triangles Better

           Sequential Version [Schank ’07]:

           foreach v in V
              foreach u,w in Adjacency(v)
                if deg(u)  deg(v)  deg(w)  deg(v)
                    if (u,w) in E
                       Triangles[v]++




WWW 2011                             21                 Sergei Vassilvitskii
Does it make a difference?




WWW 2011               22      Sergei Vassilvitskii
Dealing with Skew

           Why does it help?
           – Partition nodes into two groups:
                                   √
              • Low: L = {v : dv ≤ m}
                                   √
              • High: H = {v : dv  m}
           – There are at most n low nodes; each produces at most O(m) paths
                                √
           – There are at most 2 m high nodes
              • Each produces paths to other high nodes: O(m) paths per node




WWW 2011                                   23                       Sergei Vassilvitskii
Dealing with Skew

           Why does it help?
           – Partition nodes into two groups:
                                   √
              • Low: L = {v : dv ≤ m}
                                   √
              • High: H = {v : dv  m}
           – There are at most n low nodes; each produces at most O(m) paths
                                √
           – There are at most 2 m high nodes
              • Each produces paths to other high nodes: O(m) paths per node


           – These two are identical !
           – Therefore, no mapper can produce substantially more work than
             others.
                                3
           – Total work is O(m /2 ) , which is optimal




WWW 2011                                     24                     Sergei Vassilvitskii
Approach 2: Graph Split

           Partitioning the nodes:
           - Previous algorithm shows one way to achieve better parallelization
           - But what if even O(m) is too much. Is it possible to divide input into
             smaller chunks?



           Graph Split Algorithm:
           – Partition vertices into p equal sized groups V1 , V2 , . . . , Vp .
           – Consider all possible triples (Vi , Vj , Vk ) and the induced subgraph:
                       Gijk = G [Vi ∪ Vj ∪ Vk ]
           – Compute the triangles on each Gijk separately.




WWW 2011                                          25                               Sergei Vassilvitskii
Approach 2: Graph Split

           Some Triangles present in multiple subgraphs:


                                                            in p-2 subgraphs


                   Vi              Vj

                                                            in 1 subgraph



                           Vk
                                                            in ~p2 subgraphs



           Can count exactly how many subgraphs each triangle will be in


WWW 2011                                    26                        Sergei Vassilvitskii
Approach 2: Graph Split

           Analysis:
           – Each subgraph has O(m/p2 ) edges in expectation.
           – Very balanced running times




WWW 2011                                   27                   Sergei Vassilvitskii
Approach 2: Graph Split
           Analysis:
           – Very balanced running times
           – p controls memory needed per machine




WWW 2011                                   28       Sergei Vassilvitskii
Approach 2: Graph Split
           Analysis:
           – Very balanced running times
           – p controls memory needed per machine
                          3
           – Total work: p ·   O((m/p2 )3/2 )   = O(m
                                                        3/2
                                                              ) , independent of p




WWW 2011                                           29                                Sergei Vassilvitskii
Approach 2: Graph Split
           Analysis:
           – Very balanced running times
           – p controls memory needed per machine
                          3
           – Total work: p ·   O((m/p2 )3/2 )   = O(m
                                                        3/2
                                                              ) , independent of p




  Input too big:                                                                     Shuffle time
  paging                                                                             increases with
                                                                                     duplication




WWW 2011                                           30                                 Sergei Vassilvitskii
Overall




           Naive Parallelization Doesn’t help with Data Skew




WWW 2011                           31                Sergei Vassilvitskii
Related Work

      • Tsourakakis et al. [09]:
           – Count global number of triangles by estimating the trace of the cube
             of the matrix
           – Don’t specifically deal with skew, obtain high probability
             approximations.


      • Becchetti et al. [08]
           – Approximate the number of triangles per node
           – Use multiple passes to obtain a better and better approximation




WWW 2011                                     32                           Sergei Vassilvitskii
Conclusions




WWW 2011        33   Sergei Vassilvitskii
Conclusions

           Think about data skew.... and avoid the curse




WWW 2011                           33                 Sergei Vassilvitskii
Conclusions

           Think about data skew.... and avoid the curse
           – Get programs to run faster




WWW 2011                                  33          Sergei Vassilvitskii
Conclusions

           Think about data skew.... and avoid the curse
           – Get programs to run faster
           – Publish more papers




WWW 2011                                  33          Sergei Vassilvitskii
Conclusions

           Think about data skew.... and avoid the curse
           – Get programs to run faster
           – Publish more papers
           – Get more sleep




WWW 2011                                  33          Sergei Vassilvitskii
Conclusions

           Think about data skew.... and avoid the curse
           – Get programs to run faster
           – Publish more papers
           – Get more sleep
           – ..




WWW 2011                                  33          Sergei Vassilvitskii
Conclusions

           Think about data skew.... and avoid the curse
           – Get programs to run faster
           – Publish more papers
           – Get more sleep
           – ..
           – The possibilities are endless!




WWW 2011                                      33      Sergei Vassilvitskii
Thank You

More Related Content

PPTX
NOSQL vs SQL
PPTX
Apache Spark Architecture
PPTX
Nosql databases
PPTX
NoSQL databases - An introduction
PPTX
PDF
Oracle Clusterware Node Management and Voting Disks
PPTX
NoSQL databases
PPTX
Spark streaming
NOSQL vs SQL
Apache Spark Architecture
Nosql databases
NoSQL databases - An introduction
Oracle Clusterware Node Management and Voting Disks
NoSQL databases
Spark streaming

What's hot (20)

PPTX
Spark introduction and architecture
PPSX
A Seminar on NoSQL Databases.
PDF
Using ClickHouse for Experimentation
PDF
NOSQLEU - Graph Databases and Neo4j
PDF
Introducing Neo4j
PDF
Building Robust ETL Pipelines with Apache Spark
PDF
The Challenges of Distributing Postgres: A Citus Story | DataEngConf NYC 2017...
PPTX
Introduction to YARN and MapReduce 2
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
PPTX
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
PDF
Oracle Database 12c Multitenant for Consolidation
PDF
Linux tuning to improve PostgreSQL performance
PPTX
대용량 분산 아키텍쳐 설계 #4. soa 아키텍쳐
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PDF
Apache Spark Overview
PDF
Democratizing Data at Airbnb
PDF
Hadoop YARN
PPTX
Apache Spark Core
PDF
Hadoop Strata Talk - Uber, your hadoop has arrived
PPTX
Enabling ABAC with Accumulo and Ranger integration
Spark introduction and architecture
A Seminar on NoSQL Databases.
Using ClickHouse for Experimentation
NOSQLEU - Graph Databases and Neo4j
Introducing Neo4j
Building Robust ETL Pipelines with Apache Spark
The Challenges of Distributing Postgres: A Citus Story | DataEngConf NYC 2017...
Introduction to YARN and MapReduce 2
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Oracle Database 12c Multitenant for Consolidation
Linux tuning to improve PostgreSQL performance
대용량 분산 아키텍쳐 설계 #4. soa 아키텍쳐
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Apache Spark Overview
Democratizing Data at Airbnb
Hadoop YARN
Apache Spark Core
Hadoop Strata Talk - Uber, your hadoop has arrived
Enabling ABAC with Accumulo and Ranger integration
Ad

Similar to Counting Triangles and the Curse of the Last Reducer (20)

PDF
Computational Social Science, Lecture 08: Counting Fast, Part II
PDF
Graph theory
PPT
Graphs In Data Structure
PPT
Graphs In Data Structure
PDF
Skiena algorithm 2007 lecture10 graph data strctures
PDF
Part4 graph algorithms
PPTX
Graph theory 1
PPTX
Graph theory
PDF
Go3112581266
PPTX
Graph data structure
PPTX
Computational Social Science, Lecture 06: Networks, Part II
PPT
Graphs in c language
PDF
Graph theory in Practise
PPTX
GRAPH THEORY - Basic definition with examples
PDF
Undirected graphs
PDF
ERA Poster - Measuring Disruption from Software Evolution Activities Using Gr...
PPT
Graph
Computational Social Science, Lecture 08: Counting Fast, Part II
Graph theory
Graphs In Data Structure
Graphs In Data Structure
Skiena algorithm 2007 lecture10 graph data strctures
Part4 graph algorithms
Graph theory 1
Graph theory
Go3112581266
Graph data structure
Computational Social Science, Lecture 06: Networks, Part II
Graphs in c language
Graph theory in Practise
GRAPH THEORY - Basic definition with examples
Undirected graphs
ERA Poster - Measuring Disruption from Software Evolution Activities Using Gr...
Graph
Ad

Counting Triangles and the Curse of the Last Reducer

  • 1. Counting Triangles & The Curse of the Last Reducer Siddharth Suri Sergei Vassilvitskii Yahoo! Research
  • 2. Why Count Triangles? WWW 2011 2 Sergei Vassilvitskii
  • 3. Why Count Triangles? Clustering Coefficient: Given an undirected graph G = (V, E) cc(v) = fraction of v’s neighbors who are neighbors themselves |{(u, w) ∈ E|u ∈ Γ(v) ∧ w ∈ Γ(v)}| = dv 2 WWW 2011 3 Sergei Vassilvitskii
  • 4. Why Count Triangles? Clustering Coefficient: Given an undirected graph G = (V, E) cc(v) = fraction of v’s neighbors who are neighbors themselves |{(u, w) ∈ E|u ∈ Γ(v) ∧ w ∈ Γ(v)}| = dv 2 cc ( ) = N/A cc ( ) = 1/3 cc ( )=1 cc ( )=1 WWW 2011 4 Sergei Vassilvitskii
  • 5. Why Count Triangles? Clustering Coefficient: Given an undirected graph G = (V, E) cc(v) = fraction of v’s neighbors who are neighbors themselves |{(u, w) ∈ E|u ∈ Γ(v) ∧ w ∈ Γ(v)}| #∆s incident on v = dv = dv 2 2 cc ( ) = N/A cc ( ) = 1/3 cc ( )=1 cc ( )=1 WWW 2011 5 Sergei Vassilvitskii
  • 6. Why Clustering Coefficient? Captures how tight-knit the network is around a node. cc ( ) = 0.5 cc ( ) = 0.1 vs. WWW 2011 6 Sergei Vassilvitskii
  • 7. Why Clustering Coefficient? Captures how tight-knit the network is around a node. cc ( ) = 0.5 cc ( ) = 0.1 vs. Network Cohesion: - Tightly knit communities foster more trust, social norms. [Coleman ’88, Portes ’88] Structural Holes: - Individuals benefit form bridging [Burt ’04, ’07] WWW 2011 7 Sergei Vassilvitskii
  • 8. Why MapReduce? De facto standard for parallel computation on large data – Widely used at: Yahoo!, Google, Facebook, – Also at: New York Times, Amazon.com, Match.com, ... – Commodity hardware – Reliable infrastructure – Data continues to outpace available RAM ! WWW 2011 8 Sergei Vassilvitskii
  • 9. How to Count Triangles Sequential Version: foreach v in V foreach u,w in Adjacency(v) if (u,w) in E Triangles[v]++ v Triangles[v]=0 WWW 2011 9 Sergei Vassilvitskii
  • 10. How to Count Triangles Sequential Version: foreach v in V foreach u,w in Adjacency(v) if (u,w) in E Triangles[v]++ v Triangles[v]=1 w u WWW 2011 10 Sergei Vassilvitskii
  • 11. How to Count Triangles Sequential Version: foreach v in V foreach u,w in Adjacency(v) if (u,w) in E Triangles[v]++ v Triangles[v]=1 w u WWW 2011 11 Sergei Vassilvitskii
  • 12. How to Count Triangles Sequential Version: foreach v in V foreach u,w in Adjacency(v) if (u,w) in E Triangles[v]++ Running time: d2 v v∈V Even for sparse graphs can be quadratic if one vertex has high degree. WWW 2011 12 Sergei Vassilvitskii
  • 13. Parallel Version Parallelize the edge checking phase WWW 2011 13 Sergei Vassilvitskii
  • 14. Parallel Version Parallelize the edge checking phase – Map 1: For each v send (v, Γ(v)) to single machine. – Reduce 1: Input: v; Γ(v) Output: all 2 paths (v1 , v2 ); u where v1 , v2 ∈ Γ(u) ( , ); ( , ); ( , ); WWW 2011 14 Sergei Vassilvitskii
  • 15. Parallel Version Parallelize the edge checking phase – Map 1: For each v send (v, Γ(v)) to single machine. – Reduce 1: Input: v; Γ(v) Output: all 2 paths (v1 , v2 ); u where v1 , v2 ∈ Γ(u) ( , ); ( , ); ( , ); – Map 2: Send (v1 , v2 ); u and (v1 , v2 ); $ for (v1 , v2 ) ∈ E to same machine. – Reduce 2: input: (v, w); u1 , u2 , . . . , uk , $? Output: if $ part of the input, then: ui = ui + 1/3 ( , ); , $ −→ +1/3 +1/3 +1/3 ( , ); −→ WWW 2011 15 Sergei Vassilvitskii
  • 16. Data skew How much parallelization can we achieve? - Generate all the paths to check in parallel - The running time becomes max d2 v v∈V WWW 2011 16 Sergei Vassilvitskii
  • 17. Data skew How much parallelization can we achieve? - Generate all the paths to check in parallel - The running time becomes max d2 v v∈V Naive parallelization does not help with data skew – Some nodes will have very high degree – Example. 3.2 Million followers, must generate 10 Trillion (10^13) potential edges to check. – Even if generating 100M edges per second, 100K seconds ~ 27 hours. WWW 2011 17 Sergei Vassilvitskii
  • 18. “Just 5 more minutes” Running the naive algorithm on LiveJournal Graph – 80% of reducers done after 5 min – 99% done after 35 min WWW 2011 18 Sergei Vassilvitskii
  • 19. Adapting the Algorithm Approach 1: Dealing with skew directly – currently every triangle counted 3 times (once per vertex) – Running time quadratic in the degree of the vertex – Idea: Count each once, from the perspective of lowest degree vertex – Does this heuristic work? WWW 2011 19 Sergei Vassilvitskii
  • 20. Adapting the Algorithm Approach 1: Dealing with skew directly – currently every triangle counted 3 times (once per vertex) – Running time quadratic in the degree of the vertex – Idea: Count each once, from the perspective of lowest degree vertex – Does this heuristic work? Approach 2: Divide Conquer – Equally divide the graph between machines – But any edge partition will be bound to miss triangles – Divide into overlapping subgraphs, account for the overlap WWW 2011 20 Sergei Vassilvitskii
  • 21. How to Count Triangles Better Sequential Version [Schank ’07]: foreach v in V foreach u,w in Adjacency(v) if deg(u) deg(v) deg(w) deg(v) if (u,w) in E Triangles[v]++ WWW 2011 21 Sergei Vassilvitskii
  • 22. Does it make a difference? WWW 2011 22 Sergei Vassilvitskii
  • 23. Dealing with Skew Why does it help? – Partition nodes into two groups: √ • Low: L = {v : dv ≤ m} √ • High: H = {v : dv m} – There are at most n low nodes; each produces at most O(m) paths √ – There are at most 2 m high nodes • Each produces paths to other high nodes: O(m) paths per node WWW 2011 23 Sergei Vassilvitskii
  • 24. Dealing with Skew Why does it help? – Partition nodes into two groups: √ • Low: L = {v : dv ≤ m} √ • High: H = {v : dv m} – There are at most n low nodes; each produces at most O(m) paths √ – There are at most 2 m high nodes • Each produces paths to other high nodes: O(m) paths per node – These two are identical ! – Therefore, no mapper can produce substantially more work than others. 3 – Total work is O(m /2 ) , which is optimal WWW 2011 24 Sergei Vassilvitskii
  • 25. Approach 2: Graph Split Partitioning the nodes: - Previous algorithm shows one way to achieve better parallelization - But what if even O(m) is too much. Is it possible to divide input into smaller chunks? Graph Split Algorithm: – Partition vertices into p equal sized groups V1 , V2 , . . . , Vp . – Consider all possible triples (Vi , Vj , Vk ) and the induced subgraph: Gijk = G [Vi ∪ Vj ∪ Vk ] – Compute the triangles on each Gijk separately. WWW 2011 25 Sergei Vassilvitskii
  • 26. Approach 2: Graph Split Some Triangles present in multiple subgraphs: in p-2 subgraphs Vi Vj in 1 subgraph Vk in ~p2 subgraphs Can count exactly how many subgraphs each triangle will be in WWW 2011 26 Sergei Vassilvitskii
  • 27. Approach 2: Graph Split Analysis: – Each subgraph has O(m/p2 ) edges in expectation. – Very balanced running times WWW 2011 27 Sergei Vassilvitskii
  • 28. Approach 2: Graph Split Analysis: – Very balanced running times – p controls memory needed per machine WWW 2011 28 Sergei Vassilvitskii
  • 29. Approach 2: Graph Split Analysis: – Very balanced running times – p controls memory needed per machine 3 – Total work: p · O((m/p2 )3/2 ) = O(m 3/2 ) , independent of p WWW 2011 29 Sergei Vassilvitskii
  • 30. Approach 2: Graph Split Analysis: – Very balanced running times – p controls memory needed per machine 3 – Total work: p · O((m/p2 )3/2 ) = O(m 3/2 ) , independent of p Input too big: Shuffle time paging increases with duplication WWW 2011 30 Sergei Vassilvitskii
  • 31. Overall Naive Parallelization Doesn’t help with Data Skew WWW 2011 31 Sergei Vassilvitskii
  • 32. Related Work • Tsourakakis et al. [09]: – Count global number of triangles by estimating the trace of the cube of the matrix – Don’t specifically deal with skew, obtain high probability approximations. • Becchetti et al. [08] – Approximate the number of triangles per node – Use multiple passes to obtain a better and better approximation WWW 2011 32 Sergei Vassilvitskii
  • 33. Conclusions WWW 2011 33 Sergei Vassilvitskii
  • 34. Conclusions Think about data skew.... and avoid the curse WWW 2011 33 Sergei Vassilvitskii
  • 35. Conclusions Think about data skew.... and avoid the curse – Get programs to run faster WWW 2011 33 Sergei Vassilvitskii
  • 36. Conclusions Think about data skew.... and avoid the curse – Get programs to run faster – Publish more papers WWW 2011 33 Sergei Vassilvitskii
  • 37. Conclusions Think about data skew.... and avoid the curse – Get programs to run faster – Publish more papers – Get more sleep WWW 2011 33 Sergei Vassilvitskii
  • 38. Conclusions Think about data skew.... and avoid the curse – Get programs to run faster – Publish more papers – Get more sleep – .. WWW 2011 33 Sergei Vassilvitskii
  • 39. Conclusions Think about data skew.... and avoid the curse – Get programs to run faster – Publish more papers – Get more sleep – .. – The possibilities are endless! WWW 2011 33 Sergei Vassilvitskii