SlideShare a Scribd company logo
Counting @ Scale
       Part II

            Sharad Goel
        Columbia University
Computational Social Science: Lecture 4

          February 15, 2013
Descriptive statistics
    (as opposed to inferential statistics)
      is about counting
        contingency tables
    means, variances, quantiles
summaries of conditional distributions
Counting @ scale
  conceptually easy
computationally hard
MapReduce:
Simplifed Data Processing on Large Clusters
      Jeffrey Dean and Sanjay Ghemawat
                  OSDI, 2004
Map
assign each input line to one or more groups

                  Shuffle
             aggregate groups

                 Reduce
         operate on grouped data
Map
assign each input line to one or more groups
        v  [(k1, v1), …, (km, vm)]

                  Shuffle
             aggregate groups

                  Reduce
         operate on grouped data
        (k, [v1, …, vn])  [w1, …, wp]
Group Average

              Input
    views (user, movie, rating)

             Output
average (mean & median) by movie
Group Average

          Map
identity (key := movie)

       Reduce
movie group  average
The Insight of MapReduce
       One can efficiently group identical items

Many tasks are computationally easier on grouped data
Filter

                 Input
    arbitrary data & filter condition

                 Output
subset of input data satisfying condition
Filter

                   Map
input  input if condition(input) else pass

                 Reduce
                 identity
Distinct

        Input
     set of items

        Output
subset of distinct items
Distinct

                 Map
               identity

               Reduce
grouped items  single item from group
Sample

               Input
set of items & sample probability p

            Output
     random subset of items
Sample

                  Map
input  input if rand(0,1) < p else pass

                Reduce
                identity
Sort

          Input
set of items (and a key)

       Output
 ordered set of items
Sort

                       Map
identity, with all data assigned to the same key

                    Reduce
                    identity


     *all the work happens in the shuffle
Sort

                 Map
identity, with key := first letter of line

                Reduce
                identity


*all the work happens in the shuffle
Sort

                       Sample
generate a small sample of the data (with MapReduce)

                Determine breakpoints
      sort the sample and compute percentiles
Sort

                     Map
identity, with key determined by breakpoints

                  Reduce
                  identity


 *most of the work happens in the shuffle
Combining data

                  Example
    for each user, want to compute the
average popularity of the movies they watch

                   Problem
    one file contains views (user, movie);
another file contains popularity (movie, rank)
Joins
User    Movie

 23      829

 789     24             User   Movie   Rank

 234    5678            23      829     34

  7      24             789     24     100

                        234    5678     4
Movie   Rank
                         7      24     100
5678      4

 24      100

 829     34
Nested-Loop Joins


For each user in users:
      For each movie in movies:
            if user.movie_id == movie.id:
                    output user.id, movie.id, movie.rating
Sort-Merge Joins
User    Movie

 789     24

  7      24               User     Movie   Rank

 23      829              789       24     100

 234    5678               7        24     100

                           23       829     34
Movie   Rank
                          234      5678     4
 24      100

 829     34

5678      4
Hash Joins
User    Movie

 23      829

 789     24

 234    5678

  7      24


Movie   Rank

5678      4

 24      100

 829     34
Distributed Joins

                Map
     reduce key := hash(join key)

                Reduce
        local (sort-merge) join


*also need to keep track of which table
    is the left and which is the right
Joins
{ inner, left, right, outer }
User   Movie
User    Sex
                23      829
23     male
                789     24
789    female
                234    5678
234    female
                 7      24
 7     male
                789     90
26     male
                23      758
567    female
                23      39
 2     female
                 2      782
User    Sex
                User   Activity
23     male
                23        3
789    female
                789       2
234    female
                234       1
 7     male
                 7        1
26     male
                789      90
567    female
                 2        1
 2     female
User                Sex
                                        User       Activity
23                 male
                                        23            3
789                female
                                        789           2
234                female
                                        234           1
 7                 male
                                          7           1
26                 male
                                        789           90
567                female
                                          2           1
 2                 female


                          User    Sex          Activity
                            23   male             3
       Left Join



                          789    female           2
                          234    female           1
                            7    male             1
                            26   male
                          567    female
                            2    female           1
User                 Sex
                                         User       Activity
23                  male
                                         23            3
789                 female
                                         789           2
234                 female
                                         234           1
 7                  male
                                           7           1
26                  male
                                         789           90
567                 female
                                           2           1
 2                  female


                           User    Sex          Activity
       Inner Join


                             23   male             3
                           789    female           2
                           234    female           1
                             7    male             1
                             2    female           1
Inner join
          returns pairs of rows in tables A & B
               that match join condition

                      Left (outer) join
         returns all rows from an inner join plus
rows in the left table that do not match to the right table

                      Full (outer) join
         returns all rows from an inner join plus
   rows in either table that do not match to the other
Map-side Joins

                     Map
      load (smaller) table into memory
stream through (larger) table and find matches

                   Reduce
                   identity
MapReduce Ops

           Map-only
Filter, sample, map-side joins

      Map & Reduce
 groupby, distinct, sort, join
The long tail

             Input
      (user, movie) views

             Output
for each user, average popularity
      of movies they watch
Step 1. compute movie popularity
  group views by movie & count
Step 2. Rank movies
 sort by popularity
Step 3. merge view and ranking data
join views and movie popularity tables
Step 4. compute eccentricity
group views/ranking by user and
     compute eccentricity
Pig Latin:
A Not-So-Foreign Language for Data Processing
   Olston, Reed, Srivastava, Kumar, and Tomkins
                  SIGMOD, 2008

More Related Content

PDF
solucionario de purcell 1
PPTX
Back propagation
PDF
Computational Social Science, Lecture 13: Classification
PPTX
Computational Social Science, Lecture 03: Counting at Scale, Part I
PDF
Computational Social Science, Lecture 02: An Introduction to Counting
PDF
Computational Social Science, Lecture 10: Online Experiments
PDF
Computational Social Science, Lecture 08: Counting Fast, Part II
PDF
Computational Social Science, Lecture 07: Counting Fast, Part I
solucionario de purcell 1
Back propagation
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 03: Counting at Scale, Part I
Computational Social Science, Lecture 02: An Introduction to Counting
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 07: Counting Fast, Part I

Viewers also liked (20)

PDF
Computational Social Science, Lecture 11: Regression
PDF
Computational Social Science, Lecture 09: Data Wrangling
PPTX
Computational Social Science, Lecture 06: Networks, Part II
PPTX
Computational Social Science, Lecture 05: Networks, Part I
PDF
Modeling Social Data, Lecture 2: Introduction to Counting
PDF
Modeling Social Data, Lecture 1: Overview
PDF
Modeling Social Data, Lecture 6: Regression, Part 1
PDF
Modeling Social Data, Lecture 3: Counting at Scale
PDF
Data-driven Modeling: Lecture 03
PDF
Modeling Social Data, Lecture 4: Counting at Scale
PDF
Modeling Social Data, Lecture 3: Data manipulation in R
PDF
практ7
ODP
Cerveceria LOS VIKINGOS
PPT
Estancias en Guadalajara
PDF
практ3
DOC
Conferencia educación católica versión final - abril 24, 2009..[1]
PPT
LAS PLANTAS
PPS
No Esperes
PPT
4 de cada barroco
PPSX
Starbucks
Computational Social Science, Lecture 11: Regression
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 05: Networks, Part I
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 3: Counting at Scale
Data-driven Modeling: Lecture 03
Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 3: Data manipulation in R
практ7
Cerveceria LOS VIKINGOS
Estancias en Guadalajara
практ3
Conferencia educación católica versión final - abril 24, 2009..[1]
LAS PLANTAS
No Esperes
4 de cada barroco
Starbucks
Ad

More from jakehofman (14)

PPTX
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
PPTX
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
PDF
Modeling Social Data, Lecture 10: Networks
PDF
Modeling Social Data, Lecture 8: Classification
PDF
Modeling Social Data, Lecture 7: Model complexity and generalization
PDF
Modeling Social Data, Lecture 8: Recommendation Systems
PDF
Modeling Social Data, Lecture 6: Classification with Naive Bayes
PDF
Modeling Social Data, Lecture 2: Introduction to Counting
PDF
Modeling Social Data, Lecture 1: Case Studies
PDF
NYC Data Science Meetup: Computational Social Science
PDF
Technical Tricks of Vowpal Wabbit
PDF
Data-driven modeling: Lecture 10
PDF
Data-driven modeling: Lecture 09
PDF
Using Data to Understand the Brain
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 7: Model complexity and generalization
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 1: Case Studies
NYC Data Science Meetup: Computational Social Science
Technical Tricks of Vowpal Wabbit
Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 09
Using Data to Understand the Brain
Ad

Recently uploaded (20)

PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Classroom Observation Tools for Teachers
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Lesson notes of climatology university.
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PPTX
Cell Structure & Organelles in detailed.
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PPTX
Cell Types and Its function , kingdom of life
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Final Presentation General Medicine 03-08-2024.pptx
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
102 student loan defaulters named and shamed – Is someone you know on the list?
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Classroom Observation Tools for Teachers
human mycosis Human fungal infections are called human mycosis..pptx
Lesson notes of climatology university.
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Cell Structure & Organelles in detailed.
Microbial diseases, their pathogenesis and prophylaxis
O5-L3 Freight Transport Ops (International) V1.pdf
Module 4: Burden of Disease Tutorial Slides S2 2025
GDM (1) (1).pptx small presentation for students
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
Cell Types and Its function , kingdom of life
Anesthesia in Laparoscopic Surgery in India
202450812 BayCHI UCSC-SV 20250812 v17.pptx

Computational Social Science, Lecture 04: Counting at Scale, Part II

  • 1. Counting @ Scale Part II Sharad Goel Columbia University Computational Social Science: Lecture 4 February 15, 2013
  • 2. Descriptive statistics (as opposed to inferential statistics) is about counting contingency tables means, variances, quantiles summaries of conditional distributions
  • 3. Counting @ scale conceptually easy computationally hard
  • 4. MapReduce: Simplifed Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat OSDI, 2004
  • 5. Map assign each input line to one or more groups Shuffle aggregate groups Reduce operate on grouped data
  • 6. Map assign each input line to one or more groups v  [(k1, v1), …, (km, vm)] Shuffle aggregate groups Reduce operate on grouped data (k, [v1, …, vn])  [w1, …, wp]
  • 7. Group Average Input views (user, movie, rating) Output average (mean & median) by movie
  • 8. Group Average Map identity (key := movie) Reduce movie group  average
  • 9. The Insight of MapReduce One can efficiently group identical items Many tasks are computationally easier on grouped data
  • 10. Filter Input arbitrary data & filter condition Output subset of input data satisfying condition
  • 11. Filter Map input  input if condition(input) else pass Reduce identity
  • 12. Distinct Input set of items Output subset of distinct items
  • 13. Distinct Map identity Reduce grouped items  single item from group
  • 14. Sample Input set of items & sample probability p Output random subset of items
  • 15. Sample Map input  input if rand(0,1) < p else pass Reduce identity
  • 16. Sort Input set of items (and a key) Output ordered set of items
  • 17. Sort Map identity, with all data assigned to the same key Reduce identity *all the work happens in the shuffle
  • 18. Sort Map identity, with key := first letter of line Reduce identity *all the work happens in the shuffle
  • 19. Sort Sample generate a small sample of the data (with MapReduce) Determine breakpoints sort the sample and compute percentiles
  • 20. Sort Map identity, with key determined by breakpoints Reduce identity *most of the work happens in the shuffle
  • 21. Combining data Example for each user, want to compute the average popularity of the movies they watch Problem one file contains views (user, movie); another file contains popularity (movie, rank)
  • 22. Joins User Movie 23 829 789 24 User Movie Rank 234 5678 23 829 34 7 24 789 24 100 234 5678 4 Movie Rank 7 24 100 5678 4 24 100 829 34
  • 23. Nested-Loop Joins For each user in users: For each movie in movies: if user.movie_id == movie.id: output user.id, movie.id, movie.rating
  • 24. Sort-Merge Joins User Movie 789 24 7 24 User Movie Rank 23 829 789 24 100 234 5678 7 24 100 23 829 34 Movie Rank 234 5678 4 24 100 829 34 5678 4
  • 25. Hash Joins User Movie 23 829 789 24 234 5678 7 24 Movie Rank 5678 4 24 100 829 34
  • 26. Distributed Joins Map reduce key := hash(join key) Reduce local (sort-merge) join *also need to keep track of which table is the left and which is the right
  • 27. Joins { inner, left, right, outer }
  • 28. User Movie User Sex 23 829 23 male 789 24 789 female 234 5678 234 female 7 24 7 male 789 90 26 male 23 758 567 female 23 39 2 female 2 782
  • 29. User Sex User Activity 23 male 23 3 789 female 789 2 234 female 234 1 7 male 7 1 26 male 789 90 567 female 2 1 2 female
  • 30. User Sex User Activity 23 male 23 3 789 female 789 2 234 female 234 1 7 male 7 1 26 male 789 90 567 female 2 1 2 female User Sex Activity 23 male 3 Left Join 789 female 2 234 female 1 7 male 1 26 male 567 female 2 female 1
  • 31. User Sex User Activity 23 male 23 3 789 female 789 2 234 female 234 1 7 male 7 1 26 male 789 90 567 female 2 1 2 female User Sex Activity Inner Join 23 male 3 789 female 2 234 female 1 7 male 1 2 female 1
  • 32. Inner join returns pairs of rows in tables A & B that match join condition Left (outer) join returns all rows from an inner join plus rows in the left table that do not match to the right table Full (outer) join returns all rows from an inner join plus rows in either table that do not match to the other
  • 33. Map-side Joins Map load (smaller) table into memory stream through (larger) table and find matches Reduce identity
  • 34. MapReduce Ops Map-only Filter, sample, map-side joins Map & Reduce groupby, distinct, sort, join
  • 35. The long tail Input (user, movie) views Output for each user, average popularity of movies they watch
  • 36. Step 1. compute movie popularity group views by movie & count
  • 37. Step 2. Rank movies sort by popularity
  • 38. Step 3. merge view and ranking data join views and movie popularity tables
  • 39. Step 4. compute eccentricity group views/ranking by user and compute eccentricity
  • 40. Pig Latin: A Not-So-Foreign Language for Data Processing Olston, Reed, Srivastava, Kumar, and Tomkins SIGMOD, 2008