SlideShare a Scribd company logo
Introduction to Counting
                                            APAM E4990
                                      Computational Social Science


                                             Jake Hofman

                                            Columbia University


                                            February 1, 2013




Jake Hofman   (Columbia University)            Intro to Counting     February 1, 2013   1 / 21
Why counting?




Jake Hofman   (Columbia University)   Intro to Counting   February 1, 2013   2 / 21
Why counting?




                                      sta·tis·tic

       noun
       1. A function of a random sample of data
       2. A functional of the distribution over a random variable




Jake Hofman   (Columbia University)   Intro to Counting        February 1, 2013   3 / 21
Why counting?




                                       Problem:

        Traditionally difficult to estimate distributions due to small sample
                                  sizes or sparsity




Jake Hofman   (Columbia University)   Intro to Counting         February 1, 2013   4 / 21
Why counting?




                                             Potential solution:

                                      Sacrifice granularity for precision

                            (e.g., bin observations into larger groups)




Jake Hofman   (Columbia University)              Intro to Counting         February 1, 2013   4 / 21
Why counting?




                                       Potential solution:

              Develop sophisticated methods that generalize well from small
                                        samples

                       (e.g., model distributions with parametric forms)




Jake Hofman    (Columbia University)       Intro to Counting          February 1, 2013   4 / 21
Why counting?




Jake Hofman   (Columbia University)   Intro to Counting   February 1, 2013   5 / 21
Why counting?




                                      (Partial) solution:

              When observations are plentiful, simply count and divide to
                   estimate distributions with relative frequencies




Jake Hofman   (Columbia University)       Intro to Counting       February 1, 2013   6 / 21
Why counting?




                                      The good:

        Shift away from sophisticated statistical methods on small samples
                       to simple methods on large samples




Jake Hofman   (Columbia University)   Intro to Counting        February 1, 2013   7 / 21
Why counting?




                                      The bad:

              Even simple methods (e.g., counting) are computationally
                            challenging at large scales




Jake Hofman   (Columbia University)   Intro to Counting         February 1, 2013   7 / 21
Why counting?




                                         Claim:

         Solving the counting problem at scale enables you to investigate
                 many interesting questions in the social sciences




Jake Hofman   (Columbia University)   Intro to Counting        February 1, 2013   7 / 21
Why counting?




Jake Hofman   (Columbia University)   Intro to Counting   February 1, 2013   8 / 21
Learning to count




                                      This week:

                  Counting at small/medium scales on a single machine




Jake Hofman   (Columbia University)   Intro to Counting          February 1, 2013   9 / 21
Learning to count




                                              This week:

                  Counting at small/medium scales on a single machine


                                           Following weeks:

                                  Counting at large scales in parallel




Jake Hofman   (Columbia University)           Intro to Counting          February 1, 2013   9 / 21
Counting, the easy way




              • Load dataset into memory
              • Arrange observations into groups of interest
              • Compute distributions and statistics within each group




Jake Hofman    (Columbia University)   Intro to Counting          February 1, 2013   10 / 21
Example: Anatomy of the long tail




              Dataset            Users   Items     Rating levels   Observations
              Movielens          100K     10K           10            10M
               Netflix            500K     20K            5            100M




Jake Hofman   (Columbia University)         Intro to Counting           February 1, 2013   11 / 21
Example: Anatomy of the long tail




              Dataset            Users   Items     Rating levels   Observations
              Movielens          100K     10K           10            10M
               Netflix            500K     20K            5            100M




Jake Hofman   (Columbia University)         Intro to Counting           February 1, 2013   11 / 21
Example: Movielens
                                3,000,000




                                2,000,000
                        Count




                                1,000,000




                                       0

                                            1           2           3   4   5
                                                            Rating
Jake Hofman   (Columbia University)             Intro to Counting               February 1, 2013   12 / 21
Example: Movielens



                        Density




                                      1      2            3      4   5
                                          Mean Rating by Movie
Jake Hofman   (Columbia University)          Intro to Counting           February 1, 2013   13 / 21
Example: Movielens

                              100%




                              75%
                        CDF




                              50%




                              25%




                               0%

                                      0   3,000          6,000   9,000
                                                Movie Rank
Jake Hofman   (Columbia University)        Intro to Counting             February 1, 2013   14 / 21
Example: Movielens



                        Density




                                  10   100                  1,000
                                       User eccentricity
Jake Hofman   (Columbia University)     Intro to Counting           February 1, 2013   15 / 21
The group-by operation




        We estimate conditional distributions by grouping observations by
         their group identity, or key, and counting values within groups


                                      p( y | x1 , x2 , x3 , . . .)
                                         value                key




Jake Hofman   (Columbia University)              Intro to Counting   February 1, 2013   16 / 21
The group-by operation

                                      Generic Group-By

       for each observation:
         place observation in bucket for corresponding group

       for each group:
         compute function over observations in bucket
         output group and result




Jake Hofman   (Columbia University)      Intro to Counting   February 1, 2013   17 / 21
The group-by operation

                                       Generic Group-By

       for each observation:
         place observation in bucket for corresponding group

       for each group:
         compute function over observations in bucket
         output group and result


              Useful for computing arbitrary within-group statistics when we
                                   have required memory
                        (e.g., conditional distribution, median, etc.)



Jake Hofman    (Columbia University)      Intro to Counting        February 1, 2013   17 / 21
The group-by operation

                                      Combinable Group-By

       for each observation:
         update result for corresponding group,
         as function of current result and observation

       for each group:
         output group and result




Jake Hofman   (Columbia University)        Intro to Counting   February 1, 2013   18 / 21
The group-by operation

                                       Combinable Group-By

       for each observation:
         update result for corresponding group,
         as function of current result and observation

       for each group:
         output group and result


              Useful for computing a subset of within-group statistics with a
                                  limited memory footprint
                           (e.g., min, mean, max, variance, etc.)



Jake Hofman    (Columbia University)        Intro to Counting      February 1, 2013   18 / 21
Example: Anatomy of the long tail




              Dataset            Users   Items     Rating levels   Observations
              Movielens          100K     10K           10            10M
               Netflix            500K     20K            5            100M


         What do we do when the full dataset exceeds available memory?




Jake Hofman   (Columbia University)         Intro to Counting           February 1, 2013   19 / 21
Example: Anatomy of the long tail



              Dataset            Users   Items      Rating levels   Observations
              Movielens          100K     10K            10            10M
               Netflix            500K     20K             5            100M


         What do we do when the full dataset exceeds available memory?

                                              Sampling?
                                 Unreliable estimates for rare groups




Jake Hofman   (Columbia University)          Intro to Counting           February 1, 2013   19 / 21
Example: Anatomy of the long tail



              Dataset                Users   Items     Rating levels   Observations
              Movielens              100K     10K           10            10M
               Netflix                500K     20K            5            100M


         What do we do when the full dataset exceeds available memory?

                                       Random access from disk?
                                  1000x more storage, but 1000x slower1




              1
                  Numbers every programmer should know
Jake Hofman       (Columbia University)         Intro to Counting           February 1, 2013   19 / 21
Example: Anatomy of the long tail



              Dataset            Users   Items     Rating levels   Observations
              Movielens          100K     10K           10            10M
               Netflix            500K     20K            5            100M


         What do we do when the full dataset exceeds available memory?

                                  Streaming?
          Read data one observation at a time, storing only needed state




Jake Hofman   (Columbia University)         Intro to Counting           February 1, 2013   19 / 21
The group-by operation


                                      For arbitrary input data:

              Memory               Scenario              Distributions    Statistics
                N                Small dataset                Yes          General
               V*G             Small distributions            Yes          General
                G               Small # groups                No         Combinable
                V              Small # outcomes               No             No
                1                Large # both                 No             No


                            N = total number of observations
                             G = number of distinct groups
                    V = largest number of distinct values within group



Jake Hofman   (Columbia University)          Intro to Counting              February 1, 2013   20 / 21
The group-by operation


                                      For pre-grouped input data:

              Memory               Scenario               Distributions    Statistics
                N                Small dataset                 Yes          General
               V*G             Small distributions             Yes          General
                G               Small # groups                 No         Combinable
                V              Small # outcomes                Yes          General
                1                Large # both                  No         Combinable


                            N = total number of observations
                             G = number of distinct groups
                    V = largest number of distinct values within group



Jake Hofman   (Columbia University)           Intro to Counting              February 1, 2013   21 / 21

More Related Content

PDF
Computational Social Science, Lecture 11: Regression
PPT
SDNC13 -Day2- The subjective science of persona building by Stephen Masiclat
PPTX
Computational Social Science, Lecture 03: Counting at Scale, Part I
PPTX
Computational Social Science, Lecture 04: Counting at Scale, Part II
PDF
Computational Social Science, Lecture 10: Online Experiments
PDF
Computational Social Science, Lecture 08: Counting Fast, Part II
PDF
Computational Social Science, Lecture 07: Counting Fast, Part I
PDF
Computational Social Science, Lecture 13: Classification
Computational Social Science, Lecture 11: Regression
SDNC13 -Day2- The subjective science of persona building by Stephen Masiclat
Computational Social Science, Lecture 03: Counting at Scale, Part I
Computational Social Science, Lecture 04: Counting at Scale, Part II
Computational Social Science, Lecture 10: Online Experiments
Computational Social Science, Lecture 08: Counting Fast, Part II
Computational Social Science, Lecture 07: Counting Fast, Part I
Computational Social Science, Lecture 13: Classification

Viewers also liked (20)

PDF
Computational Social Science, Lecture 09: Data Wrangling
PPTX
Computational Social Science, Lecture 06: Networks, Part II
PPTX
Computational Social Science, Lecture 05: Networks, Part I
PDF
Modeling Social Data, Lecture 2: Introduction to Counting
PDF
Modeling Social Data, Lecture 1: Overview
PDF
Modeling Social Data, Lecture 6: Regression, Part 1
PDF
Modeling Social Data, Lecture 3: Counting at Scale
PDF
Data-driven Modeling: Lecture 03
PDF
Modeling Social Data, Lecture 4: Counting at Scale
PDF
Modeling Social Data, Lecture 3: Data manipulation in R
PPT
Usabilidad
PPT
TalkPad, una idea innovadora y revolucionadora
PDF
практ5
PPTX
Sanchez Navas Bloque 5
DOCX
Porque es tan dificil hacer activos
PPT
Disco Duro
ODP
Cerveceria LOS VIKINGOS
PDF
Smith_David_BB pdf 2.11.16
PDF
практ6
Computational Social Science, Lecture 09: Data Wrangling
Computational Social Science, Lecture 06: Networks, Part II
Computational Social Science, Lecture 05: Networks, Part I
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 1: Overview
Modeling Social Data, Lecture 6: Regression, Part 1
Modeling Social Data, Lecture 3: Counting at Scale
Data-driven Modeling: Lecture 03
Modeling Social Data, Lecture 4: Counting at Scale
Modeling Social Data, Lecture 3: Data manipulation in R
Usabilidad
TalkPad, una idea innovadora y revolucionadora
практ5
Sanchez Navas Bloque 5
Porque es tan dificil hacer activos
Disco Duro
Cerveceria LOS VIKINGOS
Smith_David_BB pdf 2.11.16
практ6
Ad

Similar to Computational Social Science, Lecture 02: An Introduction to Counting (6)

PDF
Data-driven modeling: Lecture 09
PDF
Data-driven modeling: Lecture 02
PDF
Data-driven modeling: Lecture 01
PDF
Modeling Social Data, Lecture 2: Introduction to Counting
PDF
Modeling Social Data, Lecture 7: Model complexity and generalization
PDF
Data-driven modeling: Lecture 10
Data-driven modeling: Lecture 09
Data-driven modeling: Lecture 02
Data-driven modeling: Lecture 01
Modeling Social Data, Lecture 2: Introduction to Counting
Modeling Social Data, Lecture 7: Model complexity and generalization
Data-driven modeling: Lecture 10
Ad

More from jakehofman (10)

PPTX
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
PPTX
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
PDF
Modeling Social Data, Lecture 10: Networks
PDF
Modeling Social Data, Lecture 8: Classification
PDF
Modeling Social Data, Lecture 8: Recommendation Systems
PDF
Modeling Social Data, Lecture 6: Classification with Naive Bayes
PDF
Modeling Social Data, Lecture 1: Case Studies
PDF
NYC Data Science Meetup: Computational Social Science
PDF
Technical Tricks of Vowpal Wabbit
PDF
Using Data to Understand the Brain
Modeling Social Data, Lecture 12: Causality & Experiments, Part 2
Modeling Social Data, Lecture 11: Causality and Experiments, Part 1
Modeling Social Data, Lecture 10: Networks
Modeling Social Data, Lecture 8: Classification
Modeling Social Data, Lecture 8: Recommendation Systems
Modeling Social Data, Lecture 6: Classification with Naive Bayes
Modeling Social Data, Lecture 1: Case Studies
NYC Data Science Meetup: Computational Social Science
Technical Tricks of Vowpal Wabbit
Using Data to Understand the Brain

Computational Social Science, Lecture 02: An Introduction to Counting

  • 1. Introduction to Counting APAM E4990 Computational Social Science Jake Hofman Columbia University February 1, 2013 Jake Hofman (Columbia University) Intro to Counting February 1, 2013 1 / 21
  • 2. Why counting? Jake Hofman (Columbia University) Intro to Counting February 1, 2013 2 / 21
  • 3. Why counting? sta·tis·tic noun 1. A function of a random sample of data 2. A functional of the distribution over a random variable Jake Hofman (Columbia University) Intro to Counting February 1, 2013 3 / 21
  • 4. Why counting? Problem: Traditionally difficult to estimate distributions due to small sample sizes or sparsity Jake Hofman (Columbia University) Intro to Counting February 1, 2013 4 / 21
  • 5. Why counting? Potential solution: Sacrifice granularity for precision (e.g., bin observations into larger groups) Jake Hofman (Columbia University) Intro to Counting February 1, 2013 4 / 21
  • 6. Why counting? Potential solution: Develop sophisticated methods that generalize well from small samples (e.g., model distributions with parametric forms) Jake Hofman (Columbia University) Intro to Counting February 1, 2013 4 / 21
  • 7. Why counting? Jake Hofman (Columbia University) Intro to Counting February 1, 2013 5 / 21
  • 8. Why counting? (Partial) solution: When observations are plentiful, simply count and divide to estimate distributions with relative frequencies Jake Hofman (Columbia University) Intro to Counting February 1, 2013 6 / 21
  • 9. Why counting? The good: Shift away from sophisticated statistical methods on small samples to simple methods on large samples Jake Hofman (Columbia University) Intro to Counting February 1, 2013 7 / 21
  • 10. Why counting? The bad: Even simple methods (e.g., counting) are computationally challenging at large scales Jake Hofman (Columbia University) Intro to Counting February 1, 2013 7 / 21
  • 11. Why counting? Claim: Solving the counting problem at scale enables you to investigate many interesting questions in the social sciences Jake Hofman (Columbia University) Intro to Counting February 1, 2013 7 / 21
  • 12. Why counting? Jake Hofman (Columbia University) Intro to Counting February 1, 2013 8 / 21
  • 13. Learning to count This week: Counting at small/medium scales on a single machine Jake Hofman (Columbia University) Intro to Counting February 1, 2013 9 / 21
  • 14. Learning to count This week: Counting at small/medium scales on a single machine Following weeks: Counting at large scales in parallel Jake Hofman (Columbia University) Intro to Counting February 1, 2013 9 / 21
  • 15. Counting, the easy way • Load dataset into memory • Arrange observations into groups of interest • Compute distributions and statistics within each group Jake Hofman (Columbia University) Intro to Counting February 1, 2013 10 / 21
  • 16. Example: Anatomy of the long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M Jake Hofman (Columbia University) Intro to Counting February 1, 2013 11 / 21
  • 17. Example: Anatomy of the long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M Jake Hofman (Columbia University) Intro to Counting February 1, 2013 11 / 21
  • 18. Example: Movielens 3,000,000 2,000,000 Count 1,000,000 0 1 2 3 4 5 Rating Jake Hofman (Columbia University) Intro to Counting February 1, 2013 12 / 21
  • 19. Example: Movielens Density 1 2 3 4 5 Mean Rating by Movie Jake Hofman (Columbia University) Intro to Counting February 1, 2013 13 / 21
  • 20. Example: Movielens 100% 75% CDF 50% 25% 0% 0 3,000 6,000 9,000 Movie Rank Jake Hofman (Columbia University) Intro to Counting February 1, 2013 14 / 21
  • 21. Example: Movielens Density 10 100 1,000 User eccentricity Jake Hofman (Columbia University) Intro to Counting February 1, 2013 15 / 21
  • 22. The group-by operation We estimate conditional distributions by grouping observations by their group identity, or key, and counting values within groups p( y | x1 , x2 , x3 , . . .) value key Jake Hofman (Columbia University) Intro to Counting February 1, 2013 16 / 21
  • 23. The group-by operation Generic Group-By for each observation: place observation in bucket for corresponding group for each group: compute function over observations in bucket output group and result Jake Hofman (Columbia University) Intro to Counting February 1, 2013 17 / 21
  • 24. The group-by operation Generic Group-By for each observation: place observation in bucket for corresponding group for each group: compute function over observations in bucket output group and result Useful for computing arbitrary within-group statistics when we have required memory (e.g., conditional distribution, median, etc.) Jake Hofman (Columbia University) Intro to Counting February 1, 2013 17 / 21
  • 25. The group-by operation Combinable Group-By for each observation: update result for corresponding group, as function of current result and observation for each group: output group and result Jake Hofman (Columbia University) Intro to Counting February 1, 2013 18 / 21
  • 26. The group-by operation Combinable Group-By for each observation: update result for corresponding group, as function of current result and observation for each group: output group and result Useful for computing a subset of within-group statistics with a limited memory footprint (e.g., min, mean, max, variance, etc.) Jake Hofman (Columbia University) Intro to Counting February 1, 2013 18 / 21
  • 27. Example: Anatomy of the long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M What do we do when the full dataset exceeds available memory? Jake Hofman (Columbia University) Intro to Counting February 1, 2013 19 / 21
  • 28. Example: Anatomy of the long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M What do we do when the full dataset exceeds available memory? Sampling? Unreliable estimates for rare groups Jake Hofman (Columbia University) Intro to Counting February 1, 2013 19 / 21
  • 29. Example: Anatomy of the long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M What do we do when the full dataset exceeds available memory? Random access from disk? 1000x more storage, but 1000x slower1 1 Numbers every programmer should know Jake Hofman (Columbia University) Intro to Counting February 1, 2013 19 / 21
  • 30. Example: Anatomy of the long tail Dataset Users Items Rating levels Observations Movielens 100K 10K 10 10M Netflix 500K 20K 5 100M What do we do when the full dataset exceeds available memory? Streaming? Read data one observation at a time, storing only needed state Jake Hofman (Columbia University) Intro to Counting February 1, 2013 19 / 21
  • 31. The group-by operation For arbitrary input data: Memory Scenario Distributions Statistics N Small dataset Yes General V*G Small distributions Yes General G Small # groups No Combinable V Small # outcomes No No 1 Large # both No No N = total number of observations G = number of distinct groups V = largest number of distinct values within group Jake Hofman (Columbia University) Intro to Counting February 1, 2013 20 / 21
  • 32. The group-by operation For pre-grouped input data: Memory Scenario Distributions Statistics N Small dataset Yes General V*G Small distributions Yes General G Small # groups No Combinable V Small # outcomes Yes General 1 Large # both No Combinable N = total number of observations G = number of distinct groups V = largest number of distinct values within group Jake Hofman (Columbia University) Intro to Counting February 1, 2013 21 / 21