Computational Social Science, Lecture 02: An Introduction to Counting

Introduction to Counting
APAM E4990
Computational Social Science

Jake Hofman

Columbia University

February 1, 2013

Jake Hofman (Columbia University) Intro to Counting February 1, 2013 1 / 21

Why counting?


Why counting?

sta·tis·tic

noun
1. A function of a random sample of data
2. A functional of the distribution over a random variable


Why counting?

Problem:

Traditionally diﬃcult to estimate distributions due to small sample
sizes or sparsity


Why counting?

Potential solution:

Sacriﬁce granularity for precision

(e.g., bin observations into larger groups)


Why counting?

Potential solution:

Develop sophisticated methods that generalize well from small
samples

(e.g., model distributions with parametric forms)


Why counting?


Why counting?

(Partial) solution:

When observations are plentiful, simply count and divide to
estimate distributions with relative frequencies


Why counting?

The good:

Shift away from sophisticated statistical methods on small samples
to simple methods on large samples


Why counting?

The bad:

Even simple methods (e.g., counting) are computationally
challenging at large scales


Why counting?

Claim:

Solving the counting problem at scale enables you to investigate
many interesting questions in the social sciences


Why counting?


Learning to count

This week:

Counting at small/medium scales on a single machine


Learning to count

This week:

Counting at small/medium scales on a single machine

Following weeks:

Counting at large scales in parallel


Counting, the easy way

• Load dataset into memory
• Arrange observations into groups of interest
• Compute distributions and statistics within each group


Example: Anatomy of the long tail

Dataset Users Items Rating levels Observations
Movielens 100K 10K 10 10M
Netﬂix 500K 20K 5 100M


Example: Movielens
3,000,000

2,000,000
Count

1,000,000

0

1 2 3 4 5
Rating

Example: Movielens

Density

1 2 3 4 5
Mean Rating by Movie

Example: Movielens

100%

75%
CDF

50%

25%

0%

0 3,000 6,000 9,000
Movie Rank

Example: Movielens

Density

10 100 1,000
User eccentricity

The group-by operation

We estimate conditional distributions by grouping observations by
their group identity, or key, and counting values within groups

p( y | x1 , x2 , x3 , . . .)
value key



Generic Group-By

for each observation:
place observation in bucket for corresponding group

for each group:
compute function over observations in bucket
output group and result



Generic Group-By

place observation in bucket for corresponding group

for each group:
compute function over observations in bucket

Useful for computing arbitrary within-group statistics when we
have required memory
(e.g., conditional distribution, median, etc.)



Combinable Group-By

update result for corresponding group,
as function of current result and observation

for each group:



Combinable Group-By

update result for corresponding group,
as function of current result and observation

for each group:

Useful for computing a subset of within-group statistics with a
limited memory footprint
(e.g., min, mean, max, variance, etc.)




What do we do when the full dataset exceeds available memory?





Sampling?
Unreliable estimates for rare groups





Random access from disk?
1000x more storage, but 1000x slower1

1
Numbers every programmer should know




Streaming?
Read data one observation at a time, storing only needed state



For arbitrary input data:

Memory Scenario Distributions Statistics
N Small dataset Yes General
V*G Small distributions Yes General
G Small # groups No Combinable
V Small # outcomes No No
1 Large # both No No

N = total number of observations
G = number of distinct groups
V = largest number of distinct values within group



For pre-grouped input data:

Memory Scenario Distributions Statistics
N Small dataset Yes General
V*G Small distributions Yes General
G Small # groups No Combinable
V Small # outcomes Yes General
1 Large # both No Combinable

N = total number of observations
G = number of distinct groups
V = largest number of distinct values within group


Computational Social Science, Lecture 02: An Introduction to Counting

More Related Content

Viewers also liked (20)

Similar to Computational Social Science, Lecture 02: An Introduction to Counting (6)

More from jakehofman (10)

Computational Social Science, Lecture 02: An Introduction to Counting