Computational Social Science, Lecture 04: Counting at Scale, Part II

Counting @ Scale
Part II

Sharad Goel
Columbia University
Computational Social Science: Lecture 4

February 15, 2013

Descriptive statistics
(as opposed to inferential statistics)
is about counting
contingency tables
means, variances, quantiles
summaries of conditional distributions

Counting @ scale
conceptually easy
computationally hard

MapReduce:
Simplifed Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
OSDI, 2004

Map
assign each input line to one or more groups

Shuffle
aggregate groups

Reduce
operate on grouped data

Map
assign each input line to one or more groups
v  [(k1, v1), …, (km, vm)]

Shuffle
aggregate groups

Reduce
operate on grouped data
(k, [v1, …, vn])  [w1, …, wp]

Group Average

Input
views (user, movie, rating)

Output
average (mean & median) by movie

Group Average

Map
identity (key := movie)

Reduce
movie group  average

The Insight of MapReduce
One can efficiently group identical items

Many tasks are computationally easier on grouped data

Filter

Input
arbitrary data & filter condition

Output
subset of input data satisfying condition

Filter

Map
input  input if condition(input) else pass

Reduce
identity

Distinct

Input
set of items

Output
subset of distinct items

Distinct

Map
identity

Reduce
grouped items  single item from group

Sample

Input
set of items & sample probability p

Output
random subset of items

Sample

Map
input  input if rand(0,1) < p else pass

Reduce
identity

Sort

Input
set of items (and a key)

Output
ordered set of items

Sort

Map
identity, with all data assigned to the same key

Reduce
identity

*all the work happens in the shuffle

Sort

Map
identity, with key := first letter of line

Reduce
identity

*all the work happens in the shuffle

Sort

Sample
generate a small sample of the data (with MapReduce)

Determine breakpoints
sort the sample and compute percentiles

Sort

Map
identity, with key determined by breakpoints

Reduce
identity

*most of the work happens in the shuffle

Combining data

Example
for each user, want to compute the
average popularity of the movies they watch

Problem
one file contains views (user, movie);
another file contains popularity (movie, rank)

Joins
User Movie

23 829

789 24 User Movie Rank

234 5678 23 829 34

7 24 789 24 100

234 5678 4
Movie Rank
7 24 100
5678 4

24 100

829 34

Nested-Loop Joins

For each user in users:
For each movie in movies:
if user.movie_id == movie.id:
output user.id, movie.id, movie.rating

Sort-Merge Joins
User Movie

789 24

7 24 User Movie Rank

23 829 789 24 100

234 5678 7 24 100

23 829 34
Movie Rank
234 5678 4
24 100

829 34

5678 4

Hash Joins
User Movie

23 829

789 24

234 5678

7 24

Movie Rank

5678 4

24 100

829 34

Distributed Joins

Map
reduce key := hash(join key)

Reduce
local (sort-merge) join

*also need to keep track of which table
is the left and which is the right

Joins
{ inner, left, right, outer }

User Movie
User Sex
23 829
23 male
789 24
789 female
234 5678
234 female
7 24
7 male
789 90
26 male
23 758
567 female
23 39
2 female
2 782

User Sex
User Activity
23 male
23 3
789 female
789 2
234 female
234 1
7 male
7 1
26 male
789 90
567 female
2 1
2 female

User Sex
User Activity
23 male
23 3
789 female
789 2
234 female
234 1
7 male
7 1
26 male
789 90
567 female
2 1
2 female

User Sex Activity
23 male 3
Left Join

789 female 2
234 female 1
7 male 1
26 male
567 female
2 female 1

User Sex
User Activity
23 male
23 3
789 female
789 2
234 female
234 1
7 male
7 1
26 male
789 90
567 female
2 1
2 female

User Sex Activity
Inner Join

23 male 3
789 female 2
234 female 1
7 male 1
2 female 1

Inner join
returns pairs of rows in tables A & B
that match join condition

Left (outer) join
returns all rows from an inner join plus
rows in the left table that do not match to the right table

Full (outer) join
returns all rows from an inner join plus
rows in either table that do not match to the other

Map-side Joins

Map
load (smaller) table into memory
stream through (larger) table and find matches

Reduce
identity

MapReduce Ops

Map-only
Filter, sample, map-side joins

Map & Reduce
groupby, distinct, sort, join

The long tail

Input
(user, movie) views

Output
for each user, average popularity
of movies they watch

Step 1. compute movie popularity
group views by movie & count

Step 2. Rank movies
sort by popularity

Step 3. merge view and ranking data
join views and movie popularity tables

Step 4. compute eccentricity
group views/ranking by user and
compute eccentricity

Pig Latin:
A Not-So-Foreign Language for Data Processing
Olston, Reed, Srivastava, Kumar, and Tomkins
SIGMOD, 2008

Computational Social Science, Lecture 04: Counting at Scale, Part II

More Related Content

Viewers also liked (20)

More from jakehofman (14)

Recently uploaded (20)

Computational Social Science, Lecture 04: Counting at Scale, Part II