Making Sense of Data Big and Small

Bruno Gonçalves
www.bgoncalves.com
Making Sense of Data Big and Small 
www.bgoncalves.com

www.bgoncalves.com@bgoncalves
Big Data
E X P E R T O P I N I O N
Contact Editor: Brian Brannon, bbrannon@computer.org
behavior. So, this corpus could serve as the basis of
a complete model for certain tasks—if only we knew
how to extract the model from the data.
E
ugene Wigner’s article “The Unreasonable Ef-
fectiveness of Mathematics in the Natural Sci-
ences”1 examines why so much of physics can be
Alon Halevy, Peter Norvig, and Fernando Pereira, Google
The Unreasonable
Effectiveness of Data

From Data To Information
Statistics are numbers that summarize raw facts and figures in some
meaningful way. They present key ideas that may not be immediately
apparent by just looking at the raw data, and by data, we mean facts or figures
from which we can draw conclusions. As an example, you don’t have to
wade through lots of football scores when all you want to know is the league
position of your favorite team. You need a statistic to quickly give you the
information you need.
The study of statistics covers where statistics come from, how to calculate them,
and how you can use them effectively.
Gather data
Analyze
Draw conclusions
When you’ve analyzed
your data, you make
decisions and predictions.
Once you have data, you can analyze itand generate statistics. You can calculateprobabilities to see how likely certain eventsare, test ideas, and indicate how confidentyou are about your results.
At the root of statistics is data.
Data can be gathered by looking
through existing sources, conducting
experiments, or by conducting surveys.

Data Science
Hacking
Domain 
Knowledge
Statistics
Data
Science
Machine 
Learning
Traditional 
Research
Danger 
Zone!

Data Science
Data 
Nerds
Art 
Nerds
Stats 
Nerds
High 
Salaries
Data 
Mining
Visualization
GUI 
Programmers

Data Science
Data Scientist:
The Sexiest Job of the 21st Century
Meet the people who
can coax treasure out of
messy, unstructured data.
by Thomas H. Davenport
and D.J. Patil
hen Jonathan Goldman ar-
rived for work in June 2006
at LinkedIn, the business
networking site, the place still
felt like a start-up. The com-
pany had just under 8 million
accounts, and the number was
growing quickly as existing mem-
bers invited their friends and col-
leagues to join. But users weren’t
seeking out connections with the people who were already on the site
at the rate executives had expected. Something was apparently miss-
ing in the social experience. As one LinkedIn manager put it, “It was
like arriving at a conference reception and realizing you don’t know
anyone. So you just stand in the corner sipping your drink—and you
probably leave early.”
70 Harvard Business Review October 2012

“Zero is the most natural number” 
(E. W. Dijkstra)
Count!
• How many items do we have?

Descriptive Statistics
Min Max
Mean µ =
1
N
NX
i=1
xi
=
v
u
u
t 1
N
NX
i=1
(xi µ)
2Standard 
Deviation

Anscombe’s Quartet
x1 y1
10.0 8.04
8.0 6.95
13.0 7.58
9.0 8.81
11.0 8.33
14.0 9.96
6.0 7.24
4.0 4.26
12.0 10.84
7.0 4.82
5.0 5.68
x2 y2
10.0 9.14
8.0 8.14
13.0 8.74
9.0 8.77
11.0 9.26
14.0 8.10
6.0 6.13
4.0 3.10
12.0 9.13
7.0 7.26
5.0 4.74
x3 y3
10.0 7.46
8.0 6.77
13.0 12.74
9.0 7.11
11.0 7.81
14.0 8.84
6.0 6.08
4.0 5.39
12.0 8.15
7.0 6.42
5.0 5.73
x4 y4
8.0 6.58
8.0 5.76
8.0 7.71
8.0 8.84
8.0 8.47
8.0 7.04
8.0 5.25
19.0 12.50
8.0 5.56
8.0 7.91
8.0 6.89
9
11
7.50
~4.125
0.816
ﬁt y=3+0.5x
µx
µy
y
x
⇢
0
3.25
6.5
9.75
13
0 5 10 15 20
0
3.25
6.5
9.75
13
0 5 10 15 20
0
3.25
6.5
9.75
13
0 5 10 15 20
0
3.25
6.5
9.75
13
0 5 10 15 20

Central Limit Theorem
• As the random variables:
• with:
• converge to a normal distribution:
• after some manipulations, we ﬁnd:
n ! 1
Sn =
1
n
X
i
xi
N 0, 2
p
n (Sn µ)
Sn ⇠ µ +
N 0, 2
p
n
The estimation of the mean converges to the true
mean with the sqrt of the number of samples
! SE = p
n

Gaussian Distribution - Maximally Entropic
PN (x, µ, ) =
1
p
2
e
(x µ)2
2 2

Broad-tailed distributions
Pp (k, ) =
1
C
k

(Almost) Everyone is below average!
µ
80% 20%

Outliers
• “Bill Gates walks into a bar and on average every patron is a millionaire…” 
 
 
 
• …but the median remains the same”
• Mean = 144.14 
Median = 2
Median - “the value that
separates the lower 50% of the
distribution from the higher 50%”
1 1 1 2 2 2 1000

Quantiles
• Quantiles - Points taken at regular intervals of the cumulative distribution function
• Quartiles - Ranked set of points that divide the range in 4 equal intervals (25%, 50%,
75% quantiles)

Box and whiskers plot
1
Numberofc
0 7 14 21
1
10
100
300
Numberofcasesoutsidetargetcountry
Time (days)
0 7 14 21
F)
• Show the variation of the data for each bin.
• More informative than just averages or medians.
• Useful to summarize experimental measurements,
simulation results, natural variations, etc… when
ﬂuctuations are important

Reference Range
uthors' contributions
, HH, BG, PB and CP contributed to conceiving and
igning the study, performed numerical simulations
d statistical analysis, contributed to the data integration
d helped to draft the manuscript. JJR contributed to
nceiving and designing the study, data tracking and
egration, statistical analysis and helped draft the man-
ript. NP and MT contributed to data tracking and inte-
tion, statistical analysis and helped draft the
A)
no intervention
antiviral
treatment
~ 4 weeks B)
0
0.003
0.006
0.009
0.012
0.015
0.018
incidence
0.018
Spain
Sept Oct Nov Dec Jan
Median95% RR
⇢
A
B
97.5% Percentile
2.5% Percentile
• Useful for continuous curves
• Indicates level of certainty  
“95% of the cases are in this range”

Tools For Statistical Analysis
Name Advantages Disadvantages Open Source
R
Library support
and Visualization
Steep learning curve Yes
Matlab
Native matrix
support, 
Visualization
Expensive, incomplete
statistics support
No
Scientiﬁc Python
Ease and
Simplicity
Heavy development Yes
Excel
Easy, Visual,
Flexible
Large datasets No
SAS Large Datasets
Expensive, outdated
programming
language
No
Stata Easy Statistical Analysis No
SPSS Like Stata but more expensive and less ﬂexible

Correlations

Pearson (Linear) Correlation Coefﬁcient
• Does increasing one variable also increase the other?
⇢ =
PN
i=1 (xi µX ) (yi µY )
X Y

• The square of the Person correlation between the data and the ﬁt.
• The amount of variance of the data that is explained by the “model”.
R2

Spearman Rank Correlation
• Equivalent to the Pearson Correlation Coefﬁcient of the ranked variables
• squared difference in ranks
• less sensitive to outliers as values are limited by rank
⇢ = 1
6
PN
i=1 d2
i
N (N2 1)
PN
i=1 d2
i
N2 1)

Causation

Probability
p = 1

Probability
A
P (A) = Area of A
B
p = 1

Probability
A
P (A) = Area of A
B
P (A or B) = P (A) + P (B)
p = 1

Probability
P (A) = Area of A
P (A or B) = P (A) + P (B)
P (A and B) = overlap of A and B
P (A and B)
A
B
p = 1

Probability
P (A) = Area of A
P (A or B) = P (A) + P (B)
P (A and B)
P (A|B) =
P (B|A) P (A)
P (B)
P (B|A) =
P (A and B)
P (A)
A
B
p = 1

Medical Tests
Your doctor thinks you might have a rare disease that affects 1 person in 10,000. A test
that is 99% accurate comes out positive. What’s the probability of you having the
disease?
Bayes Theorem: 
Total Probability:
Finally:
P (disease|positive test) =
P (positive test|disease) P (disease)
P (positive test)
P (positive test) = P (positive test|disease) P (disease)
+ P (positive test|no disease) P (no disease)
P (disease|positive test) = 0.0098
A
B

Medical Tests
Your doctor thinks you might have a rare disease that affects 1 person in 10,000. A test
that is 99% accurate comes out positive. What’s the probability of you having the
disease?
Bayes Theorem: 
Total Probability:
Finally:
P (positive test)
P (positive test) = P (positive test|disease) P (disease)
+ P (positive test|no disease) P (no disease)
P (disease|positive test) = 0.0098
A
B
Base Rate Fallacy
Low Base Rate Value 
+ 
Non-zero False Positive Rate

Consider a population of 1,000,000 individuals. The numbers we should expect are:
Medical Tests
disease no disease
positive 99 9,999 10,098
negative 1 989,901 989,902
100 999,900 1,000,000
A
B

Medical Tests
disease no disease
positive 99 9,999 10,098
negative 1 989,901 989,902
100 999,900 1,000,000
Marginals
Marginals
A
B

Medical Tests
disease no disease
positive 99 9,999 10,098
negative 1 989,901 989,902
100 999,900 1,000,000
Marginals
Marginals
TP
TP + FP
= 0.0098
P (no disease|negative test) =
TN
TN + FN
= 0.99999
A
B

(Confusion Matrix)
positive negative
positive TP FP
negative FN TN
Feature
Test
accuracy =
TP + TN
TP + TN + FP + FN
precision =
TP
TP + FP
sensitivity =
TP
TP + FN
specificity =
TN
FP + TN
harmonic mean F1 =
2TP
2TP + FP + FN

A second Test
Bayes Theorem still looks the same: 
but now the probability that we have the disease has been updated: 
So this time we ﬁnd: 
Each test is providing new evidence, and Bayes theorem is simply telling us how to use it
to update our beliefs.
P (positive test)
P†
(disease) = 0.0098
P†
(disease|positive test) = 0.4949

Bayesian Coin Flips
• Biased coin with unknown probability of heads (p)
• Perform N ﬂips and update our belief after each ﬂip using Bayes Theorem
P (p|heads) =
P (heads|p) P (p)
P (heads)
P (p|tails) =
P (tails|p) P (p)
P (tails)
http://youtu.be/GTx0D8VY0CY

Bayesian Coin Flips
• Perform N ﬂips and update our belief after each ﬂip using Bayes Theorem
# Uninformative prior
prior = np.ones(bins, dtype='float')/bins
likelihood_heads = np.arange(bins)/float(bins)
likelihood_tails = 1-likelihood_heads
for coin in flips:
if coin: # Heads
posterior = prior * likelihood_heads
else: # Tails
posterior = prior * likelihood_tails
# Normalize
posterior /= np.sum(posterior)
# The posterior is now the new prior
prior = posterior
P (p|heads) =
P (heads|p) P (p)
P (heads)
P (p|tails) =
P (tails|p) P (p)
P (tails)
http://youtu.be/GTx0D8VY0CY

Naive Bayes Classiﬁer
• Let’s consider spam detection for a second. If you know: 
 
• You know the probability that a speciﬁc word is used in a spam email. But how can you
determine the probability that an email (set of words) is spam?
• You can simply assume that all the probabilities are independent: 
• This is know as Naive Bayes and is surprisingly effective in many real world contexts.
P (spam|wordi)
P (not spam|wordi)
P (spam|word1, word2, · · · , wordn)
P (spam|word1, word2, · · · , wordn) =
Y
i
P (spam|wordi)

Maximum Likelihood Estimation
• Given a distribution, ,how likely are we to see a given set of data points, ?
• The probability of each point is, simply: 
• So the probability of a given realization is: 
 
• For mathematical convenience, we deﬁne the likelihood as: 
 
• The set of parameters that maximizes characterize the distribution most likely to have
generated the data.
P (x) xi
P (xi)
Y
i
P (xi)
L =
X
i
log [P (xi)]

MLE Coin Flips
• In a sequence of N ﬂips, the likelihood of Nh heads and Nt=N-N tails is: 
• or simply: 
 
• Taking the derivative: 
 
 
• Setting to zero and solving for p:
p =
Nh
N
@L
@p
=
Nh
p
N Nh
1 p
L = Nh log [p] + (N Nh) log [1 p]
L = log
h
pNh
(1 p)
N Nh
i Ignoring the
combinatorial factor!

Binomial Distribution
• The probability of getting k successes with n trials of probability p (k heads in n coin
ﬂips): 
• The mean value is: 
• and the variance: 
• and for sufﬁciently large n:
PB (k, n, p) =
n!
k! (n k)!
pk
(1 p)
n k
µ = np
= np (1 p)
PB (k, n, p) ⇠ PN (np, np (1 p))

(Beta Distribution)
• Related to Binomial and has a very similar form: 
• with and .
• is the continuous extension to
• The mean is: 
• And the variance:
P (x, ↵, ) =
(↵ + )
(↵) ( )
x↵ 1
(1 x)
1
PB (k, n, p) =
n!
k! (n k)!
pk
(1 p)
n k
x 2 [0, 1] ↵, > 0
(a) a!
µ =
↵
↵ +
=
↵
(↵ + )
2
(↵ + + 1)

A/B Testing
• Divide users into two groups A and B
• Measure some metric for each group
(conversion probability, for example)
pA pB

• If conversion is a binomial process, then:
• Standard Error
• Z score
A/B Testing
pA pB
SE =
r
p (1 p)
N
Z =
pA pB
p
SE2
A + SE2
B

p-value
• Calculate the probability of an event more extreme that the observation under the “null
hypothesis”
• The smaller the p-value the better.
• p < 0.05 Moderate evidence agains null-
hypothesis
• p < 0.01 Strong evidence against null-hypothesis
• p < 0.001 Very strong evidence agains the null-
hypothesis

Berkeley Discrimination Case Part I
Candidates
Acceptance
Rate
SE
Men 8442 0.44 5.4x10-3
Women 4321 0.35 7.2x10-3
Z =
pA pB
p
SE2
A + SE2
B
= 9.9
p ⇡ 10 23

p-value
“Statistical significance does not imply scientific significance”

(Bonferoni Correction)

(Bonferoni Correction)
• You can think of p as the probability of observing a result as extreme by chance. With n
comparisons, this probability becomes: 
 
 
which quickly goes to 1 as n increases.
• However, by replacing p by p/n for each individual comparison, we obtain: 
• and for sufﬁciently large n: 
• allowing us to keep the probability of false results arbitrarily low even with arbitrarily
large numbers of comparisons.
pn = 1 (1 p)
n
pn = 1
⇣
1
p
n
⌘n
pn ⇡ 1 e p
⇡ p

SimpsonsThe

Simpsons’ Paradox

Simpsons’ Paradox
Candidates
Acceptance
Rate
Men 8442 0.44
Women 4321 0.35
Berkeley Discrimination Case Part II: 
The statisticians strike back.
Science 187, 398 (1975)

Simpsons’ Paradox
Candidates
Acceptance
Rate
Men 8442 0.44
Women 4321 0.35
Dept
Men Women
Candidates Acceptance Candidates Acceptance
A 825 0.62 108 0.82
B 560 0.63 25 0.68
C 325 0.37 594 0.34
D 417 0.33 375 0.35
E 191 0.28 393 0.24
F 272 0.06 341 0.07
2590 0.46 1835 0.30
Berkeley Discrimination Case Part II: 
The statisticians strike back.
Science 187, 398 (1975)

Simpsons’ Paradox
Candidates
Acceptance
Rate
Men 8442 0.44
Women 4321 0.35
Dept
Men Women
A 825 0.62 108 0.82
B 560 0.63 25 0.68
C 325 0.37 594 0.34
D 417 0.33 375 0.35
E 191 0.28 393 0.24
F 272 0.06 341 0.07
2590 0.46 1835 0.30
Science 187, 398 (1975)

Simpsons’ Paradox
Candidates
Acceptance
Rate
Men 8442 0.44
Women 4321 0.35
Dept
Men Women
A 825 0.62 108 0.82
B 560 0.63 25 0.68
C 325 0.37 594 0.34
D 417 0.33 375 0.35
E 191 0.28 393 0.24
F 272 0.06 341 0.07
2590 0.46 1835 0.30
Science 187, 398 (1975)
“aggregated data can appear
to reverse important trends
in the numbers being combined”
WSJ, Dec 2, 2009

MLE - Fitting a theoretical function to experimental data
• In an experimental measurement, we expect (CLT) the experimental values to be
normally distributed around the theoretical value with a certain variance. Mathematically,
this means: 
• where are the experimental values and the theoretical ones. The likelihood is
then: 
• Where we see that to maximize the likelihood we must minimize the sum of squares
y f (x) ⇡
1
p
2 2
exp
"
(y f (x))
2
2 2
#
y f (x)
Least Squares Fitting
L =
N
2
log
⇥
2 2
⇤ X
i
"
(yi f (xi))
2
2 2
#

MLE - Fitting a power-law to experimental data
• We often find what look like power-law distributions in empirical data: 
 
 
 
and we would like to find the right parameter values. 
• The likelihood of any set of points is: 
 
 
• And maximizing, we find: 
 
 
• with a standard error of:
P (k) =
1
kmin
✓
k
kmin
◆
L =
X
i
log
"
1
kmin
✓
ki
kmin
◆ #
= 1 + n
"
X
i
log
✓
ki
kmin
◆# 1
SE =
1
p
n
SIAM Rev. 51, 661 (2009)

Clustering

K-Means
• Choose k randomly chosen points to be the centroid of each cluster
• Assign each point to belong the cluster whose centroid is closest
• Recompute the centroid positions (mean cluster position)
• Repeat until convergence

K-Means: Structure

K-Means: Structure Voronoi Tesselation

K-Means: Convergence
• How to quantify the “quality” of the solution found at each iteration, ?
• Measure the “Inertia”, the square intra-cluster distance: 
 
 
 
where are the coordinates of the centroid of the cluster to which is assigned.
• Smaller values are better
• Can stop when the relative variation is smaller than some value
µi xi
In+1 In
In
< tol
In =
NX
i=0
kxi µik
2
n

K-Means: sklearn
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=nclusters)
kmeans.fit(data)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_

K-Means: Limitations

K-Means: Limitations
• No guarantees about Finding
“Best” solution 
• Each run can ﬁnd different
solution 
• No clear way to determine “k”

Silhouettes
• For each point deﬁne as: 
 
 
 
the average distance between point and every other point within cluster .
• Let be: 
 
the minimum value of excluding
• The silhouette of is then:
ac (xi)
ac (xi) =
1
Nc
X
j2c
kxi xjk
b (xi) = min
c6=ci
ac (xi)
s (xi) =
b (xi) aci (xi)
max {b (xi) , aci
(xi)}
xi
xi c
b (xi)
ciac (xi)
xi

Silhouettes
http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

Expectation Maximization
• Iterative algorithm to learn parameter estimates in models with unobserved latent
variables
• Two steps for each iteration
• Expectation: Calculate the likelihood of the data given current parameter estimate
• Maximization: Find the parameter values that maximize the likelihood
• Stop when the relative variation of the parameter estimates is smaller than some value

Nature BioTech 26, 897 (2008)

while (improvement > delta):
expectation_A = np.zeros((5, 2), dtype=float)
expectation_B = np.zeros((5, 2), dtype=float)
for i in range(0, len(experiments)):
e = experiments[i] # i'th experiment
ll_A = get_mn_likelihood(e, np.array([tA[-1], 1-tA[-1]]))
ll_B = get_mn_likelihood(e, np.array([tB[-1], 1-tB[-1]]))
weightA = ll_A/(ll_A + ll_B)
weightB = ll_B/(ll_A + ll_B)
expectation_A[i] = np.dot(weightA, e)
expectation_B[i] = np.dot(weightB, e)
tA.append(sum(expectation_A)[0] / sum(sum(expectation_A)))
tB.append(sum(expectation_B)[0] / sum(sum(expectation_B)))
improvement = max(abs(np.array([tA[-1], tB[-1]]) - np.array([tA[-2], tB[-2]])))
http://stats.stackexchange.com/questions/72774/numerical-example-to-understand-expectation-maximization

Gaussian Mixture Models

Gaussian Mixture Models
• One solution is to try to characterize each cluster as a Gaussian. In this case we want
to ﬁnd the set of parameters and mixtures that better reproduces the data.  
• Given some data points we can calculate the prior: 
 
 
• which we can update using the data: to obtain the posterior: 
 
 
• which we can use to choose a new set of parameters and mixtures.
• Iterate using Expectation Maximization.
xi
p (✓) =
X
i
iN (µi, i)
p (x|✓)
p (✓|x) =
p (x|✓) p (✓)
p (x)

Conclusions
• Don’t trust descriptive statistics too much, they can be misleading
• Get a feel for the data using visualizations, etc…
• Know the properties of the distributions you are using
• Know the assumptions (implicit and explicit) that you are making.
• Be careful on how you aggregate data
• Use machine learning methods like k-means, Expectation Maximization, etc… to
better understand and describe the data but be sure you understand them as well.

If it still doesn’t make sense…

References

Making Sense of Data Big and Small

More Related Content

Viewers also liked (6)

Similar to Making Sense of Data Big and Small (20)

Recently uploaded (20)

Making Sense of Data Big and Small