SlideShare a Scribd company logo
Bruno Gonçalves
www.bgoncalves.com
Making Sense of Data Big and Small

www.bgoncalves.com
www.bgoncalves.com@bgoncalves
www.bgoncalves.com@bgoncalves
www.bgoncalves.com@bgoncalves
Big Data
E X P E R T O P I N I O N
Contact Editor: Brian Brannon, bbrannon@computer.org
behavior. So, this corpus could serve as the basis of
a complete model for certain tasks—if only we knew
how to extract the model from the data.
E
ugene Wigner’s article “The Unreasonable Ef-
fectiveness of Mathematics in the Natural Sci-
ences”1 examines why so much of physics can be
Alon Halevy, Peter Norvig, and Fernando Pereira, Google
The Unreasonable
Effectiveness of Data
www.bgoncalves.com@bgoncalves
From Data To Information
Statistics are numbers that summarize raw facts and figures in some
meaningful way. They present key ideas that may not be immediately
apparent by just looking at the raw data, and by data, we mean facts or figures
from which we can draw conclusions. As an example, you don’t have to
wade through lots of football scores when all you want to know is the league
position of your favorite team. You need a statistic to quickly give you the
information you need.
The study of statistics covers where statistics come from, how to calculate them,
and how you can use them effectively.
Gather data
Analyze
Draw conclusions
When you’ve analyzed
your data, you make
decisions and predictions.
Once you have data, you can analyze itand generate statistics. You can calculateprobabilities to see how likely certain eventsare, test ideas, and indicate how confidentyou are about your results.
At the root of statistics is data.
Data can be gathered by looking
through existing sources, conducting
experiments, or by conducting surveys.
www.bgoncalves.com@bgoncalves
Data Science
Hacking
Domain

Knowledge
Statistics
Data
Science
Machine

Learning
Traditional

Research
Danger

Zone!
www.bgoncalves.com@bgoncalves
Data Science
Data

Nerds
Art

Nerds
Stats

Nerds
High

Salaries
Data

Mining
Visualization
GUI

Programmers
www.bgoncalves.com@bgoncalves
Data Science
Data Scientist:
The Sexiest Job of the 21st Century
Meet the people who
can coax treasure out of
messy, unstructured data.
by Thomas H. Davenport
and D.J. Patil
hen Jonathan Goldman ar-
rived for work in June 2006
at LinkedIn, the business
networking site, the place still
felt like a start-up. The com-
pany had just under 8 million
accounts, and the number was
growing quickly as existing mem-
bers invited their friends and col-
leagues to join. But users weren’t
seeking out connections with the people who were already on the site
at the rate executives had expected. Something was apparently miss-
ing in the social experience. As one LinkedIn manager put it, “It was
like arriving at a conference reception and realizing you don’t know
anyone. So you just stand in the corner sipping your drink—and you
probably leave early.”
70 Harvard Business Review October 2012
www.bgoncalves.com@bgoncalves
“Zero is the most natural number”

(E. W. Dijkstra)
Count!
• How many items do we have?
www.bgoncalves.com@bgoncalves
Descriptive Statistics
Min Max
Mean µ =
1
N
NX
i=1
xi
=
v
u
u
t 1
N
NX
i=1
(xi µ)
2Standard

Deviation
www.bgoncalves.com@bgoncalves
Anscombe’s Quartet
x1 y1
10.0 8.04
8.0 6.95
13.0 7.58
9.0 8.81
11.0 8.33
14.0 9.96
6.0 7.24
4.0 4.26
12.0 10.84
7.0 4.82
5.0 5.68
x2 y2
10.0 9.14
8.0 8.14
13.0 8.74
9.0 8.77
11.0 9.26
14.0 8.10
6.0 6.13
4.0 3.10
12.0 9.13
7.0 7.26
5.0 4.74
x3 y3
10.0 7.46
8.0 6.77
13.0 12.74
9.0 7.11
11.0 7.81
14.0 8.84
6.0 6.08
4.0 5.39
12.0 8.15
7.0 6.42
5.0 5.73
x4 y4
8.0 6.58
8.0 5.76
8.0 7.71
8.0 8.84
8.0 8.47
8.0 7.04
8.0 5.25
19.0 12.50
8.0 5.56
8.0 7.91
8.0 6.89
9
11
7.50
~4.125
0.816
fit y=3+0.5x
µx
µy
y
x
⇢
0
3.25
6.5
9.75
13
0 5 10 15 20
0
3.25
6.5
9.75
13
0 5 10 15 20
0
3.25
6.5
9.75
13
0 5 10 15 20
0
3.25
6.5
9.75
13
0 5 10 15 20
www.bgoncalves.com@bgoncalves
Central Limit Theorem
• As the random variables:
• with:
• converge to a normal distribution:
• after some manipulations, we find:
n ! 1
Sn =
1
n
X
i
xi
N 0, 2
p
n (Sn µ)
Sn ⇠ µ +
N 0, 2
p
n
The estimation of the mean converges to the true
mean with the sqrt of the number of samples
! SE = p
n
www.bgoncalves.com@bgoncalves
Gaussian Distribution - Maximally Entropic
PN (x, µ, ) =
1
p
2
e
(x µ)2
2 2
www.bgoncalves.com@bgoncalves
Broad-tailed distributions
Pp (k, ) =
1
C
k
www.bgoncalves.com@bgoncalves
Broad-tailed distributions
Pp (k, ) =
1
C
k
www.bgoncalves.com@bgoncalves
(Almost) Everyone is below average!
µ
80% 20%
www.bgoncalves.com@bgoncalves
Outliers
• “Bill Gates walks into a bar and on average every patron is a millionaire…”







• …but the median remains the same”
• Mean = 144.14

Median = 2
Median - “the value that
separates the lower 50% of the
distribution from the higher 50%”
1 1 1 2 2 2 1000
www.bgoncalves.com@bgoncalves
Quantiles
• Quantiles - Points taken at regular intervals of the cumulative distribution function
• Quartiles - Ranked set of points that divide the range in 4 equal intervals (25%, 50%,
75% quantiles)
www.bgoncalves.com@bgoncalves
Box and whiskers plot
1
Numberofc
0 7 14 21
1
10
100
300
Numberofcasesoutsidetargetcountry
Time (days)
0 7 14 21
F)
• Show the variation of the data for each bin.
• More informative than just averages or medians.
• Useful to summarize experimental measurements,
simulation results, natural variations, etc… when
fluctuations are important
www.bgoncalves.com@bgoncalves
Reference Range
uthors' contributions
, HH, BG, PB and CP contributed to conceiving and
igning the study, performed numerical simulations
d statistical analysis, contributed to the data integration
d helped to draft the manuscript. JJR contributed to
nceiving and designing the study, data tracking and
egration, statistical analysis and helped draft the man-
ript. NP and MT contributed to data tracking and inte-
tion, statistical analysis and helped draft the
A)
no intervention
antiviral
treatment
~ 4 weeks B)
0
0.003
0.006
0.009
0.012
0.015
0.018
incidence
0.018
Spain
Sept Oct Nov Dec Jan
Median95% RR
⇢
A
B
97.5% Percentile
2.5% Percentile
• Useful for continuous curves
• Indicates level of certainty 

“95% of the cases are in this range”
www.bgoncalves.com@bgoncalves
Tools For Statistical Analysis
Name Advantages Disadvantages Open Source
R
Library support
and Visualization
Steep learning curve Yes
Matlab
Native matrix
support,

Visualization
Expensive, incomplete
statistics support
No
Scientific Python
Ease and
Simplicity
Heavy development Yes
Excel
Easy, Visual,
Flexible
Large datasets No
SAS Large Datasets
Expensive, outdated
programming
language
No
Stata Easy Statistical Analysis No
SPSS Like Stata but more expensive and less flexible
www.bgoncalves.com@bgoncalves
Correlations
www.bgoncalves.com@bgoncalves
Pearson (Linear) Correlation Coefficient
• Does increasing one variable also increase the other?
⇢ =
PN
i=1 (xi µX ) (yi µY )
X Y
www.bgoncalves.com@bgoncalves
• The square of the Person correlation between the data and the fit.
• The amount of variance of the data that is explained by the “model”.
R2
www.bgoncalves.com@bgoncalves
Spearman Rank Correlation
• Equivalent to the Pearson Correlation Coefficient of the ranked variables
• squared difference in ranks
• less sensitive to outliers as values are limited by rank
⇢ = 1
6
PN
i=1 d2
i
N (N2 1)
PN
i=1 d2
i
N2 1)
www.bgoncalves.com@bgoncalves
Causation
www.bgoncalves.com@bgoncalves
Probability
p = 1
www.bgoncalves.com@bgoncalves
Probability
A
P (A) = Area of A
B
p = 1
www.bgoncalves.com@bgoncalves
Probability
A
P (A) = Area of A
B
P (A or B) = P (A) + P (B)
p = 1
www.bgoncalves.com@bgoncalves
Probability
P (A) = Area of A
P (A or B) = P (A) + P (B)
P (A and B) = overlap of A and B
P (A and B)
A
B
p = 1
www.bgoncalves.com@bgoncalves
Probability
P (A) = Area of A
P (A or B) = P (A) + P (B)
P (A and B) = overlap of A and B
P (A and B)
P (A|B) =
P (B|A) P (A)
P (B)
P (B|A) =
P (A and B)
P (A)
A
B
p = 1
www.bgoncalves.com@bgoncalves
Bayes Theorem
Probability
P (A) = Area of A
P (A or B) = P (A) + P (B)
P (A and B) = overlap of A and B
P (A and B)
P (A|B) =
P (B|A) P (A)
P (B)
P (B|A) =
P (A and B)
P (A)
P (B) = P (B|A) P (A) + P (B|¬A) P (¬A)
A
B
p = 1
www.bgoncalves.com@bgoncalves
Medical Tests
Your doctor thinks you might have a rare disease that affects 1 person in 10,000. A test
that is 99% accurate comes out positive. What’s the probability of you having the
disease?
Bayes Theorem:

Total Probability:
Finally:
P (disease|positive test) =
P (positive test|disease) P (disease)
P (positive test)
P (positive test) = P (positive test|disease) P (disease)
+ P (positive test|no disease) P (no disease)
P (disease|positive test) = 0.0098
A
B
www.bgoncalves.com@bgoncalves
Medical Tests
Your doctor thinks you might have a rare disease that affects 1 person in 10,000. A test
that is 99% accurate comes out positive. What’s the probability of you having the
disease?
Bayes Theorem:

Total Probability:
Finally:
P (disease|positive test) =
P (positive test|disease) P (disease)
P (positive test)
P (positive test) = P (positive test|disease) P (disease)
+ P (positive test|no disease) P (no disease)
P (disease|positive test) = 0.0098
A
B
Base Rate Fallacy
Low Base Rate Value

+

Non-zero False Positive Rate
www.bgoncalves.com@bgoncalves
Consider a population of 1,000,000 individuals. The numbers we should expect are:
Medical Tests
disease no disease
positive 99 9,999 10,098
negative 1 989,901 989,902
100 999,900 1,000,000
A
B
www.bgoncalves.com@bgoncalves
Consider a population of 1,000,000 individuals. The numbers we should expect are:
Medical Tests
disease no disease
positive 99 9,999 10,098
negative 1 989,901 989,902
100 999,900 1,000,000
Marginals
Marginals
A
B
www.bgoncalves.com@bgoncalves
Consider a population of 1,000,000 individuals. The numbers we should expect are:
Medical Tests
disease no disease
positive 99 9,999 10,098
negative 1 989,901 989,902
100 999,900 1,000,000
Marginals
Marginals
P (disease|positive test) =
TP
TP + FP
= 0.0098
P (no disease|negative test) =
TN
TN + FN
= 0.99999
A
B
www.bgoncalves.com@bgoncalves
(Confusion Matrix)
positive negative
positive TP FP
negative FN TN
Feature
Test
accuracy =
TP + TN
TP + TN + FP + FN
precision =
TP
TP + FP
sensitivity =
TP
TP + FN
specificity =
TN
FP + TN
harmonic mean F1 =
2TP
2TP + FP + FN
www.bgoncalves.com@bgoncalves
A second Test
Bayes Theorem still looks the same:

but now the probability that we have the disease has been updated:

So this time we find:

Each test is providing new evidence, and Bayes theorem is simply telling us how to use it
to update our beliefs.
P (disease|positive test) =
P (positive test|disease) P (disease)
P (positive test)
P†
(disease) = 0.0098
P†
(disease|positive test) = 0.4949
www.bgoncalves.com@bgoncalves
Bayesian Coin Flips
• Biased coin with unknown probability of heads (p)
• Perform N flips and update our belief after each flip using Bayes Theorem
P (p|heads) =
P (heads|p) P (p)
P (heads)
P (p|tails) =
P (tails|p) P (p)
P (tails)
http://youtu.be/GTx0D8VY0CY
www.bgoncalves.com@bgoncalves
Bayesian Coin Flips
• Biased coin with unknown probability of heads (p)
• Perform N flips and update our belief after each flip using Bayes Theorem
# Uninformative prior
prior = np.ones(bins, dtype='float')/bins
likelihood_heads = np.arange(bins)/float(bins)
likelihood_tails = 1-likelihood_heads
for coin in flips:
if coin: # Heads
posterior = prior * likelihood_heads
else: # Tails
posterior = prior * likelihood_tails
# Normalize
posterior /= np.sum(posterior)
# The posterior is now the new prior
prior = posterior
P (p|heads) =
P (heads|p) P (p)
P (heads)
P (p|tails) =
P (tails|p) P (p)
P (tails)
http://youtu.be/GTx0D8VY0CY
www.bgoncalves.com@bgoncalves
Naive Bayes Classifier
• Let’s consider spam detection for a second. If you know:



• You know the probability that a specific word is used in a spam email. But how can you
determine the probability that an email (set of words) is spam?
• You can simply assume that all the probabilities are independent:

• This is know as Naive Bayes and is surprisingly effective in many real world contexts.
P (spam|wordi)
P (not spam|wordi)
P (spam|word1, word2, · · · , wordn)
P (spam|word1, word2, · · · , wordn) =
Y
i
P (spam|wordi)
www.bgoncalves.com@bgoncalves
Maximum Likelihood Estimation
• Given a distribution, ,how likely are we to see a given set of data points, ?
• The probability of each point is, simply:

• So the probability of a given realization is:



• For mathematical convenience, we define the likelihood as:



• The set of parameters that maximizes characterize the distribution most likely to have
generated the data.
P (x) xi
P (xi)
Y
i
P (xi)
L =
X
i
log [P (xi)]
www.bgoncalves.com@bgoncalves
MLE Coin Flips
• Biased coin with unknown probability of heads (p)
• In a sequence of N flips, the likelihood of Nh heads and Nt=N-N tails is:

• or simply:



• Taking the derivative:





• Setting to zero and solving for p:
p =
Nh
N
@L
@p
=
Nh
p
N Nh
1 p
L = Nh log [p] + (N Nh) log [1 p]
L = log
h
pNh
(1 p)
N Nh
i Ignoring the
combinatorial factor!
www.bgoncalves.com@bgoncalves
Binomial Distribution
• The probability of getting k successes with n trials of probability p (k heads in n coin
flips):

• The mean value is:

• and the variance:

• and for sufficiently large n:
PB (k, n, p) =
n!
k! (n k)!
pk
(1 p)
n k
µ = np
= np (1 p)
PB (k, n, p) ⇠ PN (np, np (1 p))
www.bgoncalves.com@bgoncalves
(Beta Distribution)
• Related to Binomial and has a very similar form:

• with and .
• is the continuous extension to
• The mean is:

• And the variance:
P (x, ↵, ) =
(↵ + )
(↵) ( )
x↵ 1
(1 x)
1
PB (k, n, p) =
n!
k! (n k)!
pk
(1 p)
n k
x 2 [0, 1] ↵, > 0
(a) a!
µ =
↵
↵ +
=
↵
(↵ + )
2
(↵ + + 1)
www.bgoncalves.com@bgoncalves
A/B Testing
• Divide users into two groups A and B
• Measure some metric for each group
(conversion probability, for example)
pA pB
www.bgoncalves.com@bgoncalves
• If conversion is a binomial process, then:
• Standard Error
• Z score
A/B Testing
pA pB
SE =
r
p (1 p)
N
Z =
pA pB
p
SE2
A + SE2
B
www.bgoncalves.com@bgoncalves
p-value
• Calculate the probability of an event more extreme that the observation under the “null
hypothesis”
• The smaller the p-value the better.
• p < 0.05 Moderate evidence agains null-
hypothesis
• p < 0.01 Strong evidence against null-hypothesis
• p < 0.001 Very strong evidence agains the null-
hypothesis
www.bgoncalves.com@bgoncalves
Berkeley Discrimination Case Part I
Candidates
Acceptance
Rate
SE
Men 8442 0.44 5.4x10-3
Women 4321 0.35 7.2x10-3
Z =
pA pB
p
SE2
A + SE2
B
= 9.9
p ⇡ 10 23
www.bgoncalves.com@bgoncalves
p-value
“Statistical significance does not imply scientific significance”
www.bgoncalves.com@bgoncalves
(Bonferoni Correction)
www.bgoncalves.com@bgoncalves
(Bonferoni Correction)
• You can think of p as the probability of observing a result as extreme by chance. With n
comparisons, this probability becomes:





which quickly goes to 1 as n increases.
• However, by replacing p by p/n for each individual comparison, we obtain:

• and for sufficiently large n:

• allowing us to keep the probability of false results arbitrarily low even with arbitrarily
large numbers of comparisons.
pn = 1 (1 p)
n
pn = 1
⇣
1
p
n
⌘n
pn ⇡ 1 e p
⇡ p
www.bgoncalves.com@bgoncalves
SimpsonsThe
www.bgoncalves.com@bgoncalves
Simpsons’ Paradox
www.bgoncalves.com@bgoncalves
Simpsons’ Paradox
Candidates
Acceptance
Rate
Men 8442 0.44
Women 4321 0.35
Berkeley Discrimination Case Part II:

The statisticians strike back.
Science 187, 398 (1975)
www.bgoncalves.com@bgoncalves
Simpsons’ Paradox
Candidates
Acceptance
Rate
Men 8442 0.44
Women 4321 0.35
Dept
Men Women
Candidates Acceptance Candidates Acceptance
A 825 0.62 108 0.82
B 560 0.63 25 0.68
C 325 0.37 594 0.34
D 417 0.33 375 0.35
E 191 0.28 393 0.24
F 272 0.06 341 0.07
2590 0.46 1835 0.30
Berkeley Discrimination Case Part II:

The statisticians strike back.
Science 187, 398 (1975)
www.bgoncalves.com@bgoncalves
Simpsons’ Paradox
Candidates
Acceptance
Rate
Men 8442 0.44
Women 4321 0.35
Dept
Men Women
Candidates Acceptance Candidates Acceptance
A 825 0.62 108 0.82
B 560 0.63 25 0.68
C 325 0.37 594 0.34
D 417 0.33 375 0.35
E 191 0.28 393 0.24
F 272 0.06 341 0.07
2590 0.46 1835 0.30
Science 187, 398 (1975)
www.bgoncalves.com@bgoncalves
Simpsons’ Paradox
Candidates
Acceptance
Rate
Men 8442 0.44
Women 4321 0.35
Dept
Men Women
Candidates Acceptance Candidates Acceptance
A 825 0.62 108 0.82
B 560 0.63 25 0.68
C 325 0.37 594 0.34
D 417 0.33 375 0.35
E 191 0.28 393 0.24
F 272 0.06 341 0.07
2590 0.46 1835 0.30
Science 187, 398 (1975)
“aggregated data can appear
to reverse important trends
in the numbers being combined”
WSJ, Dec 2, 2009
www.bgoncalves.com@bgoncalves
MLE - Fitting a theoretical function to experimental data
• In an experimental measurement, we expect (CLT) the experimental values to be
normally distributed around the theoretical value with a certain variance. Mathematically,
this means:

• where are the experimental values and the theoretical ones. The likelihood is
then:

• Where we see that to maximize the likelihood we must minimize the sum of squares
y f (x) ⇡
1
p
2 2
exp
"
(y f (x))
2
2 2
#
y f (x)
Least Squares Fitting
L =
N
2
log
⇥
2 2
⇤ X
i
"
(yi f (xi))
2
2 2
#
www.bgoncalves.com@bgoncalves
MLE - Fitting a power-law to experimental data
• We often find what look like power-law distributions in empirical data:







and we would like to find the right parameter values.

• The likelihood of any set of points is:





• And maximizing, we find:





• with a standard error of:
P (k) =
1
kmin
✓
k
kmin
◆
L =
X
i
log
"
1
kmin
✓
ki
kmin
◆ #
= 1 + n
"
X
i
log
✓
ki
kmin
◆# 1
SE =
1
p
n
SIAM Rev. 51, 661 (2009)
www.bgoncalves.com@bgoncalves
Clustering
www.bgoncalves.com@bgoncalves
K-Means
• Choose k randomly chosen points to be the centroid of each cluster
• Assign each point to belong the cluster whose centroid is closest
• Recompute the centroid positions (mean cluster position)
• Repeat until convergence
www.bgoncalves.com@bgoncalves
K-Means: Structure
www.bgoncalves.com@bgoncalves
K-Means: Structure Voronoi Tesselation
www.bgoncalves.com@bgoncalves
K-Means: Convergence
• How to quantify the “quality” of the solution found at each iteration, ?
• Measure the “Inertia”, the square intra-cluster distance:







where are the coordinates of the centroid of the cluster to which is assigned.
• Smaller values are better
• Can stop when the relative variation is smaller than some value
µi xi
In+1 In
In
< tol
In =
NX
i=0
kxi µik
2
n
www.bgoncalves.com@bgoncalves
K-Means: sklearn
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=nclusters)
kmeans.fit(data)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
www.bgoncalves.com@bgoncalves
K-Means: Limitations
www.bgoncalves.com@bgoncalves
K-Means: Limitations
• No guarantees about Finding
“Best” solution

• Each run can find different
solution

• No clear way to determine “k”
www.bgoncalves.com@bgoncalves
Silhouettes
• For each point define as:







the average distance between point and every other point within cluster .
• Let be:



the minimum value of excluding
• The silhouette of is then:
ac (xi)
ac (xi) =
1
Nc
X
j2c
kxi xjk
b (xi) = min
c6=ci
ac (xi)
s (xi) =
b (xi) aci (xi)
max {b (xi) , aci
(xi)}
xi
xi c
b (xi)
ciac (xi)
xi
www.bgoncalves.com@bgoncalves
Silhouettes
http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
www.bgoncalves.com@bgoncalves
Silhouettes
http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
www.bgoncalves.com@bgoncalves
Silhouettes
http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
www.bgoncalves.com@bgoncalves
Silhouettes
http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
www.bgoncalves.com@bgoncalves
Silhouettes
http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
www.bgoncalves.com@bgoncalves
Expectation Maximization
• Iterative algorithm to learn parameter estimates in models with unobserved latent
variables
• Two steps for each iteration
• Expectation: Calculate the likelihood of the data given current parameter estimate
• Maximization: Find the parameter values that maximize the likelihood
• Stop when the relative variation of the parameter estimates is smaller than some value
www.bgoncalves.com@bgoncalves
Expectation Maximization
Nature BioTech 26, 897 (2008)
www.bgoncalves.com@bgoncalves
Expectation Maximization
while (improvement > delta):
expectation_A = np.zeros((5, 2), dtype=float)
expectation_B = np.zeros((5, 2), dtype=float)
for i in range(0, len(experiments)):
e = experiments[i] # i'th experiment
ll_A = get_mn_likelihood(e, np.array([tA[-1], 1-tA[-1]]))
ll_B = get_mn_likelihood(e, np.array([tB[-1], 1-tB[-1]]))
weightA = ll_A/(ll_A + ll_B)
weightB = ll_B/(ll_A + ll_B)
expectation_A[i] = np.dot(weightA, e)
expectation_B[i] = np.dot(weightB, e)
tA.append(sum(expectation_A)[0] / sum(sum(expectation_A)))
tB.append(sum(expectation_B)[0] / sum(sum(expectation_B)))
improvement = max(abs(np.array([tA[-1], tB[-1]]) - np.array([tA[-2], tB[-2]])))
http://stats.stackexchange.com/questions/72774/numerical-example-to-understand-expectation-maximization
www.bgoncalves.com@bgoncalves
Expectation Maximization
www.bgoncalves.com@bgoncalves
Gaussian Mixture Models
www.bgoncalves.com@bgoncalves
Gaussian Mixture Models
• One solution is to try to characterize each cluster as a Gaussian. In this case we want
to find the set of parameters and mixtures that better reproduces the data. 

• Given some data points we can calculate the prior:





• which we can update using the data: to obtain the posterior:





• which we can use to choose a new set of parameters and mixtures.
• Iterate using Expectation Maximization.
xi
p (✓) =
X
i
iN (µi, i)
p (x|✓)
p (✓|x) =
p (x|✓) p (✓)
p (x)
www.bgoncalves.com@bgoncalves
Conclusions
• Don’t trust descriptive statistics too much, they can be misleading
• Get a feel for the data using visualizations, etc…
• Know the properties of the distributions you are using
• Know the assumptions (implicit and explicit) that you are making.
• Be careful on how you aggregate data
• Use machine learning methods like k-means, Expectation Maximization, etc… to
better understand and describe the data but be sure you understand them as well.
www.bgoncalves.com@bgoncalves
If it still doesn’t make sense…
www.bgoncalves.com@bgoncalves
References

More Related Content

PDF
Word2vec and Friends
PDF
Word2vec and Friends
PPTX
Latent dirichlet allocation_and_topic_modeling
PPTX
Text similarity measures
PDF
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
ODP
Probabilistic programming
PDF
Machine(s) Learning with Neural Networks
PDF
Complenet 2017
Word2vec and Friends
Word2vec and Friends
Latent dirichlet allocation_and_topic_modeling
Text similarity measures
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
Probabilistic programming
Machine(s) Learning with Neural Networks
Complenet 2017

Viewers also liked (6)

PDF
Mining Georeferenced Data
PDF
Human Mobility (with Mobile Devices)
PDF
Twitterology - The Science of Twitter
PDF
A practical Introduction to Machine(s) Learning
PDF
Mining Geo-referenced Data: Location-based Services and the Sharing Economy
PDF
Word2vec in Theory Practice with TensorFlow
Mining Georeferenced Data
Human Mobility (with Mobile Devices)
Twitterology - The Science of Twitter
A practical Introduction to Machine(s) Learning
Mining Geo-referenced Data: Location-based Services and the Sharing Economy
Word2vec in Theory Practice with TensorFlow
Ad

Similar to Making Sense of Data Big and Small (20)

PPTX
CS194Lec0hbh6EDA.pptx
PPT
Introductory Statistics
PDF
Data science pitfalls
PPT
Statistics
PPT
Introduction_to_Statistics_as_used_in_th.ppt
PDF
Statistics for data scientists
PPT
Master class data uses 1875554400712.ppt
PDF
Explore, Analyze and Present your data
PPT
05inference_2011.ppt
PPTX
Seminar 10 BIOSTATISTICS
PDF
03-Data-Analysis-Final.pdf
PDF
Probability and basic statistics with R
PPTX
Data in science
PDF
Essentials Of Modern Business Statistics With Microsoft Excel 7th Edition Dav...
PDF
Basic Statistical Concepts.pdf
PPTX
ststs nw.pptx
PPTX
Statistics for Decision Making Week 1 Lecture
PPTX
Presentation 7.pptx
PDF
Statistics And Exploratory Data Analysis
PPT
Statistics orientation
CS194Lec0hbh6EDA.pptx
Introductory Statistics
Data science pitfalls
Statistics
Introduction_to_Statistics_as_used_in_th.ppt
Statistics for data scientists
Master class data uses 1875554400712.ppt
Explore, Analyze and Present your data
05inference_2011.ppt
Seminar 10 BIOSTATISTICS
03-Data-Analysis-Final.pdf
Probability and basic statistics with R
Data in science
Essentials Of Modern Business Statistics With Microsoft Excel 7th Edition Dav...
Basic Statistical Concepts.pdf
ststs nw.pptx
Statistics for Decision Making Week 1 Lecture
Presentation 7.pptx
Statistics And Exploratory Data Analysis
Statistics orientation
Ad

Recently uploaded (20)

PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
annual-report-2024-2025 original latest.
PPTX
1_Introduction to advance data techniques.pptx
PDF
Lecture1 pattern recognition............
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Clinical guidelines as a resource for EBP(1).pdf
Qualitative Qantitative and Mixed Methods.pptx
Reliability_Chapter_ presentation 1221.5784
Miokarditis (Inflamasi pada Otot Jantung)
Introduction-to-Cloud-ComputingFinal.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Business Ppt On Nestle.pptx huunnnhhgfvu
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
ISS -ESG Data flows What is ESG and HowHow
STERILIZATION AND DISINFECTION-1.ppthhhbx
annual-report-2024-2025 original latest.
1_Introduction to advance data techniques.pptx
Lecture1 pattern recognition............

Making Sense of Data Big and Small