SlideShare a Scribd company logo
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Probability Distributions for ML
Sung-Yub Kim
Dept of IE, Seoul National University
January 29, 2017
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Bishop, C. M. Pattern Recognition and Machine Learning Information Science and Statistics, Springer, 2006.
Kevin P. Murphy. Machine Learning - A Probabilistic Perspective Adaptive Computation and Machine
Learning, MIT press, 2012.
Ian Goodfellow and Yoshua Bengio and Aaron Courville. Deep Learning Computer Science and Intelligent
Systems, MIT Press, 2016.
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Purpose: Density Estimation
Assumption: Data Points are independent and identically distributed.(i.i.d)
Parametric and Nonparametric
Parametric estimations are more intuitive but has very strong assumption.
Nonparametric estimation also has some parameters, but they control
model complexity.
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Bernouli and Binomial Distribution
MLE of Bernouli parameter
The Beta Distribution
Bayesian Inference on binary variables
Difference between prior and posterior
Bernouli Distribution(Ber(θ))
Bernouli Distribution has only one parameter θ which means the success
probability of the trial. PMF of bernouli dist is shown like
Ber(x|θ) = θI(x=1)
(1 − θ)I(x=0)
Binomial Distribution(Bin(n,θ))
Binomial Distribution has two parameters n for number of trials, θ for
success prob. PMF of binomial dist is shown like
Bin(k|n, θ) =
n
k
θk
(1 − θ)n−k
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Bernouli and Binomial Distribution
MLE of Bernouli parameter
The Beta Distribution
Bayesian Inference on binary variables
Difference between prior and posterior
Likelihood of Data
By i.i.d assumption, we get
p(D|µ) =
N
n=1
p(xn|µ) =
N
n=1
µxn
(1 − µ)1−xn
(1)
Log-likelihood of Data
Take logarithm, we get
ln p(D|µ) =
N
n=1
ln p(xn|µ) =
N
n=1
{xn ln µ + (1 − xn) ln(1 − µ)} (2)
MLE
Since maximizer is stationary point, we get
µML := ˆµ =
1
N
N
n=1
xn (3)
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Bernouli and Binomial Distribution
MLE of Bernouli parameter
The Beta Distribution
Bayesian Inference on binary variables
Difference between prior and posterior
Prior Distribution
The weak point of MLE is you can be overfitted to data. To overcome this
deficiency, we need to make some prior distribution.
But same time our prior distribution need to has a simple interpretation
and useful analytical properties.
Conjugate Prior
Conjugate prior for a likelihood is a prior distribution which your prior and
posterior distribution are same given your likelihood.
In this case, we need to make our prior proportional to powers of µ and
(1 − µ). Therefore, we choose Beta Distribution
Beta(µ|a, b) =
Γ(a + b)
Γ(a)Γ(b)
µa−1
(1 − µ)b−1
(4)
Beta Distribution has two parameters a,b each counts how many occurs
each classes(effective number of observations). Also we can easily valid
that posterior is also beta distribution.
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Bernouli and Binomial Distribution
MLE of Bernouli parameter
The Beta Distribution
Bayesian Inference on binary variables
Difference between prior and posterior
Posterior Distribution
By some calculation,
p(µ|m, l, a, b) =
Γ(m + l + a + b)
Γ(m + a)Γ(l + b)
µm+a−1
(1 − µ)l+b−1
(5)
where m,l are observed data.
Bayesian Inference
Now we can make some bayesian inference on binary variables. We want
to know
p(x = 1|D) =
1
0
p(x = 1|µ)p(µ|D)dµ =
1
0
µp(µ|D)dµ = E[µ|D] (6)
Therefore we get
p(x = 1|D) =
m + a
m + a + l + b
(7)
If observed data(m,l) are sufficiently big, its asymptotic property is
identical to MLE, and this property is very general.
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Bernouli and Binomial Distribution
MLE of Bernouli parameter
The Beta Distribution
Bayesian Inference on binary variables
Difference between prior and posterior
Since
Eθ[θ] = ED[Eθ[θ|D]] (8)
we know that poseterior mean of θ, averaged over the distribution generating
the data, is equal to the prior mean of θ.
Also since
Varθ[θ] = ED[Varθ[θ|D]] + VarD[Eθ[θ|D]] (9)
We know that on average, the posterior variance of θ is smaller than the prior
variance.
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Multinomials and Multinouli Distributions
MLE of Multinouli parameters
The Dirichlet Distribution and Bayesian Inference
Multinomial Distribution(Mu(x|n, θ))
Multinomial distribution is different from binomial with respect to
dimension of ouput and θ. In binomial, k means the number of success. In
multinomial each index of x means the number of state. Therefore we can
see binomial as multinomial when the dimension of x and θ is 2.
Mu(x|n, θ) =
n
x0, . . . , xK−1
K−1
j=0
θ
xj
j
Multinouli Distribution(Mu(x|1, θ))
Sometimes we are intersted in the special case of Multinomial when the n
is 1 that is called Multinouli distribution:
Mu(x|1, θ) =
K−1
j=0
θ
I(xj =1)
j
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Multinomials and Multinouli Distributions
MLE of Multinouli parameters
The Dirichlet Distribution and Bayesian Inference
Likelihood of Data
By i.i.d assumption, we get
p(D|µ) =
N
n=1
K
k=1
µ
xnk
k =
K
k=1
µ n xnk
k =
K
k=1
µ
mk
k (10)
where mk = n xnk (sufficient statistics)
Log-likelihood of Data
Take logarithm, we get
ln p(D|µ) =
K
k=1
mk ln µk (11)
MLE
Therefore, we need to solve following optimization problem for MLE
max{
K
k=1
mk ln µk |
K
k=1
µk = 1} (12)
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Multinomials and Multinouli Distributions
MLE of Multinouli parameters
The Dirichlet Distribution and Bayesian Inference
MLE(cont.)
We already know that Lagrangian stationaty point is a necessary condition
for constrained optimization problem. Therefore,
µL(µ; λ) = 0, λL(µ; λ) = 0 (13)
where
L(µ; λ) =
K
k=1
mk ln µk + λ(
K
k=1
µk − 1) (14)
Therefore, we get
µML
k =
mk
N
(15)
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Multinomials and Multinouli Distributions
MLE of Multinouli parameters
The Dirichlet Distribution and Bayesian Inference
Dirichlet Distribution
By the same intuition in Beta distribution, we can get conjugate prior for
Multinouli
Dir(µ|α) =
Γ(α0)
Γ(α1) · · · Γ(αK )
K
k=1
µ
αk −1
k (16)
where α0 = k αk
Bayesian Inference
By the same argument in binomial, we can get posterior probability
p(µ|D, α) = Dir(µ|α + m) =
Γ(α0 + N)
Γ(α1 + m1) · · · Γ(αK + mK )
K
k=1
µ
αk +mk −1
k
(17)
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Uni and Multi variate Gaussian
Basic Property
Conditional and Marginal Distributions
Inference for Gaussian
Student’s t-distribution
Univariate Gaussian Distribution(N(x|µ, σ2
) = N(x|µ, β−1
))
N(x|µ, σ2
) =
1
√
2πσ2
exp(−
1
2σ2
(x − µ)2
) (18)
N(x|µ, β−1
) =
β
2π
exp(−
β
2
(x − µ)2
) (19)
Multivariate Gaussian Distribution(N(x|µ, Σ) = N(x|µ, β−1
))
N(x|µ, Σ) =
1
(2π)
D
2 det(Σ)
1
2
exp(−
1
2
(x − µ) Σ−1
(x − µ)) (20)
N(x|µ, β−1
) =
1
(2π)
D
2 det(Σ)
1
2
exp(−
1
2
(x − µ) β(x − µ)) (21)
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Uni and Multi variate Gaussian
Basic Property
Conditional and Marginal Distributions
Inference for Gaussian
Student’s t-distribution
Mahalanobis Distance
By EVD, we can get
∆2
= (x − µ) Σ−1
(x − µ) =
D
i=1
y2
i
λi
(22)
where yi = ui (x − µ)
Change of Variable in Gaussian
By above, we can get
p(y) = p(x)|Jy→x | =
D
j=1
1
(2πλj )
1
2
exp{−
y2
j
2λj
} (23)
which means product of D independent univariate Gaussian Distribution.
First and Second Moment of Gaussian
By using above, we can get
E[x] = µ, E[xx ] = µµ + Σ (24)
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Uni and Multi variate Gaussian
Basic Property
Conditional and Marginal Distributions
Inference for Gaussian
Student’s t-distribution
Limitations of Gaussian and Solutions
There are two main limitations for Gaussian.
First, we have to infer so many covariance parameters.
Second, we cannot represent multi-modal ditriubtions. Therefore, we
define some auxilarily concepts.
Diagonal Covariance
Σ = diag(s2
) (25)
Isotropic Covariance
Σ = σ2
I (26)
Mixture Model
p(x) =
K
k=1
πk p(x|πk ) (27)
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Uni and Multi variate Gaussian
Basic Property
Conditional and Marginal Distributions
Inference for Gaussian
Student’s t-distribution
Partitions of Mahalanobis distance
First, partition the covariance matrix and precision matrix.
Σ =
Σaa Σab
Σba Σbb
, Σ−1
= Λ =
Λaa Λab
Λba Λbb
(28)
where aa, bb are symmetric and ab and ba are conjugate transpose.
Now, partition the Mahalanobis distance.
(x − µ) Σ−1
(x − µ)
= (xa − µ) Σ−1
aa (xa − µ) + (xa − µ) Σ−1
ab (xb − µ)
+(xb − µ) Σ−1
ba (xa − µ) + (xb − µ) Σ−1
bb (xb − µ)(29)
Schur Complement
Like gaussian elimination, we can use some block matrix elimination by
Schur Complement
A B
C D
−1
=
M −MBD−1
−D−1
CM D−1
+ D−1
CMBD−1 (30)
where M = (A − BD−1
C)−1
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Uni and Multi variate Gaussian
Basic Property
Conditional and Marginal Distributions
Inference for Gaussian
Student’s t-distribution
Schur Complement(cont.)
Therefore, we get
Λaa = (Σaa − ΣabΣ−1
bb Σba)−1
(31)
Λab = −(Σaa − ΣabΣ−1
bb Σba)−1
ΣabΣ−1
bb (32)
Conditional Distribution
Therefore, we get
xa|xb ∼ N(x|µa|b, Σa|b) (33)
where
µa|b = µa + ΣabΣ−1
bb (xb − xa) (34)
Σa|b = Σaa − ΣabΣ−1
bb Σba (35)
Marginal Distribution
Removing xb by integrating, we can get marginal distribution of xa
p(xa) = −
1
2
xa (Λaa − ΛabΛbbΛba)xa + xa (Λaa − ΛabΛbbΛba)µa + const (36)
Therefore, we get
xa ∼ N(x|µa, Σaa) (37)
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Uni and Multi variate Gaussian
Basic Property
Conditional and Marginal Distributions
Inference for Gaussian
Student’s t-distribution
Given a marginal Gaussian for x and a conditional Gaussian for y given x in the
form
x ∼ N(x|µ, Λ−1
) (38)
y|x ∼ N(y|Ax + b, L−1
) (39)
Then we can get marginal distribution of y and the conditional distribution of x
given y are given by
y ∼ N(y|Aµ + b, L−1
+ AΛ−1
A ) (40)
x|y ∼ N(x|Σ{A L(y − b) + Aµ}, Σ) (41)
where
Σ = (Λ + A LA)−1
(42)
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Uni and Multi variate Gaussian
Basic Property
Conditional and Marginal Distributions
Inference for Gaussian
Student’s t-distribution
Log-likelihood for data
By same argument in categorical data, we can get log-likelihood for
Gaussian
ln p(D|µ, Σ) = −
ND
2
ln 2π −
N
2
ln |Σ| −
1
2
N
n=1
(xn − µ) Σ−1
(xn − µ) (43)
and this log-likelihood depends only on these quantities called Sufficient
Statistics
N
n=1
xn,
N
n=1
xnxn (44)
MLE for Gaussian
Since MLE is a maximizer for log-likelihood, we can get
µML =
1
N
N
n=1
xn (45)
ΣML =
1
N
N
n=1
(xn − µML)(xn − µML) (46)
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Uni and Multi variate Gaussian
Basic Property
Conditional and Marginal Distributions
Inference for Gaussian
Student’s t-distribution
Sequential estimation
Since we get MLE for gaussian analytically, we can do this sequentially like
µN
ML = µN−1
ML +
1
N
(xN − µN−1
ML ) (47)
Robbins-Monro Algorithm
By same intuition, we can generalize sequential learning. Robbins-Monro
algorithm gives us root θ such that f (θ) = E[z|θ] = 0. The iterate process
of RM algorithm can be represented by
θN
= θN−1
− aN−1z(θN−1
) (48)
where z(θN−1
) means observed value of z when θ takes the value θN−1
and aN is an sequence satisfy
lim
N→∞
aN = 0,
∞
N=1
aN = ∞,
∞
N=1
aN < ∞ (49)
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Uni and Multi variate Gaussian
Basic Property
Conditional and Marginal Distributions
Inference for Gaussian
Student’s t-distribution
Generalized Sequential Learning
We can apply RM algorithm for sequential learning. In this case, our f (θ)
is a gradient of log-likelihood function. Therefore, we can get
z(θ) = −
∂
∂θ
ln p(x|θ) (50)
In Gaussian case, we put aN to σ2
/N.
Bayesian Inference for mean given variance
Since gaussian likelihood takes the form of the exponential of a quadratic
form in µ, we can choose a prior also Gaussian. Therefore, if we choose
µ ∼ N(µ|µ0, σ2
0) (51)
for prior, we get following for posterior
µ|D ∼ N(µ|µN , σ2
N ) (52)
where
µN =
σ2
Nσ2
0 + σ2
µ0 +
Nσ2
0
Nσ2
0 + σ2
µML,
1
σ2
N
=
1
σ2
0
+
N
σ2
(53)
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Uni and Multi variate Gaussian
Basic Property
Conditional and Marginal Distributions
Inference for Gaussian
Student’s t-distribution
Bayesian Inference for mean given variance(cont.)
1. Posterior mean compromises between the priot and the MLE.
2. Precision is given by the precision of the prior plus one contribution of
the data precision from each of the observed data.
3. If we take σ2
0 → ∞ then the posterior mean reduces to the MLE.
Bayesian Inference for variance given mean
Since gaussian likelihood takes the form of proportional to the product of
a power of precision and the exponential of a linear function of precision.
We choose gamma distribution which is defined by
Gam(λ|a0, b0) =
1
Γ(a0)
ba
00λa0−1
exp(−b0λ) (54)
Then we can get posterior
λ|D ∼ Gam(λ|aN , bN ) (55)
where
aN = a0 +
N
2
, bN = b0 +
N
2
σ2
ML (56)
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Uni and Multi variate Gaussian
Basic Property
Conditional and Marginal Distributions
Inference for Gaussian
Student’s t-distribution
Bayesian Inference for variance given mean(cont.)
1. We can interpret the parameter 2a0 effective prior observations for
number of data. 2. We can interpret the parameter b0/a0 effective prior
observations for variance.
Bayesian Inference for no data
By apply same argument on mean and variance, we can get prior
p(µ, λ) ∼ N(µ|µ0, (βλ)−1
)Gam(λ|a, b) (57)
where
µ0 = c/β, a = 1 + β/2, b = d − c2
/2β (58)
Note that precision of µ is a linear function of λ
For Multivariate case, we can similarly get prior
p(µ, Λ|µ0, β, W , ν) = N(µ|µ0, (βΛ)−1
)W(Λ|W , ν) (59)
where W is Wishart distribution.
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Uni and Multi variate Gaussian
Basic Property
Conditional and Marginal Distributions
Inference for Gaussian
Student’s t-distribution
Univariate t-distribution
If we integrate out the precision given that our prior for precision is
Gamma, we get t-distribution.
St(x|µ, λ, ν) =
Γ(ν/2 + 1/2)
Γ(ν/2)
(
λ
πν
)1/2
[1 +
λ(x − µ)2
ν
]−ν/2−1/2
(60)
where ν = 2a(degrees of freedom) and λ = a/b.
We can think t-dstribution as an infinite mixture of Gaussians.
Since t-distribution has fat tail(than Gaussian), we can obtain more robust
model when we estimate.
Multivariate t-distribution
We also can get multivariate case of infinite mixture of Gaussians, then we
get multivariate t-distribution
St(x|µ, Λ, ν) =
Γ(ν/2 + D/2)
Γ(ν/2)
(
Λ1/2
(πν)D/2
)[1 +
∆2
ν
]−ν/2−D/2
(61)
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Distribution for the exponential family
Sigmoid and Softmax
MLE for the exponential family
Conjugate priors for exponential family
Noninformative priors
The Exponential Family
The exponential family of distributions over x, given parameters η, is
defined to be the set of distributions of the form
p(x|η) = g(η)h(x) exp{η u(x)} (62)
where η is natural parameters of the distribution, and u(x) is a function
of x.
The fnuction g(η) can be interpereted as the normalization factor.
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Distribution for the exponential family
Sigmoid and Softmax
MLE for the exponential family
Conjugate priors for exponential family
Noninformative priors
Logistic Sigmoid
In case of bernouli distribution, our parameter is µ, although our natural
parameter is η. Those two parameter can be connected by following
η = ln(
µ
1 − µ
), µ := σ(η) =
exp(µ)
1 + exp(µ)
(63)
And we call this σ(η) sigmoid function.
Softmax function
By same argument, we can find some realtionship between our parameter
and natural parameter. That is Softmax function.
µk =
exp(ηk )
K
j=1 exp(ηj )
(64)
Note that in this case, u(x) = 1, h(x) = 1, g(x) = ( K
j=1 exp(ηj ))−1
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Distribution for the exponential family
Sigmoid and Softmax
MLE for the exponential family
Conjugate priors for exponential family
Noninformative priors
Gaussian
Gaussian also can be interpreted as the exponential family by
u(x) =
x
x2 (65)
η =
µ/σ2
−1/2σ2 (66)
g(η) = (−2η2)1/2
exp(
η2
1
4η2
) (67)
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Distribution for the exponential family
Sigmoid and Softmax
MLE for the exponential family
Conjugate priors for exponential family
Noninformative priors
Problem of estimating the natural parameter
We can generalize the argument in MLE in other cases.
First, we consider the log-likelihood of the data.
ln p(D|η) =
N
n=1
h(xn) + N ln g(η) + η
N
n=1
u(xn) (68)
Next, we need to find the stationary point of the log-likelihood.
N η ln g(η) +
N
n=1
u(xn) = 0 (69)
Therfore, we get MLE
− η ln g(η) =
1
N
N
n=1
u(xn) (70)
We see that the solution for the MLE depedns on the data only through
σnu(xn), which is therefore called the sufficient statistic of the
exponential family.
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Distribution for the exponential family
Sigmoid and Softmax
MLE for the exponential family
Conjugate priors for exponential family
Noninformative priors
Conjugate prior
For any member of the exponential family, there exists a conjugate prior
that can be written in the form
p(η|χ, ν) = f (χ, ν)g(η)ν
exp{νη χ} (71)
where f (χ, ν) is a normalization factor, and g(η) is the same function as
the exponential family.
Posterior distribution
If we choose prior as conjugate prior, we get
p(η|D, χ, ν) ∝ g(η)ν+N
exp{η (
N
n=1
u(xn) + νχ)} (72)
Therefore, we see that the parameter ν can be interpreted as the effective
number of pseudo-observations in the prior, each of which has a value
for the sufficient statistics u(x) given by χ.
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Distribution for the exponential family
Sigmoid and Softmax
MLE for the exponential family
Conjugate priors for exponential family
Noninformative priors
Noninformative Priors
We may seek a form of prior distribution, called a noninformative prior,
which is intended to have as little influence on the posterior distribution as
possible.
Generalizations of Noninformative priors
It leads to two generalizations, namely the principle of transformation
groups as in the Jeffreys prior, and the principle of maximum entropy.
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Histogram Technique
Kernel Density Estimation
Nearest-Neighbour methods
Histogram Technique
Standard histograms simply partition x into distinct bins of width ∆i and
then count the number ni of observations of x falling in bin i. In order to
turn this count into a normalized probability density, we simply divide by
the total number N of observations and by the width ∆i of the bins to
obtain probability values for each bin given by
pi =
ni
N∆i
(73)
Limitations of Hitogram
The estimated density has discontinuities that are due to the bin edges
rather than any property of the underlying distribution that generated the
data.
Histogram approach also sacling with dimensionality.
Lessons of Histogram
First, to estimate the probability density at a particular location, we should
consider the data points that lie within some local neighbourhood of that
point.
Second, the value of the smoothing parameter should be neither too large
nor too small in order to obtain good results.
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Histogram Technique
Kernel Density Estimation
Nearest-Neighbour methods
Motivation
For large N, the bernouli trial that data point fall within small region
mathcalR will be sharply peaked around the mean and so
K NP (74)
If, however, we also assume that the region R is sufficiently small that the
probability density p(x) is roughlt over the region, then we have
P p(x)V (75)
where V is the volume of R. Therefore,
p(x) =
K
NV
(76)
Note that in our assumption, R is sufficiently small tha the density is
approximately constant over the region and the yet sufficiently large that
the number K of points falling inside the region is sufficient for the
binomial distribution to be sharply peaked.
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Histogram Technique
Kernel Density Estimation
Nearest-Neighbour methods
Kernel Density Estimation(KDE)
If we fix V and determine K from the data, we use kernel approach. For
instance, we fix V to 1 and count the data point by following function
k(u) =
1, if |ui | ≤ 1/2, i = 1, · · · , D,
0, otherwise
(77)
which called Parzen window In this case, we can use this by
K =
N
n=1
k(
x − xn
h
) (78)
and it leads density function
p(x) =
1
N
N
n=1
1
hD
k(
x − xn
h
) (79)
We can also use another kernel like Gaussian kernel. If we do so, then we
get
p(x) =
1
N
N
n=1
1
(2πh2)D/2
exp{−
x − xn
2h2
} (80)
Sung-Yub Kim Probability Distributions for ML
Introduction
Binary Variables
Multinomial Variables
The Gaussian Distribution
The Exponential Family
Nonparametric Methods
Histogram Technique
Kernel Density Estimation
Nearest-Neighbour methods
Limitation of KDE
One of the difficulties with the kernel approach to density estimation is
that the parameter h governing the kernel width is fixed for all kernels. In
regions of high data density, a large value of h may lead to over-smoothing
and in lower data density, a small value of h may lead to overfitting. Thus
the optimal choice for h may be dependent on location within data space.
Nereat-Neighbor(NN)
Therefore we consider a fixing K and use the data to find an appropriate V
and we call this method K-NN methods.
In this case, the value of K governs the degree of smoothing and we need
to optimizae(hyper-parameter optimize) K.
Erro of KNN
Note that for sufficiently big N, the error rate is never more than twice the
minimum achievable error rate of an optimal classifier.
Sung-Yub Kim Probability Distributions for ML

More Related Content

PDF
(研究会輪読) Weight Uncertainty in Neural Networks
PDF
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
PDF
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
PDF
An efficient approach to wavelet image Denoising
PDF
Clustering:k-means, expect-maximization and gaussian mixture model
PDF
Paper Summary of Disentangling by Factorising (Factor-VAE)
PDF
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
PDF
Statistics symposium talk, Harvard University
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
An efficient approach to wavelet image Denoising
Clustering:k-means, expect-maximization and gaussian mixture model
Paper Summary of Disentangling by Factorising (Factor-VAE)
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
Statistics symposium talk, Harvard University

What's hot (15)

PDF
Parametric Density Estimation using Gaussian Mixture Models
PDF
CVPR2010: Advanced ITinCVPR in a Nutshell: part 5: Shape, Matching and Diverg...
PDF
[DL輪読会]Generative Models of Visually Grounded Imagination
PDF
MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Var...
PDF
From RNN to neural networks for cyclic undirected graphs
PDF
Pattern learning and recognition on statistical manifolds: An information-geo...
PDF
Machine learning (2)
DOC
SVM Tutorial
PPTX
Deep learning paper review ppt sourece -Direct clr
PDF
accurate ABC Oliver Ratmann
PDF
Slides: A glance at information-geometric signal processing
PPT
Machine Learning and Statistical Analysis
PDF
Kernels and Support Vector Machines
PPTX
interpolation of unequal intervals
PDF
Hands-on Tutorial of Machine Learning in Python
Parametric Density Estimation using Gaussian Mixture Models
CVPR2010: Advanced ITinCVPR in a Nutshell: part 5: Shape, Matching and Diverg...
[DL輪読会]Generative Models of Visually Grounded Imagination
MUMS: Bayesian, Fiducial, and Frequentist Conference - Spatially Informed Var...
From RNN to neural networks for cyclic undirected graphs
Pattern learning and recognition on statistical manifolds: An information-geo...
Machine learning (2)
SVM Tutorial
Deep learning paper review ppt sourece -Direct clr
accurate ABC Oliver Ratmann
Slides: A glance at information-geometric signal processing
Machine Learning and Statistical Analysis
Kernels and Support Vector Machines
interpolation of unequal intervals
Hands-on Tutorial of Machine Learning in Python
Ad

Viewers also liked (13)

PDF
Non parametric bayesian learning in discrete data
PPTX
non parametric methods for power spectrum estimaton
PDF
Timeseries Analysis with R
PDF
INTRODUCTION TO TIME SERIES ANALYSIS WITH “R” JUNE 2014
PDF
Problems statistics 1
PDF
Introduction to Machine Learning and Deep Learning
PPTX
기계학습(Machine learning) 입문하기
PPTX
Chapter 2: Frequency Distribution and Graphs
PDF
Understanding deep learning requires rethinking generalization (2017) 1/2
PPTX
C12
PDF
Time series-mining-slides
PDF
Time Series Analysis and Mining with R
PPT
Forecasting Slides
Non parametric bayesian learning in discrete data
non parametric methods for power spectrum estimaton
Timeseries Analysis with R
INTRODUCTION TO TIME SERIES ANALYSIS WITH “R” JUNE 2014
Problems statistics 1
Introduction to Machine Learning and Deep Learning
기계학습(Machine learning) 입문하기
Chapter 2: Frequency Distribution and Graphs
Understanding deep learning requires rethinking generalization (2017) 1/2
C12
Time series-mining-slides
Time Series Analysis and Mining with R
Forecasting Slides
Ad

Similar to Probability distributions for ml (20)

PPTX
PRML Chapter 2
PDF
Introduction to Evidential Neural Networks
PDF
02.bayesian learning
PDF
bayesian learning
PDF
02.bayesian learning
PPTX
Lecture 6 of probabilistic modellin.pptx
PPTX
tut07.pptx
PDF
Bayesian statistics intro using r
PDF
Machine learning mathematicals.pdf
PPTX
Prml 2.3
PDF
On Extension of Weibull Distribution with Bayesian Analysis using S-Plus Soft...
PPTX
Monotone likelihood ratio test
PDF
Pattern Recognition
PPTX
Probability distributionv1
PDF
Logistic Regression(SGD)
PDF
bayesian_statistics_introduction_uppsala_university
PDF
Doing Bayesian Data Analysis, Chapter 5
PDF
Probability and Statistics Cookbook
PPTX
introduction CDA.pptx
PPTX
Binomial-Distribution & It’s Application.pptx
PRML Chapter 2
Introduction to Evidential Neural Networks
02.bayesian learning
bayesian learning
02.bayesian learning
Lecture 6 of probabilistic modellin.pptx
tut07.pptx
Bayesian statistics intro using r
Machine learning mathematicals.pdf
Prml 2.3
On Extension of Weibull Distribution with Bayesian Analysis using S-Plus Soft...
Monotone likelihood ratio test
Pattern Recognition
Probability distributionv1
Logistic Regression(SGD)
bayesian_statistics_introduction_uppsala_university
Doing Bayesian Data Analysis, Chapter 5
Probability and Statistics Cookbook
introduction CDA.pptx
Binomial-Distribution & It’s Application.pptx

Recently uploaded (20)

DOCX
Factor Analysis Word Document Presentation
PDF
Introduction to Data Science and Data Analysis
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
modul_python (1).pptx for professional and student
PPTX
Managing Community Partner Relationships
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPTX
New ISO 27001_2022 standard and the changes
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PDF
Business Analytics and business intelligence.pdf
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPT
Predictive modeling basics in data cleaning process
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
[EN] Industrial Machine Downtime Prediction
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPTX
Business_Capability_Map_Collection__pptx
Factor Analysis Word Document Presentation
Introduction to Data Science and Data Analysis
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Optimise Shopper Experiences with a Strong Data Estate.pdf
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
modul_python (1).pptx for professional and student
Managing Community Partner Relationships
Topic 5 Presentation 5 Lesson 5 Corporate Fin
New ISO 27001_2022 standard and the changes
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Business Analytics and business intelligence.pdf
Qualitative Qantitative and Mixed Methods.pptx
Predictive modeling basics in data cleaning process
STERILIZATION AND DISINFECTION-1.ppthhhbx
[EN] Industrial Machine Downtime Prediction
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Business_Capability_Map_Collection__pptx

Probability distributions for ml

  • 1. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Probability Distributions for ML Sung-Yub Kim Dept of IE, Seoul National University January 29, 2017 Sung-Yub Kim Probability Distributions for ML
  • 2. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Bishop, C. M. Pattern Recognition and Machine Learning Information Science and Statistics, Springer, 2006. Kevin P. Murphy. Machine Learning - A Probabilistic Perspective Adaptive Computation and Machine Learning, MIT press, 2012. Ian Goodfellow and Yoshua Bengio and Aaron Courville. Deep Learning Computer Science and Intelligent Systems, MIT Press, 2016. Sung-Yub Kim Probability Distributions for ML
  • 3. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Purpose: Density Estimation Assumption: Data Points are independent and identically distributed.(i.i.d) Parametric and Nonparametric Parametric estimations are more intuitive but has very strong assumption. Nonparametric estimation also has some parameters, but they control model complexity. Sung-Yub Kim Probability Distributions for ML
  • 4. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Bernouli and Binomial Distribution MLE of Bernouli parameter The Beta Distribution Bayesian Inference on binary variables Difference between prior and posterior Bernouli Distribution(Ber(θ)) Bernouli Distribution has only one parameter θ which means the success probability of the trial. PMF of bernouli dist is shown like Ber(x|θ) = θI(x=1) (1 − θ)I(x=0) Binomial Distribution(Bin(n,θ)) Binomial Distribution has two parameters n for number of trials, θ for success prob. PMF of binomial dist is shown like Bin(k|n, θ) = n k θk (1 − θ)n−k Sung-Yub Kim Probability Distributions for ML
  • 5. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Bernouli and Binomial Distribution MLE of Bernouli parameter The Beta Distribution Bayesian Inference on binary variables Difference between prior and posterior Likelihood of Data By i.i.d assumption, we get p(D|µ) = N n=1 p(xn|µ) = N n=1 µxn (1 − µ)1−xn (1) Log-likelihood of Data Take logarithm, we get ln p(D|µ) = N n=1 ln p(xn|µ) = N n=1 {xn ln µ + (1 − xn) ln(1 − µ)} (2) MLE Since maximizer is stationary point, we get µML := ˆµ = 1 N N n=1 xn (3) Sung-Yub Kim Probability Distributions for ML
  • 6. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Bernouli and Binomial Distribution MLE of Bernouli parameter The Beta Distribution Bayesian Inference on binary variables Difference between prior and posterior Prior Distribution The weak point of MLE is you can be overfitted to data. To overcome this deficiency, we need to make some prior distribution. But same time our prior distribution need to has a simple interpretation and useful analytical properties. Conjugate Prior Conjugate prior for a likelihood is a prior distribution which your prior and posterior distribution are same given your likelihood. In this case, we need to make our prior proportional to powers of µ and (1 − µ). Therefore, we choose Beta Distribution Beta(µ|a, b) = Γ(a + b) Γ(a)Γ(b) µa−1 (1 − µ)b−1 (4) Beta Distribution has two parameters a,b each counts how many occurs each classes(effective number of observations). Also we can easily valid that posterior is also beta distribution. Sung-Yub Kim Probability Distributions for ML
  • 7. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Bernouli and Binomial Distribution MLE of Bernouli parameter The Beta Distribution Bayesian Inference on binary variables Difference between prior and posterior Posterior Distribution By some calculation, p(µ|m, l, a, b) = Γ(m + l + a + b) Γ(m + a)Γ(l + b) µm+a−1 (1 − µ)l+b−1 (5) where m,l are observed data. Bayesian Inference Now we can make some bayesian inference on binary variables. We want to know p(x = 1|D) = 1 0 p(x = 1|µ)p(µ|D)dµ = 1 0 µp(µ|D)dµ = E[µ|D] (6) Therefore we get p(x = 1|D) = m + a m + a + l + b (7) If observed data(m,l) are sufficiently big, its asymptotic property is identical to MLE, and this property is very general. Sung-Yub Kim Probability Distributions for ML
  • 8. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Bernouli and Binomial Distribution MLE of Bernouli parameter The Beta Distribution Bayesian Inference on binary variables Difference between prior and posterior Since Eθ[θ] = ED[Eθ[θ|D]] (8) we know that poseterior mean of θ, averaged over the distribution generating the data, is equal to the prior mean of θ. Also since Varθ[θ] = ED[Varθ[θ|D]] + VarD[Eθ[θ|D]] (9) We know that on average, the posterior variance of θ is smaller than the prior variance. Sung-Yub Kim Probability Distributions for ML
  • 9. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Multinomials and Multinouli Distributions MLE of Multinouli parameters The Dirichlet Distribution and Bayesian Inference Multinomial Distribution(Mu(x|n, θ)) Multinomial distribution is different from binomial with respect to dimension of ouput and θ. In binomial, k means the number of success. In multinomial each index of x means the number of state. Therefore we can see binomial as multinomial when the dimension of x and θ is 2. Mu(x|n, θ) = n x0, . . . , xK−1 K−1 j=0 θ xj j Multinouli Distribution(Mu(x|1, θ)) Sometimes we are intersted in the special case of Multinomial when the n is 1 that is called Multinouli distribution: Mu(x|1, θ) = K−1 j=0 θ I(xj =1) j Sung-Yub Kim Probability Distributions for ML
  • 10. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Multinomials and Multinouli Distributions MLE of Multinouli parameters The Dirichlet Distribution and Bayesian Inference Likelihood of Data By i.i.d assumption, we get p(D|µ) = N n=1 K k=1 µ xnk k = K k=1 µ n xnk k = K k=1 µ mk k (10) where mk = n xnk (sufficient statistics) Log-likelihood of Data Take logarithm, we get ln p(D|µ) = K k=1 mk ln µk (11) MLE Therefore, we need to solve following optimization problem for MLE max{ K k=1 mk ln µk | K k=1 µk = 1} (12) Sung-Yub Kim Probability Distributions for ML
  • 11. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Multinomials and Multinouli Distributions MLE of Multinouli parameters The Dirichlet Distribution and Bayesian Inference MLE(cont.) We already know that Lagrangian stationaty point is a necessary condition for constrained optimization problem. Therefore, µL(µ; λ) = 0, λL(µ; λ) = 0 (13) where L(µ; λ) = K k=1 mk ln µk + λ( K k=1 µk − 1) (14) Therefore, we get µML k = mk N (15) Sung-Yub Kim Probability Distributions for ML
  • 12. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Multinomials and Multinouli Distributions MLE of Multinouli parameters The Dirichlet Distribution and Bayesian Inference Dirichlet Distribution By the same intuition in Beta distribution, we can get conjugate prior for Multinouli Dir(µ|α) = Γ(α0) Γ(α1) · · · Γ(αK ) K k=1 µ αk −1 k (16) where α0 = k αk Bayesian Inference By the same argument in binomial, we can get posterior probability p(µ|D, α) = Dir(µ|α + m) = Γ(α0 + N) Γ(α1 + m1) · · · Γ(αK + mK ) K k=1 µ αk +mk −1 k (17) Sung-Yub Kim Probability Distributions for ML
  • 13. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Uni and Multi variate Gaussian Basic Property Conditional and Marginal Distributions Inference for Gaussian Student’s t-distribution Univariate Gaussian Distribution(N(x|µ, σ2 ) = N(x|µ, β−1 )) N(x|µ, σ2 ) = 1 √ 2πσ2 exp(− 1 2σ2 (x − µ)2 ) (18) N(x|µ, β−1 ) = β 2π exp(− β 2 (x − µ)2 ) (19) Multivariate Gaussian Distribution(N(x|µ, Σ) = N(x|µ, β−1 )) N(x|µ, Σ) = 1 (2π) D 2 det(Σ) 1 2 exp(− 1 2 (x − µ) Σ−1 (x − µ)) (20) N(x|µ, β−1 ) = 1 (2π) D 2 det(Σ) 1 2 exp(− 1 2 (x − µ) β(x − µ)) (21) Sung-Yub Kim Probability Distributions for ML
  • 14. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Uni and Multi variate Gaussian Basic Property Conditional and Marginal Distributions Inference for Gaussian Student’s t-distribution Mahalanobis Distance By EVD, we can get ∆2 = (x − µ) Σ−1 (x − µ) = D i=1 y2 i λi (22) where yi = ui (x − µ) Change of Variable in Gaussian By above, we can get p(y) = p(x)|Jy→x | = D j=1 1 (2πλj ) 1 2 exp{− y2 j 2λj } (23) which means product of D independent univariate Gaussian Distribution. First and Second Moment of Gaussian By using above, we can get E[x] = µ, E[xx ] = µµ + Σ (24) Sung-Yub Kim Probability Distributions for ML
  • 15. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Uni and Multi variate Gaussian Basic Property Conditional and Marginal Distributions Inference for Gaussian Student’s t-distribution Limitations of Gaussian and Solutions There are two main limitations for Gaussian. First, we have to infer so many covariance parameters. Second, we cannot represent multi-modal ditriubtions. Therefore, we define some auxilarily concepts. Diagonal Covariance Σ = diag(s2 ) (25) Isotropic Covariance Σ = σ2 I (26) Mixture Model p(x) = K k=1 πk p(x|πk ) (27) Sung-Yub Kim Probability Distributions for ML
  • 16. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Uni and Multi variate Gaussian Basic Property Conditional and Marginal Distributions Inference for Gaussian Student’s t-distribution Partitions of Mahalanobis distance First, partition the covariance matrix and precision matrix. Σ = Σaa Σab Σba Σbb , Σ−1 = Λ = Λaa Λab Λba Λbb (28) where aa, bb are symmetric and ab and ba are conjugate transpose. Now, partition the Mahalanobis distance. (x − µ) Σ−1 (x − µ) = (xa − µ) Σ−1 aa (xa − µ) + (xa − µ) Σ−1 ab (xb − µ) +(xb − µ) Σ−1 ba (xa − µ) + (xb − µ) Σ−1 bb (xb − µ)(29) Schur Complement Like gaussian elimination, we can use some block matrix elimination by Schur Complement A B C D −1 = M −MBD−1 −D−1 CM D−1 + D−1 CMBD−1 (30) where M = (A − BD−1 C)−1 Sung-Yub Kim Probability Distributions for ML
  • 17. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Uni and Multi variate Gaussian Basic Property Conditional and Marginal Distributions Inference for Gaussian Student’s t-distribution Schur Complement(cont.) Therefore, we get Λaa = (Σaa − ΣabΣ−1 bb Σba)−1 (31) Λab = −(Σaa − ΣabΣ−1 bb Σba)−1 ΣabΣ−1 bb (32) Conditional Distribution Therefore, we get xa|xb ∼ N(x|µa|b, Σa|b) (33) where µa|b = µa + ΣabΣ−1 bb (xb − xa) (34) Σa|b = Σaa − ΣabΣ−1 bb Σba (35) Marginal Distribution Removing xb by integrating, we can get marginal distribution of xa p(xa) = − 1 2 xa (Λaa − ΛabΛbbΛba)xa + xa (Λaa − ΛabΛbbΛba)µa + const (36) Therefore, we get xa ∼ N(x|µa, Σaa) (37) Sung-Yub Kim Probability Distributions for ML
  • 18. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Uni and Multi variate Gaussian Basic Property Conditional and Marginal Distributions Inference for Gaussian Student’s t-distribution Given a marginal Gaussian for x and a conditional Gaussian for y given x in the form x ∼ N(x|µ, Λ−1 ) (38) y|x ∼ N(y|Ax + b, L−1 ) (39) Then we can get marginal distribution of y and the conditional distribution of x given y are given by y ∼ N(y|Aµ + b, L−1 + AΛ−1 A ) (40) x|y ∼ N(x|Σ{A L(y − b) + Aµ}, Σ) (41) where Σ = (Λ + A LA)−1 (42) Sung-Yub Kim Probability Distributions for ML
  • 19. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Uni and Multi variate Gaussian Basic Property Conditional and Marginal Distributions Inference for Gaussian Student’s t-distribution Log-likelihood for data By same argument in categorical data, we can get log-likelihood for Gaussian ln p(D|µ, Σ) = − ND 2 ln 2π − N 2 ln |Σ| − 1 2 N n=1 (xn − µ) Σ−1 (xn − µ) (43) and this log-likelihood depends only on these quantities called Sufficient Statistics N n=1 xn, N n=1 xnxn (44) MLE for Gaussian Since MLE is a maximizer for log-likelihood, we can get µML = 1 N N n=1 xn (45) ΣML = 1 N N n=1 (xn − µML)(xn − µML) (46) Sung-Yub Kim Probability Distributions for ML
  • 20. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Uni and Multi variate Gaussian Basic Property Conditional and Marginal Distributions Inference for Gaussian Student’s t-distribution Sequential estimation Since we get MLE for gaussian analytically, we can do this sequentially like µN ML = µN−1 ML + 1 N (xN − µN−1 ML ) (47) Robbins-Monro Algorithm By same intuition, we can generalize sequential learning. Robbins-Monro algorithm gives us root θ such that f (θ) = E[z|θ] = 0. The iterate process of RM algorithm can be represented by θN = θN−1 − aN−1z(θN−1 ) (48) where z(θN−1 ) means observed value of z when θ takes the value θN−1 and aN is an sequence satisfy lim N→∞ aN = 0, ∞ N=1 aN = ∞, ∞ N=1 aN < ∞ (49) Sung-Yub Kim Probability Distributions for ML
  • 21. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Uni and Multi variate Gaussian Basic Property Conditional and Marginal Distributions Inference for Gaussian Student’s t-distribution Generalized Sequential Learning We can apply RM algorithm for sequential learning. In this case, our f (θ) is a gradient of log-likelihood function. Therefore, we can get z(θ) = − ∂ ∂θ ln p(x|θ) (50) In Gaussian case, we put aN to σ2 /N. Bayesian Inference for mean given variance Since gaussian likelihood takes the form of the exponential of a quadratic form in µ, we can choose a prior also Gaussian. Therefore, if we choose µ ∼ N(µ|µ0, σ2 0) (51) for prior, we get following for posterior µ|D ∼ N(µ|µN , σ2 N ) (52) where µN = σ2 Nσ2 0 + σ2 µ0 + Nσ2 0 Nσ2 0 + σ2 µML, 1 σ2 N = 1 σ2 0 + N σ2 (53) Sung-Yub Kim Probability Distributions for ML
  • 22. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Uni and Multi variate Gaussian Basic Property Conditional and Marginal Distributions Inference for Gaussian Student’s t-distribution Bayesian Inference for mean given variance(cont.) 1. Posterior mean compromises between the priot and the MLE. 2. Precision is given by the precision of the prior plus one contribution of the data precision from each of the observed data. 3. If we take σ2 0 → ∞ then the posterior mean reduces to the MLE. Bayesian Inference for variance given mean Since gaussian likelihood takes the form of proportional to the product of a power of precision and the exponential of a linear function of precision. We choose gamma distribution which is defined by Gam(λ|a0, b0) = 1 Γ(a0) ba 00λa0−1 exp(−b0λ) (54) Then we can get posterior λ|D ∼ Gam(λ|aN , bN ) (55) where aN = a0 + N 2 , bN = b0 + N 2 σ2 ML (56) Sung-Yub Kim Probability Distributions for ML
  • 23. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Uni and Multi variate Gaussian Basic Property Conditional and Marginal Distributions Inference for Gaussian Student’s t-distribution Bayesian Inference for variance given mean(cont.) 1. We can interpret the parameter 2a0 effective prior observations for number of data. 2. We can interpret the parameter b0/a0 effective prior observations for variance. Bayesian Inference for no data By apply same argument on mean and variance, we can get prior p(µ, λ) ∼ N(µ|µ0, (βλ)−1 )Gam(λ|a, b) (57) where µ0 = c/β, a = 1 + β/2, b = d − c2 /2β (58) Note that precision of µ is a linear function of λ For Multivariate case, we can similarly get prior p(µ, Λ|µ0, β, W , ν) = N(µ|µ0, (βΛ)−1 )W(Λ|W , ν) (59) where W is Wishart distribution. Sung-Yub Kim Probability Distributions for ML
  • 24. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Uni and Multi variate Gaussian Basic Property Conditional and Marginal Distributions Inference for Gaussian Student’s t-distribution Univariate t-distribution If we integrate out the precision given that our prior for precision is Gamma, we get t-distribution. St(x|µ, λ, ν) = Γ(ν/2 + 1/2) Γ(ν/2) ( λ πν )1/2 [1 + λ(x − µ)2 ν ]−ν/2−1/2 (60) where ν = 2a(degrees of freedom) and λ = a/b. We can think t-dstribution as an infinite mixture of Gaussians. Since t-distribution has fat tail(than Gaussian), we can obtain more robust model when we estimate. Multivariate t-distribution We also can get multivariate case of infinite mixture of Gaussians, then we get multivariate t-distribution St(x|µ, Λ, ν) = Γ(ν/2 + D/2) Γ(ν/2) ( Λ1/2 (πν)D/2 )[1 + ∆2 ν ]−ν/2−D/2 (61) Sung-Yub Kim Probability Distributions for ML
  • 25. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Distribution for the exponential family Sigmoid and Softmax MLE for the exponential family Conjugate priors for exponential family Noninformative priors The Exponential Family The exponential family of distributions over x, given parameters η, is defined to be the set of distributions of the form p(x|η) = g(η)h(x) exp{η u(x)} (62) where η is natural parameters of the distribution, and u(x) is a function of x. The fnuction g(η) can be interpereted as the normalization factor. Sung-Yub Kim Probability Distributions for ML
  • 26. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Distribution for the exponential family Sigmoid and Softmax MLE for the exponential family Conjugate priors for exponential family Noninformative priors Logistic Sigmoid In case of bernouli distribution, our parameter is µ, although our natural parameter is η. Those two parameter can be connected by following η = ln( µ 1 − µ ), µ := σ(η) = exp(µ) 1 + exp(µ) (63) And we call this σ(η) sigmoid function. Softmax function By same argument, we can find some realtionship between our parameter and natural parameter. That is Softmax function. µk = exp(ηk ) K j=1 exp(ηj ) (64) Note that in this case, u(x) = 1, h(x) = 1, g(x) = ( K j=1 exp(ηj ))−1 Sung-Yub Kim Probability Distributions for ML
  • 27. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Distribution for the exponential family Sigmoid and Softmax MLE for the exponential family Conjugate priors for exponential family Noninformative priors Gaussian Gaussian also can be interpreted as the exponential family by u(x) = x x2 (65) η = µ/σ2 −1/2σ2 (66) g(η) = (−2η2)1/2 exp( η2 1 4η2 ) (67) Sung-Yub Kim Probability Distributions for ML
  • 28. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Distribution for the exponential family Sigmoid and Softmax MLE for the exponential family Conjugate priors for exponential family Noninformative priors Problem of estimating the natural parameter We can generalize the argument in MLE in other cases. First, we consider the log-likelihood of the data. ln p(D|η) = N n=1 h(xn) + N ln g(η) + η N n=1 u(xn) (68) Next, we need to find the stationary point of the log-likelihood. N η ln g(η) + N n=1 u(xn) = 0 (69) Therfore, we get MLE − η ln g(η) = 1 N N n=1 u(xn) (70) We see that the solution for the MLE depedns on the data only through σnu(xn), which is therefore called the sufficient statistic of the exponential family. Sung-Yub Kim Probability Distributions for ML
  • 29. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Distribution for the exponential family Sigmoid and Softmax MLE for the exponential family Conjugate priors for exponential family Noninformative priors Conjugate prior For any member of the exponential family, there exists a conjugate prior that can be written in the form p(η|χ, ν) = f (χ, ν)g(η)ν exp{νη χ} (71) where f (χ, ν) is a normalization factor, and g(η) is the same function as the exponential family. Posterior distribution If we choose prior as conjugate prior, we get p(η|D, χ, ν) ∝ g(η)ν+N exp{η ( N n=1 u(xn) + νχ)} (72) Therefore, we see that the parameter ν can be interpreted as the effective number of pseudo-observations in the prior, each of which has a value for the sufficient statistics u(x) given by χ. Sung-Yub Kim Probability Distributions for ML
  • 30. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Distribution for the exponential family Sigmoid and Softmax MLE for the exponential family Conjugate priors for exponential family Noninformative priors Noninformative Priors We may seek a form of prior distribution, called a noninformative prior, which is intended to have as little influence on the posterior distribution as possible. Generalizations of Noninformative priors It leads to two generalizations, namely the principle of transformation groups as in the Jeffreys prior, and the principle of maximum entropy. Sung-Yub Kim Probability Distributions for ML
  • 31. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Histogram Technique Kernel Density Estimation Nearest-Neighbour methods Histogram Technique Standard histograms simply partition x into distinct bins of width ∆i and then count the number ni of observations of x falling in bin i. In order to turn this count into a normalized probability density, we simply divide by the total number N of observations and by the width ∆i of the bins to obtain probability values for each bin given by pi = ni N∆i (73) Limitations of Hitogram The estimated density has discontinuities that are due to the bin edges rather than any property of the underlying distribution that generated the data. Histogram approach also sacling with dimensionality. Lessons of Histogram First, to estimate the probability density at a particular location, we should consider the data points that lie within some local neighbourhood of that point. Second, the value of the smoothing parameter should be neither too large nor too small in order to obtain good results. Sung-Yub Kim Probability Distributions for ML
  • 32. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Histogram Technique Kernel Density Estimation Nearest-Neighbour methods Motivation For large N, the bernouli trial that data point fall within small region mathcalR will be sharply peaked around the mean and so K NP (74) If, however, we also assume that the region R is sufficiently small that the probability density p(x) is roughlt over the region, then we have P p(x)V (75) where V is the volume of R. Therefore, p(x) = K NV (76) Note that in our assumption, R is sufficiently small tha the density is approximately constant over the region and the yet sufficiently large that the number K of points falling inside the region is sufficient for the binomial distribution to be sharply peaked. Sung-Yub Kim Probability Distributions for ML
  • 33. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Histogram Technique Kernel Density Estimation Nearest-Neighbour methods Kernel Density Estimation(KDE) If we fix V and determine K from the data, we use kernel approach. For instance, we fix V to 1 and count the data point by following function k(u) = 1, if |ui | ≤ 1/2, i = 1, · · · , D, 0, otherwise (77) which called Parzen window In this case, we can use this by K = N n=1 k( x − xn h ) (78) and it leads density function p(x) = 1 N N n=1 1 hD k( x − xn h ) (79) We can also use another kernel like Gaussian kernel. If we do so, then we get p(x) = 1 N N n=1 1 (2πh2)D/2 exp{− x − xn 2h2 } (80) Sung-Yub Kim Probability Distributions for ML
  • 34. Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Histogram Technique Kernel Density Estimation Nearest-Neighbour methods Limitation of KDE One of the difficulties with the kernel approach to density estimation is that the parameter h governing the kernel width is fixed for all kernels. In regions of high data density, a large value of h may lead to over-smoothing and in lower data density, a small value of h may lead to overfitting. Thus the optimal choice for h may be dependent on location within data space. Nereat-Neighbor(NN) Therefore we consider a fixing K and use the data to find an appropriate V and we call this method K-NN methods. In this case, the value of K governs the degree of smoothing and we need to optimizae(hyper-parameter optimize) K. Erro of KNN Note that for sufficiently big N, the error rate is never more than twice the minimum achievable error rate of an optimal classifier. Sung-Yub Kim Probability Distributions for ML