Topic model an introduction

Topic Model
（≈
𝟏
𝟐
Text Mining）
Yueshen Xu
xyshzjucs@zju.edu.cn
Middleware, CCNT, ZJU
Middleware, CCNT, ZJU6/11/2014
Text Mining&NLP&ML
1, Yueshen Xu

Outline
 Basic Concepts
 Application and Background
 Famous Researchers
 Language Model
 Vector Space Model (VSM)
 Term Frequency-Inverse Document Frequency (TF-IDF)
 Latent Semantic Indexing (LSA)
 Probabilistic Latent Semantic Indexing (pLSA)
 Expectation-Maximization Algorithm (EM) & Maximum-
Likelihood Estimation (MLE)
6/11/2014 2 Middleware, CCNT, ZJU, Yueshen Xu

Outline
 Latent Dirichlet Allocation (LDA)
 Conjugate Prior
 Possion Distribution
 Variational Distribution and Variational Inference (VD
&VI)
 Markov Chain Monte Carlo (MCMC)
 Metropolis-Hastings Sampling (MH)
 Gibbs Sampling and GS for LDA
 Bayesian Theory v.s. Probability Theory

Concepts
 Latent Semantic Analysis
 Topic Model
 Text Mining
 Natural Language Processing
 Computational Linguistics
 Information Retrieval
 Dimension Reduction
 Expectation-Maximization(EM)
6/11/2014 Middleware, CCNT, ZJU
Information Retrieval
Computational Linguistics
Natural Language Processing
LSA/Topic Model
Text Mining
LSA/Topic Model
Data Mining
Reduction
Dimension
Machine
Learning
EM
4
Machine
Translation
Aim:find the topic that a word or a document belongs to
Latent Factor Model
, Yueshen Xu

Application
 LFM has been a fundamental technique in modern
search engine, recommender system, tag extraction,
blog clustering, twitter topic mining, news (text)
summarization, etc.
 Search Engine
 PageRank How important….this web page?
 LFM How relevance….this web page?
 LFM How relevance…the user’s query
vs. one document?
 Recommender System
 Opinion Extraction
 Spam Detection
 Tag Extraction
6/11/2014 5 Middleware, CCNT, ZJU
 Text Summarization
 Abstract Generation
 Twitter Topic Mining
Text: Steven Jobs had left us for about two years…..the apple’s price will fall
down….
, Yueshen Xu

Famous Researcher
David Blei,
Princeton,
LDA
Chengxiang Zhai,
UIUC, Presidential
Early Career Award
W. Bruce Croft, UMA
Language Model
Bing Liu, UIC
Opinion Mining
John D. Lafferty,
CMU, CRF&IBM
Thomas Hofmann
Brown, pLSA
Andrew McCallum,
UMA, CRF&IBM
Susan Dumais,
Microsoft, LSI
, Yueshen Xu

Language Model
 Unigram Language Model == Zero-order Markov Chain
 Bigram Language Model == First-order Markov Chain
 N-gram Language Model == (N-1)-order Markov Chain
 Mixture-unigram Language Model


sw
i
i
MwpMwp )|()|(

Bag of Words(BoW)
No order, no grammar, only multiplicity


sw
ii
i
MwwpMwp )|()|( ,1

8
w
N
M
w
N
M
z
𝑝 𝒘 =
𝑧
𝑝(𝑧)
𝑛=1
𝑁
𝑝(𝑤 𝑛|𝑧)
, Yueshen Xu

9
Vector Space Model
 A document is represented as a vector of identifier
 Identifier
 Boolean: 0, 1
 Term Count: How many times…
 Term Frequency: How frequent…in this document
 TF-IDF: How important…in the corpus  most used
 Relevance Ranking
 First used in SMART(Gerard Salton, Cornell)
),,,(
),,,(
21
21
tqqq
tjjjj
wwwq
wwwd




Gerard Salton
Award(SIGIR)
qd
qd
j
j 
cos
, Yueshen Xu

TF-IDF
 Mixture language model
 Linear combination of a certain distribution(Gaussian)
 Better Performance
 TF: Term Frequency
 IDF: Inversed Document Frequency
 TF-IDF


k kj
ij
ij
n
n
tf Term i, document j, count of i in j
)
|}:{|1
log(
dtDd
N
idf
i
i

 N documents in the corpus
iijjij idftfDdtidftf  ),,(
How important …in this document
How important …in this corpus
10, Yueshen Xu

Latent Semantic Indexing
 Challenge
 Compare document in the same concept space
 Compare documents across languages
 Synonymy, ex: buy - purchase, user - consumer
 Polysemy, ex; book - book, draw - draw
 Key Idea
 Dimensionality reduction of word-document co-occurrence matrix
 Construction of latent semantic space
Defects of VSM
Word Document
Word DocumentConcept
VSM
LSI
11, Yueshen Xu
Aspect
Topic
Latent
Factor

Singular Value Decomposition
 LSI ~= SVD
 U, V: orthogonal matrices
 ∑ :the diagonal matrix with the singular values of N
6/11/2014 Middleware, CCNT, ZJU12
T
VUN 
U
t * m
Document
Terms
t * d
m* m m* d
N ∑U V
k < m || k <<mCount, Frequency, TF-IDF
t * m
Document
Terms
t * k
k* k m* d
U V N
word: Exchangeability
k < m || k <<m
k
, Yueshen Xu

Singular Value Decomposition
 The K-largest singular values
 Distinguish the variance between words and documents to a
greatest extent
 Discarding the lowest dimensions
 Reduce noise
 Fill the matrix
 Predict & Lower computational complexity
 Enlarge the distinctiveness
 Decomposition
 Concept, semantic, topic (aspect)
(Probabilistic) Matrix Factorization/
Factorization Model: Analytic
solution of SVD
Unsupervised
Learning
, Yueshen Xu

Probabilistic Latent Semantic Indexing
 pLSI Model
w1
w2
wN
z1
zK
z2
d1
d2
dM
…..
…..
…..
)(dp)|( dzp)|( zwp
 Assumption
 Pairs(d,w) are assumed to be
generated independently
 Conditioned on z, w is generated
independently of d
 Words in a document are
exchangeable
 Documents are exchangeable
 Latent topics z are independent
Generative Process/Model
 

ZzZz
zwpdzpdpdzwpdpdpdwpwdp )|()|()()|,()()()|(),(
Multinomial Distribution
Multinomial Distribution
One layer of ‘Deep
Neutral Network’
Global
Local
, Yueshen Xu

Probabilistic Latent Semantic Indexing
d z w
N
M


Zz
zwpdzpdwp )|()|()|(






Zz
ZzZz
zpzdpzwp
zdpzdwpzwdpdwp
)()|()|(
),(),|(),,(),(
d
z w
N
M
These are two ways to
formulate pLSA, which are
equivalent but lead to two
different inference processes
Equivalent in Bayes Rule
Probabilistic
Graph Model
d:Exchangeability
Directed Acyclic
Graph (DAG)
, Yueshen Xu

Expectation-Maximization
 EM is a general algorithm for maximum-likelihood estimation
(MLE) where the data are ‘incomplete’ or contains latent
variables: pLSA, GMM, HMM…---Cross Domain
 Deduction Process
 θ:parameter to be estimated; θ0: initialize randomly; θn: the current
value; θn+1: the next value
)()(max1 nn
LL 


),|(log)(  XpL  )|,(log)(  HXpLc 
Latent Variable
),|(log)(),|(log)|(log)|,(log)(  XHpLXHpXpHXpLc 
),|(
),|(
log)()()()(



XHp
XHp
LLLL
n
n
cc
n

, Yueshen Xu
Objective:

Expectation-Maximization
),|(
),|(
log),|(
),|()(),|()()()(




XHp
XHp
XHp
XHpLXHpLLL
n
H
n
H
nn
c
H
n
c
n




K-L divergence: non-negative
Kullback-Leibler Divergence, or Relative Entropy
 
H
nn
c
H
nn
c XHpLLXHpLL ),|()()(),|()()( 
Lower Bound

H
n
ccXHp
n
XHpLLEQ n ),|()()]([);( ),|(
 
Q-function
E-step (expectation): Compute Q;
M-step(maximization): Re-estimate θ by maximizing Q
Convergence
How is EM used in pLSA?
, Yueshen Xu

EM in pLSA

 

 
  



K
k
ikkjijk
N
i
M
j
ji
K
k
ikkj
N
i
M
j
jiijk
H
n
ccXHp
n
dzpzwpdwzpwdn
dzpzwpwdndwzp
XHpLLEQ n
11 1
1 1 1
),|(
))|()|(log(),|(),(
))|()|(log(),(),|(
),|()()]([);(  
Posterior Random value in initialization
Likelyhood function
Constraints:
1.
2.
1)|(
1

M
j
kj
zwp
1)|(
1

K
k
jk dzp
Lagrange
Multiplier
       

M
i
K
k
iki
K
k
M
j
kjkc dzpzwpLEH
1 11 1
))|(1())|(1(][ 
Partial derivative=0
independent
variable
independent
variable


 

 M
m
N
i
imkim
N
i
ijkij
kj
dwzpdwn
dwzpdwn
zwp
1 1
1
),|(),(
),|(),(
)|(
)(
),|(),(
)|(
1
i
M
j
ijkij
ik
dn
dwzpdwn
dzp


M-Step
E-Step






K
l
illj
ikkj
K
l
illji
iikkj
ijk
dzpzwp
dzpzwp
dzpzwpdp
dpdzpzwp
dwzp
1
1
)|()|(
)|()|(
)|()|()(
)()|()|(
),|(
Associative
Law &
Distributive
Law
, Yueshen Xu
𝑙𝑜𝑔 𝑝(𝑤|𝑑) 𝑛(𝑑,𝑤)

Bayesian Theory v.s.
Probability Theory
 Bayesian Theory v.s. Probability Theory
 Estimate 𝜃 through posterior v.s. Estimate 𝜃 through the
maximization of likelihood
 Bayesian theory  prior v.s. Probability theory  statistic
 When the number of samples → ∞, Bayesian theory == Probability
theory
 Parameter Estimation
 𝑝 𝜃 𝐷 ∝ 𝑝 𝐷 𝜃 𝑝 𝜃  𝑝 𝜃 ?  Conjugate Prior  likelihood is
helpful, but its function is limited  Otherwise?
 Non-parametric Bayesian Methods (Complicated)
 Kernel methods: I just know a little...
 VSM  CF  MF  pLSA  LDA  Non-parametric Bayesian
Deep Learning
, Yueshen Xu

Latent Dirichlet Allocation
 Latent Dirichlet Allocation (LDA)
 David M. Blei, Andrew Y. Ng, Michael I. Jordan
 Journal of Machine Learning Research，2003, cited > 3000
 Hierarchical Bayesian model; Bayesian pLSI
θ z w
N
M
α
β
Iterative times
Generative Process of a document d in a
corpus according to LDA
 Choose N ~ Poisson(𝜉);  Why?
 For each document d={𝑤1, 𝑤2 … 𝑤 𝑛}
Choose 𝜃 ~𝐷𝑖𝑟(𝛼);  Why?
 For each of the N words 𝑤 𝑛 in d:
a) Choose a topic 𝑧 𝑛~𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑛𝑎𝑙 𝜃
Why?
b) Choose a word 𝑤 𝑛 from 𝑝 𝑤 𝑛 𝑧 𝑛, 𝛽 ,
a multinomial probability conditioned on 𝑧 𝑛
Why
ACM-Infosys
Awards
, Yueshen Xu

 LDA(Cont.)
θ z w
N
Mα
𝜑
β
K
β
Generative Process of a document d in LDA
 Choose N ~ Poisson(𝜉);  Not important
 For each document d={𝑤1, 𝑤2 … 𝑤 𝑛}
Choose 𝜃 ~𝐷𝑖𝑟(𝛼);𝜃 = 𝜃1, 𝜃2 … 𝜃 𝐾 , 𝜃 = 𝐾 ,
K is fixed, 1
𝐾
𝜃 = 1, 𝐷𝑖𝑟~𝑀𝑢𝑙𝑡𝑖 →𝐶𝑜𝑛𝑗𝑢𝑔𝑎𝑡𝑒
𝑃𝑟𝑖𝑜𝑟
 For each of the N words 𝑤 𝑛 in d:
a) Choose a topic 𝑧 𝑛~𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑛𝑎𝑙 𝜃
b) Choose a word 𝑤 𝑛 from 𝑝 𝑤 𝑛 𝑧 𝑛, 𝛽 ,
a multinomial probability conditioned on
𝑧 𝑛 one word  one topic
one document  multi-topics
𝜃 = 𝜃1, 𝜃2 … 𝜃 𝐾
z= 𝑧1, 𝑧2 … 𝑧 𝐾
For each word 𝑤 𝑛there is a 𝑧 𝑛  
pLSA: the number of p(z|d) is linear
to the number of documents 
overfitting
Regularization
M+K Dirichlet-Multinomial
, Yueshen Xu

Conjugate Prior &
Distributions
 Conjugate Prior:
 If the posterior p(θ|x) are in the same family as the p(θ), the prior
and posterior are called conjugate distributions, and the prior is
called a conjugate prior of the likelihood p(x|θ) : p(θ|x) ∝ p(x|θ)p(θ)
 Distributions
 Binomial Distribution ←→ Beta Distribution
 Multinomial Distribution ←→ Dirichlet Distribution
 Binomial & Beta Distribution
 Binomial Bin(m|N,θ)=C(m,N)θm(1-θ)N-m :likelihood
 C(m,N)=N!/(N-m)!m!
 Beta(θ|a,b) 
11-
)1(
)()(
)( 


 ba
ba
ba
 



0
1
)( dteta ta
Why do prior and
posterior need to be
conjugate distributions?
, Yueshen Xu

Conjugate Prior &
Distributions
11-
)1(
)()(
)(
)1(),(),,,|(






ba
lm
ba
ba
lmmCbalmp


11-
)1(
)()(
)(
),,,|( 



 blam
blam
blam
balmp 
Beta Distribution!
Parameter Estimation
 Multinomial & Dirichlet Distribution
 x/ 𝑥 is a multivariate, ex, 𝑥 = (0,0,1,0,0,0): event of 𝑥3 happens
 The probabilistic distribution of 𝑥 in only one event : 𝑝 𝑥 𝜃
= 𝑘=1
𝐾
𝜃 𝑘
𝑥 𝑘
, 𝜃 = (𝜃1, 𝜃2 … , 𝜃 𝑘)
, Yueshen Xu

Conjugate Prior &
Distributions
 Multinomial & Dirichlet Distribution (Cont.)
 Mult(𝑚1, 𝑚2, … , 𝑚 𝐾|𝜽, 𝑁)=
𝑁!
𝑚1!𝑚2!…𝑚 𝐾!
𝐶 𝑁
𝑚1
𝐶 𝑁−𝑚1
𝑚2
𝐶 𝑁−𝑚1−𝑚2
𝑚3
…
𝐶 𝑁− 𝑘=1
𝐾−1
𝑚 𝑘
𝑚 𝐾
𝑘=1
𝐾
𝜃 𝑘
𝑥 𝑘
: the likelihood function of 𝜃
Mult: The exact probabilistic distribution of 𝑝 𝑧 𝑘 𝑑𝑗 and 𝑝 𝑤𝑗 𝑧 𝑘
In Bayesian theory, we need to find a conjugate prior of 𝜃 for
Mult, where 0 < 𝜃 < 1, 𝑘=1
𝐾
𝜃 𝑘 = 1
Dirichlet Distribution
𝐷𝑖𝑟 𝜃 𝜶 =
Γ(𝛼0)
Γ 𝛼1 … Γ 𝛼 𝐾
𝑘=1
𝐾
𝜃 𝑘
𝛼 𝑘−1
a vector
Hyper-parameter: parameter in
probabilistic distribution function (pdf)
, Yueshen Xu

Conjugate Prior &
Distributions
 Multinomial & Dirichlet Distribution (Cont.)
 𝑝 𝜃 𝒎, 𝜶 ∝ 𝑝 𝒎 𝜃 𝑝(𝜃|𝜶) ∝ 𝑘=1
𝐾
𝜃 𝑘
𝛼 𝑘+𝑚 𝑘−1
Dirichlet?
𝑝 𝜃 𝒎, 𝜶 =𝐷𝑖𝑟 𝜃 𝒎 + 𝜶 =
Γ(𝛼0+𝑁)
Γ 𝛼1+𝑚1 …Γ 𝛼 𝐾+𝑚 𝐾
𝑘=1
𝐾
𝜃 𝑘
𝛼 𝑘+𝑚 𝑘−1
Why?  Gamma Γ is a mysterious function
Dirichlet!
𝑝~𝐵𝑒𝑡𝑎 𝑡 𝛼, 𝛽  𝐸 𝑝 = 0
1
𝑡 ×
Γ 𝛼+𝛽
Γ 𝛼 Γ 𝛽
𝑡 𝛼−1(1 − 𝑡) 𝛽−1 𝑑𝑡 =
𝛼
𝛼+𝛽
𝑝~𝐷𝑖𝑟 𝜃 𝛼  𝐸 𝑝 =
𝛼1
𝑖=1
𝐾
𝛼 𝑖
,
𝛼2
𝑖=1
𝐾
𝛼 𝑖
, … ,
𝛼 𝐾
𝑖=1
𝐾
𝛼 𝑖
, Yueshen Xu

Poisson Distribution
 Why Poisson distribution?
 The number of births per hour during a given day; the number of
particles emitted by a radioactive source in a given time; the number
of cases of a disease in different towns
 For Bin(n,p), when n is large, and p is small  p(X=k)≈
𝜉 𝑘 𝑒−𝜉
𝑘!
, 𝜉 ≈ 𝑛𝑝
 𝐺𝑎𝑚𝑚𝑎 𝑥 𝛼 =
𝑥 𝛼−1 𝑒−𝑥
Γ(𝛼)
𝐺𝑎𝑚𝑚𝑎 𝑥 𝛼 = 𝑘 + 1 =
𝑥 𝑘 𝑒−𝑥
𝑘!
(Γ 𝑘 + 1 = 𝑘!)
(Poisson  discrete; Gamma  continuous)
 Poisson Distribution
 𝑝 𝑘|𝜉 =
𝜉 𝑘 𝑒−𝜉
𝑘!
 Many experimental situations occur in which we observe the
counts of events within a set unit of time, area, volume, length .etc
, Yueshen Xu

Solution for LDA
 LDA(Cont.)
 𝛼, 𝛽: corpus-level parameters
 𝜃: document-level variable
 z, w:word-level variables
 Conditionally independent hierarchical models
 Parametric Bayes model














knkk ppp
ppp
ppp




21
n22221
n11211𝑧1
𝑧2
𝑧 𝐾
𝑤1
𝑧1 𝑧2 𝑧 𝑛
𝑤2 𝑤 𝑛
p 𝜃, 𝒛, 𝒘 𝛼, 𝛽 = 𝑝(𝜃|𝛼)
𝑛=1
𝑁
𝑝 𝑧 𝑛 𝜃 𝑝(𝑤 𝑛|𝑧 𝑛, 𝛽)
Solving Process
(𝑝 𝑧𝑖 𝜽 = 𝜃𝑖)
p 𝒘 𝛼, 𝛽 = 𝑝(𝜃|𝛼)
𝑛=1
𝑁
𝑧 𝑛
𝑝 𝑧 𝑛 𝜃 𝑝(𝑤 𝑛|𝑧 𝑛, 𝛽) 𝑑𝜃
multiple integral
p 𝑫 𝛼, 𝛽 =
𝑑=1
𝑀
𝑝(𝜃 𝑑|𝛼)
𝑛=1
𝑁 𝑑
𝑧 𝑑𝑛
𝑝 𝑧 𝑑𝑛 𝜃 𝑑 𝑝(𝑤 𝑑𝑛|𝑧 𝑑𝑛, 𝛽) 𝑑𝜃d
𝛽
, Yueshen Xu

Solution for LDA
The most significant generative model in Machine Learning Community in the
recent ten years
𝑝 𝒘 𝛼, 𝛽 =
Γ( 𝑖 𝛼𝑖)
𝑖 Γ(𝛼𝑖)
𝑖=1
𝑘
𝜃𝑖
𝛼 𝑖−1
𝑛=1
𝑁
𝑖=1
𝑘
𝑗=1
𝑉
(𝜃𝑖 𝛽𝑖𝑗) 𝑤 𝑛
𝑗
𝑑𝜃
p 𝒘 𝛼, 𝛽 = 𝑝(𝜃|𝛼)
𝑛=1
𝑁
𝑧 𝑛
𝑝 𝑧 𝑛 𝜃 𝑝(𝑤 𝑛|𝑧 𝑛, 𝛽) 𝑑𝜃
Rewrite in terms of
model parameters
𝛼 = 𝛼1, 𝛼2, … 𝛼 𝐾 ; 𝛽 ∈ 𝑅 𝐾×𝑉:What we need to solve out
Variational Inference Gibbs Sampling
Deterministic Inference Stochastic Inference
Why variational inference?Simplify the dependency structure
Why sampling? Approximate the
statistical properties of the population
with those of samples’
, Yueshen Xu

Variational Inference
 Variational Inference (Inference through a variational
distribution), VI
 VI aims to use an approximating distribution that has a simpler
dependency structure than that of the exact posterior distribution
𝑃(𝐻|𝐷) ≈ 𝑄(𝐻)
true posterior distribution
variational distribution
Dissimilarity between
P and Q?
Kullback-Leibler
Divergence
𝐾𝐿(𝑄| 𝑃 = 𝑄 𝐻 𝑙𝑜𝑔
𝑄 𝐻 𝑃 𝐷
𝑃 𝐻, 𝐷
𝑑𝐻
= 𝑄 𝐻 𝑙𝑜𝑔
𝑄 𝐻
𝑃 𝐻, 𝐷
𝑑𝐻 + 𝑙𝑜𝑔𝑃(𝐷)
𝐿
𝑑𝑒𝑓
𝑄 𝐻 𝑙𝑜𝑔𝑃 𝐻, 𝐷 𝑑𝐻 − 𝑄 𝐻 𝑙𝑜𝑔𝑄 𝐻 𝑑𝐻 =< 𝑙𝑜𝑔𝑃(𝐻, 𝐷) >Q(H) +ℍ 𝑄
Entropy of Q
, Yueshen Xu

𝑃 𝐻 𝐷 = 𝑝 𝜃, 𝑧 𝒘, 𝛼, 𝛽 , 𝑄 𝐻 = 𝑞 𝜃, 𝑧 𝛾, 𝜙 = 𝑞 𝜃 𝛾 𝑞 𝑧 𝜙
= 𝑞(𝜃|𝛾) 𝑛=1
𝑁
𝑞(𝑧 𝑛|𝜙 𝑛)
𝛾∗, 𝜙∗ = arg min(𝐷(𝑞 𝜃, 𝑧 𝛾, 𝜙 ||𝑝 𝜃, 𝑧 𝒘, 𝛼, 𝛽 ))：but we don’t
know the exact analytical form of the above KL
log 𝑝 𝑤 𝛼, 𝛽 = 𝑙𝑜𝑔
𝑧
𝑝 𝜃, 𝑧, 𝑤 𝛼, 𝛽 𝑑𝜃
= 𝑙𝑜𝑔
𝑧
𝑝 𝜃, 𝑧, 𝑤 𝛼, 𝛽 𝑞(𝜃, 𝑧)
𝑞(𝜃, 𝑧)
𝑑𝜃
≥
𝑧
𝑞 𝜃, 𝑧 𝑙𝑜𝑔
𝑝 𝜃, 𝑧, 𝑤 𝛼, 𝛽
𝑞(𝜃, 𝑧)
𝑑𝜃
= 𝐸 𝑞 𝑙𝑜𝑔𝑝 𝜃, 𝑧, 𝑤 𝛼, 𝛽 − 𝐸 𝑞 𝑙𝑜𝑔𝑞 𝜃, 𝑧 = 𝐿(𝛾, 𝜙; 𝛼, 𝛽)
log 𝑝 𝑤 𝛼, 𝛽 = 𝐿 𝛾, 𝜙; 𝛼, 𝛽 + KL  minimize KL == maximize L
𝜃 ,z: independent (approximately)
for facilitating computation
, Yueshen Xu
variational distribution

𝐿 𝛾, 𝜙; 𝛼, 𝛽 = 𝐸 𝑞 𝑙𝑜𝑔𝑝 𝜃 𝛼 + 𝐸 𝑞 𝑙𝑜𝑔𝑝 𝑧 𝜃 + 𝐸 𝑞 𝑙𝑜𝑔𝑝 𝑤 𝑧, 𝛽 −
𝐸 𝑞 𝑙𝑜𝑔𝑞 𝜃 − 𝐸 𝑞[𝑙𝑜𝑔𝑞(𝑧)]
𝐸 𝑞 𝑙𝑜𝑔𝑝 𝜃 𝛼
=
𝑖=1
𝐾
𝛼𝑖 − 1 𝐸 𝑞 𝑙𝑜𝑔𝜃𝑖 + 𝑙𝑜𝑔Γ
𝑖=1
𝐾
𝛼𝑖 −
𝑖=1
𝐾
𝑙𝑜𝑔Γ(𝛼𝑖)
𝐸 𝑞 𝑙𝑜𝑔𝜃𝑖 = 𝜓 𝛾𝑖 − 𝜓(
𝑗=1
𝐾
𝛾𝑗)
𝐸 𝑞 𝑙𝑜𝑔𝑝 𝑧 𝜃 =
𝑛=1
𝑁
𝑖=1
𝐾
𝐸 𝑞[𝑧𝑛𝑖] 𝐸 𝑞 𝑙𝑜𝑔𝜃𝑖 =
𝑛=1
𝑁
𝑖=1
𝐾
𝜙 𝑛𝑖(𝜓 𝛾𝑖 − 𝜓(
𝑗=1
𝐾
𝛾𝑗) )
𝐸 𝑞 𝑙𝑜𝑔𝑝 𝑤 𝑧, 𝛽 =
𝑛=1
𝑁
𝑖=1
𝐾
𝑗=1
𝑉
𝐸 𝑞[𝑧𝑛𝑖] 𝑤 𝑛
𝑗
𝑙𝑜𝑔𝛽𝑖𝑗 =
𝑛=1
𝑁
𝑖=1
𝐾
𝑗=1
𝑉
𝜙 𝑛𝑖 𝑤 𝑛
𝑗
𝑙𝑜𝑔𝛽𝑖𝑗
, Yueshen Xu

𝐸 𝑞 𝑙𝑜𝑔𝑞 𝜃 𝛾 is much like 𝐸 𝑞 𝑙𝑜𝑔𝑝 𝜃 𝛼
𝐸 𝑞 𝑙𝑜𝑔𝑞 𝑧 𝜙 = 𝐸 𝑞
𝑛=1
𝑁
𝑖=1
𝑘
𝑧 𝑛𝑖 𝑙𝑜𝑔 𝜙 𝑛𝑖
Maximize L with respect to 𝜙 𝑛𝑖:
𝐿 𝜙 𝑛𝑖
= 𝜙 𝑛𝑖(𝜓 𝛾𝑖 − 𝜓( 𝑗=1
𝐾
𝛾𝑗))+𝜙 𝑛𝑖 𝑙𝑜𝑔𝛽𝑖𝑗-𝜙 𝑛𝑖log𝜙 𝑛𝑖 + 𝜆( 𝑗=1
𝐾
𝜙 𝑛𝑖 − 1)
Lagrangian Multiplier
Taking derivatives with respect to 𝜙 𝑛𝑖:
𝜕𝐿
𝜕𝜙 𝑛𝑖
= (𝜓 𝛾𝑖 − 𝜓( 𝑗=1
𝐾
𝛾𝑗))+𝑙𝑜𝑔𝛽𝑖𝑗-log𝜙 𝑛𝑖 − 1 + 𝜆=0
𝜙 𝑛𝑖 ∝ 𝛽𝑖𝑗exp(𝜓 𝛾𝑖 − 𝜓
𝑗=1
𝐾
𝛾𝑗 )
, Yueshen Xu

 You can refer to more in the original paper.
 Variational EM Algorithm
 Aim: (𝛼
∗
, 𝛽
∗
)=arg max 𝑑=1
𝑀
𝑝 𝒘|𝛼, 𝛽
 Initialize 𝛼, 𝛽
 E-Step: compute 𝛼, 𝛽 through variational inference for likelihood
approximation
 M-Step: Maximize the likelihood according to 𝛼, 𝛽
 End until convergence

Markov Chain Monte Carlo
 MCMC Basic: Markov Chain (First-order)  Stationary
Distribution  Fundament of Gibbs Sampling
 General: 𝑃 𝑋𝑡+𝑛 = 𝑥 𝑋1, 𝑋2, … 𝑋𝑡 = 𝑃(𝑋𝑡+𝑛 = 𝑥|𝑋𝑡)
 First-Order: 𝑃 𝑋𝑡+1 = 𝑥 𝑋1, 𝑋2, … 𝑋𝑡 = 𝑃(𝑋𝑡+1 = 𝑥|𝑋𝑡)
 One-step transition probabilistic matrix


















|)||(|...)2|(|)1|(|
)12(p...)22(p)12(p
|)|1(...)21()11(p
SSpSpSp
Spp
P

Xm
Xm+1
, Yueshen Xu

 Markov Chain
 Initialization probability: 𝜋0 = {𝜋0 1 , 𝜋0 2 , … , 𝜋0(|𝑆|)}
 𝜋 𝑛 = 𝜋 𝑛−1 𝑃 = 𝜋 𝑛−2 𝑃2 = ⋯ = 𝜋0 𝑃 𝑛: Chapman-Kolomogrov equation
 Central-limit Theorem: Under the premise of connectivity of P, lim
𝑛→∞
𝑃𝑖𝑗
𝑛
= 𝜋 𝑗 ; 𝜋 𝑗 = 𝑖=1
|𝑆|
𝜋 𝑖 𝑃𝑖𝑗
 lim
𝑛→∞
𝜋0 𝑃 𝑛 =
𝜋(1) … 𝜋(|𝑆|)
⋮ ⋮ ⋮
𝜋(1) 𝜋(|𝑆|)
 𝜋 = {𝜋 1 , 𝜋 2 , … , 𝜋 𝑗 , … , 𝜋(|𝑆|)}
Stationary Distribution
𝑋0~𝜋0 𝑥 −→ 𝑋1~𝜋1 𝑥 −→ ⋯ −→ 𝑋 𝑛~𝜋 𝑥 −→ 𝑋 𝑛+1~𝜋 𝑥 −→ 𝑋 𝑛+2~𝜋 𝑥 −→
sample Convergence
Stationary Distribution
, Yueshen Xu

 MCMC Sampling
 We should construct the relationship between 𝜋(𝑥) and MC
transition process  Detailed Balance Condition
 In a common MC, if for 𝝅 𝒙 , 𝑃 𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛 𝑚𝑎𝑡𝑟𝑖𝑥 , 𝜋 𝑖 𝑃𝑖𝑗 = 𝜋(j)
𝑃𝑗𝑖, 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑖, 𝑗 𝜋(𝑥) is the stationary distribution of this MC
 Prove: 𝑖=1
∞
𝜋 𝑖 𝑃𝑖𝑗 = 𝑖=1
∞
𝜋 𝑗 𝑃𝑗𝑖 = 𝜋 𝑗 −→ 𝜋𝑃 = 𝜋𝜋 is the
solution of the equation 𝜋𝑃 = 𝜋  Done
 For a common MC(q(i,j), q(j|i), q(ij)), and for any probabilistic
distribution p(x) (the dimension of x is arbitrary)  Transformation
𝑝 𝑖 𝑞 𝑖, 𝑗 𝛼 𝑖, 𝑗 = 𝑝 𝑗 𝑞(𝑗, 𝑖)𝛼(𝑗, 𝑖)
Q’(i,j) Q’(j,i)
𝛼 𝑖, 𝑗 = 𝑝 𝑗 𝑞(𝑗, 𝑖),𝛼 𝑗, 𝑖 = 𝑝 𝑖 𝑞(𝑗, 𝑖),
necessary condition
, Yueshen Xu

 MCMC Sampling(cont.)
Step1: Initialize: 𝑋0 = 𝑥0
Step2: for t = 0, 1, 2, …
𝑋𝑡 = 𝑥𝑡, 𝑠𝑎𝑚𝑝𝑙𝑒 𝑦 𝑓𝑟𝑜𝑚 𝑞(𝑥|𝑥𝑡) (𝑦 ∈ 𝐷𝑜𝑚𝑎𝑖𝑛 𝑜𝑓 𝐷𝑒𝑓𝑖𝑛𝑖𝑡𝑖𝑜𝑛)
sample u from Uniform[0,1]
If 𝑢 < 𝛼 𝑥𝑡, 𝑦 = 𝑝 𝑦 𝑞 𝑥𝑡 𝑦 ⇒ 𝑥𝑡 → 𝑦,  Xt+1 = y
else Xt+1 = xt
 Metropolis-Hastings Sampling
Step1: Initialize: 𝑋0 = 𝑥0
Step2: for t = 0, 1, 2, …n, n+1, n+2…
𝑋𝑡 = 𝑥𝑡, 𝑠𝑎𝑚𝑝𝑙𝑒 𝑦 𝑓𝑟𝑜𝑚 𝑞 𝑥 𝑥𝑡 𝑦 ∈ 𝐷𝑜𝑚𝑎𝑖𝑛 𝑜𝑓 𝐷𝑒𝑓𝑖𝑛𝑖𝑡𝑖on
Burn-in Period
Convergence
, Yueshen Xu

Gibbs Sampling
sample u from Uniform[0,1]
If 𝑢 < 𝛼 𝑥𝑡, 𝑦 = 𝑚𝑖𝑛{
𝑝 𝑦 𝑞 𝑥𝑡 𝑦
𝑝 𝑥𝑡
𝑞 𝑦 𝑥𝑡
, 1} ⇒ 𝑥𝑡 → 𝑦 , Xt+1 = y
else Xt+1 = xt
Not suitable with regard to high dimensional variables
 Gibbs Sampling(Two Dimensions,(x1,y1))
 A(x1,y1), B(x1,y2)  𝑝 𝑥1, 𝑦1 𝑝 𝑦2 𝑥1 = 𝑝 𝑥1 𝑝 𝑦1 𝑥1 𝑝(𝑦2|𝑥1)
 𝑝 𝑥1, 𝑦2 𝑝 𝑦1 𝑥1 = 𝑝 𝑥1 𝑝 𝑦2 𝑥1 𝑝(𝑦1|𝑥1)
𝑝 𝑥1, 𝑦1 𝑝 𝑦2 𝑥1 = 𝑝 𝑥1, 𝑦2 𝑝 𝑦1 𝑥1
𝑝 𝐴 𝑝 𝑦2 𝑥1 = 𝑝 𝐵 𝑝 𝑦1 𝑥1
A(x1,y1)
B(x1,y2)
C(x2,y1)
D
𝑝 𝐴 𝑝 𝑥2 𝑦1 = 𝑝 𝐶 𝑝 𝑥1 𝑦1
, Yueshen Xu

Gibbs Sampling
 Gibbs Sampling(Cont.)
 We can construct the transition probabilistic matrix Q accordingly
𝑄 𝐴 → 𝐵 = 𝑝(𝑦 𝐵|𝑥1), if 𝑥 𝐴 = 𝑥 𝐵 = 𝑥1
𝑄 𝐴 → 𝐶 = 𝑝(𝑥 𝐶|𝑦1), if 𝑦 𝐴 = 𝑦 𝐶 = 𝑦1
𝑄 𝐴 → 𝐷 = 0, else
A(x1,y1)
B(x1,y2)
C(x2,y1)
D
Detailed Balance Condition:
𝑝 𝑋 𝑄 𝑋 → 𝑌 = 𝑝 𝑌 𝑄(𝑌 → 𝑋) √
 Gibbs Sampling(in two dimension)
Step1: Initialize: 𝑋0 = 𝑥0, 𝑌0 = 𝑦0
Step2: for t = 0, 1, 2, …
1. 𝑦𝑡+1~𝑝 𝑦 𝑥 𝑡 ;
. 2. 𝑥𝑡+1~𝑝 𝑥 𝑦𝑡+1
, Yueshen Xu

Gibbs Sampling
 Gibbs Sampling(in two dimension)
Step1: Initialize: 𝑋0 = 𝑥0 = {𝑥1: 𝑖 = 1,2, … 𝑛}
Step2: for t = 0, 1, 2, …
1. 𝑥1
(𝑡+1)
~𝑝 𝑥1 𝑥2
(𝑡)
, 𝑥3
(𝑡)
, … , 𝑥 𝑛
(𝑡)
;
2. 𝑥2
𝑡+1
~𝑝 𝑥2 𝑥1
(𝑡+1)
, 𝑥3
(𝑡)
, … , 𝑥 𝑛
(𝑡)
3. …
4. 𝑥𝑗
𝑡+1
~𝑝 𝑥𝑗 𝑥1
(𝑡+1)
, 𝑥𝑗−1
(𝑡+1)
, 𝑥𝑗+1
(𝑡)
… , 𝑥 𝑛
(𝑡)
5. …
6. 𝑥 𝑛
𝑡+1~𝑝 𝑥 𝑛 𝑥1
(𝑡+1)
, 𝑥2
(𝑡+1)
, … , 𝑥 𝑛−1
(𝑡+1)
t+1 t
, Yueshen Xu

Gibbs Sampling for LDA
 Gibbs Sampling in LDA
 Dir 𝑝 𝛼 =
1
Δ(𝛼) 𝑘=1
𝑉
𝑝 𝑘
𝛼 𝑘−1
, Δ( 𝛼) is the normalization factor:
Δ 𝛼 = 𝑘=1
𝑉
𝑝 𝑘
𝛼 𝑘−1
𝑑 𝑝
𝑝 𝑧 𝑚 𝛼 = 𝑝 𝑧 𝑚 𝜃 𝑝 𝜃 𝛼 𝑑 𝑝 = 𝑘=1
𝑉
𝜃 𝑘
𝑛 𝑘
Dir( 𝜃| 𝛼) 𝑑 𝜃
= 𝑘=1
𝑉
𝜃 𝑘
𝑛 𝑘 1
Δ(𝛼) 𝑘=1
𝑉
𝜃 𝑘
𝛼 𝑘−1
𝑑 𝜃
=
1
Δ(𝛼) 𝑘=1
𝑉
𝜃 𝑘
𝑛 𝑘+𝛼 𝑘−1
𝑑 𝜃 =
Δ(𝑛 𝑚+𝛼)
Δ(𝛼)
𝑝 𝒛 𝛼 = 𝑚=1
𝑀
𝑝 𝑧 𝑚 𝛼 = 𝑚=1
𝑀 Δ(𝑛 𝑚+𝛼)
Δ(𝛼)
−→
𝑝 𝒘, 𝒛 𝛼, 𝛽 = 𝑘=1
𝐾 Δ(𝑛 𝑘+𝛽)
Δ(𝛽) 𝑚=1
𝑀 Δ(𝑛 𝑚+𝛼)
Δ(𝛼)
, Yueshen Xu

Gibbs Sampling for LDA
 Gibbs Sampling in LDA
 𝑝 𝜃 𝑚 𝑧¬𝑖, 𝑤¬𝑖 = 𝐷𝑖𝑟(𝜃 𝑚|𝑛 𝑚,¬𝑖 + 𝛼), 𝑝 𝜑 𝑘 𝑧¬𝑖, 𝑤¬𝑖 =
𝐷𝑖𝑟(𝜑 𝑘|𝑛 𝑘,¬𝑖 + 𝛽)
𝑝(𝑧𝑖 = 𝑘| 𝑧¬𝑖, 𝑤¬𝑖) ∝ 𝑝 𝑧𝑖 = 𝑘, 𝑤𝑖 = 𝑡, 𝜃 𝑚, 𝜑 𝑘 𝑧¬𝑖, 𝑤¬𝑖 = 𝐸 𝜃 𝑚𝑘 ∙
𝐸 𝜑 𝑘𝑡 = 𝜃 𝑚𝑘 ∙ 𝜑 𝑘𝑡
𝜃 𝑚𝑘=
𝑛 𝑚,¬𝑖
(𝑡)
+𝛼 𝑘
𝑘=1
𝐾 (𝑛 𝑚,¬𝑖
(𝑘)
+𝛼 𝑘)
, 𝜑 𝑘𝑡=
𝑛 𝑘,¬𝑖
(𝑡)
+𝛽 𝑘
𝑡=1
𝑉 (𝑛 𝑘,¬𝑖
(𝑡)
+𝛽 𝑘)
𝑝(𝑧𝑖 = 𝑘| 𝑧¬𝑖, 𝑤) ∝
𝑛 𝑚,¬𝑖
(𝑡)
+𝛼 𝑘
𝑘=1
𝐾
(𝑛 𝑚,¬𝑖
(𝑘)
+𝛼 𝑘)
×
𝑛 𝑘,¬𝑖
(𝑡)
+𝛽 𝑘
𝑡=1
𝑉 (𝑛 𝑘,¬𝑖
(𝑡)
+𝛽 𝑘)
𝑧𝑖
(𝑡+1)
~ 𝑝(𝑧𝑖 = 𝑘| 𝑧¬𝑖, 𝑤), i=1…K

Q&A
6/11/2014 Middleware, CCNT, ZJU44, Yueshen Xu

Topic model an introduction

More Related Content

What's hot (20)

Similar to Topic model an introduction (20)

More from Yueshen Xu (20)

Recently uploaded (20)

Topic model an introduction