SlideShare a Scribd company logo
Physical and Health Education, University of Tokyo
D1 Hiroyuki Kuromiya
1
● I'm 1st grade on my Ph.D course
in Physical and Health
Education.
● Despite of the name of my
course, I am currently working on
learning analytics of
research-based active learning.
● The data I have to analysis are
often text-format. That's why I
attend this class.
Hiroyuki Kuromiya
2
Today, I am going to introduce 5 papers about topic modeling.
● Indexing by Latent Semantic Analysis (Deerwester+, 1990)
● Probabilistic Latent Semantic Indexing (Hofmann, 1999)
● Latent Dirichlet Allocation (Blei+, 2003)
● Gaussian LDA for Topic Models with Word Embeddings (Das+, 2015)
● What is Wrong with Topic Modeling? (Agrawal+, 2018)
3
Since I don't have enough time to introduce whole contents in each paper, I want
to focus on 5 questions listed below.
● What is their motivation?
● What is the key point of their paper?
● How their model is?
● How to estimate parameters?
● What are deficiencies in their model?
4
5
“a topic model is a type of statistical
model for discovering the abstract
"topics" that occur in a collection of
documents.”
(Wikipedia, “Topic model”, accessed on
May 3, 2018)
(岩田具治『トピックモデル』 , 2015, pp.vii)
6
VSM is one of the most popular families of information retrieval techniques.
VSM is characterised by three ingredients.
1. a transform function (also called local term weight such as term frequency)
2. a term weighting scheme (also called global term weight such as inverse term
frequency)
3. a similarity measure such as cosine distance
We represent a semantic distance as a spatial distance.
Hofmann (1999). probabilistic latent semantic indexing, section 5.1
The vector space model for scoring
7
Deerwester, S., Dumais, S. T., Furnas, G. W.,
Landauer, T. K., & Harshman, R. (1990)
8
● Deerwester belonged to Graduate Library
School, University of Chicago.
● The aim of the study was to improve
information retrieval system.
● They thought that there was a fundamental
problem in existing retrieval techniques that
try to match words of queries with words of
documents.
Scott Deerwester (1956-)
9
If the query is “IDF in computer-based information look-up”, we think that
document 1 and 3 are relevant. However, simple term matching method would
return document 2 and 3.
Document 1 would not be returned because of synonymy effect of look-up, and
document 2 would be returned because of polysemy effect of information.
access document retrieval information theory database indexing computer
Doc1 1 1 1 1 1
Doc2 1 1 1
Doc3 1 1 1
10
They introduced “semantic-space” wherein terms
and documents that are closely associated are
placed near one another. By using “semantic space”
● We can get rid of obscuring noise in data
● We can get conceptual content that users are
really seeking
11
Considering representational richness, explicit representation of both terms and
documents, computational tractability, they proposed two-mode factor analysis, or
Singular Value Decomposition.
X T0
S0 D0
T
documents
terms
t by d t by m
m by m m by d
12
Suppose that u’s is the eigenvectors of AAT
, and v’s are the eigenvectors of AT
A.
Since those matrices are both symmetric, their eigenvectors can be chosen
orthonormal.
The simple fact that
leads to
It tells us that
Considering V is an orthonormal matrix, it becomes
Strang, Gilbert, et al. Introduction to linear algebra. Vol. 4. Wellesley, MA: Wellesley-Cambridge Press, 2009.
13
● It begins with arbitrary rectangular matrix (cf. one-mode factor analysis
requires A to be square matrix )
● It allows us to approximate original matrix using smaller matrices.
It is important that the derived k-dimensional factor space does not
reconstruct the original term space perfectly because it means getting rid of
noise of original data (cf. python code for svd).
14
1. Using T and D, you construct
semantic space
2. Find representations for the query
following the equation below
3. Calculate cosine distance between
query and documents
15
● Precision of the LSI method lies well
above that obtained with term
matching, SMART, and Voorhees.
● The average difference in precision
between the LSI and the term
matching method is .06 which
represents a 13% improvement over
raw term matching
16
● Exhaustive comparison of a query vector
against all stored document vectors
● The initial SVD analysis is time consuming;
it is hard to update.
● Lack of statistical foundation in latent
factor
“Roughly speaking, these
factors may be thought
of as artificial concepts”
(Section 4.1)
17
Thomas Hofmann (1999)
18
● He belonged to International
Computer Science Institute,
Berkeley, CA.
● In order for computers to interact
more naturally with humans, natural
language queries are needed.
● Although LSA has been applied
with remarkable success in
different domains, it does not have
satisfactory statistical foundation.
Thomas Hofmann (1968-)
19
He presented a novel approach to LSA that has a solid statistical foundation
based on the likelihood principle and proper generative model of the data.
● He used a statistical latent class called “aspect model”.
● Model is annealed adjusted by minimizing word perplexity.
20
Aspect Model is a latent variable model for general co-occurrence data which
associates an unobserved class variable.
1. Select a document d with probability P(d)
2. Pick a latent class z with probability P(z|d)
3. Generate a word w with probability P(w|z)
We call P(z|d) “aspects”.
21
The probability of single word and document is written below.
Hence, the joint probability of the whole data set is written as below.
marzinalization product rule
22
There are three parameters in the aspect model, which are p(z), p(w|z), p(d|z).
We use Expectation Maximization (EM) algorithm to estimate them.
EM algorithm is the standard procedure for maximum likelihood estimation in
latent variable model.
Before explaining EM algorithm, let me try normal maximum likelihood estimation,.
23
Let n(d,w) the term frequency w in document d, likelihood of the model is
Hence, log-likelihood is written as below.
I want to maximize this log-likelihood, but log-sum structure in the equation is
hard to differentiate.
24
By Jensen’s inequality,
By maximizing F using method of Lagrange multiplier, it leads
25
1. Initialize parameters p(z), p(w|z), p(z|d)
2. E-step
Calculate p(z|d,w) using given parameters p(z), p(w|z), p(z|d)
3. M-step
Update parameters p(z), p(w|z), p(z|d) using p(z|d,w) which has been just
calculated
4. Re-calculate new log-likelihood. Until |new log-likelihood - old log-likelihood|<ε,
we repeat E-step and M-step.
Python code for PLSA implementation
26
● It is interesting to see that pLSA
capture two different types of “flight”
and “love” in their topics. It
distinguish polysemy of the words.
● The experiments consistently
validate the advantages of PLSI over
LSI.
27
Some points are derived from Blei et al., (2003).
● pLSA is likely to be overfitting
○ They use Tempered EM which is the improved version of EM algorithm to avoid overfitting, but
it is not a fundamental solution.
● There is no statistical foundation at the level of documents.
○ In pLSA, each document is represented as the mixing proportions for topics, and there is no
generative probabilistic model for these numbers.
○ This leads to the number of parameters in the model grows linearly with the size of the
corpus, which leads to overfitting.
○ Plus, it is not clear how to assign probability to a document outside of the training set.
28
David M. Blei, Andrew Y. Ng, Michael I. Jordan (2003)
29
● Blei was in the Computer Science Division,
University of California.
● This paper consider the problem of modeling
text corpora and other collections of
discrete data.
● They thought pLSA is incomplete because it
provides no probabilistic model at the level
of documents.
30
exchangeability of both words and documents
● LSI and pLSI are based on the “bag-of-words” assumption -- that the order of
words in a document can be neglected, but less often stated that documents
are exchangeable as well as words.
● de Finetti (1990) establishes that any collection of exchangeable random
variables has a representation as a mixture distribution.
It leads to the latent Dirichlet allocation that topics are infinitely exchangeable
within a document.
31
LDA assumes the following generative process for each document in a corpus.
1. Choose N ~ Poisson(ζ)
2. Choose θ ~ Dirichlet(α)
3. For each words
a. Choose a topic z ~ Multinomial(θ)
b. Choose a word w from p(w|z,β), a multinomial conditioned on the topic z
Note that θ is a document leven variables sampled once per document.
The joint distribution is given by
32
Dirichlet distribution is conjugate to the
multinomial distribution.
Figure right shows dirichlet distribution at
different α value. You will understand that it
simply represents natural inference of human.
For example if α = (5,2,2), θ1 would be high.
https://cs.stanford.edu
33
The key inferential problem is to compute the posterior distribution of the hidden
variables given a document.
Integrating over θ and summing over z, we obtain the marginal distribution of a
document.
This is intractable due to the coupling between θ and β in the summation over
latent topics. Thus we apply an approximate inference to estimate parameters.
34
1. E-step: find optimizing values of the variational parameters γ, θ
a. Variational Inference for γ and θ
i. Initialize γ,θ
ii. repeat until convergence
2. M-step: maximizing the lower bound on the log likelihood with respect to the
model parameters α and β.
source code for python
35
● LDA constantly performs better than
other methods, unigram, mixture
unigram and pLSI.
● For classification task, the
performance is improved with LDA
features.
● For collaborative filtering task,
EachMovie, the best predictive
performance was obtained by the
LDA model.
36
● Order effect (cf. Agrawal et al., 2018)
Different topics are generated if the training data is shuffled since its internal
weights are updated via stochastic sampling process. Such effect introduce a
systematic error for any study.
● Topic coherence (cf. Das et al., 2015)
The prior preference for semantic coherence is not encoded in the model.
Some topics can be accidental for human evaluations.
● Cannot handle out-of-vocabulary (OOV) words (cf. Das et al., 2015)
37
38
Let us write the aspect model in matrix notation. Hence, define matrices by
The joint probability model P can be a matrix product
Although there is a fundamental difference between LSA and pLSA, pLSA can be
seen as a dimensionality reduction method.
39
M dimensional multinomial distribution
can be represented as points on a M-1
dimensional simplex of all possible
multinomial.
Since the dimensionality of the
sub-simplex (a probabilistic latent
semantic space) K-1 as opposed to M-1
for the complete probability simplex, this
can also be thought of dimensionality
reduction.
40
The topic simplex for three topics
embedded in the word simplex for three
words.
The pLSI model induces an empirical
distribution on the topic simplex denoted
by x. LDA places a smooth distribution
on the topic simplex denoted by the
contour lines.
41
The boxes are “plates” representing
replicates. The outer plate represents
documents, while the inner plate
represents the repeated choice of
topics and words within a document
You can easily see LDA assumes the
generative model at the level of
documents.
(d) LDA model
42
Rajarshi Das, Manzil Zaheer, Chris Dyer (2015)
43
● Das is a second year Ph.D student in School
of Computer Science, Carnegie Mellon
University.
● They want to propose a new technique for
topic modeling by using word embeddings
(Milkov, 2013)
44
According to the distributional hypothesis,
words occurring in similar contexts tend to have
similar meaning.
This has given rise to data-driven learning of
word vectors that capture lexical and semantic
properties (e.g. word2vec).
we assume that rather than consisting of
sequences of word types, documents consist of
sequences of word embeddings.
45
Since our observations are no longer discrete values but continuous vectors in an
M-dimensional space, we characterize each topic k as a multivariate Gaussian
distribution with mean μk
and covariance ∑k
.
The generative process can thus be summarized as follows.
46
1. for k=1 to K
a. Draw topic covariance
b. Draw topic mean
2. for each document d in a corpus D
a. Draw topic distribution
b. for each word
i. Draw a topic
ii. Draw embedded vector
47
We wish to infer the posterior distribution over the topic parameters, proportions
and the topic assignments of individual words.
We use a collapsed Gibbs sampler to infer them.
We can make the sampling faster using Cholesky decomposition of covariance
matrix.
source code for python
48
● To measure topic coherence, we
follow to compute Pointwise
Mutual Information (PMI) of topic
words.
● It can be seen that Gaussian LDA is
a clear winner, achieving an
average 275% higher score on
average.
49
● we select a subset of documents and
replace words of those documents by its
synonyms if they haven’t occurred in the
corpus before.
● Compared recently proposed extension
of LDA that can handle unseen words
(infvoc), Gaussian LDA performs better
here, too.
50
Amritanshu Agrawal, Wei Fu, Tim Menzies (2018)
51
This is just summary of the paper. I didn’t have enough time to read this paper
because I spent a lot of time to trying to understand parameter inference part of
pLSA and LDA. Sorry for my unplanned.
● Motivation: the current great challenge in software analytics is understanding
unstructured data.
● Key points: tuning proper parameters to fix “order effects” in LDA.
● Model: they propose LDADE, a search-based software engineering tool which
uses Differential Evolution (DE) to tune the LDA’s parameter.
● Results: LDADE’s tunings dramatically reduce cluster instability and leads to
improved performances for supervised as well as un-supervised learning
52
● Since 1990, topic modeling has been constantly needed although social
background and researcher’s motivations have been changed.
● Topic modeling is easy to expand or add other probabilistic models. The
model has been more complex as it becomes new.
● The way to estimate parameters has been evolving so that it can deal with
more flexible models.
● It is very difficult for me to understand parameter inference part. I will need
some mathematical trainings especially on optimization, which is based on
linear algebra and probability theory.
53

More Related Content

PPT
Topic Models
PPTX
A Simple Introduction to Word Embeddings
PDF
Latent Dirichlet Allocation
PDF
Randomized Algorithm
PDF
Word2Vec
PDF
bag-of-words models
PDF
Deep Learning for Computer Vision: Attention Models (UPC 2016)
PDF
Word Embeddings - Introduction
Topic Models
A Simple Introduction to Word Embeddings
Latent Dirichlet Allocation
Randomized Algorithm
Word2Vec
bag-of-words models
Deep Learning for Computer Vision: Attention Models (UPC 2016)
Word Embeddings - Introduction

What's hot (20)

PPT
Text classification
PDF
Training Neural Networks
PDF
Wasserstein GAN 수학 이해하기 I
PDF
Deeplabv1, v2, v3, v3+
PPTX
Natural language processing and transformer models
PPTX
Word2 vec
PDF
東京都市大学 データ解析入門 2 行列分解 1
PDF
ベータ分布の謎に迫る
PPTX
Fuzzy sets
PDF
TENSOR DECOMPOSITION WITH PYTHON
PDF
Text classification presentation
PDF
20190315 nlp
PDF
東京都市大学 データ解析入門 7 回帰分析とモデル選択 2
PDF
SMO徹底入門 - SVMをちゃんと実装する
PDF
Latent Dirichlet Allocation
PDF
Word2Vec: Vector presentation of words - Mohammad Mahdavi
PDF
グラフ構造データに対する深層学習〜創薬・材料科学への応用とその問題点〜 (第26回ステアラボ人工知能セミナー)
ODP
NAIVE BAYES CLASSIFIER
PDF
Bayesian Network 을 활용한 예측 분석
PDF
Topics Modeling
Text classification
Training Neural Networks
Wasserstein GAN 수학 이해하기 I
Deeplabv1, v2, v3, v3+
Natural language processing and transformer models
Word2 vec
東京都市大学 データ解析入門 2 行列分解 1
ベータ分布の謎に迫る
Fuzzy sets
TENSOR DECOMPOSITION WITH PYTHON
Text classification presentation
20190315 nlp
東京都市大学 データ解析入門 7 回帰分析とモデル選択 2
SMO徹底入門 - SVMをちゃんと実装する
Latent Dirichlet Allocation
Word2Vec: Vector presentation of words - Mohammad Mahdavi
グラフ構造データに対する深層学習〜創薬・材料科学への応用とその問題点〜 (第26回ステアラボ人工知能セミナー)
NAIVE BAYES CLASSIFIER
Bayesian Network 을 활용한 예측 분석
Topics Modeling
Ad

Similar to Basic review on topic modeling (20)

PPTX
LSA and PLSA
ODP
Topic Modeling
PDF
graduate_thesis (1)
PDF
Latent dirichletallocation presentation
PDF
Blei ngjordan2003
PPTX
Tdm probabilistic models (part 2)
PDF
Lecture14 xing fei-fei
PDF
Topic model an introduction
PPTX
Text mining meets neural nets
PDF
Canini09a
PPT
ECO_TEXT_CLUSTERING
PDF
Is this document relevant probably
PPT
The science behind predictive analytics a text mining perspective
PDF
LDA on social bookmarking systems
PPTX
Probabilistic models (part 1)
PPT
Ir models
PDF
Latent Structured Ranking
PDF
Topic modelling
PPTX
Topic extraction using machine learning
LSA and PLSA
Topic Modeling
graduate_thesis (1)
Latent dirichletallocation presentation
Blei ngjordan2003
Tdm probabilistic models (part 2)
Lecture14 xing fei-fei
Topic model an introduction
Text mining meets neural nets
Canini09a
ECO_TEXT_CLUSTERING
Is this document relevant probably
The science behind predictive analytics a text mining perspective
LDA on social bookmarking systems
Probabilistic models (part 1)
Ir models
Latent Structured Ranking
Topic modelling
Topic extraction using machine learning
Ad

Recently uploaded (20)

PDF
Foundation of Data Science unit number two notes
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Computer network topology notes for revision
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Database Infoormation System (DBIS).pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
Quality review (1)_presentation of this 21
PDF
Launch Your Data Science Career in Kochi – 2025
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
1_Introduction to advance data techniques.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Foundation of Data Science unit number two notes
Introduction-to-Cloud-ComputingFinal.pptx
Computer network topology notes for revision
Fluorescence-microscope_Botany_detailed content
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Database Infoormation System (DBIS).pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
oil_refinery_comprehensive_20250804084928 (1).pptx
Quality review (1)_presentation of this 21
Launch Your Data Science Career in Kochi – 2025
Miokarditis (Inflamasi pada Otot Jantung)
1_Introduction to advance data techniques.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Business Ppt On Nestle.pptx huunnnhhgfvu
Moving the Public Sector (Government) to a Digital Adoption
Acceptance and paychological effects of mandatory extra coach I classes.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”

Basic review on topic modeling

  • 1. Physical and Health Education, University of Tokyo D1 Hiroyuki Kuromiya 1
  • 2. ● I'm 1st grade on my Ph.D course in Physical and Health Education. ● Despite of the name of my course, I am currently working on learning analytics of research-based active learning. ● The data I have to analysis are often text-format. That's why I attend this class. Hiroyuki Kuromiya 2
  • 3. Today, I am going to introduce 5 papers about topic modeling. ● Indexing by Latent Semantic Analysis (Deerwester+, 1990) ● Probabilistic Latent Semantic Indexing (Hofmann, 1999) ● Latent Dirichlet Allocation (Blei+, 2003) ● Gaussian LDA for Topic Models with Word Embeddings (Das+, 2015) ● What is Wrong with Topic Modeling? (Agrawal+, 2018) 3
  • 4. Since I don't have enough time to introduce whole contents in each paper, I want to focus on 5 questions listed below. ● What is their motivation? ● What is the key point of their paper? ● How their model is? ● How to estimate parameters? ● What are deficiencies in their model? 4
  • 5. 5
  • 6. “a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents.” (Wikipedia, “Topic model”, accessed on May 3, 2018) (岩田具治『トピックモデル』 , 2015, pp.vii) 6
  • 7. VSM is one of the most popular families of information retrieval techniques. VSM is characterised by three ingredients. 1. a transform function (also called local term weight such as term frequency) 2. a term weighting scheme (also called global term weight such as inverse term frequency) 3. a similarity measure such as cosine distance We represent a semantic distance as a spatial distance. Hofmann (1999). probabilistic latent semantic indexing, section 5.1 The vector space model for scoring 7
  • 8. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990) 8
  • 9. ● Deerwester belonged to Graduate Library School, University of Chicago. ● The aim of the study was to improve information retrieval system. ● They thought that there was a fundamental problem in existing retrieval techniques that try to match words of queries with words of documents. Scott Deerwester (1956-) 9
  • 10. If the query is “IDF in computer-based information look-up”, we think that document 1 and 3 are relevant. However, simple term matching method would return document 2 and 3. Document 1 would not be returned because of synonymy effect of look-up, and document 2 would be returned because of polysemy effect of information. access document retrieval information theory database indexing computer Doc1 1 1 1 1 1 Doc2 1 1 1 Doc3 1 1 1 10
  • 11. They introduced “semantic-space” wherein terms and documents that are closely associated are placed near one another. By using “semantic space” ● We can get rid of obscuring noise in data ● We can get conceptual content that users are really seeking 11
  • 12. Considering representational richness, explicit representation of both terms and documents, computational tractability, they proposed two-mode factor analysis, or Singular Value Decomposition. X T0 S0 D0 T documents terms t by d t by m m by m m by d 12
  • 13. Suppose that u’s is the eigenvectors of AAT , and v’s are the eigenvectors of AT A. Since those matrices are both symmetric, their eigenvectors can be chosen orthonormal. The simple fact that leads to It tells us that Considering V is an orthonormal matrix, it becomes Strang, Gilbert, et al. Introduction to linear algebra. Vol. 4. Wellesley, MA: Wellesley-Cambridge Press, 2009. 13
  • 14. ● It begins with arbitrary rectangular matrix (cf. one-mode factor analysis requires A to be square matrix ) ● It allows us to approximate original matrix using smaller matrices. It is important that the derived k-dimensional factor space does not reconstruct the original term space perfectly because it means getting rid of noise of original data (cf. python code for svd). 14
  • 15. 1. Using T and D, you construct semantic space 2. Find representations for the query following the equation below 3. Calculate cosine distance between query and documents 15
  • 16. ● Precision of the LSI method lies well above that obtained with term matching, SMART, and Voorhees. ● The average difference in precision between the LSI and the term matching method is .06 which represents a 13% improvement over raw term matching 16
  • 17. ● Exhaustive comparison of a query vector against all stored document vectors ● The initial SVD analysis is time consuming; it is hard to update. ● Lack of statistical foundation in latent factor “Roughly speaking, these factors may be thought of as artificial concepts” (Section 4.1) 17
  • 19. ● He belonged to International Computer Science Institute, Berkeley, CA. ● In order for computers to interact more naturally with humans, natural language queries are needed. ● Although LSA has been applied with remarkable success in different domains, it does not have satisfactory statistical foundation. Thomas Hofmann (1968-) 19
  • 20. He presented a novel approach to LSA that has a solid statistical foundation based on the likelihood principle and proper generative model of the data. ● He used a statistical latent class called “aspect model”. ● Model is annealed adjusted by minimizing word perplexity. 20
  • 21. Aspect Model is a latent variable model for general co-occurrence data which associates an unobserved class variable. 1. Select a document d with probability P(d) 2. Pick a latent class z with probability P(z|d) 3. Generate a word w with probability P(w|z) We call P(z|d) “aspects”. 21
  • 22. The probability of single word and document is written below. Hence, the joint probability of the whole data set is written as below. marzinalization product rule 22
  • 23. There are three parameters in the aspect model, which are p(z), p(w|z), p(d|z). We use Expectation Maximization (EM) algorithm to estimate them. EM algorithm is the standard procedure for maximum likelihood estimation in latent variable model. Before explaining EM algorithm, let me try normal maximum likelihood estimation,. 23
  • 24. Let n(d,w) the term frequency w in document d, likelihood of the model is Hence, log-likelihood is written as below. I want to maximize this log-likelihood, but log-sum structure in the equation is hard to differentiate. 24
  • 25. By Jensen’s inequality, By maximizing F using method of Lagrange multiplier, it leads 25
  • 26. 1. Initialize parameters p(z), p(w|z), p(z|d) 2. E-step Calculate p(z|d,w) using given parameters p(z), p(w|z), p(z|d) 3. M-step Update parameters p(z), p(w|z), p(z|d) using p(z|d,w) which has been just calculated 4. Re-calculate new log-likelihood. Until |new log-likelihood - old log-likelihood|<ε, we repeat E-step and M-step. Python code for PLSA implementation 26
  • 27. ● It is interesting to see that pLSA capture two different types of “flight” and “love” in their topics. It distinguish polysemy of the words. ● The experiments consistently validate the advantages of PLSI over LSI. 27
  • 28. Some points are derived from Blei et al., (2003). ● pLSA is likely to be overfitting ○ They use Tempered EM which is the improved version of EM algorithm to avoid overfitting, but it is not a fundamental solution. ● There is no statistical foundation at the level of documents. ○ In pLSA, each document is represented as the mixing proportions for topics, and there is no generative probabilistic model for these numbers. ○ This leads to the number of parameters in the model grows linearly with the size of the corpus, which leads to overfitting. ○ Plus, it is not clear how to assign probability to a document outside of the training set. 28
  • 29. David M. Blei, Andrew Y. Ng, Michael I. Jordan (2003) 29
  • 30. ● Blei was in the Computer Science Division, University of California. ● This paper consider the problem of modeling text corpora and other collections of discrete data. ● They thought pLSA is incomplete because it provides no probabilistic model at the level of documents. 30
  • 31. exchangeability of both words and documents ● LSI and pLSI are based on the “bag-of-words” assumption -- that the order of words in a document can be neglected, but less often stated that documents are exchangeable as well as words. ● de Finetti (1990) establishes that any collection of exchangeable random variables has a representation as a mixture distribution. It leads to the latent Dirichlet allocation that topics are infinitely exchangeable within a document. 31
  • 32. LDA assumes the following generative process for each document in a corpus. 1. Choose N ~ Poisson(ζ) 2. Choose θ ~ Dirichlet(α) 3. For each words a. Choose a topic z ~ Multinomial(θ) b. Choose a word w from p(w|z,β), a multinomial conditioned on the topic z Note that θ is a document leven variables sampled once per document. The joint distribution is given by 32
  • 33. Dirichlet distribution is conjugate to the multinomial distribution. Figure right shows dirichlet distribution at different α value. You will understand that it simply represents natural inference of human. For example if α = (5,2,2), θ1 would be high. https://cs.stanford.edu 33
  • 34. The key inferential problem is to compute the posterior distribution of the hidden variables given a document. Integrating over θ and summing over z, we obtain the marginal distribution of a document. This is intractable due to the coupling between θ and β in the summation over latent topics. Thus we apply an approximate inference to estimate parameters. 34
  • 35. 1. E-step: find optimizing values of the variational parameters γ, θ a. Variational Inference for γ and θ i. Initialize γ,θ ii. repeat until convergence 2. M-step: maximizing the lower bound on the log likelihood with respect to the model parameters α and β. source code for python 35
  • 36. ● LDA constantly performs better than other methods, unigram, mixture unigram and pLSI. ● For classification task, the performance is improved with LDA features. ● For collaborative filtering task, EachMovie, the best predictive performance was obtained by the LDA model. 36
  • 37. ● Order effect (cf. Agrawal et al., 2018) Different topics are generated if the training data is shuffled since its internal weights are updated via stochastic sampling process. Such effect introduce a systematic error for any study. ● Topic coherence (cf. Das et al., 2015) The prior preference for semantic coherence is not encoded in the model. Some topics can be accidental for human evaluations. ● Cannot handle out-of-vocabulary (OOV) words (cf. Das et al., 2015) 37
  • 38. 38
  • 39. Let us write the aspect model in matrix notation. Hence, define matrices by The joint probability model P can be a matrix product Although there is a fundamental difference between LSA and pLSA, pLSA can be seen as a dimensionality reduction method. 39
  • 40. M dimensional multinomial distribution can be represented as points on a M-1 dimensional simplex of all possible multinomial. Since the dimensionality of the sub-simplex (a probabilistic latent semantic space) K-1 as opposed to M-1 for the complete probability simplex, this can also be thought of dimensionality reduction. 40
  • 41. The topic simplex for three topics embedded in the word simplex for three words. The pLSI model induces an empirical distribution on the topic simplex denoted by x. LDA places a smooth distribution on the topic simplex denoted by the contour lines. 41
  • 42. The boxes are “plates” representing replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document You can easily see LDA assumes the generative model at the level of documents. (d) LDA model 42
  • 43. Rajarshi Das, Manzil Zaheer, Chris Dyer (2015) 43
  • 44. ● Das is a second year Ph.D student in School of Computer Science, Carnegie Mellon University. ● They want to propose a new technique for topic modeling by using word embeddings (Milkov, 2013) 44
  • 45. According to the distributional hypothesis, words occurring in similar contexts tend to have similar meaning. This has given rise to data-driven learning of word vectors that capture lexical and semantic properties (e.g. word2vec). we assume that rather than consisting of sequences of word types, documents consist of sequences of word embeddings. 45
  • 46. Since our observations are no longer discrete values but continuous vectors in an M-dimensional space, we characterize each topic k as a multivariate Gaussian distribution with mean μk and covariance ∑k . The generative process can thus be summarized as follows. 46
  • 47. 1. for k=1 to K a. Draw topic covariance b. Draw topic mean 2. for each document d in a corpus D a. Draw topic distribution b. for each word i. Draw a topic ii. Draw embedded vector 47
  • 48. We wish to infer the posterior distribution over the topic parameters, proportions and the topic assignments of individual words. We use a collapsed Gibbs sampler to infer them. We can make the sampling faster using Cholesky decomposition of covariance matrix. source code for python 48
  • 49. ● To measure topic coherence, we follow to compute Pointwise Mutual Information (PMI) of topic words. ● It can be seen that Gaussian LDA is a clear winner, achieving an average 275% higher score on average. 49
  • 50. ● we select a subset of documents and replace words of those documents by its synonyms if they haven’t occurred in the corpus before. ● Compared recently proposed extension of LDA that can handle unseen words (infvoc), Gaussian LDA performs better here, too. 50
  • 51. Amritanshu Agrawal, Wei Fu, Tim Menzies (2018) 51
  • 52. This is just summary of the paper. I didn’t have enough time to read this paper because I spent a lot of time to trying to understand parameter inference part of pLSA and LDA. Sorry for my unplanned. ● Motivation: the current great challenge in software analytics is understanding unstructured data. ● Key points: tuning proper parameters to fix “order effects” in LDA. ● Model: they propose LDADE, a search-based software engineering tool which uses Differential Evolution (DE) to tune the LDA’s parameter. ● Results: LDADE’s tunings dramatically reduce cluster instability and leads to improved performances for supervised as well as un-supervised learning 52
  • 53. ● Since 1990, topic modeling has been constantly needed although social background and researcher’s motivations have been changed. ● Topic modeling is easy to expand or add other probabilistic models. The model has been more complex as it becomes new. ● The way to estimate parameters has been evolving so that it can deal with more flexible models. ● It is very difficult for me to understand parameter inference part. I will need some mathematical trainings especially on optimization, which is based on linear algebra and probability theory. 53