Topics Modeling

CIS 890 – Information Retrieval
Project Final Presentation

Topics Modelling with LDA
Collocations on NIPS
Collection
Presenter: Svitlana Volkova
Instructor: Doina Caragea

Agenda
I. Introduction
II. Project Stages
III. Topics Modeling
 LDA Model
 HMMLDA Model
 LDA-COL Model
IV. NIPS Collection
V. Experimental Results
VI. Conclusions
#

I. Project Overview

#

Generative vs. Discriminative
Methods
Generative approaches produce a probability
density model over all variables in a system and
manipulate it to compute classification and
regression functions

Discriminative approaches provide a direct
attempt to compute the input to output
mappings

#

From LSI -> to pLSA -> to LDA

polysemy/synonymy -> probability -> exchangeability
• [Sal83] Gerard Salton and Michael J. McGill.
TF-IDF Introduction to Modern Information Retrieval.
Salton and McGill (Sal„83) McGraw-Hill, Inc., New York, NY, USA, 1983.

[Dee90] S. Deerwester, S. Dumais, T.
Latent Semantic Indexing (LSI) Landauer, G. Furnas, and R. Harshman.
Indexing by latent semantic analysis. Journal
Deerwester et. al.(Dee„90) of the American Society of Information
Science, 41(6):391-407, 1990.

Probabilistic Latent Semantic Indexing • [Hof99] T. Hofmann. Probabilistic latent
semantic indexing. Proceedings of the
Hofmann(Hof„99) Twenty-Second Annual International SIGIR
Conference, 1999.

Latent Dirichlet Allocation (LDA) • [Ble03] Blei, D.M., Ng, A.Y., Jordan, M.I.
Latent Dirichlet allocation, Journal of Machine
Blei et. al.(Ble„03)
#
Learning Research, 3, pp.993-1022, 2003.

Topic Models: LDA

#

Language Models

– probability of the sequence of words

Healthy Food
Text Mining

– each word of both the observed and unseen
documents is generated by a randomly chosen
topic which is drawn from a distribution.

#

Disadvantages of “Bag of word”
Assumption

• TEXT ≠ sequence of discrete word tokens

• The actual meaning can not be captured by words co-
occurrences only

• Word order is not important for syntax, but it is important
for lexical meaning

• Words order within “near by” context and phrases is
critical to capturing meaning of text
#

Problem Statement

#

Collocations = word phrases?
• Noun phrases:
– “strong tea”, “weapon of mass destruction”
• Phrasal verbs:
– “make up” =?
• Other phrases:
– “rich and powerful”
• Collocation is a phrase with meaning beyond
the individual words (e.g. “white house”)
[Man’99] Manning, C., & Schutze, H. Foundations of statistical natural
language processing. Cambridge, MA: MIT Press, 1999.
#

Problem Statement
– How “Information Retrieval” Topic can be represented?

Unigrams -> …, information, search, …, web

– What about “Artificial Intelligence”?

Unigrams -> agent, …, information, search, …

– Issues with using unigrams for topics modeling:
• Not enough representative for single topic
• Ambiguous (concepts sharing)
– system, modeling, information, data, structure…

#

II. Project Stages

#

Project Stages
1. NIPS Data Collection and Preprocessing
 http://books.nips.cc/
2. Learning topics models on NIPS collection
 http://psiexp.ss.uci.edu/research/programs_data/
toolbox.htm
- Model 1: LDA
- Model 2: HMMLDA
- Model 3: LDA-COL
3. Results Comparison for LDA, LDA-COL,
HMMLDA and N-grams
#

What are the limitations of using
wiki concepts?
NLP Active Learning (AL)
Information Retrieval: 2 Cognitive Science: 1
Natural Language Processing: 8
Artificial Intelligence (AI) Computer Vision
Cognitive Science: 3 Object Recognition: 2
Object Recognition: 1 Visual Perception: 1
Information Retrieval: 2
Information Retrieval (IR) Machine Learning (ML)
Information Retrieval: 35 Object Recognition: 1

 Wiki Concept Graph
 Follow links
 N-grams distribution on the document is small
 What level of concepts‟ abstraction #

III. Topic Models: LDA

#

Topics Modeling with Latent
Dirichlet Allocation

 word is represented as multinomial random variable
 topic is represented as a multinomial random variable z
 document is represented as Dirichlet random variable

#

Topic Simplex

 each corner of the
simplex corresponds to a
topic – a component of
the vector ;
 document is modeled as
a point of the simplex - a
multimodal distribution
over the topics;
 a corpus is modeled as a
Dirichlet distribution on
the simplex.
http://www.cs.berkeley.edu/~jordan

#

III. Topic Models: HMMLDA

#

Bigram Topic Models:
Wallach‟s Model

“neural network”

• Wallach‟s Bigram Topic Model (Wal„05) is based on
Hierarchical Dirichlet Language (Pet‟94)
[Wal’05] Wallach, H. Topic modeling: beyond bag-of-words. NIPS 2005
Workshop on Bayesian Methods for Natural Language Processing, 2005.
[Pet’94] MacKay, D. J. C., & Peto, L. A hierarchical Dirichlet language
model. Natural Language Engineering, 1, 1–19, 1994. #

III. Topic Models: LDA-COL

#

LDA-Collocation Model

LDA-Collocation Model
(Ste‟05)

• Can decide whether to generate a bigram or unigram

[Ste’05] Steyvers, M., & Griffiths, T. Matlab topic modeling toolbox 1.3.
http://psiexp.ss.uci.edu/research/programs data/toolbox.htm, 2005
#

Methods for Collocation Discovery
 Counting frequency (Jus„95)
Justeson, J. S., & Katz, S. M. (1995). Technical terminology: some linguistic properties and an algorithm for
identification in text. Natural Language Engineering, 1, 9–27

 Variance based collocation (Sma„93)
Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19, 143–177.

 Hypothesis testing -> assess whether or not two words
occur together more often than chance:
– t-test (Chu‟89)
Church, K., & Hanks, P. Word association norms, mutual information and lexicography. In Proceedings of
the 27th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 76–83), 1989

– 2 test (Chu‟91)
Church, K. W., Gale, W., Hanks, P., & Hindle, D. Using statistics in lexical analysis. In Lexical Acquisition:
Using On-line Resources to Build a Lexicon (pp. 115–164). Lawrence Erlbaum, 1991

– likelihood ratio test (Dun‟93)
Dunning, T. E. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics,
19, 61–74, 1993.

 Mutual information (Hod‟96)
Hodges, J., Yie, S., Reighart, R., & Boggess, L. An automated system that assists in the generation of
document indexes. Natural Language Engineering, 2, 137–160, 199 #

Topical N-grams
HMMLDA captures words dependency

HMM -> short-range syntactic LDA -> long-range
semantic

[Wan’07] Xuerui Wang, Andrew McCallum and Xing Wei Topical N-grams: Phrase and Topic
Discovery, with an Application to Information Retrieval, Proceedings of the 7th IEEE International
Conference on Data Mining (ICDM), 2007 - http://www.cs.umass.edu/~mccallum/papers/tng- #
icdm07.pdf

Topical N-grams
[Wan’07] Xuerui Wang, Andrew McCallum and Xing Wei Topical N-grams:
Phrase and Topic Discovery, with an Application to Information Retrieval,
Proceedings of the 7th IEEE International Conference on Data Mining (ICDM),
2007 - http://www.cs.umass.edu/~mccallum/papers/tng-icdm07.pdf

#

IV. Data Collection: NIPS Abstracts

#

NIPS Collection
NIPS Collection Characteristics NIPS Collection Characteristics
Number of words W = 13649 Number of iterations N = 50
Number of docs D = 1740 LDA hyper parameter ALPHA = 0.5
Number of topics T = 100 LDA hyper parameter BETA = 0.01

Randomly sampled document titles from NIPS Collection #

LDA Model Input/Output
WS a 1 x N vector where
WS(k) contains the WP a sparse matrix of size W x T;
vocabulary index of the k WP(i,j) contains the number of
word token, and N is the times word i has been assigned to
number of word tokens topic j.

LDA DP a sparse D x T matrix; DP(d,j)
contains the number of times a
Model word token in document d has
been assigned to topic j.

DS a 1 x N vector
where DS(k) contains Z a 1 x N vector containing the
the document index of topic assignments where N is the
the k word token number of word tokens. Z(k)
contains the topic assignment for
token k.

#

HMMLDA Model Input/Output
WS a 1 x N vector where WP a sparse matrix of size W x T;
WS(k) contains the WP(i,j) contains the number of times
vocabulary index of the k word i has been assigned to topic j.
word token, and N is the
number of word tokens DP a sparse D x T matrix; DP(d,j)
contains the number of times a word
token in document d has been
assigned to topic j.

HMMLDA MP a sparse W x S matrix where S is
the number of HMM states. MP(i,j)
Model contains the number of times word i
has been assigned to HMM state j.

Z a 1 x N vector containing the topic
assignments where N is the number
of word tokens. Z(k) contains the topic
DS a 1 x N vector where assignment for token k.
DS(k) contains the
document index of the k
word token X a 1 x N vector containing the HMM
state assignments where N is the
number of word tokens. X(k) contains
the assignment of the k word token to
a HMM state.
#

LDA-COL Model Input/Output
WS a 1 x N vector WP a sparse matrix of size W x T;
where WS(k) contains WP(i,j) contains the number of times
the vocabulary index of word i has been assigned to topic j.
the k word token, and N
is the number of word DP a sparse D x T matrix; DP(d,j)
tokens contains the number of times a word
token in document d has been
DS a 1 x N vector assigned to topic j.
where DS(k) contains
the document index of LDA-COL WC a 1 x W vector where WC(k)
contains the number of times word k
the k word token
Model led to a collocation with the next word
in the word stream.
WW a W x W sparse
matrix where W(i,j)
contains the count of Z a 1 x N vector containing the topic
the number of times assignments where N is the number
that word i follows word of word tokens. Z(k) contains the topic
j in the word stream. assignment for token k.
SI a 1 x N vector C a 1 x N vector containing the
where SI(k)=1 only if topic/collocation assignments where
the k word can form a N is the number of word tokens.
collocation with the (k- C(k)=0 when token k was assigned to
1) word and SI(k)=0 the topic model. C(k)=1 when token k
otherwise. was assigned to a collocation with
word token k-1. #

V. Experimental Results

#

Experiment Setup

1. 100 Topics
2. Gibbs Sampling – 50 iterations

3. Optimized Parameters
LDA HMMLDA LDA-COL
ALPHA = 0.5 ALPHA = 0.5 BETA = 0.01
BETA = 0.01 BETA = 0.01 ALPHA = 0.5
GAMMA = 0.1 GAMMA0 = 0.1
GAMMA1 = 0.1

[Gri’04] Griffiths, T., & Steyvers, M. (2004). Finding Scientific Topics.
Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-
5235.
#

LDA Model Results

#

Hidden Markov Model with Latent
Dirichlet Allocation (HMMLDA)
Model Results

[Hsu’06] Style and topic language model adaptation using HMM-LDA (2006) by B J Hsu,,
J Glass in Proceedings of Empirical Methods on Natural Language Processing (EMNLP #

LDA-COL Model Results

#

LDA vs. HMMLDA vs. LDA-COL

#

LDAs vs. Topical N-grams
[Wan’07] Xuerui Wang, Andrew McCallum and Xing Wei Topical N-grams: Phrase and Topic Discovery, with
an Application to Information Retrieval, Proceedings of the 7th IEEE International Conference on Data
Mining (ICDM), 2007 - http://www.cs.umass.edu/~mccallum/papers/tng-icdm07.pdf

#

LDAs vs. Topical N-grams

#

Conclusions
I. HMMLDA showed the worst results, because
stop words removal was not done
II. LDA-COL had the best performance in
comparison to LDA and HMMLDA, but worse
than topical n-gram models

Future Work
 Polylingual Topic Models

[Mim’2009] D. Mimno, H. M. Wallach, J. Naradowsky, D. A. Smith, and A. Mccallum,
"Polylingual topic models," in Proceedings of the 2009 Conference on Empirical Methods
in Natural Language Processing. Singapore: Association for Computational Linguistics,
August 2009, pp. 880-889, http://www.aclweb.org/anthology/D/D09/D09-1092.pdf #

Acknowledgments
University of California, Irvine.
Department of Cognitive Sciences for
MatLab Topics Modeling Toolbox

Dr .Caragea

Questions

#

Topics Modeling

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Topics Modeling (20)

More from Svitlana volkova (17)

Recently uploaded (20)

Topics Modeling