Parsimonious topic models with salient word discovery

Parsimonious Topic Models with Salient Word Discovery
Abstract:
We propose a parsimonious topic model for text corpora. In related models
such as Latent Dirichlet Allocation (LDA), all words are modeled topic-
specifically, even though many words occur with similar frequencies across
different topics. Our modeling determines salient words for each topic,
which have topic-specific probabilities, with the rest explained by a
universal shared model. Further, in LDA all topics are in principle present
in every document. By contrast, our model gives sparse topic
representation, determining the (small) subset of relevant topics for each
document. We derive a Bayesian Information Criterion (BIC), balancing
model complexity and goodness of fit. Here, interestingly, we identify an
effective sample size and corresponding penalty specific to each parameter
type in our model. We minimize BIC to jointly determine our entire
model—the topic-specific words, document-specific topics, all model
parameter values, and the total number of topics—in a wholly
unsupervised fashion. Results on three text corpora and an image dataset
show that our model achieves higher test set likelihood and better
agreement with ground-truth class labels, compared to LDA and to a
model designed to incorporate sparsity.

Existing System:
LDA every topic is in principle present in every document, with a non-zero
proportion. This seems implausible since each document is expected to
have a main theme covered by a modest subset of related topics. Allowing
all topics to have nonzero proportions in every document again
complicates the model’s representation of the data. By contrast, our
proposed method identifies a sparse set of topics present in each
document.
Proposed System:
We derive an approximation of the model posterior which improves on the
na€ıve form of BIC in two aspects: 1) Our proposed form of BIC has
differentiated cost terms, based on different effective sample sizes for the
different parameter types in our model. 2) Making use of a shared feature
representation essentially increases the sample size to feature dimension
ratio, thus giving a better approximation of the model posterior.
Our framework also gives, in a wholly unsupervised fashion, a direct
estimate of the number of topics present in the corpus. The number of
topics (i.e. model order) is a hyper-parameter in topic models, usually

determined based on validation set performance for a secondary task such
as classification.
Hardware Requirements:
• System : Pentium IV 2.4 GHz.
• Hard Disk : 40 GB.
• Floppy Drive : 1.44 Mb.
• Monitor : 15 VGA Colour.
• Mouse : Logitech.
• RAM : 256 Mb.
Software Requirements:
• Operating system : - Windows XP.
• Front End : - JSP
• Back End : - SQL Server
Software Requirements:
• Operating system : - Windows XP.
• Front End : - .Net
• Back End : - SQL Server

Parsimonious topic models with salient word discovery

More Related Content

What's hot (16)

Similar to Parsimonious topic models with salient word discovery (15)

More from ieeepondy (20)

Recently uploaded (20)

Parsimonious topic models with salient word discovery