SlideShare a Scribd company logo
Parsimonious Topic Models with Salient Word Discovery
Abstract:
We propose a parsimonious topic model for text corpora. In related models
such as Latent Dirichlet Allocation (LDA), all words are modeled topic-
specifically, even though many words occur with similar frequencies across
different topics. Our modeling determines salient words for each topic,
which have topic-specific probabilities, with the rest explained by a
universal shared model. Further, in LDA all topics are in principle present
in every document. By contrast, our model gives sparse topic
representation, determining the (small) subset of relevant topics for each
document. We derive a Bayesian Information Criterion (BIC), balancing
model complexity and goodness of fit. Here, interestingly, we identify an
effective sample size and corresponding penalty specific to each parameter
type in our model. We minimize BIC to jointly determine our entire
model—the topic-specific words, document-specific topics, all model
parameter values, and the total number of topics—in a wholly
unsupervised fashion. Results on three text corpora and an image dataset
show that our model achieves higher test set likelihood and better
agreement with ground-truth class labels, compared to LDA and to a
model designed to incorporate sparsity.
Existing System:
LDA every topic is in principle present in every document, with a non-zero
proportion. This seems implausible since each document is expected to
have a main theme covered by a modest subset of related topics. Allowing
all topics to have nonzero proportions in every document again
complicates the model’s representation of the data. By contrast, our
proposed method identifies a sparse set of topics present in each
document.
Proposed System:
We derive an approximation of the model posterior which improves on the
na€ıve form of BIC in two aspects: 1) Our proposed form of BIC has
differentiated cost terms, based on different effective sample sizes for the
different parameter types in our model. 2) Making use of a shared feature
representation essentially increases the sample size to feature dimension
ratio, thus giving a better approximation of the model posterior.
Our framework also gives, in a wholly unsupervised fashion, a direct
estimate of the number of topics present in the corpus. The number of
topics (i.e. model order) is a hyper-parameter in topic models, usually
determined based on validation set performance for a secondary task such
as classification.
Hardware Requirements:
• System : Pentium IV 2.4 GHz.
• Hard Disk : 40 GB.
• Floppy Drive : 1.44 Mb.
• Monitor : 15 VGA Colour.
• Mouse : Logitech.
• RAM : 256 Mb.
Software Requirements:
• Operating system : - Windows XP.
• Front End : - JSP
• Back End : - SQL Server
Software Requirements:
• Operating system : - Windows XP.
• Front End : - .Net
• Back End : - SQL Server
Parsimonious topic models with salient word discovery

More Related Content

PPT
[ppt]
PPTX
Information retrieval 7 boolean model
PPTX
Text Classification
PPTX
PPT
SMS Spam Filter Design Using R: A Machine Learning Approach
PPT
Finding Similar Files in Large Document Repositories
PPTX
Information Retrieval-1
PPTX
Sms spam classification
[ppt]
Information retrieval 7 boolean model
Text Classification
SMS Spam Filter Design Using R: A Machine Learning Approach
Finding Similar Files in Large Document Repositories
Information Retrieval-1
Sms spam classification

What's hot (16)

PPT
Scalable Discovery Of Hidden Emails From Large Folders
PDF
Ju3517011704
DOCX
A probabilistic approach to string transformation
PPTX
PPT
Email Data Cleaning
PPTX
DOCX
Final Report(SuddhasatwaSatpathy)
PPTX
Information Retrieval
PPT
Data Mining and the Web_Past_Present and Future
DOCX
IEEE 2014 JAVA DATA MINING PROJECTS A probabilistic approach to string transf...
PPT
Mining Product Reputations On the Web
PPTX
Development of learned dictionary based spoken language
DOCX
IEEE 2014 DOTNET DATA MINING PROJECTS A probabilistic approach to string tran...
PPT
Effective Extraction of Thematically Grouped Key Terms From Text
PPTX
PPTX
Dual Embedding Space Model (DESM)
Scalable Discovery Of Hidden Emails From Large Folders
Ju3517011704
A probabilistic approach to string transformation
Email Data Cleaning
Final Report(SuddhasatwaSatpathy)
Information Retrieval
Data Mining and the Web_Past_Present and Future
IEEE 2014 JAVA DATA MINING PROJECTS A probabilistic approach to string transf...
Mining Product Reputations On the Web
Development of learned dictionary based spoken language
IEEE 2014 DOTNET DATA MINING PROJECTS A probabilistic approach to string tran...
Effective Extraction of Thematically Grouped Key Terms From Text
Dual Embedding Space Model (DESM)
Ad

Similar to Parsimonious topic models with salient word discovery (15)

PDF
Mlj 2013 itm
PDF
Blei ngjordan2003
PDF
Canini09a
PDF
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
PPTX
Project Proposal Topics Modeling (Ir)
PDF
Survey of Generative Clustering Models 2008
ODP
Topic Modeling
PDF
graduate_thesis (1)
PDF
Mini-batch Variational Inference for Time-Aware Topic Modeling
PDF
Word Space Models and Random Indexing
PPT
Arcomem training Topic Analysis Models advanced
ODP
Word Space Models & Random indexing
PDF
Topics Modeling
PPTX
Garewal Harnessing the Power of a Semantic Index at JSTOR
PDF
IRJET-A Review on Topic Detection and Term-Term Relation Analysis in Big Data
Mlj 2013 itm
Blei ngjordan2003
Canini09a
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Project Proposal Topics Modeling (Ir)
Survey of Generative Clustering Models 2008
Topic Modeling
graduate_thesis (1)
Mini-batch Variational Inference for Time-Aware Topic Modeling
Word Space Models and Random Indexing
Arcomem training Topic Analysis Models advanced
Word Space Models & Random indexing
Topics Modeling
Garewal Harnessing the Power of a Semantic Index at JSTOR
IRJET-A Review on Topic Detection and Term-Term Relation Analysis in Big Data
Ad

More from ieeepondy (20)

PDF
Demand aware network function placement
PDF
Service description in the nfv revolution trends, challenges and a way forward
PDF
Secure optimization computation outsourcing in cloud computing a case study o...
PDF
Spatial related traffic sign inspection for inventory purposes using mobile l...
PDF
Standards for hybrid clouds
PDF
Rfhoc a random forest approach to auto-tuning hadoop's configuration
PDF
Resource and instance hour minimization for deadline constrained dag applicat...
PDF
Reliable and confidential cloud storage with efficient data forwarding functi...
PDF
Rebuttal to “comments on ‘control cloud data access privilege and anonymity w...
PDF
Scalable cloud–sensor architecture for the internet of things
PDF
Scalable algorithms for nearest neighbor joins on big trajectory data
PDF
Robust workload and energy management for sustainable data centers
PDF
Privacy preserving deep computation model on cloud for big data feature learning
PDF
Pricing the cloud ieee projects, ieee projects chennai, ieee projects 2016,ie...
PDF
Protection of big data privacy
PDF
Power optimization with bler constraint for wireless fronthauls in c ran
PDF
Performance aware cloud resource allocation via fitness-enabled auction
PDF
Performance limitations of a text search application running in cloud instances
PDF
Performance analysis and optimal cooperative cluster size for randomly distri...
PDF
Predictive control for energy aware consolidation in cloud datacenters
Demand aware network function placement
Service description in the nfv revolution trends, challenges and a way forward
Secure optimization computation outsourcing in cloud computing a case study o...
Spatial related traffic sign inspection for inventory purposes using mobile l...
Standards for hybrid clouds
Rfhoc a random forest approach to auto-tuning hadoop's configuration
Resource and instance hour minimization for deadline constrained dag applicat...
Reliable and confidential cloud storage with efficient data forwarding functi...
Rebuttal to “comments on ‘control cloud data access privilege and anonymity w...
Scalable cloud–sensor architecture for the internet of things
Scalable algorithms for nearest neighbor joins on big trajectory data
Robust workload and energy management for sustainable data centers
Privacy preserving deep computation model on cloud for big data feature learning
Pricing the cloud ieee projects, ieee projects chennai, ieee projects 2016,ie...
Protection of big data privacy
Power optimization with bler constraint for wireless fronthauls in c ran
Performance aware cloud resource allocation via fitness-enabled auction
Performance limitations of a text search application running in cloud instances
Performance analysis and optimal cooperative cluster size for randomly distri...
Predictive control for energy aware consolidation in cloud datacenters

Recently uploaded (20)

PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
GDM (1) (1).pptx small presentation for students
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Pre independence Education in Inndia.pdf
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
Basic Mud Logging Guide for educational purpose
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
Cell Types and Its function , kingdom of life
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
Institutional Correction lecture only . . .
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Abdominal Access Techniques with Prof. Dr. R K Mishra
GDM (1) (1).pptx small presentation for students
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Pre independence Education in Inndia.pdf
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Basic Mud Logging Guide for educational purpose
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
VCE English Exam - Section C Student Revision Booklet
Cell Types and Its function , kingdom of life
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Computing-Curriculum for Schools in Ghana
Renaissance Architecture: A Journey from Faith to Humanism
TR - Agricultural Crops Production NC III.pdf
Institutional Correction lecture only . . .
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
STATICS OF THE RIGID BODIES Hibbelers.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx

Parsimonious topic models with salient word discovery

  • 1. Parsimonious Topic Models with Salient Word Discovery Abstract: We propose a parsimonious topic model for text corpora. In related models such as Latent Dirichlet Allocation (LDA), all words are modeled topic- specifically, even though many words occur with similar frequencies across different topics. Our modeling determines salient words for each topic, which have topic-specific probabilities, with the rest explained by a universal shared model. Further, in LDA all topics are in principle present in every document. By contrast, our model gives sparse topic representation, determining the (small) subset of relevant topics for each document. We derive a Bayesian Information Criterion (BIC), balancing model complexity and goodness of fit. Here, interestingly, we identify an effective sample size and corresponding penalty specific to each parameter type in our model. We minimize BIC to jointly determine our entire model—the topic-specific words, document-specific topics, all model parameter values, and the total number of topics—in a wholly unsupervised fashion. Results on three text corpora and an image dataset show that our model achieves higher test set likelihood and better agreement with ground-truth class labels, compared to LDA and to a model designed to incorporate sparsity.
  • 2. Existing System: LDA every topic is in principle present in every document, with a non-zero proportion. This seems implausible since each document is expected to have a main theme covered by a modest subset of related topics. Allowing all topics to have nonzero proportions in every document again complicates the model’s representation of the data. By contrast, our proposed method identifies a sparse set of topics present in each document. Proposed System: We derive an approximation of the model posterior which improves on the na€ıve form of BIC in two aspects: 1) Our proposed form of BIC has differentiated cost terms, based on different effective sample sizes for the different parameter types in our model. 2) Making use of a shared feature representation essentially increases the sample size to feature dimension ratio, thus giving a better approximation of the model posterior. Our framework also gives, in a wholly unsupervised fashion, a direct estimate of the number of topics present in the corpus. The number of topics (i.e. model order) is a hyper-parameter in topic models, usually
  • 3. determined based on validation set performance for a secondary task such as classification. Hardware Requirements: • System : Pentium IV 2.4 GHz. • Hard Disk : 40 GB. • Floppy Drive : 1.44 Mb. • Monitor : 15 VGA Colour. • Mouse : Logitech. • RAM : 256 Mb. Software Requirements: • Operating system : - Windows XP. • Front End : - JSP • Back End : - SQL Server Software Requirements: • Operating system : - Windows XP. • Front End : - .Net • Back End : - SQL Server