SlideShare a Scribd company logo
CIS 890 – Information Retrieval
     Project Final Presentation



Topics Modelling with LDA
  Collocations on NIPS
        Collection
       Presenter: Svitlana Volkova
        Instructor: Doina Caragea
Agenda
I.       Introduction
II. Project Stages
III. Topics Modeling
          LDA Model
          HMMLDA Model
          LDA-COL Model
IV. NIPS Collection
V. Experimental Results
VI. Conclusions
                                 #
I.   Project Overview




                        #
Generative vs. Discriminative
          Methods
  Generative approaches produce a probability
 density model over all variables in a system and
   manipulate it to compute classification and
              regression functions


  Discriminative approaches provide a direct
     attempt to compute the input to output
                   mappings




                                                    #
From LSI -> to pLSA -> to LDA

polysemy/synonymy -> probability -> exchangeability
                                         •   [Sal83] Gerard Salton and Michael J. McGill.
               TF-IDF                        Introduction to Modern Information Retrieval.
      Salton and McGill (Sal„83)             McGraw-Hill, Inc., New York, NY, USA, 1983.

                                             [Dee90] S. Deerwester, S. Dumais, T.
   Latent Semantic Indexing (LSI)            Landauer, G. Furnas, and R. Harshman.
                                             Indexing by latent semantic analysis. Journal
     Deerwester et. al.(Dee„90)              of the American Society of Information
                                             Science, 41(6):391-407, 1990.

Probabilistic Latent Semantic Indexing   •   [Hof99] T. Hofmann. Probabilistic latent
                                             semantic indexing. Proceedings of the
            Hofmann(Hof„99)                  Twenty-Second Annual International SIGIR
                                             Conference, 1999.

   Latent Dirichlet Allocation (LDA)     •   [Ble03] Blei, D.M., Ng, A.Y., Jordan, M.I.
                                             Latent Dirichlet allocation, Journal of Machine
          Blei et. al.(Ble„03)
                                                                                   #
                                             Learning Research, 3, pp.993-1022, 2003.
Topic Models: LDA




                    #
Language Models

– probability of the sequence of words




                                    Healthy Food
    Text Mining




– each word of both the observed and unseen
  documents is generated by a randomly chosen
  topic which is drawn from a distribution.

                                                   #
Disadvantages of “Bag of word”
           Assumption

• TEXT ≠ sequence of discrete word tokens

• The actual meaning can not be captured by words co-
  occurrences only

• Word order is not important for syntax, but it is important
  for lexical meaning

• Words order within “near by” context and phrases is
  critical to capturing meaning of text
                                                         #
Problem Statement




                    #
Collocations = word phrases?
• Noun phrases:
   – “strong tea”, “weapon of mass destruction”
• Phrasal verbs:
   – “make up”                      =?
• Other phrases:
   – “rich and powerful”
• Collocation is a phrase with meaning beyond
  the individual words (e.g. “white house”)
 [Man’99] Manning, C., & Schutze, H. Foundations of statistical natural
 language processing. Cambridge, MA: MIT Press, 1999.
                                                                          #
Problem Statement
– How “Information Retrieval” Topic can be represented?

    Unigrams -> …, information, search, …, web

– What about “Artificial Intelligence”?

    Unigrams -> agent, …, information, search, …

– Issues with using unigrams for topics modeling:
  • Not enough representative for single topic
  • Ambiguous (concepts sharing)
      – system, modeling, information, data, structure…


                                                          #
II.   Project Stages




                       #
Project Stages
1. NIPS Data Collection and Preprocessing
      http://books.nips.cc/
2. Learning topics models on NIPS collection
      http://psiexp.ss.uci.edu/research/programs_data/
       toolbox.htm
          - Model 1: LDA
          - Model 2: HMMLDA
          - Model 3: LDA-COL
3. Results Comparison for LDA, LDA-COL,
   HMMLDA and N-grams
                                                    #
What are the limitations of using
        wiki concepts?
NLP                              Active Learning (AL)
Information Retrieval: 2         Cognitive Science: 1
Natural Language Processing: 8
Artificial Intelligence (AI)     Computer Vision
Cognitive Science: 3             Object Recognition: 2
Object Recognition: 1            Visual Perception: 1
Information Retrieval: 2
Natural Language Processing: 6
Information Retrieval (IR)       Machine Learning (ML)
Information Retrieval: 35        Object Recognition: 1
                                 Natural Language Processing: 1

    Wiki Concept Graph
    Follow links
    N-grams distribution on the document is small
    What level of concepts‟ abstraction                          #
III. Topic Models: LDA




                         #
Topics Modeling with Latent
         Dirichlet Allocation




 word is represented as multinomial random variable
 topic is represented as a multinomial random variable z
 document is represented as Dirichlet random variable

                                                      #
Topic Simplex

 each corner of the
  simplex corresponds to a
  topic – a component of
  the vector ;
 document is modeled as
  a point of the simplex - a
  multimodal       distribution
  over the topics;
 a corpus is modeled as a
  Dirichlet distribution on
  the simplex.
   http://www.cs.berkeley.edu/~jordan

                                              #
III. Topic Models: HMMLDA




                            #
Bigram Topic Models:
             Wallach‟s Model




                                                  “neural network”



• Wallach‟s Bigram Topic Model (Wal„05) is based on
  Hierarchical Dirichlet Language (Pet‟94)
  [Wal’05] Wallach, H. Topic modeling: beyond bag-of-words. NIPS 2005
  Workshop on Bayesian Methods for Natural Language Processing, 2005.
  [Pet’94] MacKay, D. J. C., & Peto, L. A hierarchical Dirichlet language
  model. Natural Language Engineering, 1, 1–19, 1994.                       #
III. Topic Models: LDA-COL




                             #
LDA-Collocation Model


       LDA-Collocation Model
             (Ste‟05)




• Can decide whether to generate a bigram or unigram


   [Ste’05] Steyvers, M., & Griffiths, T. Matlab topic modeling toolbox 1.3.
   http://psiexp.ss.uci.edu/research/programs data/toolbox.htm, 2005
                                                                               #
Methods for Collocation Discovery
 Counting frequency (Jus„95)
     Justeson, J. S., & Katz, S. M. (1995). Technical terminology: some linguistic properties and an algorithm for
     identification in text. Natural Language Engineering, 1, 9–27

 Variance based collocation (Sma„93)
     Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19, 143–177.

 Hypothesis testing -> assess whether or not two words
  occur together more often than chance:
   – t-test (Chu‟89)
     Church, K., & Hanks, P. Word association norms, mutual information and lexicography. In Proceedings of
     the 27th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 76–83), 1989

   – 2 test (Chu‟91)
     Church, K. W., Gale, W., Hanks, P., & Hindle, D. Using statistics in lexical analysis. In Lexical Acquisition:
     Using On-line Resources to Build a Lexicon (pp. 115–164). Lawrence Erlbaum, 1991

   – likelihood ratio test (Dun‟93)
     Dunning, T. E. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics,
     19, 61–74, 1993.

 Mutual information (Hod‟96)
     Hodges, J., Yie, S., Reighart, R., & Boggess, L. An automated system that assists in the generation of
     document indexes. Natural Language Engineering, 2, 137–160, 199                                         #
Topical N-grams
                           HMMLDA captures words dependency

           HMM -> short-range syntactic                                     LDA -> long-range
            semantic




[Wan’07] Xuerui Wang, Andrew McCallum and Xing Wei Topical N-grams: Phrase and Topic
Discovery, with an Application to Information Retrieval, Proceedings of the 7th IEEE International
Conference on Data Mining (ICDM), 2007 - http://www.cs.umass.edu/~mccallum/papers/tng-               #
icdm07.pdf
Topical N-grams
[Wan’07] Xuerui Wang, Andrew McCallum and Xing Wei Topical N-grams:
Phrase and Topic Discovery, with an Application to Information Retrieval,
Proceedings of the 7th IEEE International Conference on Data Mining (ICDM),
2007 - http://www.cs.umass.edu/~mccallum/papers/tng-icdm07.pdf




                                                                              #
IV. Data Collection: NIPS Abstracts




                                 #
NIPS Collection
    NIPS Collection Characteristics       NIPS Collection Characteristics
Number of words     W = 13649         Number of iterations    N = 50
Number of docs      D = 1740          LDA hyper parameter     ALPHA = 0.5
Number of topics    T = 100           LDA hyper parameter     BETA = 0.01




    Randomly sampled document titles from NIPS Collection              #
LDA Model Input/Output
WS a 1 x N vector where
   WS(k) contains the               WP a sparse matrix of size W x T;
vocabulary index of the k           WP(i,j) contains the number of
word token, and N is the            times word i has been assigned to
 number of word tokens              topic j.



                            LDA     DP a sparse D x T matrix; DP(d,j)
                                    contains the number of times a
                            Model   word token in document d has
                                    been assigned to topic j.



  DS a 1 x N vector
 where DS(k) contains               Z a 1 x N vector containing the
the document index of               topic assignments where N is the
   the k word token                 number of word tokens. Z(k)
                                    contains the topic assignment for
                                    token k.

                                                                #
HMMLDA Model Input/Output
WS a 1 x N vector where              WP a sparse matrix of size W x T;
   WS(k) contains the                WP(i,j) contains the number of times
vocabulary index of the k            word i has been assigned to topic j.
word token, and N is the
 number of word tokens               DP a sparse D x T matrix; DP(d,j)
                                     contains the number of times a word
                                     token in document d has been
                                     assigned to topic j.

                            HMMLDA   MP a sparse W x S matrix where S is
                                     the number of HMM states. MP(i,j)
                             Model   contains the number of times word i
                                     has been assigned to HMM state j.

                                     Z a 1 x N vector containing the topic
                                     assignments where N is the number
                                     of word tokens. Z(k) contains the topic
DS a 1 x N vector where              assignment for token k.
  DS(k) contains the
document index of the k
      word token                     X a 1 x N vector containing the HMM
                                     state assignments where N is the
                                     number of word tokens. X(k) contains
                                     the assignment of the k word token to
                                     a HMM state.
                                                                      #
LDA-COL Model Input/Output
WS a 1 x N vector                    WP a sparse matrix of size W x T;
where WS(k) contains                 WP(i,j) contains the number of times
the vocabulary index of              word i has been assigned to topic j.
the k word token, and N
is the number of word                DP a sparse D x T matrix; DP(d,j)
tokens                               contains the number of times a word
                                     token in document d has been
DS a 1 x N vector                    assigned to topic j.
where DS(k) contains
the document index of      LDA-COL   WC a 1 x W vector where WC(k)
                                     contains the number of times word k
the k word token
                            Model    led to a collocation with the next word
                                     in the word stream.
WW a W x W sparse
matrix where W(i,j)
contains the count of                Z a 1 x N vector containing the topic
the number of times                  assignments where N is the number
that word i follows word             of word tokens. Z(k) contains the topic
j in the word stream.                assignment for token k.
SI a 1 x N vector                    C a 1 x N vector containing the
where SI(k)=1 only if                topic/collocation assignments where
the k word can form a                N is the number of word tokens.
collocation with the (k-             C(k)=0 when token k was assigned to
1) word and SI(k)=0                  the topic model. C(k)=1 when token k
otherwise.                           was assigned to a collocation with
                                     word token k-1.                  #
V. Experimental Results




                          #
Experiment Setup

1. 100 Topics
2. Gibbs Sampling – 50 iterations

3. Optimized Parameters
             LDA                   HMMLDA                  LDA-COL
     ALPHA = 0.5             ALPHA = 0.5              BETA = 0.01
     BETA = 0.01             BETA = 0.01              ALPHA = 0.5
                             GAMMA = 0.1              GAMMA0 = 0.1
                                                      GAMMA1 = 0.1


     [Gri’04] Griffiths, T., & Steyvers, M. (2004). Finding Scientific Topics.
     Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-
     5235.
                                                                                 #
LDA Model Results




                    #
Hidden Markov Model with Latent
       Dirichlet Allocation (HMMLDA)
                Model Results




[Hsu’06] Style and topic language model adaptation using HMM-LDA (2006) by B J Hsu,,
J Glass in Proceedings of Empirical Methods on Natural Language Processing (EMNLP      #
LDA-COL Model Results




                        #
LDA vs. HMMLDA vs. LDA-COL




                             #
LDAs vs. Topical N-grams
[Wan’07] Xuerui Wang, Andrew McCallum and Xing Wei Topical N-grams: Phrase and Topic Discovery, with
an Application to Information Retrieval, Proceedings of the 7th IEEE International Conference on Data
Mining (ICDM), 2007 - http://www.cs.umass.edu/~mccallum/papers/tng-icdm07.pdf




                                                                                             #
LDAs vs. Topical N-grams




                           #
VI. Conclusions




                  #
Conclusions
I.  HMMLDA showed the worst results, because
    stop words removal was not done
II. LDA-COL had the best performance in
    comparison to LDA and HMMLDA, but worse
    than topical n-gram models

  Future Work
 Polylingual Topic Models

[Mim’2009] D. Mimno, H. M. Wallach, J. Naradowsky, D. A. Smith, and A. Mccallum,
"Polylingual topic models," in Proceedings of the 2009 Conference on Empirical Methods
in Natural Language Processing. Singapore: Association for Computational Linguistics,
August 2009, pp. 880-889, http://www.aclweb.org/anthology/D/D09/D09-1092.pdf             #
Acknowledgments
                University of California, Irvine.
              Department of Cognitive Sciences for
               MatLab Topics Modeling Toolbox

Dr .Caragea




              Questions

                                                     #

More Related Content

ODP
Topic Modeling
PPT
Topic Models
PDF
Latent Dirichlet Allocation
PDF
Topic Modeling
PPTX
Word2 vec
PDF
resampling techniques in machine learning
PDF
Introduction to Overleaf Workshop
PDF
Topic Modeling - NLP
Topic Modeling
Topic Models
Latent Dirichlet Allocation
Topic Modeling
Word2 vec
resampling techniques in machine learning
Introduction to Overleaf Workshop
Topic Modeling - NLP

What's hot (20)

PPTX
CART – Classification & Regression Trees
PPTX
Text similarity measures
PDF
Word2Vec
PDF
Storage organization and stack allocation of space
PPTX
Boolean,vector space retrieval Models
PPT
Topic Models - LDA and Correlated Topic Models
PPTX
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PPTX
Chess board problem(divide and conquer)
PPTX
A Simple Introduction to Word Embeddings
PPTX
Lect6 Association rule & Apriori algorithm
PDF
Controlled dropout: a different dropout for improving training speed on deep ...
PPTX
Fractional Knapsack Problem
PPTX
Word embedding
PDF
Support Vector Machines ( SVM )
PDF
Skip gram and cbow
PPTX
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
PDF
Word Embeddings, why the hype ?
PPTX
Knapsack Problem
PPTX
Document similarity
PPTX
Machine Learning - Dataset Preparation
CART – Classification & Regression Trees
Text similarity measures
Word2Vec
Storage organization and stack allocation of space
Boolean,vector space retrieval Models
Topic Models - LDA and Correlated Topic Models
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Chess board problem(divide and conquer)
A Simple Introduction to Word Embeddings
Lect6 Association rule & Apriori algorithm
Controlled dropout: a different dropout for improving training speed on deep ...
Fractional Knapsack Problem
Word embedding
Support Vector Machines ( SVM )
Skip gram and cbow
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
Word Embeddings, why the hype ?
Knapsack Problem
Document similarity
Machine Learning - Dataset Preparation
Ad

Viewers also liked (20)

PPT
A Topic Model for Traffic Speed Data Analysis
PDF
Topic model an introduction
PPTX
Topic model, LDA and all that
PDF
Visualzing Topic Models
PDF
Présentation de restitution co-conception Datalab
PDF
Latent dirichletallocation presentation
PPTX
PDF
[SEN#6] Le classement des Entreprises de Services du Numérique (ESN, ex SSII)...
PPTX
PCA vs LDA
PPT
Understandig PCA and LDA
PDF
Introduction au Data Mining et Méthodes Statistiques
PDF
Topic Models, LDA and all that
PPTX
Text mining
PPTX
Text Mining using LDA with Context
PPTX
LDA presentation
PDF
LDA入門
PDF
Du CRM à la DMP
PDF
Salon Big Data 2015 : Big Data et Marketing Digital, Retours d’expérience en ...
PDF
Du datamining à la datascience
PDF
Graphiques interactifs avec R
A Topic Model for Traffic Speed Data Analysis
Topic model an introduction
Topic model, LDA and all that
Visualzing Topic Models
Présentation de restitution co-conception Datalab
Latent dirichletallocation presentation
[SEN#6] Le classement des Entreprises de Services du Numérique (ESN, ex SSII)...
PCA vs LDA
Understandig PCA and LDA
Introduction au Data Mining et Méthodes Statistiques
Topic Models, LDA and all that
Text mining
Text Mining using LDA with Context
LDA presentation
LDA入門
Du CRM à la DMP
Salon Big Data 2015 : Big Data et Marketing Digital, Retours d’expérience en ...
Du datamining à la datascience
Graphiques interactifs avec R
Ad

Similar to Topics Modeling (20)

PPTX
Project Proposal Topics Modeling (Ir)
PDF
Machine Learning of Natural Language
PPTX
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
PDF
LDA on social bookmarking systems
PDF
Topic modelling
PPTX
Introduction to Text Mining and Topic Modelling
PPTX
Frontiers of Computational Journalism week 2 - Text Analysis
PDF
SFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLET
PPTX
Neural Text Embeddings for Information Retrieval (WSDM 2017)
PPTX
LiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptx
PDF
A Text Mining Research Based on LDA Topic Modelling
PDF
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
PPTX
The Geometry of Learning
PDF
TopicModels_BleiPaper_Summary.pptx
PDF
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
PDF
Question Answering over Linked Data (Reasoning Web Summer School)
PPTX
Topic Extraction on Domain Ontology
PDF
A Document Exploring System on LDA Topic Model for Wikipedia Articles
DOC
Presentation on Machine Learning and Data Mining
PDF
Semantic Interoperability - grafi della conoscenza
Project Proposal Topics Modeling (Ir)
Machine Learning of Natural Language
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LDA on social bookmarking systems
Topic modelling
Introduction to Text Mining and Topic Modelling
Frontiers of Computational Journalism week 2 - Text Analysis
SFScon18 - Gabriele Sottocornola - Probabilistic Topic Models with MALLET
Neural Text Embeddings for Information Retrieval (WSDM 2017)
LiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptx
A Text Mining Research Based on LDA Topic Modelling
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
The Geometry of Learning
TopicModels_BleiPaper_Summary.pptx
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Question Answering over Linked Data (Reasoning Web Summer School)
Topic Extraction on Domain Ontology
A Document Exploring System on LDA Topic Model for Wikipedia Articles
Presentation on Machine Learning and Data Mining
Semantic Interoperability - grafi della conoscenza

More from Svitlana volkova (17)

PDF
EACL'12 Poster
PDF
Grace Hopper Celebration 2010
PDF
Multimodal Information Extraction: Disease, Date and Location Retrieval
PPTX
Web Intelligence 2010
PPTX
Master Thesis
PPTX
MS Thesis Short
PPTX
IEEE ISI'10
PPTX
PDF
Multilingual Ner Using Wiki
PDF
WiML Poster
PDF
Social Networks
PDF
Methods Of Reliability Analysis
PDF
Ohio Project
PDF
Ukraine Presentation
PDF
Ukraine Presentation at Kansas State University
PDF
Communicatons Fulbright
PDF
Communications Ternopil
EACL'12 Poster
Grace Hopper Celebration 2010
Multimodal Information Extraction: Disease, Date and Location Retrieval
Web Intelligence 2010
Master Thesis
MS Thesis Short
IEEE ISI'10
Multilingual Ner Using Wiki
WiML Poster
Social Networks
Methods Of Reliability Analysis
Ohio Project
Ukraine Presentation
Ukraine Presentation at Kansas State University
Communicatons Fulbright
Communications Ternopil

Recently uploaded (20)

PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PDF
Insiders guide to clinical Medicine.pdf
PPTX
master seminar digital applications in india
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Pre independence Education in Inndia.pdf
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
RMMM.pdf make it easy to upload and study
PPTX
Cell Types and Its function , kingdom of life
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PPTX
Cell Structure & Organelles in detailed.
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
Insiders guide to clinical Medicine.pdf
master seminar digital applications in india
Module 4: Burden of Disease Tutorial Slides S2 2025
Pre independence Education in Inndia.pdf
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
VCE English Exam - Section C Student Revision Booklet
Microbial disease of the cardiovascular and lymphatic systems
RMMM.pdf make it easy to upload and study
Cell Types and Its function , kingdom of life
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
STATICS OF THE RIGID BODIES Hibbelers.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
human mycosis Human fungal infections are called human mycosis..pptx
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
Cell Structure & Organelles in detailed.
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx

Topics Modeling

  • 1. CIS 890 – Information Retrieval Project Final Presentation Topics Modelling with LDA Collocations on NIPS Collection Presenter: Svitlana Volkova Instructor: Doina Caragea
  • 2. Agenda I. Introduction II. Project Stages III. Topics Modeling  LDA Model  HMMLDA Model  LDA-COL Model IV. NIPS Collection V. Experimental Results VI. Conclusions #
  • 3. I. Project Overview #
  • 4. Generative vs. Discriminative Methods Generative approaches produce a probability density model over all variables in a system and manipulate it to compute classification and regression functions Discriminative approaches provide a direct attempt to compute the input to output mappings #
  • 5. From LSI -> to pLSA -> to LDA polysemy/synonymy -> probability -> exchangeability • [Sal83] Gerard Salton and Michael J. McGill. TF-IDF Introduction to Modern Information Retrieval. Salton and McGill (Sal„83) McGraw-Hill, Inc., New York, NY, USA, 1983. [Dee90] S. Deerwester, S. Dumais, T. Latent Semantic Indexing (LSI) Landauer, G. Furnas, and R. Harshman. Indexing by latent semantic analysis. Journal Deerwester et. al.(Dee„90) of the American Society of Information Science, 41(6):391-407, 1990. Probabilistic Latent Semantic Indexing • [Hof99] T. Hofmann. Probabilistic latent semantic indexing. Proceedings of the Hofmann(Hof„99) Twenty-Second Annual International SIGIR Conference, 1999. Latent Dirichlet Allocation (LDA) • [Ble03] Blei, D.M., Ng, A.Y., Jordan, M.I. Latent Dirichlet allocation, Journal of Machine Blei et. al.(Ble„03) # Learning Research, 3, pp.993-1022, 2003.
  • 7. Language Models – probability of the sequence of words Healthy Food Text Mining – each word of both the observed and unseen documents is generated by a randomly chosen topic which is drawn from a distribution. #
  • 8. Disadvantages of “Bag of word” Assumption • TEXT ≠ sequence of discrete word tokens • The actual meaning can not be captured by words co- occurrences only • Word order is not important for syntax, but it is important for lexical meaning • Words order within “near by” context and phrases is critical to capturing meaning of text #
  • 10. Collocations = word phrases? • Noun phrases: – “strong tea”, “weapon of mass destruction” • Phrasal verbs: – “make up” =? • Other phrases: – “rich and powerful” • Collocation is a phrase with meaning beyond the individual words (e.g. “white house”) [Man’99] Manning, C., & Schutze, H. Foundations of statistical natural language processing. Cambridge, MA: MIT Press, 1999. #
  • 11. Problem Statement – How “Information Retrieval” Topic can be represented? Unigrams -> …, information, search, …, web – What about “Artificial Intelligence”? Unigrams -> agent, …, information, search, … – Issues with using unigrams for topics modeling: • Not enough representative for single topic • Ambiguous (concepts sharing) – system, modeling, information, data, structure… #
  • 12. II. Project Stages #
  • 13. Project Stages 1. NIPS Data Collection and Preprocessing  http://books.nips.cc/ 2. Learning topics models on NIPS collection  http://psiexp.ss.uci.edu/research/programs_data/ toolbox.htm - Model 1: LDA - Model 2: HMMLDA - Model 3: LDA-COL 3. Results Comparison for LDA, LDA-COL, HMMLDA and N-grams #
  • 14. What are the limitations of using wiki concepts? NLP Active Learning (AL) Information Retrieval: 2 Cognitive Science: 1 Natural Language Processing: 8 Artificial Intelligence (AI) Computer Vision Cognitive Science: 3 Object Recognition: 2 Object Recognition: 1 Visual Perception: 1 Information Retrieval: 2 Natural Language Processing: 6 Information Retrieval (IR) Machine Learning (ML) Information Retrieval: 35 Object Recognition: 1 Natural Language Processing: 1  Wiki Concept Graph  Follow links  N-grams distribution on the document is small  What level of concepts‟ abstraction #
  • 16. Topics Modeling with Latent Dirichlet Allocation  word is represented as multinomial random variable  topic is represented as a multinomial random variable z  document is represented as Dirichlet random variable #
  • 17. Topic Simplex  each corner of the simplex corresponds to a topic – a component of the vector ;  document is modeled as a point of the simplex - a multimodal distribution over the topics;  a corpus is modeled as a Dirichlet distribution on the simplex. http://www.cs.berkeley.edu/~jordan #
  • 18. III. Topic Models: HMMLDA #
  • 19. Bigram Topic Models: Wallach‟s Model “neural network” • Wallach‟s Bigram Topic Model (Wal„05) is based on Hierarchical Dirichlet Language (Pet‟94) [Wal’05] Wallach, H. Topic modeling: beyond bag-of-words. NIPS 2005 Workshop on Bayesian Methods for Natural Language Processing, 2005. [Pet’94] MacKay, D. J. C., & Peto, L. A hierarchical Dirichlet language model. Natural Language Engineering, 1, 1–19, 1994. #
  • 20. III. Topic Models: LDA-COL #
  • 21. LDA-Collocation Model LDA-Collocation Model (Ste‟05) • Can decide whether to generate a bigram or unigram [Ste’05] Steyvers, M., & Griffiths, T. Matlab topic modeling toolbox 1.3. http://psiexp.ss.uci.edu/research/programs data/toolbox.htm, 2005 #
  • 22. Methods for Collocation Discovery  Counting frequency (Jus„95) Justeson, J. S., & Katz, S. M. (1995). Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1, 9–27  Variance based collocation (Sma„93) Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19, 143–177.  Hypothesis testing -> assess whether or not two words occur together more often than chance: – t-test (Chu‟89) Church, K., & Hanks, P. Word association norms, mutual information and lexicography. In Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 76–83), 1989 – 2 test (Chu‟91) Church, K. W., Gale, W., Hanks, P., & Hindle, D. Using statistics in lexical analysis. In Lexical Acquisition: Using On-line Resources to Build a Lexicon (pp. 115–164). Lawrence Erlbaum, 1991 – likelihood ratio test (Dun‟93) Dunning, T. E. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19, 61–74, 1993.  Mutual information (Hod‟96) Hodges, J., Yie, S., Reighart, R., & Boggess, L. An automated system that assists in the generation of document indexes. Natural Language Engineering, 2, 137–160, 199 #
  • 23. Topical N-grams HMMLDA captures words dependency HMM -> short-range syntactic LDA -> long-range semantic [Wan’07] Xuerui Wang, Andrew McCallum and Xing Wei Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval, Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), 2007 - http://www.cs.umass.edu/~mccallum/papers/tng- # icdm07.pdf
  • 24. Topical N-grams [Wan’07] Xuerui Wang, Andrew McCallum and Xing Wei Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval, Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), 2007 - http://www.cs.umass.edu/~mccallum/papers/tng-icdm07.pdf #
  • 25. IV. Data Collection: NIPS Abstracts #
  • 26. NIPS Collection NIPS Collection Characteristics NIPS Collection Characteristics Number of words W = 13649 Number of iterations N = 50 Number of docs D = 1740 LDA hyper parameter ALPHA = 0.5 Number of topics T = 100 LDA hyper parameter BETA = 0.01 Randomly sampled document titles from NIPS Collection #
  • 27. LDA Model Input/Output WS a 1 x N vector where WS(k) contains the WP a sparse matrix of size W x T; vocabulary index of the k WP(i,j) contains the number of word token, and N is the times word i has been assigned to number of word tokens topic j. LDA DP a sparse D x T matrix; DP(d,j) contains the number of times a Model word token in document d has been assigned to topic j. DS a 1 x N vector where DS(k) contains Z a 1 x N vector containing the the document index of topic assignments where N is the the k word token number of word tokens. Z(k) contains the topic assignment for token k. #
  • 28. HMMLDA Model Input/Output WS a 1 x N vector where WP a sparse matrix of size W x T; WS(k) contains the WP(i,j) contains the number of times vocabulary index of the k word i has been assigned to topic j. word token, and N is the number of word tokens DP a sparse D x T matrix; DP(d,j) contains the number of times a word token in document d has been assigned to topic j. HMMLDA MP a sparse W x S matrix where S is the number of HMM states. MP(i,j) Model contains the number of times word i has been assigned to HMM state j. Z a 1 x N vector containing the topic assignments where N is the number of word tokens. Z(k) contains the topic DS a 1 x N vector where assignment for token k. DS(k) contains the document index of the k word token X a 1 x N vector containing the HMM state assignments where N is the number of word tokens. X(k) contains the assignment of the k word token to a HMM state. #
  • 29. LDA-COL Model Input/Output WS a 1 x N vector WP a sparse matrix of size W x T; where WS(k) contains WP(i,j) contains the number of times the vocabulary index of word i has been assigned to topic j. the k word token, and N is the number of word DP a sparse D x T matrix; DP(d,j) tokens contains the number of times a word token in document d has been DS a 1 x N vector assigned to topic j. where DS(k) contains the document index of LDA-COL WC a 1 x W vector where WC(k) contains the number of times word k the k word token Model led to a collocation with the next word in the word stream. WW a W x W sparse matrix where W(i,j) contains the count of Z a 1 x N vector containing the topic the number of times assignments where N is the number that word i follows word of word tokens. Z(k) contains the topic j in the word stream. assignment for token k. SI a 1 x N vector C a 1 x N vector containing the where SI(k)=1 only if topic/collocation assignments where the k word can form a N is the number of word tokens. collocation with the (k- C(k)=0 when token k was assigned to 1) word and SI(k)=0 the topic model. C(k)=1 when token k otherwise. was assigned to a collocation with word token k-1. #
  • 31. Experiment Setup 1. 100 Topics 2. Gibbs Sampling – 50 iterations 3. Optimized Parameters LDA HMMLDA LDA-COL ALPHA = 0.5 ALPHA = 0.5 BETA = 0.01 BETA = 0.01 BETA = 0.01 ALPHA = 0.5 GAMMA = 0.1 GAMMA0 = 0.1 GAMMA1 = 0.1 [Gri’04] Griffiths, T., & Steyvers, M. (2004). Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228- 5235. #
  • 33. Hidden Markov Model with Latent Dirichlet Allocation (HMMLDA) Model Results [Hsu’06] Style and topic language model adaptation using HMM-LDA (2006) by B J Hsu,, J Glass in Proceedings of Empirical Methods on Natural Language Processing (EMNLP #
  • 35. LDA vs. HMMLDA vs. LDA-COL #
  • 36. LDAs vs. Topical N-grams [Wan’07] Xuerui Wang, Andrew McCallum and Xing Wei Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval, Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), 2007 - http://www.cs.umass.edu/~mccallum/papers/tng-icdm07.pdf #
  • 37. LDAs vs. Topical N-grams #
  • 39. Conclusions I. HMMLDA showed the worst results, because stop words removal was not done II. LDA-COL had the best performance in comparison to LDA and HMMLDA, but worse than topical n-gram models Future Work  Polylingual Topic Models [Mim’2009] D. Mimno, H. M. Wallach, J. Naradowsky, D. A. Smith, and A. Mccallum, "Polylingual topic models," in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics, August 2009, pp. 880-889, http://www.aclweb.org/anthology/D/D09/D09-1092.pdf #
  • 40. Acknowledgments University of California, Irvine. Department of Cognitive Sciences for MatLab Topics Modeling Toolbox Dr .Caragea Questions #