SlideShare a Scribd company logo
Vector Space Model & Lantent Semantic
              Indexing

              Ryan Reck


           November 18, 2008
1 Introduction

2 Vector Space Model

3 Lantent Semantic Indexing

4 Applications of VSM & LSI

5 Comparison: VSM vs. LSI

6 Conclusion

7 References
Introduction
What are VSM & LSI?




    VSM & LSI are techniques from information retrievel for managing
    documents based on their content.
Vector Space Model




      Models documents as a vector in a multi-dimensional space.
      Similar documents are closer together, angle between vectors
      can be interpretted as similarity of two documents.
      Queries are translated into the vector space, and the nearest
      documents (point in space, or vector angle) are the desired
      documents.
      Originated from the SMART Information Retrieval project at
      Cornell University. First published paper in 1975 [2].
Vector Space Model
Example




          doc1 =< tf1 , tf2 , tf3 , . . . , tfn >
          doc2 =< tf1 , tf2 , tf3 , . . . , tfn >
          sim(doc1 , doc2 ) = cos(θ) = v0 · v1
Vector Space Model
Calcuating Term Weights




         VSM introduced the Term Frequency - Inverse Document
         Frequency method of calculating term weights.
         TF-IDF gives greater weight to less common terms, and less
         weight to common ones, since rare terms will better
         distinguish documents than common terms.
                              |D|
         Wf ,d = tft · log ( |t∈d|
Lantent Semantic Indexing




      Built off of Vector Space Model.
      Extracts concepts from the term-document matrix.
          Combines corelated dimensions into a single aggrgate
          dimension.
      This allows the documents to be indexed by concept instead
      of simple terms.
Lantent Semantic Indexing
Example




    Good Example
    {computer , laptop} − >      {1.2 ∗ computer + 0.9 ∗ laptop}

    Realistic Example
    {computer , elevator } − >    {1.2 ∗ computer + 0.9 ∗ elevator }
Applications of VSM & LSI




      VSM, or variations of it, are almost universal.
      Search Engines
          Apache Lucene
Comparison: VSM vs. LSI


  Advantages of LSI

      Handles synonymy and polysemy directly
      Can match documents using differing vocabularies.
      Can even match across different languages, after some
      translated documents have been handled[1].


  Advantages of VSM

      Much simpler, but still performs well
      Handles new documents more easily, LSI’s dimension
      reduction can cause problems with this.
Conslusion




   VSM and LSI are both good ways to index and compare
   documents. VSM is pretty basic but still gets the job done. LSI
   provides a more complex system, but it can do a very good job,
   even under extreme circumstances, like multi-language datasets.
Refeences

      Dumais, S. T., Letsche, T. A., Littman, M. L., and
      Landauer, T. K.
      Automatic cross-language retrieval using latent semantic
      indexing.
      In AAAI Symposium on CrossLanguage Text and Speech
      Retrieval. American Association for Artificial Intelligence,
      March 1997. (March 1997).
      Salton, G., Wong, A., and Yang, C. S.
      A vector space model for automatic indexing.
      Commun. ACM 18, 11 (1975), 613–620.
      Latent semantic indexing, 2008.
      http://en.wikipedia.com/wiki/Latent semantic indexing.
      Vector space model, 2008.
      http://en.wikipedia.com/wiki/Vector space model.

More Related Content

PPTX
Clique
PPTX
Clique and sting
PPT
Textmining Predictive Models
PPT
(ppt)
PDF
Kolmogorov Complexity, Art, and all that
PDF
32 -longest-common-prefix
PPTX
Dbscan
PDF
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...
Clique
Clique and sting
Textmining Predictive Models
(ppt)
Kolmogorov Complexity, Art, and all that
32 -longest-common-prefix
Dbscan
Domain-Specific Term Extraction for Concept Identification in Ontology Constr...

What's hot (19)

PPT
The science behind predictive analytics a text mining perspective
PDF
Introduction to Probabilistic Latent Semantic Analysis
PDF
Usage of word sense disambiguation in concept identification in ontology cons...
PPT
4 Cliques Clusters
PPT
5.4 mining sequence patterns in biological data
PPTX
Clustering ppt
PDF
P229 godfrey
PPTX
Advanced topics in artificial neural networks
PPTX
Clustering for Stream and Parallelism (DATA ANALYTICS)
PDF
Papers We Love Kyiv, July 2018: A Conflict-Free Replicated JSON Datatype
ODP
Distributed Coordination
PDF
TextRank: Bringing Order into Texts
PDF
15 82-87
PPT
Textmining Retrieval And Clustering
PDF
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
PPTX
Document clustering for forensic analysis an approach for improving compute...
PDF
xSDN - An Expressive Simulator for Dynamic Network Flows
PPTX
Strings in c langauge
The science behind predictive analytics a text mining perspective
Introduction to Probabilistic Latent Semantic Analysis
Usage of word sense disambiguation in concept identification in ontology cons...
4 Cliques Clusters
5.4 mining sequence patterns in biological data
Clustering ppt
P229 godfrey
Advanced topics in artificial neural networks
Clustering for Stream and Parallelism (DATA ANALYTICS)
Papers We Love Kyiv, July 2018: A Conflict-Free Replicated JSON Datatype
Distributed Coordination
TextRank: Bringing Order into Texts
15 82-87
Textmining Retrieval And Clustering
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
Document clustering for forensic analysis an approach for improving compute...
xSDN - An Expressive Simulator for Dynamic Network Flows
Strings in c langauge
Ad

Viewers also liked (7)

PDF
Topic Modelling: Tutorial on Usage and Applications
PPT
ECO_TEXT_CLUSTERING
PPTX
Topic extraction using machine learning
PPTX
An Introduction to gensim: "Topic Modelling for Humans"
PPT
Latent Semantic Indexing For Information Retrieval
PPTX
NLP and LSA getting started
PDF
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
Topic Modelling: Tutorial on Usage and Applications
ECO_TEXT_CLUSTERING
Topic extraction using machine learning
An Introduction to gensim: "Topic Modelling for Humans"
Latent Semantic Indexing For Information Retrieval
NLP and LSA getting started
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
Ad

Similar to Vsm lsi (20)

PDF
call for papers, research paper publishing, where to publish research paper, ...
PPTX
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
PPT
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
PDF
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
ODP
The search engine index
PPT
vectorSpaceModelPeterBurden.ppt
PDF
International Journal of Soft Computing, Mathematics and Control (IJSCMC)
PPT
Ir models
PPT
Lec 4,5
PPT
Text Mining
PPT
6640200.pptNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
PPTX
Deep Learning for Search
PPTX
EDI 2009- Advanced Search: What’s Under the Hood of your Favorite Search System?
PDF
An Improved Web Explorer using Explicit Semantic Similarity with ontology and...
PPTX
Vectors in Search - Towards More Semantic Matching
PPTX
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
PDF
Information Retrieval
PPTX
Haystack 2019 - Search with Vectors - Simon Hughes
PPTX
Searching with vectors
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
call for papers, research paper publishing, where to publish research paper, ...
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
The search engine index
vectorSpaceModelPeterBurden.ppt
International Journal of Soft Computing, Mathematics and Control (IJSCMC)
Ir models
Lec 4,5
Text Mining
6640200.pptNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Deep Learning for Search
EDI 2009- Advanced Search: What’s Under the Hood of your Favorite Search System?
An Improved Web Explorer using Explicit Semantic Similarity with ontology and...
Vectors in Search - Towards More Semantic Matching
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Information Retrieval
Haystack 2019 - Search with Vectors - Simon Hughes
Searching with vectors
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...

Recently uploaded (20)

PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Machine Learning_overview_presentation.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Machine learning based COVID-19 study performance prediction
PDF
Mushroom cultivation and it's methods.pdf
PPTX
1. Introduction to Computer Programming.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Tartificialntelligence_presentation.pptx
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Machine Learning_overview_presentation.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Programs and apps: productivity, graphics, security and other tools
Machine learning based COVID-19 study performance prediction
Mushroom cultivation and it's methods.pdf
1. Introduction to Computer Programming.pptx
Unlocking AI with Model Context Protocol (MCP)
Tartificialntelligence_presentation.pptx
Univ-Connecticut-ChatGPT-Presentaion.pdf
Approach and Philosophy of On baking technology
Network Security Unit 5.pdf for BCA BBA.
Spectral efficient network and resource selection model in 5G networks
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Spectroscopy.pptx food analysis technology
Mobile App Security Testing_ A Comprehensive Guide.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Reach Out and Touch Someone: Haptics and Empathic Computing
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf

Vsm lsi

  • 1. Vector Space Model & Lantent Semantic Indexing Ryan Reck November 18, 2008
  • 2. 1 Introduction 2 Vector Space Model 3 Lantent Semantic Indexing 4 Applications of VSM & LSI 5 Comparison: VSM vs. LSI 6 Conclusion 7 References
  • 3. Introduction What are VSM & LSI? VSM & LSI are techniques from information retrievel for managing documents based on their content.
  • 4. Vector Space Model Models documents as a vector in a multi-dimensional space. Similar documents are closer together, angle between vectors can be interpretted as similarity of two documents. Queries are translated into the vector space, and the nearest documents (point in space, or vector angle) are the desired documents. Originated from the SMART Information Retrieval project at Cornell University. First published paper in 1975 [2].
  • 5. Vector Space Model Example doc1 =< tf1 , tf2 , tf3 , . . . , tfn > doc2 =< tf1 , tf2 , tf3 , . . . , tfn > sim(doc1 , doc2 ) = cos(θ) = v0 · v1
  • 6. Vector Space Model Calcuating Term Weights VSM introduced the Term Frequency - Inverse Document Frequency method of calculating term weights. TF-IDF gives greater weight to less common terms, and less weight to common ones, since rare terms will better distinguish documents than common terms. |D| Wf ,d = tft · log ( |t∈d|
  • 7. Lantent Semantic Indexing Built off of Vector Space Model. Extracts concepts from the term-document matrix. Combines corelated dimensions into a single aggrgate dimension. This allows the documents to be indexed by concept instead of simple terms.
  • 8. Lantent Semantic Indexing Example Good Example {computer , laptop} − > {1.2 ∗ computer + 0.9 ∗ laptop} Realistic Example {computer , elevator } − > {1.2 ∗ computer + 0.9 ∗ elevator }
  • 9. Applications of VSM & LSI VSM, or variations of it, are almost universal. Search Engines Apache Lucene
  • 10. Comparison: VSM vs. LSI Advantages of LSI Handles synonymy and polysemy directly Can match documents using differing vocabularies. Can even match across different languages, after some translated documents have been handled[1]. Advantages of VSM Much simpler, but still performs well Handles new documents more easily, LSI’s dimension reduction can cause problems with this.
  • 11. Conslusion VSM and LSI are both good ways to index and compare documents. VSM is pretty basic but still gets the job done. LSI provides a more complex system, but it can do a very good job, even under extreme circumstances, like multi-language datasets.
  • 12. Refeences Dumais, S. T., Letsche, T. A., Littman, M. L., and Landauer, T. K. Automatic cross-language retrieval using latent semantic indexing. In AAAI Symposium on CrossLanguage Text and Speech Retrieval. American Association for Artificial Intelligence, March 1997. (March 1997). Salton, G., Wong, A., and Yang, C. S. A vector space model for automatic indexing. Commun. ACM 18, 11 (1975), 613–620. Latent semantic indexing, 2008. http://en.wikipedia.com/wiki/Latent semantic indexing. Vector space model, 2008. http://en.wikipedia.com/wiki/Vector space model.