SlideShare a Scribd company logo
How Does LSA Work? 
Andrew Koo - Insight Data Science
Latent Semantic Analysis 
• Separate the text into sentences based on a trained model 
• Build a sparse matrix of words and the count it appears in 
each sentence 
• Normalize each word with tf-idf 
• Use singular value decomposition to reduce each 
sentence vector to multidimensional “conceptual space” 
• Pick top sentences based on the absolute value of the 
sentence vector in the “conceptual space”
1. Separate the Text into Sentences 
• Apply Tokenizer from the Python sumy Library 
“Hi world! Hello 
world! This is 
Andrew.” 
[“Hi world!”, “Hello 
world!”, “This is 
Andrew.”]
2. Build a sparse matrix of words and 
the count it appears in each sentence 
[“Hi world!”, “Hello 
world!”, “This is 
Andrew.”] 
(Sen , word) Count 
(0 , 2) 
(0 , 5) 
(1 , 5) 
(1 , 1) 
(2 , 4) 
(2 , 3) 
(2 , 0) 
1 
1 
1 
1 
1 
1 
1
3. Normalize each word with tf-idf 
• tf: term frequency - how frequent a term occurs in a document 
• idf: inverse doc frequency - how important a word is (weigh 
down the frequent terms, ex: is, does, how) 
(Sen , word) Count 
(0 , 2) 
(0 , 5) 
(1 , 5) 
(1 , 1) 
(2 , 4) 
(2 , 3) 
(2 , 0) 
1 
1 
1 
1 
1 
1 
1 
(Sen , word) Count 
(0 , 2) 
(0 , 5) 
(1 , 5) 
(1 , 1) 
(2 , 4) 
(2 , 3) 
(2 , 0) 
0.796 
0.605 
0.605 
0.796 
0.577 
0.577 
0.577
4. Use singular value decomposition to 
reduce each sentence vector to 
multidimensional “conceptual” space 
Normalized word-sentence 
matrix 
Transform 
matrix 
Scaling 
matrix 
Concept 
matrix 
Multiply the normalized word-sentence matrix by UT to transform 
each sentence to a vector in the multidimensional conceptual space
5. Pick top sentences based on the 
absolute value of the sentence vector in 
the “conceptual space” 
Concept Vector 
— λ1V1T — 
— λ2V2T — 
— λ3V3T — 
— λ4V4T — 
— λ5V5T — 
— λ6V6T — 
— λ7V7T — 
= 
Sentence Vector 
S’0 S’1 S’2 S’3 S’4 S’5 S’6 
0.400 
0.213 
0.243 
0.762 
0.145 
0.123 
0.254 
The absolute value of this vector is the importance score of this 
sentence

More Related Content

PPTX
PDF
Textrank algorithm
PPT
Day 1 database
 
PPTX
NLP_KASHK:Markov Models
PDF
Алгоритм
PPTX
Eclat algorithm in association rule mining
PPTX
Text features
PDF
Confidence Intervals––Exact Intervals, Jackknife, and Bootstrap
Textrank algorithm
Day 1 database
 
NLP_KASHK:Markov Models
Алгоритм
Eclat algorithm in association rule mining
Text features
Confidence Intervals––Exact Intervals, Jackknife, and Bootstrap

What's hot (20)

PPTX
Ood lesson3
PPTX
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
PDF
Seq2Seq (encoder decoder) model
PDF
Introduction to XGBoost
PPTX
Support Vector Machines
PDF
Lecture 4 neural networks
PPTX
RNN & LSTM: Neural Network for Sequential Data
PPTX
Code Optimization using Code Re-ordering
PPS
Давталттай алгоритмын бодлогууд
PDF
PCA (Principal Component Analysis)
PPTX
U.cs101 алгоритм программчлал-5
PPTX
Unsupervised learning (clustering)
PDF
Rnn and lstm
PPTX
Training Neural Networks.pptx
PDF
Network centrality measures and their effectiveness
PPTX
лекц-3
PPT
Introduction to fa and dfa
PDF
Ms access
PDF
Feature Engineering
PDF
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Ood lesson3
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
Seq2Seq (encoder decoder) model
Introduction to XGBoost
Support Vector Machines
Lecture 4 neural networks
RNN & LSTM: Neural Network for Sequential Data
Code Optimization using Code Re-ordering
Давталттай алгоритмын бодлогууд
PCA (Principal Component Analysis)
U.cs101 алгоритм программчлал-5
Unsupervised learning (clustering)
Rnn and lstm
Training Neural Networks.pptx
Network centrality measures and their effectiveness
лекц-3
Introduction to fa and dfa
Ms access
Feature Engineering
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Ad

Viewers also liked (20)

PPT
Latent Semantic Indexing For Information Retrieval
PPTX
NLP and LSA getting started
PPT
Latent Semantic Indexing and Analysis
PDF
Presentation of OpenNLP
PDF
How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
PPTX
Latent Semanctic Analysis Auro Tripathy
PPTX
Using ls as in class 2015
PDF
Lecture 2: Computational Semantics
PDF
Topic Modelling: Tutorial on Usage and Applications
PDF
Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Pe...
PDF
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016
PDF
Mathematical approach for Text Mining 1
PPTX
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
PPTX
Recommending Tags with a Model of Human Categorization
PPTX
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
PDF
Geometric Aspects of LSA
PPTX
20 cv mil_models_for_words
PPTX
Analysis of Reviews on Sony Z3
PDF
AutoCardSorter - Designing the Information Architecture of a web site using L...
PPTX
Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing For Information Retrieval
NLP and LSA getting started
Latent Semantic Indexing and Analysis
Presentation of OpenNLP
How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
Latent Semanctic Analysis Auro Tripathy
Using ls as in class 2015
Lecture 2: Computational Semantics
Topic Modelling: Tutorial on Usage and Applications
Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Pe...
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016
Mathematical approach for Text Mining 1
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
Recommending Tags with a Model of Human Categorization
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
Geometric Aspects of LSA
20 cv mil_models_for_words
Analysis of Reviews on Sony Z3
AutoCardSorter - Designing the Information Architecture of a web site using L...
Latent Semantic Indexing and Search Engines Optimimization (SEO)
Ad

Similar to LSA algorithm (20)

PDF
Word2vec algorithm
PPTX
Text Mining for Lexicography
PPTX
Neural Text Embeddings for Information Retrieval (WSDM 2017)
PDF
Evaluation of subjective answers using glsa enhanced with contextual synonymy
PDF
Latent Semantic Analysis(LSA)
PDF
Word2vec and Friends
PPTX
Vectors in Search - Towards More Semantic Matching
PPTX
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
PPTX
Analyzing Arguments during a Debate using Natural Language Processing in Python
PPTX
Text summarization-with Extractive Text summarization techniques.pptx
PPTX
An introduction to compositional models in distributional semantics
PDF
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
PDF
Word representation: SVD, LSA, Word2Vec
PDF
Automated Essay Scoring Using Generalized Latent Semantic Analysis
PDF
CS571: Distributional semantics
PDF
International Journal of Computer Science and Security Volume (1) Issue (4)
PPTX
Introduction to Distributional Semantics
PDF
Latent Semantic Word Sense Disambiguation Using Global Co-Occurrence Information
PDF
Extraction Based automatic summarization
PPTX
DL-CO2 -Session 3 Learning Vectorial Representations of Words.pptx
Word2vec algorithm
Text Mining for Lexicography
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Evaluation of subjective answers using glsa enhanced with contextual synonymy
Latent Semantic Analysis(LSA)
Word2vec and Friends
Vectors in Search - Towards More Semantic Matching
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Analyzing Arguments during a Debate using Natural Language Processing in Python
Text summarization-with Extractive Text summarization techniques.pptx
An introduction to compositional models in distributional semantics
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
Word representation: SVD, LSA, Word2Vec
Automated Essay Scoring Using Generalized Latent Semantic Analysis
CS571: Distributional semantics
International Journal of Computer Science and Security Volume (1) Issue (4)
Introduction to Distributional Semantics
Latent Semantic Word Sense Disambiguation Using Global Co-Occurrence Information
Extraction Based automatic summarization
DL-CO2 -Session 3 Learning Vectorial Representations of Words.pptx

Recently uploaded (20)

PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Foundation of Data Science unit number two notes
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Computer network topology notes for revision
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Database Infoormation System (DBIS).pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Introduction to machine learning and Linear Models
PDF
Lecture1 pattern recognition............
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Supervised vs unsupervised machine learning algorithms
Foundation of Data Science unit number two notes
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction to Knowledge Engineering Part 1
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
STUDY DESIGN details- Lt Col Maksud (21).pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Computer network topology notes for revision
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Business Analytics and business intelligence.pdf
climate analysis of Dhaka ,Banglades.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Fluorescence-microscope_Botany_detailed content
Database Infoormation System (DBIS).pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Introduction to machine learning and Linear Models
Lecture1 pattern recognition............

LSA algorithm

  • 1. How Does LSA Work? Andrew Koo - Insight Data Science
  • 2. Latent Semantic Analysis • Separate the text into sentences based on a trained model • Build a sparse matrix of words and the count it appears in each sentence • Normalize each word with tf-idf • Use singular value decomposition to reduce each sentence vector to multidimensional “conceptual space” • Pick top sentences based on the absolute value of the sentence vector in the “conceptual space”
  • 3. 1. Separate the Text into Sentences • Apply Tokenizer from the Python sumy Library “Hi world! Hello world! This is Andrew.” [“Hi world!”, “Hello world!”, “This is Andrew.”]
  • 4. 2. Build a sparse matrix of words and the count it appears in each sentence [“Hi world!”, “Hello world!”, “This is Andrew.”] (Sen , word) Count (0 , 2) (0 , 5) (1 , 5) (1 , 1) (2 , 4) (2 , 3) (2 , 0) 1 1 1 1 1 1 1
  • 5. 3. Normalize each word with tf-idf • tf: term frequency - how frequent a term occurs in a document • idf: inverse doc frequency - how important a word is (weigh down the frequent terms, ex: is, does, how) (Sen , word) Count (0 , 2) (0 , 5) (1 , 5) (1 , 1) (2 , 4) (2 , 3) (2 , 0) 1 1 1 1 1 1 1 (Sen , word) Count (0 , 2) (0 , 5) (1 , 5) (1 , 1) (2 , 4) (2 , 3) (2 , 0) 0.796 0.605 0.605 0.796 0.577 0.577 0.577
  • 6. 4. Use singular value decomposition to reduce each sentence vector to multidimensional “conceptual” space Normalized word-sentence matrix Transform matrix Scaling matrix Concept matrix Multiply the normalized word-sentence matrix by UT to transform each sentence to a vector in the multidimensional conceptual space
  • 7. 5. Pick top sentences based on the absolute value of the sentence vector in the “conceptual space” Concept Vector — λ1V1T — — λ2V2T — — λ3V3T — — λ4V4T — — λ5V5T — — λ6V6T — — λ7V7T — = Sentence Vector S’0 S’1 S’2 S’3 S’4 S’5 S’6 0.400 0.213 0.243 0.762 0.145 0.123 0.254 The absolute value of this vector is the importance score of this sentence