SlideShare a Scribd company logo
Latent semantic analysis (LSA) is a technique in natural
language processing, in particular in vectorial semantics,
of analyzing relationships between a set of documents and
the terms they contain by producing a set of concepts
related to the documents and terms.
Wikipedia
Latent semantic analysis
Getting started
Natural language processing (NLP) is a field of computer
science, artificial intelligence, and linguistics concerned with the
interactions between computers and human (natural) languages.
Wikipedia
Natural language processing could be divided in 4 phases:
Grammar analysis
Lexical analysis
Semantic analysis
Syntactic analysis
Apache OpenNLP
Machine learning based toolkit
for the processing of natural
language text.
http://opennlp.apache.org/
LSA
LSA could be seen as a part of NLP
Apache OpenNLP usage examples:
Lexical analysis
Grammar analysis
Syntactic analysis
Part-of-speech tagging
Tokenization
Chunker - Parser
NOTE:
Before the lexical analysis is possible to
use a sentences analysis tool: sentence
detector (Apache OpenNLP).
Supervised machine learning concepts
INPUT DATA
(ex: wikipedia corpus)
Humans produce a finite set of
couples (INPUT,OUTPUT).
It represents the training set.
It can be seen as discrete
function.
Machine learning algorithm
(ex:linear regretion, maximum
entropy, perceptron)
MODEL
OUTPUT DATA
(ex:corpus POSTagged)
Machine produces a model.
It can be seen as a continuous function.
INPUT DATA
(ex: just a document)
OUTPUT DATA
(that document
POSTagged)
Input data are taken
from an infinte set.
Machine, using model
and input, produces
the expected output.
LSA assumes that words that are close in
meaning will occur in similar pieces of text.
LSA is a method for discovering hidden
concepts in document data.
LSA key concepts
Doc 2
Doc 3
Doc 4
Doc 1
Set of documents, each
document contains
several words.
LSA algorithm takes docs and words and
evaluates vectors in a semantic vectorial
space using:
• A documents/words matrix
• Singular value decomposition (SVD)
word1word2
doc1
doc2
doc3
doc4
Semantic vectorial space.
Word1 and word2 are close,
it means that their (latent)
meaning is related.
Example:
Doc 2
Doc 3
Doc 4
Doc 1
Doc1 Doc2 Doc3 Doc4
Word1 1 0 1 0
Word2 1 0 1 1
Word3 0 1 0 1
…
Words/document matrix
1: there are occurrences of
the i-word in the j-doc.
0: there are not occurrences
of the i-word in the j-doc.
The matrix dimension is very
big (thousands of
words, hundreds of
documents).
Matrix SVD decomposition
To reduce the matrix dimension
Semantic Vector or JLSI
libraries:
• SVD decomposition.
• Build the vectorial
semantic space.
word1word2
doc1
doc2
doc4
UIMA to manage the solution
Online references:
http://opennlp.apache.org/documentation/manual/opennlp.html
https://code.google.com/p/semanticvectors/
http://hlt.fbk.eu/en/technology/jlsi
http://uima.apache.org/
http://en.wikipedia.org/wiki/Singular_value_decomposition
http://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors
Coursera video references:
http://www.coursera.org/course/nlangp
http://www.coursera.org/course/ml
Some snipptes and console commands
OpenNLP has a command line tool which is used to train the models.
Trained Model
Models and document
to manage
This snippet takes as inputs 4 files and it evaluates a new file sentence detected, tokenized and POSTtaggered.
Sentences
tokens
tags
Document that is
sentence detected,
tokenized and
POSTaggered, and that
could be, for example,
indexed in a search
engine like Apache Solr.
Note that the lucene-core is
a hierarchical dependency.
.bat file to load the classpath
SemanticVectors has two main functions:
1. Building wordSpace models.
To build the wordSpace model Semantic Vector
needs indexes created by Apache Lucene.
2. Searching through the vectors in such models.
Es: Bible chapter Indexed by Lucene
1. Building wordSpace models using pitt.search.semanticvectors.LSA class from
the index created by Apache Lucene (from a bible chapter).
In this example the Bible
chapter contains 29
documents, and in total
there are 2460 terms.
Semantic Vector builds:
1. 29 vectors that represent the documents (docvector.bin)
2. 2460 vectors that represent the terms (termvector.bin)
This two files represent the wordSpace.
Note that could be also possible to use pitt.search.semanticvectors.BuildIndex class that use Random Projection
instead of LSA to reduce the dimensional representation.
2. Searching through docVector and termVector
2.1 Searching for Documents using Terms
Search for document vectors closest to the vector ”Abraham”:
2.2 Using a document file as a source of queries
Find terms most closely related to Chapter 1 of Chronicles:
2.3 Search a general word
Find terms most closely related to “Abraham”.
2.4 Comparing words
Compare “abraham” with “Isaac”.
Compare “abraham” with “massimo”.

More Related Content

PPT
Twitter - Architecture and Scalability lessons
PDF
Flow Base Programming with Node-RED and Functional Reactive Programming with ...
PPTX
Mono Repo
PPTX
Introduction to CNI (Container Network Interface)
PDF
Openstack Neutron, interconnections with BGP/MPLS VPNs
PDF
Happy Eyeballs v2 - RFC8305
PPTX
Successfully Implement Your API Strategy with NGINX
PDF
Tips on High Performance Server Programming
Twitter - Architecture and Scalability lessons
Flow Base Programming with Node-RED and Functional Reactive Programming with ...
Mono Repo
Introduction to CNI (Container Network Interface)
Openstack Neutron, interconnections with BGP/MPLS VPNs
Happy Eyeballs v2 - RFC8305
Successfully Implement Your API Strategy with NGINX
Tips on High Performance Server Programming

What's hot (20)

PDF
Nginx Architecture
PDF
How Twitter Works (Arsen Kostenko Technology Stream)
PDF
BGP on mikrotik
PPTX
Remote access service
PPTX
Building your First gRPC Service
DOCX
NE7012- SOCIAL NETWORK ANALYSIS
PDF
파이썬을 활용한 챗봇 서비스 개발 3일차
PDF
Meshing OpenStack and Bare Metal Networks with EVPN - David Iles, Mellanox Te...
PPT
PPTX
Ccna ppt1
PPTX
Bit Torrent
PDF
Webrtc overview
PPTX
PPTX
.NET Oxford Windows Subsystem for Linux v2
PPTX
NGINX: High Performance Load Balancing
PPTX
Monitorización con Prometheus
PPTX
[Paper review] BERT
PDF
Empowering Advanced Users: Extending OutSystems UI Framework with Openness an...
PDF
OpeVPN on Mikrotik
PPTX
Brief description on Web technology
Nginx Architecture
How Twitter Works (Arsen Kostenko Technology Stream)
BGP on mikrotik
Remote access service
Building your First gRPC Service
NE7012- SOCIAL NETWORK ANALYSIS
파이썬을 활용한 챗봇 서비스 개발 3일차
Meshing OpenStack and Bare Metal Networks with EVPN - David Iles, Mellanox Te...
Ccna ppt1
Bit Torrent
Webrtc overview
.NET Oxford Windows Subsystem for Linux v2
NGINX: High Performance Load Balancing
Monitorización con Prometheus
[Paper review] BERT
Empowering Advanced Users: Extending OutSystems UI Framework with Openness an...
OpeVPN on Mikrotik
Brief description on Web technology
Ad

Viewers also liked (9)

PPT
Latent Semantic Indexing For Information Retrieval
PPTX
Singular Value Decomposition Image Compression
PPT
Latent Semantic Indexing and Analysis
PPT
ECO_TEXT_CLUSTERING
PDF
Vsm lsi
PPTX
Topic extraction using machine learning
PDF
Topic Modelling: Tutorial on Usage and Applications
PPTX
An Introduction to gensim: "Topic Modelling for Humans"
PDF
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
Latent Semantic Indexing For Information Retrieval
Singular Value Decomposition Image Compression
Latent Semantic Indexing and Analysis
ECO_TEXT_CLUSTERING
Vsm lsi
Topic extraction using machine learning
Topic Modelling: Tutorial on Usage and Applications
An Introduction to gensim: "Topic Modelling for Humans"
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
Ad

Similar to NLP and LSA getting started (20)

PPTX
PDF
A Document Exploring System on LDA Topic Model for Wikipedia Articles
PDF
Wanna search? Piece of cake!
PPTX
Indexing in Search Engine
PPT
Semantic web
PDF
An Annotation Framework For The Semantic Web
PDF
MR^3: Meta-Model Management based on RDFs Revision Reflection
PDF
.Net and Rdf APIs
PDF
Semantic Annotation: The Mainstay of Semantic Web
PPTX
SNSW CO3.pptx
PDF
Spotlight
PPTX
NLP todo
PDF
A-Study_TopicModeling
PDF
Improvement wsd dictionary using annotated corpus and testing it with simplif...
PDF
Topic Mining on disaster data (Robert Monné)
PPTX
How To Make Linked Data More than Data
PPTX
How To Make Linked Data More than Data
PPTX
RDFa Semantic Web
PDF
G04124041046
A Document Exploring System on LDA Topic Model for Wikipedia Articles
Wanna search? Piece of cake!
Indexing in Search Engine
Semantic web
An Annotation Framework For The Semantic Web
MR^3: Meta-Model Management based on RDFs Revision Reflection
.Net and Rdf APIs
Semantic Annotation: The Mainstay of Semantic Web
SNSW CO3.pptx
Spotlight
NLP todo
A-Study_TopicModeling
Improvement wsd dictionary using annotated corpus and testing it with simplif...
Topic Mining on disaster data (Robert Monné)
How To Make Linked Data More than Data
How To Make Linked Data More than Data
RDFa Semantic Web
G04124041046

Recently uploaded (20)

PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Big Data Technologies - Introduction.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
KodekX | Application Modernization Development
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Encapsulation theory and applications.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Digital-Transformation-Roadmap-for-Companies.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Advanced methodologies resolving dimensionality complications for autism neur...
Big Data Technologies - Introduction.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Review of recent advances in non-invasive hemoglobin estimation
Chapter 3 Spatial Domain Image Processing.pdf
KodekX | Application Modernization Development
Encapsulation_ Review paper, used for researhc scholars
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Unlocking AI with Model Context Protocol (MCP)
NewMind AI Weekly Chronicles - August'25 Week I
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Encapsulation theory and applications.pdf
NewMind AI Monthly Chronicles - July 2025
Understanding_Digital_Forensics_Presentation.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”

NLP and LSA getting started

  • 1. Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. Wikipedia Latent semantic analysis Getting started
  • 2. Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. Wikipedia Natural language processing could be divided in 4 phases: Grammar analysis Lexical analysis Semantic analysis Syntactic analysis Apache OpenNLP Machine learning based toolkit for the processing of natural language text. http://opennlp.apache.org/ LSA LSA could be seen as a part of NLP
  • 3. Apache OpenNLP usage examples: Lexical analysis Grammar analysis Syntactic analysis Part-of-speech tagging Tokenization Chunker - Parser NOTE: Before the lexical analysis is possible to use a sentences analysis tool: sentence detector (Apache OpenNLP).
  • 4. Supervised machine learning concepts INPUT DATA (ex: wikipedia corpus) Humans produce a finite set of couples (INPUT,OUTPUT). It represents the training set. It can be seen as discrete function. Machine learning algorithm (ex:linear regretion, maximum entropy, perceptron) MODEL OUTPUT DATA (ex:corpus POSTagged) Machine produces a model. It can be seen as a continuous function. INPUT DATA (ex: just a document) OUTPUT DATA (that document POSTagged) Input data are taken from an infinte set. Machine, using model and input, produces the expected output.
  • 5. LSA assumes that words that are close in meaning will occur in similar pieces of text. LSA is a method for discovering hidden concepts in document data. LSA key concepts Doc 2 Doc 3 Doc 4 Doc 1 Set of documents, each document contains several words. LSA algorithm takes docs and words and evaluates vectors in a semantic vectorial space using: • A documents/words matrix • Singular value decomposition (SVD) word1word2 doc1 doc2 doc3 doc4 Semantic vectorial space. Word1 and word2 are close, it means that their (latent) meaning is related.
  • 6. Example: Doc 2 Doc 3 Doc 4 Doc 1 Doc1 Doc2 Doc3 Doc4 Word1 1 0 1 0 Word2 1 0 1 1 Word3 0 1 0 1 … Words/document matrix 1: there are occurrences of the i-word in the j-doc. 0: there are not occurrences of the i-word in the j-doc. The matrix dimension is very big (thousands of words, hundreds of documents). Matrix SVD decomposition To reduce the matrix dimension Semantic Vector or JLSI libraries: • SVD decomposition. • Build the vectorial semantic space. word1word2 doc1 doc2 doc4 UIMA to manage the solution
  • 8. Some snipptes and console commands OpenNLP has a command line tool which is used to train the models. Trained Model
  • 9. Models and document to manage This snippet takes as inputs 4 files and it evaluates a new file sentence detected, tokenized and POSTtaggered. Sentences tokens tags Document that is sentence detected, tokenized and POSTaggered, and that could be, for example, indexed in a search engine like Apache Solr.
  • 10. Note that the lucene-core is a hierarchical dependency. .bat file to load the classpath SemanticVectors has two main functions: 1. Building wordSpace models. To build the wordSpace model Semantic Vector needs indexes created by Apache Lucene. 2. Searching through the vectors in such models. Es: Bible chapter Indexed by Lucene
  • 11. 1. Building wordSpace models using pitt.search.semanticvectors.LSA class from the index created by Apache Lucene (from a bible chapter). In this example the Bible chapter contains 29 documents, and in total there are 2460 terms. Semantic Vector builds: 1. 29 vectors that represent the documents (docvector.bin) 2. 2460 vectors that represent the terms (termvector.bin) This two files represent the wordSpace. Note that could be also possible to use pitt.search.semanticvectors.BuildIndex class that use Random Projection instead of LSA to reduce the dimensional representation.
  • 12. 2. Searching through docVector and termVector 2.1 Searching for Documents using Terms Search for document vectors closest to the vector ”Abraham”:
  • 13. 2.2 Using a document file as a source of queries Find terms most closely related to Chapter 1 of Chronicles:
  • 14. 2.3 Search a general word Find terms most closely related to “Abraham”.
  • 15. 2.4 Comparing words Compare “abraham” with “Isaac”. Compare “abraham” with “massimo”.