Introduction to Text Analytics
and Natural Language
Processing
Nick Grattan
Application Architecture Director, Dassault Systèmes
PhD Student, Insight Centre for Data Analytics, University College Cork
www.3ds.com insight-centre.org/
Cork AI Meetup 15th March 2018
www.meetup.com/Cork-AI/
@NickGrattan
“You shall know a word by the
company it keeps”
J.R. Firth (1957)
Agenda
• Introduction to Text Analytics.
• Overview of common techniques and the types of problems commonly
solved.
• Traditional “Frequentist” text analysis
• Bag-of-Words and Vector Space Models (High Dimensional, Low Density)
• Measuring document / text similarity with distance metrics and clustering
documents.
• Hands-On: Document Clustering with Python, NLTK and Scipy.
• Word Embeddings with word2vec for semantic term analysis
• Unsupervised semantic analysis using a corpus of words
• Hands-On: Creating a semantic space with a Neural Network in TensorFlow
Natural Language Processing and Text
Analytics
• Natural Language Processing (NLP)
• Area of AI concerned with interactions between computers and human natural
language, to process or “understand” natural language
• Common tasks: speech recognition, natural language understanding & generation,
automatic summarization, part-of-speech tagging, disambiguation, named entity
recognition …
• To fully understand and represent the meaning of language is a difficult goal (AI-
Complete) [1]
• Text Analytics (Text Mining):
• The process or practice of examining large collections of written resources in order
to generate new information (Oxford English Dictionary)
• Transforms text to data for information discovery, establishing relationships, often
using NLP
Text Preparation
• Extract text from documents
• E.g. use “BeautifulSoup” in Python to process HTML/XML documents
• Process terms (words) from text
• Tokenisation – breaks text into discrete terms
• Stop Words – remove common words (“the”, “and” etc.)
• Stemming – Reduce words to their root or base form ("fishing", "fished", and
"fisher" => "fish")
• E.g. “NLTK” (Natural Language Toolkit) in Python
• All, some, or none of these techniques may be used, depending on the
application
Bag-of-Words & Jaccard Similarity
• Bag-of-Words is the set of terms found
in a document, corpus etc.
• Jaccard Similarity between two Bag-of-
Words, A &B:
• Ratio of the Intersection length over the
Union length of two sets
• ‘0’ – Identical, ‘1’ – Dissimilar
• Simple & quick to calculate
Term Frequencies (TF) & Vector Space Models
• Term Frequency (TF)
• Count of number of term
occurrences in a document
• Vector Space Model
• Dimension for each Term in Vocabulary
• Map documents into this space
Very high dimensionality, low density
For many documents, many dimensions
with be zero
Distance Measures
• Distance between two documents in a vector space model
• Two common measures: Euclidean and Cosine
Term Frequency / Inverse Document
Frequency (TF/IDF)
• Term Frequency / Inverse Document Frequency (TF/IDF)
• Reflects how important a word is to a document in a corpus
• Increases proportionally to the number of times a document appears in the
document
• Offset by the frequency of the word in the corpus
• Adjusts for words that appear more frequently in general
See: https://deeplearning4j.org/bagofwords-tf-idf
Distance Matrices & Clustering
• Square, symmetrical matrix with pair-wise distances between
documents in a corpus
• Used for clustering documents, e.g.
• K-Means clustering
• Hierarchical clustering (Ward algorithm commonly used)
See: https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/
Edit Distance
• Number of inserts / deletes / substitutions to
transform one document to another
• Weights may be applied to different types of
edit
• E.G. Terms that are semantically related may have
a lower weight
• Levenstein Edit Distance may be solved using
Dynamic Programming
• Allows document alignments to be produced
• But: Expensive in time and space!
https://wordpress.com/read/feeds/71910664/posts/1718047915
Document Retrieval with MinHash & Local
Sensitivity Hashing (LSH)
• Problem: How to retrieve similar document from a large corpus
• MinHash:
• “Document Fingerprint” with n-hash values (n ≈ 200)
• Characteristic: Similar document have similar hash values
• Use Jaccard similarity to measure MinHash similarity, and hence document similarity
• Independent of document size, small storage and retrieval costs
• Local Sensitivity Hashing (LSH):
• For large number of documents
• Organizes documents represented by MinHash into buckets
• Documents within a bucket are similar
• Reduces retrieval time, good for document duplication/near duplication detection
etc.
https://nickgrattandatascience.wordpress.com/2013/11/12/minhash-implementation-in-c/
https://nickgrattandatascience.wordpress.com/2017/12/31/lsh-for-finding-similar-documents-from-a-large-number-of-documents-in-c/
Natural Language Processing (NLP)
• Techniques describe thus far are Text Analytical
• Numerical in nature, take little account of the meaning of
text
• Terms are numerically encoded symbols
• NLP attempts to understand text
• Semantics – The meaning of a word based on how / where
it’s used
• Part of Speech (POS) Tagging- Understanding the
construction of sentences, phrases etc.
• Word Relatedness & Concepts: Wordnet -
https://wordnet.princeton.edu/
E.g. Homonym Problem:
Words with same spelling but
different meanings, depending
on how / where used
E.G. Disambiguation: “Like” as Verb (Fruit
flies like to eat bananas), “Like” as a adjective
(“Fruit flies that look like a banana”)
Word Embeddings – word2vec
• Unsupervised semantic analysis from
corpus of terms
• Define number of dimensions for the
semantic space (e.g. 300)
• Window: Define number of words before
/ after (e.g. 1,2 or 5) target word
• Generate Training Samples
• For each word, create parameters that
map the word into the semantic space
• The “Word Vector Lookup Table”
See: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
Word Embeddings – word2vec
• Neural Network trained on samples
• Output layer discarded
• Keep the Hidden Layer Weight Matrix!
See: https://www.tensorflow.org/tutorials/word2vec
Also look at “gensim” for a word2vec implementation:
https://radimrehurek.com/gensim/models/word2vec.html
Optimisation:
1. Continuous Bag of Words (CBOW)
2. Skipgram
Word Embedding - Visualisation
• Use Principal Component Analysis (PCA) to create 2d representation
of semantic vector space
RNN and NLP
• Recurrent Neural Networks (RNN) may
be used for generative models
• Once trained, they can generate text
with the same structure, syntax and
semantics as the training set
• For a bit of fun, see “The Unreasonable
Effectiveness of Recurrent Neural
Networks”
• http://karpathy.github.io/2015/05/21/rnn-effectiveness/
C Code generated from an RNN trained on the Linux
code base. While it does not execute (!) it is
syntactically correct. The model, for example, has
learnt matched “{“ “}” and parentheses
Resources and References
[1] “Natural Language Processing with Deep Learning” – Christopher Manning et
al, Stamford University. https://www.youtube.com/watch?v=OQQ-W_63UgQ
• Lecture series including excellent description of back propagation, word2vec and GLOVE
[2] “Hands-On Machine Learning with Scikit-Learn & TensorFlow”, Aurélien
Géron (O'Reilly Media, 2017)
• Excellent introduction, Jupyter Notebooks available here: https://github.com/ageron
[3] “Speech and Language Processing”, Daniel Jurafsky & James H. Martin (2nd
Edition, Pearson Education 2009)
• In-depth introduction to NLP
[4] “Introduction to Information Retrieval”, Christopher Manning et al
(Cambridge University Press, 2008)
• Probabilistic models for text retrieval, TF/IDF, Vector Space, Support Vector Machines…

More Related Content

PDF
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
PDF
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
PPTX
Deep Content Learning in Traffic Prediction and Text Classification
PPT
Database_Cache Replacemnt Policies(Lyras)
PDF
Duplicate Detection on Hoaxy Dataset
PDF
On the Use of Domain Terms in Source Code
PPTX
PPTX
The Duet model
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
Deep Content Learning in Traffic Prediction and Text Classification
Database_Cache Replacemnt Policies(Lyras)
Duplicate Detection on Hoaxy Dataset
On the Use of Domain Terms in Source Code
The Duet model

Similar to Cork AI Meetup Number 3 (20)

PDF
Big Data Palooza Talk: Aspects of Semantic Processing
PPTX
From NLP to text mining
PPT
Copy of 10text (2)
PPT
Chapter 10 Data Mining Techniques
PPTX
A Panorama of Natural Language Processing
PDF
MACHINE-DRIVEN TEXT ANALYSIS
PDF
Word2vec and Friends
PPTX
NLP Introduction and basics of natural language processing
PDF
Text Mining Analytics 101
PDF
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
PPTX
Text features
PPTX
Text Mining for Lexicography
PDF
한국어와 NLTK, Gensim의 만남
PPTX
What is word2vec?
PDF
Visual-Semantic Embeddings: some thoughts on Language
PDF
Crash-course in Natural Language Processing
PDF
Crash Course in Natural Language Processing (2016)
PDF
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
PDF
Topic Modelling: for news recommendation, user behaviour modelling, and many ...
Big Data Palooza Talk: Aspects of Semantic Processing
From NLP to text mining
Copy of 10text (2)
Chapter 10 Data Mining Techniques
A Panorama of Natural Language Processing
MACHINE-DRIVEN TEXT ANALYSIS
Word2vec and Friends
NLP Introduction and basics of natural language processing
Text Mining Analytics 101
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
Text features
Text Mining for Lexicography
한국어와 NLTK, Gensim의 만남
What is word2vec?
Visual-Semantic Embeddings: some thoughts on Language
Crash-course in Natural Language Processing
Crash Course in Natural Language Processing (2016)
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Topic Modelling: for news recommendation, user behaviour modelling, and many ...
Ad

Recently uploaded (20)

PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PPTX
Steganography Project Steganography Project .pptx
PPTX
modul_python (1).pptx for professional and student
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PDF
Microsoft Core Cloud Services powerpoint
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
DOCX
Factor Analysis Word Document Presentation
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPTX
New ISO 27001_2022 standard and the changes
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Managing Community Partner Relationships
PPTX
IMPACT OF LANDSLIDE.....................
PDF
Microsoft 365 products and services descrption
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
PPT
DU, AIS, Big Data and Data Analytics.ppt
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
Steganography Project Steganography Project .pptx
modul_python (1).pptx for professional and student
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Microsoft Core Cloud Services powerpoint
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Factor Analysis Word Document Presentation
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
New ISO 27001_2022 standard and the changes
STERILIZATION AND DISINFECTION-1.ppthhhbx
Managing Community Partner Relationships
IMPACT OF LANDSLIDE.....................
Microsoft 365 products and services descrption
retention in jsjsksksksnbsndjddjdnFPD.pptx
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
DU, AIS, Big Data and Data Analytics.ppt
Ad

Cork AI Meetup Number 3

  • 1. Introduction to Text Analytics and Natural Language Processing Nick Grattan Application Architecture Director, Dassault Systèmes PhD Student, Insight Centre for Data Analytics, University College Cork www.3ds.com insight-centre.org/ Cork AI Meetup 15th March 2018 www.meetup.com/Cork-AI/ @NickGrattan
  • 2. “You shall know a word by the company it keeps” J.R. Firth (1957)
  • 3. Agenda • Introduction to Text Analytics. • Overview of common techniques and the types of problems commonly solved. • Traditional “Frequentist” text analysis • Bag-of-Words and Vector Space Models (High Dimensional, Low Density) • Measuring document / text similarity with distance metrics and clustering documents. • Hands-On: Document Clustering with Python, NLTK and Scipy. • Word Embeddings with word2vec for semantic term analysis • Unsupervised semantic analysis using a corpus of words • Hands-On: Creating a semantic space with a Neural Network in TensorFlow
  • 4. Natural Language Processing and Text Analytics • Natural Language Processing (NLP) • Area of AI concerned with interactions between computers and human natural language, to process or “understand” natural language • Common tasks: speech recognition, natural language understanding & generation, automatic summarization, part-of-speech tagging, disambiguation, named entity recognition … • To fully understand and represent the meaning of language is a difficult goal (AI- Complete) [1] • Text Analytics (Text Mining): • The process or practice of examining large collections of written resources in order to generate new information (Oxford English Dictionary) • Transforms text to data for information discovery, establishing relationships, often using NLP
  • 5. Text Preparation • Extract text from documents • E.g. use “BeautifulSoup” in Python to process HTML/XML documents • Process terms (words) from text • Tokenisation – breaks text into discrete terms • Stop Words – remove common words (“the”, “and” etc.) • Stemming – Reduce words to their root or base form ("fishing", "fished", and "fisher" => "fish") • E.g. “NLTK” (Natural Language Toolkit) in Python • All, some, or none of these techniques may be used, depending on the application
  • 6. Bag-of-Words & Jaccard Similarity • Bag-of-Words is the set of terms found in a document, corpus etc. • Jaccard Similarity between two Bag-of- Words, A &B: • Ratio of the Intersection length over the Union length of two sets • ‘0’ – Identical, ‘1’ – Dissimilar • Simple & quick to calculate
  • 7. Term Frequencies (TF) & Vector Space Models • Term Frequency (TF) • Count of number of term occurrences in a document • Vector Space Model • Dimension for each Term in Vocabulary • Map documents into this space Very high dimensionality, low density For many documents, many dimensions with be zero
  • 8. Distance Measures • Distance between two documents in a vector space model • Two common measures: Euclidean and Cosine
  • 9. Term Frequency / Inverse Document Frequency (TF/IDF) • Term Frequency / Inverse Document Frequency (TF/IDF) • Reflects how important a word is to a document in a corpus • Increases proportionally to the number of times a document appears in the document • Offset by the frequency of the word in the corpus • Adjusts for words that appear more frequently in general See: https://deeplearning4j.org/bagofwords-tf-idf
  • 10. Distance Matrices & Clustering • Square, symmetrical matrix with pair-wise distances between documents in a corpus • Used for clustering documents, e.g. • K-Means clustering • Hierarchical clustering (Ward algorithm commonly used) See: https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/
  • 11. Edit Distance • Number of inserts / deletes / substitutions to transform one document to another • Weights may be applied to different types of edit • E.G. Terms that are semantically related may have a lower weight • Levenstein Edit Distance may be solved using Dynamic Programming • Allows document alignments to be produced • But: Expensive in time and space! https://wordpress.com/read/feeds/71910664/posts/1718047915
  • 12. Document Retrieval with MinHash & Local Sensitivity Hashing (LSH) • Problem: How to retrieve similar document from a large corpus • MinHash: • “Document Fingerprint” with n-hash values (n ≈ 200) • Characteristic: Similar document have similar hash values • Use Jaccard similarity to measure MinHash similarity, and hence document similarity • Independent of document size, small storage and retrieval costs • Local Sensitivity Hashing (LSH): • For large number of documents • Organizes documents represented by MinHash into buckets • Documents within a bucket are similar • Reduces retrieval time, good for document duplication/near duplication detection etc. https://nickgrattandatascience.wordpress.com/2013/11/12/minhash-implementation-in-c/ https://nickgrattandatascience.wordpress.com/2017/12/31/lsh-for-finding-similar-documents-from-a-large-number-of-documents-in-c/
  • 13. Natural Language Processing (NLP) • Techniques describe thus far are Text Analytical • Numerical in nature, take little account of the meaning of text • Terms are numerically encoded symbols • NLP attempts to understand text • Semantics – The meaning of a word based on how / where it’s used • Part of Speech (POS) Tagging- Understanding the construction of sentences, phrases etc. • Word Relatedness & Concepts: Wordnet - https://wordnet.princeton.edu/ E.g. Homonym Problem: Words with same spelling but different meanings, depending on how / where used E.G. Disambiguation: “Like” as Verb (Fruit flies like to eat bananas), “Like” as a adjective (“Fruit flies that look like a banana”)
  • 14. Word Embeddings – word2vec • Unsupervised semantic analysis from corpus of terms • Define number of dimensions for the semantic space (e.g. 300) • Window: Define number of words before / after (e.g. 1,2 or 5) target word • Generate Training Samples • For each word, create parameters that map the word into the semantic space • The “Word Vector Lookup Table” See: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
  • 15. Word Embeddings – word2vec • Neural Network trained on samples • Output layer discarded • Keep the Hidden Layer Weight Matrix! See: https://www.tensorflow.org/tutorials/word2vec Also look at “gensim” for a word2vec implementation: https://radimrehurek.com/gensim/models/word2vec.html Optimisation: 1. Continuous Bag of Words (CBOW) 2. Skipgram
  • 16. Word Embedding - Visualisation • Use Principal Component Analysis (PCA) to create 2d representation of semantic vector space
  • 17. RNN and NLP • Recurrent Neural Networks (RNN) may be used for generative models • Once trained, they can generate text with the same structure, syntax and semantics as the training set • For a bit of fun, see “The Unreasonable Effectiveness of Recurrent Neural Networks” • http://karpathy.github.io/2015/05/21/rnn-effectiveness/ C Code generated from an RNN trained on the Linux code base. While it does not execute (!) it is syntactically correct. The model, for example, has learnt matched “{“ “}” and parentheses
  • 18. Resources and References [1] “Natural Language Processing with Deep Learning” – Christopher Manning et al, Stamford University. https://www.youtube.com/watch?v=OQQ-W_63UgQ • Lecture series including excellent description of back propagation, word2vec and GLOVE [2] “Hands-On Machine Learning with Scikit-Learn & TensorFlow”, Aurélien Géron (O'Reilly Media, 2017) • Excellent introduction, Jupyter Notebooks available here: https://github.com/ageron [3] “Speech and Language Processing”, Daniel Jurafsky & James H. Martin (2nd Edition, Pearson Education 2009) • In-depth introduction to NLP [4] “Introduction to Information Retrieval”, Christopher Manning et al (Cambridge University Press, 2008) • Probabilistic models for text retrieval, TF/IDF, Vector Space, Support Vector Machines…