Cork AI Meetup Number 3

Introduction to Text Analytics
and Natural Language
Processing
Nick Grattan
Application Architecture Director, Dassault Systèmes
PhD Student, Insight Centre for Data Analytics, University College Cork
www.3ds.com insight-centre.org/
Cork AI Meetup 15th March 2018
www.meetup.com/Cork-AI/
@NickGrattan

“You shall know a word by the
company it keeps”
J.R. Firth (1957)

Agenda
• Introduction to Text Analytics.
• Overview of common techniques and the types of problems commonly
solved.
• Traditional “Frequentist” text analysis
• Bag-of-Words and Vector Space Models (High Dimensional, Low Density)
• Measuring document / text similarity with distance metrics and clustering
documents.
• Hands-On: Document Clustering with Python, NLTK and Scipy.
• Word Embeddings with word2vec for semantic term analysis
• Unsupervised semantic analysis using a corpus of words
• Hands-On: Creating a semantic space with a Neural Network in TensorFlow

Natural Language Processing and Text
Analytics
• Natural Language Processing (NLP)
• Area of AI concerned with interactions between computers and human natural
language, to process or “understand” natural language
• Common tasks: speech recognition, natural language understanding & generation,
automatic summarization, part-of-speech tagging, disambiguation, named entity
recognition …
• To fully understand and represent the meaning of language is a difficult goal (AI-
Complete) [1]
• Text Analytics (Text Mining):
• The process or practice of examining large collections of written resources in order
to generate new information (Oxford English Dictionary)
• Transforms text to data for information discovery, establishing relationships, often
using NLP

Text Preparation
• Extract text from documents
• E.g. use “BeautifulSoup” in Python to process HTML/XML documents
• Process terms (words) from text
• Tokenisation – breaks text into discrete terms
• Stop Words – remove common words (“the”, “and” etc.)
• Stemming – Reduce words to their root or base form ("fishing", "fished", and
"fisher" => "fish")
• E.g. “NLTK” (Natural Language Toolkit) in Python
• All, some, or none of these techniques may be used, depending on the
application

Bag-of-Words & Jaccard Similarity
• Bag-of-Words is the set of terms found
in a document, corpus etc.
• Jaccard Similarity between two Bag-of-
Words, A &B:
• Ratio of the Intersection length over the
Union length of two sets
• ‘0’ – Identical, ‘1’ – Dissimilar
• Simple & quick to calculate

Term Frequencies (TF) & Vector Space Models
• Term Frequency (TF)
• Count of number of term
occurrences in a document
• Vector Space Model
• Dimension for each Term in Vocabulary
• Map documents into this space
Very high dimensionality, low density
For many documents, many dimensions
with be zero

Distance Measures
• Distance between two documents in a vector space model
• Two common measures: Euclidean and Cosine

Term Frequency / Inverse Document
Frequency (TF/IDF)
• Term Frequency / Inverse Document Frequency (TF/IDF)
• Reflects how important a word is to a document in a corpus
• Increases proportionally to the number of times a document appears in the
document
• Offset by the frequency of the word in the corpus
• Adjusts for words that appear more frequently in general
See: https://deeplearning4j.org/bagofwords-tf-idf

Distance Matrices & Clustering
• Square, symmetrical matrix with pair-wise distances between
documents in a corpus
• Used for clustering documents, e.g.
• K-Means clustering
• Hierarchical clustering (Ward algorithm commonly used)
See: https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/

Edit Distance
• Number of inserts / deletes / substitutions to
transform one document to another
• Weights may be applied to different types of
edit
• E.G. Terms that are semantically related may have
a lower weight
• Levenstein Edit Distance may be solved using
Dynamic Programming
• Allows document alignments to be produced
• But: Expensive in time and space!
https://wordpress.com/read/feeds/71910664/posts/1718047915

Document Retrieval with MinHash & Local
Sensitivity Hashing (LSH)
• Problem: How to retrieve similar document from a large corpus
• MinHash:
• “Document Fingerprint” with n-hash values (n ≈ 200)
• Characteristic: Similar document have similar hash values
• Use Jaccard similarity to measure MinHash similarity, and hence document similarity
• Independent of document size, small storage and retrieval costs
• Local Sensitivity Hashing (LSH):
• For large number of documents
• Organizes documents represented by MinHash into buckets
• Documents within a bucket are similar
• Reduces retrieval time, good for document duplication/near duplication detection
etc.
https://nickgrattandatascience.wordpress.com/2013/11/12/minhash-implementation-in-c/
https://nickgrattandatascience.wordpress.com/2017/12/31/lsh-for-finding-similar-documents-from-a-large-number-of-documents-in-c/

Natural Language Processing (NLP)
• Techniques describe thus far are Text Analytical
• Numerical in nature, take little account of the meaning of
text
• Terms are numerically encoded symbols
• NLP attempts to understand text
• Semantics – The meaning of a word based on how / where
it’s used
• Part of Speech (POS) Tagging- Understanding the
construction of sentences, phrases etc.
• Word Relatedness & Concepts: Wordnet -
https://wordnet.princeton.edu/
E.g. Homonym Problem:
Words with same spelling but
different meanings, depending
on how / where used
E.G. Disambiguation: “Like” as Verb (Fruit
flies like to eat bananas), “Like” as a adjective
(“Fruit flies that look like a banana”)

Word Embeddings – word2vec
• Unsupervised semantic analysis from
corpus of terms
• Define number of dimensions for the
semantic space (e.g. 300)
• Window: Define number of words before
/ after (e.g. 1,2 or 5) target word
• Generate Training Samples
• For each word, create parameters that
map the word into the semantic space
• The “Word Vector Lookup Table”
See: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

Word Embeddings – word2vec
• Neural Network trained on samples
• Output layer discarded
• Keep the Hidden Layer Weight Matrix!
See: https://www.tensorflow.org/tutorials/word2vec
Also look at “gensim” for a word2vec implementation:
https://radimrehurek.com/gensim/models/word2vec.html
Optimisation:
1. Continuous Bag of Words (CBOW)
2. Skipgram

Word Embedding - Visualisation
• Use Principal Component Analysis (PCA) to create 2d representation
of semantic vector space

RNN and NLP
• Recurrent Neural Networks (RNN) may
be used for generative models
• Once trained, they can generate text
with the same structure, syntax and
semantics as the training set
• For a bit of fun, see “The Unreasonable
Effectiveness of Recurrent Neural
Networks”
• http://karpathy.github.io/2015/05/21/rnn-effectiveness/
C Code generated from an RNN trained on the Linux
code base. While it does not execute (!) it is
syntactically correct. The model, for example, has
learnt matched “{“ “}” and parentheses

Resources and References
[1] “Natural Language Processing with Deep Learning” – Christopher Manning et
al, Stamford University. https://www.youtube.com/watch?v=OQQ-W_63UgQ
• Lecture series including excellent description of back propagation, word2vec and GLOVE
[2] “Hands-On Machine Learning with Scikit-Learn & TensorFlow”, Aurélien
Géron (O'Reilly Media, 2017)
• Excellent introduction, Jupyter Notebooks available here: https://github.com/ageron
[3] “Speech and Language Processing”, Daniel Jurafsky & James H. Martin (2nd
Edition, Pearson Education 2009)
• In-depth introduction to NLP
[4] “Introduction to Information Retrieval”, Christopher Manning et al
(Cambridge University Press, 2008)
• Probabilistic models for text retrieval, TF/IDF, Vector Space, Support Vector Machines…

Cork AI Meetup Number 3

More Related Content

Similar to Cork AI Meetup Number 3 (20)

Recently uploaded (20)

Cork AI Meetup Number 3