Industrial strength - Natural Language Processing

Industrial Strength
Natural Language Processing
I am Jeffrey Williams
I am here to provide meaning to unstructured text
I work @ Label Insight
You can find me at @jeffxor
Label Insight is Hiring!
https://www.labelinsight.com/careers/topic/engineering

Caveats
◇ I am not a linguist specialist
◇ I am not a natural language specialist
◇ I am not a data scientist
◇ I am a software engineer
This talk is aimed at software engineers trying to tackle
text problems by extract meaning or understanding

Agenda
◇ Natural Language Processing Concepts
◇ spacy.io Introduction
◇ Visualizations
◇ Applying spacy.io
◇ spacy.io Extensions
◇ Lessons Learnt
◇ Alternatives to spaCy.io

Let’s review some
NLP concepts
Sentence Boundary Detection
Sentence boundaries are often
marked by periods or other
punctuation marks, but these
same characters can serve other
purposes
Tokens/Word Segmentation
Separate a chunk of continuous
text into separate words. Text
segmentation is a significant task
requiring knowledge of the
vocabulary and morphology.
Stemming/Lemmatization
reduce inflectional forms of a
word to a common base form
am, are, is -> be
car, cars, car's, cars' -> car
Named Entity Recognition
Given a stream of text, determine
which items in the text map to
proper names, such as people or
places, and what the type of each
such name is (e.g. person,
location, organization).
Parts of Speech Tagging
Given a sentence, determine the
part of speech for each word.
Many words, especially common
ones, can serve as multiple parts
of speech.
Word sense disambiguation
Many words have more than one
meaning; we have to select the
meaning which makes the most
sense in context.

spaCy.io Introduction
◇ Open-source library for advanced (NLP) in Python
◇ Opinionated NLP library (not an API/Service)
◇ Number of pretrained models for common
languages
◇ Great documentation and example code
◇ Helps build information extraction & natural
language understanding systems
spaCy.io is very powerful library that has many extension
points allowing for training and pipeline configuration

spaCy.io Features
Lemmatization
Assigning the base forms of
words. For example, the lemma of
"was" is "be", and the lemma of
"rats" is "rat".
Rule-based Matching
Finding sequences of tokens
based on their texts and linguistic
annotations, similar to regular
expressions.
Similarity
Comparing words, text spans and
documents and how similar they
are to each other.
(POS) Part-of-speech Tagging
Assigning word types to tokens,
like verb or noun.
(NER) Named Entity Recognition
Labelling named "real-world"
objects, like persons, companies
or locations.
Dependency Parsing
Assigning syntactic dependency
labels, describing the relations
between individual tokens, like
subject or object.

Place your screenshot here
Language Support
spaCy v2.0 features new neural models for
tagging, parsing and entity recognition. The
models have been designed and implemented
from scratch specifically for spaCy, to give you
an unmatched balance of speed, size and
accuracy.
Combination of language (english), training
data (web, news, etc), size of model (sm, md,
lg)
https://spacy.io/usage/models

Provided Named Entities
From my experience with Locations it is not as
well trained as Google Cloud Natural Language
https://spacy.io/api/annotation#section-named-entities

Parts-of-Speech Tagging
Maps all language-specific part-of-speech tags
to a small, fixed set of word type tags following
the Universal Dependencies scheme.
https://spacy.io/api/annotation#section-pos-tagging

Visualizations
Super simple and super powerful for development iteration

import spacy
from spacy import displacy
nlp = spacy.load('en')
doc = nlp(u'This is a sentence.')
displacy.serve(doc, style='dep')
Dependency Visualization

import spacy
from spacy import displacy
text = """But Google is starting from
behind. The company made a late push
into hardware, and Apple’s Siri,
available on iPhones, and Amazon’s Alexa
software, which runs on its Echo and Dot
devices, have clear leads in
consumer adoption."""
nlp = spacy.load('custom_ner_model')
doc = nlp(text)
displacy.serve(doc, style='ent')
Named Entity
Visualization

spaCy.io Code
Examples
Examples of using applying spaCy.io’s building blocks to solve a
problem

Navigating Parse Trees
◇ navigate the parse tree including subtrees attached
to a word
◇ Noun chunks (noun plus the words describing the
noun)
◇ terms head and child to describe the words
connected by a single arc
◇ term dep is used for the arc label, ( type of syntactic
relation)

Phrase Matcher
◇ efficiently match large terminology lists
◇ match sequences based on lists of token
descriptions
◇ accepts match patterns in the form of Doc objects

spaCy.io
Applied to the Real World
Walk through applying to a new problem domain

Training Data
Provide additional data to
either adjust and existing
model or build your own
model.
https://prodi.gy/
spaCy.io Extensions
Functionality
Number of extension points to
add customizations
◇ Adjust pipeline
◇ Add new pipeline features
◇ Add functionality to core
components
◇ Add callback functions into
pipeline processes

spaCy.io Pipeline
Disabling/Modifying
If you don't need a particular
component of the pipeline – for
example, the tagger or the parser,
you can disable loading it.
Can sometimes make a big
difference and improve loading
speed.
Custom Components
Custom components can be
added to the pipeline
Allows for adding it before or
after, tell spaCy to add it first or
last in the pipeline, or define a
custom name.
Eg. add spell checking (hunspell)
Extension Attributes
allows you to set any custom
attributes and methods on the
Doc, Span and Token
additional information relevant to
your application, add new
features and functionality to
spaCy, and implement your own
models
Eg. improve spaCy's sentence
boundary detectionhttps://spacy.io/usage/processing-pipelines

Processing Pipeline
The Language object coordinates
these components. It takes raw text
and sends it through the pipeline,
returning an annotated document. It
also orchestrates training and
serialization.
https://spacy.io/usage/processing-pipelines

Named Entity Extension
Adding Additional Entity Types
Need a few hundred labeled sentences
for a good start, mixin examples of other
entity types
Actual training is performed by looping
over the examples, makes a prediction
against golden parsed data
train_data = [
("Uber blew through $1 million a week", [(0, 4, 'ORG')]),
("Android Pay expands to Canada", [(0, 11, 'PRODUCT'), (23, 30,
'GPE')]),
("Spotify steps up Asia expansion", [(0, 8, "ORG"), (17, 21, "LOC")]),
("Google Maps launches location sharing", [(0, 11, "PRODUCT")]),
("Google rebrands its business apps", [(0, 6, "ORG")]),
("look what i found on google! 😂", [(21, 27, "PRODUCT")])]
Update a pre-trained Model
Need to provide many examples to meaningfully
improve the system — a few hundred
https://spacy.io/usage/training#section-ner

Custom Semantics
◇ Can be used to be trained to
predict any type of tree
structure over your input text
◇ Can be useful to for
conversational applications,
◇ Train spaCy's parser to label
intents and their targets, like
attributes, quality, time and
locations
https://spacy.io/usage/training#section-tagger-parser

Attempt to summarize my learning curve both from
implementation as well as business buyin
spaCy.io Lessons
Learnt

Start Simple!
Define you key outcomes
Visualize the data
Experiment, iteration is key!

Educate
Engage you SMEs
Visualizations always help
Opt for easy/understandable

Measurement
System Metric
Operations Metric
Overall Business Metric

spaCy.io Alternatives
There are many alternatives available they tend to fall into two
categories, alternative libraries and hosted solutions

◇ NLTK Natural Language Toolkit (Python)
◇ Stanford CoreNLP (Java)
◇ NLP4J (Java)
Libraries allow you to configure, extend and train for your
problem domain
Alternate Libraries

◇ Microsoft Azure Text Analytics
◇ Google Cloud Natural Language
Hosted solutions provide a generic solution
◇ Well trained models
◇ Basic/Generic Named Entities
◇ Unable to model/train for your domain (yet!)
Alternate Hosted
Solutions

Thanks!
Any questions?
You can find me at:
◇ @jeffxor
◇ jwilliams@labelinsight.com
◇ https://speakerrate.com/speakers/181771 (Feedback)
Label Insight is Hiring!
https://www.labelinsight.com/careers/topic/engineering

Useful Information
This presentation used the following resources:
◇ spacy.io
◇ spacy.io github
◇ explosion.ai/demos/
◇ Natural Language Processing Wikipedia
◇ Stanford CoreNLP
◇ Microsoft Azure Text Analytics
◇ Google Cloud Natural Language

Industrial strength - Natural Language Processing

More Related Content

What's hot (20)

Similar to Industrial strength - Natural Language Processing (20)

Recently uploaded (20)

Industrial strength - Natural Language Processing