SlideShare a Scribd company logo
Industrial Strength
Natural Language Processing
I am Jeffrey Williams
I am here to provide meaning to unstructured text
I work @ Label Insight
You can find me at @jeffxor
Label Insight is Hiring!
https://www.labelinsight.com/careers/topic/engineering
Caveats
◇ I am not a linguist specialist
◇ I am not a natural language specialist
◇ I am not a data scientist
◇ I am a software engineer
This talk is aimed at software engineers trying to tackle
text problems by extract meaning or understanding
Agenda
◇ Natural Language Processing Concepts
◇ spacy.io Introduction
◇ Visualizations
◇ Applying spacy.io
◇ spacy.io Extensions
◇ Lessons Learnt
◇ Alternatives to spaCy.io
Let’s review some
NLP concepts
Sentence Boundary Detection
Sentence boundaries are often
marked by periods or other
punctuation marks, but these
same characters can serve other
purposes
Tokens/Word Segmentation
Separate a chunk of continuous
text into separate words. Text
segmentation is a significant task
requiring knowledge of the
vocabulary and morphology.
Stemming/Lemmatization
reduce inflectional forms of a
word to a common base form
am, are, is -> be
car, cars, car's, cars' -> car
Named Entity Recognition
Given a stream of text, determine
which items in the text map to
proper names, such as people or
places, and what the type of each
such name is (e.g. person,
location, organization).
Parts of Speech Tagging
Given a sentence, determine the
part of speech for each word.
Many words, especially common
ones, can serve as multiple parts
of speech.
Word sense disambiguation
Many words have more than one
meaning; we have to select the
meaning which makes the most
sense in context.
spaCy.io Introduction
◇ Open-source library for advanced (NLP) in Python
◇ Opinionated NLP library (not an API/Service)
◇ Number of pretrained models for common
languages
◇ Great documentation and example code
◇ Helps build information extraction & natural
language understanding systems
spaCy.io is very powerful library that has many extension
points allowing for training and pipeline configuration
spaCy.io Features
Lemmatization
Assigning the base forms of
words. For example, the lemma of
"was" is "be", and the lemma of
"rats" is "rat".
Rule-based Matching
Finding sequences of tokens
based on their texts and linguistic
annotations, similar to regular
expressions.
Similarity
Comparing words, text spans and
documents and how similar they
are to each other.
(POS) Part-of-speech Tagging
Assigning word types to tokens,
like verb or noun.
(NER) Named Entity Recognition
Labelling named "real-world"
objects, like persons, companies
or locations.
Dependency Parsing
Assigning syntactic dependency
labels, describing the relations
between individual tokens, like
subject or object.
Place your screenshot here
Language Support
spaCy v2.0 features new neural models for
tagging, parsing and entity recognition. The
models have been designed and implemented
from scratch specifically for spaCy, to give you
an unmatched balance of speed, size and
accuracy.
Combination of language (english), training
data (web, news, etc), size of model (sm, md,
lg)
https://spacy.io/usage/models
Place your screenshot here
Provided Named Entities
From my experience with Locations it is not as
well trained as Google Cloud Natural Language
https://spacy.io/api/annotation#section-named-entities
Place your screenshot here
Parts-of-Speech Tagging
Maps all language-specific part-of-speech tags
to a small, fixed set of word type tags following
the Universal Dependencies scheme.
https://spacy.io/api/annotation#section-pos-tagging
Visualizations
Super simple and super powerful for development iteration
Place your screenshot here
import spacy
from spacy import displacy
nlp = spacy.load('en')
doc = nlp(u'This is a sentence.')
displacy.serve(doc, style='dep')
Dependency Visualization
Place your screenshot here
import spacy
from spacy import displacy
text = """But Google is starting from
behind. The company made a late push
into hardware, and Apple’s Siri,
available on iPhones, and Amazon’s Alexa
software, which runs on its Echo and Dot
devices, have clear leads in
consumer adoption."""
nlp = spacy.load('custom_ner_model')
doc = nlp(text)
displacy.serve(doc, style='ent')
Named Entity
Visualization
spaCy.io Code
Examples
Examples of using applying spaCy.io’s building blocks to solve a
problem
Navigating Parse Trees
◇ navigate the parse tree including subtrees attached
to a word
◇ Noun chunks (noun plus the words describing the
noun)
◇ terms head and child to describe the words
connected by a single arc
◇ term dep is used for the arc label, ( type of syntactic
relation)
Phrase Matcher
◇ efficiently match large terminology lists
◇ match sequences based on lists of token
descriptions
◇ accepts match patterns in the form of Doc objects
spaCy.io
Applied to the Real World
Walk through applying to a new problem domain
Training Data
Provide additional data to
either adjust and existing
model or build your own
model.
https://prodi.gy/
spaCy.io Extensions
Functionality
Number of extension points to
add customizations
◇ Adjust pipeline
◇ Add new pipeline features
◇ Add functionality to core
components
◇ Add callback functions into
pipeline processes
spaCy.io Pipeline
Disabling/Modifying
If you don't need a particular
component of the pipeline – for
example, the tagger or the parser,
you can disable loading it.
Can sometimes make a big
difference and improve loading
speed.
Custom Components
Custom components can be
added to the pipeline
Allows for adding it before or
after, tell spaCy to add it first or
last in the pipeline, or define a
custom name.
Eg. add spell checking (hunspell)
Extension Attributes
allows you to set any custom
attributes and methods on the
Doc, Span and Token
additional information relevant to
your application, add new
features and functionality to
spaCy, and implement your own
models
Eg. improve spaCy's sentence
boundary detectionhttps://spacy.io/usage/processing-pipelines
Place your screenshot here
Processing Pipeline
The Language object coordinates
these components. It takes raw text
and sends it through the pipeline,
returning an annotated document. It
also orchestrates training and
serialization.
https://spacy.io/usage/processing-pipelines
Named Entity Extension
Adding Additional Entity Types
Need a few hundred labeled sentences
for a good start, mixin examples of other
entity types
Actual training is performed by looping
over the examples, makes a prediction
against golden parsed data
train_data = [
("Uber blew through $1 million a week", [(0, 4, 'ORG')]),
("Android Pay expands to Canada", [(0, 11, 'PRODUCT'), (23, 30,
'GPE')]),
("Spotify steps up Asia expansion", [(0, 8, "ORG"), (17, 21, "LOC")]),
("Google Maps launches location sharing", [(0, 11, "PRODUCT")]),
("Google rebrands its business apps", [(0, 6, "ORG")]),
("look what i found on google! 😂", [(21, 27, "PRODUCT")])]
Update a pre-trained Model
Need to provide many examples to meaningfully
improve the system — a few hundred
https://spacy.io/usage/training#section-ner
Place your screenshot here
Custom Semantics
◇ Can be used to be trained to
predict any type of tree
structure over your input text
◇ Can be useful to for
conversational applications,
◇ Train spaCy's parser to label
intents and their targets, like
attributes, quality, time and
locations
https://spacy.io/usage/training#section-tagger-parser
Attempt to summarize my learning curve both from
implementation as well as business buyin
spaCy.io Lessons
Learnt
Start Simple!
Define you key outcomes
Visualize the data
Experiment, iteration is key!
Educate
Engage you SMEs
Visualizations always help
Opt for easy/understandable
Measurement
System Metric
Operations Metric
Overall Business Metric
spaCy.io Alternatives
There are many alternatives available they tend to fall into two
categories, alternative libraries and hosted solutions
◇ NLTK Natural Language Toolkit (Python)
◇ Stanford CoreNLP (Java)
◇ NLP4J (Java)
Libraries allow you to configure, extend and train for your
problem domain
Alternate Libraries
◇ Microsoft Azure Text Analytics
◇ Google Cloud Natural Language
Hosted solutions provide a generic solution
◇ Well trained models
◇ Basic/Generic Named Entities
◇ Unable to model/train for your domain (yet!)
Alternate Hosted
Solutions
Thanks!
Any questions?
You can find me at:
◇ @jeffxor
◇ jwilliams@labelinsight.com
◇ https://speakerrate.com/speakers/181771 (Feedback)
Label Insight is Hiring!
https://www.labelinsight.com/careers/topic/engineering
Useful Information
This presentation used the following resources:
◇ spacy.io
◇ spacy.io github
◇ explosion.ai/demos/
◇ Natural Language Processing Wikipedia
◇ Stanford CoreNLP
◇ Microsoft Azure Text Analytics
◇ Google Cloud Natural Language

More Related Content

PPTX
Progressive Web App
PPTX
Docker: From Zero to Hero
PPTX
PPT on iOS
PDF
Introduction to docker
PPTX
Flutter introduction
PPTX
Teste de software - Conhecendo e Aplicando
PDF
Python standard library & list of important libraries
PDF
The Observability Pipeline
Progressive Web App
Docker: From Zero to Hero
PPT on iOS
Introduction to docker
Flutter introduction
Teste de software - Conhecendo e Aplicando
Python standard library & list of important libraries
The Observability Pipeline

What's hot (20)

PPTX
Terraform-on-AWS-EKS-v5 adshdhddhowahaaaaaaaaaaa.pptx
PDF
GitOps A/B testing with Istio and Helm
PPT
Domain Driven Design Demonstrated
ODP
OpenStack Oslo Messaging RPC API Tutorial Demo Call, Cast and Fanout
PPTX
PDF
Introduction to Containers
PDF
Quarkus - a next-generation Kubernetes Native Java framework
PDF
Repository Management with JFrog Artifactory
PPT
Introduction to Ruby on Rails
PDF
MQTT - A practical protocol for the Internet of Things
PDF
Introduction to ios
PDF
Building beautiful apps using google flutter
PPTX
Internet of things (IoT)
PDF
Google Cloud Networking Deep Dive
PDF
Native mobile application development with Flutter (Dart)
PDF
Django in the Real World
PPTX
IBM RedHat OCP Vs xKS.pptx
PPTX
Introduction à la démarche Devops
PDF
OWASP Mobile TOP 10 na przykładzie aplikacji bankowych - Semafor 2016 - Mateu...
Terraform-on-AWS-EKS-v5 adshdhddhowahaaaaaaaaaaa.pptx
GitOps A/B testing with Istio and Helm
Domain Driven Design Demonstrated
OpenStack Oslo Messaging RPC API Tutorial Demo Call, Cast and Fanout
Introduction to Containers
Quarkus - a next-generation Kubernetes Native Java framework
Repository Management with JFrog Artifactory
Introduction to Ruby on Rails
MQTT - A practical protocol for the Internet of Things
Introduction to ios
Building beautiful apps using google flutter
Internet of things (IoT)
Google Cloud Networking Deep Dive
Native mobile application development with Flutter (Dart)
Django in the Real World
IBM RedHat OCP Vs xKS.pptx
Introduction à la démarche Devops
OWASP Mobile TOP 10 na przykładzie aplikacji bankowych - Semafor 2016 - Mateu...
Ad

Similar to Industrial strength - Natural Language Processing (20)

PPTX
What's new for Text in SAP HANA SPS 11
PPT
POSI Overview
PDF
Shuzworld Analysis
PDF
C, C++ Training Institute in Chennai , Adyar
DOCX
Evaluation of online learning
PPT
Introduction to the Semantic Web
PDF
April 2016 - USG Web Tech Day - Let's Talk Drupal
PPT
Importance Of Being Driven
PDF
Python For SEO specialists and Content Marketing - Hand in Hand
PPTX
Text mining and Visualizations
PPT
CASE tools and their effects on software quality
PDF
PostgreSQL_ Up and Running_ A Practical Guide to the Advanced Open Source Dat...
PPT
Programming Paradigms
PPT
NetBase API Presentation
PPTX
Domain Driven Design
PPT
ppt
PPT
ppt
PDF
LLMs in Production: Tooling, Process, and Team Structure
PPT
Programming Paradigms
PPTX
Programming paradigms Techniques_part2.pptx
What's new for Text in SAP HANA SPS 11
POSI Overview
Shuzworld Analysis
C, C++ Training Institute in Chennai , Adyar
Evaluation of online learning
Introduction to the Semantic Web
April 2016 - USG Web Tech Day - Let's Talk Drupal
Importance Of Being Driven
Python For SEO specialists and Content Marketing - Hand in Hand
Text mining and Visualizations
CASE tools and their effects on software quality
PostgreSQL_ Up and Running_ A Practical Guide to the Advanced Open Source Dat...
Programming Paradigms
NetBase API Presentation
Domain Driven Design
ppt
ppt
LLMs in Production: Tooling, Process, and Team Structure
Programming Paradigms
Programming paradigms Techniques_part2.pptx
Ad

Recently uploaded (20)

PPTX
Introduction to Artificial Intelligence
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
AI in Product Development-omnex systems
PDF
System and Network Administration Chapter 2
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
medical staffing services at VALiNTRY
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
history of c programming in notes for students .pptx
PPTX
Online Work Permit System for Fast Permit Processing
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Operating system designcfffgfgggggggvggggggggg
Introduction to Artificial Intelligence
How to Choose the Right IT Partner for Your Business in Malaysia
AI in Product Development-omnex systems
System and Network Administration Chapter 2
Design an Analysis of Algorithms I-SECS-1021-03
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
How to Migrate SBCGlobal Email to Yahoo Easily
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
CHAPTER 2 - PM Management and IT Context
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
medical staffing services at VALiNTRY
ManageIQ - Sprint 268 Review - Slide Deck
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Upgrade and Innovation Strategies for SAP ERP Customers
Wondershare Filmora 15 Crack With Activation Key [2025
history of c programming in notes for students .pptx
Online Work Permit System for Fast Permit Processing
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Operating system designcfffgfgggggggvggggggggg

Industrial strength - Natural Language Processing

  • 1. Industrial Strength Natural Language Processing I am Jeffrey Williams I am here to provide meaning to unstructured text I work @ Label Insight You can find me at @jeffxor Label Insight is Hiring! https://www.labelinsight.com/careers/topic/engineering
  • 2. Caveats ◇ I am not a linguist specialist ◇ I am not a natural language specialist ◇ I am not a data scientist ◇ I am a software engineer This talk is aimed at software engineers trying to tackle text problems by extract meaning or understanding
  • 3. Agenda ◇ Natural Language Processing Concepts ◇ spacy.io Introduction ◇ Visualizations ◇ Applying spacy.io ◇ spacy.io Extensions ◇ Lessons Learnt ◇ Alternatives to spaCy.io
  • 4. Let’s review some NLP concepts Sentence Boundary Detection Sentence boundaries are often marked by periods or other punctuation marks, but these same characters can serve other purposes Tokens/Word Segmentation Separate a chunk of continuous text into separate words. Text segmentation is a significant task requiring knowledge of the vocabulary and morphology. Stemming/Lemmatization reduce inflectional forms of a word to a common base form am, are, is -> be car, cars, car's, cars' -> car Named Entity Recognition Given a stream of text, determine which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization). Parts of Speech Tagging Given a sentence, determine the part of speech for each word. Many words, especially common ones, can serve as multiple parts of speech. Word sense disambiguation Many words have more than one meaning; we have to select the meaning which makes the most sense in context.
  • 5. spaCy.io Introduction ◇ Open-source library for advanced (NLP) in Python ◇ Opinionated NLP library (not an API/Service) ◇ Number of pretrained models for common languages ◇ Great documentation and example code ◇ Helps build information extraction & natural language understanding systems spaCy.io is very powerful library that has many extension points allowing for training and pipeline configuration
  • 6. spaCy.io Features Lemmatization Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat". Rule-based Matching Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions. Similarity Comparing words, text spans and documents and how similar they are to each other. (POS) Part-of-speech Tagging Assigning word types to tokens, like verb or noun. (NER) Named Entity Recognition Labelling named "real-world" objects, like persons, companies or locations. Dependency Parsing Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
  • 7. Place your screenshot here Language Support spaCy v2.0 features new neural models for tagging, parsing and entity recognition. The models have been designed and implemented from scratch specifically for spaCy, to give you an unmatched balance of speed, size and accuracy. Combination of language (english), training data (web, news, etc), size of model (sm, md, lg) https://spacy.io/usage/models
  • 8. Place your screenshot here Provided Named Entities From my experience with Locations it is not as well trained as Google Cloud Natural Language https://spacy.io/api/annotation#section-named-entities
  • 9. Place your screenshot here Parts-of-Speech Tagging Maps all language-specific part-of-speech tags to a small, fixed set of word type tags following the Universal Dependencies scheme. https://spacy.io/api/annotation#section-pos-tagging
  • 10. Visualizations Super simple and super powerful for development iteration
  • 11. Place your screenshot here import spacy from spacy import displacy nlp = spacy.load('en') doc = nlp(u'This is a sentence.') displacy.serve(doc, style='dep') Dependency Visualization
  • 12. Place your screenshot here import spacy from spacy import displacy text = """But Google is starting from behind. The company made a late push into hardware, and Apple’s Siri, available on iPhones, and Amazon’s Alexa software, which runs on its Echo and Dot devices, have clear leads in consumer adoption.""" nlp = spacy.load('custom_ner_model') doc = nlp(text) displacy.serve(doc, style='ent') Named Entity Visualization
  • 13. spaCy.io Code Examples Examples of using applying spaCy.io’s building blocks to solve a problem
  • 14. Navigating Parse Trees ◇ navigate the parse tree including subtrees attached to a word ◇ Noun chunks (noun plus the words describing the noun) ◇ terms head and child to describe the words connected by a single arc ◇ term dep is used for the arc label, ( type of syntactic relation)
  • 15. Phrase Matcher ◇ efficiently match large terminology lists ◇ match sequences based on lists of token descriptions ◇ accepts match patterns in the form of Doc objects
  • 16. spaCy.io Applied to the Real World Walk through applying to a new problem domain
  • 17. Training Data Provide additional data to either adjust and existing model or build your own model. https://prodi.gy/ spaCy.io Extensions Functionality Number of extension points to add customizations ◇ Adjust pipeline ◇ Add new pipeline features ◇ Add functionality to core components ◇ Add callback functions into pipeline processes
  • 18. spaCy.io Pipeline Disabling/Modifying If you don't need a particular component of the pipeline – for example, the tagger or the parser, you can disable loading it. Can sometimes make a big difference and improve loading speed. Custom Components Custom components can be added to the pipeline Allows for adding it before or after, tell spaCy to add it first or last in the pipeline, or define a custom name. Eg. add spell checking (hunspell) Extension Attributes allows you to set any custom attributes and methods on the Doc, Span and Token additional information relevant to your application, add new features and functionality to spaCy, and implement your own models Eg. improve spaCy's sentence boundary detectionhttps://spacy.io/usage/processing-pipelines
  • 19. Place your screenshot here Processing Pipeline The Language object coordinates these components. It takes raw text and sends it through the pipeline, returning an annotated document. It also orchestrates training and serialization. https://spacy.io/usage/processing-pipelines
  • 20. Named Entity Extension Adding Additional Entity Types Need a few hundred labeled sentences for a good start, mixin examples of other entity types Actual training is performed by looping over the examples, makes a prediction against golden parsed data train_data = [ ("Uber blew through $1 million a week", [(0, 4, 'ORG')]), ("Android Pay expands to Canada", [(0, 11, 'PRODUCT'), (23, 30, 'GPE')]), ("Spotify steps up Asia expansion", [(0, 8, "ORG"), (17, 21, "LOC")]), ("Google Maps launches location sharing", [(0, 11, "PRODUCT")]), ("Google rebrands its business apps", [(0, 6, "ORG")]), ("look what i found on google! 😂", [(21, 27, "PRODUCT")])] Update a pre-trained Model Need to provide many examples to meaningfully improve the system — a few hundred https://spacy.io/usage/training#section-ner
  • 21. Place your screenshot here Custom Semantics ◇ Can be used to be trained to predict any type of tree structure over your input text ◇ Can be useful to for conversational applications, ◇ Train spaCy's parser to label intents and their targets, like attributes, quality, time and locations https://spacy.io/usage/training#section-tagger-parser
  • 22. Attempt to summarize my learning curve both from implementation as well as business buyin spaCy.io Lessons Learnt
  • 23. Start Simple! Define you key outcomes Visualize the data Experiment, iteration is key!
  • 24. Educate Engage you SMEs Visualizations always help Opt for easy/understandable
  • 26. spaCy.io Alternatives There are many alternatives available they tend to fall into two categories, alternative libraries and hosted solutions
  • 27. ◇ NLTK Natural Language Toolkit (Python) ◇ Stanford CoreNLP (Java) ◇ NLP4J (Java) Libraries allow you to configure, extend and train for your problem domain Alternate Libraries
  • 28. ◇ Microsoft Azure Text Analytics ◇ Google Cloud Natural Language Hosted solutions provide a generic solution ◇ Well trained models ◇ Basic/Generic Named Entities ◇ Unable to model/train for your domain (yet!) Alternate Hosted Solutions
  • 29. Thanks! Any questions? You can find me at: ◇ @jeffxor ◇ jwilliams@labelinsight.com ◇ https://speakerrate.com/speakers/181771 (Feedback) Label Insight is Hiring! https://www.labelinsight.com/careers/topic/engineering
  • 30. Useful Information This presentation used the following resources: ◇ spacy.io ◇ spacy.io github ◇ explosion.ai/demos/ ◇ Natural Language Processing Wikipedia ◇ Stanford CoreNLP ◇ Microsoft Azure Text Analytics ◇ Google Cloud Natural Language