SlideShare a Scribd company logo
Handling Narrative Fields in Datasets
for Classification
Portland Data Science Group
Created by Andrew Ferlitsch
Community Outreach Officer
May, 2017
Typical Dataset
Feature 1 Feature 2 Feature 3 Feature 4 Label
real-value real-value real-value categorical-value category
real-value real-value real-value categorical-value category
real-value real-value real-value categorical-value category
Dataset Clean
Categorical
Value
Conversion
Feature
Scaling
Progression in Dataset Preparation
Feature Reduction
• Filter out Garbage (dirty data)
• Filter out Noise (non-relevant features)
• Goal = Low Bias, Low Variance
Data
+
Noise
+
Garbage
Relevant
Data
Only
Information Gain
Reduce Entropy
Dataset with Narrative Fields
Feature 1 Feature 2 Feature 3 Feature 4 Label
real-value real-value narrative categorical-value category
real-value real-value narrative categorical-value category
real-value real-value narrative categorical-value category
Narrative is plain text which is a human description of the entry, i.e., what happened.
“upon arrival, the individual was initially non-responsive. …”
Category (label) is a classification based on the narrative by a human interpretation.
012 // Code value for “coarse” category
Problem with Narrative Text Fields
• Examples: 911 calls,
Police/Emergency/Medical, Incidents,
Inspections, Surveys, Complaints, Reviews
– Human Entered
– Human Interpreted => Categorizing
– Different People Entering and Categorizing
– Non-Uniformity
– Human Errors
Challenge
• Convert Narrative Fields into Features with
Categorical ( or preferably Real) Values.
Data
+
Narrative
Data
+
Categorical / Real
Values
Bag of Words
Bag of Words
Narrative Field
• Unordered List of Words
• Convert Unique Words in
Categorical Variables
• Set 1 if word appears in
narrative; otherwise set 0.
Cleansing and Tokenize (Words)
• Remove Punctuation
• Expand Contractions (e.g., isn’t -> is not)
• Lowercase
The quick brown fox jumped over the lazy dog.
the:2
quick:1
brown:1
fox:1
Jumped:1
over:1
lazy:1
dog:1
Narrative as Categorical Variables
The quick brown fox jumped over the lazy dog.
The dog barked while the cat was jumping.
the quick brow
n
fox jump
ed
over lazy dog bark
ed
while cat was jum
ping
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1
Issues: Explosion of categorical variables. For example, if the dataset
has 80,000 unique words, then you would have 80,000 categorical variables!
Corpus
• A collection of related documents.
• The Narratives in the Dataset are the Corpus.
• Each Narrative is a Document
Feature 1 .. N Narrative Label
CORPUS
Document
Word Distribution
• Make a pass through all the narratives (corpus) building a dictionary.
• Sort by Word Frequency (number of times it occurs).
0
MAX
Upper Threshold
Lower Threshold
Useless Words – Have no significance (e.g. the)
Commonly used Words
Rare Words or Misspellings
Stop Word Removal
• Remove Highest Frequency Words (above upper threshold), and
• Remove Lowest Frequency Words (below lower threshold) (optional).
The quick brown fox jumped over the lazy dog.
The dog barked while the cat was jumping.
quick brown fox jumped lazy dog barked cat jumpin
g
1 1 1 1 1 1
1 1 1 1
Well known predefined Stop Word Lists – most widely used is the Porter List
Stemming
• Stemming – Reduce words to their root stem.
Ex. Jumped, jumping, jumps => jump
• Does not use predefined dictionary. Uses grammar ending rules.
jumped, jumping
barked
quick brown fox jump lazy dog bark cat
1 1 1 1 1 1
1 1 1 1
Lemmatization
• Stems are correct if word is not exception, BUT incorrect when
word is an exception.
Ex. something => someth
• Lemmatization means reducing words to their root form, but
correcting the exceptions by using a dictionary of common
exceptions (vs. all words, e.g., 1000 words instead of 100,000).
Term Frequency (TF)
• Issue: All words are weighted the same
• Term Frequency is weighting the frequency of the word
in the corpus, and using the frequency as its feature
value (vs. 1 or 0).
(no. of occurrences in corpus) / (no. of unique words in corpus)
quick brown fox jump lazy dog bark cat
0.001 0.003 0.0002 0.006 0.0001 0.007 0.0001 0.007
0.006 0.007 0.0001 0.007
Inverse Document Frequency (IDF)
• Issue: TF gives higher weight to words that are the most
frequently used – may result in underfitting (too general).
• Inverse Document Frequency is weighted words by
have rarely they appear in the corpus (assumption is
the word is more significant in a document).
log ((no. of unique words in corpus) / (no. of occurrences in corpus) )
quick brown fox jump lazy dog bark cat
2 1.5 2.7 1.2 3 1.15 3 1.15
1.2 1.15 3 1.15
Pruning
• Even with Stemming/Lemmatization, the feature matrix
will be massive in size (e.g, 30,000 features).
• Reduce to smaller number – typically 500 to 1000.
• Choose the highest TF or IDF values in the Corpus.
Advance Topic – Word Reduction
• Words that are part of a common grouping are replaced
with a root word for the group.
• Steps:
1. Stemming/Lemmatization
2. Lookup Root Word in Word Group Dictionary
3. If entry exists, replace with common root word for
the group.
Group Example: male: [ man, gentleman, boy, guy, dude ]
Advance Topic – Word Reduction
male : [ man, gentleman, boy, guy, dude ]
female: [ woman, lady, girl, gal ]
parent : [ father, mother, mom, mommy, dad, daddy ]
Word Root
man male
gentleman male
boy male
guy male
dude male
woman female
Lady Female
girl female
gal female
The mother played with the girls while the dad
prepared snacks for the ladies in mom’s reading group.
parent,
play,
female,
parent,
prepare,
snack,
female,
parent,
read,
group
Advance Topics – N-grams
• Instead of parsing the sentence into single words, each
as a feature, we group them in pairs (2-gram) or triplets
(3-grams), etc, ….
• Parameters:
1. Choose Window Size (2, 3, …)
2. Choose Stride Length (1, 2, …)
2-gram
word1 word2 word3 … word4
stride of 1 2-gram
Advance Topics – N-grams
The quick brown fox jumped over the lazy dog
quick, brown, fox, jump, lazy, dog
2-grams, stride of 1
quick, brown
brown, fox
fox, jump
jump, lazy
lazy, dog
Dog, <null>
quick,
brown
brown,
fox
fox,
jump
Jump,
lazy
Lazy,
dog
dog
1 1 1 1 1 1
More – Not Covered
• Word-Vectors [Word Embedding]
• Correcting Misspellings
• Detecting incorrectly categorized Narratives.
Final – Homegrown Tool
• I built a command tool for doing all the steps in this
presentation.
• Java based, packaged as a JAR file.
https://github.com/andrewferlitsch/Portland-Data-Science-Group/blob/master/README.NLP.md
Final – Homegrown Tool - Examples
• Quora question pairs (training set: 400,000)
java –jar nlp.jar –c3,4 train.csv
• Remove Step Words
java –jar nlp.jar –c3,4 -e p train.csv
• Lemma and Reduce to Common Root
java –jar nlp.jar –c3,4 -e p –l –r train.csv
• Lemma and Reduce to Common Root
java –jar nlp.jar –c3,4 -e p –l –r –F train.csv

More Related Content

PPT
CHapter 2_text operation.ppt material for university students
PPTX
Text Mining, Association Rules and Decision Tree Learning
PPTX
PDF
Automated Abstracts and Big Data
PPT
Natural_Language_Processing_1.ppt
PPT
Natural language processing
PPTX
textprocessingboth.pptx
PPT
2_text operationinformation retrieval. ppt
CHapter 2_text operation.ppt material for university students
Text Mining, Association Rules and Decision Tree Learning
Automated Abstracts and Big Data
Natural_Language_Processing_1.ppt
Natural language processing
textprocessingboth.pptx
2_text operationinformation retrieval. ppt

Similar to Natural Language Provessing - Handling Narrarive Fields in Datasets for Classification (20)

PPT
PPT
PDF
NLP Lecture on the preprocessing approaches
PDF
maximum parsimony.pdf
PDF
Mixed Effects Models - Crossed Random Effects
PPTX
Natural Language Processing & its importance
PDF
HackYale - Natural Language Processing (Week 0)
PDF
HackYale NLP Week 0
PPT
Intro 2 document
PPTX
Artificial Intelligence Notes Unit 4
PPTX
NLP PPT.pptx
PDF
learn about text preprocessing nip using nltk
PDF
lec8_annotated.pdf ml csci 567 vatsal sharan
PDF
Natural Language Processing with Python
PDF
All kmers are not created equal: recognizing the signal from the noise in lar...
PDF
Chapter 2: Text Operation in information stroage and retrieval
PDF
Lazy man's learning: How To Build Your Own Text Summarizer
PPTX
A Panorama of Natural Language Processing
PPT
IRMNG presentation March 2012
PPTX
SPTechCon - Taxonomy, Content Types and Metadata - Boston - August 12 2013
NLP Lecture on the preprocessing approaches
maximum parsimony.pdf
Mixed Effects Models - Crossed Random Effects
Natural Language Processing & its importance
HackYale - Natural Language Processing (Week 0)
HackYale NLP Week 0
Intro 2 document
Artificial Intelligence Notes Unit 4
NLP PPT.pptx
learn about text preprocessing nip using nltk
lec8_annotated.pdf ml csci 567 vatsal sharan
Natural Language Processing with Python
All kmers are not created equal: recognizing the signal from the noise in lar...
Chapter 2: Text Operation in information stroage and retrieval
Lazy man's learning: How To Build Your Own Text Summarizer
A Panorama of Natural Language Processing
IRMNG presentation March 2012
SPTechCon - Taxonomy, Content Types and Metadata - Boston - August 12 2013
Ad

More from Andrew Ferlitsch (20)

PPTX
AI - Intelligent Agents
PPTX
Pareto Principle Applied to QA
PPTX
Whiteboarding Coding Challenges in Python
PPTX
Object Oriented Programming Principles
PPTX
Python - OOP Programming
PPTX
Python - Installing and Using Python and Jupyter Notepad
PPTX
Natural Language Processing - Groupings (Associations) Generation
PPTX
Machine Learning - Introduction to Recurrent Neural Networks
PPTX
Machine Learning - Introduction to Convolutional Neural Networks
PPTX
Machine Learning - Introduction to Neural Networks
PPTX
Python - Numpy/Pandas/Matplot Machine Learning Libraries
PPTX
Machine Learning - Accuracy and Confusion Matrix
PPTX
Machine Learning - Ensemble Methods
PPTX
ML - Multiple Linear Regression
PPTX
ML - Simple Linear Regression
PPTX
Machine Learning - Dummy Variable Conversion
PPTX
Machine Learning - Splitting Datasets
PPTX
Machine Learning - Dataset Preparation
PPTX
Machine Learning - Introduction to Tensorflow
PPTX
Introduction to Machine Learning
AI - Intelligent Agents
Pareto Principle Applied to QA
Whiteboarding Coding Challenges in Python
Object Oriented Programming Principles
Python - OOP Programming
Python - Installing and Using Python and Jupyter Notepad
Natural Language Processing - Groupings (Associations) Generation
Machine Learning - Introduction to Recurrent Neural Networks
Machine Learning - Introduction to Convolutional Neural Networks
Machine Learning - Introduction to Neural Networks
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Machine Learning - Accuracy and Confusion Matrix
Machine Learning - Ensemble Methods
ML - Multiple Linear Regression
ML - Simple Linear Regression
Machine Learning - Dummy Variable Conversion
Machine Learning - Splitting Datasets
Machine Learning - Dataset Preparation
Machine Learning - Introduction to Tensorflow
Introduction to Machine Learning
Ad

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Cloud computing and distributed systems.
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
cuic standard and advanced reporting.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Empathic Computing: Creating Shared Understanding
PDF
Encapsulation theory and applications.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Cloud computing and distributed systems.
Chapter 3 Spatial Domain Image Processing.pdf
Unlocking AI with Model Context Protocol (MCP)
cuic standard and advanced reporting.pdf
Big Data Technologies - Introduction.pptx
sap open course for s4hana steps from ECC to s4
Network Security Unit 5.pdf for BCA BBA.
20250228 LYD VKU AI Blended-Learning.pptx
Review of recent advances in non-invasive hemoglobin estimation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Advanced methodologies resolving dimensionality complications for autism neur...
NewMind AI Weekly Chronicles - August'25 Week I
The Rise and Fall of 3GPP – Time for a Sabbatical?
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
MIND Revenue Release Quarter 2 2025 Press Release
Empathic Computing: Creating Shared Understanding
Encapsulation theory and applications.pdf

Natural Language Provessing - Handling Narrarive Fields in Datasets for Classification

  • 1. Handling Narrative Fields in Datasets for Classification Portland Data Science Group Created by Andrew Ferlitsch Community Outreach Officer May, 2017
  • 2. Typical Dataset Feature 1 Feature 2 Feature 3 Feature 4 Label real-value real-value real-value categorical-value category real-value real-value real-value categorical-value category real-value real-value real-value categorical-value category Dataset Clean Categorical Value Conversion Feature Scaling Progression in Dataset Preparation
  • 3. Feature Reduction • Filter out Garbage (dirty data) • Filter out Noise (non-relevant features) • Goal = Low Bias, Low Variance Data + Noise + Garbage Relevant Data Only Information Gain Reduce Entropy
  • 4. Dataset with Narrative Fields Feature 1 Feature 2 Feature 3 Feature 4 Label real-value real-value narrative categorical-value category real-value real-value narrative categorical-value category real-value real-value narrative categorical-value category Narrative is plain text which is a human description of the entry, i.e., what happened. “upon arrival, the individual was initially non-responsive. …” Category (label) is a classification based on the narrative by a human interpretation. 012 // Code value for “coarse” category
  • 5. Problem with Narrative Text Fields • Examples: 911 calls, Police/Emergency/Medical, Incidents, Inspections, Surveys, Complaints, Reviews – Human Entered – Human Interpreted => Categorizing – Different People Entering and Categorizing – Non-Uniformity – Human Errors
  • 6. Challenge • Convert Narrative Fields into Features with Categorical ( or preferably Real) Values. Data + Narrative Data + Categorical / Real Values
  • 7. Bag of Words Bag of Words Narrative Field • Unordered List of Words • Convert Unique Words in Categorical Variables • Set 1 if word appears in narrative; otherwise set 0.
  • 8. Cleansing and Tokenize (Words) • Remove Punctuation • Expand Contractions (e.g., isn’t -> is not) • Lowercase The quick brown fox jumped over the lazy dog. the:2 quick:1 brown:1 fox:1 Jumped:1 over:1 lazy:1 dog:1
  • 9. Narrative as Categorical Variables The quick brown fox jumped over the lazy dog. The dog barked while the cat was jumping. the quick brow n fox jump ed over lazy dog bark ed while cat was jum ping 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Issues: Explosion of categorical variables. For example, if the dataset has 80,000 unique words, then you would have 80,000 categorical variables!
  • 10. Corpus • A collection of related documents. • The Narratives in the Dataset are the Corpus. • Each Narrative is a Document Feature 1 .. N Narrative Label CORPUS Document
  • 11. Word Distribution • Make a pass through all the narratives (corpus) building a dictionary. • Sort by Word Frequency (number of times it occurs). 0 MAX Upper Threshold Lower Threshold Useless Words – Have no significance (e.g. the) Commonly used Words Rare Words or Misspellings
  • 12. Stop Word Removal • Remove Highest Frequency Words (above upper threshold), and • Remove Lowest Frequency Words (below lower threshold) (optional). The quick brown fox jumped over the lazy dog. The dog barked while the cat was jumping. quick brown fox jumped lazy dog barked cat jumpin g 1 1 1 1 1 1 1 1 1 1 Well known predefined Stop Word Lists – most widely used is the Porter List
  • 13. Stemming • Stemming – Reduce words to their root stem. Ex. Jumped, jumping, jumps => jump • Does not use predefined dictionary. Uses grammar ending rules. jumped, jumping barked quick brown fox jump lazy dog bark cat 1 1 1 1 1 1 1 1 1 1
  • 14. Lemmatization • Stems are correct if word is not exception, BUT incorrect when word is an exception. Ex. something => someth • Lemmatization means reducing words to their root form, but correcting the exceptions by using a dictionary of common exceptions (vs. all words, e.g., 1000 words instead of 100,000).
  • 15. Term Frequency (TF) • Issue: All words are weighted the same • Term Frequency is weighting the frequency of the word in the corpus, and using the frequency as its feature value (vs. 1 or 0). (no. of occurrences in corpus) / (no. of unique words in corpus) quick brown fox jump lazy dog bark cat 0.001 0.003 0.0002 0.006 0.0001 0.007 0.0001 0.007 0.006 0.007 0.0001 0.007
  • 16. Inverse Document Frequency (IDF) • Issue: TF gives higher weight to words that are the most frequently used – may result in underfitting (too general). • Inverse Document Frequency is weighted words by have rarely they appear in the corpus (assumption is the word is more significant in a document). log ((no. of unique words in corpus) / (no. of occurrences in corpus) ) quick brown fox jump lazy dog bark cat 2 1.5 2.7 1.2 3 1.15 3 1.15 1.2 1.15 3 1.15
  • 17. Pruning • Even with Stemming/Lemmatization, the feature matrix will be massive in size (e.g, 30,000 features). • Reduce to smaller number – typically 500 to 1000. • Choose the highest TF or IDF values in the Corpus.
  • 18. Advance Topic – Word Reduction • Words that are part of a common grouping are replaced with a root word for the group. • Steps: 1. Stemming/Lemmatization 2. Lookup Root Word in Word Group Dictionary 3. If entry exists, replace with common root word for the group. Group Example: male: [ man, gentleman, boy, guy, dude ]
  • 19. Advance Topic – Word Reduction male : [ man, gentleman, boy, guy, dude ] female: [ woman, lady, girl, gal ] parent : [ father, mother, mom, mommy, dad, daddy ] Word Root man male gentleman male boy male guy male dude male woman female Lady Female girl female gal female The mother played with the girls while the dad prepared snacks for the ladies in mom’s reading group. parent, play, female, parent, prepare, snack, female, parent, read, group
  • 20. Advance Topics – N-grams • Instead of parsing the sentence into single words, each as a feature, we group them in pairs (2-gram) or triplets (3-grams), etc, …. • Parameters: 1. Choose Window Size (2, 3, …) 2. Choose Stride Length (1, 2, …) 2-gram word1 word2 word3 … word4 stride of 1 2-gram
  • 21. Advance Topics – N-grams The quick brown fox jumped over the lazy dog quick, brown, fox, jump, lazy, dog 2-grams, stride of 1 quick, brown brown, fox fox, jump jump, lazy lazy, dog Dog, <null> quick, brown brown, fox fox, jump Jump, lazy Lazy, dog dog 1 1 1 1 1 1
  • 22. More – Not Covered • Word-Vectors [Word Embedding] • Correcting Misspellings • Detecting incorrectly categorized Narratives.
  • 23. Final – Homegrown Tool • I built a command tool for doing all the steps in this presentation. • Java based, packaged as a JAR file. https://github.com/andrewferlitsch/Portland-Data-Science-Group/blob/master/README.NLP.md
  • 24. Final – Homegrown Tool - Examples • Quora question pairs (training set: 400,000) java –jar nlp.jar –c3,4 train.csv • Remove Step Words java –jar nlp.jar –c3,4 -e p train.csv • Lemma and Reduce to Common Root java –jar nlp.jar –c3,4 -e p –l –r train.csv • Lemma and Reduce to Common Root java –jar nlp.jar –c3,4 -e p –l –r –F train.csv