Natural Language Provessing - Handling Narrarive Fields in Datasets for Classification

Handling Narrative Fields in Datasets
for Classification
Portland Data Science Group
Created by Andrew Ferlitsch
Community Outreach Officer
May, 2017

Typical Dataset
Feature 1 Feature 2 Feature 3 Feature 4 Label
real-value real-value real-value categorical-value category
Dataset Clean
Categorical
Value
Conversion
Feature
Scaling
Progression in Dataset Preparation

Feature Reduction
• Filter out Garbage (dirty data)
• Filter out Noise (non-relevant features)
• Goal = Low Bias, Low Variance
Data
+
Noise
+
Garbage
Relevant
Data
Only
Information Gain
Reduce Entropy

Dataset with Narrative Fields
Feature 1 Feature 2 Feature 3 Feature 4 Label
real-value real-value narrative categorical-value category
Narrative is plain text which is a human description of the entry, i.e., what happened.
“upon arrival, the individual was initially non-responsive. …”
Category (label) is a classification based on the narrative by a human interpretation.
012 // Code value for “coarse” category

Problem with Narrative Text Fields
• Examples: 911 calls,
Police/Emergency/Medical, Incidents,
Inspections, Surveys, Complaints, Reviews
– Human Entered
– Human Interpreted => Categorizing
– Different People Entering and Categorizing
– Non-Uniformity
– Human Errors

Challenge
• Convert Narrative Fields into Features with
Categorical ( or preferably Real) Values.
Data
+
Narrative
Data
+
Categorical / Real
Values

Bag of Words
Bag of Words
Narrative Field
• Unordered List of Words
• Convert Unique Words in
Categorical Variables
• Set 1 if word appears in
narrative; otherwise set 0.

Cleansing and Tokenize (Words)
• Remove Punctuation
• Expand Contractions (e.g., isn’t -> is not)
• Lowercase
The quick brown fox jumped over the lazy dog.
the:2
quick:1
brown:1
fox:1
Jumped:1
over:1
lazy:1
dog:1

Narrative as Categorical Variables
The dog barked while the cat was jumping.
the quick brow
n
fox jump
ed
over lazy dog bark
ed
while cat was jum
ping
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1
Issues: Explosion of categorical variables. For example, if the dataset
has 80,000 unique words, then you would have 80,000 categorical variables!

Corpus
• A collection of related documents.
• The Narratives in the Dataset are the Corpus.
• Each Narrative is a Document
Feature 1 .. N Narrative Label
CORPUS
Document

Word Distribution
• Make a pass through all the narratives (corpus) building a dictionary.
• Sort by Word Frequency (number of times it occurs).
0
MAX
Upper Threshold
Lower Threshold
Useless Words – Have no significance (e.g. the)
Commonly used Words
Rare Words or Misspellings

Stop Word Removal
• Remove Highest Frequency Words (above upper threshold), and
• Remove Lowest Frequency Words (below lower threshold) (optional).
The dog barked while the cat was jumping.
quick brown fox jumped lazy dog barked cat jumpin
g
1 1 1 1 1 1
1 1 1 1
Well known predefined Stop Word Lists – most widely used is the Porter List

Stemming
• Stemming – Reduce words to their root stem.
Ex. Jumped, jumping, jumps => jump
• Does not use predefined dictionary. Uses grammar ending rules.
jumped, jumping
barked
quick brown fox jump lazy dog bark cat
1 1 1 1 1 1
1 1 1 1

Lemmatization
• Stems are correct if word is not exception, BUT incorrect when
word is an exception.
Ex. something => someth
• Lemmatization means reducing words to their root form, but
correcting the exceptions by using a dictionary of common
exceptions (vs. all words, e.g., 1000 words instead of 100,000).

Term Frequency (TF)
• Issue: All words are weighted the same
• Term Frequency is weighting the frequency of the word
in the corpus, and using the frequency as its feature
value (vs. 1 or 0).
(no. of occurrences in corpus) / (no. of unique words in corpus)
0.001 0.003 0.0002 0.006 0.0001 0.007 0.0001 0.007
0.006 0.007 0.0001 0.007

Inverse Document Frequency (IDF)
• Issue: TF gives higher weight to words that are the most
frequently used – may result in underfitting (too general).
• Inverse Document Frequency is weighted words by
have rarely they appear in the corpus (assumption is
the word is more significant in a document).
log ((no. of unique words in corpus) / (no. of occurrences in corpus) )
2 1.5 2.7 1.2 3 1.15 3 1.15
1.2 1.15 3 1.15

Pruning
• Even with Stemming/Lemmatization, the feature matrix
will be massive in size (e.g, 30,000 features).
• Reduce to smaller number – typically 500 to 1000.
• Choose the highest TF or IDF values in the Corpus.

Advance Topic – Word Reduction
• Words that are part of a common grouping are replaced
with a root word for the group.
• Steps:
1. Stemming/Lemmatization
2. Lookup Root Word in Word Group Dictionary
3. If entry exists, replace with common root word for
the group.
Group Example: male: [ man, gentleman, boy, guy, dude ]

Advance Topic – Word Reduction
male : [ man, gentleman, boy, guy, dude ]
female: [ woman, lady, girl, gal ]
parent : [ father, mother, mom, mommy, dad, daddy ]
Word Root
man male
gentleman male
boy male
guy male
dude male
woman female
Lady Female
girl female
gal female
The mother played with the girls while the dad
prepared snacks for the ladies in mom’s reading group.
parent,
play,
female,
parent,
prepare,
snack,
female,
parent,
read,
group

Advance Topics – N-grams
• Instead of parsing the sentence into single words, each
as a feature, we group them in pairs (2-gram) or triplets
(3-grams), etc, ….
• Parameters:
1. Choose Window Size (2, 3, …)
2. Choose Stride Length (1, 2, …)
2-gram
word1 word2 word3 … word4
stride of 1 2-gram

Advance Topics – N-grams
The quick brown fox jumped over the lazy dog
quick, brown, fox, jump, lazy, dog
2-grams, stride of 1
quick, brown
brown, fox
fox, jump
jump, lazy
lazy, dog
Dog, <null>
quick,
brown
brown,
fox
fox,
jump
Jump,
lazy
Lazy,
dog
dog
1 1 1 1 1 1

More – Not Covered
• Word-Vectors [Word Embedding]
• Correcting Misspellings
• Detecting incorrectly categorized Narratives.

Final – Homegrown Tool
• I built a command tool for doing all the steps in this
presentation.
• Java based, packaged as a JAR file.
https://github.com/andrewferlitsch/Portland-Data-Science-Group/blob/master/README.NLP.md

Final – Homegrown Tool - Examples
• Quora question pairs (training set: 400,000)
java –jar nlp.jar –c3,4 train.csv
• Remove Step Words
java –jar nlp.jar –c3,4 -e p train.csv
• Lemma and Reduce to Common Root
java –jar nlp.jar –c3,4 -e p –l –r train.csv
• Lemma and Reduce to Common Root
java –jar nlp.jar –c3,4 -e p –l –r –F train.csv

Natural Language Provessing - Handling Narrarive Fields in Datasets for Classification

More Related Content

Similar to Natural Language Provessing - Handling Narrarive Fields in Datasets for Classification (20)

More from Andrew Ferlitsch (20)

Recently uploaded (20)

Natural Language Provessing - Handling Narrarive Fields in Datasets for Classification