SlideShare a Scribd company logo
Corpus Bootstrapping with NLTK
by Jacob Perkins
Jacob Perkins


 http://www.weotta.com
 http://streamhacker.com
 http://text-processing.com
 https://github.com/japerk/nltk-trainer
 @japerk
Problem



 you want to do NLProc
 many proven supervised training algorithms
 but you don’t have a training corpus
Solution




 make a custom training corpus
Problems with Manual Annotation



 takes time
 requires expertise
 expert time costs $$$
Solution: Bootstrap


 less time
 less expertise
 costs less
 requires thinking & creativity
Corpus Bootstrapping at Weotta



 review sentiment
 keyword classification
 phrase extraction & classification
Bootstrapping Examples



 english -> spanish sentiment
 phrase extraction
Translating Sentiment



 start with english sentiment corpus & classifier
 english -> spanish -> spanish
English -> Spanish -> Spanish

1. translate english examples to spanish
2. train classifier
3. classify spanish text into new corpus
4. correct new corpus
5. retrain classifier
6. add to corpus & goto 4 until done
Translate Corpus


$ translate_corpus.py movie_reviews --source english
--target spanish
Train Initial Classifier



$ train_classifier.py spanish_movie_reviews
Create New Corpus


$ classify_to_corpus.py spanish_sentiment --input
spanish_examples.txt --classifier
spanish_movie_reviews_NaiveBayes.pickle
Manual Correction



1. scan each file
2. move incorrect examples to correct file
Train New Classifier



$ train_classifier.py spanish_sentiment
Adding to the Corpus

 start with >90% probability
 retrain
 carefully decrease probability threshold
Add more at a Lower Threshold


$ classify_to_corpus.py categorized_corpus --
classifier categorized_corpus_NaiveBayes.pickle --
threshold 0.8 --input new_examples.txt
When are you done?



 what level of accuracy do you need?
 does your corpus reflect real text?
 how much time do you have?
Tips


 garbage in, garbage out
 correct bad data
 clean & scrub text
 experiment with train_classifier.py options
 create custom features
Bootstrapping a Phrase Extractor
1. find a pos tagged corpus
2. annotate raw text
3. train pos tagger
4. create pos tagged & chunked corpus
5. tag unknown words
6. train pos tagger & chunker
7. correct errors
8. add to corpus, goto 5 until done
NLTK Tagged Corpora

 English: brown, conll2000, treebank
 Portuguese: mac_morpho, floresta
 Spanish: cess_esp, conll2002
 Catalan: cess_cat
 Dutch: alpino, conll2002
 Indian Languages: indian
 Chinese: sinica_treebank
 see http://text-processing.com/demo/tag/
Train Tagger



$ train_tagger.py treebank --simplify_tags
Phrase Annotation


Hello world, [this is an important phrase].
Tag Phrases


$ tag_phrases.py my_corpus --tagger
treebank_simplify_tags.pickle --input my_phrases.txt
Chunked & Tagged Phrase


Hello/N world/N ,/, [ this/DET is/V an/DET
important/ADJ phrase/N ] ./.
Correct Unknown Words



1. find -NONE- tagged words
2. fix tags
Train New Tagger


$ train_tagger.py my_corpus --reader
nltk.corpus.reader.ChunkedCorpusReader
Train Chunker


$ train_chunker.py my_corpus --reader
nltk.corpus.reader.ChunkedCorpusReader
Extracting Phrases
import collections, nltk.data
from nltk import tokenize
from nltk.tag import untag

tagger = nltk.data.load('taggers/my_corpus_tagger.pickle')
chunker = nltk.data.load('chunkers/my_corpus_chunker.pickle')

def extract_phrases(t):
    d = collections.defaultdict(list)
    for sub in t.subtrees(lambda s: s.node != 'S'):
        d[sub.node].append(' '.join(untag(sub.leaves())))
    return d

sents = tokenize.sent_tokenize(text)
words = tokenize.word_tokenize(sents[0])
d = extract_phrases(chunker.parse(tagger.tag(words)))
# defaultdict(<type 'list'>, {'PHRASE_TAG': ['phrase']})
Final Tips


 error correction is faster than manual annotation
 find close enough corpora
 use nltk-trainer to experiment
 iterate -> quality
 no substitute for human judgement
Links



http://www.nltk.org
https://github.com/japerk/nltk-trainer
http://text-processing.com

More Related Content

PPTX
Computers for kids
PDF
Grooming with Groovy
PDF
Solving Localization Challenges with Design Pattern Automation
PPT
Perl Tidy Perl Critic
KEY
NLTK in 20 minutes
PPTX
NLTK - Natural Language Processing in Python
PDF
Python & Stuff
PDF
Natural Language Processing with Python
Computers for kids
Grooming with Groovy
Solving Localization Challenges with Design Pattern Automation
Perl Tidy Perl Critic
NLTK in 20 minutes
NLTK - Natural Language Processing in Python
Python & Stuff
Natural Language Processing with Python

Viewers also liked (16)

PPTX
Nltk natural language toolkit overview and application @ PyHug
PDF
Nltk:a tool for_nlp - py_con-dhaka-2014
PDF
ZOETWITT in the Press
PDF
Natural Language Toolkit (NLTK), Basics
PDF
Basic NLP with Python and NLTK
PPTX
PDF
Lanyrd Pro
PPTX
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
PDF
Lanyrd's new integrations with Eventbrite
PDF
Practical Natural Language Processing
PDF
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
PDF
Conversational Internet - Creating a natural language interface for web pages
PDF
Automatic Language Identification
PDF
Deep Learning for NLP: An Introduction to Neural Word Embeddings
PDF
Introduction to word embeddings with Python
PDF
Lightweight Natural Language Processing (NLP)
Nltk natural language toolkit overview and application @ PyHug
Nltk:a tool for_nlp - py_con-dhaka-2014
ZOETWITT in the Press
Natural Language Toolkit (NLTK), Basics
Basic NLP with Python and NLTK
Lanyrd Pro
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
Lanyrd's new integrations with Eventbrite
Practical Natural Language Processing
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Conversational Internet - Creating a natural language interface for web pages
Automatic Language Identification
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Introduction to word embeddings with Python
Lightweight Natural Language Processing (NLP)
Ad

Similar to Corpus Bootstrapping with NLTK (20)

PDF
Howto Test A Patch And Make A Difference!
PDF
Oop in php_tutorial
PDF
An Introduction to NLP4L
ODP
The Essential Perl Hacker's Toolkit
PDF
Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010
PPTX
Behavior driven development (bdd)
KEY
Le PERL est mort
PDF
TDD with PhpSpec - Lone Star PHP 2016
PDF
Embrace dynamic PHP
PPT
Oops in PHP
PPTX
Php test fest
PPTX
Php extensions
PPTX
Stuff They Never Taught You At Website School
PDF
PerlScripting
PDF
The top 10 things that any pro PHP developer should be doing
DOC
Article 01 What Is Php
PPTX
Getting Started With Apex as an Admin by Christopher Lewis
PPTX
Df16 getting started with apex as an admin
ODP
PhpSpec: practical introduction
PDF
oop_in_php_tutorial_for_killerphp.com
Howto Test A Patch And Make A Difference!
Oop in php_tutorial
An Introduction to NLP4L
The Essential Perl Hacker's Toolkit
Don't RTFM, WTFM - Open Source Documentation - German Perl Workshop 2010
Behavior driven development (bdd)
Le PERL est mort
TDD with PhpSpec - Lone Star PHP 2016
Embrace dynamic PHP
Oops in PHP
Php test fest
Php extensions
Stuff They Never Taught You At Website School
PerlScripting
The top 10 things that any pro PHP developer should be doing
Article 01 What Is Php
Getting Started With Apex as an Admin by Christopher Lewis
Df16 getting started with apex as an admin
PhpSpec: practical introduction
oop_in_php_tutorial_for_killerphp.com
Ad

Recently uploaded (20)

PPTX
A Presentation on Artificial Intelligence
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Machine Learning_overview_presentation.pptx
PPT
Teaching material agriculture food technology
PDF
Machine learning based COVID-19 study performance prediction
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Encapsulation theory and applications.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
A Presentation on Artificial Intelligence
Chapter 3 Spatial Domain Image Processing.pdf
NewMind AI Weekly Chronicles - August'25-Week II
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Spectral efficient network and resource selection model in 5G networks
Per capita expenditure prediction using model stacking based on satellite ima...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Machine Learning_overview_presentation.pptx
Teaching material agriculture food technology
Machine learning based COVID-19 study performance prediction
Encapsulation_ Review paper, used for researhc scholars
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Encapsulation theory and applications.pdf
Network Security Unit 5.pdf for BCA BBA.
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Reach Out and Touch Someone: Haptics and Empathic Computing
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
gpt5_lecture_notes_comprehensive_20250812015547.pdf

Corpus Bootstrapping with NLTK