Corpus Bootstrapping with NLTK

Corpus Bootstrapping with NLTK
by Jacob Perkins

Jacob Perkins

http://www.weotta.com
http://streamhacker.com
http://text-processing.com
https://github.com/japerk/nltk-trainer
@japerk

Problem

you want to do NLProc
many proven supervised training algorithms
but you don’t have a training corpus

Solution

make a custom training corpus

Problems with Manual Annotation

takes time
requires expertise
expert time costs $$$

Solution: Bootstrap

less time
less expertise
costs less
requires thinking & creativity

Corpus Bootstrapping at Weotta

review sentiment
keyword classiﬁcation
phrase extraction & classiﬁcation

Bootstrapping Examples

english -> spanish sentiment
phrase extraction

Translating Sentiment

start with english sentiment corpus & classiﬁer
english -> spanish -> spanish

English -> Spanish -> Spanish

1. translate english examples to spanish
2. train classiﬁer
3. classify spanish text into new corpus
4. correct new corpus
5. retrain classiﬁer
6. add to corpus & goto 4 until done

Translate Corpus

$ translate_corpus.py movie_reviews --source english
--target spanish

Train Initial Classiﬁer

$ train_classifier.py spanish_movie_reviews

Create New Corpus

$ classify_to_corpus.py spanish_sentiment --input
spanish_examples.txt --classifier
spanish_movie_reviews_NaiveBayes.pickle

Manual Correction

1. scan each ﬁle
2. move incorrect examples to correct ﬁle

Train New Classiﬁer

$ train_classifier.py spanish_sentiment

Adding to the Corpus

start with >90% probability
retrain
carefully decrease probability threshold

Add more at a Lower Threshold

$ classify_to_corpus.py categorized_corpus --
classifier categorized_corpus_NaiveBayes.pickle --
threshold 0.8 --input new_examples.txt

When are you done?

what level of accuracy do you need?
does your corpus reﬂect real text?
how much time do you have?

Tips

garbage in, garbage out
correct bad data
clean & scrub text
experiment with train_classifier.py options
create custom features

Bootstrapping a Phrase Extractor
1. ﬁnd a pos tagged corpus
2. annotate raw text
3. train pos tagger
4. create pos tagged & chunked corpus
5. tag unknown words
6. train pos tagger & chunker
7. correct errors
8. add to corpus, goto 5 until done

NLTK Tagged Corpora

English: brown, conll2000, treebank
Portuguese: mac_morpho, ﬂoresta
Spanish: cess_esp, conll2002
Catalan: cess_cat
Dutch: alpino, conll2002
Indian Languages: indian
Chinese: sinica_treebank
see http://text-processing.com/demo/tag/

Train Tagger

$ train_tagger.py treebank --simplify_tags

Phrase Annotation

Hello world, [this is an important phrase].

Tag Phrases

$ tag_phrases.py my_corpus --tagger
treebank_simplify_tags.pickle --input my_phrases.txt

Chunked & Tagged Phrase

Hello/N world/N ,/, [ this/DET is/V an/DET
important/ADJ phrase/N ] ./.

Correct Unknown Words

1. ﬁnd -NONE- tagged words
2. ﬁx tags

Train New Tagger

$ train_tagger.py my_corpus --reader
nltk.corpus.reader.ChunkedCorpusReader

Train Chunker

$ train_chunker.py my_corpus --reader
nltk.corpus.reader.ChunkedCorpusReader

Extracting Phrases
import collections, nltk.data
from nltk import tokenize
from nltk.tag import untag

tagger = nltk.data.load('taggers/my_corpus_tagger.pickle')
chunker = nltk.data.load('chunkers/my_corpus_chunker.pickle')

def extract_phrases(t):
d = collections.defaultdict(list)
for sub in t.subtrees(lambda s: s.node != 'S'):
d[sub.node].append(' '.join(untag(sub.leaves())))
return d

sents = tokenize.sent_tokenize(text)
words = tokenize.word_tokenize(sents[0])
d = extract_phrases(chunker.parse(tagger.tag(words)))
# defaultdict(<type 'list'>, {'PHRASE_TAG': ['phrase']})

Final Tips

error correction is faster than manual annotation
ﬁnd close enough corpora
use nltk-trainer to experiment
iterate -> quality
no substitute for human judgement

Links

http://www.nltk.org
https://github.com/japerk/nltk-trainer
http://text-processing.com

Corpus Bootstrapping with NLTK

More Related Content

Viewers also liked (16)

Similar to Corpus Bootstrapping with NLTK (20)

Recently uploaded (20)

Corpus Bootstrapping with NLTK