Text analysis using python

Text Analysis with
Python
Vijay Ramachandran

Fools Rush In?
”The fact is, that to do anything in the world worth doing, we must not stand
back shivering and thinking of the cold and danger, but jump in and
scramble through as well as we can.” - Robert Cushing

Motivation
 Machine Learning
everywhere
 Users expectations of
”standard experience”
 Many Resources!

Text Mining
 Extract high quality information from
text
 Typically, trends and patterns are
analysed using statistical methods –
Machine Learning
 Common Tasks – entity recognition,
sentiment analysis, categorization,
clustering

Why Python?
 Short, concise text processing
 NLTK
 Scipy, numpy, scikit.learn
 Integration with other languages!
Because when you start your company, YOU get
to decide!

Pre-processing
 Lower casing, stripping extra characters
 ”Realyyyyyyyyyy!!!!!”
 Tokenisation (sentences, words)
>>>pktst = nltk.data.load('tokenizers/punkt/english.pickle')
>>>sentences = pktst.tokenize(tweet)
>>>words = nltk.word_tokenize(sent)
 Handling Entities
>>> re.sub(r'(^| )@[^ ]+','',tweet).strip()
 Removing stopwords
>>> stopwords = set([”a”, ”an”, ”the”, ”by”])
>>> ' '.join([w for w in words if w not in stopwords])

Leveraging a Corpus
 Simple techniques to analyze a domain
 Term Frequency to find important entities
 ”low light photography”, ”travel photography”
 tf/idf to find representative terms across
domains
 ”gaming” for TVs
 GND to find aliases
 e.g., ”e700” and ”Samsung e700” vs ”e310” and
”Samsung e310”
 Yahoo BOSS is great!

Bayes Theorem
 Conditional Probability
 Bayesian classifiers - given features, find
Probability of Class
PC∣Fi ,...,Fn=PC∏i
PFi∣C

Features
 Characteristics of to-be-classified object
 For text, typically unigrams, ”n”-grams, POS,
CHUNK, presence in gazeteer
 Can be numerical or boolean
 Crucial to performance of classifier!
 In NLTK, a dict of feature name to value
>>> {”word1” : 2, ”word2” : 1, ”word1-word2” : 1,
”word2-word3” : 3, ”word1-in-monuments” : False}

Training
 ”Teach” the classifier how to classify, using
training data
>>> from nltk import NaiveBayesClassifier as nbc
>>> trg = generate_features(training_samples)
>>> random.shuffle(trg)
>>> train, test = trg[:int(0.9*len(trg)], trg[int(0.9*len(trg):]
>>> clf = nbc.train(train)
>>> nltk.classify.accuracy(clf, test)

Training, part 2
 Measure, tune, iterate
 Cross Validation
 Aim is to find a balance between Precision and
Recall

A Gender Classifier
def gender_features(word):
return {'last_letter': word[-1]}
names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
featuresets = [(gender_features(n), g) for (n,g) in names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)

Advice on Training
 Its TEDIOUS! Cut the cognitive load
e.g.: ”I love this camera!”
BAD:
0 I PRP B-NP -
1 love VBP B-VP -
2 this DT B-NP -
3 camera NN I-NP 1
Good: (I, love) – NO, (love, this camera) – YES
 Dealing with ambiguity
 Mechanical Turk, jsonwidget

Other Classifiers
 Maximum Entropy
 Support Vector Machines
 Conditional Random Fields
All follow the same workflow!

More Examples
 Find questions in Tweets
 unigrams, bigrams, trigrams, parse distance
 Recognize ”contextual questions” in
discussions
 ”reco” words, ”thanks” words
 ”Use-type” recognizer
 POS, CHUNK, special words, verbs within 3 words
of phrase
 e.g., ”I want to compose the perfect landscape shot”

Eats, shoots and leaves
 Common basic step!
 POS, chunk, parse
tree
 Hairy theory
 FSA, Morphology,
Phonology
 n-Grams, Probabilistic
models, CFGs
Don't Worry, Be Happy!

No License Required!
 What to use? NLTK, or ?
 Docking with the Evil MotherShip
 Jepp, JPype?
 ► use RPC
>>> from .stanford_corenlp import jsonrpc
>>> server = jsonrpc.TransportTcpIp(...)
>>> result = loads(server.parse(paragraph))
 POS, Parse tree, and more (but no chunks?)

Wordnet®
 ”a large lexical database of English”
 synsets, hypernyms, hyponyms, gloss
 synonyms, antonyms
 Sentiwordnet
 Start with candidate ”good” and ”bad” words
 Expand by recursively following edges
 Classify using definition
e.g., ”good” → ”better”, ”good” → ”impressive”

Love or Hate?
”I love the screen, but the battery life is poor”
 Shallow?
 3 class classifier
 Or Deep?
 Relationship classifier
 Extracting candidate subjects
 Lots of unsolved problems – co-reference, multiple
subjects, negation, etc.
 Summary ratings for ”Executive Summaries”

Gender, revisited
 solr - search semi-structured text
 Out of the box text processing utilities – stemming,
tokenising
 Highly configurable relevancy
 fields, weights
 Sorting!
 ”sort(term, field, edit) desc” for Levenshtein edit
distance

Gender, revisited
 Schema
<field name="name" type="string" />
<field name="name_phoneme" type="phonetic" />
 Search: add ”sort=strdist(unknown_name, name, edit) desc)
 Python:
for namerec in results:
If namerec.gender == 'Male':
male_score += namerec.match_score
else:
female_score += namerec.match_score
 Correctly guesses ”Sheena” and ”Ashish”!!
 80/80 precision/recall

Miles to go before I sleep!
 Machine Learning on coursera


Thank You!!
Vijay Ramachandran: vijay750@gmail.com

Text analysis using python

More Related Content

What's hot (20)

Similar to Text analysis using python (20)

Recently uploaded (20)

Text analysis using python