SlideShare a Scribd company logo
Text Analysis with
Python
Vijay Ramachandran
Fools Rush In?
”The fact is, that to do anything in the world worth doing, we must not stand
back shivering and thinking of the cold and danger, but jump in and
scramble through as well as we can.” - Robert Cushing
Motivation
 Machine Learning
everywhere
 Users expectations of
”standard experience”
 Many Resources!
Text Mining
 Extract high quality information from
text
 Typically, trends and patterns are
analysed using statistical methods –
Machine Learning
 Common Tasks – entity recognition,
sentiment analysis, categorization,
clustering
Why Python?
 Short, concise text processing
 NLTK
 Scipy, numpy, scikit.learn
 Integration with other languages!
Because when you start your company, YOU get
to decide!
Pre-processing
 Lower casing, stripping extra characters
 ”Realyyyyyyyyyy!!!!!”
 Tokenisation (sentences, words)
>>>pktst = nltk.data.load('tokenizers/punkt/english.pickle')
>>>sentences = pktst.tokenize(tweet)
>>>words = nltk.word_tokenize(sent)
 Handling Entities
>>> re.sub(r'(^| )@[^ ]+','',tweet).strip()
 Removing stopwords
>>> stopwords = set([”a”, ”an”, ”the”, ”by”])
>>> ' '.join([w for w in words if w not in stopwords])
Leveraging a Corpus
 Simple techniques to analyze a domain
 Term Frequency to find important entities
 ”low light photography”, ”travel photography”
 tf/idf to find representative terms across
domains
 ”gaming” for TVs
 GND to find aliases
 e.g., ”e700” and ”Samsung e700” vs ”e310” and
”Samsung e310”
 Yahoo BOSS is great!
Supervised Classification
Bayes Theorem
 Conditional Probability
 Bayesian classifiers - given features, find
Probability of Class
PC∣Fi ,...,Fn=PC∏i
PFi∣C
Features
 Characteristics of to-be-classified object
 For text, typically unigrams, ”n”-grams, POS,
CHUNK, presence in gazeteer
 Can be numerical or boolean
 Crucial to performance of classifier!
 In NLTK, a dict of feature name to value
>>> {”word1” : 2, ”word2” : 1, ”word1-word2” : 1,
”word2-word3” : 3, ”word1-in-monuments” : False}
Training
 ”Teach” the classifier how to classify, using
training data
>>> from nltk import NaiveBayesClassifier as nbc
>>> trg = generate_features(training_samples)
>>> random.shuffle(trg)
>>> train, test = trg[:int(0.9*len(trg)], trg[int(0.9*len(trg):]
>>> clf = nbc.train(train)
>>> nltk.classify.accuracy(clf, test)
Training, part 2
 Measure, tune, iterate
 Cross Validation
 Aim is to find a balance between Precision and
Recall
A Gender Classifier
def gender_features(word):
return {'last_letter': word[-1]}
names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
featuresets = [(gender_features(n), g) for (n,g) in names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)
Advice on Training
 Its TEDIOUS! Cut the cognitive load
e.g.: ”I love this camera!”
BAD:
0 I PRP B-NP -
1 love VBP B-VP -
2 this DT B-NP -
3 camera NN I-NP 1
Good: (I, love) – NO, (love, this camera) – YES
 Dealing with ambiguity
 Mechanical Turk, jsonwidget
Other Classifiers
 Maximum Entropy
 Support Vector Machines
 Conditional Random Fields
All follow the same workflow!
More Examples
 Find questions in Tweets
 unigrams, bigrams, trigrams, parse distance
 Recognize ”contextual questions” in
discussions
 ”reco” words, ”thanks” words
 ”Use-type” recognizer
 POS, CHUNK, special words, verbs within 3 words
of phrase
 e.g., ”I want to compose the perfect landscape shot”
Eats, shoots and leaves
 Common basic step!
 POS, chunk, parse
tree
 Hairy theory
 FSA, Morphology,
Phonology
 n-Grams, Probabilistic
models, CFGs
Don't Worry, Be Happy!
No License Required!
 What to use? NLTK, or ?
 Docking with the Evil MotherShip
 Jepp, JPype?
 ► use RPC
>>> from .stanford_corenlp import jsonrpc
>>> server = jsonrpc.TransportTcpIp(...)
>>> result = loads(server.parse(paragraph))
 POS, Parse tree, and more (but no chunks?)
Wordnet®
 ”a large lexical database of English”
 synsets, hypernyms, hyponyms, gloss
 synonyms, antonyms
 Sentiwordnet
 Start with candidate ”good” and ”bad” words
 Expand by recursively following edges
 Classify using definition
e.g., ”good” → ”better”, ”good” → ”impressive”
Love or Hate?
”I love the screen, but the battery life is poor”
 Shallow?
 3 class classifier
 Or Deep?
 Relationship classifier
 Extracting candidate subjects
 Lots of unsolved problems – co-reference, multiple
subjects, negation, etc.
 Summary ratings for ”Executive Summaries”
Gender, revisited
 solr - search semi-structured text
 Out of the box text processing utilities – stemming,
tokenising
 Highly configurable relevancy
 fields, weights
 Sorting!
 ”sort(term, field, edit) desc” for Levenshtein edit
distance
Gender, revisited
 Schema
<field name="name" type="string" />
<field name="name_phoneme" type="phonetic" />
 Search: add ”sort=strdist(unknown_name, name, edit) desc)
 Python:
for namerec in results:
If namerec.gender == 'Male':
male_score += namerec.match_score
else:
female_score += namerec.match_score
 Correctly guesses ”Sheena” and ”Ashish”!!
 80/80 precision/recall
Miles to go before I sleep!
 Machine Learning on coursera

Thank You!!
Vijay Ramachandran: vijay750@gmail.com

More Related Content

PPTX
MODULE 4-Text Analytics.pptx
PPTX
Linear regression with gradient descent
PPTX
Text Analytics Presentation
PPTX
Ranking algorithms
PPTX
Sentiment Analysis
PDF
Machine Learning Course | Edureka
KEY
NLTK in 20 minutes
PPTX
Uncertainty in AI
MODULE 4-Text Analytics.pptx
Linear regression with gradient descent
Text Analytics Presentation
Ranking algorithms
Sentiment Analysis
Machine Learning Course | Edureka
NLTK in 20 minutes
Uncertainty in AI

What's hot (20)

PPTX
Sentimental Analysis - Naive Bayes Algorithm
PPTX
Feature scaling
PDF
Text summarization
PPTX
Types of Machine Learning
PPTX
NLP_KASHK:Markov Models
PPTX
Language models
PDF
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
PPTX
Natural Language Processing (NLP)
PDF
GAN - Theory and Applications
PPTX
Textual & Sentiment Analysis of Movie Reviews
PDF
Natural language processing (nlp)
PDF
Introduction to Sentiment Analysis
PPTX
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
PPTX
Word embedding
PPTX
Deep learning
PPTX
Supervised Learning Based Approach to Aspect Based Sentiment Analysis
PDF
Sentiment Analysis
PPTX
Fishers linear discriminant for dimensionality reduction.
Sentimental Analysis - Naive Bayes Algorithm
Feature scaling
Text summarization
Types of Machine Learning
NLP_KASHK:Markov Models
Language models
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Natural Language Processing (NLP)
GAN - Theory and Applications
Textual & Sentiment Analysis of Movie Reviews
Natural language processing (nlp)
Introduction to Sentiment Analysis
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
Word embedding
Deep learning
Supervised Learning Based Approach to Aspect Based Sentiment Analysis
Sentiment Analysis
Fishers linear discriminant for dimensionality reduction.
Ad

Similar to Text analysis using python (20)

PDF
Statistical Learning and Text Classification with NLTK and scikit-learn
PPTX
Nltk sentiment analysis
PPTX
PDF
HackYale - Natural Language Processing (Week 1)
PDF
Text classification in scikit-learn
PDF
Machine Learning as a Service: making sentiment predictions in realtime with ...
PPTX
Building NLP solutions using Python
PDF
Crash-course in Natural Language Processing
PPTX
Natural Language Processing: Comparing NLTK and OpenNLP
PDF
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
PPTX
Fast and accurate sentiment classification us and naive bayes model b516001
PPTX
Text Mining_big_data_machine_learning.pptx
PPTX
sentiment analysis
PDF
Pycon India 2018 Natural Language Processing Workshop
PDF
Nltk:a tool for_nlp - py_con-dhaka-2014
PPTX
Natural Language processing using nltk.pptx
PDF
Corpus Bootstrapping with NLTK
PPTX
Practical Data Analysis in Python
ODP
Sentiments Analysis using Python and nltk
PPTX
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Statistical Learning and Text Classification with NLTK and scikit-learn
Nltk sentiment analysis
HackYale - Natural Language Processing (Week 1)
Text classification in scikit-learn
Machine Learning as a Service: making sentiment predictions in realtime with ...
Building NLP solutions using Python
Crash-course in Natural Language Processing
Natural Language Processing: Comparing NLTK and OpenNLP
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Fast and accurate sentiment classification us and naive bayes model b516001
Text Mining_big_data_machine_learning.pptx
sentiment analysis
Pycon India 2018 Natural Language Processing Workshop
Nltk:a tool for_nlp - py_con-dhaka-2014
Natural Language processing using nltk.pptx
Corpus Bootstrapping with NLTK
Practical Data Analysis in Python
Sentiments Analysis using Python and nltk
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Ad

Recently uploaded (20)

PDF
medical staffing services at VALiNTRY
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
history of c programming in notes for students .pptx
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
System and Network Administration Chapter 2
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
Design an Analysis of Algorithms II-SECS-1021-03
medical staffing services at VALiNTRY
How Creative Agencies Leverage Project Management Software.pdf
Wondershare Filmora 15 Crack With Activation Key [2025
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
VVF-Customer-Presentation2025-Ver1.9.pptx
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
L1 - Introduction to python Backend.pptx
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
2025 Textile ERP Trends: SAP, Odoo & Oracle
How to Choose the Right IT Partner for Your Business in Malaysia
history of c programming in notes for students .pptx
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
System and Network Administration Chapter 2
Softaken Excel to vCard Converter Software.pdf
Understanding Forklifts - TECH EHS Solution
CHAPTER 2 - PM Management and IT Context
Reimagine Home Health with the Power of Agentic AI​
Design an Analysis of Algorithms II-SECS-1021-03

Text analysis using python

  • 2. Fools Rush In? ”The fact is, that to do anything in the world worth doing, we must not stand back shivering and thinking of the cold and danger, but jump in and scramble through as well as we can.” - Robert Cushing
  • 3. Motivation  Machine Learning everywhere  Users expectations of ”standard experience”  Many Resources!
  • 4. Text Mining  Extract high quality information from text  Typically, trends and patterns are analysed using statistical methods – Machine Learning  Common Tasks – entity recognition, sentiment analysis, categorization, clustering
  • 5. Why Python?  Short, concise text processing  NLTK  Scipy, numpy, scikit.learn  Integration with other languages! Because when you start your company, YOU get to decide!
  • 6. Pre-processing  Lower casing, stripping extra characters  ”Realyyyyyyyyyy!!!!!”  Tokenisation (sentences, words) >>>pktst = nltk.data.load('tokenizers/punkt/english.pickle') >>>sentences = pktst.tokenize(tweet) >>>words = nltk.word_tokenize(sent)  Handling Entities >>> re.sub(r'(^| )@[^ ]+','',tweet).strip()  Removing stopwords >>> stopwords = set([”a”, ”an”, ”the”, ”by”]) >>> ' '.join([w for w in words if w not in stopwords])
  • 7. Leveraging a Corpus  Simple techniques to analyze a domain  Term Frequency to find important entities  ”low light photography”, ”travel photography”  tf/idf to find representative terms across domains  ”gaming” for TVs  GND to find aliases  e.g., ”e700” and ”Samsung e700” vs ”e310” and ”Samsung e310”  Yahoo BOSS is great!
  • 9. Bayes Theorem  Conditional Probability  Bayesian classifiers - given features, find Probability of Class PC∣Fi ,...,Fn=PC∏i PFi∣C
  • 10. Features  Characteristics of to-be-classified object  For text, typically unigrams, ”n”-grams, POS, CHUNK, presence in gazeteer  Can be numerical or boolean  Crucial to performance of classifier!  In NLTK, a dict of feature name to value >>> {”word1” : 2, ”word2” : 1, ”word1-word2” : 1, ”word2-word3” : 3, ”word1-in-monuments” : False}
  • 11. Training  ”Teach” the classifier how to classify, using training data >>> from nltk import NaiveBayesClassifier as nbc >>> trg = generate_features(training_samples) >>> random.shuffle(trg) >>> train, test = trg[:int(0.9*len(trg)], trg[int(0.9*len(trg):] >>> clf = nbc.train(train) >>> nltk.classify.accuracy(clf, test)
  • 12. Training, part 2  Measure, tune, iterate  Cross Validation  Aim is to find a balance between Precision and Recall
  • 13. A Gender Classifier def gender_features(word): return {'last_letter': word[-1]} names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')]) featuresets = [(gender_features(n), g) for (n,g) in names] train_set, test_set = featuresets[500:], featuresets[:500] classifier = nltk.NaiveBayesClassifier.train(train_set) nltk.classify.accuracy(classifier, test_set)
  • 14. Advice on Training  Its TEDIOUS! Cut the cognitive load e.g.: ”I love this camera!” BAD: 0 I PRP B-NP - 1 love VBP B-VP - 2 this DT B-NP - 3 camera NN I-NP 1 Good: (I, love) – NO, (love, this camera) – YES  Dealing with ambiguity  Mechanical Turk, jsonwidget
  • 15. Other Classifiers  Maximum Entropy  Support Vector Machines  Conditional Random Fields All follow the same workflow!
  • 16. More Examples  Find questions in Tweets  unigrams, bigrams, trigrams, parse distance  Recognize ”contextual questions” in discussions  ”reco” words, ”thanks” words  ”Use-type” recognizer  POS, CHUNK, special words, verbs within 3 words of phrase  e.g., ”I want to compose the perfect landscape shot”
  • 17. Eats, shoots and leaves  Common basic step!  POS, chunk, parse tree  Hairy theory  FSA, Morphology, Phonology  n-Grams, Probabilistic models, CFGs Don't Worry, Be Happy!
  • 18. No License Required!  What to use? NLTK, or ?  Docking with the Evil MotherShip  Jepp, JPype?  ► use RPC >>> from .stanford_corenlp import jsonrpc >>> server = jsonrpc.TransportTcpIp(...) >>> result = loads(server.parse(paragraph))  POS, Parse tree, and more (but no chunks?)
  • 19. Wordnet®  ”a large lexical database of English”  synsets, hypernyms, hyponyms, gloss  synonyms, antonyms  Sentiwordnet  Start with candidate ”good” and ”bad” words  Expand by recursively following edges  Classify using definition e.g., ”good” → ”better”, ”good” → ”impressive”
  • 20. Love or Hate? ”I love the screen, but the battery life is poor”  Shallow?  3 class classifier  Or Deep?  Relationship classifier  Extracting candidate subjects  Lots of unsolved problems – co-reference, multiple subjects, negation, etc.  Summary ratings for ”Executive Summaries”
  • 21. Gender, revisited  solr - search semi-structured text  Out of the box text processing utilities – stemming, tokenising  Highly configurable relevancy  fields, weights  Sorting!  ”sort(term, field, edit) desc” for Levenshtein edit distance
  • 22. Gender, revisited  Schema <field name="name" type="string" /> <field name="name_phoneme" type="phonetic" />  Search: add ”sort=strdist(unknown_name, name, edit) desc)  Python: for namerec in results: If namerec.gender == 'Male': male_score += namerec.match_score else: female_score += namerec.match_score  Correctly guesses ”Sheena” and ”Ashish”!!  80/80 precision/recall
  • 23. Miles to go before I sleep!  Machine Learning on coursera 
  • 24. Thank You!! Vijay Ramachandran: vijay750@gmail.com