Getting started with text mining By Mathangi Sri Head of Data Science at PhonePe at CYPHER 2018

Getting Started with Text Mining
Mathangi Sri R

Lets look at some text
1. I love movies
2. I love icecream
3. I don’t like anything
4. I am not going to tell you anything
5. What are you guys doing
6. Where are you all going with it
7. I love her
8. doggie
When asked a question what do you love?

the tokens..?
['I', 'love', 'movies', 'I', 'love', 'icecream', 'I', 'donx92t', 'like',
'anything', 'I', 'am', 'not', 'going', 'to', 'tell', 'you', 'anything',
'What', 'are', 'you', 'guys', 'doing', 'Where', 'are', 'you', 'all',
'going', 'with', 'it', 'I', 'love', 'her', 'doggie']

word frequency
[('I', 5), ('love', 3), ('movies', 1), ('I', 5), ('love', 3),
('icecream', 1), ('I', 5), ('donx92t', 1), ('like', 1),
('anything', 2), ('I', 5), ('am', 1), ('not', 1), ('going',
2), ('to', 1), ('tell', 1), ('you', 3), ('anything', 2),
('What', 1), ('are', 2), ('you', 3), ('guys', 1), ('doing',
1), ('Where', 1), ('are', 2), ('you', 3), ('all', 1),
('going', 2), ('with', 1), ('it', 1), ('I', 5), ('love', 3),
('her', 1), ('doggie', 1)]

Term Frequency
[('I', 0.15), ('love', 0.09), ('movies', 0.03), ('I', 0.15), ('love', 0.09), ('icecream',
0.03), ('I', 0.15), ('donx92t', 0.03), ('like', 0.03), ('anything', 0.06), ('I', 0.15),
('am', 0.03), ('not', 0.03), ('going', 0.06), ('to', 0.03), ('tell', 0.03), ('you', 0.09),
('anything', 0.06), ('What', 0.03), ('are', 0.06), ('you', 0.09), ('guys', 0.03),
('doing', 0.03), ('Where', 0.03), ('are', 0.06), ('you', 0.09), ('all', 0.03), ('going',
0.06), ('with', 0.03), ('it', 0.03), ('I', 0.15), ('love', 0.09), ('her', 0.03), ('doggie',
0.03)]

TF - IDF
• TF: Term Frequency, which measures how frequently a
term occurs in a document
TF(t) = (Number of times term t appears in a document) /
(Total number of terms in the document).
• IDF: Inverse Document Frequency, which measures how
important a term is. :
IDF(t) = log_e(Total number of documents / Number of
documents with term t in it).

Tf-idf for our dataset
• 8*22 (8 records * 22 unique words. Total words 34)
u'all', u'am',
u'anyt
hing', u'are',
u'dog
gie',
u'doin
g', u'don',
u'goin
g',
u'guys
', u'her',
u'icecr
eam', u'it', u'like',
u'love'
,
u'movi
es', u'not', u'tell', u'to',
u'what
',
u'wher
e',
u'with'
, u'you'
I love movies 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.59 0.81 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I love icecream 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.00 0.59 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I don’t like
anything 0.00 0.00 0.51 0.00 0.00 0.00 0.61 0.00 0.00 0.00 0.00 0.00 0.61 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I am not going to
tell you anything 0.00 0.41 0.34 0.00 0.00 0.00 0.00 0.34 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.41 0.41 0.41 0.00 0.00 0.00 0.30
What are you
guys doing 0.00 0.00 0.00 0.41 0.00 0.49 0.00 0.00 0.49 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.49 0.00 0.00 0.35
Where are you all
going with it 0.41 0.00 0.00 0.34 0.00 0.00 0.00 0.34 0.00 0.00 0.00 0.41 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.41 0.41 0.30

Unigrams,Bi-grams and Tri-grams
• I love movies
--I love, love movies
In our dataset,
[u'all', u'all going', u'all going with', u'am', u'am not', u'am not going', u'anything', u'are', u'are you',
u'are you all', u'are you guys', u'doggie', u'doing', u'don', u'don like', u'don like anything', u'going',
u'going to', u'going to tell', u'going with', u'going with it', u'guys', u'guys doing', u'her', u'icecream',
u'it', u'like', u'like anything', u'love', u'love her', u'love icecream', u'love movies', u'movies', u'not',
u'not going', u'not going to', u'tell', u'tell you', u'tell you anything', u'to', u'to tell', u'to tell you',
u'what', u'what are', u'what are you', u'where', u'where are', u'where are you', u'with', u'with it',
u'you', u'you all', u'you all going', u'you anything', u'you guys', u'you guys doing']

Python code to genarate tf-idf matrix
Input dataset (List of strings)-
[u'I love movies', u'I love icecream ', u'I donx92t like anything', u'I am not going to tell you anything', u'What
are you guys doing', u'Where are you all going with it', u'I love her', u'doggie ']
Code:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(min_df=0.0,analyzer=u'word',ngram_range=(1, 4),stop_words=None)
tfidf_matrix = tfidf_vectorizer.fit_transform(tt1)
tf1= tfidf_matrix.todense()

Classifying text - Methods
• Supervised classification:
– Requires labelled data
– Classification algorithms – SVM, LR, Ensemble,
RF,etc
– Can measure accuracy precisely
– Need for highly actionable applications

• Unsupervised
- No labels required
- Accuracy is a ‘loose’ measure
- Measuring homogeneity of clusters
- Useful for quick insights or where grouping is
required

• Semi-supervised learning is a class of
supervised learning tasks and techniques that
also make use of unlabeled data for training -
typically a small amount of labeled data with a
large amount of unlabeled data.

Supervised Learning – Case Study

Lets look at some text
line class
20 get me to check in check in
21 check in internet check in
22 what is free baggage allowance baggage
23 how much baggage baggage
24 I have 35 kg should I pay baggage
25 how much can I carry baggage
26 lots of bags I have baggage
27 till how much baggage is free baggage
28 how many bags are free baggage
29 upto what weight I can carry baggage
30 how much can I carry baggage
31 baggage carry baggage
32 baggage to carry baggage
33 number of bags baggage
34 carrying bags baggage
35 travelling with bags baggage
36 money for luggage baggage
37 how much luggage I can carry baggage
38 too much luggage baggage

Class Distribution
0%
5%
10%
15%
20%
25%
30%
login other baggage check in greetings thanks cancel

Preprocess the data
• Naming same words into a word group (For
eg: different places can be made with a single
group name)
• Use regex and normalize Dates, dollar values
etc

Stop Words
How do you generate stop words from a corpus?

Stemming
• Stemming is the process of reducing a word
into its stem, i.e. its root form. The root form
is not necessarily a word by itself, but it can be
used to generate words by concatenating the
right suffix.

Stemmed words
fish, fishes and fishing --- fish
study, studies and studying stems --- studi
Diff between stemming vs lemmetization:
stemming – meaningless words
Lemmetization – meaningful words

Stemming and Lemmetizing
Code
from nltk.stem import PorterStemmer
#from nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()
ps.stem(“having”)
from nltk.stem.lancaster import LancasterStemmer
lancaster_stemmer = LancasterStemmer()
lancaster_stemmer.stem("maximum")

Spell checker
• https://github.com/mattalcock/blog/blob/ma
ster/2012/12/5/python-spell-checker.rst
• https://pypi.python.org/pypi/autocorrect/0.1.
0

Sampling – Train and Validation
• from sklearn.cross_validation import StratifiedShuffleSplit
• sss = StratifiedShuffleSplit(tgt3, 1,
test_size=0.2,random_state=42)
• for train_index, test_index in sss:
• #print("TRAIN:", train_index, "TEST:", test_index)
• a_train_b, a_test_b = tf1[train_index], tf1[test_index]
• b_train_b, b_test_b = tgt3[train_index], tgt3[test_index]

Generate features or word tokens and
vectorize
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer =
TfidfVectorizer(min_df=0.0,analyzer=u'word',ngram_range=(1,
4),stop_words=None)
tfidf_matrix = tfidf_vectorizer.fit_transform(tt1)
tf1= tfidf_matrix.todense()

Feature Selection
from sklearn.feature_selection import
SelectPercentile, f_classif
selector = SelectPercentile(f_classif, percentile=100)
selector.fit(a_train_b, b_train_b)
a_train_b = selector.fit_transform(a_train_b,
b_train_b)
a_test_b = selector.transform(a_test_b)

Build Model
• Logistic Regression
• GBM
• SVM
• RF
• Neural Nets
• NB

Getting started with text mining By Mathangi Sri Head of Data Science at PhonePe at CYPHER 2018

More Related Content

Similar to Getting started with text mining By Mathangi Sri Head of Data Science at PhonePe at CYPHER 2018 (6)

More from Analytics India Magazine (20)

Recently uploaded (20)

Getting started with text mining By Mathangi Sri Head of Data Science at PhonePe at CYPHER 2018