SlideShare a Scribd company logo
Getting Started with Text Mining
Mathangi Sri R
Lets look at some text
1. I love movies
2. I love icecream
3. I don’t like anything
4. I am not going to tell you anything
5. What are you guys doing
6. Where are you all going with it
7. I love her
8. doggie
When asked a question what do you love?
the tokens..?
['I', 'love', 'movies', 'I', 'love', 'icecream', 'I', 'donx92t', 'like',
'anything', 'I', 'am', 'not', 'going', 'to', 'tell', 'you', 'anything',
'What', 'are', 'you', 'guys', 'doing', 'Where', 'are', 'you', 'all',
'going', 'with', 'it', 'I', 'love', 'her', 'doggie']
word frequency
[('I', 5), ('love', 3), ('movies', 1), ('I', 5), ('love', 3),
('icecream', 1), ('I', 5), ('donx92t', 1), ('like', 1),
('anything', 2), ('I', 5), ('am', 1), ('not', 1), ('going',
2), ('to', 1), ('tell', 1), ('you', 3), ('anything', 2),
('What', 1), ('are', 2), ('you', 3), ('guys', 1), ('doing',
1), ('Where', 1), ('are', 2), ('you', 3), ('all', 1),
('going', 2), ('with', 1), ('it', 1), ('I', 5), ('love', 3),
('her', 1), ('doggie', 1)]
Term Frequency
[('I', 0.15), ('love', 0.09), ('movies', 0.03), ('I', 0.15), ('love', 0.09), ('icecream',
0.03), ('I', 0.15), ('donx92t', 0.03), ('like', 0.03), ('anything', 0.06), ('I', 0.15),
('am', 0.03), ('not', 0.03), ('going', 0.06), ('to', 0.03), ('tell', 0.03), ('you', 0.09),
('anything', 0.06), ('What', 0.03), ('are', 0.06), ('you', 0.09), ('guys', 0.03),
('doing', 0.03), ('Where', 0.03), ('are', 0.06), ('you', 0.09), ('all', 0.03), ('going',
0.06), ('with', 0.03), ('it', 0.03), ('I', 0.15), ('love', 0.09), ('her', 0.03), ('doggie',
0.03)]
TF - IDF
• TF: Term Frequency, which measures how frequently a
term occurs in a document
TF(t) = (Number of times term t appears in a document) /
(Total number of terms in the document).
• IDF: Inverse Document Frequency, which measures how
important a term is. :
IDF(t) = log_e(Total number of documents / Number of
documents with term t in it).
Tf-idf for our dataset
• 8*22 (8 records * 22 unique words. Total words 34)
u'all', u'am',
u'anyt
hing', u'are',
u'dog
gie',
u'doin
g', u'don',
u'goin
g',
u'guys
', u'her',
u'icecr
eam', u'it', u'like',
u'love'
,
u'movi
es', u'not', u'tell', u'to',
u'what
',
u'wher
e',
u'with'
, u'you'
I love movies 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.59 0.81 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I love icecream 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.00 0.59 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I don’t like
anything 0.00 0.00 0.51 0.00 0.00 0.00 0.61 0.00 0.00 0.00 0.00 0.00 0.61 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I am not going to
tell you anything 0.00 0.41 0.34 0.00 0.00 0.00 0.00 0.34 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.41 0.41 0.41 0.00 0.00 0.00 0.30
What are you
guys doing 0.00 0.00 0.00 0.41 0.00 0.49 0.00 0.00 0.49 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.49 0.00 0.00 0.35
Where are you all
going with it 0.41 0.00 0.00 0.34 0.00 0.00 0.00 0.34 0.00 0.00 0.00 0.41 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.41 0.41 0.30
Unigrams,Bi-grams and Tri-grams
• I love movies
--I love, love movies
In our dataset,
[u'all', u'all going', u'all going with', u'am', u'am not', u'am not going', u'anything', u'are', u'are you',
u'are you all', u'are you guys', u'doggie', u'doing', u'don', u'don like', u'don like anything', u'going',
u'going to', u'going to tell', u'going with', u'going with it', u'guys', u'guys doing', u'her', u'icecream',
u'it', u'like', u'like anything', u'love', u'love her', u'love icecream', u'love movies', u'movies', u'not',
u'not going', u'not going to', u'tell', u'tell you', u'tell you anything', u'to', u'to tell', u'to tell you',
u'what', u'what are', u'what are you', u'where', u'where are', u'where are you', u'with', u'with it',
u'you', u'you all', u'you all going', u'you anything', u'you guys', u'you guys doing']
Python code to genarate tf-idf matrix
Input dataset (List of strings)-
[u'I love movies', u'I love icecream ', u'I donx92t like anything', u'I am not going to tell you anything', u'What
are you guys doing', u'Where are you all going with it', u'I love her', u'doggie ']
Code:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(min_df=0.0,analyzer=u'word',ngram_range=(1, 4),stop_words=None)
tfidf_matrix = tfidf_vectorizer.fit_transform(tt1)
tf1= tfidf_matrix.todense()
Text Classification
Classifying text - Methods
• Supervised classification:
– Requires labelled data
– Classification algorithms – SVM, LR, Ensemble,
RF,etc
– Can measure accuracy precisely
– Need for highly actionable applications
Classifying text - Methods
• Unsupervised
- No labels required
- Accuracy is a ‘loose’ measure
- Measuring homogeneity of clusters
- Useful for quick insights or where grouping is
required
Classifying text - Methods
• Semi-supervised learning is a class of
supervised learning tasks and techniques that
also make use of unlabeled data for training -
typically a small amount of labeled data with a
large amount of unlabeled data.
Supervised Learning – Case Study
Lets look at some text
line class
20 get me to check in check in
21 check in internet check in
22 what is free baggage allowance baggage
23 how much baggage baggage
24 I have 35 kg should I pay baggage
25 how much can I carry baggage
26 lots of bags I have baggage
27 till how much baggage is free baggage
28 how many bags are free baggage
29 upto what weight I can carry baggage
30 how much can I carry baggage
31 baggage carry baggage
32 baggage to carry baggage
33 number of bags baggage
34 carrying bags baggage
35 travelling with bags baggage
36 money for luggage baggage
37 how much luggage I can carry baggage
38 too much luggage baggage
Class Distribution
0%
5%
10%
15%
20%
25%
30%
login other baggage check in greetings thanks cancel
Preprocess the data
• Naming same words into a word group (For
eg: different places can be made with a single
group name)
• Use regex and normalize Dates, dollar values
etc
Stop Words
How do you generate stop words from a corpus?
Stemming
• Stemming is the process of reducing a word
into its stem, i.e. its root form. The root form
is not necessarily a word by itself, but it can be
used to generate words by concatenating the
right suffix.
Stemmed words
fish, fishes and fishing --- fish
study, studies and studying stems --- studi
Diff between stemming vs lemmetization:
stemming – meaningless words
Lemmetization – meaningful words
Stemming and Lemmetizing
Code
from nltk.stem import PorterStemmer
#from nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()
ps.stem(“having”)
from nltk.stem.lancaster import LancasterStemmer
lancaster_stemmer = LancasterStemmer()
lancaster_stemmer.stem("maximum")
Spell checker
• https://github.com/mattalcock/blog/blob/ma
ster/2012/12/5/python-spell-checker.rst
• https://pypi.python.org/pypi/autocorrect/0.1.
0
Sampling – Train and Validation
• from sklearn.cross_validation import StratifiedShuffleSplit
• sss = StratifiedShuffleSplit(tgt3, 1,
test_size=0.2,random_state=42)
• for train_index, test_index in sss:
• #print("TRAIN:", train_index, "TEST:", test_index)
• a_train_b, a_test_b = tf1[train_index], tf1[test_index]
• b_train_b, b_test_b = tgt3[train_index], tgt3[test_index]
Generate features or word tokens and
vectorize
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer =
TfidfVectorizer(min_df=0.0,analyzer=u'word',ngram_range=(1,
4),stop_words=None)
tfidf_matrix = tfidf_vectorizer.fit_transform(tt1)
tf1= tfidf_matrix.todense()
Feature Selection
from sklearn.feature_selection import
SelectPercentile, f_classif
selector = SelectPercentile(f_classif, percentile=100)
selector.fit(a_train_b, b_train_b)
a_train_b = selector.fit_transform(a_train_b,
b_train_b)
a_test_b = selector.transform(a_test_b)
Build Model
• Logistic Regression
• GBM
• SVM
• RF
• Neural Nets
• NB

More Related Content

PPTX
Getting started with text mining
PPT
How to ask questions
PPSX
En_De_Lektion_01
PPTX
PPT
Cuvinte interogative in engleza
PPT
Spa 1.43
PPTX
Predicate calculus up
PPTX
Predicate calculus
Getting started with text mining
How to ask questions
En_De_Lektion_01
Cuvinte interogative in engleza
Spa 1.43
Predicate calculus up
Predicate calculus

Similar to Getting started with text mining By Mathangi Sri Head of Data Science at PhonePe at CYPHER 2018 (6)

PPT
Film plot
PDF
BEA Ignite2017 - Therkelsen
DOC
Lesson plan intended for kindergarten konti nlng
PDF
TDC - Trilha Machine Learning - O que sabemos de voce por meio de PLN e ML?
PDF
TDC2018SP | Trilha Machine Learning - Analise Forense de mensagens em rede so...
PPTX
SLIDE TUYỆT VỜI DÀNH CHO THUYẾT TRÌNH
Film plot
BEA Ignite2017 - Therkelsen
Lesson plan intended for kindergarten konti nlng
TDC - Trilha Machine Learning - O que sabemos de voce por meio de PLN e ML?
TDC2018SP | Trilha Machine Learning - Analise Forense de mensagens em rede so...
SLIDE TUYỆT VỜI DÀNH CHO THUYẾT TRÌNH
Ad

More from Analytics India Magazine (20)

PDF
Deep Learning in Search for E-Commerce
PPTX
[Paper Presentation] EMOTIONAL STRESS DETECTION USING DEEP LEARNING
PPTX
Flood & Other Disaster forecasting using Predictive Modelling and Artificial ...
PDF
AI for Enterprises-The Value Paradigm By Venkat Subramanian VP Marketing at B...
PPTX
Keep it simple and it works - Simplicity and sticking to fundamentals in the ...
PPTX
Feature Based Opinion Mining By Gourab Nath Core Faculty – Data Science at Pr...
PPTX
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
PDF
Getting your first job in Data Science By Imaad Mohamed Khan Founder-in-Resid...
PDF
10 data science & AI trends in india to watch out for in 2019
PDF
The hitchhiker's guide to artificial intelligence 2018-19
PDF
Data Science Skills Study 2018 by AIM & Great Learning
PPTX
Emerging engineering issues for building large scale AI systems By Srinivas P...
PDF
Predicting outcome of legal case using machine learning algorithms By Ankita ...
PDF
Bringing AI into the Enterprise - A Practitioner's view By Piyush Chowhan CIO...
PDF
Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...
PDF
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
PPTX
"Route risks using driving data on road segments" By Jayanta Kumar Pal Staff ...
PDF
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
PDF
Analytics Education — A Primer & Learning Path
PDF
Analytics & Data Science Industry In India: Study 2018 - by AnalytixLabs & AIM
Deep Learning in Search for E-Commerce
[Paper Presentation] EMOTIONAL STRESS DETECTION USING DEEP LEARNING
Flood & Other Disaster forecasting using Predictive Modelling and Artificial ...
AI for Enterprises-The Value Paradigm By Venkat Subramanian VP Marketing at B...
Keep it simple and it works - Simplicity and sticking to fundamentals in the ...
Feature Based Opinion Mining By Gourab Nath Core Faculty – Data Science at Pr...
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
Getting your first job in Data Science By Imaad Mohamed Khan Founder-in-Resid...
10 data science & AI trends in india to watch out for in 2019
The hitchhiker's guide to artificial intelligence 2018-19
Data Science Skills Study 2018 by AIM & Great Learning
Emerging engineering issues for building large scale AI systems By Srinivas P...
Predicting outcome of legal case using machine learning algorithms By Ankita ...
Bringing AI into the Enterprise - A Practitioner's view By Piyush Chowhan CIO...
Explainable deep learning with applications in Healthcare By Sunil Kumar Vupp...
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
"Route risks using driving data on road segments" By Jayanta Kumar Pal Staff ...
“Who Moved My Cheese?” – Sniff the changes and stay relevant as an analytics ...
Analytics Education — A Primer & Learning Path
Analytics & Data Science Industry In India: Study 2018 - by AnalytixLabs & AIM
Ad

Recently uploaded (20)

PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Lecture1 pattern recognition............
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Computer network topology notes for revision
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
1_Introduction to advance data techniques.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Reliability_Chapter_ presentation 1221.5784
Data_Analytics_and_PowerBI_Presentation.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Lecture1 pattern recognition............
Qualitative Qantitative and Mixed Methods.pptx
Computer network topology notes for revision
Miokarditis (Inflamasi pada Otot Jantung)
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Galatica Smart Energy Infrastructure Startup Pitch Deck
STERILIZATION AND DISINFECTION-1.ppthhhbx
1_Introduction to advance data techniques.pptx
.pdf is not working space design for the following data for the following dat...
Introduction to Knowledge Engineering Part 1
Business Ppt On Nestle.pptx huunnnhhgfvu
oil_refinery_comprehensive_20250804084928 (1).pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...

Getting started with text mining By Mathangi Sri Head of Data Science at PhonePe at CYPHER 2018

  • 1. Getting Started with Text Mining Mathangi Sri R
  • 2. Lets look at some text 1. I love movies 2. I love icecream 3. I don’t like anything 4. I am not going to tell you anything 5. What are you guys doing 6. Where are you all going with it 7. I love her 8. doggie When asked a question what do you love?
  • 3. the tokens..? ['I', 'love', 'movies', 'I', 'love', 'icecream', 'I', 'donx92t', 'like', 'anything', 'I', 'am', 'not', 'going', 'to', 'tell', 'you', 'anything', 'What', 'are', 'you', 'guys', 'doing', 'Where', 'are', 'you', 'all', 'going', 'with', 'it', 'I', 'love', 'her', 'doggie']
  • 4. word frequency [('I', 5), ('love', 3), ('movies', 1), ('I', 5), ('love', 3), ('icecream', 1), ('I', 5), ('donx92t', 1), ('like', 1), ('anything', 2), ('I', 5), ('am', 1), ('not', 1), ('going', 2), ('to', 1), ('tell', 1), ('you', 3), ('anything', 2), ('What', 1), ('are', 2), ('you', 3), ('guys', 1), ('doing', 1), ('Where', 1), ('are', 2), ('you', 3), ('all', 1), ('going', 2), ('with', 1), ('it', 1), ('I', 5), ('love', 3), ('her', 1), ('doggie', 1)]
  • 5. Term Frequency [('I', 0.15), ('love', 0.09), ('movies', 0.03), ('I', 0.15), ('love', 0.09), ('icecream', 0.03), ('I', 0.15), ('donx92t', 0.03), ('like', 0.03), ('anything', 0.06), ('I', 0.15), ('am', 0.03), ('not', 0.03), ('going', 0.06), ('to', 0.03), ('tell', 0.03), ('you', 0.09), ('anything', 0.06), ('What', 0.03), ('are', 0.06), ('you', 0.09), ('guys', 0.03), ('doing', 0.03), ('Where', 0.03), ('are', 0.06), ('you', 0.09), ('all', 0.03), ('going', 0.06), ('with', 0.03), ('it', 0.03), ('I', 0.15), ('love', 0.09), ('her', 0.03), ('doggie', 0.03)]
  • 6. TF - IDF • TF: Term Frequency, which measures how frequently a term occurs in a document TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document). • IDF: Inverse Document Frequency, which measures how important a term is. : IDF(t) = log_e(Total number of documents / Number of documents with term t in it).
  • 7. Tf-idf for our dataset • 8*22 (8 records * 22 unique words. Total words 34) u'all', u'am', u'anyt hing', u'are', u'dog gie', u'doin g', u'don', u'goin g', u'guys ', u'her', u'icecr eam', u'it', u'like', u'love' , u'movi es', u'not', u'tell', u'to', u'what ', u'wher e', u'with' , u'you' I love movies 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.59 0.81 0.00 0.00 0.00 0.00 0.00 0.00 0.00 I love icecream 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.00 0.59 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 I don’t like anything 0.00 0.00 0.51 0.00 0.00 0.00 0.61 0.00 0.00 0.00 0.00 0.00 0.61 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 I am not going to tell you anything 0.00 0.41 0.34 0.00 0.00 0.00 0.00 0.34 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.41 0.41 0.41 0.00 0.00 0.00 0.30 What are you guys doing 0.00 0.00 0.00 0.41 0.00 0.49 0.00 0.00 0.49 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.49 0.00 0.00 0.35 Where are you all going with it 0.41 0.00 0.00 0.34 0.00 0.00 0.00 0.34 0.00 0.00 0.00 0.41 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.41 0.41 0.30
  • 8. Unigrams,Bi-grams and Tri-grams • I love movies --I love, love movies In our dataset, [u'all', u'all going', u'all going with', u'am', u'am not', u'am not going', u'anything', u'are', u'are you', u'are you all', u'are you guys', u'doggie', u'doing', u'don', u'don like', u'don like anything', u'going', u'going to', u'going to tell', u'going with', u'going with it', u'guys', u'guys doing', u'her', u'icecream', u'it', u'like', u'like anything', u'love', u'love her', u'love icecream', u'love movies', u'movies', u'not', u'not going', u'not going to', u'tell', u'tell you', u'tell you anything', u'to', u'to tell', u'to tell you', u'what', u'what are', u'what are you', u'where', u'where are', u'where are you', u'with', u'with it', u'you', u'you all', u'you all going', u'you anything', u'you guys', u'you guys doing']
  • 9. Python code to genarate tf-idf matrix Input dataset (List of strings)- [u'I love movies', u'I love icecream ', u'I donx92t like anything', u'I am not going to tell you anything', u'What are you guys doing', u'Where are you all going with it', u'I love her', u'doggie '] Code: from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vectorizer = TfidfVectorizer(min_df=0.0,analyzer=u'word',ngram_range=(1, 4),stop_words=None) tfidf_matrix = tfidf_vectorizer.fit_transform(tt1) tf1= tfidf_matrix.todense()
  • 11. Classifying text - Methods • Supervised classification: – Requires labelled data – Classification algorithms – SVM, LR, Ensemble, RF,etc – Can measure accuracy precisely – Need for highly actionable applications
  • 12. Classifying text - Methods • Unsupervised - No labels required - Accuracy is a ‘loose’ measure - Measuring homogeneity of clusters - Useful for quick insights or where grouping is required
  • 13. Classifying text - Methods • Semi-supervised learning is a class of supervised learning tasks and techniques that also make use of unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data.
  • 15. Lets look at some text line class 20 get me to check in check in 21 check in internet check in 22 what is free baggage allowance baggage 23 how much baggage baggage 24 I have 35 kg should I pay baggage 25 how much can I carry baggage 26 lots of bags I have baggage 27 till how much baggage is free baggage 28 how many bags are free baggage 29 upto what weight I can carry baggage 30 how much can I carry baggage 31 baggage carry baggage 32 baggage to carry baggage 33 number of bags baggage 34 carrying bags baggage 35 travelling with bags baggage 36 money for luggage baggage 37 how much luggage I can carry baggage 38 too much luggage baggage
  • 16. Class Distribution 0% 5% 10% 15% 20% 25% 30% login other baggage check in greetings thanks cancel
  • 17. Preprocess the data • Naming same words into a word group (For eg: different places can be made with a single group name) • Use regex and normalize Dates, dollar values etc
  • 18. Stop Words How do you generate stop words from a corpus?
  • 19. Stemming • Stemming is the process of reducing a word into its stem, i.e. its root form. The root form is not necessarily a word by itself, but it can be used to generate words by concatenating the right suffix.
  • 20. Stemmed words fish, fishes and fishing --- fish study, studies and studying stems --- studi Diff between stemming vs lemmetization: stemming – meaningless words Lemmetization – meaningful words
  • 21. Stemming and Lemmetizing Code from nltk.stem import PorterStemmer #from nltk.tokenize import sent_tokenize, word_tokenize ps = PorterStemmer() ps.stem(“having”) from nltk.stem.lancaster import LancasterStemmer lancaster_stemmer = LancasterStemmer() lancaster_stemmer.stem("maximum")
  • 23. Sampling – Train and Validation • from sklearn.cross_validation import StratifiedShuffleSplit • sss = StratifiedShuffleSplit(tgt3, 1, test_size=0.2,random_state=42) • for train_index, test_index in sss: • #print("TRAIN:", train_index, "TEST:", test_index) • a_train_b, a_test_b = tf1[train_index], tf1[test_index] • b_train_b, b_test_b = tgt3[train_index], tgt3[test_index]
  • 24. Generate features or word tokens and vectorize from sklearn.feature_extraction.text import TfidfVectorizer tfidf_vectorizer = TfidfVectorizer(min_df=0.0,analyzer=u'word',ngram_range=(1, 4),stop_words=None) tfidf_matrix = tfidf_vectorizer.fit_transform(tt1) tf1= tfidf_matrix.todense()
  • 25. Feature Selection from sklearn.feature_selection import SelectPercentile, f_classif selector = SelectPercentile(f_classif, percentile=100) selector.fit(a_train_b, b_train_b) a_train_b = selector.fit_transform(a_train_b, b_train_b) a_test_b = selector.transform(a_test_b)
  • 26. Build Model • Logistic Regression • GBM • SVM • RF • Neural Nets • NB