SlideShare a Scribd company logo
Text Classification
Russ M. Delos Santos
Overview
Research:
Text classification for RAX Studio
Suggested Use Case:
- Account management through email.
Natural Language
Processing
1. Automatic or semi-automatic
processing of human language
2. Can be used for various
applications like
a. Sentiment Analysis
b. Intent Classification
c. Topic Labeling
Data
Pre-process to desired
text format
Features
Transform the text to
vectors (numbers)
Model
Feed the data to the
model
Prediction
Set prediction criteria
once the model
converge
Output
Output the class
General Process
Dataset / Text Corpus
- Dictionary or vocabulary which is used to train
the model
● Either tagged (for supervised learning) or untagged
(for unsupervised).
● Size depends on the algorithm used. Should be pre-
processed to remove unwanted characters, to convert
to wanted format, etc.
Dataset / Text Corpus
- Open-source dataset samples
● Amazon Reviews
● NYSK Dataset (News Articles
● Enrol Email Dataset
● Ling Spam Dataset
Feature Extraction
- Transforms texts to numbers (vector space
model)
- Choices:
● One-hot encoding
● Bag-of-words + TF*IDF
● Word2vec
One-hot encoding
- Creates a binary encoding of words. 1 is
encoded on the index of the word in the corpus
Bag-of-words
- Takes the word count of the target word in the
corpus as the feature
TF*IDF
- Term Frequency * Inverse Document
Frequency
● Frequently occurring words are typically not
important / has less weight (stopwords such as “is, are,
the, etc.”)
● Weights are assigned per word.
TF*IDF
- Term Frequency * Inverse Document
Frequency
BOW + TF*IDF
BOW + TF*IDF
word2vec
Uses the weights of the hidden layer of a neural
network as features of the words
● Can predict a context or a word based on the nearby
words in the corpus
● Uses continuous bag-of-words or skip-gram model + 1-
1-1 neural network.
word2vec
- Gives better semantic/syntactic relationships
of words through vectors
Schedule
Machine Learning Model
- A classifier algorithm that transforms an input
to the desired class
● Naive Bayes
● K-nearest neighbors
● Multilayer Perceptron
● Recurrent Neural Network + Long short-term memory
Naive Bayes
- Probabilistic model that relies on word count
● Uses bag of words as features
● Assumes that the position of words doesn't matter and
words are independent of each other
Naive Bayes
- Probabilistic model that relies on word count
K-Nearest Neighbors
- Classifies the class based on the nearest
distance from a known class
Multilayer Perceptron
- A feed-forward neural network
● Has at least 2 hidden layers
● Sigmoid function - binary classification
● Softmax function - multiclass classification
Multilayer Perceptron
Assessment
Option 1
● Features: BOW + TF*IDF
● ML Algorithm: Naive Bayes
● Pros: Easier to implement
● Cons: Word count instead of
word sequence.
● Ex. ‘Live to eat’ and ‘Eat to live’
may mean the same’
Option 2
● Features: word2vec word
embeddings
● ML Algorithm: Multilayer
Perceptron
● Pros: Produces better results,
semantically and syntactically
● Cons: Needs a big labeled
dataset to perform well
Main Blocks
ML.NET Learning Curve
- Still studying the framework.
- Not as well documented compared to Python
frameworks/libraries
- Ex. Has a method called TextCatalog.FeaturizeText() but there’s
no indication of the kind of feature extraction.
Supervised Learning Needs Big Data
- We can use open-source datasets for benchmark.
- But we need datasets with specific labels for the algorithm to
work.
Main Blocks
Model Update Criteria
- Retraining the model for every unknown word is impractical.
- Suggestion:
- Set a minimum number of occurence of new words before a
model is to be retrained
- Ignore the rare, new words since it may not affect the
entire intent, sentiment, meaning, of the text.
Implementation Plan
- Email Cleaner
- Clean special characters, HTML tags, header and footer of
the email, etc.
- Set a standard file format (tsv, csv, txt, etc. or transform to
bin)
- Use spam dataset for the mean time as benchmark (binary
classification)
- Sentence Tokenizer + Feature Extraction
- Divide emails per sentence + word2vec
- Create Neural Network
- 1 input, 2 hidden, 1 output.
- Activation function - sigmoid
References
[1]D. Jurafsky and J. Martin, Speech and
language processing. Upper Saddle River,
N.J.: Pearson Prentice Hall, 2009.
[2]https://developers.google.com/machi
ne-learning/
[3]bunch of stackoverflow /
stackexchange / Kaggle threads
[4]bunch of Medium posts

More Related Content

PPTX
Language models
PPTX
Presentation on Text Classification
PPTX
Nlp toolkits and_preprocessing_techniques
PDF
Natural Language Processing with Python
PPTX
What is word2vec?
PPTX
Text similarity measures
PPTX
Tutorial on word2vec
PPTX
NAMED ENTITY RECOGNITION
Language models
Presentation on Text Classification
Nlp toolkits and_preprocessing_techniques
Natural Language Processing with Python
What is word2vec?
Text similarity measures
Tutorial on word2vec
NAMED ENTITY RECOGNITION

What's hot (20)

PDF
Natural language processing (NLP) introduction
PDF
Natural Language Processing (NLP)
PPTX
Natural Language Processing
PPT
Introduction to Natural Language Processing
PDF
Text classification presentation
PPTX
Random forest
PPTX
Word embedding
PPTX
Introduction to Named Entity Recognition
PDF
Word2Vec
PPTX
Machine Learning
PPTX
Random forest
DOCX
Natural language processing
PPTX
NLP State of the Art | BERT
PPT
Natural Language Processing
PPTX
Natural language processing
PPTX
Deep learning presentation
PPTX
Natural Language Processing
PPTX
Introduction to ML (Machine Learning)
PPTX
Data mining: Classification and prediction
Natural language processing (NLP) introduction
Natural Language Processing (NLP)
Natural Language Processing
Introduction to Natural Language Processing
Text classification presentation
Random forest
Word embedding
Introduction to Named Entity Recognition
Word2Vec
Machine Learning
Random forest
Natural language processing
NLP State of the Art | BERT
Natural Language Processing
Natural language processing
Deep learning presentation
Natural Language Processing
Introduction to ML (Machine Learning)
Data mining: Classification and prediction
Ad

Similar to Text Classification (20)

PPTX
Building NLP solutions for Davidson ML Group
PPTX
Group 5 Text Vectorization in Natural Language Processing.pptx
PPTX
Session 07 text data.pptx
PPTX
Session 07 text data.pptx
PPTX
Session 07 text data.pptx
PPTX
Building NLP solutions using Python
PDF
MACHINE-DRIVEN TEXT ANALYSIS
PDF
Devday @ Sahaj - Domain Specific NLP Pipelines
PPTX
NLP Bootcamp
PPTX
Seminar dm
PDF
Texts Classification with the usage of Neural Network based on the Word2vec’s...
PDF
Texts Classification with the usage of Neural Network based on the Word2vec’s...
PDF
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...
PDF
A pragmatic introduction to natural language processing models (October 2019)
PPTX
Feature Engineering for NLP
PPTX
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
PDF
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
PDF
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
PDF
Statistical Learning and Text Classification with NLTK and scikit-learn
PDF
NLP Bootcamp 2018 : Representation Learning of text for NLP
Building NLP solutions for Davidson ML Group
Group 5 Text Vectorization in Natural Language Processing.pptx
Session 07 text data.pptx
Session 07 text data.pptx
Session 07 text data.pptx
Building NLP solutions using Python
MACHINE-DRIVEN TEXT ANALYSIS
Devday @ Sahaj - Domain Specific NLP Pipelines
NLP Bootcamp
Seminar dm
Texts Classification with the usage of Neural Network based on the Word2vec’s...
Texts Classification with the usage of Neural Network based on the Word2vec’s...
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...
A pragmatic introduction to natural language processing models (October 2019)
Feature Engineering for NLP
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
DETERMINING CUSTOMER SATISFACTION IN-ECOMMERCE
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Learning and Text Classification with NLTK and scikit-learn
NLP Bootcamp 2018 : Representation Learning of text for NLP
Ad

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
KodekX | Application Modernization Development
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
cuic standard and advanced reporting.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Big Data Technologies - Introduction.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Unlocking AI with Model Context Protocol (MCP)
KodekX | Application Modernization Development
NewMind AI Monthly Chronicles - July 2025
cuic standard and advanced reporting.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
The Rise and Fall of 3GPP – Time for a Sabbatical?
The AUB Centre for AI in Media Proposal.docx
Understanding_Digital_Forensics_Presentation.pptx
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Network Security Unit 5.pdf for BCA BBA.
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Review of recent advances in non-invasive hemoglobin estimation
GamePlan Trading System Review: Professional Trader's Honest Take
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy

Text Classification

  • 2. Overview Research: Text classification for RAX Studio Suggested Use Case: - Account management through email.
  • 3. Natural Language Processing 1. Automatic or semi-automatic processing of human language 2. Can be used for various applications like a. Sentiment Analysis b. Intent Classification c. Topic Labeling
  • 4. Data Pre-process to desired text format Features Transform the text to vectors (numbers) Model Feed the data to the model Prediction Set prediction criteria once the model converge Output Output the class General Process
  • 5. Dataset / Text Corpus - Dictionary or vocabulary which is used to train the model ● Either tagged (for supervised learning) or untagged (for unsupervised). ● Size depends on the algorithm used. Should be pre- processed to remove unwanted characters, to convert to wanted format, etc.
  • 6. Dataset / Text Corpus - Open-source dataset samples ● Amazon Reviews ● NYSK Dataset (News Articles ● Enrol Email Dataset ● Ling Spam Dataset
  • 7. Feature Extraction - Transforms texts to numbers (vector space model) - Choices: ● One-hot encoding ● Bag-of-words + TF*IDF ● Word2vec
  • 8. One-hot encoding - Creates a binary encoding of words. 1 is encoded on the index of the word in the corpus
  • 9. Bag-of-words - Takes the word count of the target word in the corpus as the feature
  • 10. TF*IDF - Term Frequency * Inverse Document Frequency ● Frequently occurring words are typically not important / has less weight (stopwords such as “is, are, the, etc.”) ● Weights are assigned per word.
  • 11. TF*IDF - Term Frequency * Inverse Document Frequency
  • 14. word2vec Uses the weights of the hidden layer of a neural network as features of the words ● Can predict a context or a word based on the nearby words in the corpus ● Uses continuous bag-of-words or skip-gram model + 1- 1-1 neural network.
  • 15. word2vec - Gives better semantic/syntactic relationships of words through vectors
  • 17. Machine Learning Model - A classifier algorithm that transforms an input to the desired class ● Naive Bayes ● K-nearest neighbors ● Multilayer Perceptron ● Recurrent Neural Network + Long short-term memory
  • 18. Naive Bayes - Probabilistic model that relies on word count ● Uses bag of words as features ● Assumes that the position of words doesn't matter and words are independent of each other
  • 19. Naive Bayes - Probabilistic model that relies on word count
  • 20. K-Nearest Neighbors - Classifies the class based on the nearest distance from a known class
  • 21. Multilayer Perceptron - A feed-forward neural network ● Has at least 2 hidden layers ● Sigmoid function - binary classification ● Softmax function - multiclass classification
  • 23. Assessment Option 1 ● Features: BOW + TF*IDF ● ML Algorithm: Naive Bayes ● Pros: Easier to implement ● Cons: Word count instead of word sequence. ● Ex. ‘Live to eat’ and ‘Eat to live’ may mean the same’ Option 2 ● Features: word2vec word embeddings ● ML Algorithm: Multilayer Perceptron ● Pros: Produces better results, semantically and syntactically ● Cons: Needs a big labeled dataset to perform well
  • 24. Main Blocks ML.NET Learning Curve - Still studying the framework. - Not as well documented compared to Python frameworks/libraries - Ex. Has a method called TextCatalog.FeaturizeText() but there’s no indication of the kind of feature extraction. Supervised Learning Needs Big Data - We can use open-source datasets for benchmark. - But we need datasets with specific labels for the algorithm to work.
  • 25. Main Blocks Model Update Criteria - Retraining the model for every unknown word is impractical. - Suggestion: - Set a minimum number of occurence of new words before a model is to be retrained - Ignore the rare, new words since it may not affect the entire intent, sentiment, meaning, of the text.
  • 26. Implementation Plan - Email Cleaner - Clean special characters, HTML tags, header and footer of the email, etc. - Set a standard file format (tsv, csv, txt, etc. or transform to bin) - Use spam dataset for the mean time as benchmark (binary classification) - Sentence Tokenizer + Feature Extraction - Divide emails per sentence + word2vec - Create Neural Network - 1 input, 2 hidden, 1 output. - Activation function - sigmoid
  • 27. References [1]D. Jurafsky and J. Martin, Speech and language processing. Upper Saddle River, N.J.: Pearson Prentice Hall, 2009. [2]https://developers.google.com/machi ne-learning/ [3]bunch of stackoverflow / stackexchange / Kaggle threads [4]bunch of Medium posts

Editor's Notes

  • #5: Key processes needed for natural language processing
  • #7: Can be used for benchmark testing depending on the needed classes
  • #9: Pros: Easy to implement Cons: Outputs a large matrix in which most values are zeroes
  • #10: Pros: Easy since its basicaly just counting Cons: Sequence of words doesnt matter, which is largely erroneous assumption
  • #11: Provides the needed weight per word
  • #15: Sequential word embeddings - order of words matter thus providing semantics and syntax
  • #16: Vector space model
  • #17: Sample implementation result of word2vec in Python. The corpus text is initialized as corpus_raw. As you can see, the ‘daughter’ and ‘infant’ is somehow equally distanced to ‘prisoners’. ‘Kingdom’ is very closely related to ‘madam’ or the queen as told in the story
  • #20: Classified to a class with the highest probability of class | keyword
  • #21: A lazy classifier since only the distance determines the class
  • #22: Deep learning - uses neural network