SlideShare a Scribd company logo
P1WU
UNIT – III: CLASSIFICATION
Topic 3: NAÏVE TEXT CLASSIFICATION
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
UNIT III : TEXT CLASSIFICATION AND CLUSTERING
1.A Characterization of Text
Classification
2. Unsupervised Algorithms:
Clustering
3. Naïve Text Classification
4. Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or
Dimensionality Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional
Indexing
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
NAÏVE TEXT CLASSIFICATION
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO NAÏVE TEXT CLASSIFICATION
• Naive Bayes classifiers are a collection of classification
algorithms based on Bayes Theorem.
• It is not a single algorithm but a family of algorithms where all
of them share a common principle, i.e. every pair of features
being classified is independent of each other.
• Naive Bayes classifiers have been heavily used
for text classification and text analysis machine learning
problems.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO NAÏVE TEXT CLASSIFICATION
• Text Analysis is a major application field for machine learning
algorithms.
• However the raw data,
• a sequence of symbols (i.e. strings) cannot be fed directly to the algorithms
themselves as most of them expect numerical feature vectors with a fixed size
rather than the raw text documents with variable length.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
The Naive Bayes algorithm
• Naive Bayes classifiers are a collection of classification
algorithms based on Bayes’ Theorem.
• It is not a single algorithm but a family of algorithms where all
of them share a common principle,
• i.e. every pair of features being classified is independent of each other.
• The dataset is divided into two parts, namely,
feature matrix and the response/target vector.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
The Naive Bayes algorithm
• The Feature matrix (X) contains all the vectors(rows) of the
dataset in which each vector consists of the value of
dependent features. The number of features is d i.e. X =
(x1,x2,x2, xd).
• The Response/target vector (y) contains the value of
class/group variable for each row of feature matrix.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
The Bayes’ Theorem
Bayes’ Theorem finds the probability of an event
occurring given the probability of another event that
has already occurred.
Bayes’ theorem is stated mathematically as follows:
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
The Bayes’ Theorem
• where:
• A and B are called events.
• P(A | B) is the probability of event A, given the event B is true (has occured)
• Event B is also termed as evidence.
P(A) is the priori of A (the prior independent probability, i.e. probability of event
before evidence is seen).
• P(B | A) is the probability of B given event A, i.e. probability of event B after evidence
A is seen.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
The Bayes’ Theorem
• Summary
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Dealing with text data
• Text Analysis is a major application field for machine learning
algorithms.
However the raw data, a sequence of symbols (i.e. strings) cannot be fed
directly to the algorithms themselves as most of them expect
numerical feature vectors with a fixed size rather than the raw text
documents with variable length.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Dealing with text data
• In order to address this, scikit-learn provides utilities for the most
common ways to extract numerical features from text content,
namely:
• tokenizing strings and giving an integer id for each possible token, for
instance by using w ite-spaces and punctuation as token separators.
• counting the occurrences of tokens in each document.
• In this scheme, features and samples are defined as follows:
• each individual token occurrence frequency is treated as a feature.
• the vector of all the token frequencies for a given document is considered a multivariate
sample.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 1 : Using the Naive Bayesian Classifier
• We will consider the following training set.
• The data samples are described by attributes age, income, student,
and credit.
• The class label attribute, buy, tells whether the person buys a
computer, has two distinct values, yes (class C1) and no (class C2).
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 1 : Using the Naive Bayesian Classifier
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
RID Age Income student Credit Ci: buy
1 Youth High no Fair C2: no
2 Youth High no Excellent C2: no
3 middle-aged High no Fair C1: yes
4 Senior medium no Fair C1: yes
5 Senior Low yes Fair C1: yes
6 Senior Low yes Excellent C2: no
7 middle-aged Low yes Excellent C1: yes
8 Youth medium no Fair C2: no
9 Youth Low yes Fair C1: yes
10 Senior medium yes Fair C1: yes
11 Youth medium yes Excellent C1: yes
12 middle-aged medium no Excellent C1: yes
13 middle-aged High yes Fair C1: yes
14 Senior medium no Excellent C2: no
Example 1 : Using the Naive Bayesian Classifier
• The sample we wish to classify is
• X = (age = youth, income = medium, student = yes, credit = fair)
• We need to maximize P (X|Ci)P (Ci), for i = 1, 2. P (Ci), the a priori
probability of each class, can be estimated based on the training
samples:
• P(buy =yes ) = 9 /14
• P(buy =no ) = 5 /14
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 1 : Using the Naive Bayesian Classifier
• To compute P (X|Ci), for i = 1, 2, we compute the following conditional
probabilities:
• P(age=youth | buy =yes ) = 2/9
• P(income =medium | buy =yes ) = 4/9
• P(student =yes | buy =yes ) = 6/9
• P(credit =fair | buy =yes ) = 6/9
• P(age=youth | buy =no ) = 3/5
• P(income =medium | buy =no ) = 2/5
• P(student =yes | buy =no ) = 1/5
• P(credit =fair | buy =no ) = 2/5
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 1 : Using the Naive Bayesian Classifier
• Using the above probabilities, we obtain
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 1 : Using the Naive Bayesian Classifier
• Similarly
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
To find the class that maximizes P (X|Ci)P (Ci), we compute
Thus the naive Bayesian classifier predicts buy = yes for sample X
Example 2: Predicting a class label using naïve Bayesian classification
• Predicting a class label using naïve Bayesian classification.
• The training data set is given below:
• The data tuples are described by the attributes Owns Home?, Married,
Gender and Employed.
• The class label attribute Risk Class has three distinct values.
• Let C1 corresponds to the class A, and C2 corresponds to the class B
and C3 corresponds to the class C.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 1 : Using the Naive Bayesian Classifier
• The tuple is to classify is,
• X = (Owns Home = Yes, Married = No, Gender = Female, Employed = Yes)
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Owns Home Married Gender Employed Risk Class
Yes Yes Male Yes B
No No Female Yes A
Yes Yes Female Yes C
Yes No Male No B
No Yes Female Yes C
No No Female Yes A
No No Male No B
Yes No Female Yes A
No Yes Female Yes C
Yes Yes Female Yes C
Example 2: Predicting a class label using naïve Bayesian classification
• Solution
• There are 10 samples and three classes.
• Risk class A = 3 Risk class B = 3 Risk class C = 4
•
• The prior probabilities are obtained by dividing these frequencies by
the total number in the training data,
• P(A) = 3/10 = 0.3 P(B) = 3/10 = 0.3 P(C) = 4/10 = 0.4
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 2: Predicting a class label using naïve Bayesian classification
• To compute P(X/Ci) =P {yes, no, female, yes}/Ci) for each of the classes, the conditional probabilities for each:
• P(Owns Home = Yes/A) = 1/3 =0.33
• P(Married = No/A) = 3/3 =1
• P(Gender = Female/A) = 3/3 = 1
• P(Employed = Yes/A) = 3/3 = 1
•
• P(Owns Home = Yes/B) = 2/3 =0.67
• P(Married = No/B) = 2/3 =0.67
• P(Gender = Female/B) = 0/3 = 0
• P(Employed = Yes/B) = 1/3 = 0.33
•
• P(Owns Home = Yes/C) = 2/4 =0.5
• P(Married = No/C) = 0/4 =0
• P(Gender = Female/C) = 4/4 = 1
• P(Employed = Yes/C) = 4/4 = 1
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Example 2: Predicting a class label using naïve Bayesian classification
• Using the above probabilities, we obtain
• P(X/A)= P(Owns Home = Yes/A) X
• P(Married = No/A) x
• P(Gender = Female/A) X
• P(Employed = Yes/A)
= 0.33 x 1 x 1 x 1 = 0.33
• Similarly, P(X/B)= 0 , P(X/C) =0
•
• To find the class, G, that maximizes, P(X/Ci)P(Ci), we compute,
• P(X/A) P(A) = 0.33 X 0.3 = 0.099
• P(X/B) P(B) =0 X 0.3 = 0
• P(X/C) P(C) = 0 X 0.4 = 0.0
• Therefore x is assigned to class A
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Advantages and Disadvantages
• Advantages:
a) Have the minimum error rate in comparison to all other classifiers.
b) Easy to implement
c) Good results obtained in most of the cases.
d) They provide theoretical justification for other classifiers that do not
explicitly use
• Disadvantages:
a) Lack of available probability data.
b) Inaccuracies in the assumption.
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES

More Related Content

PPTX
CRYPTOGRAPHY & NETWORK SECURITY - unit 1
PDF
CS8080_IRT_UNIT - III T4 SUPERVISED ALGORITHMS.pdf
PDF
CS8080_IRT_UNIT - III T2 UNSUPERVISED ALGORITHMS -CLUSTERING.pdf
PDF
CS8080_IRT_UNIT - III T7 SVM CLASSIFIER.pdf
PDF
CS8080_IRT_UNIT - III T6 K-NN CLASSIFIER.pdf
PDF
CS8080_IRT_UNIT - III T1 A CHARACTERIZATION OF TEXT CLASSIFICATION.pdf
PDF
CS6010 Social Network Analysis Unit V
PDF
CS6010 Social Network Analysis Unit IV
CRYPTOGRAPHY & NETWORK SECURITY - unit 1
CS8080_IRT_UNIT - III T4 SUPERVISED ALGORITHMS.pdf
CS8080_IRT_UNIT - III T2 UNSUPERVISED ALGORITHMS -CLUSTERING.pdf
CS8080_IRT_UNIT - III T7 SVM CLASSIFIER.pdf
CS8080_IRT_UNIT - III T6 K-NN CLASSIFIER.pdf
CS8080_IRT_UNIT - III T1 A CHARACTERIZATION OF TEXT CLASSIFICATION.pdf
CS6010 Social Network Analysis Unit V
CS6010 Social Network Analysis Unit IV

What's hot (20)

PDF
CS8080_IRT_UNIT - III T5 DECISION TREES.pdf
PDF
Cs8792 cns - unit iv
PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
PDF
CS6010 Social Network Analysis Unit II
PDF
CS8080 information retrieval techniques unit iii ppt in pdf
PPTX
Machine Learning and its Applications
PDF
CS6010 Social Network Analysis Unit III
PPTX
public key infrastructure
PPTX
Types of Machine Learning
PPTX
Machine learning in Cyber Security
PPT
Temporal data mining
PDF
PPT
Latent Semantic Indexing For Information Retrieval
PPTX
Vector space model of information retrieval
PPTX
7 data transformation
PDF
CS6010 Social Network Analysis Unit I
PDF
AI and Machine Learning In Cybersecurity | A Saviour or Enemy?
PPTX
Machine learning
PPTX
CS8792 - Cryptography and Network Security
CS8080_IRT_UNIT - III T5 DECISION TREES.pdf
Cs8792 cns - unit iv
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
CS6010 Social Network Analysis Unit II
CS8080 information retrieval techniques unit iii ppt in pdf
Machine Learning and its Applications
CS6010 Social Network Analysis Unit III
public key infrastructure
Types of Machine Learning
Machine learning in Cyber Security
Temporal data mining
Latent Semantic Indexing For Information Retrieval
Vector space model of information retrieval
7 data transformation
CS6010 Social Network Analysis Unit I
AI and Machine Learning In Cybersecurity | A Saviour or Enemy?
Machine learning
CS8792 - Cryptography and Network Security
Ad

Similar to CS8080_IRT_UNIT - III T3 NAIVE TEXT CLASSIFICATION.pdf (20)

PDF
CS8080 IRT UNIT - III SLIDES IN PDF.pdf
PPTX
Naive bayes
PPT
Supervised algorithms
PDF
18 ijcse-01232
PPT
BAYESIAN theorem and implementation of i
PPT
Unit-2.ppt
PPT
text classification_NB.ppt
PPTX
Bayes learning
PPT
UNIT2_NaiveBayes algorithms used in machine learning
PPT
lecture13-nbbbbb. Bbnnndnjdjdjbayes.ppt
PPTX
Managing Data: storage, decisions and classification
PPTX
"Naive Bayes Classifier" @ Papers We Love Bucharest
PPT
9-Decision Tree Induction-23-01-2025.ppt
PPTX
Naive Bayes | Statistics
PPT
2.3 bayesian classification
PPTX
Data Mining Email SPam Detection PPT WITH Algorithms
PPTX
1.1 Probability Theory and Naiv Bayse.pptx
PDF
Analysis of Classification Algorithm in Data Mining
PPTX
Bayesian classification
PPTX
Naive Bayes Presentation
CS8080 IRT UNIT - III SLIDES IN PDF.pdf
Naive bayes
Supervised algorithms
18 ijcse-01232
BAYESIAN theorem and implementation of i
Unit-2.ppt
text classification_NB.ppt
Bayes learning
UNIT2_NaiveBayes algorithms used in machine learning
lecture13-nbbbbb. Bbnnndnjdjdjbayes.ppt
Managing Data: storage, decisions and classification
"Naive Bayes Classifier" @ Papers We Love Bucharest
9-Decision Tree Induction-23-01-2025.ppt
Naive Bayes | Statistics
2.3 bayesian classification
Data Mining Email SPam Detection PPT WITH Algorithms
1.1 Probability Theory and Naiv Bayse.pptx
Analysis of Classification Algorithm in Data Mining
Bayesian classification
Naive Bayes Presentation
Ad

More from AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING (20)

PPTX
JAVA PROGRAM CONSTRUCTS OR LANGUAGE BASICS.pptx
PPTX
CS3391 OOP UT-I T4 JAVA BUZZWORDS.pptx
PPTX
CS3391 OOP UT-I T1 OVERVIEW OF OOP
PPTX
CS3391 OOP UT-I T3 FEATURES OF OBJECT ORIENTED PROGRAMMING
PPTX
CS3391 OOP UT-I T2 OBJECT ORIENTED PROGRAMMING PARADIGM.pptx
PDF
CS3391 -OOP -UNIT – V NOTES FINAL.pdf
PDF
CS3391 -OOP -UNIT – IV NOTES FINAL.pdf
PDF
CS3391 -OOP -UNIT – III NOTES FINAL.pdf
PDF
CS3391 -OOP -UNIT – II NOTES FINAL.pdf
PDF
CS3391 -OOP -UNIT – I NOTES FINAL.pdf
PDF
CS8080_IRT_UNIT - III T14 SEQUENTIAL SEARCHING.pdf
PDF
CS8080_IRT_UNIT - III T15 MULTI-DIMENSIONAL INDEXING.pdf
PDF
CS8080_IRT_UNIT - III T13 INVERTED INDEXES.pdf
PDF
CS8080_IRT_UNIT - III T11 ORGANIZING THE CLASSES.pdf
PDF
CS8080_IRT_UNIT - III T12 INDEXING AND SEARCHING.pdf
PDF
CS8080_IRT_UNIT - III T11 ORGANIZING THE CLASSES.pdf
PDF
CS8080_IRT_UNIT - III T10 ACCURACY AND ERROR.pdf
PDF
CS8080_IRT_UNIT - III T9 EVALUATION METRICS.pdf
JAVA PROGRAM CONSTRUCTS OR LANGUAGE BASICS.pptx
CS3391 OOP UT-I T4 JAVA BUZZWORDS.pptx
CS3391 OOP UT-I T1 OVERVIEW OF OOP
CS3391 OOP UT-I T3 FEATURES OF OBJECT ORIENTED PROGRAMMING
CS3391 OOP UT-I T2 OBJECT ORIENTED PROGRAMMING PARADIGM.pptx
CS3391 -OOP -UNIT – V NOTES FINAL.pdf
CS3391 -OOP -UNIT – IV NOTES FINAL.pdf
CS3391 -OOP -UNIT – III NOTES FINAL.pdf
CS3391 -OOP -UNIT – II NOTES FINAL.pdf
CS3391 -OOP -UNIT – I NOTES FINAL.pdf
CS8080_IRT_UNIT - III T14 SEQUENTIAL SEARCHING.pdf
CS8080_IRT_UNIT - III T15 MULTI-DIMENSIONAL INDEXING.pdf
CS8080_IRT_UNIT - III T13 INVERTED INDEXES.pdf
CS8080_IRT_UNIT - III T11 ORGANIZING THE CLASSES.pdf
CS8080_IRT_UNIT - III T12 INDEXING AND SEARCHING.pdf
CS8080_IRT_UNIT - III T11 ORGANIZING THE CLASSES.pdf
CS8080_IRT_UNIT - III T10 ACCURACY AND ERROR.pdf
CS8080_IRT_UNIT - III T9 EVALUATION METRICS.pdf

Recently uploaded (20)

PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PDF
Well-logging-methods_new................
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
DOCX
573137875-Attendance-Management-System-original
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
Safety Seminar civil to be ensured for safe working.
PPTX
Artificial Intelligence
PPTX
Sustainable Sites - Green Building Construction
PPTX
Geodesy 1.pptx...............................................
PPTX
Fundamentals of Mechanical Engineering.pptx
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PPT
Total quality management ppt for engineering students
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
PPT on Performance Review to get promotions
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
Well-logging-methods_new................
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
573137875-Attendance-Management-System-original
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Safety Seminar civil to be ensured for safe working.
Artificial Intelligence
Sustainable Sites - Green Building Construction
Geodesy 1.pptx...............................................
Fundamentals of Mechanical Engineering.pptx
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
Total quality management ppt for engineering students
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPT on Performance Review to get promotions
Embodied AI: Ushering in the Next Era of Intelligent Systems

CS8080_IRT_UNIT - III T3 NAIVE TEXT CLASSIFICATION.pdf

  • 1. P1WU UNIT – III: CLASSIFICATION Topic 3: NAÏVE TEXT CLASSIFICATION AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 2. UNIT III : TEXT CLASSIFICATION AND CLUSTERING 1.A Characterization of Text Classification 2. Unsupervised Algorithms: Clustering 3. Naïve Text Classification 4. Supervised Algorithms 5. Decision Tree 6. k-NN Classifier 7. SVM Classifier 8. Feature Selection or Dimensionality Reduction 9. Evaluation metrics 10. Accuracy and Error 11. Organizing the classes 12. Indexing and Searching 13. Inverted Indexes 14. Sequential Searching 15. Multi-dimensional Indexing AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 3. NAÏVE TEXT CLASSIFICATION AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 4. INTRODUCTION TO NAÏVE TEXT CLASSIFICATION • Naive Bayes classifiers are a collection of classification algorithms based on Bayes Theorem. • It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other. • Naive Bayes classifiers have been heavily used for text classification and text analysis machine learning problems. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 5. INTRODUCTION TO NAÏVE TEXT CLASSIFICATION • Text Analysis is a major application field for machine learning algorithms. • However the raw data, • a sequence of symbols (i.e. strings) cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 6. The Naive Bayes algorithm • Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. • It is not a single algorithm but a family of algorithms where all of them share a common principle, • i.e. every pair of features being classified is independent of each other. • The dataset is divided into two parts, namely, feature matrix and the response/target vector. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 7. The Naive Bayes algorithm • The Feature matrix (X) contains all the vectors(rows) of the dataset in which each vector consists of the value of dependent features. The number of features is d i.e. X = (x1,x2,x2, xd). • The Response/target vector (y) contains the value of class/group variable for each row of feature matrix. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 8. The Bayes’ Theorem Bayes’ Theorem finds the probability of an event occurring given the probability of another event that has already occurred. Bayes’ theorem is stated mathematically as follows: AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 9. The Bayes’ Theorem • where: • A and B are called events. • P(A | B) is the probability of event A, given the event B is true (has occured) • Event B is also termed as evidence. P(A) is the priori of A (the prior independent probability, i.e. probability of event before evidence is seen). • P(B | A) is the probability of B given event A, i.e. probability of event B after evidence A is seen. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 10. The Bayes’ Theorem • Summary AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 11. Dealing with text data • Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols (i.e. strings) cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 12. Dealing with text data • In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely: • tokenizing strings and giving an integer id for each possible token, for instance by using w ite-spaces and punctuation as token separators. • counting the occurrences of tokens in each document. • In this scheme, features and samples are defined as follows: • each individual token occurrence frequency is treated as a feature. • the vector of all the token frequencies for a given document is considered a multivariate sample. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 13. Example 1 : Using the Naive Bayesian Classifier • We will consider the following training set. • The data samples are described by attributes age, income, student, and credit. • The class label attribute, buy, tells whether the person buys a computer, has two distinct values, yes (class C1) and no (class C2). AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 14. Example 1 : Using the Naive Bayesian Classifier AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES RID Age Income student Credit Ci: buy 1 Youth High no Fair C2: no 2 Youth High no Excellent C2: no 3 middle-aged High no Fair C1: yes 4 Senior medium no Fair C1: yes 5 Senior Low yes Fair C1: yes 6 Senior Low yes Excellent C2: no 7 middle-aged Low yes Excellent C1: yes 8 Youth medium no Fair C2: no 9 Youth Low yes Fair C1: yes 10 Senior medium yes Fair C1: yes 11 Youth medium yes Excellent C1: yes 12 middle-aged medium no Excellent C1: yes 13 middle-aged High yes Fair C1: yes 14 Senior medium no Excellent C2: no
  • 15. Example 1 : Using the Naive Bayesian Classifier • The sample we wish to classify is • X = (age = youth, income = medium, student = yes, credit = fair) • We need to maximize P (X|Ci)P (Ci), for i = 1, 2. P (Ci), the a priori probability of each class, can be estimated based on the training samples: • P(buy =yes ) = 9 /14 • P(buy =no ) = 5 /14 AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 16. Example 1 : Using the Naive Bayesian Classifier • To compute P (X|Ci), for i = 1, 2, we compute the following conditional probabilities: • P(age=youth | buy =yes ) = 2/9 • P(income =medium | buy =yes ) = 4/9 • P(student =yes | buy =yes ) = 6/9 • P(credit =fair | buy =yes ) = 6/9 • P(age=youth | buy =no ) = 3/5 • P(income =medium | buy =no ) = 2/5 • P(student =yes | buy =no ) = 1/5 • P(credit =fair | buy =no ) = 2/5 AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 17. Example 1 : Using the Naive Bayesian Classifier • Using the above probabilities, we obtain AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 18. Example 1 : Using the Naive Bayesian Classifier • Similarly AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES To find the class that maximizes P (X|Ci)P (Ci), we compute Thus the naive Bayesian classifier predicts buy = yes for sample X
  • 19. Example 2: Predicting a class label using naïve Bayesian classification • Predicting a class label using naïve Bayesian classification. • The training data set is given below: • The data tuples are described by the attributes Owns Home?, Married, Gender and Employed. • The class label attribute Risk Class has three distinct values. • Let C1 corresponds to the class A, and C2 corresponds to the class B and C3 corresponds to the class C. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 20. Example 1 : Using the Naive Bayesian Classifier • The tuple is to classify is, • X = (Owns Home = Yes, Married = No, Gender = Female, Employed = Yes) AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES Owns Home Married Gender Employed Risk Class Yes Yes Male Yes B No No Female Yes A Yes Yes Female Yes C Yes No Male No B No Yes Female Yes C No No Female Yes A No No Male No B Yes No Female Yes A No Yes Female Yes C Yes Yes Female Yes C
  • 21. Example 2: Predicting a class label using naïve Bayesian classification • Solution • There are 10 samples and three classes. • Risk class A = 3 Risk class B = 3 Risk class C = 4 • • The prior probabilities are obtained by dividing these frequencies by the total number in the training data, • P(A) = 3/10 = 0.3 P(B) = 3/10 = 0.3 P(C) = 4/10 = 0.4 AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 22. Example 2: Predicting a class label using naïve Bayesian classification • To compute P(X/Ci) =P {yes, no, female, yes}/Ci) for each of the classes, the conditional probabilities for each: • P(Owns Home = Yes/A) = 1/3 =0.33 • P(Married = No/A) = 3/3 =1 • P(Gender = Female/A) = 3/3 = 1 • P(Employed = Yes/A) = 3/3 = 1 • • P(Owns Home = Yes/B) = 2/3 =0.67 • P(Married = No/B) = 2/3 =0.67 • P(Gender = Female/B) = 0/3 = 0 • P(Employed = Yes/B) = 1/3 = 0.33 • • P(Owns Home = Yes/C) = 2/4 =0.5 • P(Married = No/C) = 0/4 =0 • P(Gender = Female/C) = 4/4 = 1 • P(Employed = Yes/C) = 4/4 = 1 AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 23. Example 2: Predicting a class label using naïve Bayesian classification • Using the above probabilities, we obtain • P(X/A)= P(Owns Home = Yes/A) X • P(Married = No/A) x • P(Gender = Female/A) X • P(Employed = Yes/A) = 0.33 x 1 x 1 x 1 = 0.33 • Similarly, P(X/B)= 0 , P(X/C) =0 • • To find the class, G, that maximizes, P(X/Ci)P(Ci), we compute, • P(X/A) P(A) = 0.33 X 0.3 = 0.099 • P(X/B) P(B) =0 X 0.3 = 0 • P(X/C) P(C) = 0 X 0.4 = 0.0 • Therefore x is assigned to class A AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 24. Advantages and Disadvantages • Advantages: a) Have the minimum error rate in comparison to all other classifiers. b) Easy to implement c) Good results obtained in most of the cases. d) They provide theoretical justification for other classifiers that do not explicitly use • Disadvantages: a) Lack of available probability data. b) Inaccuracies in the assumption. AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES
  • 25. Any Questions? AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMESTER – VIII PROFESSIONAL ELECTIVE – IV CS8080- INFORMATION RETRIEVAL TECHNIQUES