SlideShare a Scribd company logo
Building a classifier  Assume you want to build a spam filter, that LEARN based on you telling the program what is a spam email and what is not.  How do you write a classifier which you can train it Think about what information classifier must keep when you train it How do the classifier classify a new document  Which words or features of the document are the ones that make it a spam? How the classifier used the information it learned so far to classify a new document?
Training - telling the filter what is a spam document The classifier learned by getting trained We train the classifier by providing it examples of documents and their correct classifications.  The more examples of documents and their correct classifications the classifier see, the better it will be at making predictions
What information the classifier keep when you train it Depending on the type of classifier you build Here we focus on building Naïve Bayesian classifier Using Bayesian classifier to classify document is basically calculating the probability  Pr(Category | Document).   Given a  specific  document, what’s the probability that it fits into “spam” category? What information the classifier need to learn during its training so later it can be use for calculating  Pr(catetory | document) Well, if we want to classify a new document as “spam” or “not-spam” we can just compare the probabilities below If Pr(“good” | Document)  >  Pr(“bad” | Document) -> the document is classified as “non-spam” document If Pr(“good” | Document)  <  Pr(“bad” | Document) -> the document is classified as “spam” one
How do we compare  Pr(‘bad’ | Document) vs Pr(‘good’ | Document) Use Bayes’s theorem  P(A | B) = P(B | A) x P(A)/P(B) P(Category | Document) = P(Document | Category) x P(Category) / Pr(Document) Calculate p1= P(doc | ‘bad’) x P(‘bad’ ) / Pr(doc) Calculate p2= P(doc | ‘good’) x P(‘good’) / Pr(doc) Compare p1 with p2 Pr(doc) is same for every category so we can cancel it The problem now is to calculate  Calculating p1 = P(doc | ‘bad’) x P(‘bad’ ) p2 = P(doc | ‘good’) x P(‘good’ )
What information is needed to calculate P(doc | ‘category’) and  P(‘category’ ) Pr(Category)  is the probability that a randomly selected document will be in this category it’s the number of documents in the category divided by the total number of documents. In order to calculate  Pr(Category) we need a dictionary data structure in the classifier to keep the count of document in each category
What information is needed to calculate P(doc | ‘category’) Pr(Doc | Category)  is the probability that an entire document belongs in a given category NaïveBayes assumes The probability of one word in the document being in a specific category is unrelated to the probability of the other words being in that category.  This is a false assumption and we can’t actually use the probability created by naïve Bayesian classifier as the actual probability that a document belongs in a category However, we can  compare  the results for different categories and see which one has the highest probability.  In real life, despite the underlying flawed assumption, this has proven to be a surprisingly effective method for classifying documents.
What information is needed to calculate P(doc | ‘category’) With assumption that the probability of one word is unrelated to probability of other words being in a category we can calculate  Pr(Doc|Category) P(Doc | Cat) = P(word1 | Cat) x P(word2 | Cat) …. x P(wordn | Cat) In order to calculate Pr(word | Cat) we need a dictionary data structure to keep the counts for different words in different classifications
Writing the classifier  With all data structure  cc  and  wc  setup we can implement the  naivebayes  classifier  This implementation  encapsulate what the classifier has learned so far  The  wprob  function calculate the probability that a word is in a particular category but it has a slight problem using only the information it has seen so far makes it incredibly sensitive during early training and to words that appear very rarely To get around this, we decide on an  assumed probability , which will be used when we have very little information about the word in question. A good number to start with is 0.5. we also need to decide how much to  weight  the assumed probability—a weight of 1 means the assumed probability is weighted the same as one word weightedprob  function   fix  wprob  problem by   calculating weighted average of  wprob   and the assumed probability
Writing classify function Once we have trained the classifier with examples of documents and correct classifications. How can we implement  classify  function to classify new document properly? Simplest approach would be to calculate the probability of this document being in each of the different categories and to choose the category with the best probability In spam filtering, it’s much more important to avoid having good email messages classified as spam than it is to catch every single spam message.  To deal with this, we set up a minimum threshold. For spam filtering, the threshold is 3, so that the probability for bad would have to be 3 times higher than the probability for good.  The threshold for good is 1, so anything would be good if the probability were at all better than for the bad category. Any message where the probability for bad is higher, but not 3 times higher, would be classified as unknown See the program listing for a complete classifier implementation!

More Related Content

PPTX
Knowledge base completion presentation
PDF
Probabilistic Programming: Why, What, How, When?
PDF
Text categorization with Lucene and Solr
PDF
An Overview of Naïve Bayes Classifier
PPT
lecture13-nbbbbb. Bbnnndnjdjdjbayes.ppt
PDF
19BayesTheoremClassification19BayesTheoremClassification.ppt
PPTX
best data science courses
PPTX
data science training in hyderabad
Knowledge base completion presentation
Probabilistic Programming: Why, What, How, When?
Text categorization with Lucene and Solr
An Overview of Naïve Bayes Classifier
lecture13-nbbbbb. Bbnnndnjdjdjbayes.ppt
19BayesTheoremClassification19BayesTheoremClassification.ppt
best data science courses
data science training in hyderabad

Similar to Building A Classifier (20)

PPTX
Artificial intelligence training in bangalore
PDF
Naive Bayes Classifier
PDF
Naive Bayes
PPTX
"Naive Bayes Classifier" @ Papers We Love Bucharest
PPTX
Naïve bayes
PPTX
Naïve bayes
PPTX
Naïve bayes
PPTX
Naïve bayes
PPTX
Naïve bayes
PPTX
Naïve bayes
PPTX
Naïve bayes
PPTX
Sentiment analysis using naive bayes classifier
PPTX
Unit 2 Machine Learning it's most important topic of basic
PPT
NaiveBayes_machine-learning(basic_ppt).ppt
PPT
Classification Of Web Documents
PDF
Naïve Bayes Machine Learning Classification with R Programming: A case study ...
PPTX
Naïve Bayes Classifier Algorithm.pptx
PPTX
Naive Bayes
PDF
Machine learning naive bayes and svm.pdf
PPT
ch8Bayes.ppt
Artificial intelligence training in bangalore
Naive Bayes Classifier
Naive Bayes
"Naive Bayes Classifier" @ Papers We Love Bucharest
Naïve bayes
Naïve bayes
Naïve bayes
Naïve bayes
Naïve bayes
Naïve bayes
Naïve bayes
Sentiment analysis using naive bayes classifier
Unit 2 Machine Learning it's most important topic of basic
NaiveBayes_machine-learning(basic_ppt).ppt
Classification Of Web Documents
Naïve Bayes Machine Learning Classification with R Programming: A case study ...
Naïve Bayes Classifier Algorithm.pptx
Naive Bayes
Machine learning naive bayes and svm.pdf
ch8Bayes.ppt
Ad

Recently uploaded (20)

PDF
A comparative analysis of optical character recognition models for extracting...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
1. Introduction to Computer Programming.pptx
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
A Presentation on Touch Screen Technology
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Encapsulation theory and applications.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
A comparative analysis of optical character recognition models for extracting...
NewMind AI Weekly Chronicles - August'25-Week II
1. Introduction to Computer Programming.pptx
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
A Presentation on Touch Screen Technology
Univ-Connecticut-ChatGPT-Presentaion.pdf
Chapter 5: Probability Theory and Statistics
Zenith AI: Advanced Artificial Intelligence
Encapsulation theory and applications.pdf
A novel scalable deep ensemble learning framework for big data classification...
Group 1 Presentation -Planning and Decision Making .pptx
1 - Historical Antecedents, Social Consideration.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Enhancing emotion recognition model for a student engagement use case through...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
cloud_computing_Infrastucture_as_cloud_p
Web App vs Mobile App What Should You Build First.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Digital-Transformation-Roadmap-for-Companies.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Ad

Building A Classifier

  • 1. Building a classifier Assume you want to build a spam filter, that LEARN based on you telling the program what is a spam email and what is not. How do you write a classifier which you can train it Think about what information classifier must keep when you train it How do the classifier classify a new document Which words or features of the document are the ones that make it a spam? How the classifier used the information it learned so far to classify a new document?
  • 2. Training - telling the filter what is a spam document The classifier learned by getting trained We train the classifier by providing it examples of documents and their correct classifications. The more examples of documents and their correct classifications the classifier see, the better it will be at making predictions
  • 3. What information the classifier keep when you train it Depending on the type of classifier you build Here we focus on building Naïve Bayesian classifier Using Bayesian classifier to classify document is basically calculating the probability Pr(Category | Document). Given a specific document, what’s the probability that it fits into “spam” category? What information the classifier need to learn during its training so later it can be use for calculating Pr(catetory | document) Well, if we want to classify a new document as “spam” or “not-spam” we can just compare the probabilities below If Pr(“good” | Document) > Pr(“bad” | Document) -> the document is classified as “non-spam” document If Pr(“good” | Document) < Pr(“bad” | Document) -> the document is classified as “spam” one
  • 4. How do we compare Pr(‘bad’ | Document) vs Pr(‘good’ | Document) Use Bayes’s theorem P(A | B) = P(B | A) x P(A)/P(B) P(Category | Document) = P(Document | Category) x P(Category) / Pr(Document) Calculate p1= P(doc | ‘bad’) x P(‘bad’ ) / Pr(doc) Calculate p2= P(doc | ‘good’) x P(‘good’) / Pr(doc) Compare p1 with p2 Pr(doc) is same for every category so we can cancel it The problem now is to calculate Calculating p1 = P(doc | ‘bad’) x P(‘bad’ ) p2 = P(doc | ‘good’) x P(‘good’ )
  • 5. What information is needed to calculate P(doc | ‘category’) and P(‘category’ ) Pr(Category) is the probability that a randomly selected document will be in this category it’s the number of documents in the category divided by the total number of documents. In order to calculate Pr(Category) we need a dictionary data structure in the classifier to keep the count of document in each category
  • 6. What information is needed to calculate P(doc | ‘category’) Pr(Doc | Category) is the probability that an entire document belongs in a given category NaïveBayes assumes The probability of one word in the document being in a specific category is unrelated to the probability of the other words being in that category. This is a false assumption and we can’t actually use the probability created by naïve Bayesian classifier as the actual probability that a document belongs in a category However, we can compare the results for different categories and see which one has the highest probability. In real life, despite the underlying flawed assumption, this has proven to be a surprisingly effective method for classifying documents.
  • 7. What information is needed to calculate P(doc | ‘category’) With assumption that the probability of one word is unrelated to probability of other words being in a category we can calculate Pr(Doc|Category) P(Doc | Cat) = P(word1 | Cat) x P(word2 | Cat) …. x P(wordn | Cat) In order to calculate Pr(word | Cat) we need a dictionary data structure to keep the counts for different words in different classifications
  • 8. Writing the classifier With all data structure cc and wc setup we can implement the naivebayes classifier This implementation encapsulate what the classifier has learned so far The wprob function calculate the probability that a word is in a particular category but it has a slight problem using only the information it has seen so far makes it incredibly sensitive during early training and to words that appear very rarely To get around this, we decide on an assumed probability , which will be used when we have very little information about the word in question. A good number to start with is 0.5. we also need to decide how much to weight the assumed probability—a weight of 1 means the assumed probability is weighted the same as one word weightedprob function fix wprob problem by calculating weighted average of wprob and the assumed probability
  • 9. Writing classify function Once we have trained the classifier with examples of documents and correct classifications. How can we implement classify function to classify new document properly? Simplest approach would be to calculate the probability of this document being in each of the different categories and to choose the category with the best probability In spam filtering, it’s much more important to avoid having good email messages classified as spam than it is to catch every single spam message. To deal with this, we set up a minimum threshold. For spam filtering, the threshold is 3, so that the probability for bad would have to be 3 times higher than the probability for good. The threshold for good is 1, so anything would be good if the probability were at all better than for the bad category. Any message where the probability for bad is higher, but not 3 times higher, would be classified as unknown See the program listing for a complete classifier implementation!