Building A Classifier

Building a classifier Assume you want to build a spam filter, that LEARN based on you telling the program what is a spam email and what is not. How do you write a classifier which you can train it Think about what information classifier must keep when you train it How do the classifier classify a new document Which words or features of the document are the ones that make it a spam? How the classifier used the information it learned so far to classify a new document?

Training - telling the filter what is a spam document The classifier learned by getting trained We train the classifier by providing it examples of documents and their correct classifications. The more examples of documents and their correct classifications the classifier see, the better it will be at making predictions

What information the classifier keep when you train it Depending on the type of classifier you build Here we focus on building Naïve Bayesian classifier Using Bayesian classifier to classify document is basically calculating the probability Pr(Category | Document). Given a specific document, what’s the probability that it fits into “spam” category? What information the classifier need to learn during its training so later it can be use for calculating Pr(catetory | document) Well, if we want to classify a new document as “spam” or “not-spam” we can just compare the probabilities below If Pr(“good” | Document) > Pr(“bad” | Document) -> the document is classified as “non-spam” document If Pr(“good” | Document) < Pr(“bad” | Document) -> the document is classified as “spam” one

What information is needed to calculate P(doc | ‘category’) and P(‘category’ ) Pr(Category) is the probability that a randomly selected document will be in this category it’s the number of documents in the category divided by the total number of documents. In order to calculate Pr(Category) we need a dictionary data structure in the classifier to keep the count of document in each category

What information is needed to calculate P(doc | ‘category’) Pr(Doc | Category) is the probability that an entire document belongs in a given category NaïveBayes assumes The probability of one word in the document being in a specific category is unrelated to the probability of the other words being in that category. This is a false assumption and we can’t actually use the probability created by naïve Bayesian classifier as the actual probability that a document belongs in a category However, we can compare the results for different categories and see which one has the highest probability. In real life, despite the underlying flawed assumption, this has proven to be a surprisingly effective method for classifying documents.

What information is needed to calculate P(doc | ‘category’) With assumption that the probability of one word is unrelated to probability of other words being in a category we can calculate Pr(Doc|Category) P(Doc | Cat) = P(word1 | Cat) x P(word2 | Cat) …. x P(wordn | Cat) In order to calculate Pr(word | Cat) we need a dictionary data structure to keep the counts for different words in different classifications

Writing the classifier With all data structure cc and wc setup we can implement the naivebayes classifier This implementation encapsulate what the classifier has learned so far The wprob function calculate the probability that a word is in a particular category but it has a slight problem using only the information it has seen so far makes it incredibly sensitive during early training and to words that appear very rarely To get around this, we decide on an assumed probability , which will be used when we have very little information about the word in question. A good number to start with is 0.5. we also need to decide how much to weight the assumed probability—a weight of 1 means the assumed probability is weighted the same as one word weightedprob function fix wprob problem by calculating weighted average of wprob and the assumed probability

Writing classify function Once we have trained the classifier with examples of documents and correct classifications. How can we implement classify function to classify new document properly? Simplest approach would be to calculate the probability of this document being in each of the different categories and to choose the category with the best probability In spam filtering, it’s much more important to avoid having good email messages classified as spam than it is to catch every single spam message. To deal with this, we set up a minimum threshold. For spam filtering, the threshold is 3, so that the probability for bad would have to be 3 times higher than the probability for good. The threshold for good is 1, so anything would be good if the probability were at all better than for the bad category. Any message where the probability for bad is higher, but not 3 times higher, would be classified as unknown See the program listing for a complete classifier implementation!

Building A Classifier

More Related Content

Similar to Building A Classifier (20)

Recently uploaded (20)

Building A Classifier