SlideShare a Scribd company logo
E-Mail Filtering Soonyeon Kim
Good Site for Data Mining http://liinwww.ira.uka.de/bibliography/ - The Collection of Computer Science Bibliographies  Major Conferences in Data Mining - KDD 2000 of ACM SIGKDD http://www.acm.org/sigs/sigkdd/kdd2000/ - SIGMOD 2000 of ACM SIGMOD Other Conferences - VLDB, IEEE ICDE, PAKDD conference
Text Mining: Finding Nuggets in Mountains of Textual Data Author - Jochen Dorre, Peter Gerstl, Roland Seiffert - {doerre,gerstl, seiffert }@de.ibm.com Method to find this paper - Searching from The Collection of Computer Science Bibliographies - key word used : Data mining & Text classification
Brief Description What is Text Mining? - same analytical functions of data mining to the domain of textual information. How Text mining differs from Data mining? - Data mining : addresses a very limited part of data (structured information available in database)  - Text mining : helps to dig out the hidden gold from textual information & requires the very complex feature extraction function Describe in more detail the unique technologies that are key to successful text mining
Ifile: An Application of Machine Learning to E-mail Filtering Author - Jason D. M. Rennie Artificial Intelligence Lab, MIT - jrennie@ai.mit.edu Method to find this paper - KDD 2000 of ACM SIGKDD
Outline of Paper Introduction - need for automated e-mail filtering - Ishmail - important issues regarding mail filtering Mail Filtering - Classification Efficiency - Features - Naïve Bayes algorithm IFILE Experiment Conclusion
Introduction Popular E-mail clients allow users to manage their mail into folders by meaningful topic - popular e-mail client : Netscape Messenger, Pine, Microsoft Outlook, Eudora and EXMH Ishmail - purpose of a prioritization system - alert the user when high-priority mail is arrived or a large number of messages have accumulated in a lower-priority folder Barrier - implementation for mail filters (speed efficiency, database size, collection of supervised training data) - integration into e-mail clients
Classification Efficiency Traditional classification method - kNN, C4.5, Naïve Bayes Recent development  - SVM (Support Vector Machine), Maximum Entropy discrimination) Efficiency Problems - SVM and MEM provide significant improvement in accuracy, but at the cost of simplicity and time efficiency - kNN : time to classify
Classification Efficiency(2) Naïve Bayes - efficient training, quick classification and extensibility to iterative learning - training : updating word counts - classification : normalized sum of the counts corresponding to the words in the document in question
Personal E-mail Filtering Every user has a unique collection of e-mail  User organizes their e-mail in unique way It pertains directly to his preference Key fact for effective personal e-mail filtering - using the information made through the user interface of the mail client
Learning Architecture Label is assigned to newly filtered e-mail message Added to the classification model Update the classification model : every filtered e-mail is a training example - assumed to be correct if user does not move the message to another folder - update  the model if user moves misclassified mail into the appropriate folder Update for Naïve Bayes - shift word counts from one folder to another
Features Classification model act as a function f F  C - F :  Features  C : class Mail filter is a special classifier F  C - F :  characteristics of e-mail message  C : mail folder - by considering each e-mail message as a bag of words function f maps an unordered set of words to a folder name f
Features(2) Naïve Bayes keeps the track of word frequency statistics Reduce the number of features for classification to make filtering more efficient Feature selection cutoff - old, infrequent words are dropped  - word that occur fewer than log(age)-1 times should be discarded from the model - age : number of e-mail messages added to the model since statistic has been kept for that word e.q. if “baseball” occurred in the 1 st  document and occurred 5 or fewer times in the next 63 document, the word and statistics would be eliminated from database.
Maintaining Dictionary Cutoff Algorithm - word that occur fewer than  log(age)-1  times should be discarded from the model e.q.  “datamining”  occurred in the 1 st  document 63 documents are coming after the document age = 1 + 63 = 64 log(age) – 1 = 5 if  “datamining”  appears 5 or fewer times, the word and statistics would be eliminated from database.
Maintaining Dictionary(2) -----------.idata-------------  A B C  list of folders(A:0 B:1 C:2) 5 2 6  total word instances 2 1 1  # of message party 4 0:2 1:1  belch 3 0:1  yellow 4 0:2 2:3  word age folder:frequency kick 2 1:1  peep 1 2:2  two msg in A - "party party belch yellow yellow"  one msg in B  - "party kick"  one msg in C  - "peep peep yellow yellow yellow"
Word Selection Header Trimming E-mail  Body: content Header : list of fields pertaining to the message From: To: Subject: - keep this part Received: Date: Message-id - remove
Tokenizing text Two techniques Using stop list -  decrease the amount of noise in the data by eliminating uninformative words e.g) pronoun, modifier, adverb Stemming - link together words which have the same root e.g) serve, service, serves, served => same root serv
Naïve Bayes What is Naïve Bayes? -   Simple, yet effective classifier of text documents  -   Statistical Machine learning algorithm Assumption each document is considered as a set of words Each word is independent
Naïve Bayes-1 st  step Probability of d having been generated by c i - With the assumption that attribute values are independent,
Naïve Bayes-second step Computing P(c i |d) for all classes Find the class to be classified Maximum likelyhood - Probability values are only used for comparison  Purpose, P(d) can be dropped
Naïve Bayes M-estimate purpose : to give a reasonable probability in the case of sparse data -n j  : number of instances of w j  in the documents of class c i -n : total number of words in documents of class c i -|Vocabulary| : number of distinct words
Experimental Result Information about the e-mail corpora on which classification experiments were conducted. Four volunteers including author
Experimental Result
Experimental Result Individual Experiments with different setting Alpha lexer, stoplist used, header trimming, feature selection, no stemming Alpha only lexer replaces alpha lexer White lexer replaces alpha lexer No stoplist is used Stemming is used No feature selection is used All headers are used for classification purposes
Experimental Result Three Lexers Alpha lexer - default lexer - tokenizes strings of alphabetic characters Alpha only lexer - tokenizes only strings of alphabetic characters - does not lex e-mail addresses into tokens White lexer - tokenizes strings separated by whitespace
Experimental Result Result - No experimental environment setting provide the best results across all users Experiment with highest average accuracy - experiment #1 shows the best average result (89% accuracy) - ranging from 86% to 91%
Experimental Result Time Efficiency - Naïve Bayes : “fast enough” - 27 seconds to build a model of 7000+ e-mail messages (average 259 msg/second) (tar-gzip of same msg requires 17 seconds) Space Efficiency - classification model built on 7000+ messages across 49 folders requires only 447,090 bytes
Filtering Junk E-Mail Soonyeon Kim
A Bayesian Approach to Filtering Junk E-Mail Authors - Mehran Sahami, Susan Dumais, David Heckerman, Eric Horvitz - Stanford University & Microsoft Researh From - AAAI 98 (American Association for Artificial Intelligence
Problems of Junk-mail Wasting time - Many users must now spend a non-trivial portion of their time because of unwanted messages Content of Material - Some of these messages can contain offensive material  such as graphic pornography Space problem -Junk-mails also quickly fill up file server storage space
Machine Learning Approach Learning - system S learns from experience E with respect to a class of tasks T and performance P Learning in junk-mail S : E-mail classifier T : classify an e-mail message as junk/legitimate P : fraction of correct prediction E : a set of pre-classified e-mail messages Vector Space Model - to represent mail messages as feature vectors  - e-mail message has single fixed-length feature vector - individual message can be represented as a binary vector denoting which word are present or absent. (1 for present 0 for absent)
Bayesian Classifier e-mail message as a vector of N  features X  =   X 1 ,  X 2 ,  X 3 , ...,  X N  - For example,  X 42  might be ‘the e-mail contains “ money ”’ -  x 42 =0 means “the message described by x does not contain the “ money ”’. classify messages in K  classes C  = {c 1  , c 2 } = {junk, legit}  (K=2) Now suppose we see a new e-mail message, with encoding  x .  We seek the probability that the class  C  is junk,  Pr[ C =junk |  X = x ]   shorthand for Pr[ C =junk |  X 1 =x 1  &  X 2 =x 2  & … &  X N =x N ]
Bayesian networks (a) a Naïve Bayesian classifier (b) a more complex Bayesian classifier with limited dependencies between the features
Bayesian Rule Bayes theorem ssume that each Xi is independent
Features Words - fixed width vector <X = X 1 , X 2 ,…, X n > Hand-crafted  Phrasal Features - “FREE!”, “only $” ( as in “only $4.95”) and “be over 21” - 35 such hand-crafted phrases are included Domain-specific features - domain type of sender (.edu or .com) - junk mail is usually not from .edu domain Resolving familiar E-mail address - i.e. replace  sdumais @ microsoft .com  with Susan Dumais Time - most junk E-mail is sent at night
Features(2) Peculiar punctuation - percentage of non-alphanumeric characters in the subject of a mail - “$$$$$ MONEY $$$$$”   X : subject has peculiar   punctuation Y : pct of total messages
Feature Selection Mutual Information - Mutual information MI(A,B) is a numeric measure of what we can conclude about A if we know B, and vice - versa. - Example: If A and B are independent, then we can’t conclude anything:  MI(A, B) = 0 - Select 500 features with greatest value
Evaluation Three ways Using Domain-specific Features - Words only - Words + Phrases - Words + Phrases + Extra Features Three way Categorization - 3  categories {porn-junk,  other-junk,  legit} instead of 2 categories {junk, legit}. “ Real” scenario
Using different features The cost of missing legitimate email is much higher than the costing of inadvertently reading junk. The authors wanted to make their system very “ optimistic ”  so that it only predicts “junk” if it is very certain -- uses threshold 99.9%. 1789 hand-tagged e-mail messages 1578 junk 211 legit Split into… 1538 training messages (86%) 251 testing messages (14%)
Using different features Result of experiment - words only - words + 35 phrasal features - words + phrasal features + 20 non-textual domain-specific features
Using different features Junk  Precision  = A / (A + C)  Junk  Recall  = A / (A + B) Legit  Precision  = D / (D + B)  Legit  Recall  = D / (D + C) Junk precision is of greatest concern to most users, because they would not want their legitimate mail discarded as junk D (true negatives) C (false positives) actually Legit B (false negatives) A (true positives) actually Junk predict Legit predict Junk
Using different features Precision/ Recall curves for junk mail
Sub-classes of junk E-mail Three way Categorization - 3  categories {porn-junk,  other-junk,  legit} instead of 2 categories {junk, legit} Consider that classifying is correct if any “junk” messages is classifed either “porn-junk” or “other-junk”  Unfortunately, it didn’t work! - Probably because more parameters means need (exponentially!) more data to estimate them accuractely - some feature may be very clearly indicative of junk versus legitimate, but may not be powerful in three categories (they do not distinguish well between the sub-classes of junk
Sub-classes of junk E-mail Precision/recall curves considering sub-groups of junk mail
Real Usage Scenario Three kinds of messages 1.  Read and keep 2.  Read and discard (ex. Joke from a friend) 3.  Junk Result Misclassified mails – news stories from a e-mail news service that the user subscribes to. (No loss of significant information)
Real Usage Scenario Precision/recall curves in real usage scenario

More Related Content

PDF
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
PPT
SMS Spam Filter Design Using R: A Machine Learning Approach
DOC
The Realization of Agent-Based E-mail automatic Handling System
PDF
Submission_36
PDF
RDataMining slides-text-mining-with-r
PPTX
Golang proto buff_ixxo
PDF
Text classification-php-v4
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
SMS Spam Filter Design Using R: A Machine Learning Approach
The Realization of Agent-Based E-mail automatic Handling System
Submission_36
RDataMining slides-text-mining-with-r
Golang proto buff_ixxo
Text classification-php-v4

What's hot (20)

DOC
Comparing Naive Bayesian and k-NN algorithms for automatic ...
PPTX
Text Mining Infrastructure in R
PDF
Fota Delta Size Reduction Using FIle Similarity Algorithms
PDF
Understanding Natural Languange with Corpora-based Generation of Dependency G...
PPTX
Text analytics in Python and R with examples from Tobacco Control
PDF
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
PPT
Text classification using Text kernels
PPTX
Siteimprove TechTalk: Demystifying Accessible Names
PDF
Indonesian language email spam detection using N-gram and Naïve Bayes algorithm
PDF
Text Mining with R -- an Analysis of Twitter Data
DOCX
A project on advanced C language
PDF
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
PDF
Sales_Prediction_Technique using R Programming
PDF
Elgamal signature for content distribution with network coding
PDF
C4 balajiprasath
PDF
Interactive Latent Dirichlet Allocation
PDF
New strategy to optimize the performance of spray and wait routing protocol
PDF
Diversified Social Media Retrieval for News Stories
PDF
Ijrdtvlis11 140006
PDF
Text independent speaker identification system using average pitch and forman...
Comparing Naive Bayesian and k-NN algorithms for automatic ...
Text Mining Infrastructure in R
Fota Delta Size Reduction Using FIle Similarity Algorithms
Understanding Natural Languange with Corpora-based Generation of Dependency G...
Text analytics in Python and R with examples from Tobacco Control
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
Text classification using Text kernels
Siteimprove TechTalk: Demystifying Accessible Names
Indonesian language email spam detection using N-gram and Naïve Bayes algorithm
Text Mining with R -- an Analysis of Twitter Data
A project on advanced C language
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
Sales_Prediction_Technique using R Programming
Elgamal signature for content distribution with network coding
C4 balajiprasath
Interactive Latent Dirichlet Allocation
New strategy to optimize the performance of spray and wait routing protocol
Diversified Social Media Retrieval for News Stories
Ijrdtvlis11 140006
Text independent speaker identification system using average pitch and forman...
Ad

Viewers also liked (20)

PDF
Growing Beyond Journals, Nature Web Applications
PDF
Imbiss trailers
PDF
Solución Colaborativa
PDF
EPA Draft Permitting Guidance for Oil and Gas Hydraulic Fracturing Activities...
ODP
Presentacion de la reproducion.Jorge
PDF
Tecnica n°039 entone un sonido, como aum, despacio. cuando entra en plenitud ...
PDF
7 plataformas de Gamification para Social Media
PPT
Redes sociales oquendo
DOCX
Multicaja S.A contra Banco Estado de Chile. Tribunal de Libre Competencia
DOC
shashi sharma CURRICULUM_VITAE-
PDF
"El circo gastronómico de David Muñoz". Por Juan Carlos Rodríguez. Fotografía...
PPTX
Colaboración con la residencia carmen sevilla
PPTX
La sexualidad
ODP
J Flex Cup
PDF
Estadísticas e Indicadores del Agua
PDF
Introdução ao Raspberry Pi e integração com Arduino
PDF
Caja Eco Embalaje Marítimo Instrucciones de montaje.
PPTX
Presentación1
PDF
Manual para la realización de recorridos o colaboración en la detección de te...
PDF
Baudrillard el intercambio simbólico y la muerte (1976)
Growing Beyond Journals, Nature Web Applications
Imbiss trailers
Solución Colaborativa
EPA Draft Permitting Guidance for Oil and Gas Hydraulic Fracturing Activities...
Presentacion de la reproducion.Jorge
Tecnica n°039 entone un sonido, como aum, despacio. cuando entra en plenitud ...
7 plataformas de Gamification para Social Media
Redes sociales oquendo
Multicaja S.A contra Banco Estado de Chile. Tribunal de Libre Competencia
shashi sharma CURRICULUM_VITAE-
"El circo gastronómico de David Muñoz". Por Juan Carlos Rodríguez. Fotografía...
Colaboración con la residencia carmen sevilla
La sexualidad
J Flex Cup
Estadísticas e Indicadores del Agua
Introdução ao Raspberry Pi e integração com Arduino
Caja Eco Embalaje Marítimo Instrucciones de montaje.
Presentación1
Manual para la realización de recorridos o colaboración en la detección de te...
Baudrillard el intercambio simbólico y la muerte (1976)
Ad

Similar to mailfilter.ppt (20)

PPTX
final-spam-e-mail-detection-180125111231.pptx
PPTX
Final spam-e-mail-detection
PPTX
Presentation2.pptx
PPTX
project review using naive bayes theorem .pptx
PDF
Detecting spam mail using machine learning algorithm
PPTX
Text mining and analytics v6 - p2
PDF
Haicku submission
PPT
Fang feb-17
PPTX
Spam filtering with Naive Bayes Algorithm
PDF
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
PDF
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
PDF
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
PDF
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
PDF
Prepare black list using bayesian approach to improve performance of spam fil...
PDF
Cross breed Spam Categorization Method using Machine Learning Techniques
PDF
Integration of feature sets with machine learning techniques
PPTX
Email spam detection
PDF
Detection of Spam in Emails using Machine Learning
PDF
Spam Filtering
PPTX
Naïve bayes
final-spam-e-mail-detection-180125111231.pptx
Final spam-e-mail-detection
Presentation2.pptx
project review using naive bayes theorem .pptx
Detecting spam mail using machine learning algorithm
Text mining and analytics v6 - p2
Haicku submission
Fang feb-17
Spam filtering with Naive Bayes Algorithm
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Spam Detection in Social Networks Using Correlation Based Feature Subset Sele...
Prepare black list using bayesian approach to improve performance of spam fil...
Cross breed Spam Categorization Method using Machine Learning Techniques
Integration of feature sets with machine learning techniques
Email spam detection
Detection of Spam in Emails using Machine Learning
Spam Filtering
Naïve bayes

More from butest (20)

PDF
EL MODELO DE NEGOCIO DE YOUTUBE
DOC
1. MPEG I.B.P frame之不同
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPT
Timeline: The Life of Michael Jackson
DOCX
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPTX
Com 380, Summer II
PPT
PPT
DOCX
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
DOC
MICHAEL JACKSON.doc
PPTX
Social Networks: Twitter Facebook SL - Slide 1
PPT
Facebook
DOCX
Executive Summary Hare Chevrolet is a General Motors dealership ...
DOC
Welcome to the Dougherty County Public Library's Facebook and ...
DOC
NEWS ANNOUNCEMENT
DOC
C-2100 Ultra Zoom.doc
DOC
MAC Printing on ITS Printers.doc.doc
DOC
Mac OS X Guide.doc
DOC
hier
DOC
WEB DESIGN!
EL MODELO DE NEGOCIO DE YOUTUBE
1. MPEG I.B.P frame之不同
LESSONS FROM THE MICHAEL JACKSON TRIAL
Timeline: The Life of Michael Jackson
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
LESSONS FROM THE MICHAEL JACKSON TRIAL
Com 380, Summer II
PPT
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
MICHAEL JACKSON.doc
Social Networks: Twitter Facebook SL - Slide 1
Facebook
Executive Summary Hare Chevrolet is a General Motors dealership ...
Welcome to the Dougherty County Public Library's Facebook and ...
NEWS ANNOUNCEMENT
C-2100 Ultra Zoom.doc
MAC Printing on ITS Printers.doc.doc
Mac OS X Guide.doc
hier
WEB DESIGN!

mailfilter.ppt

  • 2. Good Site for Data Mining http://liinwww.ira.uka.de/bibliography/ - The Collection of Computer Science Bibliographies Major Conferences in Data Mining - KDD 2000 of ACM SIGKDD http://www.acm.org/sigs/sigkdd/kdd2000/ - SIGMOD 2000 of ACM SIGMOD Other Conferences - VLDB, IEEE ICDE, PAKDD conference
  • 3. Text Mining: Finding Nuggets in Mountains of Textual Data Author - Jochen Dorre, Peter Gerstl, Roland Seiffert - {doerre,gerstl, seiffert }@de.ibm.com Method to find this paper - Searching from The Collection of Computer Science Bibliographies - key word used : Data mining & Text classification
  • 4. Brief Description What is Text Mining? - same analytical functions of data mining to the domain of textual information. How Text mining differs from Data mining? - Data mining : addresses a very limited part of data (structured information available in database) - Text mining : helps to dig out the hidden gold from textual information & requires the very complex feature extraction function Describe in more detail the unique technologies that are key to successful text mining
  • 5. Ifile: An Application of Machine Learning to E-mail Filtering Author - Jason D. M. Rennie Artificial Intelligence Lab, MIT - jrennie@ai.mit.edu Method to find this paper - KDD 2000 of ACM SIGKDD
  • 6. Outline of Paper Introduction - need for automated e-mail filtering - Ishmail - important issues regarding mail filtering Mail Filtering - Classification Efficiency - Features - Naïve Bayes algorithm IFILE Experiment Conclusion
  • 7. Introduction Popular E-mail clients allow users to manage their mail into folders by meaningful topic - popular e-mail client : Netscape Messenger, Pine, Microsoft Outlook, Eudora and EXMH Ishmail - purpose of a prioritization system - alert the user when high-priority mail is arrived or a large number of messages have accumulated in a lower-priority folder Barrier - implementation for mail filters (speed efficiency, database size, collection of supervised training data) - integration into e-mail clients
  • 8. Classification Efficiency Traditional classification method - kNN, C4.5, Naïve Bayes Recent development - SVM (Support Vector Machine), Maximum Entropy discrimination) Efficiency Problems - SVM and MEM provide significant improvement in accuracy, but at the cost of simplicity and time efficiency - kNN : time to classify
  • 9. Classification Efficiency(2) Naïve Bayes - efficient training, quick classification and extensibility to iterative learning - training : updating word counts - classification : normalized sum of the counts corresponding to the words in the document in question
  • 10. Personal E-mail Filtering Every user has a unique collection of e-mail User organizes their e-mail in unique way It pertains directly to his preference Key fact for effective personal e-mail filtering - using the information made through the user interface of the mail client
  • 11. Learning Architecture Label is assigned to newly filtered e-mail message Added to the classification model Update the classification model : every filtered e-mail is a training example - assumed to be correct if user does not move the message to another folder - update the model if user moves misclassified mail into the appropriate folder Update for Naïve Bayes - shift word counts from one folder to another
  • 12. Features Classification model act as a function f F C - F : Features C : class Mail filter is a special classifier F C - F : characteristics of e-mail message C : mail folder - by considering each e-mail message as a bag of words function f maps an unordered set of words to a folder name f
  • 13. Features(2) Naïve Bayes keeps the track of word frequency statistics Reduce the number of features for classification to make filtering more efficient Feature selection cutoff - old, infrequent words are dropped - word that occur fewer than log(age)-1 times should be discarded from the model - age : number of e-mail messages added to the model since statistic has been kept for that word e.q. if “baseball” occurred in the 1 st document and occurred 5 or fewer times in the next 63 document, the word and statistics would be eliminated from database.
  • 14. Maintaining Dictionary Cutoff Algorithm - word that occur fewer than log(age)-1 times should be discarded from the model e.q. “datamining” occurred in the 1 st document 63 documents are coming after the document age = 1 + 63 = 64 log(age) – 1 = 5 if “datamining” appears 5 or fewer times, the word and statistics would be eliminated from database.
  • 15. Maintaining Dictionary(2) -----------.idata------------- A B C list of folders(A:0 B:1 C:2) 5 2 6 total word instances 2 1 1 # of message party 4 0:2 1:1 belch 3 0:1 yellow 4 0:2 2:3 word age folder:frequency kick 2 1:1 peep 1 2:2 two msg in A - &quot;party party belch yellow yellow&quot; one msg in B - &quot;party kick&quot; one msg in C - &quot;peep peep yellow yellow yellow&quot;
  • 16. Word Selection Header Trimming E-mail Body: content Header : list of fields pertaining to the message From: To: Subject: - keep this part Received: Date: Message-id - remove
  • 17. Tokenizing text Two techniques Using stop list - decrease the amount of noise in the data by eliminating uninformative words e.g) pronoun, modifier, adverb Stemming - link together words which have the same root e.g) serve, service, serves, served => same root serv
  • 18. Naïve Bayes What is Naïve Bayes? - Simple, yet effective classifier of text documents - Statistical Machine learning algorithm Assumption each document is considered as a set of words Each word is independent
  • 19. Naïve Bayes-1 st step Probability of d having been generated by c i - With the assumption that attribute values are independent,
  • 20. Naïve Bayes-second step Computing P(c i |d) for all classes Find the class to be classified Maximum likelyhood - Probability values are only used for comparison Purpose, P(d) can be dropped
  • 21. Naïve Bayes M-estimate purpose : to give a reasonable probability in the case of sparse data -n j : number of instances of w j in the documents of class c i -n : total number of words in documents of class c i -|Vocabulary| : number of distinct words
  • 22. Experimental Result Information about the e-mail corpora on which classification experiments were conducted. Four volunteers including author
  • 24. Experimental Result Individual Experiments with different setting Alpha lexer, stoplist used, header trimming, feature selection, no stemming Alpha only lexer replaces alpha lexer White lexer replaces alpha lexer No stoplist is used Stemming is used No feature selection is used All headers are used for classification purposes
  • 25. Experimental Result Three Lexers Alpha lexer - default lexer - tokenizes strings of alphabetic characters Alpha only lexer - tokenizes only strings of alphabetic characters - does not lex e-mail addresses into tokens White lexer - tokenizes strings separated by whitespace
  • 26. Experimental Result Result - No experimental environment setting provide the best results across all users Experiment with highest average accuracy - experiment #1 shows the best average result (89% accuracy) - ranging from 86% to 91%
  • 27. Experimental Result Time Efficiency - Naïve Bayes : “fast enough” - 27 seconds to build a model of 7000+ e-mail messages (average 259 msg/second) (tar-gzip of same msg requires 17 seconds) Space Efficiency - classification model built on 7000+ messages across 49 folders requires only 447,090 bytes
  • 28. Filtering Junk E-Mail Soonyeon Kim
  • 29. A Bayesian Approach to Filtering Junk E-Mail Authors - Mehran Sahami, Susan Dumais, David Heckerman, Eric Horvitz - Stanford University & Microsoft Researh From - AAAI 98 (American Association for Artificial Intelligence
  • 30. Problems of Junk-mail Wasting time - Many users must now spend a non-trivial portion of their time because of unwanted messages Content of Material - Some of these messages can contain offensive material such as graphic pornography Space problem -Junk-mails also quickly fill up file server storage space
  • 31. Machine Learning Approach Learning - system S learns from experience E with respect to a class of tasks T and performance P Learning in junk-mail S : E-mail classifier T : classify an e-mail message as junk/legitimate P : fraction of correct prediction E : a set of pre-classified e-mail messages Vector Space Model - to represent mail messages as feature vectors - e-mail message has single fixed-length feature vector - individual message can be represented as a binary vector denoting which word are present or absent. (1 for present 0 for absent)
  • 32. Bayesian Classifier e-mail message as a vector of N features X =  X 1 , X 2 , X 3 , ..., X N  - For example, X 42 might be ‘the e-mail contains “ money ”’ - x 42 =0 means “the message described by x does not contain the “ money ”’. classify messages in K classes C = {c 1 , c 2 } = {junk, legit} (K=2) Now suppose we see a new e-mail message, with encoding x . We seek the probability that the class C is junk, Pr[ C =junk | X = x ] shorthand for Pr[ C =junk | X 1 =x 1 & X 2 =x 2 & … & X N =x N ]
  • 33. Bayesian networks (a) a Naïve Bayesian classifier (b) a more complex Bayesian classifier with limited dependencies between the features
  • 34. Bayesian Rule Bayes theorem ssume that each Xi is independent
  • 35. Features Words - fixed width vector <X = X 1 , X 2 ,…, X n > Hand-crafted Phrasal Features - “FREE!”, “only $” ( as in “only $4.95”) and “be over 21” - 35 such hand-crafted phrases are included Domain-specific features - domain type of sender (.edu or .com) - junk mail is usually not from .edu domain Resolving familiar E-mail address - i.e. replace sdumais @ microsoft .com with Susan Dumais Time - most junk E-mail is sent at night
  • 36. Features(2) Peculiar punctuation - percentage of non-alphanumeric characters in the subject of a mail - “$$$$$ MONEY $$$$$” X : subject has peculiar punctuation Y : pct of total messages
  • 37. Feature Selection Mutual Information - Mutual information MI(A,B) is a numeric measure of what we can conclude about A if we know B, and vice - versa. - Example: If A and B are independent, then we can’t conclude anything: MI(A, B) = 0 - Select 500 features with greatest value
  • 38. Evaluation Three ways Using Domain-specific Features - Words only - Words + Phrases - Words + Phrases + Extra Features Three way Categorization - 3 categories {porn-junk, other-junk, legit} instead of 2 categories {junk, legit}. “ Real” scenario
  • 39. Using different features The cost of missing legitimate email is much higher than the costing of inadvertently reading junk. The authors wanted to make their system very “ optimistic ” so that it only predicts “junk” if it is very certain -- uses threshold 99.9%. 1789 hand-tagged e-mail messages 1578 junk 211 legit Split into… 1538 training messages (86%) 251 testing messages (14%)
  • 40. Using different features Result of experiment - words only - words + 35 phrasal features - words + phrasal features + 20 non-textual domain-specific features
  • 41. Using different features Junk Precision = A / (A + C) Junk Recall = A / (A + B) Legit Precision = D / (D + B) Legit Recall = D / (D + C) Junk precision is of greatest concern to most users, because they would not want their legitimate mail discarded as junk D (true negatives) C (false positives) actually Legit B (false negatives) A (true positives) actually Junk predict Legit predict Junk
  • 42. Using different features Precision/ Recall curves for junk mail
  • 43. Sub-classes of junk E-mail Three way Categorization - 3 categories {porn-junk, other-junk, legit} instead of 2 categories {junk, legit} Consider that classifying is correct if any “junk” messages is classifed either “porn-junk” or “other-junk” Unfortunately, it didn’t work! - Probably because more parameters means need (exponentially!) more data to estimate them accuractely - some feature may be very clearly indicative of junk versus legitimate, but may not be powerful in three categories (they do not distinguish well between the sub-classes of junk
  • 44. Sub-classes of junk E-mail Precision/recall curves considering sub-groups of junk mail
  • 45. Real Usage Scenario Three kinds of messages 1. Read and keep 2. Read and discard (ex. Joke from a friend) 3. Junk Result Misclassified mails – news stories from a e-mail news service that the user subscribes to. (No loss of significant information)
  • 46. Real Usage Scenario Precision/recall curves in real usage scenario