SlideShare a Scribd company logo
An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages Author: Ion Androutsopoulos , John Koutsias ,Konstantinos V. Chandrinos, Constantine D. Spyropoulos Resourse:  sigir2000
Outline Introduction Feature selection The Naive Bayesian classifier Result
Introduction 垃圾郵件很多 Naïve Bayesian classifier 與 keywork-based 的反垃圾郵件機制做比較 . Sahami et al. trained a Naïve Bayesian classifier on manually categorized legitimate and spare messages
The Naive Bayesian classifier x =  (x l  , x 2  , x  3  .... , x n  ) , where x l  ,….., x n  are the values of  attributes X  1  .... , X  n  . Each attribute shows whether or not a particular word (eg. "adult") is present in the message. Use additional attributes corresponding to phrases(e.g. "be over 21") . Non-textual properties (e.g. whether or not the message contains attachments).
mutual information Use  mutual information ( MI ) to  select possible attributes. MI(X;C): Then select the attributes with the highest mutual information values.
The Naive Bayesian classifier
 
 
S -> L (legitimate to spam) L->S(spam to legitimate) denote the two error types.  we assume that L->S is  times more costly than S -> L Classify a message as spare if the following  classification criterion is  met:
 
= 999 (t=0.999) , This means that mistakenly blocking a legitimate message was taken to be as bad as letting 999 spare messages pass the filter. = 9 (t=0.9) ,  若郵件被 blocked 時  ,  回傳給 sender 道歉訊息以及猜謎 . = 1(t=0.5), If the recipient does not care about the extra work imposed on the sender.
Result
1789 messages, consisting of 211 legitimate messages that users had saved and 1578 spare messages. First experiment  word-attributes were used.  Candidate attributes were added (e.g. corresponding to the phrases "be over 21", "only $").  Third experiment, (e.g. whether or not the message contains attachments, or a high proportion of non alphanumeric characters).
Experiments with the PU1 corpus 481 spam messages. 618  legitimate messages. Naive Bayesian classifier, ten-fold cross validation to reduce random variation. That Results were then averaged over the ten runs.  varied the number of retained attributes from 50 to 700 by a step of 50 lemmatizer and stop-list
 
 
 
 

More Related Content

PDF
Object oriented programming inheritance
PDF
Spam Filtering
PPT
mailfilter.ppt
DOC
[PDF]
DOC
[PDF]
DOC
[PDF]
PPT
Semantic Parsing in Bayesian Anti Spam
PPTX
final-spam-e-mail-detection-180125111231.pptx
Object oriented programming inheritance
Spam Filtering
mailfilter.ppt
[PDF]
[PDF]
[PDF]
Semantic Parsing in Bayesian Anti Spam
final-spam-e-mail-detection-180125111231.pptx

Similar to An experimental comparison of naive bayesian and keyword based (18)

PDF
Detecting spam mail using machine learning algorithm
PPTX
Final spam-e-mail-detection
PDF
AIML Assignment 5B.pdf
PPTX
project review using naive bayes theorem .pptx
PDF
Prepare black list using bayesian approach to improve performance of spam fil...
PDF
Cross breed Spam Categorization Method using Machine Learning Techniques
PPTX
Machine learning
PDF
Intelligent Spam Mail Detection System
PDF
DETECTING SPAM BY USING NAÏVE BAYES IN MACHINE LEARNING
PDF
Identifying Valid Email Spam Emails Using Decision Tree
PDF
Detection of Spam in Emails using Machine Learning
PPTX
finbg dlf cm DH kf ki dfbjjhfsckhvkhal review ppt.pptx
PDF
EMAIL SPAM DETECTION USING HYBRID ALGORITHM
PDF
Identification of Spam Emails from Valid Emails by Using Voting
PDF
Implementation of Spam Classifier using Naïve Bayes Algorithm
PDF
An Approach for Malicious Spam Detection in Email with Comparison of Differen...
DOC
Comparing Naive Bayesian and k-NN algorithms for automatic ...
PDF
A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approaches
Detecting spam mail using machine learning algorithm
Final spam-e-mail-detection
AIML Assignment 5B.pdf
project review using naive bayes theorem .pptx
Prepare black list using bayesian approach to improve performance of spam fil...
Cross breed Spam Categorization Method using Machine Learning Techniques
Machine learning
Intelligent Spam Mail Detection System
DETECTING SPAM BY USING NAÏVE BAYES IN MACHINE LEARNING
Identifying Valid Email Spam Emails Using Decision Tree
Detection of Spam in Emails using Machine Learning
finbg dlf cm DH kf ki dfbjjhfsckhvkhal review ppt.pptx
EMAIL SPAM DETECTION USING HYBRID ALGORITHM
Identification of Spam Emails from Valid Emails by Using Voting
Implementation of Spam Classifier using Naïve Bayes Algorithm
An Approach for Malicious Spam Detection in Email with Comparison of Differen...
Comparing Naive Bayesian and k-NN algorithms for automatic ...
A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approaches
Ad

Recently uploaded (20)

PPTX
Programs and apps: productivity, graphics, security and other tools
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
STKI Israel Market Study 2025 version august
PPTX
The various Industrial Revolutions .pptx
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Hybrid model detection and classification of lung cancer
PPTX
1. Introduction to Computer Programming.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
August Patch Tuesday
PDF
Getting started with AI Agents and Multi-Agent Systems
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Hindi spoken digit analysis for native and non-native speakers
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
DP Operators-handbook-extract for the Mautical Institute
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
TLE Review Electricity (Electricity).pptx
Programs and apps: productivity, graphics, security and other tools
Module 1.ppt Iot fundamentals and Architecture
STKI Israel Market Study 2025 version august
The various Industrial Revolutions .pptx
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Getting Started with Data Integration: FME Form 101
Chapter 5: Probability Theory and Statistics
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Hybrid model detection and classification of lung cancer
1. Introduction to Computer Programming.pptx
Assigned Numbers - 2025 - Bluetooth® Document
August Patch Tuesday
Getting started with AI Agents and Multi-Agent Systems
Group 1 Presentation -Planning and Decision Making .pptx
Hindi spoken digit analysis for native and non-native speakers
cloud_computing_Infrastucture_as_cloud_p
Developing a website for English-speaking practice to English as a foreign la...
DP Operators-handbook-extract for the Mautical Institute
OMC Textile Division Presentation 2021.pptx
TLE Review Electricity (Electricity).pptx
Ad

An experimental comparison of naive bayesian and keyword based

  • 1. An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages Author: Ion Androutsopoulos , John Koutsias ,Konstantinos V. Chandrinos, Constantine D. Spyropoulos Resourse: sigir2000
  • 2. Outline Introduction Feature selection The Naive Bayesian classifier Result
  • 3. Introduction 垃圾郵件很多 Naïve Bayesian classifier 與 keywork-based 的反垃圾郵件機制做比較 . Sahami et al. trained a Naïve Bayesian classifier on manually categorized legitimate and spare messages
  • 4. The Naive Bayesian classifier x = (x l , x 2 , x 3 .... , x n ) , where x l ,….., x n are the values of attributes X 1 .... , X n . Each attribute shows whether or not a particular word (eg. "adult") is present in the message. Use additional attributes corresponding to phrases(e.g. "be over 21") . Non-textual properties (e.g. whether or not the message contains attachments).
  • 5. mutual information Use mutual information ( MI ) to select possible attributes. MI(X;C): Then select the attributes with the highest mutual information values.
  • 6. The Naive Bayesian classifier
  • 7.  
  • 8.  
  • 9. S -> L (legitimate to spam) L->S(spam to legitimate) denote the two error types. we assume that L->S is times more costly than S -> L Classify a message as spare if the following classification criterion is met:
  • 10.  
  • 11. = 999 (t=0.999) , This means that mistakenly blocking a legitimate message was taken to be as bad as letting 999 spare messages pass the filter. = 9 (t=0.9) , 若郵件被 blocked 時 , 回傳給 sender 道歉訊息以及猜謎 . = 1(t=0.5), If the recipient does not care about the extra work imposed on the sender.
  • 13. 1789 messages, consisting of 211 legitimate messages that users had saved and 1578 spare messages. First experiment word-attributes were used. Candidate attributes were added (e.g. corresponding to the phrases "be over 21", "only $"). Third experiment, (e.g. whether or not the message contains attachments, or a high proportion of non alphanumeric characters).
  • 14. Experiments with the PU1 corpus 481 spam messages. 618 legitimate messages. Naive Bayesian classifier, ten-fold cross validation to reduce random variation. That Results were then averaged over the ten runs. varied the number of retained attributes from 50 to 700 by a step of 50 lemmatizer and stop-list
  • 15.  
  • 16.  
  • 17.  
  • 18.