An experimental comparison of naive bayesian and keyword based

An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages Author: Ion Androutsopoulos , John Koutsias ,Konstantinos V. Chandrinos, Constantine D. Spyropoulos Resourse: sigir2000

Outline Introduction Feature selection The Naive Bayesian classifier Result

Introduction 垃圾郵件很多 Naïve Bayesian classifier 與 keywork-based 的反垃圾郵件機制做比較 . Sahami et al. trained a Naïve Bayesian classifier on manually categorized legitimate and spare messages

The Naive Bayesian classifier x = (x l , x 2 , x 3 .... , x n ) , where x l ,….., x n are the values of attributes X 1 .... , X n . Each attribute shows whether or not a particular word (eg. "adult") is present in the message. Use additional attributes corresponding to phrases(e.g. "be over 21") . Non-textual properties (e.g. whether or not the message contains attachments).

mutual information Use mutual information ( MI ) to select possible attributes. MI(X;C): Then select the attributes with the highest mutual information values.

S -> L (legitimate to spam) L->S(spam to legitimate) denote the two error types. we assume that L->S is times more costly than S -> L Classify a message as spare if the following classification criterion is met:

= 999 (t=0.999) , This means that mistakenly blocking a legitimate message was taken to be as bad as letting 999 spare messages pass the filter. = 9 (t=0.9) , 若郵件被 blocked 時 , 回傳給 sender 道歉訊息以及猜謎 . = 1(t=0.5), If the recipient does not care about the extra work imposed on the sender.

1789 messages, consisting of 211 legitimate messages that users had saved and 1578 spare messages. First experiment word-attributes were used. Candidate attributes were added (e.g. corresponding to the phrases "be over 21", "only $"). Third experiment, (e.g. whether or not the message contains attachments, or a high proportion of non alphanumeric characters).

Experiments with the PU1 corpus 481 spam messages. 618 legitimate messages. Naive Bayesian classifier, ten-fold cross validation to reduce random variation. That Results were then averaged over the ten runs. varied the number of retained attributes from 50 to 700 by a step of 50 lemmatizer and stop-list

An experimental comparison of naive bayesian and keyword based

More Related Content

Similar to An experimental comparison of naive bayesian and keyword based (18)

Recently uploaded (20)

An experimental comparison of naive bayesian and keyword based