SlideShare a Scribd company logo
Minority Report in Fraud Detection: Classification of Skewed Data Clifton Phua, Damminda Alahakoon, and Vincent Lee SIGKDD 2004 Reporter: Ping-Hua Yang
Abstract This paper propose an innovative fraud detection method to deal with the data mining problem of skewed data distributions. This method uses Back-propagation, together with Naïve Bayesian and C4.5 algorithms on data partitions derived from minority over-sampling with replacement. This paper compares the new fraud detection method against C4.5 trained using under-sampling, over-sampling, and SMOTE without partitioning. The most interesting find is confirming that the combination of classifiers to produce the best cost savings has its contributions from all three algorithms.
Outline  Introduction Fraud detection Experiments Results Discussion Conclusion
Introduction Fraud, or criminal deception, will be a costly problem for many profit organizations. Data mining can minimize some of these losses by making use of the massive collections of customer data. However fraud detection data being highly skewed or imbalanced in the norm. There are two typical way to proceed when faced with this problem.  The first approach is to apply different algorithms. The second approach is to manipulate the class distribution.
Introduction  This paper introduces the new fraud detection method for skewed data. The innovative use of NB, C4.5, and BP classifier to process the same partitioned numerical data has the potential of getting better cost saving. The selection of the best classifier of different algorithms using stacking and the merger of their predictions. One related problem caused by skewed data includes measuring the performance of the classifiers. Success can’t be defined in terms of predictive accuracy because the minority class in the skewed data usually has a significantly higher cost.
Fraud detection Existing fraud detection methods The new fraud detection method Fraud detection algorithms
Existing Fraud detection methods Insurance fraud The hot spot methodology applies a three step process: the k-means for cluster detection, the C4.5 for decision tree rule induction, and domain knowledge , statistical summaries and visualization tools for rule evaluation. [Williams G, Hung Z,1997] Expanded the hot spot architecture to use genetic algorithm to generate rules and to allow the domain user. [Williams G, 1999] Credit Card Fraud The Bayesian Belief Network (BBN) and Artificial Neural Network (ANN) comparison the STAGE algorithm for BBN and BP algorithm for ANN in fraud detection. [Maes S, Tuyls K, Vanschoenwinkel B, Manderick B, 2002]
Existing Fraud detection methods Telecommunications Fraud The advanced security for personal communications technologies (ASPECT) research group focuses on neural networks to train legal current user profiles that store recent user information and user profile histories that store long term information to define normal patterns of use. [Weatherford M, 2002]
The New Fraud detection method The idea is to simulate the book’s Precrime method of precogs and integration mechanism with existing data mining methods and techniques.
Fraud detection algorithms This study provides a slight variation of cross validation.  Instead of using ten data partition, an odd-numbered eleven data partitions. Bagging combines the classifiers trained by the same algorithm using unweighted majority voting on each example or instance. Stacking combines multiple classifiers generated by different algorithms with a meta-classifier. To classify an instance the base classifiers from the tree algorithms present their predictions to the meta-classifier which then makes the final prediction. This paper propose Stacking-bagging which is a hybrid technique. To train the simplest learning algorithm first, followed by the complex ones.
Experiments Data Understanding Cost Model Data Preparation Modeling
Data Understanding The available fraud detection data set in automobile insurance is provided by Angoss KnowledgeSeeker software. This paper split the main data set into a training data set and a scoring data set. The Class labels of the training data are known, and the training data is historical compared to the scoring data. This data set contains 11338 examples from January 1994 to December 1995 (training data), and 4083 instances from January 1996 to December 1996 (scoring data). It has a 6% fraudulent and 94% legitimate distribution The original data set has 6 numerical attributes and 25 categorical attributes
Cost Model This cost model has two assumptions All alters must investigated. The average cost per claim must be higher than the average cost per investigation. In 1996, the average cost per claim for the score data set is approximated at USD$2,640.
Cost Model The evaluation metrics for the predictive models on the score data set to find the optimum cost savings are:
Data preparation In a related study, it is recommended that data partitions should neither be too large for the time complexity of the learning algorithms nor too small to produce poor classifiers. Randomly select different legal examples from the years 1994 and 1995 (10840 legal examples) into eleven sets of y legal examples (923). x fraud examples (615) with a different set of y to form eleven x:y partitions (615:923) with a fraud:legal distribution of 40:60. Other possible distributions are 50:50 (923:923), and 30:70 (396:923). Minority over-sampling with replacement/replication. In rotation, each data partition of a certain distribution is used for training, testing and evaluation. A training data partition is used to come up with a classifier, a test data partition to optimize the classifier’s parameters and an evaluation data to compare the classifier with others.
Data preparation Test  The algorithm trained on partition 1 to generate classifier 1. The algorithm tested on partition 2 to refine the classifier. The algorithm evaluation on partition 3 to assess the expected accuracy of classifier.
M odeling
Modeling In figure 3. Each rectangle represents an experiment. Each circle depicts a comparison of cost savings between experiments. Each bold arrow indicates the best experiment from the comparisons. Decision threshold (except for experiments V and IX) and cost model for these experiments will remain unchanged. Experiment V and IX will produce BP predictions need to be converted into categorical ones using the decision threshold value.
Modeling
Modeling Table 4.  Lists the eleven tests, labeled A to K, which were repeated for each of experiments I to V In other words, there are 55 test in total for experiments I to V. Each test consisted of training, testing, evaluation, and scoring. The score set was the same for all classifiers but the data partitions labeled 1 to 11 were rotated. The overall success rate denotes the ability of an ensemble of classifiers to provide correct predictions. The bagged overall success rates X and Z were compared to averaged overall success rates W and Y.
Modeling Experiments I, II, and III were designed to determine the best training distribution under the cost model. Which one of the above three training distributions is the best for the data partitions under the cost model? Experiment IV and V used the best training distribution determined from comparison 1. Experiment IV and V produce a bagged Z. Experiment VI, VII, and VIII determine which ensemble mechanism produces the best cost savings. Experiment VI used bagging to combine three sets of perditions from each algorithm. Experiment VII used stacking to combine all predictions. Experiment VIII proposed to bag the best classifiers determined by stacking.
Modeling Experiment IX implemented the BP algorithm on unsampled and unpartitioned data. This experiment was then compared with the other six before it. Which one of the above seven different classifier systems will attain the highest cost savings? Experiment X, XI, and XII were constructed to find out how each sampling method performs on unpartitioned data and if they could yield better results than the multiple classifier approach. Experiment XII’s data consists of the same number of examples as XI. But for XII, the minority class used SMOTE. Can the best classifier system perform better than the sampling approaches in the following results section?
Results Table 5 show in experiments I, II, and III, the bagged success rates X outperformed all the averaged success rates W. When applied on the score set, bagged success rate Z performed marginally better than the averaged success rates Y.
Results In figure 4., experiment IV highlights C4.5 as the best learning algorithm for this particular automobile insurance data set. The resultant predictions of experiment VIII (stacking-bagging) were better than those of C4.5 algorithm.
Results
Results  In figure 5, these three experiments performed comparably well at 40:60 and 50:50. Experiment XI and XII substantiate the claims that SMOTE is superior to minority oversampling with replacement. The undersampled data provides the highest cost saving of $165,242 at 60:40, it also incurs the highest expenditure (-$266,529). This is most likely due to the number of legal examples getting very small.
Discussion Table 6 ranks all the experiments using cost savings. Stacking-bagging achieves the highest cost savings which is almost twice the of the conventional BP procedure used by many fraud detection. The optimum success rate is 60% for highest cost savings in this slewed data set and, as the success rate increases, cost savings decrease.
Discussion
Discussion Table 7 illustrates the top fifteen, out of 33 classifiers, produced from stacking.
Conclusion  In this paper, existing fraud detection methods are explored and a new fraud detection method is recommended. The choice of the three classification algorithm and one hybrid meta-learning technique is justified for the new method. To extend the fraud detection method based on Minority Report to find out the properties of a data set, data partition, or data cluster which will make on classifier more appropriate.

More Related Content

PDF
V. pacáková, d. brebera
DOCX
Luis_Ramon_Report.doc
PDF
High-Dimensional Methods: Examples for Inference on Structural Effects
PDF
2012 predictive clusters
PDF
DataMining_CA2-4
PDF
Handling Imbalanced Data: SMOTE vs. Random Undersampling
PDF
Class imbalance problem1
DOCX
MATH 533 Entire Course NEW
V. pacáková, d. brebera
Luis_Ramon_Report.doc
High-Dimensional Methods: Examples for Inference on Structural Effects
2012 predictive clusters
DataMining_CA2-4
Handling Imbalanced Data: SMOTE vs. Random Undersampling
Class imbalance problem1
MATH 533 Entire Course NEW

What's hot (19)

PDF
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
DOCX
MATH 533 Education Specialist / snaptutorial.com
PDF
Probability density estimation using Product of Conditional Experts
PDF
Multi-Cluster Based Approach for skewed Data in Data Mining
PDF
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
DOCX
FSRM 582 Project
PDF
Econometrics of High-Dimensional Sparse Models
PDF
Spectral Element Methods in Large Eddy Simulation
PDF
Prediction model of algal blooms using logistic regression and confusion matrix
PPTX
Multiclass classification of imbalanced data
PPTX
Use of Definitive Screening Designs to Optimize an Analytical Method
PPTX
Modeling strategies for definitive screening designs using jmp and r
PDF
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
PDF
Simulation Study of Hurdle Model Performance on Zero Inflated Count Data
PDF
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
PDF
Non-parametric analysis of models and data
PDF
JEDM_RR_JF_Final
DOCX
MATH 533 RANK Achievement Education--math533rank.com
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
MATH 533 Education Specialist / snaptutorial.com
Probability density estimation using Product of Conditional Experts
Multi-Cluster Based Approach for skewed Data in Data Mining
Statistical Data Analysis on a Data Set (Diabetes 130-US hospitals for years ...
FSRM 582 Project
Econometrics of High-Dimensional Sparse Models
Spectral Element Methods in Large Eddy Simulation
Prediction model of algal blooms using logistic regression and confusion matrix
Multiclass classification of imbalanced data
Use of Definitive Screening Designs to Optimize an Analytical Method
Modeling strategies for definitive screening designs using jmp and r
A Mathematical Programming Approach for Selection of Variables in Cluster Ana...
Simulation Study of Hurdle Model Performance on Zero Inflated Count Data
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
Non-parametric analysis of models and data
JEDM_RR_JF_Final
MATH 533 RANK Achievement Education--math533rank.com
Ad

Viewers also liked (20)

PDF
Predictive Model for Customer Segmentation using Database Marketing Techniques
PPT
An Introduction to boosting
PPTX
Insights into Customer Churn
PPTX
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
PDF
Behavioral Track & Trigger featuring Browse Abandonment Campaigns
PPTX
A combination of decision tree learning and clustering
PPTX
How to Perform Churn Analysis for your Mobile Application?
PDF
Churn Prediction in Mobile Social Games: Towards a Complete Assessment Using ...
PDF
Prospect Identification from a Credit Database using Regression, Decision Tre...
PPTX
Decision Tree - C4.5&CART
PPSX
Telco Churn Roi V3
PPTX
Flight Delay Prediction Model (2)
PDF
Making a Difference in Your Student Retention - Identifying Analytic Trends f...
 
PDF
The Leading Reasons for Customer Churn in SaaS
PDF
Maximizing a churn campaign’s profitability with cost sensitive predictive an...
PPSX
Decision tree Using c4.5 Algorithm
PDF
The No-BS Guide to Understanding (and Calculating) Churn
PDF
Beyond Churn Prediction : An Introduction to uplift modeling
PDF
How to Reduce Churn by 50% and Increase Customer Happiness with NPS Processes
PDF
Subscriber Churn Prediction Model using Social Network Analysis In Telecommun...
Predictive Model for Customer Segmentation using Database Marketing Techniques
An Introduction to boosting
Insights into Customer Churn
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
Behavioral Track & Trigger featuring Browse Abandonment Campaigns
A combination of decision tree learning and clustering
How to Perform Churn Analysis for your Mobile Application?
Churn Prediction in Mobile Social Games: Towards a Complete Assessment Using ...
Prospect Identification from a Credit Database using Regression, Decision Tre...
Decision Tree - C4.5&CART
Telco Churn Roi V3
Flight Delay Prediction Model (2)
Making a Difference in Your Student Retention - Identifying Analytic Trends f...
 
The Leading Reasons for Customer Churn in SaaS
Maximizing a churn campaign’s profitability with cost sensitive predictive an...
Decision tree Using c4.5 Algorithm
The No-BS Guide to Understanding (and Calculating) Churn
Beyond Churn Prediction : An Introduction to uplift modeling
How to Reduce Churn by 50% and Increase Customer Happiness with NPS Processes
Subscriber Churn Prediction Model using Social Network Analysis In Telecommun...
Ad

Similar to 11/04 Regular Meeting: Monority Report in Fraud Detection Classification of Skewed Data (20)

PDF
Supervised Machine Learning: A Review of Classification ...
PPTX
Informs presentation new ppt
PDF
TRENDS IN FINANCIAL RISK MANAGEMENT SYSTEMS IN 2020
PDF
Empirical analysis of ensemble methods for the classification of robocalls in...
PDF
Racing for unbalanced methods selection
PDF
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...
PDF
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
PDF
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
PPTX
WEKA:Credibility Evaluating Whats Been Learned
PPTX
WEKA: Credibility Evaluating Whats Been Learned
PPT
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
PPT
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
PPT
Lecture 2
PPT
Introduction to Machine Learning Aristotelis Tsirigos
PDF
Ba group3
PDF
Telecom Fraudsters Prediction
PPTX
UNIT 3: Data Warehousing and Data Mining
PDF
A SURVEY OF METHODS FOR HANDLING DISK DATA IMBALANCE
PDF
IJCSI-10-6-1-288-292
DOC
Presentation on Machine Learning and Data Mining
Supervised Machine Learning: A Review of Classification ...
Informs presentation new ppt
TRENDS IN FINANCIAL RISK MANAGEMENT SYSTEMS IN 2020
Empirical analysis of ensemble methods for the classification of robocalls in...
Racing for unbalanced methods selection
ENACTMENT RANKING OF SUPERVISED ALGORITHMS DEPENDENCE OF DATA SPLITTING ALGOR...
Enactment Ranking of Supervised Algorithms Dependence of Data Splitting Algor...
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
WEKA:Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been Learned
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Lecture 2
Introduction to Machine Learning Aristotelis Tsirigos
Ba group3
Telecom Fraudsters Prediction
UNIT 3: Data Warehousing and Data Mining
A SURVEY OF METHODS FOR HANDLING DISK DATA IMBALANCE
IJCSI-10-6-1-288-292
Presentation on Machine Learning and Data Mining

11/04 Regular Meeting: Monority Report in Fraud Detection Classification of Skewed Data

  • 1. Minority Report in Fraud Detection: Classification of Skewed Data Clifton Phua, Damminda Alahakoon, and Vincent Lee SIGKDD 2004 Reporter: Ping-Hua Yang
  • 2. Abstract This paper propose an innovative fraud detection method to deal with the data mining problem of skewed data distributions. This method uses Back-propagation, together with Naïve Bayesian and C4.5 algorithms on data partitions derived from minority over-sampling with replacement. This paper compares the new fraud detection method against C4.5 trained using under-sampling, over-sampling, and SMOTE without partitioning. The most interesting find is confirming that the combination of classifiers to produce the best cost savings has its contributions from all three algorithms.
  • 3. Outline Introduction Fraud detection Experiments Results Discussion Conclusion
  • 4. Introduction Fraud, or criminal deception, will be a costly problem for many profit organizations. Data mining can minimize some of these losses by making use of the massive collections of customer data. However fraud detection data being highly skewed or imbalanced in the norm. There are two typical way to proceed when faced with this problem. The first approach is to apply different algorithms. The second approach is to manipulate the class distribution.
  • 5. Introduction This paper introduces the new fraud detection method for skewed data. The innovative use of NB, C4.5, and BP classifier to process the same partitioned numerical data has the potential of getting better cost saving. The selection of the best classifier of different algorithms using stacking and the merger of their predictions. One related problem caused by skewed data includes measuring the performance of the classifiers. Success can’t be defined in terms of predictive accuracy because the minority class in the skewed data usually has a significantly higher cost.
  • 6. Fraud detection Existing fraud detection methods The new fraud detection method Fraud detection algorithms
  • 7. Existing Fraud detection methods Insurance fraud The hot spot methodology applies a three step process: the k-means for cluster detection, the C4.5 for decision tree rule induction, and domain knowledge , statistical summaries and visualization tools for rule evaluation. [Williams G, Hung Z,1997] Expanded the hot spot architecture to use genetic algorithm to generate rules and to allow the domain user. [Williams G, 1999] Credit Card Fraud The Bayesian Belief Network (BBN) and Artificial Neural Network (ANN) comparison the STAGE algorithm for BBN and BP algorithm for ANN in fraud detection. [Maes S, Tuyls K, Vanschoenwinkel B, Manderick B, 2002]
  • 8. Existing Fraud detection methods Telecommunications Fraud The advanced security for personal communications technologies (ASPECT) research group focuses on neural networks to train legal current user profiles that store recent user information and user profile histories that store long term information to define normal patterns of use. [Weatherford M, 2002]
  • 9. The New Fraud detection method The idea is to simulate the book’s Precrime method of precogs and integration mechanism with existing data mining methods and techniques.
  • 10. Fraud detection algorithms This study provides a slight variation of cross validation. Instead of using ten data partition, an odd-numbered eleven data partitions. Bagging combines the classifiers trained by the same algorithm using unweighted majority voting on each example or instance. Stacking combines multiple classifiers generated by different algorithms with a meta-classifier. To classify an instance the base classifiers from the tree algorithms present their predictions to the meta-classifier which then makes the final prediction. This paper propose Stacking-bagging which is a hybrid technique. To train the simplest learning algorithm first, followed by the complex ones.
  • 11. Experiments Data Understanding Cost Model Data Preparation Modeling
  • 12. Data Understanding The available fraud detection data set in automobile insurance is provided by Angoss KnowledgeSeeker software. This paper split the main data set into a training data set and a scoring data set. The Class labels of the training data are known, and the training data is historical compared to the scoring data. This data set contains 11338 examples from January 1994 to December 1995 (training data), and 4083 instances from January 1996 to December 1996 (scoring data). It has a 6% fraudulent and 94% legitimate distribution The original data set has 6 numerical attributes and 25 categorical attributes
  • 13. Cost Model This cost model has two assumptions All alters must investigated. The average cost per claim must be higher than the average cost per investigation. In 1996, the average cost per claim for the score data set is approximated at USD$2,640.
  • 14. Cost Model The evaluation metrics for the predictive models on the score data set to find the optimum cost savings are:
  • 15. Data preparation In a related study, it is recommended that data partitions should neither be too large for the time complexity of the learning algorithms nor too small to produce poor classifiers. Randomly select different legal examples from the years 1994 and 1995 (10840 legal examples) into eleven sets of y legal examples (923). x fraud examples (615) with a different set of y to form eleven x:y partitions (615:923) with a fraud:legal distribution of 40:60. Other possible distributions are 50:50 (923:923), and 30:70 (396:923). Minority over-sampling with replacement/replication. In rotation, each data partition of a certain distribution is used for training, testing and evaluation. A training data partition is used to come up with a classifier, a test data partition to optimize the classifier’s parameters and an evaluation data to compare the classifier with others.
  • 16. Data preparation Test The algorithm trained on partition 1 to generate classifier 1. The algorithm tested on partition 2 to refine the classifier. The algorithm evaluation on partition 3 to assess the expected accuracy of classifier.
  • 18. Modeling In figure 3. Each rectangle represents an experiment. Each circle depicts a comparison of cost savings between experiments. Each bold arrow indicates the best experiment from the comparisons. Decision threshold (except for experiments V and IX) and cost model for these experiments will remain unchanged. Experiment V and IX will produce BP predictions need to be converted into categorical ones using the decision threshold value.
  • 20. Modeling Table 4. Lists the eleven tests, labeled A to K, which were repeated for each of experiments I to V In other words, there are 55 test in total for experiments I to V. Each test consisted of training, testing, evaluation, and scoring. The score set was the same for all classifiers but the data partitions labeled 1 to 11 were rotated. The overall success rate denotes the ability of an ensemble of classifiers to provide correct predictions. The bagged overall success rates X and Z were compared to averaged overall success rates W and Y.
  • 21. Modeling Experiments I, II, and III were designed to determine the best training distribution under the cost model. Which one of the above three training distributions is the best for the data partitions under the cost model? Experiment IV and V used the best training distribution determined from comparison 1. Experiment IV and V produce a bagged Z. Experiment VI, VII, and VIII determine which ensemble mechanism produces the best cost savings. Experiment VI used bagging to combine three sets of perditions from each algorithm. Experiment VII used stacking to combine all predictions. Experiment VIII proposed to bag the best classifiers determined by stacking.
  • 22. Modeling Experiment IX implemented the BP algorithm on unsampled and unpartitioned data. This experiment was then compared with the other six before it. Which one of the above seven different classifier systems will attain the highest cost savings? Experiment X, XI, and XII were constructed to find out how each sampling method performs on unpartitioned data and if they could yield better results than the multiple classifier approach. Experiment XII’s data consists of the same number of examples as XI. But for XII, the minority class used SMOTE. Can the best classifier system perform better than the sampling approaches in the following results section?
  • 23. Results Table 5 show in experiments I, II, and III, the bagged success rates X outperformed all the averaged success rates W. When applied on the score set, bagged success rate Z performed marginally better than the averaged success rates Y.
  • 24. Results In figure 4., experiment IV highlights C4.5 as the best learning algorithm for this particular automobile insurance data set. The resultant predictions of experiment VIII (stacking-bagging) were better than those of C4.5 algorithm.
  • 26. Results In figure 5, these three experiments performed comparably well at 40:60 and 50:50. Experiment XI and XII substantiate the claims that SMOTE is superior to minority oversampling with replacement. The undersampled data provides the highest cost saving of $165,242 at 60:40, it also incurs the highest expenditure (-$266,529). This is most likely due to the number of legal examples getting very small.
  • 27. Discussion Table 6 ranks all the experiments using cost savings. Stacking-bagging achieves the highest cost savings which is almost twice the of the conventional BP procedure used by many fraud detection. The optimum success rate is 60% for highest cost savings in this slewed data set and, as the success rate increases, cost savings decrease.
  • 29. Discussion Table 7 illustrates the top fifteen, out of 33 classifiers, produced from stacking.
  • 30. Conclusion In this paper, existing fraud detection methods are explored and a new fraud detection method is recommended. The choice of the three classification algorithm and one hybrid meta-learning technique is justified for the new method. To extend the fraud detection method based on Minority Report to find out the properties of a data set, data partition, or data cluster which will make on classifier more appropriate.