SlideShare a Scribd company logo
DATA MODELLING
PRESENTATION
Assessment 3
School of Computing Technologies
Prepared by: Aliia Gismatullina
s4051304
TABLE OF
CONTENTS
INTRODUCTION TO THE CASE STUDY
UNDERSTANDING THE CHALLENGE
DATA PREPARATION AND EXPLORATION
MODELING APPROACH
K-Nearest Neighbors (KNN)
Decision Tree
HYPERPARAMETER TUNING AND MODEL EVALUATION
COMPARATIVE RESULTS OF CLASSIFIERS
CONCLUSION AND RECOMMENDATIONS
As part of the dynamic team at Revolution Consulting,
we've been presented with a compelling challenge
by Connect5G.
Despite offering premium services and ensuring the best possible coverage,
Connect5G has encountered a significant hurdle impacting its customer
base —the persistent issue of spam messages.
Our task was clear but far from simple: create a solution to mitigate this
problem and preserve the integrity of Connect5G's customer experience.
INTRODUCTION TO THE CASE STUDY
INTRODUCTION TO THE CASE STUDY
Our objective was to develop an innovative
machine learning solution capable of
automatically distinguishing between spam
and legitimate messages—termed 'ham'.
This approach aimed to automate the
classification process, making it both efficient
and reliable.
UNDERSTANDING THE CHALLENGE
• The Spam Problem at Connect5G
⚬ Rising customer dissatisfaction due to spam
texts.
• Previous Attempts to Solve
⚬ Ineffective blacklisting—spammers are always
a step ahead.
• The Need for a Sophisticated Solution
⚬ Shift to a machine learning solution for
dynamic spam detection.
• Objective
⚬ Automate spam detection to enhance user
experience without intrusion.
DATA PREPARATION
AND EXPLORATION
DATA PREPARATION AND
EXPLORATORY DATA ANALYSIS
• Dataset Overview
⚬ Rich dataset of spam and genuine
messages from the UK and Singapore.
• Challenge: Data Imbalance
⚬ The imbalance between spam and
genuine messages was identified - with
86.8% of messages being non-spam
(ham) and 13.2% being spam.
• Data Cleaning and Tokenization
⚬ Standardized text format - lowercase,
removing stopwords;
⚬ Segmented text into tokens for analysis.
• The initial dataset
• Cleaned dataset ready for modelling
DATA PREPARATION AND EXPLORATORY DATA ANALYSIS
• The feature extraction using CountVectorizer on the
sms_joined column has provided us with a numerical
representation of the text data, focusing on the top
1000 words by frequency.
• The most common words for spam and ham
SMS Messages
DATA
MODELLING
APPROACH
K-Nearest Neighbors (KNN) and
the Decision Tree classifier
• K-Nearest Neighbors (KNN)
⚬ Chosen for its simplicity and effectiveness in identifying
similar data points.
• Decision Tree Classifier
⚬ Valued for its structured decision-making process and
interpretability.
• Balancing the Dataset with SMOTE
⚬ Applied SMOTE to generate synthetic spam samples,
ensuring a balanced training set.
MODELING APPROACH
• KNN Algorithm Introduction
⚬ Simple to implement and interpret.
⚬ Effective for classification tasks.
• Data Preparation
⚬ Vectorization with CountVectorizer.
⚬ Feature set limited to top 1000 words by frequency.
• Balancing the Dataset
⚬ Class imbalance tackled with SMOTE.
⚬ Achieved a 50/50 balance for training.
KNN MODEL TRAINING AND EVALUATION
The provided plots are heatmaps of a
confusion matrix for the performance of a
KNN classifier:
1. without SMOTE;
2. with SMOTE
1. 2.
• Top-left cell (True Label 0, Predicted Label 0):
True Negatives (TN) where the model correctly identified non-
spam messages.
• Top-right cell (True Label 0, Predicted Label 1):
False Negatives (FN) where the model incorrectly identified
non-spam messages as spam.
• Bottom-left cell (True Label 1, Predicted Label 0):
False Positives (FP) where the model incorrectly identified
spam messages as non-spam.
• Bottom-right cell (True Label 1, Predicted Label 1):
True Positives (TP) where the model correctly identified spam
messages.
HYPERPARAMETER TUNING AND
MODEL PERFORMANCE
• Hyperparameter Tuning
⚬ Explored 'n_neighbors' and 'metric' options.
⚬ Used GridSearchCV for optimization.
• KNN Model Evaluation
⚬ High precision for non-spam detection.
⚬ Good recall for spam detection.
⚬ F1-score demonstrates the model's balanced accuracy.
DECISION TREE MODEL TRAINING AND EVALUATION
• Decision Tree Introduction
⚬ Offers transparent decision-making paths.
⚬ Suited for complex classification.
• Hyperparameter Selection
⚬ Adjusted 'max_depth' and 'min_samples_leaf'.
⚬ Explored 'gini' and 'entropy' criteria.
DECISION TREE PERFORMANCE WITH AND WITHOUT
SMOTE
• Decision Tree Performance Metrics
⚬ Comparing accuracy and balanced accuracy.
⚬ With and without SMOTE application.
• Model Performance Comparison
⚬ Impact of SMOTE on predictive power.
Performance of a Decision Tree classifier on a spam detection task:
1. without the use of SMOTE;
2. with SMOTE .
2.
1.
1) Decision Tree without SMOTE:
⚬ True Negatives (TN): 913 (Non-spam
correctly identified as non-spam)
⚬ False Negatives (FN): 7 (Spam incorrectly
identified as non-spam)
⚬ False Positives (FP): 41 (Non-spam incorrectly
identified as spam)
⚬ True Positives (TP): 110 (Spam correctly
identified as spam)
2) Decision Tree with SMOTE:
⚬ True Negatives (TN): 907
⚬ False Negatives (FN): 13
⚬ False Positives (FP): 52
⚬ True Positives (TP): 99
2. Training Time Evaluation
• Decision Tree with SMOTE has
longer training times, indicating a
more complex fitting process.
COMPARATIVE RESULTS OF CLASSIFIERS
• Accuracy and Balanced Accuracy
Comparison
• KNN with SMOTE shows significant
gains in both accuracy and
balanced accuracy.
• Decision Tree also benefits from
SMOTE, though less dramatically.
3. Prediction Time per Message
• Decision Tree models, especially
with SMOTE, offer rapid prediction
times, which is crucial for real-time
applications.
CONCLUSION AND
RECOMMENDATIONS
CONCLUSION AND RECOMMENDATIONS
• Adopt SMOTE: Integrate SMOTE in the model training process
to enhance the model's ability to generalize and maintain
performance as spam tactics evolve.
• Prioritize Prediction Time: Given the real-time nature of
messaging, we advise prioritizing models that offer rapid
prediction capabilities, making the Decision Tree with SMOTE a
favorable choice.
• Monitor and Update: Continuously monitor the model's
performance and retrain with new data to adapt to emerging
spam trends, ensuring sustained effectiveness.
• User Feedback Loop: Implement a feedback system where
users can report misclassifications, providing valuable data to
further refine the model.
THANK YOU

More Related Content

PPTX
TELECOMMUNICATION (2).pptx
PPT
MIS637_Final_Project_Rahul_Bhatia
PPTX
Kaggle Gold Medal Case Study
DOCX
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
PDF
Automated Speech Recognition
PPTX
PPT for ensembled techniques used for smoke detection
PPTX
Building Continuous Learning Systems
PDF
featurers_Machinelearning___________.pdf
TELECOMMUNICATION (2).pptx
MIS637_Final_Project_Rahul_Bhatia
Kaggle Gold Medal Case Study
Dr. Oner CelepcikayCS 4319CS 4319Machine LearningW.docx
Automated Speech Recognition
PPT for ensembled techniques used for smoke detection
Building Continuous Learning Systems
featurers_Machinelearning___________.pdf

Similar to Data modeling with a focus on spam detection using K-Nearest Neighbors (KNN) and Decision Tree classifiers (20)

PDF
Comparison of Top Data Mining(Final)
PPTX
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
PPTX
Tips and tricks to win kaggle data science competitions
PPT
A scalable collaborative filtering framework based on co clustering
PPTX
Combinatorial testing ppt
PPTX
11.1. PPT on How to crack ML Competitions all steps explained.pptx
PPTX
Churn Modeling For Mobile Telecommunications
PDF
Cyn meetup
PDF
Being Intentional: Privacy Engineering and A/B Testing
PPTX
R204585L. RMABIKA. Customer Churn Prediction Presentation 2.pptx
PPTX
churn_detection.pptx
PDF
Barga Data Science lecture 10
PPTX
NLP Classifier Models & Metrics
PDF
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
PDF
DATI, AI E ROBOTICA @POLITO
PDF
Convolutional Neural Network for Text Classification
PDF
Improving Hardware Efficiency for DNN Applications
PPTX
SpecAugment review
PPTX
Odin2018_Minh_ML_Risk_Prediction
Comparison of Top Data Mining(Final)
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Tips and tricks to win kaggle data science competitions
A scalable collaborative filtering framework based on co clustering
Combinatorial testing ppt
11.1. PPT on How to crack ML Competitions all steps explained.pptx
Churn Modeling For Mobile Telecommunications
Cyn meetup
Being Intentional: Privacy Engineering and A/B Testing
R204585L. RMABIKA. Customer Churn Prediction Presentation 2.pptx
churn_detection.pptx
Barga Data Science lecture 10
NLP Classifier Models & Metrics
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
DATI, AI E ROBOTICA @POLITO
Convolutional Neural Network for Text Classification
Improving Hardware Efficiency for DNN Applications
SpecAugment review
Odin2018_Minh_ML_Risk_Prediction
Ad

Recently uploaded (20)

PPT
DU, AIS, Big Data and Data Analytics.ppt
PPTX
modul_python (1).pptx for professional and student
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PDF
Global Data and Analytics Market Outlook Report
PPTX
Business_Capability_Map_Collection__pptx
PDF
Microsoft 365 products and services descrption
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
A Complete Guide to Streamlining Business Processes
PDF
Introduction to the R Programming Language
PDF
Microsoft Core Cloud Services powerpoint
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPT
Predictive modeling basics in data cleaning process
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
DU, AIS, Big Data and Data Analytics.ppt
modul_python (1).pptx for professional and student
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
retention in jsjsksksksnbsndjddjdnFPD.pptx
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
Global Data and Analytics Market Outlook Report
Business_Capability_Map_Collection__pptx
Microsoft 365 products and services descrption
CYBER SECURITY the Next Warefare Tactics
A Complete Guide to Streamlining Business Processes
Introduction to the R Programming Language
Microsoft Core Cloud Services powerpoint
STERILIZATION AND DISINFECTION-1.ppthhhbx
Predictive modeling basics in data cleaning process
IBA_Chapter_11_Slides_Final_Accessible.pptx
Qualitative Qantitative and Mixed Methods.pptx
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
Ad

Data modeling with a focus on spam detection using K-Nearest Neighbors (KNN) and Decision Tree classifiers

  • 1. DATA MODELLING PRESENTATION Assessment 3 School of Computing Technologies Prepared by: Aliia Gismatullina s4051304
  • 2. TABLE OF CONTENTS INTRODUCTION TO THE CASE STUDY UNDERSTANDING THE CHALLENGE DATA PREPARATION AND EXPLORATION MODELING APPROACH K-Nearest Neighbors (KNN) Decision Tree HYPERPARAMETER TUNING AND MODEL EVALUATION COMPARATIVE RESULTS OF CLASSIFIERS CONCLUSION AND RECOMMENDATIONS
  • 3. As part of the dynamic team at Revolution Consulting, we've been presented with a compelling challenge by Connect5G. Despite offering premium services and ensuring the best possible coverage, Connect5G has encountered a significant hurdle impacting its customer base —the persistent issue of spam messages. Our task was clear but far from simple: create a solution to mitigate this problem and preserve the integrity of Connect5G's customer experience. INTRODUCTION TO THE CASE STUDY
  • 4. INTRODUCTION TO THE CASE STUDY Our objective was to develop an innovative machine learning solution capable of automatically distinguishing between spam and legitimate messages—termed 'ham'. This approach aimed to automate the classification process, making it both efficient and reliable.
  • 5. UNDERSTANDING THE CHALLENGE • The Spam Problem at Connect5G ⚬ Rising customer dissatisfaction due to spam texts. • Previous Attempts to Solve ⚬ Ineffective blacklisting—spammers are always a step ahead. • The Need for a Sophisticated Solution ⚬ Shift to a machine learning solution for dynamic spam detection. • Objective ⚬ Automate spam detection to enhance user experience without intrusion.
  • 7. DATA PREPARATION AND EXPLORATORY DATA ANALYSIS • Dataset Overview ⚬ Rich dataset of spam and genuine messages from the UK and Singapore. • Challenge: Data Imbalance ⚬ The imbalance between spam and genuine messages was identified - with 86.8% of messages being non-spam (ham) and 13.2% being spam. • Data Cleaning and Tokenization ⚬ Standardized text format - lowercase, removing stopwords; ⚬ Segmented text into tokens for analysis. • The initial dataset • Cleaned dataset ready for modelling
  • 8. DATA PREPARATION AND EXPLORATORY DATA ANALYSIS • The feature extraction using CountVectorizer on the sms_joined column has provided us with a numerical representation of the text data, focusing on the top 1000 words by frequency. • The most common words for spam and ham SMS Messages
  • 9. DATA MODELLING APPROACH K-Nearest Neighbors (KNN) and the Decision Tree classifier
  • 10. • K-Nearest Neighbors (KNN) ⚬ Chosen for its simplicity and effectiveness in identifying similar data points. • Decision Tree Classifier ⚬ Valued for its structured decision-making process and interpretability. • Balancing the Dataset with SMOTE ⚬ Applied SMOTE to generate synthetic spam samples, ensuring a balanced training set. MODELING APPROACH
  • 11. • KNN Algorithm Introduction ⚬ Simple to implement and interpret. ⚬ Effective for classification tasks. • Data Preparation ⚬ Vectorization with CountVectorizer. ⚬ Feature set limited to top 1000 words by frequency. • Balancing the Dataset ⚬ Class imbalance tackled with SMOTE. ⚬ Achieved a 50/50 balance for training. KNN MODEL TRAINING AND EVALUATION The provided plots are heatmaps of a confusion matrix for the performance of a KNN classifier: 1. without SMOTE; 2. with SMOTE 1. 2. • Top-left cell (True Label 0, Predicted Label 0): True Negatives (TN) where the model correctly identified non- spam messages. • Top-right cell (True Label 0, Predicted Label 1): False Negatives (FN) where the model incorrectly identified non-spam messages as spam. • Bottom-left cell (True Label 1, Predicted Label 0): False Positives (FP) where the model incorrectly identified spam messages as non-spam. • Bottom-right cell (True Label 1, Predicted Label 1): True Positives (TP) where the model correctly identified spam messages.
  • 12. HYPERPARAMETER TUNING AND MODEL PERFORMANCE • Hyperparameter Tuning ⚬ Explored 'n_neighbors' and 'metric' options. ⚬ Used GridSearchCV for optimization. • KNN Model Evaluation ⚬ High precision for non-spam detection. ⚬ Good recall for spam detection. ⚬ F1-score demonstrates the model's balanced accuracy.
  • 13. DECISION TREE MODEL TRAINING AND EVALUATION • Decision Tree Introduction ⚬ Offers transparent decision-making paths. ⚬ Suited for complex classification. • Hyperparameter Selection ⚬ Adjusted 'max_depth' and 'min_samples_leaf'. ⚬ Explored 'gini' and 'entropy' criteria.
  • 14. DECISION TREE PERFORMANCE WITH AND WITHOUT SMOTE • Decision Tree Performance Metrics ⚬ Comparing accuracy and balanced accuracy. ⚬ With and without SMOTE application. • Model Performance Comparison ⚬ Impact of SMOTE on predictive power. Performance of a Decision Tree classifier on a spam detection task: 1. without the use of SMOTE; 2. with SMOTE . 2. 1. 1) Decision Tree without SMOTE: ⚬ True Negatives (TN): 913 (Non-spam correctly identified as non-spam) ⚬ False Negatives (FN): 7 (Spam incorrectly identified as non-spam) ⚬ False Positives (FP): 41 (Non-spam incorrectly identified as spam) ⚬ True Positives (TP): 110 (Spam correctly identified as spam) 2) Decision Tree with SMOTE: ⚬ True Negatives (TN): 907 ⚬ False Negatives (FN): 13 ⚬ False Positives (FP): 52 ⚬ True Positives (TP): 99
  • 15. 2. Training Time Evaluation • Decision Tree with SMOTE has longer training times, indicating a more complex fitting process. COMPARATIVE RESULTS OF CLASSIFIERS • Accuracy and Balanced Accuracy Comparison • KNN with SMOTE shows significant gains in both accuracy and balanced accuracy. • Decision Tree also benefits from SMOTE, though less dramatically. 3. Prediction Time per Message • Decision Tree models, especially with SMOTE, offer rapid prediction times, which is crucial for real-time applications.
  • 17. CONCLUSION AND RECOMMENDATIONS • Adopt SMOTE: Integrate SMOTE in the model training process to enhance the model's ability to generalize and maintain performance as spam tactics evolve. • Prioritize Prediction Time: Given the real-time nature of messaging, we advise prioritizing models that offer rapid prediction capabilities, making the Decision Tree with SMOTE a favorable choice. • Monitor and Update: Continuously monitor the model's performance and retrain with new data to adapt to emerging spam trends, ensuring sustained effectiveness. • User Feedback Loop: Implement a feedback system where users can report misclassifications, providing valuable data to further refine the model.