Data modeling with a focus on spam detection using K-Nearest Neighbors (KNN) and Decision Tree classifiers

DATA MODELLING
PRESENTATION
Assessment 3
School of Computing Technologies
Prepared by: Aliia Gismatullina
s4051304

TABLE OF
CONTENTS
INTRODUCTION TO THE CASE STUDY
UNDERSTANDING THE CHALLENGE
DATA PREPARATION AND EXPLORATION
MODELING APPROACH
K-Nearest Neighbors (KNN)
Decision Tree
HYPERPARAMETER TUNING AND MODEL EVALUATION
COMPARATIVE RESULTS OF CLASSIFIERS
CONCLUSION AND RECOMMENDATIONS

As part of the dynamic team at Revolution Consulting,
we've been presented with a compelling challenge
by Connect5G.
Despite offering premium services and ensuring the best possible coverage,
Connect5G has encountered a significant hurdle impacting its customer
base —the persistent issue of spam messages.
Our task was clear but far from simple: create a solution to mitigate this
problem and preserve the integrity of Connect5G's customer experience.

Our objective was to develop an innovative
machine learning solution capable of
automatically distinguishing between spam
and legitimate messages—termed 'ham'.
This approach aimed to automate the
classification process, making it both efficient
and reliable.

UNDERSTANDING THE CHALLENGE
• The Spam Problem at Connect5G
⚬ Rising customer dissatisfaction due to spam
texts.
• Previous Attempts to Solve
⚬ Ineffective blacklisting—spammers are always
a step ahead.
• The Need for a Sophisticated Solution
⚬ Shift to a machine learning solution for
dynamic spam detection.
• Objective
⚬ Automate spam detection to enhance user
experience without intrusion.

DATA PREPARATION
AND EXPLORATION

DATA PREPARATION AND
EXPLORATORY DATA ANALYSIS
• Dataset Overview
⚬ Rich dataset of spam and genuine
messages from the UK and Singapore.
• Challenge: Data Imbalance
⚬ The imbalance between spam and
genuine messages was identified - with
86.8% of messages being non-spam
(ham) and 13.2% being spam.
• Data Cleaning and Tokenization
⚬ Standardized text format - lowercase,
removing stopwords;
⚬ Segmented text into tokens for analysis.
• The initial dataset
• Cleaned dataset ready for modelling

DATA PREPARATION AND EXPLORATORY DATA ANALYSIS
• The feature extraction using CountVectorizer on the
sms_joined column has provided us with a numerical
representation of the text data, focusing on the top
1000 words by frequency.
• The most common words for spam and ham
SMS Messages

DATA
MODELLING
APPROACH
K-Nearest Neighbors (KNN) and
the Decision Tree classifier

• K-Nearest Neighbors (KNN)
⚬ Chosen for its simplicity and effectiveness in identifying
similar data points.
• Decision Tree Classifier
⚬ Valued for its structured decision-making process and
interpretability.
• Balancing the Dataset with SMOTE
⚬ Applied SMOTE to generate synthetic spam samples,
ensuring a balanced training set.
MODELING APPROACH

• KNN Algorithm Introduction
⚬ Simple to implement and interpret.
⚬ Effective for classification tasks.
• Data Preparation
⚬ Vectorization with CountVectorizer.
⚬ Feature set limited to top 1000 words by frequency.
• Balancing the Dataset
⚬ Class imbalance tackled with SMOTE.
⚬ Achieved a 50/50 balance for training.
KNN MODEL TRAINING AND EVALUATION
The provided plots are heatmaps of a
confusion matrix for the performance of a
KNN classifier:
1. without SMOTE;
2. with SMOTE
1. 2.
• Top-left cell (True Label 0, Predicted Label 0):
True Negatives (TN) where the model correctly identified non-
spam messages.
• Top-right cell (True Label 0, Predicted Label 1):
False Negatives (FN) where the model incorrectly identified
non-spam messages as spam.
• Bottom-left cell (True Label 1, Predicted Label 0):
False Positives (FP) where the model incorrectly identified
spam messages as non-spam.
• Bottom-right cell (True Label 1, Predicted Label 1):
True Positives (TP) where the model correctly identified spam
messages.

HYPERPARAMETER TUNING AND
MODEL PERFORMANCE
• Hyperparameter Tuning
⚬ Explored 'n_neighbors' and 'metric' options.
⚬ Used GridSearchCV for optimization.
• KNN Model Evaluation
⚬ High precision for non-spam detection.
⚬ Good recall for spam detection.
⚬ F1-score demonstrates the model's balanced accuracy.

DECISION TREE MODEL TRAINING AND EVALUATION
• Decision Tree Introduction
⚬ Offers transparent decision-making paths.
⚬ Suited for complex classification.
• Hyperparameter Selection
⚬ Adjusted 'max_depth' and 'min_samples_leaf'.
⚬ Explored 'gini' and 'entropy' criteria.

DECISION TREE PERFORMANCE WITH AND WITHOUT
SMOTE
• Decision Tree Performance Metrics
⚬ Comparing accuracy and balanced accuracy.
⚬ With and without SMOTE application.
• Model Performance Comparison
⚬ Impact of SMOTE on predictive power.
Performance of a Decision Tree classifier on a spam detection task:
1. without the use of SMOTE;
2. with SMOTE .
2.
1.
1) Decision Tree without SMOTE:
⚬ True Negatives (TN): 913 (Non-spam
correctly identified as non-spam)
⚬ False Negatives (FN): 7 (Spam incorrectly
identified as non-spam)
⚬ False Positives (FP): 41 (Non-spam incorrectly
identified as spam)
⚬ True Positives (TP): 110 (Spam correctly
identified as spam)
2) Decision Tree with SMOTE:
⚬ True Negatives (TN): 907
⚬ False Negatives (FN): 13
⚬ False Positives (FP): 52
⚬ True Positives (TP): 99

2. Training Time Evaluation
• Decision Tree with SMOTE has
longer training times, indicating a
more complex fitting process.
COMPARATIVE RESULTS OF CLASSIFIERS
• Accuracy and Balanced Accuracy
Comparison
• KNN with SMOTE shows significant
gains in both accuracy and
balanced accuracy.
• Decision Tree also benefits from
SMOTE, though less dramatically.
3. Prediction Time per Message
• Decision Tree models, especially
with SMOTE, offer rapid prediction
times, which is crucial for real-time
applications.

CONCLUSION AND
RECOMMENDATIONS

CONCLUSION AND RECOMMENDATIONS
• Adopt SMOTE: Integrate SMOTE in the model training process
to enhance the model's ability to generalize and maintain
performance as spam tactics evolve.
• Prioritize Prediction Time: Given the real-time nature of
messaging, we advise prioritizing models that offer rapid
prediction capabilities, making the Decision Tree with SMOTE a
favorable choice.
• Monitor and Update: Continuously monitor the model's
performance and retrain with new data to adapt to emerging
spam trends, ensuring sustained effectiveness.
• User Feedback Loop: Implement a feedback system where
users can report misclassifications, providing valuable data to
further refine the model.

Data modeling with a focus on spam detection using K-Nearest Neighbors (KNN) and Decision Tree classifiers

More Related Content

Similar to Data modeling with a focus on spam detection using K-Nearest Neighbors (KNN) and Decision Tree classifiers (20)

Recently uploaded (20)

Data modeling with a focus on spam detection using K-Nearest Neighbors (KNN) and Decision Tree classifiers