SlideShare a Scribd company logo
NAME: POOJA SHAH
Date of Assignment: 18/11/23
Date of Submission: 11/12/23
Project 2
Title: EMPLOYEE CHURN PREDICTION
Project Aim
 To determine whether an employee will churn or not , as well as
the loss incurred if it does churn.
 Create a system to prevent such churn for peaceful sustainability
of our company.
 This capstone project aims to uncover the factors that lead to employee attrition and
explore important questions by developing an employee churn prediction system
Overview of
Project
Predicting employee churn involves using machine
learning models to forecast whether an employee is
likely to leave a company in the near future. This is
a crucial task for organizations as it allows them to
take preventive measures such as improving work
conditions, offering incentives, or providing career
development opportunities to retain valuable
employees.
Project Contents-
• Problem Formulation
• Data collection
• Importing libraries, loading and
understanding the data
• Exploratory Data Analysis
• Data Preprocessing
• Data Visualization
• Graphs Analysis
• Checking imbalance in dataset
• Balancing the data using SMOTE
• Feature Scaling
• Feature Extraction using PCA
• Model building & Evaluation
• Logistic Regression
• KNN
• Decision Tree Classifier
• Random Forest
• ADA Boost
• Support Vector Classifier
• Comparing different models
• Conclusion
Importing libraries, loading and
understanding the data-
• We will be using the following libraries
1) Pandas
2) Numpy
3) Seaborn
4) Matplotlib.pyplot
Problem Formulation, Data Collection & Loading the Dataset
Exploratory Data Analysis
info () –
The info method
returns
the information non-
null count and dtype
of the data.
Exploratory Data Analysis
• Shape () -
With the help of shape attribute we can get to
know overall rows and columns in the data.
Exploratory Data Analysis
 df.isnull() - creates a DataFrame of the
same shape as df, where each entry is True
if the corresponding element in df is NaN
(null), and False otherwise.
 .sum() then calculates the sum of True
values along each column, resulting in a
Series that contains the total number of
missing values for each column.
 .to_frame() converts the Series into a
DataFrame.
 .rename(columns={0:"Total No. of
Missing Values"}) renames the column
containing the total number of missing
values to "Total No. of Missing Values."
missing_data["% of Missing Values"] =
df.isnull().mean()*100:
df.isnull().mean() calculates the proportion
of missing values for each column by taking
the mean (average) of the Boolean values in
the DataFrame. This gives the percentage of
missing values for each column.
*100 is then used to convert the proportions
into percentages.
The result is assigned to a new column in
the missing_data DataFrame called "% of
Missing Values."
Exploratory Data Analysis
Exploratory Data Analysis
• df.duplicated()
 this method finds duplicate rows in data
• df.duplicated().mean()*100
It converts duplicate values into percentage
Exploratory Data Analysis
• column_data_types = df.dtypes:
df.dtypes returns a Series containing the data
type of each column in the DataFrame.
 Counting numerical and categorical
columns:
 This loop iterates through each column in the
DataFrame and checks its data type.
• np.issubdtype(data_type, np.number)
 checks if the data type is a numerical type. If
true, it increments numerical_count; otherwise,
it increments categorical_count.
• describe().T –
• It generates descriptive statistics of the DataFrame's
numeric columns.
• .T  It is transpose operation. It switches the rows
and columns of the result obtained from describe()
• Getting the Count: The number of non-null values in each
column.
• Mean: The average value of each column.
• Standard Deviation (std): It indicates how much individual
data points deviate from the mean.
• Minimum (min): The smallest value in each column.
• 25th Percentile (25%): Also known as the first quartile, it's
the value below which 25% of the data falls.
• Median (50%): Also known as the second quartile or the
median, it's the middle value when the data is sorted. It
represents the central tendency.
• 75th Percentile (75%): Also known as the third quartile,
it's the value below which 75% of the data falls.
• Maximum (max): The largest value in each column
Pre-Processing
• df.rename(columns={"Attrition":
"Employee_Churn"}, inplace=True)
 The provided code is using the
rename method in pandas to rename a
column in a DataFrame.
• df.drop(columns=["Over18",
"EmployeeCount",
"EmployeeNumber",
"StandardHours"], inplace=True)
After executing this code, the
specified columns ("Over18",
"EmployeeCount",
"EmployeeNumber", and
"StandardHours") will be removed
from your DataFrame (df).
• df.columns
 returns names of all columns
Pre-Processing
 We will see the names of categorical columns and numerical columns in the DataFrame
printed to the console. This information can be helpful for further analysis, preprocessing, or
visualization tasks that may require handling different types of data separately.
Pre-Processing
This code is a common approach for identifying and handling outliers in a dataset using the
IQR method, and it also provides visualizations to assess the impact of the outlier handling
process. It ensures that extreme outliers do not unduly affect the analysis of the data.
 The result is a grid of boxplots, where each subplot corresponds to a numerical column in the
DataFrame. This visualization is useful for understanding the distribution and variability of
values in each numerical feature.
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
VISUALISATION – UNIVARIATE ANALYSIS – count plot & Pie Chart sub plot
• The result is a figure containing a count plot and a pie chart, both illustrating employee
churn in terms of counts and percentages, respectively. The count plot shows the
distribution of churn and non-churn instances, while the pie chart provides a visual
representation of the churn rate as a percentage.
Predicting Employee Churn: A Data-Driven Approach Project Presentation
VISUALISATION – BIVARIATE ANALYSIS – count plot
• Bivariate analysis is a
statistical analysis
technique that involves the
examination of the
relationship between two
variables. It is often used to
understand how one
variable affects or is related
to another variable.
• We then create count plots
for 2 categorical variables
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
VISUALISATION – BIVARIATE ANALYSIS – Hist Plot
• The provided code defines a function named hist_plot that creates a histogram with a kernel
density estimate (KDE) for a specified column in a DataFrame (df).
• plt.show() is used to display all the created plots.
• Each histogram provides a visual representation of the distribution of the specified
numerical columns, and the bars are colored based on whether an employee has churned or
not (as indicated by the 'Employee_Churn' column). This allows for a quick comparison of
the distributions for employees who have churned versus those who haven't in terms of age,
monthly income, and years at the company.
Predicting Employee Churn: A Data-Driven Approach Project Presentation
VISUALISATION – MULTIVARIATE ANALYSIS – scatter plot
• Scatter plots are used to visualize
the relationship between two
continuous variables.
• Each data point is plotted on a
graph, with one variable on the x-
axis and the other on the y-axis.
• This helps you visualize patterns,
trends, and potential correlation
REPLACE
• df['Employee_Churn’]:
 This selects the 'Employee_Churn' column in the DataFrame df.
• .replace({'No': 0, 'Yes': 1}):
 This method replaces values in the specified column according to
the provided dictionary. In this case, it replaces 'No' with 0 and 'Yes'
with 1.
LABEL ENCODER
• This code defines a function
named labelencoder that uses
scikit-learn's LabelEncoder to
encode categorical columns in a
pandas DataFrame into numerical
values.
This code is a useful
way to visualize the
pairwise correlations
between features in
your dataset. It helps
identify relationships
between variables and
can be valuable for
feature selection and
understanding the
underlying structure of
your data.
FEATURE SELECTION
Checking For Imbalance In Dataset
The code is creating a pie chart
to visually represent imbalanced
data, where the two slices
represent the “Churn" and “Not
Churn" classes with different
explosion and colors to highlight
the imbalance.
The percentages of each class
are displayed on the chart, and a
legend is added for clarity.
 SMOTE (Synthetic Minority
Over-sampling Technique),is
applied to the training data to
generate synthetic samples for
the minority class (where the
class with a minority of
examples is specified by the
sampling_strategy parameter).
 This way, you can address class
imbalance in your dataset and
create a balanced training set for
your machine learning models.
 We split our data before using
SMOTE
Balancing The Data using SMOTE
 The bar plot
provides a visual
representation of
the balanced or
adjusted
distribution of
classes in the
target variable
after SMOTE.
 Standardization, also known as feature scaling or normalization, is a preprocessing technique
commonly used in machine learning to bring all features or variables to a similar scale.
 This process helps algorithms perform better by ensuring that no single feature dominates the
learning process due to its larger magnitude.
 Standardization is particularly important for algorithms that rely on distances or gradients,
such as k-nearest neighbors
 The goal of standardization is to transform the features so that they have a mean of 0 and a
standard deviation of 1.
 This transformation does not change the shape of the distribution of the data; it simply scales
and shifts the data to make it more suitable for modeling.
The purpose of standardization is to transform the features so that they have a mean of 0 and a
standard deviation of 1. This is important, especially for algorithms that rely on distance
measures, as it ensures that all features contribute equally to the computations.
In this case, the features in x_sampled are standardized using the StandardScaler, and the result
is stored in the DataFrame standard_df. Each column in standard_df now represents a
standardized version of the corresponding feature in the original dataset.
FEATURE SCALING
PCA stands for Principal Component Analysis. It is a dimensionality reduction technique
commonly used in machine learning and statistics.
The main goal of PCA is to transform high-dimensional data into a new coordinate system,
capturing the most important information while minimizing information loss.
PCA achieves this by finding a set of orthogonal axes (principal components) along which
the data varies the most.
PCA – PRINCIPAL COMPONENT ANALYSIS
The purpose of standardization is to transform the features so that they have a mean of 0 and a
standard deviation of 1. This is important, especially for algorithms that rely on distance
measures, as it ensures that all features contribute equally to the computations.
In this case, the features in x_sampled are standardized using the StandardScaler, and the result
is stored in the DataFrame standard_df. Each column in standard_df now represents a
standardized version of the corresponding feature in the original dataset.
FEATURE EXTRACTION USING PCA
KEY STEPS IN PCA
Standardization: Standardize the features (subtract the mean and divide by the standard
deviation) to ensure that all features have a similar scale.
Covariance Matrix: Compute the covariance matrix for the standardized data. The covariance
matrix represents the relationships between pairs of features.
Eigenvalue Decomposition: Perform eigenvalue decomposition on the covariance matrix. This
yields a set of eigenvalues and corresponding eigenvectors.
Principal Components: The eigenvectors represent the principal components. These are the
directions in feature space along which the data varies the most. The corresponding eigenvalues
indicate the amount of variance captured by each principal component.
Projection: Project the original data onto the new coordinate system defined by the principal
components. This results in a reduced-dimensional representation of the data.
Predicting Employee Churn: A Data-Driven Approach Project Presentation
TRAIN TEST SPLIT
 By splitting your data into training and testing sets, you can use X_train and y_train to train
your machine learning model and then use X_test to evaluate its performance.
 This is a common practice to assess how well your model generalizes to unseen data.
MODEL BUILDING, CLASSIFICATION
REPORT & EVALUATION
• Will now build the following models
• Logistic Regression
• K-Nearest Neighbors
• Decision Tree Classifier
• Random Forest
• Ada Boost
• Support Vector Classifier
Classification Report
• A classification report is a summary of the performance metrics for a classification model.
• Precision: Precision is a measure of how many of the predicted positive instances were actually true positives.
• Precision = (True Positives) / (True Positives + False Positives)
• High precision indicates that the model makes fewer false positive errors.
• Recall (also known as Sensitivity or True Positive Rate): Recall measures the proportion of actual positive instances that were
correctly predicted by the model.
• Recall = (True Positives) / (True Positives + False Negatives)
• High recall indicates that the model captures a large portion of the positive instances.
• F1-Score: The F1-Score is the harmonic mean of precision and recall. It balances the trade-off between precision and recall
and is particularly useful when you want to consider both false positives and false negatives.
• F1-Score = 2 * (Precision * Recall) / (Precision + Recall)
• The F1-Score ranges between 0 and 1, where a higher value indicates a better balance between precision and recall.
• Support: Support represents the number of instances in each class in the test dataset. It gives you an idea of the distribution of
data across different classes.
AUCROC_CURVE
• AUCROC_curve
This code will help you visualize the performance of model in terms of its ability to discriminate between the
positive and negative classes. The higher the AUC score, the better the model's performance.
Interpreting the AUC:
0.5 (Random Classifier): If the AUC is 0.5, it means that the model's performance is no better than random chance.
It's essentially saying that the model cannot distinguish between positive and negative cases effectively.
< 0.5 (Worse than Random): If the AUC is less than 0.5, it suggests that the model's performance is worse than
random chance. It is misclassifying cases in the opposite direction.
> 0.5 (Better than Random): If the AUC is greater than 0.5, it indicates that the model is performing better than
random chance. The higher the AUC, the better the model is at discriminating between the classes.
1.0 (Perfect Classifier): An AUC of 1.0 represents a perfect classifier. This means the model achieves perfect
discrimination, correctly classifying all positive cases while avoiding false positives.
Logistic Regression – Modelling & Classification Report
• Logistic regression is a statistical
and machine learning model
used for binary classification,
which means it's used when the
target variable (the variable you
want to predict) has two possible
outcomes or classes.
• Classification Report
Class 0 Class 1
Precision 0.79 0.83
 Recall 0.86 0.74
 F1 Score 0.83 0.78
AUCROC_CURVE - Evaluation
• AUCROC_curve
AUC Score – 0.8964
K-Nearest Neighbour (KNN)– Modelling &
Classification Report
• KNN operates based on the principle
that similar data points tend to have
similar labels or values.
• It's a non-parametric algorithm, which
means it doesn't make assumptions
about the underlying data distribution.
• KNN considers all available training
data when making predictions, which
can be advantageous in some cases but
might be computationally expensive for
large datasets.
• Classification Report
Class 0 Class 1
Precision 0.94 0.83
 Recall 0.83 0.94
 F1 Score 0.88 0.88
AUCROC_CURVE - Evaluation
• AUCROC_curve
AUC Score – 0.9325
Decision Tree – Modelling & Classification Report
• A Decision Tree is a popular
supervised ML algorithm used
for both classification and
regression tasks. It is a non-
parametric, non-linear model that
makes predictions by recursively
partitioning the dataset into
subsets based on the most
significant attribute(s) at each
node.
• Classification Report
Class 0 Class 1
Precision 0.78 0.73
 Recall 0.74 0.76
 F1 Score 0.76 0.74
AUCROC_CURVE - Evaluation
• AUCROC_curve
AUC Score – 0.7527
Random Forest– Modelling & Classification Report
• Random Forest is an ensemble
machine learning algorithm that is
widely used for both classification
and regression tasks. It is a
powerful and versatile algorithm
known for its high accuracy and
robustness. Random Forest builds
multiple decision trees during
training and combines their
predictions to produce more reliable
and generalizable results.
• Classification Report
Class 0 Class 1
Precision 0.82 0.91
 Recall 0.93 0.78
 F1 Score 0.87 0.84
AUCROC_CURVE - Evaluation
• AUCROC_curve
AUC Score – 0.9374
Feature importance
AdaBoost – Modelling & Classification Report
• AdaBoost, short for Adaptive
Boosting, is an ensemble learning
method used for classification and
regression tasks. It is particularly
effective in improving the
performance of weak learners
(models that perform slightly better
than random chance). The basic
idea behind AdaBoost is to combine
multiple weak learners to create a
strong classifier.
• Classification Report
Class 0 Class 1
Precision 0.79 0.80
 Recall 0.83 0.76
 F1 Score 0.81 0.78
AUCROC_CURVE - Evaluation
• AUCROC_curve
AUC Score – 0.8904
Support Vector Classifier– Modelling & Classification
Report
• SVMs are adaptable and efficient
in a variety of applications
because they can manage high-
dimensional data and nonlinear
relationships.
• The SVM algorithm has the
characteristics to ignore the
outlier and finds the best
hyperplane that maximizes the
margin. SVM is robust to
outliers.
• Classification Report
Class 0 Class 1
Precision 0.85 0.90
 Recall 0.92 0.82
 F1 Score 0.89 0.85
AUCROC_CURVE - Evaluation
• AUCROC_curve
AUC Score – 0.9524
COMPARING CLASSIFICATION REPORT & AUC SCORE OF VARIOUS MODELS
• Creating dictionary to compare Classification report and AUC Score of different models
COMPARING CLASSIFICATION REPORT & AUC SCORE OF VARIOUS MODELS
Conclusion-
•In this Employee Churn prediction process, we started by examining a dataset with 1470
rows and 35 columns. It contained numerical & categorical variables, and we noticed an
imbalance in Employee churn column
•To address the data's characteristics, we performed data preprocessing.
• We bifurcated data into categorical and numerical to find any outliers using boxplot.
• Visualization was done using 3 types
 Univariate Analysis – Count plot & Pie Chart
 Bivariate Analysis – Count plots & Hist plots
 Multivariate Analysis – Scatter diagram
•Later we balanced unbalanced data using SMOTE
•Standardization was used to scale certain features for better model.
• Principal Component Analysis a dimensionality reduction technique was used to to
transform high-dimensional data into a new coordinate system, capturing the most important
information while minimizing information loss.
• We divided the dataset into training and testing sets and explored Six different machine
learning models: Logistic Regression, K-Nearest Neighbour, Decision Tree Classifier, Random
forest, AdaBoost and Support Vector Classifier.
• However, our focus was on SVC , which showed the best performance in terms of accuracy,
AUC Score and Precision.
• The chosen Support Vector Classifier model achieved an accuracy of approximately 87.387%,
Precision score of 0.9047, & AUC score is 0.9524. This result ensures that the company can
minimize potential employee churn.
• The top two influential factors for PC1 and PC 3 after applying PCA for prediction Detection.
• In conclusion, by systematically preprocessing the data, selecting the right model, we
successfully built a model that enhances accuracy and lowers the risk of employee churn
prediction. This helps the company make more informed lending decisions and reduces the
chances of financial setbacks.
Conclusion-
Thank you!!!

More Related Content

PPTX
Data preprocessing in Machine learning
PPTX
11-11_EDA Samia.pptx 11-11_EDA Samia.pptx
PPTX
EDA.pptx
PPTX
EDA.pptx
PDF
Exploratory Data Analysis - Satyajit.pdf
PPTX
Introduction to data analyticals123232.pptx
PPTX
Data mining
PPTX
Unit 2- Machine Learninnonjjnkbhkhjjljknkmg.pptx
Data preprocessing in Machine learning
11-11_EDA Samia.pptx 11-11_EDA Samia.pptx
EDA.pptx
EDA.pptx
Exploratory Data Analysis - Satyajit.pdf
Introduction to data analyticals123232.pptx
Data mining
Unit 2- Machine Learninnonjjnkbhkhjjljknkmg.pptx

Similar to Predicting Employee Churn: A Data-Driven Approach Project Presentation (20)

PDF
Data science using python, Data Preprocessing
PDF
Data Analytics ,Data Preprocessing What is Data Preprocessing?
PDF
Data preprocessing in Data Mining
PPT
03 preprocessing
PPTX
Lecture 9.pptx
PDF
Lesson 2 data preprocessing
PPTX
Preprocessing_exploring_and_Visualization.pptx
PPT
1.6.data preprocessing
PPT
Upstate CSCI 525 Data Mining Chapter 3
PDF
03 preprocessing
PPT
Data Preprocessing in Pharmaceutical.ppt
PPTX
5. working on data using R -Cleaning, filtering ,transformation, Sampling
PDF
EDA tools and making sense of data.pdf
PPT
Chapter 3. Data Preprocessing.ppt
PDF
Preprocessing Step in Data Cleaning - Data Mining
PPT
03Preprocessing for student computer sciecne.ppt
PPT
Preprocessing concepts and techniques.ppt
PPT
Preprocessing.ppt
PPT
03Predddddddddddddddddddddddprocessling.ppt
PPTX
03Preprocessing_plp.pptx
Data science using python, Data Preprocessing
Data Analytics ,Data Preprocessing What is Data Preprocessing?
Data preprocessing in Data Mining
03 preprocessing
Lecture 9.pptx
Lesson 2 data preprocessing
Preprocessing_exploring_and_Visualization.pptx
1.6.data preprocessing
Upstate CSCI 525 Data Mining Chapter 3
03 preprocessing
Data Preprocessing in Pharmaceutical.ppt
5. working on data using R -Cleaning, filtering ,transformation, Sampling
EDA tools and making sense of data.pdf
Chapter 3. Data Preprocessing.ppt
Preprocessing Step in Data Cleaning - Data Mining
03Preprocessing for student computer sciecne.ppt
Preprocessing concepts and techniques.ppt
Preprocessing.ppt
03Predddddddddddddddddddddddprocessling.ppt
03Preprocessing_plp.pptx
Ad

More from Boston Institute of Analytics (20)

PPTX
"Predicting Employee Retention: A Data-Driven Approach to Enhancing Workforce...
PPTX
"Ecommerce Customer Segmentation & Prediction: Enhancing Business Strategies ...
PPTX
Music Recommendation System: A Data Science Project for Personalized Listenin...
PPTX
Mental Wellness Analyzer: Leveraging Data for Better Mental Health Insights -...
PPTX
Suddala-Scan: Enhancing Website Analysis with AI for Capstone Project at Bost...
PPTX
Fraud Detection in Cybersecurity: Advanced Techniques for Safeguarding Digita...
PPTX
Enhancing Brand Presence Through Social Media Marketing: A Strategic Approach...
PPTX
Employee Retention Prediction: Leveraging Data for Workforce Stability
PPTX
Predicting Movie Success: Unveiling Box Office Potential with Data Analytics
PPTX
Financial Fraud Detection: Identifying and Preventing Financial Fraud
PPTX
Smart Driver Alert: Predictive Fatigue Detection Technology
PPTX
Smart Driver Alert: Predictive Fatigue Detection Technology
PPTX
E-Commerce Customer Segmentation and Prediction: Unlocking Insights for Smart...
PPTX
Predictive Maintenance: Revolutionizing Vehicle Care with Demographic and Sen...
PPTX
Smart Driver Alert: Revolutionizing Road Safety with Predictive Fatigue Detec...
PDF
Water Potability Prediction: Ensuring Safe and Clean Water
PDF
Developing a Training Program for Employee Skill Enhancement
PPTX
Website Scanning: Uncovering Vulnerabilities and Ensuring Cybersecurity
PPTX
Analyzing Open Ports on Websites: Functions, Benefits, Threats, and Detailed ...
PPTX
Designing a Simple Python Tool for Website Vulnerability Scanning
"Predicting Employee Retention: A Data-Driven Approach to Enhancing Workforce...
"Ecommerce Customer Segmentation & Prediction: Enhancing Business Strategies ...
Music Recommendation System: A Data Science Project for Personalized Listenin...
Mental Wellness Analyzer: Leveraging Data for Better Mental Health Insights -...
Suddala-Scan: Enhancing Website Analysis with AI for Capstone Project at Bost...
Fraud Detection in Cybersecurity: Advanced Techniques for Safeguarding Digita...
Enhancing Brand Presence Through Social Media Marketing: A Strategic Approach...
Employee Retention Prediction: Leveraging Data for Workforce Stability
Predicting Movie Success: Unveiling Box Office Potential with Data Analytics
Financial Fraud Detection: Identifying and Preventing Financial Fraud
Smart Driver Alert: Predictive Fatigue Detection Technology
Smart Driver Alert: Predictive Fatigue Detection Technology
E-Commerce Customer Segmentation and Prediction: Unlocking Insights for Smart...
Predictive Maintenance: Revolutionizing Vehicle Care with Demographic and Sen...
Smart Driver Alert: Revolutionizing Road Safety with Predictive Fatigue Detec...
Water Potability Prediction: Ensuring Safe and Clean Water
Developing a Training Program for Employee Skill Enhancement
Website Scanning: Uncovering Vulnerabilities and Ensuring Cybersecurity
Analyzing Open Ports on Websites: Functions, Benefits, Threats, and Detailed ...
Designing a Simple Python Tool for Website Vulnerability Scanning
Ad

Recently uploaded (20)

PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Introduction to Business Data Analytics.
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Foundation of Data Science unit number two notes
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Global journeys: estimating international migration
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Fluorescence-microscope_Botany_detailed content
Business Acumen Training GuidePresentation.pptx
IB Computer Science - Internal Assessment.pptx
Clinical guidelines as a resource for EBP(1).pdf
Introduction to Business Data Analytics.
Moving the Public Sector (Government) to a Digital Adoption
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Foundation of Data Science unit number two notes
Launch Your Data Science Career in Kochi – 2025
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction-to-Cloud-ComputingFinal.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Global journeys: estimating international migration
Acceptance and paychological effects of mandatory extra coach I classes.pptx

Predicting Employee Churn: A Data-Driven Approach Project Presentation

  • 1. NAME: POOJA SHAH Date of Assignment: 18/11/23 Date of Submission: 11/12/23 Project 2 Title: EMPLOYEE CHURN PREDICTION
  • 2. Project Aim  To determine whether an employee will churn or not , as well as the loss incurred if it does churn.  Create a system to prevent such churn for peaceful sustainability of our company.  This capstone project aims to uncover the factors that lead to employee attrition and explore important questions by developing an employee churn prediction system
  • 3. Overview of Project Predicting employee churn involves using machine learning models to forecast whether an employee is likely to leave a company in the near future. This is a crucial task for organizations as it allows them to take preventive measures such as improving work conditions, offering incentives, or providing career development opportunities to retain valuable employees.
  • 4. Project Contents- • Problem Formulation • Data collection • Importing libraries, loading and understanding the data • Exploratory Data Analysis • Data Preprocessing • Data Visualization • Graphs Analysis • Checking imbalance in dataset • Balancing the data using SMOTE • Feature Scaling • Feature Extraction using PCA • Model building & Evaluation • Logistic Regression • KNN • Decision Tree Classifier • Random Forest • ADA Boost • Support Vector Classifier • Comparing different models • Conclusion
  • 5. Importing libraries, loading and understanding the data- • We will be using the following libraries 1) Pandas 2) Numpy 3) Seaborn 4) Matplotlib.pyplot
  • 6. Problem Formulation, Data Collection & Loading the Dataset
  • 7. Exploratory Data Analysis info () – The info method returns the information non- null count and dtype of the data.
  • 8. Exploratory Data Analysis • Shape () - With the help of shape attribute we can get to know overall rows and columns in the data.
  • 9. Exploratory Data Analysis  df.isnull() - creates a DataFrame of the same shape as df, where each entry is True if the corresponding element in df is NaN (null), and False otherwise.  .sum() then calculates the sum of True values along each column, resulting in a Series that contains the total number of missing values for each column.  .to_frame() converts the Series into a DataFrame.  .rename(columns={0:"Total No. of Missing Values"}) renames the column containing the total number of missing values to "Total No. of Missing Values." missing_data["% of Missing Values"] = df.isnull().mean()*100: df.isnull().mean() calculates the proportion of missing values for each column by taking the mean (average) of the Boolean values in the DataFrame. This gives the percentage of missing values for each column. *100 is then used to convert the proportions into percentages. The result is assigned to a new column in the missing_data DataFrame called "% of Missing Values."
  • 11. Exploratory Data Analysis • df.duplicated()  this method finds duplicate rows in data • df.duplicated().mean()*100 It converts duplicate values into percentage
  • 12. Exploratory Data Analysis • column_data_types = df.dtypes: df.dtypes returns a Series containing the data type of each column in the DataFrame.  Counting numerical and categorical columns:  This loop iterates through each column in the DataFrame and checks its data type. • np.issubdtype(data_type, np.number)  checks if the data type is a numerical type. If true, it increments numerical_count; otherwise, it increments categorical_count.
  • 13. • describe().T – • It generates descriptive statistics of the DataFrame's numeric columns. • .T  It is transpose operation. It switches the rows and columns of the result obtained from describe() • Getting the Count: The number of non-null values in each column. • Mean: The average value of each column. • Standard Deviation (std): It indicates how much individual data points deviate from the mean. • Minimum (min): The smallest value in each column. • 25th Percentile (25%): Also known as the first quartile, it's the value below which 25% of the data falls. • Median (50%): Also known as the second quartile or the median, it's the middle value when the data is sorted. It represents the central tendency. • 75th Percentile (75%): Also known as the third quartile, it's the value below which 75% of the data falls. • Maximum (max): The largest value in each column
  • 14. Pre-Processing • df.rename(columns={"Attrition": "Employee_Churn"}, inplace=True)  The provided code is using the rename method in pandas to rename a column in a DataFrame. • df.drop(columns=["Over18", "EmployeeCount", "EmployeeNumber", "StandardHours"], inplace=True) After executing this code, the specified columns ("Over18", "EmployeeCount", "EmployeeNumber", and "StandardHours") will be removed from your DataFrame (df). • df.columns  returns names of all columns
  • 15. Pre-Processing  We will see the names of categorical columns and numerical columns in the DataFrame printed to the console. This information can be helpful for further analysis, preprocessing, or visualization tasks that may require handling different types of data separately.
  • 16. Pre-Processing This code is a common approach for identifying and handling outliers in a dataset using the IQR method, and it also provides visualizations to assess the impact of the outlier handling process. It ensures that extreme outliers do not unduly affect the analysis of the data.  The result is a grid of boxplots, where each subplot corresponds to a numerical column in the DataFrame. This visualization is useful for understanding the distribution and variability of values in each numerical feature.
  • 20. VISUALISATION – UNIVARIATE ANALYSIS – count plot & Pie Chart sub plot • The result is a figure containing a count plot and a pie chart, both illustrating employee churn in terms of counts and percentages, respectively. The count plot shows the distribution of churn and non-churn instances, while the pie chart provides a visual representation of the churn rate as a percentage.
  • 22. VISUALISATION – BIVARIATE ANALYSIS – count plot • Bivariate analysis is a statistical analysis technique that involves the examination of the relationship between two variables. It is often used to understand how one variable affects or is related to another variable. • We then create count plots for 2 categorical variables
  • 26. VISUALISATION – BIVARIATE ANALYSIS – Hist Plot • The provided code defines a function named hist_plot that creates a histogram with a kernel density estimate (KDE) for a specified column in a DataFrame (df). • plt.show() is used to display all the created plots. • Each histogram provides a visual representation of the distribution of the specified numerical columns, and the bars are colored based on whether an employee has churned or not (as indicated by the 'Employee_Churn' column). This allows for a quick comparison of the distributions for employees who have churned versus those who haven't in terms of age, monthly income, and years at the company.
  • 28. VISUALISATION – MULTIVARIATE ANALYSIS – scatter plot • Scatter plots are used to visualize the relationship between two continuous variables. • Each data point is plotted on a graph, with one variable on the x- axis and the other on the y-axis. • This helps you visualize patterns, trends, and potential correlation
  • 29. REPLACE • df['Employee_Churn’]:  This selects the 'Employee_Churn' column in the DataFrame df. • .replace({'No': 0, 'Yes': 1}):  This method replaces values in the specified column according to the provided dictionary. In this case, it replaces 'No' with 0 and 'Yes' with 1.
  • 30. LABEL ENCODER • This code defines a function named labelencoder that uses scikit-learn's LabelEncoder to encode categorical columns in a pandas DataFrame into numerical values.
  • 31. This code is a useful way to visualize the pairwise correlations between features in your dataset. It helps identify relationships between variables and can be valuable for feature selection and understanding the underlying structure of your data. FEATURE SELECTION
  • 32. Checking For Imbalance In Dataset The code is creating a pie chart to visually represent imbalanced data, where the two slices represent the “Churn" and “Not Churn" classes with different explosion and colors to highlight the imbalance. The percentages of each class are displayed on the chart, and a legend is added for clarity.
  • 33.  SMOTE (Synthetic Minority Over-sampling Technique),is applied to the training data to generate synthetic samples for the minority class (where the class with a minority of examples is specified by the sampling_strategy parameter).  This way, you can address class imbalance in your dataset and create a balanced training set for your machine learning models.  We split our data before using SMOTE Balancing The Data using SMOTE
  • 34.  The bar plot provides a visual representation of the balanced or adjusted distribution of classes in the target variable after SMOTE.
  • 35.  Standardization, also known as feature scaling or normalization, is a preprocessing technique commonly used in machine learning to bring all features or variables to a similar scale.  This process helps algorithms perform better by ensuring that no single feature dominates the learning process due to its larger magnitude.  Standardization is particularly important for algorithms that rely on distances or gradients, such as k-nearest neighbors  The goal of standardization is to transform the features so that they have a mean of 0 and a standard deviation of 1.  This transformation does not change the shape of the distribution of the data; it simply scales and shifts the data to make it more suitable for modeling.
  • 36. The purpose of standardization is to transform the features so that they have a mean of 0 and a standard deviation of 1. This is important, especially for algorithms that rely on distance measures, as it ensures that all features contribute equally to the computations. In this case, the features in x_sampled are standardized using the StandardScaler, and the result is stored in the DataFrame standard_df. Each column in standard_df now represents a standardized version of the corresponding feature in the original dataset. FEATURE SCALING
  • 37. PCA stands for Principal Component Analysis. It is a dimensionality reduction technique commonly used in machine learning and statistics. The main goal of PCA is to transform high-dimensional data into a new coordinate system, capturing the most important information while minimizing information loss. PCA achieves this by finding a set of orthogonal axes (principal components) along which the data varies the most. PCA – PRINCIPAL COMPONENT ANALYSIS
  • 38. The purpose of standardization is to transform the features so that they have a mean of 0 and a standard deviation of 1. This is important, especially for algorithms that rely on distance measures, as it ensures that all features contribute equally to the computations. In this case, the features in x_sampled are standardized using the StandardScaler, and the result is stored in the DataFrame standard_df. Each column in standard_df now represents a standardized version of the corresponding feature in the original dataset. FEATURE EXTRACTION USING PCA
  • 39. KEY STEPS IN PCA Standardization: Standardize the features (subtract the mean and divide by the standard deviation) to ensure that all features have a similar scale. Covariance Matrix: Compute the covariance matrix for the standardized data. The covariance matrix represents the relationships between pairs of features. Eigenvalue Decomposition: Perform eigenvalue decomposition on the covariance matrix. This yields a set of eigenvalues and corresponding eigenvectors. Principal Components: The eigenvectors represent the principal components. These are the directions in feature space along which the data varies the most. The corresponding eigenvalues indicate the amount of variance captured by each principal component. Projection: Project the original data onto the new coordinate system defined by the principal components. This results in a reduced-dimensional representation of the data.
  • 41. TRAIN TEST SPLIT  By splitting your data into training and testing sets, you can use X_train and y_train to train your machine learning model and then use X_test to evaluate its performance.  This is a common practice to assess how well your model generalizes to unseen data.
  • 42. MODEL BUILDING, CLASSIFICATION REPORT & EVALUATION • Will now build the following models • Logistic Regression • K-Nearest Neighbors • Decision Tree Classifier • Random Forest • Ada Boost • Support Vector Classifier
  • 43. Classification Report • A classification report is a summary of the performance metrics for a classification model. • Precision: Precision is a measure of how many of the predicted positive instances were actually true positives. • Precision = (True Positives) / (True Positives + False Positives) • High precision indicates that the model makes fewer false positive errors. • Recall (also known as Sensitivity or True Positive Rate): Recall measures the proportion of actual positive instances that were correctly predicted by the model. • Recall = (True Positives) / (True Positives + False Negatives) • High recall indicates that the model captures a large portion of the positive instances. • F1-Score: The F1-Score is the harmonic mean of precision and recall. It balances the trade-off between precision and recall and is particularly useful when you want to consider both false positives and false negatives. • F1-Score = 2 * (Precision * Recall) / (Precision + Recall) • The F1-Score ranges between 0 and 1, where a higher value indicates a better balance between precision and recall. • Support: Support represents the number of instances in each class in the test dataset. It gives you an idea of the distribution of data across different classes.
  • 44. AUCROC_CURVE • AUCROC_curve This code will help you visualize the performance of model in terms of its ability to discriminate between the positive and negative classes. The higher the AUC score, the better the model's performance. Interpreting the AUC: 0.5 (Random Classifier): If the AUC is 0.5, it means that the model's performance is no better than random chance. It's essentially saying that the model cannot distinguish between positive and negative cases effectively. < 0.5 (Worse than Random): If the AUC is less than 0.5, it suggests that the model's performance is worse than random chance. It is misclassifying cases in the opposite direction. > 0.5 (Better than Random): If the AUC is greater than 0.5, it indicates that the model is performing better than random chance. The higher the AUC, the better the model is at discriminating between the classes. 1.0 (Perfect Classifier): An AUC of 1.0 represents a perfect classifier. This means the model achieves perfect discrimination, correctly classifying all positive cases while avoiding false positives.
  • 45. Logistic Regression – Modelling & Classification Report • Logistic regression is a statistical and machine learning model used for binary classification, which means it's used when the target variable (the variable you want to predict) has two possible outcomes or classes. • Classification Report Class 0 Class 1 Precision 0.79 0.83  Recall 0.86 0.74  F1 Score 0.83 0.78
  • 46. AUCROC_CURVE - Evaluation • AUCROC_curve AUC Score – 0.8964
  • 47. K-Nearest Neighbour (KNN)– Modelling & Classification Report • KNN operates based on the principle that similar data points tend to have similar labels or values. • It's a non-parametric algorithm, which means it doesn't make assumptions about the underlying data distribution. • KNN considers all available training data when making predictions, which can be advantageous in some cases but might be computationally expensive for large datasets. • Classification Report Class 0 Class 1 Precision 0.94 0.83  Recall 0.83 0.94  F1 Score 0.88 0.88
  • 48. AUCROC_CURVE - Evaluation • AUCROC_curve AUC Score – 0.9325
  • 49. Decision Tree – Modelling & Classification Report • A Decision Tree is a popular supervised ML algorithm used for both classification and regression tasks. It is a non- parametric, non-linear model that makes predictions by recursively partitioning the dataset into subsets based on the most significant attribute(s) at each node. • Classification Report Class 0 Class 1 Precision 0.78 0.73  Recall 0.74 0.76  F1 Score 0.76 0.74
  • 50. AUCROC_CURVE - Evaluation • AUCROC_curve AUC Score – 0.7527
  • 51. Random Forest– Modelling & Classification Report • Random Forest is an ensemble machine learning algorithm that is widely used for both classification and regression tasks. It is a powerful and versatile algorithm known for its high accuracy and robustness. Random Forest builds multiple decision trees during training and combines their predictions to produce more reliable and generalizable results. • Classification Report Class 0 Class 1 Precision 0.82 0.91  Recall 0.93 0.78  F1 Score 0.87 0.84
  • 52. AUCROC_CURVE - Evaluation • AUCROC_curve AUC Score – 0.9374
  • 54. AdaBoost – Modelling & Classification Report • AdaBoost, short for Adaptive Boosting, is an ensemble learning method used for classification and regression tasks. It is particularly effective in improving the performance of weak learners (models that perform slightly better than random chance). The basic idea behind AdaBoost is to combine multiple weak learners to create a strong classifier. • Classification Report Class 0 Class 1 Precision 0.79 0.80  Recall 0.83 0.76  F1 Score 0.81 0.78
  • 55. AUCROC_CURVE - Evaluation • AUCROC_curve AUC Score – 0.8904
  • 56. Support Vector Classifier– Modelling & Classification Report • SVMs are adaptable and efficient in a variety of applications because they can manage high- dimensional data and nonlinear relationships. • The SVM algorithm has the characteristics to ignore the outlier and finds the best hyperplane that maximizes the margin. SVM is robust to outliers. • Classification Report Class 0 Class 1 Precision 0.85 0.90  Recall 0.92 0.82  F1 Score 0.89 0.85
  • 57. AUCROC_CURVE - Evaluation • AUCROC_curve AUC Score – 0.9524
  • 58. COMPARING CLASSIFICATION REPORT & AUC SCORE OF VARIOUS MODELS • Creating dictionary to compare Classification report and AUC Score of different models
  • 59. COMPARING CLASSIFICATION REPORT & AUC SCORE OF VARIOUS MODELS
  • 60. Conclusion- •In this Employee Churn prediction process, we started by examining a dataset with 1470 rows and 35 columns. It contained numerical & categorical variables, and we noticed an imbalance in Employee churn column •To address the data's characteristics, we performed data preprocessing. • We bifurcated data into categorical and numerical to find any outliers using boxplot. • Visualization was done using 3 types  Univariate Analysis – Count plot & Pie Chart  Bivariate Analysis – Count plots & Hist plots  Multivariate Analysis – Scatter diagram •Later we balanced unbalanced data using SMOTE •Standardization was used to scale certain features for better model. • Principal Component Analysis a dimensionality reduction technique was used to to transform high-dimensional data into a new coordinate system, capturing the most important information while minimizing information loss.
  • 61. • We divided the dataset into training and testing sets and explored Six different machine learning models: Logistic Regression, K-Nearest Neighbour, Decision Tree Classifier, Random forest, AdaBoost and Support Vector Classifier. • However, our focus was on SVC , which showed the best performance in terms of accuracy, AUC Score and Precision. • The chosen Support Vector Classifier model achieved an accuracy of approximately 87.387%, Precision score of 0.9047, & AUC score is 0.9524. This result ensures that the company can minimize potential employee churn. • The top two influential factors for PC1 and PC 3 after applying PCA for prediction Detection. • In conclusion, by systematically preprocessing the data, selecting the right model, we successfully built a model that enhances accuracy and lowers the risk of employee churn prediction. This helps the company make more informed lending decisions and reduces the chances of financial setbacks. Conclusion-