SlideShare a Scribd company logo
Employee Retention Prediction: A Data Science Project by Devangi Shukla
Employee Retention Prediction
Descriptive Objective:
This project focuses on predicting employee attrition using
machine learning to help companies anticipate when
employees are likely to leave. By analysing historical HR data,
including factors like job role, tenure, and demographics, the
goal is to predict turnover and take proactive retention
measures. This approach falls under HR Analytics, which
leverages data to improve workforce management.
Importance :
By predicting attrition, organizations can implement
retention strategies like improving career development or
enhancing workplace conditions. This also allows businesses
to plan future hiring needs, reducing the costs of turnover
and improving workforce stability. In summary, this solution
helps companies make informed, data-driven decisions to
retain talent, reduce churn, and improve long-term success
The Challenges of Employee Turnover
Financial Impact
Turnover costs businesses significant
resources in recruitment, training,
and lost productivity.
Team Morale
Losing valuable employees disrupts
team dynamics and can negatively
impact morale.
Loss of Expertise
The departure of experienced
employees can lead to a loss of
institutional knowledge and
expertise.
Factors Influencing
Employee Retention
1 Compensation and
Benefits
Competitive salaries and
benefits packages are
essential for attracting and
retaining talent.
2 Work-Life Balance
Employees value flexible
work arrangements and
opportunities to prioritize
their well-being.
3 Career Growth Opportunities
Clear paths for advancement, training, and development
motivate employees to stay.
INFORMATION ON THE DATASET
company_size: Size of the company where the enrollee works or
worked.
city: City where the enrollee is located.
city_development_index: Indicator of the development level of city.
enrollee_id: Unique identifier for each enrollee.
major_discipline:Academic field or discipline of the enrollee’s major.
relevent_experience:Whether the enrollee has relevant work
experience.
experience: Number of years of work experience.
education_level: Highest level of education attained by the enrollee.
gender: Gender of the enrollee.
training_hours:Total number of hours spent on training.
company_type:Type or category of the company.
DATA UNDERSTANDING AND
INSIGHTS
EDA- EXPLORATORY DATA ANALYSIS
HANDING MISSINGVALUES
DATA ENCODING AND OUTLIER
DETECTION
MODEL BUILDING – LR, XG, RF
CONCLUSION
LIST OF CONTENTS
FOLLOWING LIBRARIES HAVE BEEN USED
Description of these libraries are as follows:-
* Pandas for Dataframe operations
* Numpy for Numeric operations
* Matplotlib and Seaborn are Data Visualisation libraries
* Scikit-Learn for all the Machine learning algorithms and
Descriptive Statistics
Column Mean Std Dev Min
25th
Percentile
50th
Percentile
(Median)
75th
Percentile
Max
city_develo
pment_inde
x
0.775 0.075 0.624 0.748 0.776 0.834 0.920
experience 6.4 6.5 <1 <1 5 >20 >20
training_ho
urs
45.2 30.4 8 47 52 83 83
target 0.6 0.5 0 0 1 1 1
Descriptive Statistics for Numerical
Columns:
Descriptive Statistics
Descriptive Statistics for Categorical
Columns:
OBSERVATIONS FORTHE DESCRIPTIVE STATISTICS
city_development_index
Training hours
Experience
Attrition (Target)
Gender
shows a moderate variation in city development (ranging from 0.624
to 0.920).
heavily skewed towards higher values (many employees have more
than 5 years of experience, with a few having <1 year or >20 years).
range widely, with some employees having very high training hours
(up to 83 hours).
has a fairly balanced distribution between 0 and 1, suggesting that both
employees who stay and leave are relatively equally represented.
predominantly male (80% of the entries), while education level is
mostly Graduate (80%).
DATA VISUALISATION
The dataset is primarily concentrated in cities with higher development
indices (e.g., 0.920), while cities with lower indices (e.g., 0.625) have
minimal representation.
The dataset shows a significant gender imbalance, with the majority of entries
being male (17729), followed by a smaller number of females (1238) and a few
entries with missing or unspecified gender (191).
The dataset indicates that most employees (13792) have relevant
experience, while a smaller group (5366) lacks relevant
experience.
The majority of employees have no enrolment in university programs
(14203), followed by those enrolled in full-time courses (3757), with a smaller
number in part-time courses (1198).
DATA VISUALISATION
DATA VISUALISATION
The dataset shows that most employees
have a graduate education level (11598),
followed by those with master's degrees
(4361), and fewer with doctoral or other
higher education levels.
The dataset reveals that the majority
of employees come from STEM
disciplines (14492), followed by those
with a Business Degree (2813), and
smaller groups with other disciplines
like Arts and Humanities. The dataset shows the largest groups
of employees work in companies with
50-99 employees (3083), followed
by 100-500 employees (2571), with
smaller representation in larger
companies (10000+ and 5000-9999).
HANDLING MISSINGVALUES
THESE ARE THE MISSINGVALUES IN EACH
COLUMN IN PERCENTAGES :
THAT SHOWS:
Major data is missing for the following
variables :
Gender (22.5%)
Major_Discipline (15%)
Company_Size (31%)
Company_Type (32%)
We cleaned the data and removed the missing
values through the below code where we
defined the *cleanNaN* function with
parameter (dfa)
That resulted to the cleaned data:
OUTLIER DETECTION AND Z- SCORE
Performing z-test to remove outliers in Train Dataset
Performing z-test to remove outliers in Test Dataset
Outlier detection using the Z-score involves calculating the number of standard deviations a data point is away
from the mean; points with a Z-score above a threshold (e.g., |3|) are considered outliers. These outliers can be
removed or capped to reduce their impact on modeling.
Normalization of Data
The resultant data after sampling, needs to be normalized between certain range of values, so that the model
wont be biased towards the high values of different variables.
The data has been normalized to the values between 0 & 1, independently of the statistical distribution they
follow.
Train andTest data splitting
*The data has been split to test data, training data & the model is trained with the training data.
* Both the dependent & independent variables perform with higher precision when the unknown test data is
fed are split to test & training data.
* It is done, in order for the model to to it.
TRAIN ANDTEST DATA
SPITTING
BUILDING MACHINE LEARNING MODELS
LOGISTIC REGRESSION
Logistic Regression is a Machine Learning classification algorithm
that is used to predict the probability of a categorical dependent
variable. Logistic Regression is classification algorithm that is not as
sophisticated as the ensemble methods. Hence, it provides us with
a good benchmark.
BUILDING MACHINE LEARNING MODELS
DECISION TREE CLASSIFIER
Decision tree classifier is a machine learning algorithm used for both classification
and regression tasks, that predicts value of a target variable by learning simple
decision rules inferred from the input features.
* Decision trees are structured as a hierarchical tree-like structure, where each
internal node represents a feature or attribute, and each branch represents a decision
rule based on that attribute. The leaf nodes represent the final predicted outcome or
class label.
BUILDING MACHINE LEARNING MODELS
RANDOM FOREST CLASSIFIER
Random Forest is a machine learning method that is capable of solving both regression and
classification. It is a brand of Ensemble learning, as it relies on an ensemble of decision trees. It
aggregates Classification (or Regression) Trees.
* Random Forest fits a number of decision tree classifiers on various sub-samples of the dataset and
use averaging to improve the predictive accuracy and control over-fitting. Random Forest can handle a
large number of features, and is helpful for estimating which of your variables are important in the
underlying data being modeled.
CONFUSION MATRIX, ROC CURVE AND AUC SCORE
The model performs well overall, with an
accuracy of 84.2%, but it struggles with recall
(43.4%) for detecting attrition (employees who
leave).This indicates that the model is better at
predicting employees who stay (True Negatives) than
those who leave.
Precision (56.8%) and F1-Score (49.3%) suggest
that while the model is better at predicting
employees who stayed, its predictions for employees
who left are less reliable, as evidenced by a
moderate number of False Negatives (539) and False
Positives (314).
AUC (Area Under the Curve): The AUC score
is 0.7802, which indicates that the model has
good discriminative ability. An AUC of 0.5
would indicate random guessing, while 1
would indicate perfect classification.
GRID AND RANDOM SEARCHING FOR
FINE TUNING OF HYPER PARAMETERS
Grid search works by creating a grid of all possible combinations of hyperparameter values specified
by the user. It then trains and evaluates model using each combination of hyperparameters and
selects the one that yields the best performance based on a predefined evaluation metric, such as
accuracy, precision, or F1 score
It systematically explores all possible combinations of hyperparameters, ensuring that the best
combination is found within the specified search space. However, this exhaustive search can be
computationally expensive.
Random Search is a hyperparameter tuning technique where random combinations of
hyperparameters are sampled and evaluated to find the best-performing model. It's an alternative to
Grid Search that can be more efficient, especially when the hyperparameter space is large. Random
search can explore a wide range of hyperparameters quickly, while grid search can be computationally
expensive if the grid is large.
GRID SEARCH FOR RANDOM FOREST CLASSIFIER
We are taking the results of the RF as it has proven to be the best performer overall
RANDOM SEARCH FOR RANDOM FOREST CLASSIFIER
We are taking the results of the RF as it has proven to be the best performer overall
Employee turnover is
driven by 3 key factors:
Lack of Career Advancement
Employees may leave if they feel there are limited opportunities
for growth or advancement within the company.
Inadequate Compensation and benefits
Poor Work Culture and Environment
When pay and benefits don't align with industry standards or
employee expectations, workers may seek better offers elsewhere.
A toxic work environment, lack of recognition, or ineffective
leadership can lead to dissatisfaction, causing employees to leave in
search of a more supportive and rewarding workplace.
Conclusion:
The dataset provides insights into factors influencing employee
attrition, such as experience, education, and tenure. Understanding
these factors helps develop targeted retention strategies.
In model performance, Random Forest (RF3) leads with the highest
F1 score (0.73) and accuracy (0.78), offering a strong balance
between precision, recall, and overall performance.
XGBoost (XGB3) follows closely, with an F1 score of 0.72 and similar
accuracy.
Decision Tree Regressor (DTR) excels in precision for class 0 but
struggles with class 1, while Logistic Regression performs poorly
across metrics.
Overall, Random Forest is the best model for predicting employee
attrition.

More Related Content

PPTX
Predicting Movie Success: Analyzing Key Factors and Trends
PPTX
Group 6 employee_attrition
PPTX
Predicting Employee Attrition
PPTX
Hr analytics
PDF
IBM HR Analytics Employee Attrition & Performance
PPTX
Regression analysis in HR
PPTX
Employee Attrition Analysis / Churn Prediction
PPTX
Employee Attrition Analysis
Predicting Movie Success: Analyzing Key Factors and Trends
Group 6 employee_attrition
Predicting Employee Attrition
Hr analytics
IBM HR Analytics Employee Attrition & Performance
Regression analysis in HR
Employee Attrition Analysis / Churn Prediction
Employee Attrition Analysis

What's hot (7)

PPTX
Employee Retension Capstone Project - Neeraj Bubby.pptx
PPTX
"Predicting Employee Retention: A Data-Driven Approach to Enhancing Workforce...
PPT
Decision Making in Managerial job
PPT
Ood lesson10 statechart
PPT
PPTX
Axborot tizimlari hayotiy siklining asosiy jarayonlari
PDF
Computerin tehnikin undes 1hicheeliin lektsiin huraangui
Employee Retension Capstone Project - Neeraj Bubby.pptx
"Predicting Employee Retention: A Data-Driven Approach to Enhancing Workforce...
Decision Making in Managerial job
Ood lesson10 statechart
Axborot tizimlari hayotiy siklining asosiy jarayonlari
Computerin tehnikin undes 1hicheeliin lektsiin huraangui
Ad

Similar to Employee Retention Prediction: A Data Science Project by Devangi Shukla (20)

PPTX
3GN20CS040-INTERNSHIP.pptxRegulatory agencies in India and usa topic haiiRegu...
PPTX
Employee Retention Prediction: Enhancing Workforce Stability
PDF
Predicting Employee Attrition using various techniques of Machine Learning
PPTX
Strategies for Employee Retention: Building a Resilient Workforce
PPTX
Project_PPT....................................................................
PDF
Machine_Learning_Trushita
PPTX
Building and deploying analytics
PPTX
Echelon Asia Summit 2017 Startup Academy Workshop
PDF
EMPLOYEE ATTRITION PREDICTION IN INDUSTRY USING MACHINE LEARNING TECHNIQUES
PDF
IRJET- Performance Evaluation of Various Classification Algorithms
PDF
IRJET- Performance Evaluation of Various Classification Algorithms
PDF
IRJET - Employee Performance Prediction System using Data Mining
PPTX
Analytics Boot Camp - Slides
PPTX
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
DOCX
Feature extraction for classifying students based on theirac ademic performance
PPTX
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
PPTX
NN Classififcation Neural Network NN.pptx
PPTX
Employee Retention Prediction: Leveraging Data for Workforce Stability
PDF
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
PPTX
Intro to Machine Learning for non-Data Scientists
3GN20CS040-INTERNSHIP.pptxRegulatory agencies in India and usa topic haiiRegu...
Employee Retention Prediction: Enhancing Workforce Stability
Predicting Employee Attrition using various techniques of Machine Learning
Strategies for Employee Retention: Building a Resilient Workforce
Project_PPT....................................................................
Machine_Learning_Trushita
Building and deploying analytics
Echelon Asia Summit 2017 Startup Academy Workshop
EMPLOYEE ATTRITION PREDICTION IN INDUSTRY USING MACHINE LEARNING TECHNIQUES
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET - Employee Performance Prediction System using Data Mining
Analytics Boot Camp - Slides
What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?
Feature extraction for classifying students based on theirac ademic performance
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
NN Classififcation Neural Network NN.pptx
Employee Retention Prediction: Leveraging Data for Workforce Stability
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Intro to Machine Learning for non-Data Scientists
Ad

More from Boston Institute of Analytics (20)

PPTX
"Ecommerce Customer Segmentation & Prediction: Enhancing Business Strategies ...
PPTX
Music Recommendation System: A Data Science Project for Personalized Listenin...
PPTX
Mental Wellness Analyzer: Leveraging Data for Better Mental Health Insights -...
PPTX
Suddala-Scan: Enhancing Website Analysis with AI for Capstone Project at Bost...
PPTX
Fraud Detection in Cybersecurity: Advanced Techniques for Safeguarding Digita...
PPTX
Enhancing Brand Presence Through Social Media Marketing: A Strategic Approach...
PPTX
Predicting Movie Success: Unveiling Box Office Potential with Data Analytics
PPTX
Financial Fraud Detection: Identifying and Preventing Financial Fraud
PPTX
Smart Driver Alert: Predictive Fatigue Detection Technology
PPTX
Smart Driver Alert: Predictive Fatigue Detection Technology
PPTX
E-Commerce Customer Segmentation and Prediction: Unlocking Insights for Smart...
PPTX
Predictive Maintenance: Revolutionizing Vehicle Care with Demographic and Sen...
PPTX
Smart Driver Alert: Revolutionizing Road Safety with Predictive Fatigue Detec...
PDF
Water Potability Prediction: Ensuring Safe and Clean Water
PDF
Developing a Training Program for Employee Skill Enhancement
PPTX
Website Scanning: Uncovering Vulnerabilities and Ensuring Cybersecurity
PPTX
Analyzing Open Ports on Websites: Functions, Benefits, Threats, and Detailed ...
PPTX
Designing a Simple Python Tool for Website Vulnerability Scanning
PPTX
Building a Simple Python-Based Website Vulnerability Scanner
PPTX
Cybersecurity and Ethical Hacking: Capstone Project
"Ecommerce Customer Segmentation & Prediction: Enhancing Business Strategies ...
Music Recommendation System: A Data Science Project for Personalized Listenin...
Mental Wellness Analyzer: Leveraging Data for Better Mental Health Insights -...
Suddala-Scan: Enhancing Website Analysis with AI for Capstone Project at Bost...
Fraud Detection in Cybersecurity: Advanced Techniques for Safeguarding Digita...
Enhancing Brand Presence Through Social Media Marketing: A Strategic Approach...
Predicting Movie Success: Unveiling Box Office Potential with Data Analytics
Financial Fraud Detection: Identifying and Preventing Financial Fraud
Smart Driver Alert: Predictive Fatigue Detection Technology
Smart Driver Alert: Predictive Fatigue Detection Technology
E-Commerce Customer Segmentation and Prediction: Unlocking Insights for Smart...
Predictive Maintenance: Revolutionizing Vehicle Care with Demographic and Sen...
Smart Driver Alert: Revolutionizing Road Safety with Predictive Fatigue Detec...
Water Potability Prediction: Ensuring Safe and Clean Water
Developing a Training Program for Employee Skill Enhancement
Website Scanning: Uncovering Vulnerabilities and Ensuring Cybersecurity
Analyzing Open Ports on Websites: Functions, Benefits, Threats, and Detailed ...
Designing a Simple Python Tool for Website Vulnerability Scanning
Building a Simple Python-Based Website Vulnerability Scanner
Cybersecurity and Ethical Hacking: Capstone Project

Recently uploaded (20)

PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
1_Introduction to advance data techniques.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
annual-report-2024-2025 original latest.
PDF
Lecture1 pattern recognition............
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Mega Projects Data Mega Projects Data
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
1_Introduction to advance data techniques.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
STUDY DESIGN details- Lt Col Maksud (21).pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Business Ppt On Nestle.pptx huunnnhhgfvu
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Supervised vs unsupervised machine learning algorithms
IB Computer Science - Internal Assessment.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
annual-report-2024-2025 original latest.
Lecture1 pattern recognition............
climate analysis of Dhaka ,Banglades.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Mega Projects Data Mega Projects Data

Employee Retention Prediction: A Data Science Project by Devangi Shukla

  • 2. Employee Retention Prediction Descriptive Objective: This project focuses on predicting employee attrition using machine learning to help companies anticipate when employees are likely to leave. By analysing historical HR data, including factors like job role, tenure, and demographics, the goal is to predict turnover and take proactive retention measures. This approach falls under HR Analytics, which leverages data to improve workforce management. Importance : By predicting attrition, organizations can implement retention strategies like improving career development or enhancing workplace conditions. This also allows businesses to plan future hiring needs, reducing the costs of turnover and improving workforce stability. In summary, this solution helps companies make informed, data-driven decisions to retain talent, reduce churn, and improve long-term success
  • 3. The Challenges of Employee Turnover Financial Impact Turnover costs businesses significant resources in recruitment, training, and lost productivity. Team Morale Losing valuable employees disrupts team dynamics and can negatively impact morale. Loss of Expertise The departure of experienced employees can lead to a loss of institutional knowledge and expertise.
  • 4. Factors Influencing Employee Retention 1 Compensation and Benefits Competitive salaries and benefits packages are essential for attracting and retaining talent. 2 Work-Life Balance Employees value flexible work arrangements and opportunities to prioritize their well-being. 3 Career Growth Opportunities Clear paths for advancement, training, and development motivate employees to stay.
  • 5. INFORMATION ON THE DATASET company_size: Size of the company where the enrollee works or worked. city: City where the enrollee is located. city_development_index: Indicator of the development level of city. enrollee_id: Unique identifier for each enrollee. major_discipline:Academic field or discipline of the enrollee’s major. relevent_experience:Whether the enrollee has relevant work experience. experience: Number of years of work experience. education_level: Highest level of education attained by the enrollee. gender: Gender of the enrollee. training_hours:Total number of hours spent on training. company_type:Type or category of the company.
  • 6. DATA UNDERSTANDING AND INSIGHTS EDA- EXPLORATORY DATA ANALYSIS HANDING MISSINGVALUES DATA ENCODING AND OUTLIER DETECTION MODEL BUILDING – LR, XG, RF CONCLUSION LIST OF CONTENTS
  • 7. FOLLOWING LIBRARIES HAVE BEEN USED Description of these libraries are as follows:- * Pandas for Dataframe operations * Numpy for Numeric operations * Matplotlib and Seaborn are Data Visualisation libraries * Scikit-Learn for all the Machine learning algorithms and
  • 8. Descriptive Statistics Column Mean Std Dev Min 25th Percentile 50th Percentile (Median) 75th Percentile Max city_develo pment_inde x 0.775 0.075 0.624 0.748 0.776 0.834 0.920 experience 6.4 6.5 <1 <1 5 >20 >20 training_ho urs 45.2 30.4 8 47 52 83 83 target 0.6 0.5 0 0 1 1 1 Descriptive Statistics for Numerical Columns:
  • 10. OBSERVATIONS FORTHE DESCRIPTIVE STATISTICS city_development_index Training hours Experience Attrition (Target) Gender shows a moderate variation in city development (ranging from 0.624 to 0.920). heavily skewed towards higher values (many employees have more than 5 years of experience, with a few having <1 year or >20 years). range widely, with some employees having very high training hours (up to 83 hours). has a fairly balanced distribution between 0 and 1, suggesting that both employees who stay and leave are relatively equally represented. predominantly male (80% of the entries), while education level is mostly Graduate (80%).
  • 11. DATA VISUALISATION The dataset is primarily concentrated in cities with higher development indices (e.g., 0.920), while cities with lower indices (e.g., 0.625) have minimal representation. The dataset shows a significant gender imbalance, with the majority of entries being male (17729), followed by a smaller number of females (1238) and a few entries with missing or unspecified gender (191).
  • 12. The dataset indicates that most employees (13792) have relevant experience, while a smaller group (5366) lacks relevant experience. The majority of employees have no enrolment in university programs (14203), followed by those enrolled in full-time courses (3757), with a smaller number in part-time courses (1198). DATA VISUALISATION
  • 13. DATA VISUALISATION The dataset shows that most employees have a graduate education level (11598), followed by those with master's degrees (4361), and fewer with doctoral or other higher education levels. The dataset reveals that the majority of employees come from STEM disciplines (14492), followed by those with a Business Degree (2813), and smaller groups with other disciplines like Arts and Humanities. The dataset shows the largest groups of employees work in companies with 50-99 employees (3083), followed by 100-500 employees (2571), with smaller representation in larger companies (10000+ and 5000-9999).
  • 14. HANDLING MISSINGVALUES THESE ARE THE MISSINGVALUES IN EACH COLUMN IN PERCENTAGES : THAT SHOWS: Major data is missing for the following variables : Gender (22.5%) Major_Discipline (15%) Company_Size (31%) Company_Type (32%)
  • 15. We cleaned the data and removed the missing values through the below code where we defined the *cleanNaN* function with parameter (dfa) That resulted to the cleaned data:
  • 16. OUTLIER DETECTION AND Z- SCORE Performing z-test to remove outliers in Train Dataset Performing z-test to remove outliers in Test Dataset Outlier detection using the Z-score involves calculating the number of standard deviations a data point is away from the mean; points with a Z-score above a threshold (e.g., |3|) are considered outliers. These outliers can be removed or capped to reduce their impact on modeling.
  • 17. Normalization of Data The resultant data after sampling, needs to be normalized between certain range of values, so that the model wont be biased towards the high values of different variables. The data has been normalized to the values between 0 & 1, independently of the statistical distribution they follow. Train andTest data splitting *The data has been split to test data, training data & the model is trained with the training data. * Both the dependent & independent variables perform with higher precision when the unknown test data is fed are split to test & training data. * It is done, in order for the model to to it. TRAIN ANDTEST DATA SPITTING
  • 18. BUILDING MACHINE LEARNING MODELS LOGISTIC REGRESSION Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. Logistic Regression is classification algorithm that is not as sophisticated as the ensemble methods. Hence, it provides us with a good benchmark.
  • 19. BUILDING MACHINE LEARNING MODELS DECISION TREE CLASSIFIER Decision tree classifier is a machine learning algorithm used for both classification and regression tasks, that predicts value of a target variable by learning simple decision rules inferred from the input features. * Decision trees are structured as a hierarchical tree-like structure, where each internal node represents a feature or attribute, and each branch represents a decision rule based on that attribute. The leaf nodes represent the final predicted outcome or class label.
  • 20. BUILDING MACHINE LEARNING MODELS RANDOM FOREST CLASSIFIER Random Forest is a machine learning method that is capable of solving both regression and classification. It is a brand of Ensemble learning, as it relies on an ensemble of decision trees. It aggregates Classification (or Regression) Trees. * Random Forest fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. Random Forest can handle a large number of features, and is helpful for estimating which of your variables are important in the underlying data being modeled.
  • 21. CONFUSION MATRIX, ROC CURVE AND AUC SCORE The model performs well overall, with an accuracy of 84.2%, but it struggles with recall (43.4%) for detecting attrition (employees who leave).This indicates that the model is better at predicting employees who stay (True Negatives) than those who leave. Precision (56.8%) and F1-Score (49.3%) suggest that while the model is better at predicting employees who stayed, its predictions for employees who left are less reliable, as evidenced by a moderate number of False Negatives (539) and False Positives (314). AUC (Area Under the Curve): The AUC score is 0.7802, which indicates that the model has good discriminative ability. An AUC of 0.5 would indicate random guessing, while 1 would indicate perfect classification.
  • 22. GRID AND RANDOM SEARCHING FOR FINE TUNING OF HYPER PARAMETERS Grid search works by creating a grid of all possible combinations of hyperparameter values specified by the user. It then trains and evaluates model using each combination of hyperparameters and selects the one that yields the best performance based on a predefined evaluation metric, such as accuracy, precision, or F1 score It systematically explores all possible combinations of hyperparameters, ensuring that the best combination is found within the specified search space. However, this exhaustive search can be computationally expensive. Random Search is a hyperparameter tuning technique where random combinations of hyperparameters are sampled and evaluated to find the best-performing model. It's an alternative to Grid Search that can be more efficient, especially when the hyperparameter space is large. Random search can explore a wide range of hyperparameters quickly, while grid search can be computationally expensive if the grid is large.
  • 23. GRID SEARCH FOR RANDOM FOREST CLASSIFIER We are taking the results of the RF as it has proven to be the best performer overall
  • 24. RANDOM SEARCH FOR RANDOM FOREST CLASSIFIER We are taking the results of the RF as it has proven to be the best performer overall
  • 25. Employee turnover is driven by 3 key factors: Lack of Career Advancement Employees may leave if they feel there are limited opportunities for growth or advancement within the company. Inadequate Compensation and benefits Poor Work Culture and Environment When pay and benefits don't align with industry standards or employee expectations, workers may seek better offers elsewhere. A toxic work environment, lack of recognition, or ineffective leadership can lead to dissatisfaction, causing employees to leave in search of a more supportive and rewarding workplace.
  • 26. Conclusion: The dataset provides insights into factors influencing employee attrition, such as experience, education, and tenure. Understanding these factors helps develop targeted retention strategies. In model performance, Random Forest (RF3) leads with the highest F1 score (0.73) and accuracy (0.78), offering a strong balance between precision, recall, and overall performance. XGBoost (XGB3) follows closely, with an F1 score of 0.72 and similar accuracy. Decision Tree Regressor (DTR) excels in precision for class 0 but struggles with class 1, while Logistic Regression performs poorly across metrics. Overall, Random Forest is the best model for predicting employee attrition.