Employee Retention Prediction: A Data Science Project by Devangi Shukla

Employee Retention Prediction
Descriptive Objective:
This project focuses on predicting employee attrition using
machine learning to help companies anticipate when
employees are likely to leave. By analysing historical HR data,
including factors like job role, tenure, and demographics, the
goal is to predict turnover and take proactive retention
measures. This approach falls under HR Analytics, which
leverages data to improve workforce management.
Importance :
By predicting attrition, organizations can implement
retention strategies like improving career development or
enhancing workplace conditions. This also allows businesses
to plan future hiring needs, reducing the costs of turnover
and improving workforce stability. In summary, this solution
helps companies make informed, data-driven decisions to
retain talent, reduce churn, and improve long-term success

The Challenges of Employee Turnover
Financial Impact
Turnover costs businesses significant
resources in recruitment, training,
and lost productivity.
Team Morale
Losing valuable employees disrupts
team dynamics and can negatively
impact morale.
Loss of Expertise
The departure of experienced
employees can lead to a loss of
institutional knowledge and
expertise.

Factors Influencing
Employee Retention
1 Compensation and
Benefits
Competitive salaries and
benefits packages are
essential for attracting and
retaining talent.
2 Work-Life Balance
Employees value flexible
work arrangements and
opportunities to prioritize
their well-being.
3 Career Growth Opportunities
Clear paths for advancement, training, and development
motivate employees to stay.

INFORMATION ON THE DATASET
company_size: Size of the company where the enrollee works or
worked.
city: City where the enrollee is located.
city_development_index: Indicator of the development level of city.
enrollee_id: Unique identifier for each enrollee.
major_discipline:Academic field or discipline of the enrollee’s major.
relevent_experience:Whether the enrollee has relevant work
experience.
experience: Number of years of work experience.
education_level: Highest level of education attained by the enrollee.
gender: Gender of the enrollee.
training_hours:Total number of hours spent on training.
company_type:Type or category of the company.

DATA UNDERSTANDING AND
INSIGHTS
EDA- EXPLORATORY DATA ANALYSIS
HANDING MISSINGVALUES
DATA ENCODING AND OUTLIER
DETECTION
MODEL BUILDING – LR, XG, RF
CONCLUSION
LIST OF CONTENTS

FOLLOWING LIBRARIES HAVE BEEN USED
Description of these libraries are as follows:-
* Pandas for Dataframe operations
* Numpy for Numeric operations
* Matplotlib and Seaborn are Data Visualisation libraries
* Scikit-Learn for all the Machine learning algorithms and

Descriptive Statistics
Column Mean Std Dev Min
25th
Percentile
50th
Percentile
(Median)
75th
Percentile
Max
city_develo
pment_inde
x
0.775 0.075 0.624 0.748 0.776 0.834 0.920
experience 6.4 6.5 <1 <1 5 >20 >20
training_ho
urs
45.2 30.4 8 47 52 83 83
target 0.6 0.5 0 0 1 1 1
Descriptive Statistics for Numerical
Columns:

Descriptive Statistics
Descriptive Statistics for Categorical
Columns:

OBSERVATIONS FORTHE DESCRIPTIVE STATISTICS
city_development_index
Training hours
Experience
Attrition (Target)
Gender
shows a moderate variation in city development (ranging from 0.624
to 0.920).
heavily skewed towards higher values (many employees have more
than 5 years of experience, with a few having <1 year or >20 years).
range widely, with some employees having very high training hours
(up to 83 hours).
has a fairly balanced distribution between 0 and 1, suggesting that both
employees who stay and leave are relatively equally represented.
predominantly male (80% of the entries), while education level is
mostly Graduate (80%).

DATA VISUALISATION
The dataset is primarily concentrated in cities with higher development
indices (e.g., 0.920), while cities with lower indices (e.g., 0.625) have
minimal representation.
The dataset shows a significant gender imbalance, with the majority of entries
being male (17729), followed by a smaller number of females (1238) and a few
entries with missing or unspecified gender (191).

The dataset indicates that most employees (13792) have relevant
experience, while a smaller group (5366) lacks relevant
experience.
The majority of employees have no enrolment in university programs
(14203), followed by those enrolled in full-time courses (3757), with a smaller
number in part-time courses (1198).
DATA VISUALISATION

DATA VISUALISATION
The dataset shows that most employees
have a graduate education level (11598),
followed by those with master's degrees
(4361), and fewer with doctoral or other
higher education levels.
The dataset reveals that the majority
of employees come from STEM
disciplines (14492), followed by those
with a Business Degree (2813), and
smaller groups with other disciplines
like Arts and Humanities. The dataset shows the largest groups
of employees work in companies with
50-99 employees (3083), followed
by 100-500 employees (2571), with
smaller representation in larger
companies (10000+ and 5000-9999).

HANDLING MISSINGVALUES
THESE ARE THE MISSINGVALUES IN EACH
COLUMN IN PERCENTAGES :
THAT SHOWS:
Major data is missing for the following
variables :
Gender (22.5%)
Major_Discipline (15%)
Company_Size (31%)
Company_Type (32%)

We cleaned the data and removed the missing
values through the below code where we
defined the *cleanNaN* function with
parameter (dfa)
That resulted to the cleaned data:

OUTLIER DETECTION AND Z- SCORE
Performing z-test to remove outliers in Train Dataset
Performing z-test to remove outliers in Test Dataset
Outlier detection using the Z-score involves calculating the number of standard deviations a data point is away
from the mean; points with a Z-score above a threshold (e.g., |3|) are considered outliers. These outliers can be
removed or capped to reduce their impact on modeling.

Normalization of Data
The resultant data after sampling, needs to be normalized between certain range of values, so that the model
wont be biased towards the high values of different variables.
The data has been normalized to the values between 0 & 1, independently of the statistical distribution they
follow.
Train andTest data splitting
*The data has been split to test data, training data & the model is trained with the training data.
* Both the dependent & independent variables perform with higher precision when the unknown test data is
fed are split to test & training data.
* It is done, in order for the model to to it.
TRAIN ANDTEST DATA
SPITTING

BUILDING MACHINE LEARNING MODELS
LOGISTIC REGRESSION
Logistic Regression is a Machine Learning classification algorithm
that is used to predict the probability of a categorical dependent
variable. Logistic Regression is classification algorithm that is not as
sophisticated as the ensemble methods. Hence, it provides us with
a good benchmark.

DECISION TREE CLASSIFIER
Decision tree classifier is a machine learning algorithm used for both classification
and regression tasks, that predicts value of a target variable by learning simple
decision rules inferred from the input features.
* Decision trees are structured as a hierarchical tree-like structure, where each
internal node represents a feature or attribute, and each branch represents a decision
rule based on that attribute. The leaf nodes represent the final predicted outcome or
class label.

RANDOM FOREST CLASSIFIER
Random Forest is a machine learning method that is capable of solving both regression and
classification. It is a brand of Ensemble learning, as it relies on an ensemble of decision trees. It
aggregates Classification (or Regression) Trees.
* Random Forest fits a number of decision tree classifiers on various sub-samples of the dataset and
use averaging to improve the predictive accuracy and control over-fitting. Random Forest can handle a
large number of features, and is helpful for estimating which of your variables are important in the
underlying data being modeled.

CONFUSION MATRIX, ROC CURVE AND AUC SCORE
The model performs well overall, with an
accuracy of 84.2%, but it struggles with recall
(43.4%) for detecting attrition (employees who
leave).This indicates that the model is better at
predicting employees who stay (True Negatives) than
those who leave.
Precision (56.8%) and F1-Score (49.3%) suggest
that while the model is better at predicting
employees who stayed, its predictions for employees
who left are less reliable, as evidenced by a
moderate number of False Negatives (539) and False
Positives (314).
AUC (Area Under the Curve): The AUC score
is 0.7802, which indicates that the model has
good discriminative ability. An AUC of 0.5
would indicate random guessing, while 1
would indicate perfect classification.

GRID AND RANDOM SEARCHING FOR
FINE TUNING OF HYPER PARAMETERS
Grid search works by creating a grid of all possible combinations of hyperparameter values specified
by the user. It then trains and evaluates model using each combination of hyperparameters and
selects the one that yields the best performance based on a predefined evaluation metric, such as
accuracy, precision, or F1 score
It systematically explores all possible combinations of hyperparameters, ensuring that the best
combination is found within the specified search space. However, this exhaustive search can be
computationally expensive.
Random Search is a hyperparameter tuning technique where random combinations of
hyperparameters are sampled and evaluated to find the best-performing model. It's an alternative to
Grid Search that can be more efficient, especially when the hyperparameter space is large. Random
search can explore a wide range of hyperparameters quickly, while grid search can be computationally
expensive if the grid is large.

GRID SEARCH FOR RANDOM FOREST CLASSIFIER
We are taking the results of the RF as it has proven to be the best performer overall

RANDOM SEARCH FOR RANDOM FOREST CLASSIFIER
We are taking the results of the RF as it has proven to be the best performer overall

Employee turnover is
driven by 3 key factors:
Lack of Career Advancement
Employees may leave if they feel there are limited opportunities
for growth or advancement within the company.
Inadequate Compensation and benefits
Poor Work Culture and Environment
When pay and benefits don't align with industry standards or
employee expectations, workers may seek better offers elsewhere.
A toxic work environment, lack of recognition, or ineffective
leadership can lead to dissatisfaction, causing employees to leave in
search of a more supportive and rewarding workplace.

Conclusion:
The dataset provides insights into factors influencing employee
attrition, such as experience, education, and tenure. Understanding
these factors helps develop targeted retention strategies.
In model performance, Random Forest (RF3) leads with the highest
F1 score (0.73) and accuracy (0.78), offering a strong balance
between precision, recall, and overall performance.
XGBoost (XGB3) follows closely, with an F1 score of 0.72 and similar
accuracy.
Decision Tree Regressor (DTR) excels in precision for class 0 but
struggles with class 1, while Logistic Regression performs poorly
across metrics.
Overall, Random Forest is the best model for predicting employee
attrition.

Employee Retention Prediction: A Data Science Project by Devangi Shukla

More Related Content

What's hot (7)

Similar to Employee Retention Prediction: A Data Science Project by Devangi Shukla (20)

More from Boston Institute of Analytics (20)

Recently uploaded (20)

Employee Retention Prediction: A Data Science Project by Devangi Shukla