Predicting Insurance Responses: Leveraging Data Science for Better Outcomes

CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
Insurance Customer
Response Prediction

Agenda
• Dataset Overview
• Exploratory Data Analysis (EDA)
• Data Preprocessing
• Model Selection and Cross-Validation
• Model Training and Model Evaluation
• Results and Interpretation

Dataset overview
Introduction :
In this section, we will analyze and predict customer responses to a health insurance
marketing campaign using a dataset containing 50,882 instances and 14 variables.
Dataset Description :
The dataset includes key features such as ID, City code, Region, Accommodation and
Recommended insurance type. The primary target variable, Response , indicates
whether a customer accepted or rejected the recommended insurance. We will explore
the dataset’s structure, summarize the unique values of each feature, and examine
their data types to get a deeper understanding
Data Source from [ Kaggle ]

Exploratory Data Analysis (EDA)
Examine dataset dimensions:
• The dataset contains 50,882 instances (rows) and 14
variables (columns).
• Provides an overview of the data size and structure.
Preview dataset:
• Display the first few rows to inspect data entries and get an
understanding of features.
Check for missing values:
•Identify missing or incomplete data by using functions like
isnul().
•Assess how many entries are missing in each column and
determine how to handle them.
Analyze target variable :
•Investigate the distribution of the target variable ‘Response’
to understand how many customers accepted (1) or rejected
(0) the recommended insurance.
Missing Values
Target Variable

Explore unique values of categorical features:
• Analyze variables like accommodation type , reco
health and Health indicator
Visualize relationships with target variable :
• Use count plots and bar charts to visualize how
categorical features relate to ‘response’
• Examine how features like accommodation
type , health indicator etc
Generate correlation heatmap:
• Create a correlation heatmap to identify
relationships between numerical and encoded
categorical variables.
• Helps to understand which features are strongly
correlated with each other and with the target
variable.
Compare by target

Data Preprocessing
Handle Missing Values:
• Identify and fill missing values in key columns like Holding policy and Duration
• Replace NaN values with 0 for new customers who don't have an existing policy
Fill Missing Values in Categorical Features:
• For Health indicator, fill missing values with a placeholder (x0), indicating missing health data
Data Type Conversion:
• Convert Holding policy duration and other columns with inconsistent types into numeric values for
model compatibility.

Drop Irrelevant Columns:
• Remove columns such as ID, City Code, and Region to avoid high cardinality and unnecessary data in
the model
Feature Encoding:
• Apply Label Encoding to categorical variables like Health Indicator, Accommodation Type, and Reco
Insurance Type for numerical modeling.
Scaling:
• Use StandardScaler to normalize numerical features like Upper Age, Lower Age, Reco Policy
Premium, etc., ensuring they have a mean of 0 and standard deviation of 1 for better model
performance.

Model Selection and Cross-Validation
• Model Comparison:
• Evaluate multiple models including
SVM (Support Vector Machine),
Decision Tree, Random Forest,
XGBoost, and CatBoost to identify the
best-performing model for the
insurance response prediction task.
• Cross-Validation:
• Use K-Fold Cross Validation (5-fold) to
assess each model's performance
with accuracy as the evaluation
metric.
• Compute fold-wise accuracy and
mean accuracy for each model to
determine robustness

Click to edit
Master title
style
Model Training and
Evaluation
Data Split:
Split the dataset into training (80%) and testing (20%) sets
to train the model and evaluate its performance.
Model Training:
Train the selected model (SVM with a linear kernel) on the
training data using the optimal parameters.
Prediction:
Generate predictions on the test set and evaluate how well
the model generalizes.
Evaluation Metrics:
•Accuracy Score: Calculate overall accuracy on the test data.
•Confusion Matrix: Visualize model performance with true
positive, true negative, false positive, and false negative
rates.
•Classification Report: Analyze detailed metrics, including
precision, recall, and F1-score, for both classes (Accepted
and Rejected)

Results and Interpretation
Final Model Accuracy:
The SVM model achieved an accuracy of 75% on the test set, showing it performs moderately
well in predicting customer insurance responses based on this dataset.
Confusion Matrix Insights:
The confusion matrix indicates balanced prediction for customers likely to accept or reject insurance
offers, with relatively few misclassifications. This balance reflects the model's ability to reasonably handle
both positive and negative responses in the dataset.
Conclusion :
Based on the data, the model captures patterns in customer demographics and insurance preferences, helping
predict purchase likelihood. With further tuning—such as refining features or trying other algorithms—the model’s
performance could improve. This would support more efficient decision-making, helping the company better target
high-potential customers

Questions ?

Thank You!

Predicting Insurance Responses: Leveraging Data Science for Better Outcomes

More Related Content

Similar to Predicting Insurance Responses: Leveraging Data Science for Better Outcomes (20)

More from Boston Institute of Analytics (20)

Recently uploaded (20)

Predicting Insurance Responses: Leveraging Data Science for Better Outcomes