End-to-End Insurance Cost Predictor: A Machine Learning Web App

Maulik Vyas

Expert Data Analyst | BI Analyst | Data Scientist | View My Portfolio

Published Mar 17, 2025

It has been a while since I first envisioned building a full-stack web application powered by a Machine Learning model so I finally developed my first-ever Insurance Cost Predictor web application. You can explore it here: Insurance Cost Predictor Web App.

App Demo

In this article, I'll walk you through the entire process - from acquiring the dataset to deploying the app on a server. Whether you are a beginner or an experienced developer, I hope this article will inspire you to build your own machine learning-powered application.

The dataset is quite simple and straight-forward. It contains 7 features: Age, Sex, BMI, Children, Smoker, Region, and Charges. Here, Charges variable is the dependent target variable and rest others are independent feature variables.

The Journey: A 4D Data Science Framework

For this project, I followed a structured 4D Data Science Framework:

Define the problem
Discover the data
Develop the model
Deploy the model

Below outlines how each steps are unfolded.

Step 1: Define the Problem

The objective was simple yet significant: predict health insurance costs as accurately as possible based on individual attributes. This app could help users estimate their insurance premiums and help identify the driving factors behind the insurance charges.

Step 2: Discover the Data

I sourced the dataset from Kaggle: Kaggle Dataset. It is a quite straight-forward dataset with seven features:

Age: The individual’s age
Sex: Male or female
BMI: Body Mass Index
Children: Number of dependents
Smoker: Yes or no
Region: Geographic area (northeast, northwest, southeast, southwest)
Charges: Insurance cost (the target variable)

The dataset was pretty much cleaned, requiring minimal preprocessing. While discovering the dataset, I uncovered key patterns from correlation matrix:

Smoker Status: A strong positive correlation with the target variable Charges. While digging into it through various bar charts, it confirmed the initial analysis.
Age: A clear linear relationship - older individuals paid higher premiums.
BMI: No standalone correlation with Charges, but when combined with Smoker status, it showed stark contrast in costs.

To prepare the data, I applied one-hot encoding method to Region feature, converting it to numerical values from categorical. You can check out my EDA process in my notebook here: Insurance Cost Prediction Notebook

Step 3: Develop the Model

With insights from EDA, I moved to model development. Here's how I approached it:

1. Data Preparation: Split the dataset into training and test set using train_test_split method from scikit-learn library. Normalize features by MinMaxScaler to ensure all attributes were on the same scale. For this project, I used 4 Machine Learning models:

2. Model Selection: For this project, I tested with four machine learning algorithms with default settings initially to establish a baseline:

Linear Regression
Support Vector Machines (SVM)
Random Forest Regressor
Gradient Boosting Regressor

Performance to evaluate the model was Mean Absolute Error (MAE). The baseline results are as following:

Linear Regression: 4181.19
Support Vector Regressor: 8618.36
Random Forest Regressor: 2569.72
Gradient Boosting Regressor: 2407.29

After obtaining the baseline results, the objective is to beat the baseline models with hyperparameter tuning. Linear Regression and SVM underperformed, so I focused more on Random Forest and Gradient Boosting for further tuning.

3. Hyperparameter Tuning:

Random Forest: Used RandomizedSearchCV with 10-fold cross-validation for hyperparameter tuning. Best parameters:

n_estimators: 310, min_samples_split: 10, min_samples_leaf: 3, max_features: 'sqrt', max_depth: 71, bootstrap: False

Mean Absolute Error on Test Set is: 2632.60.

Since the dataset is quite simple, the tuned Random Forest model might have overfit on the train set and that could be the reason why the test set error is slightly higher compared to default Random Forest model.

Gradient Boosting: Also tuned with RandomizedSeachCV. Best parameters:

subsample: 0.9, n_estimators: 50, min_samples_split: 20, min_samples_leaf: 19, max_depth: 14, loss: 'huber', learning_rate: 0.1

Mean Absolute Error on Test Set: 1761.07

Here, the MAE for Gradient Boosting Regressor is the lowest among other models so I refined it further with GridSearchCV, narrowing down the parameters to systematically test all combinations of the hyperparameters in a specified grid.

Best hyperparameters after GridSearchCV:

learning_rate: 0.1, loss: 'huber', max_depth: 16, min_samples_leaf: 20, min_samples_split: 10, n_estimators: 40, subsample: 0.8

Best R2 score after GridSearchCV: 0.8487

Final Test MAE: 1751.75

4. Model Export: The tuned Gradient Boosting model was saved by Joblib library as cost_predictor_model.pkl for deployment.

The model development part is complete here.

Step 4: Deploy the Model

With the model ready, I built and deployed a full-stack web-app.

The structure should look something like this:

insurance_cost_prediction_app/

├── app.py

├── cost_predictor_model.pkl

├── app.yaml

├── requirements.txt

├── templates/

| ├── index.html

├── static/

| ├── styles.css

| └── script.js

| └── images/

Back-End: Main Flask app (app.py)

The app.py file is quite important because it serves as the configuration hub for your web application. It is typically a lightweight Python web framework, requires a Python script to define the application instances, routes, and logic.

It initializes with app = Flask(__name__) which sets up the app's environment and allows it to handle HTTP requests.

Key routes:

app.route('/'): Renders HTML template.
app.route('/predict', methods=['POST']): Accepts user inputs, processes them, and return prediction costs.

Below is a snippet of the code. (full version at app.py)

Front-End: index.html

The Front-end built from a simple form with HTML, styled with styles.css and made interactive with script.js. It takes all the input data such as age, sex, and smoker status, then sends them to the server. See it here: index.html

Client-Side Logic: script.js

The client-side JavaScript file runs in the user's browser which is basically used to prevent default behavior, collect user inputs, and send them to the server as a JSON payload. When the app.py file returns the response, script.js file essentially updates the web page dynamically without reloading the browser page.

Here is the snippet of the code. (full version at script.js)

Dependencies: requirements.txt

After creating server side and client side files, it is now time to freeze the requirements used in the project so that the application runs smoothly on any machines that it is deployed. The file lists libraries along with their specific versions. This file is basically generated using the command below after setting up the local environment.

GAE configuration: app.yaml

Since I deployed this application on Google Cloud Platform, I configured app.yaml to define how the application is deployed on GAE's infrastructure. It basically specifies runtime environment, handlers for URL routing, and static file mappings.

Note: If you are using Heroku instead of GCP, then create Procfile instead of app.yaml.

Local Testing:

Before going live, I deployed it locally to check whether everything works. Once, it works on a local server, then it is safe to deploy it to remote server.

For that, just run the following code in the terminal:

This ran the app at http://127.0.0.1:5000. I filled out the form, submitted it, and verified the predictions worked.

Cloud Deployment: GCP

1. Created a new project on GCP (e.g., insurance-cost-predictor-app).

2. Navigated to “APIs & Services” > “Library.”

3. Enabled “App Engine Admin API”.

4. Installed the Google Cloud SDK and set the project:

5. Replaced the local server URL with the URL that you receive from the GAE:

6. Deployed with:

7. Launched the app:

This will launch the web application in the browser. Fill out the form and submit the predict button to see your predicted cost! You have successfully created the full-stack web application powered by Machine Learning Model.

In case, you encounter issues, open up the browser developer tools (Press F12) and debug via console.

Thank you for reading! Building this app was a rewarding journey, blending data science and web development. I hope you will be now excited to build a web application by yourself. I'd love to hear your feedback - feel free to comment or reach out. Happy Coding!

End-to-End Insurance Cost Predictor: A Machine Learning Web App

Maulik Vyas

Expert Data Analyst | BI Analyst | Data Scientist | View My Portfolio

The Journey: A 4D Data Science Framework

Back-End: Main Flask app (app.py)

Front-End: index.html

Client-Side Logic: script.js

Dependencies: requirements.txt

GAE configuration: app.yaml

Local Testing:

Cloud Deployment: GCP

Others also viewed

It is 2030 and conversational AI makes us buy more insurance

Leveraging AI to Stay Ahead: Designing and Updating Insurance Products

Insurance UX: Why Simple Wins, Every Time. (Nobody cares how smart your system is if it confuses the customer.)

Product requirements for developing the comprehensive insurance platform - Backend and Frontend

60% of Customers Want Insurance Apps Here’s How AI Makes Them Smarter in 2025

How Can Insurance Lead Conversion Be Improved with AI Video?

Data Utilization in Insurance: Unlocking Competitive Advantage

Machine intelligence in insurance: insights for end-to-end enterprise transformation

Enhancing Customer Experience: Apache Kafka's Role in Personalized Insurance Services

Insurance and AI: Is hyper-personalisation just hype?

Explore topics