End-to-End Insurance Cost Predictor: A Machine Learning Web App
It has been a while since I first envisioned building a full-stack web application powered by a Machine Learning model so I finally developed my first-ever Insurance Cost Predictor web application. You can explore it here: Insurance Cost Predictor Web App.
In this article, I'll walk you through the entire process - from acquiring the dataset to deploying the app on a server. Whether you are a beginner or an experienced developer, I hope this article will inspire you to build your own machine learning-powered application.
The dataset is quite simple and straight-forward. It contains 7 features: Age, Sex, BMI, Children, Smoker, Region, and Charges. Here, Charges variable is the dependent target variable and rest others are independent feature variables.
The Journey: A 4D Data Science Framework
For this project, I followed a structured 4D Data Science Framework:
Define the problem
Discover the data
Develop the model
Deploy the model
Below outlines how each steps are unfolded.
Step 1: Define the Problem
The objective was simple yet significant: predict health insurance costs as accurately as possible based on individual attributes. This app could help users estimate their insurance premiums and help identify the driving factors behind the insurance charges.
Step 2: Discover the Data
I sourced the dataset from Kaggle: Kaggle Dataset. It is a quite straight-forward dataset with seven features:
Age: The individual’s age
Sex: Male or female
BMI: Body Mass Index
Children: Number of dependents
Smoker: Yes or no
Region: Geographic area (northeast, northwest, southeast, southwest)
Charges: Insurance cost (the target variable)
The dataset was pretty much cleaned, requiring minimal preprocessing. While discovering the dataset, I uncovered key patterns from correlation matrix:
Smoker Status: A strong positive correlation with the target variable Charges. While digging into it through various bar charts, it confirmed the initial analysis.
Age: A clear linear relationship - older individuals paid higher premiums.
BMI: No standalone correlation with Charges, but when combined with Smoker status, it showed stark contrast in costs.
To prepare the data, I applied one-hot encoding method to Region feature, converting it to numerical values from categorical. You can check out my EDA process in my notebook here: Insurance Cost Prediction Notebook
Step 3: Develop the Model
With insights from EDA, I moved to model development. Here's how I approached it:
1. Data Preparation: Split the dataset into training and test set using train_test_split method from scikit-learn library. Normalize features by MinMaxScaler to ensure all attributes were on the same scale. For this project, I used 4 Machine Learning models:
2. Model Selection: For this project, I tested with four machine learning algorithms with default settings initially to establish a baseline:
Linear Regression
Support Vector Machines (SVM)
Random Forest Regressor
Gradient Boosting Regressor
Performance to evaluate the model was Mean Absolute Error (MAE). The baseline results are as following:
Linear Regression: 4181.19
Support Vector Regressor: 8618.36
Random Forest Regressor: 2569.72
Gradient Boosting Regressor: 2407.29
After obtaining the baseline results, the objective is to beat the baseline models with hyperparameter tuning. Linear Regression and SVM underperformed, so I focused more on Random Forest and Gradient Boosting for further tuning.
3. Hyperparameter Tuning:
Random Forest: Used RandomizedSearchCV with 10-fold cross-validation for hyperparameter tuning. Best parameters:
n_estimators: 310, min_samples_split: 10, min_samples_leaf: 3, max_features: 'sqrt', max_depth: 71, bootstrap: False
Mean Absolute Error on Test Set is: 2632.60.
Since the dataset is quite simple, the tuned Random Forest model might have overfit on the train set and that could be the reason why the test set error is slightly higher compared to default Random Forest model.
Gradient Boosting: Also tuned with RandomizedSeachCV. Best parameters:
subsample: 0.9, n_estimators: 50, min_samples_split: 20, min_samples_leaf: 19, max_depth: 14, loss: 'huber', learning_rate: 0.1
Mean Absolute Error on Test Set: 1761.07
Here, the MAE for Gradient Boosting Regressor is the lowest among other models so I refined it further with GridSearchCV, narrowing down the parameters to systematically test all combinations of the hyperparameters in a specified grid.
Best hyperparameters after GridSearchCV:
learning_rate: 0.1, loss: 'huber', max_depth: 16, min_samples_leaf: 20, min_samples_split: 10, n_estimators: 40, subsample: 0.8
Best R2 score after GridSearchCV: 0.8487
Final Test MAE: 1751.75
4. Model Export: The tuned Gradient Boosting model was saved by Joblib library as cost_predictor_model.pkl for deployment.
The model development part is complete here.
Step 4: Deploy the Model
With the model ready, I built and deployed a full-stack web-app.
The structure should look something like this:
insurance_cost_prediction_app/
├── app.py
├── cost_predictor_model.pkl
├── app.yaml
├── requirements.txt
├── templates/
| ├── index.html
├── static/
| ├── styles.css
| └── script.js
| └── images/
Back-End: Main Flask app (app.py)
The app.py file is quite important because it serves as the configuration hub for your web application. It is typically a lightweight Python web framework, requires a Python script to define the application instances, routes, and logic.
It initializes with app = Flask(__name__) which sets up the app's environment and allows it to handle HTTP requests.
Key routes:
app.route('/'): Renders HTML template.
app.route('/predict', methods=['POST']): Accepts user inputs, processes them, and return prediction costs.
Below is a snippet of the code. (full version at app.py)
Front-End: index.html
The Front-end built from a simple form with HTML, styled with styles.css and made interactive with script.js. It takes all the input data such as age, sex, and smoker status, then sends them to the server. See it here: index.html
Client-Side Logic: script.js
The client-side JavaScript file runs in the user's browser which is basically used to prevent default behavior, collect user inputs, and send them to the server as a JSON payload. When the app.py file returns the response, script.js file essentially updates the web page dynamically without reloading the browser page.
Here is the snippet of the code. (full version at script.js)
Dependencies: requirements.txt
After creating server side and client side files, it is now time to freeze the requirements used in the project so that the application runs smoothly on any machines that it is deployed. The file lists libraries along with their specific versions. This file is basically generated using the command below after setting up the local environment.
GAE configuration: app.yaml
Since I deployed this application on Google Cloud Platform, I configured app.yaml to define how the application is deployed on GAE's infrastructure. It basically specifies runtime environment, handlers for URL routing, and static file mappings.
Note: If you are using Heroku instead of GCP, then create Procfile instead of app.yaml.
Local Testing:
Before going live, I deployed it locally to check whether everything works. Once, it works on a local server, then it is safe to deploy it to remote server.
For that, just run the following code in the terminal:
This ran the app at http://127.0.0.1:5000. I filled out the form, submitted it, and verified the predictions worked.
Cloud Deployment: GCP
1. Created a new project on GCP (e.g., insurance-cost-predictor-app).
2. Navigated to “APIs & Services” > “Library.”
3. Enabled “App Engine Admin API”.
4. Installed the Google Cloud SDK and set the project:
5. Replaced the local server URL with the URL that you receive from the GAE:
6. Deployed with:
7. Launched the app:
This will launch the web application in the browser. Fill out the form and submit the predict button to see your predicted cost! You have successfully created the full-stack web application powered by Machine Learning Model.
In case, you encounter issues, open up the browser developer tools (Press F12) and debug via console.
Thank you for reading! Building this app was a rewarding journey, blending data science and web development. I hope you will be now excited to build a web application by yourself. I'd love to hear your feedback - feel free to comment or reach out. Happy Coding!