SlideShare a Scribd company logo
4
Most read
7
Most read
11
Most read
Machine Learning project
Team members: Jack, Harry & Abhishek
Homesite problem:
Predicting Quote conversion
- Homesite sells Home-insurance to home buyers in United States
- Insurance quotes are offered to customers based on several factors
What Homesite knows
- Customer’s geographical,
personal, Financial Information &
HomeOwnership details
- Quote for every customer
Enter
What Homesite Doesn’t know:
Customer’s likelihood of buying
that Insurance contract
Data shared: Training & Test
Task:
Binary classification
Training set:
261k rows, 298 predictors, 1 Binary response
Test set:
200k rows, 298 columns
Predictors:
Customer Activity, Geography, Personal, property & Coverage
Response:
Customer Conversion
What’s good about Homesite data:
- 296 variables don’t have NA’s or bad data entry points
- Not many Levels in Nominal variables
- Plenty of binary variables
- Plenty of ordinal variables
- No unbalanced variables
- No missing values
- No Textual columns
Data cleaning steps
Removing Constants
Removing Identifier rows
Synthesizing Date column
1
2
3
Treating NA variables 4
Treating bad levels (-1) 5
Treating false categorical 6
Categorical to dummy 7
Gradient Boosting (Iterative corrections)
Learning from past mistakes
Could get nearly 0 training error
Weighted scoring of multiple
trees
Hard to tune, as there are too
many parameters to adjust
Often overfit and hard to decide
the stopping point
Random Forests (Majority wins)
Handles missing data
Handles redundancy easily
Reduces variations in results
Produces Out of Bag error rate
Produces De-correlated trees
Random subspace & split
Bias sometimes Increases as
Trees are shallower
Gradient Boosting + Random Forest
Handles missing data
Handles redundancy easily
Reduces variations in results
Produces Out of Bag error rate
Produces De-correlated trees
Random subspace & split
Quite slow & Computational expensive,
optimizing these constraints could be an
excellent area for research
Our Score
AUC = .95
Does not overfit
Little bias, due to correction
Easy to tune
Calculating AUC
ID True class Predicted probability
1 1 .8612
2 0 .2134
3 0 .1791
4 0 .1134
5 1 .7898
6 0 .0612
AUC
- Randomly decide a threshold
- Calculate True Positive Rate (y) & False Positive
Rate (x)
- Based on (x,y) plot the point
- Repeat steps for each value of threshold [0,1]
- We now have a curve and we call it ROC
- Area under this curve becomes AUC
War for the highest AUC
What we have already
employed
- Categorical to Continuous
conversion
- Continuous to Ordinal conversion
- Variable bucketing
- SVM / Logistic Regression
- Random Forest/ Trees
- Lasso / Ridge / Elastic Net
- Gradient Boosting
- Multicollinearity elimination
- Outlier treatment
- K-Fold Cross validation
- Imputation for NA’s
- Model tuning
- Variable transformation
- Most importantly, Your
Suggestions
What we look
forward to use
THANK YOU

More Related Content

PPTX
Machine Learning project presentation
PPTX
Support Vector Machines
PPTX
Multiclass classification of imbalanced data
PDF
Naive Bayes Classifier Tutorial | Naive Bayes Classifier Example | Naive Baye...
PDF
Understanding random forests
PPTX
Machine Learning Final presentation
PPTX
Machine Learning lecture6(regularization)
PPTX
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning project presentation
Support Vector Machines
Multiclass classification of imbalanced data
Naive Bayes Classifier Tutorial | Naive Bayes Classifier Example | Naive Baye...
Understanding random forests
Machine Learning Final presentation
Machine Learning lecture6(regularization)
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University

What's hot (20)

PPTX
Machine Learning Using Python
PDF
Support Vector Machines ( SVM )
PPTX
Machine Learning and Real-World Applications
PPTX
Credit card fraud detection using machine learning Algorithms
PDF
Feature Engineering in Machine Learning
PPTX
Housing price prediction
PPTX
Deep neural networks
PPTX
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
PPTX
Credit card fraud detection using python machine learning
PPTX
Machine Learning in Cyber Security
PPTX
House price prediction
PDF
Credit card fraud detection through machine learning
PPT
Back propagation
PDF
Bias and variance trade off
PPTX
Spam email detection using machine learning PPT.pptx
PPTX
Introduction to Machine Learning
PPTX
PPTX
Sms spam-detection
PDF
Linear regression
PPTX
supervised learning
Machine Learning Using Python
Support Vector Machines ( SVM )
Machine Learning and Real-World Applications
Credit card fraud detection using machine learning Algorithms
Feature Engineering in Machine Learning
Housing price prediction
Deep neural networks
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Credit card fraud detection using python machine learning
Machine Learning in Cyber Security
House price prediction
Credit card fraud detection through machine learning
Back propagation
Bias and variance trade off
Spam email detection using machine learning PPT.pptx
Introduction to Machine Learning
Sms spam-detection
Linear regression
supervised learning
Ad

Similar to Machine Learning Project (20)

PPTX
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
PPTX
Module 3_ Classification.pptx
PPTX
Learning machine learning with Yellowbrick
PPTX
House Sale Price Prediction
PDF
Cheatsheet machine-learning-tips-and-tricks
PPTX
Data mining model
PDF
The Beginnings of a Search Engine
PDF
The Beginnings Of A Search Engine
PPTX
Machine Learning Project - 1994 U.S. Census
PDF
Evaluating Classifiers' Performance KDD2002
PDF
Predictive modeling
PPTX
Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016
PPTX
Churn Modeling For Mobile Telecommunications
PDF
Machine learning project
PPTX
AUC: at what cost(s)?
PPTX
Building and deploying analytics
PPTX
Competition16
PPTX
NN Classififcation Neural Network NN.pptx
PDF
Assessing Model Performance - Beginner's Guide
PDF
housepriceprediction-180915174356.pdf
Presentation - Predicting Online Purchases Using Conversion Prediction Modeli...
Module 3_ Classification.pptx
Learning machine learning with Yellowbrick
House Sale Price Prediction
Cheatsheet machine-learning-tips-and-tricks
Data mining model
The Beginnings of a Search Engine
The Beginnings Of A Search Engine
Machine Learning Project - 1994 U.S. Census
Evaluating Classifiers' Performance KDD2002
Predictive modeling
Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016
Churn Modeling For Mobile Telecommunications
Machine learning project
AUC: at what cost(s)?
Building and deploying analytics
Competition16
NN Classififcation Neural Network NN.pptx
Assessing Model Performance - Beginner's Guide
housepriceprediction-180915174356.pdf
Ad

Machine Learning Project

  • 1. Machine Learning project Team members: Jack, Harry & Abhishek
  • 2. Homesite problem: Predicting Quote conversion - Homesite sells Home-insurance to home buyers in United States - Insurance quotes are offered to customers based on several factors What Homesite knows - Customer’s geographical, personal, Financial Information & HomeOwnership details - Quote for every customer Enter What Homesite Doesn’t know: Customer’s likelihood of buying that Insurance contract
  • 3. Data shared: Training & Test Task: Binary classification Training set: 261k rows, 298 predictors, 1 Binary response Test set: 200k rows, 298 columns Predictors: Customer Activity, Geography, Personal, property & Coverage Response: Customer Conversion What’s good about Homesite data: - 296 variables don’t have NA’s or bad data entry points - Not many Levels in Nominal variables - Plenty of binary variables - Plenty of ordinal variables - No unbalanced variables - No missing values - No Textual columns
  • 4. Data cleaning steps Removing Constants Removing Identifier rows Synthesizing Date column 1 2 3 Treating NA variables 4 Treating bad levels (-1) 5 Treating false categorical 6 Categorical to dummy 7
  • 5. Gradient Boosting (Iterative corrections) Learning from past mistakes Could get nearly 0 training error Weighted scoring of multiple trees Hard to tune, as there are too many parameters to adjust Often overfit and hard to decide the stopping point
  • 6. Random Forests (Majority wins) Handles missing data Handles redundancy easily Reduces variations in results Produces Out of Bag error rate Produces De-correlated trees Random subspace & split Bias sometimes Increases as Trees are shallower
  • 7. Gradient Boosting + Random Forest Handles missing data Handles redundancy easily Reduces variations in results Produces Out of Bag error rate Produces De-correlated trees Random subspace & split Quite slow & Computational expensive, optimizing these constraints could be an excellent area for research Our Score AUC = .95 Does not overfit Little bias, due to correction Easy to tune
  • 8. Calculating AUC ID True class Predicted probability 1 1 .8612 2 0 .2134 3 0 .1791 4 0 .1134 5 1 .7898 6 0 .0612 AUC - Randomly decide a threshold - Calculate True Positive Rate (y) & False Positive Rate (x) - Based on (x,y) plot the point - Repeat steps for each value of threshold [0,1] - We now have a curve and we call it ROC - Area under this curve becomes AUC
  • 9. War for the highest AUC
  • 10. What we have already employed - Categorical to Continuous conversion - Continuous to Ordinal conversion - Variable bucketing - SVM / Logistic Regression - Random Forest/ Trees - Lasso / Ridge / Elastic Net - Gradient Boosting - Multicollinearity elimination - Outlier treatment - K-Fold Cross validation - Imputation for NA’s - Model tuning - Variable transformation - Most importantly, Your Suggestions What we look forward to use