SlideShare a Scribd company logo
1
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Israel Chavez
Ngadhnjim Halilaj
Anusha Kodali
Marcos Quezada
Jyoti Shrestha
Sarat Tadi
April 28, 2016
EMC Education Services
Data Science & Big Data Analytics
2
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Project Goals
• Create a model that will allow FPC to provide a loan predicting service to
its customers.
• Identify the necessary attributes that will enable the model to give a better
prediction.
• Test the Marketing Department threshold suggestions.
• Advice FPC about the suggestions that they could offer to their customers
to increase their chances of getting a loan.
3
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Situation
•FPC wants to expand its set of services offered to its customers by creating
an online site for loan advice.
•Provide a fast and reliable planning platform for customers to manage their
personal finances.
•Attract potential customers that want to know their eligibility for loans, thus
increasing FPC business.
4
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Executive Summary
Regression and Decision tree are somewhat efficient in predicting
outcome
• Logistic Regression
– Precision: 0.786
– Recall: 0.984
•Decision Tree
– Precision: 0.784
– Recall: 0.984
5
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Approach - Discovery
• Used 2010 housing loan database by Home Mortgage Disclosure Act (HMDA).
• Filtered data based on:
4 Owner-occupied
4 1-4 Family
4 Action Type (Loan originated, application approved but not accepted,
application denied, application withdrawn)
6
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
• Data Conditioning:
4 Data was factored, incomplete data was removed Data set created.
4 Releveled variables to produce reference for possible logistic regression.
4 Tested numeric variable correlation through a correlation matrix.
4 Dataset reduced to “Originated” and “Denied” loans.
• Data Visualization:
4 Overviewed data to check distribution and noise.
4 Two originators of noise:
8 Home Improvement Loans
8 Loan amounts > $400K
Data Preparation
7
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Approach - Model Planning
• Model Selection:
4 Two methods:
8 Logistic Regression
8 Classification Tree
• Regression:
4 0.5 and 0.75 thresholds suggested by the Marketing Department were
used.
8
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Approach - Model Planning
• Variable Selection:
4 Created a Small Set for testing purposes:
8 Three possibilities:
▪ Absence of personal data
▪ Absence of County data
▪ Absence of personal and county data.
• Developed two Full models:
4 Model 1: Included everything that the example script suggested;
4 Model 2: Included only the variables that we chose to build the model
with.
• Pseudo-R² was used to check the variance of the models
• ROC & AUC were used to check the performance of our model.
9
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Approach - Model Building
• Created a Holdout set with 25% of the data to test models
• Logistic Regression:
4 Categorized the holdout data in three bins:
8 Low threshold (<50%),
8 Medium threshold (from 50-74%),
8 High threshold (>=75%).
• To further test Regression model, we experimented with a binary
classification: Loan Rejected/ Loan Approved
4 First prediction: threshold 0.5.
4 Second prediction: threshold 0.7.
• Decision Tree:
4 Used binary classification: Loan Rejected/ Loan Approved
• A confusion matrix was developed to compare both methods.
10
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Approach - Model Results and Accuracy
•The model developed with Logistic Regression with threshold 0.5 has
predictive power at least as good as the Decision Tree model
Logistic
Regression
Threshold = 0.5
Predictions
FALSE TRUE
Actual FALSE 2,452 23,657
Actual TRUE 1,385 87,383
Decision Tree
Model
Predictions
FALSE TRUE
Actual FALSE 2,082 24,027
Actual TRUE 1,349 87,419
Logistic Regression model Decision Tree model
Accuracy 0.780 0.779
Precision 0.786 0.784
Recall 0.984 0.984
11
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Logistic Regression Prediction
12
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Decision Tree Visualization
Decision Tree model is a good way to compare the prediction power of a
Logistic Regression model
13
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
• Overview of Basic Methodology: Predict the likelihood of a person getting a loan
from FPC.
• Model: Logistic regression and Decision Tree.
• Dependent variable: “Approved”, if the loan application was approved or not.
• Scope:
– 662,997 total observations for year 2010 extracted from the housing loan
database that was assembled by federal agencies pursuant to the Home
Mortgage Disclosure Act (HMDA).
•After thoroughly cleaning the data, the model had 550,336
observations.
•Sampling
– Small set: 10% of the data.
– Holdout set: 25% of the data.
Model Description
14
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Data distribution visualization
Visualizing the variables for a normal distribution helps to understand
how good of a predictor they are
15
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
Data distribution visualization
Removing the unwanted “noises” from the model increases the predicting
powers of the model
16
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
ROC/AUC
The ROC curves lie just inside the full model curve
Essentially they are the same model
Full Model
AUC: 0.70
Personal data
removed 0.69
Personal data
and county
removed
AUC: 0.68
17
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential
• Data available for analysis is somewhat efficient.
• Logistic Regression or Classification Tree yield a similar result.
• Logistic Regression should be used considering the web app response time
requirement.
• The model provides an estimate not an assurance that a specific customer
will or will not get a loan.
• Sensitive personal information does not affect the model.
• County information does not affect the model.
• High income increases the chances of getting a loan.
• % of minority population in the customer tract reduces the chances of getting
a loan (We don’t recommend to show this finding in the web!)
Recommendations
18
© Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential

More Related Content

PDF
IT pros weigh in on colocation data centers
PPTX
Credit Risk Evaluation Model
PPTX
Loan Eligibility Checker
PDF
Forecasting P2P Credit Risk based on Lending Club data
PDF
Forecasting peer to_peer_lending_risk
PPTX
Cross-Sell Home Loans Model for Liability Customers in a Bank.pptx
PDF
Loan approval prediction based on machine learning approach
PPTX
Mortgage Data for Machine Learning Algorithms
IT pros weigh in on colocation data centers
Credit Risk Evaluation Model
Loan Eligibility Checker
Forecasting P2P Credit Risk based on Lending Club data
Forecasting peer to_peer_lending_risk
Cross-Sell Home Loans Model for Liability Customers in a Bank.pptx
Loan approval prediction based on machine learning approach
Mortgage Data for Machine Learning Algorithms

Similar to Loan predicting web service (20)

PPTX
Group 1 p53
PDF
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.
PPTX
Mining Credit Card Defults
PPTX
Introduction to predictive modeling v1
PPTX
Machine_Learning.pptx
PDF
Loan Approval Prediction Using Machine Learning
PDF
Machine Learning Project - Default credit card clients
PDF
Loan Analysis Predicting Defaulters
PDF
Model building in credit card and loan approval
PPTX
scrib.pptx
PPTX
Loan Prediction System Using Machine Learning.pptx
PPTX
Machine Learning in Big Data
DOCX
keerthana gl resume.docx
PPTX
Wooing the Best Bank Deposit Customers
PPTX
Personal Loan Risk Assessment
PPTX
loanpredictionsystem-210808032534.pptx
PDF
Microsoft Professional Capstone: Data Science
DOCX
Credit Card Marketing Classification Trees Fr.docx
PPTX
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
PPTX
exploratory data analysis on german credit data
Group 1 p53
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.
Mining Credit Card Defults
Introduction to predictive modeling v1
Machine_Learning.pptx
Loan Approval Prediction Using Machine Learning
Machine Learning Project - Default credit card clients
Loan Analysis Predicting Defaulters
Model building in credit card and loan approval
scrib.pptx
Loan Prediction System Using Machine Learning.pptx
Machine Learning in Big Data
keerthana gl resume.docx
Wooing the Best Bank Deposit Customers
Personal Loan Risk Assessment
loanpredictionsystem-210808032534.pptx
Microsoft Professional Capstone: Data Science
Credit Card Marketing Classification Trees Fr.docx
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
exploratory data analysis on german credit data
Ad

More from Marcos Quezada (7)

PPTX
Acelerándolo Todo
PPTX
Como evitamos otro invierno de la ia
PDF
A modern data platform meets the needs of each type of data in your business
PDF
Inteligencia artificial - Quebrando el paradigma de la amnesia empresarial
PDF
Dime-Novel Genre Classifier: A Prototype Text-Mining Application
PPTX
Make from your it department a competitive differentiator for your business
PPTX
Root4 Startup Next Demo Day 2014
Acelerándolo Todo
Como evitamos otro invierno de la ia
A modern data platform meets the needs of each type of data in your business
Inteligencia artificial - Quebrando el paradigma de la amnesia empresarial
Dime-Novel Genre Classifier: A Prototype Text-Mining Application
Make from your it department a competitive differentiator for your business
Root4 Startup Next Demo Day 2014
Ad

Recently uploaded (20)

PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Mega Projects Data Mega Projects Data
PDF
Introduction to the R Programming Language
PDF
annual-report-2024-2025 original latest.
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Fluorescence-microscope_Botany_detailed content
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Database Infoormation System (DBIS).pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
SAP 2 completion done . PRESENTATION.pptx
climate analysis of Dhaka ,Banglades.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Mega Projects Data Mega Projects Data
Introduction to the R Programming Language
annual-report-2024-2025 original latest.
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Qualitative Qantitative and Mixed Methods.pptx
Miokarditis (Inflamasi pada Otot Jantung)
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx

Loan predicting web service

  • 1. 1 © Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential Israel Chavez Ngadhnjim Halilaj Anusha Kodali Marcos Quezada Jyoti Shrestha Sarat Tadi April 28, 2016 EMC Education Services Data Science & Big Data Analytics
  • 2. 2 © Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential Project Goals • Create a model that will allow FPC to provide a loan predicting service to its customers. • Identify the necessary attributes that will enable the model to give a better prediction. • Test the Marketing Department threshold suggestions. • Advice FPC about the suggestions that they could offer to their customers to increase their chances of getting a loan.
  • 3. 3 © Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential Situation •FPC wants to expand its set of services offered to its customers by creating an online site for loan advice. •Provide a fast and reliable planning platform for customers to manage their personal finances. •Attract potential customers that want to know their eligibility for loans, thus increasing FPC business.
  • 4. 4 © Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential Executive Summary Regression and Decision tree are somewhat efficient in predicting outcome • Logistic Regression – Precision: 0.786 – Recall: 0.984 •Decision Tree – Precision: 0.784 – Recall: 0.984
  • 5. 5 © Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential Approach - Discovery • Used 2010 housing loan database by Home Mortgage Disclosure Act (HMDA). • Filtered data based on: 4 Owner-occupied 4 1-4 Family 4 Action Type (Loan originated, application approved but not accepted, application denied, application withdrawn)
  • 6. 6 © Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential • Data Conditioning: 4 Data was factored, incomplete data was removed Data set created. 4 Releveled variables to produce reference for possible logistic regression. 4 Tested numeric variable correlation through a correlation matrix. 4 Dataset reduced to “Originated” and “Denied” loans. • Data Visualization: 4 Overviewed data to check distribution and noise. 4 Two originators of noise: 8 Home Improvement Loans 8 Loan amounts > $400K Data Preparation
  • 7. 7 © Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential Approach - Model Planning • Model Selection: 4 Two methods: 8 Logistic Regression 8 Classification Tree • Regression: 4 0.5 and 0.75 thresholds suggested by the Marketing Department were used.
  • 8. 8 © Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential Approach - Model Planning • Variable Selection: 4 Created a Small Set for testing purposes: 8 Three possibilities: ▪ Absence of personal data ▪ Absence of County data ▪ Absence of personal and county data. • Developed two Full models: 4 Model 1: Included everything that the example script suggested; 4 Model 2: Included only the variables that we chose to build the model with. • Pseudo-R² was used to check the variance of the models • ROC & AUC were used to check the performance of our model.
  • 9. 9 © Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential Approach - Model Building • Created a Holdout set with 25% of the data to test models • Logistic Regression: 4 Categorized the holdout data in three bins: 8 Low threshold (<50%), 8 Medium threshold (from 50-74%), 8 High threshold (>=75%). • To further test Regression model, we experimented with a binary classification: Loan Rejected/ Loan Approved 4 First prediction: threshold 0.5. 4 Second prediction: threshold 0.7. • Decision Tree: 4 Used binary classification: Loan Rejected/ Loan Approved • A confusion matrix was developed to compare both methods.
  • 10. 10 © Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential Approach - Model Results and Accuracy •The model developed with Logistic Regression with threshold 0.5 has predictive power at least as good as the Decision Tree model Logistic Regression Threshold = 0.5 Predictions FALSE TRUE Actual FALSE 2,452 23,657 Actual TRUE 1,385 87,383 Decision Tree Model Predictions FALSE TRUE Actual FALSE 2,082 24,027 Actual TRUE 1,349 87,419 Logistic Regression model Decision Tree model Accuracy 0.780 0.779 Precision 0.786 0.784 Recall 0.984 0.984
  • 11. 11 © Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential Logistic Regression Prediction
  • 12. 12 © Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential Decision Tree Visualization Decision Tree model is a good way to compare the prediction power of a Logistic Regression model
  • 13. 13 © Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential • Overview of Basic Methodology: Predict the likelihood of a person getting a loan from FPC. • Model: Logistic regression and Decision Tree. • Dependent variable: “Approved”, if the loan application was approved or not. • Scope: – 662,997 total observations for year 2010 extracted from the housing loan database that was assembled by federal agencies pursuant to the Home Mortgage Disclosure Act (HMDA). •After thoroughly cleaning the data, the model had 550,336 observations. •Sampling – Small set: 10% of the data. – Holdout set: 25% of the data. Model Description
  • 14. 14 © Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential Data distribution visualization Visualizing the variables for a normal distribution helps to understand how good of a predictor they are
  • 15. 15 © Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential Data distribution visualization Removing the unwanted “noises” from the model increases the predicting powers of the model
  • 16. 16 © Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential ROC/AUC The ROC curves lie just inside the full model curve Essentially they are the same model Full Model AUC: 0.70 Personal data removed 0.69 Personal data and county removed AUC: 0.68
  • 17. 17 © Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential • Data available for analysis is somewhat efficient. • Logistic Regression or Classification Tree yield a similar result. • Logistic Regression should be used considering the web app response time requirement. • The model provides an estimate not an assurance that a specific customer will or will not get a loan. • Sensitive personal information does not affect the model. • County information does not affect the model. • High income increases the chances of getting a loan. • % of minority population in the customer tract reduces the chances of getting a loan (We don’t recommend to show this finding in the web!) Recommendations
  • 18. 18 © Copyright 2011 EMC Corporation. All rights reserved. EMC Restricted Confidential