Mortgage Data for Machine Learning Algorithms

Mortgage Data for Machine Learning
Algorithms
Predictive Loan Approval Analysis on HMDA Data
09/07/2019

AGENDA
2
1. Introduction
2. Architecture & Design
3. Data
4. Exploratory Data Analysis
5. Data Wrangling & Feature Engineering
6. Machine Learning
7. Results
8. Conclusion

Team Loan Canoe:
Anne Klieve, Blake Zenuni, Tamana Naheeme
HYPOTHESIS GENERATION
• WHY This Dataset: It is expansive, and rich
with features.
• Purpose: protect consumers by providing
transparency, and monitoring against
discrimination.
WHAT IS HMDA?
i. Federal Reserve Board (1975-
2011)  CFPB subsidiary (2011 –
present)
ii. Mandates that Lending Institutions
report public loan data
• What are my chances of being approved for a Home
Mortgage? Can we build an algorithmic service of
product to service the consumer for this question?
• Are there biases against within some categorical
features in Loan approval?
Problem Framing
i. Big Banks have tools to classify and
predict loan approvals?
ii. Consumers are relegated to
inferior products
 Median Household Income for MSA should
have a strong direct relationship to
P(approval). Other Census-TRACT data
mappings may not matter as much.
 Income and loan amount should have a strong
direct relationship to P(approval); co-
applicant’s income might not have such a
strong effect.
 Features that clearly have a 1-for-1 effect on
P(approval), but should also exhibit strong
correlations to the majority of the other
features.
4
Home Mortgage Disclosure Act (HMDA)

6
A. Data Ingestion
B. Data
Mungling and
Wrangling
C. Statistical
Analyses
D. Modeling and
Application
E. Visualization and
Reporting
Data Science Project Pipeline

9
• Home Mortgage Disclosure Act (HMDA) Data
• Provides nationwide, loan-level data on U.S. mortgages.
• Frequently used for research.
• Millions of instances for every year (i.e. action on individual level loan application taken:
approved, denied, etc.).
• Used data for 2010 - 2017 for models.
• Served as source for all features
• Accessed via the Consumer Financial Protection Bureau’s public disclosure
data portal
Data Source

10
• Home Mortgage Disclosure Act (HMDA) Data Sample
• Random Sample from each year, balanced for each outcome class
• Raw data set contains 47 features
• Storage in PostgreSQL
• Advantages: Storage capacity, wrangling flexibility
• Considerations: Storage space requirements
• AWS Relational Database Storage (RDS)
• Advantages: Shared database for collaborative purposes
• Considerations: Cost and storage limits
Data Ingestion and Storage

11
Database Management
HMDA 2010 – 2017 Loan Applications Reported Raw
 47 features
 11.2MM -19MM records
 9.9GB – 16.6GB

4. Exploratory Data Analysis
12

13
Exploratory Data Analysis - Loan Amount

14
Exploratory Data Analysis - Count by Outcome

15
Feature Analysis - Numeric Features

5. Data Wrangling & Feature Engineering
17

18
Data Wrangling
• Remove irrelevant categories from outcome of interest
• Convert outcome variable to binary 1/0
• Scale numeric features
• Drop:
• Features with large majority of values missing + tuples with missing values
(FFIEC advises missing values do not occur systematically)
• Features that would result in model leakage
• Frequency feature created for MSA
• One hot encoding of categorical variables (21 -> 59 features)

Data Wrangling and Feature Selection:
Final Dataset for Preprocessing
19
+ Unbalance Randomized Sampling Technique => 25,000 per year, 200,000 total tuples all years 2010-2017
+ Balanced Randomized Sampling Technique => 12,000 per Loan Application Outcome => 200,000 total tuples 2010-2017
Three Separate AWS RDs ETL == Begin by Extracting the data for one year in a SQL CTE; Apply Scope and Filter Logic

20
ETL => Execute UNION ALL on the two randomized sub-datasets > then perform type-casting or any other transformations

21
ETL => Establish connections across all three AWS hosts
via search_path as pg_catalog and dblink_connect
Complete feature selection and SQL part of wrangling by UNION ALL
on the transformed single yr. data files across the AWS RDBs

24
Reconciliation of Features for Final Models

27
Model Building Process
• Phase 1 and 2 Models
o GaussianNB, MultinomialNB, BernoulliNB,
o tree.DecisionTreeClassifier,
o LinearDiscriminantAnalysis,
o OLS, LogisticRegression, LogisticRegressionCV,
o BaggingClassifier, ExtraTreesClassifier, RandomForestClassifier,
o LinearSVC, SVM
• Phase 3 Candidate Models
o Logistic Regression,
o LinearDiscriminantAnalysis,
o LinearSVC,
o RandomForestClassifier

28
Building Predictive Models - Phase 1
● Random Samples of 25,000
Instances for Single Years
● Initial Feature Selection of
132 Features inc. States,
MSA Binary variables
● Data Wrangling in Pandas,
no Pipeline
● Included features the team
would later remove to avoid
model leakage
Model Form F1 Precision Recall
GaussianNB 0.871 0.984 0.782
MultinomialNB 0.624 0.722 0.550
BernoulliNB 0.889 0.979 0.979
tree.DecisionTreeClassifier 1.000 1.000 1.000
LinearDiscriminantAnalysis 0.887 0.973 0.814
LogisticRegression 0.790 0.654 0.999
LogisticRegressionCV 0.853 0.780 0.941
BaggingClassifier 0.994 0.999 0.989
ExtraTreesClassifier 1.000 1.000 1.000
RandomForestClassifier 1.000 1.000 1.000

29
● Single Year and All Years models
● Removed two additional features to avoid model leakage
● Shifted from Pandas to Scikit-Learn Pipeline
● Added cross-validation and balanced sample data
● Broke out precision, recall, and F1 scores for action_taken = {0,1}

30

31

33
Results
• Final Model Selections:
• Logistic Regression
• Random Forest Classifier
Model Form Statistic Overall action_taken=1 action_taken=0
Logistic Regression
Precision 0.701676 0.703876 0.699477
Recall 0.701635 0.69617 0.7071
F1 0.70162 0.699986 0.703253
RandomForestClassifier
Precision 0.742424 0.741125 0.743723
Recall 0.742395 0.74508 0.73971
F1 0.742387 0.743081 0.741694

35
Further Implications
• EDA (i.e. mutual information).
• Analysis of merged data, include additional features.
• HMDA API has been implemented with improvements and they are sunsetting the
current tool. Using their enhanced API, it would be best to query the API directly
and have MongoDB as the intermediary.
• Robustness checks with different versions of geographic features.

36
Conclusion
• Within Project Scoping, Sufficient Number of Important Features to Produce Models
• Applicant Income, Loan Amount, Property Type, Agency, Median Household Income, Sex, Race
• Filters: Outcome variable designations; Loan Type
• Model Selection
• Our data responded well to Logistic Regression and RandomForestClassifier
• Team Lessons Learned
• Generally, we were able to accomplish what we sought to achieve
• Decisions regarding cut-off points for timely completion
• Close coordination and robust planning tools were critical to the team’s success
• Best to start posting code quickly, then iteratively refine

Mortgage Data for Machine Learning Algorithms

Mortgage Data for Machine Learning Algorithms

More Related Content

Similar to Mortgage Data for Machine Learning Algorithms (20)

Recently uploaded (20)

Mortgage Data for Machine Learning Algorithms