SlideShare a Scribd company logo
Mortgage Data for Machine Learning
Algorithms
Predictive Loan Approval Analysis on HMDA Data
09/07/2019
AGENDA
2
1. Introduction
2. Architecture & Design
3. Data
4. Exploratory Data Analysis
5. Data Wrangling & Feature Engineering
6. Machine Learning
7. Results
8. Conclusion
1. Introduction
3
Team Loan Canoe:
Anne Klieve, Blake Zenuni, Tamana Naheeme
HYPOTHESIS GENERATION
• WHY This Dataset: It is expansive, and rich
with features.
• Purpose: protect consumers by providing
transparency, and monitoring against
discrimination.
WHAT IS HMDA?
i. Federal Reserve Board (1975-
2011)  CFPB subsidiary (2011 –
present)
ii. Mandates that Lending Institutions
report public loan data
• What are my chances of being approved for a Home
Mortgage? Can we build an algorithmic service of
product to service the consumer for this question?
• Are there biases against within some categorical
features in Loan approval?
Problem Framing
i. Big Banks have tools to classify and
predict loan approvals?
ii. Consumers are relegated to
inferior products
 Median Household Income for MSA should
have a strong direct relationship to
P(approval). Other Census-TRACT data
mappings may not matter as much.
 Income and loan amount should have a strong
direct relationship to P(approval); co-
applicant’s income might not have such a
strong effect.
 Features that clearly have a 1-for-1 effect on
P(approval), but should also exhibit strong
correlations to the majority of the other
features.
4
Home Mortgage Disclosure Act (HMDA)
2. Architecture & Design
5
6
A. Data Ingestion
B. Data
Mungling and
Wrangling
C. Statistical
Analyses
D. Modeling and
Application
E. Visualization and
Reporting
Data Science Project Pipeline
7
Project Toolkit
3. Data
8
9
• Home Mortgage Disclosure Act (HMDA) Data
• Provides nationwide, loan-level data on U.S. mortgages.
• Frequently used for research.
• Millions of instances for every year (i.e. action on individual level loan application taken:
approved, denied, etc.).
• Used data for 2010 - 2017 for models.
• Served as source for all features
• Accessed via the Consumer Financial Protection Bureau’s public disclosure
data portal
Data Source
10
• Home Mortgage Disclosure Act (HMDA) Data Sample
• Random Sample from each year, balanced for each outcome class
• Raw data set contains 47 features
• Storage in PostgreSQL
• Advantages: Storage capacity, wrangling flexibility
• Considerations: Storage space requirements
• AWS Relational Database Storage (RDS)
• Advantages: Shared database for collaborative purposes
• Considerations: Cost and storage limits
Data Ingestion and Storage
11
Database Management
HMDA 2010 – 2017 Loan Applications Reported Raw
 47 features
 11.2MM -19MM records
 9.9GB – 16.6GB
4. Exploratory Data Analysis
12
13
Exploratory Data Analysis - Loan Amount
14
Exploratory Data Analysis - Count by Outcome
15
Feature Analysis - Numeric Features
16
5. Data Wrangling & Feature Engineering
17
18
Data Wrangling
• Remove irrelevant categories from outcome of interest
• Convert outcome variable to binary 1/0
• Scale numeric features
• Drop:
• Features with large majority of values missing + tuples with missing values
(FFIEC advises missing values do not occur systematically)
• Features that would result in model leakage
• Frequency feature created for MSA
• One hot encoding of categorical variables (21 -> 59 features)
Data Wrangling and Feature Selection:
Final Dataset for Preprocessing
19
+ Unbalance Randomized Sampling Technique => 25,000 per year, 200,000 total tuples all years 2010-2017
+ Balanced Randomized Sampling Technique => 12,000 per Loan Application Outcome => 200,000 total tuples 2010-2017
Three Separate AWS RDs ETL == Begin by Extracting the data for one year in a SQL CTE; Apply Scope and Filter Logic
20
Data Wrangling and Feature Selection:
Final Dataset for Preprocessing
ETL => Execute UNION ALL on the two randomized sub-datasets > then perform type-casting or any other transformations
21
Data Wrangling and Feature Selection:
Final Dataset for Preprocessing
ETL => Establish connections across all three AWS hosts
via search_path as pg_catalog and dblink_connect
Complete feature selection and SQL part of wrangling by UNION ALL
on the transformed single yr. data files across the AWS RDBs
22
Visualizing Missing Values
23
Outliers
24
Reconciliation of Features for Final Models
6. Machine Learning
25
26
Model Selection Process
27
Model Building Process
• Phase 1 and 2 Models
o GaussianNB, MultinomialNB, BernoulliNB,
o tree.DecisionTreeClassifier,
o LinearDiscriminantAnalysis,
o OLS, LogisticRegression, LogisticRegressionCV,
o BaggingClassifier, ExtraTreesClassifier, RandomForestClassifier,
o LinearSVC, SVM
• Phase 3 Candidate Models
o Logistic Regression,
o LinearDiscriminantAnalysis,
o LinearSVC,
o RandomForestClassifier
28
Building Predictive Models - Phase 1
● Random Samples of 25,000
Instances for Single Years
● Initial Feature Selection of
132 Features inc. States,
MSA Binary variables
● Data Wrangling in Pandas,
no Pipeline
● Included features the team
would later remove to avoid
model leakage
Model Form F1 Precision Recall
GaussianNB 0.871 0.984 0.782
MultinomialNB 0.624 0.722 0.550
BernoulliNB 0.889 0.979 0.979
tree.DecisionTreeClassifier 1.000 1.000 1.000
LinearDiscriminantAnalysis 0.887 0.973 0.814
LogisticRegression 0.790 0.654 0.999
LogisticRegressionCV 0.853 0.780 0.941
BaggingClassifier 0.994 0.999 0.989
ExtraTreesClassifier 1.000 1.000 1.000
RandomForestClassifier 1.000 1.000 1.000
29
Building Predictive Models - Phase 2
● Single Year and All Years models
● Removed two additional features to avoid model leakage
● Shifted from Pandas to Scikit-Learn Pipeline
● Added cross-validation and balanced sample data
● Broke out precision, recall, and F1 scores for action_taken = {0,1}
30
Building Predictive Models - Phase 2
31
Building Predictive Models - Phase 3
7. Results
32
33
Results
• Final Model Selections:
• Logistic Regression
• Random Forest Classifier
Model Form Statistic Overall action_taken=1 action_taken=0
Logistic Regression
Precision 0.701676 0.703876 0.699477
Recall 0.701635 0.69617 0.7071
F1 0.70162 0.699986 0.703253
RandomForestClassifier
Precision 0.742424 0.741125 0.743723
Recall 0.742395 0.74508 0.73971
F1 0.742387 0.743081 0.741694
8. Conclusion
34
35
Further Implications
• EDA (i.e. mutual information).
• Analysis of merged data, include additional features.
• HMDA API has been implemented with improvements and they are sunsetting the
current tool. Using their enhanced API, it would be best to query the API directly
and have MongoDB as the intermediary.
• Robustness checks with different versions of geographic features.
36
Conclusion
• Within Project Scoping, Sufficient Number of Important Features to Produce Models
• Applicant Income, Loan Amount, Property Type, Agency, Median Household Income, Sex, Race
• Filters: Outcome variable designations; Loan Type
• Model Selection
• Our data responded well to Logistic Regression and RandomForestClassifier
• Team Lessons Learned
• Generally, we were able to accomplish what we sought to achieve
• Decisions regarding cut-off points for timely completion
• Close coordination and robust planning tools were critical to the team’s success
• Best to start posting code quickly, then iteratively refine
Mortgage Data for Machine Learning Algorithms

More Related Content

PDF
Efficient Information Retrieval using Multidimensional OLAP Cube
PDF
Evaluating Aggregate Functions of Iceberg Query Using Priority Based Bitmap I...
PDF
Performance Comparison of Dimensionality Reduction Methods using MCDR
PDF
E132833
PDF
E05312426
PDF
Presentation UCAMI Congress 2016
PPTX
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
PPTX
Innovize it for_db2_empowering the dba_april_2016
Efficient Information Retrieval using Multidimensional OLAP Cube
Evaluating Aggregate Functions of Iceberg Query Using Priority Based Bitmap I...
Performance Comparison of Dimensionality Reduction Methods using MCDR
E132833
E05312426
Presentation UCAMI Congress 2016
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
Innovize it for_db2_empowering the dba_april_2016

Similar to Mortgage Data for Machine Learning Algorithms (20)

PDF
The New Role of Data in the Changing Energy & Utilities Landscape
PDF
1440 track 2 boire_using our laptop
PDF
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
PDF
Logical Data Fabric and Data Mesh – Driving Business Outcomes
PDF
Insight-2015-Session-3193
PDF
A Key to Real-time Insights in a Post-COVID World (ASEAN)
PPTX
Five ways database modernization simplifies your data life
PPTX
Enterprise architectsview 2015-apr
PDF
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
PPTX
Shikha fdp 62_14july2017
PPTX
Webinar: An Enterprise Architect’s View of MongoDB
PDF
Fast Range Aggregate Queries for Big Data Analysis
PDF
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
PDF
Democratization of NOSQL Document-Database over Relational Database Comparati...
PDF
Agile Big Data Analytics Development: An Architecture-Centric Approach
PPTX
Key Data Management Requirements for the IoT
PDF
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
PDF
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
PPTX
Analysis of economic data using big data
PPT
Scalable Machine Learning: The Role of Stratified Data Sharding
The New Role of Data in the Changing Energy & Utilities Landscape
1440 track 2 boire_using our laptop
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
Logical Data Fabric and Data Mesh – Driving Business Outcomes
Insight-2015-Session-3193
A Key to Real-time Insights in a Post-COVID World (ASEAN)
Five ways database modernization simplifies your data life
Enterprise architectsview 2015-apr
Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Le...
Shikha fdp 62_14july2017
Webinar: An Enterprise Architect’s View of MongoDB
Fast Range Aggregate Queries for Big Data Analysis
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
Democratization of NOSQL Document-Database over Relational Database Comparati...
Agile Big Data Analytics Development: An Architecture-Centric Approach
Key Data Management Requirements for the IoT
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
Analysis of economic data using big data
Scalable Machine Learning: The Role of Stratified Data Sharding
Ad

Recently uploaded (20)

PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
Introduction to the R Programming Language
PPT
Quality review (1)_presentation of this 21
PPTX
Introduction to machine learning and Linear Models
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Mega Projects Data Mega Projects Data
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
annual-report-2024-2025 original latest.
SAP 2 completion done . PRESENTATION.pptx
Introduction to Knowledge Engineering Part 1
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Reliability_Chapter_ presentation 1221.5784
Introduction-to-Cloud-ComputingFinal.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
STUDY DESIGN details- Lt Col Maksud (21).pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
1_Introduction to advance data techniques.pptx
Introduction to the R Programming Language
Quality review (1)_presentation of this 21
Introduction to machine learning and Linear Models
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Mega Projects Data Mega Projects Data
Clinical guidelines as a resource for EBP(1).pdf
Galatica Smart Energy Infrastructure Startup Pitch Deck
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
annual-report-2024-2025 original latest.
Ad

Mortgage Data for Machine Learning Algorithms

  • 1. Mortgage Data for Machine Learning Algorithms Predictive Loan Approval Analysis on HMDA Data 09/07/2019
  • 2. AGENDA 2 1. Introduction 2. Architecture & Design 3. Data 4. Exploratory Data Analysis 5. Data Wrangling & Feature Engineering 6. Machine Learning 7. Results 8. Conclusion
  • 4. Team Loan Canoe: Anne Klieve, Blake Zenuni, Tamana Naheeme HYPOTHESIS GENERATION • WHY This Dataset: It is expansive, and rich with features. • Purpose: protect consumers by providing transparency, and monitoring against discrimination. WHAT IS HMDA? i. Federal Reserve Board (1975- 2011)  CFPB subsidiary (2011 – present) ii. Mandates that Lending Institutions report public loan data • What are my chances of being approved for a Home Mortgage? Can we build an algorithmic service of product to service the consumer for this question? • Are there biases against within some categorical features in Loan approval? Problem Framing i. Big Banks have tools to classify and predict loan approvals? ii. Consumers are relegated to inferior products  Median Household Income for MSA should have a strong direct relationship to P(approval). Other Census-TRACT data mappings may not matter as much.  Income and loan amount should have a strong direct relationship to P(approval); co- applicant’s income might not have such a strong effect.  Features that clearly have a 1-for-1 effect on P(approval), but should also exhibit strong correlations to the majority of the other features. 4 Home Mortgage Disclosure Act (HMDA)
  • 5. 2. Architecture & Design 5
  • 6. 6 A. Data Ingestion B. Data Mungling and Wrangling C. Statistical Analyses D. Modeling and Application E. Visualization and Reporting Data Science Project Pipeline
  • 9. 9 • Home Mortgage Disclosure Act (HMDA) Data • Provides nationwide, loan-level data on U.S. mortgages. • Frequently used for research. • Millions of instances for every year (i.e. action on individual level loan application taken: approved, denied, etc.). • Used data for 2010 - 2017 for models. • Served as source for all features • Accessed via the Consumer Financial Protection Bureau’s public disclosure data portal Data Source
  • 10. 10 • Home Mortgage Disclosure Act (HMDA) Data Sample • Random Sample from each year, balanced for each outcome class • Raw data set contains 47 features • Storage in PostgreSQL • Advantages: Storage capacity, wrangling flexibility • Considerations: Storage space requirements • AWS Relational Database Storage (RDS) • Advantages: Shared database for collaborative purposes • Considerations: Cost and storage limits Data Ingestion and Storage
  • 11. 11 Database Management HMDA 2010 – 2017 Loan Applications Reported Raw  47 features  11.2MM -19MM records  9.9GB – 16.6GB
  • 12. 4. Exploratory Data Analysis 12
  • 14. 14 Exploratory Data Analysis - Count by Outcome
  • 15. 15 Feature Analysis - Numeric Features
  • 16. 16
  • 17. 5. Data Wrangling & Feature Engineering 17
  • 18. 18 Data Wrangling • Remove irrelevant categories from outcome of interest • Convert outcome variable to binary 1/0 • Scale numeric features • Drop: • Features with large majority of values missing + tuples with missing values (FFIEC advises missing values do not occur systematically) • Features that would result in model leakage • Frequency feature created for MSA • One hot encoding of categorical variables (21 -> 59 features)
  • 19. Data Wrangling and Feature Selection: Final Dataset for Preprocessing 19 + Unbalance Randomized Sampling Technique => 25,000 per year, 200,000 total tuples all years 2010-2017 + Balanced Randomized Sampling Technique => 12,000 per Loan Application Outcome => 200,000 total tuples 2010-2017 Three Separate AWS RDs ETL == Begin by Extracting the data for one year in a SQL CTE; Apply Scope and Filter Logic
  • 20. 20 Data Wrangling and Feature Selection: Final Dataset for Preprocessing ETL => Execute UNION ALL on the two randomized sub-datasets > then perform type-casting or any other transformations
  • 21. 21 Data Wrangling and Feature Selection: Final Dataset for Preprocessing ETL => Establish connections across all three AWS hosts via search_path as pg_catalog and dblink_connect Complete feature selection and SQL part of wrangling by UNION ALL on the transformed single yr. data files across the AWS RDBs
  • 24. 24 Reconciliation of Features for Final Models
  • 27. 27 Model Building Process • Phase 1 and 2 Models o GaussianNB, MultinomialNB, BernoulliNB, o tree.DecisionTreeClassifier, o LinearDiscriminantAnalysis, o OLS, LogisticRegression, LogisticRegressionCV, o BaggingClassifier, ExtraTreesClassifier, RandomForestClassifier, o LinearSVC, SVM • Phase 3 Candidate Models o Logistic Regression, o LinearDiscriminantAnalysis, o LinearSVC, o RandomForestClassifier
  • 28. 28 Building Predictive Models - Phase 1 ● Random Samples of 25,000 Instances for Single Years ● Initial Feature Selection of 132 Features inc. States, MSA Binary variables ● Data Wrangling in Pandas, no Pipeline ● Included features the team would later remove to avoid model leakage Model Form F1 Precision Recall GaussianNB 0.871 0.984 0.782 MultinomialNB 0.624 0.722 0.550 BernoulliNB 0.889 0.979 0.979 tree.DecisionTreeClassifier 1.000 1.000 1.000 LinearDiscriminantAnalysis 0.887 0.973 0.814 LogisticRegression 0.790 0.654 0.999 LogisticRegressionCV 0.853 0.780 0.941 BaggingClassifier 0.994 0.999 0.989 ExtraTreesClassifier 1.000 1.000 1.000 RandomForestClassifier 1.000 1.000 1.000
  • 29. 29 Building Predictive Models - Phase 2 ● Single Year and All Years models ● Removed two additional features to avoid model leakage ● Shifted from Pandas to Scikit-Learn Pipeline ● Added cross-validation and balanced sample data ● Broke out precision, recall, and F1 scores for action_taken = {0,1}
  • 33. 33 Results • Final Model Selections: • Logistic Regression • Random Forest Classifier Model Form Statistic Overall action_taken=1 action_taken=0 Logistic Regression Precision 0.701676 0.703876 0.699477 Recall 0.701635 0.69617 0.7071 F1 0.70162 0.699986 0.703253 RandomForestClassifier Precision 0.742424 0.741125 0.743723 Recall 0.742395 0.74508 0.73971 F1 0.742387 0.743081 0.741694
  • 35. 35 Further Implications • EDA (i.e. mutual information). • Analysis of merged data, include additional features. • HMDA API has been implemented with improvements and they are sunsetting the current tool. Using their enhanced API, it would be best to query the API directly and have MongoDB as the intermediary. • Robustness checks with different versions of geographic features.
  • 36. 36 Conclusion • Within Project Scoping, Sufficient Number of Important Features to Produce Models • Applicant Income, Loan Amount, Property Type, Agency, Median Household Income, Sex, Race • Filters: Outcome variable designations; Loan Type • Model Selection • Our data responded well to Logistic Regression and RandomForestClassifier • Team Lessons Learned • Generally, we were able to accomplish what we sought to achieve • Decisions regarding cut-off points for timely completion • Close coordination and robust planning tools were critical to the team’s success • Best to start posting code quickly, then iteratively refine