SlideShare a Scribd company logo
Challenges for Credit Risk
Predictive Analytics in Bulgaria
Vladimir Labov, FRM
Agenda
• The idea behind
• Solutions to practical problems in credit risk analytics:
• Outliers
• Missing values
• Logical coefficient signs
• Binning (grouping)
• Categorical variables
• Multicollinearity
• Applicants with high income and high indebtedtness
• Unofficial income
• Current place of residence verification
11.02.2015 Vladimir Labov, Data Science Society 2
The Idea Behind
• Predictive analytics for credit risk tries to answer the question - is a
borrower going to give the money back?
• Why is it important? – makes sure most of your money in a bank
goes to the right people
• Since this problem boils down to predicting whether the customer is
good or bad, statistical classification algorithms provide the best
solution.
• In practice, credit risk is quantified by estimating the expected loss
from each borrower.
• This is why methods that produce a probability of default between 0
and 1 are preferred, for example logistic regression.
• Two types of predictive models for credit risk - application and
behavioural scorecards. Application scorecards are more important,
because the actual lending decision depends on them.
11.02.2015 Vladimir Labov, Data Science Society 3
The Problem of Outliers
• Problem – outliers distort estimates of regression coefficients
• Classical solution – trim them, use robust regression or quantile regression
• Elegant solution – transform the variables to weight of evidence or default
rates
• Weight of evidence calculation:
• Interpretation:
- positive values: share of goods > share of bads
- negative values: share of goods < share of bads
- zero: share of goods = share of bads
• Advantage: no distortion of estimates; extreme values both in the
estimation sample and the holdout sample fall into the marginal WoE
groups
11.02.2015 Vladimir Labov, Data Science Society 4
100*)]
/
/
[ln(
badsallgroupinbads
goodsallgroupingoods
WoE 
The Problem of Missing Values
• Problem – missing values for a variable make the whole observation
useless
• Classical solution – trim them, use the mean value or multiple
imputation
• Elegant solutions:
- missing age or gender can be inferred from the ID number (ЕГН in
Bulgaria)
- if missing values are few: assign them to the group with the closest
default rate, or to the most logical group (to the lowest income
group, lowest years of employment history group, etc.)
- transform the variables to weight of evidence or default rates
11.02.2015 Vladimir Labov, Data Science Society 5
The Problem of
Logical Coefficient Signs
• Problem – certain variables may have a significant p-value, but with a
coefficient sign that defies economical logic
• Solutions:
- Estimate univariate regressions on each variable to help you get a
feel what the logical sign should be
- The no-brainer: just use weight-of-evidence, all variables should
have a negative coefficient sign, example why:
11.02.2015 Vladimir Labov, Data Science Society 6
coeff WoE
value
Z pd
-1 -1 1 0.73
-1 -2 2 0.88
+1 -1 -1 0.26
+1 -2 -2 0.11
))exp(1/(1
11
Zpd
WoEcoeffaxbaZ i
n
i
ii
n
i
i

  
The Problem of Binning
• Problem – how to determine the optimal groups for WoE
transformation of numerical variables?
• Solution:
- split every variable into 10 deciles
- observe if the average default rate for the deciles changes in a
logical fashion (e.g. monotonically if this is the expected relationship)
- combine groups in which the average default rate is close enough
- adjust the cut-off points for the groups whose default rate is out of
line with the adjacent groups
- if in doubt, use the Information Value (IV) criterion to compare two
binnings: the binning with the higher Information Value differentiates
better between the distribution of goods and the distribution of
bads, the ultimate goal in scorecard development
11.02.2015 Vladimir Labov, Data Science Society 7
The Problem of Categorical Variables
• Problem – How to represent categorical variables?
• Classical Solution: use dummy variables, but:
- often some categories turn out insignificant
- difficult to interpret the overall significance of a variable split to 5
dummy regressors
• Elegant Solution:
- again assign a WoE value to each category
11.02.2015 Vladimir Labov, Data Science Society 8
The Problem of Multicollinearity
• Problem – Correlated variables distort the coefficient signs or make
individually significant variables insignificant in multivariable context
• Classical Solution: drop the variable with the wrong sign
• Elegant Solution:
- combine the correlated variables into a new variable
- example: income & source of income
11.02.2015 Vladimir Labov, Data Science Society 9
Applicants with High Income
and High Indebtedness
• Problem – high income is an indicator of lower risk, but at the same
time individuals with high income may face difficulties if their debt
level is high as well
• Solution: take the income net of debt payments (disposable income)
11.02.2015 Vladimir Labov, Data Science Society 10
The Problem of Unofficial Income
• Problem – some applicants get paid a salary higher than the officially
declared one that cannot be verified by a NSSI (National Social
Security Institute) check
• Solution:
- request a declaration from the employer for the real salary
- take the max(verified by a NSSI check; verified by declaration check)
11.02.2015 Vladimir Labov, Data Science Society 11
The Problem of Current Residence
• Problem – a lot of people in Bulgaria do not update their current
and/or permanent address, while the place of residence is a
somewhat important demographical factor
• Solution:
- take the branch where the application was submitted as the
current place of residence
11.02.2015 Vladimir Labov, Data Science Society 12
An Astrological Detour 
• Some people believe you can tell a lot about the character of a
person based on their astrological sign. So can you predict whether
they are reliable borrowers from it?
• A regression of default on the astrological sign in our consumer loan
database had a Gini coefficient of 10.25% (AUROC of 0.55) – lower
than most of the variables that made it into the final model. So sorry,
ladies, but astrology can’t tell you everything about a person 
• For you astrology aficionados that still want to believe, people born
under Leo are the most risky, while people under Capricorn are the
most reliable payers 
11.02.2015 Vladimir Labov, Data Science Society 13
THANK YOU!
QUESTIONS TIME!
11.02.2015 Vladimir Labov, Data Science Society 14

More Related Content

PPT
Les5e ppt 01
PPTX
Location Intelligence - the Next Evolution of Business Applications
PPT
How to easily build business credit that's not linked to your SSN in 4 simple...
PPTX
Preslav Nakov - The Web as a Training Set Part 3
PPTX
Preslav Nakov - The Web as a Training Set Part 2
PDF
Image Processing of Food Labels
PPTX
Preslav Nakov - The Web as a Training Set Part 1
PDF
Ipc business opportunity in its industry by sirena cheng 20121226
Les5e ppt 01
Location Intelligence - the Next Evolution of Business Applications
How to easily build business credit that's not linked to your SSN in 4 simple...
Preslav Nakov - The Web as a Training Set Part 3
Preslav Nakov - The Web as a Training Set Part 2
Image Processing of Food Labels
Preslav Nakov - The Web as a Training Set Part 1
Ipc business opportunity in its industry by sirena cheng 20121226

Viewers also liked (16)

PDF
Top10 trend sin business intelligence for 2015
PDF
Information retrieval to recommender systems
DOC
sung's resume
PPTX
Masterplan of Patimban Port
PPT
Tweeting beyond Facts – The Need for a Linguistic Perspective
PPTX
The future of Big Data tooling
PPTX
Real-time analytics with HBase
PDF
Sentiment Analysis
DOC
Bab 4 amdal
PPT
Big Data: Improving capacity utilization of transport companies
PDF
The role of ipc in developing multimodal transportation in java
PDF
Real-time information analysis: social networks and open data
PDF
Demand model development for the retail sector of industry
PDF
Machine learning for NLP
PDF
Top Ten Big Data Trends in Finance
PDF
Sea Port Construction, Project Execution Plan
Top10 trend sin business intelligence for 2015
Information retrieval to recommender systems
sung's resume
Masterplan of Patimban Port
Tweeting beyond Facts – The Need for a Linguistic Perspective
The future of Big Data tooling
Real-time analytics with HBase
Sentiment Analysis
Bab 4 amdal
Big Data: Improving capacity utilization of transport companies
The role of ipc in developing multimodal transportation in java
Real-time information analysis: social networks and open data
Demand model development for the retail sector of industry
Machine learning for NLP
Top Ten Big Data Trends in Finance
Sea Port Construction, Project Execution Plan
Ad

Similar to Credit risk predictive analytics (20)

PPTX
Credit risk scoring model final
PDF
Data science pitfalls
PDF
Analysis of the Propensity to Earn Non-Wage Income in America
PDF
1645 track2 brandenburger_lempola
PPTX
exploratory data analysis on german credit data
PDF
02_AJMS_441_22.pdf
DOCX
Review Parameters Model Building & Interpretation and Model Tunin.docx
DOCX
Running head LOGISTIC REGRESSION .docx
PPTX
What is SVM Classification Analysis and How Can It Benefit Business Analytics?
PPTX
Analyzing Bias in Data - IRE 2019
PDF
09.3 credit scoring
PPTX
Statistics in Support of Policy
DOCX
5 Nonsampling ErrorNonsampling error is a catch-all term f.docx
PDF
Peter Sarlin. Toward robust early-warning models: A horse race, ensembles and...
PDF
fast publication journals
PDF
International journal of engineering and mathematical modelling vol1 no1_2015_2
PDF
Polimi 20121122 r&a
PDF
Polimi 20121122 r&a
PPTX
Construction of a robust prediction model to forecast the likelihood of a cre...
PPTX
adv_ii_fairness artificial intelligence .pptx
Credit risk scoring model final
Data science pitfalls
Analysis of the Propensity to Earn Non-Wage Income in America
1645 track2 brandenburger_lempola
exploratory data analysis on german credit data
02_AJMS_441_22.pdf
Review Parameters Model Building & Interpretation and Model Tunin.docx
Running head LOGISTIC REGRESSION .docx
What is SVM Classification Analysis and How Can It Benefit Business Analytics?
Analyzing Bias in Data - IRE 2019
09.3 credit scoring
Statistics in Support of Policy
5 Nonsampling ErrorNonsampling error is a catch-all term f.docx
Peter Sarlin. Toward robust early-warning models: A horse race, ensembles and...
fast publication journals
International journal of engineering and mathematical modelling vol1 no1_2015_2
Polimi 20121122 r&a
Polimi 20121122 r&a
Construction of a robust prediction model to forecast the likelihood of a cre...
adv_ii_fairness artificial intelligence .pptx
Ad

More from Data Science Society (20)

PDF
[Data Meetup] Data Science in Finance - Factor Models in Finance
PDF
[Data Meetup] Data Science in Finance - Building a Quant ML pipeline
PPTX
[Data Meetup] Data Science in Journalism - Tanbih, QCRI and MIT
PPTX
Computer Vision in Real Estate
PPTX
ML in Proptech - Concept to Production
PPTX
Lessons Learned: Linked Open Data implemented in 2 Use Cases
PPT
AI methods for localization in noisy environment
PPTX
Object Identification and Detection Hackathon Solution
PPTX
Data Science for Open Innovation in SMEs and Large Corporations
PDF
Air Pollution in Sofia - Solution through Data Science by Kiwi team
PPTX
Machine Learning in Astrophysics
PPTX
#AcademiaDatathon Finlists' Solution of Crypto Datathon Case
PPTX
Coreference Extraction from Identric’s Documents - Solution of Datathon 2018
PDF
DNA Analytics - What does really goes into Sausages - Datathon2018 Solution
PDF
Relationships between research tasks and data structure (basic methods and a...
PDF
Data science tools - A.Marchev and K.Haralampiev
PDF
Problems of Application of Machine Learning in the CRM - panel
PDF
Disruptive as Usual: New Technologies and Data Value Professor Severino Mereg...
PDF
Intelligent Question Answering Using the Wisdom of the Crowd, Preslav Nakov
PDF
Master class Hristo Hadjitchonev - Aubg
[Data Meetup] Data Science in Finance - Factor Models in Finance
[Data Meetup] Data Science in Finance - Building a Quant ML pipeline
[Data Meetup] Data Science in Journalism - Tanbih, QCRI and MIT
Computer Vision in Real Estate
ML in Proptech - Concept to Production
Lessons Learned: Linked Open Data implemented in 2 Use Cases
AI methods for localization in noisy environment
Object Identification and Detection Hackathon Solution
Data Science for Open Innovation in SMEs and Large Corporations
Air Pollution in Sofia - Solution through Data Science by Kiwi team
Machine Learning in Astrophysics
#AcademiaDatathon Finlists' Solution of Crypto Datathon Case
Coreference Extraction from Identric’s Documents - Solution of Datathon 2018
DNA Analytics - What does really goes into Sausages - Datathon2018 Solution
Relationships between research tasks and data structure (basic methods and a...
Data science tools - A.Marchev and K.Haralampiev
Problems of Application of Machine Learning in the CRM - panel
Disruptive as Usual: New Technologies and Data Value Professor Severino Mereg...
Intelligent Question Answering Using the Wisdom of the Crowd, Preslav Nakov
Master class Hristo Hadjitchonev - Aubg

Recently uploaded (20)

PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
modul_python (1).pptx for professional and student
PDF
Business Analytics and business intelligence.pdf
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PDF
Introduction to the R Programming Language
PPTX
IMPACT OF LANDSLIDE.....................
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction to Inferential Statistics.pptx
PDF
annual-report-2024-2025 original latest.
PPT
Predictive modeling basics in data cleaning process
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Global Data and Analytics Market Outlook Report
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
modul_python (1).pptx for professional and student
Business Analytics and business intelligence.pdf
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Introduction to the R Programming Language
IMPACT OF LANDSLIDE.....................
SAP 2 completion done . PRESENTATION.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to Inferential Statistics.pptx
annual-report-2024-2025 original latest.
Predictive modeling basics in data cleaning process
CYBER SECURITY the Next Warefare Tactics
IBA_Chapter_11_Slides_Final_Accessible.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Global Data and Analytics Market Outlook Report
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg

Credit risk predictive analytics

  • 1. Challenges for Credit Risk Predictive Analytics in Bulgaria Vladimir Labov, FRM
  • 2. Agenda • The idea behind • Solutions to practical problems in credit risk analytics: • Outliers • Missing values • Logical coefficient signs • Binning (grouping) • Categorical variables • Multicollinearity • Applicants with high income and high indebtedtness • Unofficial income • Current place of residence verification 11.02.2015 Vladimir Labov, Data Science Society 2
  • 3. The Idea Behind • Predictive analytics for credit risk tries to answer the question - is a borrower going to give the money back? • Why is it important? – makes sure most of your money in a bank goes to the right people • Since this problem boils down to predicting whether the customer is good or bad, statistical classification algorithms provide the best solution. • In practice, credit risk is quantified by estimating the expected loss from each borrower. • This is why methods that produce a probability of default between 0 and 1 are preferred, for example logistic regression. • Two types of predictive models for credit risk - application and behavioural scorecards. Application scorecards are more important, because the actual lending decision depends on them. 11.02.2015 Vladimir Labov, Data Science Society 3
  • 4. The Problem of Outliers • Problem – outliers distort estimates of regression coefficients • Classical solution – trim them, use robust regression or quantile regression • Elegant solution – transform the variables to weight of evidence or default rates • Weight of evidence calculation: • Interpretation: - positive values: share of goods > share of bads - negative values: share of goods < share of bads - zero: share of goods = share of bads • Advantage: no distortion of estimates; extreme values both in the estimation sample and the holdout sample fall into the marginal WoE groups 11.02.2015 Vladimir Labov, Data Science Society 4 100*)] / / [ln( badsallgroupinbads goodsallgroupingoods WoE 
  • 5. The Problem of Missing Values • Problem – missing values for a variable make the whole observation useless • Classical solution – trim them, use the mean value or multiple imputation • Elegant solutions: - missing age or gender can be inferred from the ID number (ЕГН in Bulgaria) - if missing values are few: assign them to the group with the closest default rate, or to the most logical group (to the lowest income group, lowest years of employment history group, etc.) - transform the variables to weight of evidence or default rates 11.02.2015 Vladimir Labov, Data Science Society 5
  • 6. The Problem of Logical Coefficient Signs • Problem – certain variables may have a significant p-value, but with a coefficient sign that defies economical logic • Solutions: - Estimate univariate regressions on each variable to help you get a feel what the logical sign should be - The no-brainer: just use weight-of-evidence, all variables should have a negative coefficient sign, example why: 11.02.2015 Vladimir Labov, Data Science Society 6 coeff WoE value Z pd -1 -1 1 0.73 -1 -2 2 0.88 +1 -1 -1 0.26 +1 -2 -2 0.11 ))exp(1/(1 11 Zpd WoEcoeffaxbaZ i n i ii n i i    
  • 7. The Problem of Binning • Problem – how to determine the optimal groups for WoE transformation of numerical variables? • Solution: - split every variable into 10 deciles - observe if the average default rate for the deciles changes in a logical fashion (e.g. monotonically if this is the expected relationship) - combine groups in which the average default rate is close enough - adjust the cut-off points for the groups whose default rate is out of line with the adjacent groups - if in doubt, use the Information Value (IV) criterion to compare two binnings: the binning with the higher Information Value differentiates better between the distribution of goods and the distribution of bads, the ultimate goal in scorecard development 11.02.2015 Vladimir Labov, Data Science Society 7
  • 8. The Problem of Categorical Variables • Problem – How to represent categorical variables? • Classical Solution: use dummy variables, but: - often some categories turn out insignificant - difficult to interpret the overall significance of a variable split to 5 dummy regressors • Elegant Solution: - again assign a WoE value to each category 11.02.2015 Vladimir Labov, Data Science Society 8
  • 9. The Problem of Multicollinearity • Problem – Correlated variables distort the coefficient signs or make individually significant variables insignificant in multivariable context • Classical Solution: drop the variable with the wrong sign • Elegant Solution: - combine the correlated variables into a new variable - example: income & source of income 11.02.2015 Vladimir Labov, Data Science Society 9
  • 10. Applicants with High Income and High Indebtedness • Problem – high income is an indicator of lower risk, but at the same time individuals with high income may face difficulties if their debt level is high as well • Solution: take the income net of debt payments (disposable income) 11.02.2015 Vladimir Labov, Data Science Society 10
  • 11. The Problem of Unofficial Income • Problem – some applicants get paid a salary higher than the officially declared one that cannot be verified by a NSSI (National Social Security Institute) check • Solution: - request a declaration from the employer for the real salary - take the max(verified by a NSSI check; verified by declaration check) 11.02.2015 Vladimir Labov, Data Science Society 11
  • 12. The Problem of Current Residence • Problem – a lot of people in Bulgaria do not update their current and/or permanent address, while the place of residence is a somewhat important demographical factor • Solution: - take the branch where the application was submitted as the current place of residence 11.02.2015 Vladimir Labov, Data Science Society 12
  • 13. An Astrological Detour  • Some people believe you can tell a lot about the character of a person based on their astrological sign. So can you predict whether they are reliable borrowers from it? • A regression of default on the astrological sign in our consumer loan database had a Gini coefficient of 10.25% (AUROC of 0.55) – lower than most of the variables that made it into the final model. So sorry, ladies, but astrology can’t tell you everything about a person  • For you astrology aficionados that still want to believe, people born under Leo are the most risky, while people under Capricorn are the most reliable payers  11.02.2015 Vladimir Labov, Data Science Society 13
  • 14. THANK YOU! QUESTIONS TIME! 11.02.2015 Vladimir Labov, Data Science Society 14