Credit risk predictive analytics

Challenges for Credit Risk
Predictive Analytics in Bulgaria
Vladimir Labov, FRM

Agenda
• The idea behind
• Solutions to practical problems in credit risk analytics:
• Outliers
• Missing values
• Logical coefficient signs
• Binning (grouping)
• Categorical variables
• Multicollinearity
• Applicants with high income and high indebtedtness
• Unofficial income
• Current place of residence verification
11.02.2015 Vladimir Labov, Data Science Society 2

The Idea Behind
• Predictive analytics for credit risk tries to answer the question - is a
borrower going to give the money back?
• Why is it important? – makes sure most of your money in a bank
goes to the right people
• Since this problem boils down to predicting whether the customer is
good or bad, statistical classification algorithms provide the best
solution.
• In practice, credit risk is quantified by estimating the expected loss
from each borrower.
• This is why methods that produce a probability of default between 0
and 1 are preferred, for example logistic regression.
• Two types of predictive models for credit risk - application and
behavioural scorecards. Application scorecards are more important,
because the actual lending decision depends on them.

The Problem of Outliers
• Problem – outliers distort estimates of regression coefficients
• Classical solution – trim them, use robust regression or quantile regression
• Elegant solution – transform the variables to weight of evidence or default
rates
• Weight of evidence calculation:
• Interpretation:
- positive values: share of goods > share of bads
- negative values: share of goods < share of bads
- zero: share of goods = share of bads
• Advantage: no distortion of estimates; extreme values both in the
estimation sample and the holdout sample fall into the marginal WoE
groups
100*)]
/
/
[ln(
badsallgroupinbads
goodsallgroupingoods
WoE 

The Problem of Missing Values
• Problem – missing values for a variable make the whole observation
useless
• Classical solution – trim them, use the mean value or multiple
imputation
• Elegant solutions:
- missing age or gender can be inferred from the ID number (ЕГН in
Bulgaria)
- if missing values are few: assign them to the group with the closest
default rate, or to the most logical group (to the lowest income
group, lowest years of employment history group, etc.)
- transform the variables to weight of evidence or default rates

The Problem of
Logical Coefficient Signs
• Problem – certain variables may have a significant p-value, but with a
coefficient sign that defies economical logic
• Solutions:
- Estimate univariate regressions on each variable to help you get a
feel what the logical sign should be
- The no-brainer: just use weight-of-evidence, all variables should
have a negative coefficient sign, example why:
coeff WoE
value
Z pd
-1 -1 1 0.73
-1 -2 2 0.88
+1 -1 -1 0.26
+1 -2 -2 0.11
))exp(1/(1
11
Zpd
WoEcoeffaxbaZ i
n
i
ii
n
i
i

  

The Problem of Binning
• Problem – how to determine the optimal groups for WoE
transformation of numerical variables?
• Solution:
- split every variable into 10 deciles
- observe if the average default rate for the deciles changes in a
logical fashion (e.g. monotonically if this is the expected relationship)
- combine groups in which the average default rate is close enough
- adjust the cut-off points for the groups whose default rate is out of
line with the adjacent groups
- if in doubt, use the Information Value (IV) criterion to compare two
binnings: the binning with the higher Information Value differentiates
better between the distribution of goods and the distribution of
bads, the ultimate goal in scorecard development

The Problem of Categorical Variables
• Problem – How to represent categorical variables?
• Classical Solution: use dummy variables, but:
- often some categories turn out insignificant
- difficult to interpret the overall significance of a variable split to 5
dummy regressors
• Elegant Solution:
- again assign a WoE value to each category

The Problem of Multicollinearity
• Problem – Correlated variables distort the coefficient signs or make
individually significant variables insignificant in multivariable context
• Classical Solution: drop the variable with the wrong sign
• Elegant Solution:
- combine the correlated variables into a new variable
- example: income & source of income

Applicants with High Income
and High Indebtedness
• Problem – high income is an indicator of lower risk, but at the same
time individuals with high income may face difficulties if their debt
level is high as well
• Solution: take the income net of debt payments (disposable income)

The Problem of Unofficial Income
• Problem – some applicants get paid a salary higher than the officially
declared one that cannot be verified by a NSSI (National Social
Security Institute) check
• Solution:
- request a declaration from the employer for the real salary
- take the max(verified by a NSSI check; verified by declaration check)

The Problem of Current Residence
• Problem – a lot of people in Bulgaria do not update their current
and/or permanent address, while the place of residence is a
somewhat important demographical factor
• Solution:
- take the branch where the application was submitted as the
current place of residence

An Astrological Detour 
• Some people believe you can tell a lot about the character of a
person based on their astrological sign. So can you predict whether
they are reliable borrowers from it?
• A regression of default on the astrological sign in our consumer loan
database had a Gini coefficient of 10.25% (AUROC of 0.55) – lower
than most of the variables that made it into the final model. So sorry,
ladies, but astrology can’t tell you everything about a person 
• For you astrology aficionados that still want to believe, people born
under Leo are the most risky, while people under Capricorn are the
most reliable payers 

THANK YOU!
QUESTIONS TIME!

Credit risk predictive analytics

More Related Content

Viewers also liked (16)

Similar to Credit risk predictive analytics (20)

More from Data Science Society (20)

Recently uploaded (20)

Credit risk predictive analytics