Estimating Future Initial Margin with Machine Learning

Estimating Future Initial Margin
with Machine Learning
Andres Hernandez

Set up
Consider a bank, with a portfolio of OTC contracts traded with a
counterparty covered under a single netting agreement, which can
include variational margin VM and initial margin IM. The exposure
of the bank at time t is
E(t) = (V (t) − VM(t) + U(t) − IM(t))+
V (t) is the value of the portfolio at time t
VM(t) is the variational margin available to the bank at time t
U(t) is the value of cashﬂows scheduled to paid up to time t
IM(t) is the initial margin available to the bank at time t
3

Initial Margin
Our purpose is to be able to produce the forecast of initial margin.
For the purpose of MVA, the expectation Et [IM(t)] is needed, e.g.
Green and Kenyon
MVA = −
∫ T
t
((1 − RB)λB(u) − sI (u)) e−
∫ u
t (r(s)+λB (s)+λC (s))ds
×Et [IM(t)] du,
but in general, if the intention is to calculate exposure, what we will
strive for is the forecast along a particular scenario path m
IMm(t) = Q99 (∆m(t; δIM)|Ft)
where Q99 is the 99-th percentile, δIM is the MPoR, and ∆m(t; δIM)
is the clean change in portfolio value
∆m(t; δIM) = Vm(t + δIM) + Um(t, t + δIM) − Vm(t) (1)
4

Assumptions
We follow L. Andersen, M. Pykhtin, and A. Sokol and assume that
as δIM is relatively short, the P&L under the quantile is Gaussian
with zero drift
IMm(t) ≈ σm(t)Φ−1
(99%)
and where the variance is deﬁned as
σ2
m(t) = E
[
∆2
m(t, δIM)|Ft
]
Forecasting the variance σ2
m(t) is really the purpose of this talk.
5

Estimating σm(t) through Regression

Longstaﬀ-Schwartz Regression
F(t0, t) = E [f (xt; t0, t)|Ft0 ]
For square-integrable functions L2(Ω, F, Q), the expansion via or-
thonormal functions can be used to resolve the expectation
f (xt; t0, t) =
∞∑
i=0
ai (t0, t)Li (xt)
where Li (xt) is part of an orthonormal function sequence, which
covers the L2 space
F(t0, t) =
∞∑
i=0
ai (t0, t)
7

The coeﬃcients ai (t0, t) are proportional to
ai (t0, t) = E [f (xt; t0, t)Li (xt)|Ft0 ] (2)
With such a sequence, one can guarantee that for a chosen error
tolerance ϵ, there exists an N such that
F(t0, t) −
N∑
i=0
ai (t0, t) < ϵ
However, if one would need to evaluate Eq.(2) to use the method,
there would be little use for it.
8

Instead of evaluating the coeﬃcients ai , one chooses an N, and
estimates them by regressing against the available Monte Carlo sim-
ulation.
In the original Longstaﬀ-Schwartz paper, a basis of Laguerre and
Hermite functions were used. Note however, that the requirement
of the orthogonal sequence was simply to guarantee the ability to
increase the precision, but in practice one does not rely on this
property, and often just uses a polynomial function.
9

Regression with Machine Learning

Machine Learning tools
While there are a myriad methods available from the machine learn-
ing toolkit, for time constrain we will look at the following methods
Least-Squares Regression (LSE)
Nadaraya-Watson kernel Regression (NWK)
k-Nearest Neighbor Regression (kNR)
Gradient Boosted Regression Trees (GBRT)
Recurrent Neural Network (LSTM)
All attempt to approximate the conditional expectation of Y relative
to a variable X
E [Y |X] = m(X)
11

Least Squares Regression
A form for the function is proposed, e.g.
ˆm(X) =
N∑
n=0
anXn
and the coeﬃcients are determined by minimising the squared errors.
In the following N = 1 will be used. For a sample set {(xi , yi )}, with
i = 1, . . . , M
min
∑
i
wi (yi − ˆm(xi ))2
One could introduce some clever choice of weights wi , but in the
following all points are equally weighted
12

Nadaraya-Watson kernel Regression
Nadaraya-Watson uses a locally weighted average, with the weight
provided by a kernel K. The estimation of m, ˆmh, is then given by
ˆmh(x) =
∑
i Kh(x − xi )yi
∑
i Kh(x − xi )
The parameter h, called the bandwidth, should determine how much
the kernel will focus on local over global features. For example, the
radial basis function kernel
Kh(x, xi ) = exp
(
−
∥x − xi ∥2
2h2
)
As h varies from 0 to ∞, the kernel will move from weighting only
a match with an exact xi to weighting all points equally.
13

k-Nearest Neighbor Regression
To know the value for input x, the distance to all points in the
sample set is calculated. The k samples with the shortest distance
are picked, and the output value is simply the weighted average of
them. In our case, the points are weighted by inverse distance, so
that the nearest points have more weight.
14

Gradient Boosted Regression Trees
In a decision regression tree a
set of decisions based on one of
the predictor variables is made
at each node, leading a ﬁnal
prediction:
A GBRT ﬁts an additive set of
decision trees (weak learners)
F(x) =
∑
m
γmhm(x).
It bootstraps the model one
tree at a time
Fm(x) = Fm−1(x) + γmhm(x),
by having the new decision tree
hm(x) minimize the error of
Fm−1(x)
15

Artiﬁcial Neural Networks
An ANN is simply a network of regression units stacked in a particu-
lar conﬁguration. Each regression unit, called a neuron, takes input
from the previous layer 1, combines that input according to a rule,
and applies a function on the result:
x1
w1
xn
wn
Σ
b
Σwx + b
σ(x) a
1
There are more complicated topologies, e.g. Recursive Neural Networks or
Restricted Boltzmann machine16

Artiﬁcial Neural Networks
In ANNs independent regression units are stacked together in layers,
with layers stacked on top of each other
17

Many to Many Recurrent Neural Network
A long-short-term memory (LSTM)
architecture was used. The standard
LSTM block is composed of several
gates with an internal state:
*Wikimedia
The LSTM blocks are
grouped into layers,
and several layers can
be stacked on top of
each other
t
18

Estimating σm(t) with Machine Learning

Benchmark
Originally we intended to use forward SIMM as the benchmark, but
as what we are calculating and SIMM are diﬀerent, and need to
be scaled to compare them (see Anfuso et. al 2016), a diﬀerent
benchmark was used
σ2
m(t) ≈ E
[
∆2
m(t, δIM)|V (t) = Vm(t)
]
≈ E
[
∆2
m(t, δIM)|V (t) ≈ Vm(t)
]
= E
[
∆2
m(t, δIM)| |V (t) − Vm(t)| < ϵ
]
For the regular calculations 1k scenarios are used, but 100k for the
benchmark calculations. ϵ is chosen on a timestep basis, at most
being half the width of the distance between the two nearest points
on that time step.
20

IR Swap
0 50 100 150 200 250
Timestep
1.78e+09
3.56e+09
5.34e+09
7.12e+09
8.90e+09
1.07e+10
1.25e+10
1.42e+10
1.60e+10
E[σ2(t)]
LSE
NWK
kNR
GBRT
LSTM
Benchmark
21

5 × 5 European Swaption - Physical Exercise
0 50 100 150 200 250
Timestep
1.88e+06
3.76e+06
5.64e+06
7.52e+06
9.40e+06
1.13e+07
1.32e+07
1.50e+07
1.69e+07
E[σ2(t)]
LSE
NWK
kNR
GBRT
LSTM
Benchmark
22

While all methods so far would probably be acceptable to calculate
E[IM(t)], and hence MVA, not all would be acceptable to calculate
exposure
−0.04 −0.02 0.00 0.02 0.04 0.06
Rate
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
Δ2
m
1e7 BenchmarkΔ(timestepΔ=Δ180)
−0.04 −0.02 0.00 0.02 0.04 0.06 0.08
Rate
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
Δ2
m
1e7 LSEΔ(timestepΔ=Δ180)
23

−0.04 −0.02 0.00 0.02 0.04 0.06 0.08
Rate
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
Δ2
m
1e7 kNRΔ(timestepΔ=Δ180)
−0.04 −0.02 0.00 0.02 0.04 0.06 0.08
Rate
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
Δ2
m
1e7 GBRTΔ(timestepΔ=Δ180)
−0.04 −0.02 0.00 0.02 0.04 0.06 0.08
Rate
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
Δ2
m
1e7 LSTMΔ(timestepΔ=Δ180)
24

Portfolio
Multiple currencies: EUR, USD, SEK, AUD
Multiple indices: EUR 6M, EUR 3M, EUR 12M
15 IR Swaps, 5 XCCy Swaps, 4 FX Swaps
7 Bermudan Swaptions with physical exercise and multiple
exercise dates
While not all products are treated equally under regulation, for the
purpose of this exercise, they will all be included in the same netting
set.
25

Portfolio
0 50 100 150 200 250
Timestep
1.33e+13
1.73e+13
2.14e+13
2.54e+13
2.95e+13
3.35e+13
3.76e+13
4.16e+13
4.57e+13
E[σ2(t)]
LSE
NWK
kNR
GBRT
LSTM
Benchmark
26

Single model for the whole simulation
The LSTM produces a much smoother prediction because it is
”cheating”. Instead of training a new model at each time step, a la
Longstaﬀ-Schwartz, the LSTM is provided with the benchmark
itself, albeit a few days old. The current simulation is used to
predict, but a much bigger simulation in the past was used to train.
Besides LSE for which it makes no sense to even try, this could be
done for the others.
27

k-Nearest Neighbor Single Model for Swaption
0 50 100 150 200 250
Timestep
1.88e+06
3.76e+06
5.64e+06
7.52e+06
9.40e+06
1.13e+07
1.32e+07
1.50e+07
1.69e+07
E[σ2(t)]
2 day delay
pre-trained kNR
timestep trained kNR
Benchmark
28

k-Nearest Neighbor Single Model for Swaption
0 50 100 150 200 250
Timestep
1.69e+06
3.38e+06
5.07e+06
6.76e+06
8.45e+06
1.01e+07
1.18e+07
1.35e+07
1.52e+07
E[σ2(t)]
10 day delay
pre-trained kNR
timestep trained kNR
Benchmark
29

Summary
A regression based approach allows for a fast calculation of expected
initial margin, and hence MVA, but care needs to be taken for expo-
sure calculation. Until now, the best solution seems to be k-Nearest
Neighbor Regression: simple, intuitive, and fast.
Moving forward
Validate the Gaussian assumption for more complex portfolios
Backtest stability of mapping to forward SIMM
Improve neural network response by trying out generative
models
Use transfer learning and other tools to attempt to replace
ever more parts from the Monte Carlo workﬂow for neural
networks taught on large data sets.
30

Thank you
R⃝2017 PricewaterhouseCoopers GmbH Wirtschaftsprfungsgesellschaft. All rights reserved. In this
document, PwC refers to PricewaterhouseCoopers GmbH Wirtschaftsprfungsgesellschaft, which is a member
ﬁrm of PricewaterhouseCoopers International Limited (PwCIL). Each member ﬁrm of PwCIL is a separate
and independent legal entity.

Estimating Future Initial Margin with Machine Learning

More Related Content

Similar to Estimating Future Initial Margin with Machine Learning (20)

Recently uploaded (20)

Estimating Future Initial Margin with Machine Learning