Deep Learning for Cyber Security

Deep Learning for Cyber Security
.
Steven Hutt steven.c.hutt@gmail.com
28 February, 2017

Cyber Security
www.dropbox.gov
Why now?
• DNC, Sony, Yahoo, ...
• attack vectors constantly changing
• static detection approaches failing
• large amounts of data
• Deep Learning for anomaly detection
• corporate / government motivated
1

The Challenge
Can we apply Deep Learning to Cyber Security and better identify malicious
network trafﬁc?
Reasons to be Optimistic:
• Plenty of data
• Good progress in general area of unsupervised feature selection
• Good progress in general area of anomaly detection
• Very topical subject - so that's good!
Reasons to be Skeptical:
• Practical usage requires very low false positive rate
• Essentially no labelled data
• Data is a hybrid of categorical (mostly) and numeric (some)
• Very topical subject - so why are there so few commercial successes?
Reasons to try:
• potentially huge market
• it's fun
2

Network Flow: the unit of data
A network ﬂow is a record of the information exchanged via packets
between a source and a destination machine during the course of the
network protocol session.
ipv4Source 192.104.50.16 hasAttach True
ipv4Dest 147.135.57.43 sizeAttach 87
portSource 80 mimeType avi
portDest 1639 numAttach 1
latitude -5.31858 cookie 'name=kpl; expires=...'
longitude 81.52040 subject 'Re: meeting'
duration 13.87 searchString 'free beer'
timeStamp 1486034657716 urlString 'https://test-cloud-p...'
Network ﬂow values are a hybrid of categorical, numerical and text data.
Deep Learning for numerical and text data has been extensively developed.
Here we focus on Deep Learning for categorical data.
3

Unsupervised Anomaly Detection
We attempt to ﬁt a probability distribution to the data {xk}N
k=1. Many
approaches are possible. We focus here on generative models:
..z.
x
. pθ(z).
pθ(x)
. prior.
likelihood
.
marginal
.
posterior
.
pθ(x | z)
.
pθ(z | x)
Maximize log-likelihood of data:
pθ(x) =
∑
z
pθ(x | z)pθ(z)
ˆθ = arg min
θ
N∑
k=1
log
∑
z
pθ(xk | z)pθ(z)
Possible uses of generative model:
• A data point x is labelled anomalous if pˆθ(x) is below some threshold.
• The posterior pˆθ(z | x) determines unsupervised feature extraction z of
the data x. Features z can then be used in subsequent anomaly
detection model.
4

Categorical Distributions
Deep Learning for categorical data has been less studied than for numerical
or text data, so we focus on modelling multivariate categorical distributions.
Let x = {x1, . . . , xp} be categorical variables with xj ∈ {1, . . . , dj}, j = 1, . . . p.
Let π be a probability distribution on x:
πc1...cp = P(x1 = c1, . . . , xp = cp),
Then there exists:
1. an integer k > 0
2. a mixing variable z ∈ {1, . . . , k} with distribution ν = (ν1, . . . , νk)
3. a set of independent distributions ψ
(j)
hcj
= P(xj = cj | z = h)
such that
πc1...cp = P(x1 = c1, . . . , xp = cp) =
k∑
h=1
νk
p
∏
j=1
ψ
(j)
hcj
.
In other words, every multivariate categorical distribution is a mixture of
multivariate categorical distributions with independent marginals.
5

Problems
While the mixture representation result is encouraging there are several
challenges:
1. Direct likelihood maximization is computationally very expensive
2. Stochastic Gradient Descent is not possible as variables are discrete
3. The size k of the mixing variable is geometric in the dimension p
In order to address these challenges, we will utilize the following:
1. variational inference for approximate likelihood maximization
2. Gumbel softmax to relax categorical variables to continuous variables
3. Dirichlet processes to incorporate k as part of the inference
6

Variational Autoencoders in a Nutshell
Recall we wish to compute
ˆθ = arg min
θ
N∑
k=1
log
∑
z
pθ(xk | z)pθ(z)
so we need to efﬁciently compute
∑
z pθ(xk | z)pθ(z).
We can (badly) approximate by sampling:
∑
z
pθ(xk | z)pθ(z) ≃
M∑
i=1
pθ(xk | zi), where zi ∼ pθ(z)
But most of the time pθ(xk | zi) ≃ 0 so sample from zi ∼ pθ(z | xk) instead.
But pθ(z | xk) is the wrong distribution and is unknown...
... so we learn an approximation qϕ(z | xk) to the unknown pθ(z | xk)
... and account for the wrong distribution via a likelihood upper bound:
Lθ,ϕ(xk) = −DKL(qϕ(z | xk)||pθ(z))
regularization term
+ Eqϕ (log pθ(xk | z))
reconstruction term
Think of qϕ(z | x) : x → z as an encoder and pθ(x | z) : z → x as a decoder.
7

The Reparametrization Trick
..
input
.hidden . (sample).
output
.
x
.z.
L
.
qϕ(z | x)
.Id .
pθ(x | z)
Stochastic Compute Graph
..
input
.hidden .
output
.
x
.z. ϵ.
L
.
qϕ(z | x)
.Id .
g
.
pθ(x | z)
Deterministic Compute Graph
The feedforward step involves sam-
pling the random variable z ∼ qϕ(z | x).
However, sampling does not admit a
gradient so back-propagation of gradi-
ents fails.
The Reparametrization Trick replaces
z ∼ qϕ(z | x) with
ˆz = g(ϵ) where ϵ ∼ p(ϵ).
Now the back-propagation path is
through deterministic nodes only.
Of course, our variables are categorical so back-propagation of gradients
fails anyway...
8

Gumbel Softmax Distribution
τ = 0.0
τ = 0.5
τ = 1.0
How to sample from a categorical random variable?
Let z be of dimension k with probabilities (ν1, . . . , νk).
We may sample from z as follows:
z = one-hot
(
arg max
h
{γ1 + ν1, . . . , γk + νk}
)
where γh ∼ Gumbel(0, 1) are IID, h = 1, . . . , k.
One-hot vectors are vertices of the simplex ∆k−1
.
Deﬁne a continuous distribution on the interior of the
simplex
y = (y1, . . . , yk) ∈ ∆k−1
, such that
k∑
h=1
yh = 1,
where yh =
e(ln νh+γh)/τ
∑k
j=1 e(ln νj+γj)/τ
, for τ > 0.
As τ → 0 the continuous distribution y converges to
the categorical distribution z.
9

Variational Autoencoder for Categorical Variables
Relax the categorical assumptions of the model in order to obtain a
continuous model on which back-propagation of gradients applies.
..
x
.y. γ.
L
.
Continuous
.
qϕ(y | x)
.Id .
g
.
pθ(x | y)
... τ → 0.
x
. z.
L
.
Categorical
.
qϕ(z | x)
. Id.
pθ(x | z)
Note:
• for τ small: close to categorical but high variance of gradients
• for τ large: far from categorical but low variance of gradients
In practice, start training with large τ and anneal to small τ.
10

Application
Train variational autoencoder to obtain parameters ˆθ, ˆϕ.
Possible approaches for anomaly detection:
• Use qˆϕ(z | x) : x → z for input to a machine learning anomaly detector
• Use pˆθ : x → p ∈ [0, 1] to identify rare events
• Use reconstruction error
x
q ˆϕ
−→ z
p ˆθ
−→ x
as anomaly detector
Other approaches are possible...
11

Deep Learning for Cyber Security

More Related Content

What's hot (20)

Viewers also liked (19)

Similar to Deep Learning for Cyber Security (20)

More from Altoros (20)

Recently uploaded (20)

Deep Learning for Cyber Security