SlideShare a Scribd company logo
2
Most read
3
Most read
4
Most read
Deep Learning for Cyber Security
.
Steven Hutt steven.c.hutt@gmail.com
28 February, 2017
Cyber Security
www.dropbox.gov
Why now?
• DNC, Sony, Yahoo, ...
• attack vectors constantly changing
• static detection approaches failing
• large amounts of data
• Deep Learning for anomaly detection
• corporate / government motivated
1
The Challenge
Can we apply Deep Learning to Cyber Security and better identify malicious
network traffic?
Reasons to be Optimistic:
• Plenty of data
• Good progress in general area of unsupervised feature selection
• Good progress in general area of anomaly detection
• Very topical subject - so that's good!
Reasons to be Skeptical:
• Practical usage requires very low false positive rate
• Essentially no labelled data
• Data is a hybrid of categorical (mostly) and numeric (some)
• Very topical subject - so why are there so few commercial successes?
Reasons to try:
• potentially huge market
• it's fun
2
Network Flow: the unit of data
A network flow is a record of the information exchanged via packets
between a source and a destination machine during the course of the
network protocol session.
ipv4Source 192.104.50.16 hasAttach True
ipv4Dest 147.135.57.43 sizeAttach 87
portSource 80 mimeType avi
portDest 1639 numAttach 1
latitude -5.31858 cookie 'name=kpl; expires=...'
longitude 81.52040 subject 'Re: meeting'
duration 13.87 searchString 'free beer'
timeStamp 1486034657716 urlString 'https://test-cloud-p...'
Network flow values are a hybrid of categorical, numerical and text data.
Deep Learning for numerical and text data has been extensively developed.
Here we focus on Deep Learning for categorical data.
3
Unsupervised Anomaly Detection
We attempt to fit a probability distribution to the data {xk}N
k=1. Many
approaches are possible. We focus here on generative models:
..z.
x
. pθ(z).
pθ(x)
. prior.
likelihood
.
marginal
.
posterior
.
pθ(x | z)
.
pθ(z | x)
Maximize log-likelihood of data:
pθ(x) =
∑
z
pθ(x | z)pθ(z)
ˆθ = arg min
θ
N∑
k=1
log
∑
z
pθ(xk | z)pθ(z)
Possible uses of generative model:
• A data point x is labelled anomalous if pˆθ(x) is below some threshold.
• The posterior pˆθ(z | x) determines unsupervised feature extraction z of
the data x. Features z can then be used in subsequent anomaly
detection model.
4
Categorical Distributions
Deep Learning for categorical data has been less studied than for numerical
or text data, so we focus on modelling multivariate categorical distributions.
Let x = {x1, . . . , xp} be categorical variables with xj ∈ {1, . . . , dj}, j = 1, . . . p.
Let π be a probability distribution on x:
πc1...cp = P(x1 = c1, . . . , xp = cp),
Then there exists:
1. an integer k > 0
2. a mixing variable z ∈ {1, . . . , k} with distribution ν = (ν1, . . . , νk)
3. a set of independent distributions ψ
(j)
hcj
= P(xj = cj | z = h)
such that
πc1...cp = P(x1 = c1, . . . , xp = cp) =
k∑
h=1
νk
p
∏
j=1
ψ
(j)
hcj
.
In other words, every multivariate categorical distribution is a mixture of
multivariate categorical distributions with independent marginals.
5
Problems
While the mixture representation result is encouraging there are several
challenges:
1. Direct likelihood maximization is computationally very expensive
2. Stochastic Gradient Descent is not possible as variables are discrete
3. The size k of the mixing variable is geometric in the dimension p
In order to address these challenges, we will utilize the following:
1. variational inference for approximate likelihood maximization
2. Gumbel softmax to relax categorical variables to continuous variables
3. Dirichlet processes to incorporate k as part of the inference
6
Variational Autoencoders in a Nutshell
Recall we wish to compute
ˆθ = arg min
θ
N∑
k=1
log
∑
z
pθ(xk | z)pθ(z)
so we need to efficiently compute
∑
z pθ(xk | z)pθ(z).
We can (badly) approximate by sampling:
∑
z
pθ(xk | z)pθ(z) ≃
M∑
i=1
pθ(xk | zi), where zi ∼ pθ(z)
But most of the time pθ(xk | zi) ≃ 0 so sample from zi ∼ pθ(z | xk) instead.
But pθ(z | xk) is the wrong distribution and is unknown...
... so we learn an approximation qϕ(z | xk) to the unknown pθ(z | xk)
... and account for the wrong distribution via a likelihood upper bound:
Lθ,ϕ(xk) = −DKL(qϕ(z | xk)||pθ(z))
regularization term
+ Eqϕ (log pθ(xk | z))
reconstruction term
Think of qϕ(z | x) : x → z as an encoder and pθ(x | z) : z → x as a decoder.
7
The Reparametrization Trick
..
input
.hidden . (sample).
output
.
x
.z.
L
.
qϕ(z | x)
.Id .
pθ(x | z)
Stochastic Compute Graph
..
input
.hidden .
output
.
x
.z. ϵ.
L
.
qϕ(z | x)
.Id .
g
.
pθ(x | z)
Deterministic Compute Graph
The feedforward step involves sam-
pling the random variable z ∼ qϕ(z | x).
However, sampling does not admit a
gradient so back-propagation of gradi-
ents fails.
The Reparametrization Trick replaces
z ∼ qϕ(z | x) with
ˆz = g(ϵ) where ϵ ∼ p(ϵ).
Now the back-propagation path is
through deterministic nodes only.
Of course, our variables are categorical so back-propagation of gradients
fails anyway...
8
Gumbel Softmax Distribution
τ = 0.0
τ = 0.5
τ = 1.0
How to sample from a categorical random variable?
Let z be of dimension k with probabilities (ν1, . . . , νk).
We may sample from z as follows:
z = one-hot
(
arg max
h
{γ1 + ν1, . . . , γk + νk}
)
where γh ∼ Gumbel(0, 1) are IID, h = 1, . . . , k.
One-hot vectors are vertices of the simplex ∆k−1
.
Define a continuous distribution on the interior of the
simplex
y = (y1, . . . , yk) ∈ ∆k−1
, such that
k∑
h=1
yh = 1,
where yh =
e(ln νh+γh)/τ
∑k
j=1 e(ln νj+γj)/τ
, for τ > 0.
As τ → 0 the continuous distribution y converges to
the categorical distribution z.
9
Variational Autoencoder for Categorical Variables
Relax the categorical assumptions of the model in order to obtain a
continuous model on which back-propagation of gradients applies.
..
x
.y. γ.
L
.
Continuous
.
qϕ(y | x)
.Id .
g
.
pθ(x | y)
... τ → 0.
x
. z.
L
.
Categorical
.
qϕ(z | x)
. Id.
pθ(x | z)
Note:
• for τ small: close to categorical but high variance of gradients
• for τ large: far from categorical but low variance of gradients
In practice, start training with large τ and anneal to small τ.
10
Application
Train variational autoencoder to obtain parameters ˆθ, ˆϕ.
Possible approaches for anomaly detection:
• Use qˆϕ(z | x) : x → z for input to a machine learning anomaly detector
• Use pˆθ : x → p ∈ [0, 1] to identify rare events
• Use reconstruction error
x
q ˆϕ
−→ z
p ˆθ
−→ x
as anomaly detector
Other approaches are possible...
11
Questions?
11

More Related Content

PPTX
Bayesian Neural Networks
PDF
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
PPTX
Rabbit challenge 3 DNN Day2
PDF
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
PDF
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
PDF
K-means, EM and Mixture models
PDF
The Perceptron (D1L2 Deep Learning for Speech and Language)
PPTX
Anomaly detection using deep one class classifier
Bayesian Neural Networks
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Rabbit challenge 3 DNN Day2
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
K-means, EM and Mixture models
The Perceptron (D1L2 Deep Learning for Speech and Language)
Anomaly detection using deep one class classifier

What's hot (20)

PPTX
The world of loss function
PPTX
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
PDF
Iclr2016 vaeまとめ
PDF
Accelerating Random Forests in Scikit-Learn
PPTX
Rabbit challenge 5_dnn3
PDF
Auto encoding-variational-bayes
PDF
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
PDF
A Maximum Entropy Approach to the Loss Data Aggregation Problem
PDF
Presentation1
PDF
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
PDF
Auto-encoding variational bayes
PDF
Safe and Efficient Off-Policy Reinforcement Learning
PDF
Word Embeddings (D2L4 Deep Learning for Speech and Language UPC 2017)
PDF
Introduction to Big Data Science
PDF
Intractable likelihoods
PPT
A Gentle Introduction to the EM Algorithm
PPTX
Tensorflow, deep learning and recurrent neural networks without a ph d
PDF
Matching networks for one shot learning
PDF
Additive model and boosting tree
The world of loss function
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Iclr2016 vaeまとめ
Accelerating Random Forests in Scikit-Learn
Rabbit challenge 5_dnn3
Auto encoding-variational-bayes
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
A Maximum Entropy Approach to the Loss Data Aggregation Problem
Presentation1
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Auto-encoding variational bayes
Safe and Efficient Off-Policy Reinforcement Learning
Word Embeddings (D2L4 Deep Learning for Speech and Language UPC 2017)
Introduction to Big Data Science
Intractable likelihoods
A Gentle Introduction to the EM Algorithm
Tensorflow, deep learning and recurrent neural networks without a ph d
Matching networks for one shot learning
Additive model and boosting tree
Ad

Viewers also liked (19)

PPTX
Saving Environment with IoT: Smart Watering with Predix
PPTX
A Secure Model of IoT Using Blockchain
PDF
67 Weeks of TensorFlow
PPTX
Security of IoT Data: Implementing Data-Centric Security and User Access Stra...
PPTX
Unified Analytics in GE’s Predix for the IIoT: Tying Operational Technology t...
PDF
2016 ISACA NACACS - Audit As An Impact Player For Cybersecurity
PDF
CernVM-FS for Docker image distribution in Cloud Foundry
PDF
Cybesecurity of the IoT
PPTX
Cloud Foundry Diego: The New Cloud Runtime - CloudOpen Europe Talk 2015
PPTX
Cyber security for children
PPTX
How does the Cloud Foundry Diego Project Run at Scale?
PDF
Beyond Matching: Applying Data Science Techniques to IOC-based Detection
PPTX
The Industrial Internet: Automation and Analytics
PDF
European Cyber Security Challenge - Greel National Cyber Security Team
PPTX
Cloud Foundry V2 | Intermediate Deep Dive
PDF
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
PPTX
Circle of Code with Cloud Foundry
PDF
Image Recognition with TensorFlow
PPTX
Who Lives in Our Garden?
Saving Environment with IoT: Smart Watering with Predix
A Secure Model of IoT Using Blockchain
67 Weeks of TensorFlow
Security of IoT Data: Implementing Data-Centric Security and User Access Stra...
Unified Analytics in GE’s Predix for the IIoT: Tying Operational Technology t...
2016 ISACA NACACS - Audit As An Impact Player For Cybersecurity
CernVM-FS for Docker image distribution in Cloud Foundry
Cybesecurity of the IoT
Cloud Foundry Diego: The New Cloud Runtime - CloudOpen Europe Talk 2015
Cyber security for children
How does the Cloud Foundry Diego Project Run at Scale?
Beyond Matching: Applying Data Science Techniques to IOC-based Detection
The Industrial Internet: Automation and Analytics
European Cyber Security Challenge - Greel National Cyber Security Team
Cloud Foundry V2 | Intermediate Deep Dive
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
Circle of Code with Cloud Foundry
Image Recognition with TensorFlow
Who Lives in Our Garden?
Ad

Similar to Deep Learning for Cyber Security (20)

PPTX
Bayesian Neural Networks
PPTX
Representation Learning & Generative Modeling with Variational Autoencoder(VA...
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
Automatic variational inference with latent categorical variables
PDF
Bayesian Deep Learning
PDF
Data Science Cheatsheet.pdf
PDF
super-cheatsheet-artificial-intelligence.pdf
PPTX
GAN for Bayesian Inference objectives
PDF
Deep Learning in Finance
PPTX
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
PPTX
Monte Carlo Berkeley.pptx
PDF
maXbox starter67 machine learning V
PDF
Lecture13 xing fei-fei
PDF
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
PDF
Presentation
PDF
13th Athens Big Data Meetup - 2nd Talk - Training Neural Networks With Enterp...
PPTX
Into to prob_prog_hari
PDF
[GAN by Hung-yi Lee]Part 1: General introduction of GAN
PPTX
Machine Learning from Statistical Point of View
PPT
An Introduction to Probabilistic Graphical Modeling
Bayesian Neural Networks
Representation Learning & Generative Modeling with Variational Autoencoder(VA...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
Automatic variational inference with latent categorical variables
Bayesian Deep Learning
Data Science Cheatsheet.pdf
super-cheatsheet-artificial-intelligence.pdf
GAN for Bayesian Inference objectives
Deep Learning in Finance
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
Monte Carlo Berkeley.pptx
maXbox starter67 machine learning V
Lecture13 xing fei-fei
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
Presentation
13th Athens Big Data Meetup - 2nd Talk - Training Neural Networks With Enterp...
Into to prob_prog_hari
[GAN by Hung-yi Lee]Part 1: General introduction of GAN
Machine Learning from Statistical Point of View
An Introduction to Probabilistic Graphical Modeling

More from Altoros (20)

PDF
Maturing with Kubernetes
PDF
Kubernetes Platform Readiness and Maturity Assessment
PDF
Journey Through Four Stages of Kubernetes Deployment Maturity
PPTX
SGX: Improving Privacy, Security, and Trust Across Blockchain Networks
PPTX
Using the Cloud Foundry and Kubernetes Stack as a Part of a Blockchain CI/CD ...
PPTX
A Zero-Knowledge Proof: Improving Privacy on a Blockchain
PPTX
Crap. Your Big Data Kitchen Is Broken.
PDF
Containers and Kubernetes
PPTX
Distributed Ledger Technology for Over-the-Counter Trading
PPTX
5-Step Deployment of Hyperledger Fabric on Multiple Nodes
PPTX
Deploying Kubernetes on GCP with Kubespray
PPTX
UAA for Kubernetes
PPTX
Troubleshooting .NET Applications on Cloud Foundry
PPTX
Continuous Integration and Deployment with Jenkins for PCF
PPTX
How to Never Leave Your Deployment Unattended
PPTX
Cloud Foundry Monitoring How-To: Collecting Metrics and Logs
PDF
Smart Baggage Tracking: End-to-End Sensor-Based Solution
PPTX
Navigating the Ecosystem of Pivotal Cloud Foundry Tiles
PPTX
AI as a Catalyst for IoT
PPTX
Over-Engineering: Causes, Symptoms, and Treatment
Maturing with Kubernetes
Kubernetes Platform Readiness and Maturity Assessment
Journey Through Four Stages of Kubernetes Deployment Maturity
SGX: Improving Privacy, Security, and Trust Across Blockchain Networks
Using the Cloud Foundry and Kubernetes Stack as a Part of a Blockchain CI/CD ...
A Zero-Knowledge Proof: Improving Privacy on a Blockchain
Crap. Your Big Data Kitchen Is Broken.
Containers and Kubernetes
Distributed Ledger Technology for Over-the-Counter Trading
5-Step Deployment of Hyperledger Fabric on Multiple Nodes
Deploying Kubernetes on GCP with Kubespray
UAA for Kubernetes
Troubleshooting .NET Applications on Cloud Foundry
Continuous Integration and Deployment with Jenkins for PCF
How to Never Leave Your Deployment Unattended
Cloud Foundry Monitoring How-To: Collecting Metrics and Logs
Smart Baggage Tracking: End-to-End Sensor-Based Solution
Navigating the Ecosystem of Pivotal Cloud Foundry Tiles
AI as a Catalyst for IoT
Over-Engineering: Causes, Symptoms, and Treatment

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Electronic commerce courselecture one. Pdf
PPT
Teaching material agriculture food technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Big Data Technologies - Introduction.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Programs and apps: productivity, graphics, security and other tools
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Spectral efficient network and resource selection model in 5G networks
The Rise and Fall of 3GPP – Time for a Sabbatical?
Digital-Transformation-Roadmap-for-Companies.pptx
MYSQL Presentation for SQL database connectivity
Network Security Unit 5.pdf for BCA BBA.
Empathic Computing: Creating Shared Understanding
Understanding_Digital_Forensics_Presentation.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Review of recent advances in non-invasive hemoglobin estimation
Electronic commerce courselecture one. Pdf
Teaching material agriculture food technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Big Data Technologies - Introduction.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Chapter 3 Spatial Domain Image Processing.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

Deep Learning for Cyber Security

  • 1. Deep Learning for Cyber Security . Steven Hutt steven.c.hutt@gmail.com 28 February, 2017
  • 2. Cyber Security www.dropbox.gov Why now? • DNC, Sony, Yahoo, ... • attack vectors constantly changing • static detection approaches failing • large amounts of data • Deep Learning for anomaly detection • corporate / government motivated 1
  • 3. The Challenge Can we apply Deep Learning to Cyber Security and better identify malicious network traffic? Reasons to be Optimistic: • Plenty of data • Good progress in general area of unsupervised feature selection • Good progress in general area of anomaly detection • Very topical subject - so that's good! Reasons to be Skeptical: • Practical usage requires very low false positive rate • Essentially no labelled data • Data is a hybrid of categorical (mostly) and numeric (some) • Very topical subject - so why are there so few commercial successes? Reasons to try: • potentially huge market • it's fun 2
  • 4. Network Flow: the unit of data A network flow is a record of the information exchanged via packets between a source and a destination machine during the course of the network protocol session. ipv4Source 192.104.50.16 hasAttach True ipv4Dest 147.135.57.43 sizeAttach 87 portSource 80 mimeType avi portDest 1639 numAttach 1 latitude -5.31858 cookie 'name=kpl; expires=...' longitude 81.52040 subject 'Re: meeting' duration 13.87 searchString 'free beer' timeStamp 1486034657716 urlString 'https://test-cloud-p...' Network flow values are a hybrid of categorical, numerical and text data. Deep Learning for numerical and text data has been extensively developed. Here we focus on Deep Learning for categorical data. 3
  • 5. Unsupervised Anomaly Detection We attempt to fit a probability distribution to the data {xk}N k=1. Many approaches are possible. We focus here on generative models: ..z. x . pθ(z). pθ(x) . prior. likelihood . marginal . posterior . pθ(x | z) . pθ(z | x) Maximize log-likelihood of data: pθ(x) = ∑ z pθ(x | z)pθ(z) ˆθ = arg min θ N∑ k=1 log ∑ z pθ(xk | z)pθ(z) Possible uses of generative model: • A data point x is labelled anomalous if pˆθ(x) is below some threshold. • The posterior pˆθ(z | x) determines unsupervised feature extraction z of the data x. Features z can then be used in subsequent anomaly detection model. 4
  • 6. Categorical Distributions Deep Learning for categorical data has been less studied than for numerical or text data, so we focus on modelling multivariate categorical distributions. Let x = {x1, . . . , xp} be categorical variables with xj ∈ {1, . . . , dj}, j = 1, . . . p. Let π be a probability distribution on x: πc1...cp = P(x1 = c1, . . . , xp = cp), Then there exists: 1. an integer k > 0 2. a mixing variable z ∈ {1, . . . , k} with distribution ν = (ν1, . . . , νk) 3. a set of independent distributions ψ (j) hcj = P(xj = cj | z = h) such that πc1...cp = P(x1 = c1, . . . , xp = cp) = k∑ h=1 νk p ∏ j=1 ψ (j) hcj . In other words, every multivariate categorical distribution is a mixture of multivariate categorical distributions with independent marginals. 5
  • 7. Problems While the mixture representation result is encouraging there are several challenges: 1. Direct likelihood maximization is computationally very expensive 2. Stochastic Gradient Descent is not possible as variables are discrete 3. The size k of the mixing variable is geometric in the dimension p In order to address these challenges, we will utilize the following: 1. variational inference for approximate likelihood maximization 2. Gumbel softmax to relax categorical variables to continuous variables 3. Dirichlet processes to incorporate k as part of the inference 6
  • 8. Variational Autoencoders in a Nutshell Recall we wish to compute ˆθ = arg min θ N∑ k=1 log ∑ z pθ(xk | z)pθ(z) so we need to efficiently compute ∑ z pθ(xk | z)pθ(z). We can (badly) approximate by sampling: ∑ z pθ(xk | z)pθ(z) ≃ M∑ i=1 pθ(xk | zi), where zi ∼ pθ(z) But most of the time pθ(xk | zi) ≃ 0 so sample from zi ∼ pθ(z | xk) instead. But pθ(z | xk) is the wrong distribution and is unknown... ... so we learn an approximation qϕ(z | xk) to the unknown pθ(z | xk) ... and account for the wrong distribution via a likelihood upper bound: Lθ,ϕ(xk) = −DKL(qϕ(z | xk)||pθ(z)) regularization term + Eqϕ (log pθ(xk | z)) reconstruction term Think of qϕ(z | x) : x → z as an encoder and pθ(x | z) : z → x as a decoder. 7
  • 9. The Reparametrization Trick .. input .hidden . (sample). output . x .z. L . qϕ(z | x) .Id . pθ(x | z) Stochastic Compute Graph .. input .hidden . output . x .z. ϵ. L . qϕ(z | x) .Id . g . pθ(x | z) Deterministic Compute Graph The feedforward step involves sam- pling the random variable z ∼ qϕ(z | x). However, sampling does not admit a gradient so back-propagation of gradi- ents fails. The Reparametrization Trick replaces z ∼ qϕ(z | x) with ˆz = g(ϵ) where ϵ ∼ p(ϵ). Now the back-propagation path is through deterministic nodes only. Of course, our variables are categorical so back-propagation of gradients fails anyway... 8
  • 10. Gumbel Softmax Distribution τ = 0.0 τ = 0.5 τ = 1.0 How to sample from a categorical random variable? Let z be of dimension k with probabilities (ν1, . . . , νk). We may sample from z as follows: z = one-hot ( arg max h {γ1 + ν1, . . . , γk + νk} ) where γh ∼ Gumbel(0, 1) are IID, h = 1, . . . , k. One-hot vectors are vertices of the simplex ∆k−1 . Define a continuous distribution on the interior of the simplex y = (y1, . . . , yk) ∈ ∆k−1 , such that k∑ h=1 yh = 1, where yh = e(ln νh+γh)/τ ∑k j=1 e(ln νj+γj)/τ , for τ > 0. As τ → 0 the continuous distribution y converges to the categorical distribution z. 9
  • 11. Variational Autoencoder for Categorical Variables Relax the categorical assumptions of the model in order to obtain a continuous model on which back-propagation of gradients applies. .. x .y. γ. L . Continuous . qϕ(y | x) .Id . g . pθ(x | y) ... τ → 0. x . z. L . Categorical . qϕ(z | x) . Id. pθ(x | z) Note: • for τ small: close to categorical but high variance of gradients • for τ large: far from categorical but low variance of gradients In practice, start training with large τ and anneal to small τ. 10
  • 12. Application Train variational autoencoder to obtain parameters ˆθ, ˆϕ. Possible approaches for anomaly detection: • Use qˆϕ(z | x) : x → z for input to a machine learning anomaly detector • Use pˆθ : x → p ∈ [0, 1] to identify rare events • Use reconstruction error x q ˆϕ −→ z p ˆθ −→ x as anomaly detector Other approaches are possible... 11