5707_10_auto-encoder.pptx

Ch10. Auto-encoders
KH Wong
Ch10. Auto and variational encoders
v230607d
1

Two types of autoencoders
• Part1 : Vanilla (means traditional or classical)
Autoencoder
– or simply called Autoencoder
• Part 2: Variational Autoencoder
v230607d
2

Part 1:
Overview of Vanilla
(traditional/classical) Autoencoder
• Introduction
• Theory
• Architecture
• Application
• Examples
v230607d
3

Introduction
• What is auto-decoder?
– An unsupervised method
• Application
– For noise removal
– Dimensional reduction
• Method
– Use noise-free ground truth data (e.g. MNIST)+ self
generative noise to train the network
– The final network can remove noise of in the input
(e.g. hand written characters), the output will be
similar to the ground truth data
v230607d
4

Noise removal
• https://www.slideshare.net/billlangjun/simple-introduction-to-autoencoder
v230607d
5
Result:
plt.title('Original images: top rows,'
'Corrupted Input: middle rows, '
'Denoised Input: third rows')
Perfect input + noise

Auto encoder Structure
An autoencoder is a
feedforward neural network
that learns to predict the
input (corrupted by noise)
itself in the output.
• The input-to-hidden part
corresponds to an encoder
• The hidden-to-output part
corresponds to a decoder.
• Input and output are of
the same dimension and
size.
v230607d
6
https://towardsdatascience.com/deep-autoencoders-using-tensorflow-c68f075fd1a3
Noisy
Input
x
De-noised
Output
x‘
encoder decoder
Neural network after training
x‘
x
Z (code)

Theory
(W=weight, b=bias)
Autoencoders are trained to
minimize reconstruction errors
(such as squared errors), often
referred to as the "loss (L)":
• By combining (*) and (**)
v230607d
7
 ’ x’
X W
b
Z W’
b’
(**)
)
'
'
(
'
'
(*)
)
(
'















b
z
W
x
b
Wx
z
x
z
x


2
2
)
'
)
(
'
(
'
'
)
'
,
(
b
b
Wx
W
x
x
x
x
x
L
Loss









'
x
z
x 

Encoder decoder
Input code output

Exercise 1a,b,c
• How many input layers, hidden layers, output
layers in the figure shown? MC choices: How
many
• (a) input layer(s)?
• (b) hidden layer(s)?
• (c) Output layer(s)?
• How many neurons in these layers? MC
choices: How many neurons in these layers?
• (d) input layer?
• (e) hidden layers: choices:
– 1) 3
– 2) 6
– 3) 8
– 4) 10
• (f) output layer?
• (g) Which is true on the number of neurons?
– 1) input neurons more than output neurons
2) input neurons same as output neurons
– 3) input neurons less than output neurons Ch10. Auto and variational encoders
v230607d 8
Input Output

Answer : Exercise 1
• How many input layers,
hidden layers, output layers in
the figure shown?
– Answer: input=1, hidden=3,
output layer=1
• How many neurons in these
layers?
– Answer: input(=4),
hidden(3,2,3),total=8 (choice
3), output (=4)
• What is the relation between
the number of input and
output neurons?
– Answer: same (choice 2)
v230607d
9
Input Output

Architecture
• Encoder and decoder
• Training can use
typical
backpropagation
methods
v230607d
10
https://towardsdatascience.com/how-to-
reduce-image-noises-by-autoencoder-
65d5e6de543

Training
• Apply clean MNIST data set + added noise to be used as input,
• Use clean MNIST data set as output
• Train the autoencoder using backpropagation
v230607d
11
Added noise
Autoencoder training by
backpropagation
+
Clean MINST
samples
Clean MNIST samples
same

Recall
• After training, autoencoders can be used to
remove noise
v230607d
12
Trained
autoencoder
Noisy
Input
De-noised
Output

Exercise 2a,b: Auto-encoder training
• (Q.2a) For (epoch=1;epoch <=max_epoch ; epoch++)
– {For all 10,000 images{
• Core code:
• Use backpropagation to train the whole
autoencoder network (encoder + decoder)}
• Break if Loss is too small }
• MC question: In core code, choices:
1. Feed each clean image to the input, and Present
the clean image to the output
2. Feed each clean image+noise to the output, and
Present the clean image to the input
3. Feed each clean image+noise to the input, and
Present the clean image to the output
• (Q.2b) If the trained encoder receives a noisy image of a
handwritten numeral, what do you expect at the output?
– MC choice: 1) a denoised image; 2) input + noise
– 3) same as input ; 4) pure random noise
v230607d
13
Noise clean image
for numeral
“2”
auto-encoder
Input output

Answer: Exercise 2a,b
• Answer 2(a): Auto-encoder training
• For (epoch=1;epoch <=max_epoch ; epoch++)
– {For all 10,000 images{
• Feed each clean image plus noise to the
(encoder) input
• Present the clean image of the numerical to
the output (of the decoder),
• Use backpropagation to train the whole
autoencoder network (encoder + decoder)
• }
• Break if Loss is too small
– }
• Ex.2(b) Autoencoder usage: If the trained encoder
receives a noisy image of a handwritten numeral,
what do you expect at the output?
– Answer 2(b): a denoised image of the realinput
numeral image (choice 1 is correct)
v230607d
14
+
Noise clean image
for numeral
“2”
auto-encoder
Core code
Choice 3
is correct
Input Output

Sample
Code:
Part(i):
obtain
dataset
and add
noise
https://towardsdatascience.
com/how-to-reduce-image-
noises-by-autoencoder-
65d5e6de543
• #part1 ---------------------------------------------------
• np.random.seed(1337)
• # MNIST dataset
• (x_train, _), (x_test, _) = mnist.load_data()
• image_size = x_train.shape[1]
• x_train = np.reshape(x_train, [-1, image_size, image_size, 1])
• x_test = np.reshape(x_test, [-1, image_size, image_size, 1])
• x_train = x_train.astype('float32') / 255
• x_test = x_test.astype('float32') / 255
• # Generate corrupted MNIST images by adding noise with normal dist
• # centered at 0.5 and std=0.5
• noise = np.random.normal(loc=0.5, scale=0.5, size=x_train.shape)
• x_train_noisy = x_train + noise
• noise = np.random.normal(loc=0.5, scale=0.5, size=x_test.shape)
• x_test_noisy = x_test + noise
• x_train_noisy = np.clip(x_train_noisy, 0., 1.)
• x_test_noisy = np.clip(x_test_noisy, 0., 1.)
v230607d
15

Part (ii):First build
the Encoder Model
• #part2 ---------------------------------------------------
• # Network parameters
• input_shape = (image_size, image_size, 1)
• batch_size = 128
• kernel_size = 3
• latent_dim = 16
• # Encoder/Decoder number of CNN layers and filters per layer
• layer_filters = [32, 64]
• # Build the Autoencoder Model
• # First build the Encoder Model
• inputs = Input(shape=input_shape, name='encoder_input')
• x = inputs
• # Stack of Conv2D blocks
• # Notes:
• # 1) Use Batch Normalization before ReLU on deep networks
• # 2) Use MaxPooling2D as alternative to strides>1
• # - faster but not as good as strides>1
• for filters in layer_filters:
• x = Conv2D(filters=filters,
• kernel_size=kernel_size,
• strides=2,
• activation='relu',
• padding='same')(x)
• # Shape info needed to build Decoder Model
• shape = K.int_shape(x)
• # Generate the latent vector
• x = Flatten()(x)
• latent = Dense(latent_dim, name='latent_vector')(x)
• # Instantiate Encoder Model
• encoder = Model(inputs, latent, name='encoder')
• encoder.summary()
v230607d
16

Part (iii):Build the
Decoder Model
• #part3 ---------------------------------------------------
• # Build the Decoder Model
• latent_inputs = Input(shape=(latent_dim,), name='decoder_input')
• x = Dense(shape[1] * shape[2] * shape[3])(latent_inputs)
• x = Reshape((shape[1], shape[2], shape[3]))(x)
• # Stack of Transposed Conv2D blocks
• # Notes:
• # 1) Use Batch Normalization before ReLU on deep networks
• # 2) Use UpSampling2D as alternative to strides>1
• for filters in layer_filters[::-1]:
• x = Conv2DTranspose(filters=filters,
• strides=2,
• x = Conv2DTranspose(filters=1,
• outputs = Activation('sigmoid', name='decoder_output')(x)
• # Instantiate Decoder Model
• decoder = Model(latent_inputs, outputs, name='decoder')
• decoder.summary()
• # Autoencoder = Encoder + Decoder
• # Instantiate Autoencoder Model
• autoencoder = Model(inputs, decoder(encoder(inputs)), name='autoencoder')
• autoencoder.summary()
• autoencoder.compile(loss='mse', optimizer='adam')
v230607d
17

Part (iv): Train the
autoencoder,
decode images
display result
• #part4 ---------------------------------------------------
• # Train the autoencoder
• autoencoder.fit(x_train_noisy,
• x_train,
• validation_data=(x_test_noisy, x_test),
• epochs=30,
• batch_size=batch_size)
• # Predict the Autoencoder output from corrupted test images
• x_decoded = autoencoder.predict(x_test_noisy)
• # Display the 1st 8 corrupted and denoised images
• rows, cols = 10, 30
• num = rows * cols
• imgs = np.concatenate([x_test[:num], x_test_noisy[:num],
x_decoded[:num]])
• imgs = imgs.reshape((rows * 3, cols, image_size, image_size))
• imgs = np.vstack(np.split(imgs, rows, axis=1))
• imgs = imgs.reshape((rows * 3, -1, image_size, image_size))
• imgs = np.vstack([np.hstack(i) for i in imgs])
• imgs = (imgs * 255).astype(np.uint8)
• plt.figure()
• plt.axis('off')
• plt.title('Original images: top rows, '
• 'Corrupted Input: middle rows, '
• 'Denoised Input: third rows')
• plt.imshow(imgs, interpolation='none', cmap='gray')
• Image.fromarray(imgs).save('corrupted_and_denoised.png')
• plt.show()
v230607d
18

Code https://towardsdatascience.com/how-to-reduce-image-noises-by-autoencoder-65d5e6de543
Result: plt.title('Original images: top rows, '
'Corrupted Input: middle rows, '
'Denoised Image: third rows')
• '''Trains a denoising autoencoder on MNIST dataset.
• https://towardsdatascience.com/how-to-reduce-image-noises-by-autoencoder-65d5e6de543
• Denoising is one of theclassic applications of autoencoders.
• The denoising process removes unwantednoisethatcorrupted the
• truesignal.
• Noise+ Data ---> Denoising Autoencoder ---> Data
• Given a training dataset of corrupted data as input and
• truesignal as output, a denoising autoencoder can recover the
• hidden structureto generateclean data.
• This example has modular design. The encoder, decoder and autoencoder
• are 3 models that shareweights. For example, after training the
• autoencoder, theencoder can be used to generate latent vectors
• of input data for low-dim visualizationlikePCA or TSNE.
• '''
• #keras>> tensorflow.keras, modificationby khw
• from __future__ import absolute_import
• from __future__ import division
• from __future__ import print_function
• import tensorflow.keras as keras
• from tensorflow.keras.layers import Activation, Dense, Input
• from tensorflow.keras.layers import Conv2D, Flatten
• from tensorflow.keras.layers import Reshape, Conv2DTranspose
• from tensorflow.keras.models importModel
• from tensorflow.keras importbackend as K
• from tensorflow.keras.datasets import mnist
• import numpyas np
• import matplotlib.pyplot as plt
• from PIL import Image
• np.random.seed(1337)
• # MNIST dataset
• (x_train, _), (x_test, _) = mnist.load_data()
• image_size = x_train.shape[1]
• x_train = np.reshape(x_train, [-1, image_size, image_size, 1])
• x_test = np.reshape(x_test, [-1, image_size, image_size, 1])
• x_train = x_train.astype('float32') / 255
• x_test = x_test.astype('float32') / 255
• # Generate corrupted MNIST images by adding noisewith normal dist
• # centered at 0.5 and std=0.5
• noise= np.random.normal(loc=0.5, scale=0.5, size=x_train.shape)
• x_train_noisy =x_train + noise
• noise= np.random.normal(loc=0.5, scale=0.5, size=x_test.shape)
• x_test_noisy=x_test + noise
• x_train_noisy =np.clip(x_train_noisy, 0., 1.)
• x_test_noisy=np.clip(x_test_noisy, 0., 1.)
• # Network parameters
• input_shape =(image_size, image_size, 1)
• batch_size =128
• kernel_size = 3
• latent_dim = 16
• # Encoder/Decoder number of CNN layers and filters per layer
• layer_filters = [32, 64]
• # Build theAutoencoder Model
• # First build theEncoder Model
• inputs =Input(shape=input_shape, name='encoder_input')
• x = inputs
• # Stack of Conv2Dblocks
• # Notes:
• # 1) UseBatch Normalization before ReLU on deep networks
• # 2) UseMaxPooling2Das alternativeto strides>1
• for filters in layer_filters:
• x = Conv2D(filters=filters,
• strides=2,
• # Shapeinfo needed to build Decoder Model
• shape= K.int_shape(x)
• # Generate thelatent vector
• x = Flatten()(x)
• latent = Dense(latent_dim, name='latent_vector')(x)
• # InstantiateEncoder Model
• encoder = Model(inputs, latent, name='encoder')
• encoder.summary()
• # Build theDecoder Model
• latent_inputs =Input(shape=(latent_dim,), name='decoder_input')
• x = Dense(shape[1] * shape[2] * shape[3])(latent_inputs)
• x = Reshape((shape[1], shape[2], shape[3]))(x)
• # Stack of Transposed Conv2Dblocks
• # Notes:
• # 1) UseBatch Normalization before ReLU on deep networks
• # 2) UseUpSampling2Das alternativeto strides>1
• for filters in layer_filters[::-1]:
• x = Conv2DTranspose(filters=filters,
• strides=2,
• x = Conv2DTranspose(filters=1,
• outputs=Activation('sigmoid', name='decoder_output')(x)
• # InstantiateDecoder Model
• decoder = Model(latent_inputs, outputs, name='decoder')
• decoder.summary()
• # Autoencoder = Encoder + Decoder
• # InstantiateAutoencoder Model
• autoencoder =Model(inputs, decoder(encoder(inputs)), name='autoencoder')
• autoencoder.summary()
• autoencoder.compile(loss='mse', optimizer='adam')
• # Train theautoencoder
• autoencoder.fit(x_train_noisy,
• x_train,
• validation_data=(x_test_noisy, x_test),
• epochs=30,
• batch_size=batch_size)
• # Predict theAutoencoder outputfrom corruptedtest images
• x_decoded = autoencoder.predict(x_test_noisy)
• # Display the1st 8 corrupted and denoised images
• rows, cols = 10, 30
• num = rows * cols
• imgs = np.concatenate([x_test[:num], x_test_noisy[:num], x_decoded[:num]])
• imgs = imgs.reshape((rows *3, cols, image_size, image_size))
• imgs = np.vstack(np.split(imgs, rows, axis=1))
• imgs = imgs.reshape((rows *3, -1, image_size, image_size))
• imgs = np.vstack([np.hstack(i) for i in imgs])
• imgs = (imgs * 255).astype(np.uint8)
• plt.figure()
• plt.axis('off')
• plt.title('Original images: top rows, '
• 'Corrupted Input:middlerows, '
• 'Denoised Input: third rows')
• plt.imshow(imgs, interpolation='none', cmap='gray')
• Image.fromarray(imgs).save('corrupted_and_denoised.png')
• plt.show()
v230607d
19

Exercise 3
• Discuss applications of a Vanilla (traditional)
autoencoder.
• Which of the following is true? MC choices:
1) Image recognition
2) Denoise input images + Image recognition
3) Denoise input images +Dimensionality Reduction
4) Denoise input images only
v230607d
20

Answer: Exercise 3
• Discuss applications of a Vanilla (traditional) autoencoder.
• Which of the following is true? MC choices:
1) Image recognition
2) Denoise input images + Image recognition
3) Denoise input images +Dimensionality Reduction (correct)
4) Denoise input images only
• More information, see https://en.wikipedia.org/wiki/Autoencoder
– Dimensionality Reduction
– Relationship with principal component analysis (PCA)
– Information Retrieval
– Anomaly Detection
– Image Processing
– Drug discovery
v230607d
21

Part 2: Variational autoencoder
Will learn
• Learn what is Variational autoencoder
• How to train it?
• How to use it?
v230607d
22

Some math background is needed:
• https://ljvmiranda921.github.io/notebook/20
17/08/13/softmax-and-the-negative-log-
likelihood/
• See appendix2: The expected negative log
likelihood
• Conditional expectation etc.
v230607d
23

Variational Autoencoder (VAE) v.s. Traditional
Autoencoder
• Autoencoders (vanilla or traditional)
– During training you present a pattern with artificial added noise to the
encoder, and feed the same input pattern (as target, or teacher) to the
output. Then, use backpropagation to train the Autoencoder network.
– So, it is unsupervised learning (no label data is needed).
– It can be used for data compression and noise removal.
– During recall, when a noisy pattern is presented to the input, a de-
noise image will appear at the output.
• Variational autoencoders
– Instead of learning from an input pattern, Variational autoencoders
learn the parameters of a probability distribution function from the
input patterns. We then use the parameters learned to generate new
data. So, it is a generative model like GAN (Generative Adversarial
Network) in functionality.
v230607d
24

Variational autoencoder
https://jaan.io/what-is-variational-autoencoder-vae-tutorial/
• Variational autoencoders are cool. They
let us design complex generative models
of data and fit them to large datasets.
They can generate images of fictional
celebrity faces and high-resolution digital
artwork.
• VAE faces
• VAE faces demo
• VAE MNIST
• VAE street addresses
• https://jaan.io/what-is-variational-
autoencoder-vae-tutorial/
• May be or similar to that used in
software such as Deepfake
(https://en.wikipedia.org/wiki/Deepfake)
FICTIONAL CELEBRITY FACES GENERATED BY A VARIATIONAL AUTOENCODER (BY
ALEC RADFORD). Ch10. Auto and variational encoders
v230607d
25

Example: Applying VAE for MNIST data
set extension
•
v230607d
26
https://arxiv.org/pdf/1312.6114.pdf
Output: generated image
Dataset (images extended)
Input: original image
data set

Some background:
Univariate and Multivariate Gaussian
• https://ttic.uchicago.edu/~shubhendu/Slides/Estimation.pdf
v230607d
27
 
 







 





2
2
2
/
1
2
univariate
2
2
1
exp
2
1
)
(
dimension
-
1
variance
mean,
,
_
Gaussian
Univariate





x
x
N
sample
data
x
 
   




















x
x
x
N
d
co
sample
data
x
T
d
1
2
/
1
2
/
te
multivaria
2
1
exp
2
1
)
(
dimension
-
variance
mean,
,
_
Gaussian,
te
Multivaria

Properties of Gaussian (Normal) distribution
• Standard Normal
distribution (1-dimension):
• Red line, when mean()=0,
Sigma ()=1
– At (x-)=0,  =1
– G(x) =1/sqrt(2*pi)=0.3989
• At x=1*, drops off to
– (1/sqrt(2*pi))*exp(-1^1/2)=0.2420
– Area covered 68.2%
– (1/sqrt(2*pi))*exp(-2^2/2)= 0.0540
– (1/sqrt(2*pi))*exp(-2^2/2)= ??
(exercise)
http://en.wikipedia.org/wiki/Normal_distribution
Probability density
function
 












1
)
(
2
1
mean
variance,
deviation,
standard
Gaussian
D
1
2
2
2
2
2
dx
x
G
e
πσ
G(x) σ
μ
x



Standard
Normal
distribution
Area
covered
(total=
100%)
G
G
v230607d
28
 sets the
horizontal
shift
 Controls the
shape
So called 95% confident value µ(+/-)2

Gaussian (Normal) functions 1D,2D
•
2
2
/
1 
   
2
2
2
2
2
2
1
G(x)G(y)
y)
G(x,
Gaussian
D
2




y
x y
x
e







 
2
2
2
2
2
1
G(x)
mean
deviation,
standard
Gaussian
D
1











x
e
G(x)
x
y
x

x
y
1-D Gaussian
2-D Gaussian
2
2
/
1 
v230607d
29

Example : A 1-D and 2-D Gaussian
distribution
• %2-D Gaussian distribution P(xj)
• %matlab code----------
• clear, N=10
• [X1,X2]=meshgrid(-N:N,-N:N);
• sigma =2.5;mean=[3 3]'
• G=1/(2*pi*sigma^2)*
• (exp(-((X1-mean(1)).^2+(X2-mean(2)).^2))
/(2*sigma^2));
• G=G./sum(G(:)) %normalise it
• 'sigma is ', sigma
• 'sum(G(:)) is ',sum(G(:))
• 'max(max(G(:))) is',max(max(G(:)))
• figure(1), clf
• surf(X1,X2,G);
• xlabel('x1'),ylabel('x2')
v230607d
30
 
 







 






2
0
2
0
2
/
1
2
0
2
0
0
2
1
exp
2
1
)
(
variance
mean,
,
_
,
Gaussian
1





j
j
j
x
x
N
sample
a
x
D
  






 




2
0
2
2
2
1
2
0
2
1
2
exp
2
1
)
,
(
0
mean
assume
Gaussian
symmetric)
(circular
isotropic
an
2


x
x
x
x
N
D

Exercise 4
• In Box 1, sigma ()=2
• x=mx y=my
• Mc choices:
1) G(x,y)=1/(2*pi*2+2)
2) G(x,y)=1/(2*pi*2)
3) G(x,y)=1/(2*pi*2^4)
4) G(x,y)=1/(2*pi*2^2)
• Student
exercise:
• Fill in the
blanks of this
Gaussian mask
of size 9x9 ,
sigma ()=2
• Sketch the
function
• G(x,y)=
• 0.0007 0.0017 0.0033 0.0048 0.0054 0.0048 0.0033 0.0017 0.0007
• 0.0017 0.0042 0.0078 0.0114 0.0129 0.0114 0.0078 0.0042 0.0017
• 0.0033 0.0078 0.0146 0.0213 0.0241 0.0213 0.0146 0.0078 0.0033
• 0.0048 0.0114 0.0213 0.0310 0.0351 0.0310 0.0213 0.0114 0.0048
• 0.0054 0.0129 0.0241 0.0351 BOX1 ? ____? 0.0241 0.0129 0.0054
• 0.0048 0.0114 0.0213 0.0310 0.0351 ____? 0.0213 0.0114 0.0048
• 0.0033 0.0078 0.0146 0.0213 0.0241 0.0213 0.0146 0.0078 0.0033
• 0.0017 0.0042 0.0078 0.0114 0.0129 0.0114 0.0078 0.0042 0.0017
• 0.0007 0.0017 0.0033 0.0048 0.0054 0.0048 0.0033 0.0017 0.0007
v230607d
31
   
2
2
2
2
2
2
1
G(x)G(y)
y)
G(x,
mean
Gaussian,
D
2


y
x m
y
m
x
y
x
e
)
,m
(m







x=mx
y=my
x=1+mx
y=my
Box1

Answer: Exercise 4
Fill in the blanks
Gaussian mask of
size the 9x9 , sigma
()=2
• 0.0007 0.0017 0.0033 0.0048 0.0054 0.0048 0.0033 0.0017 0.0007
• 0.0017 0.0042 0.0078 0.0114 0.0129 0.0114 0.0078 0.0042 0.0017
• 0.0033 0.0078 0.0146 0.0213 0.0241 0.0213 0.0146 0.0078 0.0033
• 0.0048 0.0114 0.0213 0.0310 0.0351 0.0310 0.0213 0.0114 0.0048
• 0.0054 0.0129 0.0241 0.0351 0.0398 0.0351 0.0241 0.0129 0.0054
• 0.0048 0.0114 0.0213 0.0310 0.0351 0.0310 0.0213 0.0114 0.0048
• 0.0033 0.0078 0.0146 0.0213 0.0241 0.0213 0.0146 0.0078 0.0033
• 0.0017 0.0042 0.0078 0.0114 0.0129 0.0114 0.0078 0.0042 0.0017
• 0.0007 0.0017 0.0033 0.0048 0.0054 0.0048 0.0033 0.0017 0.0007
v230607d
32
clear %matlab
sigma=2 % in matlab , no -ve index
for looping, so shift center to (5,5)
mean_x=5 , mean_y=5
for y=1:9
for x=1:9
g(x,y)=(1/(2*pi*sigma^2))*exp(-((x-
mean_x)^2+(y-mean_y)^2)
/(2*sigma^2))
end
end
mesh(g)
title('2D Gaussian function')
1/(2*pi*2^2): choice 4 is correct, because
x=mx, y=my. ,thus 𝑒
−
𝑥−𝑚𝑥
2+ 𝑦−𝑚𝑦
2
2𝜎2
=1
1/(2*pi*2^2)*exp(-
1/8)
1/(2*pi*2^2)*exp
(-2/8)
Box 1
x=mx
y=my
x=1+mx
y=my
2 − D Gaussian, mean
(𝑚𝑥, 𝑚𝑦)
G(x,y) = G(x)G(y)
=
1
2𝜋𝜎2
𝑒
−
𝑥−𝑚𝑥
2+ 𝑦−𝑚𝑦
2
2𝜎2

Variational autoencoder
• A neural network view
v230607d
33
https://www.jeremyjordan.me/variational-autoencoders/
Multivariate Gaussian:
Mean = µ
 = standard dedication
Variance = 2

Generative Models concept
• It is an unsupervised learning method that generates
new samples by using training data from the same
distribution
• E.g., You have limited number of samples but want to
create more samples of the same probability
distributions to be used in machine learning purposes.
Others include:
– Creating new cartoon figures
– Generating faces from images of celebrities.
– Creating new fashions.
– Creating new written characters for training optical character
recognition systems of some languages
• Generative model algorithms
– Variational autoencoder (discussed here)
– Generative adversarial network (GAN) not discussed here
v230607d
34

Variational autoencoder for generative
models
• Use training samples to train hidden data (parameters of
multi-variate Gaussian standard deviations=s, means = µs ).
After training you may create new output from some input
and weighted s and µs . You may change the weights of s and
µs for a variety of related different outputs.
v230607d
35
https://www.quora.com/Whats-the-difference-between-a-Variational-Autoencoder-VAE-and-an-Autoencoder
parameters of multi-variate Gaussian standard deviations= s, means= µs )
E.g. 50µs, 30s

Application example: Use Generative
Models for MNIST data extension
http://yann.lecun.com/exdb/mnist/
•
v230607d
36
During training , patterns are fed into input and
output one by one, learn µ, by minimize loss
After training, data
generation phase
Generated extended data set
MNIST original data set
Random generator
layer using 30µs, 30s
z

Exercise 5:What is the architectural difference
between Vanilla (traditional) autoencoder and
Variational autoencoder?
• MC: Which is incorrect?
1) In Vanilla (traditional)
autoencoder: input to output
are directly connected by
neurons and weights.
2) In Variational autoencoder:
The encoder turns input (x)
into means (µs) and standard
deviations (s) of a multivariate
Gaussian distribution, then use
a random sampling method
to create the output.
3) In Variational autoencoder :
input to output are directly
connected by neurons and
weights.
The number of mean (µs) and
standard deviations (s)
neurons are the same. Ch10. Auto and variational encoders
v230607d
37
Vanilla autoencoder
E.g. 30µs, 30s
z

Answer Exercise 5:What is the architectural
difference between Vanilla (traditional)
autoencoder and Variational autoencoder?
• MC: Which is incorrect?
1) In Vanilla (traditional)
autoencoder: input to output
are directly connected by
neurons and weights.
The encoder turns input (x)
into means (µs) and standard
deviations (s) of a multivariate
Gaussian distribution, then use
a random sampling method
to create the output.
3) In Variational autoencoder :
input to output are directly
connected by neurons and
weights. (This is incorrect)
The number of mean (µs) and
standard deviations (s)
neurons are the same. Ch10. Auto and variational encoders
v230607d
38
Vanilla autoencoder
E.g. 30µs, 30s
z

Exercise 6a,b for Variational
autoencoder VAE
• Which statement is incorrect for
VAE?: MC choices:
1) Because the search space is large,
there are too many combinations
of means (µs) and standard
deviations (s) for generating the
same output.
2) There are multiple solutions for
means (µs) and standard
deviations (s)
3) There is a deterministic linear
solution for VAE
4) Neural network provides a
solution for VAE.
• (b) Discuss exercise for students:
what is a multivariate-Gaussian
distribution.
v230607d
39
form
https://en.wikipedia.org/wiki/Multiv
ariate_normal_distribution
of 2 dimensions

Answer: Exercise 6a,b for
Variational autoencoder VAE
• Which statement is incorrect for
VAE?: MC choices:
(choice3)There is a deterministic linear
solution for VAE (this is incorrect)
• (b) Discuss exercise for students:
what is a multivariate-Gaussian
distribution.
• Answer: Multivariate-dimensional
Gaussian:
• In probability theory and statistics,
the multivariate normal
distribution, multivariate Gaussian
distribution, or joint normal
distribution is a generalization of
the one-dimensional
(univariate) normal distribution to
higher dimensions. One definition is
that a random vector is said to be k-
variate normally distributed if
every linear combination of
its k components has a univariate
normal distribution.
v230607d
40
form
https://en.wikipedia.org/wiki/Multiv
ariate_normal_distribution
of 2 dimensions

Example of variational autoencoder
• Neural network
v230607d
41
https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf
By random sampling
Random generator layer
Z
X̂
X

Training of Vanilla and Variational
Autoencoders
• Training of variational autoencoders is like training the
vanilla autoencoders. E.g., for the de-noised application,
presents noisy images to the input and clean image
versions to the output. Use backpropagation to train the
network. Read our previous discussion on vanilla
autoencoder
https://www.edureka.co/blog/autoencoders-tutorial/
http://www.math.purdue.edu/~buzzard/MA598-Spring2019/Lectures/Lec18%20-%20VAE.pptx
v230607d
42

Variational Autoencoder (VAE)
• The latent variables, Z, are
drawn from a probability
distribution depending on
the input, X, and the
reconstruction is chosen
probabilistically from z.
• That means after you
obtained mean=µ,variance
2, sample from X (n=500
neurons) to get Z (k=30
neurons)
• X=(x1,x2,…………,xn)
• Z=(z1,z2,…,zk)
v230607d
43
Z
Encoder
Q (z|X)
Decoder
P (X|z)
Z=Latent
Variables
By sampling
Z=Sample from
a distribution
N(µ,)
X X̂

Three difficult concepts in VAE
1) Train the neural network to
maximize input/output likelihood
2) Use of Divergence (DKL)
3) Reparameterization
v230607d
44

Variational Autoencoders
VAE Concept 1
Train the neural network to maximize
input/output likelihood
v230607d
45
Tutorial on Variational Autoencoders
Carl Doersch
https://arxiv.org/abs/1606.05908

VAE Encoder
• The Encoder q(en)(z|x) takes input x and returns Hidden
parameters Z (random generated from µ,). (=encoder
parameters. weights/biases)
• From Z, use sampling to create input to the decoder
• Encoders and Decoders are neural networks (NN)
• Parameters in the NN are needed to be learned – so we have
to set up a loss function.
http://gregorygundersen.com/blog/2018/04/29/reparameterization/
v230607d
46
Encoder(XZ)
q(en)(z|x)
Input
Data
Decoder(Z )
Hidden
Z
Output
ted
Reconstruc
X
X̂
 
Z
X
P de |
ˆ
)
(

X-> encoder –>Z->decoder x^
X̂
 

VAE Decoder
• The decoder takes hidden variable Z (gen. from means and
standard deviations) as input, and reconstructs
the image using random sampling methods. ( =decoder
parameters weights/biases)
• Encoders and Decoders are Neural Networks (NN)
• Parameters ( ,) in the NN are needed to be learned – so
we have to set up a loss function.
v230607d
47
 
Z
X
P de |
ˆ
)
(

Encoder(XZ)
q(en)(z|x)
Input
Data
Decoder(Z )
Hidden
Z
Output
ted
Reconstruc
X
X̂
 
Z
X
P de |
ˆ
)
(

X̂

The reconstruction loss =(l(rec) )=
“expected negative log-likelihood” of VAE
• Given xi X, zQ, E() is expected value
• The idea is to train the Encoder/Decoder (Neural Network) to maximum
the likelihood (or minimize binary_cross_entropy (BCE) or Mean squared
error (MSE) between x and reconstructed
• To maximize likelihood, we minimize the reconstruction loss=“expected
negative log-likelihood” (li ) of the i-th datapoint xi. (see appendix 2)
v230607d
48
   
 
 
z
x
P
E
E
x
l i
de
Q
z
X
x
i
rec
i i
|
ˆ
log
|
, )
(
)
(


 



Encoder
q(en)(z|xi)
Decoder
Hidden
Z (µ,)
i
x
data
Input
 
minimized
be
to
,
function
loss
tion
Reconstruc
)
(


rec
i
l
i
x̂
output
ted
Reconstruc
i
x
BCE
or
MSE
 
z
x
P i
de |
ˆ
)
(

X
xi
ˆ
ˆ 

VAE Concept 2
Use of Divergence (DKL):
Similar training images should produce
similar hidden data (means and
standard deviations)
v230607d
49
http://mi.eng.cam.ac.uk/~mjfg/local/4F10/lect4.pdf
https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
https://jhui.github.io/2017/03/06/Variational-autoencoders/ (for relating
covariance and standard deviations, with good example)

How to make sure the neural networks produce similar hidden
data (means & standard deviations) from similar training images
• Problem: Input that we regard as similar may end up very
different in z space (hidden, means and standard deviations).
That means some solutions may give small loss li
(all)(,  ),
even q(en) and p(de) are of very different distributions.
• Solution: Use p(z)=N(0,1), try to force q(en)(z|xi) (a neural
network) to act similarly to a standard normal probability
density function. We can use Kullback-Leibler divergence
(DKL) to do the checking.
v230607d 50
For encoder and decoder
We discussed this in concept 1:
https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
http://gregorygundersen.com/blog/2018/04/29/reparameterization/
This is for concept 2:
We will minimize (L(all) )
 
  Gaussian
and
between
difference
ˆ
output
and
input
between
loss
|
,
)
(
1
)
(
en
n
i
i
i
i
i
all
Q
x
x
x
L














 

,
)
(rec
i
l
   
 
I
N
x
z
q
D i
en
KL ,
0
||
|
)
(


Math background: Kullback–Leibler divergence (also known as relative
entropy) measures how one probability distribution is different from
another one -- reference probability distribution over the same variable
X.
•
v230607d
51
Tutorial on Variational Autoencoders by Carl Doersch &
https://arxiv.org/abs/1606.05908
   
 
 
         
 
     
     
   
   
   
   
   
   
 
 
   
 
     
 
   
   
   
   
   
   
 
 
X
X
X
I
X
tr
I
N
X
X
N
D
X
X
N
x
z
Q
I
N
x
z
Q
D
X
X
X
I
X
tr
I
N
X
X
N
D
I
N
N
X
X
N
N
I
I
tr
N
N
D
T
KL
i
i
T
KL
T
T
KL
2
2
2
2
2
2
2
2
2
2
2
2
1
1
2
1
2
2
2
1
1
2
2
2
1
2
1
2
2
2
2
2
2
1
1
det
log
2
1
,
0
||
,
,
|
,
0
||
|
For
det
log
2
1
,
0
||
,
,
0
,
also
;
,
,
If
)
(
det
det
log
*
2
1
,
||
,














































































For equation (I) See https://arxiv.org/pdf/1907.08956.pdf
https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_diver
gence
Kullback–Leibler divergence DKL (D1|| D2)=0 indicates the two distributions D1,D2 are identical
ℎ𝑒𝑛𝑐𝑒 , µ2 = 0, 2
2=1
N(0,I)=Zero_mean, variance=1 Gaussian

Training: Combining concept 1 and 2 to minimize Loss li (X), of X= {x1,x2,..,xN} ,
E()=expected value . For the whole X, the average loss is
•
v230607d
52
 
 
 
 
 
 
 
 
 
 
 
   
   
 
 
 
 
 
 
 
 
 
 
 
 
   

 
















































X
x
|z
x
i
|z
x
i
all
rec
i
x
z
x
z
i
Q
z
X
x
i
Q
z
X
x
i
Q
z
X
x
x
z
x
z
i
i
x
z
x
z
x
z
x
z
x
z
x
z
z
x
z
x
i
i
i
Q
z
X
x
i
i
Q
z
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
μ
x
σ
N
l
P
z
x
P
E
E
z
x
P
E
E
z
x
P
E
E
z
x
P
z
x
P
x
z
z
N
I
N
N
z
z
I
N
X
x
x
z
x
P
E
E
x
z
x
P
E
z
P
z
x
z
x
P
x
z
x
z
Q
X
x
X
x
2
ˆ
2
ˆ
_
)
(
|
|
|
|
|
|
|
|
|
|
|
|
2
1
1
,
function1
Objective_
minimize
we
Gaussian,
is
Since
*
|
ˆ
log
function1
Objective_
|
ˆ
log
-
likelihood
-
log
nagative
minimize
to
as
same
the
is
It
similar)
likelihood
output
input
(make
,
|
ˆ
log
maximize
want to
we
.
|
ˆ
log
|
ˆ
log
use
practice,
In
.
ˆ
output
produce
to
uses
decoder
The
.
gen.
to
,
gen.
rand.
use
encoder,
by
found
are
,
if
is
advantage
The
on),
distributi
ki/Normal_
dia.org/wi
(en.wikipe
,
,
0
scaling
by
formed
be
can
It
,
Gaussian
assume
can
but we
on,
distributi
any
be
can
stage
At this
,
0
1},
)
stdev(
0,
)
ean(
function{m
Gaussian
a
by
gen.
var.
random
input
n
output whe
decoder
at
gen.
ˆ
of
val.
exp.
|
ˆ
log
output
decoder
at the
generated
ˆ
val.)of
(exp.
value
expected
|
ˆ
log
variable,
(hidden)
latent
the
of
on
distributi
Prob.
side)
(decoder
by
generated
ˆ
of
on
distributi
Prob.
|
ˆ
side)
(encoder
by
generated
of
on
distributi
Prob.
|
ˆ
ˆ
decoder
of
Output
,
encoder
Input to






















Concept 1
See http://bjlkeng.github.io/posts/variational-autoencoders/ & https://arxiv.org/abs/1312.6114
Input, output mean

 
     
   
   
   
 
     
 
   
   
   
   
   
 
 
   
   
   
   
 
 
.
minimize
to
algorithm
iterative
an
run
will
We
det
log
2
1
2
1
1
thus
,
det
log
2
1
,
0
||
|
that
earlier
shown
have
We
,
0
||
|
2
1
1
function2
Objective_
function1
Objective_
nction
jective_fu
Overall_ob
minimized
be
to
is
this
,
,
0
||
|
func2
objective_
[]
on
slides
previous
see
Gaussian,
and
|
of
difference
,
0
||
|
,
0
put
,
gaussian
a
to
close
be
to
|
want
earlier we
mentioned
We
side)
(encoder
by
generated
of
on
distributi
Prob.
|
:
Recall
2
2
2
ˆ
2
ˆ
)
(
2
2
)
(
)
(
2
ˆ
2
ˆ
)
(
)
(
)
(
)
(
)
(
l
X
X
X
I
X
tr
μ
x
σ
N
L
X
X
X
I
X
tr
I
N
X
z
q
D
I
N
x
z
q
D
μ
x
σ
N
I
N
x
z
q
D
D
x
z
q
I
N
x
z
q
D
I
N
z
P
x
z
q
x
z
x
z
q
T
X
x
|z
x
i
|z
x
all
T
en
KL
i
en
KL
X
x
|z
x
i
|z
x
i
en
KL
KL
i
en
i
en
KL
i
en
i
i
en
i
i
i
i
i
i























































Training: Combining concept 1 and 2 to minimize Loss li (X), of X= {x1,x2,..,xN} ,,
E()=expected value . For the whole X, the average loss is
•
v230607d
53
Concept 1:(reconstruction loss):
Concept 2:
 

,
)
1
(
i
l
 

,
)
(rec
i
l
   
 
I
N
x
z
q
D i
en
KL ,
0
||
|
)
(


For VAE implementation
• Input X=(x1,x2,…,xn)
• Using the encoder, from X we obtain k different
Gaussian distributions: N(meanj, StdDevj),
• each zj=generated by N(µ j ,j),where j =1,.,K,then we
have Z=(z1,z2,.,zk)
v230607d
54
   
 
   
   
   
   
   
   
 
 
   
   
   
 











k
j
j
j
j
k
T
k
KL
KL
T
KL
i
en
KL
I
N
diag
N
D
D
X
X
X
I
X
tr
I
N
X
X
N
D
I
N
x
z
q
D
1
2
2
2
1
1
2
2
2
)
(
ln
1
2
1
,
0
||
,..,
,
,..,
minimized)
be
to
is
(
becomes
above
the
n,
applicatio
VAE
For
det
log
2
1
,
0
||
,
,
0
||
|
slide
previous
the
From














See https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
https://wiseodd.github.io/techblog/2016/12/10/variational-autoencoder/
Concept 2:

In practice
• we replace 2 with exp(2) to enable stability
in calculation. And for the minimization of DKL,
this replacement gives the same result
v230607d
55
   
 
   
   
   
 
   
   
   
 
on
minimizati
during
use
will
we
function
actual
the
is
This
1
)
exp(
2
1
,
0
||
,..,
,
,..,
with
)
ln(
and
)
exp(
with
replace
,
n
calculatio
numerical
in
stablity
enable
To
ln
1
2
1
,
0
||
,..,
,
,..,
,
0
||
|
earlier
seen
have
We
1
2
2
2
1
1
2
2
2
2
1
2
2
2
1
1
)
(













k
j
j
j
j
k
T
k
KL
k
j
j
j
j
k
T
k
KL
i
en
KL
I
N
diag
N
D
I
N
diag
N
D
I
N
x
z
q
D




















Use neural networks to implement the system
•
v230607d
56
Use backpropagation to minimize
the loss function (concept3):
Binary_cross_entropy (BCE) or Mean
squared error (MSE) between input X
and output 𝑋
Use backpropagation to
minimize
the loss function L(all) of encoder:
(concept1 & 2)
Encoder
neural
network
Decoder
neural
network
     
 
I
N
x
z
q
D
μ
x
σ
N
L
L
loss
Minimize
i
en
KL
X
x
|z
x
i
|z
x
all
all
i
i
i
,
0
||
|
2
1
1
)
(
_
)
(
2
ˆ
2
ˆ
)
(
)
(











 
   
 
I
N
x
z
q
D i
en
KL ,
0
||
|
)
(

Input
Data
Concept 2
Concept 1

The training method
•
v230607d
57
http://anotherdatum.com/vae.html
The latent
vector
represents
Gaussian
distributions
Input
and
output
are
similar
Minimize loss (L(all))
Using Concept 1 &2
X̂
X
z

VAE Concept 3
Reparameterization: the method to
enable backpropagation for training
neural network that involves random
processes
v230607d
58

VAE generative model
• In theory, we can sample zi from N(µi ,i ) produced by
the encoder. Note: N()=Gaussian Function
• Z is the input to the decoder to produce the output.
• Alternatively, we find z by sampling  (called epsilon or
eps) from N(0,1) (Gaussian mean= 0, StdDev=1), then
find z using : zi =µi +*i
• Then z has mean = µi , StdDev= I as required
• See gen_data_using_mean0_sigma1.m in appendix
• This is called Reparameterization,
• Reason: therefore, we can back-propagate this function
during training
v230607d
59

Train the variational-encoder
• How to train the
auto-encoder neural
network?
• Difficulty
– Since a random
process is involved,
backpropagation
cannot be executed
• Solution
– Use of the re-
parameterization trick
v230607d
60
Generate z by
random sampling

Training : an example
• example
v230607d
61
https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf
Random
generator
layer Z
X̂
X

• Learning algorithm : The probability function (left side diagram)
cannot be back-propagated, therefore Reparameterization trick
(right side diagram) should be applied
•
v230607d
62
http://bjlkeng.github.io/posts/variational-autoencoders/
Figure 3: An initial attempt at a variational autoencoder
without the "Reparameterization trick". Objective functions
shown in red. We cannot back-propagate through the
stochastic sampling operation because it is not a continuous
deterministic function.
Figure 4: A variational autoencoder with the
"Reparameterization trick". Notice that all
operations between the inputs and objectives are
continuous deterministic functions, allowing back-
propagation to occur.
StdDev=

Qq(z|x)
P(X|z)
This Q(z|x)
=N(µz|X,z|X) should
be close to N(0,I)
We also want
the output to be
similar to the
input
Problem:
Cannot backpropagate
Solution:
Reparamete
rization
trick
Random
generator
layer 

Intuition of the Parameterization trick
• The encoder uses random
sampling to generate z
• Backpropagation (during
training) is not possible for
the random sampling process
• Parameterization can
produce the same effect for
the encoder
• Backpropagation (during
training) is possible because
no random process is
involved
v230607d
63
Encoder
Path by
random
sampling
Backpropagation
path


Reparameterization
Z can be produced by a scaled N(0,I)
• Reparameterization generates any
Gaussian distribution of known mean
(µx), standard-deviation (x) by using
the equation (Z= µx+ x ) based on
the variable  generated by N(0,1) .
• After the forward pass,  is
generated, so  is not random. It is a
data to be used in backpropagation
during training.
• N(0,1) =Gaussian with mean=0 and
standard deviation=1
•  = the generated variable of N(0,1)
• µx =mean
• x = standard-deviation
• Z= µx+ x Ch10. Auto and variational encoders
v230607d
64
mean Standard deviation

To produce the random
variable  N(0,1): mean
=0, std=1
Input data

https://learnopencv.com/variational-
autoencoder-in-tensorflow/
v230607d
65

Summary for reparameterization
•  = the generated variable by sampling N(0,1)
• µx = mean
• x = standard deviation
• z= µx + x ; This equation is deterministic, so
it can be backpropagate
• See code in
• https://learnopencv.com/variational-
autoencoder-in-tensorflow/
v230607d
66

Exercise 7
• In reparameterization of the
variational autoencoder
method shown below,  =
0.35 which is a randomly
sampled value by sampling
the normal distribution of
mean=0 and standard
deviation =1. If the output of
the encoder network has µ
z|x = mean =0.3, z|x =
standard deviation=0.8, find
the value z .
• MC choices:
1) 0.50
2) 0.54
3) 0.56
4) 0.58 Ch10. Auto and variational encoders
v230607d
67

Answer: Exercise 7
• In reparameterization of the
variational autoencoder
method shown below,  =
0.35 which is a randomly
sampled value by sampling
the normal distribution of
mean=0 and standard
deviation =1. If the output of
the encoder network has µ
z|x = mean =0.3, z|x =
standard deviation=0.8, find
the value z .
• MC choices:
1) 0.50
2) 0.54
3) 0.56
4) 0.58 (correct) Ch10. Auto and variational encoders
v230607d
68
Answer:
z =µ +*z|x , here
=0.35, µ=0.3,
standard deviation=
z|x=0.8
=0.3+0.35*0.8=0.58

Exercise 8
• Discuss exercise
• why Reparameterization is needed?
v230607d
69

Answer: Exercise 8
Discuss why Reparameterization is needed.
• Answer: Z is generated by a
random process if you have
mean=µx , standardDev= x. Since
the VAE system is implemented
using neural networks, they need
backpropagation for training the
weights/parameters, and the
random process of generating Z
cannot be backpropagated.
• Solution: The reparameterization
trick converts the random process
into a determinization process (z=
µx + x ) with the help of a
random variable  generated by a
normal distributed random
generated normal distribution
with mean=0 and standardDev=1:
N(0,1), hence this deterministic
process can be backpropagated.
v230607d
70
Reparameterization trick

Demo Matlab code: gen_data_using_mean0_sigma1.m
to show the idea: X= µx+ x * eps is the formula for generating X
by eps (generated by normal distortion of mean=0, std=1)
https://nbviewer.jupyter.org/github/gokererdogan/Notebooks/blob/master/Reparameterization%20Trick.ipynb
• %gen_data_using_mean0_sigma1.m
• clear
• %%large number of samples %%
• eps=randn(10000,1);
• mu_x=2 %this is your mean
• sigma_x=1 %this is your std
• x=mu_x+(eps*sigma_x);
• grad2_of_mean=
sum(2*(mu_x+eps))/length(x);
• 'grad2 of mean='
• grad2_of_mean
• 'mean(x)='
• mean(x)
• 'std(x)='
• std(x)
• Result:grad2_of_mean = 3.9933
• mean(x)= 1.9960 (approximate 2)
• std(x)= 0.9984 (approximate 1)
• x= standard deviation of x
• µx =mean of x
• eps= N(mean=0,std=1), normal dist.
• X= µx+ x * eps
• And gradient of mean is expected_val_of
(2(eps+mu_x)), assume x=1 for simplicity
• The above is not random, because eps
has been generated and µx is the current
mean. We can use this in our
backpropagation formula to find the
updated mean.
v230607d
71
Using X= µx+ x * eps , we can find its gradient bypassing the random process.
Because eps is generated by a random process during the neural net forward pass,
during backpropagation this is the data (now available deterministically) to be
used. Note:grad2_of_mean = expected_value_of (2(eps+mu_x))

Implementation
Using Keras
https://github.com/keras-
team/keras/tree/master/
v230607d
72

Keras
•
v230607d
73
StdDev=


Keras implementation of VAE
• x = Input(shape=(original_dim,))
• h = Dense(intermediate_dim, activation='relu')(x)
• z_mu = Dense(latent_dim)(h)
• z_log_var = Dense(latent_dim)(h)
• z_mu, z_log_var = KLDivergenceLayer()([z_mu, z_log_var])
• # Use of lambda: normalize log variance to std dev
• z_sigma = Lambda(lambda t: K.exp(.5*t))(z_log_var)
• eps = Input(tensor=K.random_normal(shape=(K.shape(x)[0],
• latent_dim)))
• z_eps = Multiply()([z_sigma, eps])
• z = Add()([z_mu, z_eps])
• decoder = Sequential([
• Dense(intermediate_dim, input_dim=latent_dim,
activation='relu'),
• Dense(original_dim, activation='sigmoid')
• ])
• x_pred = decoder(z)
v230607d
74
http://louistiao.me/posts/implementing-variational-autoencoders-in-keras-beyond-the-quickstart-tutorial/
original_dim = 784
intermediate_dim = 256
latent_dim = 2
batch_size = 100
epochs = 50
epsilon_std = 1.0
StdDev=
Predicted
output

df
•
v230607d
75
StdDev=


variational_autoencoder_deconv .py
from https://github.com/keras-team/keras/tree/master/
• '''Example of VAE on MNIST dataset using CNN
•
• The VAE has a modular design. The encoder, decoder and VAE
• are 3 models that share weights. After training the VAE model,
• the encoder can be used to generate latent vectors.
• The decoder can be used to generate MNIST digits by sampling the
• latent vector from a Gaussian distribution with mean=0 and std=1.
•
• # Reference
•
• [1] Kingma, Diederik P., and Max Welling.
• "Auto-encoding variational bayes."
• https://arxiv.org/abs/1312.6114
• '''
•
• from __future__ import absolute_import
• from __future__ import division
• from __future__ import print_function
•
• from tensorflow.keras.layers import Dense, Input
• from tensorflow.keras.layers import Conv2D, Flatten, Lambda
• from tensorflow.keras.layers import Reshape, Conv2DTranspose
• from tensorflow.keras.models import Model
• from tensorflow.keras.datasets import mnist
• from tensorflow.keras.losses import mse, binary_crossentropy
• from tensorflow.keras.utils import plot_model
• from tensorflow.keras import backend as K
•
• import numpy as np
• import argparse
• import os
•
•
• # reparameterization trick
• # instead of sampling from Q(z|X), sample eps = N(0,I)
• # then z = z_mean + sqrt(var)*eps
• def sampling(args):
• """Reparameterization trick by sampling fr an isotropic unit Gaussian.
•
• # Arguments
• args (tensor): mean and log of variance of Q(z|X)
•
v230607d
76
n variational_autoencoder_deconv : use:
vae.save_weights('vae_cnn_mnist.tf') #instead
of vae.save_weights('vae_cnn_mnist.h5')
Resulst
Epoch 30/30
60000/60000
[==============================] - 91s
2ms/sample - loss: 145.7313 - val_loss:
146.8615
To run this, you need to install:
>>conda install graphviz
>>conda install pydot

variational_autoencoder_deconv.py
from https://github.com/keras-team/keras/tree/master/
• Results
v230607d
77

Summary
• Learned vanilla autoencoder
• Learned variational autoencoder
• Learned the Reparameterization trick to
enable learning in variational autoencoder
v230607d
78

Reference
•
https://nbviewer.jupyter.org/github/gokererd
ogan/Notebooks/blob/master/Reparameteriz
ation%20Trick.ipynb
v230607d
79

Appendices
v230607d
80

Appendix 1: Training: Combining concept 1 and 2 to minimize Loss L.
X={x1,x2,..,xN} , E()=expected value . For the whole X, the average loss is
•
v230607d
81
   
     
   
       
 
     
 
 
     
 
     
 
   
     
 
     
     
 
       
 
   
     
 
 
   
   
   
   
 
 
.
minimize
to
algorithm
iterative
an
run
will
We
det
log
2
1
2
1
1
,
average
,
0
,
||
slide
previous
in
formula
the
use
,
,
|
If
||
|
*
|
log
N
1
|
,
)
(
||
|
*
|
log
N
1
||
|
*
|
log
N
1
||
|
*
|
log
|
,
,
0
Note
.
||
|
|
log
|
,
2
2
2
|
2
i
|
|
i
|
|
|
|
,
0
|
|
,
0
i
i
L
X
X
X
I
X
tr
x
N
L
I
N
z
P
z
P
X
Q
D
X
X
N
X
Q
x
z
Q
z
P
x
z
Q
D
x
x
z
x
P
x
II
z
P
x
z
Q
D
x
x
z
x
P
z
P
x
z
Q
D
x
x
z
x
P
E
z
P
x
z
Q
D
x
x
z
x
P
E
E
x
I
N
z
P
x
z
Q
D
z
x
P
E
x
T
X
x
z
X
i
KL
i
i
KL
X
x
i
x
z
i
x
z
i
i
i
KL
X
x
i
X
z
i
X
z
i
i
KL
X
x
i
X
z
i
X
z
i
I
N
i
KL
i
X
z
i
X
z
i
I
N
X
x
i
i
KL
i
X
x
i
i
i
i
i
i
i
i
i





























































































Concept 1 Concept 2

Appendix 2
Probability likelihood
A tutorial
KH Wong
v230607d
82

Overview
• Bayesian rules
• Gaussian distribution
• Probability vs likelihood
• Log-likelihood and maximum likelihood
• Negative log-likelihood
v230607d
83

Bayesian rules
v230607d
84

Bayesian rules
• P(B|A)=P(A|B)P(B)/P(A)
• P(A and B)=P(A,B)=P(A|B) P(B)
• P(A,B|C)=P(A|B,C) P(B|C)
• Prove the above as exercises
v230607d
85
In each cell, the joint probability p(r, c) is re-expressed by the equivalent form
p(r | c) p(c) from the definition of conditional probability in Equation 5.3.
The marginal probability p(r) =Σc*p(r | c*) p(c*),
https://www.sciencedirect.com/topics/mathematics/marginal-probability

Gaussian distribution
• %2-D Gaussian distribution
P(xj)
• %matlab code----------
• clear, N=10
• [X1,X2]=meshgrid(-N:N,-
N:N);
• sigma =2.5;mean=[3 3]'
• G=1/(2*pi*sigma^2)*exp(-
((X1-mean(1)).^2+(X2-
mean(2)).^2)/(2*sigma^2));
• G=G./sum(G(:)) %normalise it
• 'sigma is ', sigma
• 'sum(G(:)) is ',sum(G(:))
• 'max(max(G(:)))
is',max(max(G(:)))
• figure(1), clf
• surf(X1,X2,G);
• xlabel('x1'),ylabel('x2')
v230607d
86
 
 







 






2
0
2
0
2
/
1
2
0
2
0
0
2
1
exp
2
1
)
(
variance
mean,
,
_
Gaussian
1





j
j
j
x
x
N
sample
a
x
D
  






 



2
0
2
2
2
1
2
0
2
1
2
exp
2
1
)
,
(
Gaussian
symmetric)
(circular
isotropic
an
2


x
x
x
x
N
D

Probability vs likelihood
• It is two sides of a coin.
• P() Probability function :
– Given a Gaussian model (with mean µo and variance o), the
probability function P(X| µo,o) measures the probability that the
observation X is generated by the model.
• L() likelihood function:
– Given data X, the Likelihood function L(µo,o| X) measures the
probability that X fits the Gaussian model with mean µo and variance
o.
– Major application: Given data X, we can maximize the Likelihood
function L(µo,o| X) to find the model (µo,o) that fits the data. This is
called the maximum likelihood method.
– Log-likelihood rather than likelihood is more convenient for finding
the maximum, hence it is often used.
Ch10. Auto and variational encoders v230607d 87
 
X
L
X
P o
o
o
o |
,
)
,
|
(
2
2



 

Likelihood function
L( ) of n-dimension
• Likelihood function
• Intuition: Likelihood
function
L(µ,|X)) means: given
a Gaussian model
N(mean, variance) how
much the multivariate
data X =[x1,x2,x3,..,xn]
fits the model with
parameter (µ,).
v230607d
88
      














n
j
j
n
n
x
X
L
x
x
x
X
1
2
2
2
/
2
2
2
1
2
1
exp
2
|
,
]
,...,
,
[





   
   
    

















 











2
1
2
2
/
2
1
2
2
2
/
1
2
2
1
2
2
1
exp
2
2
1
exp
2
,
|
|
,
as
written
be
can
function
likelihood
the
IID,
are
sample
the
from
ns
observatio
that the
assumption
Given the
:
Proof
•
n
j
j
n
n
j
j
j
n
j
x
x
x
N
X
L











A more useful representation is Log-Likelihood function= Log(L( ))=l ()
• Intuition:
• The peak of Likelihood and
Log-Likelihood functions
should be the same.
• The two are one to one
mapping hence no data
loss.
• Log based method is
easier to be handled by
math, so log-Likelihood
function is often used
• For computers, log
numbers are smaller
hence may save memory.
Using log, we can use
addition rather than
multiplication which
makes computation easier.
v230607d
89
 
 
   
 
   
   
  proved!
,
2
1
)
ln(
2
)
2
ln(
2
2
1
2
ln
2
2
1
exp
ln
2
ln
2
1
exp
2
ln
,...,
,
|
,
ln
)
,...,
,
|
,
(
:
Proof
2
1
2
2
1
2
2
2
1
2
2
2
/
2
1
2
2
2
/
2
2
1
2
1
2
1
2



























































n
j
j
n
j
j
n
j
j
n
n
j
j
n
n
n
x
n
n
x
n
x
x
x
x
L
x
x
x
l

















     
 
2
1
0
2
2
2
1
2
1
2
2
2
/
2
2
2
1
2
1
)
ln(
2
)
2
ln(
2
)
,...,
,
|
,
(
,
2
1
exp
2
|
,
:
defn
By
]
,...,
,
[
For
function,
Likelihood
-
Log






















n
j
j
n
n
j
j
n
n
x
n
n
x
x
x
l
x
X
L
x
x
x
X












Maximum Likelihood V.S. Log-Likelihood
v230607d
90
 
   
     
 
   
 






























n
j
j
n
j
j
n
x
n
n
X
l
x
Log
X
l
X
L
Log
1
2
0
2
2
2
1
2
2
2
/
2
2
2
2
1
)
ln(
2
)
2
ln(
2
|
,
:
function
Likelihood
-
Log
2
1
exp
2
|
,
|
,
:
function
likelihood
the
of
Log
Take













 
     
 















n
j
j
n
n
x
X
L
x
x
x
X
1
2
2
2
/
2
2
2
2
1
2
1
exp
2
|
,
:
function
Likelihood
,
is
set
parameter
Gaussian
the
],
,...,
,
[
Given








Since both likelihood 𝐿 and Log−likelihood 𝑙 are monotonic functions, hence
arg_max𝜃 𝐿(𝜃|𝑋) == arg_max𝜃 𝑙(𝜃|𝑋)
The maximum happens at 𝜃 = 𝜇, 𝜎2
, where 𝜇 =
1
𝑛
𝑗=1
𝑛
𝑥𝑗 , & variance 𝜎2
=
1
𝑛
𝑗=1
𝑛
𝑥𝑗 − 𝜇
2
http://jrmeyer.github.io/machinelearning/2017/08/18/mle.html
https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1
•
Important

Proof 1 :Maximum Log-Likelihood function of a
Multivariate Gaussian distribution
v230607d
91
 
 
 
 
   
 
   
hood?
log_likeli
the
maximize
to
of
expression
the
is
what
So
,
of
mean
1
hence
0,
or
0
2
1
2
1
)
ln(
2
)
2
ln(
2
2
1
)
ln(
2
)
2
ln(
2
earlier
showed
we
function,
Gaussian
a
For
,
0
)
,...,
,
|
,
(
ln
)
at
is
hood
log_likeli
Max
2
1
1
1
2
2
1
2
2
2
1
2
2
2
2
1
2





















x
x
n
x
d
x
d
d
x
n
n
d
d
dl(X)
x
n
n
l(X)
d
x
x
x
L
d
d
dl(X)
d
dLog(L(X)
n
j
j
n
j
j
n
j
j
n
j
j
n
j
j
n















































       
 
   
 




























n
j
j
n
j
j
n
j
j
n
j
j
n
x
n
x
n
x
n
n
x
X
L
1
2
2
1
1
2
2
2
1
2
2
2
/
2
2
1
variance
and
,
1
mean
when
happens
likelihood
-
Log
maximum
The
,
2
1
)
ln(
2
)
2
ln(
2
2
1
exp
2
|
,
:
func
Likelihood












http://jrmeyer.github.io/machinelearning/2017/08/18/mle.html

Proof 2 : Maximum Log-Likelihood function of
a Multivariate Gaussian distribution
v230607d
92
 
 
 
 
 
 
 
 
 
 2
2
2
0
0
2
1
2
2
1
2
4
2
1
2
4
2
2
1
2
2
2
2
2
1
2
2
ˆ
1,
n
if
:
Note
.
of
variance
and
,
of
mean
mean
of
on
distributi
Gaussian
a
by
generated
be
likely to
most
is
data
the
,
given
means
That
of
variance
1
ˆ
2
1
2
Hence
0
2
1
2
2
1
)
ln(
2
)
2
ln(
2
function
Gaussian
a
For
,
0
)
,...,
,
|
,
(
ln
when
happens
hood
log_likeli
Maximum














































 

























j
,,,j
i,
n
j
j
n
j
j
n
j
j
n
j
j
n
x
x
x
x
x
x
n
x
n
x
n
d
x
n
n
d
d
x
x
x
l
d
d
dLog(L(x)
http://people.stat.sfu.ca/~raltman/stat402/402L4.pdf
 
z
1
dz
ln(z)
d
:
Note 
       
 
   
 




























n
j
j
n
j
j
n
j
j
n
j
j
n
x
n
x
n
x
n
n
x
X
L
1
2
2
1
1
2
2
2
1
2
2
2
/
2
2
1
variance
and
,
1
mean
when
happens
likelihood
-
Log
maximum
The
,
2
1
)
ln(
2
)
2
ln(
2
2
1
exp
2
|
,
:
func
Likelihood













Alternative proof: Maximum Log_likelihood
Find the most suitable variance 2
• Maximum likelihood is at
v230607d
93
 
 

 




n
j
n
j
n
j
j
n x
n
x
n 1
2
2
1
ˆ
1
ˆ
,
1
ˆ 


 
 
   
 
 
 
 
 
 
 
 
 
 2
2
1
2
2
1
2
2
2
2
2
1
2
2
2
2
1
2
2
2
2
1
2
2
2
2
1
2
2
2
2
1
2
ˆ
1,
n
if
:
Note
here)
occurs
Gaussian
the
of
hood
Log_likeli
maximum
(
,
,
ˆ
if
only
zero
to
equal
is
which
0
1
2
1
1
2
1
2
1
2
1
2
1
2
1
2
2
1
)
ln(
2
)
2
ln(
2
0
)
,...,
,
|
,
(
for
problem
hood
Log_likeli
maximum
the
Solve
:
Proof



























































































































j
n
j
j
n
j
j
n
j
j
n
j
j
n
j
j
n
j
j
n
x
done
x
n
x
x
n
x
n
d
d
x
n
x
n
n
x
x
x
l
l

Negative Log-Likelihood (NLL)
And its application in softmax
To maximize log-likelihood, we can
minimize its negative log-likelihood
(NLL) function
v230607d
94

Softmax function
• https://medium.com/data-science-bootcamp/understand-the-softmax-function-in-minutes-f3a59641e86d
• y=[2 , 1, 0.1]’
• Softmax(y)=[0.6590, 0.242,0.0986]’
• exp(2)/((exp(2)+exp(1)+exp(0.1))=0.6590
• exp(1)/((exp(2)+exp(1)+exp(0.1))= 0.2424
• exp(0.1)/((exp(2)+exp(1)+exp(0.1))= 0.0986
v230607d
95
,n
,
i
y
y
y n
i
i
i
i ,..
2
1
for
,
)
exp(
)
exp(
)
(
softmax
1





Softmax Activation Function
• https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/#nll
v230607d
96
=5/(5+4+2)=
exp(5)/((exp(5)+exp(4)+exp(2))=0.705
exp(4)/((exp(5)+exp(4)+exp(2))=0.25949

Negative Log-Likelihood (NLL)
• To maximize likelihood,
minimum negative log-
likelihood (NLL) is picked
v230607d
97
https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/#nll
=-ln(likelihood)
=-ln(0.02)=3.91
=-ln(0)=infinity
=-ln(0.98)=0.02
Minimum negative log-
likelihood (NLL) is
picked, so 0.02 is
selected
Softmax
output as the
likelihood

•
v230607d
98
Continue

FAQ on VAE
• FAQ Assign3, 2020 Nov 17
• Question 3.1:
• Hi, sorry for interrupting, i got a question for assignment 3 auto-encoder part. in question one, encoder hidden layer and decoder hidden layer are in different size(15 and 18), and the
neurons for (means, variance) and samples are different as well, does it mean in variational auto-encoder, encoder hidden layer size and decoder hidden layer size can be different, and
neuron numbers for means, variance and samples don't have to match? if so, is there some random drop-out functions when means, variance and samples size don't match? Thanks.
• Answer 3.1:
• This is a very good question, in my notes, (mean, variance, sample_z are of the same sizes, but I have found some implementations that show it may not be the only case. Yes, it is a kind
of dropout as described by the papers shown below. I think the rule is mean and variance should have the sample number because they go into pairs, but the randomly generated
sample_z can be of different sizes. It is done by randomly (via Monte carlo method) select the pair of mean and variance for generating the value of sample_z. Neural computing is a trial-
and-error method, you may try different approaches and the preferred method is the one which gives you the good result. You may explore more papers and see if my interpretation is
correct or not.
• See section3.4 of
• https://arxiv.org/pdf/1706.03643.pdf
• Also
• https://deeplearn.org/arxiv/92996/generating-data-using-monte-carlo-dropout
• ////////////////////////////////////////////////////////
• Question 3.2 on VAE (variational Auto-encoder)
• Question 3.2a:
• In your notes, variational auto encoder turns input x into means and deviations of a multivariate Gaussian distribution, then use a random sampling method to create output. The output
is Z and Z is generating random sample to the next layer of neuron.
• (i) How do we train the neuron network if the input is from random sampling? (ii) And How do we force a multivariate Gaussian distribution Z to uni-variate Gaussian distribution N(0,1)?
• Answer 3.2a : I will answer part (ii) of the above question first. It is not to avoid over-fitting. Since from input to the latent (hidden) representation z, there is a random process. A random
process can have many different forms, it can be Gaussian, Laplace or Cauchy etc. or some unknown forms. If there is no control you may not be able to repeat the process hence training
becomes useless. In the VAE paper (https://arxiv.org/abs/1312.6114 ) , the authors propose to force the random probability distribution to be Gaussian (I guess you may force them to be
Laplace etc. and still can work, but you have to be consistent on using one model). How? The method is using D_KL ( Kullback–Leibler divergence ). It the concept 2 in my notes, to make
sure the random process is Gaussian.
• ////--------------------------------------------------------------------------------------------------------------------
• Question 3.2b : Why do we still need re-parameterization to do back propagation?
• Answer 3.2b: It is known that random process cannot be back-propagate. But re-parameterization provides a means to back-propagate. First, zi is not generated by a random generator
of mean=µi ,Stad_dev=i , but rather using an indirect method of finding zi (using zi =µi +*i ) through  which is generated by N(0,1)=NGaussian(mean=0,std_dev=1). If you have doubt
run my Matlab program on p.71 of 5707_10_auto-encoder (1).pptx . It short, it is found that if we use zi =µi +*i to generate zi. then zi will have the characteristic of (mean=
µi,std_dev=i).
• Then, why do we use that indirect method? Because during forward pass of neural computing,  is already calculated by N(0,), it is a real number not a random variable (same for
mean=µi ,Stad_dev=i neuron outputs) , so during back-propagation we can use zi =µi +*i to find out how much we can back-propagate to change the weights of the neurons. In the
lecture note p.67 of 5707_10_auto-encoder.pptx, the gradient is calculated (please recall that for neural back-propagation computing, gradient is needed to find the de/dw), we can form
our weight updating program based on this formulation. The idea is with this gradient, we know how to change µi ,i if we know the change of zi. (if it is random process , we simply do
not know how). However, you don’t need to enter this gradient to the VAE program because it is already in the Tensorflow-keras library, it is done automatically by Tensorflow-keras as
long as you provide the zi =µi +*I formulation of the forward pass.
• ////--------------------------------------------------------------------------------------------------------------------
• Question 3.2c :If we put N(0.15, 2,3) does that mean the input mean is 0.15 and std is 2.3. Then it goes through KL to compute the error with the expected.
• Answer3.2c: The use of N(0,1) (using Gaussian mean=0, Std_dev=1) is to make the formulation easier to program or calculate , see P.51 , D_KL ( the loss function is based on that )
formulation becomes simpler. I guess you may assume all distributions to be N(0.15, 2,3), then your loss function becomes more complex. The idea is to make sure zi is generated by a
Gaussian process, zi can be generated by a different mean and std_dev, but needed to be Gaussian. It is done by reducing D_KL( random process that generate zi, || N(0,1) ). So by
comparing the process of generating zi to a typical Gaussian like N(0,1) to form the loss function is reasonable.
v230607d
99

•
v230607d
100
To prove Eq [x2]=Ep [2( +)],
https://stats.stackexchange.com/questions/199605/how-does-the-reparameterization-trick-for-vaes-work-and-why-is-it-important

Alternative Derivation : To prove Eq [x2]=Ep [2( +)]
•
v230607d
101
 
   
     
 
 
 
 
 
   
     
 
 
   
     
 
 
 
 
 
 
   
   
   
 
 
   
   
 











































































































2
is
of
derivative
therefore
),
1
,
0
(
.
.
,
of
on
distributi
the
is
where
,
then
1
0
,
Since
,
therefore
,
log
hence
1
variance
,
mean
of
on
distributi
normal
a
is
)
1
,
(
If
on
expectatai
of
defn.
by
also
,
log
log
in
put
),
(
1
n,
expectatio
of
defn.
by
,
)
(
1
log
Thus,
log
1
log
and
,
Since
find
to
need
we
thus,
,
min
find
want to
We
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
p
p
q
q
p
q
q
q
q
q
q
q
q
q
E
E
x
E
x
E
N
e
i
p
E
x
E
,
N
X
x
x
E
x
E
x
x
q
N
q
x
x
q
E
dx
x
x
q
x
q
x
E
(ii)
(i)
ii
dx
x
x
q
x
q
x
q
dx
x
x
q
x
q
x
q
x
E
dx
x
x
q
x
E
i
x
q
x
q
x
q
dx
dy
(y)
dx
(y)
d
x
q
x
q
x
E
X
E
Note:
–
 
 
 
 
     

























 









 


x
x
x
q
x
x
x
q
)
1
(
2
)
(
log
)
1
(
2
1
exp
)
1
(
2
1
)
(
2
1
exp
)
(
2
1
)
(
2
2
2
/
1
2
2
2
2
/
1
2

Reparameterization:
Backpropagation needs derivative of a function (process)
•
v230607d
102
https://stats.stackexchange.com/questions/199605/how-does-the-reparameterization-
trick-for-vaes-work-and-why-is-it-important
Derivative of a random process
is not possible
Derivative of the Reparameterization process
(no random node is involved) is possible

ssa
•
v230607d
103
https://stats.stackexchange.com/questions/199605/how-does-the-reparameterization-
trick-for-vaes-work-and-why-is-it-important
Explanation:

Summary: Backpropagation
• The gradient during backpropagation is
•
• This gradient is required for the neural network
learning (back-propagation) process
•  the generated variable of N(0,1) during the forward
pass
• µx is the current mean and is given at forward pass
• So, the gradient (see formula * above) can be found
and used in backpropagation
v230607d
104
𝛻𝜃𝐸𝑞[𝑧2]=𝛻𝜃E𝑝 [(µx +)2]=Ep[2(µx + )]------(*)

Gradient for backpropagation
• Eq()=expectation,
• µ=mean, =standard deviation
• z=µ+,   sampled from N(0,I)
• The above is deterministic, we can find µ and thus,
derivative of Eq [z2]
• Eq [z2]= Ep [(µ +)2]
• Assume =1 for simplicity, (µ, are independent)
• Derivative of Eq [z2] =  Eq [z2] /  µ =µ Eq [z2] = Ep [2(µ +)]
• (The proof is in the appendix: To prove µ Eq [z2]=Ep [2(µ +)])
• If we have enough samples of , we can find µ Eq [z2] . This
gradient is required for the neural network learning (back-
propagation) process
• µ=current mean,  = randomly generated by N(0,I) during
forward pass
• For , we can apply the same treatment for updating
v230607d
105

https://nbviewer.jupyter.org/github/gokererdogan/Notebooks/blob/master/Reparameterization%20Trick.ipynb
Demo gen_data_using_mean0_sigma1.py
Reparameterization trick
• import numpy as np
• N = 1000
• theta = 2.0
• eps = np.random.randn(N)
• x = theta + eps
• grad1 = lambda x: np.sum(np.square(x)*(x-theta)) / x.size
• grad2 = lambda eps: np.sum(2*(theta + eps)) / x.size
• print grad1(x)
• print grad2(eps)
• 3.86872102149
• 4.03506045463
• Let us plot the variance for different sample sizes.
• Ns = [10, 100, 1000, 10000, 100000]
• reps = 100
• means1 = np.zeros(len(Ns))
• vars1 = np.zeros(len(Ns))
• means2 = np.zeros(len(Ns))
• vars2 = np.zeros(len(Ns))
• est1 = np.zeros(reps)
• est2 = np.zeros(reps)
• for i, N in enumerate(Ns):
• for r in range(reps):
• x = np.random.randn(N) + theta
• est1[r] = grad1(x)
• eps = np.random.randn(N)
• est2[r] = grad2(eps)
• means1[i] = np.mean(est1)
• means2[i] = np.mean(est2)
• vars1[i] = np.var(est1)
• vars2[i] = np.var(est2)
•
• print means1
• print means2
• print
• print vars1
• print vars2
• [ 4.10377908 4.07894165 3.97133622 4.00847457 3.99620013]
• [ 3.95374031 4.0025519 3.99285189 4.00065614 4.00154934]
• [ 8.63411090e+00 8.90650401e-01 8.94014392e-02 8.95798809e-03
• 1.09726802e-03]
• [ 3.70336929e-01 4.60841910e-02 3.59508788e-03 3.94404543e-04
• 3.97245142e-05]
• %matplotlib inline
• plt.plot(vars1)
• plt.plot(vars2)
• plt.legend(['no rt', 'rt'])
• /usr/local/lib/python2.7/dist-packages/matplotlib/__init__.py:872: UserWarning:
axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the
latter.
• warnings.warn(self.msg_depr % (key, alt_key))
v230607d
106
Variance of the estimates using reparameterization trick is
one order of magnitude smaller than the estimates from the
first method!

5707_10_auto-encoder.pptx

More Related Content

Similar to 5707_10_auto-encoder.pptx (20)

Recently uploaded (20)

5707_10_auto-encoder.pptx

Editor's Notes