SlideShare a Scribd company logo
Ch10. Auto-encoders
KH Wong
Ch10. Auto and variational encoders
v230607d
1
Two types of autoencoders
• Part1 : Vanilla (means traditional or classical)
Autoencoder
– or simply called Autoencoder
• Part 2: Variational Autoencoder
Ch10. Auto and variational encoders
v230607d
2
Part 1:
Overview of Vanilla
(traditional/classical) Autoencoder
• Introduction
• Theory
• Architecture
• Application
• Examples
Ch10. Auto and variational encoders
v230607d
3
Introduction
• What is auto-decoder?
– An unsupervised method
• Application
– For noise removal
– Dimensional reduction
• Method
– Use noise-free ground truth data (e.g. MNIST)+ self
generative noise to train the network
– The final network can remove noise of in the input
(e.g. hand written characters), the output will be
similar to the ground truth data
Ch10. Auto and variational encoders
v230607d
4
Noise removal
• https://www.slideshare.net/billlangjun/simple-introduction-to-autoencoder
Ch10. Auto and variational encoders
v230607d
5
Result:
plt.title('Original images: top rows,'
'Corrupted Input: middle rows, '
'Denoised Input: third rows')
Perfect input + noise
Auto encoder Structure
An autoencoder is a
feedforward neural network
that learns to predict the
input (corrupted by noise)
itself in the output.
• The input-to-hidden part
corresponds to an encoder
• The hidden-to-output part
corresponds to a decoder.
• Input and output are of
the same dimension and
size.
Ch10. Auto and variational encoders
v230607d
6
https://towardsdatascience.com/deep-autoencoders-using-tensorflow-c68f075fd1a3
Noisy
Input
x
De-noised
Output
x‘
encoder decoder
Neural network after training
x‘
x
Z (code)
Theory
(W=weight, b=bias)
Autoencoders are trained to
minimize reconstruction errors
(such as squared errors), often
referred to as the "loss (L)":
• By combining (*) and (**)
Ch10. Auto and variational encoders
v230607d
7
 ’ x’
X W
b
Z W’
b’
(**)
)
'
'
(
'
'
(*)
)
(
'















b
z
W
x
b
Wx
z
x
z
x


2
2
)
'
)
(
'
(
'
'
)
'
,
(
b
b
Wx
W
x
x
x
x
x
L
Loss









'
x
z
x 

Encoder decoder
Input code output
Exercise 1a,b,c
• How many input layers, hidden layers, output
layers in the figure shown? MC choices: How
many
• (a) input layer(s)?
• (b) hidden layer(s)?
• (c) Output layer(s)?
• How many neurons in these layers? MC
choices: How many neurons in these layers?
• (d) input layer?
• (e) hidden layers: choices:
– 1) 3
– 2) 6
– 3) 8
– 4) 10
• (f) output layer?
• (g) Which is true on the number of neurons?
– 1) input neurons more than output neurons
2) input neurons same as output neurons
– 3) input neurons less than output neurons Ch10. Auto and variational encoders
v230607d 8
Input Output
Answer : Exercise 1
• How many input layers,
hidden layers, output layers in
the figure shown?
– Answer: input=1, hidden=3,
output layer=1
• How many neurons in these
layers?
– Answer: input(=4),
hidden(3,2,3),total=8 (choice
3), output (=4)
• What is the relation between
the number of input and
output neurons?
– Answer: same (choice 2)
Ch10. Auto and variational encoders
v230607d
9
Input Output
Architecture
• Encoder and decoder
• Training can use
typical
backpropagation
methods
Ch10. Auto and variational encoders
v230607d
10
https://towardsdatascience.com/how-to-
reduce-image-noises-by-autoencoder-
65d5e6de543
Training
• Apply clean MNIST data set + added noise to be used as input,
• Use clean MNIST data set as output
• Train the autoencoder using backpropagation
Ch10. Auto and variational encoders
v230607d
11
Added noise
Autoencoder training by
backpropagation
+
Clean MINST
samples
Clean MNIST samples
same
Recall
• After training, autoencoders can be used to
remove noise
Ch10. Auto and variational encoders
v230607d
12
Trained
autoencoder
Noisy
Input
De-noised
Output
Exercise 2a,b: Auto-encoder training
• (Q.2a) For (epoch=1;epoch <=max_epoch ; epoch++)
– {For all 10,000 images{
• Core code:
• Use backpropagation to train the whole
autoencoder network (encoder + decoder)}
• Break if Loss is too small }
• MC question: In core code, choices:
1. Feed each clean image to the input, and Present
the clean image to the output
2. Feed each clean image+noise to the output, and
Present the clean image to the input
3. Feed each clean image+noise to the input, and
Present the clean image to the output
• (Q.2b) If the trained encoder receives a noisy image of a
handwritten numeral, what do you expect at the output?
– MC choice: 1) a denoised image; 2) input + noise
– 3) same as input ; 4) pure random noise
Ch10. Auto and variational encoders
v230607d
13
Noise clean image
for numeral
“2”
auto-encoder
Input output
Answer: Exercise 2a,b
• Answer 2(a): Auto-encoder training
• For (epoch=1;epoch <=max_epoch ; epoch++)
– {For all 10,000 images{
• Feed each clean image plus noise to the
(encoder) input
• Present the clean image of the numerical to
the output (of the decoder),
• Use backpropagation to train the whole
autoencoder network (encoder + decoder)
• }
• Break if Loss is too small
– }
• Ex.2(b) Autoencoder usage: If the trained encoder
receives a noisy image of a handwritten numeral,
what do you expect at the output?
– Answer 2(b): a denoised image of the realinput
numeral image (choice 1 is correct)
Ch10. Auto and variational encoders
v230607d
14
+
Noise clean image
for numeral
“2”
auto-encoder
Core code
Choice 3
is correct
Input Output
Sample
Code:
Part(i):
obtain
dataset
and add
noise
https://towardsdatascience.
com/how-to-reduce-image-
noises-by-autoencoder-
65d5e6de543
• #part1 ---------------------------------------------------
• np.random.seed(1337)
• # MNIST dataset
• (x_train, _), (x_test, _) = mnist.load_data()
• image_size = x_train.shape[1]
• x_train = np.reshape(x_train, [-1, image_size, image_size, 1])
• x_test = np.reshape(x_test, [-1, image_size, image_size, 1])
• x_train = x_train.astype('float32') / 255
• x_test = x_test.astype('float32') / 255
• # Generate corrupted MNIST images by adding noise with normal dist
• # centered at 0.5 and std=0.5
• noise = np.random.normal(loc=0.5, scale=0.5, size=x_train.shape)
• x_train_noisy = x_train + noise
• noise = np.random.normal(loc=0.5, scale=0.5, size=x_test.shape)
• x_test_noisy = x_test + noise
• x_train_noisy = np.clip(x_train_noisy, 0., 1.)
• x_test_noisy = np.clip(x_test_noisy, 0., 1.)
Ch10. Auto and variational encoders
v230607d
15
Part (ii):First build
the Encoder Model
• #part2 ---------------------------------------------------
• # Network parameters
• input_shape = (image_size, image_size, 1)
• batch_size = 128
• kernel_size = 3
• latent_dim = 16
• # Encoder/Decoder number of CNN layers and filters per layer
• layer_filters = [32, 64]
• # Build the Autoencoder Model
• # First build the Encoder Model
• inputs = Input(shape=input_shape, name='encoder_input')
• x = inputs
• # Stack of Conv2D blocks
• # Notes:
• # 1) Use Batch Normalization before ReLU on deep networks
• # 2) Use MaxPooling2D as alternative to strides>1
• # - faster but not as good as strides>1
• for filters in layer_filters:
• x = Conv2D(filters=filters,
• kernel_size=kernel_size,
• strides=2,
• activation='relu',
• padding='same')(x)
• # Shape info needed to build Decoder Model
• shape = K.int_shape(x)
• # Generate the latent vector
• x = Flatten()(x)
• latent = Dense(latent_dim, name='latent_vector')(x)
• # Instantiate Encoder Model
• encoder = Model(inputs, latent, name='encoder')
• encoder.summary()
Ch10. Auto and variational encoders
v230607d
16
Part (iii):Build the
Decoder Model
• #part3 ---------------------------------------------------
• # Build the Decoder Model
• latent_inputs = Input(shape=(latent_dim,), name='decoder_input')
• x = Dense(shape[1] * shape[2] * shape[3])(latent_inputs)
• x = Reshape((shape[1], shape[2], shape[3]))(x)
• # Stack of Transposed Conv2D blocks
• # Notes:
• # 1) Use Batch Normalization before ReLU on deep networks
• # 2) Use UpSampling2D as alternative to strides>1
• # - faster but not as good as strides>1
• for filters in layer_filters[::-1]:
• x = Conv2DTranspose(filters=filters,
• kernel_size=kernel_size,
• strides=2,
• activation='relu',
• padding='same')(x)
• x = Conv2DTranspose(filters=1,
• kernel_size=kernel_size,
• padding='same')(x)
• outputs = Activation('sigmoid', name='decoder_output')(x)
• # Instantiate Decoder Model
• decoder = Model(latent_inputs, outputs, name='decoder')
• decoder.summary()
• # Autoencoder = Encoder + Decoder
• # Instantiate Autoencoder Model
• autoencoder = Model(inputs, decoder(encoder(inputs)), name='autoencoder')
• autoencoder.summary()
• autoencoder.compile(loss='mse', optimizer='adam')
Ch10. Auto and variational encoders
v230607d
17
Part (iv): Train the
autoencoder,
decode images
display result
• #part4 ---------------------------------------------------
• # Train the autoencoder
• autoencoder.fit(x_train_noisy,
• x_train,
• validation_data=(x_test_noisy, x_test),
• epochs=30,
• batch_size=batch_size)
• # Predict the Autoencoder output from corrupted test images
• x_decoded = autoencoder.predict(x_test_noisy)
• # Display the 1st 8 corrupted and denoised images
• rows, cols = 10, 30
• num = rows * cols
• imgs = np.concatenate([x_test[:num], x_test_noisy[:num],
x_decoded[:num]])
• imgs = imgs.reshape((rows * 3, cols, image_size, image_size))
• imgs = np.vstack(np.split(imgs, rows, axis=1))
• imgs = imgs.reshape((rows * 3, -1, image_size, image_size))
• imgs = np.vstack([np.hstack(i) for i in imgs])
• imgs = (imgs * 255).astype(np.uint8)
• plt.figure()
• plt.axis('off')
• plt.title('Original images: top rows, '
• 'Corrupted Input: middle rows, '
• 'Denoised Input: third rows')
• plt.imshow(imgs, interpolation='none', cmap='gray')
• Image.fromarray(imgs).save('corrupted_and_denoised.png')
• plt.show()
Ch10. Auto and variational encoders
v230607d
18
Code https://towardsdatascience.com/how-to-reduce-image-noises-by-autoencoder-65d5e6de543
Result: plt.title('Original images: top rows, '
'Corrupted Input: middle rows, '
'Denoised Image: third rows')
• '''Trains a denoising autoencoder on MNIST dataset.
• https://towardsdatascience.com/how-to-reduce-image-noises-by-autoencoder-65d5e6de543
• Denoising is one of theclassic applications of autoencoders.
• The denoising process removes unwantednoisethatcorrupted the
• truesignal.
• Noise+ Data ---> Denoising Autoencoder ---> Data
• Given a training dataset of corrupted data as input and
• truesignal as output, a denoising autoencoder can recover the
• hidden structureto generateclean data.
• This example has modular design. The encoder, decoder and autoencoder
• are 3 models that shareweights. For example, after training the
• autoencoder, theencoder can be used to generate latent vectors
• of input data for low-dim visualizationlikePCA or TSNE.
• '''
• #keras>> tensorflow.keras, modificationby khw
• from __future__ import absolute_import
• from __future__ import division
• from __future__ import print_function
• import tensorflow.keras as keras
• from tensorflow.keras.layers import Activation, Dense, Input
• from tensorflow.keras.layers import Conv2D, Flatten
• from tensorflow.keras.layers import Reshape, Conv2DTranspose
• from tensorflow.keras.models importModel
• from tensorflow.keras importbackend as K
• from tensorflow.keras.datasets import mnist
• import numpyas np
• import matplotlib.pyplot as plt
• from PIL import Image
• np.random.seed(1337)
• # MNIST dataset
• (x_train, _), (x_test, _) = mnist.load_data()
• image_size = x_train.shape[1]
• x_train = np.reshape(x_train, [-1, image_size, image_size, 1])
• x_test = np.reshape(x_test, [-1, image_size, image_size, 1])
• x_train = x_train.astype('float32') / 255
• x_test = x_test.astype('float32') / 255
• # Generate corrupted MNIST images by adding noisewith normal dist
• # centered at 0.5 and std=0.5
• noise= np.random.normal(loc=0.5, scale=0.5, size=x_train.shape)
• x_train_noisy =x_train + noise
• noise= np.random.normal(loc=0.5, scale=0.5, size=x_test.shape)
• x_test_noisy=x_test + noise
• x_train_noisy =np.clip(x_train_noisy, 0., 1.)
• x_test_noisy=np.clip(x_test_noisy, 0., 1.)
• # Network parameters
• input_shape =(image_size, image_size, 1)
• batch_size =128
• kernel_size = 3
• latent_dim = 16
• # Encoder/Decoder number of CNN layers and filters per layer
• layer_filters = [32, 64]
• # Build theAutoencoder Model
• # First build theEncoder Model
• inputs =Input(shape=input_shape, name='encoder_input')
• x = inputs
• # Stack of Conv2Dblocks
• # Notes:
• # 1) UseBatch Normalization before ReLU on deep networks
• # 2) UseMaxPooling2Das alternativeto strides>1
• # - faster but not as good as strides>1
• for filters in layer_filters:
• x = Conv2D(filters=filters,
• kernel_size=kernel_size,
• strides=2,
• activation='relu',
• padding='same')(x)
• # Shapeinfo needed to build Decoder Model
• shape= K.int_shape(x)
• # Generate thelatent vector
• x = Flatten()(x)
• latent = Dense(latent_dim, name='latent_vector')(x)
• # InstantiateEncoder Model
• encoder = Model(inputs, latent, name='encoder')
• encoder.summary()
• # Build theDecoder Model
• latent_inputs =Input(shape=(latent_dim,), name='decoder_input')
• x = Dense(shape[1] * shape[2] * shape[3])(latent_inputs)
• x = Reshape((shape[1], shape[2], shape[3]))(x)
• # Stack of Transposed Conv2Dblocks
• # Notes:
• # 1) UseBatch Normalization before ReLU on deep networks
• # 2) UseUpSampling2Das alternativeto strides>1
• # - faster but not as good as strides>1
• for filters in layer_filters[::-1]:
• x = Conv2DTranspose(filters=filters,
• kernel_size=kernel_size,
• strides=2,
• activation='relu',
• padding='same')(x)
• x = Conv2DTranspose(filters=1,
• kernel_size=kernel_size,
• padding='same')(x)
• outputs=Activation('sigmoid', name='decoder_output')(x)
• # InstantiateDecoder Model
• decoder = Model(latent_inputs, outputs, name='decoder')
• decoder.summary()
• # Autoencoder = Encoder + Decoder
• # InstantiateAutoencoder Model
• autoencoder =Model(inputs, decoder(encoder(inputs)), name='autoencoder')
• autoencoder.summary()
• autoencoder.compile(loss='mse', optimizer='adam')
• # Train theautoencoder
• autoencoder.fit(x_train_noisy,
• x_train,
• validation_data=(x_test_noisy, x_test),
• epochs=30,
• batch_size=batch_size)
• # Predict theAutoencoder outputfrom corruptedtest images
• x_decoded = autoencoder.predict(x_test_noisy)
• # Display the1st 8 corrupted and denoised images
• rows, cols = 10, 30
• num = rows * cols
• imgs = np.concatenate([x_test[:num], x_test_noisy[:num], x_decoded[:num]])
• imgs = imgs.reshape((rows *3, cols, image_size, image_size))
• imgs = np.vstack(np.split(imgs, rows, axis=1))
• imgs = imgs.reshape((rows *3, -1, image_size, image_size))
• imgs = np.vstack([np.hstack(i) for i in imgs])
• imgs = (imgs * 255).astype(np.uint8)
• plt.figure()
• plt.axis('off')
• plt.title('Original images: top rows, '
• 'Corrupted Input:middlerows, '
• 'Denoised Input: third rows')
• plt.imshow(imgs, interpolation='none', cmap='gray')
• Image.fromarray(imgs).save('corrupted_and_denoised.png')
• plt.show()
Ch10. Auto and variational encoders
v230607d
19
Exercise 3
• Discuss applications of a Vanilla (traditional)
autoencoder.
• Which of the following is true? MC choices:
1) Image recognition
2) Denoise input images + Image recognition
3) Denoise input images +Dimensionality Reduction
4) Denoise input images only
Ch10. Auto and variational encoders
v230607d
20
Answer: Exercise 3
• Discuss applications of a Vanilla (traditional) autoencoder.
• Which of the following is true? MC choices:
1) Image recognition
2) Denoise input images + Image recognition
3) Denoise input images +Dimensionality Reduction (correct)
4) Denoise input images only
• More information, see https://en.wikipedia.org/wiki/Autoencoder
– Dimensionality Reduction
– Relationship with principal component analysis (PCA)
– Information Retrieval
– Anomaly Detection
– Image Processing
– Drug discovery
Ch10. Auto and variational encoders
v230607d
21
Part 2: Variational autoencoder
Will learn
• Learn what is Variational autoencoder
• How to train it?
• How to use it?
Ch10. Auto and variational encoders
v230607d
22
Some math background is needed:
• https://ljvmiranda921.github.io/notebook/20
17/08/13/softmax-and-the-negative-log-
likelihood/
• See appendix2: The expected negative log
likelihood
• Conditional expectation etc.
Ch10. Auto and variational encoders
v230607d
23
Variational Autoencoder (VAE) v.s. Traditional
Autoencoder
• Autoencoders (vanilla or traditional)
– During training you present a pattern with artificial added noise to the
encoder, and feed the same input pattern (as target, or teacher) to the
output. Then, use backpropagation to train the Autoencoder network.
– So, it is unsupervised learning (no label data is needed).
– It can be used for data compression and noise removal.
– During recall, when a noisy pattern is presented to the input, a de-
noise image will appear at the output.
• Variational autoencoders
– Instead of learning from an input pattern, Variational autoencoders
learn the parameters of a probability distribution function from the
input patterns. We then use the parameters learned to generate new
data. So, it is a generative model like GAN (Generative Adversarial
Network) in functionality.
Ch10. Auto and variational encoders
v230607d
24
Variational autoencoder
https://jaan.io/what-is-variational-autoencoder-vae-tutorial/
• Variational autoencoders are cool. They
let us design complex generative models
of data and fit them to large datasets.
They can generate images of fictional
celebrity faces and high-resolution digital
artwork.
• VAE faces
• VAE faces demo
• VAE MNIST
• VAE street addresses
• https://jaan.io/what-is-variational-
autoencoder-vae-tutorial/
• May be or similar to that used in
software such as Deepfake
(https://en.wikipedia.org/wiki/Deepfake)
FICTIONAL CELEBRITY FACES GENERATED BY A VARIATIONAL AUTOENCODER (BY
ALEC RADFORD). Ch10. Auto and variational encoders
v230607d
25
Example: Applying VAE for MNIST data
set extension
•
Ch10. Auto and variational encoders
v230607d
26
https://arxiv.org/pdf/1312.6114.pdf
Output: generated image
Dataset (images extended)
Input: original image
data set
Some background:
Univariate and Multivariate Gaussian
• https://ttic.uchicago.edu/~shubhendu/Slides/Estimation.pdf
Ch10. Auto and variational encoders
v230607d
27
 
 







 





2
2
2
/
1
2
univariate
2
2
1
exp
2
1
)
(
dimension
-
1
variance
mean,
,
_
Gaussian
Univariate





x
x
N
sample
data
x
 
   




















x
x
x
N
d
co
sample
data
x
T
d
1
2
/
1
2
/
te
multivaria
2
1
exp
2
1
)
(
dimension
-
variance
mean,
,
_
Gaussian,
te
Multivaria
Properties of Gaussian (Normal) distribution
• Standard Normal
distribution (1-dimension):
• Red line, when mean()=0,
Sigma ()=1
– At (x-)=0,  =1
– G(x) =1/sqrt(2*pi)=0.3989
• At x=1*, drops off to
– (1/sqrt(2*pi))*exp(-1^1/2)=0.2420
– Area covered 68.2%
• At x=2*, drops off to
– (1/sqrt(2*pi))*exp(-2^2/2)= 0.0540
– Area covered 95.44%
• At x=3*, drops off to
– (1/sqrt(2*pi))*exp(-2^2/2)= ??
(exercise)
– Area covered 99.73%
http://en.wikipedia.org/wiki/Normal_distribution
Probability density
function
 












1
)
(
2
1
mean
variance,
deviation,
standard
Gaussian
D
1
2
2
2
2
2
dx
x
G
e
πσ
G(x) σ
μ
x



Standard
Normal
distribution
Area
covered
(total=
100%)
G
G
Ch10. Auto and variational encoders
v230607d
28
 sets the
horizontal
shift
 Controls the
shape
So called 95% confident value µ(+/-)2
Gaussian (Normal) functions 1D,2D
•
2
2
/
1 
   
2
2
2
2
2
2
1
G(x)G(y)
y)
G(x,
Gaussian
D
2




y
x y
x
e







 
2
2
2
2
2
1
G(x)
mean
deviation,
standard
Gaussian
D
1











x
e
G(x)
x
y
x

x
y
1-D Gaussian
2-D Gaussian
2
2
/
1 
Ch10. Auto and variational encoders
v230607d
29
Example : A 1-D and 2-D Gaussian
distribution
• %2-D Gaussian distribution P(xj)
• %matlab code----------
• clear, N=10
• [X1,X2]=meshgrid(-N:N,-N:N);
• sigma =2.5;mean=[3 3]'
• G=1/(2*pi*sigma^2)*
• (exp(-((X1-mean(1)).^2+(X2-mean(2)).^2))
/(2*sigma^2));
• G=G./sum(G(:)) %normalise it
• 'sigma is ', sigma
• 'sum(G(:)) is ',sum(G(:))
• 'max(max(G(:))) is',max(max(G(:)))
• figure(1), clf
• surf(X1,X2,G);
• xlabel('x1'),ylabel('x2')
Ch10. Auto and variational encoders
v230607d
30
 
 







 






2
0
2
0
2
/
1
2
0
2
0
0
2
1
exp
2
1
)
(
variance
mean,
,
_
,
Gaussian
1





j
j
j
x
x
N
sample
a
x
D
  






 




2
0
2
2
2
1
2
0
2
1
2
exp
2
1
)
,
(
0
mean
assume
Gaussian
symmetric)
(circular
isotropic
an
2


x
x
x
x
N
D
Exercise 4
• In Box 1, sigma ()=2
• x=mx y=my
• Mc choices:
1) G(x,y)=1/(2*pi*2+2)
2) G(x,y)=1/(2*pi*2)
3) G(x,y)=1/(2*pi*2^4)
4) G(x,y)=1/(2*pi*2^2)
• Student
exercise:
• Fill in the
blanks of this
Gaussian mask
of size 9x9 ,
sigma ()=2
• Sketch the
function
• G(x,y)=
• 0.0007 0.0017 0.0033 0.0048 0.0054 0.0048 0.0033 0.0017 0.0007
• 0.0017 0.0042 0.0078 0.0114 0.0129 0.0114 0.0078 0.0042 0.0017
• 0.0033 0.0078 0.0146 0.0213 0.0241 0.0213 0.0146 0.0078 0.0033
• 0.0048 0.0114 0.0213 0.0310 0.0351 0.0310 0.0213 0.0114 0.0048
• 0.0054 0.0129 0.0241 0.0351 BOX1 ? ____? 0.0241 0.0129 0.0054
• 0.0048 0.0114 0.0213 0.0310 0.0351 ____? 0.0213 0.0114 0.0048
• 0.0033 0.0078 0.0146 0.0213 0.0241 0.0213 0.0146 0.0078 0.0033
• 0.0017 0.0042 0.0078 0.0114 0.0129 0.0114 0.0078 0.0042 0.0017
• 0.0007 0.0017 0.0033 0.0048 0.0054 0.0048 0.0033 0.0017 0.0007
Ch10. Auto and variational encoders
v230607d
31
   
2
2
2
2
2
2
1
G(x)G(y)
y)
G(x,
mean
Gaussian,
D
2


y
x m
y
m
x
y
x
e
)
,m
(m







x=mx
y=my
x=1+mx
y=my
Box1
Answer: Exercise 4
Fill in the blanks
Gaussian mask of
size the 9x9 , sigma
()=2
• 0.0007 0.0017 0.0033 0.0048 0.0054 0.0048 0.0033 0.0017 0.0007
• 0.0017 0.0042 0.0078 0.0114 0.0129 0.0114 0.0078 0.0042 0.0017
• 0.0033 0.0078 0.0146 0.0213 0.0241 0.0213 0.0146 0.0078 0.0033
• 0.0048 0.0114 0.0213 0.0310 0.0351 0.0310 0.0213 0.0114 0.0048
• 0.0054 0.0129 0.0241 0.0351 0.0398 0.0351 0.0241 0.0129 0.0054
• 0.0048 0.0114 0.0213 0.0310 0.0351 0.0310 0.0213 0.0114 0.0048
• 0.0033 0.0078 0.0146 0.0213 0.0241 0.0213 0.0146 0.0078 0.0033
• 0.0017 0.0042 0.0078 0.0114 0.0129 0.0114 0.0078 0.0042 0.0017
• 0.0007 0.0017 0.0033 0.0048 0.0054 0.0048 0.0033 0.0017 0.0007
Ch10. Auto and variational encoders
v230607d
32
clear %matlab
sigma=2 % in matlab , no -ve index
for looping, so shift center to (5,5)
mean_x=5 , mean_y=5
for y=1:9
for x=1:9
g(x,y)=(1/(2*pi*sigma^2))*exp(-((x-
mean_x)^2+(y-mean_y)^2)
/(2*sigma^2))
end
end
mesh(g)
title('2D Gaussian function')
1/(2*pi*2^2): choice 4 is correct, because
x=mx, y=my. ,thus 𝑒
−
𝑥−𝑚𝑥
2+ 𝑦−𝑚𝑦
2
2𝜎2
=1
1/(2*pi*2^2)*exp(-
1/8)
1/(2*pi*2^2)*exp
(-2/8)
Box 1
x=mx
y=my
x=1+mx
y=my
2 − D Gaussian, mean
(𝑚𝑥, 𝑚𝑦)
G(x,y) = G(x)G(y)
=
1
2𝜋𝜎2
𝑒
−
𝑥−𝑚𝑥
2+ 𝑦−𝑚𝑦
2
2𝜎2
Variational autoencoder
• A neural network view
Ch10. Auto and variational encoders
v230607d
33
https://www.jeremyjordan.me/variational-autoencoders/
Multivariate Gaussian:
Mean = µ
 = standard dedication
Variance = 2
Generative Models concept
• It is an unsupervised learning method that generates
new samples by using training data from the same
distribution
• E.g., You have limited number of samples but want to
create more samples of the same probability
distributions to be used in machine learning purposes.
Others include:
– Creating new cartoon figures
– Generating faces from images of celebrities.
– Creating new fashions.
– Creating new written characters for training optical character
recognition systems of some languages
• Generative model algorithms
– Variational autoencoder (discussed here)
– Generative adversarial network (GAN) not discussed here
Ch10. Auto and variational encoders
v230607d
34
Variational autoencoder for generative
models
• Use training samples to train hidden data (parameters of
multi-variate Gaussian standard deviations=s, means = µs ).
After training you may create new output from some input
and weighted s and µs . You may change the weights of s and
µs for a variety of related different outputs.
Ch10. Auto and variational encoders
v230607d
35
https://www.quora.com/Whats-the-difference-between-a-Variational-Autoencoder-VAE-and-an-Autoencoder
parameters of multi-variate Gaussian standard deviations= s, means= µs )
E.g. 50µs, 30s
Application example: Use Generative
Models for MNIST data extension
http://yann.lecun.com/exdb/mnist/
•
Ch10. Auto and variational encoders
v230607d
36
During training , patterns are fed into input and
output one by one, learn µ, by minimize loss
After training, data
generation phase
Generated extended data set
MNIST original data set
Random generator
layer using 30µs, 30s
z
Exercise 5:What is the architectural difference
between Vanilla (traditional) autoencoder and
Variational autoencoder?
• MC: Which is incorrect?
1) In Vanilla (traditional)
autoencoder: input to output
are directly connected by
neurons and weights.
2) In Variational autoencoder:
The encoder turns input (x)
into means (µs) and standard
deviations (s) of a multivariate
Gaussian distribution, then use
a random sampling method
to create the output.
3) In Variational autoencoder :
input to output are directly
connected by neurons and
weights.
4) In Variational autoencoder:
The number of mean (µs) and
standard deviations (s)
neurons are the same. Ch10. Auto and variational encoders
v230607d
37
Vanilla autoencoder
E.g. 30µs, 30s
z
Answer Exercise 5:What is the architectural
difference between Vanilla (traditional)
autoencoder and Variational autoencoder?
• MC: Which is incorrect?
1) In Vanilla (traditional)
autoencoder: input to output
are directly connected by
neurons and weights.
2) In Variational autoencoder:
The encoder turns input (x)
into means (µs) and standard
deviations (s) of a multivariate
Gaussian distribution, then use
a random sampling method
to create the output.
3) In Variational autoencoder :
input to output are directly
connected by neurons and
weights. (This is incorrect)
4) In Variational autoencoder:
The number of mean (µs) and
standard deviations (s)
neurons are the same. Ch10. Auto and variational encoders
v230607d
38
Vanilla autoencoder
E.g. 30µs, 30s
z
Exercise 6a,b for Variational
autoencoder VAE
• Which statement is incorrect for
VAE?: MC choices:
1) Because the search space is large,
there are too many combinations
of means (µs) and standard
deviations (s) for generating the
same output.
2) There are multiple solutions for
means (µs) and standard
deviations (s)
3) There is a deterministic linear
solution for VAE
4) Neural network provides a
solution for VAE.
• (b) Discuss exercise for students:
what is a multivariate-Gaussian
distribution.
Ch10. Auto and variational encoders
v230607d
39
form
https://en.wikipedia.org/wiki/Multiv
ariate_normal_distribution
of 2 dimensions
Answer: Exercise 6a,b for
Variational autoencoder VAE
• Which statement is incorrect for
VAE?: MC choices:
(choice3)There is a deterministic linear
solution for VAE (this is incorrect)
• (b) Discuss exercise for students:
what is a multivariate-Gaussian
distribution.
• Answer: Multivariate-dimensional
Gaussian:
• In probability theory and statistics,
the multivariate normal
distribution, multivariate Gaussian
distribution, or joint normal
distribution is a generalization of
the one-dimensional
(univariate) normal distribution to
higher dimensions. One definition is
that a random vector is said to be k-
variate normally distributed if
every linear combination of
its k components has a univariate
normal distribution.
Ch10. Auto and variational encoders
v230607d
40
form
https://en.wikipedia.org/wiki/Multiv
ariate_normal_distribution
of 2 dimensions
Example of variational autoencoder
• Neural network
Ch10. Auto and variational encoders
v230607d
41
https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf
By random sampling
Random generator layer
Z
X̂
X
Training of Vanilla and Variational
Autoencoders
• Training of variational autoencoders is like training the
vanilla autoencoders. E.g., for the de-noised application,
presents noisy images to the input and clean image
versions to the output. Use backpropagation to train the
network. Read our previous discussion on vanilla
autoencoder
https://www.edureka.co/blog/autoencoders-tutorial/
http://www.math.purdue.edu/~buzzard/MA598-Spring2019/Lectures/Lec18%20-%20VAE.pptx
Ch10. Auto and variational encoders
v230607d
42
Variational Autoencoder (VAE)
• The latent variables, Z, are
drawn from a probability
distribution depending on
the input, X, and the
reconstruction is chosen
probabilistically from z.
• That means after you
obtained mean=µ,variance
2, sample from X (n=500
neurons) to get Z (k=30
neurons)
• X=(x1,x2,…………,xn)
• Z=(z1,z2,…,zk)
https://jaan.io/what-is-variational-autoencoder-vae-tutorial/
Ch10. Auto and variational encoders
v230607d
43
https://jaan.io/what-is-variational-autoencoder-vae-tutorial/
Z
Encoder
Q (z|X)
Decoder
P (X|z)
Z=Latent
Variables
By sampling
Z=Sample from
a distribution
N(µ,)
X X̂
Three difficult concepts in VAE
1) Train the neural network to
maximize input/output likelihood
2) Use of Divergence (DKL)
3) Reparameterization
Ch10. Auto and variational encoders
v230607d
44
Variational Autoencoders
VAE Concept 1
Train the neural network to maximize
input/output likelihood
Ch10. Auto and variational encoders
v230607d
45
Tutorial on Variational Autoencoders
Carl Doersch
https://arxiv.org/abs/1606.05908
VAE Encoder
• The Encoder q(en)(z|x) takes input x and returns Hidden
parameters Z (random generated from µ,). (=encoder
parameters. weights/biases)
• From Z, use sampling to create input to the decoder
• Encoders and Decoders are neural networks (NN)
• Parameters in the NN are needed to be learned – so we have
to set up a loss function.
https://jaan.io/what-is-variational-autoencoder-vae-tutorial/
http://gregorygundersen.com/blog/2018/04/29/reparameterization/
Ch10. Auto and variational encoders
v230607d
46
https://jaan.io/what-is-variational-autoencoder-vae-tutorial/
Encoder(XZ)
q(en)(z|x)
Input
Data
Decoder(Z )
Hidden
Z
Output
ted
Reconstruc
X
X̂
 
Z
X
P de |
ˆ
)
(

X-> encoder –>Z->decoder x^
X̂
 
VAE Decoder
• The decoder takes hidden variable Z (gen. from means and
standard deviations) as input, and reconstructs
the image using random sampling methods. ( =decoder
parameters weights/biases)
• Encoders and Decoders are Neural Networks (NN)
• Parameters ( ,) in the NN are needed to be learned – so
we have to set up a loss function.
https://jaan.io/what-is-variational-autoencoder-vae-tutorial/
Ch10. Auto and variational encoders
v230607d
47
https://jaan.io/what-is-variational-autoencoder-vae-tutorial/
 
Z
X
P de |
ˆ
)
(

Encoder(XZ)
q(en)(z|x)
Input
Data
Decoder(Z )
Hidden
Z
Output
ted
Reconstruc
X
X̂
 
Z
X
P de |
ˆ
)
(

X̂
The reconstruction loss =(l(rec) )=
“expected negative log-likelihood” of VAE
• Given xi X, zQ, E() is expected value
• The idea is to train the Encoder/Decoder (Neural Network) to maximum
the likelihood (or minimize binary_cross_entropy (BCE) or Mean squared
error (MSE) between x and reconstructed
• To maximize likelihood, we minimize the reconstruction loss=“expected
negative log-likelihood” (li ) of the i-th datapoint xi. (see appendix 2)
Ch10. Auto and variational encoders
v230607d
48
   
 
 
z
x
P
E
E
x
l i
de
Q
z
X
x
i
rec
i i
|
ˆ
log
|
, )
(
)
(


 



Encoder
q(en)(z|xi)
Decoder
Hidden
Z (µ,)
i
x
data
Input
 
minimized
be
to
,
function
loss
tion
Reconstruc
)
(


rec
i
l
i
x̂
output
ted
Reconstruc
i
x
BCE
or
MSE
 
z
x
P i
de |
ˆ
)
(

X
xi
ˆ
ˆ 
Variational Autoencoders
VAE Concept 2
Use of Divergence (DKL):
Similar training images should produce
similar hidden data (means and
standard deviations)
Ch10. Auto and variational encoders
v230607d
49
http://mi.eng.cam.ac.uk/~mjfg/local/4F10/lect4.pdf
https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
https://jhui.github.io/2017/03/06/Variational-autoencoders/ (for relating
covariance and standard deviations, with good example)
How to make sure the neural networks produce similar hidden
data (means & standard deviations) from similar training images
• Problem: Input that we regard as similar may end up very
different in z space (hidden, means and standard deviations).
That means some solutions may give small loss li
(all)(,  ),
even q(en) and p(de) are of very different distributions.
• Solution: Use p(z)=N(0,1), try to force q(en)(z|xi) (a neural
network) to act similarly to a standard normal probability
density function. We can use Kullback-Leibler divergence
(DKL) to do the checking.
Ch10. Auto and variational encoders
v230607d 50
For encoder and decoder
We discussed this in concept 1:
https://jaan.io/what-is-variational-autoencoder-vae-tutorial/
https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
http://gregorygundersen.com/blog/2018/04/29/reparameterization/
This is for concept 2:
We will minimize (L(all) )
 
  Gaussian
and
between
difference
ˆ
output
and
input
between
loss
|
,
)
(
1
)
(
en
n
i
i
i
i
i
all
Q
x
x
x
L














 

,
)
(rec
i
l
   
 
I
N
x
z
q
D i
en
KL ,
0
||
|
)
(

Math background: Kullback–Leibler divergence (also known as relative
entropy) measures how one probability distribution is different from
another one -- reference probability distribution over the same variable
X.
•
Ch10. Auto and variational encoders
v230607d
51
Tutorial on Variational Autoencoders by Carl Doersch &
https://arxiv.org/abs/1606.05908
   
 
 
         
 
     
     
   
   
   
   
   
   
 
 
   
 
     
 
   
   
   
   
   
   
 
 
X
X
X
I
X
tr
I
N
X
X
N
D
X
X
N
x
z
Q
I
N
x
z
Q
D
X
X
X
I
X
tr
I
N
X
X
N
D
I
N
N
X
X
N
N
I
I
tr
N
N
D
T
KL
i
i
T
KL
T
T
KL
2
2
2
2
2
2
2
2
2
2
2
2
1
1
2
1
2
2
2
1
1
2
2
2
1
2
1
2
2
2
2
2
2
1
1
det
log
2
1
,
0
||
,
,
|
,
0
||
|
For
det
log
2
1
,
0
||
,
,
0
,
also
;
,
,
If
)
(
det
det
log
*
2
1
,
||
,














































































For equation (I) See https://arxiv.org/pdf/1907.08956.pdf
https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_diver
gence
Kullback–Leibler divergence DKL (D1|| D2)=0 indicates the two distributions D1,D2 are identical
ℎ𝑒𝑛𝑐𝑒 , µ2 = 0, 2
2=1
N(0,I)=Zero_mean, variance=1 Gaussian
Training: Combining concept 1 and 2 to minimize Loss li (X), of X= {x1,x2,..,xN} ,
E()=expected value . For the whole X, the average loss is
•
Ch10. Auto and variational encoders
v230607d
52
 
 
 
 
 
 
 
 
 
 
 
   
   
 
 
 
 
 
 
 
 
 
 
 
 
   

 
















































X
x
|z
x
i
|z
x
i
all
rec
i
x
z
x
z
i
Q
z
X
x
i
Q
z
X
x
i
Q
z
X
x
x
z
x
z
i
i
x
z
x
z
x
z
x
z
x
z
x
z
z
x
z
x
i
i
i
Q
z
X
x
i
i
Q
z
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
i
μ
x
σ
N
l
P
z
x
P
E
E
z
x
P
E
E
z
x
P
E
E
z
x
P
z
x
P
x
z
z
N
I
N
N
z
z
I
N
X
x
x
z
x
P
E
E
x
z
x
P
E
z
P
z
x
z
x
P
x
z
x
z
Q
X
x
X
x
2
ˆ
2
ˆ
_
)
(
|
|
|
|
|
|
|
|
|
|
|
|
2
1
1
,
function1
Objective_
minimize
we
Gaussian,
is
Since
*
|
ˆ
log
function1
Objective_
|
ˆ
log
-
likelihood
-
log
nagative
minimize
to
as
same
the
is
It
similar)
likelihood
output
input
(make
,
|
ˆ
log
maximize
want to
we
.
|
ˆ
log
|
ˆ
log
use
practice,
In
.
ˆ
output
produce
to
uses
decoder
The
.
gen.
to
,
gen.
rand.
use
encoder,
by
found
are
,
if
is
advantage
The
on),
distributi
ki/Normal_
dia.org/wi
(en.wikipe
,
,
0
scaling
by
formed
be
can
It
,
Gaussian
assume
can
but we
on,
distributi
any
be
can
stage
At this
,
0
1},
)
stdev(
0,
)
ean(
function{m
Gaussian
a
by
gen.
var.
random
input
n
output whe
decoder
at
gen.
ˆ
of
val.
exp.
|
ˆ
log
output
decoder
at the
generated
ˆ
val.)of
(exp.
value
expected
|
ˆ
log
variable,
(hidden)
latent
the
of
on
distributi
Prob.
side)
(decoder
by
generated
ˆ
of
on
distributi
Prob.
|
ˆ
side)
(encoder
by
generated
of
on
distributi
Prob.
|
ˆ
ˆ
decoder
of
Output
,
encoder
Input to






















Concept 1
See http://bjlkeng.github.io/posts/variational-autoencoders/ & https://arxiv.org/abs/1312.6114
Input, output mean
 
     
   
   
   
 
     
 
   
   
   
   
   
 
 
   
   
   
   
 
 
.
minimize
to
algorithm
iterative
an
run
will
We
det
log
2
1
2
1
1
thus
,
det
log
2
1
,
0
||
|
that
earlier
shown
have
We
,
0
||
|
2
1
1
function2
Objective_
function1
Objective_
nction
jective_fu
Overall_ob
minimized
be
to
is
this
,
,
0
||
|
func2
objective_
[]
on
slides
previous
see
Gaussian,
and
|
of
difference
,
0
||
|
,
0
put
,
gaussian
a
to
close
be
to
|
want
earlier we
mentioned
We
side)
(encoder
by
generated
of
on
distributi
Prob.
|
:
Recall
2
2
2
ˆ
2
ˆ
)
(
2
2
)
(
)
(
2
ˆ
2
ˆ
)
(
)
(
)
(
)
(
)
(
l
X
X
X
I
X
tr
μ
x
σ
N
L
X
X
X
I
X
tr
I
N
X
z
q
D
I
N
x
z
q
D
μ
x
σ
N
I
N
x
z
q
D
D
x
z
q
I
N
x
z
q
D
I
N
z
P
x
z
q
x
z
x
z
q
T
X
x
|z
x
i
|z
x
all
T
en
KL
i
en
KL
X
x
|z
x
i
|z
x
i
en
KL
KL
i
en
i
en
KL
i
en
i
i
en
i
i
i
i
i
i























































Training: Combining concept 1 and 2 to minimize Loss li (X), of X= {x1,x2,..,xN} ,,
E()=expected value . For the whole X, the average loss is
•
Ch10. Auto and variational encoders
v230607d
53
Concept 1:(reconstruction loss):
Concept 2:
See http://bjlkeng.github.io/posts/variational-autoencoders/ & https://arxiv.org/abs/1312.6114
 

,
)
1
(
i
l
 

,
)
(rec
i
l
   
 
I
N
x
z
q
D i
en
KL ,
0
||
|
)
(

For VAE implementation
• Input X=(x1,x2,…,xn)
• Using the encoder, from X we obtain k different
Gaussian distributions: N(meanj, StdDevj),
• each zj=generated by N(µ j ,j),where j =1,.,K,then we
have Z=(z1,z2,.,zk)
Ch10. Auto and variational encoders
v230607d
54
   
 
   
   
   
   
   
   
 
 
   
   
   
 











k
j
j
j
j
k
T
k
KL
KL
T
KL
i
en
KL
I
N
diag
N
D
D
X
X
X
I
X
tr
I
N
X
X
N
D
I
N
x
z
q
D
1
2
2
2
1
1
2
2
2
)
(
ln
1
2
1
,
0
||
,..,
,
,..,
minimized)
be
to
is
(
becomes
above
the
n,
applicatio
VAE
For
det
log
2
1
,
0
||
,
,
0
||
|
slide
previous
the
From














See https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
https://wiseodd.github.io/techblog/2016/12/10/variational-autoencoder/
Concept 2:
In practice
• we replace 2 with exp(2) to enable stability
in calculation. And for the minimization of DKL,
this replacement gives the same result
Ch10. Auto and variational encoders
v230607d
55
   
 
   
   
   
 
   
   
   
 
on
minimizati
during
use
will
we
function
actual
the
is
This
1
)
exp(
2
1
,
0
||
,..,
,
,..,
with
)
ln(
and
)
exp(
with
replace
,
n
calculatio
numerical
in
stablity
enable
To
ln
1
2
1
,
0
||
,..,
,
,..,
,
0
||
|
earlier
seen
have
We
1
2
2
2
1
1
2
2
2
2
1
2
2
2
1
1
)
(













k
j
j
j
j
k
T
k
KL
k
j
j
j
j
k
T
k
KL
i
en
KL
I
N
diag
N
D
I
N
diag
N
D
I
N
x
z
q
D



















Use neural networks to implement the system
•
Ch10. Auto and variational encoders
v230607d
56
Use backpropagation to minimize
the loss function (concept3):
Binary_cross_entropy (BCE) or Mean
squared error (MSE) between input X
and output 𝑋
Use backpropagation to
minimize
the loss function L(all) of encoder:
(concept1 & 2)
Encoder
neural
network
Decoder
neural
network
     
 
I
N
x
z
q
D
μ
x
σ
N
L
L
loss
Minimize
i
en
KL
X
x
|z
x
i
|z
x
all
all
i
i
i
,
0
||
|
2
1
1
)
(
_
)
(
2
ˆ
2
ˆ
)
(
)
(











 
   
 
I
N
x
z
q
D i
en
KL ,
0
||
|
)
(

Input
Data
Concept 2
Concept 1
The training method
•
Ch10. Auto and variational encoders
v230607d
57
http://anotherdatum.com/vae.html
The latent
vector
represents
Gaussian
distributions
Input
and
output
are
similar
Minimize loss (L(all))
Using Concept 1 &2
X̂
X
z
Variational Autoencoders
VAE Concept 3
Reparameterization: the method to
enable backpropagation for training
neural network that involves random
processes
Ch10. Auto and variational encoders
v230607d
58
VAE generative model
• In theory, we can sample zi from N(µi ,i ) produced by
the encoder. Note: N()=Gaussian Function
• Z is the input to the decoder to produce the output.
• Alternatively, we find z by sampling  (called epsilon or
eps) from N(0,1) (Gaussian mean= 0, StdDev=1), then
find z using : zi =µi +*i
• Then z has mean = µi , StdDev= I as required
• See gen_data_using_mean0_sigma1.m in appendix
• This is called Reparameterization,
• Reason: therefore, we can back-propagate this function
during training
Ch10. Auto and variational encoders
v230607d
59
Train the variational-encoder
• How to train the
auto-encoder neural
network?
• Difficulty
– Since a random
process is involved,
backpropagation
cannot be executed
• Solution
– Use of the re-
parameterization trick
Ch10. Auto and variational encoders
v230607d
60
Generate z by
random sampling
Training : an example
• example
Ch10. Auto and variational encoders
v230607d
61
https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf
Random
generator
layer Z
X̂
X
• Learning algorithm : The probability function (left side diagram)
cannot be back-propagated, therefore Reparameterization trick
(right side diagram) should be applied
•
Ch10. Auto and variational encoders
v230607d
62
http://bjlkeng.github.io/posts/variational-autoencoders/
Figure 3: An initial attempt at a variational autoencoder
without the "Reparameterization trick". Objective functions
shown in red. We cannot back-propagate through the
stochastic sampling operation because it is not a continuous
deterministic function.
Figure 4: A variational autoencoder with the
"Reparameterization trick". Notice that all
operations between the inputs and objectives are
continuous deterministic functions, allowing back-
propagation to occur.
StdDev=

Qq(z|x)
P(X|z)
This Q(z|x)
=N(µz|X,z|X) should
be close to N(0,I)
We also want
the output to be
similar to the
input
Problem:
Cannot backpropagate
Solution:
Reparamete
rization
trick
Random
generator
layer 
Intuition of the Parameterization trick
• The encoder uses random
sampling to generate z
• Backpropagation (during
training) is not possible for
the random sampling process
• Parameterization can
produce the same effect for
the encoder
• Backpropagation (during
training) is possible because
no random process is
involved
Ch10. Auto and variational encoders
v230607d
63
Encoder
Path by
random
sampling
Backpropagation
path

Reparameterization
Z can be produced by a scaled N(0,I)
• Reparameterization generates any
Gaussian distribution of known mean
(µx), standard-deviation (x) by using
the equation (Z= µx+ x ) based on
the variable  generated by N(0,1) .
• After the forward pass,  is
generated, so  is not random. It is a
data to be used in backpropagation
during training.
• N(0,1) =Gaussian with mean=0 and
standard deviation=1
•  = the generated variable of N(0,1)
• µx =mean
• x = standard-deviation
• Z= µx+ x Ch10. Auto and variational encoders
v230607d
64
mean Standard deviation

To produce the random
variable  N(0,1): mean
=0, std=1
Input data
https://learnopencv.com/variational-
autoencoder-in-tensorflow/
Ch10. Auto and variational encoders
v230607d
65
Summary for reparameterization
•  = the generated variable by sampling N(0,1)
• µx = mean
• x = standard deviation
• z= µx + x ; This equation is deterministic, so
it can be backpropagate
• See code in
• https://learnopencv.com/variational-
autoencoder-in-tensorflow/
Ch10. Auto and variational encoders
v230607d
66
Exercise 7
• In reparameterization of the
variational autoencoder
method shown below,  =
0.35 which is a randomly
sampled value by sampling
the normal distribution of
mean=0 and standard
deviation =1. If the output of
the encoder network has µ
z|x = mean =0.3, z|x =
standard deviation=0.8, find
the value z .
• MC choices:
1) 0.50
2) 0.54
3) 0.56
4) 0.58 Ch10. Auto and variational encoders
v230607d
67
Answer: Exercise 7
• In reparameterization of the
variational autoencoder
method shown below,  =
0.35 which is a randomly
sampled value by sampling
the normal distribution of
mean=0 and standard
deviation =1. If the output of
the encoder network has µ
z|x = mean =0.3, z|x =
standard deviation=0.8, find
the value z .
• MC choices:
1) 0.50
2) 0.54
3) 0.56
4) 0.58 (correct) Ch10. Auto and variational encoders
v230607d
68
Answer:
z =µ +*z|x , here
=0.35, µ=0.3,
standard deviation=
z|x=0.8
=0.3+0.35*0.8=0.58
Exercise 8
• Discuss exercise
• why Reparameterization is needed?
Ch10. Auto and variational encoders
v230607d
69
Answer: Exercise 8
Discuss why Reparameterization is needed.
• Answer: Z is generated by a
random process if you have
mean=µx , standardDev= x. Since
the VAE system is implemented
using neural networks, they need
backpropagation for training the
weights/parameters, and the
random process of generating Z
cannot be backpropagated.
• Solution: The reparameterization
trick converts the random process
into a determinization process (z=
µx + x ) with the help of a
random variable  generated by a
normal distributed random
generated normal distribution
with mean=0 and standardDev=1:
N(0,1), hence this deterministic
process can be backpropagated.
Ch10. Auto and variational encoders
v230607d
70
Reparameterization trick
Demo Matlab code: gen_data_using_mean0_sigma1.m
to show the idea: X= µx+ x * eps is the formula for generating X
by eps (generated by normal distortion of mean=0, std=1)
https://nbviewer.jupyter.org/github/gokererdogan/Notebooks/blob/master/Reparameterization%20Trick.ipynb
• %gen_data_using_mean0_sigma1.m
• clear
• %%large number of samples %%
• eps=randn(10000,1);
• mu_x=2 %this is your mean
• sigma_x=1 %this is your std
• x=mu_x+(eps*sigma_x);
• grad2_of_mean=
sum(2*(mu_x+eps))/length(x);
• 'grad2 of mean='
• grad2_of_mean
• 'mean(x)='
• mean(x)
• 'std(x)='
• std(x)
• Result:grad2_of_mean = 3.9933
• mean(x)= 1.9960 (approximate 2)
• std(x)= 0.9984 (approximate 1)
• x= standard deviation of x
• µx =mean of x
• eps= N(mean=0,std=1), normal dist.
• X= µx+ x * eps
• And gradient of mean is expected_val_of
(2(eps+mu_x)), assume x=1 for simplicity
• The above is not random, because eps
has been generated and µx is the current
mean. We can use this in our
backpropagation formula to find the
updated mean.
Ch10. Auto and variational encoders
v230607d
71
Using X= µx+ x * eps , we can find its gradient bypassing the random process.
Because eps is generated by a random process during the neural net forward pass,
during backpropagation this is the data (now available deterministically) to be
used. Note:grad2_of_mean = expected_value_of (2(eps+mu_x))
Implementation
Using Keras
https://github.com/keras-
team/keras/tree/master/
Ch10. Auto and variational encoders
v230607d
72
Keras
•
Ch10. Auto and variational encoders
v230607d
73
StdDev=

Keras implementation of VAE
• x = Input(shape=(original_dim,))
• h = Dense(intermediate_dim, activation='relu')(x)
• z_mu = Dense(latent_dim)(h)
• z_log_var = Dense(latent_dim)(h)
• z_mu, z_log_var = KLDivergenceLayer()([z_mu, z_log_var])
• # Use of lambda: normalize log variance to std dev
• z_sigma = Lambda(lambda t: K.exp(.5*t))(z_log_var)
• eps = Input(tensor=K.random_normal(shape=(K.shape(x)[0],
• latent_dim)))
• z_eps = Multiply()([z_sigma, eps])
• z = Add()([z_mu, z_eps])
• decoder = Sequential([
• Dense(intermediate_dim, input_dim=latent_dim,
activation='relu'),
• Dense(original_dim, activation='sigmoid')
• ])
• x_pred = decoder(z)
Ch10. Auto and variational encoders
v230607d
74
http://louistiao.me/posts/implementing-variational-autoencoders-in-keras-beyond-the-quickstart-tutorial/
original_dim = 784
intermediate_dim = 256
latent_dim = 2
batch_size = 100
epochs = 50
epsilon_std = 1.0
StdDev=
Predicted
output
df
•
Ch10. Auto and variational encoders
v230607d
75
StdDev=

variational_autoencoder_deconv .py
from https://github.com/keras-team/keras/tree/master/
• '''Example of VAE on MNIST dataset using CNN
•
• The VAE has a modular design. The encoder, decoder and VAE
• are 3 models that share weights. After training the VAE model,
• the encoder can be used to generate latent vectors.
• The decoder can be used to generate MNIST digits by sampling the
• latent vector from a Gaussian distribution with mean=0 and std=1.
•
• # Reference
•
• [1] Kingma, Diederik P., and Max Welling.
• "Auto-encoding variational bayes."
• https://arxiv.org/abs/1312.6114
• '''
•
• from __future__ import absolute_import
• from __future__ import division
• from __future__ import print_function
•
• from tensorflow.keras.layers import Dense, Input
• from tensorflow.keras.layers import Conv2D, Flatten, Lambda
• from tensorflow.keras.layers import Reshape, Conv2DTranspose
• from tensorflow.keras.models import Model
• from tensorflow.keras.datasets import mnist
• from tensorflow.keras.losses import mse, binary_crossentropy
• from tensorflow.keras.utils import plot_model
• from tensorflow.keras import backend as K
•
• import numpy as np
• import matplotlib.pyplot as plt
• import argparse
• import os
•
•
• # reparameterization trick
• # instead of sampling from Q(z|X), sample eps = N(0,I)
• # then z = z_mean + sqrt(var)*eps
• def sampling(args):
• """Reparameterization trick by sampling fr an isotropic unit Gaussian.
•
• # Arguments
• args (tensor): mean and log of variance of Q(z|X)
•
Ch10. Auto and variational encoders
v230607d
76
n variational_autoencoder_deconv : use:
vae.save_weights('vae_cnn_mnist.tf') #instead
of vae.save_weights('vae_cnn_mnist.h5')
Resulst
Epoch 30/30
60000/60000
[==============================] - 91s
2ms/sample - loss: 145.7313 - val_loss:
146.8615
To run this, you need to install:
>>conda install graphviz
>>conda install pydot
variational_autoencoder_deconv.py
from https://github.com/keras-team/keras/tree/master/
• Results
Ch10. Auto and variational encoders
v230607d
77
Summary
• Learned vanilla autoencoder
• Learned variational autoencoder
• Learned the Reparameterization trick to
enable learning in variational autoencoder
Ch10. Auto and variational encoders
v230607d
78
Reference
•
https://nbviewer.jupyter.org/github/gokererd
ogan/Notebooks/blob/master/Reparameteriz
ation%20Trick.ipynb
Ch10. Auto and variational encoders
v230607d
79
Appendices
Ch10. Auto and variational encoders
v230607d
80
Appendix 1: Training: Combining concept 1 and 2 to minimize Loss L.
X={x1,x2,..,xN} , E()=expected value . For the whole X, the average loss is
•
Ch10. Auto and variational encoders
v230607d
81
   
     
   
       
 
     
 
 
     
 
     
 
   
     
 
     
     
 
       
 
   
     
 
 
   
   
   
   
 
 
.
minimize
to
algorithm
iterative
an
run
will
We
det
log
2
1
2
1
1
,
average
,
0
,
||
slide
previous
in
formula
the
use
,
,
|
If
||
|
*
|
log
N
1
|
,
)
(
||
|
*
|
log
N
1
||
|
*
|
log
N
1
||
|
*
|
log
|
,
,
0
Note
.
||
|
|
log
|
,
2
2
2
|
2
i
|
|
i
|
|
|
|
,
0
|
|
,
0
i
i
L
X
X
X
I
X
tr
x
N
L
I
N
z
P
z
P
X
Q
D
X
X
N
X
Q
x
z
Q
z
P
x
z
Q
D
x
x
z
x
P
x
II
z
P
x
z
Q
D
x
x
z
x
P
z
P
x
z
Q
D
x
x
z
x
P
E
z
P
x
z
Q
D
x
x
z
x
P
E
E
x
I
N
z
P
x
z
Q
D
z
x
P
E
x
T
X
x
z
X
i
KL
i
i
KL
X
x
i
x
z
i
x
z
i
i
i
KL
X
x
i
X
z
i
X
z
i
i
KL
X
x
i
X
z
i
X
z
i
I
N
i
KL
i
X
z
i
X
z
i
I
N
X
x
i
i
KL
i
X
x
i
i
i
i
i
i
i
i
i





























































































Concept 1 Concept 2
See http://bjlkeng.github.io/posts/variational-autoencoders/ & https://arxiv.org/abs/1312.6114
Appendix 2
Probability likelihood
A tutorial
KH Wong
Ch10. Auto and variational encoders
v230607d
82
Overview
• Bayesian rules
• Gaussian distribution
• Probability vs likelihood
• Log-likelihood and maximum likelihood
• Negative log-likelihood
Ch10. Auto and variational encoders
v230607d
83
Bayesian rules
Ch10. Auto and variational encoders
v230607d
84
Bayesian rules
• P(B|A)=P(A|B)P(B)/P(A)
• P(A and B)=P(A,B)=P(A|B) P(B)
• P(A,B|C)=P(A|B,C) P(B|C)
• Prove the above as exercises
Ch10. Auto and variational encoders
v230607d
85
In each cell, the joint probability p(r, c) is re-expressed by the equivalent form
p(r | c) p(c) from the definition of conditional probability in Equation 5.3.
The marginal probability p(r) =Σc*p(r | c*) p(c*),
https://www.sciencedirect.com/topics/mathematics/marginal-probability
Gaussian distribution
• %2-D Gaussian distribution
P(xj)
• %matlab code----------
• clear, N=10
• [X1,X2]=meshgrid(-N:N,-
N:N);
• sigma =2.5;mean=[3 3]'
• G=1/(2*pi*sigma^2)*exp(-
((X1-mean(1)).^2+(X2-
mean(2)).^2)/(2*sigma^2));
• G=G./sum(G(:)) %normalise it
• 'sigma is ', sigma
• 'sum(G(:)) is ',sum(G(:))
• 'max(max(G(:)))
is',max(max(G(:)))
• figure(1), clf
• surf(X1,X2,G);
• xlabel('x1'),ylabel('x2')
Ch10. Auto and variational encoders
v230607d
86
 
 







 






2
0
2
0
2
/
1
2
0
2
0
0
2
1
exp
2
1
)
(
variance
mean,
,
_
Gaussian
1





j
j
j
x
x
N
sample
a
x
D
  






 



2
0
2
2
2
1
2
0
2
1
2
exp
2
1
)
,
(
Gaussian
symmetric)
(circular
isotropic
an
2


x
x
x
x
N
D
Probability vs likelihood
• It is two sides of a coin.
• P() Probability function :
– Given a Gaussian model (with mean µo and variance o), the
probability function P(X| µo,o) measures the probability that the
observation X is generated by the model.
• L() likelihood function:
– Given data X, the Likelihood function L(µo,o| X) measures the
probability that X fits the Gaussian model with mean µo and variance
o.
– Major application: Given data X, we can maximize the Likelihood
function L(µo,o| X) to find the model (µo,o) that fits the data. This is
called the maximum likelihood method.
– Log-likelihood rather than likelihood is more convenient for finding
the maximum, hence it is often used.
Ch10. Auto and variational encoders v230607d 87
 
X
L
X
P o
o
o
o |
,
)
,
|
(
2
2



 
Likelihood function
L( ) of n-dimension
• Likelihood function
• Intuition: Likelihood
function
L(µ,|X)) means: given
a Gaussian model
N(mean, variance) how
much the multivariate
data X =[x1,x2,x3,..,xn]
fits the model with
parameter (µ,).
Ch10. Auto and variational encoders
v230607d
88
      














n
j
j
n
n
x
X
L
x
x
x
X
1
2
2
2
/
2
2
2
1
2
1
exp
2
|
,
]
,...,
,
[





   
   
    

















 











2
1
2
2
/
2
1
2
2
2
/
1
2
2
1
2
2
1
exp
2
2
1
exp
2
,
|
|
,
as
written
be
can
function
likelihood
the
IID,
are
sample
the
from
ns
observatio
that the
assumption
Given the
:
Proof
•
n
j
j
n
n
j
j
j
n
j
x
x
x
N
X
L










A more useful representation is Log-Likelihood function= Log(L( ))=l ()
• Intuition:
• The peak of Likelihood and
Log-Likelihood functions
should be the same.
• The two are one to one
mapping hence no data
loss.
• Log based method is
easier to be handled by
math, so log-Likelihood
function is often used
• For computers, log
numbers are smaller
hence may save memory.
Using log, we can use
addition rather than
multiplication which
makes computation easier.
Ch10. Auto and variational encoders
v230607d
89
 
 
   
 
   
   
  proved!
,
2
1
)
ln(
2
)
2
ln(
2
2
1
2
ln
2
2
1
exp
ln
2
ln
2
1
exp
2
ln
,...,
,
|
,
ln
)
,...,
,
|
,
(
:
Proof
2
1
2
2
1
2
2
2
1
2
2
2
/
2
1
2
2
2
/
2
2
1
2
1
2
1
2



























































n
j
j
n
j
j
n
j
j
n
n
j
j
n
n
n
x
n
n
x
n
x
x
x
x
L
x
x
x
l

















     
 
2
1
0
2
2
2
1
2
1
2
2
2
/
2
2
2
1
2
1
)
ln(
2
)
2
ln(
2
)
,...,
,
|
,
(
,
2
1
exp
2
|
,
:
defn
By
]
,...,
,
[
For
function,
Likelihood
-
Log






















n
j
j
n
n
j
j
n
n
x
n
n
x
x
x
l
x
X
L
x
x
x
X











Maximum Likelihood V.S. Log-Likelihood
Ch10. Auto and variational encoders
v230607d
90
 
   
     
 
   
 






























n
j
j
n
j
j
n
x
n
n
X
l
x
Log
X
l
X
L
Log
1
2
0
2
2
2
1
2
2
2
/
2
2
2
2
1
)
ln(
2
)
2
ln(
2
|
,
:
function
Likelihood
-
Log
2
1
exp
2
|
,
|
,
:
function
likelihood
the
of
Log
Take













 
     
 















n
j
j
n
n
x
X
L
x
x
x
X
1
2
2
2
/
2
2
2
2
1
2
1
exp
2
|
,
:
function
Likelihood
,
is
set
parameter
Gaussian
the
],
,...,
,
[
Given








Since both likelihood 𝐿 and Log−likelihood 𝑙 are monotonic functions, hence
arg_max𝜃 𝐿(𝜃|𝑋) == arg_max𝜃 𝑙(𝜃|𝑋)
The maximum happens at 𝜃 = 𝜇, 𝜎2
, where 𝜇 =
1
𝑛
𝑗=1
𝑛
𝑥𝑗 , & variance 𝜎2
=
1
𝑛
𝑗=1
𝑛
𝑥𝑗 − 𝜇
2
http://jrmeyer.github.io/machinelearning/2017/08/18/mle.html
https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1
•
Important
Proof 1 :Maximum Log-Likelihood function of a
Multivariate Gaussian distribution
Ch10. Auto and variational encoders
v230607d
91
 
 
 
 
   
 
   
hood?
log_likeli
the
maximize
to
of
expression
the
is
what
So
,
of
mean
1
hence
0,
or
0
2
1
2
1
)
ln(
2
)
2
ln(
2
2
1
)
ln(
2
)
2
ln(
2
earlier
showed
we
function,
Gaussian
a
For
,
0
)
,...,
,
|
,
(
ln
)
at
is
hood
log_likeli
Max
2
1
1
1
2
2
1
2
2
2
1
2
2
2
2
1
2





















x
x
n
x
d
x
d
d
x
n
n
d
d
dl(X)
x
n
n
l(X)
d
x
x
x
L
d
d
dl(X)
d
dLog(L(X)
n
j
j
n
j
j
n
j
j
n
j
j
n
j
j
n















































https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1
       
 
   
 




























n
j
j
n
j
j
n
j
j
n
j
j
n
x
n
x
n
x
n
n
x
X
L
1
2
2
1
1
2
2
2
1
2
2
2
/
2
2
1
variance
and
,
1
mean
when
happens
likelihood
-
Log
maximum
The
,
2
1
)
ln(
2
)
2
ln(
2
2
1
exp
2
|
,
:
func
Likelihood












http://jrmeyer.github.io/machinelearning/2017/08/18/mle.html
Proof 2 : Maximum Log-Likelihood function of
a Multivariate Gaussian distribution
Ch10. Auto and variational encoders
v230607d
92
 
 
 
 
 
 
 
 
 
 2
2
2
0
0
2
1
2
2
1
2
4
2
1
2
4
2
2
1
2
2
2
2
2
1
2
2
ˆ
1,
n
if
:
Note
.
of
variance
and
,
of
mean
mean
of
on
distributi
Gaussian
a
by
generated
be
likely to
most
is
data
the
,
given
means
That
of
variance
1
ˆ
2
1
2
Hence
0
2
1
2
2
1
)
ln(
2
)
2
ln(
2
function
Gaussian
a
For
,
0
)
,...,
,
|
,
(
ln
when
happens
hood
log_likeli
Maximum














































 

























j
,,,j
i,
n
j
j
n
j
j
n
j
j
n
j
j
n
x
x
x
x
x
x
n
x
n
x
n
d
x
n
n
d
d
x
x
x
l
d
d
dLog(L(x)
http://people.stat.sfu.ca/~raltman/stat402/402L4.pdf
https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1
 
z
1
dz
ln(z)
d
:
Note 
       
 
   
 




























n
j
j
n
j
j
n
j
j
n
j
j
n
x
n
x
n
x
n
n
x
X
L
1
2
2
1
1
2
2
2
1
2
2
2
/
2
2
1
variance
and
,
1
mean
when
happens
likelihood
-
Log
maximum
The
,
2
1
)
ln(
2
)
2
ln(
2
2
1
exp
2
|
,
:
func
Likelihood












Alternative proof: Maximum Log_likelihood
Find the most suitable variance 2
• Maximum likelihood is at
Ch10. Auto and variational encoders
v230607d
93
 
 

 




n
j
n
j
n
j
j
n x
n
x
n 1
2
2
1
ˆ
1
ˆ
,
1
ˆ 


 
 
   
 
 
 
 
 
 
 
 
 
 2
2
1
2
2
1
2
2
2
2
2
1
2
2
2
2
1
2
2
2
2
1
2
2
2
2
1
2
2
2
2
1
2
ˆ
1,
n
if
:
Note
here)
occurs
Gaussian
the
of
hood
Log_likeli
maximum
(
,
,
ˆ
if
only
zero
to
equal
is
which
0
1
2
1
1
2
1
2
1
2
1
2
1
2
1
2
2
1
)
ln(
2
)
2
ln(
2
0
)
,...,
,
|
,
(
for
problem
hood
Log_likeli
maximum
the
Solve
:
Proof



























































































































j
n
j
j
n
j
j
n
j
j
n
j
j
n
j
j
n
j
j
n
x
done
x
n
x
x
n
x
n
d
d
x
n
x
n
n
x
x
x
l
l
Negative Log-Likelihood (NLL)
And its application in softmax
To maximize log-likelihood, we can
minimize its negative log-likelihood
(NLL) function
Ch10. Auto and variational encoders
v230607d
94
Softmax function
• https://medium.com/data-science-bootcamp/understand-the-softmax-function-in-minutes-f3a59641e86d
• y=[2 , 1, 0.1]’
• Softmax(y)=[0.6590, 0.242,0.0986]’
• exp(2)/((exp(2)+exp(1)+exp(0.1))=0.6590
• exp(1)/((exp(2)+exp(1)+exp(0.1))= 0.2424
• exp(0.1)/((exp(2)+exp(1)+exp(0.1))= 0.0986
Ch10. Auto and variational encoders
v230607d
95
,n
,
i
y
y
y n
i
i
i
i ,..
2
1
for
,
)
exp(
)
exp(
)
(
softmax
1




Softmax Activation Function
• https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/#nll
Ch10. Auto and variational encoders
v230607d
96
=5/(5+4+2)=
exp(5)/((exp(5)+exp(4)+exp(2))=0.705
exp(4)/((exp(5)+exp(4)+exp(2))=0.25949
Negative Log-Likelihood (NLL)
• To maximize likelihood,
minimum negative log-
likelihood (NLL) is picked
Ch10. Auto and variational encoders
v230607d
97
https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/#nll
=-ln(likelihood)
=-ln(0.02)=3.91
=-ln(0)=infinity
=-ln(0.98)=0.02
Minimum negative log-
likelihood (NLL) is
picked, so 0.02 is
selected
Softmax
output as the
likelihood
•
Ch10. Auto and variational encoders
v230607d
98
Continue
FAQ on VAE
• FAQ Assign3, 2020 Nov 17
• Question 3.1:
• Hi, sorry for interrupting, i got a question for assignment 3 auto-encoder part. in question one, encoder hidden layer and decoder hidden layer are in different size(15 and 18), and the
neurons for (means, variance) and samples are different as well, does it mean in variational auto-encoder, encoder hidden layer size and decoder hidden layer size can be different, and
neuron numbers for means, variance and samples don't have to match? if so, is there some random drop-out functions when means, variance and samples size don't match? Thanks.
• Answer 3.1:
• This is a very good question, in my notes, (mean, variance, sample_z are of the same sizes, but I have found some implementations that show it may not be the only case. Yes, it is a kind
of dropout as described by the papers shown below. I think the rule is mean and variance should have the sample number because they go into pairs, but the randomly generated
sample_z can be of different sizes. It is done by randomly (via Monte carlo method) select the pair of mean and variance for generating the value of sample_z. Neural computing is a trial-
and-error method, you may try different approaches and the preferred method is the one which gives you the good result. You may explore more papers and see if my interpretation is
correct or not.
• See section3.4 of
• https://arxiv.org/pdf/1706.03643.pdf
• Also
• https://deeplearn.org/arxiv/92996/generating-data-using-monte-carlo-dropout
• ////////////////////////////////////////////////////////
• Question 3.2 on VAE (variational Auto-encoder)
• Question 3.2a:
• In your notes, variational auto encoder turns input x into means and deviations of a multivariate Gaussian distribution, then use a random sampling method to create output. The output
is Z and Z is generating random sample to the next layer of neuron.
• (i) How do we train the neuron network if the input is from random sampling? (ii) And How do we force a multivariate Gaussian distribution Z to uni-variate Gaussian distribution N(0,1)?
• Answer 3.2a : I will answer part (ii) of the above question first. It is not to avoid over-fitting. Since from input to the latent (hidden) representation z, there is a random process. A random
process can have many different forms, it can be Gaussian, Laplace or Cauchy etc. or some unknown forms. If there is no control you may not be able to repeat the process hence training
becomes useless. In the VAE paper (https://arxiv.org/abs/1312.6114 ) , the authors propose to force the random probability distribution to be Gaussian (I guess you may force them to be
Laplace etc. and still can work, but you have to be consistent on using one model). How? The method is using D_KL ( Kullback–Leibler divergence ). It the concept 2 in my notes, to make
sure the random process is Gaussian.
• ////--------------------------------------------------------------------------------------------------------------------
• Question 3.2b : Why do we still need re-parameterization to do back propagation?
• Answer 3.2b: It is known that random process cannot be back-propagate. But re-parameterization provides a means to back-propagate. First, zi is not generated by a random generator
of mean=µi ,Stad_dev=i , but rather using an indirect method of finding zi (using zi =µi +*i ) through  which is generated by N(0,1)=NGaussian(mean=0,std_dev=1). If you have doubt
run my Matlab program on p.71 of 5707_10_auto-encoder (1).pptx . It short, it is found that if we use zi =µi +*i to generate zi. then zi will have the characteristic of (mean=
µi,std_dev=i).
• Then, why do we use that indirect method? Because during forward pass of neural computing,  is already calculated by N(0,), it is a real number not a random variable (same for
mean=µi ,Stad_dev=i neuron outputs) , so during back-propagation we can use zi =µi +*i to find out how much we can back-propagate to change the weights of the neurons. In the
lecture note p.67 of 5707_10_auto-encoder.pptx, the gradient is calculated (please recall that for neural back-propagation computing, gradient is needed to find the de/dw), we can form
our weight updating program based on this formulation. The idea is with this gradient, we know how to change µi ,i if we know the change of zi. (if it is random process , we simply do
not know how). However, you don’t need to enter this gradient to the VAE program because it is already in the Tensorflow-keras library, it is done automatically by Tensorflow-keras as
long as you provide the zi =µi +*I formulation of the forward pass.
• ////--------------------------------------------------------------------------------------------------------------------
• Question 3.2c :If we put N(0.15, 2,3) does that mean the input mean is 0.15 and std is 2.3. Then it goes through KL to compute the error with the expected.
• Answer3.2c: The use of N(0,1) (using Gaussian mean=0, Std_dev=1) is to make the formulation easier to program or calculate , see P.51 , D_KL ( the loss function is based on that )
formulation becomes simpler. I guess you may assume all distributions to be N(0.15, 2,3), then your loss function becomes more complex. The idea is to make sure zi is generated by a
Gaussian process, zi can be generated by a different mean and std_dev, but needed to be Gaussian. It is done by reducing D_KL( random process that generate zi, || N(0,1) ). So by
comparing the process of generating zi to a typical Gaussian like N(0,1) to form the loss function is reasonable.
Ch10. Auto and variational encoders
v230607d
99
•
Ch10. Auto and variational encoders
v230607d
100
To prove Eq [x2]=Ep [2( +)],
https://stats.stackexchange.com/questions/199605/how-does-the-reparameterization-trick-for-vaes-work-and-why-is-it-important
Alternative Derivation : To prove Eq [x2]=Ep [2( +)]
•
Ch10. Auto and variational encoders
v230607d
101
 
   
     
 
 
 
 
 
   
     
 
 
   
     
 
 
 
 
 
 
   
   
   
 
 
   
   
 











































































































2
is
of
derivative
therefore
),
1
,
0
(
.
.
,
of
on
distributi
the
is
where
,
then
1
0
,
Since
,
therefore
,
log
hence
1
variance
,
mean
of
on
distributi
normal
a
is
)
1
,
(
If
on
expectatai
of
defn.
by
also
,
log
log
in
put
),
(
1
n,
expectatio
of
defn.
by
,
)
(
1
log
Thus,
log
1
log
and
,
Since
find
to
need
we
thus,
,
min
find
want to
We
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
p
p
q
q
p
q
q
q
q
q
q
q
q
q
E
E
x
E
x
E
N
e
i
p
E
x
E
,
N
X
x
x
E
x
E
x
x
q
N
q
x
x
q
E
dx
x
x
q
x
q
x
E
(ii)
(i)
ii
dx
x
x
q
x
q
x
q
dx
x
x
q
x
q
x
q
x
E
dx
x
x
q
x
E
i
x
q
x
q
x
q
dx
dy
(y)
dx
(y)
d
x
q
x
q
x
E
X
E
Note:
–
 
 
 
 
     

























 









 


x
x
x
q
x
x
x
q
)
1
(
2
)
(
log
)
1
(
2
1
exp
)
1
(
2
1
)
(
2
1
exp
)
(
2
1
)
(
2
2
2
/
1
2
2
2
2
/
1
2
Reparameterization:
Backpropagation needs derivative of a function (process)
•
Ch10. Auto and variational encoders
v230607d
102
https://stats.stackexchange.com/questions/199605/how-does-the-reparameterization-
trick-for-vaes-work-and-why-is-it-important
Derivative of a random process
is not possible
Derivative of the Reparameterization process
(no random node is involved) is possible
ssa
•
Ch10. Auto and variational encoders
v230607d
103
https://stats.stackexchange.com/questions/199605/how-does-the-reparameterization-
trick-for-vaes-work-and-why-is-it-important
Explanation:
Summary: Backpropagation
• The gradient during backpropagation is
•
• This gradient is required for the neural network
learning (back-propagation) process
•  the generated variable of N(0,1) during the forward
pass
• µx is the current mean and is given at forward pass
• So, the gradient (see formula * above) can be found
and used in backpropagation
Ch10. Auto and variational encoders
v230607d
104
𝛻𝜃𝐸𝑞[𝑧2]=𝛻𝜃E𝑝 [(µx +)2]=Ep[2(µx + )]------(*)
Gradient for backpropagation
• Eq()=expectation,
• µ=mean, =standard deviation
• z=µ+,   sampled from N(0,I)
• The above is deterministic, we can find µ and thus,
derivative of Eq [z2]
• Eq [z2]= Ep [(µ +)2]
• Assume =1 for simplicity, (µ, are independent)
• Derivative of Eq [z2] =  Eq [z2] /  µ =µ Eq [z2] = Ep [2(µ +)]
• (The proof is in the appendix: To prove µ Eq [z2]=Ep [2(µ +)])
• If we have enough samples of , we can find µ Eq [z2] . This
gradient is required for the neural network learning (back-
propagation) process
• µ=current mean,  = randomly generated by N(0,I) during
forward pass
• For , we can apply the same treatment for updating
Ch10. Auto and variational encoders
v230607d
105
https://nbviewer.jupyter.org/github/gokererdogan/Notebooks/blob/master/Reparameterization%20Trick.ipynb
Demo gen_data_using_mean0_sigma1.py
Reparameterization trick
• import numpy as np
• N = 1000
• theta = 2.0
• eps = np.random.randn(N)
• x = theta + eps
• grad1 = lambda x: np.sum(np.square(x)*(x-theta)) / x.size
• grad2 = lambda eps: np.sum(2*(theta + eps)) / x.size
• print grad1(x)
• print grad2(eps)
• 3.86872102149
• 4.03506045463
• Let us plot the variance for different sample sizes.
• Ns = [10, 100, 1000, 10000, 100000]
• reps = 100
• means1 = np.zeros(len(Ns))
• vars1 = np.zeros(len(Ns))
• means2 = np.zeros(len(Ns))
• vars2 = np.zeros(len(Ns))
• est1 = np.zeros(reps)
• est2 = np.zeros(reps)
• for i, N in enumerate(Ns):
• for r in range(reps):
• x = np.random.randn(N) + theta
• est1[r] = grad1(x)
• eps = np.random.randn(N)
• est2[r] = grad2(eps)
• means1[i] = np.mean(est1)
• means2[i] = np.mean(est2)
• vars1[i] = np.var(est1)
• vars2[i] = np.var(est2)
•
• print means1
• print means2
• print
• print vars1
• print vars2
• [ 4.10377908 4.07894165 3.97133622 4.00847457 3.99620013]
• [ 3.95374031 4.0025519 3.99285189 4.00065614 4.00154934]
• [ 8.63411090e+00 8.90650401e-01 8.94014392e-02 8.95798809e-03
• 1.09726802e-03]
• [ 3.70336929e-01 4.60841910e-02 3.59508788e-03 3.94404543e-04
• 3.97245142e-05]
• %matplotlib inline
• import matplotlib.pyplot as plt
• plt.plot(vars1)
• plt.plot(vars2)
• plt.legend(['no rt', 'rt'])
• /usr/local/lib/python2.7/dist-packages/matplotlib/__init__.py:872: UserWarning:
axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the
latter.
• warnings.warn(self.msg_depr % (key, alt_key))
Ch10. Auto and variational encoders
v230607d
106
Variance of the estimates using reparameterization trick is
one order of magnitude smaller than the estimates from the
first method!

More Related Content

PDF
Autoencoder
PDF
Explanation of Autoencoder to Variontal Auto Encoder
PDF
Lecture 7-8 From Autoencoder to VAE.pdf
PPTX
Denoising autoencoder by Harish.R
PPTX
Lec16 - Autoencoders.pptx
PDF
Autoencoder
PPTX
Lecture 7-8 From Autoencoder to VAE.pptx
PPTX
AUTO ENCODERS (Deep Learning fundamentals)
Autoencoder
Explanation of Autoencoder to Variontal Auto Encoder
Lecture 7-8 From Autoencoder to VAE.pdf
Denoising autoencoder by Harish.R
Lec16 - Autoencoders.pptx
Autoencoder
Lecture 7-8 From Autoencoder to VAE.pptx
AUTO ENCODERS (Deep Learning fundamentals)

Similar to 5707_10_auto-encoder.pptx (20)

PDF
Autoencoders
PPTX
DL-unite4-Autoencoders.pptx..............
PPTX
UNIT-4.pptx
PDF
UNIT-4.pdf
PDF
UNIT-4.pdf
PPTX
Autoencoders in Computer Vision: A Deep Learning Approach for Image Denoising...
PPTX
A Comprehensive Overview of Encoder and Decoder Architectures in Deep Learnin...
PDF
Autoencoder in Deep Learning and its types
PDF
UNIT 2: Artificial Neural Networks (ANN)
PDF
Introduction to Autoencoders
PDF
autoencoder-190813145130.pdf
PPTX
autoencoder-190813144108.pptx
PDF
Autoencoders
PPTX
Introduction to Autoencoders: Types and Applications
PPTX
Autoencoders for image_classification
PPTX
Deeplearning
PDF
asoirwhsndjsrhuguigugioopioihriewundbnshaj
PDF
Introduction to deep learning using python
PPTX
Piotr Mirowski - Review Autoencoders (Deep Learning) - CIUUK14
PDF
Robustness Metrics for ML Models based on Deep Learning Methods
Autoencoders
DL-unite4-Autoencoders.pptx..............
UNIT-4.pptx
UNIT-4.pdf
UNIT-4.pdf
Autoencoders in Computer Vision: A Deep Learning Approach for Image Denoising...
A Comprehensive Overview of Encoder and Decoder Architectures in Deep Learnin...
Autoencoder in Deep Learning and its types
UNIT 2: Artificial Neural Networks (ANN)
Introduction to Autoencoders
autoencoder-190813145130.pdf
autoencoder-190813144108.pptx
Autoencoders
Introduction to Autoencoders: Types and Applications
Autoencoders for image_classification
Deeplearning
asoirwhsndjsrhuguigugioopioihriewundbnshaj
Introduction to deep learning using python
Piotr Mirowski - Review Autoencoders (Deep Learning) - CIUUK14
Robustness Metrics for ML Models based on Deep Learning Methods
Ad

Recently uploaded (20)

PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPT
protein biochemistry.ppt for university classes
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
PDF
The scientific heritage No 166 (166) (2025)
PDF
An interstellar mission to test astrophysical black holes
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PPTX
Microbiology with diagram medical studies .pptx
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
neck nodes and dissection types and lymph nodes levels
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PPTX
Introduction to Cardiovascular system_structure and functions-1
PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
PDF
Placing the Near-Earth Object Impact Probability in Context
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
TOTAL hIP ARTHROPLASTY Presentation.pptx
protein biochemistry.ppt for university classes
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
The scientific heritage No 166 (166) (2025)
An interstellar mission to test astrophysical black holes
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
INTRODUCTION TO EVS | Concept of sustainability
POSITIONING IN OPERATION THEATRE ROOM.ppt
Microbiology with diagram medical studies .pptx
Phytochemical Investigation of Miliusa longipes.pdf
neck nodes and dissection types and lymph nodes levels
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
Introduction to Cardiovascular system_structure and functions-1
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
Placing the Near-Earth Object Impact Probability in Context
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Ad

5707_10_auto-encoder.pptx

  • 1. Ch10. Auto-encoders KH Wong Ch10. Auto and variational encoders v230607d 1
  • 2. Two types of autoencoders • Part1 : Vanilla (means traditional or classical) Autoencoder – or simply called Autoencoder • Part 2: Variational Autoencoder Ch10. Auto and variational encoders v230607d 2
  • 3. Part 1: Overview of Vanilla (traditional/classical) Autoencoder • Introduction • Theory • Architecture • Application • Examples Ch10. Auto and variational encoders v230607d 3
  • 4. Introduction • What is auto-decoder? – An unsupervised method • Application – For noise removal – Dimensional reduction • Method – Use noise-free ground truth data (e.g. MNIST)+ self generative noise to train the network – The final network can remove noise of in the input (e.g. hand written characters), the output will be similar to the ground truth data Ch10. Auto and variational encoders v230607d 4
  • 5. Noise removal • https://www.slideshare.net/billlangjun/simple-introduction-to-autoencoder Ch10. Auto and variational encoders v230607d 5 Result: plt.title('Original images: top rows,' 'Corrupted Input: middle rows, ' 'Denoised Input: third rows') Perfect input + noise
  • 6. Auto encoder Structure An autoencoder is a feedforward neural network that learns to predict the input (corrupted by noise) itself in the output. • The input-to-hidden part corresponds to an encoder • The hidden-to-output part corresponds to a decoder. • Input and output are of the same dimension and size. Ch10. Auto and variational encoders v230607d 6 https://towardsdatascience.com/deep-autoencoders-using-tensorflow-c68f075fd1a3 Noisy Input x De-noised Output x‘ encoder decoder Neural network after training x‘ x Z (code)
  • 7. Theory (W=weight, b=bias) Autoencoders are trained to minimize reconstruction errors (such as squared errors), often referred to as the "loss (L)": • By combining (*) and (**) Ch10. Auto and variational encoders v230607d 7  ’ x’ X W b Z W’ b’ (**) ) ' ' ( ' ' (*) ) ( '                b z W x b Wx z x z x   2 2 ) ' ) ( ' ( ' ' ) ' , ( b b Wx W x x x x x L Loss          ' x z x   Encoder decoder Input code output
  • 8. Exercise 1a,b,c • How many input layers, hidden layers, output layers in the figure shown? MC choices: How many • (a) input layer(s)? • (b) hidden layer(s)? • (c) Output layer(s)? • How many neurons in these layers? MC choices: How many neurons in these layers? • (d) input layer? • (e) hidden layers: choices: – 1) 3 – 2) 6 – 3) 8 – 4) 10 • (f) output layer? • (g) Which is true on the number of neurons? – 1) input neurons more than output neurons 2) input neurons same as output neurons – 3) input neurons less than output neurons Ch10. Auto and variational encoders v230607d 8 Input Output
  • 9. Answer : Exercise 1 • How many input layers, hidden layers, output layers in the figure shown? – Answer: input=1, hidden=3, output layer=1 • How many neurons in these layers? – Answer: input(=4), hidden(3,2,3),total=8 (choice 3), output (=4) • What is the relation between the number of input and output neurons? – Answer: same (choice 2) Ch10. Auto and variational encoders v230607d 9 Input Output
  • 10. Architecture • Encoder and decoder • Training can use typical backpropagation methods Ch10. Auto and variational encoders v230607d 10 https://towardsdatascience.com/how-to- reduce-image-noises-by-autoencoder- 65d5e6de543
  • 11. Training • Apply clean MNIST data set + added noise to be used as input, • Use clean MNIST data set as output • Train the autoencoder using backpropagation Ch10. Auto and variational encoders v230607d 11 Added noise Autoencoder training by backpropagation + Clean MINST samples Clean MNIST samples same
  • 12. Recall • After training, autoencoders can be used to remove noise Ch10. Auto and variational encoders v230607d 12 Trained autoencoder Noisy Input De-noised Output
  • 13. Exercise 2a,b: Auto-encoder training • (Q.2a) For (epoch=1;epoch <=max_epoch ; epoch++) – {For all 10,000 images{ • Core code: • Use backpropagation to train the whole autoencoder network (encoder + decoder)} • Break if Loss is too small } • MC question: In core code, choices: 1. Feed each clean image to the input, and Present the clean image to the output 2. Feed each clean image+noise to the output, and Present the clean image to the input 3. Feed each clean image+noise to the input, and Present the clean image to the output • (Q.2b) If the trained encoder receives a noisy image of a handwritten numeral, what do you expect at the output? – MC choice: 1) a denoised image; 2) input + noise – 3) same as input ; 4) pure random noise Ch10. Auto and variational encoders v230607d 13 Noise clean image for numeral “2” auto-encoder Input output
  • 14. Answer: Exercise 2a,b • Answer 2(a): Auto-encoder training • For (epoch=1;epoch <=max_epoch ; epoch++) – {For all 10,000 images{ • Feed each clean image plus noise to the (encoder) input • Present the clean image of the numerical to the output (of the decoder), • Use backpropagation to train the whole autoencoder network (encoder + decoder) • } • Break if Loss is too small – } • Ex.2(b) Autoencoder usage: If the trained encoder receives a noisy image of a handwritten numeral, what do you expect at the output? – Answer 2(b): a denoised image of the realinput numeral image (choice 1 is correct) Ch10. Auto and variational encoders v230607d 14 + Noise clean image for numeral “2” auto-encoder Core code Choice 3 is correct Input Output
  • 15. Sample Code: Part(i): obtain dataset and add noise https://towardsdatascience. com/how-to-reduce-image- noises-by-autoencoder- 65d5e6de543 • #part1 --------------------------------------------------- • np.random.seed(1337) • # MNIST dataset • (x_train, _), (x_test, _) = mnist.load_data() • image_size = x_train.shape[1] • x_train = np.reshape(x_train, [-1, image_size, image_size, 1]) • x_test = np.reshape(x_test, [-1, image_size, image_size, 1]) • x_train = x_train.astype('float32') / 255 • x_test = x_test.astype('float32') / 255 • # Generate corrupted MNIST images by adding noise with normal dist • # centered at 0.5 and std=0.5 • noise = np.random.normal(loc=0.5, scale=0.5, size=x_train.shape) • x_train_noisy = x_train + noise • noise = np.random.normal(loc=0.5, scale=0.5, size=x_test.shape) • x_test_noisy = x_test + noise • x_train_noisy = np.clip(x_train_noisy, 0., 1.) • x_test_noisy = np.clip(x_test_noisy, 0., 1.) Ch10. Auto and variational encoders v230607d 15
  • 16. Part (ii):First build the Encoder Model • #part2 --------------------------------------------------- • # Network parameters • input_shape = (image_size, image_size, 1) • batch_size = 128 • kernel_size = 3 • latent_dim = 16 • # Encoder/Decoder number of CNN layers and filters per layer • layer_filters = [32, 64] • # Build the Autoencoder Model • # First build the Encoder Model • inputs = Input(shape=input_shape, name='encoder_input') • x = inputs • # Stack of Conv2D blocks • # Notes: • # 1) Use Batch Normalization before ReLU on deep networks • # 2) Use MaxPooling2D as alternative to strides>1 • # - faster but not as good as strides>1 • for filters in layer_filters: • x = Conv2D(filters=filters, • kernel_size=kernel_size, • strides=2, • activation='relu', • padding='same')(x) • # Shape info needed to build Decoder Model • shape = K.int_shape(x) • # Generate the latent vector • x = Flatten()(x) • latent = Dense(latent_dim, name='latent_vector')(x) • # Instantiate Encoder Model • encoder = Model(inputs, latent, name='encoder') • encoder.summary() Ch10. Auto and variational encoders v230607d 16
  • 17. Part (iii):Build the Decoder Model • #part3 --------------------------------------------------- • # Build the Decoder Model • latent_inputs = Input(shape=(latent_dim,), name='decoder_input') • x = Dense(shape[1] * shape[2] * shape[3])(latent_inputs) • x = Reshape((shape[1], shape[2], shape[3]))(x) • # Stack of Transposed Conv2D blocks • # Notes: • # 1) Use Batch Normalization before ReLU on deep networks • # 2) Use UpSampling2D as alternative to strides>1 • # - faster but not as good as strides>1 • for filters in layer_filters[::-1]: • x = Conv2DTranspose(filters=filters, • kernel_size=kernel_size, • strides=2, • activation='relu', • padding='same')(x) • x = Conv2DTranspose(filters=1, • kernel_size=kernel_size, • padding='same')(x) • outputs = Activation('sigmoid', name='decoder_output')(x) • # Instantiate Decoder Model • decoder = Model(latent_inputs, outputs, name='decoder') • decoder.summary() • # Autoencoder = Encoder + Decoder • # Instantiate Autoencoder Model • autoencoder = Model(inputs, decoder(encoder(inputs)), name='autoencoder') • autoencoder.summary() • autoencoder.compile(loss='mse', optimizer='adam') Ch10. Auto and variational encoders v230607d 17
  • 18. Part (iv): Train the autoencoder, decode images display result • #part4 --------------------------------------------------- • # Train the autoencoder • autoencoder.fit(x_train_noisy, • x_train, • validation_data=(x_test_noisy, x_test), • epochs=30, • batch_size=batch_size) • # Predict the Autoencoder output from corrupted test images • x_decoded = autoencoder.predict(x_test_noisy) • # Display the 1st 8 corrupted and denoised images • rows, cols = 10, 30 • num = rows * cols • imgs = np.concatenate([x_test[:num], x_test_noisy[:num], x_decoded[:num]]) • imgs = imgs.reshape((rows * 3, cols, image_size, image_size)) • imgs = np.vstack(np.split(imgs, rows, axis=1)) • imgs = imgs.reshape((rows * 3, -1, image_size, image_size)) • imgs = np.vstack([np.hstack(i) for i in imgs]) • imgs = (imgs * 255).astype(np.uint8) • plt.figure() • plt.axis('off') • plt.title('Original images: top rows, ' • 'Corrupted Input: middle rows, ' • 'Denoised Input: third rows') • plt.imshow(imgs, interpolation='none', cmap='gray') • Image.fromarray(imgs).save('corrupted_and_denoised.png') • plt.show() Ch10. Auto and variational encoders v230607d 18
  • 19. Code https://towardsdatascience.com/how-to-reduce-image-noises-by-autoencoder-65d5e6de543 Result: plt.title('Original images: top rows, ' 'Corrupted Input: middle rows, ' 'Denoised Image: third rows') • '''Trains a denoising autoencoder on MNIST dataset. • https://towardsdatascience.com/how-to-reduce-image-noises-by-autoencoder-65d5e6de543 • Denoising is one of theclassic applications of autoencoders. • The denoising process removes unwantednoisethatcorrupted the • truesignal. • Noise+ Data ---> Denoising Autoencoder ---> Data • Given a training dataset of corrupted data as input and • truesignal as output, a denoising autoencoder can recover the • hidden structureto generateclean data. • This example has modular design. The encoder, decoder and autoencoder • are 3 models that shareweights. For example, after training the • autoencoder, theencoder can be used to generate latent vectors • of input data for low-dim visualizationlikePCA or TSNE. • ''' • #keras>> tensorflow.keras, modificationby khw • from __future__ import absolute_import • from __future__ import division • from __future__ import print_function • import tensorflow.keras as keras • from tensorflow.keras.layers import Activation, Dense, Input • from tensorflow.keras.layers import Conv2D, Flatten • from tensorflow.keras.layers import Reshape, Conv2DTranspose • from tensorflow.keras.models importModel • from tensorflow.keras importbackend as K • from tensorflow.keras.datasets import mnist • import numpyas np • import matplotlib.pyplot as plt • from PIL import Image • np.random.seed(1337) • # MNIST dataset • (x_train, _), (x_test, _) = mnist.load_data() • image_size = x_train.shape[1] • x_train = np.reshape(x_train, [-1, image_size, image_size, 1]) • x_test = np.reshape(x_test, [-1, image_size, image_size, 1]) • x_train = x_train.astype('float32') / 255 • x_test = x_test.astype('float32') / 255 • # Generate corrupted MNIST images by adding noisewith normal dist • # centered at 0.5 and std=0.5 • noise= np.random.normal(loc=0.5, scale=0.5, size=x_train.shape) • x_train_noisy =x_train + noise • noise= np.random.normal(loc=0.5, scale=0.5, size=x_test.shape) • x_test_noisy=x_test + noise • x_train_noisy =np.clip(x_train_noisy, 0., 1.) • x_test_noisy=np.clip(x_test_noisy, 0., 1.) • # Network parameters • input_shape =(image_size, image_size, 1) • batch_size =128 • kernel_size = 3 • latent_dim = 16 • # Encoder/Decoder number of CNN layers and filters per layer • layer_filters = [32, 64] • # Build theAutoencoder Model • # First build theEncoder Model • inputs =Input(shape=input_shape, name='encoder_input') • x = inputs • # Stack of Conv2Dblocks • # Notes: • # 1) UseBatch Normalization before ReLU on deep networks • # 2) UseMaxPooling2Das alternativeto strides>1 • # - faster but not as good as strides>1 • for filters in layer_filters: • x = Conv2D(filters=filters, • kernel_size=kernel_size, • strides=2, • activation='relu', • padding='same')(x) • # Shapeinfo needed to build Decoder Model • shape= K.int_shape(x) • # Generate thelatent vector • x = Flatten()(x) • latent = Dense(latent_dim, name='latent_vector')(x) • # InstantiateEncoder Model • encoder = Model(inputs, latent, name='encoder') • encoder.summary() • # Build theDecoder Model • latent_inputs =Input(shape=(latent_dim,), name='decoder_input') • x = Dense(shape[1] * shape[2] * shape[3])(latent_inputs) • x = Reshape((shape[1], shape[2], shape[3]))(x) • # Stack of Transposed Conv2Dblocks • # Notes: • # 1) UseBatch Normalization before ReLU on deep networks • # 2) UseUpSampling2Das alternativeto strides>1 • # - faster but not as good as strides>1 • for filters in layer_filters[::-1]: • x = Conv2DTranspose(filters=filters, • kernel_size=kernel_size, • strides=2, • activation='relu', • padding='same')(x) • x = Conv2DTranspose(filters=1, • kernel_size=kernel_size, • padding='same')(x) • outputs=Activation('sigmoid', name='decoder_output')(x) • # InstantiateDecoder Model • decoder = Model(latent_inputs, outputs, name='decoder') • decoder.summary() • # Autoencoder = Encoder + Decoder • # InstantiateAutoencoder Model • autoencoder =Model(inputs, decoder(encoder(inputs)), name='autoencoder') • autoencoder.summary() • autoencoder.compile(loss='mse', optimizer='adam') • # Train theautoencoder • autoencoder.fit(x_train_noisy, • x_train, • validation_data=(x_test_noisy, x_test), • epochs=30, • batch_size=batch_size) • # Predict theAutoencoder outputfrom corruptedtest images • x_decoded = autoencoder.predict(x_test_noisy) • # Display the1st 8 corrupted and denoised images • rows, cols = 10, 30 • num = rows * cols • imgs = np.concatenate([x_test[:num], x_test_noisy[:num], x_decoded[:num]]) • imgs = imgs.reshape((rows *3, cols, image_size, image_size)) • imgs = np.vstack(np.split(imgs, rows, axis=1)) • imgs = imgs.reshape((rows *3, -1, image_size, image_size)) • imgs = np.vstack([np.hstack(i) for i in imgs]) • imgs = (imgs * 255).astype(np.uint8) • plt.figure() • plt.axis('off') • plt.title('Original images: top rows, ' • 'Corrupted Input:middlerows, ' • 'Denoised Input: third rows') • plt.imshow(imgs, interpolation='none', cmap='gray') • Image.fromarray(imgs).save('corrupted_and_denoised.png') • plt.show() Ch10. Auto and variational encoders v230607d 19
  • 20. Exercise 3 • Discuss applications of a Vanilla (traditional) autoencoder. • Which of the following is true? MC choices: 1) Image recognition 2) Denoise input images + Image recognition 3) Denoise input images +Dimensionality Reduction 4) Denoise input images only Ch10. Auto and variational encoders v230607d 20
  • 21. Answer: Exercise 3 • Discuss applications of a Vanilla (traditional) autoencoder. • Which of the following is true? MC choices: 1) Image recognition 2) Denoise input images + Image recognition 3) Denoise input images +Dimensionality Reduction (correct) 4) Denoise input images only • More information, see https://en.wikipedia.org/wiki/Autoencoder – Dimensionality Reduction – Relationship with principal component analysis (PCA) – Information Retrieval – Anomaly Detection – Image Processing – Drug discovery Ch10. Auto and variational encoders v230607d 21
  • 22. Part 2: Variational autoencoder Will learn • Learn what is Variational autoencoder • How to train it? • How to use it? Ch10. Auto and variational encoders v230607d 22
  • 23. Some math background is needed: • https://ljvmiranda921.github.io/notebook/20 17/08/13/softmax-and-the-negative-log- likelihood/ • See appendix2: The expected negative log likelihood • Conditional expectation etc. Ch10. Auto and variational encoders v230607d 23
  • 24. Variational Autoencoder (VAE) v.s. Traditional Autoencoder • Autoencoders (vanilla or traditional) – During training you present a pattern with artificial added noise to the encoder, and feed the same input pattern (as target, or teacher) to the output. Then, use backpropagation to train the Autoencoder network. – So, it is unsupervised learning (no label data is needed). – It can be used for data compression and noise removal. – During recall, when a noisy pattern is presented to the input, a de- noise image will appear at the output. • Variational autoencoders – Instead of learning from an input pattern, Variational autoencoders learn the parameters of a probability distribution function from the input patterns. We then use the parameters learned to generate new data. So, it is a generative model like GAN (Generative Adversarial Network) in functionality. Ch10. Auto and variational encoders v230607d 24
  • 25. Variational autoencoder https://jaan.io/what-is-variational-autoencoder-vae-tutorial/ • Variational autoencoders are cool. They let us design complex generative models of data and fit them to large datasets. They can generate images of fictional celebrity faces and high-resolution digital artwork. • VAE faces • VAE faces demo • VAE MNIST • VAE street addresses • https://jaan.io/what-is-variational- autoencoder-vae-tutorial/ • May be or similar to that used in software such as Deepfake (https://en.wikipedia.org/wiki/Deepfake) FICTIONAL CELEBRITY FACES GENERATED BY A VARIATIONAL AUTOENCODER (BY ALEC RADFORD). Ch10. Auto and variational encoders v230607d 25
  • 26. Example: Applying VAE for MNIST data set extension • Ch10. Auto and variational encoders v230607d 26 https://arxiv.org/pdf/1312.6114.pdf Output: generated image Dataset (images extended) Input: original image data set
  • 27. Some background: Univariate and Multivariate Gaussian • https://ttic.uchicago.edu/~shubhendu/Slides/Estimation.pdf Ch10. Auto and variational encoders v230607d 27                   2 2 2 / 1 2 univariate 2 2 1 exp 2 1 ) ( dimension - 1 variance mean, , _ Gaussian Univariate      x x N sample data x                           x x x N d co sample data x T d 1 2 / 1 2 / te multivaria 2 1 exp 2 1 ) ( dimension - variance mean, , _ Gaussian, te Multivaria
  • 28. Properties of Gaussian (Normal) distribution • Standard Normal distribution (1-dimension): • Red line, when mean()=0, Sigma ()=1 – At (x-)=0,  =1 – G(x) =1/sqrt(2*pi)=0.3989 • At x=1*, drops off to – (1/sqrt(2*pi))*exp(-1^1/2)=0.2420 – Area covered 68.2% • At x=2*, drops off to – (1/sqrt(2*pi))*exp(-2^2/2)= 0.0540 – Area covered 95.44% • At x=3*, drops off to – (1/sqrt(2*pi))*exp(-2^2/2)= ?? (exercise) – Area covered 99.73% http://en.wikipedia.org/wiki/Normal_distribution Probability density function               1 ) ( 2 1 mean variance, deviation, standard Gaussian D 1 2 2 2 2 2 dx x G e πσ G(x) σ μ x    Standard Normal distribution Area covered (total= 100%) G G Ch10. Auto and variational encoders v230607d 28  sets the horizontal shift  Controls the shape So called 95% confident value µ(+/-)2
  • 29. Gaussian (Normal) functions 1D,2D • 2 2 / 1      2 2 2 2 2 2 1 G(x)G(y) y) G(x, Gaussian D 2     y x y x e          2 2 2 2 2 1 G(x) mean deviation, standard Gaussian D 1            x e G(x) x y x  x y 1-D Gaussian 2-D Gaussian 2 2 / 1  Ch10. Auto and variational encoders v230607d 29
  • 30. Example : A 1-D and 2-D Gaussian distribution • %2-D Gaussian distribution P(xj) • %matlab code---------- • clear, N=10 • [X1,X2]=meshgrid(-N:N,-N:N); • sigma =2.5;mean=[3 3]' • G=1/(2*pi*sigma^2)* • (exp(-((X1-mean(1)).^2+(X2-mean(2)).^2)) /(2*sigma^2)); • G=G./sum(G(:)) %normalise it • 'sigma is ', sigma • 'sum(G(:)) is ',sum(G(:)) • 'max(max(G(:))) is',max(max(G(:))) • figure(1), clf • surf(X1,X2,G); • xlabel('x1'),ylabel('x2') Ch10. Auto and variational encoders v230607d 30                    2 0 2 0 2 / 1 2 0 2 0 0 2 1 exp 2 1 ) ( variance mean, , _ , Gaussian 1      j j j x x N sample a x D                2 0 2 2 2 1 2 0 2 1 2 exp 2 1 ) , ( 0 mean assume Gaussian symmetric) (circular isotropic an 2   x x x x N D
  • 31. Exercise 4 • In Box 1, sigma ()=2 • x=mx y=my • Mc choices: 1) G(x,y)=1/(2*pi*2+2) 2) G(x,y)=1/(2*pi*2) 3) G(x,y)=1/(2*pi*2^4) 4) G(x,y)=1/(2*pi*2^2) • Student exercise: • Fill in the blanks of this Gaussian mask of size 9x9 , sigma ()=2 • Sketch the function • G(x,y)= • 0.0007 0.0017 0.0033 0.0048 0.0054 0.0048 0.0033 0.0017 0.0007 • 0.0017 0.0042 0.0078 0.0114 0.0129 0.0114 0.0078 0.0042 0.0017 • 0.0033 0.0078 0.0146 0.0213 0.0241 0.0213 0.0146 0.0078 0.0033 • 0.0048 0.0114 0.0213 0.0310 0.0351 0.0310 0.0213 0.0114 0.0048 • 0.0054 0.0129 0.0241 0.0351 BOX1 ? ____? 0.0241 0.0129 0.0054 • 0.0048 0.0114 0.0213 0.0310 0.0351 ____? 0.0213 0.0114 0.0048 • 0.0033 0.0078 0.0146 0.0213 0.0241 0.0213 0.0146 0.0078 0.0033 • 0.0017 0.0042 0.0078 0.0114 0.0129 0.0114 0.0078 0.0042 0.0017 • 0.0007 0.0017 0.0033 0.0048 0.0054 0.0048 0.0033 0.0017 0.0007 Ch10. Auto and variational encoders v230607d 31     2 2 2 2 2 2 1 G(x)G(y) y) G(x, mean Gaussian, D 2   y x m y m x y x e ) ,m (m        x=mx y=my x=1+mx y=my Box1
  • 32. Answer: Exercise 4 Fill in the blanks Gaussian mask of size the 9x9 , sigma ()=2 • 0.0007 0.0017 0.0033 0.0048 0.0054 0.0048 0.0033 0.0017 0.0007 • 0.0017 0.0042 0.0078 0.0114 0.0129 0.0114 0.0078 0.0042 0.0017 • 0.0033 0.0078 0.0146 0.0213 0.0241 0.0213 0.0146 0.0078 0.0033 • 0.0048 0.0114 0.0213 0.0310 0.0351 0.0310 0.0213 0.0114 0.0048 • 0.0054 0.0129 0.0241 0.0351 0.0398 0.0351 0.0241 0.0129 0.0054 • 0.0048 0.0114 0.0213 0.0310 0.0351 0.0310 0.0213 0.0114 0.0048 • 0.0033 0.0078 0.0146 0.0213 0.0241 0.0213 0.0146 0.0078 0.0033 • 0.0017 0.0042 0.0078 0.0114 0.0129 0.0114 0.0078 0.0042 0.0017 • 0.0007 0.0017 0.0033 0.0048 0.0054 0.0048 0.0033 0.0017 0.0007 Ch10. Auto and variational encoders v230607d 32 clear %matlab sigma=2 % in matlab , no -ve index for looping, so shift center to (5,5) mean_x=5 , mean_y=5 for y=1:9 for x=1:9 g(x,y)=(1/(2*pi*sigma^2))*exp(-((x- mean_x)^2+(y-mean_y)^2) /(2*sigma^2)) end end mesh(g) title('2D Gaussian function') 1/(2*pi*2^2): choice 4 is correct, because x=mx, y=my. ,thus 𝑒 − 𝑥−𝑚𝑥 2+ 𝑦−𝑚𝑦 2 2𝜎2 =1 1/(2*pi*2^2)*exp(- 1/8) 1/(2*pi*2^2)*exp (-2/8) Box 1 x=mx y=my x=1+mx y=my 2 − D Gaussian, mean (𝑚𝑥, 𝑚𝑦) G(x,y) = G(x)G(y) = 1 2𝜋𝜎2 𝑒 − 𝑥−𝑚𝑥 2+ 𝑦−𝑚𝑦 2 2𝜎2
  • 33. Variational autoencoder • A neural network view Ch10. Auto and variational encoders v230607d 33 https://www.jeremyjordan.me/variational-autoencoders/ Multivariate Gaussian: Mean = µ  = standard dedication Variance = 2
  • 34. Generative Models concept • It is an unsupervised learning method that generates new samples by using training data from the same distribution • E.g., You have limited number of samples but want to create more samples of the same probability distributions to be used in machine learning purposes. Others include: – Creating new cartoon figures – Generating faces from images of celebrities. – Creating new fashions. – Creating new written characters for training optical character recognition systems of some languages • Generative model algorithms – Variational autoencoder (discussed here) – Generative adversarial network (GAN) not discussed here Ch10. Auto and variational encoders v230607d 34
  • 35. Variational autoencoder for generative models • Use training samples to train hidden data (parameters of multi-variate Gaussian standard deviations=s, means = µs ). After training you may create new output from some input and weighted s and µs . You may change the weights of s and µs for a variety of related different outputs. Ch10. Auto and variational encoders v230607d 35 https://www.quora.com/Whats-the-difference-between-a-Variational-Autoencoder-VAE-and-an-Autoencoder parameters of multi-variate Gaussian standard deviations= s, means= µs ) E.g. 50µs, 30s
  • 36. Application example: Use Generative Models for MNIST data extension http://yann.lecun.com/exdb/mnist/ • Ch10. Auto and variational encoders v230607d 36 During training , patterns are fed into input and output one by one, learn µ, by minimize loss After training, data generation phase Generated extended data set MNIST original data set Random generator layer using 30µs, 30s z
  • 37. Exercise 5:What is the architectural difference between Vanilla (traditional) autoencoder and Variational autoencoder? • MC: Which is incorrect? 1) In Vanilla (traditional) autoencoder: input to output are directly connected by neurons and weights. 2) In Variational autoencoder: The encoder turns input (x) into means (µs) and standard deviations (s) of a multivariate Gaussian distribution, then use a random sampling method to create the output. 3) In Variational autoencoder : input to output are directly connected by neurons and weights. 4) In Variational autoencoder: The number of mean (µs) and standard deviations (s) neurons are the same. Ch10. Auto and variational encoders v230607d 37 Vanilla autoencoder E.g. 30µs, 30s z
  • 38. Answer Exercise 5:What is the architectural difference between Vanilla (traditional) autoencoder and Variational autoencoder? • MC: Which is incorrect? 1) In Vanilla (traditional) autoencoder: input to output are directly connected by neurons and weights. 2) In Variational autoencoder: The encoder turns input (x) into means (µs) and standard deviations (s) of a multivariate Gaussian distribution, then use a random sampling method to create the output. 3) In Variational autoencoder : input to output are directly connected by neurons and weights. (This is incorrect) 4) In Variational autoencoder: The number of mean (µs) and standard deviations (s) neurons are the same. Ch10. Auto and variational encoders v230607d 38 Vanilla autoencoder E.g. 30µs, 30s z
  • 39. Exercise 6a,b for Variational autoencoder VAE • Which statement is incorrect for VAE?: MC choices: 1) Because the search space is large, there are too many combinations of means (µs) and standard deviations (s) for generating the same output. 2) There are multiple solutions for means (µs) and standard deviations (s) 3) There is a deterministic linear solution for VAE 4) Neural network provides a solution for VAE. • (b) Discuss exercise for students: what is a multivariate-Gaussian distribution. Ch10. Auto and variational encoders v230607d 39 form https://en.wikipedia.org/wiki/Multiv ariate_normal_distribution of 2 dimensions
  • 40. Answer: Exercise 6a,b for Variational autoencoder VAE • Which statement is incorrect for VAE?: MC choices: (choice3)There is a deterministic linear solution for VAE (this is incorrect) • (b) Discuss exercise for students: what is a multivariate-Gaussian distribution. • Answer: Multivariate-dimensional Gaussian: • In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k- variate normally distributed if every linear combination of its k components has a univariate normal distribution. Ch10. Auto and variational encoders v230607d 40 form https://en.wikipedia.org/wiki/Multiv ariate_normal_distribution of 2 dimensions
  • 41. Example of variational autoencoder • Neural network Ch10. Auto and variational encoders v230607d 41 https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf By random sampling Random generator layer Z X̂ X
  • 42. Training of Vanilla and Variational Autoencoders • Training of variational autoencoders is like training the vanilla autoencoders. E.g., for the de-noised application, presents noisy images to the input and clean image versions to the output. Use backpropagation to train the network. Read our previous discussion on vanilla autoencoder https://www.edureka.co/blog/autoencoders-tutorial/ http://www.math.purdue.edu/~buzzard/MA598-Spring2019/Lectures/Lec18%20-%20VAE.pptx Ch10. Auto and variational encoders v230607d 42
  • 43. Variational Autoencoder (VAE) • The latent variables, Z, are drawn from a probability distribution depending on the input, X, and the reconstruction is chosen probabilistically from z. • That means after you obtained mean=µ,variance 2, sample from X (n=500 neurons) to get Z (k=30 neurons) • X=(x1,x2,…………,xn) • Z=(z1,z2,…,zk) https://jaan.io/what-is-variational-autoencoder-vae-tutorial/ Ch10. Auto and variational encoders v230607d 43 https://jaan.io/what-is-variational-autoencoder-vae-tutorial/ Z Encoder Q (z|X) Decoder P (X|z) Z=Latent Variables By sampling Z=Sample from a distribution N(µ,) X X̂
  • 44. Three difficult concepts in VAE 1) Train the neural network to maximize input/output likelihood 2) Use of Divergence (DKL) 3) Reparameterization Ch10. Auto and variational encoders v230607d 44
  • 45. Variational Autoencoders VAE Concept 1 Train the neural network to maximize input/output likelihood Ch10. Auto and variational encoders v230607d 45 Tutorial on Variational Autoencoders Carl Doersch https://arxiv.org/abs/1606.05908
  • 46. VAE Encoder • The Encoder q(en)(z|x) takes input x and returns Hidden parameters Z (random generated from µ,). (=encoder parameters. weights/biases) • From Z, use sampling to create input to the decoder • Encoders and Decoders are neural networks (NN) • Parameters in the NN are needed to be learned – so we have to set up a loss function. https://jaan.io/what-is-variational-autoencoder-vae-tutorial/ http://gregorygundersen.com/blog/2018/04/29/reparameterization/ Ch10. Auto and variational encoders v230607d 46 https://jaan.io/what-is-variational-autoencoder-vae-tutorial/ Encoder(XZ) q(en)(z|x) Input Data Decoder(Z ) Hidden Z Output ted Reconstruc X X̂   Z X P de | ˆ ) (  X-> encoder –>Z->decoder x^ X̂  
  • 47. VAE Decoder • The decoder takes hidden variable Z (gen. from means and standard deviations) as input, and reconstructs the image using random sampling methods. ( =decoder parameters weights/biases) • Encoders and Decoders are Neural Networks (NN) • Parameters ( ,) in the NN are needed to be learned – so we have to set up a loss function. https://jaan.io/what-is-variational-autoencoder-vae-tutorial/ Ch10. Auto and variational encoders v230607d 47 https://jaan.io/what-is-variational-autoencoder-vae-tutorial/   Z X P de | ˆ ) (  Encoder(XZ) q(en)(z|x) Input Data Decoder(Z ) Hidden Z Output ted Reconstruc X X̂   Z X P de | ˆ ) (  X̂
  • 48. The reconstruction loss =(l(rec) )= “expected negative log-likelihood” of VAE • Given xi X, zQ, E() is expected value • The idea is to train the Encoder/Decoder (Neural Network) to maximum the likelihood (or minimize binary_cross_entropy (BCE) or Mean squared error (MSE) between x and reconstructed • To maximize likelihood, we minimize the reconstruction loss=“expected negative log-likelihood” (li ) of the i-th datapoint xi. (see appendix 2) Ch10. Auto and variational encoders v230607d 48         z x P E E x l i de Q z X x i rec i i | ˆ log | , ) ( ) (        Encoder q(en)(z|xi) Decoder Hidden Z (µ,) i x data Input   minimized be to , function loss tion Reconstruc ) (   rec i l i x̂ output ted Reconstruc i x BCE or MSE   z x P i de | ˆ ) (  X xi ˆ ˆ 
  • 49. Variational Autoencoders VAE Concept 2 Use of Divergence (DKL): Similar training images should produce similar hidden data (means and standard deviations) Ch10. Auto and variational encoders v230607d 49 http://mi.eng.cam.ac.uk/~mjfg/local/4F10/lect4.pdf https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence https://jhui.github.io/2017/03/06/Variational-autoencoders/ (for relating covariance and standard deviations, with good example)
  • 50. How to make sure the neural networks produce similar hidden data (means & standard deviations) from similar training images • Problem: Input that we regard as similar may end up very different in z space (hidden, means and standard deviations). That means some solutions may give small loss li (all)(,  ), even q(en) and p(de) are of very different distributions. • Solution: Use p(z)=N(0,1), try to force q(en)(z|xi) (a neural network) to act similarly to a standard normal probability density function. We can use Kullback-Leibler divergence (DKL) to do the checking. Ch10. Auto and variational encoders v230607d 50 For encoder and decoder We discussed this in concept 1: https://jaan.io/what-is-variational-autoencoder-vae-tutorial/ https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence http://gregorygundersen.com/blog/2018/04/29/reparameterization/ This is for concept 2: We will minimize (L(all) )     Gaussian and between difference ˆ output and input between loss | , ) ( 1 ) ( en n i i i i i all Q x x x L                  , ) (rec i l       I N x z q D i en KL , 0 || | ) ( 
  • 51. Math background: Kullback–Leibler divergence (also known as relative entropy) measures how one probability distribution is different from another one -- reference probability distribution over the same variable X. • Ch10. Auto and variational encoders v230607d 51 Tutorial on Variational Autoencoders by Carl Doersch & https://arxiv.org/abs/1606.05908                                                                                                       X X X I X tr I N X X N D X X N x z Q I N x z Q D X X X I X tr I N X X N D I N N X X N N I I tr N N D T KL i i T KL T T KL 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 1 2 2 2 1 1 2 2 2 1 2 1 2 2 2 2 2 2 1 1 det log 2 1 , 0 || , , | , 0 || | For det log 2 1 , 0 || , , 0 , also ; , , If ) ( det det log * 2 1 , || ,                                                                               For equation (I) See https://arxiv.org/pdf/1907.08956.pdf https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_diver gence Kullback–Leibler divergence DKL (D1|| D2)=0 indicates the two distributions D1,D2 are identical ℎ𝑒𝑛𝑐𝑒 , µ2 = 0, 2 2=1 N(0,I)=Zero_mean, variance=1 Gaussian
  • 52. Training: Combining concept 1 and 2 to minimize Loss li (X), of X= {x1,x2,..,xN} , E()=expected value . For the whole X, the average loss is • Ch10. Auto and variational encoders v230607d 52                                                                                                              X x |z x i |z x i all rec i x z x z i Q z X x i Q z X x i Q z X x x z x z i i x z x z x z x z x z x z z x z x i i i Q z X x i i Q z i i i i i i i i i i i i i i i i i i i i i i i i i μ x σ N l P z x P E E z x P E E z x P E E z x P z x P x z z N I N N z z I N X x x z x P E E x z x P E z P z x z x P x z x z Q X x X x 2 ˆ 2 ˆ _ ) ( | | | | | | | | | | | | 2 1 1 , function1 Objective_ minimize we Gaussian, is Since * | ˆ log function1 Objective_ | ˆ log - likelihood - log nagative minimize to as same the is It similar) likelihood output input (make , | ˆ log maximize want to we . | ˆ log | ˆ log use practice, In . ˆ output produce to uses decoder The . gen. to , gen. rand. use encoder, by found are , if is advantage The on), distributi ki/Normal_ dia.org/wi (en.wikipe , , 0 scaling by formed be can It , Gaussian assume can but we on, distributi any be can stage At this , 0 1}, ) stdev( 0, ) ean( function{m Gaussian a by gen. var. random input n output whe decoder at gen. ˆ of val. exp. | ˆ log output decoder at the generated ˆ val.)of (exp. value expected | ˆ log variable, (hidden) latent the of on distributi Prob. side) (decoder by generated ˆ of on distributi Prob. | ˆ side) (encoder by generated of on distributi Prob. | ˆ ˆ decoder of Output , encoder Input to                       Concept 1 See http://bjlkeng.github.io/posts/variational-autoencoders/ & https://arxiv.org/abs/1312.6114 Input, output mean
  • 53.                                                                           . minimize to algorithm iterative an run will We det log 2 1 2 1 1 thus , det log 2 1 , 0 || | that earlier shown have We , 0 || | 2 1 1 function2 Objective_ function1 Objective_ nction jective_fu Overall_ob minimized be to is this , , 0 || | func2 objective_ [] on slides previous see Gaussian, and | of difference , 0 || | , 0 put , gaussian a to close be to | want earlier we mentioned We side) (encoder by generated of on distributi Prob. | : Recall 2 2 2 ˆ 2 ˆ ) ( 2 2 ) ( ) ( 2 ˆ 2 ˆ ) ( ) ( ) ( ) ( ) ( l X X X I X tr μ x σ N L X X X I X tr I N X z q D I N x z q D μ x σ N I N x z q D D x z q I N x z q D I N z P x z q x z x z q T X x |z x i |z x all T en KL i en KL X x |z x i |z x i en KL KL i en i en KL i en i i en i i i i i i                                                        Training: Combining concept 1 and 2 to minimize Loss li (X), of X= {x1,x2,..,xN} ,, E()=expected value . For the whole X, the average loss is • Ch10. Auto and variational encoders v230607d 53 Concept 1:(reconstruction loss): Concept 2: See http://bjlkeng.github.io/posts/variational-autoencoders/ & https://arxiv.org/abs/1312.6114    , ) 1 ( i l    , ) (rec i l       I N x z q D i en KL , 0 || | ) ( 
  • 54. For VAE implementation • Input X=(x1,x2,…,xn) • Using the encoder, from X we obtain k different Gaussian distributions: N(meanj, StdDevj), • each zj=generated by N(µ j ,j),where j =1,.,K,then we have Z=(z1,z2,.,zk) Ch10. Auto and variational encoders v230607d 54                                                            k j j j j k T k KL KL T KL i en KL I N diag N D D X X X I X tr I N X X N D I N x z q D 1 2 2 2 1 1 2 2 2 ) ( ln 1 2 1 , 0 || ,.., , ,.., minimized) be to is ( becomes above the n, applicatio VAE For det log 2 1 , 0 || , , 0 || | slide previous the From               See https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence https://wiseodd.github.io/techblog/2016/12/10/variational-autoencoder/ Concept 2:
  • 55. In practice • we replace 2 with exp(2) to enable stability in calculation. And for the minimization of DKL, this replacement gives the same result Ch10. Auto and variational encoders v230607d 55                                   on minimizati during use will we function actual the is This 1 ) exp( 2 1 , 0 || ,.., , ,.., with ) ln( and ) exp( with replace , n calculatio numerical in stablity enable To ln 1 2 1 , 0 || ,.., , ,.., , 0 || | earlier seen have We 1 2 2 2 1 1 2 2 2 2 1 2 2 2 1 1 ) (              k j j j j k T k KL k j j j j k T k KL i en KL I N diag N D I N diag N D I N x z q D                   
  • 56. Use neural networks to implement the system • Ch10. Auto and variational encoders v230607d 56 Use backpropagation to minimize the loss function (concept3): Binary_cross_entropy (BCE) or Mean squared error (MSE) between input X and output 𝑋 Use backpropagation to minimize the loss function L(all) of encoder: (concept1 & 2) Encoder neural network Decoder neural network         I N x z q D μ x σ N L L loss Minimize i en KL X x |z x i |z x all all i i i , 0 || | 2 1 1 ) ( _ ) ( 2 ˆ 2 ˆ ) ( ) (                    I N x z q D i en KL , 0 || | ) (  Input Data Concept 2 Concept 1
  • 57. The training method • Ch10. Auto and variational encoders v230607d 57 http://anotherdatum.com/vae.html The latent vector represents Gaussian distributions Input and output are similar Minimize loss (L(all)) Using Concept 1 &2 X̂ X z
  • 58. Variational Autoencoders VAE Concept 3 Reparameterization: the method to enable backpropagation for training neural network that involves random processes Ch10. Auto and variational encoders v230607d 58
  • 59. VAE generative model • In theory, we can sample zi from N(µi ,i ) produced by the encoder. Note: N()=Gaussian Function • Z is the input to the decoder to produce the output. • Alternatively, we find z by sampling  (called epsilon or eps) from N(0,1) (Gaussian mean= 0, StdDev=1), then find z using : zi =µi +*i • Then z has mean = µi , StdDev= I as required • See gen_data_using_mean0_sigma1.m in appendix • This is called Reparameterization, • Reason: therefore, we can back-propagate this function during training Ch10. Auto and variational encoders v230607d 59
  • 60. Train the variational-encoder • How to train the auto-encoder neural network? • Difficulty – Since a random process is involved, backpropagation cannot be executed • Solution – Use of the re- parameterization trick Ch10. Auto and variational encoders v230607d 60 Generate z by random sampling
  • 61. Training : an example • example Ch10. Auto and variational encoders v230607d 61 https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf Random generator layer Z X̂ X
  • 62. • Learning algorithm : The probability function (left side diagram) cannot be back-propagated, therefore Reparameterization trick (right side diagram) should be applied • Ch10. Auto and variational encoders v230607d 62 http://bjlkeng.github.io/posts/variational-autoencoders/ Figure 3: An initial attempt at a variational autoencoder without the "Reparameterization trick". Objective functions shown in red. We cannot back-propagate through the stochastic sampling operation because it is not a continuous deterministic function. Figure 4: A variational autoencoder with the "Reparameterization trick". Notice that all operations between the inputs and objectives are continuous deterministic functions, allowing back- propagation to occur. StdDev=  Qq(z|x) P(X|z) This Q(z|x) =N(µz|X,z|X) should be close to N(0,I) We also want the output to be similar to the input Problem: Cannot backpropagate Solution: Reparamete rization trick Random generator layer 
  • 63. Intuition of the Parameterization trick • The encoder uses random sampling to generate z • Backpropagation (during training) is not possible for the random sampling process • Parameterization can produce the same effect for the encoder • Backpropagation (during training) is possible because no random process is involved Ch10. Auto and variational encoders v230607d 63 Encoder Path by random sampling Backpropagation path 
  • 64. Reparameterization Z can be produced by a scaled N(0,I) • Reparameterization generates any Gaussian distribution of known mean (µx), standard-deviation (x) by using the equation (Z= µx+ x ) based on the variable  generated by N(0,1) . • After the forward pass,  is generated, so  is not random. It is a data to be used in backpropagation during training. • N(0,1) =Gaussian with mean=0 and standard deviation=1 •  = the generated variable of N(0,1) • µx =mean • x = standard-deviation • Z= µx+ x Ch10. Auto and variational encoders v230607d 64 mean Standard deviation  To produce the random variable  N(0,1): mean =0, std=1 Input data
  • 66. Summary for reparameterization •  = the generated variable by sampling N(0,1) • µx = mean • x = standard deviation • z= µx + x ; This equation is deterministic, so it can be backpropagate • See code in • https://learnopencv.com/variational- autoencoder-in-tensorflow/ Ch10. Auto and variational encoders v230607d 66
  • 67. Exercise 7 • In reparameterization of the variational autoencoder method shown below,  = 0.35 which is a randomly sampled value by sampling the normal distribution of mean=0 and standard deviation =1. If the output of the encoder network has µ z|x = mean =0.3, z|x = standard deviation=0.8, find the value z . • MC choices: 1) 0.50 2) 0.54 3) 0.56 4) 0.58 Ch10. Auto and variational encoders v230607d 67
  • 68. Answer: Exercise 7 • In reparameterization of the variational autoencoder method shown below,  = 0.35 which is a randomly sampled value by sampling the normal distribution of mean=0 and standard deviation =1. If the output of the encoder network has µ z|x = mean =0.3, z|x = standard deviation=0.8, find the value z . • MC choices: 1) 0.50 2) 0.54 3) 0.56 4) 0.58 (correct) Ch10. Auto and variational encoders v230607d 68 Answer: z =µ +*z|x , here =0.35, µ=0.3, standard deviation= z|x=0.8 =0.3+0.35*0.8=0.58
  • 69. Exercise 8 • Discuss exercise • why Reparameterization is needed? Ch10. Auto and variational encoders v230607d 69
  • 70. Answer: Exercise 8 Discuss why Reparameterization is needed. • Answer: Z is generated by a random process if you have mean=µx , standardDev= x. Since the VAE system is implemented using neural networks, they need backpropagation for training the weights/parameters, and the random process of generating Z cannot be backpropagated. • Solution: The reparameterization trick converts the random process into a determinization process (z= µx + x ) with the help of a random variable  generated by a normal distributed random generated normal distribution with mean=0 and standardDev=1: N(0,1), hence this deterministic process can be backpropagated. Ch10. Auto and variational encoders v230607d 70 Reparameterization trick
  • 71. Demo Matlab code: gen_data_using_mean0_sigma1.m to show the idea: X= µx+ x * eps is the formula for generating X by eps (generated by normal distortion of mean=0, std=1) https://nbviewer.jupyter.org/github/gokererdogan/Notebooks/blob/master/Reparameterization%20Trick.ipynb • %gen_data_using_mean0_sigma1.m • clear • %%large number of samples %% • eps=randn(10000,1); • mu_x=2 %this is your mean • sigma_x=1 %this is your std • x=mu_x+(eps*sigma_x); • grad2_of_mean= sum(2*(mu_x+eps))/length(x); • 'grad2 of mean=' • grad2_of_mean • 'mean(x)=' • mean(x) • 'std(x)=' • std(x) • Result:grad2_of_mean = 3.9933 • mean(x)= 1.9960 (approximate 2) • std(x)= 0.9984 (approximate 1) • x= standard deviation of x • µx =mean of x • eps= N(mean=0,std=1), normal dist. • X= µx+ x * eps • And gradient of mean is expected_val_of (2(eps+mu_x)), assume x=1 for simplicity • The above is not random, because eps has been generated and µx is the current mean. We can use this in our backpropagation formula to find the updated mean. Ch10. Auto and variational encoders v230607d 71 Using X= µx+ x * eps , we can find its gradient bypassing the random process. Because eps is generated by a random process during the neural net forward pass, during backpropagation this is the data (now available deterministically) to be used. Note:grad2_of_mean = expected_value_of (2(eps+mu_x))
  • 73. Keras • Ch10. Auto and variational encoders v230607d 73 StdDev= 
  • 74. Keras implementation of VAE • x = Input(shape=(original_dim,)) • h = Dense(intermediate_dim, activation='relu')(x) • z_mu = Dense(latent_dim)(h) • z_log_var = Dense(latent_dim)(h) • z_mu, z_log_var = KLDivergenceLayer()([z_mu, z_log_var]) • # Use of lambda: normalize log variance to std dev • z_sigma = Lambda(lambda t: K.exp(.5*t))(z_log_var) • eps = Input(tensor=K.random_normal(shape=(K.shape(x)[0], • latent_dim))) • z_eps = Multiply()([z_sigma, eps]) • z = Add()([z_mu, z_eps]) • decoder = Sequential([ • Dense(intermediate_dim, input_dim=latent_dim, activation='relu'), • Dense(original_dim, activation='sigmoid') • ]) • x_pred = decoder(z) Ch10. Auto and variational encoders v230607d 74 http://louistiao.me/posts/implementing-variational-autoencoders-in-keras-beyond-the-quickstart-tutorial/ original_dim = 784 intermediate_dim = 256 latent_dim = 2 batch_size = 100 epochs = 50 epsilon_std = 1.0 StdDev= Predicted output
  • 75. df • Ch10. Auto and variational encoders v230607d 75 StdDev= 
  • 76. variational_autoencoder_deconv .py from https://github.com/keras-team/keras/tree/master/ • '''Example of VAE on MNIST dataset using CNN • • The VAE has a modular design. The encoder, decoder and VAE • are 3 models that share weights. After training the VAE model, • the encoder can be used to generate latent vectors. • The decoder can be used to generate MNIST digits by sampling the • latent vector from a Gaussian distribution with mean=0 and std=1. • • # Reference • • [1] Kingma, Diederik P., and Max Welling. • "Auto-encoding variational bayes." • https://arxiv.org/abs/1312.6114 • ''' • • from __future__ import absolute_import • from __future__ import division • from __future__ import print_function • • from tensorflow.keras.layers import Dense, Input • from tensorflow.keras.layers import Conv2D, Flatten, Lambda • from tensorflow.keras.layers import Reshape, Conv2DTranspose • from tensorflow.keras.models import Model • from tensorflow.keras.datasets import mnist • from tensorflow.keras.losses import mse, binary_crossentropy • from tensorflow.keras.utils import plot_model • from tensorflow.keras import backend as K • • import numpy as np • import matplotlib.pyplot as plt • import argparse • import os • • • # reparameterization trick • # instead of sampling from Q(z|X), sample eps = N(0,I) • # then z = z_mean + sqrt(var)*eps • def sampling(args): • """Reparameterization trick by sampling fr an isotropic unit Gaussian. • • # Arguments • args (tensor): mean and log of variance of Q(z|X) • Ch10. Auto and variational encoders v230607d 76 n variational_autoencoder_deconv : use: vae.save_weights('vae_cnn_mnist.tf') #instead of vae.save_weights('vae_cnn_mnist.h5') Resulst Epoch 30/30 60000/60000 [==============================] - 91s 2ms/sample - loss: 145.7313 - val_loss: 146.8615 To run this, you need to install: >>conda install graphviz >>conda install pydot
  • 78. Summary • Learned vanilla autoencoder • Learned variational autoencoder • Learned the Reparameterization trick to enable learning in variational autoencoder Ch10. Auto and variational encoders v230607d 78
  • 80. Appendices Ch10. Auto and variational encoders v230607d 80
  • 81. Appendix 1: Training: Combining concept 1 and 2 to minimize Loss L. X={x1,x2,..,xN} , E()=expected value . For the whole X, the average loss is • Ch10. Auto and variational encoders v230607d 81                                                                                                                         . minimize to algorithm iterative an run will We det log 2 1 2 1 1 , average , 0 , || slide previous in formula the use , , | If || | * | log N 1 | , ) ( || | * | log N 1 || | * | log N 1 || | * | log | , , 0 Note . || | | log | , 2 2 2 | 2 i | | i | | | | , 0 | | , 0 i i L X X X I X tr x N L I N z P z P X Q D X X N X Q x z Q z P x z Q D x x z x P x II z P x z Q D x x z x P z P x z Q D x x z x P E z P x z Q D x x z x P E E x I N z P x z Q D z x P E x T X x z X i KL i i KL X x i x z i x z i i i KL X x i X z i X z i i KL X x i X z i X z i I N i KL i X z i X z i I N X x i i KL i X x i i i i i i i i i                                                                                              Concept 1 Concept 2 See http://bjlkeng.github.io/posts/variational-autoencoders/ & https://arxiv.org/abs/1312.6114
  • 82. Appendix 2 Probability likelihood A tutorial KH Wong Ch10. Auto and variational encoders v230607d 82
  • 83. Overview • Bayesian rules • Gaussian distribution • Probability vs likelihood • Log-likelihood and maximum likelihood • Negative log-likelihood Ch10. Auto and variational encoders v230607d 83
  • 84. Bayesian rules Ch10. Auto and variational encoders v230607d 84
  • 85. Bayesian rules • P(B|A)=P(A|B)P(B)/P(A) • P(A and B)=P(A,B)=P(A|B) P(B) • P(A,B|C)=P(A|B,C) P(B|C) • Prove the above as exercises Ch10. Auto and variational encoders v230607d 85 In each cell, the joint probability p(r, c) is re-expressed by the equivalent form p(r | c) p(c) from the definition of conditional probability in Equation 5.3. The marginal probability p(r) =Σc*p(r | c*) p(c*), https://www.sciencedirect.com/topics/mathematics/marginal-probability
  • 86. Gaussian distribution • %2-D Gaussian distribution P(xj) • %matlab code---------- • clear, N=10 • [X1,X2]=meshgrid(-N:N,- N:N); • sigma =2.5;mean=[3 3]' • G=1/(2*pi*sigma^2)*exp(- ((X1-mean(1)).^2+(X2- mean(2)).^2)/(2*sigma^2)); • G=G./sum(G(:)) %normalise it • 'sigma is ', sigma • 'sum(G(:)) is ',sum(G(:)) • 'max(max(G(:))) is',max(max(G(:))) • figure(1), clf • surf(X1,X2,G); • xlabel('x1'),ylabel('x2') Ch10. Auto and variational encoders v230607d 86                    2 0 2 0 2 / 1 2 0 2 0 0 2 1 exp 2 1 ) ( variance mean, , _ Gaussian 1      j j j x x N sample a x D               2 0 2 2 2 1 2 0 2 1 2 exp 2 1 ) , ( Gaussian symmetric) (circular isotropic an 2   x x x x N D
  • 87. Probability vs likelihood • It is two sides of a coin. • P() Probability function : – Given a Gaussian model (with mean µo and variance o), the probability function P(X| µo,o) measures the probability that the observation X is generated by the model. • L() likelihood function: – Given data X, the Likelihood function L(µo,o| X) measures the probability that X fits the Gaussian model with mean µo and variance o. – Major application: Given data X, we can maximize the Likelihood function L(µo,o| X) to find the model (µo,o) that fits the data. This is called the maximum likelihood method. – Log-likelihood rather than likelihood is more convenient for finding the maximum, hence it is often used. Ch10. Auto and variational encoders v230607d 87   X L X P o o o o | , ) , | ( 2 2     
  • 88. Likelihood function L( ) of n-dimension • Likelihood function • Intuition: Likelihood function L(µ,|X)) means: given a Gaussian model N(mean, variance) how much the multivariate data X =[x1,x2,x3,..,xn] fits the model with parameter (µ,). Ch10. Auto and variational encoders v230607d 88                      n j j n n x X L x x x X 1 2 2 2 / 2 2 2 1 2 1 exp 2 | , ] ,..., , [                                                 2 1 2 2 / 2 1 2 2 2 / 1 2 2 1 2 2 1 exp 2 2 1 exp 2 , | | , as written be can function likelihood the IID, are sample the from ns observatio that the assumption Given the : Proof • n j j n n j j j n j x x x N X L          
  • 89. A more useful representation is Log-Likelihood function= Log(L( ))=l () • Intuition: • The peak of Likelihood and Log-Likelihood functions should be the same. • The two are one to one mapping hence no data loss. • Log based method is easier to be handled by math, so log-Likelihood function is often used • For computers, log numbers are smaller hence may save memory. Using log, we can use addition rather than multiplication which makes computation easier. Ch10. Auto and variational encoders v230607d 89                     proved! , 2 1 ) ln( 2 ) 2 ln( 2 2 1 2 ln 2 2 1 exp ln 2 ln 2 1 exp 2 ln ,..., , | , ln ) ,..., , | , ( : Proof 2 1 2 2 1 2 2 2 1 2 2 2 / 2 1 2 2 2 / 2 2 1 2 1 2 1 2                                                            n j j n j j n j j n n j j n n n x n n x n x x x x L x x x l                          2 1 0 2 2 2 1 2 1 2 2 2 / 2 2 2 1 2 1 ) ln( 2 ) 2 ln( 2 ) ,..., , | , ( , 2 1 exp 2 | , : defn By ] ,..., , [ For function, Likelihood - Log                       n j j n n j j n n x n n x x x l x X L x x x X           
  • 90. Maximum Likelihood V.S. Log-Likelihood Ch10. Auto and variational encoders v230607d 90                                                   n j j n j j n x n n X l x Log X l X L Log 1 2 0 2 2 2 1 2 2 2 / 2 2 2 2 1 ) ln( 2 ) 2 ln( 2 | , : function Likelihood - Log 2 1 exp 2 | , | , : function likelihood the of Log Take                                       n j j n n x X L x x x X 1 2 2 2 / 2 2 2 2 1 2 1 exp 2 | , : function Likelihood , is set parameter Gaussian the ], ,..., , [ Given         Since both likelihood 𝐿 and Log−likelihood 𝑙 are monotonic functions, hence arg_max𝜃 𝐿(𝜃|𝑋) == arg_max𝜃 𝑙(𝜃|𝑋) The maximum happens at 𝜃 = 𝜇, 𝜎2 , where 𝜇 = 1 𝑛 𝑗=1 𝑛 𝑥𝑗 , & variance 𝜎2 = 1 𝑛 𝑗=1 𝑛 𝑥𝑗 − 𝜇 2 http://jrmeyer.github.io/machinelearning/2017/08/18/mle.html https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1 • Important
  • 91. Proof 1 :Maximum Log-Likelihood function of a Multivariate Gaussian distribution Ch10. Auto and variational encoders v230607d 91                   hood? log_likeli the maximize to of expression the is what So , of mean 1 hence 0, or 0 2 1 2 1 ) ln( 2 ) 2 ln( 2 2 1 ) ln( 2 ) 2 ln( 2 earlier showed we function, Gaussian a For , 0 ) ,..., , | , ( ln ) at is hood log_likeli Max 2 1 1 1 2 2 1 2 2 2 1 2 2 2 2 1 2                      x x n x d x d d x n n d d dl(X) x n n l(X) d x x x L d d dl(X) d dLog(L(X) n j j n j j n j j n j j n j j n                                                https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1                                             n j j n j j n j j n j j n x n x n x n n x X L 1 2 2 1 1 2 2 2 1 2 2 2 / 2 2 1 variance and , 1 mean when happens likelihood - Log maximum The , 2 1 ) ln( 2 ) 2 ln( 2 2 1 exp 2 | , : func Likelihood             http://jrmeyer.github.io/machinelearning/2017/08/18/mle.html
  • 92. Proof 2 : Maximum Log-Likelihood function of a Multivariate Gaussian distribution Ch10. Auto and variational encoders v230607d 92                    2 2 2 0 0 2 1 2 2 1 2 4 2 1 2 4 2 2 1 2 2 2 2 2 1 2 2 ˆ 1, n if : Note . of variance and , of mean mean of on distributi Gaussian a by generated be likely to most is data the , given means That of variance 1 ˆ 2 1 2 Hence 0 2 1 2 2 1 ) ln( 2 ) 2 ln( 2 function Gaussian a For , 0 ) ,..., , | , ( ln when happens hood log_likeli Maximum                                                                          j ,,,j i, n j j n j j n j j n j j n x x x x x x n x n x n d x n n d d x x x l d d dLog(L(x) http://people.stat.sfu.ca/~raltman/stat402/402L4.pdf https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1   z 1 dz ln(z) d : Note                                              n j j n j j n j j n j j n x n x n x n n x X L 1 2 2 1 1 2 2 2 1 2 2 2 / 2 2 1 variance and , 1 mean when happens likelihood - Log maximum The , 2 1 ) ln( 2 ) 2 ln( 2 2 1 exp 2 | , : func Likelihood            
  • 93. Alternative proof: Maximum Log_likelihood Find the most suitable variance 2 • Maximum likelihood is at Ch10. Auto and variational encoders v230607d 93            n j n j n j j n x n x n 1 2 2 1 ˆ 1 ˆ , 1 ˆ                               2 2 1 2 2 1 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 1 2 2 2 2 1 2 2 2 2 1 2 ˆ 1, n if : Note here) occurs Gaussian the of hood Log_likeli maximum ( , , ˆ if only zero to equal is which 0 1 2 1 1 2 1 2 1 2 1 2 1 2 1 2 2 1 ) ln( 2 ) 2 ln( 2 0 ) ,..., , | , ( for problem hood Log_likeli maximum the Solve : Proof                                                                                                                            j n j j n j j n j j n j j n j j n j j n x done x n x x n x n d d x n x n n x x x l l
  • 94. Negative Log-Likelihood (NLL) And its application in softmax To maximize log-likelihood, we can minimize its negative log-likelihood (NLL) function Ch10. Auto and variational encoders v230607d 94
  • 95. Softmax function • https://medium.com/data-science-bootcamp/understand-the-softmax-function-in-minutes-f3a59641e86d • y=[2 , 1, 0.1]’ • Softmax(y)=[0.6590, 0.242,0.0986]’ • exp(2)/((exp(2)+exp(1)+exp(0.1))=0.6590 • exp(1)/((exp(2)+exp(1)+exp(0.1))= 0.2424 • exp(0.1)/((exp(2)+exp(1)+exp(0.1))= 0.0986 Ch10. Auto and variational encoders v230607d 95 ,n , i y y y n i i i i ,.. 2 1 for , ) exp( ) exp( ) ( softmax 1    
  • 96. Softmax Activation Function • https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/#nll Ch10. Auto and variational encoders v230607d 96 =5/(5+4+2)= exp(5)/((exp(5)+exp(4)+exp(2))=0.705 exp(4)/((exp(5)+exp(4)+exp(2))=0.25949
  • 97. Negative Log-Likelihood (NLL) • To maximize likelihood, minimum negative log- likelihood (NLL) is picked Ch10. Auto and variational encoders v230607d 97 https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/#nll =-ln(likelihood) =-ln(0.02)=3.91 =-ln(0)=infinity =-ln(0.98)=0.02 Minimum negative log- likelihood (NLL) is picked, so 0.02 is selected Softmax output as the likelihood
  • 98. • Ch10. Auto and variational encoders v230607d 98 Continue
  • 99. FAQ on VAE • FAQ Assign3, 2020 Nov 17 • Question 3.1: • Hi, sorry for interrupting, i got a question for assignment 3 auto-encoder part. in question one, encoder hidden layer and decoder hidden layer are in different size(15 and 18), and the neurons for (means, variance) and samples are different as well, does it mean in variational auto-encoder, encoder hidden layer size and decoder hidden layer size can be different, and neuron numbers for means, variance and samples don't have to match? if so, is there some random drop-out functions when means, variance and samples size don't match? Thanks. • Answer 3.1: • This is a very good question, in my notes, (mean, variance, sample_z are of the same sizes, but I have found some implementations that show it may not be the only case. Yes, it is a kind of dropout as described by the papers shown below. I think the rule is mean and variance should have the sample number because they go into pairs, but the randomly generated sample_z can be of different sizes. It is done by randomly (via Monte carlo method) select the pair of mean and variance for generating the value of sample_z. Neural computing is a trial- and-error method, you may try different approaches and the preferred method is the one which gives you the good result. You may explore more papers and see if my interpretation is correct or not. • See section3.4 of • https://arxiv.org/pdf/1706.03643.pdf • Also • https://deeplearn.org/arxiv/92996/generating-data-using-monte-carlo-dropout • //////////////////////////////////////////////////////// • Question 3.2 on VAE (variational Auto-encoder) • Question 3.2a: • In your notes, variational auto encoder turns input x into means and deviations of a multivariate Gaussian distribution, then use a random sampling method to create output. The output is Z and Z is generating random sample to the next layer of neuron. • (i) How do we train the neuron network if the input is from random sampling? (ii) And How do we force a multivariate Gaussian distribution Z to uni-variate Gaussian distribution N(0,1)? • Answer 3.2a : I will answer part (ii) of the above question first. It is not to avoid over-fitting. Since from input to the latent (hidden) representation z, there is a random process. A random process can have many different forms, it can be Gaussian, Laplace or Cauchy etc. or some unknown forms. If there is no control you may not be able to repeat the process hence training becomes useless. In the VAE paper (https://arxiv.org/abs/1312.6114 ) , the authors propose to force the random probability distribution to be Gaussian (I guess you may force them to be Laplace etc. and still can work, but you have to be consistent on using one model). How? The method is using D_KL ( Kullback–Leibler divergence ). It the concept 2 in my notes, to make sure the random process is Gaussian. • ////-------------------------------------------------------------------------------------------------------------------- • Question 3.2b : Why do we still need re-parameterization to do back propagation? • Answer 3.2b: It is known that random process cannot be back-propagate. But re-parameterization provides a means to back-propagate. First, zi is not generated by a random generator of mean=µi ,Stad_dev=i , but rather using an indirect method of finding zi (using zi =µi +*i ) through  which is generated by N(0,1)=NGaussian(mean=0,std_dev=1). If you have doubt run my Matlab program on p.71 of 5707_10_auto-encoder (1).pptx . It short, it is found that if we use zi =µi +*i to generate zi. then zi will have the characteristic of (mean= µi,std_dev=i). • Then, why do we use that indirect method? Because during forward pass of neural computing,  is already calculated by N(0,), it is a real number not a random variable (same for mean=µi ,Stad_dev=i neuron outputs) , so during back-propagation we can use zi =µi +*i to find out how much we can back-propagate to change the weights of the neurons. In the lecture note p.67 of 5707_10_auto-encoder.pptx, the gradient is calculated (please recall that for neural back-propagation computing, gradient is needed to find the de/dw), we can form our weight updating program based on this formulation. The idea is with this gradient, we know how to change µi ,i if we know the change of zi. (if it is random process , we simply do not know how). However, you don’t need to enter this gradient to the VAE program because it is already in the Tensorflow-keras library, it is done automatically by Tensorflow-keras as long as you provide the zi =µi +*I formulation of the forward pass. • ////-------------------------------------------------------------------------------------------------------------------- • Question 3.2c :If we put N(0.15, 2,3) does that mean the input mean is 0.15 and std is 2.3. Then it goes through KL to compute the error with the expected. • Answer3.2c: The use of N(0,1) (using Gaussian mean=0, Std_dev=1) is to make the formulation easier to program or calculate , see P.51 , D_KL ( the loss function is based on that ) formulation becomes simpler. I guess you may assume all distributions to be N(0.15, 2,3), then your loss function becomes more complex. The idea is to make sure zi is generated by a Gaussian process, zi can be generated by a different mean and std_dev, but needed to be Gaussian. It is done by reducing D_KL( random process that generate zi, || N(0,1) ). So by comparing the process of generating zi to a typical Gaussian like N(0,1) to form the loss function is reasonable. Ch10. Auto and variational encoders v230607d 99
  • 100. • Ch10. Auto and variational encoders v230607d 100 To prove Eq [x2]=Ep [2( +)], https://stats.stackexchange.com/questions/199605/how-does-the-reparameterization-trick-for-vaes-work-and-why-is-it-important
  • 101. Alternative Derivation : To prove Eq [x2]=Ep [2( +)] • Ch10. Auto and variational encoders v230607d 101                                                                                                                                                                                                2 is of derivative therefore ), 1 , 0 ( . . , of on distributi the is where , then 1 0 , Since , therefore , log hence 1 variance , mean of on distributi normal a is ) 1 , ( If on expectatai of defn. by also , log log in put ), ( 1 n, expectatio of defn. by , ) ( 1 log Thus, log 1 log and , Since find to need we thus, , min find want to We 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 p p q q p q q q q q q q q q E E x E x E N e i p E x E , N X x x E x E x x q N q x x q E dx x x q x q x E (ii) (i) ii dx x x q x q x q dx x x q x q x q x E dx x x q x E i x q x q x q dx dy (y) dx (y) d x q x q x E X E Note: –                                                       x x x q x x x q ) 1 ( 2 ) ( log ) 1 ( 2 1 exp ) 1 ( 2 1 ) ( 2 1 exp ) ( 2 1 ) ( 2 2 2 / 1 2 2 2 2 / 1 2
  • 102. Reparameterization: Backpropagation needs derivative of a function (process) • Ch10. Auto and variational encoders v230607d 102 https://stats.stackexchange.com/questions/199605/how-does-the-reparameterization- trick-for-vaes-work-and-why-is-it-important Derivative of a random process is not possible Derivative of the Reparameterization process (no random node is involved) is possible
  • 103. ssa • Ch10. Auto and variational encoders v230607d 103 https://stats.stackexchange.com/questions/199605/how-does-the-reparameterization- trick-for-vaes-work-and-why-is-it-important Explanation:
  • 104. Summary: Backpropagation • The gradient during backpropagation is • • This gradient is required for the neural network learning (back-propagation) process •  the generated variable of N(0,1) during the forward pass • µx is the current mean and is given at forward pass • So, the gradient (see formula * above) can be found and used in backpropagation Ch10. Auto and variational encoders v230607d 104 𝛻𝜃𝐸𝑞[𝑧2]=𝛻𝜃E𝑝 [(µx +)2]=Ep[2(µx + )]------(*)
  • 105. Gradient for backpropagation • Eq()=expectation, • µ=mean, =standard deviation • z=µ+,   sampled from N(0,I) • The above is deterministic, we can find µ and thus, derivative of Eq [z2] • Eq [z2]= Ep [(µ +)2] • Assume =1 for simplicity, (µ, are independent) • Derivative of Eq [z2] =  Eq [z2] /  µ =µ Eq [z2] = Ep [2(µ +)] • (The proof is in the appendix: To prove µ Eq [z2]=Ep [2(µ +)]) • If we have enough samples of , we can find µ Eq [z2] . This gradient is required for the neural network learning (back- propagation) process • µ=current mean,  = randomly generated by N(0,I) during forward pass • For , we can apply the same treatment for updating Ch10. Auto and variational encoders v230607d 105
  • 106. https://nbviewer.jupyter.org/github/gokererdogan/Notebooks/blob/master/Reparameterization%20Trick.ipynb Demo gen_data_using_mean0_sigma1.py Reparameterization trick • import numpy as np • N = 1000 • theta = 2.0 • eps = np.random.randn(N) • x = theta + eps • grad1 = lambda x: np.sum(np.square(x)*(x-theta)) / x.size • grad2 = lambda eps: np.sum(2*(theta + eps)) / x.size • print grad1(x) • print grad2(eps) • 3.86872102149 • 4.03506045463 • Let us plot the variance for different sample sizes. • Ns = [10, 100, 1000, 10000, 100000] • reps = 100 • means1 = np.zeros(len(Ns)) • vars1 = np.zeros(len(Ns)) • means2 = np.zeros(len(Ns)) • vars2 = np.zeros(len(Ns)) • est1 = np.zeros(reps) • est2 = np.zeros(reps) • for i, N in enumerate(Ns): • for r in range(reps): • x = np.random.randn(N) + theta • est1[r] = grad1(x) • eps = np.random.randn(N) • est2[r] = grad2(eps) • means1[i] = np.mean(est1) • means2[i] = np.mean(est2) • vars1[i] = np.var(est1) • vars2[i] = np.var(est2) • • print means1 • print means2 • print • print vars1 • print vars2 • [ 4.10377908 4.07894165 3.97133622 4.00847457 3.99620013] • [ 3.95374031 4.0025519 3.99285189 4.00065614 4.00154934] • [ 8.63411090e+00 8.90650401e-01 8.94014392e-02 8.95798809e-03 • 1.09726802e-03] • [ 3.70336929e-01 4.60841910e-02 3.59508788e-03 3.94404543e-04 • 3.97245142e-05] • %matplotlib inline • import matplotlib.pyplot as plt • plt.plot(vars1) • plt.plot(vars2) • plt.legend(['no rt', 'rt']) • /usr/local/lib/python2.7/dist-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter. • warnings.warn(self.msg_depr % (key, alt_key)) Ch10. Auto and variational encoders v230607d 106 Variance of the estimates using reparameterization trick is one order of magnitude smaller than the estimates from the first method!

Editor's Notes

  • #8: \subsection{Theory} %%eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee $x \rightarrow F \rightarrow x'$\\ $z=\sigma (Wx+b)------(*)$\\ $x'=\sigma'(W'z+b')---(**)$\\ Autoencoders are trained to minimize reconstruction errors (such as squared errors), often referred to as the "loss (L)":\\ By combining (*) and (**)\\ $Loss=L(x,x')=\| x-x' \|^2$\\ $\| x-\sigma' ( W' \sigma (W x+b)+b' \|^2$\\
  • #28: subsection{Uni-variate and Multivariate Gaussian} %https://ttic.uchicago.edu/~shubhendu/Slides/Estimation.pdf $N_{univariate}(x)=\frac{1}{(2\pi \sigma^2)^{(1/2)}} exp\big( -\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2} \big)$\\ Multivariate Gaussian, $x$=data sample, %$\mu$=mean, $\sum$=covariance \\ d-dimension\\ $N_{univariate}(x)= \frac{1}{(2\pi)^{(d/2)} | \Sigma |^{(1/2)}} exp\big( -\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu) \big)$
  • #30: %%%%%%%%%%%% slide 54 %%%%%%%%%%%%%%%%%%%% %%% eeeeee eq:filtering_01A eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee %%% 1st order (A) Gaussian \begin{flalign} \label{eq:filtering_01a} \begin{aligned} G(x) = \frac{1} {\sqrt{2\pi{\sigma^2 }}} e^{- \frac{(x-\mu)^2} {2\sigma^2} } \end{aligned} \end{flalign} %%% eeeeee eq:filtering_01B eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee 1st order (B) Gaussian\\ \begin{flalign} \label{eq:filtering_01b} \begin{aligned} G(x) = \frac{1} {\sqrt{2\pi{\sigma^2 }}} exp\left({- \frac{(x-\mu)^2} {2\sigma^2} }\right) \end{aligned} \end{flalign} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%% eeeeee eq:filtering_02A eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 2nd order (A) Gaussian\\ \begin{flalign} \label{eq:filtering_02A} \begin{aligned} G(x,y) =G(x)G(y)= \frac{1} {{2\pi{\sigma^2 }}} e^{- \frac{(x-\mu_x)^2+(y-\mu_y)^2} {2\sigma^2} } \end{aligned} \end{flalign} %%% 2nd order (B) Gaussian %%% eeeeee eq:filtering_02B eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee \begin{flalign} \label{eq:filtering_02B} \begin{aligned} G(x,y) =G(x)G(y)= \frac{1} {{2\pi{\sigma^2 }}} exp\left({- \frac{(x-\mu_x)^2+(y-\mu_y)^2} {2\sigma^2} }\right) \end{aligned} \end{flalign}
  • #31: %%%%%%%%%%%%%slide 28 %%%%%%%%%%%%%%%%% 1-D Gaussian $x_j$= a sample,\\ $\mu_0=mean, \sigma_0=variance$\\ $N(x_j)=\frac{1}{(2 \pi \sigma_0^2)^{1/2}} exp \big (-\frac{1}{2}\frac{(x_j - \mu_0)^2}{\sigma_0^2} \big)$\\ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 2-D an isotropic (circular symmetric) Gaussian \\ assume mean is 0\\ $N(x_1,x_2)=\frac{1}{(2 \pi \sigma_0^2)} exp \big (-\frac{1}{2}\frac{(x_1^2+x_2^2 )^2}{\sigma_0^2} \big)$\\
  • #32: %%%%%%%%%%slide 29 , 30%%%%%%%%%%%%%%%%%%%% ---------------slide 29,30 \\ 2-D an isotropic (circular symmetric) Gaussian for variable $(x,y)$ \\ assume 2-D mean is $(m_x,m_y)$\\ $G(x,y)=G(x)G(y)$\\ $N(x_1,x_2)=\frac{1}{(2 \pi \sigma_0^2)} exp \big (-\frac{1}{2}\frac{((x-m_x)^2+(y-m_y)^2 )^2}{\sigma_0^2} \big)$\\
  • #33: %%%%%%%%%%slide 29 , 30%%%%%%%%%%%%%%%%%%%% ---------------slide 29,30 \\ 2-D an isotropic (circular symmetric) Gaussian for variable $(x,y)$ \\ assume 2-D mean is $(m_x,m_y)$\\ $G(x,y)=G(x)G(y)$\\ $N(x_1,x_2)=\frac{1}{(2 \pi \sigma_0^2)} exp \big (-\frac{1}{2}\frac{((x-m_x)^2+(y-m_y)^2 )^2}{\sigma_0^2} \big)$\\
  • #49: %%%%%%%%%%slide 46%%%%%%%%%%%%%%%%%%%% ---------------slide 46 \\ \subsection{slide 46:The reconstruction loss $l()$} Given $x_i \epsilon X, z \epsilon Q, \text { and } E()$ is the expected value The idea is to train the Encoder/Decoder (Neural Network) to maximum the likelihood of the Mean squared error (MSE) between x and reconstructed $\hat x_i \epsilon \hat X$\\ % To maximize likelihood, we can minimize the “expected negative log-likelihood” $(l_i)$ of the $ i^{th}$ datapoint $x_i$. \\ % $ l_i(\theta, \phi | x_i)=-E_{x_i \epsilon X} \big[ E_{z \epsilon Q}[log P_{\phi (de)}(\hat x_i | z)] \big]$\\ $q_{\theta(en)}(z|x_i)$\\ $P_{\phi(de)}(\hat x_i |z)$\\ % $l_i(\theta,\phi)$\\
  • #51: %%%%%%%%%%%%%%%%%%%%%%%% \subsection{slide 48: How to make sure the neural networks produce similar hidden data (means and standard deviations) from similar training images} Problem: Input that we regard as similar may end up very different in z space (hidden, means and standard deviations). That means some solutions may give small loss $l_i{(\theta, \phi)}$, even $q_{\theta(en)}$ and $p_{\phi(de)}$ are of very different distributions.\\ Solution: Use $p(z)=N(0,1)$, try to force $q_{\theta(en)}(z|x_i)$ (a neural network) to act similar to a standard normal probability density function. We can use Kullback-Leibler divergence $(D_{KL})$ to do the checking. % $l_i(\theta,\phi|x_i)$\\ $l_i()$
  • #52: \subsection{Slide 49: Math background: Kullback–Leibler divergence} %https://ttic.uchicago.edu/~shubhendu/Slides/Estimation.pdf Math background: Kullback–Leibler divergence (also known as relative entropy) measures how one probability distribution is different from a second, reference probability distribution over the same variable X. %1 Define:\\ $D_{KL}\big[ N(\mu_1,\sigma_1^2) || ( \mu_2,\sigma_2^2) \big]= \frac{1}{2} \Big[ tr \big ([\sigma_2^2]^T \cdot \sigma_1^2-I \big) +(\mu_1-\mu_2)^T[\sigma^2]^{-1}(\mu_1 - \mu_2) +log \big( \frac{det(\sigma_2^2)}{det(\sigma_1^2)} \big) \Big]$-----(I)\\ If $N(\mu_1,\sigma_1^2)=N(\mu(X),\sigma^2(X))$; \\Also $N(\mu_2,\sigma^2_2)=N(0,I)$\\ %2 % $D_{KL}\big[ N(\mu(X),\sigma^2(X)) || ( N(0,I) \big]= \frac{1}{2} \Big[ tr \big ([\sigma^2]^T \cdot \sigma^2-I \big) +(\mu(X))^T(\mu(X))) -log (det(\sigma^2(X))) \Big]$\\ %3 For $D[Q(z|x_i) || N(0,I)] \text{, where } Q(z|x_i) =N(\mu(X),\sigma^2(X))$\\ $D_{KL}\big[ N(\mu(X),\sigma^2(X)) || ( N(0,I) \big]= \frac{1}{2} \Big[ tr \big ([\sigma^2]^T \cdot \sigma^2-I \big) +(\mu(X))^T(\mu(X))) -log (det(\sigma^2(X))) \Big]$\\ Kullback–Leibler divergence $D_{KL} (D_1 || D_2)=0$ indicates the two distributions ${D_1,D_2}$ are identical %Tutorial on Variational Autoencoders by Carl Doersch & https://arxiv.org/abs/1606.05908
  • #53: \subsection{Slide50:Training (concept 1)} %See http://bjlkeng.github.io/posts/variational-autoencoders/ & https://arxiv.org/abs/1312.6114 Combining concept 1 and 2 to minimize Loss $l_i (X), \text{ of } X= {x_1,x_2,..,x_N} , E()$=expected value . For the whole $X$, the average loss is \\ Input to encoder = $x_i \epsilon X$\\ Output to encoder = $\hat{x_i} \epsilon \hat{X}$\\ $P(\hat{x_i} | z)$=Prob. distribution of $x_i$ generated by $z$ (decoder side)\\ % $P(z | x_i)$=Prob. distribution of $x_i$ generated by $z$ (decoder side)\\ % $P(z)$=Prob. distribution of the latent (hidden) variables,\\ % $E_{Z \epsilon Q}[log P(\hat x_i | z)]$=expected value (exp. val) of $\hat x_i$ generated at the decoder output\\ % $E_{x_i \epsilon X}[E_{Z \epsilon Q}[log(P(\hat {x_i | z)})]]$= exp. val. of $\hat{x_i}$ gen. at the decoder output when input = $x_i \epsilon X$ \\ % $\varepsilon = \text { random variable generated by a Gaussian function }\{ mean (\mu)=0, stdev( \sigma)=1\}, \varepsilon N(0,I) $\\ % At this stage $z$ can be any distribution, but we can assume $z$= Gaussian $N(\mu_{x_i|z}), \sigma_{x_i | z})$\\ % It can be formed by scaling $\varepsilon \epsilon N(),I)$, (\url{en. https://en.wikipedia.org/wiki/Normal_distribution})\\ The advantage is if $(\mu_{x_i | z}, \sigma_{x_i | z)})$ are found, then use a random gen. $N(\mu_{x_i | z}, \sigma_{x_i | z})$ to gen. $z$.\\ % Hence $log P(\hat{x_i} | z)=log P \big( \hat {x_i |z}=\mu_{x_i |z}(x_i)+\sigma_{x_i | z}(x_i) \ast \varepsilon \big)$ \\ % We want to maximize $E_{x_i \epsilon X}[ E_{z_i \epsilon Q}[log P (\hat x_i | z]]$ (to make input output similar)\\ % It is the same as to minimize $-\big( E_{x_i \epsilon X}[ E_{z_i \epsilon Q}[log P (\hat x_i | z]] \big)$\\ Objective function1=$-E_{x_i | X} \Big[ E_{z \epsilon Q}[ log P(\hat{x_i} | z=\mu_{x_i |z}(x_i)+\sigma_{x_i | z}(x_i) \ast \varepsilon )$ ] \Big]\\ Since P is Gaussian, we minimize Objective$\_$function1=$\frac{1}{N} \sum\limits_{x_i \epsilon X} \Big( \frac{1}{2 \sigma^2_{\hat {x_i}} |z} (x_i - \mu_{\hat{x_i}|z})^2 \Big)$
  • #54: %%%%%%%%%%%%%%%%%%%%%%%% \subsection{slide 51, concept2} Training: Combining concept 1 and 2 to minimize Loss $l_i (X), of X= {x_1,x_2,..,x_N} , E()$=expected value . For the whole $X$, the average loss is Recall $q_{\theta(en)(z|x_i)}$=prob.distribution of $z$ generated by $x_i$ (encoder side)\\ % We mentioned earlier we want $q_{\theta(en)(z|x_i)}$ to be close to Gaussian, put $P(z)=N(0,I)$\\ % $D_{KL}\big[ q_{\theta(en)}(z|x_i) || ( N(0,I) \big]$=difference of $q_{\theta(en)(z|x_i)}$ and Gaussian, see previous discussion on $D_KL[]$\\ % objective$\_$func2= $D_{KL}\big[ q_{\theta(en)(z|x_i)} || ( N(0,I) \big]$, this is to be minimized\\ % Overall$\_$objective$\_$function =objective$\_$funct1+objective$\_$func2\\ $=\frac{1}{N} \sum\limits_{x_i \epsilon X} \Big( \frac{1}{2 \sigma^2_{\hat {x_i}} |z} (x_i - \mu_{\hat{x_i}|z})^2 \Big) +D_{KL}\big[ q_{\theta(en)(z|x_i)} || ( N(0,I) \big]$\\ We have shown earlier that\\ $D_{KL}\big[ q_{\theta(en)}(z|x_i) || ( N(0,I) \big]= % +\frac{1}{2} \big\{ tr (\sigma^2(X)-I) +\mu(X)^T \mu(X)-log(det(\sigma^2(X))) \big\}$ % % %% $l=\frac{1}{N} \sum\limits_{x_i \epsilon X} \Big( \frac{1}{2 \sigma^2_{\hat {x_i}} |z} (x_i - \mu_{\hat{x_i}|z})^2 \Big) +\frac{1}{2} \big\{ tr (\sigma^2(X)-I) +\mu(X)^T \mu(X)-log(det(\sigma^2(X))) \big\}$\\ The first term is for concept 1 and the second term is for concept 2\\ We will run an iterative algorithm to minimize $l$ See \url{ http://bjlkeng.github.io/posts/variational-autoencoders/ & https://arxiv.org/abs/1312.6114}
  • #57: \subsection{slide 52:Use neural networks to implement system} $l=\frac{1}{N} \sum\limits_{x_i \epsilon X} \Big( \frac{1}{2 \sigma^2_{\hat {x_i}} |z} (x_i - \mu_{\hat{x_i}|z})^2 \Big) +\frac{1}{2} \big\{ tr (\sigma^2(X)-I) +\mu(X)^T \mu(X)-log(det(\sigma^2(X))) \big\}$\\
  • #60: %%%%%%%%%%%%%%%%%%%% \subsection{slide 55: VAE generative model} In theory, we use a sample of z from $q_{\theta(en)} (z|x_i)$ as input to sample from $p_{\phi(de)}(\hat X_i| Z) $ to give an approximate reconstruction of $x_i$\\ Alternatively, if we sample any $z$ from $N(0,1)$ and use it as input to sample from $p_{\phi(de)}(\hat X_i | Z) $ then we can approximate the entire data distribution $p( )$. I.e., we can generate new samples that look like the input but aren’t in the input.
  • #65: %%%%%%%%%%%%%%%%%%%% \subsection{slide 55: VAE generative model} In theory, we use a sample of z from $q_{\theta(en)} (z|x_i)$ as input to sample from $p_{\phi(de)}(\hat X_i| Z) $ to give an approximate reconstruction of $x_i$\\ Alternatively, if we sample any $z$ from $N(0,1)$ and use it as input to sample from $p_{\phi(de)}(\hat X_i | Z) $ then we can approximate the entire data distribution $p( )$. I.e., we can generate new samples that look like the input but aren’t in the input.
  • #67: \subsection{slide 67: Summary: Forward pass} $\epsilon$= the generated variable\\ $\mu$ = mean\\ $\sigma$= variance\\ $Z= \epsilon \sigma_x +\mu_x$
  • #102: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{slide 64 :Derivation} We want to find $\min_\theta(E_q[X^2])$, thus we need to find $\nabla_\theta E_q[x^2]$\\ Since $\nabla_\theta q_\theta (x)=q_\theta(x)$, and $\frac{d(log(y))}{dx}=\frac{1}{log(y)}\frac{dy}{dx}$\\ Thus, $\nabla_\theta log(q_\theta (x)) =\frac{1}{q_\theta (x)} \nabla (q_\theta(x))$ -----(i)\\ % $\nabla\theta E_q[x^2]=\nabla_\theta \int q_\theta(x) x^2 dx$, by definition of exception\\ % $\nabla\theta E_q[x^2]= \int \nabla_\theta q_\theta(x) \frac{q_\theta(x)}{q_\theta(x)} \Big) x^2 dx$,\\ $=\int \Big( \frac{1}{q_\theta(x)} \nabla_\theta q_\theta(x) \Big) q_\theta(x) x^2 dx$---(ii), put (i) in (ii)\\ %%% $\nabla _\theta E_q[x^2]= \int q_\theta(x) \nabla_\theta log (q_\theta(x)) x^2 dx$\\ $=E_q[\nabla_\theta log(q_\theta(x))x^2]$, also by definition of expectation\\ % If $q_\theta = N(\theta, I) $ is a normal distribution of mean =$\theta$, variance =1\\ hence $\nabla_\theta log (q_\theta(x))=x-\theta$ (see appendix 1), therefore\\ $\nabla_\theta E_q[x^2]=E_q[x^2(x-\theta)]$, since $X=\theta + \epsilon, \epsilon \approx N(0,1)$\\ % then $E_q[x^2]=E_p[(\theta+\epsilon)^2]$, where p is the distribution of $\epsilon$,\\ i.e. $\epsilon \approx N(0,1)$, therefore derivative of $E_q[x^2]$ is \\ $\nabla_\theta E_q[x^2]=\nabla_\theta E_p [(\theta+\epsilon)^2]=E_p[(2\theta+\epsilon)]$\\ %%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%% Note by definition: $\frac{d}{dx}ln(x)=\frac{1}{x}, \text{ and also } \frac{log_b(x)}{dx}=\frac{1}{ln(b) \cdot x}$\\
  • #105: %%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{slide 68:Summary: Backpropagation} % The gradient during backpropagation is\\ $\nabla E_q[x^2]=\nabla_\theta E_p [(\theta+\epsilon)^2]=E_p[2(\theta+\epsilon)]$\\ % $\epsilon$ is found the generated variable\\ $\theta$ is given\\ So the gradient can be found and used in backpropagation
  • #106: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{slid 63:Gradient for back propagation} Derivation of $\nabla_{\theta}E_q[x^2]$\\ $X=\theta+\epsilon,\epsilon \approx N(0,I) $\\ Then $E_q [x^2]=E_p[(\theta+\epsilon)^2]$\\ $P= \text { distribution of } \epsilon \approx N(0,I)$\\ Thus, derivative of $E_q[x^2]$ $=\nabla_\theta E_q[x^2]=\nabla_\theta E_p[(\theta + \epsilon)^2]$ $=E_p[2(\theta+\epsilon)]$