2. 2
Overview
● Motivation for deep
learning
● Areas of Deep Learning
● Convolutional neural
networks
● Recurrent neural networks
● Deep learning tools
4. What about the MLPs we learnt in class?
Recall:
● Input Layer
● Hidden
layer
● Activations
● Outputs
Pic Credit: Becoming Human: Artificial Intelligence
Magazine
4
5. What about the MLPs we learnt in class?
Expensive to learn. Will not generalize well
Does not exploit the order and local relations in the
data!
64x64x3=12288
parameters We also want
many layers
5
6. 6
Overview
● Motivation for deep
learning
● Areas in Deep Learning
● Convolutional neural
networks
● Recurrent neural networks
● Deep learning tools
7. What are different pillars of deep learning?
Convolutional
NN Image
7
Recurrent
NN Time
Series
Deep RL
Control
System
Graph NN
Networks/Relation
al
8. 8
Overview
● Motivation for deep
learning
● Areas of Deep Learning
● Convolutional neural
networks
● Recurrent neural networks
● Deep learning tools
13. Convolving Filters
● Why not extract features using
filters?
● Better, why not let the data
dictate
what filters to use?
● Learnable filters!!
13
14. Convolution on multiple channels
● Images are generally RGB !!
● How would a filter work
on a image with RGB
channels?
● The filter should also have
3 channels.
● Now the output has a
channel
for every filter we have
used. 14
30. Parameter Sharing
Lesser the parameters less computationally intensive the training.
This is a
win win as we are reusing parameters.
30
31. Translational invariance
Since we are training filters
to detect cats and the
moving these filters over
the data, a differently
positioned cat will also get
detected by the same set of
filters.
31
32. Filteres? Layers of filters?
Images that maximize filter outputs at
certain layers. We observe that the images
get more complex as filters are situated
deeper
How deeper layers can learn deeper
embeddings. How an eye is made up of
multiple curves and a face is made up of two
eyes.
32
33. How do we use convolutions?
Let convolutions extract
features! 33
Image credit: LeCun et al.
34. Fun Fact: Convolution really is just a linear
operation
● In fact convolution is a giant
matrix
multiplication.
● We can expand the 2
dimensional image into a vector
and the conv operation into a
matrix.
34
35. 35
How do we learn?
We now have a network with:
● a bunch of weights
● a loss function
To learn:
● Just do gradient descent and backpropagate the error derivates
36. How do we learn?
Instead of
There are “optimizers”
● Momentum: Gradient +
Momentum
● Nestrov: Momentum + Gradients
● Adagrad: Normalize with sum of
sq
● RMSprop: Normalize with
moving avg of sum of squares
● ADAM: RMsprop + momentum
36
37. Mini-batch Gradient Descent
Expensive to compute gradient for large dataset
Memory size
Compute time
Mini-batch: takes a sample of training data
How to we sample intelligently?
37
38. Is deeper better?
Deeper networks seem to be
more powerful but harder to
train.
● Loss of information during
forward propagation
● Loss of gradient info
during back propagation
There are many ways to
“keep the gradient going”
38
39. Solution
Connect the layers, create a gradient highway or
information
highway.
ResNet (2015)
39
Image credit: He et al.
40. Initialization
● Can we initialize all neurons to
zero?
● If all the weights are same we will
not be able to break symmetry of
the network and all filters will end
up learning the same thing.
● Large numbers, might knock
relu units out.
● Relu units once knocked out
and their output is zero,
their gradient flow also
becomes zero.
● We need small random numbers
at
initialization.
● Variance :
1/sqrt(n)
● Mean: 0
Popular initialization setups
40
(Xavier, He) (Uniform,
Normal)
41. Dropout
● What does cutting off some
network
connections do?
● Trains multiple smaller networks
in an ensemble.
● Can drop entire layer too!
● Acts like a really good
regularizer
41
42. Tricks for training
● Data augmentation if your data
set is smaller. This helps the
network generalize more.
● Early stopping if training loss
goes above validation loss.
● Random hyperparameter
search or
grid search?
42
43. 43
Overview
● Motivation for deep
learning
● Areas of Deep Learning
● Convolutional neural
networks
● Recurrent neural networks
● Deep learning tools
44. CNN sounds like fun!
What are some deep learning pillars?
Recurrent
NN
Time Series
Convolutional
NN
Deep RL Graph
NN
44
45. We can also have 1D architectures (remember
this)
● CNN works on any data where there
is a local pattern
● We use 1D convolutions on DNA
sequences, text sequences and
music notes
● But what if time series has causal
dependency or any kind of
sequential dependency?
45
46. To address sequential dependency?
Use recurrent neural network
(RNN)
Previous
output
Latent
Output
One time
step RNN Cell
They are really the same cell,
NOT many different cells like kernels of
CNN
46
Unrolling an
RNN
47. How does RNN produce result?
I love CS !
Result after
reading full
sentence
Evolving
“embedding”
47
48. There are 2 types of RNN cells
Long Short Term Memory
(LSTM)
48
Gated Recurrent Unit
(GRU)
Store in “long term
memory”
Response to current
input
Update
gate
Reset
gate
Response
to
current
input
50. “Recurrent” AND convolutional?
Temporal convolutional network
Temporal dependency achieved
through “one-sided” convolution
More efficient because deep
learning packages are optimized
for matrix multiplication =
convolution
No hard dependency
50
51. More? Take CS230, CS236, CS231N, CS224N
Convolutional
NN Image
Recurrent
NN Time
Series
Deep RL
Control
System
Graph NN
Networks/Relation
al
51
52. Not today, but take CS234 and CS224W
Convolutional
NN Image
Recurrent
NN Time
Series
Deep RL
Control
System
Graph NN
Networks/Relation
al
52
53. 53
Overview
● Motivation for deep
learning
● Areas of Deep Learning
● Convolutional neural
networks
● Recurrent neural networks
● Deep learning tools
55. Where can I get free stuff?
Google
Colab
Free (limited-ish) GPU access
Works nicely with
Tensorflow
Links to Google
Drive
Register a new Google Cloud
account
=> Instant
$300??
=> AWS free tier (limited
compute)
=> Azure education account,
$200?
To SAVE money
CLOSE your GPU
instance
~$1 an hour
Azure
Notebook
Kaggle kernel???
Amazon
SageMaker?
55