SlideShare a Scribd company logo
Deep learning
Outline
 Machine Learning basics
 Introduction to Deep Learning
 what is Deep Learning
 why is it useful
 Main components/hyper-parameters:
 activation functions
 optimizers, cost functions and training
 regularization methods
 tuning
 classification vs. regression tasks
 DNN basic architecture:
 Convolutional neural network
Most material from CS224 NLP with DL course at Stanford
Machine learning is a field of computer science that gives computers the
ability to learn without being explicitly programmed
Methods that can learn from and make predictions on data
Labeled Data
Labeled Data
Machine Learning
algorithm
Learned model Prediction
Training
Prediction
Machine Learning Basics
Regression
Supervised: Learning with a labeled training set
Example: email classification with already labeled emails
Unsupervised: Discover patterns in unlabeled data
Example: cluster similar documents based on text
Reinforcement learning: learn to act based on feedback/reward
Example: learn to play Go, reward: win or lose
Types of Learning
class A
class A
Classification Clustering
http://mbjoseph.github.io/2013/11/27/measure.html
Most machine learning methods work well because of human-designed
representations and input features
ML becomes just optimizing weights to best make a final prediction
ML vs. Deep Learning
Deep learning algorithms attempt to learn (multiple levels of) representation by
using a hierarchy of multiple layers
If you provide the system tons of information, it begins to understand it and
respond in useful ways.
What is Deep Learning (DL) ?
https://www.xenonstack.com/blog/static/public/uploads/media/machine-learning-vs-deep-learning.png
o Manually designed features are often over-specified, incomplete and take a
long time to design and validate
o Learned Features are easy to adapt, fast to learn
o Deep learning provides a very flexible, (almost?) universal, learnable
framework for representing world, visual and linguistic information.
o Can learn both unsupervised and supervised
o Effective end-to-end joint system learning
o Utilize large amounts of training data
Why is DL useful?
In ~2010 DL started outperforming
other ML techniques
first in speech and vision, then NLP
Neural Network Intro
Demo
How do we train?
𝒉 = 𝝈(𝐖𝟏𝒙 + 𝒃𝟏)
𝒚 = 𝝈(𝑾𝟐𝒉 + 𝒃𝟐)
𝒉
𝒚
𝒙
4 + 2 = 6 neurons (not counting inputs)
[3 x 4] + [4 x 2] = 20 weights
4 + 2 = 6 biases
26 learnable parameters
Weights
Activation functions
Training
Sample
labeled data
(batch)
Forward it
through the
network, get
predictions
Back-
propagate
the errors
Update the
network
weights
Optimize (min. or max.) objective/cost function 𝑱(𝜽)
Generate error signal that measures difference
between predictions and target values
Use error signal to change the weights and get more
accurate predictions
Subtracting a fraction of the gradient moves you
towards the (local) minimum of the cost function
https://medium.com/@ramrajchandradevan/the-evolution-of-gradient-descend-optimization-algorithm-4106a6702d39
Non-linearities needed to learn complex (non-linear) representations of data,
otherwise the NN would be just a linear function
More layers and neurons can approximate more complex functions
Activation functions
W1W2𝑥 = 𝑊𝑥
Full list: https://en.wikipedia.org/wiki/Activation_function
http://cs231n.github.io/assets/nn1/layer_sizes.jpeg
Activation: Sigmoid
+ Nice interpretation as the firing rate of a neuron
• 0 = not firing at all
• 1 = fully firing
- Sigmoid neurons saturate and kill gradients, thus NN will barely learn
• when the neuron’s activation are 0 or 1 (saturate)
� gradient at these regions almost zero
� almost no signal will flow to its weights
� if initial weights are too large then most neurons would saturate
Takes a real-valued number and
“squashes” it into range between 0
and 1.
𝑅𝑛 → 0,1
http://adilmoujahid.com/images/activation.png
Activation: ReLU
Takes a real-valued number and
thresholds it at zero
𝑅𝑛 → 𝑅+
𝑛
Most Deep Networks use ReLU nowadays
� Trains much faster
• accelerates the convergence of SGD
• due to linear, non-saturating form
� Less expensive operations
• compared to sigmoid/tanh (exponentials etc.)
• implemented by simply thresholding a matrix at zero
� More expressive
� Prevents the gradient vanishing problem
f 𝑥 = max(0, 𝑥)
http://adilmoujahid.com/images/activation.png
Overfitting
Learned hypothesis may fit the
training data very well, even
outliers (noise) but fail to
generalize to new examples (test
data)
http://wiki.bethanycrane.com/overfitting-of-data
https://www.neuraldesigner.com/images/learning/selection_error.svg
L2 = weight decay
• Regularization term that penalizes big weights, added to
the objective
• Weight decay value determines how dominant regularization is
during gradient computation
• Big weight decay coefficient  big penalty for big weights
Regularization
Dropout
• Randomly drop units (along with their
connections) during training
• Each unit retained with fixed probability p,
independent of other units
• Hyper-parameter p to be chosen (tuned)
𝐽𝑟𝑒𝑔 𝜃 = 𝐽 𝜃 + 𝜆
𝑘
𝜃𝑘
2
Early-stopping
• Use validation error to decide when to stop training
• Stop when monitored quantity has not improved after n subsequent epochs
• n is called patience
Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural
networks from overfitting." Journal of machine learning research (2014)
Convolutional Neural
Networks
0 We know it is good to learn a small model.
0 From this fully connected model, do we really need
all the edges?
0 Can some of these be shared?
Consider learning an image:
0 Some patterns are much smaller than the whole
image
“beak” detector
Can represent a small region with fewer parameters
Same pattern appears in different places:
They can be compressed!
What about training a lot of such “small” detectors
and each detector must “move around”.
“upper-left
beak” detector
“middle beak”
detector
They can be compressed
to the same parameters.
The whole CNN
Fully Connected
Feedforward network
cat dog ……
Convolution
Max Pooling
Convolution
Max Pooling
Flattened
Can
repeat
many
times
A convolutional layer
A filter
A CNN is a neural network with some convolutional layers
(and some other layers). A convolutional layer has a number
of filters that does convolutional operation.
Beak detector
Convolution
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
3 -1 -3 -1
-3 1 0 -3
-3 -3 0 1
3 -2 -2 -1
-1 1 -1
-1 1 -1
-1 1 -1
Filter 2
-1 -1 -1 -1
-1 -1 -2 1
-1 -1 -2 1
-1 0 -4 3
Repeat this for each filter
stride=1
Two 4 x 4 images
Forming 2 x 4 x 4 matrix
Feature
Map
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
image
convolution
-1 1 -1
-1 1 -1
-1 1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1
x
2
x
…
…
36
x
…
…
1 0 0 0 0 1
0 1 0 0 1 0
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
Convolution v.s. Fully Connected
Fully-
connected
Max Pooling
3 -1 -3 -1
-3 1 0 -3
-3 -3 0 1
3 -2 -2 -1
-1 1 -1
-1 1 -1
-1 1 -1
Filter 2
-1 -1 -1 -1
-1 -1 -2 1
-1 -1 -2 1
-1 0 -4 3
1 -1 -1
-1 1 -1
-1 -1 1
Filter 1
Why Pooling
0Subsampling pixels will not change the
object
Subsampling
bird
bird
We can subsample the pixels to make image smaller
fewer parameters to characterize the image
A CNN compresses a fully
connected network in two
ways:
0 Reducing number of connections
0 Shared weights on the edges
0 Max pooling further reduces the complexity
The whole CNN
Convolution
Max Pooling
Convolution
Max Pooling
Can
repeat
many
times
A new image
The number of channels
is the number of filters
Smaller than the original
image
3 0
1
3
-1 1
3
0
The whole CNN
Fully Connected
Feedforward network
cat dog ……
Convolution
Max Pooling
Convolution
Max Pooling
Flattened
A new image
A new image
Flattening
3 0
1
3
-1 1
3
0 Flattened
3
0
1
3
-1
1
0
3
Fully Connected
Feedforward network
Only modified the network structure and
input format (vector -> 3-D tensor)
CNN in Keras
Convolution
Max Pooling
Convolution
Max Pooling
input
1 -1 -1
-1 1 -1
-1 -1 1
-1 1 -1
-1 1 -1
-1 1 -1
There are
25 3x3
filters.
…
…
Input_shape = ( 28 , 28 , 1)
1: black/white, 3: RGB
28 x 28 pixels
3 -1
-3 1
3
Only modified the network structure and
input format (vector -> 3-D array)
CNN in Keras
Convolution
Max Pooling
Convolution
Max Pooling
Input
1 x 28 x 28
25 x 26 x 26
25 x 13 x 13
50 x 11 x 11
50 x 5 x 5
How many parameters for
each filter?
How many parameters
for each filter?
9
225=
25x9
Only modified the network structure and
input format (vector -> 3-D array)
CNN in Keras
Convolution
Max Pooling
Convolution
Max Pooling
Input
1 x 28 x 28
25 x 26 x 26
25 x 13 x 13
50 x 11 x 11
50 x 5 x 5
Flattened
1250
Fully connected
feedforward network
Output
 Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." Journal
of machine learning research (2014)
 Bergstra, James, and Yoshua Bengio. "Random search for hyper-parameter optimization." Journal of
Machine Learning Research, Feb (2012)
 Kim, Y. “Convolutional Neural Networks for Sentence Classification”, EMNLP (2014)
 Severyn, Aliaksei, and Alessandro Moschitti. "UNITN: Training Deep Convolutional Neural Network for
Twitter Sentiment Classification." SemEval@ NAACL-HLT (2015)
 Cho, Kyunghyun, et al. "Learning phrase representations using RNN encoder-decoder for statistical
machine translation." EMNLP (2014)
 Ilya Sutskever et al. “Sequence to sequence learning with neural networks.” NIPS (2014)
 Bahdanau et al. "Neural machine translation by jointly learning to align and translate." ICLR (2015)
 Gal, Y., Islam, R., Ghahramani, Z. “Deep Bayesian Active Learning with Image Data.” ICML (2017)
 Nair, V., Hinton, G.E. “Rectified linear units improve restricted boltzmann machines.” ICML (2010)
 Ronan Collobert, et al. “Natural language processing (almost) from scratch.” JMLR (2011)
 Kumar, Shantanu. "A Survey of Deep Learning Methods for Relation Extraction." arXiv preprint
arXiv:1705.03645 (2017)
 Lin et al. “Neural Relation Extraction with Selective Attention over Instances” ACL (2016) [code]
 Zeng, D.et al. “Relation classification via convolutional deep neural network”. COLING (2014)
 Nguyen, T.H., Grishman, R. “Relation extraction: Perspective from CNNs.” VS@ HLT-NAACL. (2015)
 Zhang, D., Wang, D. “Relation classification via recurrent NN.” -arXiv preprint arXiv:1508.01006 (2015)
 Zhou, P. et al. “Attention-based bidirectional LSTM networks for relation classification . ACL (2016)
 Mike Mintz et al. “Distant supervision for relation extraction without labeled data.” ACL- IJCNLP (2009)
References
https://giphy.com/gifs/thanks-thank-you-thnx-3o6ozuHcxTtVWJJn32/download

More Related Content

PDF
LSTM Basics
PPTX
1. Introduction to deep learning.pptx
PPTX
Restricted Boltzmann Machines.pptx
PDF
Machine learning
PPTX
Deep Learning Fundamentals
PDF
Hadoop Distributed File System Reliability and Durability at Facebook
PPTX
Gradient Boosted trees
PPTX
Introduction to Python Programming Basics
LSTM Basics
1. Introduction to deep learning.pptx
Restricted Boltzmann Machines.pptx
Machine learning
Deep Learning Fundamentals
Hadoop Distributed File System Reliability and Durability at Facebook
Gradient Boosted trees
Introduction to Python Programming Basics

What's hot (20)

PPTX
Convolutional Neural Network (CNN)
PDF
Confusion Matrix Explained
PDF
Markov decision process
PDF
Ridge regression, lasso and elastic net
PPTX
Random Forest Classifier in Machine Learning | Palin Analytics
PDF
Reinforcement Learning
PPTX
Inception V3 Image Processing (1).pptx
PDF
An introduction to Deep Learning
PPTX
Boltzmann Machines in Deep learning and machine learning also used for traini...
PPTX
Association rule mining.pptx
PDF
Deep Learning: Application & Opportunity
PPSX
Perceptron (neural network)
PDF
Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...
PDF
Machine learning life cycle
PPTX
Deep Learning - RNN and CNN
PDF
Machine Learning Course | Edureka
PPTX
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
PDF
Intro to Neural Networks
PPTX
Deep Learning With Python | Deep Learning And Neural Networks | Deep Learning...
PPTX
Concept learning and candidate elimination algorithm
Convolutional Neural Network (CNN)
Confusion Matrix Explained
Markov decision process
Ridge regression, lasso and elastic net
Random Forest Classifier in Machine Learning | Palin Analytics
Reinforcement Learning
Inception V3 Image Processing (1).pptx
An introduction to Deep Learning
Boltzmann Machines in Deep learning and machine learning also used for traini...
Association rule mining.pptx
Deep Learning: Application & Opportunity
Perceptron (neural network)
Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...
Machine learning life cycle
Deep Learning - RNN and CNN
Machine Learning Course | Edureka
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Intro to Neural Networks
Deep Learning With Python | Deep Learning And Neural Networks | Deep Learning...
Concept learning and candidate elimination algorithm
Ad

Similar to Deep learning (20)

PPT
deepnet-lourentzou.ppt
PPT
Deep learning is a subset of machine learning and AI
PPT
Overview of Deep Learning and its advantage
PPT
Introduction to Deep Learning presentation
PPTX
Diving into Deep Learning (Silicon Valley Code Camp 2017)
PPTX
Android and Deep Learning
PDF
deep CNN vs conventional ML
PPTX
Deep Learning Tutorial
PPTX
Deep learning tutorial 9/2019
PPTX
D3, TypeScript, and Deep Learning
PPTX
Deeplearning for Computer Vision PPT with
PPTX
Java and Deep Learning
PPTX
[Revised] Intro to CNN
PPTX
D3, TypeScript, and Deep Learning
PPTX
Dssg talk CNN intro
PPTX
A Beginner's Approach to Deep Learning Techniques
PDF
Deep Learning Study _ FInalwithCNN_RNN_LSTM_GRU.pdf
PDF
Separating Hype from Reality in Deep Learning with Sameer Farooqui
PDF
Deep Learning: concepts and use cases (October 2018)
PDF
H2O Open Source Deep Learning, Arno Candel 03-20-14
deepnet-lourentzou.ppt
Deep learning is a subset of machine learning and AI
Overview of Deep Learning and its advantage
Introduction to Deep Learning presentation
Diving into Deep Learning (Silicon Valley Code Camp 2017)
Android and Deep Learning
deep CNN vs conventional ML
Deep Learning Tutorial
Deep learning tutorial 9/2019
D3, TypeScript, and Deep Learning
Deeplearning for Computer Vision PPT with
Java and Deep Learning
[Revised] Intro to CNN
D3, TypeScript, and Deep Learning
Dssg talk CNN intro
A Beginner's Approach to Deep Learning Techniques
Deep Learning Study _ FInalwithCNN_RNN_LSTM_GRU.pdf
Separating Hype from Reality in Deep Learning with Sameer Farooqui
Deep Learning: concepts and use cases (October 2018)
H2O Open Source Deep Learning, Arno Candel 03-20-14
Ad

Recently uploaded (20)

PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PDF
737-MAX_SRG.pdf student reference guides
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PPTX
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
PPTX
Module 8- Technological and Communication Skills.pptx
PPT
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
PPTX
Fundamentals of Mechanical Engineering.pptx
PDF
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
PDF
Improvement effect of pyrolyzed agro-food biochar on the properties of.pdf
PDF
August 2025 - Top 10 Read Articles in Network Security & Its Applications
PPTX
Software Engineering and software moduleing
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PDF
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
PDF
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
PDF
distributed database system" (DDBS) is often used to refer to both the distri...
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PDF
ChapteR012372321DFGDSFGDFGDFSGDFGDFGDFGSDFGDFGFD
PPTX
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
PPTX
CyberSecurity Mobile and Wireless Devices
PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
Fundamentals of safety and accident prevention -final (1).pptx
737-MAX_SRG.pdf student reference guides
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
Module 8- Technological and Communication Skills.pptx
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
Fundamentals of Mechanical Engineering.pptx
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
Improvement effect of pyrolyzed agro-food biochar on the properties of.pdf
August 2025 - Top 10 Read Articles in Network Security & Its Applications
Software Engineering and software moduleing
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
distributed database system" (DDBS) is often used to refer to both the distri...
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
ChapteR012372321DFGDSFGDFGDFSGDFGDFGDFGSDFGDFGFD
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
CyberSecurity Mobile and Wireless Devices
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...

Deep learning

  • 2. Outline  Machine Learning basics  Introduction to Deep Learning  what is Deep Learning  why is it useful  Main components/hyper-parameters:  activation functions  optimizers, cost functions and training  regularization methods  tuning  classification vs. regression tasks  DNN basic architecture:  Convolutional neural network Most material from CS224 NLP with DL course at Stanford
  • 3. Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed Methods that can learn from and make predictions on data Labeled Data Labeled Data Machine Learning algorithm Learned model Prediction Training Prediction Machine Learning Basics
  • 4. Regression Supervised: Learning with a labeled training set Example: email classification with already labeled emails Unsupervised: Discover patterns in unlabeled data Example: cluster similar documents based on text Reinforcement learning: learn to act based on feedback/reward Example: learn to play Go, reward: win or lose Types of Learning class A class A Classification Clustering http://mbjoseph.github.io/2013/11/27/measure.html
  • 5. Most machine learning methods work well because of human-designed representations and input features ML becomes just optimizing weights to best make a final prediction ML vs. Deep Learning
  • 6. Deep learning algorithms attempt to learn (multiple levels of) representation by using a hierarchy of multiple layers If you provide the system tons of information, it begins to understand it and respond in useful ways. What is Deep Learning (DL) ? https://www.xenonstack.com/blog/static/public/uploads/media/machine-learning-vs-deep-learning.png
  • 7. o Manually designed features are often over-specified, incomplete and take a long time to design and validate o Learned Features are easy to adapt, fast to learn o Deep learning provides a very flexible, (almost?) universal, learnable framework for representing world, visual and linguistic information. o Can learn both unsupervised and supervised o Effective end-to-end joint system learning o Utilize large amounts of training data Why is DL useful? In ~2010 DL started outperforming other ML techniques first in speech and vision, then NLP
  • 8. Neural Network Intro Demo How do we train? 𝒉 = 𝝈(𝐖𝟏𝒙 + 𝒃𝟏) 𝒚 = 𝝈(𝑾𝟐𝒉 + 𝒃𝟐) 𝒉 𝒚 𝒙 4 + 2 = 6 neurons (not counting inputs) [3 x 4] + [4 x 2] = 20 weights 4 + 2 = 6 biases 26 learnable parameters Weights Activation functions
  • 9. Training Sample labeled data (batch) Forward it through the network, get predictions Back- propagate the errors Update the network weights Optimize (min. or max.) objective/cost function 𝑱(𝜽) Generate error signal that measures difference between predictions and target values Use error signal to change the weights and get more accurate predictions Subtracting a fraction of the gradient moves you towards the (local) minimum of the cost function https://medium.com/@ramrajchandradevan/the-evolution-of-gradient-descend-optimization-algorithm-4106a6702d39
  • 10. Non-linearities needed to learn complex (non-linear) representations of data, otherwise the NN would be just a linear function More layers and neurons can approximate more complex functions Activation functions W1W2𝑥 = 𝑊𝑥 Full list: https://en.wikipedia.org/wiki/Activation_function http://cs231n.github.io/assets/nn1/layer_sizes.jpeg
  • 11. Activation: Sigmoid + Nice interpretation as the firing rate of a neuron • 0 = not firing at all • 1 = fully firing - Sigmoid neurons saturate and kill gradients, thus NN will barely learn • when the neuron’s activation are 0 or 1 (saturate) � gradient at these regions almost zero � almost no signal will flow to its weights � if initial weights are too large then most neurons would saturate Takes a real-valued number and “squashes” it into range between 0 and 1. 𝑅𝑛 → 0,1 http://adilmoujahid.com/images/activation.png
  • 12. Activation: ReLU Takes a real-valued number and thresholds it at zero 𝑅𝑛 → 𝑅+ 𝑛 Most Deep Networks use ReLU nowadays � Trains much faster • accelerates the convergence of SGD • due to linear, non-saturating form � Less expensive operations • compared to sigmoid/tanh (exponentials etc.) • implemented by simply thresholding a matrix at zero � More expressive � Prevents the gradient vanishing problem f 𝑥 = max(0, 𝑥) http://adilmoujahid.com/images/activation.png
  • 13. Overfitting Learned hypothesis may fit the training data very well, even outliers (noise) but fail to generalize to new examples (test data) http://wiki.bethanycrane.com/overfitting-of-data https://www.neuraldesigner.com/images/learning/selection_error.svg
  • 14. L2 = weight decay • Regularization term that penalizes big weights, added to the objective • Weight decay value determines how dominant regularization is during gradient computation • Big weight decay coefficient  big penalty for big weights Regularization Dropout • Randomly drop units (along with their connections) during training • Each unit retained with fixed probability p, independent of other units • Hyper-parameter p to be chosen (tuned) 𝐽𝑟𝑒𝑔 𝜃 = 𝐽 𝜃 + 𝜆 𝑘 𝜃𝑘 2 Early-stopping • Use validation error to decide when to stop training • Stop when monitored quantity has not improved after n subsequent epochs • n is called patience Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." Journal of machine learning research (2014)
  • 15. Convolutional Neural Networks 0 We know it is good to learn a small model. 0 From this fully connected model, do we really need all the edges? 0 Can some of these be shared?
  • 16. Consider learning an image: 0 Some patterns are much smaller than the whole image “beak” detector Can represent a small region with fewer parameters
  • 17. Same pattern appears in different places: They can be compressed! What about training a lot of such “small” detectors and each detector must “move around”. “upper-left beak” detector “middle beak” detector They can be compressed to the same parameters.
  • 18. The whole CNN Fully Connected Feedforward network cat dog …… Convolution Max Pooling Convolution Max Pooling Flattened Can repeat many times
  • 19. A convolutional layer A filter A CNN is a neural network with some convolutional layers (and some other layers). A convolutional layer has a number of filters that does convolutional operation. Beak detector
  • 20. Convolution 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 6 x 6 image 3 -1 -3 -1 -3 1 0 -3 -3 -3 0 1 3 -2 -2 -1 -1 1 -1 -1 1 -1 -1 1 -1 Filter 2 -1 -1 -1 -1 -1 -1 -2 1 -1 -1 -2 1 -1 0 -4 3 Repeat this for each filter stride=1 Two 4 x 4 images Forming 2 x 4 x 4 matrix Feature Map
  • 21. 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 image convolution -1 1 -1 -1 1 -1 -1 1 -1 1 -1 -1 -1 1 -1 -1 -1 1 1 x 2 x … … 36 x … … 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 Convolution v.s. Fully Connected Fully- connected
  • 22. Max Pooling 3 -1 -3 -1 -3 1 0 -3 -3 -3 0 1 3 -2 -2 -1 -1 1 -1 -1 1 -1 -1 1 -1 Filter 2 -1 -1 -1 -1 -1 -1 -2 1 -1 -1 -2 1 -1 0 -4 3 1 -1 -1 -1 1 -1 -1 -1 1 Filter 1
  • 23. Why Pooling 0Subsampling pixels will not change the object Subsampling bird bird We can subsample the pixels to make image smaller fewer parameters to characterize the image
  • 24. A CNN compresses a fully connected network in two ways: 0 Reducing number of connections 0 Shared weights on the edges 0 Max pooling further reduces the complexity
  • 25. The whole CNN Convolution Max Pooling Convolution Max Pooling Can repeat many times A new image The number of channels is the number of filters Smaller than the original image 3 0 1 3 -1 1 3 0
  • 26. The whole CNN Fully Connected Feedforward network cat dog …… Convolution Max Pooling Convolution Max Pooling Flattened A new image A new image
  • 27. Flattening 3 0 1 3 -1 1 3 0 Flattened 3 0 1 3 -1 1 0 3 Fully Connected Feedforward network
  • 28. Only modified the network structure and input format (vector -> 3-D tensor) CNN in Keras Convolution Max Pooling Convolution Max Pooling input 1 -1 -1 -1 1 -1 -1 -1 1 -1 1 -1 -1 1 -1 -1 1 -1 There are 25 3x3 filters. … … Input_shape = ( 28 , 28 , 1) 1: black/white, 3: RGB 28 x 28 pixels 3 -1 -3 1 3
  • 29. Only modified the network structure and input format (vector -> 3-D array) CNN in Keras Convolution Max Pooling Convolution Max Pooling Input 1 x 28 x 28 25 x 26 x 26 25 x 13 x 13 50 x 11 x 11 50 x 5 x 5 How many parameters for each filter? How many parameters for each filter? 9 225= 25x9
  • 30. Only modified the network structure and input format (vector -> 3-D array) CNN in Keras Convolution Max Pooling Convolution Max Pooling Input 1 x 28 x 28 25 x 26 x 26 25 x 13 x 13 50 x 11 x 11 50 x 5 x 5 Flattened 1250 Fully connected feedforward network Output
  • 31.  Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." Journal of machine learning research (2014)  Bergstra, James, and Yoshua Bengio. "Random search for hyper-parameter optimization." Journal of Machine Learning Research, Feb (2012)  Kim, Y. “Convolutional Neural Networks for Sentence Classification”, EMNLP (2014)  Severyn, Aliaksei, and Alessandro Moschitti. "UNITN: Training Deep Convolutional Neural Network for Twitter Sentiment Classification." SemEval@ NAACL-HLT (2015)  Cho, Kyunghyun, et al. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." EMNLP (2014)  Ilya Sutskever et al. “Sequence to sequence learning with neural networks.” NIPS (2014)  Bahdanau et al. "Neural machine translation by jointly learning to align and translate." ICLR (2015)  Gal, Y., Islam, R., Ghahramani, Z. “Deep Bayesian Active Learning with Image Data.” ICML (2017)  Nair, V., Hinton, G.E. “Rectified linear units improve restricted boltzmann machines.” ICML (2010)  Ronan Collobert, et al. “Natural language processing (almost) from scratch.” JMLR (2011)  Kumar, Shantanu. "A Survey of Deep Learning Methods for Relation Extraction." arXiv preprint arXiv:1705.03645 (2017)  Lin et al. “Neural Relation Extraction with Selective Attention over Instances” ACL (2016) [code]  Zeng, D.et al. “Relation classification via convolutional deep neural network”. COLING (2014)  Nguyen, T.H., Grishman, R. “Relation extraction: Perspective from CNNs.” VS@ HLT-NAACL. (2015)  Zhang, D., Wang, D. “Relation classification via recurrent NN.” -arXiv preprint arXiv:1508.01006 (2015)  Zhou, P. et al. “Attention-based bidirectional LSTM networks for relation classification . ACL (2016)  Mike Mintz et al. “Distant supervision for relation extraction without labeled data.” ACL- IJCNLP (2009) References

Editor's Notes

  • #9: Hyper-parameters
  • #13: ReLU units can “die” large gradient flowing through a ReLU could cause weights to update in such a way that the neuron will never activate on any datapoint again gradient flowing through the unit will forever be zero from that point on With a proper setting of the learning rate this is less frequently an issue