SlideShare a Scribd company logo
IMAGE CAPTIONING
MUHAMMAD ZBEEDAT
MAY 2019
INTRODUCTION
• What do you see in the
picture?
• Well some of you might say “A white dog in a grassy area”, some may say
“White dog with brown spots” and yet some others might say “A dog on
grass and some pink flowers”.
• Definitely all of these captions are relevant for this image and there may be
some others also. But the point I want to make is; it’s so easy for us, as
human beings, to just have a glance at a picture and describe it in an
appropriate language. Even a 5 year old could do this with utmost ease.
• But, can you write a computer program that takes an image as input and
produces a relevant caption as output?
• Just prior to the recent development of DNN - Deep Neural
Networks this problem was inconceivable even by the most
advanced researchers in Computer Vision. But with the advent
of Deep Learning this problem can be solved very easily if we
have the required dataset.
• This problem was well researched by Andrej Karapathy in his
PhD
thesis at Stanford, who is also now the Director of AI at Tesla.
• To get a better feel of this problem, I strongly recommend to
use this state-of-the-art system created by Microsoft called
as Caption Bot. Just go to this link and try uploading any
picture you want; this system will generate a caption for it.
MOTIVATION
We must first understand how important this
problem is to real world scenarios. Let’s see few
applications where a solution to this problem can
be very useful:
MOTIVATION
Aid to the blind— We can create a product for
the blind which will guide them travelling on
the roads without the support of anyone else.
We can do this by first converting the scene
into text and then the text to voice. Both are
now famous applications of Deep Learning.
Refer this link where its shown how Nvidia
research is trying to create such a product.
Self driving cars— Automatic driving is one of
the biggest challenges and if we can properly
caption the scene around the car, it can give
a boost to the self driving system.
MOTIVATION
Automatic Captioning can help,
make Google Image Search as
good as Google Search, as then
every image could be first
converted into a caption and
then search can be performed
based on the caption.
In web development, it’s good
practice to provide a description
for any image that appears on
the page so that an image can be
read or heard as opposed to just
seen. This makes web content
accessible.
MOTIVATION
CCTV cameras (Closed-circuit television
cameras) are everywhere today, but
along with viewing the world, if we can
also generate relevant captions, then
we can raise alarms as soon as there is
some malicious activity going on
somewhere. This could probably help
reduce some crime and/or accidents.
It can be used to describe video in real
time.
DATA COLLECTION
• There are many open source
datasets available for this
problem, like Flickr 8k
(containing8k images), Flickr
30k (containing 30k images), MS
COCO (containing 180k images),
etc.
• COCO stands for Common
Objects and Contexts, and it
contains a large variety of
images where each image has a
EXAMPLE
CAPTIONING MODEL
A captioning model relies on two main components, a CNN and an RNN. Captioning is all
about merging the two to combine their most powerful attributes i.e.
CNNs (Convolutional Neural Networks) excel (‫מצטיין‬) at preserving spatial
information and recognize objects in images.
RNNs (Recurrent Neural Networks) work well with any kind of sequential
data, such as generating a sequence of words.
So by merging the two, you can get a model that can find patterns and images, and then
use that information to help generate a description of those images.
CNN
Convolutional Neural Networks, or CNNs, were designed to map image data to an output variable.
They have proven so effective that they are the go-to method for any type of prediction problem involving image
data as an input.
The benefit of using CNNs is their ability to develop an internal representation of a two-dimensional image. This
allows the model to learn position and scale (‫סקאלה‬/‫מידה‬ ‫קנה‬) in variant structures in the data, which is important
when working with images.
Use CNNs For:
• Image data
• Classification prediction problems
• Regression prediction problems
More generally, CNNs work well with data that has a spatial relationship (‫מרחבי‬ ‫יחס‬) .
Although not specifically developed for non-image data, CNNs achieve state-of-the-art results on problems such as
document classification (For example, there is an order relationship between words in a document of text) used in
sentiment analysis and related problems.
the process of
computationally identifying
and categorizing opinions
expressed in a piece of text,
especially in order to
determine whether the
writer's attitude towards a
particular topic, product,
etc. is positive, negative, or
neutral.
RNN
RNNs in general and LSTMs in particular have received the most success when working with sequences of words and
paragraphs, generally called natural language processing.
This includes both sequences of text and sequences of spoken language represented as a time series. They are also
used as generative models that require a sequence output, not only with text, but on applications such as generating
handwriting.
Use RNNs For:
• Text data
• Speech data
• Classification prediction problems
• Regression prediction problems
• Generative models
Don’t Use RNNs For:
• Tabular data (as you would see in a CSV file or spreadsheet)
• Image data
MODEL OF
IMAGE
CAPTIONING
APPROACH
• Say you’re asked to write a caption that
describes this image, how would you approach
this task?
• Based on how these objects are placed in an
image and their relationship to each other, you
might think that a dog is looking at the sky. He
is smiling so he might as well be happy. Also he
is outside so he might as well be in a park. The
sky isn’t blue so it might as well be sunset or
sunrise.
After collecting these visual observations, you
could put together a phrase that describes the
image as, “A happy dog is looking at the sky”.
CHALLENGES
1. Recognize objects in the image
2. Generate a fluent description in natural language
NEURAL OBJECT RECOGNITION
• A solved problem: Convolutional Neural Networks (CNN) do the trick.
• CNN is an architecture specialized in finding topological invariants
(‫טופולוגיות‬) in the input.
• Finds relationships between atoms and infers (‫להסיק‬) higher
abstractions.
• Highly resistant to (‫בפני‬ ‫עמיד‬) noise and spatial transformations.
• It learns automatically what are the relevant features to extract from
an input.
• Not limited to images: CNNs can be applied to text, audio, etc..
ARCHITECTU
RE OF CNN
1. Convolution
2. Non Linearity (ReLU)
3. Pooling or Sub Sampling
4. Classification (Fully
Connected Layer)
OBJECT
DETECTION
POOLING -
MAX/SUM/AV
G
• Downsample the image by
“hashing” it to fewer values.
We can:
CLASSIFICATION: FULLY CONNECTED LAYER
• After a couple of “convolute, Relu
and pool” cycles, we have maybe 128
channels of 14x14 pixel images.
• Concat and reshape them in a linear
array of 25088 (14x14x128) cells.
• Feed it to a feed forward neural
network that will output our classes.
• Since we want a set of
features that represents
the spatial content in the
image, we’re going to
remove the final fully
connected layer that
classifies the image and
look at earlier layer that
processes the spatial
information in the image.
CNN MODEL
• Feed an image into a CNN.
Can use a pre-trained
network like VGG16 or
Resnet or AlexNet and
more...
• https://neurohive.io/en/po
pular-networks/vgg16/
• https://medium.com/@side
real/cnns-architectures-
lenet-alexnet-vgg-
googlenet-resnet-and-
more-666091488df5
VGG16
• VGG16 is a convolutional neural network model
proposed by K. Simonyan and A. Zisserman from the
University of Oxford in the paper “Very Deep
Convolutional Networks for Large-Scale Image
Recognition”. The model achieves 92.7% top-5 test
accuracy in ImageNet, which is a dataset of over 14
million images belonging to 1000 classes. It was one
of the famous model submitted to ILSVRC-2014. It
makes the improvement over AlexNet by replacing
large kernel-sized filters (11 and 5 in the first and
second convolutional layer, respectively) with
multiple 3×3 kernel-sized filters one after another.
VGG16 was trained for weeks and was using NVIDIA
Titan Black GPU’s.
VGG-16 VS
ALEX-NET
• Similar to AlexNet, only 3x3
convolutions, but lots of
filters.
Alex-Net
_______________________________________________________
VGG-16
LAYERS
VGG-16# 3D
CONVOLUTIO
N LAYERS
Filter size: 3×3 (which is the smallest size to
capture the notion of left/right, up/down,
center)
In one of the configurations, it also utilizes 1×1
convolution filters, which can be seen as a linear
transformation of the input channels
The convolution stride is fixed to 1 pixel
VGG-16#
THRESHOLD
LAYERS
VGG-16#
MAX
POOLING
LAYERS
Max-pooling is performed over a 2×2 pixel window,
with stride 2.
VGG-16#
MULTILAYER
CLASSIFIER
VGG-16#
SOFTMAX
LAYER
VGG-16#
WHAT DO
THEY LEARN?
VGG-16#
TRAINING
VGG-16#
TRAINING
• Backward propogation
LEARNING
OBJECTS
PARTS
EXAMPLE
CNN DEMO TIME
Real time web handwritten digit recognition
http://scs.ryerson.ca/~aharley/vis/conv/flat.html
There are a lot of “famous” nets that can be freely downloaded
and used off the shelf,
like ResNet which has an error rate of 3.6% over 20000
categories.
VGG-16 and more…
• So now CNN acts as a feature extractor that compresses the
information in the original image into a smaller representation.
Since it encodes the content of the image into a smaller feature
vector hence, this CNN is often called the encoder.
When we process this feature vector and use it as an initial
input to the following RNN, then it would be called decoder
because RNN would decode the process feature vector and turn
it into natural language.
RNN -
RECURRENT
NETWORKS
OFFER A LOT OF
FLEXIBILITY
Vanilla mode of processing without RNN, from fixed-sized input to fixed-
sized output (e.g. image classification).
Sequence output (e.g. image captioning takes an image and outputs a
sentence of words).
Sequence input (e.g. sentiment analysis where a given sentence is
classified as expressing positive or negative sentiment).
Sequence input and sequence output (e.g. Machine Translation: an RNN
reads a sentence in English and then outputs a sentence in French).
Synced sequence input and output (e.g. video classification where we wish
to label each frame of the video).
SEQUENTIAL
PROCESSING OF
FIXED INPUTS
(IN ABSENCE OF
SEQUENCES)
• the figure below
shows results
from two very
nice papers
from DeepMind
an algorithm learns a recurrent network policy that steers its attention
around an image; In particular, it learns to read out house numbers from
left to right
a recurrent network generates images of digits by learning to sequentially
add color to a canvas
LANGUAGE
GENERATION
WITH
RECURRENT
NEURAL
NETWORKS
• Language generation is
a serial task. We
generate words one
after another.
• This is well modeled by
Recurrent Neural Cells: a
neuron that uses itself
over and over again to
accept serial inputs,
outputting each time a
new value.
RNN HIDDEN
STATES
RNN HIDDEN
STATES
TRAINING
RNN
TRAINING
PHASE
X
Pre-trained VGG-
16
[3x224x224]
Wih
TEST PHASE
RNN AND THE PROBLEM OF MEMORY
All network state is held in a single cell, used over and over again. Internal
state
can get really complicated. Moving the values around during training can lead
to
loss of data.
RNN has a “plugin” architecture, in which we can use different types of cells:
• Simple RNN cell: fastest, but breaks over long sequences. Outdated.
• LSTM cell: slower, supports selectively forgetting and keeping data.
Standard.
• GRU cell: like LSTM, but faster due to simpler internal architecture. State of
art.
A REGULAR
RNN
• Vanilla RNN
Activation Function
LSTM - LONG
SHORT TERM
MEMORY
• With LSTM
LSTM
• Overall
structure
• Recommend to see this
video:
https://www.youtube.com/
watch?v=WCUNPb-5EYI
GATES
GATING
MEMORY
ANOTHER ENTIRELY
SEPARATED NEURAL
NETWORK THAT LEARNS
WHEN TO FORGET WHAT
MEMORY
ANOTHER ENTIRELY
SEPARATED NEURAL
NETWORK THAT LEARNS
WHAT PREDICTIONS TO
PASS AND WHAT NOT
LONG
SHORT
MEMORY
WITH
ATTENTION
WORK WITH ATTENTION TO
DECIDE WHAT TO IGNORE
LTSM
• Core idea
LTSM
• Step by step
LTSM
• Step by step - Update
GRU -
GATED
RECURRENT
UNIT
VARIATION OF A LSTM
GRU
LSTM had 3 gates input, output and
forget gates. Where in GRU we only have
two gates an update gate z and a reset
gate r.
Update gate: The update gate decides
how much of previous memory to keep
around.
Reset Input: The reset gate defines how
to combine new input with previous value.
Unlike LSTM in GRU, there is no
persistent cell state distinct from the
hidden state as in LSTM.
SOFT
ATTENTION
MECHANISM
SOFT
ATTENTION
• Example
SOFT
ATTENTION
• Results
SOFT
ATTENTION
• Results
DENSE-CAP
• Fully Convolutional
Localization Networks for
Dense (‫צפיפות‬) Captioning
DENSECAP
DENSECAP
• Model Architicture
REFERENCES
• VGG16 – Convolutional Network for Classification and Detection
• The Unreasonable Effectiveness of Recurrent Neural Networks
• Illustrated Guide to LSTM’s and GRU’s: A step by step explanation
• Image Captioning with Keras
• Automatic Image Captioning : Building an image-caption generator
from scratch
• Multi-Modal Methods: Image Captioning (From Translation to
Attention)
• TensorFlow Tutorial #22 - Image Captioning
RECOMMENDED VIDEOS
• Convolutional Neural Network (CNN) models
• CS231n Winter 2016: Lecture 7: Convolutional Neural Networks
• Convolutional Nerural Network Course
• Recurrent Neural Networks (RNN) and Long Short-Term Memory
(LSTM)
• CS231n Lecture 10 - Recurrent Neural Networks, Image Captioning,
LSTM
• Andrej Karpathy - Automated Image Captioning with ConvNets and
Recurrent Nets
• Illustrated Guide to LSTM's and GRU's: A step by step explanation
• https://paperswithcode.com/task/text-generation
• https://machinelearningmastery.com/develop-a-deep-
learning-caption-generation-model-in-python/

More Related Content

PPTX
Image Caption Generation using Convolutional Neural Network and LSTM
PPTX
Image captioning
PPTX
Show and tell: A Neural Image caption generator
PPTX
Convolution Neural Network (CNN)
PPTX
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
PPTX
Deep learning presentation
PPT
Poverty and Food Security
PDF
Machine Learning for Fraud Detection
Image Caption Generation using Convolutional Neural Network and LSTM
Image captioning
Show and tell: A Neural Image caption generator
Convolution Neural Network (CNN)
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Deep learning presentation
Poverty and Food Security
Machine Learning for Fraud Detection

What's hot (20)

PPTX
Deep neural networks
PDF
Deep Learning: Application Landscape - March 2018
PPTX
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
PPT
Back propagation
PDF
Introduction to Recurrent Neural Network
PDF
Convolutional Neural Networks (CNN)
PPTX
Deep learning
PPTX
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
PPTX
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
PDF
Machine Learning and its Applications
PPTX
Basics of Soft Computing
PPTX
Transfer Learning and Fine-tuning Deep Neural Networks
PDF
Introduction to Recurrent Neural Network
PPTX
Image classification using CNN
PDF
Artificial Neural Networks Lect3: Neural Network Learning rules
PPTX
Issues in knowledge representation
PPTX
Web scraping
PDF
Recurrent Neural Networks. Part 1: Theory
PPT
Chapter10 image segmentation
PPTX
Brain Tumour Detection.pptx
Deep neural networks
Deep Learning: Application Landscape - March 2018
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Back propagation
Introduction to Recurrent Neural Network
Convolutional Neural Networks (CNN)
Deep learning
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Machine Learning and its Applications
Basics of Soft Computing
Transfer Learning and Fine-tuning Deep Neural Networks
Introduction to Recurrent Neural Network
Image classification using CNN
Artificial Neural Networks Lect3: Neural Network Learning rules
Issues in knowledge representation
Web scraping
Recurrent Neural Networks. Part 1: Theory
Chapter10 image segmentation
Brain Tumour Detection.pptx
Ad

Similar to Image captioning (20)

PPTX
Automatic Attendace using convolutional neural network Face Recognition
PPTX
Talk from NVidia Developer Connect
PDF
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
PDF
Image Captioning Generator using Deep Machine Learning
PPTX
Deep learning
PPTX
Mnist report ppt
PDF
From Conventional Machine Learning to Deep Learning and Beyond.pptx
PDF
Mnist report
PPTX
[Revised] Intro to CNN
PDF
Towards better analysis of deep convolutional neural networks
PPTX
Keras: A versatile modeling layer for deep learning
PPTX
Introduction to Convolutional Neural Networks (CNNs).pptx
PDF
Transformer models for FER
PDF
Top 10 deep learning algorithms you should know in
PDF
dl-unit-4-deep-learning deep-learning.pdf
PDF
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
PDF
The Future of Health Monitoring: Advances in Wearable Sensor Data Processing
PPTX
408187464-Age-and-Gender-Detection-3-pptx.pptx
PDF
[PR12] Generative Models as Distributions of Functions
PDF
Unit 5: Convolutional Neural Networks - CNN
Automatic Attendace using convolutional neural network Face Recognition
Talk from NVidia Developer Connect
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
Image Captioning Generator using Deep Machine Learning
Deep learning
Mnist report ppt
From Conventional Machine Learning to Deep Learning and Beyond.pptx
Mnist report
[Revised] Intro to CNN
Towards better analysis of deep convolutional neural networks
Keras: A versatile modeling layer for deep learning
Introduction to Convolutional Neural Networks (CNNs).pptx
Transformer models for FER
Top 10 deep learning algorithms you should know in
dl-unit-4-deep-learning deep-learning.pdf
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
The Future of Health Monitoring: Advances in Wearable Sensor Data Processing
408187464-Age-and-Gender-Detection-3-pptx.pptx
[PR12] Generative Models as Distributions of Functions
Unit 5: Convolutional Neural Networks - CNN
Ad

Recently uploaded (20)

PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
2. Earth - The Living Planet earth and life
PDF
The scientific heritage No 166 (166) (2025)
PPT
Chemical bonding and molecular structure
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PDF
HPLC-PPT.docx high performance liquid chromatography
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
TOTAL hIP ARTHROPLASTY Presentation.pptx
Comparative Structure of Integument in Vertebrates.pptx
2. Earth - The Living Planet earth and life
The scientific heritage No 166 (166) (2025)
Chemical bonding and molecular structure
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
INTRODUCTION TO EVS | Concept of sustainability
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
Phytochemical Investigation of Miliusa longipes.pdf
Biophysics 2.pdffffffffffffffffffffffffff
neck nodes and dissection types and lymph nodes levels
2. Earth - The Living Planet Module 2ELS
Taita Taveta Laboratory Technician Workshop Presentation.pptx
HPLC-PPT.docx high performance liquid chromatography
The KM-GBF monitoring framework – status & key messages.pptx
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx

Image captioning

  • 2. INTRODUCTION • What do you see in the picture?
  • 3. • Well some of you might say “A white dog in a grassy area”, some may say “White dog with brown spots” and yet some others might say “A dog on grass and some pink flowers”. • Definitely all of these captions are relevant for this image and there may be some others also. But the point I want to make is; it’s so easy for us, as human beings, to just have a glance at a picture and describe it in an appropriate language. Even a 5 year old could do this with utmost ease. • But, can you write a computer program that takes an image as input and produces a relevant caption as output?
  • 4. • Just prior to the recent development of DNN - Deep Neural Networks this problem was inconceivable even by the most advanced researchers in Computer Vision. But with the advent of Deep Learning this problem can be solved very easily if we have the required dataset. • This problem was well researched by Andrej Karapathy in his PhD thesis at Stanford, who is also now the Director of AI at Tesla.
  • 5. • To get a better feel of this problem, I strongly recommend to use this state-of-the-art system created by Microsoft called as Caption Bot. Just go to this link and try uploading any picture you want; this system will generate a caption for it.
  • 6. MOTIVATION We must first understand how important this problem is to real world scenarios. Let’s see few applications where a solution to this problem can be very useful:
  • 7. MOTIVATION Aid to the blind— We can create a product for the blind which will guide them travelling on the roads without the support of anyone else. We can do this by first converting the scene into text and then the text to voice. Both are now famous applications of Deep Learning. Refer this link where its shown how Nvidia research is trying to create such a product. Self driving cars— Automatic driving is one of the biggest challenges and if we can properly caption the scene around the car, it can give a boost to the self driving system.
  • 8. MOTIVATION Automatic Captioning can help, make Google Image Search as good as Google Search, as then every image could be first converted into a caption and then search can be performed based on the caption. In web development, it’s good practice to provide a description for any image that appears on the page so that an image can be read or heard as opposed to just seen. This makes web content accessible.
  • 9. MOTIVATION CCTV cameras (Closed-circuit television cameras) are everywhere today, but along with viewing the world, if we can also generate relevant captions, then we can raise alarms as soon as there is some malicious activity going on somewhere. This could probably help reduce some crime and/or accidents. It can be used to describe video in real time.
  • 10. DATA COLLECTION • There are many open source datasets available for this problem, like Flickr 8k (containing8k images), Flickr 30k (containing 30k images), MS COCO (containing 180k images), etc. • COCO stands for Common Objects and Contexts, and it contains a large variety of images where each image has a
  • 12. CAPTIONING MODEL A captioning model relies on two main components, a CNN and an RNN. Captioning is all about merging the two to combine their most powerful attributes i.e. CNNs (Convolutional Neural Networks) excel (‫מצטיין‬) at preserving spatial information and recognize objects in images. RNNs (Recurrent Neural Networks) work well with any kind of sequential data, such as generating a sequence of words. So by merging the two, you can get a model that can find patterns and images, and then use that information to help generate a description of those images.
  • 13. CNN Convolutional Neural Networks, or CNNs, were designed to map image data to an output variable. They have proven so effective that they are the go-to method for any type of prediction problem involving image data as an input. The benefit of using CNNs is their ability to develop an internal representation of a two-dimensional image. This allows the model to learn position and scale (‫סקאלה‬/‫מידה‬ ‫קנה‬) in variant structures in the data, which is important when working with images. Use CNNs For: • Image data • Classification prediction problems • Regression prediction problems More generally, CNNs work well with data that has a spatial relationship (‫מרחבי‬ ‫יחס‬) . Although not specifically developed for non-image data, CNNs achieve state-of-the-art results on problems such as document classification (For example, there is an order relationship between words in a document of text) used in sentiment analysis and related problems. the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc. is positive, negative, or neutral.
  • 14. RNN RNNs in general and LSTMs in particular have received the most success when working with sequences of words and paragraphs, generally called natural language processing. This includes both sequences of text and sequences of spoken language represented as a time series. They are also used as generative models that require a sequence output, not only with text, but on applications such as generating handwriting. Use RNNs For: • Text data • Speech data • Classification prediction problems • Regression prediction problems • Generative models Don’t Use RNNs For: • Tabular data (as you would see in a CSV file or spreadsheet) • Image data
  • 16. APPROACH • Say you’re asked to write a caption that describes this image, how would you approach this task? • Based on how these objects are placed in an image and their relationship to each other, you might think that a dog is looking at the sky. He is smiling so he might as well be happy. Also he is outside so he might as well be in a park. The sky isn’t blue so it might as well be sunset or sunrise. After collecting these visual observations, you could put together a phrase that describes the image as, “A happy dog is looking at the sky”.
  • 17. CHALLENGES 1. Recognize objects in the image 2. Generate a fluent description in natural language
  • 18. NEURAL OBJECT RECOGNITION • A solved problem: Convolutional Neural Networks (CNN) do the trick. • CNN is an architecture specialized in finding topological invariants (‫טופולוגיות‬) in the input. • Finds relationships between atoms and infers (‫להסיק‬) higher abstractions. • Highly resistant to (‫בפני‬ ‫עמיד‬) noise and spatial transformations. • It learns automatically what are the relevant features to extract from an input. • Not limited to images: CNNs can be applied to text, audio, etc..
  • 19. ARCHITECTU RE OF CNN 1. Convolution 2. Non Linearity (ReLU) 3. Pooling or Sub Sampling 4. Classification (Fully Connected Layer)
  • 21. POOLING - MAX/SUM/AV G • Downsample the image by “hashing” it to fewer values. We can:
  • 22. CLASSIFICATION: FULLY CONNECTED LAYER • After a couple of “convolute, Relu and pool” cycles, we have maybe 128 channels of 14x14 pixel images. • Concat and reshape them in a linear array of 25088 (14x14x128) cells. • Feed it to a feed forward neural network that will output our classes.
  • 23. • Since we want a set of features that represents the spatial content in the image, we’re going to remove the final fully connected layer that classifies the image and look at earlier layer that processes the spatial information in the image.
  • 24. CNN MODEL • Feed an image into a CNN. Can use a pre-trained network like VGG16 or Resnet or AlexNet and more... • https://neurohive.io/en/po pular-networks/vgg16/ • https://medium.com/@side real/cnns-architectures- lenet-alexnet-vgg- googlenet-resnet-and- more-666091488df5
  • 25. VGG16 • VGG16 is a convolutional neural network model proposed by K. Simonyan and A. Zisserman from the University of Oxford in the paper “Very Deep Convolutional Networks for Large-Scale Image Recognition”. The model achieves 92.7% top-5 test accuracy in ImageNet, which is a dataset of over 14 million images belonging to 1000 classes. It was one of the famous model submitted to ILSVRC-2014. It makes the improvement over AlexNet by replacing large kernel-sized filters (11 and 5 in the first and second convolutional layer, respectively) with multiple 3×3 kernel-sized filters one after another. VGG16 was trained for weeks and was using NVIDIA Titan Black GPU’s.
  • 26. VGG-16 VS ALEX-NET • Similar to AlexNet, only 3x3 convolutions, but lots of filters. Alex-Net _______________________________________________________
  • 28. VGG-16# 3D CONVOLUTIO N LAYERS Filter size: 3×3 (which is the smallest size to capture the notion of left/right, up/down, center) In one of the configurations, it also utilizes 1×1 convolution filters, which can be seen as a linear transformation of the input channels The convolution stride is fixed to 1 pixel
  • 30. VGG-16# MAX POOLING LAYERS Max-pooling is performed over a 2×2 pixel window, with stride 2.
  • 38. CNN DEMO TIME Real time web handwritten digit recognition http://scs.ryerson.ca/~aharley/vis/conv/flat.html There are a lot of “famous” nets that can be freely downloaded and used off the shelf, like ResNet which has an error rate of 3.6% over 20000 categories. VGG-16 and more…
  • 39. • So now CNN acts as a feature extractor that compresses the information in the original image into a smaller representation. Since it encodes the content of the image into a smaller feature vector hence, this CNN is often called the encoder. When we process this feature vector and use it as an initial input to the following RNN, then it would be called decoder because RNN would decode the process feature vector and turn it into natural language.
  • 40. RNN - RECURRENT NETWORKS OFFER A LOT OF FLEXIBILITY Vanilla mode of processing without RNN, from fixed-sized input to fixed- sized output (e.g. image classification). Sequence output (e.g. image captioning takes an image and outputs a sentence of words). Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative sentiment). Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English and then outputs a sentence in French). Synced sequence input and output (e.g. video classification where we wish to label each frame of the video).
  • 41. SEQUENTIAL PROCESSING OF FIXED INPUTS (IN ABSENCE OF SEQUENCES) • the figure below shows results from two very nice papers from DeepMind an algorithm learns a recurrent network policy that steers its attention around an image; In particular, it learns to read out house numbers from left to right a recurrent network generates images of digits by learning to sequentially add color to a canvas
  • 42. LANGUAGE GENERATION WITH RECURRENT NEURAL NETWORKS • Language generation is a serial task. We generate words one after another. • This is well modeled by Recurrent Neural Cells: a neuron that uses itself over and over again to accept serial inputs, outputting each time a new value.
  • 48. RNN AND THE PROBLEM OF MEMORY All network state is held in a single cell, used over and over again. Internal state can get really complicated. Moving the values around during training can lead to loss of data. RNN has a “plugin” architecture, in which we can use different types of cells: • Simple RNN cell: fastest, but breaks over long sequences. Outdated. • LSTM cell: slower, supports selectively forgetting and keeping data. Standard. • GRU cell: like LSTM, but faster due to simpler internal architecture. State of art.
  • 49. A REGULAR RNN • Vanilla RNN Activation Function
  • 50. LSTM - LONG SHORT TERM MEMORY • With LSTM
  • 51. LSTM • Overall structure • Recommend to see this video: https://www.youtube.com/ watch?v=WCUNPb-5EYI
  • 52. GATES
  • 54. MEMORY ANOTHER ENTIRELY SEPARATED NEURAL NETWORK THAT LEARNS WHEN TO FORGET WHAT
  • 55. MEMORY ANOTHER ENTIRELY SEPARATED NEURAL NETWORK THAT LEARNS WHAT PREDICTIONS TO PASS AND WHAT NOT
  • 59. LTSM • Step by step - Update
  • 61. GRU LSTM had 3 gates input, output and forget gates. Where in GRU we only have two gates an update gate z and a reset gate r. Update gate: The update gate decides how much of previous memory to keep around. Reset Input: The reset gate defines how to combine new input with previous value. Unlike LSTM in GRU, there is no persistent cell state distinct from the hidden state as in LSTM.
  • 66. DENSE-CAP • Fully Convolutional Localization Networks for Dense (‫צפיפות‬) Captioning
  • 69. REFERENCES • VGG16 – Convolutional Network for Classification and Detection • The Unreasonable Effectiveness of Recurrent Neural Networks • Illustrated Guide to LSTM’s and GRU’s: A step by step explanation • Image Captioning with Keras • Automatic Image Captioning : Building an image-caption generator from scratch • Multi-Modal Methods: Image Captioning (From Translation to Attention) • TensorFlow Tutorial #22 - Image Captioning
  • 70. RECOMMENDED VIDEOS • Convolutional Neural Network (CNN) models • CS231n Winter 2016: Lecture 7: Convolutional Neural Networks • Convolutional Nerural Network Course • Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) • CS231n Lecture 10 - Recurrent Neural Networks, Image Captioning, LSTM • Andrej Karpathy - Automated Image Captioning with ConvNets and Recurrent Nets • Illustrated Guide to LSTM's and GRU's: A step by step explanation