Image captioning

IMAGE CAPTIONING
MUHAMMAD ZBEEDAT
MAY 2019

INTRODUCTION
• What do you see in the
picture?

• Well some of you might say “A white dog in a grassy area”, some may say
“White dog with brown spots” and yet some others might say “A dog on
grass and some pink flowers”.
• Definitely all of these captions are relevant for this image and there may be
some others also. But the point I want to make is; it’s so easy for us, as
human beings, to just have a glance at a picture and describe it in an
appropriate language. Even a 5 year old could do this with utmost ease.
• But, can you write a computer program that takes an image as input and
produces a relevant caption as output?

• Just prior to the recent development of DNN - Deep Neural
Networks this problem was inconceivable even by the most
advanced researchers in Computer Vision. But with the advent
of Deep Learning this problem can be solved very easily if we
have the required dataset.
• This problem was well researched by Andrej Karapathy in his
PhD
thesis at Stanford, who is also now the Director of AI at Tesla.

• To get a better feel of this problem, I strongly recommend to
use this state-of-the-art system created by Microsoft called
as Caption Bot. Just go to this link and try uploading any
picture you want; this system will generate a caption for it.

MOTIVATION
We must first understand how important this
problem is to real world scenarios. Let’s see few
applications where a solution to this problem can
be very useful:

MOTIVATION
Aid to the blind— We can create a product for
the blind which will guide them travelling on
the roads without the support of anyone else.
We can do this by first converting the scene
into text and then the text to voice. Both are
now famous applications of Deep Learning.
Refer this link where its shown how Nvidia
research is trying to create such a product.
Self driving cars— Automatic driving is one of
the biggest challenges and if we can properly
caption the scene around the car, it can give
a boost to the self driving system.

MOTIVATION
Automatic Captioning can help,
make Google Image Search as
good as Google Search, as then
every image could be first
converted into a caption and
then search can be performed
based on the caption.
In web development, it’s good
practice to provide a description
for any image that appears on
the page so that an image can be
read or heard as opposed to just
seen. This makes web content
accessible.

MOTIVATION
CCTV cameras (Closed-circuit television
cameras) are everywhere today, but
along with viewing the world, if we can
also generate relevant captions, then
we can raise alarms as soon as there is
some malicious activity going on
somewhere. This could probably help
reduce some crime and/or accidents.
It can be used to describe video in real
time.

DATA COLLECTION
• There are many open source
datasets available for this
problem, like Flickr 8k
(containing8k images), Flickr
30k (containing 30k images), MS
COCO (containing 180k images),
etc.
• COCO stands for Common
Objects and Contexts, and it
contains a large variety of
images where each image has a

CAPTIONING MODEL
A captioning model relies on two main components, a CNN and an RNN. Captioning is all
about merging the two to combine their most powerful attributes i.e.
CNNs (Convolutional Neural Networks) excel (‫מצטיין‬) at preserving spatial
information and recognize objects in images.
RNNs (Recurrent Neural Networks) work well with any kind of sequential
data, such as generating a sequence of words.
So by merging the two, you can get a model that can find patterns and images, and then
use that information to help generate a description of those images.

CNN
Convolutional Neural Networks, or CNNs, were designed to map image data to an output variable.
They have proven so effective that they are the go-to method for any type of prediction problem involving image
data as an input.
The benefit of using CNNs is their ability to develop an internal representation of a two-dimensional image. This
allows the model to learn position and scale (‫סקאלה‬/‫מידה‬ ‫קנה‬) in variant structures in the data, which is important
when working with images.
Use CNNs For:
• Image data
• Classification prediction problems
• Regression prediction problems
More generally, CNNs work well with data that has a spatial relationship (‫מרחבי‬ ‫יחס‬) .
Although not specifically developed for non-image data, CNNs achieve state-of-the-art results on problems such as
document classification (For example, there is an order relationship between words in a document of text) used in
sentiment analysis and related problems.
the process of
computationally identifying
and categorizing opinions
expressed in a piece of text,
especially in order to
determine whether the
writer's attitude towards a
particular topic, product,
etc. is positive, negative, or
neutral.

RNN
RNNs in general and LSTMs in particular have received the most success when working with sequences of words and
paragraphs, generally called natural language processing.
This includes both sequences of text and sequences of spoken language represented as a time series. They are also
used as generative models that require a sequence output, not only with text, but on applications such as generating
handwriting.
Use RNNs For:
• Text data
• Speech data
• Classification prediction problems
• Regression prediction problems
• Generative models
Don’t Use RNNs For:
• Tabular data (as you would see in a CSV file or spreadsheet)
• Image data

APPROACH
• Say you’re asked to write a caption that
describes this image, how would you approach
this task?
• Based on how these objects are placed in an
image and their relationship to each other, you
might think that a dog is looking at the sky. He
is smiling so he might as well be happy. Also he
is outside so he might as well be in a park. The
sky isn’t blue so it might as well be sunset or
sunrise.
After collecting these visual observations, you
could put together a phrase that describes the
image as, “A happy dog is looking at the sky”.

CHALLENGES
1. Recognize objects in the image
2. Generate a fluent description in natural language

NEURAL OBJECT RECOGNITION
• A solved problem: Convolutional Neural Networks (CNN) do the trick.
• CNN is an architecture specialized in finding topological invariants
(‫טופולוגיות‬) in the input.
• Finds relationships between atoms and infers (‫להסיק‬) higher
abstractions.
• Highly resistant to (‫בפני‬ ‫עמיד‬) noise and spatial transformations.
• It learns automatically what are the relevant features to extract from
an input.
• Not limited to images: CNNs can be applied to text, audio, etc..

ARCHITECTU
RE OF CNN
1. Convolution
2. Non Linearity (ReLU)
3. Pooling or Sub Sampling
4. Classification (Fully
Connected Layer)

POOLING -
MAX/SUM/AV
G
• Downsample the image by
“hashing” it to fewer values.
We can:

CLASSIFICATION: FULLY CONNECTED LAYER
• After a couple of “convolute, Relu
and pool” cycles, we have maybe 128
channels of 14x14 pixel images.
• Concat and reshape them in a linear
array of 25088 (14x14x128) cells.
• Feed it to a feed forward neural
network that will output our classes.

• Since we want a set of
features that represents
the spatial content in the
image, we’re going to
remove the final fully
connected layer that
classifies the image and
look at earlier layer that
processes the spatial
information in the image.

CNN MODEL
• Feed an image into a CNN.
Can use a pre-trained
network like VGG16 or
Resnet or AlexNet and
more...
• https://neurohive.io/en/po
pular-networks/vgg16/
• https://medium.com/@side
real/cnns-architectures-
lenet-alexnet-vgg-
googlenet-resnet-and-
more-666091488df5

VGG16
• VGG16 is a convolutional neural network model
proposed by K. Simonyan and A. Zisserman from the
University of Oxford in the paper “Very Deep
Convolutional Networks for Large-Scale Image
Recognition”. The model achieves 92.7% top-5 test
accuracy in ImageNet, which is a dataset of over 14
million images belonging to 1000 classes. It was one
of the famous model submitted to ILSVRC-2014. It
makes the improvement over AlexNet by replacing
large kernel-sized filters (11 and 5 in the first and
second convolutional layer, respectively) with
multiple 3×3 kernel-sized filters one after another.
VGG16 was trained for weeks and was using NVIDIA
Titan Black GPU’s.

VGG-16 VS
ALEX-NET
• Similar to AlexNet, only 3x3
convolutions, but lots of
filters.
Alex-Net
_______________________________________________________

VGG-16# 3D
CONVOLUTIO
N LAYERS
Filter size: 3×3 (which is the smallest size to
capture the notion of left/right, up/down,
center)
In one of the configurations, it also utilizes 1×1
convolution filters, which can be seen as a linear
transformation of the input channels
The convolution stride is fixed to 1 pixel

VGG-16#
MAX
POOLING
LAYERS
Max-pooling is performed over a 2×2 pixel window,
with stride 2.

VGG-16#
TRAINING
• Backward propogation

CNN DEMO TIME
Real time web handwritten digit recognition
http://scs.ryerson.ca/~aharley/vis/conv/flat.html
There are a lot of “famous” nets that can be freely downloaded
and used off the shelf,
like ResNet which has an error rate of 3.6% over 20000
categories.
VGG-16 and more…

• So now CNN acts as a feature extractor that compresses the
information in the original image into a smaller representation.
Since it encodes the content of the image into a smaller feature
vector hence, this CNN is often called the encoder.
When we process this feature vector and use it as an initial
input to the following RNN, then it would be called decoder
because RNN would decode the process feature vector and turn
it into natural language.

RNN -
RECURRENT
NETWORKS
OFFER A LOT OF
FLEXIBILITY
Vanilla mode of processing without RNN, from fixed-sized input to fixed-
sized output (e.g. image classification).
Sequence output (e.g. image captioning takes an image and outputs a
sentence of words).
Sequence input (e.g. sentiment analysis where a given sentence is
classified as expressing positive or negative sentiment).
Sequence input and sequence output (e.g. Machine Translation: an RNN
reads a sentence in English and then outputs a sentence in French).
Synced sequence input and output (e.g. video classification where we wish
to label each frame of the video).

SEQUENTIAL
PROCESSING OF
FIXED INPUTS
(IN ABSENCE OF
SEQUENCES)
• the figure below
shows results
from two very
nice papers
from DeepMind
an algorithm learns a recurrent network policy that steers its attention
around an image; In particular, it learns to read out house numbers from
left to right
a recurrent network generates images of digits by learning to sequentially
add color to a canvas

LANGUAGE
GENERATION
WITH
RECURRENT
NEURAL
NETWORKS
• Language generation is
a serial task. We
generate words one
after another.
• This is well modeled by
Recurrent Neural Cells: a
neuron that uses itself
over and over again to
accept serial inputs,
outputting each time a
new value.

TRAINING
PHASE
X
Pre-trained VGG-
16
[3x224x224]
Wih

RNN AND THE PROBLEM OF MEMORY
All network state is held in a single cell, used over and over again. Internal
state
can get really complicated. Moving the values around during training can lead
to
loss of data.
RNN has a “plugin” architecture, in which we can use different types of cells:
• Simple RNN cell: fastest, but breaks over long sequences. Outdated.
• LSTM cell: slower, supports selectively forgetting and keeping data.
Standard.
• GRU cell: like LSTM, but faster due to simpler internal architecture. State of
art.

A REGULAR
RNN
• Vanilla RNN
Activation Function

LSTM - LONG
SHORT TERM
MEMORY
• With LSTM

LSTM
• Overall
structure
• Recommend to see this
video:
https://www.youtube.com/
watch?v=WCUNPb-5EYI

MEMORY
ANOTHER ENTIRELY
SEPARATED NEURAL
NETWORK THAT LEARNS
WHEN TO FORGET WHAT

MEMORY
ANOTHER ENTIRELY
SEPARATED NEURAL
NETWORK THAT LEARNS
WHAT PREDICTIONS TO
PASS AND WHAT NOT

LONG
SHORT
MEMORY
WITH
ATTENTION
WORK WITH ATTENTION TO
DECIDE WHAT TO IGNORE

LTSM
• Step by step - Update

GRU -
GATED
RECURRENT
UNIT
VARIATION OF A LSTM

GRU
LSTM had 3 gates input, output and
forget gates. Where in GRU we only have
two gates an update gate z and a reset
gate r.
Update gate: The update gate decides
how much of previous memory to keep
around.
Reset Input: The reset gate defines how
to combine new input with previous value.
Unlike LSTM in GRU, there is no
persistent cell state distinct from the
hidden state as in LSTM.

DENSE-CAP
• Fully Convolutional
Localization Networks for
Dense (‫צפיפות‬) Captioning

DENSECAP
• Model Architicture

REFERENCES
• VGG16 – Convolutional Network for Classification and Detection
• The Unreasonable Effectiveness of Recurrent Neural Networks
• Illustrated Guide to LSTM’s and GRU’s: A step by step explanation
• Image Captioning with Keras
• Automatic Image Captioning : Building an image-caption generator
from scratch
• Multi-Modal Methods: Image Captioning (From Translation to
Attention)
• TensorFlow Tutorial #22 - Image Captioning

RECOMMENDED VIDEOS
• Convolutional Neural Network (CNN) models
• CS231n Winter 2016: Lecture 7: Convolutional Neural Networks
• Convolutional Nerural Network Course
• Recurrent Neural Networks (RNN) and Long Short-Term Memory
(LSTM)
• CS231n Lecture 10 - Recurrent Neural Networks, Image Captioning,
LSTM
• Andrej Karpathy - Automated Image Captioning with ConvNets and
Recurrent Nets
• Illustrated Guide to LSTM's and GRU's: A step by step explanation

• https://paperswithcode.com/task/text-generation
• https://machinelearningmastery.com/develop-a-deep-
learning-caption-generation-model-in-python/

Image captioning

More Related Content

What's hot (20)

Similar to Image captioning (20)

Recently uploaded (20)

Image captioning