Image Caption Generation using Convolutional Neural Network and LSTM

Team 21
Omkar Reddy Gojala
Mrinalini Injeti Ramakanth

Two dogs are wrestling in
the grass
 Goal is to generate a descriptive sentence of an image
 Project was inspired by the works of Andrej Karpathy and Marc Tanti et al.(2017)
 Potential Applications:
 Aiding visually impaired
 Generating video summary using individual frames
Neural
Network

 We used Flickr8K dataset for this
project
 Flickr8K dataset contains a variety of
images depicting scenes and situations
 The dataset consists of 8000 images
and each image has 5 corresponding
descriptions
 We split the data into 6000, 1000, &
1000 images as training, validation
and testing sets respectively
 The images are of different dimensions
• A man riding his bike on a hill
• A man with helmet and backpack standing on dirt
bike in a hilly grassy area
• A person rides a motorbike through a grassy field
• Man on motorcycle riding in dry field wearing a
helmet and backpack
• The biker is riding through a grassy plain .

 Each description is tokenized and converted to lowercase
 Removed alpha-numeric characters and punctuation marks
 We use startseq and endseq as prefix and postfix for each caption respectively
 Filtered out unique words from the corpus and represented each word by an integer
 To generate a fixed length word vector we calculated the maximum length caption

 Resized all images to a fixed size of 299x299x3 using OpenCV
 Employed transfer learning using pre-trained InceptionV3 CNN model to encode images
 We removed the last softmax layer from the InceptionV3 network to extract 2048 image
vector

 For each image we will train the model by temporally injecting incremental
sequences of the description
 In this phase, we essentially create labels in our training data
Image Partial Caption Target Word
Image startseq a
Image startseq a young
Image startseq a young boy
…… …… ……
Image startseq a young boy
wearing a helmet and
riding a bike in a park
endseq

 We used an encoder-decoder architecture
 2048 image vector is fed to a Dense layer to
generate 256 length image vector
 34 length word vector is fed to LSTM/RNN to
output 256 length word vector
 Decoder model adds both the encoder outputs
and is fed to Dense 256 layer
 The last Dense layer will have as many nodes as
the vocabulary size
 The last softmax layer predicts the next word
present in the output vocabulary

 Caption is predicted word by word
 Image is fed along with the first word(startseq) to the RNN to predict the second
word
 Again the same image along with first word + second word is fed to the RNN to
predict the third word and so on until the last word(endseq) is encountered
Neural Network
model
(i=0) startseq
(i=1) startseq little
(i=2) …..
(i=0) little
(i=1) boy
(i=2) ….
Target Word

LSTM (Long Short Term Memory)
Bilingual Evaluation Understudy Score (BLEU)
 BLEU is a metric for evaluating a generated sentence to a reference sentence
 BLEU score lies between 0 and 1
Simple RNN (Recurrent Neural Network)
BLEU N-GRAM SCORE
BLEU-1 0.572214
BLEU-2 0.339204
BLEU-3 0.237129
BLEU-4 0.116733
BLEU N-GRAM SCORE
BLEU-1 0.364472
BLEU-2 0.181942
BLEU-3 0.103185
BLEU-4 0.085675

Correct Predictions
Actual Caption:
a boy with a blue helmet is
riding a bike
Predicted Caption:
little boy rides bike with
helmet
Actual Caption:
white fluffy dog running in
the dirt
Predicted Caption:
white dog runs across the
sand
Actual Caption:
a boy dribbles a basketball
in the gymnasium
Predicted Caption:
boy in white shirt is playing
basketball

Funny Predictions ??
Actual Caption:
man fly fishing in a small
river with steam in the
background
Predicted Caption:
Man is swinging on a swing
Actual Caption:
a woman wearing a black and
white outfit while holding her
sunglasses
Predicted Caption:
man in pink dress is holding
her head
Actual Caption:
a group of different people
are walking in all different
directions in a city
Predicted Caption:
group of people walking
ocean

Predictions that went really wrong!
Actual Caption:
A man wearing a red life
jacket is holding a purple
rope while waterskiing
Predicted Caption:
man in white and white and
white shorts leash on swing
Actual Caption:
A dog is chewing on a metal
pole
Predicted Caption:
dog is standing in its mouth
Actual Caption:
a young hockey player
playing in the ice rink
Predicted Caption:
chasing player in motorcycle
is playing chasing

FUTURE
WORK
 We can enhance the predictions by using
more training examples. For example using
Flickr32k dataset which has 32000 images
 Implement visual attention techniques,
which focuses on interesting parts of the
image
 Creating an application for visually
impaired to convert the generated caption
into voice output

Image Caption Generation using Convolutional Neural Network and LSTM

Image Caption Generation using Convolutional Neural Network and LSTM

More Related Content

What's hot (20)

Similar to Image Caption Generation using Convolutional Neural Network and LSTM (20)

Recently uploaded (20)

Image Caption Generation using Convolutional Neural Network and LSTM