SlideShare a Scribd company logo
2
Most read
3
Most read
13
Most read
Team 21
Omkar Reddy Gojala
Mrinalini Injeti Ramakanth
Two dogs are wrestling in
the grass
 Goal is to generate a descriptive sentence of an image
 Project was inspired by the works of Andrej Karpathy and Marc Tanti et al.(2017)
 Potential Applications:
 Aiding visually impaired
 Generating video summary using individual frames
Neural
Network
 We used Flickr8K dataset for this
project
 Flickr8K dataset contains a variety of
images depicting scenes and situations
 The dataset consists of 8000 images
and each image has 5 corresponding
descriptions
 We split the data into 6000, 1000, &
1000 images as training, validation
and testing sets respectively
 The images are of different dimensions
• A man riding his bike on a hill
• A man with helmet and backpack standing on dirt
bike in a hilly grassy area
• A person rides a motorbike through a grassy field
• Man on motorcycle riding in dry field wearing a
helmet and backpack
• The biker is riding through a grassy plain .
 Each description is tokenized and converted to lowercase
 Removed alpha-numeric characters and punctuation marks
 We use startseq and endseq as prefix and postfix for each caption respectively
 Filtered out unique words from the corpus and represented each word by an integer
 To generate a fixed length word vector we calculated the maximum length caption
 Resized all images to a fixed size of 299x299x3 using OpenCV
 Employed transfer learning using pre-trained InceptionV3 CNN model to encode images
 We removed the last softmax layer from the InceptionV3 network to extract 2048 image
vector
 For each image we will train the model by temporally injecting incremental
sequences of the description
 In this phase, we essentially create labels in our training data
Image Partial Caption Target Word
Image startseq a
Image startseq a young
Image startseq a young boy
…… …… ……
Image startseq a young boy
wearing a helmet and
riding a bike in a park
endseq
 We used an encoder-decoder architecture
 2048 image vector is fed to a Dense layer to
generate 256 length image vector
 34 length word vector is fed to LSTM/RNN to
output 256 length word vector
 Decoder model adds both the encoder outputs
and is fed to Dense 256 layer
 The last Dense layer will have as many nodes as
the vocabulary size
 The last softmax layer predicts the next word
present in the output vocabulary
 Caption is predicted word by word
 Image is fed along with the first word(startseq) to the RNN to predict the second
word
 Again the same image along with first word + second word is fed to the RNN to
predict the third word and so on until the last word(endseq) is encountered
Neural Network
model
(i=0) startseq
(i=1) startseq little
(i=2) …..
(i=0) little
(i=1) boy
(i=2) ….
Target Word
LSTM (Long Short Term Memory)
Bilingual Evaluation Understudy Score (BLEU)
 BLEU is a metric for evaluating a generated sentence to a reference sentence
 BLEU score lies between 0 and 1
Simple RNN (Recurrent Neural Network)
BLEU N-GRAM SCORE
BLEU-1 0.572214
BLEU-2 0.339204
BLEU-3 0.237129
BLEU-4 0.116733
BLEU N-GRAM SCORE
BLEU-1 0.364472
BLEU-2 0.181942
BLEU-3 0.103185
BLEU-4 0.085675
Correct Predictions
Actual Caption:
a boy with a blue helmet is
riding a bike
Predicted Caption:
little boy rides bike with
helmet
Actual Caption:
white fluffy dog running in
the dirt
Predicted Caption:
white dog runs across the
sand
Actual Caption:
a boy dribbles a basketball
in the gymnasium
Predicted Caption:
boy in white shirt is playing
basketball
Funny Predictions ??
Actual Caption:
man fly fishing in a small
river with steam in the
background
Predicted Caption:
Man is swinging on a swing
Actual Caption:
a woman wearing a black and
white outfit while holding her
sunglasses
Predicted Caption:
man in pink dress is holding
her head
Actual Caption:
a group of different people
are walking in all different
directions in a city
Predicted Caption:
group of people walking
ocean
Predictions that went really wrong!
Actual Caption:
A man wearing a red life
jacket is holding a purple
rope while waterskiing
Predicted Caption:
man in white and white and
white shorts leash on swing
Actual Caption:
A dog is chewing on a metal
pole
Predicted Caption:
dog is standing in its mouth
Actual Caption:
a young hockey player
playing in the ice rink
Predicted Caption:
chasing player in motorcycle
is playing chasing
FUTURE
WORK
 We can enhance the predictions by using
more training examples. For example using
Flickr32k dataset which has 32000 images
 Implement visual attention techniques,
which focuses on interesting parts of the
image
 Creating an application for visually
impaired to convert the generated caption
into voice output
Image Caption Generation using Convolutional Neural Network and LSTM

More Related Content

PPTX
Image captioning
PPTX
Image captioning
PPTX
Show and tell: A Neural Image caption generator
PPTX
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
PPTX
Image classification with Deep Neural Networks
PDF
PPT Image Caption Generator mini project
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
PDF
Convolutional Neural Network Models - Deep Learning
Image captioning
Image captioning
Show and tell: A Neural Image caption generator
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Image classification with Deep Neural Networks
PPT Image Caption Generator mini project
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Convolutional Neural Network Models - Deep Learning

What's hot (20)

PDF
Automated Neural Image Caption Generator for Visually Impaired People
PDF
Image captioning with Keras and Tensorflow - Debarko De @ Practo
PDF
Emotion detection using cnn.pptx
PDF
Image Captioning Generator using Deep Machine Learning
PDF
Deep Learning - Convolutional Neural Networks
PPTX
AGE AND GENDER DETECTION.pptx
PPTX
Introduction to CNN
PPTX
Human age and gender Detection
PDF
Agent architectures
PDF
Transfer Learning: An overview
PPTX
Presentation on Sentiment Analysis
PPTX
Object detection with deep learning
PPTX
Convolution Neural Network (CNN)
PPTX
Brain Tumour Detection.pptx
PPTX
Introduction to Machine Learning
PPTX
Convolutional Neural Network and Its Applications
PDF
Introduction to Generative Adversarial Networks (GANs)
PPTX
Image classification using CNN
PPTX
Knowledge representation in AI
Automated Neural Image Caption Generator for Visually Impaired People
Image captioning with Keras and Tensorflow - Debarko De @ Practo
Emotion detection using cnn.pptx
Image Captioning Generator using Deep Machine Learning
Deep Learning - Convolutional Neural Networks
AGE AND GENDER DETECTION.pptx
Introduction to CNN
Human age and gender Detection
Agent architectures
Transfer Learning: An overview
Presentation on Sentiment Analysis
Object detection with deep learning
Convolution Neural Network (CNN)
Brain Tumour Detection.pptx
Introduction to Machine Learning
Convolutional Neural Network and Its Applications
Introduction to Generative Adversarial Networks (GANs)
Image classification using CNN
Knowledge representation in AI

Similar to Image Caption Generation using Convolutional Neural Network and LSTM (20)

PPTX
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
PPTX
Image caption generation L18_CNN_RNN_2.pptx
PDF
Denis Yarats ITEM 2018
PDF
IRJET- Image Captioning using Multimodal Embedding
PDF
IRJET- Visual Information Narrator using Neural Network
PPTX
Long Term Recurrent Convolutional Neural Networks
DOCX
Learning a Recurrent Visual Representation for Image Caption G
PDF
A hierarchical neural autoencoder for paragraphs and documents
DOCX
Learning a Recurrent Visual Representation for Image Caption G.docx
PDF
IMAGE CAPTION GENERATOR USING DEEP LEARNING
PDF
IRJET- Image Caption Generation System using Neural Network with Attention Me...
PDF
IRJET- Extension to Visual Information Narrator using Neural Network
PDF
Alberto Massidda - Images and words: mechanics of automated captioning with n...
PPTX
Image captions.pptx
PPTX
Introduction to Neural Information Retrieval and Large Language Models
PPTX
Recent Advances in Natural Language Processing
PPTX
Talk from NVidia Developer Connect
PDF
Duplicate_Quora_Question_Detection
PPTX
Image_Caption_Generator_Presentation_With_Deployment.pptx
PDF
IRJET- Neural Story Teller using RNN and Generative Algorithm
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
Image caption generation L18_CNN_RNN_2.pptx
Denis Yarats ITEM 2018
IRJET- Image Captioning using Multimodal Embedding
IRJET- Visual Information Narrator using Neural Network
Long Term Recurrent Convolutional Neural Networks
Learning a Recurrent Visual Representation for Image Caption G
A hierarchical neural autoencoder for paragraphs and documents
Learning a Recurrent Visual Representation for Image Caption G.docx
IMAGE CAPTION GENERATOR USING DEEP LEARNING
IRJET- Image Caption Generation System using Neural Network with Attention Me...
IRJET- Extension to Visual Information Narrator using Neural Network
Alberto Massidda - Images and words: mechanics of automated captioning with n...
Image captions.pptx
Introduction to Neural Information Retrieval and Large Language Models
Recent Advances in Natural Language Processing
Talk from NVidia Developer Connect
Duplicate_Quora_Question_Detection
Image_Caption_Generator_Presentation_With_Deployment.pptx
IRJET- Neural Story Teller using RNN and Generative Algorithm

Recently uploaded (20)

PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
cuic standard and advanced reporting.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
MYSQL Presentation for SQL database connectivity
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPT
Teaching material agriculture food technology
PPTX
Big Data Technologies - Introduction.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Electronic commerce courselecture one. Pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Building Integrated photovoltaic BIPV_UPV.pdf
Review of recent advances in non-invasive hemoglobin estimation
Encapsulation_ Review paper, used for researhc scholars
cuic standard and advanced reporting.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Programs and apps: productivity, graphics, security and other tools
Mobile App Security Testing_ A Comprehensive Guide.pdf
Spectral efficient network and resource selection model in 5G networks
MYSQL Presentation for SQL database connectivity
“AI and Expert System Decision Support & Business Intelligence Systems”
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Teaching material agriculture food technology
Big Data Technologies - Introduction.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Network Security Unit 5.pdf for BCA BBA.
Digital-Transformation-Roadmap-for-Companies.pptx

Image Caption Generation using Convolutional Neural Network and LSTM

  • 1. Team 21 Omkar Reddy Gojala Mrinalini Injeti Ramakanth
  • 2. Two dogs are wrestling in the grass  Goal is to generate a descriptive sentence of an image  Project was inspired by the works of Andrej Karpathy and Marc Tanti et al.(2017)  Potential Applications:  Aiding visually impaired  Generating video summary using individual frames Neural Network
  • 3.  We used Flickr8K dataset for this project  Flickr8K dataset contains a variety of images depicting scenes and situations  The dataset consists of 8000 images and each image has 5 corresponding descriptions  We split the data into 6000, 1000, & 1000 images as training, validation and testing sets respectively  The images are of different dimensions • A man riding his bike on a hill • A man with helmet and backpack standing on dirt bike in a hilly grassy area • A person rides a motorbike through a grassy field • Man on motorcycle riding in dry field wearing a helmet and backpack • The biker is riding through a grassy plain .
  • 4.  Each description is tokenized and converted to lowercase  Removed alpha-numeric characters and punctuation marks  We use startseq and endseq as prefix and postfix for each caption respectively  Filtered out unique words from the corpus and represented each word by an integer  To generate a fixed length word vector we calculated the maximum length caption
  • 5.  Resized all images to a fixed size of 299x299x3 using OpenCV  Employed transfer learning using pre-trained InceptionV3 CNN model to encode images  We removed the last softmax layer from the InceptionV3 network to extract 2048 image vector
  • 6.  For each image we will train the model by temporally injecting incremental sequences of the description  In this phase, we essentially create labels in our training data Image Partial Caption Target Word Image startseq a Image startseq a young Image startseq a young boy …… …… …… Image startseq a young boy wearing a helmet and riding a bike in a park endseq
  • 7.  We used an encoder-decoder architecture  2048 image vector is fed to a Dense layer to generate 256 length image vector  34 length word vector is fed to LSTM/RNN to output 256 length word vector  Decoder model adds both the encoder outputs and is fed to Dense 256 layer  The last Dense layer will have as many nodes as the vocabulary size  The last softmax layer predicts the next word present in the output vocabulary
  • 8.  Caption is predicted word by word  Image is fed along with the first word(startseq) to the RNN to predict the second word  Again the same image along with first word + second word is fed to the RNN to predict the third word and so on until the last word(endseq) is encountered Neural Network model (i=0) startseq (i=1) startseq little (i=2) ….. (i=0) little (i=1) boy (i=2) …. Target Word
  • 9. LSTM (Long Short Term Memory) Bilingual Evaluation Understudy Score (BLEU)  BLEU is a metric for evaluating a generated sentence to a reference sentence  BLEU score lies between 0 and 1 Simple RNN (Recurrent Neural Network) BLEU N-GRAM SCORE BLEU-1 0.572214 BLEU-2 0.339204 BLEU-3 0.237129 BLEU-4 0.116733 BLEU N-GRAM SCORE BLEU-1 0.364472 BLEU-2 0.181942 BLEU-3 0.103185 BLEU-4 0.085675
  • 10. Correct Predictions Actual Caption: a boy with a blue helmet is riding a bike Predicted Caption: little boy rides bike with helmet Actual Caption: white fluffy dog running in the dirt Predicted Caption: white dog runs across the sand Actual Caption: a boy dribbles a basketball in the gymnasium Predicted Caption: boy in white shirt is playing basketball
  • 11. Funny Predictions ?? Actual Caption: man fly fishing in a small river with steam in the background Predicted Caption: Man is swinging on a swing Actual Caption: a woman wearing a black and white outfit while holding her sunglasses Predicted Caption: man in pink dress is holding her head Actual Caption: a group of different people are walking in all different directions in a city Predicted Caption: group of people walking ocean
  • 12. Predictions that went really wrong! Actual Caption: A man wearing a red life jacket is holding a purple rope while waterskiing Predicted Caption: man in white and white and white shorts leash on swing Actual Caption: A dog is chewing on a metal pole Predicted Caption: dog is standing in its mouth Actual Caption: a young hockey player playing in the ice rink Predicted Caption: chasing player in motorcycle is playing chasing
  • 13. FUTURE WORK  We can enhance the predictions by using more training examples. For example using Flickr32k dataset which has 32000 images  Implement visual attention techniques, which focuses on interesting parts of the image  Creating an application for visually impaired to convert the generated caption into voice output