SlideShare a Scribd company logo
Team
• Mohammed Kaif Shaikh
• Akshat Jain
• Pooja Patil
• Darsh Jain
This paper proposes a Discriminative Latent Semantic Graph
(D-LSG) framework to generate natural language captions
that can summarize the visual contents in long videos. The
model has three main components:
• A conditional graph is used to enhance object proposals
by fusing contextual information from the video frames
• A dynamic graph aggregates the enhanced proposals into
compact visual words with higher semantic meaning
• A discriminative module validates the generated captions
by reconstructing visual words and scoring against the
original visual words to ensure fidelity and relevance. The
model can effectively leverage complex object
interactions, extract salient visual concepts from videos,
and generate captions that are content-relevant.
Video captioning aims to use natural language descriptions
to summarize the visual contents in video data. This is a
challenging task as it requires:
• Modeling complex dependencies between objects and
their interactions
• Extracting high-level visual concepts from
spatio-temporal video data
• Generating captions that accurately reflect visual
content and are semantically rich
Sr. No. Title Year Authors Methodology
Feature Extraction
Techinique
Classifier Accuracy Issues Research Gap
2
Video Joint Modelling Based on Hierarchical
Transformer for Co-summarization
2022
Haopeng Li, Qiuhong Ke, Mingming
Gong, and Rui Zhang
ML
GoogLeNet(CNN), Video
Joint Modelling based on
Hierarchical Transformer
(VJMHT)
Transformer-based models (F-
Transformer and S-Transformer) for
modeling intra-shot and inter-shot
dependencies in the video
summarization process.
80%
Low F-Measure or Rank
Correlation, Long Training and
Inference Times:
Need for Generalization Across Diverse
Datasets &Handling Long Videos and
Temporal Context
3
Video summarization using deep learning techniques:
a detailed analysis and investigation
2023
Parul Saini, Krishan Kumar, Shamal
Kashid, Ashray Saini, Alok Negi
Deep Learning 3D CNN,FCNN,DG-CNN
K-Nearest Neighbors (K-NN),Deep
Belief Network (DBN)
88%
Some GAN-based models may
produce very short summaries that
lack important details making it
challenging to strike the right
balance.
additional efforts be put into video
summarization algorithms for optimizing
the best summaries based on the
intended audience
4 Semantic Text Summarization of Long Videos 2017
Shagan Sah, Sourabh Kulhare, Allison
Gray, Subhashini Venugopalan, Emily
Prud’hommeaux, Raymond Ptucha
Deep Learning, Neural
Network
3D CNN
deep visual-captioning techniques for
feature extraction in video
summarization
70%
annotated ground truth data for
semantic video summarization may
be limited or expensive to obtain,
hindering supervised training of
CNN-based models.
Enhancing CNNs' semantic
understanding of video content, is
essential. explore techniques that allow
CNNs to recognize actions, and
relationships within video frames.
5
Towards Diverse Paragraph Captioning for
Untrimmed Videos
2021
Yuqing Song1 , Shizhe Chen, Qin Jin,
Renmin University of China,
INRIA
Machine Learning,
Reinforcement Learning
ResNet, VGG MFT,Vtransformer,AdvInf,MART 79%
the vanilla encoder brings
computation burden for long
paragraph generation,both MLE
and RL training make the model
generate high-frequency words and
phrases.
Scalability to Longer Videos,Training
with Limited Data,Evaluation Metrics
Beyond State-of-the-Art:
6
A Comprehensive Review of the Video-to-Text
Problem
2021
Jesus Perez-Martin, Benjamin Bustos,
Silvio Jamil, F. Guimaraes, Ivan Sipiran,
Jorge P´erez , Grethel Coello
ML,DL
2D CNN,NLG (Natural
Language Generator), RNN
AlexNet,ImageNet,ILSVRC 71%
An essential issue for exactly and
precise video description
generation is the selection of the
most informative frames.
Model Adaptation:Fine-tuning pre-
trained AlexNet on ImageNet may not
always lead to optimal results for
specific tasks
Content Variability, Comparative
Analysis
Limited Mention of Multimodal
Integration: researchers can choose or
create datasets that inherently require
the integration of both visual and textual
information
Discriminative Latent Semantic Graph for Video
Captioning
1 2021
Yang Bai1, Junyan Wang2,Yang Long3
,Bingzhang Hu4, Yang Song2,Maurice
Pagnucco 2,Yu Guan1
Squence to Sequence
Model , Deep Learning
2D CNNs, Faster R-
CNN,LSTM models
Language LSTM,
Multimodal bilinear pooling
70%
Sr. No. Title Year Authors Methodology
Feature Extraction
Techinique
Classifier Accuracy Issues Research Gap
7
Real Time Video to Text
Summarization using Neural Network
2020
Abhishek Yadav, Anjali Vishwakarma,
Shyama Panickar, and Prof. Satish
Kuchiwale.
Deep Learning
Convolutional Neural
Network
RNN,SoftMax
layer
75%
Training RNNs for video
summarization can suffer from the
vanishing gradient problem, where
gradients become too large. This
can impact training stability and
convergence.
Research should aim to develop
effective regularization techniques and
architectural innovations to mitigate
overfitting in RNN-based video
summarization models.
8
Video Summarization by Learning
Deep Side Semantic Embedding
2019
Yitin Yuan, Taon Mei, Senior Member
IEEE, Peng Cui and Wenwu Zhu
Deep Learning 3D-CNN DSSE Model 80%
effectively measuring the semantic
relevance between video frames
and query information
Deep Side Semantic Embedding
(DSSE) model to address these issues
by leveraging side information
to select semantically meaningful
segments from videos
9
Spatiotemporal Modeling for Video
Summarization Using Convolutional
Recurrent Neural Network
2019 Yuan Yuan, Haopeng LI, QI WANG Deep Learning 2D-CNN,DCNNs AlexNet and GoogLeNet 85%
the increasing amount of video
data,the difficulty of retrieving
valuable information conveyed by
videos and the extremely heavy
burden of data storage
improving computational
efficiency,further research into enabling
real time. especially for applications that
require rapid summarization
10
Text Semantics Based Automatic Summarization
for Chinese Videos.
2015
WANG Xingqi, ZHA Taotao, WU
Chunming, FANG Jinglong.
ML HLAC, HOG Ant Colony - Broad Range of content
There has been no attempt for text
semantic-based video summarization
prior to their proposed method.
11
Video and Text Summarization
Using VDAN and RNN
2021
Joys Princia A, Ms. J Sangeetha Priya,
Kalai Selvi J, Rithi Afra J
Deep Learning and
Neural Network
VDAN Random Forest -
visual gaps and breaks
between frames
Short-term dependencies of simple
RNNS
The key problems this model aims to address:
• Current video captioning models cannot effectively leverage complex object-level
interactions and relationships in the video data.
• They fail to extract high-level visual concepts that capture salient information from
spatio-temporal video data.
• Existing models struggle to validate the fidelity and relevance of generated captions to
the source video's visual content.
• • •
Semantic Summarization of videos, Semantic Summarization of videos
1. Literature Review
• Survey prior work in video captioning and summarization
• Understand limitations of existing methods
• Identify opportunities for improvement
2. Problem Definition
• Clearly define the problem to be solved
• Set project objectives and scope
3. Data Collection
• Gather relevant video datasets for training and testing
• Ensure diversity of video content.
4. Model Development
• Implement base encoder-decoder architecture
• Incorporate conditional graph for enhancing object proposals
• Develop dynamic graph for latent proposal aggregation
5. Training and Optimization
• Prepare training data and protocols
• Train model end-to-end with suitable loss functions
Following can be the future scopes or possible applications of the
model introduced:
1.Video Search and Retrieval
• Generate textual captions to index video content
• Enable text-based semantic search of video database
2. Video Highlight Detection
• Identify key moments and events in long videos
• Generate concise summaries for skimming videos
3. Law Enforcement:
• Scan and index video evidence from body-worn cameras
• Surface video segments containing threats, violations etc.
• Assist investigators in reviewing large volumes of footage
4. Multi-lingual Subtitling
• The generated text can be translated to create multi-lingual
subtitles and aid localization of video content.
[1] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for
image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077–6086.
[2] Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 6299–6308.
[3] David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Human Language Technologies. 190–200.
[4] Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less is more: Picking informative frames for video captioning. In ECCV. 358–373.
[5] Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. 2017. Towards diverse and natural image descriptions via a conditional gan. In ICCV. 2970–2979.
[6] Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth
workshop on statistical machine translation. 376–380.
[7] Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013.
Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV. 2712–2719.
[8] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017. Improved training of wasserstein gans. In Advances in
neural information processing systems. 5767–5777.
[9] Jingyi Hou, Xinxiao Wu, Xiaoxun Zhang, Yayun Qi, Yunde Jia, and Jiebo Luo. 2020. Joint Commonsense and Relation Reasoning for Image and Video
Captioning.. In AAAI. 10973–10980.
[10] Yaosi Hu, Zhenzhong Chen, Zheng-Jun Zha, and Feng Wu. 2019. Hierarchical global-local temporal modeling for video captioning. In Proceedings of the
27th ACM International Conference on Multimedia. 774–783.
[11] Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016).
Semantic Summarization of videos, Semantic Summarization of videos

More Related Content

PDF
An Intelligent Approach for Effective Retrieval of Content from Large Data Se...
PDF
Overview of Video Concept Detection using (CNN) Convolutional Neural Network
PPTX
video to text summarization using natural languyge proccesing
PDF
Automatic Visual Concept Detection in Videos: Review
PDF
Video content analysis and retrieval system using video storytelling and inde...
PPTX
Deepfake Detection_Using Machine Learning .pptx
PDF
PERFORMANCE ANALYSIS OF FINGERPRINTING EXTRACTION ALGORITHM IN VIDEO COPY DET...
PPTX
Mtech Second progresspresentation ON VIDEO SUMMARIZATION
An Intelligent Approach for Effective Retrieval of Content from Large Data Se...
Overview of Video Concept Detection using (CNN) Convolutional Neural Network
video to text summarization using natural languyge proccesing
Automatic Visual Concept Detection in Videos: Review
Video content analysis and retrieval system using video storytelling and inde...
Deepfake Detection_Using Machine Learning .pptx
PERFORMANCE ANALYSIS OF FINGERPRINTING EXTRACTION ALGORITHM IN VIDEO COPY DET...
Mtech Second progresspresentation ON VIDEO SUMMARIZATION

Similar to Semantic Summarization of videos, Semantic Summarization of videos (20)

PDF
Parking Surveillance Footage Summarization
PDF
Braving the Semantic Gap Mapping Visual Concepts from Images and Videos 1st E...
PPTX
3 (1).pptxgsbbshjsjkskskksnshshjsjsjsjjsjsjsjjs
PPTX
Video Description using Deep Learning
PDF
A survey on Measurement of Objective Video Quality in Social Cloud using Mach...
PDF
Real-time eyeglass detection using transfer learning for non-standard facial...
PDF
Efficient fusion of spatio-temporal saliency for frame wise saliency identifi...
PDF
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
PDF
A Literature Survey on Image Linguistic Visual Question Answering
PDF
SUMMARY GENERATION FOR LECTURING VIDEOS
PDF
Query clip genre recognition using tree pruning technique for video retrieval
PDF
Query clip genre recognition using tree pruning technique for video retrieval
PDF
RECURRENT FEATURE GROUPING AND CLASSIFICATION MODEL FOR ACTION MODEL PREDICTI...
PDF
RECURRENT FEATURE GROUPING AND CLASSIFICATION MODEL FOR ACTION MODEL PREDICTI...
PDF
INAPPROPRIATE & ABUSIVE CONTENT CENSORSHIP
PDF
Modelling Framework of a Neural Object Recognition
PDF
Profile based Video segmentation system to support E-learning
PDF
Inverted File Based Search Technique for Video Copy Retrieval
PDF
Video saliency-recognition by applying custom spatio temporal fusion technique
PPTX
major project ppt final (SignLanguage Detection)
Parking Surveillance Footage Summarization
Braving the Semantic Gap Mapping Visual Concepts from Images and Videos 1st E...
3 (1).pptxgsbbshjsjkskskksnshshjsjsjsjjsjsjsjjs
Video Description using Deep Learning
A survey on Measurement of Objective Video Quality in Social Cloud using Mach...
Real-time eyeglass detection using transfer learning for non-standard facial...
Efficient fusion of spatio-temporal saliency for frame wise saliency identifi...
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...
A Literature Survey on Image Linguistic Visual Question Answering
SUMMARY GENERATION FOR LECTURING VIDEOS
Query clip genre recognition using tree pruning technique for video retrieval
Query clip genre recognition using tree pruning technique for video retrieval
RECURRENT FEATURE GROUPING AND CLASSIFICATION MODEL FOR ACTION MODEL PREDICTI...
RECURRENT FEATURE GROUPING AND CLASSIFICATION MODEL FOR ACTION MODEL PREDICTI...
INAPPROPRIATE & ABUSIVE CONTENT CENSORSHIP
Modelling Framework of a Neural Object Recognition
Profile based Video segmentation system to support E-learning
Inverted File Based Search Technique for Video Copy Retrieval
Video saliency-recognition by applying custom spatio temporal fusion technique
major project ppt final (SignLanguage Detection)
Ad

Recently uploaded (20)

PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PPTX
Institutional Correction lecture only . . .
PDF
Classroom Observation Tools for Teachers
PPTX
master seminar digital applications in india
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Cell Structure & Organelles in detailed.
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
Cell Types and Its function , kingdom of life
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
TR - Agricultural Crops Production NC III.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
Pharmacology of Heart Failure /Pharmacotherapy of CHF
STATICS OF THE RIGID BODIES Hibbelers.pdf
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
Institutional Correction lecture only . . .
Classroom Observation Tools for Teachers
master seminar digital applications in india
Week 4 Term 3 Study Techniques revisited.pptx
Microbial disease of the cardiovascular and lymphatic systems
Cell Structure & Organelles in detailed.
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
VCE English Exam - Section C Student Revision Booklet
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPH.pptx obstetrics and gynecology in nursing
Cell Types and Its function , kingdom of life
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
TR - Agricultural Crops Production NC III.pdf
Ad

Semantic Summarization of videos, Semantic Summarization of videos

  • 1. Team • Mohammed Kaif Shaikh • Akshat Jain • Pooja Patil • Darsh Jain
  • 2. This paper proposes a Discriminative Latent Semantic Graph (D-LSG) framework to generate natural language captions that can summarize the visual contents in long videos. The model has three main components: • A conditional graph is used to enhance object proposals by fusing contextual information from the video frames • A dynamic graph aggregates the enhanced proposals into compact visual words with higher semantic meaning • A discriminative module validates the generated captions by reconstructing visual words and scoring against the original visual words to ensure fidelity and relevance. The model can effectively leverage complex object interactions, extract salient visual concepts from videos, and generate captions that are content-relevant.
  • 3. Video captioning aims to use natural language descriptions to summarize the visual contents in video data. This is a challenging task as it requires: • Modeling complex dependencies between objects and their interactions • Extracting high-level visual concepts from spatio-temporal video data • Generating captions that accurately reflect visual content and are semantically rich
  • 4. Sr. No. Title Year Authors Methodology Feature Extraction Techinique Classifier Accuracy Issues Research Gap 2 Video Joint Modelling Based on Hierarchical Transformer for Co-summarization 2022 Haopeng Li, Qiuhong Ke, Mingming Gong, and Rui Zhang ML GoogLeNet(CNN), Video Joint Modelling based on Hierarchical Transformer (VJMHT) Transformer-based models (F- Transformer and S-Transformer) for modeling intra-shot and inter-shot dependencies in the video summarization process. 80% Low F-Measure or Rank Correlation, Long Training and Inference Times: Need for Generalization Across Diverse Datasets &Handling Long Videos and Temporal Context 3 Video summarization using deep learning techniques: a detailed analysis and investigation 2023 Parul Saini, Krishan Kumar, Shamal Kashid, Ashray Saini, Alok Negi Deep Learning 3D CNN,FCNN,DG-CNN K-Nearest Neighbors (K-NN),Deep Belief Network (DBN) 88% Some GAN-based models may produce very short summaries that lack important details making it challenging to strike the right balance. additional efforts be put into video summarization algorithms for optimizing the best summaries based on the intended audience 4 Semantic Text Summarization of Long Videos 2017 Shagan Sah, Sourabh Kulhare, Allison Gray, Subhashini Venugopalan, Emily Prud’hommeaux, Raymond Ptucha Deep Learning, Neural Network 3D CNN deep visual-captioning techniques for feature extraction in video summarization 70% annotated ground truth data for semantic video summarization may be limited or expensive to obtain, hindering supervised training of CNN-based models. Enhancing CNNs' semantic understanding of video content, is essential. explore techniques that allow CNNs to recognize actions, and relationships within video frames. 5 Towards Diverse Paragraph Captioning for Untrimmed Videos 2021 Yuqing Song1 , Shizhe Chen, Qin Jin, Renmin University of China, INRIA Machine Learning, Reinforcement Learning ResNet, VGG MFT,Vtransformer,AdvInf,MART 79% the vanilla encoder brings computation burden for long paragraph generation,both MLE and RL training make the model generate high-frequency words and phrases. Scalability to Longer Videos,Training with Limited Data,Evaluation Metrics Beyond State-of-the-Art: 6 A Comprehensive Review of the Video-to-Text Problem 2021 Jesus Perez-Martin, Benjamin Bustos, Silvio Jamil, F. Guimaraes, Ivan Sipiran, Jorge P´erez , Grethel Coello ML,DL 2D CNN,NLG (Natural Language Generator), RNN AlexNet,ImageNet,ILSVRC 71% An essential issue for exactly and precise video description generation is the selection of the most informative frames. Model Adaptation:Fine-tuning pre- trained AlexNet on ImageNet may not always lead to optimal results for specific tasks Content Variability, Comparative Analysis Limited Mention of Multimodal Integration: researchers can choose or create datasets that inherently require the integration of both visual and textual information Discriminative Latent Semantic Graph for Video Captioning 1 2021 Yang Bai1, Junyan Wang2,Yang Long3 ,Bingzhang Hu4, Yang Song2,Maurice Pagnucco 2,Yu Guan1 Squence to Sequence Model , Deep Learning 2D CNNs, Faster R- CNN,LSTM models Language LSTM, Multimodal bilinear pooling 70%
  • 5. Sr. No. Title Year Authors Methodology Feature Extraction Techinique Classifier Accuracy Issues Research Gap 7 Real Time Video to Text Summarization using Neural Network 2020 Abhishek Yadav, Anjali Vishwakarma, Shyama Panickar, and Prof. Satish Kuchiwale. Deep Learning Convolutional Neural Network RNN,SoftMax layer 75% Training RNNs for video summarization can suffer from the vanishing gradient problem, where gradients become too large. This can impact training stability and convergence. Research should aim to develop effective regularization techniques and architectural innovations to mitigate overfitting in RNN-based video summarization models. 8 Video Summarization by Learning Deep Side Semantic Embedding 2019 Yitin Yuan, Taon Mei, Senior Member IEEE, Peng Cui and Wenwu Zhu Deep Learning 3D-CNN DSSE Model 80% effectively measuring the semantic relevance between video frames and query information Deep Side Semantic Embedding (DSSE) model to address these issues by leveraging side information to select semantically meaningful segments from videos 9 Spatiotemporal Modeling for Video Summarization Using Convolutional Recurrent Neural Network 2019 Yuan Yuan, Haopeng LI, QI WANG Deep Learning 2D-CNN,DCNNs AlexNet and GoogLeNet 85% the increasing amount of video data,the difficulty of retrieving valuable information conveyed by videos and the extremely heavy burden of data storage improving computational efficiency,further research into enabling real time. especially for applications that require rapid summarization 10 Text Semantics Based Automatic Summarization for Chinese Videos. 2015 WANG Xingqi, ZHA Taotao, WU Chunming, FANG Jinglong. ML HLAC, HOG Ant Colony - Broad Range of content There has been no attempt for text semantic-based video summarization prior to their proposed method. 11 Video and Text Summarization Using VDAN and RNN 2021 Joys Princia A, Ms. J Sangeetha Priya, Kalai Selvi J, Rithi Afra J Deep Learning and Neural Network VDAN Random Forest - visual gaps and breaks between frames Short-term dependencies of simple RNNS
  • 6. The key problems this model aims to address: • Current video captioning models cannot effectively leverage complex object-level interactions and relationships in the video data. • They fail to extract high-level visual concepts that capture salient information from spatio-temporal video data. • Existing models struggle to validate the fidelity and relevance of generated captions to the source video's visual content.
  • 9. 1. Literature Review • Survey prior work in video captioning and summarization • Understand limitations of existing methods • Identify opportunities for improvement 2. Problem Definition • Clearly define the problem to be solved • Set project objectives and scope 3. Data Collection • Gather relevant video datasets for training and testing • Ensure diversity of video content. 4. Model Development • Implement base encoder-decoder architecture • Incorporate conditional graph for enhancing object proposals • Develop dynamic graph for latent proposal aggregation 5. Training and Optimization • Prepare training data and protocols • Train model end-to-end with suitable loss functions
  • 10. Following can be the future scopes or possible applications of the model introduced: 1.Video Search and Retrieval • Generate textual captions to index video content • Enable text-based semantic search of video database 2. Video Highlight Detection • Identify key moments and events in long videos • Generate concise summaries for skimming videos 3. Law Enforcement: • Scan and index video evidence from body-worn cameras • Surface video segments containing threats, violations etc. • Assist investigators in reviewing large volumes of footage 4. Multi-lingual Subtitling • The generated text can be translated to create multi-lingual subtitles and aid localization of video content.
  • 11. [1] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077–6086. [2] Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308. [3] David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 190–200. [4] Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less is more: Picking informative frames for video captioning. In ECCV. 358–373. [5] Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. 2017. Towards diverse and natural image descriptions via a conditional gan. In ICCV. 2970–2979. [6] Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation. 376–380. [7] Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV. 2712–2719. [8] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017. Improved training of wasserstein gans. In Advances in neural information processing systems. 5767–5777. [9] Jingyi Hou, Xinxiao Wu, Xiaoxun Zhang, Yayun Qi, Yunde Jia, and Jiebo Luo. 2020. Joint Commonsense and Relation Reasoning for Image and Video Captioning.. In AAAI. 10973–10980. [10] Yaosi Hu, Zhenzhong Chen, Zheng-Jun Zha, and Feng Wu. 2019. Hierarchical global-local temporal modeling for video captioning. In Proceedings of the 27th ACM International Conference on Multimedia. 774–783. [11] Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016).