SlideShare a Scribd company logo
Enhancing Video Summarization via
Vision-Language Embedding
Bryan A. Plummer, Matthew Brown, Svetlana Lazebnik
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Background
• Summarizing video typically involves a tradeoff
between
 Segments that are interesting
 Segments that are representative for the story
• Gygli et al. proposed an optimization approach for
balancing the criteria of interestingness and
representativeness [1]
• Rich supervision in the form of freeform language 
more sophisticated model
• Two-branch neural network of [2] to learn a nonlinear
embedding using paired images and text
• Compute the similarity between two video segments
without requiring language input at test time
1
[1] M. Gygli, H. Grabner, and L. Van Gool. Video summarization by learning submodular mixtures of objectives. In CVPR, 2015.
[2] L. Wang, Y. Li, and S. Lazebnik. Learning deep structurepreserving image-text embeddings. In CVPR, 2016
Can retrieve semantically consistent result
Overview
• Submod : a mixture of submodular objectives on top of vision-only features [1]
 Optimization of a linear combination of objectives desired in the output summary
 Goal : select the best summary Y⊂V (video consisting of n segments)
 The weights are learned from pairs of videos and output summaries
• Augment with vision-language objectives computed in the cross-modal embedding
space
2
[1] M. Gygli, H. Grabner, and L. Van Gool. Video summarization by learning submodular mixtures of objectives. In CVPR, 2015
Visual Objectives
• Representativeness
 Include major events of a video
 Visual features are extracted from each segment
 K-medoids loss function is employed
 Total squared reconstruction error:
 where feature vector 𝑓𝑖 : each segment from the original video, 𝑓𝑠 : closest codebook center
 Submodular objective:
 where p′ represents a phantom exemplar
• Uniformity
 Temporal coherence

3
Visual Objectives
• Interestingness
 Some segments preferred over others in the same event
 Per-frame interestingness score for all the frames in a video segment
 where I(y) of all the unique frames y in the current summary Y
4
Visual-Language Objectives
• Joint vision-language embedding model using the two-branch
network of Wang et al.[2]
 Visual features branch + Text features branch
 Each branch consists of two fully connected layers
• ReLU nonlinearities between them
• L2 normalization
 Network trained with
• a margin-based triplet loss combining bi-directional ranking
terms
• a neighborhood-preserving term
 Two different embeddings
• Trained using the dense text annotations coming from video
datasets
• Trained on the Flickr30k dataset (31,783 still images with 5
sentences each)
5
[2] L. Wang, Y. Li, and S. Lazebnik. Learning deep structurepreserving image-text embeddings. In CVPR, 2016
Visual-Language Objectives
• Visual side
 ResNet features
• Text side
 6000-dimensional HGLMM features
• Output dimensionality : 512
• Compute semantic representativeness and semantic interestingness
 By mapping visual features to the shared semantic space
6
Text-Guided Summarization
• Can supply a description of the desired summary (Optional)
 Augment the objective function
• Constrained text guidance
 Each sentence maps onto a single desired segment
 Cosine similarity between
 𝑔𝑆 : feature representation, 𝑡𝑠 : sentence representation from description D
• Unconstrained text guidance
 Input sentences and associated video segments are not in the same order
 Solve with Hungarian algorithm
7
Experiments
• UT Egocentric (UTE) dataset
 Consists of four wearable camera videos capturing daily activities
 3-5 hours each, 17 hours total
• TV Episodes dataset
 4 videos of 3 different TV shows, 45 min each
• Testing and evaluation
 2 min summary on UTE dataset
 4 min summary on TV Episodes dataset
 Automatically produced summary is compared with 3 human-provided reference summary
• recall and f-measure using ROUGE-SU score
8
UTE Dataset Results
• Improvement in f-measure of 5% and recall of 4%
9
TV Episodes Results
• Interestingness objective not used
• Improvement in f-measure of 1.5% and recall of 3%
10
Text-Guided Summarization Results
• Constrained version performs better in UTE dataset
• About the same in TV Episodes dataset
• Scenes in UTE tend to change gradually
• Repetitive scenes in TV episodes (unconstrained model would be more robust)
11
Conclusion
• The feature representation in the embedding space has the potential to better capture
the story elements
• Generate summary with freeform text input
• Limitation
 Training and test data are not sufficient
12
Enhancing Video Summarization via Vision-Language Embedding

More Related Content

PPT
Environmental Sound detection Using MFCC technique
PPTX
Explaining video summarization based on the focus of attention
PDF
Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)
PPTX
video to text summarization using natural languyge proccesing
PPTX
Mtech Second progresspresentation ON VIDEO SUMMARIZATION
PDF
Deep neural networks for Youtube recommendations
PDF
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION
PDF
TIP_TAViT_presentation.pdf
Environmental Sound detection Using MFCC technique
Explaining video summarization based on the focus of attention
Paper introduction: Sequence to Sequence - Video to Text (ICCV2015)
video to text summarization using natural languyge proccesing
Mtech Second progresspresentation ON VIDEO SUMMARIZATION
Deep neural networks for Youtube recommendations
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION
TIP_TAViT_presentation.pdf

Similar to Enhancing Video Summarization via Vision-Language Embedding (20)

PPTX
phd-mark4
PDF
Icme2020 tutorial video_summarization_part1
PPTX
Visual Search for Musical Performances and Endoscopic Videos
PDF
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...
PDF
VIDEO TO TEXT SUMMARIZER USING AI.pdf
PPTX
Video Description using Deep Learning
PPTX
TVSum: Summarizing Web Videos Using Titles
PDF
Dynamic Threshold in Clip Analysis and Retrieval
PPTX
Query focused video summarization
PDF
Key frame extraction for video summarization using motion activity descriptors
PDF
Key frame extraction for video summarization using motion activity descriptors
PDF
PGL SUM Video Summarization
PDF
Video Indexing and Retrieval
PDF
SUBJECTIVE QUALITY EVALUATION OF H.264 AND H.265 ENCODED VIDEO SEQUENCES STRE...
PDF
Subjective Quality Evaluation of H.264 and H.265 Encoded Video Sequences Stre...
PPTX
Emily Denton - Unsupervised Learning of Disentangled Representations from Vid...
PDF
Video content analysis and retrieval system using video storytelling and inde...
PDF
Optimal Repeated Frame Compensation Using Efficient Video Coding
PPTX
A Ensemble Learning-based No Reference QoE Model for User Generated Contents
PPT
mpeg4copy-120428133000-phpapp01.ppt
phd-mark4
Icme2020 tutorial video_summarization_part1
Visual Search for Musical Performances and Endoscopic Videos
5 ijaems sept-2015-9-video feature extraction based on modified lle using ada...
VIDEO TO TEXT SUMMARIZER USING AI.pdf
Video Description using Deep Learning
TVSum: Summarizing Web Videos Using Titles
Dynamic Threshold in Clip Analysis and Retrieval
Query focused video summarization
Key frame extraction for video summarization using motion activity descriptors
Key frame extraction for video summarization using motion activity descriptors
PGL SUM Video Summarization
Video Indexing and Retrieval
SUBJECTIVE QUALITY EVALUATION OF H.264 AND H.265 ENCODED VIDEO SEQUENCES STRE...
Subjective Quality Evaluation of H.264 and H.265 Encoded Video Sequences Stre...
Emily Denton - Unsupervised Learning of Disentangled Representations from Vid...
Video content analysis and retrieval system using video storytelling and inde...
Optimal Repeated Frame Compensation Using Efficient Video Coding
A Ensemble Learning-based No Reference QoE Model for User Generated Contents
mpeg4copy-120428133000-phpapp01.ppt
Ad

More from ivaderivader (20)

PPTX
Argument Mining
PPTX
Papers at CHI23
PPTX
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
PPTX
So Predictable! Continuous 3D Hand Trajectory Prediction in Virtual Reality
PPTX
Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...
PPTX
Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...
PPTX
Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Orien...
PPTX
A Style-Based Generator Architecture for Generative Adversarial Networks
PPTX
CatchLIve: Real-time Summarization of Live Streams with Stream Content and In...
PPTX
Perception! Immersion! Empowerment! Superpowers as Inspiration for Visualization
PPTX
Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic F...
PPTX
Neural Approximate Dynamic Programming for On-Demand Ride-Pooling
PPTX
StoryMap: Using Social Modeling and Self-Modeling to Support Physical Activit...
PPTX
Bad Breakdowns, Useful Seams, and Face Slapping: Analysis of VR Fails on YouTube
PPTX
Invertible Denoising Network: A Light Solution for Real Noise Removal
PPTX
Traffic Demand Prediction Based Dynamic Transition Convolutional Neural Network
PPTX
MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training
PPTX
Screen2Vec: Semantic Embedding of GUI Screens and GUI Components
PPTX
Augmenting Decisions of Taxi Drivers through Reinforcement Learning for Impro...
PPTX
Natural Language to Visualization by Neural Machine Translation
Argument Mining
Papers at CHI23
DDGK: Learning Graph Representations for Deep Divergence Graph Kernels
So Predictable! Continuous 3D Hand Trajectory Prediction in Virtual Reality
Reinforcement Learning-based Placement of Charging Stations in Urban Road Net...
Prediction for Retrospection: Integrating Algorithmic Stress Prediction into ...
Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Orien...
A Style-Based Generator Architecture for Generative Adversarial Networks
CatchLIve: Real-time Summarization of Live Streams with Stream Content and In...
Perception! Immersion! Empowerment! Superpowers as Inspiration for Visualization
Learning to Remember Patterns: Pattern Matching Memory Networks for Traffic F...
Neural Approximate Dynamic Programming for On-Demand Ride-Pooling
StoryMap: Using Social Modeling and Self-Modeling to Support Physical Activit...
Bad Breakdowns, Useful Seams, and Face Slapping: Analysis of VR Fails on YouTube
Invertible Denoising Network: A Light Solution for Real Noise Removal
Traffic Demand Prediction Based Dynamic Transition Convolutional Neural Network
MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training
Screen2Vec: Semantic Embedding of GUI Screens and GUI Components
Augmenting Decisions of Taxi Drivers through Reinforcement Learning for Impro...
Natural Language to Visualization by Neural Machine Translation
Ad

Recently uploaded (20)

PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPTX
TLE Review Electricity (Electricity).pptx
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
A Presentation on Touch Screen Technology
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
August Patch Tuesday
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
A comparative analysis of optical character recognition models for extracting...
MIND Revenue Release Quarter 2 2025 Press Release
Encapsulation_ Review paper, used for researhc scholars
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
TLE Review Electricity (Electricity).pptx
WOOl fibre morphology and structure.pdf for textiles
Enhancing emotion recognition model for a student engagement use case through...
NewMind AI Weekly Chronicles - August'25-Week II
OMC Textile Division Presentation 2021.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
DP Operators-handbook-extract for the Mautical Institute
Unlocking AI with Model Context Protocol (MCP)
Univ-Connecticut-ChatGPT-Presentaion.pdf
A Presentation on Touch Screen Technology
Programs and apps: productivity, graphics, security and other tools
August Patch Tuesday
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf

Enhancing Video Summarization via Vision-Language Embedding

  • 1. Enhancing Video Summarization via Vision-Language Embedding Bryan A. Plummer, Matthew Brown, Svetlana Lazebnik 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2. Background • Summarizing video typically involves a tradeoff between  Segments that are interesting  Segments that are representative for the story • Gygli et al. proposed an optimization approach for balancing the criteria of interestingness and representativeness [1] • Rich supervision in the form of freeform language  more sophisticated model • Two-branch neural network of [2] to learn a nonlinear embedding using paired images and text • Compute the similarity between two video segments without requiring language input at test time 1 [1] M. Gygli, H. Grabner, and L. Van Gool. Video summarization by learning submodular mixtures of objectives. In CVPR, 2015. [2] L. Wang, Y. Li, and S. Lazebnik. Learning deep structurepreserving image-text embeddings. In CVPR, 2016 Can retrieve semantically consistent result
  • 3. Overview • Submod : a mixture of submodular objectives on top of vision-only features [1]  Optimization of a linear combination of objectives desired in the output summary  Goal : select the best summary Y⊂V (video consisting of n segments)  The weights are learned from pairs of videos and output summaries • Augment with vision-language objectives computed in the cross-modal embedding space 2 [1] M. Gygli, H. Grabner, and L. Van Gool. Video summarization by learning submodular mixtures of objectives. In CVPR, 2015
  • 4. Visual Objectives • Representativeness  Include major events of a video  Visual features are extracted from each segment  K-medoids loss function is employed  Total squared reconstruction error:  where feature vector 𝑓𝑖 : each segment from the original video, 𝑓𝑠 : closest codebook center  Submodular objective:  where p′ represents a phantom exemplar • Uniformity  Temporal coherence  3
  • 5. Visual Objectives • Interestingness  Some segments preferred over others in the same event  Per-frame interestingness score for all the frames in a video segment  where I(y) of all the unique frames y in the current summary Y 4
  • 6. Visual-Language Objectives • Joint vision-language embedding model using the two-branch network of Wang et al.[2]  Visual features branch + Text features branch  Each branch consists of two fully connected layers • ReLU nonlinearities between them • L2 normalization  Network trained with • a margin-based triplet loss combining bi-directional ranking terms • a neighborhood-preserving term  Two different embeddings • Trained using the dense text annotations coming from video datasets • Trained on the Flickr30k dataset (31,783 still images with 5 sentences each) 5 [2] L. Wang, Y. Li, and S. Lazebnik. Learning deep structurepreserving image-text embeddings. In CVPR, 2016
  • 7. Visual-Language Objectives • Visual side  ResNet features • Text side  6000-dimensional HGLMM features • Output dimensionality : 512 • Compute semantic representativeness and semantic interestingness  By mapping visual features to the shared semantic space 6
  • 8. Text-Guided Summarization • Can supply a description of the desired summary (Optional)  Augment the objective function • Constrained text guidance  Each sentence maps onto a single desired segment  Cosine similarity between  𝑔𝑆 : feature representation, 𝑡𝑠 : sentence representation from description D • Unconstrained text guidance  Input sentences and associated video segments are not in the same order  Solve with Hungarian algorithm 7
  • 9. Experiments • UT Egocentric (UTE) dataset  Consists of four wearable camera videos capturing daily activities  3-5 hours each, 17 hours total • TV Episodes dataset  4 videos of 3 different TV shows, 45 min each • Testing and evaluation  2 min summary on UTE dataset  4 min summary on TV Episodes dataset  Automatically produced summary is compared with 3 human-provided reference summary • recall and f-measure using ROUGE-SU score 8
  • 10. UTE Dataset Results • Improvement in f-measure of 5% and recall of 4% 9
  • 11. TV Episodes Results • Interestingness objective not used • Improvement in f-measure of 1.5% and recall of 3% 10
  • 12. Text-Guided Summarization Results • Constrained version performs better in UTE dataset • About the same in TV Episodes dataset • Scenes in UTE tend to change gradually • Repetitive scenes in TV episodes (unconstrained model would be more robust) 11
  • 13. Conclusion • The feature representation in the embedding space has the potential to better capture the story elements • Generate summary with freeform text input • Limitation  Training and test data are not sufficient 12

Editor's Notes

  • #3: using notions of sparsity, graph connectedness, or statistical significance  abstract 한 컨셉을 mathematical term Semantic understanding, high-level category/ concept Joint image-text embedding
  • #5: 중간점을 사용  더 좋은 군집 형성 가능 DeCAF features: A Deep Convolutional Activation Feature for Generic Visual Recognition With more up-to-date Deep ResNet features
  • #7: Triplet loss: anchor input 과 positive input 사이의 거리는 최소화, negative input 과는 최대화 여기에서는 각 이미지 피처별로 텍스트 피처와의 거리를 의미
  • #13: With audio recognize the similar scenes differently