Enhancing Video Summarization via Vision-Language Embedding

Enhancing Video Summarization via
Vision-Language Embedding
Bryan A. Plummer, Matthew Brown, Svetlana Lazebnik
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Background
• Summarizing video typically involves a tradeoff
between
 Segments that are interesting
 Segments that are representative for the story
• Gygli et al. proposed an optimization approach for
balancing the criteria of interestingness and
representativeness [1]
• Rich supervision in the form of freeform language 
more sophisticated model
• Two-branch neural network of [2] to learn a nonlinear
embedding using paired images and text
• Compute the similarity between two video segments
without requiring language input at test time
1
[1] M. Gygli, H. Grabner, and L. Van Gool. Video summarization by learning submodular mixtures of objectives. In CVPR, 2015.
[2] L. Wang, Y. Li, and S. Lazebnik. Learning deep structurepreserving image-text embeddings. In CVPR, 2016
Can retrieve semantically consistent result

Overview
• Submod : a mixture of submodular objectives on top of vision-only features [1]
 Optimization of a linear combination of objectives desired in the output summary
 Goal : select the best summary Y⊂V (video consisting of n segments)
 The weights are learned from pairs of videos and output summaries
• Augment with vision-language objectives computed in the cross-modal embedding
space
2
[1] M. Gygli, H. Grabner, and L. Van Gool. Video summarization by learning submodular mixtures of objectives. In CVPR, 2015

Visual Objectives
• Representativeness
 Include major events of a video
 Visual features are extracted from each segment
 K-medoids loss function is employed
 Total squared reconstruction error:
 where feature vector 𝑓𝑖 : each segment from the original video, 𝑓𝑠 : closest codebook center
 Submodular objective:
 where p′ represents a phantom exemplar
• Uniformity
 Temporal coherence

3

Visual Objectives
• Interestingness
 Some segments preferred over others in the same event
 Per-frame interestingness score for all the frames in a video segment
 where I(y) of all the unique frames y in the current summary Y
4

Visual-Language Objectives
• Joint vision-language embedding model using the two-branch
network of Wang et al.[2]
 Visual features branch + Text features branch
 Each branch consists of two fully connected layers
• ReLU nonlinearities between them
• L2 normalization
 Network trained with
• a margin-based triplet loss combining bi-directional ranking
terms
• a neighborhood-preserving term
 Two different embeddings
• Trained using the dense text annotations coming from video
datasets
• Trained on the Flickr30k dataset (31,783 still images with 5
sentences each)
5
[2] L. Wang, Y. Li, and S. Lazebnik. Learning deep structurepreserving image-text embeddings. In CVPR, 2016

Visual-Language Objectives
• Visual side
 ResNet features
• Text side
 6000-dimensional HGLMM features
• Output dimensionality : 512
• Compute semantic representativeness and semantic interestingness
 By mapping visual features to the shared semantic space
6

Text-Guided Summarization
• Can supply a description of the desired summary (Optional)
 Augment the objective function
• Constrained text guidance
 Each sentence maps onto a single desired segment
 Cosine similarity between
 𝑔𝑆 : feature representation, 𝑡𝑠 : sentence representation from description D
• Unconstrained text guidance
 Input sentences and associated video segments are not in the same order
 Solve with Hungarian algorithm
7

Experiments
• UT Egocentric (UTE) dataset
 Consists of four wearable camera videos capturing daily activities
 3-5 hours each, 17 hours total
• TV Episodes dataset
 4 videos of 3 different TV shows, 45 min each
• Testing and evaluation
 2 min summary on UTE dataset
 4 min summary on TV Episodes dataset
 Automatically produced summary is compared with 3 human-provided reference summary
• recall and f-measure using ROUGE-SU score
8

UTE Dataset Results
• Improvement in f-measure of 5% and recall of 4%
9

TV Episodes Results
• Interestingness objective not used
• Improvement in f-measure of 1.5% and recall of 3%
10

Text-Guided Summarization Results
• Constrained version performs better in UTE dataset
• About the same in TV Episodes dataset
• Scenes in UTE tend to change gradually
• Repetitive scenes in TV episodes (unconstrained model would be more robust)
11

Conclusion
• The feature representation in the embedding space has the potential to better capture
the story elements
• Generate summary with freeform text input
• Limitation
 Training and test data are not sufficient
12

Enhancing Video Summarization via Vision-Language Embedding

Enhancing Video Summarization via Vision-Language Embedding

More Related Content

Similar to Enhancing Video Summarization via Vision-Language Embedding (20)

More from ivaderivader (20)

Recently uploaded (20)

Enhancing Video Summarization via Vision-Language Embedding

Editor's Notes