Dataset and methods for 360-degree video summarization

Title of presentation
Subtitle
Name of presenter
Date
A human-annotated video dataset for training and evaluation of 360-
degree video summarization methods
Ioannis Kontostathis, Evlampios Apostolidis, Vasileios Mezaris
Information Technologies Institute
Centre for Research and Technology, Hellas
VIDEO4IMX-2024 Workshop
@ ACM IMX 2024
Stockholm, Sweden, 12 June 2024

Introduction
Current status: Increasing interest in the production and distribution of
360° video content, supported by:
• Existence of advanced 360° video recording devices (GoPro, Insta360)
• Compatibility of the most popular social networks and video sharing
platforms with this type of video content
Potential use: Transforming 360° videos to concise 2D-video summaries
that can be viewed via traditional devices, (TV sets, smartphones), would:
• i) enable repurposing, ii) increase consumption (via extra devices), iii)
and facilitate browsing and retrieval of 360° video content
Observed need for:
• Technologies that could support the summarization of 360° videos
• Datasets for training these technologies
360o video 2D video summary

Related Work
Most existing datasets can train networks for NFOV selection and saliency prediction, to support:
• The creation of NFOV videos from 360° videos (Pano2Vid)
• The navigation of the viewer in the content of 360° videos (Pano2Vid, Sports-360)
• The prediction of viewport-dependent 360° video saliency (PVS-HM, VR-EyeTracking)
Can support the training of methods for 360° video highlight detection or summarization only partially
• Datasets suitable for training 360 video
highlight detection and story-based
summarization methods: i) assume only
one important activity or narrative, and ii)
are not publicly available

The 360-VSumm dataset: Video generation process
Basis: The VR-EyeTracking dataset [21] which comprises of 208 dynamic HD 360◦ videos with diverse
range of content (e.g. indoor/outdoor scenes, underwater activities, sports games, short films)
Processing steps:
• Used the ERP frames and their associated ground-truth saliency maps
• Ran the “2D video production” alg. from Kontostathis et al. (2024), to produce 2D videos showing
the detected salient activities and events in the 360◦ videos and compute the frames’ saliency
• Selected a sub-set of 40 2D-videos, with:
• Dynamic and diverse visual content
• Duration longer than 1 min.
I. Kontostathis, E. Apostolidis, V. Mezaris. 2024. An Integrated System for Spatio-temporal
Summarization of 360-Degrees Videos. MultiMedia Modeling. Springer Nature Switzerland,
Cham, 202–215.

The 360-VSumm dataset: Video annotation process
• Performed by 15 annotators that where asked to select the most important/interesting parts of
each video to form a summary that lasts approx. equally to the 15% of the video’s length
• Each 2D-video’s fragments (corresponding to different salient activities) were further segmented
into M 2-sec. sub-fragments
• Based on a developed user-friendly
interactive tool (see figure) each
annotator had to select N sub-
fragments, with N = 15% of M, to
form the video summary
• The tool allows users to perform
multiple times the selection process
and check the generated summary,
before concluding to the most
suitable one

The 360-VSumm dataset: Characteristics
• 40 2D-videos with dynamic and diverse visual content showing multiple events that overlap
in time or run in parallel (see figure)
• 15 ground-truth annotations (human-generated summaries) for each video; binary vectors
indicating which frames have been selected for inclusion in the summary and which not
• A mean ground-truth summary that can be used for supervised training (average of the 15
ground-truth summaries at the frame-level)
• Data about the fragments/sub-fragments of
these videos and their frames’ saliency
• Publicly-available at:
https://github.com/IDT-ITI/360-VSumm

Experiments: Research questions and exp. settings
Research questions:
1. Can we use pre-trained models of state-of-the-art methods for conventional video summarization,
to produce summaries for 360◦ videos?
2. Are there any performance gains after re-training these models using data from 360◦ videos?
3. Does it help to take the frames’ saliency into account?
Experimental settings:
• Used 2 SoA methods for traditional 2D video summarization
• PGL-SUM: supervised method; combines global and local multi-head attention mechanisms
with positional encoding to model frame dependencies at various levels of granularity
• CA-SUM: unsupervised method; contains a concentrated attention mechanism and incorporates
knowledge about the uniqueness and diversity of the video frames
• Estimated similarity between a machine-generated and a user-defined summary using F-Score
• Split the dataset into five different splits to perform 5-fold cross validation

Experiments: Quantitative results
Question: “Can we use pre-trained models of state-of-the-art methods for conventional video
summarization, to produce summaries for 360◦ videos?”
• Measured the performance of a random summarizer and pretrained models of PGL-SUM, CA-SUM
(using the SumMe and TVSum video summarization datasets) on the test splits of 360-Vsumm
• Results indicated that models trained for conventional video summarization show random-level
performance on 360° video summarization
Answer: “No, we need methods that are better tailored to the visual characteristics of 360° videos”

Question: “Are there any performance gains after re-training these models using data from 360◦ videos?”
Investigation using PGL-SUM:
• Tried to find the optimal number of local attention mechanisms; then
explored different options about the number of attention heads
• Results showed that: i) using more local attention mechanisms leads
to improved performance; ii) using the maximum number of attention
heads per mechanism further improves the performance by 1.6%
Investigation using CA-SUM:
• Ran experiments for different regularization factors; then considered
various choices for the block size of the concentrated attention
• Results showed that setting regular. factor = 0.7 and block size = 70
leads to the best performance
Answer: “Yes, using the 360-VSumm dataset we can effectively train
methods that were designed for conventional video summarization”

Question: “Does it help to take the frames’ saliency into account?”
• Evaluated the performance of variants of PGL-SUM and CA-SUM that use the frames’ saliency
scores to weight the deep representations of the visual content of the 2D-video frames
• Results showed that the use of the frames’ saliency improves the summarization performance of
both PGL-SUM and CA-SUM
Answer: “Yes, the frames’ saliency is a useful
auxiliary signal for the training process”

Experiments: Qualitative results
Frame-based overview
of the presented events
in the video (top part),
and the produced
summaries of CA-SUM,
PGL-SUM and their
saliency-aware variants
(bottom part).
Bounding boxes of the
same color indicate
activities that take
place at the same time
in different views of the
360◦ video

Conclusions and future work
Conclusions:
• Presented the 360-VSumm dataset for training and evaluating 360° video summarization methods
• Trained two SoA methods for conventional 2D-video summarization and evaluated their
performance to establish a baseline for future comparisons
• Took into account two saliency-aware variants of these methods and documented the positive
impact of using data about the frames' saliency during the summarization process
• Developed an interactive tool for annotation purposes that can be used to facilitate similar
annotation activities
Future work:
• Extract additional data about the frames of the produced 2D-videos (e.g. their spatial positioning
in the 360° video) and use it as extra auxiliary data for training 360° video summarization methods

References
1. E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris, I. Patras. 2020. Performance over Random: A Robust Evaluation Protocol for Video
Summarization Methods. 28th ACM Int. Conf. on Multimedia (MM ’20). ACM, New York, NY, USA, 1056–1064.
2. E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris, I. Patras. 2021. Video Summarization Using Deep Neural Networks: A Survey. Proc. IEEE 109,
11 (2021), 1838–1863.
3. E. Apostolidis, G. Balaouras, V. Mezaris, I. Patras. 2021. Combining Global and Local Attention with Positional Encoding for Video Summarization.
2021 IEEE Int. Symposium on Multimedia (ISM). 226–234.
4. E. Apostolidis, G. Balaouras, V. Mezaris, I. Patras. 2022. Summarizing Videos using Concentrated Attention and Considering the Uniqueness and
Diversity of the Video Frames. 2022 Int. Conf. on Multimedia Retrieval (ICMR ’22). ACM, New York, NY, USA, 407–415.
5. E. Bernal-Berdun, D. Martin, D. Gutierrez, B. Masia. 2022. SST-Sal: A spherical spatio-temporal approach for saliency prediction in 360 videos.
Computers & Graphics 106 (2022), 200–209.
6. Y. Dahou, M. Tliba, K. McGuinness, N. O’Connor. 2021. ATSal: An Attention Based Architecture for Saliency Prediction in 360 Videos. In Pattern
Recognition. ICPR Int. Workshops and Challenges. Springer International Publishing, Cham, 305–320.
7. M. Gygli, H. Grabner, H. Riemenschneider, L. Van Gool. 2014. Creating Summaries from User Videos. European Conf. on Computer Vision (ECCV)
2014. Springer International Publishing, Cham, 505–520.
8. H.-N. Hu, Y.-C. Lin, M.-Y. Liu, H.-T. Cheng, Y.-J. Chang, M. Sun. 2017. Deep 360 Pilot: Learning a Deep Agent for Piloting Through 360deg Sports
Videos. IEEE Conf. on Computer Vision and Pattern Recognition.
9. M. Hu, R. Hu, Z. Wang, Z. Xiong, R. Zhong. 2022. Spatiotemporal two-stream LSTM network for unsupervised video summarization. Multimedia
Tools and Applications 81 (2022), 40489–40510.
10. Κ. Kang, S. Cho. 2019. Interactive and Automatic Navigation for 360° Video Playback. ACM Trans. Graph. 38, 4, Article 108 (2019), 11 pages.
11. I. Kontostathis, E. Apostolidis, V. Mezaris. 2024. An Integrated System for Spatio-temporal Summarization of 360-Degrees Videos. MultiMedia
Modeling. Springer Nature Switzerland, Cham, 202–215.
12. S. Lee, J. Sung, Y. Yu, G. Kim. 2018. A Memory Network Approach for Story-Based Temporal Summarization of 360° Videos. IEEE Conf. on Computer
Vision and Pattern Recognition (CVPR).

References
13. G. Liang, Y. Lv, S. Li, S. Zhang, Y. Zhang. 2022. Video summarization with a convolutional attentive adversarial network. Pattern Recognition 131
(2022), 108840.
14. M. Qiao, M. Xu, Z. Wang, Α. Borji. 2021. Viewport-Dependent Saliency Prediction in 360° Video. IEEE Transactions on Multimedia 23 (2021), 748–760.
Υ. Song, J. Vallmitjana, A. Stent, A. Jaimes. 2015. TVSum: Summarizing web videos using titles. 2015 IEEE/CVF Conf. on Computer Vision and Pattern
Recognition (CVPR).
15. Y.-C. Su, K. Grauman. 2017. Making 360deg Video Watchable in 2D: Learning Videography for Click Free Viewing. IEEE Conf. on Computer Vision and
Pattern Recognition (CVPR).
16. Y.-C. Su, D. Jayaraman, K. Grauman. 2016. Pano2Vid: Automatic Cinematography for Watching 360 Videos. Asian Conference on Computer Vision
(ACCV).
17. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, et al.. 2015. Going deeper with convolutions. 2015 IEEE Conf. on Computer Vision and Pattern Recognition
(CVPR). 1–9.
18. M. Wang, Y.-J. Li, W.-X. Zhang, C. Richardt, et al,. 2020. Transitioning360: Content-aware NFoV Virtual Camera Paths for 360° Video Playback. 2020
IEEE Int. Symposium on Mixed and Augmented Reality (ISMAR). 185–194.
19. M. Xu, Y. Song, J. Wang, M. Qiao, et al.. 2019. Predicting Head Movement in Panoramic Video: A Deep Reinforcement Learning Approach. IEEE
Trans. on Pattern Analysis and Machine Intelligence 41, 11 (2019), 2693–2708.
20.Y. Xu, Y. Dong, J. Wu, Z. Sun, Z. Shi, J. Yu, S. Gao. 2018. Gaze Prediction in Dynamic 360° Immersive Videos. In 2018 IEEE/CVF Conference on
Computer Vision and Pattern Recognition. 5333–5342.
21. Y. Yu, S. Lee, J. Na, J. Kang, G. Kim. 2018. A Deep Ranking Model for Spatio-Temporal Highlight Detection From a 360 Video. 2018 AAAI Conf. on
Artificial Intelligence.
22.S.-S. Zang, H. Yu, Y. Song, R. Zeng. 2023. Unsupervised video summarization using deep Non-Local video summarization networks.
Neurocomputing 519 (2023), 26–35.
23. Y. Zhang, Y. Liu, C. Wu. 2024. Attention-guided multigranularity fusion model for video summarization. Expert Systems with Applications 249
(2024), 123568.
24.W. Zhu, J. Lu, Y. Han, J. Zhou. 2022. Learning multiscale hierarchical attention for video summarization. Pattern Recognition 122 (2022), 108312.

Thank you for your attention!
Questions?
Vasileios Mezaris, bmezaris@iti.gr
Code and dataset publicly available at:
https://github.com/IDT-ITI/360-VSumm
(Traditional) Video summarization demos:
https://multimedia2.iti.gr/videosummarization/service/start.html (fully automatic)
https://multimedia2.iti.gr/interactivevideosumm/service/start.html (interactive)
This work was supported by the EU Horizon Europe and Horizon 2020 programmes
under grant agreements 101070109 TransMIXR and 951911 AI4Media, respectively

Dataset and methods for 360-degree video summarization

More Related Content

Similar to Dataset and methods for 360-degree video summarization (20)

More from VasileiosMezaris (20)

Recently uploaded (20)

Dataset and methods for 360-degree video summarization