SlideShare a Scribd company logo
Title of presentation
Subtitle
Name of presenter
Date
A human-annotated video dataset for training and evaluation of 360-
degree video summarization methods
Ioannis Kontostathis, Evlampios Apostolidis, Vasileios Mezaris
Information Technologies Institute
Centre for Research and Technology, Hellas
VIDEO4IMX-2024 Workshop
@ ACM IMX 2024
Stockholm, Sweden, 12 June 2024
Introduction
Current status: Increasing interest in the production and distribution of
360° video content, supported by:
• Existence of advanced 360° video recording devices (GoPro, Insta360)
• Compatibility of the most popular social networks and video sharing
platforms with this type of video content
Potential use: Transforming 360° videos to concise 2D-video summaries
that can be viewed via traditional devices, (TV sets, smartphones), would:
• i) enable repurposing, ii) increase consumption (via extra devices), iii)
and facilitate browsing and retrieval of 360° video content
Observed need for:
• Technologies that could support the summarization of 360° videos
• Datasets for training these technologies
360o video 2D video summary
Related Work
Most existing datasets can train networks for NFOV selection and saliency prediction, to support:
• The creation of NFOV videos from 360° videos (Pano2Vid)
• The navigation of the viewer in the content of 360° videos (Pano2Vid, Sports-360)
• The prediction of viewport-dependent 360° video saliency (PVS-HM, VR-EyeTracking)
Can support the training of methods for 360° video highlight detection or summarization only partially
• Datasets suitable for training 360 video
highlight detection and story-based
summarization methods: i) assume only
one important activity or narrative, and ii)
are not publicly available
Related Work
Most existing datasets can train networks for NFOV selection and saliency prediction, to support:
• The creation of NFOV videos from 360° videos (Pano2Vid)
• The navigation of the viewer in the content of 360° videos (Pano2Vid, Sports-360)
• The prediction of viewport-dependent 360° video saliency (PVS-HM, VR-EyeTracking)
Can support the training of methods for 360° video highlight detection or summarization only partially
• Datasets suitable for training 360 video
highlight detection and story-based
summarization methods: i) assume only
one important activity or narrative, and ii)
are not publicly available
Related Work
Most existing datasets can train networks for NFOV selection and saliency prediction, to support:
• The creation of NFOV videos from 360° videos (Pano2Vid)
• The navigation of the viewer in the content of 360° videos (Pano2Vid, Sports-360)
• The prediction of viewport-dependent 360° video saliency (PVS-HM, VR-EyeTracking)
Can support the training of methods for 360° video highlight detection or summarization only partially
• Datasets suitable for training 360 video
highlight detection and story-based
summarization methods: i) assume only
one important activity or narrative, and ii)
are not publicly available
The 360-VSumm dataset: Video generation process
Basis: The VR-EyeTracking dataset [21] which comprises of 208 dynamic HD 360◦ videos with diverse
range of content (e.g. indoor/outdoor scenes, underwater activities, sports games, short films)
Processing steps:
• Used the ERP frames and their associated ground-truth saliency maps
• Ran the “2D video production” alg. from Kontostathis et al. (2024), to produce 2D videos showing
the detected salient activities and events in the 360◦ videos and compute the frames’ saliency
• Selected a sub-set of 40 2D-videos, with:
• Dynamic and diverse visual content
• Duration longer than 1 min.
I. Kontostathis, E. Apostolidis, V. Mezaris. 2024. An Integrated System for Spatio-temporal
Summarization of 360-Degrees Videos. MultiMedia Modeling. Springer Nature Switzerland,
Cham, 202–215.
The 360-VSumm dataset: Video annotation process
• Performed by 15 annotators that where asked to select the most important/interesting parts of
each video to form a summary that lasts approx. equally to the 15% of the video’s length
• Each 2D-video’s fragments (corresponding to different salient activities) were further segmented
into M 2-sec. sub-fragments
• Based on a developed user-friendly
interactive tool (see figure) each
annotator had to select N sub-
fragments, with N = 15% of M, to
form the video summary
• The tool allows users to perform
multiple times the selection process
and check the generated summary,
before concluding to the most
suitable one
The 360-VSumm dataset: Characteristics
• 40 2D-videos with dynamic and diverse visual content showing multiple events that overlap
in time or run in parallel (see figure)
• 15 ground-truth annotations (human-generated summaries) for each video; binary vectors
indicating which frames have been selected for inclusion in the summary and which not
• A mean ground-truth summary that can be used for supervised training (average of the 15
ground-truth summaries at the frame-level)
• Data about the fragments/sub-fragments of
these videos and their frames’ saliency
• Publicly-available at:
https://github.com/IDT-ITI/360-VSumm
Experiments: Research questions and exp. settings
Research questions:
1. Can we use pre-trained models of state-of-the-art methods for conventional video summarization,
to produce summaries for 360◦ videos?
2. Are there any performance gains after re-training these models using data from 360◦ videos?
3. Does it help to take the frames’ saliency into account?
Experimental settings:
• Used 2 SoA methods for traditional 2D video summarization
• PGL-SUM: supervised method; combines global and local multi-head attention mechanisms
with positional encoding to model frame dependencies at various levels of granularity
• CA-SUM: unsupervised method; contains a concentrated attention mechanism and incorporates
knowledge about the uniqueness and diversity of the video frames
• Estimated similarity between a machine-generated and a user-defined summary using F-Score
• Split the dataset into five different splits to perform 5-fold cross validation
Experiments: Quantitative results
Question: “Can we use pre-trained models of state-of-the-art methods for conventional video
summarization, to produce summaries for 360◦ videos?”
• Measured the performance of a random summarizer and pretrained models of PGL-SUM, CA-SUM
(using the SumMe and TVSum video summarization datasets) on the test splits of 360-Vsumm
• Results indicated that models trained for conventional video summarization show random-level
performance on 360° video summarization
Answer: “No, we need methods that are better tailored to the visual characteristics of 360° videos”
Experiments: Quantitative results
Question: “Are there any performance gains after re-training these models using data from 360◦ videos?”
Investigation using PGL-SUM:
• Tried to find the optimal number of local attention mechanisms; then
explored different options about the number of attention heads
• Results showed that: i) using more local attention mechanisms leads
to improved performance; ii) using the maximum number of attention
heads per mechanism further improves the performance by 1.6%
Investigation using CA-SUM:
• Ran experiments for different regularization factors; then considered
various choices for the block size of the concentrated attention
• Results showed that setting regular. factor = 0.7 and block size = 70
leads to the best performance
Answer: “Yes, using the 360-VSumm dataset we can effectively train
methods that were designed for conventional video summarization”
Experiments: Quantitative results
Question: “Does it help to take the frames’ saliency into account?”
• Evaluated the performance of variants of PGL-SUM and CA-SUM that use the frames’ saliency
scores to weight the deep representations of the visual content of the 2D-video frames
• Results showed that the use of the frames’ saliency improves the summarization performance of
both PGL-SUM and CA-SUM
Answer: “Yes, the frames’ saliency is a useful
auxiliary signal for the training process”
Experiments: Qualitative results
Frame-based overview
of the presented events
in the video (top part),
and the produced
summaries of CA-SUM,
PGL-SUM and their
saliency-aware variants
(bottom part).
Bounding boxes of the
same color indicate
activities that take
place at the same time
in different views of the
360◦ video
Conclusions and future work
Conclusions:
• Presented the 360-VSumm dataset for training and evaluating 360° video summarization methods
• Trained two SoA methods for conventional 2D-video summarization and evaluated their
performance to establish a baseline for future comparisons
• Took into account two saliency-aware variants of these methods and documented the positive
impact of using data about the frames' saliency during the summarization process
• Developed an interactive tool for annotation purposes that can be used to facilitate similar
annotation activities
Future work:
• Extract additional data about the frames of the produced 2D-videos (e.g. their spatial positioning
in the 360° video) and use it as extra auxiliary data for training 360° video summarization methods
References
1. E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris, I. Patras. 2020. Performance over Random: A Robust Evaluation Protocol for Video
Summarization Methods. 28th ACM Int. Conf. on Multimedia (MM ’20). ACM, New York, NY, USA, 1056–1064.
2. E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris, I. Patras. 2021. Video Summarization Using Deep Neural Networks: A Survey. Proc. IEEE 109,
11 (2021), 1838–1863.
3. E. Apostolidis, G. Balaouras, V. Mezaris, I. Patras. 2021. Combining Global and Local Attention with Positional Encoding for Video Summarization.
2021 IEEE Int. Symposium on Multimedia (ISM). 226–234.
4. E. Apostolidis, G. Balaouras, V. Mezaris, I. Patras. 2022. Summarizing Videos using Concentrated Attention and Considering the Uniqueness and
Diversity of the Video Frames. 2022 Int. Conf. on Multimedia Retrieval (ICMR ’22). ACM, New York, NY, USA, 407–415.
5. E. Bernal-Berdun, D. Martin, D. Gutierrez, B. Masia. 2022. SST-Sal: A spherical spatio-temporal approach for saliency prediction in 360 videos.
Computers & Graphics 106 (2022), 200–209.
6. Y. Dahou, M. Tliba, K. McGuinness, N. O’Connor. 2021. ATSal: An Attention Based Architecture for Saliency Prediction in 360 Videos. In Pattern
Recognition. ICPR Int. Workshops and Challenges. Springer International Publishing, Cham, 305–320.
7. M. Gygli, H. Grabner, H. Riemenschneider, L. Van Gool. 2014. Creating Summaries from User Videos. European Conf. on Computer Vision (ECCV)
2014. Springer International Publishing, Cham, 505–520.
8. H.-N. Hu, Y.-C. Lin, M.-Y. Liu, H.-T. Cheng, Y.-J. Chang, M. Sun. 2017. Deep 360 Pilot: Learning a Deep Agent for Piloting Through 360deg Sports
Videos. IEEE Conf. on Computer Vision and Pattern Recognition.
9. M. Hu, R. Hu, Z. Wang, Z. Xiong, R. Zhong. 2022. Spatiotemporal two-stream LSTM network for unsupervised video summarization. Multimedia
Tools and Applications 81 (2022), 40489–40510.
10. Κ. Kang, S. Cho. 2019. Interactive and Automatic Navigation for 360° Video Playback. ACM Trans. Graph. 38, 4, Article 108 (2019), 11 pages.
11. I. Kontostathis, E. Apostolidis, V. Mezaris. 2024. An Integrated System for Spatio-temporal Summarization of 360-Degrees Videos. MultiMedia
Modeling. Springer Nature Switzerland, Cham, 202–215.
12. S. Lee, J. Sung, Y. Yu, G. Kim. 2018. A Memory Network Approach for Story-Based Temporal Summarization of 360° Videos. IEEE Conf. on Computer
Vision and Pattern Recognition (CVPR).
References
13. G. Liang, Y. Lv, S. Li, S. Zhang, Y. Zhang. 2022. Video summarization with a convolutional attentive adversarial network. Pattern Recognition 131
(2022), 108840.
14. M. Qiao, M. Xu, Z. Wang, Α. Borji. 2021. Viewport-Dependent Saliency Prediction in 360° Video. IEEE Transactions on Multimedia 23 (2021), 748–760.
Υ. Song, J. Vallmitjana, A. Stent, A. Jaimes. 2015. TVSum: Summarizing web videos using titles. 2015 IEEE/CVF Conf. on Computer Vision and Pattern
Recognition (CVPR).
15. Y.-C. Su, K. Grauman. 2017. Making 360deg Video Watchable in 2D: Learning Videography for Click Free Viewing. IEEE Conf. on Computer Vision and
Pattern Recognition (CVPR).
16. Y.-C. Su, D. Jayaraman, K. Grauman. 2016. Pano2Vid: Automatic Cinematography for Watching 360 Videos. Asian Conference on Computer Vision
(ACCV).
17. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, et al.. 2015. Going deeper with convolutions. 2015 IEEE Conf. on Computer Vision and Pattern Recognition
(CVPR). 1–9.
18. M. Wang, Y.-J. Li, W.-X. Zhang, C. Richardt, et al,. 2020. Transitioning360: Content-aware NFoV Virtual Camera Paths for 360° Video Playback. 2020
IEEE Int. Symposium on Mixed and Augmented Reality (ISMAR). 185–194.
19. M. Xu, Y. Song, J. Wang, M. Qiao, et al.. 2019. Predicting Head Movement in Panoramic Video: A Deep Reinforcement Learning Approach. IEEE
Trans. on Pattern Analysis and Machine Intelligence 41, 11 (2019), 2693–2708.
20.Y. Xu, Y. Dong, J. Wu, Z. Sun, Z. Shi, J. Yu, S. Gao. 2018. Gaze Prediction in Dynamic 360° Immersive Videos. In 2018 IEEE/CVF Conference on
Computer Vision and Pattern Recognition. 5333–5342.
21. Y. Yu, S. Lee, J. Na, J. Kang, G. Kim. 2018. A Deep Ranking Model for Spatio-Temporal Highlight Detection From a 360 Video. 2018 AAAI Conf. on
Artificial Intelligence.
22.S.-S. Zang, H. Yu, Y. Song, R. Zeng. 2023. Unsupervised video summarization using deep Non-Local video summarization networks.
Neurocomputing 519 (2023), 26–35.
23. Y. Zhang, Y. Liu, C. Wu. 2024. Attention-guided multigranularity fusion model for video summarization. Expert Systems with Applications 249
(2024), 123568.
24.W. Zhu, J. Lu, Y. Han, J. Zhou. 2022. Learning multiscale hierarchical attention for video summarization. Pattern Recognition 122 (2022), 108312.
Thank you for your attention!
Questions?
Vasileios Mezaris, bmezaris@iti.gr
Code and dataset publicly available at:
https://github.com/IDT-ITI/360-VSumm
(Traditional) Video summarization demos:
https://multimedia2.iti.gr/videosummarization/service/start.html (fully automatic)
https://multimedia2.iti.gr/interactivevideosumm/service/start.html (interactive)
This work was supported by the EU Horizon Europe and Horizon 2020 programmes
under grant agreements 101070109 TransMIXR and 951911 AI4Media, respectively

More Related Content

PPTX
Spatio-Temporal Summarization of 360-degrees Videos
PPTX
Summarizing videos with Attention
PPTX
Mtech Fourth progress presentation
PPTX
360° Video Viewing Dataset in Head-Mounted Virtual Reality
PDF
PGL SUM Video Summarization
PDF
SUMMARY GENERATION FOR LECTURING VIDEOS
PPTX
CA-SUM Video Summarization
PPTX
Unsupervised object-level video summarization with online motion auto-encoder
Spatio-Temporal Summarization of 360-degrees Videos
Summarizing videos with Attention
Mtech Fourth progress presentation
360° Video Viewing Dataset in Head-Mounted Virtual Reality
PGL SUM Video Summarization
SUMMARY GENERATION FOR LECTURING VIDEOS
CA-SUM Video Summarization
Unsupervised object-level video summarization with online motion auto-encoder

Similar to Dataset and methods for 360-degree video summarization (20)

PDF
GAN-based video summarization
PPTX
TVSum: Summarizing Web Videos Using Titles
PPTX
Mtech First progress PRESENTATION ON VIDEO SUMMARIZATION
PPTX
M.tech Third progress Presentation
PPTX
Mtech Second progresspresentation ON VIDEO SUMMARIZATION
PDF
Multimodal video abstraction into a static document using deep learning
PDF
Efficient fusion of spatio-temporal saliency for frame wise saliency identifi...
PPTX
Query focused video summarization
PPTX
Unsupervised video summarization framework using keyframe extraction and vide...
PDF
Video saliency-detection using custom spatiotemporal fusion method
PDF
Video Summarization for Sports
PPTX
Hierarchical structure adaptive
PDF
Perception and Quality of Immersive Media
PPTX
NMSL_2017summer
PDF
Video saliency-recognition by applying custom spatio temporal fusion technique
PPTX
Semantic Summarization of videos, Semantic Summarization of videos
PPTX
CVPR presentation
PPTX
Fixation Prediction for 360° Video Streaming in Head-Mounted Virtual Reality
PDF
Icme2020 tutorial video_summarization_part1
PPTX
Video Transformers.pptx
GAN-based video summarization
TVSum: Summarizing Web Videos Using Titles
Mtech First progress PRESENTATION ON VIDEO SUMMARIZATION
M.tech Third progress Presentation
Mtech Second progresspresentation ON VIDEO SUMMARIZATION
Multimodal video abstraction into a static document using deep learning
Efficient fusion of spatio-temporal saliency for frame wise saliency identifi...
Query focused video summarization
Unsupervised video summarization framework using keyframe extraction and vide...
Video saliency-detection using custom spatiotemporal fusion method
Video Summarization for Sports
Hierarchical structure adaptive
Perception and Quality of Immersive Media
NMSL_2017summer
Video saliency-recognition by applying custom spatio temporal fusion technique
Semantic Summarization of videos, Semantic Summarization of videos
CVPR presentation
Fixation Prediction for 360° Video Streaming in Head-Mounted Virtual Reality
Icme2020 tutorial video_summarization_part1
Video Transformers.pptx
Ad

More from VasileiosMezaris (20)

PDF
Combatting video-borne disinformation and increasing trust in AI methods
PDF
An LLM Framework for Long-form Video Retrieval and Audio-Visual Question Answ...
PDF
Improving the Perturbation-Based Explanation of Deepfake Detectors Through th...
PDF
B-FPGM: Lightweight Face Detection via Bayesian-Optimized Soft FPGM Pruning
PPTX
LMM-Regularized CLIP Embeddings for Image Classification
PPTX
Disturbing Image Detection Using LMM-Elicited Emotion Embeddings
PPTX
Exploiting LMM based knowledge for image classification tasks
PPTX
Detecting visual-media-borne disinformation: a summary of latest advances at ...
PPTX
Explainable Deepfake Image/Video Detection
PPTX
Multi-Modal Fusion for Image Manipulation Detection and Localization
PDF
CERTH-ITI at MediaEval 2023 NewsImages Task
PPTX
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
PPTX
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
PPTX
TAME: Trainable Attention Mechanism for Explanations
PPTX
Gated-ViGAT
PPTX
Explaining video summarization based on the focus of attention
PPTX
Combining textual and visual features for Ad-hoc Video Search
PPTX
Explaining the decisions of image/video classifiers
PPTX
Learning visual explanations for DCNN-based image classifiers using an attent...
PPTX
Are all combinations equal? Combining textual and visual features with multi...
Combatting video-borne disinformation and increasing trust in AI methods
An LLM Framework for Long-form Video Retrieval and Audio-Visual Question Answ...
Improving the Perturbation-Based Explanation of Deepfake Detectors Through th...
B-FPGM: Lightweight Face Detection via Bayesian-Optimized Soft FPGM Pruning
LMM-Regularized CLIP Embeddings for Image Classification
Disturbing Image Detection Using LMM-Elicited Emotion Embeddings
Exploiting LMM based knowledge for image classification tasks
Detecting visual-media-borne disinformation: a summary of latest advances at ...
Explainable Deepfake Image/Video Detection
Multi-Modal Fusion for Image Manipulation Detection and Localization
CERTH-ITI at MediaEval 2023 NewsImages Task
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
TAME: Trainable Attention Mechanism for Explanations
Gated-ViGAT
Explaining video summarization based on the focus of attention
Combining textual and visual features for Ad-hoc Video Search
Explaining the decisions of image/video classifiers
Learning visual explanations for DCNN-based image classifiers using an attent...
Are all combinations equal? Combining textual and visual features with multi...
Ad

Recently uploaded (20)

PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PDF
diccionario toefl examen de ingles para principiante
PPTX
neck nodes and dissection types and lymph nodes levels
PDF
An interstellar mission to test astrophysical black holes
PPTX
Microbiology with diagram medical studies .pptx
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PPTX
2. Earth - The Living Planet earth and life
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
Sciences of Europe No 170 (2025)
PPTX
famous lake in india and its disturibution and importance
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
Introduction to Fisheries Biotechnology_Lesson 1.pptx
The KM-GBF monitoring framework – status & key messages.pptx
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
microscope-Lecturecjchchchchcuvuvhc.pptx
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
diccionario toefl examen de ingles para principiante
neck nodes and dissection types and lymph nodes levels
An interstellar mission to test astrophysical black holes
Microbiology with diagram medical studies .pptx
bbec55_b34400a7914c42429908233dbd381773.pdf
2. Earth - The Living Planet earth and life
Phytochemical Investigation of Miliusa longipes.pdf
Sciences of Europe No 170 (2025)
famous lake in india and its disturibution and importance
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5

Dataset and methods for 360-degree video summarization

  • 1. Title of presentation Subtitle Name of presenter Date A human-annotated video dataset for training and evaluation of 360- degree video summarization methods Ioannis Kontostathis, Evlampios Apostolidis, Vasileios Mezaris Information Technologies Institute Centre for Research and Technology, Hellas VIDEO4IMX-2024 Workshop @ ACM IMX 2024 Stockholm, Sweden, 12 June 2024
  • 2. Introduction Current status: Increasing interest in the production and distribution of 360° video content, supported by: • Existence of advanced 360° video recording devices (GoPro, Insta360) • Compatibility of the most popular social networks and video sharing platforms with this type of video content Potential use: Transforming 360° videos to concise 2D-video summaries that can be viewed via traditional devices, (TV sets, smartphones), would: • i) enable repurposing, ii) increase consumption (via extra devices), iii) and facilitate browsing and retrieval of 360° video content Observed need for: • Technologies that could support the summarization of 360° videos • Datasets for training these technologies 360o video 2D video summary
  • 3. Related Work Most existing datasets can train networks for NFOV selection and saliency prediction, to support: • The creation of NFOV videos from 360° videos (Pano2Vid) • The navigation of the viewer in the content of 360° videos (Pano2Vid, Sports-360) • The prediction of viewport-dependent 360° video saliency (PVS-HM, VR-EyeTracking) Can support the training of methods for 360° video highlight detection or summarization only partially • Datasets suitable for training 360 video highlight detection and story-based summarization methods: i) assume only one important activity or narrative, and ii) are not publicly available
  • 4. Related Work Most existing datasets can train networks for NFOV selection and saliency prediction, to support: • The creation of NFOV videos from 360° videos (Pano2Vid) • The navigation of the viewer in the content of 360° videos (Pano2Vid, Sports-360) • The prediction of viewport-dependent 360° video saliency (PVS-HM, VR-EyeTracking) Can support the training of methods for 360° video highlight detection or summarization only partially • Datasets suitable for training 360 video highlight detection and story-based summarization methods: i) assume only one important activity or narrative, and ii) are not publicly available
  • 5. Related Work Most existing datasets can train networks for NFOV selection and saliency prediction, to support: • The creation of NFOV videos from 360° videos (Pano2Vid) • The navigation of the viewer in the content of 360° videos (Pano2Vid, Sports-360) • The prediction of viewport-dependent 360° video saliency (PVS-HM, VR-EyeTracking) Can support the training of methods for 360° video highlight detection or summarization only partially • Datasets suitable for training 360 video highlight detection and story-based summarization methods: i) assume only one important activity or narrative, and ii) are not publicly available
  • 6. The 360-VSumm dataset: Video generation process Basis: The VR-EyeTracking dataset [21] which comprises of 208 dynamic HD 360◦ videos with diverse range of content (e.g. indoor/outdoor scenes, underwater activities, sports games, short films) Processing steps: • Used the ERP frames and their associated ground-truth saliency maps • Ran the “2D video production” alg. from Kontostathis et al. (2024), to produce 2D videos showing the detected salient activities and events in the 360◦ videos and compute the frames’ saliency • Selected a sub-set of 40 2D-videos, with: • Dynamic and diverse visual content • Duration longer than 1 min. I. Kontostathis, E. Apostolidis, V. Mezaris. 2024. An Integrated System for Spatio-temporal Summarization of 360-Degrees Videos. MultiMedia Modeling. Springer Nature Switzerland, Cham, 202–215.
  • 7. The 360-VSumm dataset: Video annotation process • Performed by 15 annotators that where asked to select the most important/interesting parts of each video to form a summary that lasts approx. equally to the 15% of the video’s length • Each 2D-video’s fragments (corresponding to different salient activities) were further segmented into M 2-sec. sub-fragments • Based on a developed user-friendly interactive tool (see figure) each annotator had to select N sub- fragments, with N = 15% of M, to form the video summary • The tool allows users to perform multiple times the selection process and check the generated summary, before concluding to the most suitable one
  • 8. The 360-VSumm dataset: Characteristics • 40 2D-videos with dynamic and diverse visual content showing multiple events that overlap in time or run in parallel (see figure) • 15 ground-truth annotations (human-generated summaries) for each video; binary vectors indicating which frames have been selected for inclusion in the summary and which not • A mean ground-truth summary that can be used for supervised training (average of the 15 ground-truth summaries at the frame-level) • Data about the fragments/sub-fragments of these videos and their frames’ saliency • Publicly-available at: https://github.com/IDT-ITI/360-VSumm
  • 9. Experiments: Research questions and exp. settings Research questions: 1. Can we use pre-trained models of state-of-the-art methods for conventional video summarization, to produce summaries for 360◦ videos? 2. Are there any performance gains after re-training these models using data from 360◦ videos? 3. Does it help to take the frames’ saliency into account? Experimental settings: • Used 2 SoA methods for traditional 2D video summarization • PGL-SUM: supervised method; combines global and local multi-head attention mechanisms with positional encoding to model frame dependencies at various levels of granularity • CA-SUM: unsupervised method; contains a concentrated attention mechanism and incorporates knowledge about the uniqueness and diversity of the video frames • Estimated similarity between a machine-generated and a user-defined summary using F-Score • Split the dataset into five different splits to perform 5-fold cross validation
  • 10. Experiments: Quantitative results Question: “Can we use pre-trained models of state-of-the-art methods for conventional video summarization, to produce summaries for 360◦ videos?” • Measured the performance of a random summarizer and pretrained models of PGL-SUM, CA-SUM (using the SumMe and TVSum video summarization datasets) on the test splits of 360-Vsumm • Results indicated that models trained for conventional video summarization show random-level performance on 360° video summarization Answer: “No, we need methods that are better tailored to the visual characteristics of 360° videos”
  • 11. Experiments: Quantitative results Question: “Are there any performance gains after re-training these models using data from 360◦ videos?” Investigation using PGL-SUM: • Tried to find the optimal number of local attention mechanisms; then explored different options about the number of attention heads • Results showed that: i) using more local attention mechanisms leads to improved performance; ii) using the maximum number of attention heads per mechanism further improves the performance by 1.6% Investigation using CA-SUM: • Ran experiments for different regularization factors; then considered various choices for the block size of the concentrated attention • Results showed that setting regular. factor = 0.7 and block size = 70 leads to the best performance Answer: “Yes, using the 360-VSumm dataset we can effectively train methods that were designed for conventional video summarization”
  • 12. Experiments: Quantitative results Question: “Does it help to take the frames’ saliency into account?” • Evaluated the performance of variants of PGL-SUM and CA-SUM that use the frames’ saliency scores to weight the deep representations of the visual content of the 2D-video frames • Results showed that the use of the frames’ saliency improves the summarization performance of both PGL-SUM and CA-SUM Answer: “Yes, the frames’ saliency is a useful auxiliary signal for the training process”
  • 13. Experiments: Qualitative results Frame-based overview of the presented events in the video (top part), and the produced summaries of CA-SUM, PGL-SUM and their saliency-aware variants (bottom part). Bounding boxes of the same color indicate activities that take place at the same time in different views of the 360◦ video
  • 14. Conclusions and future work Conclusions: • Presented the 360-VSumm dataset for training and evaluating 360° video summarization methods • Trained two SoA methods for conventional 2D-video summarization and evaluated their performance to establish a baseline for future comparisons • Took into account two saliency-aware variants of these methods and documented the positive impact of using data about the frames' saliency during the summarization process • Developed an interactive tool for annotation purposes that can be used to facilitate similar annotation activities Future work: • Extract additional data about the frames of the produced 2D-videos (e.g. their spatial positioning in the 360° video) and use it as extra auxiliary data for training 360° video summarization methods
  • 15. References 1. E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris, I. Patras. 2020. Performance over Random: A Robust Evaluation Protocol for Video Summarization Methods. 28th ACM Int. Conf. on Multimedia (MM ’20). ACM, New York, NY, USA, 1056–1064. 2. E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris, I. Patras. 2021. Video Summarization Using Deep Neural Networks: A Survey. Proc. IEEE 109, 11 (2021), 1838–1863. 3. E. Apostolidis, G. Balaouras, V. Mezaris, I. Patras. 2021. Combining Global and Local Attention with Positional Encoding for Video Summarization. 2021 IEEE Int. Symposium on Multimedia (ISM). 226–234. 4. E. Apostolidis, G. Balaouras, V. Mezaris, I. Patras. 2022. Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames. 2022 Int. Conf. on Multimedia Retrieval (ICMR ’22). ACM, New York, NY, USA, 407–415. 5. E. Bernal-Berdun, D. Martin, D. Gutierrez, B. Masia. 2022. SST-Sal: A spherical spatio-temporal approach for saliency prediction in 360 videos. Computers & Graphics 106 (2022), 200–209. 6. Y. Dahou, M. Tliba, K. McGuinness, N. O’Connor. 2021. ATSal: An Attention Based Architecture for Saliency Prediction in 360 Videos. In Pattern Recognition. ICPR Int. Workshops and Challenges. Springer International Publishing, Cham, 305–320. 7. M. Gygli, H. Grabner, H. Riemenschneider, L. Van Gool. 2014. Creating Summaries from User Videos. European Conf. on Computer Vision (ECCV) 2014. Springer International Publishing, Cham, 505–520. 8. H.-N. Hu, Y.-C. Lin, M.-Y. Liu, H.-T. Cheng, Y.-J. Chang, M. Sun. 2017. Deep 360 Pilot: Learning a Deep Agent for Piloting Through 360deg Sports Videos. IEEE Conf. on Computer Vision and Pattern Recognition. 9. M. Hu, R. Hu, Z. Wang, Z. Xiong, R. Zhong. 2022. Spatiotemporal two-stream LSTM network for unsupervised video summarization. Multimedia Tools and Applications 81 (2022), 40489–40510. 10. Κ. Kang, S. Cho. 2019. Interactive and Automatic Navigation for 360° Video Playback. ACM Trans. Graph. 38, 4, Article 108 (2019), 11 pages. 11. I. Kontostathis, E. Apostolidis, V. Mezaris. 2024. An Integrated System for Spatio-temporal Summarization of 360-Degrees Videos. MultiMedia Modeling. Springer Nature Switzerland, Cham, 202–215. 12. S. Lee, J. Sung, Y. Yu, G. Kim. 2018. A Memory Network Approach for Story-Based Temporal Summarization of 360° Videos. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
  • 16. References 13. G. Liang, Y. Lv, S. Li, S. Zhang, Y. Zhang. 2022. Video summarization with a convolutional attentive adversarial network. Pattern Recognition 131 (2022), 108840. 14. M. Qiao, M. Xu, Z. Wang, Α. Borji. 2021. Viewport-Dependent Saliency Prediction in 360° Video. IEEE Transactions on Multimedia 23 (2021), 748–760. Υ. Song, J. Vallmitjana, A. Stent, A. Jaimes. 2015. TVSum: Summarizing web videos using titles. 2015 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR). 15. Y.-C. Su, K. Grauman. 2017. Making 360deg Video Watchable in 2D: Learning Videography for Click Free Viewing. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 16. Y.-C. Su, D. Jayaraman, K. Grauman. 2016. Pano2Vid: Automatic Cinematography for Watching 360 Videos. Asian Conference on Computer Vision (ACCV). 17. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, et al.. 2015. Going deeper with convolutions. 2015 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 1–9. 18. M. Wang, Y.-J. Li, W.-X. Zhang, C. Richardt, et al,. 2020. Transitioning360: Content-aware NFoV Virtual Camera Paths for 360° Video Playback. 2020 IEEE Int. Symposium on Mixed and Augmented Reality (ISMAR). 185–194. 19. M. Xu, Y. Song, J. Wang, M. Qiao, et al.. 2019. Predicting Head Movement in Panoramic Video: A Deep Reinforcement Learning Approach. IEEE Trans. on Pattern Analysis and Machine Intelligence 41, 11 (2019), 2693–2708. 20.Y. Xu, Y. Dong, J. Wu, Z. Sun, Z. Shi, J. Yu, S. Gao. 2018. Gaze Prediction in Dynamic 360° Immersive Videos. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5333–5342. 21. Y. Yu, S. Lee, J. Na, J. Kang, G. Kim. 2018. A Deep Ranking Model for Spatio-Temporal Highlight Detection From a 360 Video. 2018 AAAI Conf. on Artificial Intelligence. 22.S.-S. Zang, H. Yu, Y. Song, R. Zeng. 2023. Unsupervised video summarization using deep Non-Local video summarization networks. Neurocomputing 519 (2023), 26–35. 23. Y. Zhang, Y. Liu, C. Wu. 2024. Attention-guided multigranularity fusion model for video summarization. Expert Systems with Applications 249 (2024), 123568. 24.W. Zhu, J. Lu, Y. Han, J. Zhou. 2022. Learning multiscale hierarchical attention for video summarization. Pattern Recognition 122 (2022), 108312.
  • 17. Thank you for your attention! Questions? Vasileios Mezaris, bmezaris@iti.gr Code and dataset publicly available at: https://github.com/IDT-ITI/360-VSumm (Traditional) Video summarization demos: https://multimedia2.iti.gr/videosummarization/service/start.html (fully automatic) https://multimedia2.iti.gr/interactivevideosumm/service/start.html (interactive) This work was supported by the EU Horizon Europe and Horizon 2020 programmes under grant agreements 101070109 TransMIXR and 951911 AI4Media, respectively