SlideShare a Scribd company logo
Visual and audio scene classification
for detecting discrepancies in video:
a baseline method and experimental protocol
Konstantinos Apostolidis1
, Jakob Abeßer2
,
Luca Cuccovillo2
, Vasileios Mezaris1
1
Information Technologies Institute, CERTH
2
Fraunhofer Institute for Digital Media Technology
ACM ICMR 2024, MAD workshop
Phuket, Thailand, June 10-13
Introduction
● Digital Disinformation: fabricated news, tampered images, manipulated videos
● Verifying the integrity of media is crucial for journalists, security experts, emergency
management
● AI-Based Tools can support this task offering scalability and efficiency
Voice clip #1 Voice clip #2
Processing point
Ambient soundscape to mask edits
Motivation
Motivation
● Audio Editing: Use of soundscapes to enhance immersion OR mask edits
● Malicious Manipulations Scenario: Lack of authentic soundscapes leading to
audio-visual inconsistencies
● Need for methods to detect subtle discrepancies in audio-visual content
VADD
TAU
1. Training of a robust audio-visual scene
classifier
2. Evaluate on scene
classification dataset
3. Re-train distinct visual and audio classifiers
based on the complete architecture
4. Employ on audio-visual
discrepancy detection
(binary task - A/V streams match/do not match)
Visual
Visual
Audio
Audio
Proposed Method Overview
Proposed Method Overview
Adapting visual and audio scene classification techniques to detect discrepancies
between the audio and video modalities in multimedia content
Key Contributions:
● Novel experimental protocol and benchmark dataset
● Free dataset and source code for community use
● A baseline method that adapts visual- and audio-scene classification techniques to
detect such discrepancies
VADD
TAU
1. Training of a robust audio-visual scene
classifier
2. Evaluate on scene
classification dataset
Visual
Visual
Audio
Audio
Datasets
Original Scene
Classification
Dataset
Visual-Audio
Discrepancies
Dataset
3. Re-train distinct visual and audio classifiers
based on the complete architecture
4. Employ on audio-visual
discrepancy detection
(binary task - A/V streams match/do not match)
Original Scene Classification Dataset
● TAU Audio-Visual Urban Scenes 2021
● 10 scene classes
● Used in Task 1B of the DCASE 2021’s challenge
involves scene estimation on categorizing videos based on their A/V content
● Participants in this challenge are required to develop systems that can jointly analyze
audio and visual information to determine the type of scene
Visual-Audio Discrepancies Experimental Protocol
● Goal: enable the evaluation of methods that can detect discrepancies between the
visual and audio streams.
● Leverage the wealth of visual and auditory data already available in the already
existing TAU dataset
● Created a subset of videos in which the visual content portrays one class, while the
accompanying audio track is sourced from a different class
● Designed a procedure to ensure balanced pristine and manipulated sets
● 3-class and 10-class variants for varying difficulty levels
Visual Scene Representations
Followed a transfer learning approach utilizing 3 pre-trained models
Name Description Embedding vector size
ViT Vision Transformer -
activations from the penultimate layer
1024
CLIP Contrastive Language-Image Pre-training -
image encoding vector
1000
ResNet Residual Networks embeddings -
activations from the penultimate layer
2048
Audio Scene Representations
Deep Audio Embeddings (DAEs): Pre-trained DNN models on large datasets (e.g. AudioSet)
Name Description Embedding vector size
OpenL3 Separate audio and video networks, combined via
fusion layers
512
PANN Pre-trained Audio Neural Network 512
IOV ResNet model, which has been trained for a different
but similar task
256
Combining Modalities
● Model architecture: Concatenation of embeddings with a self-attention mechanism
and two fully-connected layers
Visual
embeddings
ViT
CLIP
ResNet
Concat
Self-attention
FC
FC
Audio
embeddings
OpenL3
PANN
IOV
Experimental Results: Scene Classification
● Performance Evaluation - compare with the winner of Task 1B of the DCASE 2021’s
challenge
● Results:
○ 97.24% accuracy on TAU (proposed) compared to
95.1% accuracy on TAU (DCASE 2021 Task 1b winner)
○ Superior performance due to modern features and self-attention mechanisms
○ A detected discrepancy between the audio and visual scenes, indicates a high
likelihood of actual inconsistencies
Experimental Results: Visual-Audio Discrepancies
● Detection of Manipulated Samples - Evaluation on the 3-class and 10-class variants
of the VADD dataset
● Results:
○ 3-class VADD variant: 95.54 F1-score
○ 10-class VADD variant: 79.16 F1-score
○ High accuracy in the 3-class variant
○ 10-class variant is a more realistic and challenging scenario
Ablation study
● Self-attention layer placement - different variants tested:
○ Late self-attention (LS): Applied after concatenating all input embeddings (default
placement)
○ Early self-attention (ES): Applied directly to individual visual and audio embeddings before
concatenation
○ Per-modality self-attention (MS): Applied to concatenated visual and audio embeddings
○ Combined self-attention: Various combinations of ES, MS, and LS approaches
○ No self-attention (NS)
● Data augmentation techniques
● Single vs. double FC layers
● Results showed that the chosen model architecture, data augmentation strategies,
and the number of FC layers used, are well-suited for the task at hand
Ablation study
● Results:
○ The chosen model architecture, data augmentation strategies, and the number of FC layers used, are
well-suited for the task at hand
Conclusions and Outlook
● Visual/Audio Scene Discrepancy Detection to Counter Disinformation
● Key contributions:
○ Baseline method adapting existing visual- and audio-scene classification techniques
○ Novel experimental protocol and benchmark dataset
○ Free dataset and source code for community use
● Next steps:
○ Alternative contrastive learning / fusion approaches
○ Incorporate temporal information and go beyond global analysis
Our github
repository!
Thank you!
Any questions?
This work was supported by the EU Horizon Europe programme
under grant agreement 101070093 vera.ai

More Related Content

PPTX
Music Gesture for Visual Sound Separation
PDF
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
PPTX
t.pptx is a ppt for DDS and software applications
PPTX
vignesh ppt-1 is a ppt for DDS hardware and software
PDF
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
PPTX
Elderly Assistance- Deep Learning Theme detection
PDF
Slides of my presentation at EUSIPCO 2017
PPTX
Detecting visual-media-borne disinformation: a summary of latest advances at ...
Music Gesture for Visual Sound Separation
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
t.pptx is a ppt for DDS and software applications
vignesh ppt-1 is a ppt for DDS hardware and software
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Elderly Assistance- Deep Learning Theme detection
Slides of my presentation at EUSIPCO 2017
Detecting visual-media-borne disinformation: a summary of latest advances at ...

Similar to Visual and audio scene classification for detecting discrepancies (MAD'24 workshop) (20)

PPTX
Foley Music: Learning to Generate Music from Videos
PDF
TAAI 2016 Keynote Talk: It is all about AI
PPTX
Hybrid Algorithms for Summarization of Video Surveillance Systems 6_3_2023.pptx
PDF
Yurii Pashchenko: Tips and tricks for building your own automated visual data...
PPTX
[NS][Lab_Seminar_250609]Audio-Visual Semantic Graph Network for Audio-Visual ...
PDF
MDX challenge 2021 town hall presentation
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
PDF
Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - "Machine...
PPTX
8. Deepfake Mix PPT using the CNN technique.pptx
PDF
Trends of ICASSP 2022
PDF
Activity Recognition using RGBD
DOCX
74.DETECTING AI-GENERATED IMAGES WITH CNN AND INTERPRETATION USING EXPLAINABL...
PDF
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019
PPTX
Explainable Deepfake Image/Video Detection
PPTX
Transformer in Vision
PDF
Improving the Perturbation-Based Explanation of Deepfake Detectors Through th...
PPTX
Deepfake Detection with the help of AI.pptx
PDF
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
PPTX
Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - Machine ...
PDF
Human Action Recognition in Videos
Foley Music: Learning to Generate Music from Videos
TAAI 2016 Keynote Talk: It is all about AI
Hybrid Algorithms for Summarization of Video Surveillance Systems 6_3_2023.pptx
Yurii Pashchenko: Tips and tricks for building your own automated visual data...
[NS][Lab_Seminar_250609]Audio-Visual Semantic Graph Network for Audio-Visual ...
MDX challenge 2021 town hall presentation
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - "Machine...
8. Deepfake Mix PPT using the CNN technique.pptx
Trends of ICASSP 2022
Activity Recognition using RGBD
74.DETECTING AI-GENERATED IMAGES WITH CNN AND INTERPRETATION USING EXPLAINABL...
Self-supervised Audiovisual Learning - Xavier Giro - UPC Barcelona 2019
Explainable Deepfake Image/Video Detection
Transformer in Vision
Improving the Perturbation-Based Explanation of Deepfake Detectors Through th...
Deepfake Detection with the help of AI.pptx
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - Machine ...
Human Action Recognition in Videos
Ad

Recently uploaded (20)

PPTX
Chapter 5: Probability Theory and Statistics
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
cloud_computing_Infrastucture_as_cloud_p
PPT
What is a Computer? Input Devices /output devices
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
TLE Review Electricity (Electricity).pptx
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
PPTX
O2C Customer Invoices to Receipt V15A.pptx
Chapter 5: Probability Theory and Statistics
Web App vs Mobile App What Should You Build First.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
cloud_computing_Infrastucture_as_cloud_p
What is a Computer? Input Devices /output devices
Module 1.ppt Iot fundamentals and Architecture
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Getting started with AI Agents and Multi-Agent Systems
Zenith AI: Advanced Artificial Intelligence
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
A comparative study of natural language inference in Swahili using monolingua...
WOOl fibre morphology and structure.pdf for textiles
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Programs and apps: productivity, graphics, security and other tools
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
TLE Review Electricity (Electricity).pptx
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Enhancing emotion recognition model for a student engagement use case through...
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
O2C Customer Invoices to Receipt V15A.pptx
Ad

Visual and audio scene classification for detecting discrepancies (MAD'24 workshop)

  • 1. Visual and audio scene classification for detecting discrepancies in video: a baseline method and experimental protocol Konstantinos Apostolidis1 , Jakob Abeßer2 , Luca Cuccovillo2 , Vasileios Mezaris1 1 Information Technologies Institute, CERTH 2 Fraunhofer Institute for Digital Media Technology ACM ICMR 2024, MAD workshop Phuket, Thailand, June 10-13
  • 2. Introduction ● Digital Disinformation: fabricated news, tampered images, manipulated videos ● Verifying the integrity of media is crucial for journalists, security experts, emergency management ● AI-Based Tools can support this task offering scalability and efficiency
  • 3. Voice clip #1 Voice clip #2 Processing point Ambient soundscape to mask edits Motivation
  • 4. Motivation ● Audio Editing: Use of soundscapes to enhance immersion OR mask edits ● Malicious Manipulations Scenario: Lack of authentic soundscapes leading to audio-visual inconsistencies ● Need for methods to detect subtle discrepancies in audio-visual content
  • 5. VADD TAU 1. Training of a robust audio-visual scene classifier 2. Evaluate on scene classification dataset 3. Re-train distinct visual and audio classifiers based on the complete architecture 4. Employ on audio-visual discrepancy detection (binary task - A/V streams match/do not match) Visual Visual Audio Audio Proposed Method Overview
  • 6. Proposed Method Overview Adapting visual and audio scene classification techniques to detect discrepancies between the audio and video modalities in multimedia content Key Contributions: ● Novel experimental protocol and benchmark dataset ● Free dataset and source code for community use ● A baseline method that adapts visual- and audio-scene classification techniques to detect such discrepancies
  • 7. VADD TAU 1. Training of a robust audio-visual scene classifier 2. Evaluate on scene classification dataset Visual Visual Audio Audio Datasets Original Scene Classification Dataset Visual-Audio Discrepancies Dataset 3. Re-train distinct visual and audio classifiers based on the complete architecture 4. Employ on audio-visual discrepancy detection (binary task - A/V streams match/do not match)
  • 8. Original Scene Classification Dataset ● TAU Audio-Visual Urban Scenes 2021 ● 10 scene classes ● Used in Task 1B of the DCASE 2021’s challenge involves scene estimation on categorizing videos based on their A/V content ● Participants in this challenge are required to develop systems that can jointly analyze audio and visual information to determine the type of scene
  • 9. Visual-Audio Discrepancies Experimental Protocol ● Goal: enable the evaluation of methods that can detect discrepancies between the visual and audio streams. ● Leverage the wealth of visual and auditory data already available in the already existing TAU dataset ● Created a subset of videos in which the visual content portrays one class, while the accompanying audio track is sourced from a different class ● Designed a procedure to ensure balanced pristine and manipulated sets ● 3-class and 10-class variants for varying difficulty levels
  • 10. Visual Scene Representations Followed a transfer learning approach utilizing 3 pre-trained models Name Description Embedding vector size ViT Vision Transformer - activations from the penultimate layer 1024 CLIP Contrastive Language-Image Pre-training - image encoding vector 1000 ResNet Residual Networks embeddings - activations from the penultimate layer 2048
  • 11. Audio Scene Representations Deep Audio Embeddings (DAEs): Pre-trained DNN models on large datasets (e.g. AudioSet) Name Description Embedding vector size OpenL3 Separate audio and video networks, combined via fusion layers 512 PANN Pre-trained Audio Neural Network 512 IOV ResNet model, which has been trained for a different but similar task 256
  • 12. Combining Modalities ● Model architecture: Concatenation of embeddings with a self-attention mechanism and two fully-connected layers Visual embeddings ViT CLIP ResNet Concat Self-attention FC FC Audio embeddings OpenL3 PANN IOV
  • 13. Experimental Results: Scene Classification ● Performance Evaluation - compare with the winner of Task 1B of the DCASE 2021’s challenge ● Results: ○ 97.24% accuracy on TAU (proposed) compared to 95.1% accuracy on TAU (DCASE 2021 Task 1b winner) ○ Superior performance due to modern features and self-attention mechanisms ○ A detected discrepancy between the audio and visual scenes, indicates a high likelihood of actual inconsistencies
  • 14. Experimental Results: Visual-Audio Discrepancies ● Detection of Manipulated Samples - Evaluation on the 3-class and 10-class variants of the VADD dataset ● Results: ○ 3-class VADD variant: 95.54 F1-score ○ 10-class VADD variant: 79.16 F1-score ○ High accuracy in the 3-class variant ○ 10-class variant is a more realistic and challenging scenario
  • 15. Ablation study ● Self-attention layer placement - different variants tested: ○ Late self-attention (LS): Applied after concatenating all input embeddings (default placement) ○ Early self-attention (ES): Applied directly to individual visual and audio embeddings before concatenation ○ Per-modality self-attention (MS): Applied to concatenated visual and audio embeddings ○ Combined self-attention: Various combinations of ES, MS, and LS approaches ○ No self-attention (NS) ● Data augmentation techniques ● Single vs. double FC layers ● Results showed that the chosen model architecture, data augmentation strategies, and the number of FC layers used, are well-suited for the task at hand
  • 16. Ablation study ● Results: ○ The chosen model architecture, data augmentation strategies, and the number of FC layers used, are well-suited for the task at hand
  • 17. Conclusions and Outlook ● Visual/Audio Scene Discrepancy Detection to Counter Disinformation ● Key contributions: ○ Baseline method adapting existing visual- and audio-scene classification techniques ○ Novel experimental protocol and benchmark dataset ○ Free dataset and source code for community use ● Next steps: ○ Alternative contrastive learning / fusion approaches ○ Incorporate temporal information and go beyond global analysis Our github repository!
  • 18. Thank you! Any questions? This work was supported by the EU Horizon Europe programme under grant agreement 101070093 vera.ai