Visual and audio scene classification for detecting discrepancies (MAD'24 workshop)

Visual and audio scene classification
for detecting discrepancies in video:
a baseline method and experimental protocol
Konstantinos Apostolidis1
, Jakob Abeßer2
,
Luca Cuccovillo2
, Vasileios Mezaris1
1
Information Technologies Institute, CERTH
2
Fraunhofer Institute for Digital Media Technology
ACM ICMR 2024, MAD workshop
Phuket, Thailand, June 10-13

Introduction
● Digital Disinformation: fabricated news, tampered images, manipulated videos
● Verifying the integrity of media is crucial for journalists, security experts, emergency
management
● AI-Based Tools can support this task offering scalability and efficiency

Voice clip #1 Voice clip #2
Processing point
Ambient soundscape to mask edits
Motivation

Motivation
● Audio Editing: Use of soundscapes to enhance immersion OR mask edits
● Malicious Manipulations Scenario: Lack of authentic soundscapes leading to
audio-visual inconsistencies
● Need for methods to detect subtle discrepancies in audio-visual content

VADD
TAU
1. Training of a robust audio-visual scene
classifier
2. Evaluate on scene
classification dataset
3. Re-train distinct visual and audio classifiers
based on the complete architecture
4. Employ on audio-visual
discrepancy detection
(binary task - A/V streams match/do not match)
Visual
Visual
Audio
Audio
Proposed Method Overview

Proposed Method Overview
Adapting visual and audio scene classification techniques to detect discrepancies
between the audio and video modalities in multimedia content
Key Contributions:
● Novel experimental protocol and benchmark dataset
● Free dataset and source code for community use
● A baseline method that adapts visual- and audio-scene classification techniques to
detect such discrepancies

VADD
TAU
1. Training of a robust audio-visual scene
classifier
2. Evaluate on scene
classification dataset
Visual
Visual
Audio
Audio
Datasets
Original Scene
Classification
Dataset
Visual-Audio
Discrepancies
Dataset
3. Re-train distinct visual and audio classifiers
based on the complete architecture
4. Employ on audio-visual
discrepancy detection
(binary task - A/V streams match/do not match)

Original Scene Classification Dataset
● TAU Audio-Visual Urban Scenes 2021
● 10 scene classes
● Used in Task 1B of the DCASE 2021’s challenge
involves scene estimation on categorizing videos based on their A/V content
● Participants in this challenge are required to develop systems that can jointly analyze
audio and visual information to determine the type of scene

Visual-Audio Discrepancies Experimental Protocol
● Goal: enable the evaluation of methods that can detect discrepancies between the
visual and audio streams.
● Leverage the wealth of visual and auditory data already available in the already
existing TAU dataset
● Created a subset of videos in which the visual content portrays one class, while the
accompanying audio track is sourced from a different class
● Designed a procedure to ensure balanced pristine and manipulated sets
● 3-class and 10-class variants for varying difficulty levels

Visual Scene Representations
Followed a transfer learning approach utilizing 3 pre-trained models
Name Description Embedding vector size
ViT Vision Transformer -
activations from the penultimate layer
1024
CLIP Contrastive Language-Image Pre-training -
image encoding vector
1000
ResNet Residual Networks embeddings -
activations from the penultimate layer
2048

Audio Scene Representations
Deep Audio Embeddings (DAEs): Pre-trained DNN models on large datasets (e.g. AudioSet)
Name Description Embedding vector size
OpenL3 Separate audio and video networks, combined via
fusion layers
512
PANN Pre-trained Audio Neural Network 512
IOV ResNet model, which has been trained for a different
but similar task
256

Combining Modalities
● Model architecture: Concatenation of embeddings with a self-attention mechanism
and two fully-connected layers
Visual
embeddings
ViT
CLIP
ResNet
Concat
Self-attention
FC
FC
Audio
embeddings
OpenL3
PANN
IOV

Experimental Results: Scene Classification
● Performance Evaluation - compare with the winner of Task 1B of the DCASE 2021’s
challenge
● Results:
○ 97.24% accuracy on TAU (proposed) compared to
95.1% accuracy on TAU (DCASE 2021 Task 1b winner)
○ Superior performance due to modern features and self-attention mechanisms
○ A detected discrepancy between the audio and visual scenes, indicates a high
likelihood of actual inconsistencies

Experimental Results: Visual-Audio Discrepancies
● Detection of Manipulated Samples - Evaluation on the 3-class and 10-class variants
of the VADD dataset
● Results:
○ 3-class VADD variant: 95.54 F1-score
○ 10-class VADD variant: 79.16 F1-score
○ High accuracy in the 3-class variant
○ 10-class variant is a more realistic and challenging scenario

Ablation study
● Self-attention layer placement - different variants tested:
○ Late self-attention (LS): Applied after concatenating all input embeddings (default
placement)
○ Early self-attention (ES): Applied directly to individual visual and audio embeddings before
concatenation
○ Per-modality self-attention (MS): Applied to concatenated visual and audio embeddings
○ Combined self-attention: Various combinations of ES, MS, and LS approaches
○ No self-attention (NS)
● Data augmentation techniques
● Single vs. double FC layers
● Results showed that the chosen model architecture, data augmentation strategies,
and the number of FC layers used, are well-suited for the task at hand

Ablation study
● Results:
○ The chosen model architecture, data augmentation strategies, and the number of FC layers used, are
well-suited for the task at hand

Conclusions and Outlook
● Visual/Audio Scene Discrepancy Detection to Counter Disinformation
● Key contributions:
○ Baseline method adapting existing visual- and audio-scene classification techniques
○ Novel experimental protocol and benchmark dataset
○ Free dataset and source code for community use
● Next steps:
○ Alternative contrastive learning / fusion approaches
○ Incorporate temporal information and go beyond global analysis
Our github
repository!

Thank you!
Any questions?
This work was supported by the EU Horizon Europe programme
under grant agreement 101070093 vera.ai

Visual and audio scene classification for detecting discrepancies (MAD'24 workshop)

More Related Content

Similar to Visual and audio scene classification for detecting discrepancies (MAD'24 workshop) (20)

Recently uploaded (20)

Visual and audio scene classification for detecting discrepancies (MAD'24 workshop)