SlideShare a Scribd company logo
Title of presentation
Subtitle
Name of presenter
Date
Gated-ViGAT: Efficient Bottom-Up Event Recognition and Explanation
Using a New Frame Selection Policy and Gating Mechanism
Nikolaos Gkalelis, Dimitrios Daskalakis, Vasileios Mezaris
CERTH-ITI, Thermi - Thessaloniki, Greece
IEEE Int. Symposium on Multimedia,
Naples, Italy, Dec. 2022
2
• The recognition of high-level events in unconstrained video is an important topic
with applications in: security (e.g. “making a bomb”), automotive industry (e.g.
“pedestrian crossing the street”), etc.
• Most approaches are top-down: “patchify” the frame (context agnostic); use
label and loss function to learn focusing on frame regions related with event
• Bottom-up approaches: use an object detector, feature extractor and graph
network to extract and process features from the main objects in the video
Introduction
Video event
“walking the dog”
3
• Our recent bottom-up approach with SOTA performance in many datasets
• Uses a graph attention network (GAT) head to process local (object) & global
(frame) information
• Also provides frame/object-level explanations (in contrast to top-down ones)
Video event
“removing ice from
car” miscategorized
as “shoveling snow”
Object-level
explanation:
classifier does
not focus on the
car object
ViGAT
4
• Cornerstone of ViGAT head; transforms a feature matrix (representing graph’s
nodes) to a feature vector (representing the whole graph)
• Computes explanation significance (weighted in-degrees, WiDs) of each node
using the graph’s adjacency matrix
Attention
Mechanism
GAT head Graph pooling
X (K x F) A (K x K) Ζ (K x F) η (1 x F)
𝝋𝒍 =
𝒌=𝟏
𝑲
𝒂𝒌,𝒍 , 𝒍 = 𝟏, … , 𝑲
Computation of
Attention matrix from
node features; and
Adjacency Matrix using
attention coefficients
Multiplication of
node features with
Adjacency Matrix
Production of vector-
representation of the graph
WiDs: Explanation
significance of l-th node
ViGAT block
ω2
ω2
5
K
K objects
object-level
features
b
frame-level
local features
P
ω2
P
P
P
ω3
b
frame-level
global features
P
ω1 concat u
video
feature
o
video frames
video-level
global feature
mean
video-level
local feature
K
frame WiDs
(local info)
frame WiDs
(global info)
object WiDs
P
P
P
Recognized Event: Playing
beach volleyball!
Explanation: Event supporting
frames and objects
ViGAT architecture
max3
max
o: object detector
b: feature extractor
u: classification head
GAT blocks: ω1, ω2, ω3
Global branch: ω1
Local branch: ω2, ω3
Local information
Global information
6
• ViGAT has high computational cost due to local (object) information processing
(e.g.,P=120 frames, K=50 objects per frame, PK=6000 objects/video)
• Efficient video processing has investigated at the top-down (frame) paradigm:
- Frame selection policy: identify most important frames for classification
- Gating component: stop processing frames when sufficient evidence achieved
• Unexplored topic in bottom-up paradigm: Can we use such techniques to reduce
the computational complexity in the local processing pipeline of ViGAT?
ViGAT
7
K
b
P
Q
ω3
concat
u
video
feature
o
Extracted
video frames
mean
video-level
local feature
K
Frame WiDs
(local info)
Object WiDs
(local info)
Q(s)
Frame selection
policy
Q(s) Q(s)
Q(s)
Q(s)
Q(s)
g(s)
ON/OFF
concat
max
Explanation: Event supporting
frames and objects
Recognized Event: Playing
beach volleyball!
Computed
video-level
global feature
Computed
frame WiDs
(global info)
u1 uP
max3
Gate is closed: Request Q(s+1) - Q(s) additional frames
ζ(s)
ζ(1)
g(1)
g(S)
Z(s)
Gated-ViGAT
ω2
ω2
ω2
Local information processing pipeline
8
• Iterative algorithm to select Q frames
frame-level
global features
frame WiDs
(global info)
argmax
p1
minmax
minmax
αp = (1/2) (1 – γp
Τγpi-1
)
γp = γp /|γp|
γ1
γP
uP
u1
uP
u1
α1 αP
pi
argmax
u1 uP
up = αp up
u1 uP
α1 αP
1. Initialize
2. Select Q-1 frames
Input: Q, frame index p1, P feature vectors
Iterate for i= 2 to Q-1
γ1
γP
Gated-ViGAT: Frame selection policy
9
• Each gate has a GAT block-like structure and binary classification
head (open/close); corresponds to specified number of frames Q(s);
trained to provide 1 (i.e. open) when ViGAT loss is low; design
hyperparameters:Q(s) , β (sensitivity)
Use frame selection policy to select Q(s) frames for gate g(s)
Compute the video-level local feature ζ(s) (and Z(s))
Compute ViGAT classification loss: lce = CE(label, y)
Derive pseudolabel 0(s) : 1 if lce <= βes/2 ; zero otherwise
Compute gate component loss: 𝐿 =
1
𝑆 𝑠=1
𝑆
𝑙𝑏𝑐𝑒(𝑔 𝑠
𝒁 𝑠
, 𝑜(𝑠)
)
Perform backpropagation to update gate weights
concat
u
video
feature
video-level
local feature
g(s)
concat
Computed
video-level
global feature
ζ(s)
ζ(1)
g(1)
g(S)
Local ViGAT
branch
Z(s)
Ground truth
label
cross
entropy
y
Binary cross
entropy
o(s)
Gated-ViGAT: Gate training
Select Q(s) video
frames for gate g(s)
Q
o
10
• ActivityNet v1.3: 200 events/actions, 10K/5K training/testing, 5 to 10 mins; multilabel
• MiniKinetics: 200 events/actions, 80K/5K training/testing, 10 secs duration; single-label
• Video representation: 120/30 frames with uniform sampling for ActivityNet/MiniKinetics
• Pretrained ViGAT components: Faster R-CNN (pretrained/finetuned on Imagenet1K/VG, K=50
objects), ViT-B/16 backbone (pretrained/finetuned on Imagenet11K/Imagenet1K), 3 GAT blocks
(pretrained on the respective dataset, i.e., ActivityNet or MiniKinetics)
• Gates: S= 6 / 5 (number of gates), {Q(s)} = {9, 12, 16, 20, 25, 30} / {2, 4, 6, 8, 10} (sequence lengths),
for ActivityNet/MiniKinetics
• Gate training hyperparameters: β = 10-8, epochs= 40, lr = 10-4 multiplied with 0.1 at epochs 16, 35
• Evaluation Measures: mAP (ActivityNet), top-1 accuracy (MiniKinetics), FLOPs
• Gated-ViGAT is compared against top-scoring methods in the two datasets
Experiments
11
Methods in MiniKinetics Top-1%
TBN [30] 69.5
BAT [7] 70.6
MARS (3D ResNet) [31] 72.8
Fast-S3D (Inception) [14] 78
ATFR (X3D-S) [18] 78
ATFR (R(2+1D)) [18] 78.2
RMS (SlowOnly) [28] 78.6
ATFR (I3D) [18] 78.8
Ada3D (I3D, Kinetics) [32] 79.2
ATFR (3D Resnet) [18] 79.3
CGNL (Modified ResNet) [17] 79.5
TCPNet (ResNet, Kinetics) [3] 80.7
LgNet (R3D) [3] 80.9
FrameExit (EfficientNet) [1] 75.3
ViGAT [9] 82.1
Gated-ViGAT (proposed) 81.3
• Gated-ViGAT outperforms all top-down approaches
• Slightly underperforms ViGAT, but approx. 4 and 5.5 FLOPs reduction
• As expected, has higher computational complexity than many top-down
approaches (e.g. see [3], [4]) but can provide explanations
Methods in ActivityNet mAP%
AdaFrame [21] 71.5
ListenToLook [23] 72.3
LiteEval [33] 72.7
SCSampler [25] 72.9
AR-Net [13] 73.8
FrameExit [1] 77.3
AR-Net (EfficientNet) [13] 79.7
MARL (ResNet, Kinetics) [22] 82.9
FrameExit (X3D-S) [1] 87.4
ViGAT [9] 88.1
Gated-ViGAT (proposed) 87.3
FLOPS in 2 datasets ViGAT Gated-ViGAT
ActivityNet 137.4 24.8
MiniKinetics 34.4 8.7
Experiments: results
*Best and second best performance
are denoted with bold and underline
12
• Computed # of videos processed and recognition performance for each gate
• Average number of frames for ActivityNet / MiniKinetics: 20 / 7
• Recognition rate drops with gate number increase; this behavior is more
clearly shown in ActivityNet (longer videos)
• Conclusion: more “easy” videos exit early, more “difficult” videos still difficult
to recognize even with many frames (similar conclusion with [1])
ActivityNet g(1) g(2) g(3) g(4) g(5) g(6)
# frames 9 12 16 20 25 30
# videos 793 651 722 502 535 1722
mAP% 99.8 94.5 93.8 92.7 86 71.6
MiniKinetics g(1) g(2) g(3) g(4) g(5)
# frames 2 4 6 8 10
# videos 179 686 1199 458 2477
Top-1% 84.9 83 81.1 84.9 80.7
Experiments: method insight
13
• Bullfighting (top) and Cricket (bottom) test videos of ActivityNet exited at first
gate, i.e., recognized using only 9 frames out of 120 (required with ViGAT)
• Frames selected with the proposed policy, both explain recognition result and
provide diverse view of the video: help to recognize video with fewer frames
Bullfighting
Cricket
Experiments: examples
14
• Can also provide explanations at object-level (in contrast to top-down methods)
“Waterskiing” predicted
as “Making a sandwich”
“Playing accordion” predicted
as “Playing guitarra”
“Breakdancing” (correct prediction)
Experiments: examples
15
Policy / #frames 10 20 30
Random 83 85.5 86.5
WiD-based 84.9 86.1 86.9
Random on local 85.4 86.6 86.9
WiD-based on local 86.6 87.1 87.5
FrameExit policy 86.2 87.3 87.5
Proposed policy 86.7 87.3 87.6
Gated-ViGAT (proposed) 86.8 87.5 87.7
Experiments: ablation study on frame selection policies
• Comparison (mAP%) on ActivityNet
• Gated-ViGAT selects diverse frames with high explanation potential
• Proposed policy is second best (surpassing FrameExit [1], current SOTA)
Random: Θ frames selected randomly for local/global features
WiD-Based: Θ frames are selected using global WiDs
Random local: P frames derive global feature; Θ frames selected randomly
WiD-based local: P frames derive global feature; Θ frames using global WiDs
FrameExit policy: Θ frames are selected using policy in [1]
Proposed policy: P frames derive global feature; Θ selected using proposed
Gated-ViGAT: in addition to above gate component selects Θ frames in average
16
• Top-6 frames of “bungee jumping” video selected with WiD-based vs proposed policy
Proposed
WiD-based
Updated
WiDs
Experiments: ablation study example
17
• An efficient bottom-up event recognition and explanation approach presented
• Utilizes a new policy algorithm to select frames that: a) explain best the
classifier’s decision, b) provide diverse information of the underlying event
• Utilizes a gating mechanism to instruct the model to stop extracting bottom-
up (object) information when sufficient evidence of the event is achieved
• Evaluation on 2 datasets provided competitive recognition performance and
approx. 5 times FLOPs reduction in comparison to previous SOTA
• Future work: investigations for further efficiency improvements, e.g.: faster
object detector, feature extractor, frame selection also for the global
information pipeline, etc.
Conclusions
18
Thank you for your attention!
Questions?
Nikolaos Gkalelis, gkalelis@iti.gr
Vasileios Mezaris, bmezaris@iti.gr
Code publicly available at:
https://github.com/bmezaris/Gated-ViGAT
This work was supported by the EUs Horizon 2020 research and innovation programme under grant
agreement 101021866 CRiTERIA

More Related Content

PPTX
TAME: Trainable Attention Mechanism for Explanations
PPTX
Transportation problem
PPTX
Two player games
PPT
Transportation Problem
PDF
재능기부 - 경기도 연천군 비전 2030 종합발전계획 수립
PPT
Transportatopn problm
PPTX
Solving the traveling salesman problem by genetic algorithm
PPT
Arctic Climatology Sensor Network
TAME: Trainable Attention Mechanism for Explanations
Transportation problem
Two player games
Transportation Problem
재능기부 - 경기도 연천군 비전 2030 종합발전계획 수립
Transportatopn problm
Solving the traveling salesman problem by genetic algorithm
Arctic Climatology Sensor Network

Similar to Gated-ViGAT (20)

PPTX
Pawach job record.pptx
PPT
Video Conferencing Experiences with UltraGrid:
PPT
Video Conferencing Experiences with UltraGrid:
PDF
State of GeoServer 2.10
PPT
BWC Supercomputing 2008 Presentation
PDF
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
PPTX
Hybrid Algorithms for Summarization of Video Surveillance Systems 6_3_2023.pptx
PDF
Efficient video perception through AI
PDF
"Implementing Histogram of Oriented Gradients on a Parallel Vision Processor,...
PDF
State of GeoServer - FOSS4G 2016
PDF
Extending the life of your device (firmware updates over LoRa) - LoRa AMM
PDF
“Addressing Evolving AI Model Challenges Through Memory and Storage,” a Prese...
PDF
“Implementing Transformer Neural Networks for Visual Perception on Embedded D...
PDF
“The Importance of Memory for Breaking the Edge AI Performance Bottleneck,” a...
PPTX
NGIoT standardisation workshops_Jens Hagemeyer presentation
PDF
YOW2018 Cloud Performance Root Cause Analysis at Netflix
PDF
State of GeoServer
PDF
Tech 2 Tech: Network performance
PDF
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
PPTX
Explaining video summarization based on the focus of attention
Pawach job record.pptx
Video Conferencing Experiences with UltraGrid:
Video Conferencing Experiences with UltraGrid:
State of GeoServer 2.10
BWC Supercomputing 2008 Presentation
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
Hybrid Algorithms for Summarization of Video Surveillance Systems 6_3_2023.pptx
Efficient video perception through AI
"Implementing Histogram of Oriented Gradients on a Parallel Vision Processor,...
State of GeoServer - FOSS4G 2016
Extending the life of your device (firmware updates over LoRa) - LoRa AMM
“Addressing Evolving AI Model Challenges Through Memory and Storage,” a Prese...
“Implementing Transformer Neural Networks for Visual Perception on Embedded D...
“The Importance of Memory for Breaking the Edge AI Performance Bottleneck,” a...
NGIoT standardisation workshops_Jens Hagemeyer presentation
YOW2018 Cloud Performance Root Cause Analysis at Netflix
State of GeoServer
Tech 2 Tech: Network performance
"Designing Deep Neural Network Algorithms for Embedded Devices," a Presentati...
Explaining video summarization based on the focus of attention
Ad

More from VasileiosMezaris (20)

PDF
Combatting video-borne disinformation and increasing trust in AI methods
PDF
An LLM Framework for Long-form Video Retrieval and Audio-Visual Question Answ...
PDF
Improving the Perturbation-Based Explanation of Deepfake Detectors Through th...
PDF
B-FPGM: Lightweight Face Detection via Bayesian-Optimized Soft FPGM Pruning
PPTX
LMM-Regularized CLIP Embeddings for Image Classification
PPTX
Disturbing Image Detection Using LMM-Elicited Emotion Embeddings
PPTX
Exploiting LMM based knowledge for image classification tasks
PPTX
Detecting visual-media-borne disinformation: a summary of latest advances at ...
PPTX
Dataset and methods for 360-degree video summarization
PPTX
Explainable Deepfake Image/Video Detection
PPTX
Multi-Modal Fusion for Image Manipulation Detection and Localization
PDF
CERTH-ITI at MediaEval 2023 NewsImages Task
PPTX
Spatio-Temporal Summarization of 360-degrees Videos
PPTX
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
PPTX
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
PPTX
Combining textual and visual features for Ad-hoc Video Search
PPTX
Explaining the decisions of image/video classifiers
PPTX
Learning visual explanations for DCNN-based image classifiers using an attent...
PPTX
Are all combinations equal? Combining textual and visual features with multi...
PPTX
CA-SUM Video Summarization
Combatting video-borne disinformation and increasing trust in AI methods
An LLM Framework for Long-form Video Retrieval and Audio-Visual Question Answ...
Improving the Perturbation-Based Explanation of Deepfake Detectors Through th...
B-FPGM: Lightweight Face Detection via Bayesian-Optimized Soft FPGM Pruning
LMM-Regularized CLIP Embeddings for Image Classification
Disturbing Image Detection Using LMM-Elicited Emotion Embeddings
Exploiting LMM based knowledge for image classification tasks
Detecting visual-media-borne disinformation: a summary of latest advances at ...
Dataset and methods for 360-degree video summarization
Explainable Deepfake Image/Video Detection
Multi-Modal Fusion for Image Manipulation Detection and Localization
CERTH-ITI at MediaEval 2023 NewsImages Task
Spatio-Temporal Summarization of 360-degrees Videos
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Combining textual and visual features for Ad-hoc Video Search
Explaining the decisions of image/video classifiers
Learning visual explanations for DCNN-based image classifiers using an attent...
Are all combinations equal? Combining textual and visual features with multi...
CA-SUM Video Summarization
Ad

Recently uploaded (20)

PPTX
2. Earth - The Living Planet earth and life
PDF
. Radiology Case Scenariosssssssssssssss
PDF
diccionario toefl examen de ingles para principiante
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PDF
An interstellar mission to test astrophysical black holes
PPTX
Cell Membrane: Structure, Composition & Functions
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
Microbiology with diagram medical studies .pptx
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
famous lake in india and its disturibution and importance
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
2. Earth - The Living Planet earth and life
. Radiology Case Scenariosssssssssssssss
diccionario toefl examen de ingles para principiante
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
Introduction to Fisheries Biotechnology_Lesson 1.pptx
bbec55_b34400a7914c42429908233dbd381773.pdf
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
An interstellar mission to test astrophysical black holes
Cell Membrane: Structure, Composition & Functions
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
Microbiology with diagram medical studies .pptx
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
ECG_Course_Presentation د.محمد صقران ppt
Biophysics 2.pdffffffffffffffffffffffffff
famous lake in india and its disturibution and importance
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
7. General Toxicologyfor clinical phrmacy.pptx
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice

Gated-ViGAT

  • 1. Title of presentation Subtitle Name of presenter Date Gated-ViGAT: Efficient Bottom-Up Event Recognition and Explanation Using a New Frame Selection Policy and Gating Mechanism Nikolaos Gkalelis, Dimitrios Daskalakis, Vasileios Mezaris CERTH-ITI, Thermi - Thessaloniki, Greece IEEE Int. Symposium on Multimedia, Naples, Italy, Dec. 2022
  • 2. 2 • The recognition of high-level events in unconstrained video is an important topic with applications in: security (e.g. “making a bomb”), automotive industry (e.g. “pedestrian crossing the street”), etc. • Most approaches are top-down: “patchify” the frame (context agnostic); use label and loss function to learn focusing on frame regions related with event • Bottom-up approaches: use an object detector, feature extractor and graph network to extract and process features from the main objects in the video Introduction Video event “walking the dog”
  • 3. 3 • Our recent bottom-up approach with SOTA performance in many datasets • Uses a graph attention network (GAT) head to process local (object) & global (frame) information • Also provides frame/object-level explanations (in contrast to top-down ones) Video event “removing ice from car” miscategorized as “shoveling snow” Object-level explanation: classifier does not focus on the car object ViGAT
  • 4. 4 • Cornerstone of ViGAT head; transforms a feature matrix (representing graph’s nodes) to a feature vector (representing the whole graph) • Computes explanation significance (weighted in-degrees, WiDs) of each node using the graph’s adjacency matrix Attention Mechanism GAT head Graph pooling X (K x F) A (K x K) Ζ (K x F) η (1 x F) 𝝋𝒍 = 𝒌=𝟏 𝑲 𝒂𝒌,𝒍 , 𝒍 = 𝟏, … , 𝑲 Computation of Attention matrix from node features; and Adjacency Matrix using attention coefficients Multiplication of node features with Adjacency Matrix Production of vector- representation of the graph WiDs: Explanation significance of l-th node ViGAT block
  • 5. ω2 ω2 5 K K objects object-level features b frame-level local features P ω2 P P P ω3 b frame-level global features P ω1 concat u video feature o video frames video-level global feature mean video-level local feature K frame WiDs (local info) frame WiDs (global info) object WiDs P P P Recognized Event: Playing beach volleyball! Explanation: Event supporting frames and objects ViGAT architecture max3 max o: object detector b: feature extractor u: classification head GAT blocks: ω1, ω2, ω3 Global branch: ω1 Local branch: ω2, ω3 Local information Global information
  • 6. 6 • ViGAT has high computational cost due to local (object) information processing (e.g.,P=120 frames, K=50 objects per frame, PK=6000 objects/video) • Efficient video processing has investigated at the top-down (frame) paradigm: - Frame selection policy: identify most important frames for classification - Gating component: stop processing frames when sufficient evidence achieved • Unexplored topic in bottom-up paradigm: Can we use such techniques to reduce the computational complexity in the local processing pipeline of ViGAT? ViGAT
  • 7. 7 K b P Q ω3 concat u video feature o Extracted video frames mean video-level local feature K Frame WiDs (local info) Object WiDs (local info) Q(s) Frame selection policy Q(s) Q(s) Q(s) Q(s) Q(s) g(s) ON/OFF concat max Explanation: Event supporting frames and objects Recognized Event: Playing beach volleyball! Computed video-level global feature Computed frame WiDs (global info) u1 uP max3 Gate is closed: Request Q(s+1) - Q(s) additional frames ζ(s) ζ(1) g(1) g(S) Z(s) Gated-ViGAT ω2 ω2 ω2 Local information processing pipeline
  • 8. 8 • Iterative algorithm to select Q frames frame-level global features frame WiDs (global info) argmax p1 minmax minmax αp = (1/2) (1 – γp Τγpi-1 ) γp = γp /|γp| γ1 γP uP u1 uP u1 α1 αP pi argmax u1 uP up = αp up u1 uP α1 αP 1. Initialize 2. Select Q-1 frames Input: Q, frame index p1, P feature vectors Iterate for i= 2 to Q-1 γ1 γP Gated-ViGAT: Frame selection policy
  • 9. 9 • Each gate has a GAT block-like structure and binary classification head (open/close); corresponds to specified number of frames Q(s); trained to provide 1 (i.e. open) when ViGAT loss is low; design hyperparameters:Q(s) , β (sensitivity) Use frame selection policy to select Q(s) frames for gate g(s) Compute the video-level local feature ζ(s) (and Z(s)) Compute ViGAT classification loss: lce = CE(label, y) Derive pseudolabel 0(s) : 1 if lce <= βes/2 ; zero otherwise Compute gate component loss: 𝐿 = 1 𝑆 𝑠=1 𝑆 𝑙𝑏𝑐𝑒(𝑔 𝑠 𝒁 𝑠 , 𝑜(𝑠) ) Perform backpropagation to update gate weights concat u video feature video-level local feature g(s) concat Computed video-level global feature ζ(s) ζ(1) g(1) g(S) Local ViGAT branch Z(s) Ground truth label cross entropy y Binary cross entropy o(s) Gated-ViGAT: Gate training Select Q(s) video frames for gate g(s) Q o
  • 10. 10 • ActivityNet v1.3: 200 events/actions, 10K/5K training/testing, 5 to 10 mins; multilabel • MiniKinetics: 200 events/actions, 80K/5K training/testing, 10 secs duration; single-label • Video representation: 120/30 frames with uniform sampling for ActivityNet/MiniKinetics • Pretrained ViGAT components: Faster R-CNN (pretrained/finetuned on Imagenet1K/VG, K=50 objects), ViT-B/16 backbone (pretrained/finetuned on Imagenet11K/Imagenet1K), 3 GAT blocks (pretrained on the respective dataset, i.e., ActivityNet or MiniKinetics) • Gates: S= 6 / 5 (number of gates), {Q(s)} = {9, 12, 16, 20, 25, 30} / {2, 4, 6, 8, 10} (sequence lengths), for ActivityNet/MiniKinetics • Gate training hyperparameters: β = 10-8, epochs= 40, lr = 10-4 multiplied with 0.1 at epochs 16, 35 • Evaluation Measures: mAP (ActivityNet), top-1 accuracy (MiniKinetics), FLOPs • Gated-ViGAT is compared against top-scoring methods in the two datasets Experiments
  • 11. 11 Methods in MiniKinetics Top-1% TBN [30] 69.5 BAT [7] 70.6 MARS (3D ResNet) [31] 72.8 Fast-S3D (Inception) [14] 78 ATFR (X3D-S) [18] 78 ATFR (R(2+1D)) [18] 78.2 RMS (SlowOnly) [28] 78.6 ATFR (I3D) [18] 78.8 Ada3D (I3D, Kinetics) [32] 79.2 ATFR (3D Resnet) [18] 79.3 CGNL (Modified ResNet) [17] 79.5 TCPNet (ResNet, Kinetics) [3] 80.7 LgNet (R3D) [3] 80.9 FrameExit (EfficientNet) [1] 75.3 ViGAT [9] 82.1 Gated-ViGAT (proposed) 81.3 • Gated-ViGAT outperforms all top-down approaches • Slightly underperforms ViGAT, but approx. 4 and 5.5 FLOPs reduction • As expected, has higher computational complexity than many top-down approaches (e.g. see [3], [4]) but can provide explanations Methods in ActivityNet mAP% AdaFrame [21] 71.5 ListenToLook [23] 72.3 LiteEval [33] 72.7 SCSampler [25] 72.9 AR-Net [13] 73.8 FrameExit [1] 77.3 AR-Net (EfficientNet) [13] 79.7 MARL (ResNet, Kinetics) [22] 82.9 FrameExit (X3D-S) [1] 87.4 ViGAT [9] 88.1 Gated-ViGAT (proposed) 87.3 FLOPS in 2 datasets ViGAT Gated-ViGAT ActivityNet 137.4 24.8 MiniKinetics 34.4 8.7 Experiments: results *Best and second best performance are denoted with bold and underline
  • 12. 12 • Computed # of videos processed and recognition performance for each gate • Average number of frames for ActivityNet / MiniKinetics: 20 / 7 • Recognition rate drops with gate number increase; this behavior is more clearly shown in ActivityNet (longer videos) • Conclusion: more “easy” videos exit early, more “difficult” videos still difficult to recognize even with many frames (similar conclusion with [1]) ActivityNet g(1) g(2) g(3) g(4) g(5) g(6) # frames 9 12 16 20 25 30 # videos 793 651 722 502 535 1722 mAP% 99.8 94.5 93.8 92.7 86 71.6 MiniKinetics g(1) g(2) g(3) g(4) g(5) # frames 2 4 6 8 10 # videos 179 686 1199 458 2477 Top-1% 84.9 83 81.1 84.9 80.7 Experiments: method insight
  • 13. 13 • Bullfighting (top) and Cricket (bottom) test videos of ActivityNet exited at first gate, i.e., recognized using only 9 frames out of 120 (required with ViGAT) • Frames selected with the proposed policy, both explain recognition result and provide diverse view of the video: help to recognize video with fewer frames Bullfighting Cricket Experiments: examples
  • 14. 14 • Can also provide explanations at object-level (in contrast to top-down methods) “Waterskiing” predicted as “Making a sandwich” “Playing accordion” predicted as “Playing guitarra” “Breakdancing” (correct prediction) Experiments: examples
  • 15. 15 Policy / #frames 10 20 30 Random 83 85.5 86.5 WiD-based 84.9 86.1 86.9 Random on local 85.4 86.6 86.9 WiD-based on local 86.6 87.1 87.5 FrameExit policy 86.2 87.3 87.5 Proposed policy 86.7 87.3 87.6 Gated-ViGAT (proposed) 86.8 87.5 87.7 Experiments: ablation study on frame selection policies • Comparison (mAP%) on ActivityNet • Gated-ViGAT selects diverse frames with high explanation potential • Proposed policy is second best (surpassing FrameExit [1], current SOTA) Random: Θ frames selected randomly for local/global features WiD-Based: Θ frames are selected using global WiDs Random local: P frames derive global feature; Θ frames selected randomly WiD-based local: P frames derive global feature; Θ frames using global WiDs FrameExit policy: Θ frames are selected using policy in [1] Proposed policy: P frames derive global feature; Θ selected using proposed Gated-ViGAT: in addition to above gate component selects Θ frames in average
  • 16. 16 • Top-6 frames of “bungee jumping” video selected with WiD-based vs proposed policy Proposed WiD-based Updated WiDs Experiments: ablation study example
  • 17. 17 • An efficient bottom-up event recognition and explanation approach presented • Utilizes a new policy algorithm to select frames that: a) explain best the classifier’s decision, b) provide diverse information of the underlying event • Utilizes a gating mechanism to instruct the model to stop extracting bottom- up (object) information when sufficient evidence of the event is achieved • Evaluation on 2 datasets provided competitive recognition performance and approx. 5 times FLOPs reduction in comparison to previous SOTA • Future work: investigations for further efficiency improvements, e.g.: faster object detector, feature extractor, frame selection also for the global information pipeline, etc. Conclusions
  • 18. 18 Thank you for your attention! Questions? Nikolaos Gkalelis, gkalelis@iti.gr Vasileios Mezaris, bmezaris@iti.gr Code publicly available at: https://github.com/bmezaris/Gated-ViGAT This work was supported by the EUs Horizon 2020 research and innovation programme under grant agreement 101021866 CRiTERIA