Gated-ViGAT

Title of presentation
Subtitle
Name of presenter
Date
Gated-ViGAT: Efficient Bottom-Up Event Recognition and Explanation
Using a New Frame Selection Policy and Gating Mechanism
Nikolaos Gkalelis, Dimitrios Daskalakis, Vasileios Mezaris
CERTH-ITI, Thermi - Thessaloniki, Greece
IEEE Int. Symposium on Multimedia,
Naples, Italy, Dec. 2022

2
• The recognition of high-level events in unconstrained video is an important topic
with applications in: security (e.g. “making a bomb”), automotive industry (e.g.
“pedestrian crossing the street”), etc.
• Most approaches are top-down: “patchify” the frame (context agnostic); use
label and loss function to learn focusing on frame regions related with event
• Bottom-up approaches: use an object detector, feature extractor and graph
network to extract and process features from the main objects in the video
Introduction
Video event
“walking the dog”

3
• Our recent bottom-up approach with SOTA performance in many datasets
• Uses a graph attention network (GAT) head to process local (object) & global
(frame) information
• Also provides frame/object-level explanations (in contrast to top-down ones)
Video event
“removing ice from
car” miscategorized
as “shoveling snow”
Object-level
explanation:
classifier does
not focus on the
car object
ViGAT

4
• Cornerstone of ViGAT head; transforms a feature matrix (representing graph’s
nodes) to a feature vector (representing the whole graph)
• Computes explanation significance (weighted in-degrees, WiDs) of each node
using the graph’s adjacency matrix
Attention
Mechanism
GAT head Graph pooling
X (K x F) A (K x K) Ζ (K x F) η (1 x F)
𝝋𝒍 =
𝒌=𝟏
𝑲
𝒂𝒌,𝒍 , 𝒍 = 𝟏, … , 𝑲
Computation of
Attention matrix from
node features; and
Adjacency Matrix using
attention coefficients
Multiplication of
node features with
Adjacency Matrix
Production of vector-
representation of the graph
WiDs: Explanation
significance of l-th node
ViGAT block

ω2
ω2
5
K
K objects
object-level
features
b
frame-level
local features
P
ω2
P
P
P
ω3
b
frame-level
global features
P
ω1 concat u
video
feature
o
video frames
video-level
global feature
mean
video-level
local feature
K
frame WiDs
(local info)
frame WiDs
(global info)
object WiDs
P
P
P
Recognized Event: Playing
beach volleyball!
Explanation: Event supporting
frames and objects
ViGAT architecture
max3
max
o: object detector
b: feature extractor
u: classification head
GAT blocks: ω1, ω2, ω3
Global branch: ω1
Local branch: ω2, ω3
Local information
Global information

6
• ViGAT has high computational cost due to local (object) information processing
(e.g.,P=120 frames, K=50 objects per frame, PK=6000 objects/video)
• Efficient video processing has investigated at the top-down (frame) paradigm:
- Frame selection policy: identify most important frames for classification
- Gating component: stop processing frames when sufficient evidence achieved
• Unexplored topic in bottom-up paradigm: Can we use such techniques to reduce
the computational complexity in the local processing pipeline of ViGAT?
ViGAT

7
K
b
P
Q
ω3
concat
u
video
feature
o
Extracted
video frames
mean
video-level
local feature
K
Frame WiDs
(local info)
Object WiDs
(local info)
Q(s)
Frame selection
policy
Q(s) Q(s)
Q(s)
Q(s)
Q(s)
g(s)
ON/OFF
concat
max
Explanation: Event supporting
frames and objects
Recognized Event: Playing
beach volleyball!
Computed
video-level
global feature
Computed
frame WiDs
(global info)
u1 uP
max3
Gate is closed: Request Q(s+1) - Q(s) additional frames
ζ(s)
ζ(1)
g(1)
g(S)
Z(s)
Gated-ViGAT
ω2
ω2
ω2
Local information processing pipeline

8
• Iterative algorithm to select Q frames
frame-level
global features
frame WiDs
(global info)
argmax
p1
minmax
minmax
αp = (1/2) (1 – γp
Τγpi-1
)
γp = γp /|γp|
γ1
γP
uP
u1
uP
u1
α1 αP
pi
argmax
u1 uP
up = αp up
u1 uP
α1 αP
1. Initialize
2. Select Q-1 frames
Input: Q, frame index p1, P feature vectors
Iterate for i= 2 to Q-1
γ1
γP
Gated-ViGAT: Frame selection policy

9
• Each gate has a GAT block-like structure and binary classification
head (open/close); corresponds to specified number of frames Q(s);
trained to provide 1 (i.e. open) when ViGAT loss is low; design
hyperparameters:Q(s) , β (sensitivity)
Use frame selection policy to select Q(s) frames for gate g(s)
Compute the video-level local feature ζ(s) (and Z(s))
Compute ViGAT classification loss: lce = CE(label, y)
Derive pseudolabel 0(s) : 1 if lce <= βes/2 ; zero otherwise
Compute gate component loss: 𝐿 =
1
𝑆 𝑠=1
𝑆
𝑙𝑏𝑐𝑒(𝑔 𝑠
𝒁 𝑠
, 𝑜(𝑠)
)
Perform backpropagation to update gate weights
concat
u
video
feature
video-level
local feature
g(s)
concat
Computed
video-level
global feature
ζ(s)
ζ(1)
g(1)
g(S)
Local ViGAT
branch
Z(s)
Ground truth
label
cross
entropy
y
Binary cross
entropy
o(s)
Gated-ViGAT: Gate training
Select Q(s) video
frames for gate g(s)
Q
o

10
• ActivityNet v1.3: 200 events/actions, 10K/5K training/testing, 5 to 10 mins; multilabel
• MiniKinetics: 200 events/actions, 80K/5K training/testing, 10 secs duration; single-label
• Video representation: 120/30 frames with uniform sampling for ActivityNet/MiniKinetics
• Pretrained ViGAT components: Faster R-CNN (pretrained/finetuned on Imagenet1K/VG, K=50
objects), ViT-B/16 backbone (pretrained/finetuned on Imagenet11K/Imagenet1K), 3 GAT blocks
(pretrained on the respective dataset, i.e., ActivityNet or MiniKinetics)
• Gates: S= 6 / 5 (number of gates), {Q(s)} = {9, 12, 16, 20, 25, 30} / {2, 4, 6, 8, 10} (sequence lengths),
for ActivityNet/MiniKinetics
• Gate training hyperparameters: β = 10-8, epochs= 40, lr = 10-4 multiplied with 0.1 at epochs 16, 35
• Evaluation Measures: mAP (ActivityNet), top-1 accuracy (MiniKinetics), FLOPs
• Gated-ViGAT is compared against top-scoring methods in the two datasets
Experiments

11
Methods in MiniKinetics Top-1%
TBN [30] 69.5
BAT [7] 70.6
MARS (3D ResNet) [31] 72.8
Fast-S3D (Inception) [14] 78
ATFR (X3D-S) [18] 78
ATFR (R(2+1D)) [18] 78.2
RMS (SlowOnly) [28] 78.6
ATFR (I3D) [18] 78.8
Ada3D (I3D, Kinetics) [32] 79.2
ATFR (3D Resnet) [18] 79.3
CGNL (Modified ResNet) [17] 79.5
TCPNet (ResNet, Kinetics) [3] 80.7
LgNet (R3D) [3] 80.9
FrameExit (EfficientNet) [1] 75.3
ViGAT [9] 82.1
Gated-ViGAT (proposed) 81.3
• Gated-ViGAT outperforms all top-down approaches
• Slightly underperforms ViGAT, but approx. 4 and 5.5 FLOPs reduction
• As expected, has higher computational complexity than many top-down
approaches (e.g. see [3], [4]) but can provide explanations
Methods in ActivityNet mAP%
AdaFrame [21] 71.5
ListenToLook [23] 72.3
LiteEval [33] 72.7
SCSampler [25] 72.9
AR-Net [13] 73.8
FrameExit [1] 77.3
AR-Net (EfficientNet) [13] 79.7
MARL (ResNet, Kinetics) [22] 82.9
FrameExit (X3D-S) [1] 87.4
ViGAT [9] 88.1
Gated-ViGAT (proposed) 87.3
FLOPS in 2 datasets ViGAT Gated-ViGAT
ActivityNet 137.4 24.8
MiniKinetics 34.4 8.7
Experiments: results
*Best and second best performance
are denoted with bold and underline

12
• Computed # of videos processed and recognition performance for each gate
• Average number of frames for ActivityNet / MiniKinetics: 20 / 7
• Recognition rate drops with gate number increase; this behavior is more
clearly shown in ActivityNet (longer videos)
• Conclusion: more “easy” videos exit early, more “difficult” videos still difficult
to recognize even with many frames (similar conclusion with [1])
ActivityNet g(1) g(2) g(3) g(4) g(5) g(6)
# frames 9 12 16 20 25 30
# videos 793 651 722 502 535 1722
mAP% 99.8 94.5 93.8 92.7 86 71.6
MiniKinetics g(1) g(2) g(3) g(4) g(5)
# frames 2 4 6 8 10
# videos 179 686 1199 458 2477
Top-1% 84.9 83 81.1 84.9 80.7
Experiments: method insight

13
• Bullfighting (top) and Cricket (bottom) test videos of ActivityNet exited at first
gate, i.e., recognized using only 9 frames out of 120 (required with ViGAT)
• Frames selected with the proposed policy, both explain recognition result and
provide diverse view of the video: help to recognize video with fewer frames
Bullfighting
Cricket
Experiments: examples

14
• Can also provide explanations at object-level (in contrast to top-down methods)
“Waterskiing” predicted
as “Making a sandwich”
“Playing accordion” predicted
as “Playing guitarra”
“Breakdancing” (correct prediction)
Experiments: examples

15
Policy / #frames 10 20 30
Random 83 85.5 86.5
WiD-based 84.9 86.1 86.9
Random on local 85.4 86.6 86.9
WiD-based on local 86.6 87.1 87.5
FrameExit policy 86.2 87.3 87.5
Proposed policy 86.7 87.3 87.6
Gated-ViGAT (proposed) 86.8 87.5 87.7
Experiments: ablation study on frame selection policies
• Comparison (mAP%) on ActivityNet
• Gated-ViGAT selects diverse frames with high explanation potential
• Proposed policy is second best (surpassing FrameExit [1], current SOTA)
Random: Θ frames selected randomly for local/global features
WiD-Based: Θ frames are selected using global WiDs
Random local: P frames derive global feature; Θ frames selected randomly
WiD-based local: P frames derive global feature; Θ frames using global WiDs
FrameExit policy: Θ frames are selected using policy in [1]
Proposed policy: P frames derive global feature; Θ selected using proposed
Gated-ViGAT: in addition to above gate component selects Θ frames in average

16
• Top-6 frames of “bungee jumping” video selected with WiD-based vs proposed policy
Proposed
WiD-based
Updated
WiDs
Experiments: ablation study example

17
• An efficient bottom-up event recognition and explanation approach presented
• Utilizes a new policy algorithm to select frames that: a) explain best the
classifier’s decision, b) provide diverse information of the underlying event
• Utilizes a gating mechanism to instruct the model to stop extracting bottom-
up (object) information when sufficient evidence of the event is achieved
• Evaluation on 2 datasets provided competitive recognition performance and
approx. 5 times FLOPs reduction in comparison to previous SOTA
• Future work: investigations for further efficiency improvements, e.g.: faster
object detector, feature extractor, frame selection also for the global
information pipeline, etc.
Conclusions

18
Thank you for your attention!
Questions?
Nikolaos Gkalelis, gkalelis@iti.gr
Vasileios Mezaris, bmezaris@iti.gr
Code publicly available at:
https://github.com/bmezaris/Gated-ViGAT
This work was supported by the EUs Horizon 2020 research and innovation programme under grant
agreement 101021866 CRiTERIA

Gated-ViGAT

More Related Content

Similar to Gated-ViGAT (20)

More from VasileiosMezaris (20)

Recently uploaded (20)

Gated-ViGAT