【ECCV 2016 BNMW】Human Action Recognition without Human

Human Action Recognition without Human
He Yun1,2, Soma Shirakabe1,2, Yutaka Satoh1,2, Hirokatsu Kataoka1
1Computer Vision Research Group, AIST, Japan
2Human-Centered Vision Lab., University of Tsukuba, Japan

Motion representation
•  Database: UCF101, HMDB51, ActivityNet
•  Approach: IDT, Two-Stream CNN
–  DBs and approaches have been prepared in the field

Action Database
h"p://www.thumos.info/

The problem setting in action recognition
•  Video-level prediction
–  1 action-label prediction per input video
Tennis Swing
Mo6on Descriptor

Dense Trajectories (DT) [Wang+, CVPR11]
•  Trajectory-based representation
–  A large amount of trajectories
–  Feature description (HOG, HOF, MBH)
–  Codeword vector is generated

Two-Stream CNN [Simonyan+, NIPS14]
•  Spatial and temporal convolution
–  Spatial-stream: From a RGB image
–  Temporal-stream: From a stacked flows
–  Score fusion: Average or SVM

Is background enough to classify actions?
•  RGB input is too strong!
–  The two-stream CNN[Simonyan+, NIPS14] reported spatial-stream can understand an
action more than expected
•  72.4% with spatial-stream (RGB) @UCF101
•  “Human Action Recognition without Human”

Without Human?
•  Human action recognition can be done just by motion of the
background?
Tennis Swing
Mo6on Descriptor
Tennis Swing?
Mo6on Descriptor

Detailed setting of w/ and w/o Human
•  With and without human setting
–  Without human setting: center-blind image with UCF101
–  With human setting: inverse of the without human setting
I (x, y) f (x, y) * I’ (x, y)
1/2 1/4 1/4
1/2
1/4
1/4
I (x, y) f (x, y) * I’ (x, y)
1/2 1/4 1/4
1/2
1/4
1/4
ーー
Without Human SeIng With Human SeIng

Framework
–  Baseline: Very deep two-stream CNN [Wang+, arXiv15]
–  Two different scenarios: without human and with human

Exploration experiment
•  @UCF101
–  UCF101 pre-trained model with very deep two-stream CNN
–  With/Without Human Setting

Visual results (Without Human Setting)

Without Human
•  The concept of ”Human Action Recognition without Human”
–  The accuracies are very close
•  With human is +9.49% better than without human
–  The current motion representation heavily rely on the backgrounds

Future work
•  This is a suggestive reality
–  We must accept this reality to realize better motion representation
–  Pure motion representation is an urgent work!
•  More sophisticated approach
•  Human only motion

【ECCV 2016 BNMW】Human Action Recognition without Human

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to 【ECCV 2016 BNMW】Human Action Recognition without Human (20)

More from Hirokatsu Kataoka (6)

Recently uploaded (20)

【ECCV 2016 BNMW】Human Action Recognition without Human