SlideShare a Scribd company logo
Frontiers of
Human Activity Analysis

                J. K. Aggarwal
               Michael S. Ryoo
                 Kris M. Kitani
Overview

        Human Activity Recognition



Single-layer                  Hierarchical

        Space-time                     Statistical
                                       Syntactic

        Sequential                     Descriptive


                                                     2
Motivation

How do we interpret a sequence of actions?




                                             3
Hierarchy
Hierarchy implies decomposition into sub-parts




                                                 4
Now we’ll cover…

        Human Activity Recognition



Single-layer                  Hierarchical

        Space-time                   Statistical
                                     Syntactic
        Sequential                     Descriptive


                                                     5
Syntactic
Approaches


             6
Syntactic Models


Activities as strings of symbols.




What is the underlying structure?


                                    7
Early applications to Vision
                                        Tsai and Fu 1980.
Attributed Grammar-A Tool for Combining Syntactic and Statistical Approaches to Pattern Recognition.




                                                                                                       8
Hierarchical syntactic approach
  Useful for activities with:
    Deep hierarchical structure
    Repetitive (cyclic) structure


  Not for
    Systems with a lot of errors and uncertainty
    Activities with shallow structure



                                                    9
Basics
                  Context-Free Grammar



      Generic Language            Natural Languages


       Start Symbol (S)               Sentences


  Set of Terminal Symbols (T)           Words


Set of Non-Terminal Symbols (N)     Parts of Speech


  Set of Production Rules (P)        Syntax Rules



                                                      10
Parsing with a grammar
S → NP VP         (0.8)           PP → PREP NP   (1.0)
S → VP            (0.2)           PREP → like    (1.0)
NP → NOUN         (0.4)           VERB → swat    (0.2)
NP → NOUN PP      (0.4)           VERB → flies   (0.4)
NP → NOUN NP      (0.2)           VERB → like    (0.4)
VP → VERB         (0.3)           NOUN → swat    (0.05)
VP → VERB NP      (0.3)           NOUN → flies   (0.45)
VP → VERB PP      (0.2)           NOUN → ants    (0.5)
VP → VERB NP PP   (0.2)




 swat                     flies          like      ants
                                                          11
Parsing with a grammar
S → NP VP            (0.8)               PP → PREP NP      (1.0)
S → VP               (0.2)               PREP → like       (1.0)
NP → NOUN            (0.4)               VERB → swat       (0.2)
NP → NOUN PP         (0.4)               VERB → flies      (0.4)
NP → NOUN NP         (0.2)               VERB → like       (0.4)
VP → VERB            (0.3)               NOUN → swat       (0.05)
VP → VERB NP         (0.3)               NOUN → flies      (0.45)
VP → VERB PP         (0.2)               NOUN → ants       (0.5)
VP → VERB NP PP      (0.2)

                                S
                  NP           (0.8)            VP

                  (0.2)                            (0.3)
NOUN                          NP                              NP
                                 (0.4)                              (0.4)

                             NOUN              VERB        NOUN
    (0.05)                      (0.45)            (0.4)       (0.5)

 swat                         flies             like         ants
                                                                            12
Video analysis with CFGs

     The “Inverse Hollywood problem”:
     From video to scripts and storyboards via causal analysis.
     Brand 1997




     Action Recognition using Probabilistic Parsing.
     Bobick and Ivanov 1998




     Recognizing Multitasked Activities from Video using
     Stochastic Context-Free Grammar.
     Moore and Essa 2001




                                                                  13
CFG for human activities




enter   detach   leave   enter   detach   attach    touch    touch     detach      attach       leave




                                                   M. Brand. The "Inverse Hollywood Problem":
                                                      From video to scripts and storyboards
                                                         via causal analysis. AAAI 1997.




                                                                                                        14
Parse tree
                                                              SCENE (Open up a PC)




               IN                     ACTION (Open PC)                               ACTION (unscrew)                      OUT


                                   OUT                   IN                                MOVE                          REMOVE




              ADD                                      ADD
                                                                                     MOTION     MOTION


      enter         detach       leave         enter          detach      attach      touch      touch       detach   attach     leave




                                                               •  Deterministic low-level primitive detection
                                                               •  Deterministic parsing



M. Brand. The "Inverse Hollywood Problem": From video to scripts and storyboards via causal analysis. AAAI 1997.
                                                                                                                                         15
Stochastic CFGs
Action Recognition using Probabilistic Parsing.
          Bobick and Ivanov 1998




                                                  16
Gesture analysis with CFGs
                                       Primitive recognition with HMMs




Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998
                                                                         17
left-right




Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998
                                                                                18
up-down




Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998
                                                                            19
right-left




Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998
                                                                                20
down-up




Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998
                                                                            21
Parse Tree
                       S


                    RH



  TOP          UD            BOT          DU


   LR                         RL

left-right   up-down       right-left   down-up

                                               22
Errors
                   Likelihood value over time (not discrete symbols)


          HMM a

          HMM b



                                                  Errors are inevitable…
                      but the grammar acts as a top-down constraint


Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998
                                                                           23
Dealing with uncertainty & errors
        Stolcke-Early (probabilistic) parser
        SKIP rules to deal with insertion errors


               HMM a


               HMM b


               HMM c



Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998
                                                                         24
SCFG for Blackjack
Recognizing Multitasked Activities from Video using
       Stochastic Context-Free Grammar.
             Moore and Essa 2001




  •  Deals with more complex activities
  •  Deals with more error types

                                                      25
extracting primitive actions




                               26
Game grammar




Recognizing Multitasked Activities from Video using Stochastic Context-Free Grammar. Moore and Essa 2001
                                                                                                           27
Dealing with errors

  Ungrammatical strings cause parser to fail
  Account for errors with multiple hypothesis
     Insertion, deletion, substitution

  Issues
     How many errors should we tolerate?
     Potentially exponential hypothesis space
     Ungrammatical strings: vision problem or illegal
      activity?


                                                         28
Observations
  CFGs good for structured activities
  Can incorporate uncertainty in observations
  Natural contextual prior for recognizing errors

  Not clear how to deal with errors
  Assumes ‘good’ action classifiers
  Need to define grammar manually

     Can we learn the grammar from data?

                                                     29
Heuristic Grammatical Induction

                                 1.  Lexicon learning
                                     •  Learn HMMs
                                     •  Cluster HMMs

                                 2.  Convert video to string

                                 3.  Learn Grammar




  Unsupervised Analysis of Human Gestures. Wang et al 2001
                                                               30
COMPRESSIVE
  a b c d a b c d b c d a b a b
                                          length   occurrence   new rule      new symbol


                                             deletion of             insertion of
                                              substring                new rule
                              substring




On-Line and Off-Line Heuristics for Inferring Hierarchies of Repetitions in Sequences.
                                Nevill-Manning 2000.
                                                                                           31
example

S→a b c d a b c d b c d a b a b
                                              (DL=16)


A→b c d
S→a A a               A         A          a b a b
                                              (DL=14)


     Repeat until compression becomes 0.
                                                    32
Critical assumption
  No uncertainty
  No errors
    insertions
    deletions
    substitution


 Can we learn grammars despite errors?


                                         33
Learning with noise
 Can we learn the basic structure of a transaction?




Recovering the basic structure of human activities from
 noisy video-based symbol strings. Kitani et al 2008.
                                                          34
extracting primitives




Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008.
                                                                                                               35
Underlying structure?

  D → a x b y c a b x c y a b c x




Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008.
                                                                                                               36
Underlying structure?

  D → a x b y c a b x c y a b c x


          D→a                                    b                c a b                                c       a b c



Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008.
                                                                                                                        37
Underlying structure?

  D → a x b y c a b x c y a b c x


          D→a                                    b                c a b                                c       a b c



Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008.
                                                                                                                        38
Underlying structure?

  D → a x b y c a b x c y a b c x


          D→a                                    b                c a b                                c        a b c

                        A→a b c                                                 D → A A A
                                   Simple grammar                                           Efficient compression


Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008.
                                                                                                                         39
Information Theory Problem (MDL)

              ˆ
              G = arg min {DL(G) + DL(D|G)}
                                                                Model complexity                         Data compression
                                          G




Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008.
                                                                                                                            40
Information Theory Problem (MDL)

              ˆ
              G = arg min {DL(G) + DL(D|G)}
                                                                Model complexity                         Data compression
                                          G
                            DL(G)                        =           − log p(G)
                          Model complexity
                                                         =           − log p(θS , GS )
                                                         =           − log p(θS |GS ) − log p(GS )
                                                         =           DL(θS |GS ) − DL(GS )
                                                                        Grammar parameters                     Grammar structure




Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008.
                                                                                                                                   41
Information Theory Problem (MDL)

              ˆ
              G = arg min {DL(G) + DL(D|G)}
                                                                Model complexity                         Data compression
                                          G
                            DL(G)                        =           − log p(G)
                          Model complexity
                                                         =           − log p(θS , GS )
                                                         =           − log p(θS |GS ) − log p(GS )
                                                         =           DL(θS |GS ) − DL(GS )
                                                                        Grammar parameters                     Grammar structure



                      DL(D|G) = − log p(D|G)
                         Data compression                                              Likelihood
                                                                                 (inside probabilities)

Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008.
                                                                                                                                   42
Minimum Description Length




Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008.
                                                                                                               43
Minimum Description Length




Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008.
                                                                                                               44
Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008.
                                                                                                               45
Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008.
                                                                                                               46
Conclusions
  Possible to learn basic structure
  Robust to errors
   (insertion, deletion, substitution)

  Need a lot of training data
  Computational complexity




                                         47
Bayesian Approaches




Infinite Hierarchical Hidden Markov Models.   The Infinite PCFG using Hierarchical Dirichlet Processes.
              Heller et al 2009.                               Liang et al 2007.




                                                                                                        48
Take home message
              Hierarchical Syntactic Models


  Useful for activities with:
    Deep hierarchical structure
    Repetitive (cyclic) structure


  Not for
    Systems with a lot of errors and uncertainty
    Activities with weak structure



                                                    49
Statistical
Approaches


               50
Using a hierarchical statistical approach

  Use when
    Low-level action detectors are noisy
    Structure of activity is sequential
    Integrating dynamics


  Not for
    Activities with deep hierarchical structure
    Activities with complex temporal structure


                                                   51
Statistical (State-based) Model
Activities as a stochastic path.




                   What are the underlying dynamics?

                                                       52
Characteristics
  Strong Markov assumption
  Strong dynamics prior
  Robust to uncertainty

  Modifications to account for
    Hierarchical structure
    Concurrent structure



                                  53
Hierarchical activities
      Problem:
  How do we model
hierarchical activities?

                           combinatory state space!




      Solution:
 “stack” actions for
hierarchical activities

                                                      54
Hierarchical hidden Markov model




 Learning and Detecting Activities from Movement Trajectories Using the
        Hierarchical Hidden Markov Models. Nguyen et al 2005
                                                                          55
Context-free activity grammar




Learning and Detecting Activities from Movement Trajectories Using the Hierarchical Hidden Markov Models. Nguyen et al 2005
                                                                                                                              56
Context-free activity grammar




Learning and Detecting Activities from Movement Trajectories Using the Hierarchical Hidden Markov Models. Nguyen et al 2005
                                                                                                                              57
Observations
  Tree structures useful for hierarchies
  Tight integration of trajectories with
   abstract semantic states

  Activities are not always a single
   sequence
   (ie. they sometimes happen in parallel )



                                              58
Concurrent activities
     Problem:
 How do we model
concurrent activities?

                         combinatory state space!




     Solution:
“stand-up” model for
 concurrent activities

                                                    59
Propagation network




Propagation Networks for Recognition of Partially Ordered Sequential Action. Shi et al 2004
                                                                                              60
Propagation Networks for Recognition of Partially Ordered Sequential Action. Shi et al 2004
                                                                                              61
temporal inference




Inference by standing the state transition model on its side
                                                               62
Inferring structure (storylines)
              Understanding Videos, Constructing Plots –
  Learning a Visually Grounded Storyline Model from Annotated Videos
             Gupta, Srinivasan, Shi and Davis CVPR 2009




            Learn AND-OR graphs from weakly labeled data

                                                                       63
Scripts from structure




Understanding Videos, Constructing Plots - Learning a Visually Grounded Storyline Model from Annotated Videos.
Gupta, Srinivasan, Shi and Davis CVPR 2009
                                                                                                                 64
Take home message
              Hierarchical statistical model


  Use when
    Low-level action detectors are noisy
    Structure of activity is sequential
    Integrating dynamics


  Not for
    Activities with deep hierarchical structure
    Activities with complex temporal structure


                                                   65
Contrasting hierarchical approaches


              Actions as:     Activities as:    Model     Characteristic


              probabilistic                                 Robust to
 Statistic                        paths          DBN
                 states                                    uncertainty


                discrete                                    Describes
Syntactic                        strings         CFG
                symbols                                   deep hierarchy

                                                            Encodes
                  logical
Descriptive                       sets         CFG, MLN     complex
              relationships
                                                              logic



                                                                         66
References
                          (not included in ACM survey paper)


  W. Tsai and K.S. Fu. Attributed Grammar-A Tool for Combining
   Syntactic and Statistical Approaches to Pattern Recognition. SMC1980.
  M. Brand. The "Inverse Hollywood Problem":
   From video to scripts and storyboards via causal analysis. AAAI 1997.
  T. Wang, H. Shum, Y. Xu, N. Zheng. Unsupervised Analysis of Human
   Gestures. PRCM 2001.
  C.G. Nevill-Manning, I.H. Witten. On-Line and Off-Line Heuristics for
   Inferring Hierarchies of Repetitions in Sequences. IEEE 2000.
  K. Heller, Y.W. Teh and D. Gorur. Infinite Hierarchical Hidden Markov Model
   s. AISTATS 2009.
  P. Liang, S. Petrov, M. Jordan, D. Klein. The Infinite PCFG using
   Hierarchical Dirichlet Processes. EMNLP 2007.
  A. Gupta, N. Srinivasan, J.Shi and L.Davis. Understanding Videos,
   Constructing Plots - Learning a Visually Grounded Storyline Model from
   Annotated Videos. CVPR 2009.


                                                                                67

More Related Content

KEY
Mechanisms of bottom-up and top-down processing in visual perception
PPTX
Coupling Eye-Motion and Ego-Motion Features forFirst-Person Activity Recognition
PDF
(2007) Performance Analysis for Multi Sensor Fingerprint Recognition System
PDF
cvpr2011: human activity recognition - part 5: description based
PDF
cvpr2011: human activity recognition - part 2: overview
PDF
NIPS2009: Understand Visual Scenes - Part 1
PDF
Haptic playback
PDF
cvpr2011: human activity recognition - part 6: applications
Mechanisms of bottom-up and top-down processing in visual perception
Coupling Eye-Motion and Ego-Motion Features forFirst-Person Activity Recognition
(2007) Performance Analysis for Multi Sensor Fingerprint Recognition System
cvpr2011: human activity recognition - part 5: description based
cvpr2011: human activity recognition - part 2: overview
NIPS2009: Understand Visual Scenes - Part 1
Haptic playback
cvpr2011: human activity recognition - part 6: applications

Viewers also liked (7)

PPT
Human+Resource+Systems
PPTX
Industrial psychology
PPTX
Industrial Psychology
PPT
The scope of psychology
PPTX
Industrial Organizational Psychology . ppt
PDF
HR Strategy - How to develop and deploy your hrm strategy - a manual for HR ...
PDF
Introduction to Industrial Psychology and its Basic Concept
Human+Resource+Systems
Industrial psychology
Industrial Psychology
The scope of psychology
Industrial Organizational Psychology . ppt
HR Strategy - How to develop and deploy your hrm strategy - a manual for HR ...
Introduction to Industrial Psychology and its Basic Concept
Ad

More from zukun (20)

PDF
My lyn tutorial 2009
PDF
ETHZ CV2012: Tutorial openCV
PDF
ETHZ CV2012: Information
PDF
Siwei lyu: natural image statistics
PDF
Lecture9 camera calibration
PDF
Brunelli 2008: template matching techniques in computer vision
PDF
Modern features-part-4-evaluation
PDF
Modern features-part-3-software
PDF
Modern features-part-2-descriptors
PDF
Modern features-part-1-detectors
PDF
Modern features-part-0-intro
PDF
Lecture 02 internet video search
PDF
Lecture 01 internet video search
PDF
Lecture 03 internet video search
PDF
Icml2012 tutorial representation_learning
PPT
Advances in discrete energy minimisation for computer vision
PDF
Gephi tutorial: quick start
PDF
EM algorithm and its application in probabilistic latent semantic analysis
PDF
Object recognition with pictorial structures
PDF
Iccv2011 learning spatiotemporal graphs of human activities
My lyn tutorial 2009
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Information
Siwei lyu: natural image statistics
Lecture9 camera calibration
Brunelli 2008: template matching techniques in computer vision
Modern features-part-4-evaluation
Modern features-part-3-software
Modern features-part-2-descriptors
Modern features-part-1-detectors
Modern features-part-0-intro
Lecture 02 internet video search
Lecture 01 internet video search
Lecture 03 internet video search
Icml2012 tutorial representation_learning
Advances in discrete energy minimisation for computer vision
Gephi tutorial: quick start
EM algorithm and its application in probabilistic latent semantic analysis
Object recognition with pictorial structures
Iccv2011 learning spatiotemporal graphs of human activities
Ad

Recently uploaded (20)

PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
cuic standard and advanced reporting.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Approach and Philosophy of On baking technology
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Unlocking AI with Model Context Protocol (MCP)
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Understanding_Digital_Forensics_Presentation.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Encapsulation_ Review paper, used for researhc scholars
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
cuic standard and advanced reporting.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Approach and Philosophy of On baking technology
NewMind AI Weekly Chronicles - August'25 Week I
Spectroscopy.pptx food analysis technology
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Digital-Transformation-Roadmap-for-Companies.pptx

cvpr2011: human activity recognition - part 4: syntactic

  • 1. Frontiers of Human Activity Analysis J. K. Aggarwal Michael S. Ryoo Kris M. Kitani
  • 2. Overview Human Activity Recognition Single-layer Hierarchical Space-time Statistical Syntactic Sequential Descriptive 2
  • 3. Motivation How do we interpret a sequence of actions? 3
  • 5. Now we’ll cover… Human Activity Recognition Single-layer Hierarchical Space-time Statistical Syntactic Sequential Descriptive 5
  • 7. Syntactic Models Activities as strings of symbols. What is the underlying structure? 7
  • 8. Early applications to Vision Tsai and Fu 1980. Attributed Grammar-A Tool for Combining Syntactic and Statistical Approaches to Pattern Recognition. 8
  • 9. Hierarchical syntactic approach   Useful for activities with:   Deep hierarchical structure   Repetitive (cyclic) structure   Not for   Systems with a lot of errors and uncertainty   Activities with shallow structure 9
  • 10. Basics Context-Free Grammar Generic Language Natural Languages Start Symbol (S) Sentences Set of Terminal Symbols (T) Words Set of Non-Terminal Symbols (N) Parts of Speech Set of Production Rules (P) Syntax Rules 10
  • 11. Parsing with a grammar S → NP VP (0.8) PP → PREP NP (1.0) S → VP (0.2) PREP → like (1.0) NP → NOUN (0.4) VERB → swat (0.2) NP → NOUN PP (0.4) VERB → flies (0.4) NP → NOUN NP (0.2) VERB → like (0.4) VP → VERB (0.3) NOUN → swat (0.05) VP → VERB NP (0.3) NOUN → flies (0.45) VP → VERB PP (0.2) NOUN → ants (0.5) VP → VERB NP PP (0.2) swat flies like ants 11
  • 12. Parsing with a grammar S → NP VP (0.8) PP → PREP NP (1.0) S → VP (0.2) PREP → like (1.0) NP → NOUN (0.4) VERB → swat (0.2) NP → NOUN PP (0.4) VERB → flies (0.4) NP → NOUN NP (0.2) VERB → like (0.4) VP → VERB (0.3) NOUN → swat (0.05) VP → VERB NP (0.3) NOUN → flies (0.45) VP → VERB PP (0.2) NOUN → ants (0.5) VP → VERB NP PP (0.2) S NP (0.8) VP (0.2) (0.3) NOUN NP NP (0.4) (0.4) NOUN VERB NOUN (0.05) (0.45) (0.4) (0.5) swat flies like ants 12
  • 13. Video analysis with CFGs The “Inverse Hollywood problem”: From video to scripts and storyboards via causal analysis. Brand 1997 Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998 Recognizing Multitasked Activities from Video using Stochastic Context-Free Grammar. Moore and Essa 2001 13
  • 14. CFG for human activities enter detach leave enter detach attach touch touch detach attach leave M. Brand. The "Inverse Hollywood Problem": From video to scripts and storyboards via causal analysis. AAAI 1997. 14
  • 15. Parse tree SCENE (Open up a PC) IN ACTION (Open PC) ACTION (unscrew) OUT OUT IN MOVE REMOVE ADD ADD MOTION MOTION enter detach leave enter detach attach touch touch detach attach leave •  Deterministic low-level primitive detection •  Deterministic parsing M. Brand. The "Inverse Hollywood Problem": From video to scripts and storyboards via causal analysis. AAAI 1997. 15
  • 16. Stochastic CFGs Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998 16
  • 17. Gesture analysis with CFGs Primitive recognition with HMMs Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998 17
  • 18. left-right Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998 18
  • 19. up-down Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998 19
  • 20. right-left Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998 20
  • 21. down-up Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998 21
  • 22. Parse Tree S RH TOP UD BOT DU LR RL left-right up-down right-left down-up 22
  • 23. Errors Likelihood value over time (not discrete symbols) HMM a HMM b Errors are inevitable… but the grammar acts as a top-down constraint Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998 23
  • 24. Dealing with uncertainty & errors   Stolcke-Early (probabilistic) parser   SKIP rules to deal with insertion errors HMM a HMM b HMM c Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998 24
  • 25. SCFG for Blackjack Recognizing Multitasked Activities from Video using Stochastic Context-Free Grammar. Moore and Essa 2001 •  Deals with more complex activities •  Deals with more error types 25
  • 27. Game grammar Recognizing Multitasked Activities from Video using Stochastic Context-Free Grammar. Moore and Essa 2001 27
  • 28. Dealing with errors   Ungrammatical strings cause parser to fail   Account for errors with multiple hypothesis   Insertion, deletion, substitution   Issues   How many errors should we tolerate?   Potentially exponential hypothesis space   Ungrammatical strings: vision problem or illegal activity? 28
  • 29. Observations   CFGs good for structured activities   Can incorporate uncertainty in observations   Natural contextual prior for recognizing errors   Not clear how to deal with errors   Assumes ‘good’ action classifiers   Need to define grammar manually Can we learn the grammar from data? 29
  • 30. Heuristic Grammatical Induction 1.  Lexicon learning •  Learn HMMs •  Cluster HMMs 2.  Convert video to string 3.  Learn Grammar Unsupervised Analysis of Human Gestures. Wang et al 2001 30
  • 31. COMPRESSIVE a b c d a b c d b c d a b a b length occurrence new rule new symbol deletion of insertion of substring new rule substring On-Line and Off-Line Heuristics for Inferring Hierarchies of Repetitions in Sequences. Nevill-Manning 2000. 31
  • 32. example S→a b c d a b c d b c d a b a b (DL=16) A→b c d S→a A a A A a b a b (DL=14) Repeat until compression becomes 0. 32
  • 33. Critical assumption   No uncertainty   No errors   insertions   deletions   substitution Can we learn grammars despite errors? 33
  • 34. Learning with noise Can we learn the basic structure of a transaction? Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008. 34
  • 35. extracting primitives Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008. 35
  • 36. Underlying structure? D → a x b y c a b x c y a b c x Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008. 36
  • 37. Underlying structure? D → a x b y c a b x c y a b c x D→a b c a b c a b c Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008. 37
  • 38. Underlying structure? D → a x b y c a b x c y a b c x D→a b c a b c a b c Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008. 38
  • 39. Underlying structure? D → a x b y c a b x c y a b c x D→a b c a b c a b c A→a b c D → A A A Simple grammar Efficient compression Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008. 39
  • 40. Information Theory Problem (MDL) ˆ G = arg min {DL(G) + DL(D|G)} Model complexity Data compression G Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008. 40
  • 41. Information Theory Problem (MDL) ˆ G = arg min {DL(G) + DL(D|G)} Model complexity Data compression G DL(G) = − log p(G) Model complexity = − log p(θS , GS ) = − log p(θS |GS ) − log p(GS ) = DL(θS |GS ) − DL(GS ) Grammar parameters Grammar structure Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008. 41
  • 42. Information Theory Problem (MDL) ˆ G = arg min {DL(G) + DL(D|G)} Model complexity Data compression G DL(G) = − log p(G) Model complexity = − log p(θS , GS ) = − log p(θS |GS ) − log p(GS ) = DL(θS |GS ) − DL(GS ) Grammar parameters Grammar structure DL(D|G) = − log p(D|G) Data compression Likelihood (inside probabilities) Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008. 42
  • 43. Minimum Description Length Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008. 43
  • 44. Minimum Description Length Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008. 44
  • 45. Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008. 45
  • 46. Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008. 46
  • 47. Conclusions   Possible to learn basic structure   Robust to errors (insertion, deletion, substitution)   Need a lot of training data   Computational complexity 47
  • 48. Bayesian Approaches Infinite Hierarchical Hidden Markov Models. The Infinite PCFG using Hierarchical Dirichlet Processes. Heller et al 2009. Liang et al 2007. 48
  • 49. Take home message Hierarchical Syntactic Models   Useful for activities with:   Deep hierarchical structure   Repetitive (cyclic) structure   Not for   Systems with a lot of errors and uncertainty   Activities with weak structure 49
  • 51. Using a hierarchical statistical approach   Use when   Low-level action detectors are noisy   Structure of activity is sequential   Integrating dynamics   Not for   Activities with deep hierarchical structure   Activities with complex temporal structure 51
  • 52. Statistical (State-based) Model Activities as a stochastic path. What are the underlying dynamics? 52
  • 53. Characteristics   Strong Markov assumption   Strong dynamics prior   Robust to uncertainty   Modifications to account for   Hierarchical structure   Concurrent structure 53
  • 54. Hierarchical activities Problem: How do we model hierarchical activities? combinatory state space! Solution: “stack” actions for hierarchical activities 54
  • 55. Hierarchical hidden Markov model Learning and Detecting Activities from Movement Trajectories Using the Hierarchical Hidden Markov Models. Nguyen et al 2005 55
  • 56. Context-free activity grammar Learning and Detecting Activities from Movement Trajectories Using the Hierarchical Hidden Markov Models. Nguyen et al 2005 56
  • 57. Context-free activity grammar Learning and Detecting Activities from Movement Trajectories Using the Hierarchical Hidden Markov Models. Nguyen et al 2005 57
  • 58. Observations   Tree structures useful for hierarchies   Tight integration of trajectories with abstract semantic states   Activities are not always a single sequence (ie. they sometimes happen in parallel ) 58
  • 59. Concurrent activities Problem: How do we model concurrent activities? combinatory state space! Solution: “stand-up” model for concurrent activities 59
  • 60. Propagation network Propagation Networks for Recognition of Partially Ordered Sequential Action. Shi et al 2004 60
  • 61. Propagation Networks for Recognition of Partially Ordered Sequential Action. Shi et al 2004 61
  • 62. temporal inference Inference by standing the state transition model on its side 62
  • 63. Inferring structure (storylines) Understanding Videos, Constructing Plots – Learning a Visually Grounded Storyline Model from Annotated Videos Gupta, Srinivasan, Shi and Davis CVPR 2009 Learn AND-OR graphs from weakly labeled data 63
  • 64. Scripts from structure Understanding Videos, Constructing Plots - Learning a Visually Grounded Storyline Model from Annotated Videos. Gupta, Srinivasan, Shi and Davis CVPR 2009 64
  • 65. Take home message Hierarchical statistical model   Use when   Low-level action detectors are noisy   Structure of activity is sequential   Integrating dynamics   Not for   Activities with deep hierarchical structure   Activities with complex temporal structure 65
  • 66. Contrasting hierarchical approaches Actions as: Activities as: Model Characteristic probabilistic Robust to Statistic paths DBN states uncertainty discrete Describes Syntactic strings CFG symbols deep hierarchy Encodes logical Descriptive sets CFG, MLN complex relationships logic 66
  • 67. References (not included in ACM survey paper)   W. Tsai and K.S. Fu. Attributed Grammar-A Tool for Combining Syntactic and Statistical Approaches to Pattern Recognition. SMC1980.   M. Brand. The "Inverse Hollywood Problem": From video to scripts and storyboards via causal analysis. AAAI 1997.   T. Wang, H. Shum, Y. Xu, N. Zheng. Unsupervised Analysis of Human Gestures. PRCM 2001.   C.G. Nevill-Manning, I.H. Witten. On-Line and Off-Line Heuristics for Inferring Hierarchies of Repetitions in Sequences. IEEE 2000.   K. Heller, Y.W. Teh and D. Gorur. Infinite Hierarchical Hidden Markov Model s. AISTATS 2009.   P. Liang, S. Petrov, M. Jordan, D. Klein. The Infinite PCFG using Hierarchical Dirichlet Processes. EMNLP 2007.   A. Gupta, N. Srinivasan, J.Shi and L.Davis. Understanding Videos, Constructing Plots - Learning a Visually Grounded Storyline Model from Annotated Videos. CVPR 2009. 67