SlideShare a Scribd company logo
6.870 Grounding object
          recognition and scene
              understanding
Wednesdays 1-4pm
Room 13-1143
Instructor: Antonio Torralba
Email: torralba@csail.mit.edu

http://people.csail.mit.edu/torralba/courses/6.870/6.870.recognition.htm
Some slides are borrowed from other classes (see links on the course
web site). Let me know if I forget to give credit to the right people.
http://groups.csail.mit.edu/vision/courses/6.869/
Grading

•  Class participation: 20%

•  Paper presentations: 40%

•  Course project: 40%
Course project
•  Topics for projects: It can derive from one
   of the papers studied or from your own
   research.

•  Work individually or in pairs.

•  Results described as a 4 pages CVPR
   paper

•  Short presentation at the end of the
   semester
Paper presentations (40%)
Email me at the end of the class for scheduling the next week. We will
  first decide how to structure the week together.

•  Presenter:
    –  Present the key ideas, background material, and technical details.
    –  Show me the slides two days before the class.
    –  To test the basic ideas of the paper(s), using code available online or
       writing toy code.
    –  Create toy test problems that reveal something about the algorithm.
    –  Constructive criticism.
Readings	
  
6.870 Grounding object recognition
                           and scene understanding




Lecture	
  1	
  
  	
  Class	
  goals	
  and	
  
  	
  a	
  short	
  introduc2on	
  
What	
  is	
  vision?	
  
•  What	
  does	
  it	
  mean,	
  to	
  see?	
  	
  “to	
  know	
  what	
  is	
  
   where	
  by	
  looking”.	
  
•  How	
  to	
  discover	
  from	
  images	
  what	
  is	
  present	
  
   in	
  the	
  world,	
  where	
  things	
  are,	
  what	
  ac2ons	
  
   are	
  taking	
  place.	
  




 from	
  Marr,	
  1982	
  
The	
  importance	
  of	
  images	
  
   Some	
  images	
  are	
  more	
  important	
  than	
  others	
  	
  




                                                                  “Dora	
  Maar	
  au	
  Chat”	
  
                                                                  Pablo	
  Picasso,	
  1941	
  




                       100	
  million	
  $	
  
Why	
  is	
  vision	
  hard?	
  
The	
  structure	
  of	
  ambient	
  light	
  
The	
  structure	
  of	
  ambient	
  light	
  
The	
  Plenop2c	
  Func2on	
  
                                Adelson & Bergen, 91




  The intensity P can be parameterized as:

                          P (θ, φ,	

 λ,	

 t, X, Y, Z)
“The complete set of all convergence points constitutes the permanent possibilities
of vision.” Gibson
MIT6.870 Grounding Object Recognition and Scene Understanding: lecture 1
MIT6.870 Grounding Object Recognition and Scene Understanding: lecture 1
MIT6.870 Grounding Object Recognition and Scene Understanding: lecture 1
MIT6.870 Grounding Object Recognition and Scene Understanding: lecture 1
Why	
  is	
  vision	
  hard?	
  
Measuring	
  light	
  vs.	
  measuring	
  
     scene	
  proper2es	
  




     We perceive two squares, one on top of each other.
Measuring	
  light	
  vs.	
  measuring	
  scene	
  
                proper2es	
  




                            by Roger Shepard (”Turning the Tables”)


        Depth processing is automatic, and we can not shut it down…
Measuring	
  light	
  vs.	
  measuring	
  
     scene	
  proper2es	
  
Measuring	
  light	
  vs.	
  measuring	
  
     scene	
  proper2es	
  
Measuring	
  light	
  vs.	
  measuring	
  
     scene	
  proper2es	
  




                        (c) 2006 Walt Anthony
Assump2ons	
  can	
  be	
  wrong	
  




             Ames	
  room	
  
By Aude Oliva
Why	
  is	
  vision	
  hard?	
  
Some	
  things	
  have	
  strong	
  varia2ons	
  
            in	
  appearance	
  
Some	
  things	
  know	
  that	
  you	
  have	
  eyes	
  




Brady,	
  M.	
  J.,	
  &	
  Kersten,	
  D.	
  (2003).	
  Bootstrapped	
  learning	
  of	
  novel	
  objects.	
  J	
  Vis,	
  3(6),	
  413-­‐422	
  	
  
A	
  short	
  history	
  of	
  vision	
  
The	
  early	
  op2mism	
  
MIT6.870 Grounding Object Recognition and Scene Understanding: lecture 1
The	
  crisis	
  of	
  the	
  80’s
                                 	
  
Object	
  recogni2on	
  
                  Is	
  it	
  really	
  so	
  hard?	
  


Yes,	
  object	
  recogni2on	
  is	
  hard…	
  
                   (or at least it seems so for now…)
Challenges 1: view point variation




Michelangelo 1475-1564
Challenges 2: illumination




                             slide credit: S. Ullman
Challenges 3: occlusion




         Magritte, 1957
Challenges 4: scale
Challenges 5: deformation




                            Xu, Beihong 1943
Challenges 6: background clutter




      Klimt, 1913
Challenges 7: intra-class variation
Challenges




Brady, M. J., & Kersten, D. (2003). Bootstrapped learning of novel objects. J Vis, 3(6), 413-422
Discover the camouflaged object




Brady, M. J., & Kersten, D. (2003). Bootstrapped learning of novel objects. J Vis, 3(6), 413-422
Discover the camouflaged object




Brady, M. J., & Kersten, D. (2003). Bootstrapped learning of novel objects. J Vis, 3(6), 413-422
MIT6.870 Grounding Object Recognition and Scene Understanding: lecture 1
MIT6.870 Grounding Object Recognition and Scene Understanding: lecture 1
MIT6.870 Grounding Object Recognition and Scene Understanding: lecture 1
MIT6.870 Grounding Object Recognition and Scene Understanding: lecture 1
MIT6.870 Grounding Object Recognition and Scene Understanding: lecture 1
Any guesses?
MIT6.870 Grounding Object Recognition and Scene Understanding: lecture 1
So,	
  let’s	
  make	
  the	
  problem	
  simpler:	
  
                       Block	
  world	
  




Nice framework to develop fancy math, but too far from reality…
                                           Object Recognition in the Geometric Era:
                                           a Retrospective. Joseph L. Mundy. 2006
Binford	
  and	
  generalized	
  cylinders	
  




                                 Object Recognition in the Geometric Era:
                                 a Retrospective. Joseph L. Mundy. 2006
Binford	
  and	
  generalized	
  cylinders	
  
Recogni2on	
  by	
  components	
  



Irving Biederman
Recognition-by-Components: A Theory of Human Image Understanding.
Psychological Review, 1987.
Recogni2on	
  by	
  components	
  
The	
  fundamental	
  assump2on	
  of	
  the	
  proposed	
  theory,	
  
  recogni2on-­‐by-­‐components	
  (RBC),	
  is	
  that	
  a	
  modest	
  set	
  of	
  
  generalized-­‐cone	
  components,	
  called	
  geons	
  (N	
  =	
  36),	
  can	
  be	
  
  derived	
  from	
  contrasts	
  of	
  five	
  readily	
  detectable	
  proper2es	
  of	
  
  edges	
  in	
  a	
  two-­‐dimensional	
  image:	
  curvature,	
  collinearity,	
  
  symmetry,	
  parallelism,	
  and	
  cotermina2on.	
  

The	
  “contribu2on	
  lies	
  in	
  its	
  proposal	
  for	
  a	
  par2cular	
  vocabulary	
  
  of	
  components	
  derived	
  from	
  perceptual	
  mechanisms	
  and	
  its	
  
  account	
  of	
  how	
  an	
  arrangement	
  of	
  these	
  components	
  can	
  
  access	
  a	
  representa2on	
  of	
  an	
  object	
  in	
  memory.”	
  
A	
  do-­‐it-­‐yourself	
  example	
  




1)  We know that this object is nothing we know
2)  We can split this objects into parts that everybody will agree
3)  We can see how it resembles something familiar: “a hot dog cart”


“The naive realism that emerges in descriptions of nonsense objects may be
   reflecting the workings of a representational system by which objects are
   identified.”
Stages	
  of	
  processing	
  




“Parsing is performed, primarily at concave regions, simultaneously with a
detection of nonaccidental properties.”
Non	
  accidental	
  proper2es	
  
Certain properties of edges in a two-dimensional image are taken by the visual
system as strong evidence that the edges in the three-dimensional world contain those
same properties.

Non accidental properties, (Witkin & Tenenbaum,1983): Rarely be produced by
accidental alignments of viewpoint and object features and consequently are generally
unaffected by slight variations in viewpoint.

                                         image




                                                          ?
Examples:
•  Colinearity
•  Smoothness
•  Symmetry
•  Parallelism
•  Cotermination
From	
  generalized	
  cylinders	
  to	
  GEONS	
  




“From variation over only two or three levels in the nonaccidental relations of four
attributes of generalized cylinders, a set of 36 GEONS can be generated.”
 Geons represent a restricted form of generalized cylinders.
Objects	
  and	
  their	
  geons	
  
Scenes	
  and	
  geons	
  




                      Mezzanotte & Biederman
The	
  importance	
  of	
  spa2al	
  
         arrangement	
  
Parts and Structure approaches
With a different perspective, these models focused more on the
   geometry than on defining the constituent elements:

•    Fischler & Elschlager 1973
•    Yuille ‘91
•    Brunelli & Poggio ‘93
•    Lades, v.d. Malsburg et al. ‘93
•    Cootes, Lanitis, Taylor et al. ‘95
•    Amit & Geman ‘95, ‘99
•    Perona et al. ‘95, ‘96, ’98, ’00, ’03, ‘04, ‘05
•    Felzenszwalb & Huttenlocher ’00, ’04           Figure from [Fischler & Elschlager 73]

•    Crandall & Huttenlocher ’05, ’06
•    Leibe & Schiele ’03, ’04
•    Many papers since 2000
But,	
  despite	
  promising	
  ini2al	
  results…things	
  did	
  not	
  
  work	
  out	
  so	
  well	
  (lack	
  of	
  data,	
  processing	
  power,	
  lack	
  
  of	
  reliable	
  methods	
  for	
  low-­‐level	
  and	
  mid-­‐level	
  
  vision)	
  

Instead,	
  a	
  different	
  way	
  of	
  thinking	
  about	
  object	
  
  detec2on	
  started	
  making	
  some	
  progress:	
  learning	
  
  based	
  approaches	
  and	
  classifiers,	
  which	
  ignored	
  low	
  
  and	
  mid-­‐level	
  vision.	
  

Maybe	
  the	
  2me	
  is	
  here	
  to	
  come	
  back	
  to	
  some	
  of	
  the	
  
 earlier	
  models,	
  more	
  grounded	
  in	
  intui2ons	
  about	
  
 visual	
  percep2on.	
  
Renewed	
  op2mism	
  
Neocognitron	
  
          Fukushima (1980). Hierarchical multilayered neural network




S-cells work as feature-extracting cells. They resemble simple cells of the
primary visual cortex in their response.
C-cells, which resembles complex cells in the visual cortex, are inserted in the
network to allow for positional errors in the features of the stimulus. The input
connections of C-cells, which come from S-cells of the preceding layer, are fixed
and invariable. Each C-cell receives excitatory input connections from a group
of S-cells that extract the same feature, but from slightly different positions. The
C-cell responds if at least one of these S-cells yield an output.
Neocognitron	
  




         Learning is done greedily for each layer
Convolu2onal	
  Neural	
  Network	
  




                                                   Le Cun et al, 98




The output neurons share all the intermediate levels
Face detection and the success
      of learning based approaches




•  The representation and matching of pictorial structures Fischler, Elschlager (1973).
•  Face recognition using eigenfaces M. Turk and A. Pentland (1991).
•  Human Face Detection in Visual Scenes - Rowley, Baluja, Kanade (1995)
•  Graded Learning for Object Detection - Fleuret, Geman (1999)
•  Robust Real-time Object Detection - Viola, Jones (2001)
•  Feature Reduction and Hierarchy of Classifiers for Fast Object Detection in Video Images - Heisele, Serre,
Mukherjee, Poggio (2001)
• ….
•  The representation and matching of pictorial structures Fischler, Elschlager (1973).
•  Face recognition using eigenfaces M. Turk and A. Pentland (1991).
•  Human Face Detection in Visual Scenes - Rowley, Baluja, Kanade (1995)
•  Graded Learning for Object Detection - Fleuret, Geman (1999)
•  Robust Real-time Object Detection - Viola, Jones (2001)
•  Feature Reduction and Hierarchy of Classifiers for Fast Object Detection in Video Images - Heisele, Serre,
Mukherjee, Poggio (2001)
• ….
Faces	
  everywhere	
  




http://www.marcofolio.net/imagedump/faces_everywhere_15_images_8_illusions.html   72
The face age




  Feret dataset, 1996 DARPA

•  The representation and matching of pictorial structures Fischler,
Elschlager (1973).
•  Face recognition using eigenfaces M. Turk and A. Pentland (1991).
•  Human Face Detection in Visual Scenes - Rowley, Baluja, Kanade (1995)
•  Graded Learning for Object Detection - Fleuret, Geman (1999)
•  Robust Real-time Object Detection - Viola, Jones (2001)
•  Feature Reduction and Hierarchy of Classifiers for Fast Object Detection
in Video Images - Heisele, Serre, Mukherjee, Poggio (2001)
• ….
Rapid Object Detection Using a Boosted
                              Cascade of Simple Features




                             Paul Viola     Michael J. Jones
                    Mitsubishi Electric Research Laboratories (MERL)
                                      Cambridge, MA


                Most of this work was done at Compaq CRL before the authors moved to MERL

Manuscript available on web:
http://citeseer.ist.psu.edu/cache/papers/cs/23183/http:zSzzSzwww.ai.mit.eduzSzpeoplezSzviolazSzresearchzSzpublicationszSzICCV01-Viola-Jones.pdf/viola01robust.pdf
Haar-like filters and cascades
Viola and Jones, ICCV 2001




                               The average intensity in the
                               block is computed with four
                               sums independently of the
                               block size.
Also Fleuret and Geman, 2001
Face detection
•  The representation and matching of pictorial structures
Fischler, Elschlager (1973).
•  Face recognition using eigenfaces M. Turk and A.
Pentland (1991).
•  Human Face Detection in Visual Scenes - Rowley, Baluja,
Kanade (1995)
•  Graded Learning for Object Detection - Fleuret, Geman
(1999)
•  Robust Real-time Object Detection - Viola, Jones (2001)
•  Feature Reduction and Hierarchy of Classifiers for Fast
Object Detection in Video Images - Heisele, Serre,
Mukherjee, Poggio (2001)
• ….
Families of recognition algorithms
                                             Voting models                           Shape matching
  Bag of words models
                                                                                     Deformable models



                                         Viola and Jones, ICCV 2001                Berg, Berg, Malik, 2005
Csurka, Dance, Fan, Willamowski, and    Heisele, Poggio, et. al., NIPS 01
                                                                                   Cootes, Edwards, Taylor, 2001
Bray 2004                                Schneiderman, Kanade 2004
Sivic, Russell, Freeman, Zisserman,       Vidal-Naquet, Ullman 2003
ICCV 2005

                                                                            Rigid template models
                Constellation models




              Fischler and Elschlager, 1973                                  Sirovich and Kirby 1987
                                                                             Turk, Pentland, 1991
             Burl, Leung, and Perona, 1995
            Weber, Welling, and Perona, 2000                                 Dalal & Triggs, 2006
        Fergus, Perona, & Zisserman, CVPR 2003
Scene understanding
Torralba,	
  Sinha	
  (2001)	
                                                                           Torralba	
  Murphy	
  Freeman	
  (2004)	
  
                                         Carboneio,	
  de	
  Freitas	
  &	
  Barnard	
  (2004)	
  




Fink	
  &	
  Perona	
  (2003)	
  

                                                                                                                          Rabinovich	
  et	
  al	
  (2007)	
  
                                       Sudderth,	
  Torralba,	
  
                                       Wilsky,	
  Freeman	
  (2005)	
  	
  
                                                                               Hoiem,	
  Efros,	
  Hebert	
  (2005)	
  


Kumar,	
  Hebert	
  (2005)	
  

                                                                                                                              Choi, Lim,
                                                                                                                              Torralba,
                                                                           Desai,	
  Ramanan,	
  and	
  Fowlkes	
  (2009)	
  
                                                                                                                              Willsky (2010)
                                    Heitz	
  and	
  Koller	
  (2008)	
  
NSF Frontiers in computer vision workshop, 2011
MobilEye
Demo google googles
The	
  labeling	
  crisis
                               	
  
                SKY


                                      TREE


       PERSON   BENCH
                            PERSON

                                      PATH
       LAKE                                  PERSON

                               DUCK

                                      PERSON
                     DUCK

SIGN          DUCK

                              GRASS
So what does object recognition involve?




                            Slide by Fei-Fei, Fergus, Torralba
Verification: is that a lamp?




                                Slide by Fei-Fei, Fergus, Torralba
Detection: are there people?




                               Slide by Fei-Fei, Fergus, Torralba
Identification: is that Potala Palace?




                              Slide by Fei-Fei, Fergus, Torralba
Object categorization

                             mountain



         tree
                           building
          banner

                         street lamp

                               vendor
                people
                                Slide by Fei-Fei, Fergus, Torralba
Scene and context categorization
                        •  outdoor
                        •  city
                        •  …




                               Slide by Fei-Fei, Fergus, Torralba
Is this space large or small?
How far are the buildings in the back?




                             Slide by Fei-Fei, Fergus, Torralba
Activity




What is this person doing?
                             What are these two doing??




                                              Slide by Fei-Fei, Fergus, Torralba
What	
  are	
  we	
  tuned	
  to?	
  

The	
  visual	
  system	
  is	
  tuned	
  to	
  process	
  structures
  	
  typically	
  found	
  in	
  the	
  world.	
  	
  
The visual system seems to be tuned to a set of images:




                                                    Demo inspired from D. Field
Remember these images
Did you saw this image?
Remember these images
        Test 2
Did you saw this image?
Data
Human vision
• Many input modalities
• Active
• Supervised, unsupervised, semi supervised
learning. It can look for supervision.




Robot vision
• Many poor input modalities
• Active, but it does not go far


Internet vision
• Many input modalities
• It can reach everywhere
• Tons of data
Kinect
Active stereo with structured light



                                                     Li Zhang’s one-shot stereo

                 camera 1                                                                      camera 1


projector                                                                 projector


                 camera 2

          Project “structured” light patterns onto the object
                 •  simplifies the correspondence problem
Li Zhang, Brian Curless, and Steven M. Seitz. Rapid Shape Acquisition Using Color Structured
Light and Multi-pass Dynamic Programming. In Proceedings of the 1st International
Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT), Padova, Italy,
June 19-21, 2002, pp. 24-36.	

         CSE 576, Spring 2008 Szeliski
               Slide credit: Rick                            Stereo matching                              100
CSE 576, Spring 2008   Stereo matching   101
102
MIT6.870 Grounding Object Recognition and Scene Understanding: lecture 1
MIT6.870 Grounding Object Recognition and Scene Understanding: lecture 1
Willow garage




 http://www.willowgarage.com/pages/pr2/overview
MIT6.870 Grounding Object Recognition and Scene Understanding: lecture 1
Class goals

•  Vision and language

•  Vision and robotics

•  Vision and others
  The strategies our visual system uses are tuned to our visual world


          To provide the right vision tools for not vision experts
          Thinking about the tasks to find new representations

More Related Content

PDF
NIPS2009: Understand Visual Scenes - Part 1
PDF
ICCV 2011 Presentation
PDF
ICCV 2011 Presentation
PDF
Barkley Portfolio
PDF
Promising avenues for interdisciplinary research in vision
PDF
Architecture and Design Portfolio
PPTX
Imagine camp, Developing Image Processing app for windows phone platform
PPT
Mit6870 orsu lecture2
NIPS2009: Understand Visual Scenes - Part 1
ICCV 2011 Presentation
ICCV 2011 Presentation
Barkley Portfolio
Promising avenues for interdisciplinary research in vision
Architecture and Design Portfolio
Imagine camp, Developing Image Processing app for windows phone platform
Mit6870 orsu lecture2

Similar to MIT6.870 Grounding Object Recognition and Scene Understanding: lecture 1 (20)

PPTX
Object recognition
PPTX
Iccv2009 recognition and learning object categories p0 c00 - introduction
PPT
Introduction vision
PPT
Chapter 6 (percpetion)
PPT
Perception
PPT
perception
PPT
Perception
PPT
PDF
Intro to data visualization
PPT
Chapter 6 ap psych- Perception
PDF
General relativity vs. quantum mechanics issues of foundations uv 1_oct2018
PPT
Mit6870 orsu lecture11
PPT
Constructivist Learning
PPT
Memory2009
PPTX
Attention & Perception - Cognitive Psychology.pptx
PPT
Reflection-refraction.ppt
PPT
Reflection-refraction.ppt
PDF
Quantum Mechanics by Dr Steven Spencer
PDF
Abstract of project 2
PDF
QM philosophy talk
Object recognition
Iccv2009 recognition and learning object categories p0 c00 - introduction
Introduction vision
Chapter 6 (percpetion)
Perception
perception
Perception
Intro to data visualization
Chapter 6 ap psych- Perception
General relativity vs. quantum mechanics issues of foundations uv 1_oct2018
Mit6870 orsu lecture11
Constructivist Learning
Memory2009
Attention & Perception - Cognitive Psychology.pptx
Reflection-refraction.ppt
Reflection-refraction.ppt
Quantum Mechanics by Dr Steven Spencer
Abstract of project 2
QM philosophy talk
Ad

More from zukun (20)

PDF
My lyn tutorial 2009
PDF
ETHZ CV2012: Tutorial openCV
PDF
ETHZ CV2012: Information
PDF
Siwei lyu: natural image statistics
PDF
Lecture9 camera calibration
PDF
Brunelli 2008: template matching techniques in computer vision
PDF
Modern features-part-4-evaluation
PDF
Modern features-part-3-software
PDF
Modern features-part-2-descriptors
PDF
Modern features-part-1-detectors
PDF
Modern features-part-0-intro
PDF
Lecture 02 internet video search
PDF
Lecture 01 internet video search
PDF
Lecture 03 internet video search
PDF
Icml2012 tutorial representation_learning
PPT
Advances in discrete energy minimisation for computer vision
PDF
Gephi tutorial: quick start
PDF
EM algorithm and its application in probabilistic latent semantic analysis
PDF
Object recognition with pictorial structures
PDF
Iccv2011 learning spatiotemporal graphs of human activities
My lyn tutorial 2009
ETHZ CV2012: Tutorial openCV
ETHZ CV2012: Information
Siwei lyu: natural image statistics
Lecture9 camera calibration
Brunelli 2008: template matching techniques in computer vision
Modern features-part-4-evaluation
Modern features-part-3-software
Modern features-part-2-descriptors
Modern features-part-1-detectors
Modern features-part-0-intro
Lecture 02 internet video search
Lecture 01 internet video search
Lecture 03 internet video search
Icml2012 tutorial representation_learning
Advances in discrete energy minimisation for computer vision
Gephi tutorial: quick start
EM algorithm and its application in probabilistic latent semantic analysis
Object recognition with pictorial structures
Iccv2011 learning spatiotemporal graphs of human activities
Ad

Recently uploaded (20)

PDF
01-Introduction-to-Information-Management.pdf
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Cell Structure & Organelles in detailed.
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
GDM (1) (1).pptx small presentation for students
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
A systematic review of self-coping strategies used by university students to ...
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
01-Introduction-to-Information-Management.pdf
Anesthesia in Laparoscopic Surgery in India
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Final Presentation General Medicine 03-08-2024.pptx
Microbial diseases, their pathogenesis and prophylaxis
O7-L3 Supply Chain Operations - ICLT Program
Final Presentation General Medicine 03-08-2024.pptx
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
VCE English Exam - Section C Student Revision Booklet
Supply Chain Operations Speaking Notes -ICLT Program
Microbial disease of the cardiovascular and lymphatic systems
Cell Structure & Organelles in detailed.
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
GDM (1) (1).pptx small presentation for students
STATICS OF THE RIGID BODIES Hibbelers.pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
A systematic review of self-coping strategies used by university students to ...
FourierSeries-QuestionsWithAnswers(Part-A).pdf

MIT6.870 Grounding Object Recognition and Scene Understanding: lecture 1

  • 1. 6.870 Grounding object recognition and scene understanding Wednesdays 1-4pm Room 13-1143 Instructor: Antonio Torralba Email: torralba@csail.mit.edu http://people.csail.mit.edu/torralba/courses/6.870/6.870.recognition.htm Some slides are borrowed from other classes (see links on the course web site). Let me know if I forget to give credit to the right people.
  • 3. Grading •  Class participation: 20% •  Paper presentations: 40% •  Course project: 40%
  • 4. Course project •  Topics for projects: It can derive from one of the papers studied or from your own research. •  Work individually or in pairs. •  Results described as a 4 pages CVPR paper •  Short presentation at the end of the semester
  • 5. Paper presentations (40%) Email me at the end of the class for scheduling the next week. We will first decide how to structure the week together. •  Presenter: –  Present the key ideas, background material, and technical details. –  Show me the slides two days before the class. –  To test the basic ideas of the paper(s), using code available online or writing toy code. –  Create toy test problems that reveal something about the algorithm. –  Constructive criticism.
  • 7. 6.870 Grounding object recognition and scene understanding Lecture  1    Class  goals  and    a  short  introduc2on  
  • 8. What  is  vision?   •  What  does  it  mean,  to  see?    “to  know  what  is   where  by  looking”.   •  How  to  discover  from  images  what  is  present   in  the  world,  where  things  are,  what  ac2ons   are  taking  place.   from  Marr,  1982  
  • 9. The  importance  of  images   Some  images  are  more  important  than  others     “Dora  Maar  au  Chat”   Pablo  Picasso,  1941   100  million  $  
  • 10. Why  is  vision  hard?  
  • 11. The  structure  of  ambient  light  
  • 12. The  structure  of  ambient  light  
  • 13. The  Plenop2c  Func2on   Adelson & Bergen, 91 The intensity P can be parameterized as: P (θ, φ, λ, t, X, Y, Z) “The complete set of all convergence points constitutes the permanent possibilities of vision.” Gibson
  • 18. Why  is  vision  hard?  
  • 19. Measuring  light  vs.  measuring   scene  proper2es   We perceive two squares, one on top of each other.
  • 20. Measuring  light  vs.  measuring  scene   proper2es   by Roger Shepard (”Turning the Tables”) Depth processing is automatic, and we can not shut it down…
  • 21. Measuring  light  vs.  measuring   scene  proper2es  
  • 22. Measuring  light  vs.  measuring   scene  proper2es  
  • 23. Measuring  light  vs.  measuring   scene  proper2es   (c) 2006 Walt Anthony
  • 24. Assump2ons  can  be  wrong   Ames  room  
  • 26. Why  is  vision  hard?  
  • 27. Some  things  have  strong  varia2ons   in  appearance  
  • 28. Some  things  know  that  you  have  eyes   Brady,  M.  J.,  &  Kersten,  D.  (2003).  Bootstrapped  learning  of  novel  objects.  J  Vis,  3(6),  413-­‐422    
  • 29. A  short  history  of  vision  
  • 32. The  crisis  of  the  80’s  
  • 33. Object  recogni2on   Is  it  really  so  hard?   Yes,  object  recogni2on  is  hard…   (or at least it seems so for now…)
  • 34. Challenges 1: view point variation Michelangelo 1475-1564
  • 35. Challenges 2: illumination slide credit: S. Ullman
  • 36. Challenges 3: occlusion Magritte, 1957
  • 38. Challenges 5: deformation Xu, Beihong 1943
  • 39. Challenges 6: background clutter Klimt, 1913
  • 41. Challenges Brady, M. J., & Kersten, D. (2003). Bootstrapped learning of novel objects. J Vis, 3(6), 413-422
  • 42. Discover the camouflaged object Brady, M. J., & Kersten, D. (2003). Bootstrapped learning of novel objects. J Vis, 3(6), 413-422
  • 43. Discover the camouflaged object Brady, M. J., & Kersten, D. (2003). Bootstrapped learning of novel objects. J Vis, 3(6), 413-422
  • 51. So,  let’s  make  the  problem  simpler:   Block  world   Nice framework to develop fancy math, but too far from reality… Object Recognition in the Geometric Era: a Retrospective. Joseph L. Mundy. 2006
  • 52. Binford  and  generalized  cylinders   Object Recognition in the Geometric Era: a Retrospective. Joseph L. Mundy. 2006
  • 53. Binford  and  generalized  cylinders  
  • 54. Recogni2on  by  components   Irving Biederman Recognition-by-Components: A Theory of Human Image Understanding. Psychological Review, 1987.
  • 55. Recogni2on  by  components   The  fundamental  assump2on  of  the  proposed  theory,   recogni2on-­‐by-­‐components  (RBC),  is  that  a  modest  set  of   generalized-­‐cone  components,  called  geons  (N  =  36),  can  be   derived  from  contrasts  of  five  readily  detectable  proper2es  of   edges  in  a  two-­‐dimensional  image:  curvature,  collinearity,   symmetry,  parallelism,  and  cotermina2on.   The  “contribu2on  lies  in  its  proposal  for  a  par2cular  vocabulary   of  components  derived  from  perceptual  mechanisms  and  its   account  of  how  an  arrangement  of  these  components  can   access  a  representa2on  of  an  object  in  memory.”  
  • 56. A  do-­‐it-­‐yourself  example   1)  We know that this object is nothing we know 2)  We can split this objects into parts that everybody will agree 3)  We can see how it resembles something familiar: “a hot dog cart” “The naive realism that emerges in descriptions of nonsense objects may be reflecting the workings of a representational system by which objects are identified.”
  • 57. Stages  of  processing   “Parsing is performed, primarily at concave regions, simultaneously with a detection of nonaccidental properties.”
  • 58. Non  accidental  proper2es   Certain properties of edges in a two-dimensional image are taken by the visual system as strong evidence that the edges in the three-dimensional world contain those same properties. Non accidental properties, (Witkin & Tenenbaum,1983): Rarely be produced by accidental alignments of viewpoint and object features and consequently are generally unaffected by slight variations in viewpoint. image ?
  • 59. Examples: •  Colinearity •  Smoothness •  Symmetry •  Parallelism •  Cotermination
  • 60. From  generalized  cylinders  to  GEONS   “From variation over only two or three levels in the nonaccidental relations of four attributes of generalized cylinders, a set of 36 GEONS can be generated.” Geons represent a restricted form of generalized cylinders.
  • 61. Objects  and  their  geons  
  • 62. Scenes  and  geons   Mezzanotte & Biederman
  • 63. The  importance  of  spa2al   arrangement  
  • 64. Parts and Structure approaches With a different perspective, these models focused more on the geometry than on defining the constituent elements: •  Fischler & Elschlager 1973 •  Yuille ‘91 •  Brunelli & Poggio ‘93 •  Lades, v.d. Malsburg et al. ‘93 •  Cootes, Lanitis, Taylor et al. ‘95 •  Amit & Geman ‘95, ‘99 •  Perona et al. ‘95, ‘96, ’98, ’00, ’03, ‘04, ‘05 •  Felzenszwalb & Huttenlocher ’00, ’04 Figure from [Fischler & Elschlager 73] •  Crandall & Huttenlocher ’05, ’06 •  Leibe & Schiele ’03, ’04 •  Many papers since 2000
  • 65. But,  despite  promising  ini2al  results…things  did  not   work  out  so  well  (lack  of  data,  processing  power,  lack   of  reliable  methods  for  low-­‐level  and  mid-­‐level   vision)   Instead,  a  different  way  of  thinking  about  object   detec2on  started  making  some  progress:  learning   based  approaches  and  classifiers,  which  ignored  low   and  mid-­‐level  vision.   Maybe  the  2me  is  here  to  come  back  to  some  of  the   earlier  models,  more  grounded  in  intui2ons  about   visual  percep2on.  
  • 67. Neocognitron   Fukushima (1980). Hierarchical multilayered neural network S-cells work as feature-extracting cells. They resemble simple cells of the primary visual cortex in their response. C-cells, which resembles complex cells in the visual cortex, are inserted in the network to allow for positional errors in the features of the stimulus. The input connections of C-cells, which come from S-cells of the preceding layer, are fixed and invariable. Each C-cell receives excitatory input connections from a group of S-cells that extract the same feature, but from slightly different positions. The C-cell responds if at least one of these S-cells yield an output.
  • 68. Neocognitron   Learning is done greedily for each layer
  • 69. Convolu2onal  Neural  Network   Le Cun et al, 98 The output neurons share all the intermediate levels
  • 70. Face detection and the success of learning based approaches •  The representation and matching of pictorial structures Fischler, Elschlager (1973). •  Face recognition using eigenfaces M. Turk and A. Pentland (1991). •  Human Face Detection in Visual Scenes - Rowley, Baluja, Kanade (1995) •  Graded Learning for Object Detection - Fleuret, Geman (1999) •  Robust Real-time Object Detection - Viola, Jones (2001) •  Feature Reduction and Hierarchy of Classifiers for Fast Object Detection in Video Images - Heisele, Serre, Mukherjee, Poggio (2001) • ….
  • 71. •  The representation and matching of pictorial structures Fischler, Elschlager (1973). •  Face recognition using eigenfaces M. Turk and A. Pentland (1991). •  Human Face Detection in Visual Scenes - Rowley, Baluja, Kanade (1995) •  Graded Learning for Object Detection - Fleuret, Geman (1999) •  Robust Real-time Object Detection - Viola, Jones (2001) •  Feature Reduction and Hierarchy of Classifiers for Fast Object Detection in Video Images - Heisele, Serre, Mukherjee, Poggio (2001) • ….
  • 73. The face age Feret dataset, 1996 DARPA •  The representation and matching of pictorial structures Fischler, Elschlager (1973). •  Face recognition using eigenfaces M. Turk and A. Pentland (1991). •  Human Face Detection in Visual Scenes - Rowley, Baluja, Kanade (1995) •  Graded Learning for Object Detection - Fleuret, Geman (1999) •  Robust Real-time Object Detection - Viola, Jones (2001) •  Feature Reduction and Hierarchy of Classifiers for Fast Object Detection in Video Images - Heisele, Serre, Mukherjee, Poggio (2001) • ….
  • 74. Rapid Object Detection Using a Boosted Cascade of Simple Features Paul Viola Michael J. Jones Mitsubishi Electric Research Laboratories (MERL) Cambridge, MA Most of this work was done at Compaq CRL before the authors moved to MERL Manuscript available on web: http://citeseer.ist.psu.edu/cache/papers/cs/23183/http:zSzzSzwww.ai.mit.eduzSzpeoplezSzviolazSzresearchzSzpublicationszSzICCV01-Viola-Jones.pdf/viola01robust.pdf
  • 75. Haar-like filters and cascades Viola and Jones, ICCV 2001 The average intensity in the block is computed with four sums independently of the block size. Also Fleuret and Geman, 2001
  • 77. •  The representation and matching of pictorial structures Fischler, Elschlager (1973). •  Face recognition using eigenfaces M. Turk and A. Pentland (1991). •  Human Face Detection in Visual Scenes - Rowley, Baluja, Kanade (1995) •  Graded Learning for Object Detection - Fleuret, Geman (1999) •  Robust Real-time Object Detection - Viola, Jones (2001) •  Feature Reduction and Hierarchy of Classifiers for Fast Object Detection in Video Images - Heisele, Serre, Mukherjee, Poggio (2001) • ….
  • 78. Families of recognition algorithms Voting models Shape matching Bag of words models Deformable models Viola and Jones, ICCV 2001 Berg, Berg, Malik, 2005 Csurka, Dance, Fan, Willamowski, and Heisele, Poggio, et. al., NIPS 01 Cootes, Edwards, Taylor, 2001 Bray 2004 Schneiderman, Kanade 2004 Sivic, Russell, Freeman, Zisserman, Vidal-Naquet, Ullman 2003 ICCV 2005 Rigid template models Constellation models Fischler and Elschlager, 1973 Sirovich and Kirby 1987 Turk, Pentland, 1991 Burl, Leung, and Perona, 1995 Weber, Welling, and Perona, 2000 Dalal & Triggs, 2006 Fergus, Perona, & Zisserman, CVPR 2003
  • 79. Scene understanding Torralba,  Sinha  (2001)   Torralba  Murphy  Freeman  (2004)   Carboneio,  de  Freitas  &  Barnard  (2004)   Fink  &  Perona  (2003)   Rabinovich  et  al  (2007)   Sudderth,  Torralba,   Wilsky,  Freeman  (2005)     Hoiem,  Efros,  Hebert  (2005)   Kumar,  Hebert  (2005)   Choi, Lim, Torralba, Desai,  Ramanan,  and  Fowlkes  (2009)   Willsky (2010) Heitz  and  Koller  (2008)  
  • 80. NSF Frontiers in computer vision workshop, 2011
  • 83. The  labeling  crisis   SKY TREE PERSON BENCH PERSON PATH LAKE PERSON DUCK PERSON DUCK SIGN DUCK GRASS
  • 84. So what does object recognition involve? Slide by Fei-Fei, Fergus, Torralba
  • 85. Verification: is that a lamp? Slide by Fei-Fei, Fergus, Torralba
  • 86. Detection: are there people? Slide by Fei-Fei, Fergus, Torralba
  • 87. Identification: is that Potala Palace? Slide by Fei-Fei, Fergus, Torralba
  • 88. Object categorization mountain tree building banner street lamp vendor people Slide by Fei-Fei, Fergus, Torralba
  • 89. Scene and context categorization •  outdoor •  city •  … Slide by Fei-Fei, Fergus, Torralba
  • 90. Is this space large or small? How far are the buildings in the back? Slide by Fei-Fei, Fergus, Torralba
  • 91. Activity What is this person doing? What are these two doing?? Slide by Fei-Fei, Fergus, Torralba
  • 92. What  are  we  tuned  to?   The  visual  system  is  tuned  to  process  structures  typically  found  in  the  world.    
  • 93. The visual system seems to be tuned to a set of images: Demo inspired from D. Field
  • 95. Did you saw this image?
  • 97. Did you saw this image?
  • 98. Data Human vision • Many input modalities • Active • Supervised, unsupervised, semi supervised learning. It can look for supervision. Robot vision • Many poor input modalities • Active, but it does not go far Internet vision • Many input modalities • It can reach everywhere • Tons of data
  • 100. Active stereo with structured light Li Zhang’s one-shot stereo camera 1 camera 1 projector projector camera 2 Project “structured” light patterns onto the object •  simplifies the correspondence problem Li Zhang, Brian Curless, and Steven M. Seitz. Rapid Shape Acquisition Using Color Structured Light and Multi-pass Dynamic Programming. In Proceedings of the 1st International Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT), Padova, Italy, June 19-21, 2002, pp. 24-36. CSE 576, Spring 2008 Szeliski Slide credit: Rick Stereo matching 100
  • 101. CSE 576, Spring 2008 Stereo matching 101
  • 102. 102
  • 107. Class goals •  Vision and language •  Vision and robotics •  Vision and others The strategies our visual system uses are tuned to our visual world To provide the right vision tools for not vision experts Thinking about the tasks to find new representations