MIT6.870 Grounding Object Recognition and Scene Understanding: lecture 1

6.870 Grounding object
recognition and scene
understanding
Wednesdays 1-4pm
Room 13-1143
Instructor: Antonio Torralba
Email: torralba@csail.mit.edu

http://people.csail.mit.edu/torralba/courses/6.870/6.870.recognition.htm
Some slides are borrowed from other classes (see links on the course
web site). Let me know if I forget to give credit to the right people.

http://groups.csail.mit.edu/vision/courses/6.869/

Grading

•  Class participation: 20%

•  Paper presentations: 40%

•  Course project: 40%

Course project
•  Topics for projects: It can derive from one
of the papers studied or from your own
research.

•  Work individually or in pairs.

•  Results described as a 4 pages CVPR
paper

•  Short presentation at the end of the
semester

Paper presentations (40%)
Email me at the end of the class for scheduling the next week. We will
first decide how to structure the week together.

•  Presenter:
–  Present the key ideas, background material, and technical details.
–  Show me the slides two days before the class.
–  To test the basic ideas of the paper(s), using code available online or
writing toy code.
–  Create toy test problems that reveal something about the algorithm.
–  Constructive criticism.

6.870 Grounding object recognition
and scene understanding

Lecture
1

Class
goals
and

a
short
introduc2on

What
is
vision?

•  What
does
it
mean,
to
see?

“to
know
what
is

where
by
looking”.

•  How
to
discover
from
images
what
is
present

in
the
world,
where
things
are,
what
ac2ons

are
taking
place.

from
Marr,
1982

The
importance
of
images

Some
images
are
more
important
than
others

“Dora
Maar
au
Chat”

Pablo
Picasso,
1941

100
million
$

The
structure
of
ambient
light

The
Plenop2c
Func2on

Adelson & Bergen, 91

The intensity P can be parameterized as:

P (θ, φ,

λ,

t, X, Y, Z)
“The complete set of all convergence points constitutes the permanent possibilities
of vision.” Gibson

MIT6.870 Grounding Object Recognition and Scene Understanding: lecture 1

Measuring
light
vs.
measuring

scene
proper2es

We perceive two squares, one on top of each other.

Measuring
light
vs.
measuring
scene

proper2es

by Roger Shepard (”Turning the Tables”)

Depth processing is automatic, and we can not shut it down…

Measuring
light
vs.
measuring

scene
proper2es

Measuring
light
vs.
measuring

scene
proper2es

(c) 2006 Walt Anthony

Assump2ons
can
be
wrong

Ames
room

Some
things
have
strong
varia2ons

in
appearance

Some
things
know
that
you
have
eyes

Brady,
M.
J.,
&
Kersten,
D.
(2003).
Bootstrapped
learning
of
novel
objects.
J
Vis,
3(6),
413-‐422

A
short
history
of
vision

The
crisis
of
the
80’s

Object
recogni2on

Is
it
really
so
hard?

Yes,
object
recogni2on
is
hard…

(or at least it seems so for now…)

Challenges 1: view point variation

Michelangelo 1475-1564

Challenges 2: illumination

slide credit: S. Ullman

Challenges 3: occlusion

Magritte, 1957

Challenges 5: deformation

Xu, Beihong 1943

Challenges 6: background clutter

Klimt, 1913

Challenges 7: intra-class variation

Challenges

Brady, M. J., & Kersten, D. (2003). Bootstrapped learning of novel objects. J Vis, 3(6), 413-422

Discover the camouflaged object

Brady, M. J., & Kersten, D. (2003). Bootstrapped learning of novel objects. J Vis, 3(6), 413-422

So,
let’s
make
the
problem
simpler:

Block
world

Nice framework to develop fancy math, but too far from reality…
Object Recognition in the Geometric Era:
a Retrospective. Joseph L. Mundy. 2006

Binford
and
generalized
cylinders

Object Recognition in the Geometric Era:
a Retrospective. Joseph L. Mundy. 2006

Binford
and
generalized
cylinders

Recogni2on
by
components

Irving Biederman
Recognition-by-Components: A Theory of Human Image Understanding.
Psychological Review, 1987.

Recogni2on
by
components

The
fundamental
assump2on
of
the
proposed
theory,

recogni2on-‐by-‐components
(RBC),
is
that
a
modest
set
of

generalized-‐cone
components,
called
geons
(N
=
36),
can
be

derived
from
contrasts
of
ﬁve
readily
detectable
proper2es
of

edges
in
a
two-‐dimensional
image:
curvature,
collinearity,

symmetry,
parallelism,
and
cotermina2on.

The
“contribu2on
lies
in
its
proposal
for
a
par2cular
vocabulary

of
components
derived
from
perceptual
mechanisms
and
its

account
of
how
an
arrangement
of
these
components
can

access
a
representa2on
of
an
object
in
memory.”

A
do-‐it-‐yourself
example

1)  We know that this object is nothing we know
2)  We can split this objects into parts that everybody will agree
3)  We can see how it resembles something familiar: “a hot dog cart”

“The naive realism that emerges in descriptions of nonsense objects may be
reflecting the workings of a representational system by which objects are
identified.”

Stages
of
processing

“Parsing is performed, primarily at concave regions, simultaneously with a
detection of nonaccidental properties.”

Non
accidental
proper2es

Certain properties of edges in a two-dimensional image are taken by the visual
system as strong evidence that the edges in the three-dimensional world contain those
same properties.

Non accidental properties, (Witkin & Tenenbaum,1983): Rarely be produced by
accidental alignments of viewpoint and object features and consequently are generally
unaffected by slight variations in viewpoint.

image

?

Examples:
•  Colinearity
•  Smoothness
•  Symmetry
•  Parallelism
•  Cotermination

From
generalized
cylinders
to
GEONS

“From variation over only two or three levels in the nonaccidental relations of four
attributes of generalized cylinders, a set of 36 GEONS can be generated.”
Geons represent a restricted form of generalized cylinders.

Objects
and
their
geons

Scenes
and
geons

Mezzanotte & Biederman

The
importance
of
spa2al

arrangement

Parts and Structure approaches
With a different perspective, these models focused more on the
geometry than on defining the constituent elements:

•  Fischler & Elschlager 1973
•  Yuille ‘91
•  Brunelli & Poggio ‘93
•  Lades, v.d. Malsburg et al. ‘93
•  Cootes, Lanitis, Taylor et al. ‘95
•  Amit & Geman ‘95, ‘99
•  Perona et al. ‘95, ‘96, ’98, ’00, ’03, ‘04, ‘05
•  Felzenszwalb & Huttenlocher ’00, ’04 Figure from [Fischler & Elschlager 73]

•  Crandall & Huttenlocher ’05, ’06
•  Leibe & Schiele ’03, ’04
•  Many papers since 2000

But,
despite
promising
ini2al
results…things
did
not

work
out
so
well
(lack
of
data,
processing
power,
lack

of
reliable
methods
for
low-‐level
and
mid-‐level

vision)

Instead,
a
diﬀerent
way
of
thinking
about
object

detec2on
started
making
some
progress:
learning

based
approaches
and
classiﬁers,
which
ignored
low

and
mid-‐level
vision.

Maybe
the
2me
is
here
to
come
back
to
some
of
the

earlier
models,
more
grounded
in
intui2ons
about

visual
percep2on.

Neocognitron

Fukushima (1980). Hierarchical multilayered neural network

S-cells work as feature-extracting cells. They resemble simple cells of the
primary visual cortex in their response.
C-cells, which resembles complex cells in the visual cortex, are inserted in the
network to allow for positional errors in the features of the stimulus. The input
connections of C-cells, which come from S-cells of the preceding layer, are fixed
and invariable. Each C-cell receives excitatory input connections from a group
of S-cells that extract the same feature, but from slightly different positions. The
C-cell responds if at least one of these S-cells yield an output.

Neocognitron

Learning is done greedily for each layer

Convolu2onal
Neural
Network

Le Cun et al, 98

The output neurons share all the intermediate levels

Face detection and the success
of learning based approaches

•  The representation and matching of pictorial structures Fischler, Elschlager (1973).
•  Face recognition using eigenfaces M. Turk and A. Pentland (1991).
•  Human Face Detection in Visual Scenes - Rowley, Baluja, Kanade (1995)
•  Graded Learning for Object Detection - Fleuret, Geman (1999)
•  Robust Real-time Object Detection - Viola, Jones (2001)
•  Feature Reduction and Hierarchy of Classifiers for Fast Object Detection in Video Images - Heisele, Serre,
Mukherjee, Poggio (2001)
• ….

•  The representation and matching of pictorial structures Fischler, Elschlager (1973).
•  Feature Reduction and Hierarchy of Classifiers for Fast Object Detection in Video Images - Heisele, Serre,
• ….

Faces
everywhere

http://www.marcofolio.net/imagedump/faces_everywhere_15_images_8_illusions.html 72

The face age

Feret dataset, 1996 DARPA

•  The representation and matching of pictorial structures Fischler,
Elschlager (1973).
•  Feature Reduction and Hierarchy of Classifiers for Fast Object Detection
in Video Images - Heisele, Serre, Mukherjee, Poggio (2001)
• ….

Rapid Object Detection Using a Boosted
Cascade of Simple Features

Paul Viola Michael J. Jones
Mitsubishi Electric Research Laboratories (MERL)
Cambridge, MA

Most of this work was done at Compaq CRL before the authors moved to MERL

Manuscript available on web:
http://citeseer.ist.psu.edu/cache/papers/cs/23183/http:zSzzSzwww.ai.mit.eduzSzpeoplezSzviolazSzresearchzSzpublicationszSzICCV01-Viola-Jones.pdf/viola01robust.pdf

Haar-like filters and cascades
Viola and Jones, ICCV 2001

The average intensity in the
block is computed with four
sums independently of the
block size.
Also Fleuret and Geman, 2001

•  The representation and matching of pictorial structures
Fischler, Elschlager (1973).
•  Face recognition using eigenfaces M. Turk and A.
Pentland (1991).
•  Human Face Detection in Visual Scenes - Rowley, Baluja,
Kanade (1995)
•  Graded Learning for Object Detection - Fleuret, Geman
(1999)
•  Feature Reduction and Hierarchy of Classifiers for Fast
Object Detection in Video Images - Heisele, Serre,
• ….

Families of recognition algorithms
Voting models Shape matching
Bag of words models
Deformable models

Viola and Jones, ICCV 2001 Berg, Berg, Malik, 2005
Csurka, Dance, Fan, Willamowski, and Heisele, Poggio, et. al., NIPS 01
Cootes, Edwards, Taylor, 2001
Bray 2004 Schneiderman, Kanade 2004
Sivic, Russell, Freeman, Zisserman, Vidal-Naquet, Ullman 2003
ICCV 2005

Rigid template models
Constellation models

Fischler and Elschlager, 1973 Sirovich and Kirby 1987
Turk, Pentland, 1991
Burl, Leung, and Perona, 1995
Weber, Welling, and Perona, 2000 Dalal & Triggs, 2006
Fergus, Perona, & Zisserman, CVPR 2003

Scene understanding
Torralba,
Sinha
(2001)
Torralba
Murphy
Freeman
(2004)

Carboneio,
de
Freitas
&
Barnard
(2004)

Fink
&
Perona
(2003)

Rabinovich
et
al
(2007)

Sudderth,
Torralba,

Wilsky,
Freeman
(2005)

Hoiem,
Efros,
Hebert
(2005)

Kumar,
Hebert
(2005)

Choi, Lim,
Torralba,
Desai,
Ramanan,
and
Fowlkes
(2009)

Willsky (2010)
Heitz
and
Koller
(2008)

NSF Frontiers in computer vision workshop, 2011

The
labeling
crisis

SKY

TREE

PERSON BENCH
PERSON

PATH
LAKE PERSON

DUCK

PERSON
DUCK

SIGN DUCK

GRASS

So what does object recognition involve?

Slide by Fei-Fei, Fergus, Torralba

Verification: is that a lamp?


Detection: are there people?


Identification: is that Potala Palace?


Object categorization

mountain

tree
building
banner

street lamp

vendor
people

Scene and context categorization
•  outdoor
•  city
•  …


Is this space large or small?
How far are the buildings in the back?


Activity

What is this person doing?
What are these two doing??


What
are
we
tuned
to?

The
visual
system
is
tuned
to
process
structures

typically
found
in
the
world.

The visual system seems to be tuned to a set of images:

Demo inspired from D. Field

Remember these images
Test 2

Data
Human vision
• Many input modalities
• Active
• Supervised, unsupervised, semi supervised
learning. It can look for supervision.

Robot vision
• Many poor input modalities
• Active, but it does not go far

Internet vision
• Many input modalities
• It can reach everywhere
• Tons of data

Active stereo with structured light

Li Zhang’s one-shot stereo

camera 1 camera 1

projector projector

camera 2

Project “structured” light patterns onto the object
•  simplifies the correspondence problem
Li Zhang, Brian Curless, and Steven M. Seitz. Rapid Shape Acquisition Using Color Structured
Light and Multi-pass Dynamic Programming. In Proceedings of the 1st International
Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT), Padova, Italy,
June 19-21, 2002, pp. 24-36.

CSE 576, Spring 2008 Szeliski
Slide credit: Rick Stereo matching 100

CSE 576, Spring 2008 Stereo matching 101

Willow garage

http://www.willowgarage.com/pages/pr2/overview

Class goals

•  Vision and language

•  Vision and robotics

•  Vision and others
The strategies our visual system uses are tuned to our visual world

To provide the right vision tools for not vision experts
Thinking about the tasks to find new representations

MIT6.870 Grounding Object Recognition and Scene Understanding: lecture 1

More Related Content

Similar to MIT6.870 Grounding Object Recognition and Scene Understanding: lecture 1 (20)

More from zukun (20)

Recently uploaded (20)

MIT6.870 Grounding Object Recognition and Scene Understanding: lecture 1