Lecture11

Visual Recognition: Learning Features

Raquel Urtasun

TTI Chicago

Feb 21, 2012

Raquel Urtasun (TTI-C) Visual Recognition Feb 21, 2012 1 / 82

Learning Representations

Sparse coding
Deep architectures
Topic models


Image Classiﬁcation using BoW

Dictionary Learning Image Classification

Dense/Sparse SIFT

Dense/Sparse SIFT
VQ Coding

K-means Spatial Pyramid
Pooling

Dictionary
Nonlinear SVM

[Source: K. Yu]

BoW+SPM: the architecture

Nonlinear SVM is not scalable
VQ coding may be too coarse
Average pooling is not optimal
Why not learn the whole thing?

Local Gradients Pooling VQ Coding Average Pooling Nonlinear
(obtain histogram) SVM

e.g, SIFT, HOG

[Source: K. Yu]

Sparse Architecture

Nonlinear SVM is not scalable → Scalable linear classiﬁer
VQ coding may be too coarse → Better coding methods
Average pooling is not optimal → Better pooling methods
Why not learn the whole → Deep learning

Local Gradients Pooling Better Coding Better Pooling Scalable
Linear
Classifier
e.g, SIFT, HOG

[Source: A. Ng]

Feature learning problem
Given a 14 × 14 image patch x, can represent it using 196 real numbers.
Problem: Can we ﬁnd a learn a better representation for this?

Given a set of images, learn a better way to represent image than pixels.

[Source: A. Ng]

Learning an image representation

Sparse coding [Olshausen & Field,1996]
Input: Images x (1) , x (2) , · · · , x (m) (each in n×n
)
n×n
Learn: Dictionary of bases φ1 , · · · , φk (also ), so that each input x can
be approximately decomposed as:
k
x≈ aj φj
j=1

such that the aj ’s are mostly zero, i.e., sparse

[Source: A. Ng]


Sparse Coding Illustration

!!!!"#$%&#'!()#*+,! -+#&.+/!0#,+,!1!!"#"$#"!%&23!!45/*+,6!

D+,$!+E#)A'+!

" 0.8 * + 0.3 * + 0.5 *

! " 0.8 * !'% + 0.3 * !&( + 0.5 * !%'"
!789!89!:9!89!!"#9!89!:9!89!!"$9!89!:9!89!!"%9!:;!!
FC)A#G$!H!+#,I'J!
<!7#=9!:9!#>?;!!!!1@+#$%&+!&+A&+,+.$#BC.2!! I.$+&A&+$#0'+!

[Source: A. Ng]

More examples

Method hypothesizes that edge-like patches are the most basic elements of a
scene, and represents an image in terms of the edges that appear in it.
Use to obtain a more compact, higher-level representation of the scene than
pixels.

[Source: A. Ng]

Sparse Coding details

)
Obtain dictionary elements and weights by
 
 
m  k k 
 (i) (i) (i) 
min ||x − aj φj ||2
2 +λ |aj |1 
a,φ  
i=1  j=1 j=1 
sparsity

Alternating minimization with respect to φj ’s and a’s.
The second is harder, the ﬁrst one is closed form.


Fast algorithms
Solving for a is expensive.
Simple algorithm that works well
Repeatedly guess sign (+, - or 0) of each of the ai ’s.
Solve for ai ’s in closed form. Reﬁne guess for signs.
Other algorithms such as projective gradient descent, stochastic subgradient
descent, etc

[Source: A. Ng]

Recap of sparse coding for feature learning
Training:
)
n×n
Learn: Dictionary of bases φ1 , · · · , φk (also ),
 
m k k
||x (i) − (i) (i)
min aj φj ||2 + λ |aj |
a,φ
i=1 j=1 j=1

Test time:
Input: novel x (1) , x (2) , · · · , x (m) and learned φ1 , · · · , φk .
Solve for the representation a1 , · · · , ak for each example
 
m k k
min ||x − aj φj ||2 + λ |aj |
a
i=1 j=1 j=1


Sparse coding recap

Much better than pixel representation, but still not competitive with SIFT,
etc.
Three ways to make it competitive:
Combine this with SIFT.
Advanced versions of sparse coding, e.g., LCC.
Deep learning.

[Source: A. Ng]


Sparse Classiﬁcation

[Source: A. Ng]

K-means vs sparse coding

[Source: A. Ng]

Why sparse coding helps classiﬁcation?

The coding is a nonlinear feature mapping
Represent data in a higher dimensional space
Sparsity makes prominent patterns more distinctive

[Source: K. Yu]

A topic model view to sparse coding

Each basis is a direction or a topic.
Sparsity: each datum is a linear combination of only a few bases.
Applicable to image denoising, inpainting, and super-resolution.

Basis 1

Basis 2

Both figures adapted from CVPR10 tutorial by F.
Bach, J. Mairal, J. Ponce and G. Sapiro

[Source: K. Yu]


A geometric view to sparse coding

Each basis is somewhat like a pseudo data point anchor point
Sparsity: each datum is a sparse combination of neighbor anchors.
The coding scheme explores the manifold structure of data

Data manifold
Data

Basis

[Source: K. Yu]


Inﬂuence of Sparsity
When SC achieves the best classiﬁcation accuracy, the learned bases are like
digits – each basis has a clear local class association.

Error: 4.54% Error: 3.75% Error: 2.64%


The image classiﬁcation setting for analysis
Learning an image classiﬁer is a matter of learning nonlinear functions on
patches
m m
f (x) = w T x = ai (w T φ(i) ) = ai f (φ(i) )
i=1 i=1
f .images f .patches
m (i)
where x = i=1 ai φ

Dense local feature

Sparse Coding

Linear Pooling

Linear SVM

[Source: K. Yu]

Nonlinear learning via local coding
k
We assume xi ≈ j=1 ai,j φj and thus
k
f (xi ) = ai,j f (φj )
j=1

locally linear

data points
bases

[Source: K. Yu]

How to learn the non-linear function

1 Learning the dictionary φ1 , · · · , φk from unlabeled data
2 Use the dictionary to encode data xi → ai,1 , · · · , ai,k

3 Estimate the parameters

Nonlinear local learning via learning a global linear function

Local Coordinate Coding (LCC):

If f(x) is (alpha, beta)-Lipschitz smooth

Function Coding error Locality term
approximation
error

A good coding scheme should [Yu et al. 09]
have a small coding error,
and also be suﬃciently local
[Source: K. Yu]

Results

The larger dictionary, the higher accuracy, but also the higher comp. cost

(Yu et al. 09)

(Yang et al. 09)

The same observation for Caltech256, PASCAL, ImageNet

[Source: K. Yu]

Locality-constrained linear coding

A fast implementation of LCC [Wang et al. 10]
Dictionary learning using k-means and code for x based on
Step 1 ensure locality: ﬁnd the K nearest bases [φj ]j∈J(x)
Step 2 ensure low coding error:

min ||x − ai,j φj ||2 , s.t. ai,j = 1
a
j∈J(x) j∈J(x)

[Source: K. Yu]


Results

[Source: K. Yu]


Interpretation of BoW + linear classiﬁer

Piece-wise local constant (zero-order)

data points
cluster centers

[Source: K. Yu]

Support vector coding [Zhou et al, 10]

Local tangent Piecewise local linear (first-order)

data points
cluster centers

[Source: K. Yu]

Details...

Let [ai , 1, · · · , ai,k ] be the VQ coding of xi

Super-vector codes
of data
Global linear
weights to be
learned

No.1 position in PASCAL VOC 2009

[Source: K. Yu]


Summary of coding algorithms

All lead to higher-dimensional, sparse, and localized coding
All explore geometric structure of data
New coding methods are suitable for linear classiﬁers.
Their implementations are quite straightforward.

[Source: K. Yu]

PASCAL VOC 2009
No.1 for 18 of 20 categories
PASCAL: 20 categories, 10,000 images
They use only HOG feature on gray images

Best of
Classes Ours Other Teams Difference

[Source: Urtasun (TTI-C)
Raquel K. Yu] Visual Recognition Feb 21, 2012 31 / 82

ImageNet Large-scale Visual Recognition Challenge 2010

ImageNet: 1000 categories, 1.2 million images for training

[Source: K. Yu]

ImageNet Large-scale Visual Recognition Challenge 2010

150 teams registered worldwide, resulting 37 submissions from 11 teams
SC achieved 52% for 1000 class classiﬁcation.

Teams Top 5 Hit Rate
Our team: NEC-UIUC 72.8%
Xerox European Lab, France 66.4%
Univ. of Tokyo 55.4%
Univ. of California, Irvine 53.4%
MIT 45.6%
NTU, Singapore 41.7%
LIG, France 39.3%
IBM T. J. Waston Research Center 30.0%
National Institute of Informatics, Tokyo 25.8%
SRI (Stanford Research Institute) 24.9%

[Source: K. Yu]

Learning Codebooks for Image Classiﬁcation

Replacing Vector Quantization by Learned Dictionaries
unsupervised: [Yang et al., 2009]
supervised: [Boureau et al., 2010, Yang et al., 2010]
[Source: Mairal]


Discriminative Training

[Source: Mairal]

Discriminative Dictionaries

Figure: Top: reconstructive, Bottom: discriminative, Left: Bicycle, Right:
Background

[Source: Mairal]

Application:Edge detection

[Source: Mairal]


Application:Authentic from fake

[Source: Mairal]

Predictive Sparse Decomposition (PSD)

Feed-forward predictor function for feature extraction
 
m k k
||x (i) − (i) (i)
min aj φj ||2 + λ |aj | + β||a − C (X , K )||2 
2
a,φ,K
i=1 j=1 j=1

with e.g., C (X , K ) = g · tanh(x ∗ k)

Learning is done by
1) Fix K and a, minimize to get optimal φ
2) Update a and K using optimal φ
3) Scale elements of φ to be unit norm.
[Source: Y. LeCun]


Predictive Sparse Decomposition (PSD)

12 x12 natural image patches with 256 dictionary elements

Figure: (Left) Encoder, (Right) Decoder

[Source: Y. LeCun]

Training Deep Networks with PSD

Train layer wise [Hinton, 06]
C (X , K 1 )
C (f (K 1 ), K 2 )
···
Each layer is trained on the output f (x) produced from previous layer.
f is a series of non-linearity and pooling operations

[Source: Y. LeCun]


Object Recognition - Caltech 101

[Source: Y. LeCun]

Problems with sparse coding

Sparse coding produces ﬁlters that are shifted version of each other
It ignores that it’s going to be used in a convolutional fashion
Inference in all overlapping patches independently

Problems with sparse coding
1) the representations are redundant, as the training and inference are done at
the patch level
2) inference for the whole image is computationally expensive


Solutions via Convolutional Nets
Problems
1) the representations are redundant, as the training and inference are done at
the patch level
2) inference for the whole image is computationally expensive
Solutions
1) Apply sparse coding to the entire image at once, with the dictionary as a
convolutional ﬁlter bank
K
1
(x, z, D) = ||x − Dk ∗ zk ||2 + |z|1
2
2
k=1
with Dk a s × s 2D ﬁlter, x is w × h image and zk a 2D
(w + s − 1) × (h + s − 1) feature.
2) Use feed-forward, non-linear encoders to produce a fast approx. to the
sparse code
K K
1
(x, z, D, W ) = ||x − Dk ∗ zk ||2 +
2 ||zk − f (W k ∗ x)||2 + |z|1
2
2
k=1 k=1

Convolutional Nets

Figure: Image from Kavukcuoglu et al. 10. (left) Dictionary from sparse coding.
(right) Dictionary learned from conv. nets.


Stacking
One or more additional stages can be stacked on top of the ﬁrst one.
Each stage then takes the output of its preceding stage as input and
processes it using the same series of operations with diﬀerent architectural
parameters like size and connections.

Figure: Image from Kavukcuoglu et al. 10. Second layer.


Convolutional Nets

[Source: Y. LeCun]

Convolutional Nets

[Source: H. Lee]

INRIA pedestrians

Figure: Image from Kavukcuoglu et al. 10. Second layer.


Many more deep architectures

Restricted Bolzmann machines (RBMs)
Deep belief nets
Deep Boltzmann machines
Stacked denoising auto-encoders
···


Topic Models


Topic Models

Topic models have become a powerful unsupervised learning tool for text
summarization, document classiﬁcation, information retrieval, link
prediction, and collaborative ﬁltering.
Topic modeling introduces latent topic variables in text and image that
reveal the underlying structure.
Topics are a group of related words

[Source: J. Zen]

Probabilistic Topic Modeling

Treat data as observations that arise from a generative model composed of
latent variables.
The latent variables reﬂect the semantic structure of the document.
Infer the latent structure using posterior inference (MAP):
What are the topics that summarize the document network?
Predict new data by the estimated topic model
How the new data ﬁt into the estimated topic structures?

[Source: J. Zen]


Latent Dirichlet Allocation (LDA) [Blei et al. 03]

Simple intuition: a document exhibits multiple topics

[Source: J. Zen]

Generative models

Each document is a mixture of corpus-wide topics.
Each word is drawn from one of those topics.

[Source: J. Zen]

The posterior distribution (inference)

Observations include documents and their words. Our goal is to infer the
underlying topic structures marked by ? from observations.

[Source: J. Zen]

What does it mean with visual words

[Source: B. Schiele]

How to read Graphical Models


LDA

[Source: J. Zen]

LDA generative process

[Source: J. Zen]

LDA as matrix factorization


Dirichlet Distribution


LDA inference

[Source: J. Zen]

LDA: intractable exact inference

[Source: J. Zen]


LDA: approximate inference

[Source: J. Zen]

Computer Vision Applications

Natural Scene Categorization [Fei-Fei and Perona, 05]
Object Recognition [Sivic et al. 05]
Object Recognition to model distribution on patches [Fritz et al. ]
Activity Recognition [Niebles et al. ]
Text and image [Saenko et al. ]
···


Natural Scene Categorization

A diﬀerent model is learned for each class.

[Source: J. Zen]


In reality is a bit more complicated as it reasons about class [Fei-Fei et al.,
05].


Classiﬁcation Results


More Results



Matrix Factorization view


Most likely words given topic




Matrix Factorization view


Object Recognition [Fritz et al.]


Learning of Generative Decompositions [Fritz et al.]


Topics [Fritz et al.]


Supervised Classiﬁcation [Fritz et al.]


UIUC Cars


Extensions

Introduce supervision, e.g., discriminative LDA, supervised LDA
Introduce spatial information
Multi-view approaches, e.g., correlated LDA, Pachenko
Use inﬁnite mixtures, e.g., HDP, PY processes to model heavy tales.
And many more.


Lecture11

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Lecture11 (20)

Lecture11