Lec07 aggregation-and-retrieval-system

Image Analysis & Retrieval
CS/EE 5590 Special Topics (Class Ids: 44873, 44874)
Fall 2016, M/W 4-5:15pm@Bloch 0012
Lec 07
Feature Aggregation and Image Retrieval System
Zhu Li
Dept of CSEE, UMKC
Office: FH560E, Email: lizhu@umkc.edu, Ph: x 2346.
http://l.web.umkc.edu/lizhu
p.1Image Analysis & Retrieval, 2016

Outline
 ReCap of Lecture 06
 SIFT
 Box Filter
 Image Retrieval System
 Why Aggregation ?
 Aggregation Schemes
 Summary
Image Analysis & Retrieval, 2016 p.2

Scale Space Theory - Lindeberg
 Scale Space Response via Laplacian of Gaussian
 The scale is controlled by 𝜎
 Characteristic Scale:
2
2
2
2
2
y
g
x
g
g






𝑔 = 𝑒
− 𝑥+𝑦 2
2𝜎
r
image
𝜎 = 0.8𝑟 𝜎 = 1.2𝑟 𝜎 = 2𝑟
…
characteristic
scale

SIFT
 Use DoG to approximate LoG
 Separable Gaussian filter
 Difference of image instead of difference of Gaussian kernel
L
o
G
Scale space construction
By Gaussian Filtering,
and Image Difference

Peak Strength & Edge Removal
 Peak Strength:
 Interpolate true DoG response and pixel location by Taylor
expansion
 Edge Removal:
 Re-do Harris type detection to remove edge on much reduced
pixel set

Scale Invariance thru Dominant Orientation Coding
 Voting for the dominant orientation
 Weighted by a Gaussian window to give more emphasis to the
gradients closer to the center

SIFT Matching and Repeatability Prediction
 SIFT Distance
Not all SIFT are created equal…
 Peak strength (DoG response at interpolated position)
Combined scale/peak strength pmf
𝑑(𝑠1
1
, 𝑠 𝑘∗
2
)
𝑑(𝑠1
1
, 𝑠 𝑘
2
)
≤ 𝜃

Box Fitler – CABOX work
 Basic Idea:
 Approximate DoG with linear combination of box filters
min.
𝒉
𝒈 − 𝐵 ∙ 𝒉 𝐿2
2
+ 𝒉 𝐿1
 Solution by LASSO
= h1*
h2*+ + …

Outline
 SIFT
 Box Filter
 Summary

Image Matching/Retrieval System
SIFT is a sub-image level feature, we actually care
more on how SIFT match will translate into image level
matching/retrieval accuracy
Say if we can compute a single distance from a
collection of features:
 Then for a data base of n images, we can compute an n
x n distance matrix
 This gives us full information of the performance of this
feature/distance system
 How to characterize the performance of such image matching
and retrieval system ?
𝑑 𝐼1, 𝐼2 =
𝑘
𝛼 𝑘 𝑑(𝐹𝑘
1
, 𝐹𝑘
2
)
𝐷𝑖, 𝑘 = 𝑑(𝐼𝑗, 𝐼 𝑘)

Thresholding for Matching
 Basically, for any pair of Images (documents, in IR
jargon), we declare
 Then for each possible image pair, or pairs we care, for
a given threshold t, there will be 4 possible
consequences
 TP pair: {Ij, Ik} declared matching pairs, d(Ij, Ik) < t;
 FP pair: {Ij, Ik} declared matching pairs, d(Ij, Ik) >= t;
 TN pair: {Ij, Ik} declared non-matching pairs, d(Ij, Ik) >= t;
 FN pair: {Ij, Ik} declared non- matching pairs, d(Ij, Ik) < t;
𝐼𝑗, 𝐼 𝑘 𝑎𝑟𝑒 𝑚𝑎𝑡𝑐ℎ, 𝑖𝑓 𝑑 𝐼𝑗, 𝐼 𝑘 < 𝑡
𝐼𝑗, 𝐼 𝑘 𝑎𝑟𝑒𝑛𝑜𝑡 𝑚𝑎𝑡𝑐ℎ, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Matching System Performance
 True Positive Rate/Precision:
 Out of retrieved matching pairs, how many are true matching
pairs
 For all matching pairs with distance < t
 False Positive Rate:
 Out of retrieved matching pairs, how many are actually
negative, false matchings
𝑇𝑃𝑅 =
𝑡𝑝
𝑡𝑝 + 𝑓𝑛
𝐹𝑃𝑅 =
𝑓𝑝
𝑓𝑝 + 𝑡𝑛

TPR-FPR
Definition:
TP rate = TP/(TP+FN)
FP rate = FP/(FP+TN)
From the
actual value
point of view

ROC curve(1)
ROC = receiver operating characteristic
Y:TP rate
X:FP rate

ROC curve(2)
Which method (A or B) is better?
compute ROC area: area under ROC
curve

Precision, Recall, F-measure
Precision = TP/(TP + FP),
Recall = TP/(TP + FN)
F-measure = 2*(precision*recall)/(precision + recall)
Precision:
is the probability that a
retrieved document
is relevant.
Recall:
is the probability that a
relevant document
is retrieved in a search.

Matlab Implementation
 We will compute all image
pair distances D(j,k)
 How do we compute the
TPR-FPR plot ?
 Understand that TPR and
FPR are actually function of
threshold t,
 Just need to parameterize
TPR(t) and FPR(t), and
obtaining operating points of
meaningful thresholds, to
generate the plot.
 Matlab Implementation:
 [tp, fp, tn,
fn]=getPrecisionRecall()
d_min = min(min(d0), min(d1));
d_max = max(max(d0), max(d1));
delta = (d_max - d_min) / npt;
for k=1:npt
thres = d_min + (k-1)*delta;
tp(k) = length(find(d0<=thres));
fp(k) = length(find(d1<=thres));
tn(k) = length(find(d1>thres));
fn(k) = length(find(d0>thres));
end
if dbg
figure(22); grid on; hold on;
plot(fp./(tn+fp), tp./(tp+fn), '.-r',
'DisplayName', 'tpr-fpr');legend();
end

TPR-FPR
 Image Matching performance are characterized by
functions
 TPR(FPR)
 Retrieval set: we want high Precision, Short List: High
Recall.

Outline
 SIFT
 Box Filter
 Summary

Why Aggregation ?
 What (Local) Interesting Points features bring us ?
 Scale and rotation invariance in the form of nk x d:
 Un-cerntainty of the number of detected features nk, at query
time
 Permutation along rows of features are the same
representation.
 Problems:
 The feature has state, not able to draw decision boundaries,
 Not directly indexable/hashable
 Typically very high dimensionality
𝑆 𝑘| [𝑥 𝑘, 𝑦 𝑘, 𝜃 𝑘, 𝜎 𝑘, ℎ1, ℎ2, … , ℎ128] , 𝑘 = 1. . 𝑛

Decision Boundary in Matching
 Can we have a decision boundary function for
interesting points based representation ?
…..

Curse of Dimensionality in Retrieval
 What feature dimensions will do to the retrieval
efficiency…
 Looking at retrieval 99% of per dimension locality, and the
total volume covered plot.
 Matlab: showDimensionCurse.m
+

Aggregation – 30,000ft view
 Bag of Words
 Compute k centroids in feature space, called visual words
 Compute histogram
 k x1 feature, hard assignment
 VLAD
 Compute centroids in feature space
 Compute aggregaged difference w.r.t the centroids
 k x d feature, soft assignment
 Fisher Vector
 Compute a Gaussian Mixture Model (GMM) with 2nd order info
 Compute the aggregated feature w.r.t the mean and covariance of
GMM
 2 x k x d feature
 AKULA
 Adaptive centroids and feature count
 Improved with covariance ?
0.5
0.4 0.05
0.05

Visual Key Words: main idea
Extract some local features from a number of
images …
Image Analysis & Retrieval, 2016 24
e.g., SIFT descriptor
space: each point is 128-
dimensional
Slide credit: D. Nister

Visual Key Words: main idea
Image Analysis & Retrieval, 2016 25Slide credit: D. Nister

Visual words: main idea

Visual Key Words
Each point is a local
descriptor, e.g. SIFT
vector.

Visual words
Example: each group of patches belongs to the
same visual word
Figure from Sivic & Zisserman, ICCV 2003

Visual words
31
Source credit: K. Grauman, B. Leibe
• More recently used for describing scenes and
objects for the sake of indexing or classification.
Sivic & Zisserman 2003;
Csurka, Bray, Dance, & Fan
2004; many others.

Object Bag of ‘words’
ICCV 2005 short course, L. Fei-Fei
Bag of Words

BoW Examples
 Illustration

Bags of visual words
Summarize entire image based on its distribution
(histogram) of word occurrences.
Analogous to bag of words representation
commonly used for documents.
Image credit: Fei-Fei Li

Texture Retrieval
Texons…
Universal texton dictionary
histogram
Source: Lana Lazebnik

BoW Distance Metrics
Rank images by normalized scalar product
between their (possibly weighted) occurrence
counts---nearest neighbor search for similar
images.
[5 1 1 0][1 8 1 4]
dj
q

Inverted List
 Image Retrieval via Inverted List
Image credit: A. Zisserman
Visual
Word
number
List of image
numbers
When will this give us a significant gain in efficiency?

Indexing local features: inverted file index
For text documents, an
efficient way to find all pages
on which a word occurs is to
use an index…
We want to find all images in
which a feature occurs.
We need to index each
feature by the image it
appears and also we keep the
# of occurrence.
Source credit : K. Grauman, B. Leibe

TF-IDF Weighting
Term Frequency – Inverse Document Frequency
 Describe image by frequency of each visual word within
it, down-weight words that appear often in the database
(Standard weighting for text retrieval)
Total number of
words in database
Number of
occurrences of
word i in whole
database
Number of
occurrences of
word i in
document d
Number of
words in
document d

BoW Use Case with Spatial Localization
Collecting words within a query region
Query region:
pull out only the SIFT
descriptors whose
positions are within the
polygon

BoW Patch Search
Localizing the BoW representation

Localization with BoW

Hiearchical Assignment of Histogram
Tree construction:
[Nister & Stewenius, CVPR’06]

Vocabulary Tree
Training: Filling the tree

46
Vocabulary Tree
Image Analysis & Retrieval, 2016 46Slide credit: David Nister

47
Vocabulary Tree

Vocabulary Tree

50
Vocabulary Tree
Recognition
RANSAC
verification

Vocabulary Tree: Performance
Evaluated on large databases
 Indexing with up to 1M images
Online recognition for database
of 50,000 CD covers
 Retrieval in ~1s
Find experimentally that large vocabularies can be
beneficial for recognition

Larger vocabularies
can be
advantageous…
But what happens if it
is too large?
Visual Word Vocabulary Size
 Performance w.r.t vocabulary size

Bags of words: pros and cons
Good:
+ flexible to geometry / deformations / viewpoint
+ compact summary of image content
+ provides vector representation for sets
+ Inverted List implementation offers practical solution
against large repository
Bad:
- Lost of information at quantization and histogram
generation
- basic model ignores geometry – must verify afterwards,
or encode via features
- background and foreground mixed when bag covers
whole image
- interest points or sampling: no guarantee to capture
object-level parts
Image Analysis & Retrieval, 2016 53Source credit : K. Grauman, B. Leibe

Can we improve BoW ?
• E.g. Why isn’t our Bag of Words classifier at 90%
instead of 70%?
• Training Data
– Huge issue, but not necessarily a variable you can manipulate.
• Learning method
– BoW is on top of any feature scheme
• Representation
– Are we losing too much info in the process ?

Standard Kmeans Bag of Words
 BoW revisited
http://www.cs.utexas.edu/~grauman/courses/fall2009/papers/bag_of_visual_words.pdf

Motivation
Bag of Visual Words is only about counting the number
of local descriptors assigned to each Voronoi region
Why not including other statistics/information ?

We already looked at the Spatial Pyramid/Pooling
Spatial Pooling
level 2: 4x4level 0: 1x1 level 1: 2x2
Key take away: Multiple assignment ? Soft Assignment ?

Motivation
Why not including other statistics? For instance:
• mean of local descriptors

Motivation
Why not including other statistics? For instance:
• mean of local descriptors
• (co)variance of local descriptors

Simple case: Soft Assignment
Called “Kernel codebook encoding” by Chatfield et al.
2011. Cast a weighted vote into the most similar
clusters.

Simple case: Soft Assignment
Called “Kernel codebook encoding” by Chatfield et al.
2011. Cast a weighted vote into the most similar
clusters.
This is fast and easy to implement (try it for Project 3!)
but it does have some downsides for image retrieval –
the inverted file index becomes less sparse.

A first example: the VLAD
Given a codebook ,
e.g. learned with K-means, and a set of
local descriptors :
•  assign:
•  compute:
• concatenate vi’s + normalize
Jégou, Douze, Schmid and Pérez, “Aggregating local descriptors into a compact image representation”, CVPR’10.
 3
x
v1 v2
v3 v4
v5
1
 4
 2
 5
① assign descriptors
② compute x-  i
③ vi=sum x-  i for cell i

A first example: the VLAD
A graphical representation of
Jégou, Douze, Schmid and Pérez, “Aggregating local descriptors into a compact image representation”, CVPR’10.

VL_FEAT Implementation
 Matlab:
function [vc]=vladSiftEncoding(sift,
codebook)
dbg=1;
if dbg
if (0) % init VL_FEAT, only need
to do once
run('../../tools/vlfeat-
0.9.20/toolbox/vl_setup.m');
end
im = imread('../pics/flarsheim-
2.jpg');
[f, sift] =
vl_sift(single(rgb2gray(im))); sift =
single(sift');
[indx, codebook] = kmeans(sift,
16);
% make sift # smaller
sift = sift(1:800,:);
end
[n, kd]=size(sift);
[m, kd]=size(codebook);
% compute assignment
dist = pdist2(codebook, sift);
mdist = mean(mean(dist));
% normalize the heat kernel s.t. mean
dist is mapped to 0.5
a = -log(0.5)/mdist;
indx = exp(-a*dist);
vc=vl_vlad(sift', codebook', indx);
if dbg
figure(41); colormap(gray);
subplot(2,2,1); imshow(im);
title('image');
subplot(2,2,2); imagesc(dist);
title('m x n distance');
subplot(2,2,3); imagesc(indx);
title('m x n assignment');
subplot(2,2,4); imagesc(reshape(vc,
[m, kd]));title('vlad code');
end

VLAD Code
 What are the tweaks ?
 Code book design
 Soft Assignment options

References
 Vocabulary Tree:
 David Nistér, Henrik Stewénius: Scalable Recognition with a Vocabulary
Tree. CVPR (2) 2006: 2161-2168
 VLAD:
 Herve Jegou, Matthijs Douze, Cordelia Schmid:
Improving Bag-of-Features for Large Scale Image Search. International
Journal of Computer Vision 87(3): 316-336 (2010)
 Fisher Vector:
 Florent Perronnin, Jorge Sánchez, Thomas Mensink:
Improving the Fisher Kernel for Large-Scale Image Classification.
ECCV (4) 2010: 143-156
 AKULA:
 Abhishek Nagar, Zhu Li, Gaurav Srivastava, Kyungmo Park:
AKULA - Adaptive Cluster Aggregation for Visual Search. DCC 2014:
13-22

Lec 07 Summary
 Image Retrieval System Metric
 What is true positive, false positive, true negative, false
negative ?
 What is precision, recall, F-score ?
Why Aggregation ?
 Decision boundary
 Indexing/Hashing
 Bag of Words
 A histogram with bins visual words
 Variations: hierarchical assignment with vocabulary tree
 Implementation: Inverted List
VLAD
 Richer encoding of aggregated info
 Soft assignment of features to codebook bins
 Vectorized representation – no need for inverted list

Lec07 aggregation-and-retrieval-system

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to Lec07 aggregation-and-retrieval-system (20)

More from United States Air Force Academy (6)

Recently uploaded (20)

Lec07 aggregation-and-retrieval-system