Supervised Quantization for Similarity Search (camera-ready)

Supervised Quantization for Similarity Search
Xiaojuan Wang1
Ting Zhang2 ∗
Guo-Jun Qi3
Jinhui Tang4
Jingdong Wang5
1
Sun Yat-sen University, China 2
University of Science and Technology of China, China
3
University of Central Florida, USA 4
Nanjing University of Science and Technology, China
5
Microsoft Research, China
Abstract
In this paper, we address the problem of searching for
semantically similar images from a large database. We
present a compact coding approach, supervised quanti-
zation. Our approach simultaneously learns feature se-
lection that linearly transforms the database points into
a low-dimensional discriminative subspace, and quantizes
the data points in the transformed space. The optimization
criterion is that the quantized points not only approximate
the transformed points accurately, but also are semantically
separable: the points belonging to a class lie in a cluster
that is not overlapped with other clusters corresponding to
other classes, which is formulated as a classification prob-
lem. The experiments on several standard datasets show the
superiority of our approach over the state-of-the art super-
vised hashing and unsupervised quantization algorithms.
1. Introduction
Similarity search has been a fundamental research topic
in machine learning, computer vision, and information re-
trieval. The goal, given a query, is to find the most similar
item from a database, e.g., composed of N d-dimensional
vectors. The recent study shows that the compact cod-
ing approach, including hashing and quantization, is ad-
vantageous in terms of memory cost, search efficiency, and
search accuracy.
The compact coding approach converts the database
items into short codes in which the distance is efficiently
computed. The objective is that the similarity computed
in the coding space is well aligned with the similarity that
is computed based on the Euclidean distance in the in-
put space, or that comes from the given semantic sim-
ilarity (e.g., the data items from the same class should
be similar). The solution to the former kind of similar-
ity search is unsupervised compact coding, such as hash-
∗This work was partly done when Xiaojuan Wang and Ting Zhang were
interns at MSR. They contributed equally to this work.
ing [1,5–7,9–12,16,20,22,28,29,34,36–38] and quantiza-
tion [8,26,39]. The solution to the latter problem is super-
vised compact coding, which is our interest in this paper.
Almost all research efforts in supervised compact cod-
ing focus on developing hashing algorithms to preserve se-
mantic similarities, such as LDA Hashing [30], minimal
loss hashing [24], supervised hashing with kernels [21],
FastHash [18], triplet loss hashing [25], and supervised dis-
crete hashing [27]. In contrast, there is less study in quan-
tization, which however already shows the superior perfor-
mance for Euclidean distance and cosine-based similarity
search. This paper makes a study on the quantization solu-
tion to semantic similarity search.
Our main contributions are as follows: (i) We propose
a supervised composite quantization approach. To the best
of our knowledge, our method is the first attempt to explore
quantization for semantic similarity search. The advantage
of quantization over hashing is that the number of possi-
ble distances is significantly higher, and hence the distance
approximation, accordingly the similarity search accuracy,
is more accurate. (ii) Our approach jointly optimizes the
quantization and learns the discriminative subspace where
the quantization is performed. The criterion is the semantic
separability: the points belonging to a class lie in a cluster
that is not overlapped with other clusters corresponding to
other classes, which is formulated as a classification prob-
lem. (iii) Our method significantly outperforms many state-
of-the-art methods in terms of search accuracy and search
efficiency under the same code length.
2. Related work
There are two main research issues in supervised hash-
ing: how to design hash functions and how to preserve se-
mantic similarity. In essence, most algorithms can adopt
various hash functions, e.g., an algorithm using a linear hash
function usually can also use a kernel hash function. Our
review of the supervised hashing algorithms focuses on the
semantic similarity preserving manners. We roughly divide
them into three categories: pairwise similarity preserving,
multiwise similarity preserving, and classification.

Pairwise similarity preserving hashing aligns the simi-
larity over each pair of items computed in the hash codes
with the semantic similarity in various manners. Repre-
sentative algorithms include LDA Hashing [30], minimal
loss hashing [24], binary reconstructive embedding [15], su-
pervised hashing with kernels [21], two-step hashing [19],
FastHash [18], and so on. The recent work [4], supervised
deep hashing, designs deep neural network as hash func-
tions to seek multiple hierarchical non-linear feature trans-
formations, and preserves the pairwise semantic similarity
by maximizing the inter-class variations and minimizing the
intra-class variations of the hash codes.
Multiwise similarity preserving hashing formulates the
problem by maximizing the agreement of the similarity or-
ders over more than two items between the input space and
the coding space. The representative algorithms include or-
der preserving hashing [34], which directly aligns the rank
orders computed from the input space and the coding space,
triplet loss hashing [25], listwise supervision hashing [32],
and so on. Triplet loss hashing and listwise supervision
hashing adopt different loss functions to align the similarity
order in the coding space and the semantic similarity over
triplets of items. The recent proposed deep semantic rank-
ing based method [40] preserves multilevel semantic simi-
larity between multilabel images by jointly learning feature
representations and mappings from them to hash codes.
The recently-developed supervised discrete hashing
(SDH) algorithm [27] formulates the problem using the rule
that the classification performance over the learned binary
codes is as good as possible. This rule seems inferior com-
pared with pairwise and multiwise similarity preserving, but
yields superior search performance. This is mainly thanks
to its optimization algorithm (directly optimize the binary
codes) and scalability (not necessarily do the sampling as
done in most pairwise and multiwise similarity preserving
algorithms). Semantic separability in our approach, whose
goal is that the points belonging to a class lie in a cluster that
is not overlapped with other clusters corresponding to other
classes, is formulated as a classification problem, which can
also be optimized using all the data points.
Our approach is a supervised version of quantization.
The quantizer we adopt is composite quantization [39],
which is shown to be a generalized version of product quan-
tization [8] and cartesian k-means [26], and achieves better
performance. Rather than performing the quantization in
the input space, our approach conducts the quantization in a
discriminative space, which is jointly learned with the com-
posite quantizer.
3. Formulation
Given a d-dimensional query vector q ∈ Rd
and a
search database consisting of N d-dimensional vectors X =
{xn}N
n=1 with each point xn ∈ Rd
associated with a class
label, denoted by a binary label vector yn ∈ {0, 1}C
in
which the 1-valued entry indicates the class label of xn, the
goal is to find K vectors from the database X that are near-
est to the query so that the found vectors share the same
class label with the query. This paper is interested in the
approximate solution: converting the database vectors into
compact codes and then performing the similarity search in
the compact coding space, which has the advantage of lower
memory cost and higher search efficiency.
Modeling. We present a supervised quantization approach
to approximate each database vector with a vector selected
or composed from a dictionary of base items. Then the
database vector is represented by a short code composed of
the indices of the selected base items. Our approach, rather
than directly quantizing the database vectors in the original
space, learns to transform the database vectors to a discrim-
inative subspace with a matrix P ∈ Rd×r
, and then does
the quantization in the transformed space.
We propose to adopt the state-of-the-art unsupervised
quantization approach: composite quantization [39]. Com-
posite quantization approximates a vector x using the sum
of M elements with each selected from a dictionary, i.e.,
¯x =
M
m=1 cmkm
, where cmkm
is selected from the mth
dictionary with K elements Cm = [cm1 cm2 · · · cmK],
and encodes x by a short code (k1 k2 · · · kM ). Our ap-
proach uses the sum to approximate the transformed vector,
which is formulated by minimizing the approximation error,
PT
x − ¯x 2
2 = PT
x −
M
m=1
cmkm
2
2. (1)
We learn the transformation matrix P such that the quan-
tized data points are semantically separable: the points be-
longing to the same class lie in a cluster, and the clusters
corresponding to different classes are disjointed. We solve
the semantic separation problem by finding C linear deci-
sion surfaces to divide all the points into C clusters1
, each
corresponding to a class, which is formulated as a classifi-
cation problem given as follows,
N
n=1
(yn, WT
¯xn) + λ W 2
F , (2)
where λ is the parameter controlling the regularization term
W 2
F ; W = [w1 w2 · · · wC] ∈ Rr×C
; (·, ·) is a classi-
fication loss function to penalize the cases where the point
¯xn is not assigned to the cluster corresponding to yn based
on the C associated decision functions {wT
k ¯xn}C
k=1. In this
paper, we adopt the regression loss:
(yn, WT
¯xn) = yn − WT
¯xn
2
2 (3)
1C linear decision surfaces can divide the points into more than C clus-
ters.

The proposed approach combines the quantization with
the feature selection, and jointly learns the quantization pa-
rameter and the transform matrix. The overall objective
function is given as follows,
min
W,P,C,{bn}N
n=1,
N
n=1
yn − WT
Cbn
2
2 + λ W 2
F
+ γ
N
n=1
Cbn − PT
xn
2
2 (4)
s. t.
M
i=j
bT
niCT
i Cjbnj = ,
n = 1, 2, · · · , N,
where γ is the parameter controlling the quantization term;
Cbn is the matrix form of
M
m=1 cmkn
m
and bn =
[bT
n1 bT
n2 · · · bT
nM ]T
; bnm ∈ {0, 1}K
is an indicator vec-
tor with only one entry being 1, indicating that the corre-
sponding dictionary element is selected from the mth dic-
tionary. The equality constraint,
M
i=j bT
niCT
i Cjbnj =
M
i=j cT
ikn
i
cjkn
j
= , called constant inter-dictionary-
element-product, is introduced from composite quantiza-
tion [39] for fast distance computation (reduced from O(d)
to O(M)) in the search stage, which is presented below.
Querying. The search process is similar to that in com-
posite quantization [39]. Given a query q, after transfor-
mation, the approximate distance between q (represented
as q = PT
q) and a database vector x (represented as
Cb =
M
m=1 cmkm ) is computed as
q −
M
m=1
cmkm
2
2 = (5)
M
m=1
q − cmkm
2
2 − (M − 1) q 2
2 +
M
i=j
cT
iki
cjkj
.
Given the query q , the second term −(M − 1) q 2
2
in the right-hand side of Equation 5 is a constant for all
database vectors. Meanwhile, the third term
M
i=j cT
iki
cjkj
,
which is equal to thanks to the introduced constant con-
straint, is also a constant. Hence these two constant terms
can be ignored, as they do not affect the sorting results.
As a result, it is enough to compute the distances be-
tween q and the selected dictionary elements {cmkm
}M
m=1:
{ q − cmkm
2
2}M
m=1. We precompute a distance table of
length MK recording the distances between q and the dic-
tionary elements in all the dictionaries before examining the
distance between q and each approximated point ¯x in the
database. Then computing
M
m=1 q −cmkm
2
2 takes only
O(M) distance table lookups and O(M) addition opera-
tions.
4. Optimization
Our problem (4) consists of five groups of unknown vari-
ables: classification matrix W, transformation matrix P,
dictionaries C, binary indicator vectors {bn}N
n=1, and the
constant . We follow [39] and combine the constraints
M
i=j bT
niCT
i Cjbnj = into the objective function using
the quadratic penalty method:
ψ(W, P, C, {bn}N
n=1, ) =
N
n=1
yn − WT
Cbn
2
2 + λ W 2
F
+ γ
N
n=1
Cbn − PT
xn
2
2 + µ
N
n=1
(
M
i=j
bT
niCT
i Cjbnj − )2
,
(6)
where µ is the penalty parameter.
We use the alternative optimization technique to itera-
tively solve the problem, with each iteration updating one
of W, P, , C, and {bn}N
n=1 while fixing the others. The
initialization scheme and the iteration details are presented
as follows.
Initialization. The transformation matrix P is initialized
using principal component analysis (PCA). We use the dic-
tionaries and codes learned from product quantization [8]
in the transformed space to initialize C and {bn}N
n=1 for
the shortest code (16 bits) in our experiment, and we use
the dictionaries and codes learned in the shorter code to do
the initialization for longer code with setting the additional
dictionary elements to zero and randomly initializing the
additional binary codes.
W-Step. With C and {bn}N
n=1 fixed, W is solved by
the regularized least squares problem, resulting in a closed-
form solution:
W∗
= ( ¯X ¯XT
+ λIr)−1 ¯XYT
, (7)
where ¯X = [Cb1 · · · CbN ] ∈ Rr×N
, Y = [y1 · · · yN ] ∈
RC×N
, and Ir is an identity matrix of size r × r.
P-Step. With C and {bn}N
n=1 fixed, the transformation
matrix P is solved using the normal equation:
P∗
= (XXT
)−1
X ¯XT
, (8)
where X = [x1 · · · xN ] ∈ Rd×N
.
-Step. With C and {bn}N
n=1 fixed, the objective function
is a quadratic function with respect to , and it is easy to get
the optimal solution to .
∗
=
1
N
N
n=1
M
i=j
bT
niCT
i Cjbnj. (9)
C-Step. With other variables fixed, the problem is an un-
constrained nonlinear optimization problem with respect to

C. We use the quasi-Newton algorithm and specifically
the L-BFGS algorithm, the limited version of the Broyden-
Fletcher-Goldfarb-Shanno (BFGS) algorithm. The imple-
mentation is publicly available2
. The derivative with respect
to C and the objective function value need to be fed into the
solver. L-BFGS is an iterative algorithm and we set its max-
imum iterations to 100. The partial derivative with respect
to Cm is :
∂ψ
∂Cm
=
N
n=1
[2W(WT
Cbn − yn)bT
nm+ (10)
2γ(Cbn − PT
xn)bT
nm+
4µ(
M
i=j
bT
niCT
i Cjbnj − )(
M
l=1,l=m
Clbnl)bT
nm].
B-Step. The optimization problem with respect to {bn}N
n=1
could be decomposed to N subproblems,
ψn(bn) = ||yn − WT
Cbn||2
2 + γ||Cbn − PT
xn||2
2
+ µ(
M
i=j
bT
niCT
i Cjbnj − )2
. (11)
bn is a binary-integer-mixed vector, and thus the optimiza-
tion is NP-hard. We use the alternative optimization tech-
nique again to solve the M subvectors {bnm}M
m=1 itera-
tively. With {bnl}M
l=1,l=m fixed, we exhaustively check all
the elements in the dictionary Cm, finding the element such
that ψn(bn) is minimized, and accordingly set the corre-
sponding entry of bnm to be 1 and all the others to be 0.
Convergence. Every update step in the algorithm assures
that the objective function value weakly decreases after
each iteration, and the empirical results show that the al-
gorithm takes a few iterations to converge. Figure 1 shows
the convergence curves on NUS-WIDE and ImageNet with
16 bits, which indicates that our algorithm gets converged
in a few iterations.
5. Discussions
Connection with supervised sparse coding. It is pointed
in [39] that composite quantization is related to sparse cod-
ing: the binary indicator vector b is a special sparse code,
containing only M non-zero entries (valued as 1) and each
non-zero entry distributed in a subvector. The proposed su-
pervised quantization approach is close to supervised sparse
coding [23], which introduces supervision to learn the dic-
tionary and the sparse codes, but different from it in the mo-
tivation and the manner of imposing the supervision: our
approach adopts the supervision to help separate the data
2http://users.iems.northwestern.edu/ nocedal/lbfgsb.html
0 5 10 15 20
0
0.4
0.8
1.2
1.6
2
x 10
5 NUS−WIDE 16 bits
iteration number
objectivevalue
(a) NUS-WIDE
0 5 10 15 20 25 30
1.35
1.4
1.45
1.5
1.55
1.6
1.65
x 10
6 ImageNet 16 bits
iteration number
objectivevalue
(b) ImageNet
Figure 1: Convergence curves of our algorithm on NUS-
WIDE and ImageNet with 16 bits. The vertical axis repre-
sents the value of the objective function (6) and the horizon-
tal axis corresponds to the number of iterations.
points into clusters with each corresponding to a class; our
approach imposes the supervision on the approximated data
points while supervised sparse coding imposes the supervi-
sion on the sparse codes.
Classification loss vs. rank loss. There are some hashing
approaches exploring the supervision information through
rank loss [33], such as the triplet loss in [25, 32], and the
pairwise loss in [24,31]. In general, compared with the clas-
sification loss, those two rank losses might be more helpful
to learn the compact codes as they directly align the rank or-
der in the coding space with the given semantic rank infor-
mation. However, they yield a larger number of loss terms,
e.g., O(N2
) for pairwise loss and O(N3
) for triplet loss,
requiring prohibitive computational cost which makes the
optimization difficult and infeasible. Therefore, sampling
is usually adopted for training, which however makes the
results not as good as expected. A comparison with triplet
loss is shown in Section 6.3.
6. Experiment
6.1. Datasets and settings
Datasets. We perform the experiments on four standard
datasets: CIFAR-10 [13], MNIST [17], NUS-WIDE [2],
and ImageNet [3].
The CIFAR-10 dataset consists of 60, 000 32 × 32 color
tinny images, and includes 10 classes with 6, 000 images
per class. We represent each image by a 512-dimensional
GIST feature vector available on the website3
. The dataset
is split into a query set with 1, 000 samples and a training
set with all the remaining samples as done in [27].
The MNIST dataset consists of 70, 000 28×28 greyscale
images of handwritten digits from ’0’ to ’9’. Each image
is represented by the raw pixel values, resulting in a 784-
3http://www.cs.toronto.edu/ kriz/cifar.html

dimensional vector. We split the dataset into a query set
with 1, 000 samples and a training set with all remaining
samples as done in [27].
The NUS-WIDE dataset contains 269, 648 images col-
lected from Flickr, with each image containing multiple se-
mantic labels from 81 concept labels. The 500-dimensional
bag-of-words features provided in [2] are used. Follow-
ing [27], we collect 193,752 images that are from the 21
most frequent labels for evaluation, including sky, clouds,
person, water, animal, grass, building, window, plants, lake,
ocean, road, flowers, sunset, relocation, rocks, vehicles,
snow, tree, beach, and mountain. For each label, 100 images
are uniformly sampled as the query set, and the remaining
images are used as the training set.
The dataset ILSVRC 2012 [3], named as ImageNet in
this paper, contains over 1.2 million images of 1, 000 cat-
egories. We use the provided training set as the retrieval
database and the provided 50, 000 validation images as the
query set since the ground-truth labeling of the test set is
not publicly available. Similar to [27], we use the 4096-
dimensional feature extracted from the convolution neural
networks (CNN) in [14] to represent each image.
Evaluation criteria. We adopt the widely used mean
average precision (MAP) criterion, defined as MAP =
1
Q
Q
i=1 AP(qi), where Q is the number of queries, and
AP is computed as AP(q) = 1
L
R
r=1 Pq(r)δ(r). Here L
is the number of true neighbors for the query q in the R
retrieved items, where R is the size of the database except
that R is 1500 on the ImageNet dataset for evaluation effi-
ciency. Pq(r) denotes the precision when top r data points
are returned, and δ(r) is an indicator function which is 1
when the rth result is a true neighbor and otherwise 0. A
data point is considered as a true neighbor when it shares at
least one class label with the query.
Besides the search accuracy, we also report the search
efficiency by evaluating the query time under various code
lengths. The query time contains the query preprocessing
time and the linear scan search time. For hashing algo-
rithms, the query preprocessing time refers to query en-
coding time; for unsupervised quantization algorithms, the
query preprocessing time refers to distance lookup table
construction time; for our proposed method, the query pre-
processing time includes feature transformation time and
distance lookup table construction time. For all methods,
we use C++ implementations to test the query time on a 64-
bit windows server with 48 GB RAM and 3.33 GHz CPU.
Parameter settings. There are three trade-off parameters in
the objective function (6): γ for the quantization loss term,
µ for penalizing the equality constraint term, and λ for the
regularization term. We select γ and µ via validation. We
choose a subset of the training set as the validation set (the
size of the validation set is the same to that of the query set),
and the best parameters γ and µ are chosen so that the aver-
age search performance in terms of MAP, by regarding the
validation vectors as queries, is the best. It is feasible that
the validation set is a subset of the training set, as the vali-
dation criterion is not the objective function but the search
performance [39]. The empirical analysis about the two pa-
rameters will be given in Section 6.3. The parameter λ is
set to 1, which already shows the satisfactory performance.
We set the dimension r of the discriminative subspace to
256. We do not tune r and λ for saving time while we think
that tuning it might yield better performance. We choose
K = 256 to be the dictionary size as done in [8,26,39], so
that the resulting distance lookup tables are small and each
subindex fits into one byte.
6.2. Comparison
Methods. Our method, denoted by SQ, is compared with
seven state-of-the-art supervised hashing methods: super-
vised discrete hashing (SDH) [27], FastHash [35], super-
vised hashing with kernels (KSH) [21], CCA-ITQ [6],
semi-supervised hashing (SSH) [31], minimal loss hash-
ing (MLH) [24], and binary reconstructive embedding
(BRE) [15], as well as the state-of-the-art unsupervised
quantization method, composite quantization (CQ) [39]. To
the best of our knowledge, there do not exist supervised
quantization algorithms. We use the public implementa-
tions for all the algorithms except that we implement SSH
by ourselves as we do not find the public code, and follow
the corresponding papers/authors to set up the parameters.
For FastHash, we adopt hinge loss as loss function in the
binary code inference step and boosted tree as classifier in
the hash function learning step, which is suggested by the
author to achieve the best performance.
Implementation details. It is infeasible to do the training
over the whole training set for the pairwise-similarity-based
hashing algorithms (SSH, BRE, MLH, KSH, FastHash), as
discussed in [27]. Therefore, for CIFAR-10, MNIST, and
NUS-WIDE, following the recent work [27], we randomly
sample 5000 data points from the training set to do the op-
timization for the pairwise similarity-based algorithms, and
use the whole training set for SDH and CCA-ITQ. For Im-
ageNet, we use as many training samples for optimization
as possible if the 256G RAM in our server is enough for
optimization: 500, 000 for CCA-ITQ, 100, 000 for SDH,
10, 000 for the remaining hashing methods. There are two
hashing algorithms, KSH and SDH, that adopt the kernel-
based representation, i.e., select h anchor points {aj}h
j=1
and use φ(x) = [exp(−||x − a1||2
2/2σ2
) . . . exp(−||x −
ah||2
2/2σ2
)]T
∈ Rh
to represent x. Our approach also uses
the kernel-based representation for CIFAR-10, MNIST, and
NUS-WIDE. Following [27], h = 1000 and σ is chosen
based on the rule σ = 1
N
N
n=1 min{ xn − aj 2}h
j=1.

16 32 64 128
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
code length
MAP
CIFAR−10
CQ
CCA−ITQ
BRE
MLH
SSH
KSH
FastHash
SDH
SQ
(a) CIFAR-10
16 32 64 128
0.4
0.5
0.6
0.7
0.8
0.9
1
code length
MAP
MNIST
CQ
CCA−ITQ
BRE
MLH
SSH
KSH
FastHash
SDH
SQ
(b) MNIST
16 32 64 128
0.38
0.4
0.42
0.44
0.46
0.48
0.5
0.52
0.54
0.56
0.58
code length
MAP
NUS−WIDE
CQ
CCA−ITQ
BRE
MLH
SSH
KSH
FastHash
SDH
SQ
(c) NUS-WIDE
Figure 2: Search performance (in terms of MAP) comparison of different methods on CIFAR-10, MNIST, and NUS-WIDE
with code length of 16, 32, 64, and 128.
Search accuracy. The results on CIFAR-10, MNIST, and
NUS-WIDE with the code length of 16, 32, 64, and 128,
are shown in Figure 2. It can be seen that our approach,
SQ, achieves the best performance, and SDH is the second
best. In comparison with SDH, our approach gains large im-
provement on CIFAR-10 and NUS-WIDE, e.g., 23.66% im-
provement on CIFAR-10 with 64 bits, and 4.65% improve-
ment on NUS-WIDE with 16 bits. It is worth noting that
on these two datasets, the performance of SQ with 16 bits
is even much better than that of SDH with 128 bits. Our
approach gets relatively small improvement over SDH on
MNIST. The reason might be that SDH already achieves a
high performance, and it is not easy to get a large improve-
ment further. Compared with the unsupervised quantiza-
tion algorithm, composite quantization (CQ), whose perfor-
mance is lower than most of the supervised hashing algo-
rithms, our approach obtains significant improvement, e.g.,
42.57% improvement on CIFAR-10 with 16 bits, 46.14%
on MNIST with 16 bits, and 15.39% on NUS-WIDE with
16 bits. This shows that learning with supervision indeed
benefits the search performance.
The result on ImageNet is shown in Figure 3. The perfor-
mance of our approach again outperforms other algorithms,
and CQ is the second best. The reason might be the power-
ful discrimination ability of the original CNN features. To
achieve a comprehensive analysis, we provide the Euclidean
baseline (see Figure 3) that simply computes the distances
between the query and the database vectors using the orig-
inal CNN features and returns the top R retrieved items.
As shown in Figure 3, our proposed SQ also outperforms
the Euclidean baseline by a large margin, and CQ is a lit-
tle lower than the baseline. This shows that our approach is
able to learn better quantizer through the supervision though
it is known that the CNN features are already good. The best
supervised hashing algorithm, SDH, uses the kernel-based
representation in our experiment as suggested in its original
paper [27]. To further verify the superiority of our approach
16 32 64 128
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
code length
MAP
ImageNet
CQ
CCA−ITQ
BRE
MLH
SSH
KSH
FastHash
SDH
SDH−Linear
SQ
Euclidean
Figure 3: Search performance (in terms of MAP) compari-
son of different methods on ImageNet with code length of
16, 32, 64, and 128.
over SDH, we also report the result of SDH without us-
ing the kernel representation (denoted by “SDH-Linear” in
Figure 3), and find that it is still lower than our approach.
This further shows the effectiveness of quantization: quan-
tization has much more different differences compared with
hashing, which has only a few Hamming distances for the
same code length.
Search efficiency. We report the query time of our pro-
posed approach SQ, the unsupervised quantization method
CQ, and the supervised hashing method SDH, which out-
performs other supervised hashing algorithms in our exper-
iments. Figure 4 shows the search performance and the cor-
responding query time under the code length of 16, 32, 64,
and 128 on the four datasets.
Compared with CQ, our proposed SQ obtains much
higher search performance for the same query time. It can

0 0.5 1 1.5 2 2.5 3
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
query time cost / ms
MAP
CIFAR−10
CQ SDH SQ
(a) CIFAR-10
0 1 2 3 4
0.4
0.5
0.6
0.7
0.8
0.9
1
MAP
MNIST
CQ SDH SQ
(b) MNIST
0 1 2 3 4 5
0.35
0.4
0.45
0.5
0.55
0.6
MAP
NUS−WIDE
CQ SDH SQ
(c) NUS-WIDE
0 5 10 15 20
0.2
0.3
0.4
0.5
0.6
0.65
MAP
ImageNet
CQ SDH SQ
(d) ImageNet
Figure 4: Query time comparison of SQ, CQ, and SDH under various code lengths on CIFAR-10, MNIST, NUS-WIDE, and
ImageNet. The vertical axis represents the search performance, and the horizontal axis corresponds to the query time cost
(milliseconds). The markers from left to right on each curve indicate the code length of 16, 32, 64, and 128 respectively.
be seen that on CIFAR-10, MNIST, and NUS-WIDE, SQ
takes more time than CQ under the code length of 16 and
32, and less time under the code length of 128: SQ takes ex-
tra time to do feature transformation; the querying process,
however, is carried out in a lower-dimensional transformed
subspace, therefore the search efficiency is still compara-
ble to CQ. It can also be observed that SQ takes almost
equal time as CQ on ImageNet. This is because CQ also
takes time to do feature transformation here and the query-
ing process is carried out in the 256-dimensional PCA sub-
space (it is cost prohibitive to tune the parameter of CQ on
high-dimensional large-scale dataset).
Compared with SDH, SQ outperforms SDH for the same
query time on ImageNet and NUS-WIDE. For example, SQ
with 32 bits outperforms SDH with 16 bits by a margin
of 40.82% on ImageNet, and SQ with 16 bits outperforms
SDH with 128 bits by a margin of 2% on NUS-WIDE, while
they take almost the same query time.
On CIFAR-10, SQ with 16 bits outperforms SDH with
128 bits by 12.4% while taking slightly more time (0.16
milliseconds), and this trend indicates that for the same
query time, SQ could also obtain higher performance. On
MNIST, SQ achieves the same performance as SDH while
taking slightly more query time. The reason is that the
query preprocessing time of SQ (mainly refers to distance
lookup table construction time here) is relatively long com-
pared with the linear scan search time on the small-scale
database. In real-word scenarios, retrieval tasks that require
quantization solution usually are conducted on large-scale
databases, and the scale usually is at least 200, 000.
6.3. Empirical analysis
Classification loss vs. triplet loss. We empirically
compare the performances between the proposed formula-
tion (4) that uses the classification loss for semantic sep-
aration, and an intuitive formulation that uses triplet loss
Table 1: MAP comparison of classification loss (denoted by
“c-loss”) and triplet loss (denoted by “t-loss”).
Datasets Methods 16 bits 32 bits 64 bits 128 bits
CIFAR-10
t-loss 0.3284 0.3679 0.5305 0.5469
c-loss 0.6045 0.6855 0.7042 0.7120
MNIST
t-loss 0.4347 0.5286 0.6442 0.7500
c-loss 0.9329 0.9374 0.9377 0.9400
to discriminate a semantically similar pair and a semanti-
cally dissimilar pair. The triplet loss formulation is written
as (i,j,l)[||Cbi − Cbj||2
2 − ||Cbi − Cbl||2
2 + ρ]+. The
triplet (i, j, l) is composed of three points where i and j are
from the same class and l is from a different class; ρ ≥ 0 is
a constant indicating the distance margin; [·]+ = max(0, ·)
is the standard hinge loss function.
We optimize the formulation with triplet loss using the
alternative optimization algorithm similar to that for opti-
mizing problem (4). The parameters γ and µ are chosen
through validation. It is infeasible to do the optimization
with all the triplets. Therefore we borrow the idea of ac-
tive set, and select the triplets that are most likely to trigger
the hinge loss at each iteration, which is efficiently imple-
mented by maintaining an approximate nearest neighbor list
for each database vector.
The results on CIFAR-10 and MNIST under various
code lengths are shown in Table 1. It is observed that the
results with classification loss are much better than those
with triplet loss. It seems to us that the triplet loss is better
than classification loss, as the search goal is essentially to
rank similar pairs before dissimilar pairs, which is explicitly
formulated in triplet loss. The reason of the lower perfor-
mance of triplet loss most likely lies in the difficulty of the
optimization (e.g., too many (O(N3
)) loss terms results in
the sampling technique used for training, which makes the

Figure 5: Illustration of the effect of γ and µ on the search performance in the validation sets of CIFAR-10, MNIST, NUS-
WIDE, and ImageNet with 16 bits. γ ranges from 1e-7 to 1e+2 and µ ranges from 1e-1 to 1e+2.
Table 2: MAP comparison of the formulation with feature
transformation (denoted by “with fea.”) and that without
feature transformation (denoted by “no fea.”).
Datasets Methods 16 bits 32 bits 64 bits 128 bits
CIFAR-10
no fea. 0.5140 0.5174 0.5274 0.5301
with fea. 0.6045 0.6855 0.7042 0.7120
MNIST
no fea. 0.4534 0.4538 0.4617 0.4650
with fea. 0.9329 0.9374 0.9377 0.9400
results not as good as expected).
Feature transformation. Our approach learns the feature
transformation matrix P, and quantizes the database vectors
in the learned discriminative subspace. To verify the effec-
tiveness of feature transformation in our formulation (4),
we empirically compare the performances between the pro-
posed formulation and the formulation that does not learn
feature transformation. We take CIFAR-10 and MNIST as
examples and the results are shown in Table 2. As shown,
SQ significantly outperforms the formulation that does not
learn feature transformation, which indicates the impor-
tance of feature transformation in our proposed formulation.
The Effect of γ and µ. We empirically show how the pa-
rameters γ (for controlling the quantization loss term) and
µ (for penalizing the equality constraint term) affect the
search performance on the validation set, where the param-
eters are tuned to select the best combination. We report the
performances with 16 bits in Figure 5, by varying γ from
1e-7 to 1e+2 and µ from 1e-1 to 1e+2.
It can be seen from Figure 5 that the overall perfor-
mances do not depend much on µ and the performances
change a lot when varying the γ. This is reasonable be-
cause γ controls the quantization loss, and µ is introduced
for accelerating the search. The best search performances
on CIFAR-10, MNIST, NUS-WIDE, and ImageNet are ob-
tained with (γ, µ) = (0.01, 0.1), (γ, µ) = (1e-7, 10), (γ, µ)
= (1e-5, 0.1), and (γ, µ) = (1, 100) respectively. We can see
that the best MAP values 0.6132, 0.9449, and 0.5466 on the
validation sets are close to the values 0.6045, 0.9329, and
0.5452 on the query sets of CIFAR-10, MNIST, and NUS-
WIDE, and that the MAP value 0.5372 on the validation set
is different from the value 0.5039 on the query set of Ima-
geNet. The reason might be that the validation set (sampled
from the training set) and the query set (the validation set
provided in ImageNet) are not of the same distribution.
7. Conclusion
In this paper, we present a supervised compact coding
approach, supervised quantization, to semantic similarity
search. To the best of our knowledge, our approach is the
first attempt to study the quantization for semantic simi-
larity search. The superior performance comes from two
points: (i) The distance differentiation ability of quantiza-
tion is stronger than that of hashing. (ii) The learned dis-
criminative subspace is helpful to find a semantic quantizer.
Acknowledgements
This work was partially supported by the National Ba-
sic Research Program of China (973 Program) under Grant
2014CB347600.

References
[1] M. A. Carreira-Perpinan and R. Raziperchikolaei. Hashing
with binary autoencoders. In CVPR, pages 557–566. 1
[2] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng.
Nus-wide: a real-world web image database from national
university of singapore. In CIVR, page 48, 2009. 4, 5
[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-
Fei. Imagenet: A large-scale hierarchical image database. In
CVPR, pages 248–255, 2009. 4, 5
[4] V. Erin Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou. Deep
hashing for compact binary codes learning. In CVPR, pages
2475–2483, 2015. 2
[5] A. Gionis, P. Indyk, R. Motwani, et al. Similarity search in
high dimensions via hashing. In VLDB, volume 99, pages
518–529, 1999. 1
[6] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iter-
ative quantization: A procrustean approach to learning bi-
nary codes for large-scale image retrieval. IEEE Trans. Pat-
tern Analysis and Machine Intelligence, 35(12):2916–2929,
2013. 1, 5
[7] P. Jain, B. Kulis, and K. Grauman. Fast image search for
learned metrics. In CVPR, pages 1–8, 2008. 1
[8] H. Jegou, M. Douze, and C. Schmid. Product quantization
for nearest neighbor search. IEEE Trans. Pattern Analysis
and Machine Intelligence, 33(1):117–128, 2011. 1, 2, 3, 5
[9] K. Jiang, Q. Que, and B. Kulis. Revisiting kernelized
locality-sensitive hashing for improved large-scale image re-
trieval. In CVPR, pages 4933–4941, 2015. 1
[10] Q.-Y. Jiang and W.-J. Li. Scalable graph hashing with feature
transformation. In IJCAI, pages 2248–2254, 2015. 1
[11] A. Joly and O. Buisson. Random maximum margin hashing.
In CVPR, pages 873–880, 2011. 1
[12] W. Kong and W.-J. Li. Isotropic hashing. In NIPS, pages
1646–1654, 2012. 1
[13] A. Krizhevsky and G. Hinton. Learning multiple layers of
features from tiny images, 2009. 4
[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classiﬁcation with deep convolutional neural networks. In
NIPS, pages 1097–1105, 2012. 5
[15] B. Kulis and T. Darrell. Learning to hash with binary recon-
structive embeddings. In NIPS, pages 1042–1050, 2009. 2,
5
[16] B. Kulis and K. Grauman. Kernelized locality-sensitive
hashing for scalable image search. In ICCV, pages 2130–
2137, 2009. 1
[17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
based learning applied to document recognition. Proceed-
ings of the IEEE, 86(11):2278–2324, 1998. 4
[18] G. Lin, C. Shen, Q. Shi, A. van den Hengel, and D. Suter.
Fast supervised hashing with decision trees for high-
dimensional data. In CVPR, pages 1971–1978, 2014. 1,
2
[19] G. Lin, C. Shen, D. Suter, and A. van den Hengel. A gen-
eral two-step approach to learning-based hashing. In ICCV,
pages 2552–2559, 2013. 2
[20] W. Liu, C. Mu, S. Kumar, and S.-F. Chang. Discrete graph
hashing. In NIPS, pages 3419–3427, 2014. 1
[21] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Su-
pervised hashing with kernels. In CVPR, pages 2074–2081,
2012. 1, 2, 5
[22] W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with
graphs. In ICML, pages 1–8, 2011. 1
[23] J. Mairal, F. R. Bach, J. Ponce, G. Sapiro, and A. Zisserman.
Supervised dictionary learning. In NIPS, pages 1033–1040,
2008. 4
[24] M. Norouzi and D. M. Blei. Minimal loss hashing for com-
pact binary codes. In ICML, pages 353–360, 2011. 1, 2, 4,
5
[25] M. Norouzi, D. M. Blei, and R. R. Salakhutdinov. Hamming
distance metric learning. In NIPS, pages 1061–1069, 2012.
1, 2, 4
[26] M. Norouzi and D. J. Fleet. Cartesian k-means. In CVPR,
pages 3017–3024, 2013. 1, 2, 5
[27] F. Shen, C. Shen, W. Liu, and H. T. Shen. Supervised discrete
hashing. In CVPR, pages 37–45, 2015. 1, 2, 4, 5, 6
[28] F. Shen, C. Shen, Q. Shi, A. Van Den Hengel, and Z. Tang.
Inductive hashing on manifolds. In CVPR, pages 1562–1569,
2013. 1
[29] F. Shen, C. Shen, Q. Shi, A. van den Hengel, Z. Tang, and
H. T. Shen. Hashing on nonlinear manifolds. IEEE Trans.
Image Processing, 24(6):1839–1851, 2015. 1
[30] C. Strecha, A. M. Bronstein, M. M. Bronstein, and P. Fua.
Ldahash: Improved matching with smaller descriptors. IEEE
Trans. Pattern Analysis and Machine Intelligence, 34(1):66–
78, 2012. 1, 2
[31] J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hash-
ing for large-scale search. IEEE Trans. Pattern Analysis and
Machine Intelligence, 34(12):2393–2406, 2012. 4, 5
[32] J. Wang, W. Liu, A. X. Sun, and Y.-G. Jiang. Learning hash
codes with listwise supervision. In ICCV, pages 3032–3039,
2013. 2, 4
[33] J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity
search: A survey. CoRR, abs/1408.2927, 2014. 4
[34] J. Wang, J. Wang, N. Yu, and S. Li. Order preserving hashing
for approximate nearest neighbor search. In ACM Multime-
dia, pages 133–142, 2013. 1, 2
[35] J. Wang, J. Wang, G. Zeng, R. Gan, S. Li, and B. Guo. Fast
neighborhood graph search using cartesian concatenation. In
ICCV, pages 2128–2135. 2013. 5
[36] Y. Weiss, R. Fergus, and A. Torralba. Multidimensional
spectral hashing. In ECCV, pages 340–353. 2012. 1
[37] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In
NIPS, pages 1753–1760, 2009. 1
[38] H. Xu, J. Wang, Z. Li, G. Zeng, S. Li, and N. Yu. Comple-
mentary hashing for approximate nearest neighbor search. In
ICCV, pages 1631–1638, 2011. 1
[39] T. Zhang, C. Du, and J. Wang. Composite quantization for
approximate nearest neighbor search. In ICML, pages 838–
846, 2014. 1, 2, 3, 4, 5
[40] F. Zhao, Y. Huang, L. Wang, and T. Tan. Deep semantic rank-
ing based hashing for multi-label image retrieval. In CVPR,
pages 1556–1564, 2015. 2

Supervised Quantization for Similarity Search (camera-ready)

More Related Content

What's hot (17)

Similar to Supervised Quantization for Similarity Search (camera-ready) (20)

Supervised Quantization for Similarity Search (camera-ready)