20140702 xu jiaming hashinglearning - lite

1
Learning to Hash for Large-Scale Search
Xu Jiaming
Chinese Academe of Science
2014-07-04 @CUHK

2
Motivation
 Similarity based search has been popular in many applications
– Image/video search and retrieval: finding most similar images/videos
– Audio search: find similar songs
– Product search: find shoes with similar style but different color
– Patient search: find patients with similar diagnostic status
 Two key components:
– Similarity/distance measure
– Indexing scheme
Whittlesearch (Kovashka et al. 2013)
- 2013CIKM Tutorial by Jun Wang

3
A Conceptual Diagram for Hashing Based Image Search System
Indexing
and Search
Image
Database
Similarity Search & Retrieval
Hash Function Design
Visual Search ApplicationsVisual Search Applications
Reranking
Refinement
Designing compact yet accurate hashing codes is a
critical component to make the search effective

4
Outline
 Background (data-independent)
 Locality Sensitive Hashing [1999-VLDB, 2006-FOCS, 2008-Communications]
 SimHash [2002-STOC, 2007-WWW]
 Learning to Hashing (data-dependent)
 Unsupervised V.S. Supervised
STH [2010-SIGIR] V.S. SHK [2012-CVPR]
 One-Step V.S. Two-Step
ITQ [2011-CVPR, 2013-TPAMI] V.S. TSH [2013-ICCV]
 Others (data-dependent)
 Smart Hashing Update for Fast Response [2013-IJCAI]
 Two-Stage Hashing [2014-ACL]
 Semantic Hashing with Topics and Tags [2013-SIGIR]
 Dual-View Hashing [2013-ICML]
 Multiple View Hashing [2011-SIGIR]
 LSH in MapReduce

5
Outline

6
LSH [1999-VLDB, 2006-FOCS, 2008-Communications]
0
1
Database Items
hash function
random
101 Query
Locality Sensitive Hashing (LSH)
0
1 0
1

7
SimHash [2002-STOC, 2007-WWW]
Text
…
…
Observed Features
W1
W2
Wn
100110 W1
110000 W2
001001 Wn
…
…
W1 –W1 -W1 W1 W1 -W1
W2 W2 -W2 -W2 -W2 -W2
-Wn –Wn Wn –Wn –Wn Wn
…
…13, 108, -22, -5, -32, 551, 1, 0, 0, 0, 1
Step1: Compute
TF-IDF
Step2: Hash
Function
Step3: Signature
Step4: Sum
Step5: Generate
Fingerprint

8
Outline

9
STH [2010-SIGIR]
2
min :
. .: { 1,1}
0
1
ij i j
ij
k
i
i
i
T
i i
i
S y y
s t y
y
y y
n
−
∈ −
=
=
∑
∑
∑ I
min : ( ( ) )
. .: ( , ) { 1,1}
0
T
k
T
T
trace Y D W Y
s t Y i j
−
∈ −
=
=
Y 1
Y Y I
Laplacian Eigenmap
Self Taught Hashing (STH)
Unsupervised Learning
Supervised Learning

10
SHK [2012-CVPR]
Pairwise similarity
Code inner product approximates pairwise similarity
Supervised Hashing with Kernels

11
Outline

12
ITQ [2011-CVPR, 2013-TPAMI]
Iterative Quantization
 Apply PCA for dimensionality reduction, find to maximize:
 Keep top c eigenvectors of the data covariance matrix to
obtain , projected data is
 Note that if is an optimal solution then is also optimal for
any orthogonal matrix
 Key idea: Find to minimize the quantization loss:
 nc and V are fixed so this is equivalent to maximizing ( ) :

13
TSH [2013-ICCV]
Two-Step Hashing

14
Outline

15
SHU [2013-IJCAI]
Smart Hashing Update
1. Consistency-based Selection;
2. Similarity-based Selection.
( , ) min{ ( , , 1), ( , ,1)}Diff k j num k j num k j= −
2
{ 1,1}
1
min l r
l
T
l l
H
F
Q H H S
r×
∈ −
= −
2
1 1
{1,2,...,r}
min k k T
k r r Fk
R rS H H− −
∈
= −

16
TSH [2014-ACL]
Two-Stage Hashing
 LSH for neighbor candidate pruning; ITQ for
effective re-ranking.
 LSH captures term similarity; ITQ captures
topic similarity
 Advantages:
 High hash lookup success rate is attained by the LSH stage;
 High search precision due to the ITQ re-ranking stage;
 Scan only a small portion of an entire dataset
 Integrate two similarity measures

17
SHTTM [2013-SIGIR]
Semantic Hashing Using Tags and Topic Modeling
Hash Code Learning Hash Function Learning
2 2*
1
* 1
( )
arg min
( )
j j j
n
j j
j
T T
y f x x
y x λ
λ
=
−
= =
= − +
⇒ = +
∑W
W
W W W
W Y X X X I
Tag Consistency
1
2
2 2 2
min ( )
. . { 1,1} , 0
T
F
k n
C
s t
γ
×
− + + −
∈ − =
Y,U
T U Y U Yθ
Y Y1
g
Similarity Preservation

18
DVH [2013-ICML]
Predictable Dual-View Hashing
The goal is to find two sets of hyperplanes that map the visual and textual space into a common
subspace.
CCA
Multi-SVM

19
MVH [2011-SIGIR]
Composite Hashing with Multiple Information Sources
( )
2
2( ) ( ) ( ) ( )
1 2
1 1 1
( , , ) ( ) ( , )
( )
S C
M M M
TT k k k k
k
k k k
J J J
C tr C α
= = =
= +
= + − +∑ ∑ ∑
Y WαY Y W
Y L Y Y W X W%
 Overall Objection

20
Outline

21
LSH in MapReduce – Key Idea

22
LSH in MapReduce – First Round of MapReduce

23
LSH in MapReduce – Second Round of MapReduce

24
Reference
[1]. Gionis A, Indyk P, Motwani R. Similarity search in high dimensions via
hashing[C]//VLDB. 1999, 99: 518-529.
[2]. Andoni A, Indyk P. Near-optimal hashing algorithms for approximate nearest neighbor
in high dimensions[C]//Foundations of Computer Science, 2006. FOCS'06. 47th Annual
IEEE Symposium on. IEEE, 2006: 459-468.
[3]. Andoni A, Indyk P. Near-Optimal Hashing Algorithms for Approximate Nearest
Neighbor in High Dimensions[J]. COMMUNICATIONS OF THE ACM, 2008, 51(1): 117.
[4]. Charikar M S. Similarity estimation techniques from rounding
algorithms[C]//Proceedings of the thiry-fourth annual ACM symposium on Theory of
computing. ACM, 2002: 380-388.
[5]. Manku G S, Jain A, Das Sarma A. Detecting near-duplicates for web
crawling[C]//Proceedings of the 16th international conference on World Wide Web. ACM,
2007: 141-150.
[6]. Zhang D, Wang J, Cai D, et al. Self-taught hashing for fast similarity
search[C]//Proceedings of the 33rd international ACM SIGIR conference on Research
and development in information retrieval. ACM, 2010: 18-25.
[7]. Liu W, Wang J, Ji R, et al. Supervised hashing with kernels[C]//Computer Vision and
Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012: 2074-2081.

25
Reference
[8]. Gong Y, Lazebnik S. Iterative quantization: A procrustean approach to learning binary
codes[C]//Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on.
IEEE, 2011: 817-824.
[9]. Gong Y, Lazebnik S, Gordo A, et al. Iterative quantization: A procrustean approach to
learning binary codes for large-scale image retrieval[J]. Pattern Analysis and Machine
Intelligence, IEEE Transactions on, 2013, 35(12): 2916-2929.
[10]. Lin G, Shen C, Suter D, et al. A general two-step approach to learning-based
hashing[C]//Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE,
2013: 2552-2559.
[11]. Yang Q, Huang L K, Zheng W S, et al. Smart hashing update for fast
response[C]//Proceedings of the Twenty-Third international joint conference on Artificial
Intelligence. AAAI Press, 2013: 1855-1861.
[12]. Li H, Liu W, Ji H. Two-Stage Hashing for Fast Document Retrieval[C]. ACL. 2014
[13]. Wang Q, Zhang D, Si L. Semantic hashing using tags and topic
modeling[C]//Proceedings of the 36th international ACM SIGIR conference on Research
and development in information retrieval. ACM, 2013: 213-222.
[14]. Rastegari M, Choi J, Fakhraei S, et al. Predictable Dual-View
Hashing[C]//Proceedings of The 30th International Conference on Machine Learning.
2013: 1328-1336.

26
Reference
[15]. Zhang D, Wang F, Si L. Composite hashing with multiple information
sources[C]//Proceedings of the 34th international ACM SIGIR conference on Research
and development in Information Retrieval. ACM, 2011: 225-234.
[16]. Szmit, Radosław. "Locality Sensitive Hashing for Similarity Search Using
MapReduce on Large Scale Data." Language Processing and Intelligent Information
Systems. Springer Berlin Heidelberg, 2013. 171-178.
[17]. Blog: Location Sensitive Hashing in Map Reduce:
http://horicky.blogspot.hk/2012/09/location-sensitive-hashing-in-map-reduce.html
[18]. Likelike Project: https://github.com/takahi-i/likelike
[19]. Jun Wang. Learning to Hash for Large-Scale Search. 2013 CIKM Tutorial.

27
Discussions and Questions?
Thank you!
2014-07-04

20140702 xu jiaming hashinglearning - lite

More Related Content

Similar to 20140702 xu jiaming hashinglearning - lite (20)

Recently uploaded (20)

20140702 xu jiaming hashinglearning - lite

Editor's Notes