SlideShare a Scribd company logo
1
Learning to Hash for Large-Scale Search
Xu Jiaming
Chinese Academe of Science
2014-07-04 @CUHK
2
Motivation
 Similarity based search has been popular in many applications
– Image/video search and retrieval: finding most similar images/videos
– Audio search: find similar songs
– Product search: find shoes with similar style but different color
– Patient search: find patients with similar diagnostic status
 Two key components:
– Similarity/distance measure
– Indexing scheme
Whittlesearch (Kovashka et al. 2013)
- 2013CIKM Tutorial by Jun Wang
3
A Conceptual Diagram for Hashing Based Image Search System
Indexing
and Search
Image
Database
Similarity Search & Retrieval
Hash Function Design
Visual Search ApplicationsVisual Search Applications
Reranking
Refinement
Designing compact yet accurate hashing codes is a
critical component to make the search effective
- 2013CIKM Tutorial by Jun Wang
4
Outline
 Background (data-independent)
 Locality Sensitive Hashing [1999-VLDB, 2006-FOCS, 2008-Communications]
 SimHash [2002-STOC, 2007-WWW]
 Learning to Hashing (data-dependent)
 Unsupervised V.S. Supervised
STH [2010-SIGIR] V.S. SHK [2012-CVPR]
 One-Step V.S. Two-Step
ITQ [2011-CVPR, 2013-TPAMI] V.S. TSH [2013-ICCV]
 Others (data-dependent)
 Smart Hashing Update for Fast Response [2013-IJCAI]
 Two-Stage Hashing [2014-ACL]
 Semantic Hashing with Topics and Tags [2013-SIGIR]
 Dual-View Hashing [2013-ICML]
 Multiple View Hashing [2011-SIGIR]
 LSH in MapReduce
5
Outline
 Background (data-independent)
 Locality Sensitive Hashing [1999-VLDB, 2006-FOCS, 2008-Communications]
 SimHash [2002-STOC, 2007-WWW]
 Learning to Hashing (data-dependent)
 Unsupervised V.S. Supervised
STH [2010-SIGIR] V.S. SHK [2012-CVPR]
 One-Step V.S. Two-Step
ITQ [2011-CVPR, 2013-TPAMI] V.S. TSH [2013-ICCV]
 Others (data-dependent)
 Smart Hashing Update for Fast Response [2013-IJCAI]
 Two-Stage Hashing [2014-ACL]
 Semantic Hashing with Topics and Tags [2013-SIGIR]
 Dual-View Hashing [2013-ICML]
 Multiple View Hashing [2011-SIGIR]
 LSH in MapReduce
6
LSH [1999-VLDB, 2006-FOCS, 2008-Communications]
0
1
Database Items
hash function
random
101 Query
Locality Sensitive Hashing (LSH)
- 2013CIKM Tutorial by Jun Wang
0
1 0
1
7
SimHash [2002-STOC, 2007-WWW]
Text
…
…
Observed Features
W1
W2
Wn
100110 W1
110000 W2
001001 Wn
…
…
W1 –W1 -W1 W1 W1 -W1
W2 W2 -W2 -W2 -W2 -W2
-Wn –Wn Wn –Wn –Wn Wn
…
…13, 108, -22, -5, -32, 551, 1, 0, 0, 0, 1
Step1: Compute
TF-IDF
Step2: Hash
Function
Step3: Signature
Step4: Sum
Step5: Generate
Fingerprint
8
Outline
 Background (data-independent)
 Locality Sensitive Hashing [1999-VLDB, 2006-FOCS, 2008-Communications]
 SimHash [2002-STOC, 2007-WWW]
 Learning to Hashing (data-dependent)
 Unsupervised V.S. Supervised
STH [2010-SIGIR] V.S. SHK [2012-CVPR]
 One-Step V.S. Two-Step
ITQ [2011-CVPR, 2013-TPAMI] V.S. TSH [2013-ICCV]
 Others (data-dependent)
 Smart Hashing Update for Fast Response [2013-IJCAI]
 Two-Stage Hashing [2014-ACL]
 Semantic Hashing with Topics and Tags [2013-SIGIR]
 Dual-View Hashing [2013-ICML]
 Multiple View Hashing [2011-SIGIR]
 LSH in MapReduce
9
STH [2010-SIGIR]
2
min :
. .: { 1,1}
0
1
ij i j
ij
k
i
i
i
T
i i
i
S y y
s t y
y
y y
n
−
∈ −
=
=
∑
∑
∑ I
min : ( ( ) )
. .: ( , ) { 1,1}
0
T
k
T
T
trace Y D W Y
s t Y i j
−
∈ −
=
=
Y 1
Y Y I
Laplacian Eigenmap
Self Taught Hashing (STH)
Unsupervised Learning
Supervised Learning
10
SHK [2012-CVPR]
Pairwise similarity
Code inner product approximates pairwise similarity
Supervised Hashing with Kernels
- 2013CIKM Tutorial by Jun Wang
11
Outline
 Background (data-independent)
 Locality Sensitive Hashing [1999-VLDB, 2006-FOCS, 2008-Communications]
 SimHash [2002-STOC, 2007-WWW]
 Learning to Hashing (data-dependent)
 Unsupervised V.S. Supervised
STH [2010-SIGIR] V.S. SHK [2012-CVPR]
 One-Step V.S. Two-Step
ITQ [2011-CVPR, 2013-TPAMI] V.S. TSH [2013-ICCV]
 Others (data-dependent)
 Smart Hashing Update for Fast Response [2013-IJCAI]
 Two-Stage Hashing [2014-ACL]
 Semantic Hashing with Topics and Tags [2013-SIGIR]
 Dual-View Hashing [2013-ICML]
 Multiple View Hashing [2011-SIGIR]
 LSH in MapReduce
12
ITQ [2011-CVPR, 2013-TPAMI]
Iterative Quantization
 Apply PCA for dimensionality reduction, find to maximize:
 Keep top c eigenvectors of the data covariance matrix to
obtain , projected data is
 Note that if is an optimal solution then is also optimal for
any orthogonal matrix
 Key idea: Find to minimize the quantization loss:
 nc and V are fixed so this is equivalent to maximizing ( ) :
13
TSH [2013-ICCV]
Two-Step Hashing
14
Outline
 Background (data-independent)
 Locality Sensitive Hashing [1999-VLDB, 2006-FOCS, 2008-Communications]
 SimHash [2002-STOC, 2007-WWW]
 Learning to Hashing (data-dependent)
 Unsupervised V.S. Supervised
STH [2010-SIGIR] V.S. SHK [2012-CVPR]
 One-Step V.S. Two-Step
ITQ [2011-CVPR, 2013-TPAMI] V.S. TSH [2013-ICCV]
 Others (data-dependent)
 Smart Hashing Update for Fast Response [2013-IJCAI]
 Two-Stage Hashing [2014-ACL]
 Semantic Hashing with Topics and Tags [2013-SIGIR]
 Dual-View Hashing [2013-ICML]
 Multiple View Hashing [2011-SIGIR]
 LSH in MapReduce
15
SHU [2013-IJCAI]
Smart Hashing Update
1. Consistency-based Selection;
2. Similarity-based Selection.
( , ) min{ ( , , 1), ( , ,1)}Diff k j num k j num k j= −
2
{ 1,1}
1
min l r
l
T
l l
H
F
Q H H S
r×
∈ −
= −
2
1 1
{1,2,...,r}
min k k T
k r r Fk
R rS H H− −
∈
= −
16
TSH [2014-ACL]
Two-Stage Hashing
 LSH for neighbor candidate pruning; ITQ for
effective re-ranking.
 LSH captures term similarity; ITQ captures
topic similarity
 Advantages:
 High hash lookup success rate is attained by the LSH stage;
 High search precision due to the ITQ re-ranking stage;
 Scan only a small portion of an entire dataset
 Integrate two similarity measures
17
SHTTM [2013-SIGIR]
Semantic Hashing Using Tags and Topic Modeling
Hash Code Learning Hash Function Learning
2 2*
1
* 1
( )
arg min
( )
j j j
n
j j
j
T T
y f x x
y x λ
λ
=
−
= =
= − +
⇒ = +
∑W
W
W W W
W Y X X X I
Tag Consistency
1
2
2 2 2
min ( )
. . { 1,1} , 0
T
F
k n
C
s t
γ
×
− + + −
∈ − =
Y,U
T U Y U Yθ
Y Y1
g
Similarity Preservation
18
DVH [2013-ICML]
Predictable Dual-View Hashing
The goal is to find two sets of hyperplanes that map the visual and textual space into a common
subspace.
CCA
Multi-SVM
19
MVH [2011-SIGIR]
Composite Hashing with Multiple Information Sources
( )
2
2( ) ( ) ( ) ( )
1 2
1 1 1
( , , ) ( ) ( , )
( )
S C
M M M
TT k k k k
k
k k k
J J J
C tr C α
= = =
= +
= + − +∑ ∑ ∑
Y WαY Y W
Y L Y Y W X W%
 Overall Objection
20
Outline
 Background (data-independent)
 Locality Sensitive Hashing [1999-VLDB, 2006-FOCS, 2008-Communications]
 SimHash [2002-STOC, 2007-WWW]
 Learning to Hashing (data-dependent)
 Unsupervised V.S. Supervised
STH [2010-SIGIR] V.S. SHK [2012-CVPR]
 One-Step V.S. Two-Step
ITQ [2011-CVPR, 2013-TPAMI] V.S. TSH [2013-ICCV]
 Others (data-dependent)
 Smart Hashing Update for Fast Response [2013-IJCAI]
 Two-Stage Hashing [2014-ACL]
 Semantic Hashing with Topics and Tags [2013-SIGIR]
 Dual-View Hashing [2013-ICML]
 Multiple View Hashing [2011-SIGIR]
 LSH in MapReduce
21
LSH in MapReduce – Key Idea
22
LSH in MapReduce – First Round of MapReduce
23
LSH in MapReduce – Second Round of MapReduce
24
Reference
[1]. Gionis A, Indyk P, Motwani R. Similarity search in high dimensions via
hashing[C]//VLDB. 1999, 99: 518-529.
[2]. Andoni A, Indyk P. Near-optimal hashing algorithms for approximate nearest neighbor
in high dimensions[C]//Foundations of Computer Science, 2006. FOCS'06. 47th Annual
IEEE Symposium on. IEEE, 2006: 459-468.
[3]. Andoni A, Indyk P. Near-Optimal Hashing Algorithms for Approximate Nearest
Neighbor in High Dimensions[J]. COMMUNICATIONS OF THE ACM, 2008, 51(1): 117.
[4]. Charikar M S. Similarity estimation techniques from rounding
algorithms[C]//Proceedings of the thiry-fourth annual ACM symposium on Theory of
computing. ACM, 2002: 380-388.
[5]. Manku G S, Jain A, Das Sarma A. Detecting near-duplicates for web
crawling[C]//Proceedings of the 16th international conference on World Wide Web. ACM,
2007: 141-150.
[6]. Zhang D, Wang J, Cai D, et al. Self-taught hashing for fast similarity
search[C]//Proceedings of the 33rd international ACM SIGIR conference on Research
and development in information retrieval. ACM, 2010: 18-25.
[7]. Liu W, Wang J, Ji R, et al. Supervised hashing with kernels[C]//Computer Vision and
Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012: 2074-2081.
25
Reference
[8]. Gong Y, Lazebnik S. Iterative quantization: A procrustean approach to learning binary
codes[C]//Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on.
IEEE, 2011: 817-824.
[9]. Gong Y, Lazebnik S, Gordo A, et al. Iterative quantization: A procrustean approach to
learning binary codes for large-scale image retrieval[J]. Pattern Analysis and Machine
Intelligence, IEEE Transactions on, 2013, 35(12): 2916-2929.
[10]. Lin G, Shen C, Suter D, et al. A general two-step approach to learning-based
hashing[C]//Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE,
2013: 2552-2559.
[11]. Yang Q, Huang L K, Zheng W S, et al. Smart hashing update for fast
response[C]//Proceedings of the Twenty-Third international joint conference on Artificial
Intelligence. AAAI Press, 2013: 1855-1861.
[12]. Li H, Liu W, Ji H. Two-Stage Hashing for Fast Document Retrieval[C]. ACL. 2014
[13]. Wang Q, Zhang D, Si L. Semantic hashing using tags and topic
modeling[C]//Proceedings of the 36th international ACM SIGIR conference on Research
and development in information retrieval. ACM, 2013: 213-222.
[14]. Rastegari M, Choi J, Fakhraei S, et al. Predictable Dual-View
Hashing[C]//Proceedings of The 30th International Conference on Machine Learning.
2013: 1328-1336.
26
Reference
[15]. Zhang D, Wang F, Si L. Composite hashing with multiple information
sources[C]//Proceedings of the 34th international ACM SIGIR conference on Research
and development in Information Retrieval. ACM, 2011: 225-234.
[16]. Szmit, Radosław. "Locality Sensitive Hashing for Similarity Search Using
MapReduce on Large Scale Data." Language Processing and Intelligent Information
Systems. Springer Berlin Heidelberg, 2013. 171-178.
[17]. Blog: Location Sensitive Hashing in Map Reduce:
http://horicky.blogspot.hk/2012/09/location-sensitive-hashing-in-map-reduce.html
[18]. Likelike Project: https://github.com/takahi-i/likelike
[19]. Jun Wang. Learning to Hash for Large-Scale Search. 2013 CIKM Tutorial.
27
Discussions and Questions?
Thank you!
2014-07-04

More Related Content

PDF
Dynamic Two-Stage Image Retrieval from Large Multimodal Databases
PDF
An attribute assisted reranking model for web image search
PDF
Comparison of Various Web Image Re - Ranking Techniques
PDF
A NOVEL WEB IMAGE RE-RANKING APPROACH BASED ON QUERY SPECIFIC SEMANTIC SIGNAT...
PPT
20140327 - Hashing Object Embedding
PDF
Learn to Make a Machine Learn Presentation by Dr. Angana Chakraborty
PDF
Multiview Alignment Hashing for Efficient Image Search
PDF
Probabilistic data structures. Part 4. Similarity
Dynamic Two-Stage Image Retrieval from Large Multimodal Databases
An attribute assisted reranking model for web image search
Comparison of Various Web Image Re - Ranking Techniques
A NOVEL WEB IMAGE RE-RANKING APPROACH BASED ON QUERY SPECIFIC SEMANTIC SIGNAT...
20140327 - Hashing Object Embedding
Learn to Make a Machine Learn Presentation by Dr. Angana Chakraborty
Multiview Alignment Hashing for Efficient Image Search
Probabilistic data structures. Part 4. Similarity

Similar to 20140702 xu jiaming hashinglearning - lite (20)

PDF
large_scale_search.pdf
PDF
UNSUPERVISED VISUAL HASHING WITH SEMANTIC ASSISTANT FOR CONTENT-BASED IMAGE R...
PDF
A Hybrid Procreative –Discriminative Based Hashing Method
PDF
5 efficient-matching.ppt
PDF
IEEE PROJECT TOPICS &ABSTRACTS on image processing
PDF
Similarity-preserving hash for content-based audio retrieval using unsupervis...
PDF
A deep locality-sensitive hashing approach for achieving optimal image retri...
PDF
Building graphs to discover information by David Martínez at Big Data Spain 2015
DOCX
Multiview alignment hashing for
PDF
Graph Regularised Hashing
PDF
Local sensitive hashing & minhash on facebook friend
PDF
Locality Sensitive Hashing
PDF
Efficient Image Retrieval by Multi-view Alignment Technique with Non Negative...
PDF
ENTROPY OPTIMIZED FEATURE-BASED BAG-OF-WORDS REPRESENTATION FOR INFORMATION R...
PDF
Regularised Cross-Modal Hashing (SIGIR'15 Poster)
PDF
OpenLSH - a framework for locality sensitive hashing
PDF
Locality sensitive hashing
PPTX
Secure Image Retrieval based on Hybrid Features and Hashes
PDF
Sketching and locality sensitive hashing for alignment
PDF
Data Science Research Day (Talk)
large_scale_search.pdf
UNSUPERVISED VISUAL HASHING WITH SEMANTIC ASSISTANT FOR CONTENT-BASED IMAGE R...
A Hybrid Procreative –Discriminative Based Hashing Method
5 efficient-matching.ppt
IEEE PROJECT TOPICS &ABSTRACTS on image processing
Similarity-preserving hash for content-based audio retrieval using unsupervis...
A deep locality-sensitive hashing approach for achieving optimal image retri...
Building graphs to discover information by David Martínez at Big Data Spain 2015
Multiview alignment hashing for
Graph Regularised Hashing
Local sensitive hashing & minhash on facebook friend
Locality Sensitive Hashing
Efficient Image Retrieval by Multi-view Alignment Technique with Non Negative...
ENTROPY OPTIMIZED FEATURE-BASED BAG-OF-WORDS REPRESENTATION FOR INFORMATION R...
Regularised Cross-Modal Hashing (SIGIR'15 Poster)
OpenLSH - a framework for locality sensitive hashing
Locality sensitive hashing
Secure Image Retrieval based on Hybrid Features and Hashes
Sketching and locality sensitive hashing for alignment
Data Science Research Day (Talk)
Ad

Recently uploaded (20)

PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
1_Introduction to advance data techniques.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Mega Projects Data Mega Projects Data
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Introduction to Knowledge Engineering Part 1
Launch Your Data Science Career in Kochi – 2025
1_Introduction to advance data techniques.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
STUDY DESIGN details- Lt Col Maksud (21).pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Business Acumen Training GuidePresentation.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Mega Projects Data Mega Projects Data
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Data_Analytics_and_PowerBI_Presentation.pptx
.pdf is not working space design for the following data for the following dat...
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
oil_refinery_comprehensive_20250804084928 (1).pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
IB Computer Science - Internal Assessment.pptx
Introduction to Knowledge Engineering Part 1
Ad

20140702 xu jiaming hashinglearning - lite

  • 1. 1 Learning to Hash for Large-Scale Search Xu Jiaming Chinese Academe of Science 2014-07-04 @CUHK
  • 2. 2 Motivation  Similarity based search has been popular in many applications – Image/video search and retrieval: finding most similar images/videos – Audio search: find similar songs – Product search: find shoes with similar style but different color – Patient search: find patients with similar diagnostic status  Two key components: – Similarity/distance measure – Indexing scheme Whittlesearch (Kovashka et al. 2013) - 2013CIKM Tutorial by Jun Wang
  • 3. 3 A Conceptual Diagram for Hashing Based Image Search System Indexing and Search Image Database Similarity Search & Retrieval Hash Function Design Visual Search ApplicationsVisual Search Applications Reranking Refinement Designing compact yet accurate hashing codes is a critical component to make the search effective - 2013CIKM Tutorial by Jun Wang
  • 4. 4 Outline  Background (data-independent)  Locality Sensitive Hashing [1999-VLDB, 2006-FOCS, 2008-Communications]  SimHash [2002-STOC, 2007-WWW]  Learning to Hashing (data-dependent)  Unsupervised V.S. Supervised STH [2010-SIGIR] V.S. SHK [2012-CVPR]  One-Step V.S. Two-Step ITQ [2011-CVPR, 2013-TPAMI] V.S. TSH [2013-ICCV]  Others (data-dependent)  Smart Hashing Update for Fast Response [2013-IJCAI]  Two-Stage Hashing [2014-ACL]  Semantic Hashing with Topics and Tags [2013-SIGIR]  Dual-View Hashing [2013-ICML]  Multiple View Hashing [2011-SIGIR]  LSH in MapReduce
  • 5. 5 Outline  Background (data-independent)  Locality Sensitive Hashing [1999-VLDB, 2006-FOCS, 2008-Communications]  SimHash [2002-STOC, 2007-WWW]  Learning to Hashing (data-dependent)  Unsupervised V.S. Supervised STH [2010-SIGIR] V.S. SHK [2012-CVPR]  One-Step V.S. Two-Step ITQ [2011-CVPR, 2013-TPAMI] V.S. TSH [2013-ICCV]  Others (data-dependent)  Smart Hashing Update for Fast Response [2013-IJCAI]  Two-Stage Hashing [2014-ACL]  Semantic Hashing with Topics and Tags [2013-SIGIR]  Dual-View Hashing [2013-ICML]  Multiple View Hashing [2011-SIGIR]  LSH in MapReduce
  • 6. 6 LSH [1999-VLDB, 2006-FOCS, 2008-Communications] 0 1 Database Items hash function random 101 Query Locality Sensitive Hashing (LSH) - 2013CIKM Tutorial by Jun Wang 0 1 0 1
  • 7. 7 SimHash [2002-STOC, 2007-WWW] Text … … Observed Features W1 W2 Wn 100110 W1 110000 W2 001001 Wn … … W1 –W1 -W1 W1 W1 -W1 W2 W2 -W2 -W2 -W2 -W2 -Wn –Wn Wn –Wn –Wn Wn … …13, 108, -22, -5, -32, 551, 1, 0, 0, 0, 1 Step1: Compute TF-IDF Step2: Hash Function Step3: Signature Step4: Sum Step5: Generate Fingerprint
  • 8. 8 Outline  Background (data-independent)  Locality Sensitive Hashing [1999-VLDB, 2006-FOCS, 2008-Communications]  SimHash [2002-STOC, 2007-WWW]  Learning to Hashing (data-dependent)  Unsupervised V.S. Supervised STH [2010-SIGIR] V.S. SHK [2012-CVPR]  One-Step V.S. Two-Step ITQ [2011-CVPR, 2013-TPAMI] V.S. TSH [2013-ICCV]  Others (data-dependent)  Smart Hashing Update for Fast Response [2013-IJCAI]  Two-Stage Hashing [2014-ACL]  Semantic Hashing with Topics and Tags [2013-SIGIR]  Dual-View Hashing [2013-ICML]  Multiple View Hashing [2011-SIGIR]  LSH in MapReduce
  • 9. 9 STH [2010-SIGIR] 2 min : . .: { 1,1} 0 1 ij i j ij k i i i T i i i S y y s t y y y y n − ∈ − = = ∑ ∑ ∑ I min : ( ( ) ) . .: ( , ) { 1,1} 0 T k T T trace Y D W Y s t Y i j − ∈ − = = Y 1 Y Y I Laplacian Eigenmap Self Taught Hashing (STH) Unsupervised Learning Supervised Learning
  • 10. 10 SHK [2012-CVPR] Pairwise similarity Code inner product approximates pairwise similarity Supervised Hashing with Kernels - 2013CIKM Tutorial by Jun Wang
  • 11. 11 Outline  Background (data-independent)  Locality Sensitive Hashing [1999-VLDB, 2006-FOCS, 2008-Communications]  SimHash [2002-STOC, 2007-WWW]  Learning to Hashing (data-dependent)  Unsupervised V.S. Supervised STH [2010-SIGIR] V.S. SHK [2012-CVPR]  One-Step V.S. Two-Step ITQ [2011-CVPR, 2013-TPAMI] V.S. TSH [2013-ICCV]  Others (data-dependent)  Smart Hashing Update for Fast Response [2013-IJCAI]  Two-Stage Hashing [2014-ACL]  Semantic Hashing with Topics and Tags [2013-SIGIR]  Dual-View Hashing [2013-ICML]  Multiple View Hashing [2011-SIGIR]  LSH in MapReduce
  • 12. 12 ITQ [2011-CVPR, 2013-TPAMI] Iterative Quantization  Apply PCA for dimensionality reduction, find to maximize:  Keep top c eigenvectors of the data covariance matrix to obtain , projected data is  Note that if is an optimal solution then is also optimal for any orthogonal matrix  Key idea: Find to minimize the quantization loss:  nc and V are fixed so this is equivalent to maximizing ( ) :
  • 14. 14 Outline  Background (data-independent)  Locality Sensitive Hashing [1999-VLDB, 2006-FOCS, 2008-Communications]  SimHash [2002-STOC, 2007-WWW]  Learning to Hashing (data-dependent)  Unsupervised V.S. Supervised STH [2010-SIGIR] V.S. SHK [2012-CVPR]  One-Step V.S. Two-Step ITQ [2011-CVPR, 2013-TPAMI] V.S. TSH [2013-ICCV]  Others (data-dependent)  Smart Hashing Update for Fast Response [2013-IJCAI]  Two-Stage Hashing [2014-ACL]  Semantic Hashing with Topics and Tags [2013-SIGIR]  Dual-View Hashing [2013-ICML]  Multiple View Hashing [2011-SIGIR]  LSH in MapReduce
  • 15. 15 SHU [2013-IJCAI] Smart Hashing Update 1. Consistency-based Selection; 2. Similarity-based Selection. ( , ) min{ ( , , 1), ( , ,1)}Diff k j num k j num k j= − 2 { 1,1} 1 min l r l T l l H F Q H H S r× ∈ − = − 2 1 1 {1,2,...,r} min k k T k r r Fk R rS H H− − ∈ = −
  • 16. 16 TSH [2014-ACL] Two-Stage Hashing  LSH for neighbor candidate pruning; ITQ for effective re-ranking.  LSH captures term similarity; ITQ captures topic similarity  Advantages:  High hash lookup success rate is attained by the LSH stage;  High search precision due to the ITQ re-ranking stage;  Scan only a small portion of an entire dataset  Integrate two similarity measures
  • 17. 17 SHTTM [2013-SIGIR] Semantic Hashing Using Tags and Topic Modeling Hash Code Learning Hash Function Learning 2 2* 1 * 1 ( ) arg min ( ) j j j n j j j T T y f x x y x λ λ = − = = = − + ⇒ = + ∑W W W W W W Y X X X I Tag Consistency 1 2 2 2 2 min ( ) . . { 1,1} , 0 T F k n C s t γ × − + + − ∈ − = Y,U T U Y U Yθ Y Y1 g Similarity Preservation
  • 18. 18 DVH [2013-ICML] Predictable Dual-View Hashing The goal is to find two sets of hyperplanes that map the visual and textual space into a common subspace. CCA Multi-SVM
  • 19. 19 MVH [2011-SIGIR] Composite Hashing with Multiple Information Sources ( ) 2 2( ) ( ) ( ) ( ) 1 2 1 1 1 ( , , ) ( ) ( , ) ( ) S C M M M TT k k k k k k k k J J J C tr C α = = = = + = + − +∑ ∑ ∑ Y WαY Y W Y L Y Y W X W%  Overall Objection
  • 20. 20 Outline  Background (data-independent)  Locality Sensitive Hashing [1999-VLDB, 2006-FOCS, 2008-Communications]  SimHash [2002-STOC, 2007-WWW]  Learning to Hashing (data-dependent)  Unsupervised V.S. Supervised STH [2010-SIGIR] V.S. SHK [2012-CVPR]  One-Step V.S. Two-Step ITQ [2011-CVPR, 2013-TPAMI] V.S. TSH [2013-ICCV]  Others (data-dependent)  Smart Hashing Update for Fast Response [2013-IJCAI]  Two-Stage Hashing [2014-ACL]  Semantic Hashing with Topics and Tags [2013-SIGIR]  Dual-View Hashing [2013-ICML]  Multiple View Hashing [2011-SIGIR]  LSH in MapReduce
  • 21. 21 LSH in MapReduce – Key Idea
  • 22. 22 LSH in MapReduce – First Round of MapReduce
  • 23. 23 LSH in MapReduce – Second Round of MapReduce
  • 24. 24 Reference [1]. Gionis A, Indyk P, Motwani R. Similarity search in high dimensions via hashing[C]//VLDB. 1999, 99: 518-529. [2]. Andoni A, Indyk P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions[C]//Foundations of Computer Science, 2006. FOCS'06. 47th Annual IEEE Symposium on. IEEE, 2006: 459-468. [3]. Andoni A, Indyk P. Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions[J]. COMMUNICATIONS OF THE ACM, 2008, 51(1): 117. [4]. Charikar M S. Similarity estimation techniques from rounding algorithms[C]//Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. ACM, 2002: 380-388. [5]. Manku G S, Jain A, Das Sarma A. Detecting near-duplicates for web crawling[C]//Proceedings of the 16th international conference on World Wide Web. ACM, 2007: 141-150. [6]. Zhang D, Wang J, Cai D, et al. Self-taught hashing for fast similarity search[C]//Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. ACM, 2010: 18-25. [7]. Liu W, Wang J, Ji R, et al. Supervised hashing with kernels[C]//Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012: 2074-2081.
  • 25. 25 Reference [8]. Gong Y, Lazebnik S. Iterative quantization: A procrustean approach to learning binary codes[C]//Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011: 817-824. [9]. Gong Y, Lazebnik S, Gordo A, et al. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval[J]. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2013, 35(12): 2916-2929. [10]. Lin G, Shen C, Suter D, et al. A general two-step approach to learning-based hashing[C]//Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013: 2552-2559. [11]. Yang Q, Huang L K, Zheng W S, et al. Smart hashing update for fast response[C]//Proceedings of the Twenty-Third international joint conference on Artificial Intelligence. AAAI Press, 2013: 1855-1861. [12]. Li H, Liu W, Ji H. Two-Stage Hashing for Fast Document Retrieval[C]. ACL. 2014 [13]. Wang Q, Zhang D, Si L. Semantic hashing using tags and topic modeling[C]//Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. ACM, 2013: 213-222. [14]. Rastegari M, Choi J, Fakhraei S, et al. Predictable Dual-View Hashing[C]//Proceedings of The 30th International Conference on Machine Learning. 2013: 1328-1336.
  • 26. 26 Reference [15]. Zhang D, Wang F, Si L. Composite hashing with multiple information sources[C]//Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, 2011: 225-234. [16]. Szmit, Radosław. "Locality Sensitive Hashing for Similarity Search Using MapReduce on Large Scale Data." Language Processing and Intelligent Information Systems. Springer Berlin Heidelberg, 2013. 171-178. [17]. Blog: Location Sensitive Hashing in Map Reduce: http://horicky.blogspot.hk/2012/09/location-sensitive-hashing-in-map-reduce.html [18]. Likelike Project: https://github.com/takahi-i/likelike [19]. Jun Wang. Learning to Hash for Large-Scale Search. 2013 CIKM Tutorial.

Editor's Notes

  • #5: 直接进入各个Hashing模型
  • #6: 直接进入各个Hashing模型
  • #7: 应用最为广泛的工作LSH。
  • #8: Google的工作,用于网页爬虫中的文本内容去重工作
  • #9: 直接进入各个Hashing模型
  • #12: 直接进入各个Hashing模型
  • #13: 先通过PCA进行降维,得到低维的向量V,然后一种最直接的方式是直接拟合这个低维向量,即直接对这个低维向量V进行二值化,但实际上在PCA问题中,最优解W进行任意的正交变换后还是最优解。因而我们可以对低维向量进行任意的正交变化,然后由hash code进行拟合这个正交后的矩阵。
  • #14: ICCV这篇是澳大利亚的阿德雷得大學的工作,他们的Motivation是说目前大多数Hash方法都是针对数据集的Hash降维编码过程及Hash编码预测函数的学习过程整合在一起来学习,这种紧密耦合一方面限制了灵活性,另一方面导致优化问题变的复杂,难以求解。他们提出一种框架,把Hash问题拆解成两个阶段完成,第一个阶段是进行针对现有的数据集进行hash码学习,第二个阶段是基于之前的Hash码学习Hash函数。如果对Motivation不太清楚的,我们可以下面这个图例,这个图例是这篇文章的主要参考工作SIGIR2010的Self-Taught Hashing,它就是一个典型 二阶段Hash学习方法,出自于普渡大学的Si Luo实验室,这篇文章的第三作者是浙大的蔡登,可能是交流学习阶段一起完成的工作。我们看这个图,首先给一堆文本集然后通过一种无监督的降维方法得到二进制的Hash码,这是第一个阶段。然后根据已经学到的Hash码作为二值标签利用监督学习方法学习一个Hash函数。而这两个阶段都属于离线学习阶段,而Query查询属于在线阶段。其实STH本身就是一个二阶段框架了,ICCV的这篇文章基本就是基于此工作提出总结性的两阶段Hash学习框架
  • #15: 直接进入各个Hashing模型
  • #16: 还之前的Two-Step框架上,更新Hash函数. 这篇IJCAI是中山大学的工作,他们的工作是基于DMKD2012年上一篇基于主动学习Hash的文章(DMKD是检索类里面的B级期刊)。他们的Motivation比较实用化,就是说现有的基于Hash的方法已经获得了比较好的效果,但是他们大多是被动Hashing学习,且假定带标签的数据都是已经提供好的。这在这篇文章中,他们考虑如何基于逐渐增多的标签数据更新Hashing模型码给用户做出快速相应,被称为Smart Hashing Update.所谓主动学习,就是系统自动的挑选一些数据给用户进行标记,然后基于已经存在的数据和新标准的数据更新整个Hashing模型。他们的算法流程见下图,每次由用户标出新数据之后,添加到现有数据集中,然后由系统挑选那些Hash位需要进行更新,被挑出的t个bit位对应的Hash函数参与本轮更新,那其实如何挑选这t个bit位比较关键,本文是给了两种策略:1,Consistency-based Selection;2,Similarity-based Selection;基于一致性选择是考虑整体数据集属于同一类的Hash码每一位上的一致性是否比较强,判断同样位的标签{-1,1}是否比较一致,是否都是正一,或都是负一,如果一致性不好的话,我们就把它挑选出来参与更新;当然这种策略的缺点就是没有考虑内部数据和外部新数据的相似性,因而第二种是基于相似度选择,度量同一类别内的Hash编码效果好不好,CVPR2012上给出了一个性能度量指标公式,H是同一种类别的Hash码,S是关联矩阵,这个指标越小的话,说明效果比较好。这边为了挑出效果不好的t个Hash函数对这个指标进行了变形,依次把第k位从Hash函数中剔除,对比剔除哪个Hash位之后这个指标下降比较明显的话,就把波动影响比较大的t个挑选出来这就完成了挑选工作,然后根据带标签数据进行重新学习
  • #17: 和Two Step Hashing 不同。两个Hash方法,前后排
  • #18: 考虑了Topic model和Tag信息,一种 two step hashing方式
  • #19: 一种one step hashing method,为了防止平凡解,后面加了约束条件。 用CCA来 solve this optimization. 但是orthogonality 有时是not necessary and harmful的
  • #20: 一种one step hashing method
  • #21: 直接进入各个Hashing模型