SlideShare a Scribd company logo
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection
Andrey Gusev June 6, 2018
Using LSH and Tensorflow
Help you discover and do
what you love.
200m+People on
Pinterest
each month
100b+Pins
2b+Boards
10b+Recommendations/Day
1
2
3
4
5
Agenda Neardup, clustering and LSH
Candidate generation
Deep dive
Candidate selection
TF on Spark
Neardup
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Not Neardup
Unrelated
Neardup
Duplicate
Clustering
Not An Equivalence Class
Formulation
For each image find a canonical image which represents an equivalence class.
Problem
Neardup is not an equivalence relation because neardup relation is not a
transitive relation.
It means we can not find a perfect partition such that all images within a
cluster are closer to each other than to the other clusters.
Incremental approximate K-Cut
Incrementally:
1. Generate candidates via batch LSH search
2. Select candidates via a TF model
3. Take a transitive closure over selected candidates
4. Pass over clusters and greedily select sub-clusters (K-Cut).
LSH
Embeddings and LSH
- Visual Embeddings are high-dimensional vector representations of
entities (in our case images) which capture semantic similarity.
- Produced via Neural Networks like VGG16, Inception, etc.
- Locality-sensitive hashing or LSH is a modern technique used to reduce
dimensionality of high-dimensional data while preserving pairwise
distances between individual points.
LSH: Locality Sensitive Hashing
- Pick random projection vectors (black)
- For each embeddings vector determine
on which side of the hyperplane the
embeddings vector lands
- On the same side: set bit to 1
- On different side: set bit to 0
Result 1: <1 1 0>
Result 2: <1 0 1>
1
1
0
1
0
1
LSH terms
Pick optimal number of terms and bits per term
- 1001110001011000 -> [00]1001 - [01]1100 - [10]0101 - [11]1000
- [x] → a term index
Candidate
Generation
Neardup Candidate Generation
- Input Data:
RDD[(ImgId, List[LSHTerm])] // billions
- Goal:
RDD[(ImgId, TopK[(ImgId, Overlap))]
Nearest Neighbor (KNN) problem formulation
Neardup Candidate Generation
Given a set of documents each described by LSH terms, example:
A → (1,2,3)
B → (1,3,10)
C → (2,10)
And more generally:
Di
→ [tj
]
Where each Di
is a document and [tj
] is a list of LSH terms (assume each is a 4 byte integer)
Results:
A → (B,2), (C,1)
B → (A,2), (C,1)
C → (A,1), (B,1)
Spark Candidate Generation
1. Input RDD[(ImgId, List[LSHTerm])] ← both index and query sets
2. flatMap, groupBy input into RDD[(LSHTerm, PostingList)] ← an inverted index
3. flatMap, groupBy into RDD[(LSHTerm, PostingList)] ← a query list
4. Join (2) and (3), flatMap over queries posting list, and groupBy query ImgId;
RDD[(ImgId, List[PostingList])] ← search results by query.
5. Merge List[List[ImgId]] into TopK(ImgId, Overlap) counting number of times each ImgId is
seen → RDD[ImgId, TopK[(ImgId, Overlap)].
* PostingList = List[ImgId]
Orders of magnitude too slow.
Deep Dive
def mapDocToInt(termIndexRaw: RDD[(String, List[TermId])]): RDD[(String, DocId)] = {
// ensure that mapping between string and id is stable by sorting
// this allows attempts to re-use partial stage completions
termIndexRaw.keys.distinct().sortBy(x => x).zipWithIndex()
}
val stringArray = (for (ind <- 0 to 1000) yield randomString(32)).toArray
val intArray = (for (ind <- 0 to 1000) yield ind).toArray
* https://www.javamex.com/classmexer/
Dictionary encoding
108128 Bytes*
4024 Bytes*
25x
Variable Byte Encoding
- One bit of each byte is a continuation bit; overhead
- int → byte (best case)
- 32 char string up to 25x4 = 100x memory reduction
https://nlp.stanford.edu/IR-book/html/htmledition/variable-byte-codes-1.html
Inverted Index Partitioning
Inverted index is skewed
/**
* Build partitioned inverted index by taking module of docId into partition.
*/
def buildPartitionedInvertedIndex(flatTermIndexAndFreq: RDD[(TermId, (DocId, TermFreq))]):
RDD[((TermId, TermPartition), Iterable[DocId])] = {
flatTermIndexAndFreq.map { case (termId, (docId, _)) =>
// partition documents within the same term to improve balance
// and reduce the posting list length
((termId, (Math.abs(docId) % TERM_PARTITIONING).toByte), docId)
}.groupByKey()
}
Packing
(Int, Byte) => Long
Before:
Unsorted: 128.77 MB in 549ms
Sort+Limit: 4.41 KB in 7511ms
After:
Unsorted: 38.83 MB in 219ms
Sort+Limit: 4.41 KB in 467ms
def packDocIdAndByteIntoLong(docId: DocId, docFreq: DocFreq): Long = {
(docFreq.toLong << 32) | (docId & 0xffffffffL)
}
def unpackDocIdAndByteFromLong(packed: Long): (DocId, DocFreq) = {
(packed.toInt, (packed >> 32).toByte)
}
Slicing
Split query set into slices to reduce spill and size for
“widest” portion of the computation. Union at the end.
Additional Optimizations
- Cost based optimizer - significant improvements to runtime can be realized by
analyzing input data sets and setting performance parameters automatically.
- Counting - jaccard overlap counting is done via low level, high performance collections.
- Off heaping serialization when possible (spark.kryo.unsafe).
Generic Batch LSH Search
- Can be applied generically to KNN, embedding
agnostic.
- Can work on arbitrary large query set via slicing.
Candidate
Selection
TF DNN Classifier
- Transfer learning over VGG16
- Visual embeddings
- XOR hamming bits
- Learning still happens at >1B pairs
- Batch size of 1024, Adam optimizer
4096
2048
256
128
1
Vectorization: mapPartitions + grouped
- During training and inference vectorization reduces overhead.
- Spark mapPartitions + grouped allows for large batches and controlling
the size. Works well for inference.
- 2ms/prediction on c3.8xl CPUs with network of 10MM parameters .
input.mapPartitions { partition: Iterator[(ImgInfo, ImgInfo)] =>
// break down large partitions into groups and score per group
partition.grouped(BATCH_SIZE).flatMap { group: Seq[(ImgInfo, ImgInfo)] =>
// create tensors and score as features: Array[Array[Float]] --> Tensor.create(features)
}
}
One TF Session per JVM
- Reduce model loading overhead, load once per JVM; thread-safe.
object TensorflowModel {
lazy val model: Session = {
SavedModelBundle.load(...).session()
}
}
Summary
- Candidate Generation uses Batch LSH Search over terms from visual
embeddings.
- Batch LSH scales to billions of objects in the index and is embedding
agnostic.
- Candidate Selection uses a TF classifier over raw visual embeddings.
- Two-pass transitive closure to cluster results.
Thanks!

More Related Content

PDF
Introduction to Kibana
PDF
Introduction to Apache Calcite
PDF
Introduction to elasticsearch
PDF
Spark SQL Bucketing at Facebook
PDF
Introducing DataFrames in Spark for Large Scale Data Science
PDF
Apache Calcite (a tutorial given at BOSS '21)
PPTX
Elastic - ELK, Logstash & Kibana
PDF
Premier Inside-Out: Apache Druid
Introduction to Kibana
Introduction to Apache Calcite
Introduction to elasticsearch
Spark SQL Bucketing at Facebook
Introducing DataFrames in Spark for Large Scale Data Science
Apache Calcite (a tutorial given at BOSS '21)
Elastic - ELK, Logstash & Kibana
Premier Inside-Out: Apache Druid

What's hot (20)

PDF
Working with JSON Data in PostgreSQL vs. MongoDB
PDF
Elasticsearch in Netflix
PDF
Care and Feeding of Catalyst Optimizer
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PDF
Apache Camel v3, Camel K and Camel Quarkus
PDF
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
PDF
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
PDF
PostgreSQL Performance Tuning
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
PPTX
Kibana overview
PPTX
ORC File and Vectorization - Hadoop Summit 2013
PDF
How to Analyze and Tune MySQL Queries for Better Performance
PDF
Introducing the Apache Flink Kubernetes Operator
PPSX
Collections - Maps
PDF
Introduction to Elasticsearch
PDF
Making Nested Columns as First Citizen in Apache Spark SQL
PDF
The Parquet Format and Performance Optimization Opportunities
PPT
Apache Hive - Introduction
PDF
Hive tuning
Working with JSON Data in PostgreSQL vs. MongoDB
Elasticsearch in Netflix
Care and Feeding of Catalyst Optimizer
A Deep Dive into Query Execution Engine of Spark SQL
Apache Camel v3, Camel K and Camel Quarkus
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Common Strategies for Improving Performance on Your Delta Lakehouse
PostgreSQL Performance Tuning
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Kibana overview
ORC File and Vectorization - Hadoop Summit 2013
How to Analyze and Tune MySQL Queries for Better Performance
Introducing the Apache Flink Kubernetes Operator
Collections - Maps
Introduction to Elasticsearch
Making Nested Columns as First Citizen in Apache Spark SQL
The Parquet Format and Performance Optimization Opportunities
Apache Hive - Introduction
Hive tuning
Ad

Similar to Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev (20)

PDF
Project - Deep Locality Sensitive Hashing
PDF
Neighbourhood Preserving Quantisation for LSH SIGIR Poster
PDF
Fighting fraud: finding duplicates at scale
PDF
Landmark Retrieval & Recognition
PDF
Retrieving Visually-Similar Products for Shopping Recommendations using Spark...
PPTX
Distributed Deep Learning + others for Spark Meetup
PPTX
Deep Residual Hashing Neural Network for Image Retrieval
PDF
서버리스 기반 콘텐츠 추천 서비스 만들기 - 이상현, Vingle :: AWS Summit Seoul 2019
PDF
Scaling up data science applications
PDF
Scaling Up: How Switching to Apache Spark Improved Performance, Realizability...
PPTX
Scaling up data science applications
PPTX
Sparkling Random Ferns by P Dendek and M Fedoryszak
PDF
Duplicates everywhere (Kiev)
PDF
Pr083 Non-local Neural Networks
PDF
Introduction to spark
PDF
Approximation algorithms for stream and batch processing
PDF
[221] 이미지를 이해하는 이미지검색: 텍스트기반 이미지검색에 CNN 이용하기
PPTX
Author paper identification problem final presentation
PDF
Scalable Recommendation Algorithms with LSH
PDF
Data mining for_java_and_dot_net 2016-17
Project - Deep Locality Sensitive Hashing
Neighbourhood Preserving Quantisation for LSH SIGIR Poster
Fighting fraud: finding duplicates at scale
Landmark Retrieval & Recognition
Retrieving Visually-Similar Products for Shopping Recommendations using Spark...
Distributed Deep Learning + others for Spark Meetup
Deep Residual Hashing Neural Network for Image Retrieval
서버리스 기반 콘텐츠 추천 서비스 만들기 - 이상현, Vingle :: AWS Summit Seoul 2019
Scaling up data science applications
Scaling Up: How Switching to Apache Spark Improved Performance, Realizability...
Scaling up data science applications
Sparkling Random Ferns by P Dendek and M Fedoryszak
Duplicates everywhere (Kiev)
Pr083 Non-local Neural Networks
Introduction to spark
Approximation algorithms for stream and batch processing
[221] 이미지를 이해하는 이미지검색: 텍스트기반 이미지검색에 CNN 이용하기
Author paper identification problem final presentation
Scalable Recommendation Algorithms with LSH
Data mining for_java_and_dot_net 2016-17
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Lecture1 pattern recognition............
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Mega Projects Data Mega Projects Data
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
annual-report-2024-2025 original latest.
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
1_Introduction to advance data techniques.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
.pdf is not working space design for the following data for the following dat...
Business Ppt On Nestle.pptx huunnnhhgfvu
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Fluorescence-microscope_Botany_detailed content
Introduction to Knowledge Engineering Part 1
Introduction-to-Cloud-ComputingFinal.pptx
climate analysis of Dhaka ,Banglades.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Lecture1 pattern recognition............
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Mega Projects Data Mega Projects Data
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
annual-report-2024-2025 original latest.
IB Computer Science - Internal Assessment.pptx
1_Introduction to advance data techniques.pptx

Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev

  • 2. Image Similarity Detection Andrey Gusev June 6, 2018 Using LSH and Tensorflow
  • 3. Help you discover and do what you love.
  • 5. 1 2 3 4 5 Agenda Neardup, clustering and LSH Candidate generation Deep dive Candidate selection TF on Spark
  • 11. Not An Equivalence Class Formulation For each image find a canonical image which represents an equivalence class. Problem Neardup is not an equivalence relation because neardup relation is not a transitive relation. It means we can not find a perfect partition such that all images within a cluster are closer to each other than to the other clusters.
  • 12. Incremental approximate K-Cut Incrementally: 1. Generate candidates via batch LSH search 2. Select candidates via a TF model 3. Take a transitive closure over selected candidates 4. Pass over clusters and greedily select sub-clusters (K-Cut).
  • 13. LSH
  • 14. Embeddings and LSH - Visual Embeddings are high-dimensional vector representations of entities (in our case images) which capture semantic similarity. - Produced via Neural Networks like VGG16, Inception, etc. - Locality-sensitive hashing or LSH is a modern technique used to reduce dimensionality of high-dimensional data while preserving pairwise distances between individual points.
  • 15. LSH: Locality Sensitive Hashing - Pick random projection vectors (black) - For each embeddings vector determine on which side of the hyperplane the embeddings vector lands - On the same side: set bit to 1 - On different side: set bit to 0 Result 1: <1 1 0> Result 2: <1 0 1> 1 1 0 1 0 1
  • 16. LSH terms Pick optimal number of terms and bits per term - 1001110001011000 -> [00]1001 - [01]1100 - [10]0101 - [11]1000 - [x] → a term index
  • 18. Neardup Candidate Generation - Input Data: RDD[(ImgId, List[LSHTerm])] // billions - Goal: RDD[(ImgId, TopK[(ImgId, Overlap))] Nearest Neighbor (KNN) problem formulation
  • 19. Neardup Candidate Generation Given a set of documents each described by LSH terms, example: A → (1,2,3) B → (1,3,10) C → (2,10) And more generally: Di → [tj ] Where each Di is a document and [tj ] is a list of LSH terms (assume each is a 4 byte integer) Results: A → (B,2), (C,1) B → (A,2), (C,1) C → (A,1), (B,1)
  • 20. Spark Candidate Generation 1. Input RDD[(ImgId, List[LSHTerm])] ← both index and query sets 2. flatMap, groupBy input into RDD[(LSHTerm, PostingList)] ← an inverted index 3. flatMap, groupBy into RDD[(LSHTerm, PostingList)] ← a query list 4. Join (2) and (3), flatMap over queries posting list, and groupBy query ImgId; RDD[(ImgId, List[PostingList])] ← search results by query. 5. Merge List[List[ImgId]] into TopK(ImgId, Overlap) counting number of times each ImgId is seen → RDD[ImgId, TopK[(ImgId, Overlap)]. * PostingList = List[ImgId]
  • 21. Orders of magnitude too slow.
  • 23. def mapDocToInt(termIndexRaw: RDD[(String, List[TermId])]): RDD[(String, DocId)] = { // ensure that mapping between string and id is stable by sorting // this allows attempts to re-use partial stage completions termIndexRaw.keys.distinct().sortBy(x => x).zipWithIndex() } val stringArray = (for (ind <- 0 to 1000) yield randomString(32)).toArray val intArray = (for (ind <- 0 to 1000) yield ind).toArray * https://www.javamex.com/classmexer/ Dictionary encoding 108128 Bytes* 4024 Bytes* 25x
  • 24. Variable Byte Encoding - One bit of each byte is a continuation bit; overhead - int → byte (best case) - 32 char string up to 25x4 = 100x memory reduction https://nlp.stanford.edu/IR-book/html/htmledition/variable-byte-codes-1.html
  • 25. Inverted Index Partitioning Inverted index is skewed /** * Build partitioned inverted index by taking module of docId into partition. */ def buildPartitionedInvertedIndex(flatTermIndexAndFreq: RDD[(TermId, (DocId, TermFreq))]): RDD[((TermId, TermPartition), Iterable[DocId])] = { flatTermIndexAndFreq.map { case (termId, (docId, _)) => // partition documents within the same term to improve balance // and reduce the posting list length ((termId, (Math.abs(docId) % TERM_PARTITIONING).toByte), docId) }.groupByKey() }
  • 26. Packing (Int, Byte) => Long Before: Unsorted: 128.77 MB in 549ms Sort+Limit: 4.41 KB in 7511ms After: Unsorted: 38.83 MB in 219ms Sort+Limit: 4.41 KB in 467ms def packDocIdAndByteIntoLong(docId: DocId, docFreq: DocFreq): Long = { (docFreq.toLong << 32) | (docId & 0xffffffffL) } def unpackDocIdAndByteFromLong(packed: Long): (DocId, DocFreq) = { (packed.toInt, (packed >> 32).toByte) }
  • 27. Slicing Split query set into slices to reduce spill and size for “widest” portion of the computation. Union at the end.
  • 28. Additional Optimizations - Cost based optimizer - significant improvements to runtime can be realized by analyzing input data sets and setting performance parameters automatically. - Counting - jaccard overlap counting is done via low level, high performance collections. - Off heaping serialization when possible (spark.kryo.unsafe).
  • 29. Generic Batch LSH Search - Can be applied generically to KNN, embedding agnostic. - Can work on arbitrary large query set via slicing.
  • 31. TF DNN Classifier - Transfer learning over VGG16 - Visual embeddings - XOR hamming bits - Learning still happens at >1B pairs - Batch size of 1024, Adam optimizer 4096 2048 256 128 1
  • 32. Vectorization: mapPartitions + grouped - During training and inference vectorization reduces overhead. - Spark mapPartitions + grouped allows for large batches and controlling the size. Works well for inference. - 2ms/prediction on c3.8xl CPUs with network of 10MM parameters . input.mapPartitions { partition: Iterator[(ImgInfo, ImgInfo)] => // break down large partitions into groups and score per group partition.grouped(BATCH_SIZE).flatMap { group: Seq[(ImgInfo, ImgInfo)] => // create tensors and score as features: Array[Array[Float]] --> Tensor.create(features) } }
  • 33. One TF Session per JVM - Reduce model loading overhead, load once per JVM; thread-safe. object TensorflowModel { lazy val model: Session = { SavedModelBundle.load(...).session() } }
  • 34. Summary - Candidate Generation uses Batch LSH Search over terms from visual embeddings. - Batch LSH scales to billions of objects in the index and is embedding agnostic. - Candidate Selection uses a TF classifier over raw visual embeddings. - Two-pass transitive closure to cluster results.