SlideShare a Scribd company logo
Liangliang Cao
http://www.llcao.net
UMass (now at Google AI*)
* The research in this talk are done before
joining Google/Facebook
Visual Search and Question Answering
Lu Jiang
http://www.lujiang.info/
Google AI
Yannis Kalantidis
http://www.skamalas.com/
Facebook AI*
ICME 2019 Tutorial
July 8th 13:30--17:00
I. Overview of Visual Search and Understanding (Liangliang).
II. Visual Representations and Indexing (Yannis)
III. MemexQA (Lu)
Outline
2
Section II:
Visual Representations and Indexing
3
Visual Search: We want to see more of the “same”
4
Color Similarity
*slide credit: Clayton Mellina, Huy Nguyen5
Compositional Similarity
*slide credit: Clayton Mellina, Huy Nguyen6
Identity Similarity
*slide credit: Clayton Mellina, Huy Nguyen7
Semantic Similarity
*slide credit: Clayton Mellina, Huy Nguyen8
Visual Search Applications
Similarity search:
● Given an image as query, show me visually similar images
● Useful tool for commercial photo search & licensing
● Visually congruent native ads
Clustering and deduplication:
● Cluster images of a large collection for browsing
● Personal photo album summarization
● Deduplicate or diversify image search results
Batch search and recommendations:
● Use all photos from a group to recommend photos to the group admin
● Use all photos favorited by a user to get recommendations
● Visual recommendations can be combined with social metadata
9
Basic Ingredients for large-scale search
Representation Learning
Documents/images/videos are represented as vectors
Quantization and Indexing
● Storing high dimensional features could be prohibitive
○ Hashing (bad performance, reconstruction not possible)
○ Quantization (better performance, allows approx. reconstruction)
● Searching in them can only be feasible if only a very small
percentage of the collection is checked → Indexing
10
Visual Representations
11
Some Recent Visual Representations
A (highly biased) set of recent CNN architectures that aim at:
● Reducing network parameters
○ Multi-Fiber Networks [ECCV 2018]
● Reducing memory for attention mechanisms
○ A2
-Nets: Double Attention Networks [NeurIPS 2018]
● Reasoning with global context
○ Global Reasoning Networks [CVPR 2019]
● Reducing spatial redundancy
○ Octave Convolutions [arXiv 2019]
12
Visual Representations
A (highly biased) set of recent CNN architectures that aim at:
● Reducing network parameters
○ Multi-Fiber Networks [ECCV 2018]
● Reducing memory for attention mechanisms
○ A2
-Nets: Double Attention Networks [NeurIPS 2018]
● Reasoning with global context
○ Global Reasoning Networks [CVPR 2019]
● Reducing spatial redundancy
○ Octave Convolutions [arXiv 2019]
13
The Multi-fiber Unit
Idea: slice the complex residual unit into N parallel and separated units (called
fibers), each of which is isolated from the others
14
The Multi-fiber Unit
● one fiber cannot access and
utilize the feature learned from
the others.
● Transistor component:
facilitates information flow
across these fibers
● number of the first-layer output
channels to be 4 times smaller
(cost would be reduced by a
factor of 2)
[Chen, Kalantidis, et al. Multi-Fiber Networks. ECCV 2018] 15
Results on Imagenet
[Chen, Kalantidis, et al. Multi-Fiber Networks. ECCV 2018] 16
Results on Imagenet
[Chen, Kalantidis, et al. Multi-Fiber Networks. ECCV 2018] 17
Visual Representations
A (highly biased) set of recent CNN architectures that aim at:
● Reducing network parameters
○ Multi-Fiber Networks [ECCV 2018]
● Reducing memory for attention mechanisms
○ A2
-Nets: Double Attention Networks [NeurIPS 2018]
● Reasoning with global context
○ Global Reasoning Networks [CVPR 2019]
● Reducing spatial redundancy
○ Octave Convolutions [arXiv 2019]
18
Reducing computations for attention mechanisms
Incorporating global context
● e.g. the attention mechanisms [Vaswani et al. 2017, Wang et al. 2018]
● Enables interactions between locations over the full coordinate space
● Requires computing and storing a (quadratic) matrix of all input location pairs
Convolutional Neural Networks model local relations
● Operate on the (spatio-temporal) coordinate space grid
● Require stacking multiple layers to capture relations
between distant locations
[Vaswani et al. Attention is all you need. NIPS 2017]
[Wang et al. Non-local Neural Networks. CVPR, 2018] 19
A2
-Nets: Double Attention Networks
Decomposed attention mechanism
Aggregate and propagate features from the entire
(spatio-temporal) input space efficiently
● First attention: Gather features from the entire
space into a compact set through second-order
attention pooling
● Second attention: Adaptively select and
distribute features to each location.
[Chen, Kalantidis, et al. A2
-Nets: Double Attention Networks. NeurIPS 2018] 20
Accuracy on Imagenet
A2
-Nets: Double Attention Networks
[Chen, Kalantidis, et al. A2
-Nets: Double Attention Networks. NeurIPS 2018] 21
Visual Representations
A (highly biased) set of recent CNN architectures that aim at:
● Reducing network parameters
○ Multi-Fiber Networks [ECCV 2018]
● Reducing memory for attention mechanisms
○ A2
-Nets: Double Attention Networks [NeurIPS 2018]
● Reasoning with global context
○ Global Reasoning Networks [CVPR 2019]
● Reducing spatial redundancy
○ Octave Convolutions [arXiv 2019]
22
Global context modeling is highly important
● Attention-like mechanisms becoming standard across ML
A limitation of current global context modeling approaches
● Follow the Gather → Distribute model
● Only focus on delivering information
● Rely on convolutional layers for reasoning
Can we capture and reason on global region
interactions efficiently?
23
Beyond the simple attention mechanism
Gather → Reason → Distribute
Can we construct a (latent) space, where relations over sets of features scattered
over the coordinate space, translate to simple feature interactions?
24
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
Global Reasoning Networks
Coordinate Space Interaction Space
1) From Coordinate Space to Interaction Space
2) Reasoning in Interaction Space
3) From Interaction Space (back) to Coordinate Space
→ Weighted projections
→ Graph convolutions
→ Weighted broadcasting
25
Global Reasoning in Three Steps
Coordinate Space
Interaction Space
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
Interaction Space
● We want to learn a set of projections for (arbitrary) region features
Projection
Coordinate Space
26
From Coordinate Space to Interaction Space
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
learnable projection weights
27
Given a set of input features , compute projection function
From Coordinate Space to Interaction Space
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
28
Given a set of input features , compute projection function
From Coordinate Space to Interaction Space
C
H
W
H
W
C
bi
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
29
Given a set of input features , compute projection function
From Coordinate Space to Interaction Space
H
N
W
N
C
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
● After projection → N feature vectors
Projection
Coordinate Space
30
From Coordinate Space to Interaction Space
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
● After projection → N feature vectors
● Relations between arbitrary regions → interactions between features
Projection
Coordinate Space
Interaction Space
31
From Coordinate Space to Interaction Space
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
● What is an efficient way of reasoning over feature interactions?
How to model interactions?
● Treat each feature as a node in a fully-connected graph
● Learn the edge weights that correspond to interactions of features
● Graph convolution formulation by [Kipf & Welling]:
Reverse
Projection
N x N (learnt)
adjacency matrix
state update
32
Reasoning in Interaction Space
[Kipf & Welling. Semi-supervised classification with graph convolutional networks. ICLR, 2017]
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
● Reverse projection: Distribute the updated states back
● Reuse projection weights
Reverse
Projection
Coordinate Space
Interaction Space
33
From Interaction Space to Coordinate Space
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
● Projection: Weighted global pooling
34
Global Reasoning (GloRe) Unit
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
● Projection: Weighted global pooling
● Reasoning: Graph Convolution
35
Global Reasoning (GloRe) Unit
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
● Projection: Weighted global pooling
● Reasoning: Graph Convolution
● Reverse projection: Weighted broadcasting
36
Global Reasoning (GloRe) Unit
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
● Projection: Weighted global pooling
● Reasoning: Graph Convolution
● Reverse projection: Weighted broadcasting
37
Global Reasoning (GloRe) Unit
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
● Projection: Weighted global pooling
● Reasoning: Graph Convolution
● Reverse projection: Weighted broadcasting
What do
the learnt projection
weights look like?
38
Global Reasoning (GloRe) Unit
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
Visualization of projection weights
What do the learnt projections
look like?
39
Global Reasoning (GloRe) Unit
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
The Global Reasoning (GloRe) unit
● Is highly efficient (smaller computational cost than a self-attention)
● Is a plug-and-play residual unit that can be inserted in CNNs for different tasks
Image Classification & Action Recognition backbone CNNs
● Insert one or more units units different positions
Semantic segmentation
● Insert before bottleneck
40
Global Reasoning Networks
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
Figure from [Noa et al ICCV 2015]
41
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
Ablations on Imagenet
How many blocks to add and where?
How many graph convolutions?
42
[Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
Experiments on ImageNet
Visual Representations
A (highly biased) set of recent CNN architectures that aim at:
● Reducing network parameters
○ Multi-Fiber Networks [ECCV 2018]
● Reducing memory for attention mechanisms
○ A2
-Nets: Double Attention Networks [NeurIPS 2018]
● Reasoning with global context
○ Global Reasoning Networks [CVPR 2019]
● Reducing spatial redundancy
○ Octave Convolutions [arXiv 2019]
43
[Huang et al. Multi-Scale Dense Networks for Resource Efficient Image Classification, ICLR 2018]
[Chen et al. Big-Little Net: An Efficient Multi-Scale Feature Representation for Visual and Speech Recognition, ICLR 2019]
Reducing Spatial Redundancy
Many approaches exploit multi-scale inputs
• Recent Examples
• Multi-scale DenseNets [Huang et al.]: Multi-resolution paths over a DenseNet
• Big-Little Nets [Chen et al.]: Multi-resolution paths, synchronizing at every block
• Network architecture is altered
Spatial-redundancy in feature maps
• ConvNet kernels are highly local
• Some feature maps must contain low
frequency information (smooth and slowly
varying)
44
Octave Convolution
45
Octave Convolution
Advantages
•Multi-scale processing with effective
communication between the low- and
high-frequency maps
•Gains in terms of FLOPS
•Gains in terms of memory
•Larger receptive field for low-frequency
feature maps
The Octave Convolution
kernel
46
import OctConv as conv
Ablation study on ImageNet for varying models and
ratios 47
ImageNet Classification
48
Is the speedup real?
•On CPU (i.e. FB production): Reaching (almost) theoretical gains!
•On GPU: An optimized CUDA-level implementation is required
Results for
ResNet-50
49
Recent Visual Representations
Code online:
● Multi-Fiber Networks [ECCV 2018]
○ https://github.com/cypw/PyTorch-MFNet
● Global Reasoning Networks [CVPR 2019]
○ https://github.com/facebookresearch/GloRe (coming soon)
● Octave Convolutions [arXiv 2019]
○ https://github.com/facebookresearch/OctConv
50
Indexing
51
Basic Ingredients for large-scale search
Representation Learning
Documents/images/videos are represented as vectors
Quantization and Indexing
● Storing high dimensional features could be prohibitive
○ Hashing (bad performance, reconstruction not possible)
○ Quantization (better performance, allows approx. reconstruction)
● Searching in them can only be feasible if only a very small
percentage of the collection is checked → Indexing
52
Quantization: k-means
Pros:
● Very high compression
Cons:
● Hard to train for large k
● Performance is good only for large k
Idea: Create a “vocabulary” in high-dimensional space through clustering
Represent each vector with the index of its closest “word”
[McQueen 1967]53
Quantization: product quantization
Idea: Split the vector in multiple sub-vectors, create a vocabulary for each subvector
Represent each feature with the list of indices for its closest words
[Gray, ASSP 1984]
[Jegou, Douze & Schmid, PAMI 2011]54
Quantization: product quantization
Pros:
● Tunable compression & better reconstruction
● Easy & fast to train, a vocabulary of size k
gives you km
effective “cells” for m subvectors
Cons:
● Independence assumption (“fix”: PCA)
● Unbalanced partitioning (fix: OPQ)
[Gray, ASSP 1984]
[Jegou, Douze & Schmid, PAMI 2011]55
Optimized product quantization
[Ge et al, CVPR 2013, PAMI 2014]56
Locally Optimized Product Quantization
[Kalantidis & Avrithis, CVPR 2014]
Idea: Locally optimize residuals, balance variance across subspaces, use multi-index
57
Locally Optimized Product Quantization
[Kalantidis & Avrithis, CVPR 2014]58
Locally Optimized Product Quantization
● Balance variance across subspaces
● Local optimization using OPQ
● 20% improvement in precision
over state-of-the-art
● Overhead independent of database size
Stats for multi-LOPQ:
● 1 Billion 128-dimensional vectors
● ~22GB memory
● less than 55ms search time
[Kalantidis & Avrithis, CVPR 2014]
Idea: Locally optimize residuals, balance variance across subspaces, use multi-index
59
Indexing
21.1 3.33 21.2 20.1 2.21 11.1 11.2 0.21
id: 123984
.
.
.
.
5,4 id:123984...
1
5
6
...
7
2
4
21.1 3.33 21.2 20.1 11.1 11.2 0.21
11
231
661
id: 123984
.
.
.
.
11 id:123984... ...
60
Indexing: multi-index
Pros:
● 2-step quantization: in the second stage one can quantize residuals
● Finer partitioning / smaller residuals
● Need to search many cells/posting lists:
multi-sequence: fast algorithm for traversing neighboring cells
[Babenko & Lempitsky, CVPR 2012]
Idea: Use product quantization for indexing: Split into 2 sub-vectors
61
Multi-LOPQ: Searching in a multi-index
● split query vector
● sort PQ centroids by ascending
distance for each subvector
● start at the cell (Q1
[0], Q2
[0]), the
first clusters in each posting list
● for the current cell (Q1
[a], Q2
[b]),
insert both its bottom and right
neighbors into a priority queue
with priority:
dist(xL
, Q1
[a]) + dist(xR
, Q2
[b])
62
Locally Optimized Product Quantization
[Kalantidis & Avrithis, CVPR 2014]63
Project Name
Thank you!
Yannis Kalantidis
ykalant@image.ntua.gr
http://www.skamalas.com
64
Locally Optimized Product Quantization
https://github.com/yahoo/lopq
[Kalantidis & Avrithis, CVPR 2014]
[Kalantidis et al, ECCV-W 2016]65

More Related Content

PDF
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
PDF
Architecture Design for Deep Neural Networks III
PDF
Architecture Design for Deep Neural Networks I
PDF
Object Detection Beyond Mask R-CNN and RetinaNet II
PDF
Object Detection Beyond Mask R-CNN and RetinaNet I
PDF
Intelligent Multimedia Recommendation
PDF
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
PPTX
[Mmlab seminar 2016] deep learning for human pose estimation
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Architecture Design for Deep Neural Networks III
Architecture Design for Deep Neural Networks I
Object Detection Beyond Mask R-CNN and RetinaNet II
Object Detection Beyond Mask R-CNN and RetinaNet I
Intelligent Multimedia Recommendation
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
[Mmlab seminar 2016] deep learning for human pose estimation

What's hot (20)

PDF
Region-oriented Convolutional Networks for Object Retrieval
PPTX
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
PDF
2019 cvpr paper_overview
PDF
Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017
PPTX
[DL輪読会]ClearGrasp
PDF
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
PDF
Visual Saliency Prediction with Deep Learning - Kevin McGuinness - UPC Barcel...
PDF
Content-based Image Retrieval - Eva Mohedano - UPC Barcelona 2018
PDF
DeepFix: a fully convolutional neural network for predicting human fixations...
PDF
Deep Learning for Computer Vision: Image Retrieval (UPC 2016)
PDF
Convolutional Features for Instance Search
PDF
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
PDF
Backbone can not be trained at once rolling back to pre trained network for p...
PDF
An Introduction to Neural Architecture Search
PDF
Object Detection Using R-CNN Deep Learning Framework
PDF
Object Detection and Recognition
PDF
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
PPTX
Transformer in Vision
PDF
Scene classification using Convolutional Neural Networks - Jayani Withanawasam
PPTX
Object Detection Methods using Deep Learning
Region-oriented Convolutional Networks for Object Retrieval
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
2019 cvpr paper_overview
Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017
[DL輪読会]ClearGrasp
Content-Based Image Retrieval (D2L6 Insight@DCU Machine Learning Workshop 2017)
Visual Saliency Prediction with Deep Learning - Kevin McGuinness - UPC Barcel...
Content-based Image Retrieval - Eva Mohedano - UPC Barcelona 2018
DeepFix: a fully convolutional neural network for predicting human fixations...
Deep Learning for Computer Vision: Image Retrieval (UPC 2016)
Convolutional Features for Instance Search
CNN vs SIFT-based Visual Localization - Laura Leal-Taixé - UPC Barcelona 2018
Backbone can not be trained at once rolling back to pre trained network for p...
An Introduction to Neural Architecture Search
Object Detection Using R-CNN Deep Learning Framework
Object Detection and Recognition
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
Transformer in Vision
Scene classification using Convolutional Neural Networks - Jayani Withanawasam
Object Detection Methods using Deep Learning
Ad

Similar to Visual Search and Question Answering II (20)

PPTX
On Integrating Information Visualization Techniques into Data Mining: A Revie...
PPTX
Attentive Relational Networks for Mapping Images to Scene Graphs
PPTX
[NS][Lab_Seminar_240909]Sparse Multi-Relational Graph Convolutional Network f...
PDF
Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...
PDF
On Execution Platforms for Large-Scale Aggregate Computing
PPTX
Semantic Segmentation on Satellite Imagery
PDF
Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018
PPTX
240628_Thanh_LabSeminar[Explore Internal and External Similarity for Single I...
PDF
The Importance of Time in Visual Attention Models
PDF
Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondence
PDF
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
PPTX
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...
PPTX
[20240628_LabSeminar_Huy]ScalableSTGNN.pptx
PPTX
240527_Thanh_LabSeminar[Transitivity Recovering Decompositions: Interpretable...
PDF
最近の研究情勢についていくために - Deep Learningを中心に -
PDF
Comparison of Various RCNN techniques for Classification of Object from Image
PDF
Declarative Macro-Programming of Collective Systems with Aggregate Computing:...
PPTX
[NS][Lab_Seminar_240611]Graph R-CNN.pptx
PDF
Memory Efficient Graph Convolutional Network based Distributed Link Prediction
PPTX
CrowdMap: Accurate Reconstruction of Indoor Floor Plan from Crowdsourced Sens...
On Integrating Information Visualization Techniques into Data Mining: A Revie...
Attentive Relational Networks for Mapping Images to Scene Graphs
[NS][Lab_Seminar_240909]Sparse Multi-Relational Graph Convolutional Network f...
Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...
On Execution Platforms for Large-Scale Aggregate Computing
Semantic Segmentation on Satellite Imagery
Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018
240628_Thanh_LabSeminar[Explore Internal and External Similarity for Single I...
The Importance of Time in Visual Attention Models
Parn pyramidal+affine+regression+networks+for+dense+semantic+correspondence
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
240315_Thanh_LabSeminar[G-TAD: Sub-Graph Localization for Temporal Action Det...
[20240628_LabSeminar_Huy]ScalableSTGNN.pptx
240527_Thanh_LabSeminar[Transitivity Recovering Decompositions: Interpretable...
最近の研究情勢についていくために - Deep Learningを中心に -
Comparison of Various RCNN techniques for Classification of Object from Image
Declarative Macro-Programming of Collective Systems with Aggregate Computing:...
[NS][Lab_Seminar_240611]Graph R-CNN.pptx
Memory Efficient Graph Convolutional Network based Distributed Link Prediction
CrowdMap: Accurate Reconstruction of Indoor Floor Plan from Crowdsourced Sens...
Ad

More from Wanjin Yu (9)

PDF
Architecture Design for Deep Neural Networks II
PDF
Causally regularized machine learning
PDF
Computer vision for transportation
PDF
Object Detection Beyond Mask R-CNN and RetinaNet III
PDF
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
PDF
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
PDF
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
PDF
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
PDF
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Architecture Design for Deep Neural Networks II
Causally regularized machine learning
Computer vision for transportation
Object Detection Beyond Mask R-CNN and RetinaNet III
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...

Recently uploaded (20)

PPTX
Digital Literacy And Online Safety on internet
PDF
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
PPTX
PptxGenJS_Demo_Chart_20250317130215833.pptx
PDF
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PPTX
Introuction about ICD -10 and ICD-11 PPT.pptx
PPTX
introduction about ICD -10 & ICD-11 ppt.pptx
PPTX
presentation_pfe-universite-molay-seltan.pptx
PPTX
Job_Card_System_Styled_lorem_ipsum_.pptx
PPTX
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
PPTX
Slides PPTX World Game (s) Eco Economic Epochs.pptx
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PDF
WebRTC in SignalWire - troubleshooting media negotiation
PDF
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
PDF
Testing WebRTC applications at scale.pdf
PDF
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
PDF
Paper PDF World Game (s) Great Redesign.pdf
PPTX
Module 1 - Cyber Law and Ethics 101.pptx
PPTX
Internet___Basics___Styled_ presentation
PDF
Slides PDF The World Game (s) Eco Economic Epochs.pdf
PPT
tcp ip networks nd ip layering assotred slides
Digital Literacy And Online Safety on internet
Automated vs Manual WooCommerce to Shopify Migration_ Pros & Cons.pdf
PptxGenJS_Demo_Chart_20250317130215833.pptx
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
Introuction about ICD -10 and ICD-11 PPT.pptx
introduction about ICD -10 & ICD-11 ppt.pptx
presentation_pfe-universite-molay-seltan.pptx
Job_Card_System_Styled_lorem_ipsum_.pptx
CHE NAA, , b,mn,mblblblbljb jb jlb ,j , ,C PPT.pptx
Slides PPTX World Game (s) Eco Economic Epochs.pptx
SASE Traffic Flow - ZTNA Connector-1.pdf
WebRTC in SignalWire - troubleshooting media negotiation
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
Testing WebRTC applications at scale.pdf
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
Paper PDF World Game (s) Great Redesign.pdf
Module 1 - Cyber Law and Ethics 101.pptx
Internet___Basics___Styled_ presentation
Slides PDF The World Game (s) Eco Economic Epochs.pdf
tcp ip networks nd ip layering assotred slides

Visual Search and Question Answering II

  • 1. Liangliang Cao http://www.llcao.net UMass (now at Google AI*) * The research in this talk are done before joining Google/Facebook Visual Search and Question Answering Lu Jiang http://www.lujiang.info/ Google AI Yannis Kalantidis http://www.skamalas.com/ Facebook AI* ICME 2019 Tutorial July 8th 13:30--17:00
  • 2. I. Overview of Visual Search and Understanding (Liangliang). II. Visual Representations and Indexing (Yannis) III. MemexQA (Lu) Outline 2
  • 4. Visual Search: We want to see more of the “same” 4
  • 5. Color Similarity *slide credit: Clayton Mellina, Huy Nguyen5
  • 6. Compositional Similarity *slide credit: Clayton Mellina, Huy Nguyen6
  • 7. Identity Similarity *slide credit: Clayton Mellina, Huy Nguyen7
  • 8. Semantic Similarity *slide credit: Clayton Mellina, Huy Nguyen8
  • 9. Visual Search Applications Similarity search: ● Given an image as query, show me visually similar images ● Useful tool for commercial photo search & licensing ● Visually congruent native ads Clustering and deduplication: ● Cluster images of a large collection for browsing ● Personal photo album summarization ● Deduplicate or diversify image search results Batch search and recommendations: ● Use all photos from a group to recommend photos to the group admin ● Use all photos favorited by a user to get recommendations ● Visual recommendations can be combined with social metadata 9
  • 10. Basic Ingredients for large-scale search Representation Learning Documents/images/videos are represented as vectors Quantization and Indexing ● Storing high dimensional features could be prohibitive ○ Hashing (bad performance, reconstruction not possible) ○ Quantization (better performance, allows approx. reconstruction) ● Searching in them can only be feasible if only a very small percentage of the collection is checked → Indexing 10
  • 12. Some Recent Visual Representations A (highly biased) set of recent CNN architectures that aim at: ● Reducing network parameters ○ Multi-Fiber Networks [ECCV 2018] ● Reducing memory for attention mechanisms ○ A2 -Nets: Double Attention Networks [NeurIPS 2018] ● Reasoning with global context ○ Global Reasoning Networks [CVPR 2019] ● Reducing spatial redundancy ○ Octave Convolutions [arXiv 2019] 12
  • 13. Visual Representations A (highly biased) set of recent CNN architectures that aim at: ● Reducing network parameters ○ Multi-Fiber Networks [ECCV 2018] ● Reducing memory for attention mechanisms ○ A2 -Nets: Double Attention Networks [NeurIPS 2018] ● Reasoning with global context ○ Global Reasoning Networks [CVPR 2019] ● Reducing spatial redundancy ○ Octave Convolutions [arXiv 2019] 13
  • 14. The Multi-fiber Unit Idea: slice the complex residual unit into N parallel and separated units (called fibers), each of which is isolated from the others 14
  • 15. The Multi-fiber Unit ● one fiber cannot access and utilize the feature learned from the others. ● Transistor component: facilitates information flow across these fibers ● number of the first-layer output channels to be 4 times smaller (cost would be reduced by a factor of 2) [Chen, Kalantidis, et al. Multi-Fiber Networks. ECCV 2018] 15
  • 16. Results on Imagenet [Chen, Kalantidis, et al. Multi-Fiber Networks. ECCV 2018] 16
  • 17. Results on Imagenet [Chen, Kalantidis, et al. Multi-Fiber Networks. ECCV 2018] 17
  • 18. Visual Representations A (highly biased) set of recent CNN architectures that aim at: ● Reducing network parameters ○ Multi-Fiber Networks [ECCV 2018] ● Reducing memory for attention mechanisms ○ A2 -Nets: Double Attention Networks [NeurIPS 2018] ● Reasoning with global context ○ Global Reasoning Networks [CVPR 2019] ● Reducing spatial redundancy ○ Octave Convolutions [arXiv 2019] 18
  • 19. Reducing computations for attention mechanisms Incorporating global context ● e.g. the attention mechanisms [Vaswani et al. 2017, Wang et al. 2018] ● Enables interactions between locations over the full coordinate space ● Requires computing and storing a (quadratic) matrix of all input location pairs Convolutional Neural Networks model local relations ● Operate on the (spatio-temporal) coordinate space grid ● Require stacking multiple layers to capture relations between distant locations [Vaswani et al. Attention is all you need. NIPS 2017] [Wang et al. Non-local Neural Networks. CVPR, 2018] 19
  • 20. A2 -Nets: Double Attention Networks Decomposed attention mechanism Aggregate and propagate features from the entire (spatio-temporal) input space efficiently ● First attention: Gather features from the entire space into a compact set through second-order attention pooling ● Second attention: Adaptively select and distribute features to each location. [Chen, Kalantidis, et al. A2 -Nets: Double Attention Networks. NeurIPS 2018] 20
  • 21. Accuracy on Imagenet A2 -Nets: Double Attention Networks [Chen, Kalantidis, et al. A2 -Nets: Double Attention Networks. NeurIPS 2018] 21
  • 22. Visual Representations A (highly biased) set of recent CNN architectures that aim at: ● Reducing network parameters ○ Multi-Fiber Networks [ECCV 2018] ● Reducing memory for attention mechanisms ○ A2 -Nets: Double Attention Networks [NeurIPS 2018] ● Reasoning with global context ○ Global Reasoning Networks [CVPR 2019] ● Reducing spatial redundancy ○ Octave Convolutions [arXiv 2019] 22
  • 23. Global context modeling is highly important ● Attention-like mechanisms becoming standard across ML A limitation of current global context modeling approaches ● Follow the Gather → Distribute model ● Only focus on delivering information ● Rely on convolutional layers for reasoning Can we capture and reason on global region interactions efficiently? 23 Beyond the simple attention mechanism
  • 24. Gather → Reason → Distribute Can we construct a (latent) space, where relations over sets of features scattered over the coordinate space, translate to simple feature interactions? 24 [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019] Global Reasoning Networks Coordinate Space Interaction Space
  • 25. 1) From Coordinate Space to Interaction Space 2) Reasoning in Interaction Space 3) From Interaction Space (back) to Coordinate Space → Weighted projections → Graph convolutions → Weighted broadcasting 25 Global Reasoning in Three Steps Coordinate Space Interaction Space [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  • 26. Interaction Space ● We want to learn a set of projections for (arbitrary) region features Projection Coordinate Space 26 From Coordinate Space to Interaction Space [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  • 27. learnable projection weights 27 Given a set of input features , compute projection function From Coordinate Space to Interaction Space [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  • 28. 28 Given a set of input features , compute projection function From Coordinate Space to Interaction Space C H W H W C bi [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  • 29. 29 Given a set of input features , compute projection function From Coordinate Space to Interaction Space H N W N C [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  • 30. ● After projection → N feature vectors Projection Coordinate Space 30 From Coordinate Space to Interaction Space [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  • 31. ● After projection → N feature vectors ● Relations between arbitrary regions → interactions between features Projection Coordinate Space Interaction Space 31 From Coordinate Space to Interaction Space [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019] ● What is an efficient way of reasoning over feature interactions?
  • 32. How to model interactions? ● Treat each feature as a node in a fully-connected graph ● Learn the edge weights that correspond to interactions of features ● Graph convolution formulation by [Kipf & Welling]: Reverse Projection N x N (learnt) adjacency matrix state update 32 Reasoning in Interaction Space [Kipf & Welling. Semi-supervised classification with graph convolutional networks. ICLR, 2017] [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  • 33. ● Reverse projection: Distribute the updated states back ● Reuse projection weights Reverse Projection Coordinate Space Interaction Space 33 From Interaction Space to Coordinate Space [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  • 34. ● Projection: Weighted global pooling 34 Global Reasoning (GloRe) Unit [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  • 35. ● Projection: Weighted global pooling ● Reasoning: Graph Convolution 35 Global Reasoning (GloRe) Unit [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  • 36. ● Projection: Weighted global pooling ● Reasoning: Graph Convolution ● Reverse projection: Weighted broadcasting 36 Global Reasoning (GloRe) Unit [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  • 37. ● Projection: Weighted global pooling ● Reasoning: Graph Convolution ● Reverse projection: Weighted broadcasting 37 Global Reasoning (GloRe) Unit [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  • 38. ● Projection: Weighted global pooling ● Reasoning: Graph Convolution ● Reverse projection: Weighted broadcasting What do the learnt projection weights look like? 38 Global Reasoning (GloRe) Unit [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  • 39. Visualization of projection weights What do the learnt projections look like? 39 Global Reasoning (GloRe) Unit [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  • 40. The Global Reasoning (GloRe) unit ● Is highly efficient (smaller computational cost than a self-attention) ● Is a plug-and-play residual unit that can be inserted in CNNs for different tasks Image Classification & Action Recognition backbone CNNs ● Insert one or more units units different positions Semantic segmentation ● Insert before bottleneck 40 Global Reasoning Networks [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019] Figure from [Noa et al ICCV 2015]
  • 41. 41 [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019] Ablations on Imagenet How many blocks to add and where? How many graph convolutions?
  • 42. 42 [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019] Experiments on ImageNet
  • 43. Visual Representations A (highly biased) set of recent CNN architectures that aim at: ● Reducing network parameters ○ Multi-Fiber Networks [ECCV 2018] ● Reducing memory for attention mechanisms ○ A2 -Nets: Double Attention Networks [NeurIPS 2018] ● Reasoning with global context ○ Global Reasoning Networks [CVPR 2019] ● Reducing spatial redundancy ○ Octave Convolutions [arXiv 2019] 43
  • 44. [Huang et al. Multi-Scale Dense Networks for Resource Efficient Image Classification, ICLR 2018] [Chen et al. Big-Little Net: An Efficient Multi-Scale Feature Representation for Visual and Speech Recognition, ICLR 2019] Reducing Spatial Redundancy Many approaches exploit multi-scale inputs • Recent Examples • Multi-scale DenseNets [Huang et al.]: Multi-resolution paths over a DenseNet • Big-Little Nets [Chen et al.]: Multi-resolution paths, synchronizing at every block • Network architecture is altered Spatial-redundancy in feature maps • ConvNet kernels are highly local • Some feature maps must contain low frequency information (smooth and slowly varying) 44
  • 46. Octave Convolution Advantages •Multi-scale processing with effective communication between the low- and high-frequency maps •Gains in terms of FLOPS •Gains in terms of memory •Larger receptive field for low-frequency feature maps The Octave Convolution kernel 46
  • 47. import OctConv as conv Ablation study on ImageNet for varying models and ratios 47
  • 49. Is the speedup real? •On CPU (i.e. FB production): Reaching (almost) theoretical gains! •On GPU: An optimized CUDA-level implementation is required Results for ResNet-50 49
  • 50. Recent Visual Representations Code online: ● Multi-Fiber Networks [ECCV 2018] ○ https://github.com/cypw/PyTorch-MFNet ● Global Reasoning Networks [CVPR 2019] ○ https://github.com/facebookresearch/GloRe (coming soon) ● Octave Convolutions [arXiv 2019] ○ https://github.com/facebookresearch/OctConv 50
  • 52. Basic Ingredients for large-scale search Representation Learning Documents/images/videos are represented as vectors Quantization and Indexing ● Storing high dimensional features could be prohibitive ○ Hashing (bad performance, reconstruction not possible) ○ Quantization (better performance, allows approx. reconstruction) ● Searching in them can only be feasible if only a very small percentage of the collection is checked → Indexing 52
  • 53. Quantization: k-means Pros: ● Very high compression Cons: ● Hard to train for large k ● Performance is good only for large k Idea: Create a “vocabulary” in high-dimensional space through clustering Represent each vector with the index of its closest “word” [McQueen 1967]53
  • 54. Quantization: product quantization Idea: Split the vector in multiple sub-vectors, create a vocabulary for each subvector Represent each feature with the list of indices for its closest words [Gray, ASSP 1984] [Jegou, Douze & Schmid, PAMI 2011]54
  • 55. Quantization: product quantization Pros: ● Tunable compression & better reconstruction ● Easy & fast to train, a vocabulary of size k gives you km effective “cells” for m subvectors Cons: ● Independence assumption (“fix”: PCA) ● Unbalanced partitioning (fix: OPQ) [Gray, ASSP 1984] [Jegou, Douze & Schmid, PAMI 2011]55
  • 56. Optimized product quantization [Ge et al, CVPR 2013, PAMI 2014]56
  • 57. Locally Optimized Product Quantization [Kalantidis & Avrithis, CVPR 2014] Idea: Locally optimize residuals, balance variance across subspaces, use multi-index 57
  • 58. Locally Optimized Product Quantization [Kalantidis & Avrithis, CVPR 2014]58
  • 59. Locally Optimized Product Quantization ● Balance variance across subspaces ● Local optimization using OPQ ● 20% improvement in precision over state-of-the-art ● Overhead independent of database size Stats for multi-LOPQ: ● 1 Billion 128-dimensional vectors ● ~22GB memory ● less than 55ms search time [Kalantidis & Avrithis, CVPR 2014] Idea: Locally optimize residuals, balance variance across subspaces, use multi-index 59
  • 60. Indexing 21.1 3.33 21.2 20.1 2.21 11.1 11.2 0.21 id: 123984 . . . . 5,4 id:123984... 1 5 6 ... 7 2 4 21.1 3.33 21.2 20.1 11.1 11.2 0.21 11 231 661 id: 123984 . . . . 11 id:123984... ... 60
  • 61. Indexing: multi-index Pros: ● 2-step quantization: in the second stage one can quantize residuals ● Finer partitioning / smaller residuals ● Need to search many cells/posting lists: multi-sequence: fast algorithm for traversing neighboring cells [Babenko & Lempitsky, CVPR 2012] Idea: Use product quantization for indexing: Split into 2 sub-vectors 61
  • 62. Multi-LOPQ: Searching in a multi-index ● split query vector ● sort PQ centroids by ascending distance for each subvector ● start at the cell (Q1 [0], Q2 [0]), the first clusters in each posting list ● for the current cell (Q1 [a], Q2 [b]), insert both its bottom and right neighbors into a priority queue with priority: dist(xL , Q1 [a]) + dist(xR , Q2 [b]) 62
  • 63. Locally Optimized Product Quantization [Kalantidis & Avrithis, CVPR 2014]63
  • 64. Project Name Thank you! Yannis Kalantidis ykalant@image.ntua.gr http://www.skamalas.com 64
  • 65. Locally Optimized Product Quantization https://github.com/yahoo/lopq [Kalantidis & Avrithis, CVPR 2014] [Kalantidis et al, ECCV-W 2016]65