SlideShare a Scribd company logo
Visual Translation Embedding Network for Visual
Relation Detection
Slides by Fran Roldán
ReadAI Reading Group, UPC
20th March, 2017
Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang,
Tat-Seng Chua, [arxiv] (27 Feb 2017) [demo]
Index
1. Visual Relation Detection
2. Visual Translation Embedding (VTransE)
3. VTransE Feature Extraction
4. VTransE Network
5. Evaluation
6. Conclusion
2
Visual Relation Detection
● Modeling and understanding the
relationships between objects in a
scene (i.e. “person ride bike”).
● Better generalization for other tasks
such as image captioning or VQA.
● Visual relations are
subject-predicate-object triplets, which
we can model jointly or separately.
3
VTransE
Translation Embedding
● For N objects and R predicates we have to learn:
○ Joint model: N2
R
○ Separate model: N+R.
● However, large appearance changes of predicate (i.e . predicate ride is
different when object is bike than when the object is elephant).
4
VTransE
Translation Embedding
● For N objects and R predicates we have to learn:
○ Joint model: N2
R
○ Separate model: N+R.
● However, large appearance changes of predicate (i.e . predicate ride is
different when object is bike than when the object is elephant).
...is there any solution?
5
VTransE
Translation Embedding
● Based on Translation Embeddings for representing large scale knowledge
bases.
● Map the features of objects and predicates in a low-dimensional space,
where relation triplet can be interpreted as a vector translation.
We only need to learn the “ride”
translation vector in the relation space. 6
VTransE
Visual Translation Embedding
Suppose are M-dim features of subject and object. We must
learn a relation translation vector and the projection matrices
.
7
VTransE
Visual Translation Embedding
Loss function to reward only deterministically accurate predicates:
8
VTransE Feature Extraction
Knowledge Transfer in Relation
● Region proposal network (RPN) and a classification layer.
● Incorporation of knowledge transfer between objects and predicates,
which can be transferred in a single forward/backward pass.
● Novel feature extraction layer:
○ Classeme (i.e. class probabilities).
○ Location (i.e. bounding boxes coordinates and scales).
○ RoI visual features (use of bilinear feature interpolation instead of RoI pooling).
9
VTransE Feature Extraction
In order to extract we analyze three type of features:
● Classeme: N+1-dim vector of class probabilities (N classes and 1
background) obtained from object classification.
● Location: 4-dim vector such that:
where are bounding boxes coordinates of subject
and object respectively.
● Visual Features: D-dim vector transformed from a convolutional feature of
the shape . 10
VTransE Feature Extraction
Bilinear Interpolation
Smooth function of two inputs: feature map F and an object bounding box.
: X x Y grid split in box
Since G is a linear function, V can be back-propagated to the bounding box
coordinates
11
VTransE
Optimization
● Multi-task loss function:
○ Object detection loss:
○ Relation detection loss:
● Loss trade-off:
12
VTransE Network
Built upon an object detection module and incorporates the proposed feature
extraction layer.
13
Evaluation
Q1: Is the idea of embedding relations effective in the visual domain?
Q2: What are the effects of the features in relation detection and knowledge
transfer?
Q3: How does the overall VTransE network perform compared to the other
state-of-the-art visual relation models?
14
Evaluation
● Datasets:
○ Visual Relationship Dataset (VRD): 5,000 images with 100 object categories and 70
predicates. In total, VRD contains 37,993 relation annotations with 6,672 unique relations
and 24.25 predicates per object category.
○ Visual Genome Version 1.2 (VG): 99,658 images with 200 object categories and 100
predicates, resulting in 1,174,692 relation annotations with 19,237 unique relations and
57 predicates per object category.
15
Evaluation (Q1)
Q1: Is the idea of embedding
relations effective in the visual
domain?
Isolate VTransE from object
detection and perform the task
of Predicate Prediction
16
R@K computes the fraction of
times a true relation is predicted in
the top K confident relation
predictions in an image
Evaluation (Q2)
Q2: What are the effects of the features in relation detection and knowledge
transfer?
17
VRD VG
Evaluation (Q2)
Q2: What are the effects of the features in relation detection and knowledge
transfer?
18
Evaluation (Q3)
Q3: How does the overall VTransE network perform compared to the other
state-of-the-art visual relation models?
19
Evaluation (Q3)
20
Conclusions
● Visual Relation task gives us a comprehensive scene understanding for
connecting computer vision and natural language.
● VTransE designed to provide object detection and relation prediction
simultaneously
● Novel feature extraction layer that enables object-relation knowledge
transfer.
21

More Related Content

PDF
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
PDF
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
PDF
Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018
PDF
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
PDF
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
PPTX
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
PPTX
Graph R-CNN for Scene Graph Generation
PPTX
Introduction to Graph neural networks @ Vienna Deep Learning meetup
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Graph R-CNN for Scene Graph Generation
Introduction to Graph neural networks @ Vienna Deep Learning meetup

What's hot (20)

PDF
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
PDF
Ire presentation
PDF
Webinar on Graph Neural Networks
PDF
Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017
PDF
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
PDF
Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)
PDF
Gnn overview
PPTX
VIBE: Video Inference for Human Body Pose and Shape Estimation
PDF
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
PPTX
Graph Neural Network - Introduction
PDF
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
PPTX
Semantic segmentation with Convolutional Neural Network Approaches
PDF
The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)
PDF
Deep Learning for Computer Vision: Segmentation (UPC 2016)
PDF
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
PDF
Understanding Large Social Networks | IRE Major Project | Team 57
PDF
Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...
PDF
Joint unsupervised learning of deep representations and image clusters
PDF
Graph neural networks overview
PDF
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Generative Models and Adversarial Training (D2L3 Insight@DCU Machine Learning...
Ire presentation
Webinar on Graph Neural Networks
Deep Visual Saliency - Kevin McGuinness - UPC Barcelona 2017
Unsupervised Deep Learning (D2L1 Insight@DCU Machine Learning Workshop 2017)
Convolutional Neural Networks (D1L3 2017 UPC Deep Learning for Computer Vision)
Gnn overview
VIBE: Video Inference for Human Body Pose and Shape Estimation
Image Retrieval (D4L5 2017 UPC Deep Learning for Computer Vision)
Graph Neural Network - Introduction
Deep 3D Visual Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2017
Semantic segmentation with Convolutional Neural Network Approaches
The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)
Deep Learning for Computer Vision: Segmentation (UPC 2016)
Loss functions (DLAI D4L2 2017 UPC Deep Learning for Artificial Intelligence)
Understanding Large Social Networks | IRE Major Project | Team 57
Visualization of Deep Learning Models (D1L6 2017 UPC Deep Learning for Comput...
Joint unsupervised learning of deep representations and image clusters
Graph neural networks overview
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Ad

Viewers also liked (20)

PDF
YouTube-8M: A Large-Scale Video Classification Benchmark (UPC Reading Group)
PDF
Shuffle and learn: Unsupervised Learning using Temporal Order Verification (U...
PDF
The impact of visual saliency prediction in image classification
PDF
Multi-label Remote Sensing Image Retrieval based on Deep Features
PDF
Skin Lesion Detection from Dermoscopic Images using Convolutional Neural Netw...
PDF
Creating new classes of objects with deep generative neural nets
PDF
Faces in Places: Compound Query Retrieval
PPT
Tools for Image Retrieval in Large Multimedia Databases
PDF
Recurrent Instance Segmentation (UPC Reading Group)
PDF
Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)
PDF
Region-oriented Convolutional Networks for Object Retrieval
PDF
Word Embeddings (D2L4 Deep Learning for Speech and Language UPC 2017)
PDF
Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...
PDF
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
PDF
Speech Recognition with Deep Neural Networks (D3L2 Deep Learning for Speech a...
PDF
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
PDF
Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)
PDF
Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model (UP...
PDF
Time-series forecasting of indoor temperature using pre-trained Deep Neural N...
PDF
Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...
YouTube-8M: A Large-Scale Video Classification Benchmark (UPC Reading Group)
Shuffle and learn: Unsupervised Learning using Temporal Order Verification (U...
The impact of visual saliency prediction in image classification
Multi-label Remote Sensing Image Retrieval based on Deep Features
Skin Lesion Detection from Dermoscopic Images using Convolutional Neural Netw...
Creating new classes of objects with deep generative neural nets
Faces in Places: Compound Query Retrieval
Tools for Image Retrieval in Large Multimedia Databases
Recurrent Instance Segmentation (UPC Reading Group)
Image-to-Image Translation with Conditional Adversarial Nets (UPC Reading Group)
Region-oriented Convolutional Networks for Object Retrieval
Word Embeddings (D2L4 Deep Learning for Speech and Language UPC 2017)
Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
Speech Recognition with Deep Neural Networks (D3L2 Deep Learning for Speech a...
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Speaker ID II (D4L1 Deep Learning for Speech and Language UPC 2017)
Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model (UP...
Time-series forecasting of indoor temperature using pre-trained Deep Neural N...
Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language U...
Ad

Similar to Visual Translation Embedding Network for Visual Relation Detection (UPC Reading Group) (20)

PPTX
Attentive Relational Networks for Mapping Images to Scene Graphs
PDF
Introduction to 3D Computer Vision and Differentiable Rendering
PDF
最近の研究情勢についていくために - Deep Learningを中心に -
PDF
Visual Transformers
PDF
物件偵測與辨識技術
PPTX
[NS][Lab_Seminar_241118]Relation Matters: Foreground-aware Graph-based Relati...
PPTX
[NS][Lab_Seminar_250407]AlignmentLearning.pptx
PDF
Computer Vision: Visual Extent of an Object
PPTX
[NS][Lab_Seminar_241209]CYCLO : Cyclic Graph Transformer Approach to Multi-Ob...
PPTX
[NS][Lab_Seminar_240705]Self-Supervised Relation Alignment for Scene Graph Ge...
PDF
Understanding Large Social Networks | IRE Major Project | Team 57 | LINE
PDF
BriefHistoryTransformerstransformers.pdf
PDF
final_report
PDF
Deep image retrieval learning global representations for image search
PPTX
Semantic Segmentation on Satellite Imagery
PPTX
Describing Images using Visual Dependency Representation
PDF
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
PDF
Learning with Relative Attributes
PDF
Cvpr 2017 Summary Meetup
PDF
Brodmann17 CVPR 2017 review - meetup slides
Attentive Relational Networks for Mapping Images to Scene Graphs
Introduction to 3D Computer Vision and Differentiable Rendering
最近の研究情勢についていくために - Deep Learningを中心に -
Visual Transformers
物件偵測與辨識技術
[NS][Lab_Seminar_241118]Relation Matters: Foreground-aware Graph-based Relati...
[NS][Lab_Seminar_250407]AlignmentLearning.pptx
Computer Vision: Visual Extent of an Object
[NS][Lab_Seminar_241209]CYCLO : Cyclic Graph Transformer Approach to Multi-Ob...
[NS][Lab_Seminar_240705]Self-Supervised Relation Alignment for Scene Graph Ge...
Understanding Large Social Networks | IRE Major Project | Team 57 | LINE
BriefHistoryTransformerstransformers.pdf
final_report
Deep image retrieval learning global representations for image search
Semantic Segmentation on Satellite Imagery
Describing Images using Visual Dependency Representation
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
Learning with Relative Attributes
Cvpr 2017 Summary Meetup
Brodmann17 CVPR 2017 review - meetup slides

More from Universitat Politècnica de Catalunya (20)

PDF
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
PDF
Deep Generative Learning for All
PDF
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
PDF
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
PDF
The Transformer - Xavier Giró - UPC Barcelona 2021
PDF
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
PDF
Open challenges in sign language translation and production
PPTX
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
PPTX
Discovery and Learning of Navigation Goals from Pixels in Minecraft
PDF
Learn2Sign : Sign language recognition and translation using human keypoint e...
PDF
Intepretability / Explainable AI for Deep Neural Networks
PDF
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
PDF
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
PDF
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
PDF
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
PDF
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
PDF
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
PDF
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
PDF
Curriculum Learning for Recurrent Video Object Segmentation
PDF
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
The Transformer - Xavier Giró - UPC Barcelona 2021
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Open challenges in sign language translation and production
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Learn2Sign : Sign language recognition and translation using human keypoint e...
Intepretability / Explainable AI for Deep Neural Networks
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Curriculum Learning for Recurrent Video Object Segmentation
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020

Recently uploaded (20)

PPTX
IB Computer Science - Internal Assessment.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Business Analytics and business intelligence.pdf
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPT
Quality review (1)_presentation of this 21
PDF
Introduction to the R Programming Language
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Introduction to machine learning and Linear Models
IB Computer Science - Internal Assessment.pptx
.pdf is not working space design for the following data for the following dat...
SAP 2 completion done . PRESENTATION.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
ISS -ESG Data flows What is ESG and HowHow
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Introduction to Knowledge Engineering Part 1
Business Analytics and business intelligence.pdf
IBA_Chapter_11_Slides_Final_Accessible.pptx
[EN] Industrial Machine Downtime Prediction
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
STERILIZATION AND DISINFECTION-1.ppthhhbx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Data_Analytics_and_PowerBI_Presentation.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Quality review (1)_presentation of this 21
Introduction to the R Programming Language
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Introduction to machine learning and Linear Models

Visual Translation Embedding Network for Visual Relation Detection (UPC Reading Group)

  • 1. Visual Translation Embedding Network for Visual Relation Detection Slides by Fran Roldán ReadAI Reading Group, UPC 20th March, 2017 Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, Tat-Seng Chua, [arxiv] (27 Feb 2017) [demo]
  • 2. Index 1. Visual Relation Detection 2. Visual Translation Embedding (VTransE) 3. VTransE Feature Extraction 4. VTransE Network 5. Evaluation 6. Conclusion 2
  • 3. Visual Relation Detection ● Modeling and understanding the relationships between objects in a scene (i.e. “person ride bike”). ● Better generalization for other tasks such as image captioning or VQA. ● Visual relations are subject-predicate-object triplets, which we can model jointly or separately. 3
  • 4. VTransE Translation Embedding ● For N objects and R predicates we have to learn: ○ Joint model: N2 R ○ Separate model: N+R. ● However, large appearance changes of predicate (i.e . predicate ride is different when object is bike than when the object is elephant). 4
  • 5. VTransE Translation Embedding ● For N objects and R predicates we have to learn: ○ Joint model: N2 R ○ Separate model: N+R. ● However, large appearance changes of predicate (i.e . predicate ride is different when object is bike than when the object is elephant). ...is there any solution? 5
  • 6. VTransE Translation Embedding ● Based on Translation Embeddings for representing large scale knowledge bases. ● Map the features of objects and predicates in a low-dimensional space, where relation triplet can be interpreted as a vector translation. We only need to learn the “ride” translation vector in the relation space. 6
  • 7. VTransE Visual Translation Embedding Suppose are M-dim features of subject and object. We must learn a relation translation vector and the projection matrices . 7
  • 8. VTransE Visual Translation Embedding Loss function to reward only deterministically accurate predicates: 8
  • 9. VTransE Feature Extraction Knowledge Transfer in Relation ● Region proposal network (RPN) and a classification layer. ● Incorporation of knowledge transfer between objects and predicates, which can be transferred in a single forward/backward pass. ● Novel feature extraction layer: ○ Classeme (i.e. class probabilities). ○ Location (i.e. bounding boxes coordinates and scales). ○ RoI visual features (use of bilinear feature interpolation instead of RoI pooling). 9
  • 10. VTransE Feature Extraction In order to extract we analyze three type of features: ● Classeme: N+1-dim vector of class probabilities (N classes and 1 background) obtained from object classification. ● Location: 4-dim vector such that: where are bounding boxes coordinates of subject and object respectively. ● Visual Features: D-dim vector transformed from a convolutional feature of the shape . 10
  • 11. VTransE Feature Extraction Bilinear Interpolation Smooth function of two inputs: feature map F and an object bounding box. : X x Y grid split in box Since G is a linear function, V can be back-propagated to the bounding box coordinates 11
  • 12. VTransE Optimization ● Multi-task loss function: ○ Object detection loss: ○ Relation detection loss: ● Loss trade-off: 12
  • 13. VTransE Network Built upon an object detection module and incorporates the proposed feature extraction layer. 13
  • 14. Evaluation Q1: Is the idea of embedding relations effective in the visual domain? Q2: What are the effects of the features in relation detection and knowledge transfer? Q3: How does the overall VTransE network perform compared to the other state-of-the-art visual relation models? 14
  • 15. Evaluation ● Datasets: ○ Visual Relationship Dataset (VRD): 5,000 images with 100 object categories and 70 predicates. In total, VRD contains 37,993 relation annotations with 6,672 unique relations and 24.25 predicates per object category. ○ Visual Genome Version 1.2 (VG): 99,658 images with 200 object categories and 100 predicates, resulting in 1,174,692 relation annotations with 19,237 unique relations and 57 predicates per object category. 15
  • 16. Evaluation (Q1) Q1: Is the idea of embedding relations effective in the visual domain? Isolate VTransE from object detection and perform the task of Predicate Prediction 16 R@K computes the fraction of times a true relation is predicted in the top K confident relation predictions in an image
  • 17. Evaluation (Q2) Q2: What are the effects of the features in relation detection and knowledge transfer? 17 VRD VG
  • 18. Evaluation (Q2) Q2: What are the effects of the features in relation detection and knowledge transfer? 18
  • 19. Evaluation (Q3) Q3: How does the overall VTransE network perform compared to the other state-of-the-art visual relation models? 19
  • 21. Conclusions ● Visual Relation task gives us a comprehensive scene understanding for connecting computer vision and natural language. ● VTransE designed to provide object detection and relation prediction simultaneously ● Novel feature extraction layer that enables object-relation knowledge transfer. 21