SlideShare a Scribd company logo
2020 IRRLAB Presentation
IRRLAB
Neural Motifs:
Scene Graph Parsing with Global
Context
Sangmin Woo
2020.04.29
Rowan Zellers1 Mark Yatskar1,2 Sam Thomson3 Yejin Choi1,2
1Paul G. Allen School of Computer Science & Engineering, University of Washington
2Allen Institute for Artificial Intelligence
3School of Computer Science, Carnegie Mellon University
2 / 22
IRRLABContents
 Scene Graph Generation
 Scene Graph Analysis
 Model: Neural Motifs
 Experimental Results
• Quantitative Results
• Qualitative Results
 References
3 / 22
IRRLABScene Graph Generation
 Scene Graph Generation(SGG) (a.k.a. Scene Graph Parsing)
• The task of producing graph representations of real-world images that
provide semantic summaries of objects and their relationships.
4 / 22
IRRLABScene Graph Analysis
 Object and relation types in Visual Genome Dataset
5 / 22
IRRLABScene Graph Analysis
 Types of edges between high-level categories in Visual Genome
6 / 22
IRRLABScene Graph Analysis
 How much information is gained by knowing the identity of
different parts in a scene graphs?
• In general, the identity of edges involved in a relationship is not highly
informative of other elements of the structure while the identities of head
or tail provide significant information, both to each other and to edge labels.
7 / 22
IRRLABScene Graph Analysis
 Larger Motifs
• Higher order structure
8 / 22
IRRLABModel
 Stacked Motif Network(MOTIFNET)
• Region proposals 𝑩
 𝐵 = 𝑏1, … , 𝑏 𝑛 , 𝑏𝑖 ∈ ℝ4
• Object label 𝑶
• Relation label 𝑹
Object
Detection
Object
Label
𝑷 𝑮 𝑰 = 𝑷 𝑩 𝑰 𝑷 𝑶 𝑩, 𝑰 𝑷(𝑹|𝑩, 𝑶, 𝑰)
Relation
Label
9 / 22
IRRLABModel
 Stacked Motif Network(MOTIFNET)
10/ 22
IRRLABModel
 Bounding Boxes
• Utilized Faster R-CNN
• For each image 𝐼, the detector predicts a set of region proposals 𝐵 =
{𝑏1, … , 𝑏 𝑛}.
• For each proposal 𝑏𝑖 ∈ 𝐵, it also outputs a feature vector 𝑓𝑖, and a vector
𝑙𝑖 ∈ 𝑅|𝐶|
of (non-contextualized) object label probabilities.
11 / 22
IRRLABModel
 Objects
• Context Encoding
• Construct a contextualized representation of object prediction based
on the set of proposal regions 𝐵
• Element of 𝐵 are first organized into a linear sequence,
[(𝑏1, 𝑓1, 𝑙1), … , (𝑏 𝑛, 𝑓𝑛, 𝑙 𝑛)].
• The object context, 𝐶, is then computed using a bidirectional LSTM
• 𝐶 = [𝑐1, … , 𝑐 𝑛] contains the final LSTM layer’s hidden states for each
element in the linearization of 𝐵
• 𝑊1 is a parameter matrix that maps the distribution of predicted classes, 𝑙1
to ℝ100
.
• The biLSTM allows all elements of 𝐵 to contribute information about
potential object identities.
12/ 22
IRRLABModel
 Objects
• Decoding
• The context 𝐶 is used to sequentially decode labels for each proposal
bounding region, conditioning on previously decoded labels.
• LSTM is used to decode a category label for each contextualized
representation in 𝐶
• Hidden sates ℎ𝑖 are discarded and object class commitments 𝑜𝑖 are
used in the relation model.
13/ 22
IRRLABModel
 Relations
• Context Encoding
• Construct a contextualized representation of bounding regions 𝐵 and
objects 𝑂 using additional bi-directional LSTM layers
• Where the edge context 𝐷 = [𝑑1, … , 𝑑 𝑛] contains the states for each
bounding region at the final layer, and 𝑊2 is a parameter matrix
mapping 𝑜𝑖 into ℝ100
14/ 22
IRRLABModel
 Relations
• Decoding
• There are quadratic number of possible relations in a scene graph
• For each possible edge, say between 𝑏𝑖 and 𝑏𝑗, the probability the
edge will have label 𝑥𝑖→𝑗 is computed
• The distribution uses global context 𝐷 and a feature vector for the
union of boxes 𝑓𝑖,𝑗
• ° denotes outer product
• 𝑊ℎ and 𝑊𝑡 project the head an tail context into ℝ4096
• 𝑤 𝑜 𝑖, 𝑜 𝑗
is a bias vector specific to the head and tail labels
15/ 22
IRRLABModel
 Frequency Baselines
• To support the finding that object labels are highly predictive of edge labels,
two frequency baselines built off training set statistics are additionally
introduced
1) FREQ, uses pre-trained detector to predict object labels for each RoI
• To obtain predicate probabilities between boxes 𝑖 and 𝑗, empirical
distribution over relationships between objects 𝑜𝑖 and 𝑜𝑗 is looked up
• Intuitively, while this baseline does not look at the image to compute
𝑃(𝑥𝑖→𝑗|𝑜𝑖, 𝑜𝑗), it displays the value of conditioning on object label
predictions 𝑜
2) FREQ-OVERLAP, requires that the two boxes intersect in order for the pair
to count as a valid relation
16/ 22
IRRLABModel
 Experimental Setup
• Alternating Highway LSTMs
• To mitigate vanishing gradient problems as information flows upward,
highway connections to all LSTMs are added
• To additionally reduce the number of parameters, LSTM directions are
alternated
• Each alternating highway LSTM step can be written as follows:
• Where 𝑥𝑖 is the input, ℎ𝑖 represents the hidden state, and 𝑠 is the
direction: 𝑠 = 1 if the current layer is even, and -1 otherwise
• For MOTIFNET, 2 alternating highway LSTM layers are used for
object context, and 4 for edge context
17/ 22
IRRLABModel
 Stacked Motif Network(MOTIFNET)
18/ 22
IRRLABModel
 Experimental Setup
• RoI ordering for LSTMs
• LEFTRIGHT(default): Default option is to sort the regions left-to-right
by the central x-coordinate: it is expected that this encourages the
model to predict edges between nearby objects, which is beneficial as
objects appearing in relationships tend to be close together
• CONFIDENCE: Another option is to order bounding regions based on
the confidence of the maximum non-background prediction from the
detector: as this lets the detector commit to “easy” regions, obtaining
context for more difficult regions
• SIZE: Here, bounding boxes are sorted in descending order by size,
possibly predicting global scene information first
• RANDOM: regions are randomly ordered.
19/ 22
IRRLABModel
 Experimental Setup
• Predicate Visual Features
• To extract visual features for a predicate between boxes 𝑏𝑖, 𝑏𝑗,
detector’s features corresponding to the union box of 𝑏𝑖 and 𝑏𝑗 are
resized to 7x7x256
• Geometric relations are modeled by using 14x14x2 binary input with
one channel per box
• Two convolutional layers are applied to this and add the resulting
7x7x256 representation to the detector features
• Last, fine-tuned VGG fully connected layers are applied to obtain
4096 dimensional representation.
20/ 22
IRRLABResults
 Quantitative Results & Ablation Studies
21/ 22
IRRLABResults
 Qualitative Results
Thank You
2020 IRRLAB Presentation
IRRLAB
shmwoo9395@{gist.ac.kr, gmail.com}

More Related Content

PDF
Objects as points (CenterNet) review [CDM]
PPTX
Attentive Relational Networks for Mapping Images to Scene Graphs
PPTX
Objects as points
PPTX
DOCX
PPTX
Exploring Randomly Wired Neural Networks for Image Recognition
PDF
PR-132: SSD: Single Shot MultiBox Detector
PDF
Overview of Convolutional Neural Networks
Objects as points (CenterNet) review [CDM]
Attentive Relational Networks for Mapping Images to Scene Graphs
Objects as points
Exploring Randomly Wired Neural Networks for Image Recognition
PR-132: SSD: Single Shot MultiBox Detector
Overview of Convolutional Neural Networks

What's hot (20)

PPTX
Feature recognition and classification
PDF
2015-RISS-Poster-Wang-Tairui
PDF
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
PPTX
Machine Learning - Convolutional Neural Network
PPTX
Introduction to Convolutional Neural Networks
PPTX
Convolutional Neural Network and Its Applications
PPTX
Image colorization
PPTX
Convolutional neural networks
PPTX
Convolutional neural networks
PDF
Learning Convolutional Neural Networks for Graphs
PPTX
Convolutional neural network
PPTX
Introduction to Deep Learning
PPTX
Deep learning for Computer Vision intro
PPTX
Digit recognition using neural network
PDF
PR-284: End-to-End Object Detection with Transformers(DETR)
PDF
Offline Character Recognition Using Monte Carlo Method and Neural Network
PPTX
Autoencoder
PPT
Adaptive Geographical Search in Networks
Feature recognition and classification
2015-RISS-Poster-Wang-Tairui
PR-120: ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture De...
Machine Learning - Convolutional Neural Network
Introduction to Convolutional Neural Networks
Convolutional Neural Network and Its Applications
Image colorization
Convolutional neural networks
Convolutional neural networks
Learning Convolutional Neural Networks for Graphs
Convolutional neural network
Introduction to Deep Learning
Deep learning for Computer Vision intro
Digit recognition using neural network
PR-284: End-to-End Object Detection with Transformers(DETR)
Offline Character Recognition Using Monte Carlo Method and Neural Network
Autoencoder
Adaptive Geographical Search in Networks
Ad

Similar to Neural motifs scene graph parsing with global context (20)

PPTX
Graph R-CNN for Scene Graph Generation
PPTX
[NS][Lab_Seminar_241118]Relation Matters: Foreground-aware Graph-based Relati...
PDF
2019 cvpr paper_overview
PDF
2019 cvpr paper overview by Ho Seong Lee
PDF
Model Evaluation in the land of Deep Learning
PPTX
Anomaly Detection with Azure and .net
PPTX
[NS][Lab_Seminar_250407]AlignmentLearning.pptx
PDF
IRJET- Object Detection and Recognition using Single Shot Multi-Box Detector
PPTX
[NS][Lab_Seminar_250106]SAM-Aware Graph Prompt Reasoning Network for Cross-Do...
PPTX
ObjRecog2-17 (1).pptx
PDF
ML Interpretability Inside Out
PDF
PAISS (PRAIRIE AI Summer School) Digest July 2018
PDF
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
PDF
OBJECT IDENTIFICATION
PPTX
Anomaly Detection with Azure and .NET
PDF
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
PPTX
Ai use cases
PPTX
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
PDF
Rnn presentation 2
PPTX
Practical deep learning for computer vision
Graph R-CNN for Scene Graph Generation
[NS][Lab_Seminar_241118]Relation Matters: Foreground-aware Graph-based Relati...
2019 cvpr paper_overview
2019 cvpr paper overview by Ho Seong Lee
Model Evaluation in the land of Deep Learning
Anomaly Detection with Azure and .net
[NS][Lab_Seminar_250407]AlignmentLearning.pptx
IRJET- Object Detection and Recognition using Single Shot Multi-Box Detector
[NS][Lab_Seminar_250106]SAM-Aware Graph Prompt Reasoning Network for Cross-Do...
ObjRecog2-17 (1).pptx
ML Interpretability Inside Out
PAISS (PRAIRIE AI Summer School) Digest July 2018
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
OBJECT IDENTIFICATION
Anomaly Detection with Azure and .NET
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Ai use cases
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Rnn presentation 2
Practical deep learning for computer vision
Ad

More from Sangmin Woo (12)

PPTX
Multimodal Learning with Severely Missing Modality.pptx
PPTX
Video Transformers.pptx
PPTX
Masked Autoencoders Are Scalable Vision Learners.pptx
PPTX
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
PPTX
Visual Commonsense Reasoning.pptx
PPTX
Video Grounding.pptx
PPTX
Action Recognition Datasets.pptx
PPTX
Exploring Simple Siamese Representation Learning
PPTX
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
PPTX
Towards Efficient Transformers
PPTX
Transformer in Vision
PPTX
Action Genome: Action As Composition of Spatio Temporal Scene Graphs
Multimodal Learning with Severely Missing Modality.pptx
Video Transformers.pptx
Masked Autoencoders Are Scalable Vision Learners.pptx
An Empirical Study of Training Self-Supervised Vision Transformers.pptx
Visual Commonsense Reasoning.pptx
Video Grounding.pptx
Action Recognition Datasets.pptx
Exploring Simple Siamese Representation Learning
Recent Breakthroughs in AI + Learning Visual-Linguistic Representation in the...
Towards Efficient Transformers
Transformer in Vision
Action Genome: Action As Composition of Spatio Temporal Scene Graphs

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
1. Introduction to Computer Programming.pptx
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Getting Started with Data Integration: FME Form 101
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPT
Teaching material agriculture food technology
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
Spectral efficient network and resource selection model in 5G networks
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
MIND Revenue Release Quarter 2 2025 Press Release
1. Introduction to Computer Programming.pptx
Heart disease approach using modified random forest and particle swarm optimi...
TLE Review Electricity (Electricity).pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Diabetes mellitus diagnosis method based random forest with bat algorithm
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Getting Started with Data Integration: FME Form 101
OMC Textile Division Presentation 2021.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
A Presentation on Artificial Intelligence
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Unlocking AI with Model Context Protocol (MCP)
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Teaching material agriculture food technology
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Building Integrated photovoltaic BIPV_UPV.pdf

Neural motifs scene graph parsing with global context

  • 1. 2020 IRRLAB Presentation IRRLAB Neural Motifs: Scene Graph Parsing with Global Context Sangmin Woo 2020.04.29 Rowan Zellers1 Mark Yatskar1,2 Sam Thomson3 Yejin Choi1,2 1Paul G. Allen School of Computer Science & Engineering, University of Washington 2Allen Institute for Artificial Intelligence 3School of Computer Science, Carnegie Mellon University
  • 2. 2 / 22 IRRLABContents  Scene Graph Generation  Scene Graph Analysis  Model: Neural Motifs  Experimental Results • Quantitative Results • Qualitative Results  References
  • 3. 3 / 22 IRRLABScene Graph Generation  Scene Graph Generation(SGG) (a.k.a. Scene Graph Parsing) • The task of producing graph representations of real-world images that provide semantic summaries of objects and their relationships.
  • 4. 4 / 22 IRRLABScene Graph Analysis  Object and relation types in Visual Genome Dataset
  • 5. 5 / 22 IRRLABScene Graph Analysis  Types of edges between high-level categories in Visual Genome
  • 6. 6 / 22 IRRLABScene Graph Analysis  How much information is gained by knowing the identity of different parts in a scene graphs? • In general, the identity of edges involved in a relationship is not highly informative of other elements of the structure while the identities of head or tail provide significant information, both to each other and to edge labels.
  • 7. 7 / 22 IRRLABScene Graph Analysis  Larger Motifs • Higher order structure
  • 8. 8 / 22 IRRLABModel  Stacked Motif Network(MOTIFNET) • Region proposals 𝑩  𝐵 = 𝑏1, … , 𝑏 𝑛 , 𝑏𝑖 ∈ ℝ4 • Object label 𝑶 • Relation label 𝑹 Object Detection Object Label 𝑷 𝑮 𝑰 = 𝑷 𝑩 𝑰 𝑷 𝑶 𝑩, 𝑰 𝑷(𝑹|𝑩, 𝑶, 𝑰) Relation Label
  • 9. 9 / 22 IRRLABModel  Stacked Motif Network(MOTIFNET)
  • 10. 10/ 22 IRRLABModel  Bounding Boxes • Utilized Faster R-CNN • For each image 𝐼, the detector predicts a set of region proposals 𝐵 = {𝑏1, … , 𝑏 𝑛}. • For each proposal 𝑏𝑖 ∈ 𝐵, it also outputs a feature vector 𝑓𝑖, and a vector 𝑙𝑖 ∈ 𝑅|𝐶| of (non-contextualized) object label probabilities.
  • 11. 11 / 22 IRRLABModel  Objects • Context Encoding • Construct a contextualized representation of object prediction based on the set of proposal regions 𝐵 • Element of 𝐵 are first organized into a linear sequence, [(𝑏1, 𝑓1, 𝑙1), … , (𝑏 𝑛, 𝑓𝑛, 𝑙 𝑛)]. • The object context, 𝐶, is then computed using a bidirectional LSTM • 𝐶 = [𝑐1, … , 𝑐 𝑛] contains the final LSTM layer’s hidden states for each element in the linearization of 𝐵 • 𝑊1 is a parameter matrix that maps the distribution of predicted classes, 𝑙1 to ℝ100 . • The biLSTM allows all elements of 𝐵 to contribute information about potential object identities.
  • 12. 12/ 22 IRRLABModel  Objects • Decoding • The context 𝐶 is used to sequentially decode labels for each proposal bounding region, conditioning on previously decoded labels. • LSTM is used to decode a category label for each contextualized representation in 𝐶 • Hidden sates ℎ𝑖 are discarded and object class commitments 𝑜𝑖 are used in the relation model.
  • 13. 13/ 22 IRRLABModel  Relations • Context Encoding • Construct a contextualized representation of bounding regions 𝐵 and objects 𝑂 using additional bi-directional LSTM layers • Where the edge context 𝐷 = [𝑑1, … , 𝑑 𝑛] contains the states for each bounding region at the final layer, and 𝑊2 is a parameter matrix mapping 𝑜𝑖 into ℝ100
  • 14. 14/ 22 IRRLABModel  Relations • Decoding • There are quadratic number of possible relations in a scene graph • For each possible edge, say between 𝑏𝑖 and 𝑏𝑗, the probability the edge will have label 𝑥𝑖→𝑗 is computed • The distribution uses global context 𝐷 and a feature vector for the union of boxes 𝑓𝑖,𝑗 • ° denotes outer product • 𝑊ℎ and 𝑊𝑡 project the head an tail context into ℝ4096 • 𝑤 𝑜 𝑖, 𝑜 𝑗 is a bias vector specific to the head and tail labels
  • 15. 15/ 22 IRRLABModel  Frequency Baselines • To support the finding that object labels are highly predictive of edge labels, two frequency baselines built off training set statistics are additionally introduced 1) FREQ, uses pre-trained detector to predict object labels for each RoI • To obtain predicate probabilities between boxes 𝑖 and 𝑗, empirical distribution over relationships between objects 𝑜𝑖 and 𝑜𝑗 is looked up • Intuitively, while this baseline does not look at the image to compute 𝑃(𝑥𝑖→𝑗|𝑜𝑖, 𝑜𝑗), it displays the value of conditioning on object label predictions 𝑜 2) FREQ-OVERLAP, requires that the two boxes intersect in order for the pair to count as a valid relation
  • 16. 16/ 22 IRRLABModel  Experimental Setup • Alternating Highway LSTMs • To mitigate vanishing gradient problems as information flows upward, highway connections to all LSTMs are added • To additionally reduce the number of parameters, LSTM directions are alternated • Each alternating highway LSTM step can be written as follows: • Where 𝑥𝑖 is the input, ℎ𝑖 represents the hidden state, and 𝑠 is the direction: 𝑠 = 1 if the current layer is even, and -1 otherwise • For MOTIFNET, 2 alternating highway LSTM layers are used for object context, and 4 for edge context
  • 17. 17/ 22 IRRLABModel  Stacked Motif Network(MOTIFNET)
  • 18. 18/ 22 IRRLABModel  Experimental Setup • RoI ordering for LSTMs • LEFTRIGHT(default): Default option is to sort the regions left-to-right by the central x-coordinate: it is expected that this encourages the model to predict edges between nearby objects, which is beneficial as objects appearing in relationships tend to be close together • CONFIDENCE: Another option is to order bounding regions based on the confidence of the maximum non-background prediction from the detector: as this lets the detector commit to “easy” regions, obtaining context for more difficult regions • SIZE: Here, bounding boxes are sorted in descending order by size, possibly predicting global scene information first • RANDOM: regions are randomly ordered.
  • 19. 19/ 22 IRRLABModel  Experimental Setup • Predicate Visual Features • To extract visual features for a predicate between boxes 𝑏𝑖, 𝑏𝑗, detector’s features corresponding to the union box of 𝑏𝑖 and 𝑏𝑗 are resized to 7x7x256 • Geometric relations are modeled by using 14x14x2 binary input with one channel per box • Two convolutional layers are applied to this and add the resulting 7x7x256 representation to the detector features • Last, fine-tuned VGG fully connected layers are applied to obtain 4096 dimensional representation.
  • 20. 20/ 22 IRRLABResults  Quantitative Results & Ablation Studies
  • 22. Thank You 2020 IRRLAB Presentation IRRLAB shmwoo9395@{gist.ac.kr, gmail.com}

Editor's Notes

  • #3: 관련 연구, Scene Graph 분석, 모델, 실험 결과 순으로 발표
  • #4: Scene상의 Object를 node, 한 쌍의 object간의 관계를 edge로 구조화하여 표현하는 것을 Scene Graph라고 함 SGG task는 하나의 이미지가 주어지면 이러한 SG를 생성하는 task를 말함.
  • #5: Entity를 이루는것 중에 part type이 25%를 넘게 차지한다. Relation중에서는 Geometric과 Possessive가 90%를 넘게 차지한다.
  • #6: Visual genome dataset에서 entity category간의 edge가 어떻게 연결되는지에 대한 분포를 보여줌. Geometric, possessive, semantic edge가 각각 50.9, 40.9, 8.7%임. Semantic edge는 대부분 사람과 다른 사물 간에서 발생.
  • #7: Graph의 다른 요소들이 주어졌을 때, head, tail, edge label을 예측하는것에 대한 likelihoo를 그래프로 표현한것. X축은 top-k에서 k를, Y축은 accurac를 보여줌. 일반적으로 relationship에서 edge는 head나 tail에 비해서 informative하지 않다. Object pair에 의해 edge가 결정되는 추이를 보인다. Object pair만 주어져도 edge를 k=1의 경우에서 70%확률을 보인다.
  • #8: Graph의 다른 요소들이 주어졌을 때, head, tail, edge label을 예측하는것에 대한 likelihoo를 그래프로 표현한것. X축은 top-k에서 k를, Y축은 accurac를 보여줌. 일반적으로 relationship에서 edge는 head나 tail에 비해서 informative하지 않다. Object pair에 의해 edge가 결정되는 추이를 보인다. Object pair만 주어져도 edge를 k=1의 경우에서 70%확률을 보인다.
  • #9: 전체적인 과정을 수식화한 것. 이미지 I가 주어졌을 때, Scene Graph G를 만드는것이 목적. 이를 세 단계로 분할하면, 첫 번째는 object detection, object label 예측, relation label 예측으로 분리됨. Region proposal은 Bounding box의 B를 따서 B로 표시함. N은 한 이미지 내에서 여러 object가 있을 수 있기 때문에, n개의 object를 의미함. Bounding box는 두점의 좌표로 표시할 수 있기 때문에 4 dimension에 해당함. Object label은 O로 표시되고 마찬가지로, n개의 bounding box하나에 해당하는 label이 있을 수 있음. Relation label R로 표시되고, 이거는 n개의 bounding box 간의 연결 가능한 개수로 보기때문에 n(n-1)개가 나옴.
  • #10: Scene graph parsing을 세 단계로 나누어 해결하려함. 첫 번째는 bounding box를 찾는것으로 backbone vgg16와 faster r-cnn을 이용하여 object proposa을 찾아냄. 다음은 찾아낸 object region에 대해서 bi-LSTM을 이용하여 context를 encoding하고, LSTM을 이용하여 decoding하여 object label을 예측함. Object context와 예측된 label을 같이 input으로 사용해서 edge context를 encoding 함. 이렇게 Encode된 contex를 가지고 relationship label을 예측함.
  • #11: 첫번째 step은 Bounding box를 과정인데 여기서 faster r-cnn을 사용. 각각의 image I에 대해서, detector은 region proposal B를 먼저 예측함. 각각의 proposal b_i는 feature vector f_i와 object 레이블 확률분포 l_i를 가짐.
  • #12: Object label을 찾는 과정에서는 먼저 앞서 찾은 bounding box B에 대해서 Context를 encoding하는 과정을 거침. 먼저 B는 linear sequence로 재구성됨. 여기서 b는 bounding box, f는 visual feature, l은 object label probability distributio임. Object context C는 bi-LSTM을 통해 계산됨. C는 각 element에 대한 최종 LSTM 레이어의 hidden state임. W는 100차원의 weight matri임.
  • #13: 앞서 encode한 context를 decode하는 단계로 여기서도 마찬가지로 LSTM이 이용됨. Context c와 이전 state의 object class를 input으로 LSTM에 넣었을 때 h라는 hidden state가 나옴. W는 weight matri로, h와 곱해져서 argmax를 하면 가장 큰 object class probabilit가 나옴. 그러면 해당하는 object class가 다음 relation model의 inpute으로 전달됨.
  • #14: Relation에 대한 context를 encoding 하는 단계. bounding box B와 object O를 bi-LSTM을 이용하여 encoding 함. Edge context D는 최종 레이어에서의 각 bounding regio에 대한 state를 가짐. W_2는 100차원의 weight matrix임.
  • #15: Scene graph에서는 지수승의 relation 조합이 있을 수 있음. 각각의 가능한 edge 조합에 대해서 edge가 연결이 되어있을 확률을 계산함 D는 앞서 relation encoding에서 계산된 Global context, f_ij는 union box에 대한 feature vector를 의미. W_h와 W_t는 head와 tail을 각각 4096차원으로 projec하는 weight matrix. W_oi,oj는 bias vector
  • #16: Object label이 주어졌을 때, edge label을 예측하는 성능이 괜찮음을 보이기 위해 두 frequency baseline을 제시함 이는 training set의 통계에 따라 만들어지는데, 예를 들어서 training set에 등장하는 man과 bicycl의 관계가 ridin이 가장 많이 등장하면 Test set에 대해서도 ridin으로 예측하는 방식임. 첫번째 baseline FREQ 각 RoI에 대해서는 pretrained detector를 사용하고, box i와 j간의 relatio에 대한 확률을 계산하기 위해서 해당하는 object o_i와 o_j 사이에서 등장하는 relationship distribution을 확인함. 즉, image는 전혀 사용되지 않고, 단지 object label에 대해서만 conditionin됨. FREQ-overla은 FREQ와 동일한데, 두 bounding box가 겹치는 부분 즉, intersection이 있을때만 유효한 relation으로 간주함.
  • #17: Object label이 주어졌을 때, edge label을 예측하는 성능이 괜찮음을 보이기 위해 두 frequency baseline을 제시함 이는 training set의 통계에 따라 만들어지는데, 예를 들어서 training set에 등장하는 man과 bicycl의 관계가 ridin이 가장 많이 등장하면 Test set에 대해서도 ridin으로 예측하는 방식임. 첫번째 baseline FREQ 각 RoI에 대해서는 pretrained detector를 사용하고, box i와 j간의 relatio에 대한 확률을 계산하기 위해서 해당하는 object o_i와 o_j 사이에서 등장하는 relationship distribution을 확인함. 즉, image는 전혀 사용되지 않고, 단지 object label에 대해서만 conditionin됨. FREQ-overla은 FREQ와 동일한데, 두 bounding box가 겹치는 부분 즉, intersection이 있을때만 유효한 relation으로 간주함.
  • #18: Scene graph parsing을 세 단계로 나누어 해결하려함. 첫 번째는 bounding box를 찾는것으로 backbone vgg16와 faster r-cnn을 이용하여 object proposa을 찾아냄. 다음은 찾아낸 object region에 대해서 bi-LSTM을 이용하여 context를 encoding하고, LSTM을 이용하여 decoding하여 object label을 예측함. Object context와 예측된 label을 같이 input으로 사용해서 edge context를 encoding 함. 이렇게 Encode된 contex를 가지고 relationship label을 예측함.
  • #19: Figure에서 LSTM의 input으로 들어가는 RoI의 순서를 어떻게 정할지를 실험함. LEFTRIGHT는 RoI를 x좌표 기준으로 왼쪽에서 오른쪽으로 sort하는것. 근처에 있는 object 간의 edge를 predic하도록 모델링함. 가까운 objec일수록 relationshi이 나타날 확률이 높기 때문. CONFIDENCE는 detecto로부터 prediction confidence가 가장 높은 순으로 정렬함. 쉬운것부터 어려운순대로 모델링함 SIZE는 bounding box의 크기에 따라 정렬함. Global scene에 대한 정보 부터 모델링함 RANDOM은 무작위로 정렬됨.
  • #20: Object에 대한 visual featur는 앞단의 Faster r-cnn에서 곧바로 얻을 수 있는데, Predicate의 visual featur는 두 objec에 대한 featur이기 때문에 두 object의 union bounding box에서 얻어냄 이게 여기서는 7x7x256 사이즈로 resiz됨. Union bounding box 내에서의 두 objec간의 geometric relation은 14x14x2 크기의 binary input을 이용하여 모델링됨. 여기에 2개의 conv layer가 적용되면 7x7x256 크기의 representatio이 되고 이게 앞선 detector featur와 더해짐 최종적으로 VGG의 fine-tuned fc layer를 통해서 4096 크기의 representatio을 얻음.
  • #21: Freq-overla만으로도 기존의 SOTA인 message passin을 1.4 mean recall 만큼 증가시켰다. Motifnet-nocontext가 freq-overlap에 비해서 성능이 증가했다는것은 visual featur에 edge prediction을 위한 단서가 있다는 것을 보여줌. 또한 motifnet-nocontext와 full model(motifnet-leftright)의 비교를 통해 context 중요하다는 것을 알 수 있다.
  • #22: 가장 흔한 실패 케이스. 첫번째 케이스는, wearing vs. wears 와 같은 predicate ambiguity. 두번째 케이스는, detector이 object를 잘못 찾은 경우.