Neural motifs scene graph parsing with global context

2020 IRRLAB Presentation
IRRLAB
Neural Motifs:
Scene Graph Parsing with Global
Context
Sangmin Woo
2020.04.29
Rowan Zellers1 Mark Yatskar1,2 Sam Thomson3 Yejin Choi1,2
1Paul G. Allen School of Computer Science & Engineering, University of Washington
2Allen Institute for Artificial Intelligence
3School of Computer Science, Carnegie Mellon University

2 / 22
IRRLABContents
 Scene Graph Generation
 Scene Graph Analysis
 Model: Neural Motifs
 Experimental Results
• Quantitative Results
• Qualitative Results
 References

3 / 22
IRRLABScene Graph Generation
 Scene Graph Generation(SGG) (a.k.a. Scene Graph Parsing)
• The task of producing graph representations of real-world images that
provide semantic summaries of objects and their relationships.

4 / 22
IRRLABScene Graph Analysis
 Object and relation types in Visual Genome Dataset

5 / 22
 Types of edges between high-level categories in Visual Genome

6 / 22
 How much information is gained by knowing the identity of
different parts in a scene graphs?
• In general, the identity of edges involved in a relationship is not highly
informative of other elements of the structure while the identities of head
or tail provide significant information, both to each other and to edge labels.

7 / 22
 Larger Motifs
• Higher order structure

8 / 22
IRRLABModel
 Stacked Motif Network(MOTIFNET)
• Region proposals 𝑩
 𝐵 = 𝑏1, … , 𝑏 𝑛 , 𝑏𝑖 ∈ ℝ4
• Object label 𝑶
• Relation label 𝑹
Object
Detection
Object
Label
𝑷 𝑮 𝑰 = 𝑷 𝑩 𝑰 𝑷 𝑶 𝑩, 𝑰 𝑷(𝑹|𝑩, 𝑶, 𝑰)
Relation
Label

9 / 22
IRRLABModel

10/ 22
IRRLABModel
 Bounding Boxes
• Utilized Faster R-CNN
• For each image 𝐼, the detector predicts a set of region proposals 𝐵 =
{𝑏1, … , 𝑏 𝑛}.
• For each proposal 𝑏𝑖 ∈ 𝐵, it also outputs a feature vector 𝑓𝑖, and a vector
𝑙𝑖 ∈ 𝑅|𝐶|
of (non-contextualized) object label probabilities.

11 / 22
IRRLABModel
 Objects
• Context Encoding
• Construct a contextualized representation of object prediction based
on the set of proposal regions 𝐵
• Element of 𝐵 are first organized into a linear sequence,
[(𝑏1, 𝑓1, 𝑙1), … , (𝑏 𝑛, 𝑓𝑛, 𝑙 𝑛)].
• The object context, 𝐶, is then computed using a bidirectional LSTM
• 𝐶 = [𝑐1, … , 𝑐 𝑛] contains the final LSTM layer’s hidden states for each
element in the linearization of 𝐵
• 𝑊1 is a parameter matrix that maps the distribution of predicted classes, 𝑙1
to ℝ100
.
• The biLSTM allows all elements of 𝐵 to contribute information about
potential object identities.

12/ 22
IRRLABModel
 Objects
• Decoding
• The context 𝐶 is used to sequentially decode labels for each proposal
bounding region, conditioning on previously decoded labels.
• LSTM is used to decode a category label for each contextualized
representation in 𝐶
• Hidden sates ℎ𝑖 are discarded and object class commitments 𝑜𝑖 are
used in the relation model.

13/ 22
IRRLABModel
 Relations
• Context Encoding
• Construct a contextualized representation of bounding regions 𝐵 and
objects 𝑂 using additional bi-directional LSTM layers
• Where the edge context 𝐷 = [𝑑1, … , 𝑑 𝑛] contains the states for each
bounding region at the final layer, and 𝑊2 is a parameter matrix
mapping 𝑜𝑖 into ℝ100

14/ 22
IRRLABModel
 Relations
• Decoding
• There are quadratic number of possible relations in a scene graph
• For each possible edge, say between 𝑏𝑖 and 𝑏𝑗, the probability the
edge will have label 𝑥𝑖→𝑗 is computed
• The distribution uses global context 𝐷 and a feature vector for the
union of boxes 𝑓𝑖,𝑗
• ° denotes outer product
• 𝑊ℎ and 𝑊𝑡 project the head an tail context into ℝ4096
• 𝑤 𝑜 𝑖, 𝑜 𝑗
is a bias vector specific to the head and tail labels

15/ 22
IRRLABModel
 Frequency Baselines
• To support the finding that object labels are highly predictive of edge labels,
two frequency baselines built off training set statistics are additionally
introduced
1) FREQ, uses pre-trained detector to predict object labels for each RoI
• To obtain predicate probabilities between boxes 𝑖 and 𝑗, empirical
distribution over relationships between objects 𝑜𝑖 and 𝑜𝑗 is looked up
• Intuitively, while this baseline does not look at the image to compute
𝑃(𝑥𝑖→𝑗|𝑜𝑖, 𝑜𝑗), it displays the value of conditioning on object label
predictions 𝑜
2) FREQ-OVERLAP, requires that the two boxes intersect in order for the pair
to count as a valid relation

16/ 22
IRRLABModel
 Experimental Setup
• Alternating Highway LSTMs
• To mitigate vanishing gradient problems as information flows upward,
highway connections to all LSTMs are added
• To additionally reduce the number of parameters, LSTM directions are
alternated
• Each alternating highway LSTM step can be written as follows:
• Where 𝑥𝑖 is the input, ℎ𝑖 represents the hidden state, and 𝑠 is the
direction: 𝑠 = 1 if the current layer is even, and -1 otherwise
• For MOTIFNET, 2 alternating highway LSTM layers are used for
object context, and 4 for edge context

17/ 22
IRRLABModel

18/ 22
IRRLABModel
• RoI ordering for LSTMs
• LEFTRIGHT(default): Default option is to sort the regions left-to-right
by the central x-coordinate: it is expected that this encourages the
model to predict edges between nearby objects, which is beneficial as
objects appearing in relationships tend to be close together
• CONFIDENCE: Another option is to order bounding regions based on
the confidence of the maximum non-background prediction from the
detector: as this lets the detector commit to “easy” regions, obtaining
context for more difficult regions
• SIZE: Here, bounding boxes are sorted in descending order by size,
possibly predicting global scene information first
• RANDOM: regions are randomly ordered.

19/ 22
IRRLABModel
• Predicate Visual Features
• To extract visual features for a predicate between boxes 𝑏𝑖, 𝑏𝑗,
detector’s features corresponding to the union box of 𝑏𝑖 and 𝑏𝑗 are
resized to 7x7x256
• Geometric relations are modeled by using 14x14x2 binary input with
one channel per box
• Two convolutional layers are applied to this and add the resulting
7x7x256 representation to the detector features
• Last, fine-tuned VGG fully connected layers are applied to obtain
4096 dimensional representation.

20/ 22
IRRLABResults
 Quantitative Results & Ablation Studies

21/ 22
IRRLABResults
 Qualitative Results

Thank You
2020 IRRLAB Presentation
IRRLAB
shmwoo9395@{gist.ac.kr, gmail.com}

Neural motifs scene graph parsing with global context

More Related Content

What's hot (20)

Similar to Neural motifs scene graph parsing with global context (20)

More from Sangmin Woo (12)

Recently uploaded (20)

Neural motifs scene graph parsing with global context

Editor's Notes