16. Datasets
• The VQA v1.0 dataset
• 204,721 MS-COCO images
• with 614,163 associated question-answer pairs
• train (248,349 questions), validation (121,512) and test (244,302)
• The VQA v2.0 dataset
• further balanced the v1.0 dataset by collecting complementary
images such that each question in the balanced dataset has a
pair of similar images but with different answers to the question
• 1,105,904 questions from 204,721 images (v1.0 datasetの約2倍)
16
17. Datasets
• The Visual Genome dataset
• 108,249 images labeled with question-answer pairs, objects
and attributes
• 各画像について平均で17の question-answer pairs が集めら
れている
• Visual Genome Relationship (VGR) annotations
• 画像中の異なる物体間の関係をトリプル形式 <s, r, t> で記述
• 求めるvisual knowledge baseにぴったり
• 1,531,448 knowledge triples, about 14 triples per image
17
18. Building Visual Knowledge Base
• VQA 用の独自の知識ベースを構築し, 各エントリーは<s, r, t>の
構造を持つ
• 知識ベース (つまりvisual knowledge base)のエントリーは以下
の2つのソースから成る:
(i) VQA v1.0 train+val set の question-answer pairs
(ii) the VGR dataset
• これら2つを結合して, 約159,970 の独立なトリプルを持つ
visual knowledge baseを得る
18
#4:Figure 2 illustrates how the proposed VKMN model works on visual question answering, and
#6:VKMN model consists of four modules:
, which retrieves related knowledge entries based on query questions or auto- matic image captions from the pre-built visual knowl- edge base by sub-graph hashing;
, which embeds visual features and knowledge triples (or part of the triples) jointly for the ease of storing in key- value memory networks;
, which receives query question, reads the stored key-value to form the visual-knowledge fea- ture, and predicts the answers.
#7:The input image I and question q need to be processed into feature vectors before feeding into memory networks.
Several methods are proposed to learn the multimodal joint embedding in an end-to-end manner for VQA, including the VQA 2016 challenge winner solu- tion MCB [14], and the state-of-the-art solution MLB [20].
#9:Figure 4 illustrates one example of subgraph hashing and knowledge triple expansion.
#10:Our start point is the spatial attentive visual feature u from the in- put module and knowledge entry e from knowledge spot- ting module.
We are required to learn a joint embedding to combine u ∈ Rd and e together.
#12:The design of which part should be key, and which part should be value is quite flexible.
However, this is not true for VQA, as we do not know which part of the knowledge triple is missing in the visual question.
In practice, we build three memory blocks for these three cases separately as shown in Figure 2, and name it triple replication.
#14:The original key-value memory networks [29] supports it- erative refining the query and reading the memories.
#19:As the general purpose knowledge bases like Freebase contains billions of knowledge facts, while most knowledge entries are irrelevant to the visual questions,
#21:Table 1 lists the detailed results of the four cases in compar- ison to our designed model (baseline) on different question categories. This ablation study verifies the effectiveness of
our design choice.
Results show that TransE performs better than BOW, especially on the “other” answer-type (57.0 vs 56.1).
This comparison ver- ifies the importance of visual attentions in VKMN.
It is obvious that VKMN outperforms MLB, especially on the “other” answer- type (57.0 vs 54.9). This study verifies the ef- fectiveness of the proposed memory network module.
This study shows that the triple replication mechanism is important to avoid the ambi- guity in visual-question pair, especially on the “other” answer-type (57.0 vs 53.9).
#22:We further list the full benchmark results on both test- dev and test-standard dataset of the VQA v1.0 dataset in Table 2. For easy and fair comparison, we also list the results by state-of-the-art methods with single model. It shows that VKMN outperforms state-of-the-art results on both Open-Ended and Multiple-choice tasks, and especial- ly for the “other” answer-type, which proves that VKM- N is effective on incorporating external knowledge for an- swering 6W questions (what, where, when, who, why, and how).
#23:Figure 5 further illustrates some quantitative exam- ples, in comparison to the state-of-the-art method MLB. Below each example, we also show the confidence score of top-5 knowledge triples according to Eq.5. It is obvi- ous that the VKMN model could attend to highly relevant knowledge triples. Note that although some top-1 triple is relevant but not quite accurate (due to appearance frequency in training-set), the final decision is based on softmax classi- fier (Eq.7) with weight averaging knowledge representation (Eq.6), which tends to produce right answers.
#24:Besides, we show some failure cases in Figure 6 along with the MLB at- tention maps. These cases are related to spatial relationship reasoning, in which MLB does not get correct attention re- gions. Problem may be alleviated when resorting to some advanced attention mechanism such structure attention [50].
#25:We further evaluated our method on VQA v2.0 dataset, which is about twice larger than VQA v1.0 on question number, and much more difficult than VQA v1.0. Table 3 listed the comparison results on test-standard set. Single VKMN model achieves 64.36 overall accuracy, which is much better than MCB and MLB, especially on the “other” answer-type.
Although the top-3 solutions in the VQA v2.0 challenge show better accu- racy than our VKMN, they heavily rely on model ensemble, while their best results are obtained by ensembling dozens or even hundreds of models. Even for the first place solution [37], their best single model result is still much worse than that of our single VKMN model, especially on the “other” answer-type, when leaving out the exhaustive bottom-up at- tention from the object detection results.