Re GAT
Re GAT
Abstract
arXiv:1903.12314v3 [cs.CV] 9 Oct 2019
1
tions of neighboring objects (e.g., <motorcycle-next 2. Related Work
to-car>), to align with spacial descriptions in the ques-
tion. Another direction is to learn semantic dependencies 2.1. Visual Question Answering
between objects (e.g., <girl-eating-cake>) to cap- The current dominant framework for VQA systems con-
ture the interactive dynamics in the visual scene. sists of an image encoder, a question encoder, multimodal
Motivated by this, we propose a Relation-aware Graph fusion, and an answer predictor. In lieu of directly us-
Attention Network (ReGAT) for VQA, introducing a novel ing visual features from CNN-based feature extractors,
relation encoder that captures these inter-object relations [56, 11, 41, 33, 49, 38, 63, 36] explored various image at-
beyond static object/region detection. These visual rela- tention mechanisms to locate regions that are relevant to the
tion features can reveal more fine-grained visual concepts question. To learn a better representation of the question,
in the image, which in turn provides a holistic scene in- [33, 38, 11] proposed to perform question-guided image at-
terpretation that can be used for answering semantically- tention and image-guided question attention collaboratively,
complicated questions. In order to cover the high variance to merge knowledge from both visual and textual modalities
in image scenes and question types, both explicit (e.g., spa- in the encoding stage. [15, 25, 60, 4, 24] explored higher-
tial/positional, semantic/actionable) relations and implicit order fusion methods to better combine textual information
relations are learned by the relation encoder, where images with visual information (e.g., using bilinear pooling instead
are represented as graphs and interactions between objects of simpler first-order methods such as summation, concate-
are captured via a graph attention mechanism. nation and multiplication).
Furthermore, the graph attention is learned based on the To make the model more interpretable, some literature
context of the question, permitting the injection of seman- [30, 59, 29, 54, 55, 53] also exploited high-level semantic
tic information from the question into the relation encoding information in the image, such as attributes, captions and
stage. In this way, the features learned by the relation en- visual relation facts. Most of these methods applied VQA-
coder not only capture object-interactive visual contents in independent models to extract semantic knowledge from
the image, but also absorb the semantic clues in the ques- the image, while [34] built a Relation-VQA dataset and di-
tion, to dynamically focus on particular relation types and rectly mined VQA-specific relation facts to feed additional
instances for each question on the fly. semantic information to the model. A few recent studies
Figure 1 shows an overview of the proposed model. [48, 35, 29] investigated how to incorporate memory to aid
First, a Faster R-CNN is used to generate a set of object re- the reasoning step, especially for difficult questions.
gion proposals, and a question encoder is used for question However, the semantic knowledge brought in by either
embedding. The convolutional and bounding-box features memory or high-level semantic information is usually con-
of each region are then injected into the relation encoder verted into textual representation, instead of directly used
to learn the relation-aware, question-adaptive, region-level as visual representation, which contains richer and more in-
representations from the image. These relation-aware vi- dicative information about the image. Our work is com-
sual features and the question embeddings are then fed into plementary to these prior studies in that we encode object
a multimodal fusion module to produce a joint representa- relations directly into image representation, and the relation
tion, which is used in the answer prediction module to gen- encoding step is generic and can be naturally fit into any
erate an answer. state-of-the-art VQA model.
In principle, our work is different from (and compatible
to) existing VQA systems. It is pivoted on a new dimension: 2.2. Visual Relationship
using question-adaptive inter-object relations to enrich im-
Visual relationship has been explored before deep learn-
age representations in order to enhance VQA performance.
ing became popular. Early work [10, 14, 7, 37] pre-
The contributions of our work are three-fold:
sented methods to re-score the detected objects by consid-
• We propose a novel graph-based relation encoder to ering object relations (e.g., co-occurrence [10], position and
learn both explicit and implicit relations between vi- size [5]) as post-processing steps for object detection. Some
sual objects via graph attention networks. previous work [16, 17] also probed the idea that spatial re-
• The learned relations are question-adaptive, meaning lationships (e.g., “above”, “around”, “below” and “inside”)
that they can dynamically capture visual object rela- between objects can help improve image segmentation.
tions that are most relevant to each question. Visual relationship has proven to be crucial to many
• We show that our ReGAT model is a generic approach computer vision tasks. For example, it aided the cogni-
that can be used to improve state-of-the-art VQA mod- tive task of mapping images to captions [13, 12, 58] and
els on the VQA 2.0 dataset. Our model also achieved improved image search [47, 23] and object localization
state-of-the-art performance on the more challanging [45, 21]. Recent work on visual relationship [45, 43, 9] fo-
VQA-CP v2 dataset. cused more on non-spatial relation, or known as “semantic
Figure 2. Model architecture of the proposed ReGAT for visual question answering. Faster R-CNN is employed to detect a set of object
regions. These region-level features are then fed into different relation encoders to learn relation-aware question-adaptive visual features,
which will be fused with question representation to predict an answer. Multimodal fusion and answer predictor are omitted for simplicity.
relation” (i.e., actions of, or interactions between objects). Our contributions Our work is inspired by [21, 58].
A few neural network architectures have been designed for However, different from them, ReGAT considers both ex-
the visual relationship prediction task [32, 8, 61]. plicit and implicit relations to enrich image representations.
For explicit relations, our model uses Graph Attention Net-
2.3. Relational Reasoning work (GAT) rather than a simple GCN as used in [58]. As
opposed to GCNs, the use of GAT allows for assigning dif-
We name the visual relationship aforementioned as ex- ferent importances to nodes of the same neighborhood. For
plicit relation, which has been shown to be effective for im- implicit relations, our model learns a graph that is adap-
age captioning [58]. Specifically, [58] exploited pre-defined tive to each question by filtering out question-irrelevant re-
semantic relations learned from the Visual Genome dataset lations, instead of treating all the relations equally as in [21].
[28] and spatial relations between objects. A graph was then In experiments, we conduct detailed ablation studies to
constructed based on these relations, and a Graph Convolu- demonstrate the effectiveness of each individual design.
tional Network (GCN) [26] was used to learn representa-
tions for each object. 3. Relation-aware Graph Attention Network
Another line of research focuses on implicit relations,
Here is the problem definition of the VQA task: given
where no explicit semantic or spatial relations are used to
a question q grounded in an image I, the goal is to predict
construct the graph. Instead, all the relations are implic-
an answer â ∈ A that best matches the ground-truth answer
itly captured by an attention module or via higher-order
a? . As common practice in the VQA literature, this can be
methods over the fully-connected graph of an input im-
defined as a classification problem:
age [46, 21, 6, 57], to model the interactions between de-
tected objects. For example, [46] reasons over all the possi- â = arg max pθ (a|I, q) , (1)
a∈A
ble pairs of objects in an image via the use of simple MLPs.
In [6], a bilinear fusion method, called MuRel cell, was in- where pθ is the trained model.
troduced to perform pairwise relationship modeling. Figure 2 gives a detailed illustration of our proposed
Some other work [50, 39, 52] have been proposed for model, consisting of an Image Encoder, a Question En-
learning question-conditioned graph representations for im- coder, and a Relation Encoder. For the Image Encoder,
ages. Specifically, [39] introduced a graph learner module Faster R-CNN [2] is used to identify a set of objects V =
that is conditioned on question representations to compute {vi }K
i=1 , where each object vi is associated with a visual
the image representations using pairwise attention and spa- feature vector v i ∈ Rdv and a bounding-box feature vec-
tial graph convolutions. [50] exploited structured question tor bi ∈ Rdb (K = 36, dv = 2048, and db = 4 in our
representations such as parse trees, and used GRU to model experiments). Each bi = [x, y, w, h] corresponds to a 4-
contextualized interactions between both objects and words. dimensional spatial coordinate, where (x, y) denotes the co-
A more recent work [52] introduced a sparser graph defined ordinate of the top-left point of the bounding box, and h/w
by inter/intra-class edges, in which relationships are implic- corresponds to the height/width of the box. For the Ques-
itly learned via a language-guided graph attention mecha- tion Encoder, we use a bidirectional RNN with Gated Re-
nism. However, all these work still focused on implicit re- current Unit (GRU) and perform self attention on the se-
lations, which are less interpretable than explicit relations. quence of RNN hidden states to generate question embed-
ding q ∈ Rdq (dq = 1024 in our experiments). The fol-
lowing sub-sections will explain the details of the Relation
Encoder.
Table 1. Performance on VQA 2.0 validation set with different fusion methods. Consistent improvements are observed across 3 popular fu-
sion methods, which demonstrates that our model is compatible to generic VQA frameworks. (†) Results based on our re-implementations.
Model SOTA [6] Baseline Sem Spa Imp All Att. Q-adaptive Semantic Spatial Implicit All
Acc. 39.54 39.24 39.54 40.30 39.58 40.42 No No 63.20 63.04 n/a n/a
Yes No 63.90 63.85 63.36 64.98
Table 2. Model accuracy on the VQA-CP v2 benchmark (open- No Yes 63.31 63.13 n/a n/a
ended setting on the test split). Yes Yes 64.11 64.02 64.10 65.30
Test-dev Table 4. Performance on VQA 2.0 validation set for ablation study
Model Test-std
Overall Y/N Num Other (Q-adaptive: question-adaptive; Att: Attention).
BUTD [49] 65.32 81.82 44.21 56.05 65.67
MFH [60] 68.76 84.27 50.66 60.50 - BAN [24], which uses eight bilinear attention maps, our
Counter [62] 68.09 83.14 51.62 58.97 68.41 model outperforms BAN with fewer glimpses. Pythia [22]
Pythia [22] 70.01 - - - 70.24
BAN [24] 70.04 85.42 54.04 60.52 70.35
achieved 70.01 by adding additional grid-level features and
v-AGCN [57] 65.94 82.39 56.46 45.93 66.17 using 100 object proposals from a fine-tuned Faster R-CNN
Graph learner [39] - - - - 66.18 on the VQA dataset for all images. Our model, without any
MuRel [6] 68.03 84.77 49.84 57.85 68.41 feature augmentation used in their work, surpasses Pythia’s
ReGAT (ours) 70.27 86.08 54.42 60.33 70.58 performance by a large margin.
Table 3. Model accuracy on the VQA 2.0 benchmark (open-ended
setting on the test-dev and test-std split). 4.4. Ablation Study
In Table 4, we compare three ablated instances of Re-
performance. Our best results are achieved by combining GAT with its complete form. Specifically, we validate the
the best single relation models through weighted sum. To importance of concatenating question features to each ob-
verify the performance gain is significant, we performed t- ject representation and attention mechanism. All the results
test on the results of our BAN baseline and our proposed reported in Table 4 are based on BUTD model architecture.
model with each single relation. We report the standard de- To remove attention mechanism from our relation encoder,
viation in Table 1, and the p-value is 0.001459. The im- we simply replace graph attention network with graph con-
provement from our method is significant at p < 0.05. We volution network, which can also learn node representation
also compare with an additional baseline model that uses from graphs but with simple linear transformation.
BiLSTM as the contextualized relation encoder, the results Firstly, we validate the effectiveness of using attention
show that using BiLSTM hurts the performance. mechanism to learn relation-aware visual features. Adding
To demonstrate the generalizability of our ReGAT attention mechanism leads to a higher accuracy for all three
model, we also conduct experiments on the VQA-CP v2 types of relation. Comparison between line 1 and line 2
dataset, where the distributions of the training and test splits shows a gain of +0.70 for semantic relation and +0.81 for
are very different from each other. Table 2 shows results on spatial relation. Secondly, we validate the effectiveness of
VQA-CP v2 test split. Here we use BAN with four glimpses question-adaptive relation features. Between line 1 and line
as the baseline model. Consistent with what we have ob- 3, we see a gain of approximately +0.1 for both seman-
served on VQA 2.0, our ReGAT model surpasses the base- tic and spatial relations. Finally, attention mechanism and
line by a large margin. With only single relation, our model question-adaptive features are added to give the complete
has already achieved state-of-the-art performance on VQA- ReGAT model. This instance gives the highest accuracy
CP v2 (40.30 vs. 39.54). When adding all the relations, the (line 4). Surprisingly, by comparing line 1 and line 4, we
performance gain was further lifted to +0.88. can observe that combining graph attention with question-
Table 3 shows single-model results on VQA 2.0 test-dev adaptive gives better gain than simply adding the individual
and test-std splits. The top five rows show results from gains from the two methods. It is worth mentioning that
models without relational reasoning, and the bottom four for implicit relation, adding question-adaptive improves the
rows are results from models with relational reasoning. Our model performance by +0.74, which is higher than the gain
model surpasses all previous work with or without relational from question-adaptive for the two explicit relations. When
reasoning. Our final model uses bilinear attention with four all the relations are considered, we observe consistent per-
glimpses as the multimodal fusion method. Compared to formance gain by adding the question-adaptive mechanism.
Figure 4. Visualization of attention maps learned from ablated in- Figure 5. Visualization of different types of visual object relation
stances: The three bounding boxes shown in each image are the in VQA task. The 3 bounding boxes shown in each image are the
top-3 attended regions. The numbers are attention weights. top-3 attended regions. Green arrows indicate relations from sub-
ject to object. Labels and numbers in green boxes are class labels
To better understand how these two components help an- for explicit relation and attention weights for implicit relation.
swer questions, we further visualize and compare the atten-
tion maps learned by the ablated instances in Section 4.5. capture the relative geometric positions between regions.
To visualize implicit relations, Figure 5(c) shows the at-
4.5. Visualization tention weights to the top-1 region from every other region.
Surprisingly, the learned implicit relations are able to cap-
To better illustrate the effectiveness of adding graph at- ture both spatial and semantic interactions. For example,
tention and question-adaptive mechanism, we compare the the top image in Figure 5(c) illustrates spatial interaction
attention maps learned by the complete ReGAT model in “on” between the table and the vase, and the bottom image
a single-relation setting with those learned by two ablated illustrates the semantic interaction “walking” between the
models. As shown in Figure 4, the second, third and last traffic light and the person.
rows correspond to line 1, 3 and 4 in Table 4, respectively.
Comparing row 2 with row 3 leads to the observation that
5. Conclusion
graph attention helps to capture the interactions between
objects, which contributes to a better alignment between We have presented Relation-aware Graph Attention Net-
image regions and questions. Row 3 and row 4 show that work (ReGAT), a novel framework for visual question an-
adding the question-adaptive attention mechanism produces swering, to model multi-type object relations with question-
sharper attention maps and focuses on more relevant re- adaptive attention mechanism. ReGAT exploits two types
gions. These visualization results are consistent with the of visual object relations: Explicit Relations and Implicit
quantitative results reported in Table 4. Relations, to learn a relation-aware region representation
Figure 5 provides visualization examples on how differ- through graph attention. Our method achieves state-of-the-
ent types of relations help improve the performance. In art results on both VQA 2.0 and VQA-CP v2 datasets. The
each example, we show the top-3 attended regions and the proposed ReGAT model is compatible with generic VQA
learned relations between these regions. As shown in these models. Comprehensive experiments on two VQA datasets
examples, each relation type contributes to a better align- show that our model can be infused into state-of-the-art
ment between image regions and questions. For example, VQA architectures in a plug-and-play fashion. For future
in Figure 5(a), semantic relations “Holding” and “Riding” work, we will investigate how to fuse the three relations
resonate with the same words that appeared in the corre- more effectively and how to utilize each relation to solve
sponding questions. Figure 5(b) shows how spatial relations specific question types.
References [17] Stephen Gould, Jim Rodgers, David Cohen, Gal Elidan, and
Daphne Koller. Multi-class segmentation with relative loca-
[1] Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Anirud- tion prior. IJCV, 2008. 2
dha Kembhavi. Don’t just assume; look and answer: Over- [18] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noord-
coming priors for visual question answering. In CVPR, 2018. huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch,
6 Yangqing Jia, and Kaiming He. Accurate, large mini-
[2] Peter Anderson, Xiaodong He, Chris Buehler, Damien batch sgd: training imagenet in 1 hour. arXiv preprint
Teney, Mark Johnson, Stephen Gould, and Lei Zhang. arXiv:1706.02677, 2017. 6
Bottom-up and top-down attention for image captioning and [19] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba-
visual question answering. In CVPR, 2018. 3, 5, 6, 7 tra, and Devi Parikh. Making the v in vqa matter: Elevating
[3] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret the role of image understanding in visual question answer-
Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi ing. In CVPR, 2017. 6
Parikh. VQA: Visual Question Answering. In ICCV, 2015. [20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
6 Deep residual learning for image recognition. In CVPR,
[4] Hedi Ben-Younes, Rémi Cadene, Matthieu Cord, and Nico- 2016. 6
las Thome. Mutan: Multimodal tucker fusion for visual [21] Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen
question answering. In ICCV, 2017. 2, 6, 7 Wei. Relation networks for object detection. In CVPR, 2018.
[5] Irving Biederman, Robert J Mezzanotte, and Jan C Rabi- 2, 3, 5
nowitz. Scene perception: Detecting and judging objects un- [22] Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach,
dergoing relational violations. Cognitive psychology, 1982. Dhruv Batra, and Devi Parikh. Pythia v0. 1: the win-
2 ning entry to the vqa challenge 2018. arXiv preprint
[6] Remi Cadene, Hedi Ben-younes, Matthieu Cord, and Nicolas arXiv:1807.09956, 2018. 7
Thome. Murel: Multimodal relational reasoning for visual [23] Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li,
question answering. In CVPR, 2019. 3, 7 David Shamma, Michael Bernstein, and Li Fei-Fei. Image
[7] Myung Jin Choi, Antonio Torralba, and Alan S Willsky. A retrieval using scene graphs. In CVPR, 2015. 2
tree-based context model for object recognition. PAMI, 2012. [24] Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilin-
2 ear attention networks. In NeurIPS, 2018. 2, 6, 7
[8] Bo Dai, Yuqi Zhang, and Dahua Lin. Detecting visual re- [25] Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee
lationships with deep relational networks. In CVPR, 2017. Kim, Jung-Woo Ha, and Byoung-Tak Zhang. Hadamard
3 product for low-rank bilinear pooling. In ICLR, 2017. 2
[9] Santosh K Divvala, Ali Farhadi, and Carlos Guestrin. Learn- [26] Thomas N Kipf and Max Welling. Semi-supervised classi-
ing everything about anything: Webly-supervised visual con- fication with graph convolutional networks. In ICLR, 2017.
cept learning. In CVPR, 2014. 2 3
[10] Santosh K Divvala, Derek Hoiem, James H Hays, Alexei A [27] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,
Efros, and Martial Hebert. An empirical study of context in Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-
object detection. In CVPR, 2009. 2 tidis, Li-Jia Li, David A Shamma, Michael Bernstein, and
Li Fei-Fei. Visual genome: Connecting language and vision
[11] Haoqi Fan and Jiatong Zhou. Stacked latent attention for
using crowdsourced dense image annotations. arXiv preprint
multimodal reasoning. In CVPR, 2018. 1, 2
arXiv:1602.07332, 2016. 4, 6
[12] Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K Sri- [28] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,
vastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-
Margaret Mitchell, John C Platt, et al. From captions to vi- tidis, Li-Jia Li, David A Shamma, et al. Visual genome:
sual concepts and back. In CVPR, 2015. 2 Connecting language and vision using crowdsourced dense
[13] Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Pe- image annotations. IJCV, 2017. 3
ter Young, Cyrus Rashtchian, Julia Hockenmaier, and David [29] Guohao Li, Hang Su, and Wenwu Zhu. Incorporating exter-
Forsyth. Every picture tells a story: Generating sentences nal knowledge to answer open-domain visual questions with
from images. In ECCV, 2010. 2 dynamic memory networks. In CVPR, 2018. 2
[14] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, [30] Qing Li, Jianlong Fu, Dongfei Yu, Tao Mei, and Jiebo Luo.
and Deva Ramanan. Object detection with discriminatively Tell-and-answer: Towards explainable visual question an-
trained part-based models. PAMI, 2010. 2 swering using attributes and captions. In EMNLP, 2018. 2
[15] Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, [31] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
Trevor Darrell, and Marcus Rohrbach. Multimodal com- Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
pact bilinear pooling for visual question answering and vi- Zitnick. Microsoft coco: Common objects in context. In
sual grounding. In EMNLP, 2016. 2 ECCV, 2014. 6
[16] Carolina Galleguillos, Andrew Rabinovich, and Serge Be- [32] Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-
longie. Object categorization using co-occurrence, location Fei. Visual relationship detection with language priors. In
and appearance. In CVPR, 2008. 2 ECCV, 2016. 3
[33] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. [50] Damien Teney, Lingqiao Liu, and Anton van den Hengel.
Hierarchical question-image co-attention for visual question Graph-structured representations for visual question answer-
answering. In NIPS, 2016. 1, 2 ing. In CVPR, 2017. 3
[34] Pan Lu, Lei Ji, Wei Zhang, Nan Duan, Ming Zhou, and Jiany- [51] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
ong Wang. R-vqa: Learning visual relation facts with se- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
mantic attention for visual question answering. In SIGKDD, Polosukhin. Attention is all you need. In NIPS, 2017. 5
2018. 2 [52] Peng Wang, Qi Wu, Jiewei Cao, Chunhua Shen, Lianli Gao,
[35] Chao Ma, Chunhua Shen, Anthony Dick, Qi Wu, Peng and Anton van den Hengel. Neighbourhood watch: Refer-
Wang, Anton van den Hengel, and Ian Reid. Visual question ring expression comprehension via language-guided graph
answering with memory-augmented networks. In CVPR, attention networks. In CVPR, 2019. 3
2018. 2 [53] Peng Wang, Qi Wu, Chunhua Shen, and Anton van den Hen-
[36] Mateusz Malinowski, Carl Doersch, Adam Santoro, and Pe- gel. The vqa-machine: Learning how to use existing vision
ter Battaglia. Learning visual question answering by boot- algorithms to answer new questions. In CVPR, 2017. 2
strapping hard attention. In ECCV, 2018. 2 [54] Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and An-
[37] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu ton van den Hengel. Image captioning and visual ques-
Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and tion answering based on attributes and external knowledge.
Alan Yuille. The role of context for object detection and se- PAMI, 2017. 2
mantic segmentation in the wild. In CVPR, 2014. 2 [55] Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and
[38] Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. Dual Anton van den Hengel. Ask me anything: Free-form vi-
attention networks for multimodal reasoning and matching. sual question answering based on knowledge from external
In CVPR, 2017. 1, 2 sources. In CVPR, 2016. 2
[39] Will Norcliffe-Brown, Stathis Vafeias, and Sarah Parisot. [56] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and
Learning conditioned graph structures for interpretable vi- Alex Smola. Stacked attention networks for image question
sual question answering. In NeurIPS, 2018. 3, 7 answering. In CVPR, 2016. 1, 2
[40] Adam Paszke, Sam Gross, Soumith Chintala, Gregory
[57] Zhuoqian Yang, Jing Yu, Chenghao Yang, Zengchang Qin,
Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-
and Yue Hu. Multi-modal learning with prior visual relation
ban Desmaison, Luca Antiga, and Adam Lerer. Automatic
reasoning. arXiv preprint arXiv:1812.09681, 2018. 3, 7
differentiation in pytorch. 2017. 6
[58] Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. Exploring
[41] Badri Patro and Vinay P Namboodiri. Differential attention
visual relationship for image captioning. In ECCV, 2018. 2,
for visual question answering. In CVPR, 2018. 2
3, 4, 6
[42] Jeffrey Pennington, Richard Socher, and Christopher Man-
ning. Glove: Global vectors for word representation. In [59] Dongfei Yu, Jianlong Fu, Tao Mei, and Yong Rui. Multi-
EMNLP, 2014. 6 level attention networks for visual question answering. In
CVPR, 2017. 2
[43] Vignesh Ramanathan, Congcong Li, Jia Deng, Wei Han,
Zhen Li, Kunlong Gu, Yang Song, Samy Bengio, Charles [60] Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and
Rosenberg, and Li Fei-Fei. Learning semantic relationships Dacheng Tao. Beyond bilinear: Generalized multimodal
for better action retrieval in images. In CVPR, 2015. 2 factorized high-order pooling for visual question answering.
[44] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. IEEE Transactions on Neural Networks and Learning Sys-
Faster r-cnn: Towards real-time object detection with region tems, 2018. 2, 7
proposal networks. In NIPS, 2015. 6 [61] Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-
[45] Mohammad Amin Sadeghi and Ali Farhadi. Recognition us- Seng Chua. Visual translation embedding network for visual
ing visual phrases. In CVPR, 2011. 2 relation detection. In CVPR, 2017. 3
[46] Adam Santoro, David Raposo, David G Barrett, Mateusz [62] Yan Zhang, Jonathon S. Hare, and Adam Prügel-Bennett.
Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Learning to count objects in natural images for visual ques-
Lillicrap. A simple neural network module for relational rea- tion answering. In ICLR, 2018. 7
soning. In NIPS, 2017. 3 [63] Chen Zhu, Yanpeng Zhao, Shuaiyi Huang, Kewei Tu, and Yi
[47] Sebastian Schuster, Ranjay Krishna, Angel Chang, Li Fei- Ma. Structured attentions for visual question answering. In
Fei, and Christopher D Manning. Generating semantically ICCV, 2017. 2
precise scene graphs from textual descriptions for improved
image retrieval. In Proceedings of the fourth workshop on
vision and language, 2015. 2
[48] Zhou Su, Chen Zhu, Yinpeng Dong, Dongqi Cai, Yurong
Chen, and Jianguo Li. Learning visual knowledge memory
networks for visual question answering. In CVPR, 2018. 2
[49] Damien Teney, Peter Anderson, Xiaodong He, and Anton
van den Hengel. Tips and tricks for visual question answer-
ing: Learnings from the 2017 challenge. In CVPR, 2018. 1,
2, 7