0% found this document useful (0 votes)
29 views10 pages

Re GAT

The document proposes a Relation-aware Graph Attention Network (ReGAT) for visual question answering. ReGAT models multi-type inter-object relations using a graph attention mechanism to learn question-adaptive relation representations. It considers explicit relations like spatial and semantic relations as well as implicit relations. Experiments show ReGAT outperforms prior state-of-the-art approaches on VQA and VQA-CP datasets.

Uploaded by

1692124482
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views10 pages

Re GAT

The document proposes a Relation-aware Graph Attention Network (ReGAT) for visual question answering. ReGAT models multi-type inter-object relations using a graph attention mechanism to learn question-adaptive relation representations. It considers explicit relations like spatial and semantic relations as well as implicit relations. Experiments show ReGAT outperforms prior state-of-the-art approaches on VQA and VQA-CP datasets.

Uploaded by

1692124482
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Relation-Aware Graph Attention Network for Visual Question Answering

Linjie Li, Zhe Gan, Yu Cheng, Jingjing Liu


Microsoft Dynamics 365 AI Research
{lindsey.li, zhe.gan, yu.cheng, jingjl}@microsoft.com

Abstract
arXiv:1903.12314v3 [cs.CV] 9 Oct 2019

In order to answer semantically-complicated questions


about an image, a Visual Question Answering (VQA) model
needs to fully understand the visual scene in the image,
especially the interactive dynamics between different ob-
jects. We propose a Relation-aware Graph Attention Net-
work (ReGAT), which encodes each image into a graph
and models multi-type inter-object relations via a graph
attention mechanism, to learn question-adaptive relation
representations. Two types of visual object relations are Figure 1. An overview of the ReGAT model. Both explicit rela-
tions (semantic and spatial) and implicit relations are considered.
explored: (i) Explicit Relations that represent geometric
The proposed relation encoder captures question-adaptive object
positions and semantic interactions between objects; and interactions via Graph Attention.
(ii) Implicit Relations that capture the hidden dynamics be-
tween image regions. Experiments demonstrate that ReGAT
Network (CNN) or Region-based CNN (R-CNN) is com-
outperforms prior state-of-the-art approaches on both VQA
monly used as a visual feature extractor for image encod-
2.0 and VQA-CP v2 datasets. We further show that Re-
ing, and a Recurrent Neural Network (RNN) is used for
GAT is compatible to existing VQA architectures, and can
question encoding. After obtaining a sparse set of image
be used as a generic relation encoder to boost the model
regions from the visual feature extractor, multimodal fusion
performance for VQA.1
is applied to learn a joint representation that represents the
alignment between each individual region and the question.
1. Introduction This joint representation is then fed into an answer predictor
Recent advances in deep learning have driven tremen- to produce an answer.
dous progress in both Computer Vision and Natural Lan- This framework has proven to be useful for the VQA
guage Processing (NLP). Interdisciplinary area between task, but there still exists a significant semantic gap between
language and vision, such as image captioning, text-to- image and natural language. For example, given an image
image synthesis and visual question answering (VQA), has of a group of zebras (see Figure 1), the model may recog-
attracted rapidly growing attention from both vision and nize the black and white pixels, but not which white and
NLP communities. Take VQA as an example - the goal black pixels are from which zebra. Thus, it is difficult to
(and the main challenge) is to train a model that can achieve answer questions such as “Is the zebra at the far right a
comprehensive and semantically-aligned understanding of baby zebra?” or “Are all the zebras eating grass?”. A VQA
multimodal input. Specifically, given an image and a nat- system needs to recognize not only the objects (“zebras“)
ural language question grounded on the image, the task is and the surrounding environment (“grass”), but also the se-
to associate visual features in the image with the semantic mantics about actions (“eating”) and locations (“at the far
meaning in the question, in order to correctly answer the right”) in both images and questions.
question. In order to capture this type of action and location in-
Most state-of-the-art approaches to VQA [56, 11, 38, 33, formation, we need to go beyond mere object detection in
49] focus on learning a multimodal joint representation of image understanding, and learn a more holistic view of the
images and questions. Specifically, a Convolutional Neural visual scene in the image, by interpreting the dynamics and
1 Code is available at https://github.com/linjieli222/ interactions between different objects in an image. One
VQA_ReGAT. possible solution is to detect the relative geometrical posi-

1
tions of neighboring objects (e.g., <motorcycle-next 2. Related Work
to-car>), to align with spacial descriptions in the ques-
tion. Another direction is to learn semantic dependencies 2.1. Visual Question Answering
between objects (e.g., <girl-eating-cake>) to cap- The current dominant framework for VQA systems con-
ture the interactive dynamics in the visual scene. sists of an image encoder, a question encoder, multimodal
Motivated by this, we propose a Relation-aware Graph fusion, and an answer predictor. In lieu of directly us-
Attention Network (ReGAT) for VQA, introducing a novel ing visual features from CNN-based feature extractors,
relation encoder that captures these inter-object relations [56, 11, 41, 33, 49, 38, 63, 36] explored various image at-
beyond static object/region detection. These visual rela- tention mechanisms to locate regions that are relevant to the
tion features can reveal more fine-grained visual concepts question. To learn a better representation of the question,
in the image, which in turn provides a holistic scene in- [33, 38, 11] proposed to perform question-guided image at-
terpretation that can be used for answering semantically- tention and image-guided question attention collaboratively,
complicated questions. In order to cover the high variance to merge knowledge from both visual and textual modalities
in image scenes and question types, both explicit (e.g., spa- in the encoding stage. [15, 25, 60, 4, 24] explored higher-
tial/positional, semantic/actionable) relations and implicit order fusion methods to better combine textual information
relations are learned by the relation encoder, where images with visual information (e.g., using bilinear pooling instead
are represented as graphs and interactions between objects of simpler first-order methods such as summation, concate-
are captured via a graph attention mechanism. nation and multiplication).
Furthermore, the graph attention is learned based on the To make the model more interpretable, some literature
context of the question, permitting the injection of seman- [30, 59, 29, 54, 55, 53] also exploited high-level semantic
tic information from the question into the relation encoding information in the image, such as attributes, captions and
stage. In this way, the features learned by the relation en- visual relation facts. Most of these methods applied VQA-
coder not only capture object-interactive visual contents in independent models to extract semantic knowledge from
the image, but also absorb the semantic clues in the ques- the image, while [34] built a Relation-VQA dataset and di-
tion, to dynamically focus on particular relation types and rectly mined VQA-specific relation facts to feed additional
instances for each question on the fly. semantic information to the model. A few recent studies
Figure 1 shows an overview of the proposed model. [48, 35, 29] investigated how to incorporate memory to aid
First, a Faster R-CNN is used to generate a set of object re- the reasoning step, especially for difficult questions.
gion proposals, and a question encoder is used for question However, the semantic knowledge brought in by either
embedding. The convolutional and bounding-box features memory or high-level semantic information is usually con-
of each region are then injected into the relation encoder verted into textual representation, instead of directly used
to learn the relation-aware, question-adaptive, region-level as visual representation, which contains richer and more in-
representations from the image. These relation-aware vi- dicative information about the image. Our work is com-
sual features and the question embeddings are then fed into plementary to these prior studies in that we encode object
a multimodal fusion module to produce a joint representa- relations directly into image representation, and the relation
tion, which is used in the answer prediction module to gen- encoding step is generic and can be naturally fit into any
erate an answer. state-of-the-art VQA model.
In principle, our work is different from (and compatible
to) existing VQA systems. It is pivoted on a new dimension: 2.2. Visual Relationship
using question-adaptive inter-object relations to enrich im-
Visual relationship has been explored before deep learn-
age representations in order to enhance VQA performance.
ing became popular. Early work [10, 14, 7, 37] pre-
The contributions of our work are three-fold:
sented methods to re-score the detected objects by consid-
• We propose a novel graph-based relation encoder to ering object relations (e.g., co-occurrence [10], position and
learn both explicit and implicit relations between vi- size [5]) as post-processing steps for object detection. Some
sual objects via graph attention networks. previous work [16, 17] also probed the idea that spatial re-
• The learned relations are question-adaptive, meaning lationships (e.g., “above”, “around”, “below” and “inside”)
that they can dynamically capture visual object rela- between objects can help improve image segmentation.
tions that are most relevant to each question. Visual relationship has proven to be crucial to many
• We show that our ReGAT model is a generic approach computer vision tasks. For example, it aided the cogni-
that can be used to improve state-of-the-art VQA mod- tive task of mapping images to captions [13, 12, 58] and
els on the VQA 2.0 dataset. Our model also achieved improved image search [47, 23] and object localization
state-of-the-art performance on the more challanging [45, 21]. Recent work on visual relationship [45, 43, 9] fo-
VQA-CP v2 dataset. cused more on non-spatial relation, or known as “semantic
Figure 2. Model architecture of the proposed ReGAT for visual question answering. Faster R-CNN is employed to detect a set of object
regions. These region-level features are then fed into different relation encoders to learn relation-aware question-adaptive visual features,
which will be fused with question representation to predict an answer. Multimodal fusion and answer predictor are omitted for simplicity.

relation” (i.e., actions of, or interactions between objects). Our contributions Our work is inspired by [21, 58].
A few neural network architectures have been designed for However, different from them, ReGAT considers both ex-
the visual relationship prediction task [32, 8, 61]. plicit and implicit relations to enrich image representations.
For explicit relations, our model uses Graph Attention Net-
2.3. Relational Reasoning work (GAT) rather than a simple GCN as used in [58]. As
opposed to GCNs, the use of GAT allows for assigning dif-
We name the visual relationship aforementioned as ex- ferent importances to nodes of the same neighborhood. For
plicit relation, which has been shown to be effective for im- implicit relations, our model learns a graph that is adap-
age captioning [58]. Specifically, [58] exploited pre-defined tive to each question by filtering out question-irrelevant re-
semantic relations learned from the Visual Genome dataset lations, instead of treating all the relations equally as in [21].
[28] and spatial relations between objects. A graph was then In experiments, we conduct detailed ablation studies to
constructed based on these relations, and a Graph Convolu- demonstrate the effectiveness of each individual design.
tional Network (GCN) [26] was used to learn representa-
tions for each object. 3. Relation-aware Graph Attention Network
Another line of research focuses on implicit relations,
Here is the problem definition of the VQA task: given
where no explicit semantic or spatial relations are used to
a question q grounded in an image I, the goal is to predict
construct the graph. Instead, all the relations are implic-
an answer â ∈ A that best matches the ground-truth answer
itly captured by an attention module or via higher-order
a? . As common practice in the VQA literature, this can be
methods over the fully-connected graph of an input im-
defined as a classification problem:
age [46, 21, 6, 57], to model the interactions between de-
tected objects. For example, [46] reasons over all the possi- â = arg max pθ (a|I, q) , (1)
a∈A
ble pairs of objects in an image via the use of simple MLPs.
In [6], a bilinear fusion method, called MuRel cell, was in- where pθ is the trained model.
troduced to perform pairwise relationship modeling. Figure 2 gives a detailed illustration of our proposed
Some other work [50, 39, 52] have been proposed for model, consisting of an Image Encoder, a Question En-
learning question-conditioned graph representations for im- coder, and a Relation Encoder. For the Image Encoder,
ages. Specifically, [39] introduced a graph learner module Faster R-CNN [2] is used to identify a set of objects V =
that is conditioned on question representations to compute {vi }K
i=1 , where each object vi is associated with a visual
the image representations using pairwise attention and spa- feature vector v i ∈ Rdv and a bounding-box feature vec-
tial graph convolutions. [50] exploited structured question tor bi ∈ Rdb (K = 36, dv = 2048, and db = 4 in our
representations such as parse trees, and used GRU to model experiments). Each bi = [x, y, w, h] corresponds to a 4-
contextualized interactions between both objects and words. dimensional spatial coordinate, where (x, y) denotes the co-
A more recent work [52] introduced a sparser graph defined ordinate of the top-left point of the bounding box, and h/w
by inter/intra-class edges, in which relationships are implic- corresponds to the height/width of the box. For the Ques-
itly learned via a language-guided graph attention mecha- tion Encoder, we use a bidirectional RNN with Gated Re-
nism. However, all these work still focused on implicit re- current Unit (GRU) and perform self attention on the se-
lations, which are less interpretable than explicit relations. quence of RNN hidden states to generate question embed-
ding q ∈ Rdq (dq = 1024 in our experiments). The fol-
lowing sub-sections will explain the details of the Relation
Encoder.

3.1. Graph Construction


Fully-connected Relation Graph By treating each ob- (a) Spatial Relation (b) Semantic Relation
ject vi in the image as one vertex, we can construct a fully-
Figure 3. Illustration of spatial and semantic relations. The green
connected undirected graph Gimp = (V, E), where E is the
arrows denote the direction of relations (subject → object). Labels
set of K × (K − 1) edges. Each edge represents an im- in green boxes are class labels of relations. Red and Blue boxes
plicit relation between two objects, which can be reflected contain class labels of objects.
by the learned weight assigned to each edge through graph
attention. All the weights are learned implicitly without any This can be formulated as a classification task [58] by
prior knowledge. We name the relation encoder built on this training a semantic relation classifier on a visual relation-
graph Gimp the implicit relation encoder. ship dataset (e.g., Visual Genome [27]). Given two ob-
ject regions i and j, the goal is to determine which pred-
Pruned Graph with Prior Knowledge On the other icate p represents a semantic relation <i-p-j> between
hand, if explicit relations between vertices are available, one these two regions. Here, the relations between the sub-
can readily transform the fully-connected graph Gimp into ject j and the object i are not interchangeable, meaning
an explicit relation graph, by pruning the edges where the that the edges formed by semantic relations are not sym-
corresponding explicit relation does not exist. For each pair metric. For a valid <i-pi,j -j>, there may not exist a
of objects i, j, if <i-p-j>is a valid relation, an edge is cre- relation <j-pj.i -i> within our definition. For example,
ated from i to j, with an edge label p. In addition, we assign <man-holding-bat> is a valid relation, while there is
each object node i with a self-loop edge and label this edge no semantic relation from bat to man.
as identical. In this way, the graph becomes sparse, and The classification model takes in three inputs: feature
each edge encodes prior knowledge about one inter-object vector of the subject region v i , feature vector of the ob-
relation in the image. We name the relation encoder built ject region v j , and region-level feature vector v i,j of the
upon this graph the explicit relation encoder. union bounding box containing both i and j. These three
The explicit nature of these features requires pre-trained types of feature are obtained from pre-trained object detec-
classifiers to extract the relations in the form of discrete tion model, and then transformed via an embedding layer.
class labels, which represent the dynamics and interactions The embedded features are then concatenated and fed into
between objects explicit to the human eye. Different types a classification layer to produce softmax probability over
of explicit relations can be learned based on this pruned 14 semantic relations, with an additional no-relation
graph. In this paper, we explore two instances: spatial and class. The trained classifier is then used to predict relations
semantic graphs, to capture positional and actionable rela- between any pair of object regions in a given image. Exam-
tions between objects, which is imperative for visual ques- ples of semantic relations are shown in Figure 3(b).
tion answering.
Spatial Graph Let spai,j =<objecti -predicate 3.2. Relation Encoder
-objectj > denote the spatial relation that represents the Question-adaptive Graph Attention The proposed rela-
relative geometric position of objecti against objecti . tion encoder is designed to encode relational dynamics be-
In order to construct a spatial graph Gspa , given two ob- tween objects in an image. For the VQA task, there might
ject region proposals objecti and objectj , we classify be different types of relations that are useful for different
spai,j into 11 different categories [58] (e.g., objecti is question types. Thus, in designing the relation encoder, we
inside objectj (class 1), objectj is inside objecti use a question-adaptive attention mechanism to inject se-
(class 2), as illustrated in Figure 3(a)), including a mantic information from the question into relation graphs,
no-relation class retained for objects that are too far to dynamically assign higher weights to those relations that
away from each other. Note that edges formed by spatial are mostly relevant to each question. This is achieved by
relations are symmetrical: if <objecti -pi,j -objectj first concatenating the question embedding q with each of
> is a valid spatial relation, there must be a valid spatial the K visual features v i , denoted as
relation spaj,i =<objectj -pj,i -objecti >. However,
the two predicates pi,j and pj,i are different. v 0i = [v i ||q] for i = 1, . . . , K . (2)
Semantic Graph In order to construct semantic graph
Gsem , semantic relations between objects need to be Self-attention is then performed on the vertices, which gen-
extracted (e.g., <subject-predicate-object>). erates hidden relation features {v ?i }K
i=1 that characterize the
relations between a target object and its neighboring ob- Explicit Relation We consider semantic relation encoder
jects. Based on this, each relation graph goes through the first. Since edges in the semantic graph Esem now contain
following attention mechanism: label information and are directional, we design the atten-
X  tion mechanism in (3) to be sensitive to both directionality
v ?i = σ αij · Wv 0j . (3) (vi -to-vj , vj -to-vi and vi -to-vi ) and labels. Specifically,
j∈Ni X 
v ?i = σ αij · (Wdir(i,j) v 0j + blab(i,j) , (8)
For different types of relation graph, the definition of the at- j∈Ni
tention coefficients αij varies, so does the projection matrix
exp((Uv 0i )> · Vdir(i,j) v 0j + clab(i,j) )
W ∈ Rdh ×(dq +dv ) and the neighborhood Ni of object i. αij = P ,
0 > 0
σ(·) is a nonlinear function such as ReLU. To stabilize the j∈Ni exp((Uv i ) · Vdir(i,j) v j + clab(i,j) )
learning process of self-attention, we also extend the above
where W{·} , V{·} are matrices, and b{·} , c{·} are bias
graph attention mechanism by employing multi-head atten-
terms. dir(i, j) selects the transformation matrix wrt the di-
tion, where M independent attention mechanisms are exe-
rectionality of each edge, and lab(i, j) represents the label
cuted, and their output features are concatenated, resulting
of each edge. Consequently, after encoding all the regions
in the following output feature representation:
{v 0i }K
i=1 via the above graph attention mechanism, the re-
X
m 0
 fined region-level features {v ?i }K
i=1 are endowed with the
v ?i = kM
m=1 σ α m
ij · W v j . (4)
prior semantic relations between objects.
j∈Ni
As opposed to graph convolutional networks, this graph
In the end, v ?i is added to the original visual feature v i to attention mechanism effectively assigns different weights of
serve as the final relation-aware feature. importance to nodes of the same neighborhood. Combining
with the question-adaptive mechanism, the learned attention
weights can reflect which relations are relevant to a specific
Implicit Relation Since the graph for learning implicit
question. The relation encoder can work in the same manner
relation is fully-connected, Ni contains all the objects in
on the spatial graph Espa , with a different set of parameters
the image, including object i itself. Inspired by [21], we
to be learned, thus details are omitted for simplicity.
design the attention weight αij to not only depend on
v
visual-feature weight αij , but also bounding-box weight 3.3. Multimodal Fusion and Answer Prediction
b
αij . Specifically,
After obtaining relation-aware visual features, we want
b
αij v
· exp(αij ) to fuse question information q with each visual represen-
αij = PK , (5) tation v ?i through a multi-model fusion strategy. Since our
b v
j=1 αij · exp(αij ) relation encoder preserves the dimensionality of visual fea-
v
tures, it can be incorporated with any existing multi-modal
where αij represents the similarity between the visual fea- fusion method to learn a joint representation J:
tures, computed by scaled dot-product [51]:
J = f (v ? , q; Θ) , (9)
v
αij = (Uv 0i )> · Vv 0j , (6)
where f is a multi-modal fusion method and Θ are trainable
where U, V ∈ Rdh ×(dq +dv ) are projection matrices. αij
b
parameters of the fusion module.
measures the relative geometric position between any pair For the Answer Predictor, we adopt a two-layer multi-
of regions: layer perceptron (MLP) as the classifier, with the joint rep-
b
resentation J as the input. Binary cross entropy is used as
αij = max{0, w · fb (bi , bj )} , (7) the loss function, similar to [2].
In the training stage, different relations encoders are
where fb (·, ·) first computes a 4-dimensional relative geom- trained independently. In the inference stage, we combine
|x −x | |y −y | w h
etry feature (log( iwi j ), log( ihi j ), log( wji ), log( hji )), the three graph attention networks with a weighted sum of
then embeds it into a dh -dimensional feature by computing the predicted answer distributions. Specifically, the final an-
cosine and sine functions of different wavelengths. w ∈ swer distribution is calculated by:
Rdh transforms the dh -dimensional feature into a scalar
weight, which is further trimmed at 0. Unlike how we as- P r(a = ai ) = αP rsem (a = ai ) + βP rspa (a = ai )
sume no-relation for objects that are too far away from + (1 − α − β)P rimp (a = ai ) , (10)
each other in the explicit relation setting, the restrictions for
implicit relations are learned through w and the zero trim- where α and β are trade-off hyper-parameters (0 ≤ α +
ming operation. β ≤ 1, 0 ≤ α, β ≤ 1). P rsem (a = ai ), P rspa (a = ai )
and P rimp (a = ai ) denote the predicted probability for up to the 14th token (similar to [24]). Questions shorter than
answer ai , from the model trained with semantic, spatial 14 words are padded at the end with zero vectors. The di-
and implicit relations, respectively. mension of the hidden layer in GRU is set as 1024. We em-
ploy multi-head attention with 16 heads for all three graph
4. Experiments attention networks. The dimension of relation features is set
to 1024. For implicit relation, we set the embedded relative
We evaluate our proposed model on VQA 2.0 and VQA- geometry feature dimension dh to be 64.
CP v2 datasets [3, 19, 1]. In addition, Visual Genome [27] For the semantic relation classifier, we extract pre-
is used to pre-train the semantic relation classifier. It is also trained object detection features with known bounding
used to augment the VQA dataset when testing on the test- boxes from Faster R-CNN [44] model in conjunction with
dev and test-std splits. We use accuracy as the evaluation ResNet-101 [20]. More specifically, the features are the
metric: output of the Pool5 layer after RoI pooling from Res4b22
#humans provided ans feature map [58]. The Faster R-CNN model is trained over
Acc(ans) = min(1, ). (11) 1,600 selected object classes and 400 attribute classes, sim-
3
ilar to the bottom-up attention [2].
4.1. Datasets Our model is implemented based on PyTorch [40]. In
VQA 2.0 dataset is composed of real images from experiments, we use Adamax optimizer for training, with
MSCOCO [31] with the same train/validation/test splits. the mini-batch size as 256. For choice of learning rate, we
For each image, an average of 3 questions are gener- employ the warm-up strategy [18]. Specifically, we begin
ated. These questions are divided into 3 categories: Y/N, with a learning rate of 0.0005, linearly increasing it at each
Number and Other. 10 answers are collected for each epoch till it reaches 0.002 at epoch 4. After 15 epochs, the
image-question pair from human annotators, and the most learning rate is decreased by 1/2 for every 2 epochs up to
frequent answer is selected as the correct answer. Both 20 epochs. Every linear mapping is regularized by weight
open-ended and multiple-choice question types are included normalization and dropout (p = 0.2 except for the classifier
in this dataset. In this work, we focus on the open-ended with 0.5).
task, and take the answers that appeared more than 9 times 4.3. Experimental Results
in the training set as candidate answers, which produces
3, 129 answer candidates. The model is trained on the train- This sub-section provides experimental results on the
ing set, but when testing on the test set, both training and VQA 2.0 and VQA-CP v2 datasets. By way of design, the
validation set are used for training, and the max-probable relation encoder can be composed into different VQA archi-
answer is selected as the predicted answer. tectures as a plug-and-play component. In our experiments,
VQA-CP v2 dataset is a derivation of the VQA 2.0 we consider three popular VQA models with different mul-
dataset, which was introduced to evaluate and reduce the timodal fusion methods: Bottom-up Top-down [2] (BUTD),
question-oriented bias in VQA models. In particular, the Multimodal Tucker Fusion [4] (MUTAN), and Bilinear At-
distribution of answers with respect to question types dif- tention Network [24] (BAN). Table 1 reports results on the
fers between training and test splits. VQA 2.0 validation set in the following setting:
Visual Genome contains 108K images with densely an- • Imp / Sem / Spa: only one single type of relation
notated objects, attributes and relationships, which we used (implicit, semantic or spatial) is used to incorporate
to pre-train the semantic relation classifier in our model. We bottom-up attention features.
filtered out those images that also appeared in the VQA val- • Imp+Sem / Imp+Spa / Sem+Spa: two different types
idation set, and split the relation data into 88K for train- of relations are used via weighted sum.
ing, 8K for validation, and 8K for testing. Furthermore, • All: all three types of relations are utilized, through
we selected the top-14 most frequent predicates in the train- weighted sum (e.g.: α = 0.4, β = 0.3). See Eqn. (10)
ing data, after normalizing the predicates with relationship- for details.
alias provided in Visual Genome. The final semantic re- Compared to the baseline models, we can observe con-
lation classifier is trained over 14 relation classes plus a sistent performance gain for all three architectures after
no-relation class. adding the proposed relation encoder. These results demon-
strate that our ReGAT model is a generic approach that can
4.2. Implementation Details
be used to improve state-of-the-art VQA models. Further-
Each question is tokenized and each word is embedded more, the results indicate that each single relation helps im-
using 600-dimensional word embeddings (including 300- prove the performance, and pairwise combination of rela-
dimensional GloVe word embeddings [42]). The sequence tions can achieve consistent performance gain. When all
of embedded words is then fed into GRU for each time step three types are combined, our model can achieve the best
Fusion Model
Method Baseline BiLSTM Imp Sem Spa Imp+Sem Imp+Spa Sem+Spa All
BUTD [2] 63.15 (63.38† ) 61.95 64.10 64.11 64.02 64.93 64.92 64.84 65.30
MUTAN [4] 58.16 (61.36† ) 61.22 62.45 62.60 62.01 63.99 63.70 63.89 64.37
BAN [24] 65.36±0.14 (65.51† ) 64.55 65.93 ±0.06 65.97 ±0.05 66.02 ±0.12 66.81 66.76 66.85 67.18

Table 1. Performance on VQA 2.0 validation set with different fusion methods. Consistent improvements are observed across 3 popular fu-
sion methods, which demonstrates that our model is compatible to generic VQA frameworks. (†) Results based on our re-implementations.

Model SOTA [6] Baseline Sem Spa Imp All Att. Q-adaptive Semantic Spatial Implicit All
Acc. 39.54 39.24 39.54 40.30 39.58 40.42 No No 63.20 63.04 n/a n/a
Yes No 63.90 63.85 63.36 64.98
Table 2. Model accuracy on the VQA-CP v2 benchmark (open- No Yes 63.31 63.13 n/a n/a
ended setting on the test split). Yes Yes 64.11 64.02 64.10 65.30

Test-dev Table 4. Performance on VQA 2.0 validation set for ablation study
Model Test-std
Overall Y/N Num Other (Q-adaptive: question-adaptive; Att: Attention).
BUTD [49] 65.32 81.82 44.21 56.05 65.67
MFH [60] 68.76 84.27 50.66 60.50 - BAN [24], which uses eight bilinear attention maps, our
Counter [62] 68.09 83.14 51.62 58.97 68.41 model outperforms BAN with fewer glimpses. Pythia [22]
Pythia [22] 70.01 - - - 70.24
BAN [24] 70.04 85.42 54.04 60.52 70.35
achieved 70.01 by adding additional grid-level features and
v-AGCN [57] 65.94 82.39 56.46 45.93 66.17 using 100 object proposals from a fine-tuned Faster R-CNN
Graph learner [39] - - - - 66.18 on the VQA dataset for all images. Our model, without any
MuRel [6] 68.03 84.77 49.84 57.85 68.41 feature augmentation used in their work, surpasses Pythia’s
ReGAT (ours) 70.27 86.08 54.42 60.33 70.58 performance by a large margin.
Table 3. Model accuracy on the VQA 2.0 benchmark (open-ended
setting on the test-dev and test-std split). 4.4. Ablation Study
In Table 4, we compare three ablated instances of Re-
performance. Our best results are achieved by combining GAT with its complete form. Specifically, we validate the
the best single relation models through weighted sum. To importance of concatenating question features to each ob-
verify the performance gain is significant, we performed t- ject representation and attention mechanism. All the results
test on the results of our BAN baseline and our proposed reported in Table 4 are based on BUTD model architecture.
model with each single relation. We report the standard de- To remove attention mechanism from our relation encoder,
viation in Table 1, and the p-value is 0.001459. The im- we simply replace graph attention network with graph con-
provement from our method is significant at p < 0.05. We volution network, which can also learn node representation
also compare with an additional baseline model that uses from graphs but with simple linear transformation.
BiLSTM as the contextualized relation encoder, the results Firstly, we validate the effectiveness of using attention
show that using BiLSTM hurts the performance. mechanism to learn relation-aware visual features. Adding
To demonstrate the generalizability of our ReGAT attention mechanism leads to a higher accuracy for all three
model, we also conduct experiments on the VQA-CP v2 types of relation. Comparison between line 1 and line 2
dataset, where the distributions of the training and test splits shows a gain of +0.70 for semantic relation and +0.81 for
are very different from each other. Table 2 shows results on spatial relation. Secondly, we validate the effectiveness of
VQA-CP v2 test split. Here we use BAN with four glimpses question-adaptive relation features. Between line 1 and line
as the baseline model. Consistent with what we have ob- 3, we see a gain of approximately +0.1 for both seman-
served on VQA 2.0, our ReGAT model surpasses the base- tic and spatial relations. Finally, attention mechanism and
line by a large margin. With only single relation, our model question-adaptive features are added to give the complete
has already achieved state-of-the-art performance on VQA- ReGAT model. This instance gives the highest accuracy
CP v2 (40.30 vs. 39.54). When adding all the relations, the (line 4). Surprisingly, by comparing line 1 and line 4, we
performance gain was further lifted to +0.88. can observe that combining graph attention with question-
Table 3 shows single-model results on VQA 2.0 test-dev adaptive gives better gain than simply adding the individual
and test-std splits. The top five rows show results from gains from the two methods. It is worth mentioning that
models without relational reasoning, and the bottom four for implicit relation, adding question-adaptive improves the
rows are results from models with relational reasoning. Our model performance by +0.74, which is higher than the gain
model surpasses all previous work with or without relational from question-adaptive for the two explicit relations. When
reasoning. Our final model uses bilinear attention with four all the relations are considered, we observe consistent per-
glimpses as the multimodal fusion method. Compared to formance gain by adding the question-adaptive mechanism.
Figure 4. Visualization of attention maps learned from ablated in- Figure 5. Visualization of different types of visual object relation
stances: The three bounding boxes shown in each image are the in VQA task. The 3 bounding boxes shown in each image are the
top-3 attended regions. The numbers are attention weights. top-3 attended regions. Green arrows indicate relations from sub-
ject to object. Labels and numbers in green boxes are class labels
To better understand how these two components help an- for explicit relation and attention weights for implicit relation.
swer questions, we further visualize and compare the atten-
tion maps learned by the ablated instances in Section 4.5. capture the relative geometric positions between regions.
To visualize implicit relations, Figure 5(c) shows the at-
4.5. Visualization tention weights to the top-1 region from every other region.
Surprisingly, the learned implicit relations are able to cap-
To better illustrate the effectiveness of adding graph at- ture both spatial and semantic interactions. For example,
tention and question-adaptive mechanism, we compare the the top image in Figure 5(c) illustrates spatial interaction
attention maps learned by the complete ReGAT model in “on” between the table and the vase, and the bottom image
a single-relation setting with those learned by two ablated illustrates the semantic interaction “walking” between the
models. As shown in Figure 4, the second, third and last traffic light and the person.
rows correspond to line 1, 3 and 4 in Table 4, respectively.
Comparing row 2 with row 3 leads to the observation that
5. Conclusion
graph attention helps to capture the interactions between
objects, which contributes to a better alignment between We have presented Relation-aware Graph Attention Net-
image regions and questions. Row 3 and row 4 show that work (ReGAT), a novel framework for visual question an-
adding the question-adaptive attention mechanism produces swering, to model multi-type object relations with question-
sharper attention maps and focuses on more relevant re- adaptive attention mechanism. ReGAT exploits two types
gions. These visualization results are consistent with the of visual object relations: Explicit Relations and Implicit
quantitative results reported in Table 4. Relations, to learn a relation-aware region representation
Figure 5 provides visualization examples on how differ- through graph attention. Our method achieves state-of-the-
ent types of relations help improve the performance. In art results on both VQA 2.0 and VQA-CP v2 datasets. The
each example, we show the top-3 attended regions and the proposed ReGAT model is compatible with generic VQA
learned relations between these regions. As shown in these models. Comprehensive experiments on two VQA datasets
examples, each relation type contributes to a better align- show that our model can be infused into state-of-the-art
ment between image regions and questions. For example, VQA architectures in a plug-and-play fashion. For future
in Figure 5(a), semantic relations “Holding” and “Riding” work, we will investigate how to fuse the three relations
resonate with the same words that appeared in the corre- more effectively and how to utilize each relation to solve
sponding questions. Figure 5(b) shows how spatial relations specific question types.
References [17] Stephen Gould, Jim Rodgers, David Cohen, Gal Elidan, and
Daphne Koller. Multi-class segmentation with relative loca-
[1] Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Anirud- tion prior. IJCV, 2008. 2
dha Kembhavi. Don’t just assume; look and answer: Over- [18] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noord-
coming priors for visual question answering. In CVPR, 2018. huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch,
6 Yangqing Jia, and Kaiming He. Accurate, large mini-
[2] Peter Anderson, Xiaodong He, Chris Buehler, Damien batch sgd: training imagenet in 1 hour. arXiv preprint
Teney, Mark Johnson, Stephen Gould, and Lei Zhang. arXiv:1706.02677, 2017. 6
Bottom-up and top-down attention for image captioning and [19] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba-
visual question answering. In CVPR, 2018. 3, 5, 6, 7 tra, and Devi Parikh. Making the v in vqa matter: Elevating
[3] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret the role of image understanding in visual question answer-
Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi ing. In CVPR, 2017. 6
Parikh. VQA: Visual Question Answering. In ICCV, 2015. [20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
6 Deep residual learning for image recognition. In CVPR,
[4] Hedi Ben-Younes, Rémi Cadene, Matthieu Cord, and Nico- 2016. 6
las Thome. Mutan: Multimodal tucker fusion for visual [21] Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen
question answering. In ICCV, 2017. 2, 6, 7 Wei. Relation networks for object detection. In CVPR, 2018.
[5] Irving Biederman, Robert J Mezzanotte, and Jan C Rabi- 2, 3, 5
nowitz. Scene perception: Detecting and judging objects un- [22] Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach,
dergoing relational violations. Cognitive psychology, 1982. Dhruv Batra, and Devi Parikh. Pythia v0. 1: the win-
2 ning entry to the vqa challenge 2018. arXiv preprint
[6] Remi Cadene, Hedi Ben-younes, Matthieu Cord, and Nicolas arXiv:1807.09956, 2018. 7
Thome. Murel: Multimodal relational reasoning for visual [23] Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li,
question answering. In CVPR, 2019. 3, 7 David Shamma, Michael Bernstein, and Li Fei-Fei. Image
[7] Myung Jin Choi, Antonio Torralba, and Alan S Willsky. A retrieval using scene graphs. In CVPR, 2015. 2
tree-based context model for object recognition. PAMI, 2012. [24] Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilin-
2 ear attention networks. In NeurIPS, 2018. 2, 6, 7
[8] Bo Dai, Yuqi Zhang, and Dahua Lin. Detecting visual re- [25] Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee
lationships with deep relational networks. In CVPR, 2017. Kim, Jung-Woo Ha, and Byoung-Tak Zhang. Hadamard
3 product for low-rank bilinear pooling. In ICLR, 2017. 2
[9] Santosh K Divvala, Ali Farhadi, and Carlos Guestrin. Learn- [26] Thomas N Kipf and Max Welling. Semi-supervised classi-
ing everything about anything: Webly-supervised visual con- fication with graph convolutional networks. In ICLR, 2017.
cept learning. In CVPR, 2014. 2 3
[10] Santosh K Divvala, Derek Hoiem, James H Hays, Alexei A [27] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,
Efros, and Martial Hebert. An empirical study of context in Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-
object detection. In CVPR, 2009. 2 tidis, Li-Jia Li, David A Shamma, Michael Bernstein, and
Li Fei-Fei. Visual genome: Connecting language and vision
[11] Haoqi Fan and Jiatong Zhou. Stacked latent attention for
using crowdsourced dense image annotations. arXiv preprint
multimodal reasoning. In CVPR, 2018. 1, 2
arXiv:1602.07332, 2016. 4, 6
[12] Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K Sri- [28] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,
vastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-
Margaret Mitchell, John C Platt, et al. From captions to vi- tidis, Li-Jia Li, David A Shamma, et al. Visual genome:
sual concepts and back. In CVPR, 2015. 2 Connecting language and vision using crowdsourced dense
[13] Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Pe- image annotations. IJCV, 2017. 3
ter Young, Cyrus Rashtchian, Julia Hockenmaier, and David [29] Guohao Li, Hang Su, and Wenwu Zhu. Incorporating exter-
Forsyth. Every picture tells a story: Generating sentences nal knowledge to answer open-domain visual questions with
from images. In ECCV, 2010. 2 dynamic memory networks. In CVPR, 2018. 2
[14] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, [30] Qing Li, Jianlong Fu, Dongfei Yu, Tao Mei, and Jiebo Luo.
and Deva Ramanan. Object detection with discriminatively Tell-and-answer: Towards explainable visual question an-
trained part-based models. PAMI, 2010. 2 swering using attributes and captions. In EMNLP, 2018. 2
[15] Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, [31] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
Trevor Darrell, and Marcus Rohrbach. Multimodal com- Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
pact bilinear pooling for visual question answering and vi- Zitnick. Microsoft coco: Common objects in context. In
sual grounding. In EMNLP, 2016. 2 ECCV, 2014. 6
[16] Carolina Galleguillos, Andrew Rabinovich, and Serge Be- [32] Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-
longie. Object categorization using co-occurrence, location Fei. Visual relationship detection with language priors. In
and appearance. In CVPR, 2008. 2 ECCV, 2016. 3
[33] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. [50] Damien Teney, Lingqiao Liu, and Anton van den Hengel.
Hierarchical question-image co-attention for visual question Graph-structured representations for visual question answer-
answering. In NIPS, 2016. 1, 2 ing. In CVPR, 2017. 3
[34] Pan Lu, Lei Ji, Wei Zhang, Nan Duan, Ming Zhou, and Jiany- [51] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
ong Wang. R-vqa: Learning visual relation facts with se- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
mantic attention for visual question answering. In SIGKDD, Polosukhin. Attention is all you need. In NIPS, 2017. 5
2018. 2 [52] Peng Wang, Qi Wu, Jiewei Cao, Chunhua Shen, Lianli Gao,
[35] Chao Ma, Chunhua Shen, Anthony Dick, Qi Wu, Peng and Anton van den Hengel. Neighbourhood watch: Refer-
Wang, Anton van den Hengel, and Ian Reid. Visual question ring expression comprehension via language-guided graph
answering with memory-augmented networks. In CVPR, attention networks. In CVPR, 2019. 3
2018. 2 [53] Peng Wang, Qi Wu, Chunhua Shen, and Anton van den Hen-
[36] Mateusz Malinowski, Carl Doersch, Adam Santoro, and Pe- gel. The vqa-machine: Learning how to use existing vision
ter Battaglia. Learning visual question answering by boot- algorithms to answer new questions. In CVPR, 2017. 2
strapping hard attention. In ECCV, 2018. 2 [54] Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and An-
[37] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu ton van den Hengel. Image captioning and visual ques-
Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and tion answering based on attributes and external knowledge.
Alan Yuille. The role of context for object detection and se- PAMI, 2017. 2
mantic segmentation in the wild. In CVPR, 2014. 2 [55] Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and
[38] Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. Dual Anton van den Hengel. Ask me anything: Free-form vi-
attention networks for multimodal reasoning and matching. sual question answering based on knowledge from external
In CVPR, 2017. 1, 2 sources. In CVPR, 2016. 2
[39] Will Norcliffe-Brown, Stathis Vafeias, and Sarah Parisot. [56] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and
Learning conditioned graph structures for interpretable vi- Alex Smola. Stacked attention networks for image question
sual question answering. In NeurIPS, 2018. 3, 7 answering. In CVPR, 2016. 1, 2
[40] Adam Paszke, Sam Gross, Soumith Chintala, Gregory
[57] Zhuoqian Yang, Jing Yu, Chenghao Yang, Zengchang Qin,
Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-
and Yue Hu. Multi-modal learning with prior visual relation
ban Desmaison, Luca Antiga, and Adam Lerer. Automatic
reasoning. arXiv preprint arXiv:1812.09681, 2018. 3, 7
differentiation in pytorch. 2017. 6
[58] Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. Exploring
[41] Badri Patro and Vinay P Namboodiri. Differential attention
visual relationship for image captioning. In ECCV, 2018. 2,
for visual question answering. In CVPR, 2018. 2
3, 4, 6
[42] Jeffrey Pennington, Richard Socher, and Christopher Man-
ning. Glove: Global vectors for word representation. In [59] Dongfei Yu, Jianlong Fu, Tao Mei, and Yong Rui. Multi-
EMNLP, 2014. 6 level attention networks for visual question answering. In
CVPR, 2017. 2
[43] Vignesh Ramanathan, Congcong Li, Jia Deng, Wei Han,
Zhen Li, Kunlong Gu, Yang Song, Samy Bengio, Charles [60] Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and
Rosenberg, and Li Fei-Fei. Learning semantic relationships Dacheng Tao. Beyond bilinear: Generalized multimodal
for better action retrieval in images. In CVPR, 2015. 2 factorized high-order pooling for visual question answering.
[44] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. IEEE Transactions on Neural Networks and Learning Sys-
Faster r-cnn: Towards real-time object detection with region tems, 2018. 2, 7
proposal networks. In NIPS, 2015. 6 [61] Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-
[45] Mohammad Amin Sadeghi and Ali Farhadi. Recognition us- Seng Chua. Visual translation embedding network for visual
ing visual phrases. In CVPR, 2011. 2 relation detection. In CVPR, 2017. 3
[46] Adam Santoro, David Raposo, David G Barrett, Mateusz [62] Yan Zhang, Jonathon S. Hare, and Adam Prügel-Bennett.
Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Learning to count objects in natural images for visual ques-
Lillicrap. A simple neural network module for relational rea- tion answering. In ICLR, 2018. 7
soning. In NIPS, 2017. 3 [63] Chen Zhu, Yanpeng Zhao, Shuaiyi Huang, Kewei Tu, and Yi
[47] Sebastian Schuster, Ranjay Krishna, Angel Chang, Li Fei- Ma. Structured attentions for visual question answering. In
Fei, and Christopher D Manning. Generating semantically ICCV, 2017. 2
precise scene graphs from textual descriptions for improved
image retrieval. In Proceedings of the fourth workshop on
vision and language, 2015. 2
[48] Zhou Su, Chen Zhu, Yinpeng Dong, Dongqi Cai, Yurong
Chen, and Jianguo Li. Learning visual knowledge memory
networks for visual question answering. In CVPR, 2018. 2
[49] Damien Teney, Peter Anderson, Xiaodong He, and Anton
van den Hengel. Tips and tricks for visual question answer-
ing: Learnings from the 2017 challenge. In CVPR, 2018. 1,
2, 7

You might also like