Improving Graph Neural Networks With Simple Architecture Design
Improving Graph Neural Networks With Simple Architecture Design
Design
Sunil Kumar Maurya Xin Liu Tsuyoshi Murata
[email protected] [email protected] [email protected]
Tokyo Institute of Technology AIRC, AIST Tokyo Institute of Technology
Tokyo, Japan Tokyo, Japan Tokyo, Japan
ABSTRACT 1 INTRODUCTION
arXiv:2105.07634v1 [stat.ML] 17 May 2021
Graph Neural Networks have emerged as a useful tool to learn Graph Neural Networks (GNNs) have opened a unique path to
on the data by applying additional constraints based on the graph learning on data by leveraging the intrinsic relations between enti-
structure. These graphs are often created with assumed intrinsic ties that can be structured as a graph. By imposing these structural
relations between the entities. In recent years, there have been constraints, additional information can be learned and used for
tremendous improvements in the architecture design, pushing the many types of prediction tasks. With rapid development of the field
performance up in various prediction tasks. In general, these neu- and easy accessibility of computation and data, GNNs have been
ral architectures combine layer depth and node feature aggrega- used to solve a variety of problems like node classification [15, 27, 1,
tion steps. This makes it challenging to analyze the importance 7], link prediction [32, 3, 4], graph classification [33, 34], prediction
of features at various hops and the expressiveness of the neural of molecular properties [10, 18], natural language processing [19],
network layers. As different graph datasets show varying levels of node ranking [20] and so on.
homophily and heterophily in features and class label distribution, In this work, we focus on the node classification task using
it becomes essential to understand which features are important for graph neural networks. Since the success of early GNN models
the prediction tasks without any prior information. In this work, like GCN [15], researchers have successively proposed numerous
we decouple the node feature aggregation step and depth of graph variants [30] to address various shortcomings in model training
neural network and introduce several key design strategies for and to improve the prediction capabilities. Some of the techniques
graph neural networks. More specifically, we propose to use soft- used in these variants include neighbor sampling [12, 6], attention
max as a regularizer and "Soft-Selector" of features aggregated from mechanism to assign different weights to neighbors [27], use of
neighbors at different hop distances; and "Hop-Normalization" over Personalized PageRank matrix instead of adjacency matrix [16]
GNN layers. Combining these techniques, we present a simple and and simplified model design [29]. Also, there has been a growing
shallow model, Feature Selection Graph Neural Network (FSGNN), interest in making the models deeper by stacking more layers and
and show empirically that the proposed model outperforms other using the residual connections to improve the expressiveness of
state of the art GNN models and achieves up to 64% improvements the model [23, 7]. However, most of these models by design are
in accuracy on node classification tasks. Moreover, analyzing the more suitable for homophily datasets where nodes linked to each
learned soft-selection parameters of the model provides a simple other are more likely to belong in the same class. They may not
way to study the importance of features in the prediction tasks. perform well with heterophily datasets which are more likely to
Finally, we demonstrate with experiments that the model is scalable have nodes with different labels connected together. Zhu et al.
for large graphs with millions of nodes and billions of edges. [35] highlight this problem and propose node’s ego-embedding
Source code at https://github.com/sunilkmaurya/FSGNN and neighbor-embedding separation to improve performance on
heterophily datasets.
KEYWORDS In general, GNN models combine feature aggregation and trans-
Graph Neural Networks, Node Classification, Model Design, Feature formation using a learnable weight matrix in the same layer, often
Selection referred to as graph convolutional layer. These layers are stacked
together with the non-linear transformation (e.g., ReLU) and reg-
ACM Reference Format:
ularization(e.g., Dropout) as a learning framework on the graph
Sunil Kumar Maurya, Xin Liu, and Tsuyoshi Murata. 2021. Improving Graph
Neural Networks with Simple Architecture Design. In Proceedings of ACM
data. Stacking the layers also has the effect of introducing powers
Conference (Conference’17). ACM, New York, NY, USA, 10 pages. https: of adjacency matrix (or laplacian matrix), which helps to generate a
//doi.org/10.1145/nnnnnnn.nnnnnnn new set of features for a node by aggregating neighbor’s features at
multiple hops, thus encoding the neighborhood information. The
Permission to make digital or hard copies of all or part of this work for personal or number of these unique features depends on the propagation steps
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation or the depth of the model. The final node embeddings are the output
on the first page. Copyrights for components of this work owned by others than ACM of just stacked layers or, for some models, also has skip connection
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, or residual connection combined at final layer.
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from [email protected]. However, such a combination muddles the distinction between
Conference’17, July 2017, Washington, DC, USA the importance of features and expressiveness of MLP. It becomes
© 2021 Association for Computing Machinery. challenging to analyze which features are essential and how much
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00
https://doi.org/10.1145/nnnnnnn.nnnnnnn expressiveness MLP requires over a specific task. To overcome this
Conference’17, July 2017, Washington, DC, USA Maurya et al.
challenge, we provide a framework to treat the feature propagation For heterophily datasets, we require a propagation scheme to sepa-
and learning separately. With this freedom, we propose a simple rate features of neighbors from node’s own features. So we use the
GNN model with three unique design considerations: Soft-selection following formulation for the GNN layer,
of features using softmax function, Hop-Normalization, and unique
mapping of features. With experimental results, we show that our 𝐻 (𝑖+1) = 𝜎 (𝐴𝑠𝑦𝑚 𝐻 (𝑖) 𝑊 (𝑖) ) (2)
simple 2-layer GNN outperforms other state-of-art GNN models
− 21 − 21
where 𝐴𝑠𝑦𝑚 = 𝐷 𝐴𝐷 is symmetric normalized adjacency ma-
(both shallow and deep) and achieves up to 64% higher node clas-
sification accuracy. In addition, analyzing the model parameters trix without added self-loops. To combine features from multiple
gives us an insight into identifying which features are most respon- hops, concatenation operator can be used before the final layer.
sible for classification accuracy. One interesting observation we Following the conventional GNN formulation using 𝐴, ˜ a simple
find is regarding Chameleon and Squirrel datasets. These are dense 2-layered GNN can be represented as [15],
graph datasets and are generally regarded as being low-quality het-
erophily datasets. However, in our experiments with our proposed 𝑍 = 𝐴˜𝑠𝑦𝑚 𝜎 (𝐴˜𝑠𝑦𝑚 𝑋𝑊 (0) )𝑊 (1) (3)
model, we find them to be showing strong heterophily properties
with improved classification results. 2.2 Node Classification
Furthermore, we demonstrate that due to the simple design of
Node classification is an extensively studied graph based semi-
our model, it can scale up for very large graph datasets. We run
supervised learning problem. It encompasses training the GNN to
experiments on ogbn-papers100M dataset, which is the largest
predict labels of nodes based on the features and neighborhood
publicly available node classification dataset, and achieve higher
structure of the nodes. GNN model is considered as a function
accuracy than the state of the art models.
𝑓 (𝑋, 𝐴) conditioned on node features 𝑋 and adjacency matrix 𝐴.
The rest of the paper is organized as follows: Section 2 outlines
Taking the example of Eq. 3, GNN aggregates the features of two
formulation of graph neural networks and details node classifica-
hops of neighbors and outputs 𝑍 . Softmax function is applied row-
tion task. In Section 3, we discuss design strategies for GNNs and
wise, and cross-entropy error is calculated over all labeled training
propose the GNN model FSGNN. In Section 4, we briefly introduce
examples. The gradients of loss are back-propagated through the
relevant GNN literature. Section 5 contains the experimental details
GNN layers. Once trained, the model can be used for the prediction
and comparison with other GNN models. In Section 6, we empiri-
of labels of nodes in the test set.
cally analyze our proposed design strategies and their effect on the
model’s performance. Section 7 summarizes the paper.
2.3 Homophily vs Heterophily
Node classification problem relies on the graph structure and fea-
2 PRELIMINARIES tures of the nodes to identify the labels of the node. Under ho-
Let 𝐺 = (𝑉 , 𝐸) be an undirected graph with 𝑛 nodes and 𝑚 edges. mophily, nodes are assumed to have neighbors with similar features
For numerical calculations, graph is represented as adjacency matrix and labels. Thus, the cumulative aggregation of node’s self-features
denoted by 𝐴 ∈ {0, 1}𝑛×𝑛 with each element 𝐴𝑖 𝑗 = 1 if there exists with that of neighbors reinforce the signal corresponding to the
an edge between node 𝑣𝑖 and 𝑣 𝑗 , otherwise 𝐴𝑖 𝑗 = 0. If self-loops label and help to improve accuracy of the predictions. While in the
are added to the graph then, resultant adajcency matrix is denoted case of heterophily, nodes are assumed to have dissimilar features
as 𝐴˜ = 𝐴 + 𝐼 . Diagonal degree matrix of 𝐴 and 𝐴˜ are denoted as 𝐷 and labels. In this case, the cumulative aggregation will reduce the
and 𝐷.˜ Each node is associated with a d-dimensional feature vector signal and add more noise causing neural network to learn poorly
and the feature matrix for all nodes is represented as 𝑋 ∈ R𝑛×𝑑 . and causing drop in performance. Thus it is essential to have node’s
self-features separate from the neighbor’s features. In real-world
2.1 Graph Neural Networks datasets, homophily and heterophily levels may vary, hence it is
optimal to have both aggregation schemes (Eq. 1 & 2)
Graph Neural Networks (GNNs) leverage feature propagation mech-
anism [10] to aggregate neighborhood information of a node and
3 PROPOSED ARCHITECTURE
use non-linear transformation with trainable weight matrix to get
the final embeddings for the nodes. Conventionally, a simple GNN For the design of a GNN with good generalization capability and
layer is defined as performance, there are many aspects of the data that needs to be
considered. The feature propagation and aggregation scheme is
governed by if the class label distribution has strong homophily or
𝐻 (𝑖+1) = 𝜎 (𝐴˜𝑠𝑦𝑚 𝐻 (𝑖) 𝑊 (𝑖) ) (1)
heterophily or some combination of both. The number of hops (and
1 1
depth of the model for many GNN models) for feature aggregation
where 𝐴˜𝑠𝑦𝑚 = 𝐷˜ − 2 𝐴˜𝐷˜ − 2 is a symmetric normalized adjacency are dependent on graph structure and size as well as label distri-
matrix with added self-loops. 𝐻 𝑖 represents features from the pre- bution among neighbors of the nodes. Also, the type and amount
vious layer, 𝑊 𝑖 denotes the learnable weight matrix, and 𝜎 is a of regularization during training needs to be decided, for example,
non-linear activation function, which is usually ReLU in most im- using dropout on input features or on graph edges.
plementation of GNNs. However, this formulation is suitable for Keeping these aspects under consideration, we propose three
homophily datasets as features are cumulatively aggregated i.e. design strategies that help to create a versatile and simple GNN
node’s own features are added together with neighbor’s features. model.
Improving Graph Neural Networks with Simple Architecture Design Conference’17, July 2017, Washington, DC, USA
𝜶𝟏
(𝟎)
𝑾𝟏 ∥∙∥𝟐
𝑋
𝜶𝟐
𝑯(𝟏)
𝐴𝑠𝑦𝑚 𝑋 (𝟎)
𝑾𝟐 ∥∙∥𝟐 CONCAT 𝝈 𝑾(𝟐) 𝒁
Logits
𝜶𝟑
𝐴ሚ𝑠𝑦𝑚 𝑋 (𝟎)
𝑾𝟑 ∥∙∥𝟐
𝐾
α𝑖 = 1
𝑖=1
Softmax Normalized
...
...
𝐴ሚ𝐾
𝑠𝑦𝑚 𝑋
˜
Figure 1: Figure shows model diagram of FSGNN. Input features are generated based on powers of 𝐴 and 𝐴.
3.1 Design Strategies for GNNs (ii) We can precompute and fix the node features set and ex-
3.1.1 Decouple feature generation and representation learning. periment with the neural network architectures for the best
As discussed in Sec. 2.1, these features can be aggregated cumula- performance. Precomputing features also helps to scale the
tively (homophily-based) or non-cumulatively (heterophily-based). training of the model for large graphs with batchwise train-
Moreover, the features can also be combined based on some arbi- ing.
trary criteria. We assume a function, (iii) In conventional GNN setting, stacking many layers also
causes oversmoothing of node features [5] and adversely
𝑔(𝑋, 𝐴, 𝐾) ↦→ {𝑋 1, 𝑋 2, . . . , 𝑋𝑝 } affects the performance of the model. Recently proposed
models use skip connection or residual connection to over-
The function takes 𝑋 as node features matrix, 𝐴 as an adjacency come this issue. However, they fail to demonstrate which
matrix, 𝐾 as the power of the adjacency matrix or number of hops features are useful. We provide an alternate scheme where
to propagate features and outputs a set of aggregated features. the model can learn weights that identify which features are
These features then can be combined using sum or concatenation useful for the prediction task.
operation to get final representation of the node. However, in the
node classification task, for a given label distribution, only a subset For the model design, instead of a single input channel, we pro-
of these features are useful to predict the label of the node. For pose to have all these features as input in parallel. Please refer Fig.1
example, features of node’s neighbors that lie at a greater distance for the illustration. Each feature is mapped to a separate linear layer.
in the graph may not be sufficiently informative or useful for node’s Hence the linear transformations are uniquely learned for all input
label prediction. features.
Conventionally, GNN models have feature propagation and trans- 3.1.2 Feature Selection.
formation combined into a single layer, and the layers are stacked As features are aggregated over many hops, some features are
together. This step makes it difficult to distinguish the importance useful and correlate with the label distribution, while others are
of the features and the role of MLP. To overcome this limitation, we not very useful for learning and act more like the noise for the
propose to separate the feature generation step and representation model. As we propose to input the feature set in parallel channels,
learning over features separately. This provides us with three main we can design the model to learn which features are more relevant
benefits. for lower loss value and giving higher weights to those features
(i) Features generated for nodes are not constrained by the while simultaneously reducing the weights on other features. We
design of the GNN model. We get the freedom to choose propose to weight these features with a single scalar value that is
the feature set as required by the problem and the neural multiplied to each input feature matrix and impose a constraint
network design, which is sufficiently expressive. on these values by softmax function. Let 𝛼𝑖 be the scalar value for
Conference’17, July 2017, Washington, DC, USA Maurya et al.
the 𝑖 𝑡ℎ feature matrix, then 𝛼𝑖 scales the magnitude of the features after hidden layers are not commonly used. It may be in part due
(0) to the common practice of normalizing node/edge features and
as 𝛼𝑖 𝑋𝑖𝑊𝑖 . Softmax function is used in deep learning as a non-
linear normalizer, and its output is often practically interpreted as symmetric/non-symmetric normalization of the adjacency matrix.
probabilities. Before training, the scalar values corresponding to We propose to normalize all aggregated features from different
each feature matrix are initialized with equal values and softmax hops after linear transformation, hence the term "Hop-Normalization".
is applied on these values. The resultant normalized values 𝛼𝑖 are We propose row-wise L2-normalize the hidden layer activations as,
then multiplied with the input features, and the concatenation
operator is applied. Considering 𝐿 number of input feature matrices ℎ𝑖 𝑗
ℎ𝑖 𝑗 = (6)
𝑋𝑙 , 𝑙 ∈ {1 .. 𝐿} , the formulation can be described as, ∥ ℎ𝑖 ∥ 2
where ℎ𝑖 represents the 𝑖 𝑡ℎ row vector of activations and ℎ𝑖 𝑗
𝐿
n (0) represents individual values. L2-normalization scales the node em-
𝐻 (1) = 𝛼𝑙 𝑋𝑙 𝑊𝑙 (4) bedding vectors to lie on the "unit sphere". In the later section, we
𝑙=1
empirically show significant improvements in the performance of
𝐿
∑︁ the model with the use of this scheme.
where 𝛼𝑙 = 1
𝑙=1 3.2 Feature Selection Graph Neural Network
While training, the scalar values of relevant features correspond- Combining the design strategies proposed earlier, we propose a
ing to the labels increase towards 1 while others decrease towards simple and shallow (2-layered) graph GNN model called Feature
0. The features that are not useful and represent more noise than Selection Graph Neural Network (FSGNN). Figure 1 shows the
signal have their magnitudes reduced with corresponding decreas- diagrammatic representation of our model. Input features are pre-
ing in their scalar values. Since we are not using a binary selection computed using 𝐴𝑠𝑦𝑚 and 𝐴˜𝑠𝑦𝑚 and transformed using a linear
of features, we term this selection procedure as "soft-selection" of layer unique to each feature matrix. Hop-normalization is applied
features. on the output activations of the first layer and weighted with scalar
This formulation can be understood in two ways. As GNNs have weights regularized by the softmax function. Output features are
represented with a polynomial filter, then concatenated and non-linearly transformed using ReLU and
𝐾−1
mapped to the second linear layer. Cross-entropy loss is calculated
with output logits of second layer.
∑︁
𝑔𝜃 (𝑃) = 𝜃𝑘 𝑃 𝑘 (5)
𝑘=0
where 𝜃 ∈ R𝐾 is a vector of polynomial coefficients and P can be Algorithm 1: Pseudo Code FSGNN (Forward propagation)
adjacency matrix [15][7], laplacian matrix [21] or PageRank based Input :𝐴𝑠𝑦𝑚 ; 𝐴˜𝑠𝑦𝑚 ; No. of hops 𝐾; weight matrices 𝑊 (𝑘) ;
matrix [2]. As the polynomial coefficients are scalar parameters 𝛼 vector of dimension 2K+1;
then our scheme can be considered as applying regularization on Output : Logits
these parameters using the softmax function. The other way to
look is to simply consider it as a weighting scheme. As the input 1 𝛼𝑖 ← 1.0, 𝑖 = 1...2𝐾 + 1
features can be arbitrarily chosen, and instead of a scalar weighting 2 𝛼 ← 𝑆𝑂𝐹𝑇 𝑀𝐴𝑋 (𝛼)
scheme, a more sophisticated scheme can be used. 3 𝑙𝑖𝑠𝑡_𝑚𝑎𝑡 ← [𝑋 ]
For practical implementation, since all weights are initialized 4 𝑋𝐴 ← 𝑋
as equal, they can be set equal to 1. After normalizing with soft- 5 𝑋𝐴˜ ← 𝑋
max function, the individual scalar values becomes equal to 1/𝐿. 6 for 𝑘 = 1...𝐾 do
During training, these values change, denoting the importance of 7 𝑋𝐴 ← 𝐴𝑠𝑦𝑚 𝑋𝐴
the features. In some cases, initial 𝛼𝑙 = 1/𝐿 value may be too small 8 𝑋𝐴˜ ← 𝐴˜𝑠𝑦𝑚 𝑋𝐴˜
and may adversely affect training. In that case, a constant 𝛾 may be 9 𝑙𝑖𝑠𝑡_𝑚𝑎𝑡 .𝐴𝑃𝑃𝐸𝑁 𝐷 ( 𝑋𝐴 )
multiplied after softmax normalization to increase the initial magni-
(0) 10 𝑙𝑖𝑠𝑡_𝑚𝑎𝑡 .𝐴𝑃𝑃𝐸𝑁 𝐷 ( 𝑋𝐴˜ )
tude as 𝛾𝛼𝑙 𝑋𝑙 𝑊𝑙 . Since 𝛾 remains constant during the training, it
11 end
does not affect the softmax regularization of the scalar parameters.
12 𝑙𝑖𝑠𝑡_𝑐𝑎𝑡 = 𝐿𝐼𝑆𝑇 ()
As the scalar values affect the magnitude of the features, they
13 for 𝑗 = 1...2𝐾 + 1 do
also affect the gradients propagated back to the linear layer, which
14 𝑋 𝑓 ← 𝑙𝑖𝑠𝑡_𝑚𝑎𝑡 [ 𝑗]
transforms the input features. Hence it is important to have a unique
(0)
weight matrix for each input feature matrix. 15 𝑂𝑢𝑡 ← 𝐻𝑂𝑃𝑁𝑂𝑅𝑀 ( 𝑋 𝑓 𝑊 𝑗 )
16 𝑙𝑖𝑠𝑡_𝑐𝑎𝑡 .𝐴𝑃𝑃𝐸𝑁 𝐷 ( 𝛼 𝑗 ⊙ 𝑂𝑢𝑡 )
3.1.3 Hop-Normalization.
The third strategy we propose is Hop-Normalization. It is a 17 end
common practice in the deep learning field to use different types 18 𝐻 (1) ← 𝐶𝑂𝑁𝐶𝐴𝑇 ( 𝑙𝑖𝑠𝑡_𝑐𝑎𝑡 )
of normalization schemes, for example, batch normalization [14], 19 𝑍 ← 𝑅𝑒𝐿𝑈 ( 𝐻 (1) )𝑊 (2)
layer normalization, weight normalization, and so on. However, in
graph neural network frameworks, normalization of activations
Improving Graph Neural Networks with Simple Architecture Design Conference’17, July 2017, Washington, DC, USA
Table 2: Mean classification accuracy on fully-supervised node classification task. Results for GCN, GAT, GraphSAGE,
Cheby+JK, MixHop and H2GCN-1 are taken from [35]. For GEOM-GCN and GCNII results are taken from the respective
article. Best performance for each dataset is marked as bold and second best performance is underlined for comparison.
40 4 20 4
20 0
0
20
20
40
40
60 60
60 40 20 0 20 40 60 40 20 0 20 40 60
0 60 0
30 1 1
2 2
Chameleon dataset
20 40
3 3
10 4 20 4
0 0
10 20
20
40
30
60
60 40 20 0 20 40 60 80 60 40 20 0 20 40
Figure 3: Figure shows t-SNE plots of trained embeddings (3-hop) of Squirrel and Chameleon datasets without (left) and with
hop-normalization (right). Points represent nodes and colors represent their respective labels. Mean classification accuracy
without and with hop-normalization are 39.92% and 73.48% for Squirrel; 61.38% and 78.14% for Chameleon datasets respectively.
that self-looped features are given more importance. Among het- for 3-hop aggregation. The dimension of the hidden layer is set to
erophily datasets, Wisconsin, Cornell, Texas, and Actor have the 256 and 𝛾 is set to L=7 (equal to number of input features) to provide
most weights on node’s ego features. In these datasets, graph struc- stable training. The model is trained batchwise with input features
ture plays a limited role in the performance accuracy of the model. for 10 random initializations, and we report mean accuracy.
For Chameleon and Squirrel datasets, we observed that the node’s We compare the accuracy of our model with SGC [29], Node2Vec
own features and first-hop features(without self-loop) were more [11] and SIGN [9]. Similar to our method, input features can be
useful for classification than any other features. precomputed in SGC and SIGN, thus making them scalable for larger
datasets. Once features are computed, the model can be trained
6.3 Hop-Normalization with small input batches of node features on the GPU. Many other
In our experimental results, we find Chameleon and Squirrel datasets GNN models cannot be trained for larger graphs as the feature
have significant improvements. To understand the results better, we generation, and model training are combined.
create 2-dimensional plot of the trained embeddings of both datasets Table 4 shows the mean node classification accuracy along with
using t-SNE[17]. Figure 4 shows the comparison of embeddings published results of other methods taken from [9][13]. Our model
with and without hop-normalization. Without hop-normalization, outperforms all other methods, with SIGN having a closer perfor-
embeddings of the nodes are not separated clearly, thus resulting in mance to ours. However, SIGN uses the adjacency matrix of both
lower classification performance. We observe similar performance directed and undirected versions of the graph for feature transfor-
on other GNN models. While with hop-normalization, the node mations, while our model only utilizes the adjacency matrix of the
embeddings are well separated into clusters corresponding to their undirected graph.
label leading to a higher observed performance with our model.
Table 4: Mean classification accuracy on ogbn-100M dataset. For node classfication results (2), we do grid search for learning
SGC result is taken from [13] and Node2Vec and SIGN re- rate and weight decay of the layers and dropout between the layers.
sults are taken from [9]. Best performance is marked bold Hyperparameters are set for first layer 𝑓 𝑐1, second layer 𝑓 𝑐2 and
and second best performance is underlined. scalar weight parameter 𝑠𝑐𝑎. ReLU is used as non-linear activation
and Adam is used as the optimizer. Table 5 shows details of hyperpa-
Method Accuracy rameter search space. Table 6 and 7 show the best hyperparameters
SGC 63.29±0.19 for the model in 3-hop and 8-hop configuration respectively.
Node2Vec 58.07±0.28 For experiments on ogbn-papers100M dataset, we did not do grid
SIGN 65.11±0.14 search. Based on the data from earlier experiments we manually
FSGNN 67.17±0.14 tuned the hyperparameters to get the accuracy result. Batch size
of 10000 was used for training data. Table 8 shows the relevant
is intuitive as aggregated features from higher hops are not very hyperparameters for the model.
useful, and the model can learn to place low weights on them.
Table 5: Hyperparameter search space
100
Cora
90 Hyperparameter Values
0.0, 0.0001, 0.001, 0.01, 0.1
Accuracy (%)
80
𝑊 𝐷𝑠𝑐𝑎
𝐿𝑅𝑠𝑐𝑎 0.04, 0.02, 0.01, 0.005
70 𝑊 𝐷 𝑓 𝑐1 0.0, 0.0001, 0.001
60 𝑊 𝐷 𝑓 𝑐2 0.0, 0.0001, 0.001
𝐿𝑅 𝑓 𝑐 0.01, 0.005
50
2 4 8 16 32 𝐷𝑟𝑜𝑝𝑜𝑢𝑡 0.5, 0.6, 0.7
100
Chameleon
90
Table 6: Hyperparameters of the 3-hop model
Accuracy (%)
80
IMPLEMENTATION DETAILS
For reproducibility of experimental results, we provide the details Table 8: Hyperparameters for the ogbn-paper100M dataset
of our experiment setup and hyperparameters of the model.
We use PyTorch 1.6.0 as deep learning framework on Python 3.8. Dataset 𝑊 𝐷𝑠𝑐𝑎 𝐿𝑅𝑠𝑐𝑎 𝑊 𝐷 𝑓 𝑐1 𝑊 𝐷 𝑓 𝑐2 𝐿𝑅 𝑓 𝑐1 𝐿𝑅 𝑓 𝑐2 𝐷𝑟𝑜𝑝𝑜𝑢𝑡
ogbn-papers100M 0.1 0.0001 0.001 0.000001 0.00005 0.0002 0.5
Model training is done on Nvidia V100 GPU with 16 GB graphics
memory and CUDA version 10.2.89.
Improving Graph Neural Networks with Simple Architecture Design Conference’17, July 2017, Washington, DC, USA