Multimodal Learning With Graphs
Multimodal Learning With Graphs
‡ Correspondence: [email protected]
∗ Equal contribution
Abstract
Artificial intelligence for graphs has achieved remarkable success in modeling complex systems, ranging
from dynamic networks in biology to interacting particle systems in physics. However, the increasingly
heterogeneous graph datasets call for multimodal methods that can combine different inductive biases—
the set of assumptions that algorithms use to make predictions for inputs they have not encountered
during training. Learning on multimodal datasets presents fundamental challenges because the inductive
biases can vary by data modality and graphs might not be explicitly given in the input. To address
these challenges, multimodal graph AI methods combine different modalities while leveraging cross-
modal dependencies using graphs. Diverse datasets are combined using graphs and fed into sophisticated
multimodal architectures, specified as image-intensive, knowledge-grounded and language-intensive
models. Using this categorization, we introduce a blueprint for multimodal graph learning, use it to study
existing methods and provide guidelines to design new models.
1 Introduction
Deep learning on graphs has contributed to breakthroughs in biology [1, 2], chemistry [3, 4], physics [5, 6],
and the social sciences [7]. The predominant use of graph neural networks [8] is to learn representations of
various graph components—such as nodes, edges, subgraphs, and entire graphs—based on neural message
passing strategies. The learned representations are used for downstream tasks, including label prediction
via semi-supervised learning [9], self-supervised learning [10], and graph design and generation [11, 12]. In
most existing applications, datasets explicitly describe graphs in the form of nodes, edges, and additional
information representing contextual knowledge, such as node, edge, and graph attributes.
Modeling complex systems requires measurements that describe the same objects from different perspectives,
at different scales, or through multiple modalities, such as images, sensor readings, language sequences,
and compact mathematical statements. Multimodal learning [13] studies how such heterogeneous, complex
descriptors can be optimized to create learning systems that are broadly generalizable, robust to changes in the
underlying data distributions, and can train more with less labeled data. While multimodal learning has been
1
Images
Images
Image classification
Semantic image segmentation
Image restoration
Image denoising
Multimodal graph learning Human object interaction
Visual reasoning
Language
Pre-Race Press Conference. Lorem ipsum dolor sit amet, Hidden layers Hidden layers Language
Before today’s race, Eliud consectetur adipiscing elit. Donec
Kipchoge said he felt good and
ready to race. He has broken
world records multiple times and is
sit amet sodales massa, vitae
scelerisque urna. Nullam placerat
euismod mi, sit amet dictum justo.
in multimodal in multimodal Relation extraction
looking to break another record at
today’s race. He has been training
In placerat, nulla in tristique
suscipit, erat justo condimentum architecture architecture Question answering
consistently and his coaches and
trainers have said he should be
magna, sed consectetur diam felis
sit amet est. Vivamus ante lacus, Summarization
able to perform well during the porttitor at tellus nec, consectetur
race notwithstanding any unseen
consequences………………………
vulputate neque. Praesent ac nulla
sed diam……
Knowledge graph construction
Sentiment analysis
Chemistry Chemistry
Molecular property prediction
Chemical reaction prediction
Molecular generation
…
Drug-target interaction
Drug screening
Biology
Biology
Protein structure prediction
Cori Cycle Protein protein interaction prediction
Protein binding site identification
Fig. 1 Graph-centric multimodal learning. Shown on the left are the different data modalities. Shown on the right
are machine learning tasks for which multimodal graph learning has proved valuable. We introduce the multimodal
graph learning (MGL) blueprint that serves as a unifying framework for multimodal graph neural architectures realized
through learning systems in computer vision, natural language processing, and natural sciences.
successfully used in settings where unimodal methods fail [14, 15, 16], it presents several challenges that must
be overcome to enable its broad use in AI [13, 17]. These challenges include finding representations optimized
for machine learning analyses and fusing combined information from various modalities to create predictive
models [18, 19, 20]. These challenges have proven difficult. For example, multimodal methods tend to focus on
only a subset of modalities that are most helpful during model training while ignoring modalities that might
be informative for model implementation—a pitfall known as modality collapse [21]. Moreover, in contrast to
the frequent assumption that every object must exist in all modalities, the complete set of modalities is rarely
available due to limitations of data collection and measurement technologies—a challenge known as missing
modalities [22, 23]. Because different modalities can lead to intricate relational dependencies, simple modality
fusion cannot fully leverage multimodal datasets [24]. Graph learning models such data systems [25, 26, 27] by
connecting data points in different modalities as edges in optimally defined graphs [28, 29, 30] and building
learning systems for a wide range of tasks [31, 32].
We introduce a blueprint for multimodal graph learning (MGL). The MGL blueprint provides a framework
that can express existing algorithms and help develop new methods for multimodal learning leveraging graphs.
This framework allows for learning fused graph representations and studying the aforementioned challenges
of modality collapse and missing modalities [18, 13]. We apply this formulation across a broad spectrum of
domains, ranging from computer vision and language processing to the natural sciences (Figure 1). We consider
image-intensive graphs (IIG) for image and video reasoning (Section 3), language-intensive graphs (LIG) for
processing natural and biological sequences (Section 4), and knowledge-intensive graphs (KIG) used to aid in
scientific discovery (Section 5).
2
a c Structure Learning (SL)
Single-modal
architectures Data modality
Data 1
Combining
architecture
Identified nodes
Multimodal data: graphs, Constructed nodes
Data 2
interaction
Data N
Examined node
Learned graph nodes with (4 modalities)
multimodal attributes
b Learning on Structure (LoS)
All-in-one multimodal Component 3: Propagating information
Data 1
architecture
Current update
1-hop
2-hop
Representation
Data 2
Output
Updated node
representations
...
Downstream
representations
Mixing module
Fig. 2 Overview of multimodal graph learning (MGL) blueprint. a, A standard approach to multimodal learning
involves combining different unimodal architectures, each optimized for a distinct data modality. b, In contrast, an
all-in-one multimodal architecture considers inductive biases specialized for each data modality and optimizes model
parameters in an end-to-end manner, enabling expressive data fusion. c, The MGL blueprint comprises four components:
identifying entities, uncovering topology, propagating information, and mixing representations. These components are
grouped into two phases: structure learning and learning on the structure.
3
given a collection of multimodal input data, yields output representations that are used in downstream tasks.
We refer to this methodology as multimodal graph learning (MGL). MGL can be seen as a blueprint consisting
of four learning components that are connected in an end-to-end fashion. In Figure 2a,b, we highlight the
difference between a conventional combination of unimodal architectures for treating multimodal data and
the suggested all-in-one multimodal architecture.
The first two components of MGL, identifying entities and uncovering topology, can be grouped as the
structure learning (SL) phase (Figure 2c) :
Component 1: Identifying Entities. The first component identifies relevant entities in various data modali-
ties and projects them into a shared namespace. For example, in precision medicine, the state of a patient might
be described by matched pathology slides and clinical notes, giving rise to patient nodes with the combined
image and language information. In another example from computer vision (Figure 3), entity identification
entails defining superpixels in an image.
Component 2: Uncovering Topology. With the entities of our problem defined, the second component
discovers the interactions and interaction types among the nodes across the modalities. Interactions are
often explicitly provided, so the graph is given, and this component is responsible for the combination of
the already existing graph structure with the rest of modalities (e.g., in Figure 5c, the Uncovering Topology
component corresponds to combining protein surface information with the protein structure itself). When
the data does not have an a priori network structure, the uncovering topology component explores possible
adjacency matrices based on explicit (e.g., spatial and visual characteristics) or implicit (e.g., similarities in
representations) features. For the latter case, examples from the natural language processing field consider the
construction of graphs from text input that express relations among words (Figure 4b).
After graphs are specified or adaptively optimized (SL phase in MGL; Figure 2c), various strategies can be
used to learn on the graphs. The last two MGL components, known together as the learning on structure
(LoS) phase (Figure 2c), capture these strategies.
Component 4: Mixing Representations. The last component transforms learned node-level representa-
tions depending on downstream tasks. The propagation models output representations over the nodes that
can be combined and mixed depending on the final representation level (e.g., a graph-level or a subgraph-level
label). Popular mixing strategies include simple aggregation operators (e.g., summation or averaging) or more
sophisticated functions that incorporate neural network architectures.
Figure 2c shows all MGL components, going from multimodal input data to optimized representations used
for downstream tasks. Mathematical formulations are in Box 1 and summaries of multimodal graph learning
methods are in Supplementary Note 2.
Box 1: The blueprint for multimodal graph learning. The blueprint for graph-centric multimodal learning
has four components.
1. Identifying Entities: Information from different sources is combined and projected into a shared
4
namespace. Nodes are identified independently as set elements, and no interactions are given yet. Let
there be k modalities C = {C1 , ..., Ck }, where Ci is an information matrix of i-th modality that
describes every entity by an information vector. We define Identifyi module for every modality i as:
that maps information of all modalities into the same namespace. If k = 1, we get a reduced unimodal
variant of MGL.
2. Uncovering Topology: Let there be data modalities X = {X1 , ..., Xk }. We define Connectj modules,
j = 1, . . . , m, to specify connections between entities in X based on m distance measures as:
If Xi is already given as an adjacency matrix, the associated Connectj modules specify predefined
neighborhoods.
3. Propagating Information: Neural messages are exchanged along edges in the adjacency matrices
A = {A1 , ..., Am } to produce node representations:
When multiple adjacency matrices are given, the Propagate module can specify multiple independent
propagation models (Supplementary Note 1) or operate on a combined adjacency matrix.
4. Mixing Representations: Representations are mixed and transformed into latent representations opti-
mized for a downstream task:
Z ← Mix(H, A). (4)
The mixing module Mix transforms node representations into final representations of entities Z on which
downstream tasks are defined on. Established strategies to mix representations include aggregation oper-
ators, such as summation [39], averaging [40], multi-hop aggregation [41], and methods using adjacency
information A.
Visual Comprehension Visual comprehension remains a cornerstone of visual analyses, where multimodal
graph learning has proven helpful in classifying, segmenting, and enhancing images. Image classification
identifies the set of object categories present in an image [51]. In contrast, image segmentation divides an image
into segments and assigns each segment into a category [52, 44, 45]. Finally, image restoration and denoising
5
a Image comprehension b Image denoising
Non-local
Superpixel graph B1
C1 similar B1
SLIC segmentation
Patch aggregation
C2
patches
B2 B2
A1
B3 B4
A5
A2
B6
A2 B5
Identifying Uncovering A6
entities topology A4
C5
A1 C4
A3 C2 C3
C1
4
Human
Faster R-CNN
3
2
Surfboard 1 2
Ocean
Human detection Human graph Human detection 6 A
Uncovering Propagating
Head
topology information
Arms 5
Torso
Foot 3
A
4
Foot Legs
Fig. 3 Application of multimodal graph learning blueprint to images. a, Modality identification for image com-
prehension where nodes represent aggregated regions of interest, or superpixels, generated by the SLIC segmentation
algorithm. b, Topology uncovering for image denoising where image patches (nodes) are connected to other non-local
similar patches. c, Topology uncovering in human-object interaction where two graphs are created. A human-centric
graph maps body parts to their anatomical neighbors, and an interaction connects body parts relative to the distance to
other objects in the image. d, Information propagation in human-object interaction where spatially conditioned graphs
modify message passing to incorporate edge features that enforce the relative direction of objects in an image [50].
transform low-quality images into high-resolution counterparts [53]. The information required for these tasks
lies in objects, segments, and image patches, as well as in the long-range context surrounding them [52].
IIG construction (corresponding to MGL Components 1 and 2) begins with a segmentation algorithm such
as simple linear iterative clustering (SLIC) [54] to identify meaningful regions [55, 56, 44] (Figure 3a). These
regions define nodes used to extract feature maps and summary visual features for each region [45, 52], whose
attributes are initialized from CNNs like FCN-16 [57] or VGG19 [58]. Moreover, the nodes are connected to
their k nearest neighbors in the CNN learned feature space [55, 45, 46, 47] (Figure 3b), to spatially adjacent
regions [51, 59, 44, 56], or to an arbitrary number of neighbors based on a previously defined similarity threshold
between nodes [47, 56].
Once the SL phase of MGL is completed, propagation models (MGL Component 3) based on graph convolu-
tions [52, 59, 56, 45] and graph attention [60] (GAT) are used to weigh node neighbors in the graph based on
learned attention scores [51, 47]. In addition, methods such as graph denoiser networks (GCDNs) [61], internal
graph neural networks (IGNNs) [46], and residualGCNs [62, 44] consider edge similarities to indicate the
relative distance between image regions.
Visual Reasoning
6
Visual reasoning goes beyond recognizing visual elements by asking questions about the relationships between
entities in images. These relationships can involve humans and objects as in human-object interaction [48]
(HOI) or, more broadly, visual, semantic, and numeric entities as in visual question answering [63, 64, 65]
(VQA).
In HOI, the MGL methods identify two entities, human body parts (e.g., hands, face, etc.) and objects (e.g.,
surfboard, bike, etc.) [48, 50], that interact in fully connected [48, 49], bipartite [50, 66], or partially connected
topologies [67, 68]. MGL methods for VQA construct a new topology [69] that spans interconnected visual,
semantic, and numeric graphs. Entities represent visual objects identified by an extractor, such as Faster R-
CNN [70], scene text identified by optical character recognition, and number-type texts. Interactions between
these entities are defined based on spatial localization: entities occurring near each other are connected by
edges.
To learn about these structures (MGL Component 3), methods distinguish between propagating information
between entities of the same type and entities of different types. In HOI, knowledge about entities of the
same kind (i.e., intra-class neural messages) is exchanged by following edges and applying transformations
defined by a GAT [60], which weighs neural messages by the similarity of latent vectors of nodes. In contrast,
information between different entities (i.e., inter-class neural messages) is propagated using a GPNN [48]
where the weights are adaptively learned [49]. Models can have multiple channels that reason over entities of
the same class and share information across classes. For example, in HOI, relation parsing neural networks
[68] use a two-channel model where human and object-centric message passing is performed before mixing
these representations for the final prediction (Figure 3c). The same occurs in VQA, where visual, semantic,
and numeric channels perform independent message passing before sharing information via visual-semantic
aggregation and semantic-numeric aggregation [69, 71]. Other neural architectures can serve as drop-in
replacements to graph-based channels [66, 67].
7
a nsubj
cc
conj
dobj
amod
b c Virtual node r1
attr
M E
graph modeling
have many different breeds. Dog
S2
breeds vary in shape, size, and S3
E M
S2 Si S3
.
..
Document 1 .
.. ⍺ij
..
...
S4 Si
....
Dogs are common pets and
Sj
.
Document 2 have many different breeds. Document 3
Dog breeds vary in shape,
All healthy dogs, regardless Intra-aspect Inter-aspect
Cats are the only
size, and color. Dogsofare
size, type, or breed, have Word/Entity/Aspect WEA to WEA Edge modeling modeling
domesticated species in the
often called “man’s best
an identical skeletal structure (WEA) node
family Felidae. Domesticated
friend.” Cats are alsowith an exception of the
cats are the kind of cat valued Mention node Entity to mention Edge
common pets and have manyof bones in the tail.
number
as a pet. Same goes for dogs;
different breeds. Human
The dog’s skeleton is well Mention to mention Edge
domesticated dogs are
ownership of pets may date for running. There is
Document node Mention to document Edge
adapted
enjoyed as pets. Cats were
back to at least 12,000 years variation in the Mention to sentence Edge
significant
venerated in ancient Egypt
dog’s skull between different
Si Sentence node
having an important role in
dog types ……………………
religion ……………………
Fig. 4 Application of multimodal graph learning blueprint to language. a, The different levels of context in text
inputs from sentences to documents and the individual units identified at each context level. This is an example of
modality identification’s first component of the MGL blueprint. b, The simplified construction of a language-intensive
graph from text input, an application of the topology uncovering component of the MGL blueprint. c, and d, visualize
examples of learning on LIGs for aspect-based sentiment analysis (ABSA), which aims to assign a sentiment (positive,
negative, or neutral) to a sentence with regards to a given aspect. By grouping by relation type from within a sentence
(shown in c) or modeling relations between sentences and aspects (shown in d), these methods integrate inductive biases
relevant to ABSA and innovate in MGL’s third component, information propagation.
underlying dependency tree [79]. Beyond words, other entities are included to capture cross-sentence topology
[77, 80] (Figure 4a-b).
8
The sentiment of neighboring or similar sentences is essential to determine the aspect-based sentiment of
the document [81]. Cooperative graph attention networks (CoGAN) incorporate this via the cooperation
between two graph-based modeling blocks: the inter- and intra-aspect modeling blocks (Figure 4d) [81].
These blocks capture the relation of sentences to other sentences with the same aspect (intra-aspect) and to
neighboring sentences in the document that contain different aspects (inter-aspect). The outputs of the intra-
and inter-aspect blocks are mixed in an interaction block, passing through a series of hidden layers. Finally,
the intermediate representations between each hidden layer are fused via learned attention weights to create
a final sentence representation (MGL Component 4).
9
a Physical interactions b Molecular reasoning c Protein modeling
mass mass I
C
acceleration N
z
O C C
y O
z
𝑯𝝍 = 𝑬𝝍
x momentum C z
C
C C
C C N
Intramolecular forces
Atomic features
Physics-informed
atomic number
Surface graph
differential operators Bond descriptors
connectivity Molecular
mass valence bond length superpixel
position aromaticity bond order
velocity
… …
acceleration
momentum
… Structure graph
Fig. 5 Applications of multimodal graph learning to natural sciences. a, Information propagation in physical
interactions where physics-informed neural message passing is used to update the states of particles in a system due to
inter-particle interactions and other forces. b, Information propagation in molecular reasoning where a global attention
mechanism is used to model the potential interaction between atoms in two molecules to predict whether two molecules
will react. c, Topology uncovering in protein modeling where a multiscale graph representation is used to integrate
primary, secondary, and tertiary structures of a protein with higher-level protein motifs summarized in molecular
superpixels to represent a protein [26]. This robust topology provides a better prediction on tasks such as protein-ligand
binding affinity prediction.
To mitigate this issue, permutation (PERM) and permutation-concatenation (PERM-CAT) aggregation [102]
update every atom in a chiral group via a weighted sum of every permutation of its respective chiral group.
Though the identity of the neighbors is the same in every permutation, the spatial arrangement varies. By
weighing each permutation, PERM and PERM-CAT encode this inductive bias by modifying how information
is propagated in the underlying graph (MGL Component 3).
Moreover, MGL can help identify chemical products produced by molecules through reactions [105, 106, 107,
108]. For example, to predict whether two molecules react, QM-GNN [105], a quantum chemistry-augmented
GNN represents each reactant by its molecular graph with chemistry-informed initial representations for
every atom and bond. After rounds of message passing, the atom representations are updated through a
global attention mechanism (Figure 5b). The attention mechanism uncovers a novel topology where atoms
can interact with atoms on other molecules. It incorporates a principle from chemistry that intermolecular
interactions between particles inform reactivity. The final representations are combined with descriptors,
such as atomic charges and bond lengths, and used for prediction. Such an approach integrates structural
knowledge about molecules in a GNN with relevant chemistry knowledge, allowing for accurate prediction
on small training datasets [105]. The inclusion of domain knowledge by fusing GNN outputs illustrates the
Mix module in MGL (Section 2, Box 1). Graph learning on molecules created new opportunities for virtual
drug screening [109], molecule generation and design [110, 111, 27], and drug target identification [112, 113].
10
Beyond individual molecules, MGL can help understand the properties of complex structures across multiple
scales, the most pertinent of these structures being proteins. At the primary amino acid sequence scale, the
hallmark task predicts the 3D structure from the amino acid sequence. AlphaFold constructs a KIG where
nodes are amino acids with representations derived from sequence homology [25]. To propagate information
in this KIG, AlphaFold introduces a triangle multiplicative update and triangle self-attention update. These
triangle modifications integrate the inductive bias that learned representations must abide by the triangle
inequality on distances to represent 3D structures. Multimodal graph learning, among other innovations,
enabled AlphaFold to predict 3D protein structure from amino acid sequence [25].
Beyond 3D structure, molecular protein surfaces mediate critical roles in cellular function and disease, and
thus modeling geometric and physical protein properties is essential [1, 114, 115]. For example, MaSIF [114]
trains a GNN on molecular surfaces described as multimodal graphs to predict protein interactions. The
initial representation of the nodes is based on geometric and chemical features. Next, Gaussian kernels are
defined on every node to propagate information, encoding complex geometric shapes of molecular surfaces
and extending the notion of a convolution. The final representations can be used to predict protein-protein
interactions [114], structural configurations of protein complexes [116], and protein-ligand binding [26].
6 Outlook
Multimodal graph learning is an emerging field with applications across natural sciences, vision and language
domains. We anticipate the growth in MGL be driven by fully multimodal graph architectures and new uses
in the natural sciences and medicine. We also outline applications to understand when MGL is valuable or
unhelpful and needs improvements to resolve challenges represented by multimodal inductive biases or a lack
of explicit graphs.
11
methods must automatically capture novel, unspecified, and relevant interactions. Some applications use node
feature similarity to dynamically construct local adjacencies after each layer to discover new interactions [85].
However, this cannot capture novel interactions among distant nodes since information is only passed among
closely connected nodes in message passing. Methods address this limitation by incorporating attention layers
with induced sparsity to discover these interactions [105]. In applications without strong relational structure,
such as molecular property prediction [99, 100, 101], particle classification [85], and text classification [74], node
features often have more predictive value than any encoded structure. As a result, other methods have been
shown to lead to better performance than graph-based methods [129, 130].
12
Data and code availability We summarize multimodal graph learning (MGL) methods and provide a
continually updated summary at https://yashaektefaie.github.io/mgl. We host a live table where
future MGL methods will be added as a resource to the community.
Acknowledgements Y.E., G.D., and M.Z. gratefully acknowledge the support of US Air Force Contract No.
FA8702-15-D-0001, and awards from Harvard Data Science Initiative, Amazon Research, Bayer Early Excellence
in Science, AstraZeneca Research, and Roche Alliance with Distinguished Scientists. Y.E. is supported by
grant T32 HG002295 from the National Human Genome Research Institute and the NSDEG fellowship. G.D. is
supported by the Harvard Data Science Initiative Postdoctoral Fellowship. Any opinions, findings, conclusions
or recommendations expressed in this material are those of the authors and do not necessarily reflect the
views of the funders.
13
References
[1] Greener, J. G., Kandathil, S. M., Moffat, L. & Jones, D. T. A guide to machine learning for biologists.
Nature Reviews Molecular Cell Biology 23, 40–55 (2022).
[2] Yu, M. K. et al. Visible machine learning for biomedicine. Cell 173, 1562–1565 (2018).
[3] Wu, Z. et al. Moleculenet: a benchmark for molecular machine learning. Chemical science 9, 513–530
(2017).
[4] Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O. & Dahl, G. E. Neural message passing for quantum
chemistry. In Precup, D. & Teh, Y. W. (eds.) Proceedings of the 34th International Conference on Machine
Learning, vol. 70 of Proceedings of Machine Learning Research, 1263–1272 (PMLR, 2017).
[5] Sanchez-Gonzalez, A. et al. Graph networks as learnable physics engines for inference and control. In
Dy, J. & Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning, vol. 80 of
Proceedings of Machine Learning Research, 4470–4479 (PMLR, 2018).
[6] Sanchez-Gonzalez, A. et al. Learning to simulate complex physics with graph networks. In III, H. D. &
Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning, vol. 119 of
Proceedings of Machine Learning Research, 8459–8468 (PMLR, 2020).
[7] Liu, Q., Kusner, M. J. & Blunsom, P. A survey on contextual embeddings. In CoRR (2020). 2003.07278.
[8] Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M. & Monfardini, G. The graph neural network model.
IEEE Transactions on Neural Networks 20, 61–80 (2009).
[9] Kipf, T. N. & Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. In
Proceedings of the 5th International Conference on Learning Representations, ICLR ’17 (2017).
[10] Kipf, T. N. & Welling, M. Variational graph auto-encoders. NIPS Workshop on Bayesian Deep Learning
(2016).
[11] Grover, A., Zweig, A. & Ermon, S. Graphite: Iterative generative modeling of graphs. In Chaudhuri, K.
& Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning, vol. 97 of
Proceedings of Machine Learning Research, 2434–2444 (PMLR, 2019).
[12] Guo, X. & Zhao, L. A systematic survey on deep generative models for graph generation. In CoRR (2020).
[13] Baltrušaitis, T., Ahuja, C. & Morency, L.-P. Multimodal machine learning: A survey and taxonomy.
CoRR (2017).
[14] Hong, C., Yu, J., Wan, J., Tao, D. & Wang, M. Multimodal deep autoencoder for human pose recovery.
IEEE Transactions on Image Processing 24, 5659–5670 (2015).
[15] Khattar, D., Goud, J. S., Gupta, M. & Varma, V. Mvae: Multimodal variational autoencoder for fake news
detection. In The World Wide Web Conference, WWW ’19, 2915–2921 (Association for Computing
Machinery, New York, NY, USA, 2019).
[16] Mao, J., Xu, J., Jing, Y. & Yuille, A. Training and evaluating multimodal word embeddings with
large-scale web annotated images. In Proceedings of the 30th International Conference on Neural
Information Processing Systems, NIPS’16, 442–450 (Curran Associates Inc., Red Hook, NY, USA, 2016).
[17] Huang, Y., Lin, J., Zhou, C., Yang, H. & Huang, L. Modality competition: What makes joint training of
multi-modal network fail in deep learning? (Provably). In Chaudhuri, K. et al. (eds.) Proceedings of the
14
39th International Conference on Machine Learning, vol. 162 of Proceedings of Machine Learning Research,
9226–9259 (PMLR, 2022).
[18] Xu, P., Zhu, X. & Clifton, D. A. Multimodal learning with transformers: A survey (2022).
[19] Bayoudh, K., Knani, R., Hamdaoui, F. & Mtibaa, A. A survey on deep multimodal learning for computer
vision: advances, trends, applications, and datasets. The Visual Computer 38, 2939–2970 (2022).
[20] Zhang, C., Yang, Z., He, X. & Deng, L. Multimodal intelligence: Representation learning, information
fusion, and applications. IEEE Journal of Selected Topics in Signal Processing 14, 478–493 (2020).
[21] Javaloy, A., Meghdadi, M. & Valera, I. Mitigating modality collapse in multimodal VAEs via impartial
optimization. In Chaudhuri, K. et al. (eds.) Proceedings of the 39th International Conference on Machine
Learning, vol. 162 of Proceedings of Machine Learning Research, 9938–9964 (PMLR, 2022).
[22] Ma, M. et al. Smil: Multimodal learning with severely missing modality. Proceedings of the AAAI
Conference on Artificial Intelligence 35, 2302–2310 (2021).
[23] Poklukar, P. et al. Gmc – geometric multimodal contrastive representation learning. In Proceedings of
the 39th International Conference on Machine Learning, Proceedings of Machine Learning Research
(arXiv, 2022).
[24] Zitnik, M. et al. Machine learning for integrating data in biology and medicine: Principles, practice, and
opportunities. Information Fusion 50, 71–91 (2019).
[25] Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021).
[26] Somnath, V. R., Bunne, C. & Krause, A. Multi-scale representation learning on proteins. In Beygelzimer,
A., Dauphin, Y., Liang, P. & Vaughan, J. W. (eds.) Advances in Neural Information Processing Systems (2021).
[27] Walters, W. P. & Barzilay, R. Applications of deep learning in molecule generation and molecular
property prediction. Accounts of Chemical Research 54, 263–270 (2021). PMID: 33370107.
[28] Wang, J., Hu, J., Qian, S., Fang, Q. & Xu, C. Multimodal graph convolutional networks for high quality
content recognition. Neurocomputing 412, 42–51 (2020).
[29] Mai, S., Hu, H. & Xing, S. Modality to modality translation: An adversarial representation learning and
graph fusion network for multimodal fusion. Proceedings of the AAAI Conference on Artificial Intelligence
34, 164–172 (2020).
[30] Zhang, X., Zeman, M., Tsiligkaridis, T. & Zitnik, M. Graph-guided network for irregularly sampled
multivariate time series. In International Conference on Learning Representations, ICLR (2022).
[31] Zhao, F. & Wang, D. Multimodal Graph Meta Contrastive Learning, 3657–3661 (Association for Computing
Machinery, New York, NY, USA, 2021).
[32] Zheng, S. et al. Multi-modal graph learning for disease prediction. CoRR 2107.00206 (2021).
2107.00206.
[33] Ramachandram, D. & Taylor, G. W. Deep multimodal learning: A survey on recent advances and trends.
IEEE Signal Processing Magazine 34, 96–108 (2017).
[34] Ngiam, J. et al. Multimodal deep learning. In Proceedings of the 28th International Conference on
International Conference on Machine Learning, ICML’11, 689–696 (Omnipress, Madison, WI, USA, 2011).
15
[35] Aafaq, N., Akhtar, N., Liu, W., Gilani, S. Z. & Mian, A. Spatio-temporal dynamics and semantic attribute
enriched visual encoding for video captioning. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR) (2019).
[36] Fang, Z., Gokhale, T., Banerjee, P., Baral, C. & Yang, Y. Video2Commonsense: Generating commonsense
descriptions to enrich video captioning. In Proceedings of the 2020 Conference on Empirical Methods in
Natural Language Processing (EMNLP), 840–860 (Association for Computational Linguistics, Online,
2020).
[37] Kiros, R., Salakhutdinov, R. & Zemel, R. Multimodal neural language models. In Xing, E. P. & Jebara, T.
(eds.) Proceedings of the 31st International Conference on Machine Learning, vol. 32 of Proceedings of
Machine Learning Research, 595–603 (PMLR, Bejing, China, 2014).
[38] Rezaei-Shoshtari, S., Hogan, F. R., Jenkin, M., Meger, D. & Dudek, G. Learning intuitive physics with
multimodal generative models. Proceedings of the AAAI Conference on Artificial Intelligence 35, 6110–6118
(2021).
[39] Xu, K., Hu, W., Leskovec, J. & Jegelka, S. How powerful are graph neural networks? In International
Conference on Learning Representations (2019).
[40] Hamilton, W., Ying, Z. & Leskovec, J. Inductive representation learning on large graphs. In Guyon, I.
et al. (eds.) Advances in Neural Information Processing Systems, vol. 30 (Curran Associates, Inc., 2017).
[41] Xu, K. et al. Representation learning on graphs with jumping knowledge networks. In Dy, J. & Krause,
A. (eds.) Proceedings of the 35th International Conference on Machine Learning, vol. 80 of Proceedings of
Machine Learning Research, 5453–5462 (PMLR, 2018).
[42] Bronstein, M. M., Bruna, J., Cohen, T. & Veličković, P. Geometric deep learning: Grids, groups, graphs,
geodesics, and gauges. CoRR (2021). 2104.13478.
[43] Chen, Y. et al. Graph-based global reasoning networks. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR) (2019).
[44] Varga, V. & Lorincz, A. Fast interactive video object segmentation with graph neural networks. In
International Joint Conference on Neural Networks, IJCNN 2021, Shenzhen, China, July 18-22, 2021, 1–10
(IEEE, 2021).
[45] Liu, Q., Kampffmeyer, M., Jenssen, R. & Salberg, A.-B. Self-constructing graph neural networks to
model long-range pixel dependencies for semantic segmentation of remote sensing images.
International Journal of Remote Sensing 42, 6184–6208 (2021).
[46] Zhou, S., Zhang, J., Zuo, W. & Loy, C. C. Cross-scale internal graph neural network for image
super-resolution. In Advances in Neural Information Processing Systems (2020).
[47] Mou, C. & Zhang, J. Graph attention neural network for image restoration. In 2021 IEEE International
Conference on Multimedia and Expo (ICME) (2021).
[48] Qi, S., Wang, W., Jia, B., Shen, J. & Zhu, S.-C. Learning human-object interactions by graph parsing
neural networks. In Proceedings of the European Conference on Computer Vision (ECCV) (2018).
[49] Wang, H., Zheng, W.-s. & Yingbiao, L. Contextual heterogeneous graph network for human-object
interaction detection. In Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August
23–28, 2020, Proceedings, Part XVII, 248–264 (Springer-Verlag, Berlin, Heidelberg, 2020).
16
[50] Zhang, F. Z., Campbell, D. & Gould, S. Spatially conditioned graphs for detecting human-object
interactions. In CVPR, 13319–13327 (2021).
[51] Avelar, P. C., Tavares, A. R., da Silveira, T. T., Jung, C. R. & Lamb, L. C. Superpixel image classification
with graph attention networks. In 2020 33rd SIBGRAPI Conference on Graphics, Patterns and Images
(SIBGRAPI), 203–209 (IEEE Computer Society, Los Alamitos, CA, USA, 2020).
[52] Lu, Y., Chen, Y., Zhao, D. & Chen, J. Graph-fcn for image semantic segmentation. In Lu, H., Tang, H. &
Wang, Z. (eds.) Advances in Neural Networks – ISNN 2019, 97–105 (Springer International Publishing,
Cham, 2019).
[53] Kim, J., Lee, J. K. & Lee, K. M. Deeply-recursive convolutional network for image super-resolution. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016).
[54] Achanta, R. et al. Slic superpixels compared to state-of-the-art superpixel methods. IEEE Transactions
on Pattern Analysis and Machine Intelligence 34, 2274–2282 (2012).
[55] Zeng, H., Liu, Q., Zhang, M., Han, X. & Wang, Y. Semi-supervised hyperspectral image classification
with graph clustering convolutional networks. arXiv preprint arXiv:2012.10932 (2020).
[56] Wan, S. et al. Multiscale dynamic graph convolutional network for hyperspectral image classification.
IEEE Transactions on Geoscience and Remote Sensing 58, 3162–3177 (2019).
[57] Long, J., Shelhamer, E. & Darrell, T. Fully convolutional networks for semantic segmentation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015).
[58] Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. In
Bengio, Y. & LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San
Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015).
[59] Knyazev, B., Lin, X., Amer, M. R. & Taylor, G. W. Image classification with hierarchical multigraph
networks. In British Machine Vision Conference (BMVC) (2019).
[60] Veličković, P. et al. Graph attention networks. In International Conference on Learning Representations
(2018).
[61] Valsesia, D., Fracastoro, G. & Magli, E. Deep graph-convolutional image denoising. In CoRR (2019).
1907.08448.
[62] Bresson, X. & Laurent, T. Residual gated graph convnets. In Computing Research Repository, vol.
1711.07553 (2017). 1711.07553.
[63] Biten, A. F. et al. Scene text visual question answering. In Proceedings of the IEEE/CVF International
Conference on Computer Vision (ICCV) (2019).
[64] Singh, A. et al. Towards vqa models that can read. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR) (2019).
[65] Liu, C. et al. Graph structured network for image-text matching. In IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR) (2020).
[66] Ulutan, O., Iftekhar, A. S. M. & Manjunath, B. S. Vsgnet: Spatial attention network for detecting human
object interactions using graph convolutions. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR) (2020).
17
[67] Gao, C., Xu, J., Zou, Y. & Huang, J.-B. Drg: Dual relation graph for human-object interaction detection.
In Vedaldi, A., Bischof, H., Brox, T. & Frahm, J.-M. (eds.) Computer Vision – ECCV 2020, 696–712 (Springer
International Publishing, Cham, 2020).
[68] Zhou, P. & Chi, M. Relation parsing neural network for human-object interaction detection. In
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019).
[69] Gao, D., Li, K., Wang, R., Shan, S. & Chen, X. Multi-modal graph neural network for joint reasoning on
vision and scene text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR) (2020).
[70] Ren, S., He, K., Girshick, R. & Sun, J. Faster R-CNN: Towards Real-Time object detection with region
proposal networks. IEEE Trans Pattern Anal Mach Intell 39, 1137–1149 (2016).
[71] Wu, T. et al. Ginet: Graph interaction network for scene parsing. In Vedaldi, A., Bischof, H., Brox, T. &
Frahm, J.-M. (eds.) Computer Vision – ECCV 2020, 34–51 (Springer International Publishing, Cham, 2020).
[72] Wu, L. et al. Graph neural networks for natural language processing: A survey. In CoRR (2021).
2106.06090.
[73] Vaswani, A. et al. Attention is all you need. In Guyon, I. et al. (eds.) Advances in Neural Information
Processing Systems, vol. 30 (Curran Associates, Inc., 2017).
[74] Li, I., Li, T., Li, Y., Dong, R. & Suzumura, T. Heterogeneous Graph Neural Networks for Multi-label
Text Classification. arXiv:2103.14620 [cs] (2021). ArXiv: 2103.14620.
[75] Huang, L., Ma, D., Li, S., Zhang, X. & Wang, H. Text level graph neural network for text classification. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th
International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3444–3450 (Association
for Computational Linguistics, Hong Kong, China, 2019).
[76] Zhang, Y. et al. Every document owns its structure: Inductive text classification via graph neural
networks. CoRR (2020). 2004.13826.
[77] Pan, J., Peng, M. & Zhang, Y. Mention-centered graph neural network for document-level relation
extraction. In CoRR (2021). 2103.08200.
[78] Zhu, H. et al. Graph Neural Networks with Generated Parameters for Relation Extraction.
arXiv:1902.00756 [cs] (2019). ArXiv: 1902.00756.
[79] Guo, Z., Zhang, Y. & Lu, W. Attention guided graph convolutional networks for relation extraction. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 241–251
(Association for Computational Linguistics, Florence, Italy, 2019).
[80] Zeng, S., Xu, R., Chang, B. & Li, L. Double graph based reasoning for document-level relation
extraction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing
(EMNLP), 1630–1640 (Association for Computational Linguistics, Online, 2020).
[81] Chen, X. et al. Aspect sentiment classification with document-level sentiment preference modeling. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 3667–3677
(Association for Computational Linguistics, Online, 2020).
[82] Zhang, C., Li, Q. & Song, D. Aspect-based sentiment classification with aspect-specific graph
convolutional networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language
18
Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),
4568–4578 (Association for Computational Linguistics, Hong Kong, China, 2019).
[83] Zhang, M. & Qian, T. Convolution over hierarchical syntactic and lexical graphs for aspect level
sentiment analysis. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing (EMNLP), 3540–3549 (Association for Computational Linguistics, Online, 2020).
[84] Pouran Ben Veyseh, A. et al. Improving aspect-based sentiment analysis with gated graph convolutional
networks and syntax-based regulation. In Findings of the Association for Computational Linguistics:
EMNLP 2020, 4543–4548 (Association for Computational Linguistics, Online, 2020).
[85] Shlomi, J., Battaglia, P. & Vlimant, J.-R. Graph neural networks in particle physics. Machine Learning:
Science and Technology 2, 021001 (2021).
[86] Henrion, I. et al. Neural message passing for jet physics. In Deep Learning for Physical Sciences Workshop
at the 31st Conference on Neural Information Processing Systems (NeurIPS) (2017).
[87] Qasim, S. R., Kieseler, J., Iiyama, Y. & Pierini, M. Learning representations of irregular particle-detector
geometry with distance-weighted graph networks. The European Physical Journal C 79 (2019).
[88] Mikuni, V. & Canelli, F. Abcnet: an attention-based method for particle tagging. The European Physical
Journal Plus 135 (2020).
[89] Ju, X. et al. Graph neural networks for particle reconstruction in high energy physics detectors. In
CoRR (2020). 2003.11603.
[90] Shukla, K., Xu, M., Trask, N. & Karniadakis, G. E. Scalable algorithms for physics-informed neural and
graph networks. Data-Centric Engineering 3, e24 (2022).
[91] Seo, S. & Liu, Y. Differentiable physics-informed graph networks. In CoRR (2019). 1902.02950.
[92] Li, W. & Deka, D. Physics based gnns for locating faults in power grids. In CoRR (2021). 2107.02275.
[93] Battaglia, P. W. et al. Relational inductive biases, deep learning, and graph networks. In CoRR (2018).
1806.01261.
[94] Veličković, P., Ying, R., Padovano, M., Hadsell, R. & Blundell, C. Neural execution of graph algorithms.
In International Conference on Learning Representations (2020).
[95] Schuetz, M. J. A., Brubaker, J. K. & Katzgraber, H. G. Combinatorial optimization with physics-inspired
graph neural networks. Nature Machine Intelligence 4, 367–377 (2022).
[96] Mirhoseini, A. et al. A graph placement methodology for fast chip design. Nature 594, 207–212 (2021).
[97] Gasteiger, J., Groß, J. & Günnemann, S. Directional message passing for molecular graphs. In
International Conference on Learning Representations (2020).
[98] Jørgensen, P. B., Jacobsen, K. W. & Schmidt, M. N. Neural message passing with edge updates for
predicting properties of molecules and materials. CoRR (2018). 1806.03146.
[99] Gasteiger, J., Yeshwanth, C. & Günnemann, S. Directional message passing on molecular graphs via
synthetic coordinates. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P. & Vaughan, J. W. (eds.)
Advances in Neural Information Processing Systems, vol. 34, 15421–15433 (Curran Associates, Inc., 2021).
[100] Liu, M. et al. Fast quantum property prediction via deeper 2d and 3d graph networks. In CoRR (2021).
2106.08551.
19
[101] St. John, P. C., Guan, Y., Kim, Y., Kim, S. & Paton, R. S. Prediction of organic homolytic bond
dissociation enthalpies at near chemical accuracy with sub-second computational cost. Nature
Communications 11, 2328 (2020).
[102] Pattanaik, L. et al. Message passing networks for molecules with tetrahedral chirality. In CoRR (2020).
2012.00094.
[103] Fey, M., Yuen, J.-G. & Weichert, F. Hierarchical inter-message passing for learning on molecular graphs.
In CoRR (2020). 2006.12179.
[104] Ariëns, E. Chirality in bioactive agents and its pitfalls. Trends in Pharmacological Sciences 7, 200–205
(1986).
[105] Guan, Y. et al. Regio-selectivity prediction with a machine-learned reaction representation and
on-the-fly quantum mechanical descriptors. Chem. Sci. 12, 2198–2208 (2021).
[106] Coley, C. W. et al. A graph-convolutional neural network model for the prediction of chemical
reactivity. Chem. Sci. 10, 370–377 (2019).
[107] Struble, T. J., Coley, C. W. & Jensen, K. F. Multitask prediction of site selectivity in aromatic c–h
functionalization reactions. React. Chem. Eng. 5, 896–902 (2020).
[108] Stuyver, T. & Coley, C. W. Quantum chemistry-augmented neural networks for reactivity prediction:
Performance, generalizability, and explainability. J Chem Phys 156, 084104 (2022).
[109] Stokes, J. M. et al. A deep learning approach to antibiotic discovery. Cell 180, 688–702.e13 (2020).
[110] Fu, T. et al. Differentiable scaffolding tree for molecule optimization. In International Conference on
Learning Representations (2022).
[111] Mercado, R. et al. Graph networks for molecular design. Machine Learning: Science and Technology 2,
025023 (2021).
[112] Torng, W. & Altman, R. B. Graph convolutional neural networks for predicting drug-target interactions.
Journal of Chemical Information and Modeling 59, 4131–4149 (2019). PMID: 31580672,
https://doi.org/10.1021/acs.jcim.9b00628.
[113] Moon, S., Zhung, W., Yang, S., Lim, J. & Kim, W. Y. Pignet: a physics-informed deep learning model
toward generalized drug–target interaction predictions. Chemical Science 13, 3661–3673 (2022).
[114] Gainza, P. et al. Deciphering interaction fingerprints from protein molecular surfaces using geometric
deep learning. Nature Methods 17, 184–192 (2020).
[115] Sanner, M. F., Olson, A. J. & Spehner, J.-C. Reduced surface: an efficient way to compute molecular
surfaces. Biopolymers 38, 305–320 (1996).
[116] Sverrisson, F., Feydy, J., Correia, B. E. & Bronstein, M. M. Fast end-to-end learning on protein surfaces.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
15272–15281 (2021).
[117] Feng, Y., You, H., Zhang, Z., Ji, R. & Gao, Y. Hypergraph neural networks. Proceedings of the AAAI
Conference on Artificial Intelligence 33, 3558–3565 (2019).
[118] Srinivasan, B., Zheng, D. & Karypis, G. Learning over Families of Sets - Hypergraph Representation
Learning for Higher Order Tasks, 756–764 (SIAM Activity Group on Data Science, 2021).
20
[119] Jo, J. et al. Edge representation learning with hypergraphs. In Ranzato, M., Beygelzimer, A., Dauphin, Y.,
Liang, P. & Vaughan, J. W. (eds.) Advances in Neural Information Processing Systems, vol. 34, 7534–7546
(Curran Associates, Inc., 2021).
[120] Zhang, C., Song, D., Huang, C., Swami, A. & Chawla, N. V. Heterogeneous graph neural network. In
Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,
KDD ’19, 793–803 (Association for Computing Machinery, New York, NY, USA, 2019).
[121] Chandak, P., Huang, K. & Zitnik, M. Building a knowledge graph to enable precision medicine.
Scientific Data (2023).
[122] Lee, S. & Song, B. C. Graph-based knowledge distillation by multi-head attention network. In Sidorov,
K. & Hicks, Y. (eds.) Proceedings of the British Machine Vision Conference (BMVC), 162.1–162.12 (BMVA
Press, 2019).
[123] Zhou, S. et al. Distilling holistic knowledge with graph neural networks. In Proceedings of the IEEE/CVF
International Conference on Computer Vision (ICCV), 10387–10396 (2021).
[124] Sun, L., Gou, J., Yu, B., Du, L. & Tao, D. Collaborative teacher-student learning via multiple knowledge
transfer. In CoRR (2021). 2101.08471.
[125] Park, W., Kim, D., Lu, Y. & Cho, M. Relational knowledge distillation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR) (2019).
[126] Liu, Y. et al. Knowledge distillation via instance relationship graph. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR) (2019).
[127] Ma, J. et al. Using deep learning to model the hierarchical structure and function of a cell. Nature
Methods 15, 290–298 (2018).
[128] Nicholson, D. N. & Greene, C. S. Constructing knowledge graphs and their biomedical applications.
Computational and Structural Biotechnology Journal 18, 1414–1428 (2020).
[129] Borisov, V. et al. Deep neural networks and tabular data: A survey. ArXiv abs/2110.01889 (2021).
[130] Jiang, D. et al. Could graph neural networks learn better molecular representation for drug discovery? a
comparison study of descriptor-based and graph-based models. Journal of Cheminformatics 13, 12 (2021).
21
Supplementary Note 1: Overview of Graph Neural Networks
Graph representation learning has been at the center of an accelerating research area due to the prominence of
the so-called graph neural network (GNN) models. GNNs have become very popular, mainly because of their
wide applicability and their flexible framework to define simple or more complex information propagation
processes. In most applications, however, the graph learning models require predefined graphs to apply
diffusion operators to them.
S1
information between a node and its neighborhood in a differentiable manner. Assigning a parameterization
over the node attributes and using the information propagation model, we learn node representations using
gradient-based optimization. Depending on the user-end task, the learned node representations can then be
transformed into an appropriate form.
Despite the wide variety of models, a GNN layer is characterized by the following three computational steps:
1) First, neighborhood aggregation is a differentiable function responsible for generating a representation for a
node’s neighborhood:
hagg
u = Agg({hv |v ∈ Nu }), (5)
where Nu is the neighborhood of node u. Although Nu can be defined in many ways, in most GNN architec-
tures [8, 9], it corresponds to the nodes that are adjacent to the examined node (i.e., the 1-hop neighborhood).
The choice of the aggregation function can output very different architectures and can be classified into two
main categories, the convolutional and the message-passing models, as discussed in Box S1. 2) Second, node
update is a differentiable function that learns node representations by combining the current state of the nodes
with the aggregated representation of their neighborhoods:
3) Finally, representation transformation denotes the application of a parametric or non-parametric function that
transforms the learned node representations into final embeddings that correspond to the target prediction
labels as:
hout = Transform({h0u |u ∈ V }). (7)
Specifically, for edge-level tasks, such as link prediction and edge labeling, non-parametric operators can be
used (e.g. concatenation, summation, or averaging of the embeddings of node pairs [2, 1]). For graph-level
tasks, a parametric or non-parametric function that mixes node representations is referred to as Readout.
Examples of the Readout functions include summation (e.g. in graph isomorphism networks [39]) and a
Set2Set function [6] (e.g. in principal neighborhood aggregation [7]).
Box S1: Types of aggregation The choice of the Agg function is crucial for a graph neural network, as it
defines the way that neighborhood information is processed. Based on this choice, we meet two large model
categories.
Graph convolutions. The first category is based on the connection of propagation models with graph signal
processing [8]. In particular, assuming a filter g ∈ Rn that expresses the behavior of graph nodes, the information
propagation can be expressed through a graph convolution: x ∗ g = U gU T x. The matrix U ∈ Rn×n is the set
of eigenvectors of the Laplacian matrix L expressed as a transformed graph adjacency information. Simplifying
the graph convolution into an aggregation scheme, we obtain the following:
where ψ is a learnable function such as a multilayer perceptron (MLP) and φ is an aggregation operator,
depending on the choice of filter g. For example, in both GCN [9] and SGC [9] models, ψ is the identity function,
Θuv = 1 ∀ (u, v) ∈ E and the φ is the average function. Other typical examples of this category are the
Chebnet [8] and the Caleynet [10] models, where the choices of Θ, φ, and ψ are functions of the used graph
filters.
Neural message-passing. In message passing, aggregation is a process of encoding messages for each node
S2
based on its neighbors. The difference with the convolutional variant is that the edge (u, v)’s importance is
parametrized compared to a non-parametric constant Θuv . Thus, aggregation is formulated as:
where ψ is, similarly, a learnable function (e.g. an MLP) and φ is an aggregator operator. For instance,
the family of MPNNs [4] utilizes a summation operator as φ, and ψ can take various forms from simple
concatenation to a trainable neural network. Similarly, the GIN model [39] chooses as φ is the summation
operator and as ψ is an MLP. Going beyond the choice of a single function, [7] combines multiple aggregators
with degree scalers.
In this category, we can also meet abundant models, whose functions φ and ψ are based on attention mechanisms
tailored for graph structures. The GAT [11] model was the first approach that defined self-attention coefficients
for graphs. Since then, attention-based models in combination with learned positional or spectral encodings
have been introduced, such as the SAN [12] and Graph Transformer archictures, e.g. Graphormer [13], and
GraphGPS [14].
Method Identifying entities Uncovering topology Propagating information Mixing representations Application
FuNet [15] Hyperspectral pixels Radial basis function miniGCN (GCN mini- Concatenation, sum, or Hyperspectral image
similarity batching) product classification
Graph-FCN [16] Pixels Edge weights based on a GCN on weighted edges Graph loss added with fully Image semantic seg-
Gaussian kernel connected network mentation
GSMN [65] Images, relations, and Visual graph for images Node-level and graph-level Similarity function for posi- Image-text matching
attributes combined with textual matching tive and negative pairs
graph
RAG-GAT [51] Super-pixels Region adjacency graph Graph attention network Sum pooling combined Superpixel image clas-
with an MLP sification
TextGCN [17] Words and documents Occurrence edges in GCN No mixing, single-channel Text classification
text and corpus model
CoGAN [81] Sentences and aspects Sentences and aspects as Cooperative graph attention Softmax decoding block Aspect sentiment clas-
nodes sification
MCN [77] Sentences, mentions, Document-level graph Relation-aware GCN Sigmoid activation on en- Document-level rela-
and entities tity pairs tion extraction
GP-GNN [78] Word and position en- Generated adjacency Neural message passing Pair-wise product Relation extraction
codings Matrix
QM-GNN [105] Atoms Chemical bonds Weisfeiler-Lehman network Concatenation with quan- Regio-selectivity pre-
and global attention tum mechanical descriptors diction
GNS [6] particles Radial particle connec- Graph network (learned di- No mixing, single-channel Particle-based simula-
tivity rected message passing) model tion
MaSIF [114] Discretized protein sur- Overlapping geodesic Gaussian kernels on a local No mixing, single-channel Ligand site prediction
face mesh radial features geodesic system model and classification
MMGL [32] Patients Modality-aware latent Adaptive GCN Sub-branch prediction neu- Disease prediction
graph ral network
Tab. S1 Classification of existing methods according to the Multimodal Graph Learning (MGL) blueprint. The four
components of MGL are identified for every method. Shown are methods for image-intensive graphs (dark gray),
language-intensive graphs (light gray), and knowledge-grounded graphs (white).
S3
References
[1] Kim, J., Kim, T., Kim, S. & Yoo, C. D. Edge-labeling graph neural network for few-shot learning. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019).
[2] Zhang, M. & Chen, Y. Link prediction based on graph neural networks. In Bengio, S. et al. (eds.) Advances
in Neural Information Processing Systems, vol. 31 (Curran Associates, Inc., 2018).
[3] Beani, D. et al. Directional graph networks. In Meila, M. & Zhang, T. (eds.) Proceedings of the 38th
International Conference on Machine Learning, vol. 139 of Proceedings of Machine Learning Research,
748–758 (PMLR, 2021).
[4] Bodnar, C. et al. Weisfeiler and lehman go cellular: CW networks. In Thirty-Fifth Conference on Neural
Information Processing Systems (2021).
[5] Alsentzer, E., Finlayson, S. G., Li, M. M. & Zitnik, M. Subgraph neural networks. Advances in Neural
Information Processing Systems (2020).
[6] Vinyals, O., Bengio, S. & Kudlur, M. Order matters: Sequence to sequence for sets. In International
Conference on Learning Representations (ICLR) (2016).
[7] Corso, G., Cavalleri, L., Beaini, D., Liò, P. & Veličković, P. Principal neighbourhood aggregation for graph
nets. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F. & Lin, H. (eds.) Advances in Neural
Information Processing Systems, vol. 33, 13260–13271 (Curran Associates, Inc., 2020).
[8] Defferrard, M., Bresson, X. & Vandergheynst, P. Convolutional neural networks on graphs with fast
localized spectral filtering. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I. & Garnett, R. (eds.) Advances
in Neural Information Processing Systems, vol. 29 (Curran Associates, Inc., 2016).
[9] Wu, F. et al. Simplifying graph convolutional networks. In Chaudhuri, K. & Salakhutdinov, R. (eds.)
Proceedings of the 36th International Conference on Machine Learning, vol. 97 of Proceedings of Machine
Learning Research, 6861–6871 (PMLR, 2019).
[10] Levie, R., Monti, F., Bresson, X. & Bronstein, M. M. Cayleynets: Graph convolutional neural networks
with complex rational spectral filters. IEEE Transactions on Signal Processing 67, 97–109 (2019).
[11] Veličković, P. et al. Graph attention networks. In International Conference on Learning Representations
(2018).
[12] Kreuzer, D., Beaini, D., Hamilton, W. L., Létourneau, V. & Tossou, P. Rethinking graph transformers with
spectral attention. In Beygelzimer, A., Dauphin, Y., Liang, P. & Vaughan, J. W. (eds.) Advances in Neural
Information Processing Systems (2021).
[13] Ying, C. et al. Do transformers really perform badly for graph representation? In Beygelzimer, A.,
Dauphin, Y., Liang, P. & Vaughan, J. W. (eds.) Advances in Neural Information Processing Systems (2021).
[14] Rampasek, L. et al. Recipe for a general, powerful, scalable graph transformer. In Oh, A. H., Agarwal, A.,
Belgrave, D. & Cho, K. (eds.) Advances in Neural Information Processing Systems (2022).
[15] Hong, D. et al. Graph convolutional networks for hyperspectral image classification. IEEE Transactions
on Geoscience and Remote Sensing 59, 5966–5978 (2021).
[16] Lu, Y., Chen, Y., Zhao, D. & Chen, J. Graph-fcn for image semantic segmentation. CoRR (2020).
2001.00335.
S4
[17] Yao, L., Mao, C. & Luo, Y. Graph convolutional networks for text classification. In Proceedings of the
Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial
Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence,
AAAI’19/IAAI’19/EAAI’19 (AAAI Press, 2019).
S5