CS224w Machine Learning With Graphs
CS224w Machine Learning With Graphs
Leni Aniva
Autumn 2024
1
• Instructor: Jure Leskovec
• Website: http://cs244w.stanford.edu
Contents
0.1 Why Graphs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
0.2 Choice of Graph Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Node Embeddings 9
2.1 Random Walk Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Embedding Entire Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Relations to Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Applications and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1
CONTENTS
8 Graph Transformers 41
8.1 Self-Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.2 Self-Attention and Message Passing . . . . . . . . . . . . . . . . . . . . . . . 45
8.3 A New Design Landscape for Graph Transformers . . . . . . . . . . . . . . . 46
8.4 Positional Encodings for Graph Transformers . . . . . . . . . . . . . . . . . 47
2
CONTENTS
22 Conclusion 119
22.1 GNN Design Space and Task Space . . . . . . . . . . . . . . . . . . . . . . . 119
22.2 GraphGym . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
22.3 Pre-Training Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . 121
3
CONTENTS
Introduction
0.1 Why Graphs?
Graphs are a general language for describing and analyzing entities with relations and
interactions.
Applications:
• Molecules: Vertices are atoms and edges are bonds
• Event Graphs
• Computer Networks
• Disease Pathways
• Code Graphs
Complex domains have a rich relational structure which can be represented as a relational
graph. By explicitly modeling relationships we achieve a better performance with lower
modeling capacity.
The Modern ML Toolbox processes tensors, e.g. images (2D), text/speech (1D). Modern
deep learning toolbox is designed for simple sequences and grids. Not everything can be
represented as a sequence of a grid. How can we develop neural networks that are much
more broadly applicable? We can use graphs. Graphs connect things.
• Graph neural network is the 3rd most popular keyword in ICLR ’22.
• Graph learning is also very difficult due to the complex and less structured nature of
graphs.
• Graph learning is also associated with representation learning. In some cases it may
be possible to learn a d-dimensional embedding for each node in the graph such that
similar nodes have closer embeddings.
A number of different tasks can be executed on graph data.
• Node level prediction: to characterize the structure and position of a node in the
network.
Example: In Protein folding, each atom is a node and the task is to predict the
coordinate of the node. predicted.
• Edge/Link level prediction: Predicting property for a pair of nodes. This can be
either trying to find missing links or finding new links as time progresses
Example: Graph-based recommender systems and drug side effects.
4
CONTENTS
Figure 0.1: When a machine learning model is applied to a graph, each node defines its own
computational graph in its neighbourhood.
• Undirected/Directed edges
• Allow/Disallow self-loop
Most real-world networks are sparse. The adjacency matrix is a sparse matrix with mostly
0’s. The density of the matrix (E/N 2 ) is 1.51 · 10−5 for the WWW and 2.27 · 10−8 for MSN
IM.
5
1 TRADITIONAL MACHINE LEARNING ON GRAPHS
Figure 1.2: Classifying nodes on a graph when a few labels are provided
6
1 TRADITIONAL MACHINE LEARNING ON GRAPHS
1
cv := P
u̸=v |{shortest paths between u, v}|
1
ev := k |{edges among N (v)}| ∈ [0, 1]
v
2
Clustering coëfficient counts the number of triangles in the ego-network (the network
formed by {v} ∪ N (v), where v is the ego). We can generalise the above by counting
graphlets.
An induced graph is a graph formed by taking a subset of vertices in a larger graph such
that only edges connecting the remaining vertices are preserved.
Two graphs with identical topologies are isomorphic.
7
1 TRADITIONAL MACHINE LEARNING ON GRAPHS
Graphlets are small subgraphs that describe the structure of u’s neighbourhood network.
Specifically, they are rooted, connected, induced, non-isomorphic subgraphs. Considering
graphs of size 2 to 5 nodes we get a vector of 73 (number of graphlets with 2 to 5 vertices)
elements that describes the topology of a node’s neighbourhood. This vector is the
graphlet degree vector (GDV) of a node.
The features we have discussed so far capture local topological properties of the graph but
cannot distinguish points in a global scale.
The problem with the three indices above is that they are always 0 if u, v do not
share a neighbour.
8
2 NODE EMBEDDINGS
• Katz Index:
This can be computed by powers of the adjacency matrix. The matrix counting all
walks of length n between vertices is An , so the Katz index can be computed by
∞
C := β i Ai = (I − βA)−1 − I
X
i=1
where the β < 1 decay factor is necessary to prevent C from blowing up to + inf.
An analogous definition exists for directed graphs.
• k(G, G′ ) ∈ R
• There exists a feature representation ϕ such that K(G, G′ ) = ϕ(G)ϕ(G′ ) which can
even be infinite-dimensional.
We could use a bag-of-words (BOW) for a graph. Recall that in NLP, BoW simply uses the
word count as features for documents with no ordering being considered. We regard nodes
as a word.
Graph-Level Graphlet features counts the number of different graphlets in a graph.
The graphlets here are not rooted and do not have to be connected. This definition of
graphlet is slightly different from the definition of node level features. A limitation of this
definition is that counting graphlets is expensive. Counting size-k gfor a raph with size n
by enumeration takes time O(nk ) due to costly subgraph isomorphism tests. If a graph’s
node has bounded degree the time could be compressed down to O(ndk−1 ).
so far we have only considered features related to the graph structure and not considered
the attributes of nodes and their neighbours.
2 Node Embeddings
Representation learning avoids the need of doing feature engineering every time. The goal
is to map individual nodes or an entire graph into vectors, or embeddings.
In node embeddings, we would like the embedding to have the following properties:
9
2 NODE EMBEDDINGS
Figure 2.1: Node Level Embeddings of graphs map each node to a vector
10
2 NODE EMBEDDINGS
• Decoder: Given two embeddings, measure the similarity. Usually chosen to be the
dot product Dec(z, w) = z · w
• Efficiency: The graph does not need to be throughoutly traversed when training.
We define:
• Vector Enc(u) := z u : Embedding of node u
11
2 NODE EMBEDDINGS
• We select a random walk strategy R and use such strategy to determine PR (v|u), the
probability that a random walk starting from u visits v. The strategy defines NR (u),
a random multiset (can have repeats for nodes visited multiple times, essentially a
probability distribution) collected from combining all short fixed-length random walks
starting at u
Now we are ready to mathematically state the objective function. The negative
log-likelihood
Predicted probability of u, v co-occurring
Summation over all vertices
exp(z u · z v )
P (v|z u ) := P
n∈V exp(z u · z n )
However, this function is expensive to evaluate. The two n∈V loops already give O(|V |2 )
P
time complexity. The solution to this problem is negative sampling1 , which provides the
estimate
exp(z u · z v ) k
log P ≃ log σ(z u · z v ) − log σ(z u · z ni ) (ni ∼ Pv )
X
where Pi is a probability distribution over V . Instead of normalising w.r.t. all nodes, just
normalize against k random negative samples ni . We could select PV such that
PV (n) ∝ deg n. The value of k is usually chosen to be 5 to 20 since
1
This is a form of Noise Contrastive Estimation (NCE). See https://arxiv.org/pdf/1402.3722.pdf
12
2 NODE EMBEDDINGS
Figure 2.3: Comparison of neighbourhood NR generated by BFS and DFS strategies. BFS
provides a micro-view of the neighbourhood while DFS provides a macro-view.
2. For all u:
• Compute ∂L/∂z u
• Make a step z u ← z u − η · ∂L/∂z u . η is the learning rate.
A couple different options are in order for the random walk strategy R. In DeepWalk,
this is just an unbiased random walk starting from each node but this could be too
constrained. In node2vec, the strategy is chosen to embed nodes with similar network
neighbourhood close in the feature space. We frame this as a maximum-likelihood
optimisation problem which is independent to the downstream prediction task. The key
observation is that flexible notion of NR lead to rich node embeddings.
Two classic strategies define a neighbourhood NR : Breadth First Search and Depth First
Search. We could interpolate BFS and DFS using two parameters:
• In-out parameter q: Moving outwards (DFS) vs. inwards (BFS). Intuitively q is the
interpolation parameter.
• Leave N (s1 ) and explore further (distance 2) with weight 1/q for each node further
out.
13
2 NODE EMBEDDINGS
Figure 2.4: Biased 2nd order neighbourhoods along with unnormalized probabilities.
A BFS-like walk has low value of p (easy to backtrace) and a DFS-like walk has a low value
of q.
In a survery in 2017, node2vec performs better on node classification tasks while
alternative methods perform better on link prediction. No one method wins all cases.
Random walk approaches are generally more efficient.
z G :=
X
zv
v∈G
Another approach is to introduce a “virtual node” to represent the subgraph and run a
node embedding algorithm to use its embedding.
We will discuss hierarchical embeddings, which successively summarises the graph in
smaller clusters to generate an embedding.
14
2 NODE EMBEDDINGS
P
Volume of graph vol G := i,j Ai,j Degree matrix Du,u := deg u
1 XT
! !
log vol(G) ( D −1 A)r D −1 − log b
T r=1
Number of negative samples
Context Window Size T := |NR (u)|
• Clustering/Community detection
• Node classification
Node embeddings via matrix factorisation and random walks have some limitations:
• Transductivity: The embedding can only be generated after all nodes in the graph
are seen. Cannot obtain embeddings for nodes not in the training set.
Deep Representation Learning and Graph Neural Networks mitigate these limitations,
which will be covered in depth in Section 4 and Section 5.
15
3 GRAPH NEURAL NETWORKS
where
random sample i∈I ∇θ L(yi , f (xi ; θ)) where I ⊆ {1, . . . , n}. This is
P
16
3 GRAPH NEURAL NETWORKS
stochastic gradient descent and I is the batch. |I| is the batch size
and the number of full passes through the dataset is the epoch.
– Other higher order optimisers exist such as RMSprop, Adam, Adagrad, etc.
Suppose we have a graph G with vertex set V , adjacency matrix A ∈ {0, 1}|V |×|V | ,
node features X ∈ R|V |×d .
A naı̈ve approach would be to join the adjacency matrix and features and feed them into a
deep neural net. The problems with this idea are
• O(|V |) parameters
• Graph size is inherently baked into the size of the neural network
One solution is to use convolutional networks, which use a sliding kernel which is
invariant across all points on the graph. There is no fixed notion of locality or sliding
window on a graph, and graphs do not give an inherent order to their vertices.
Consider we learn a function f that maps a graph G := (A, X) to a vector in Rd . Then we
would like f (A1 , X 1 ) = f (A2 , X 2 ) for two different orderings of the vertices of G. For any
graph function f : R|V |×m × R|V |×|V | → Rd ,
e.g.
17
3 GRAPH NEURAL NETWORKS
Figure 3.1: General structure of GNN which consists of permutation equivariant convolu-
tional layers and permutation invariant pooling layers.
Figure 3.2: Computation graph defined from a node’s neighburhood. Each node defines a
computation graph based on its neighbourhood and this can change from node to node.
18
3 GRAPH NEURAL NETWORKS
Initial 0th layer embedding
h (0)
v := x v
Average of neighbours’
previous layer embeddings Embedding of v at layer k
1
h (k+1) := σ Wk h (k) + Bk h (k) (k = 0, . . . , K − 1)
X
v
|N (v)| u∈N (v) u v
z v := h (K)
v
v∈V
Encoder output
Classification weights
• Unsupervised setting: When no node labels are available we could use the graph’s
structure as supervision. By requiring similar nodes have similar embeddings. i.e. we
optimise
L= CE(yu,v , Dec(z u , z v ))
X
u,v
19
4 A GENERAL PERSPECTIVE ON GRAPH NEURAL NETWORKS
Overall:
1. Define a neighbourhood aggregation function
2. Define a loss function on the embeddings
3. Train on a set of nodes, i.e. a batch of computation graphs.
4. Generate embeddings of nodes (even for nodes that the model never trained on)
A GNN is inductive as opposed to transductive. The same mode generalises to unseen
nodes.
20
4 A GENERAL PERSPECTIVE ON GRAPH NEURAL NETWORKS
1. Message
2. Aggregation Function
A GNN Layer is composed of the message and aggregation. Different
implementations include Graph Convolutional Networks (GCN), GraphSAGE, and
GAT (Graph Attention).
3. Layer Connectivity
Layer Connectivity refers to the topological connection between layers including skip
connections.
• The message function converts the hidden state of each node into a message, which
will be sent to other nodes later.
u := Msg (h u )
(l) (l−1)
m (l)
• The aggregation function defines how the node’s neighbours’ messages are combined
v := Agg({m u : u ∈ N (v)})
h (l) (l)
Example: h (l)
v can be summation, mean, or maximum.
The issue here is that the information from node v itself could get lost, since h (l)
v does
not directly depend on h v(l−1) . The solution is to include h (l−1)
v in the computation of
h v . We can compute a message for v itself, and then use
(l)
v := concat(Agg({m u : u ∈ N (v)}), m v )
h (l) (l) (l)
21
4 A GENERAL PERSPECTIVE ON GRAPH NEURAL NETWORKS
Examples:
(1) Graph Convolutional Networks (GCN), where the message and aggregation
functions are Activation
Message
(l−1)
(l) h u
v := σ
h (l)
X
W
deg v
u∈N (v)
where Aggregation
Normalized by node degree
GCN Graphs are assumed to
1 have self-edges that are included in the
u =
m (l) W(l) h (l−1) summation.
N (v) u
Xn o
v = σ(
h (l) u : u ∈ N (v) )
m (l)
(2) GraphSAGE:
Stage 2 aggregation with self Stage 1 aggregation from neighbours
v := σ
h (l) W(l) · concat(h (l−1)
v , Agg({h (l−1)
u : u ∈ N (v)}) )
h (l−1)
Agg(v) := u
X
u∈N (v)
|N (v)|
• Pool: Transform neighbour vectors and apply symmetric vector function mean or
max.
Agg(v) := meanu∈N (v) MLP (h (l−1)
u )
• LSTM: Apply LSTM to reshuffled neighbours
the embedding vectors have different scales for vectors. In some cases, normalisation
results in performance improvements.
v := σ
h (l) αv,u W(l) h (l−1)
X
u
u∈N (v)
22
4 A GENERAL PERSPECTIVE ON GRAPH NEURAL NETWORKS
GAT assigns different importance to messages coming from different nodes. When
αv,u = N 1(v) , attention reduces to GCN/GraphSAGE, where αv,u is defined explicitly
based on the structural properties of the graph, specifically, the node degree deg v. The
attention mechanism in GAT is inspired by cognitive attention and focuses on
importants parts of the input data.
An attention mechanism computes αv,u . Define the attention coëfficients
ev,u := a(W(l) h (l−1)
u , W(l) h (l−1)
v )
Then we normalize ev,u into the attention weight using softmax:
exp ev,u
αv,u := P
k∈N (v) exp ev,k
In Multi-head Attention, multiple attention scores are used and the result of each
attention “head” is aggregated:
v [j] := σ(
h (l) αv,u [j]W(l) h (l−1) )
X
u
u∈N (v)
h (l)
v := Agg(h lv [j] : j)
The benefits of attention mechanism are:
• Allows for implicitly specifying different importance values of neighbours
• Computationally efficient: Attention can be parallelised across all edges of the
graph.
• Storage efficient: Sparse matrix operations do not require more than O(V + E)
entries. The number of parameters is fixed.
• Localised: Only attends over local neighbourhoods
• Inductive capability: Does not depend on global graph structure.
23
4 A GENERAL PERSPECTIVE ON GRAPH NEURAL NETWORKS
• Solution 1: Increase the expressive power within each GNN layer: We can make
aggregation/transformation into a deep neural network.
• Solution 2: Add layers that do not pass messages. A GNN does not necessarily
only contain GNN layers. We can apply MLP layers, applied to each node,
before and after GNN layers as pre-process and post-process layers. In practice
adding these layers are beneficial.
2. Add skip connections in GNNs: Since earlier GNN layers can sometimes be better
in differentiating nodes, we add shortcuts to the GNN.
A skip connection creates a mixture of models. We get a mixture of shallow and deep
GNNs using mixed connections. When we have N skip connections, information has
2N possible pathways of transmission. An example of a skip connection:
F (x) x
(l−1)
(l) h u
v := σ
h (l) + h (l−1)
X
W
|N (v)| v
u∈N (v)
24
4 A GENERAL PERSPECTIVE ON GRAPH NEURAL NETWORKS
(a) A shortcut or skip connection (b) Skip connections to the last layer
• Feature Level:
25
5 GNN AUGMENTATION AND TRAINING
u , hv )
ŷ u,v := Head(h (L) (L)
26
5 GNN AUGMENTATION AND TRAINING
– Dot product:
Head(h (L)
u , h v ) = h u Wh v
(L) (L) (L)
ŷ G := Head(h (L)
v : v ∈ V (G))
This is similar to the aggregation step in GNNs. Global pooling, e.g. mean pooling,
max pooling, sum pooling, work great for smaller graphs.
The main issue is global pooling over a large graph loses information. A solution is to
aggregate the graph hierarchically. We train two independent GNNs at each level.
• GNN-A: Compute node embeddings (embedding task)
• GNN-B: Compute the cluster a node belongs to (clustering task)
For each pooling layer, use clustering assignments from GNN-B to aggregate node
embeddings generated by GNN-A.
27
5 GNN AUGMENTATION AND TRAINING
Figure 5.3: An inductive dataset of three different graphs. Each graph is given its own
message and supervision edges.
Sometimes we cannot guarantee that the test set will be really held out. In which case we
could use random spplit and report the average performance over different random seeds.
Splitting graphs is special and has its own quirkiness compared to image dataset. If we
split a graph into different vertices, the nodes are not independent. The nodes in the
“unseen” validation or test set will affect our prediction on the nodes in the training set,
because of message passing mechanics. There are two solutions to this issue:
• Transductive Setting: The input graph can be observed in all datasets splits, but
only the training sets have visible labels.
• Inductive Setting: Break the edges between splits to get multiple graphs. In this
case the nodes in different components of the graph are truly independent.
Only this setting is applicable to graph classification.
In a link-prediction task, the setup of the task is tricky. It is a self-supervised task and we
need to generate labels and datasets on our own. A practical method is to hide edges from
the GNN and let GNN predict if those edges exist. We split edges twice:
28
6 THEORY OF GRAPH NEURAL NETWORKS
1. Assign two types of edges in the original graph: Message edges and Supervision edges.
Message edges will be visible to the GNN while supervision edges will not.
2. Split edges into train/validation/test. We can either use a inductive (Figure 5.3) or
transductive (Figure 5.4) setting.
The transductive setting is the default when people talk about link prediction. In this
case there is only one graph, observable in all dataset splits.
Settings
We focus on message passing in GNNs:
u := Message (h u )
(l) (l)
m (l)
29
6 THEORY OF GRAPH NEURAL NETWORKS
Figure 6.1: When f produces one-hot encodings, the ϕ counts the number of occurrences of
each element of the multi-set.
A GNN distinguishes graph structures using the computation graphs induced by the
neighbourhood of each node. If the k-hop neighbourhood structures of two nodes are
identical and the nodes have the same features, a GNN would not be able to distinguish
between them. A computation subgraph is a rooted subtree with root at each node.
We can measure expressive power using injections. The most expressive GNNs should map
subtrees to node structure injectively. If each step of GNN’s aggregation can completely
retain the neighbourhood information of each node, the generated node embeddings can
distinguish different rooted structures. In other words, the most expressive GNN uses
injective neighbourhood aggregation.
Aggregate({x u : u ∈ N (v)})
In GCN, this is the mean function, and in GraphSAGE, it is the max pool. For example,
both pooling functions will create the same aggregation over the neighbour message
multi-sets:
1 0 1 1 0 0
(" # " #) (" # " # " # " #)
, , , , ,
0 1 0 0 1 1
In general, the discriminative power decreases in the series
Theorem 6.1 (Xu et al. ICLR 2019). Any injective multi-set function can be expressed as
!
f (x)
X
ϕ
x∈S
Proof. (sketch) f produces one-hot encodings, and ϕ adds them together. See
Figure 6.1.
To model ϕ and f , we can use a MLP.
30
6 THEORY OF GRAPH NEURAL NETWORKS
GIN uses the two-MLP structure above and it is the most expressive message passing GNN.
31
6 THEORY OF GRAPH NEURAL NETWORKS
u∈N (v)
Question: In the hash table, we would not be able to control the output
(almost random), but in our case the output seems to be deterministic.
The discussion here is mainly about how to design an injective function over a multi-
set.
If the input features c(0) (v) is one-hot, direct summation is injective. In this case, we only
need ϕ to ensure injectivity:
GINConv(c (k) (v), {c (k) (u) : u ∈ N (v)}) := MLP (1 + ϵ)c (k) (v) + c (k) (u)
X
u∈N (v)
32
6 THEORY OF GRAPH NEURAL NETWORKS
• Node embeddings are low-dimensional. Hence they can capture the fine-grained
similarity of different nodes.
WL Kernel has been both theoretically and empirically shown to distinguish most of the
real-world graphs [Cai et al. 1992]. Hence GIN is also powerful enough to distinguish most
of the real graphs.
• Activation Function:
33
7 LIMITS OF GRAPH NEURAL NETWORKS
Figure 7.1: A k-layer GNN embeds a node based on the k-hop neighborhood structure
.
Figure 7.2: The two square nodes have the same computational graphs and therefore the
same embedding despite having different neighbourhood structures.
1. If two nodes have the same neighborhood structure, they must have the same
embedding
2. If two nodes have different neighborhood structure, they must have different
embeddings
1. If two nodes have the same neighborhood structure, they must have the same
embedding
2. If two nodes have different neighborhood structure, they must have different
embeddings
Observation (2) is often unsatisfiable: There are basic structures that existing GNN
frameworks cannot distinguish, such as the length of cycles. GNNs power can be improved
in to resolve this problem.
Observation (1) could also be problematic: Sometimes we may want to assign different
embeddings to nodes that have different positions in the graph. e.g. In position-aware
tasks.
We’ll resolve these issues by building more expressive GNNs.
34
7 LIMITS OF GRAPH NEURAL NETWORKS
Figure 7.3: Failure 1: The computational graph of a node on a cycle is always the same
Figure 7.4: Failure 2: The computational graph of • and • are the same, so the link-level
prediction on two dashed edges will be identical.
Figure 7.5: Failure 3: Nodes on two different graphs have identical computational graphs
Figure 7.6: The WL Kernel inherits graph symmetries. Symmetric colours are associated
with limitations involving spectral decomposition of a graph.
• Node Level: Different inputs with the same computational graph leads to GNN
failure (Figure 7.3)
• Edge Level: Edge prediction tasks may fail since the nodes on the edges have
identical computational graphs. (Figure 7.4)
• Graph Level: Same overall computation graphs on different graphs lead to same
prediction. (Figure 7.5)
35
7 LIMITS OF GRAPH NEURAL NETWORKS
A = V Λ V⊺
Diagonal matrix of eigenvalues λ1 , . . . , λN
Example
The number of cycles in a graph can be viewed as functions of eigenvalues and eigen-
vectors, e.g.
N
diag(A3 ) = λ3n ∥v n ∥2
X
n=1
i=1 k=0
Thus the weights of the first MLP layer depends on the eigenvalues and the dot product
between the eigenvectors v n and the colours at the previous level C (l) [:, i].
With uniform initial colours, we have C (l) [:, i] = 1 . The new node embeddings only
depend on the eigenvectors that are not orthogonal to 1. However, graphs with symmetries
admit eigenvectors orthogonal to 1 .
36
7 LIMITS OF GRAPH NEURAL NETWORKS
In a nutshell: WL cannot distinguish between symmetric nodes in the graph since the
embeddings and graphs structure admit the same symmetry. The limitations of the WL
kernel are limitations of the initial node color.
Question: How does the dot product relate to the initial colour?
It is a little bit beyond the scope of the course, but the initial colour C (1) . In graphs
that have symmetries, the inner product v n · 1 goes to zero, and the information from
C (l) [:, i] is extinguished.
One-hot encoding:
- Has low generalizability and cannot generalize to new nodes. New nodes introduce new
id’s.
~ In terms of expressive power, all nodes are identical, but the GNN can still learn from
structure.
We can also use the diagonals of the adjacency powers as augmented node features. They
correspond to the closed loops each node is involved in.
Theorem 7.1. If two graphs have adjacency matrices with different eigenvalues, there
exists a GNN with closed-loop initial node features that can always tell them apart.
37
7 LIMITS OF GRAPH NEURAL NETWORKS
GNNs with structural initial node features can produce different representations for almost
all real-world graphs. Almost all since distinguishing graphs is an open problem. In this
case, a GIN with structural initial node features is strictly more powerful than the WL
Kernel.
Certain structures are hard to learn by GNN. For example, the cycle count feature (the
length of a cycle that v resides in). We could embed the cycle count as a feature. Other
commonly used augmented features are clustering coëfficient, PageRank, centrality.
Structurally aware node feature:
y := E[y]
1 Xm
y := y (j)
m j=1
We allow the GNN to momentarily break equivariance during each individual sample, but
equivariance holds after taking expectation.
38
7 LIMITS OF GRAPH NEURAL NETWORKS
Figure 7.7: Structure and Position Aware Tasks; GNNs often work well for structure-aware
tasks but fail at position-aware tasks.
2. The expressive power of GNNs is upper bounded by the WL test. For example,
message passing GNNs cannot count the length of a cycle in a graph.
Solution: Identity-aware GNNs
• Structure-Aware Tasks: Nodes are labeled by their structural roles in the graph
GNNs always fail for position aware tasks due to the similarity of computational graphs.
We could randomly pick a node s1 as an anchor node and represent other nodes by their
relative distances w.r.t. s1 . The anchor node serves as a coördinate axis. We can pick more
nodes s2 , s3 , . . . as anchor nodes to better characterise node positions in the graph.
Theorem 7.2 (Bourgain’s Theorem, Informal). Let (V, d) be the metric space on the graph
vertices V and c a constant. Select random nodes Si,j ⊆ V such that each node in v has a
probability of 2−i of being included.
Consider the following embedding function
2
f (v) := [dmin (v, S1,1 ), dmin (v, S1,2 ), . . . , dmin (v, Slog n,c log n )] ∈ Rc log n
39
7 LIMITS OF GRAPH NEURAL NETWORKS
P-GNN6 follows this theory. It samples O(log2 n) anchor sets Si,j and embeds each node
via f . The embedding positional information can be used:
• Simple solution: Use the position encodings as an augmented node feature. The
problem with this is since the encoding is tied to a random anchor set, dimensions of
positional encoding can be randomly permuted without changing its meaning.
• Inference: Given a new unseen graph, new anchor sets are sampled.
40
8 GRAPH TRANSFORMERS
Question: Is there a case where the node colouring would not be useful in
network A and B?
The networks A,B will not converge to the same result since they might but they won’t
if your objective function forces them to distinguish the nodes.
8 Graph Transformers
We know a lot about the design space of GNNs. What does the design space of graph
transformers look like?
Graph
Computation Graph
Figure 7.9: Inductive Node Colouring distinguishes computational graphs on cycles of length
3 and 4.
41
8 GRAPH TRANSFORMERS
1 1
0 0
2 2
2 0
(a) 3-cycle (b) 4-cycle
8.1 Self-Attention
The attention layer processes each input token x i into three values, the query q i , key k i ,
and value v i 7 , via 3 trainable models.
q i := Wq x i , k i := Wk x i , v i := Wv x i
7
This terminology comes from search engines where the user inputs a query which gets matched with
keys.
42
8 GRAPH TRANSFORMERS
43
8 GRAPH TRANSFORMERS
q i , k i must have the same length d. Then the layer computes a score between the query
and the key:
1
ai := √ q i · k i
d
Finally the output of the layer is a mixture of v i s weighted by the softmax of ai
z := softmax(a1 , . . . , aN )i v i
X
We can represent the same calculation in matrix form, with the input matrix being
Query
X ∈ RM ×N
Q := XWq , K := XWk , V := XWv
Key Value
Then we can compute the score
1
Z := softmax √ QK ⊺ V
d
Attention Score
Multi-head Attention is the same as executing many instances of this process in parallel.
Question: Can we take mean pool over the outputs of the heads of the
multi-head attention?
Mean pool ignores ordering, so it would not be very useful.
44
8 GRAPH TRANSFORMERS
Excursion:
Aniva: I think this would be easier to see using the Einstein summation convention,
where if an index appears on only one side, it is assumed to be summed over. Suppose
xij is the feature, q i,µ the query, kµi the key, and vνi the value, then one self-attention
layer is
q i,µ := (Wq )j,µ xij , kµi := (Wk )jµ xij , vνi := (Wv )jν xij
and one attention layer is
1
!
i′ ′
zν := softmaxi √ q i,µ kµi vνi
d i′
Similar to transformers, GNNs also take in a sequence of vectors (in no particular order)
and output a sequence of embedding. The difference is when GNN uses message passing,
transformer uses attention.
z 1 := softmaxj (q j · k j )v i
X
This shows Self-attention can be written as message and aggregation – i.e., it is a GNN!
Every node receives information from every other node. In other words, the graph is fully
connected.
At the moment the transformer model we have is oblivious to tokens, since the attention
mechanism and weighting function softmax ignore index ordering. To fix this issue we need
positional encoding. For NLP tasks, each token x i in the input is concatenated with a
vector p indicating its position e.g. [cos i/N, sin i/N ]. Then we use the concatenated vector
[x, p] as the input to an attention layer instead of x.
45
8 GRAPH TRANSFORMERS
Figure 8.3: Node features on a graph can be used as the inputs features for a transformer
1. Relative distances based on random walks from anchor set: This is particularly
strong for tasks that require counting cycles. Pick anchor vertices v1 , . . . , vl , and each
vertex v gets the position encoding
p := [d(v, v1 ), . . . , d(v, vl )]
Relative distances useful for position-aware task but not structural aware tasks.
L := D − A
Diagonal degree matrix Adjacency Matrix
Several Laplacian variants that add degree information differently. Laplacian matrix
captures the matrix structure, and its eigenvectors inherit this structure.
Eigenvectors with small eigenvalue correspond to global structure, and large
eigenvalue correspond to local symmetries. We can calculate the eigen-decomposition
of the Laplacian matrix L = ΣΛΣ ⊺ , and use Σ (with the indices order [data,
feature]) as the position encoding.
A simple task such as whether a graph has a cycle, could be done with a GNN with
the assistance of the Laplacian eigenvectors.
46
8 GRAPH TRANSFORMERS
Question: Does the Laplacian only encodes structure or both structure and
position?
Both. See Figure 8.4.
Finally, we need to find out how to embed the edge features x i,j . The only place in the
attention mechanism where pairs of vertices come in is during the computation of the
attention scores [ai,j ] = QK ⊺ . We can adjust this based on the edge features
ai,j 7→ ai,j + ci,j . i.e.
w ⊺ x
e i,j edge exists between i, j
ci,j :=
P
k
⊺
w ek x ek path e1 , . . . , en exists between i, j
47
9 MACHINE LEARNING WITH HETEROGENEOUS GRAPHS
48
9 MACHINE LEARNING WITH HETEROGENEOUS GRAPHS
Figure 9.1: People, conferences, and publications in academia can be represented by a het-
ereogeneous graph.
Question: Can we have multiple edges of different types between two nodes
Yes.
Aggregation
• In a graph with multiple relation types, different neural network weights WR(e) could
be used on different edge types:
Message
Activation
(l) h (l−1)
v := σ
h (l) WR(u→v) u
X
degin v
u→v
Aggregation
49
9 MACHINE LEARNING WITH HETEROGENEOUS GRAPHS
1 (l)
v := σ
h (l) Wr(l) h (l−1) + W0 h (l−1)
X X
u v
cv,r
r∈R u∈ N r (v)
In message-aggregation form:
• Message:
1
u,r :=
m (l) Wr(l) h (l)
u
cv,r
(l)
v := W0 h v
(l)
m (l)
• Aggregation:
(l)
h (l+1) = σ( m u,R(u→v) + m (l)
v )
X
v
u∈N (v)
Each relation has L matrices Wr(1) , . . . , Wr(L) . The size of each Wr(l) is d(l+1) × d(l) . In total
this leads to rapid growth of the numbers of parameters w.r.t. the number of relations, so
overfitting may become an issue. Two methods of regularisation exist:
• Block Diagonal Matrices: Use B block diagonal matrices for Wr :
Wr,1 · · · 0
. .. ..
..
Wr := . .
0 · · · Wr,B
Limitation: Only nearby neurons could interact via Wr . This reduces the number of
(l+1) (l)
parameters from d(l+1) × d(l) to d B × dB
• Basis/Dictionary Learning: Share weights across different relations.
We could represent the matrix of each relation Wr as a linear combination of
learnable basis matrices
B
Wr :=
X
ar,b Vb
b=1
50
9 MACHINE LEARNING WITH HETEROGENEOUS GRAPHS
Example
Consider the following graph:
1 3
A 2 C
1 3 3 2
D 3 E 1 F
51
9 MACHINE LEARNING WITH HETEROGENEOUS GRAPHS
– Hits@k: Fix a value k, and |{ri < k : yi }|, where yi is the ground truth
label, can be used to measure the number of relevant hits in the top k
ranked edges.
– Reciprocal Rank: yi /ri (a higher score is better)
P
i
Question: How can we generate negative edges for link prediction if the
graph is very dense.
The question of link prediction is ill defined when the graph is dense since if there are
no places to insert new edges, prediction of new edges cannot fail.
Question: For negative sampling, How do you account for the imbalance of
edge types?
The prediction of edges (u, r, e)is equivalent to sampling the marginal distribution
v|u, r given u, r fixed. This marginal distribution is not affected by the imbalance of
edges.
QK ⊺
!
Attention(Q, K , V ) := softmax √ V
dk
where Q is the query, K is the key, and V is the value. All 3 matrices have the shape
(batchsize, dk ).
Recall that when applying GAT to a homogeneous graph,
Attention weights
Without decomposition, the number of attention parameters can quickly overwhelm the
model, since |T | node types and |R| relation types produce |T |2 |R| different weight
matrices.
8
Hu et al. WWW ’20
52
9 MACHINE LEARNING WITH HETEROGENEOUS GRAPHS
u := Wϕ(u→v) Nτ (u) h u
msg (l−1)
m (l)
Weight for each edge type Linear head for each node type
A layer of HGT is given by
Attention scores
Values V
1
!
h (l)
v := Aggregate softmax √ αv,u : u ∈ N (v) u : u ∈ N (v)
· m (l)
dk
On the ogbn-mag benchmark to predict paper venues, HGT uses much fewer parameters
even though the attention computation is expensive, but it performs better than R-GCN.
Aggregation.
v :=
h (l) m (l)
M X
u
r∈R u∈Nr (v)
where ⊕ is concatenation.
53
10 KNOWLEDGE GRAPH EMBEDDINGS
Table 10.1: A summary of various knowledge graph completion methods. The last 5 prop-
erty columns represent the model’s capability to model symmetry, antisymmetry, inverse,
composition, and 1-to-n relationships, respectively.
Real world applications: FreeBase, Wikipedia, YAGO, etc. They are used for serving
information and question answering agents.
Common Characteristics:
Enumerating all the possible facts is impossible, but we can predict plausible but missing
links.
54
10 KNOWLEDGE GRAPH EMBEDDINGS
• TransE:
Intuition: For a triple (h, r, t), h + r ≃ t if the given fact is true. The scoring function
is fr (h, t) := − ∥h + r − t∥. TransE originated from an observation in Language
models that the embeddings of words often have analogous relations
• TransR:
Like TransE, but the entities are morphed by the vector r by a projection matrix
M k×d
r .
55
11 REASONING IN KNOWLEDGE GRAPHS
There is not a general embedding that works for all KGs. Use a table to select models.
q := ( va , ( r1 , . . . , rn ))
Anchor entity Path relations
56
11 REASONING IN KNOWLEDGE GRAPHS
r1 r2 rn
···
va ?
57
11 REASONING IN KNOWLEDGE GRAPHS
Answering queries seems easy: Just traverse the graph. However knowledge graphs are
incomplete and known. Due to this incompleteness, one is not able to identify all the
answer entities.
Can we first do KG completion and then traverse the completed probabilistic KG? No,
since the probabilistic graph is a dense graph and the time complexity of traversing such a
graph is exponential in the number of query path length L.
A solution of this predictive query problem would be able to answer arbitrary queries
while implicitly accounting for the missing information.
q := v a + r 1 + · · · + r n
Then, we can query for node embeddings close to q and label them as the answer to the
query.
Question: Does the order of the path not matter in TransE since it is a
vector addition?
TransE would not be able to model ordered paths.
Since TransE can handle compositional relations, it can handle path queries by translating
multiple relations into a composition. TransR, DistMult, ComplEx cannot handle
composition and hence cannot be easily extended to handle path queries.
Can we answer more complex queries with conjunction operation? If a conjunctive query
q = (q1 , q2 ), then JqKG = Jq1 KG ∩ Jq2 KG .
58
11 REASONING IN KNOWLEDGE GRAPHS
n
r
r2 t
r r1 ···
t
h h
11.2 Query2Box
We have two problems to solve in conjunctive queries:
• Each intermediate node represents a set of entities. How can we represent it?
• How do we define the intersection operation in latent space when two queries have to
be simultaneously satisfied.
Question: Can we use the TransE approach and embed the paths separately
and take the intersection?
Yes, but representing an entire set with a single point is difficult and we can easily
come up with counterexamples.
Settings
Let d be the out degree, |V | be the number of entities, and |R| be the number of
relations.
In Query2Box:
59
11 REASONING IN KNOWLEDGE GRAPHS
Figure 11.6: Box Embeddings for biomedicine that encloses some answer entities
r
q′
q
• Relation embeddings (2d |R| parameters): Each relation takes a box and produces a
new box. This is the projection operator P and maps Box × R 7→ Box. There is
one projection operator for each relation type, where the centre and offset are moved
by
• Intersection operator I: Inputs are boxes and output is a box. The centre of the new
box should be “close” to the centres of the input boxes, and the offset should shrink
since the intersected box is smaller than the size of all previous boxes. It does not
have to be a strict geometric intersection.
We can define the result of intersection to be
Hadamard (Elementwise) Product
centre q∩ = w i ⊙ centre(qi )
X
i
exp(fc (centre qi ))
w i := P
j exp(fc (centre qi ))
and the score function fq (v) would be defined as the inverse distance of v to q:
fq (v) := −dbox (q, v).
The intuition is that the distance inside the box is shrunk relative to the outside.
The distance used in the slides is Manhattan (L1 ) distance. It is very natural
since balls in L1 norm are boxes. Other distances can be used as well.
61
11 REASONING IN KNOWLEDGE GRAPHS
Figure 11.9: Converting a query to disjunctive normal form where unions are done at the
last step
We would like the question answering model to include points belonging to q ∈ Q but
exclude the points belonging to q ̸∈ Q. When q1 , . . . , qd have non-overlapping answers, a
dimensionality of Θ(d) is needed to handle all OR queries.
For arbitrary real world queries, this number d is often very large, so we cannot embed
and-or queries in low dimensional space. A solution to this is to leave the union operation
to the very last step. This is the disjunctive normal form of the query, and any query
can be written in the form q = q1 ∨ · · · ∨ qn .
We can use the following metric to measure the distance of an entity embedding to a
Disjunctive Normal Form (DNF) query:
ℓ(f ) := E q∼Query(G) ,v∈JqK,v′ ̸∈JqK [ − log σ(fq (v)) − log(1 − σ(fq (v ′ ))) ]
The queries q are generated from query templates. A query template outlines the
topological structure of a query. Generating a grounded query from a template involving
tracing back from the answer node of the query template and grounding each question edge
backwards.9
Question: What if instead of one box, we use beam search to avoid sparsity.
Good idea would be interesting to try out.
• Edges E connect users and items and indicate user-item interaction and are often
associated with timestamp.
Given past user-item interactions, we wish to predict new items each user will interact with
in the future. This is a link prediction problem. For each u ∈ U, v ∈ V , the model should
generate a score f (u, v) which ranks the recommendations. Since |V | is large, evaluating
every user-item pair (u, v) is infeasible. The solution to this problem is to break down the
recommender into two stages.
9
See the SMORE paper for detail on this subject.
63
12 GNNS FOR RECOMMENDER SYSTEMS
Users
Items
Figure 12.1: Recommendation system involves predicting possible user-item interaction edges
given past edges
• Binary Loss: Define positive/negative edges. The set of positive edges E are
observed, and the set of negative edges E− := {(u, v)|(u, v) ̸∈ E}
The binary loss is
1 X 1
log σ(fθ (u, v)) − log(1 − σ(fθ (u, v)))
X
−
|E| (u,v)∈E |E− | (u,v)∈E−
64
12 GNNS FOR RECOMMENDER SYSTEMS
during training, both sums are approximated with mini-batches of positive and
negative edges.
Binary loss pushes scores of positive edges higher than those of negative edges. An
issue is that the scores of all positive edges are pushed higher then the scores of all
negative edges, but the recommendation for a user u is not affected by the
recommendations of another user, so this unnecessarily penalises the model.
Question: Why can’t we just use a hinge loss to plateau the loss when
the ranking is correct?
The main goal of the BPR loss is to not compare scores for different users,
because the scores across users do not matter in the rankings.
Question: Why sample many more negative edges than positive edges?
The problem is inherently imbalanced. Each user only interacts with a tiny
subsets of all items, so a lot of negative examples are needed.
10
The term “Bayesian” is not essential to the loss definition. The original paper Rendle et al. 2009
considers the Bayesian prior over parameters (essentially acts as a parameter regularization), which we omit
here.
65
12 GNNS FOR RECOMMENDER SYSTEMS
h (k+1)
u u , Aggregate(h v : v ∈ N (u)))
:= Combine(h (k) (k)
12.3 LightGCN
NGCF jointly learns two kinds of parameters: Shallow user/item embeddings, and
parameters for the GNN. The embeddings are already quite expressive. They are learned
for every user and item node and the total number of parameters in them is O(N D) (where
N is the number of nodes), where as the number of parameters in the GNN is only O(D2 ).
The GNN parameters may not be very essential.
66
12 GNNS FOR RECOMMENDER SYSTEMS
We can simplify the GNN used in NGCF. The adjacency and embedding matrices of an
undirected bipartite graph can be expressed as User embedding
" #
0 R EU
A= , E=
R⊺ 0
EV
Item embedding
Recall the diffusion matrix of Correct and Smooth (Section 19.2) method. Let D be the
degree matrix of A. Each layer of a GCN’s aggregation can be written in the form
Learnable linear transformation
LightGCN simplifies this by removing the ReLU non-linearity. Iterating the layers we see
that
Removing the ReLU part significantly simplifies the GCN. The LightGCN algorithm
applies E ← ÃE K times, and each multiplication diffuses the embeddings to their
K
neighbours. The matrix à is dense and not stored in memory. Instead the above
multiplication step is executed many times. We could also consider multi-scale diffusion:
K K
k
E := αi E (i) = αi à E (0)
X X
i=0 i=0
where α0 E (0) acts as self connection. For simplicity, LightGCN uses αk := 1/(k + 1).
Intuitively, the simple diffusion propagation encourages embeddings of similar users and
items to be similar. LightGCN is similar to GCN and C&S, except that self-loops are not
added, and the final embedding is the average of layer embeddings.
LightGCN performs better than shallow encoders but are also more computationally costly.
The simplification from NGCF leads to better performance.
67
12 GNNS FOR RECOMMENDER SYSTEMS
Figure 12.2: “Bed rail” may look like “Garden fence” but they are rarely adjacent in the
graph
12.4 PinSAGE
The PinSAGE algorithm was developed for Pinterest to recommend pins and is the largest
industry deployment of a GCN. Each pin embedding unifies visual, textual, and graph
information. It works for fresh content and is available a few seconds after pin creation.
The task of pin recommendation is to learn node embeddings zi such that the distance of
similar pins are shorter than the distance of dissimilar pins. e.g.
d(zcake1 , zcake2 ) < d(zcake1 , zsweater ). There are 1 + B repin pairs from related pins surface
which captures semantic relatedness.
Graph has tens of billions of nodes and edges. In addition to the GNN model, PinSAGE
paper introduces several methods to scale the GNN:
• Shared negative samples across users:
Recall that in BPR loss, for each user u∗ ∈ Û , we sample one positive item and a set
of negative items V− . Using more negative samples per user improves
recommendation but is also expensive, where the number of computational graphs is
Û · |V− |.
The key idea is that the same set of negative samples can be used across all users in
the mini-batch. This saves computational cost by a factor of Û .
68
13 RELATIONAL DEEP LEARNING
• Curriculum learning:
Make the negative samples gradually harder in the process of training. At the nth
epoch, we add n − 1 hard negative samples. For each user node, the hard negatives
are item nodes that are close but not connected to the user node in the graph.
69
13 RELATIONAL DEEP LEARNING
Historically, computational vision (CV) has been done with handcrafted features. For
example, the wheel and window of a car could be detected by feature detectors.
Modern computer vision has been mostly using end-to-end learning: The input data
does not undergo augmentation before being fed to the machine learning model.
We would like to apply deep learning on end-to-end relational database tasks. This has 4
benefits:
A classical method of machine learning on databases is Tabular ML. e.g. decision trees
on single tables. The advantage of using deep learning over Tabular ML is being able to
operate on multiple tables.
Another way is to do statistical relational learning (SDL) which learns a distribution
over relational structures. RDL a scalable, expressive inheritor of SDL.
Different from knowledge graphs, each entity can also have features.
70
13 RELATIONAL DEEP LEARNING
71
13 RELATIONAL DEEP LEARNING
Most tasks are temporal: User’s label and the database changes all the time. To train a
GNN for such a task, we define a training table containing (entityId, time, label). This
could be used for Classification, Regression, or Multi-class categorization tasks. The time
label is essential to temporal prediction tasks. An entity may have different labels at
different times. In the churn example, we create a training table with columns
(user, time, churn), and attach it to the database.
Each node’s neighbourhood defines a computational graph. The computation graph for
each node is time-dependent, and so are the message and aggregation process.
72
14 ADVANCED TOPICS IN GRAPH NEURAL NETWORKS
R ▷◁ S := {r ∪ s : r ∈ R, s ∈ S, r.key = s.key}
GNN’s Aggreagtion and Message passing mechanisms allow them to learn a SQL Join and
Agg operation. GNNs can perform multi-hop reasoning and discover patterns across
multiple samples.
13.3 RelBench
The link for this paper can be found at (RelBench”.
RelBench is a collection of databases and tools for evaluating GNNs. We can use this to
compare the efficacy of GNNs compared to an expert data scientist. For the problem “Will
a user be active in the next 6 months?”, the data scientists workflow consists of
4h Exploratory Data Analysis (EDA): Observe plots from individual and joined tables
0.5h Feature ideation: Come up with possibly indicative features
5h SQL Query writing
2h XGBoost hparam sweep
1h SHAP (feature importance analysis)
GNNs consistently outperform human experts in tasks in RelBench.
73
14 ADVANCED TOPICS IN GRAPH NEURAL NETWORKS
3. We can train a model which can perform different and diverse tasks via
in-context learning: The model can perform a task using a description for the task:
fi (x) ≃ g(x, i)
The power of in-context learning is few-shot prompting: Prompting the pre-trained model
with only a few examples is sufficient for the model to run other tasks. This is common for
large language models (LLMs). Note that during few-shot prompting, the model’s
gradients are not updated.
Performing in-context learning on graphs is more difficult than text. The main problems
are
1. How to represent the few-shot prompt for different graph tasks in the same input
format, so that it can be consumed by one shared model?
2. How to pretrain a model that can solve any task in this format?
PRODIGY is a method of in-context learning for graphs. To solve the first problem, we
use prompt graphs, which is a meta hierarchical graph.
Consider the task of link prediction. Suppose we have prompt graphs Gi := (Vi , Ei ), where
each Vi has two specially labeled nodes si , ti and the ground truth label yi for the edge
(si , ti ) and we want to predict the link (s0 , t0 ) on G0 := (V0 , E0 ). PRODIGY creates the
prompt graph Task Graph Data Graph
V := {yi : i} ∪ {vi : i} ∪ V0 ∪
[
Vi
i
E := {(vi , yi ), (vi , Gi ) : i} ∪ E0 ∪
[
Ei
i
74
14 ADVANCED TOPICS IN GRAPH NEURAL NETWORKS
The task of the GNN is to predict the link (v0 , y) over possible labels y. This converts a
link/node/graph level prediction task into a link-level prediction task with a consistent
graph format.
The task graph contains hierarchical edges (vi , Gi ). This could be processed by having one
GNN for each prompt example.
To pre-train PRODIGY, we generate data in the PromptGraph format and pretrain our
model over them. There are two ways to do this:
1. Neighbour Matching: Train a task to classify which neighborhood a node is in,
where each neighborhood is defined by other nodes in it
2. Multi-Task: Train the model with multiple task data combined into PromptGraph
data
75
14 ADVANCED TOPICS IN GRAPH NEURAL NETWORKS
Text and image features could be incorporated using language model embeddings or CNNs.
If we can quantify uncertainty on GNNs, we could stop trusting its results when
uncertainty is high. Instead of having a single output, the model can
Settings
We want to construct provable prediction sets with confidence α < 1 from test data
Dtest .
Set outputs enable a rigorous notion of reliability. The coverage of a prediction set is
1
Coverage := 1(Yi ∈ C(Xi ))
X
|Dtest | i∈Dtest
76
14 ADVANCED TOPICS IN GRAPH NEURAL NETWORKS
Consider a categorical prediction task where each category y has confidence µ̂(x)y . This is
often based on softmax scores. We define the non-conformity score function
V : X × Y → R to be
V (x, y) = 1 − µ̂(x)y
When the softmax score is high, the non-conformity score is low, and the model is more
confident. We calibrate the model over many data points (xi , yi ) and take the 1 − α (where
α is a prescribed confidence level) quantile
1
η̂ := quantilei (V (xi , yi ), (1 − α)(1 + ))
n
q̂
For each predicted point Ŷn+1 , we can then construct a set based on this interval:
P (yn+1 ∈ C(xn+1 )) ≥ 1 − α
77
14 ADVANCED TOPICS IN GRAPH NEURAL NETWORKS
where zi := (xi , yi ).
Exchangeabiltiy is difficult for graph data: There’s dependency between test and
calibration nodes (i.e. not IID). Message passing during training includes calibration and
test nodes.
Having coverage is not enough. We also need to ensure the coverage interval is efficient. An
infinitely large interval has 100% coverage but is useless. Inefficiency is defined as
1
Inefficiency := |C(xi )|
X
|Dtest | i∈Dtest
GNNs prediction scores are not optimized for conformal efficiency. We can design a loss
function which approximates this metric. The prediction set size proxy is
1 X X
Lset := σ ((−̂V (xi , y))/τ )
N i∈Vct y∈Y
1 X
Linterval := (µ̃1−α/2 (xi ) + η̂) − (µ̃α/2 (xi ) − η̂)
N i∈Vct
14.3 Robustness
Deep convolutional neural networks are vulnerable to adversarial attacks.
78
14 ADVANCED TOPICS IN GRAPH NEURAL NETWORKS
Adversarial examples are also reported in natural language processing and audio pro-
cessing.
The existence of adversarial examples prevent the reliable deployment of deep learning
models to the real world. Adversaries may try to actively interfere the neural networks.
Deep learning models are often not robust.
• Model: GCN
• The attacker has access to A (the adjacency matrix), X (the feature matrix),
Y (the label matrix), and the learning algorithm.
• The attacker can modify (A, X) to (A′ , X ′ ) with the assumption (A′ , X ′ ) ≃
(A, X). The manipulation is unnoticeably small.
• cv (resp. c′v ) is the class label of node v predicted by GCN with parameters θ
(resp. θ ′ ).
79
15 FOUNDATION MODELS FOR KNOWLEDGE GRAPHS
Question: Why we assume only the adjacency matrix and feature matrix
can be changed, but not the labels?
If the attacker can change the label the adversary problem becomes very easy.
Question: Does the attacker need to know the structure of the model itself?
Yes.
Attack possibilities:
2. Pick one which obtains the highest difference in the log-probabilities indicated by the
score function
GCN is not robust to direct adversarial attacks but it is somewhat robust to indirect and
random attacks.
80
15 FOUNDATION MODELS FOR KNOWLEDGE GRAPHS
Suppose we have a knowledge graph G := (V, E) where the set of all relations is R.
The edges are triples (h, r, t), where head h is related via r to tail t. Suppose that
nodes and edges are associated with shallow embeddings (not learning a GNN yet).
Given a triple (h, r, t), the goal is in the embedding space (h, r) should have a similar
embedding to t.
In comparison with natural language tasks, each token in a language corpus is a node of
the same type. The vocabulary is homogeneous. Knowledge graphs have heterogeneous
vocabulary. In natural language tasks, the words are connected sequentially, which is not
true for knowledge graphs.
There are two essential tasks for Knowledge Graphs
81
15 FOUNDATION MODELS FOR KNOWLEDGE GRAPHS
AR is a weighted adjacency matrix. In practice we are looking for entities that serve as
head and/or tail for relation pairs. A GNN can then act on this relation graph and
generate embeddings for each relation.
InGram is similar to the transformer architecture: Transformer learns a fully connected
graph between tokens. InGram learns a similar graph between all relations.
The next step is entity-level message passing on the original (V, E) knowledge graph. We
use an attention mechanism in 4.1:
(l) (l)
h ∗(l+1) := σ W 1 h (l) + W 0 h (l)
X X
v u v
r∈R u∈N r (v)
(l) (l)
h †(l+1) := σ αurv (W 1 h (l) + v )
W 0 h (l)
X X
v u
r∈R u∈N r (v)
h (l+1)
v = h ∗(l+1)
v + h †(l+1)
v Attention
82
15 FOUNDATION MODELS FOR KNOWLEDGE GRAPHS
f (h, r, t) := z ⊺h Diag(W w r ) t
with the loss function Embedding of r
InGram can be used for inductive link prediction across new entities and new relation. It
could be trained into a foundational model.
83
16 DEEP GENERATIVE MODELS FOR GRAPHS
3. Deep graph generative models: Learn the graph formulation process from data
In this lecture we will cover Deep Graph decoders: A model which produces a graph
structure from an embedding.
84
16 DEEP GENERATIVE MODELS FOR GRAPHS
• The data distribution pdata (x) is not known to us, but we have samples xi ∼
pdata (x).
To bring pmodel (x|θ) close to pdata (x), we use the principle of maximum likelihood, where
we find θ such that the likelihood of drawing xi ∼ pmodel (x|θ) is the greatest.
θ∗ := arg max Ex∼pc data [log pmodel (x|θ)] ≃ arg max log pmodel (xi |theta)
X
θ θ i
To sample from pmodel (x|θ), there are a couple of approaches. The most common
approach is to sample from a noise latent distribution z ∼ N (0 , I ), and transform the
noise with a function f to obtain x := f (z|θ). The distribution of x is the pushforward
of N (0 , I ). In deep generative models, f (·) is a neural network.
To generate a sequence, we can use auto-regressive models, where the chain rule is
used to generate a join distribution of x1 , . . . , xn :
n
pmodel (x; θ) = pmodel (xt |x1 , . . . , xt−1 ; θ)
Y
t=1
85
16 DEEP GENERATIVE MODELS FOR GRAPHS
Figure 16.2: Generating a graph using GraphRNN is plotting its adjacency matrix column
by column
A graph and a node ordering is a sequence of sequences. The node ordering is randomly
selected. This generates an adjacency matrix column by column. One drawback of this
approach is the huge modeling capacity required to generate longer and longer columns We
have transformed a graph generation problem into a sequence generation problem.
x1 x2
An RNN cell takes inputs st−1 (previous hidden state) and xt , and outputs yt and st
(next hidden state). It is defined as
st := σ(W · xt + U · st−1 ), yt := V · st
86
16 DEEP GENERATIVE MODELS FOR GRAPHS
More expressive RNN cells such as GRU and LSTM have been developed to combat
vanishing gradient problems.
An RNN can be used to generate sequences by feeding back xt+1 := tt (i.e. using
the previous output as input). The sequence is initialised by a special SOS (start of
sequence) symbol and EOS (end of sequence) symbol is emitted (as an extra RNN
output) to signal halting the generation process.
An RNN modeled in this fashion is completely deterministic. To introduce stochastic-
ity in the model, each yi = pmodel (xt |x1 , . . . , xt−1 ; θ) could be a categorical probability
distribution and we can sample xi+1 ∼ yi .
A sequence generation RNN can be trained using teacher forcing, where the inputs
to the RNN is forced to be the ground truth sequence and the loss is computed between
the (shifted) ground truth sequence and RNN output.
RNNs are trained using Backpropagation Through Time (BPTT) which accu-
mulates gradients across time steps.
GraphRNN has a node-level RNN and an edge-level RNN. Node-level RNN generates the
initial state for edge-level RNN, and edge-level RNN sequentially predict if the new node
will connect to each of the previous nodes. The edge-level RNN at node-level step i
outputs a binary label ŷj of whether nodes i, j have an edge, and it is trained using binary
cross-entropy
L := −(ŷj log(yj ) + (1 − ŷj ) log(1 − yj ))
Question: Could we have vanshing gradients in GraphRNN?
If the generation sequence is too long this could be a problem.
87
16 DEEP GENERATIVE MODELS FOR GRAPHS
Question: If the graph grows in multiple directions, how can the model
build one part of the graph and then another part of the graph?
(#610)
Not sure if you have a specific context in mind but one way to mitigate forgetting
other branches of a graph could be to use attention networks. It would definitely help
in improving the forgetting part and depending on the problem, you can tweak the
network to be more useful for any particular kind of forgetting you are facing.
Some models can generate larger graph structures (e.g. a clique) at a time and this
mitigates some of the issue.
88
16 DEEP GENERATIVE MODELS FOR GRAPHS
Figure 16.3: Ordering the nodes using BFS reduces the number of memory steps required
to generate the graph
89
16 DEEP GENERATIVE MODELS FOR GRAPHS
• GCPN uses GNN to predict the generation action (more expressive, but takes longer
time to compute)
Steps in GCPN:
1. Supervised training: Train policy by imitating the action given by real observed
graphs.
90
17 GEOMETRIC GRAPH LEARNING
A graph G = (A, S) is a set V of n nodes connected by edges. Each node has scalar
attributes (e.g. atom type). A is the adjacency matrix and S ∈ R|V |×f
A geometric graph is a graph G = (A, S, R), where each node is embedded in d-
dimensional Euclidean space, i.e. R ∈ R|V |×d .
Molecules can be represented as a graph G with node features si (atom type, charges) and
edge features ai,j (valence bond type). Sometimes we also know the 3D positions of each
node r i .
Geometric graphs lead to a variety of GNN models: Geometric GNN, Geometric
Generative model.
To design a GNN which processes geometric graphs, we need to overcome some obstacles.
The coördinate system used to describe graph geometry transform the node coördinates.
The output of traditional GNN will be affected by this transformation, so we would like the
GNN to be aware of symmetry based on the coördinates system.
A function F : X → Y is
• Invariant if F ◦ ρX = F .
91
17 GEOMETRIC GRAPH LEARNING
92
17 GEOMETRIC GRAPH LEARNING
Figure 17.1: A pair of molecules exhibiting geometric isomerism, having identical scalar
quantities (distance and angle) but are distinguished by directional (normal) and geometric
information.
j
(l) X (l) (l) (l) r i,j
∆v i := v j ⊙ ϕvv (sj ) ⊙ Wvv (∥r i,j ∥) + ϕvs (sj ) ⊙ Wvs (∥r i,j ∥)
X
j j ∥r i,j ∥
where ϕ, W are neural networks. This passes invariant scalar messages and equivariant
vector messages through each layer, thus keeping the equivariant properties.
Atom embedding
93
17 GEOMETRIC GRAPH LEARNING
94
18 FAST NEURAL SUBGRAPH MATCHING AND COUNTING
• Significant: More frequent than expected than e.g. randomly generated graphs
What is the distribution of such random graphs?
95
18 FAST NEURAL SUBGRAPH MATCHING AND COUNTING
Motifs help us understand how graphs work and make prediction based on presence or lack
of presence in a graph dataset.
Let GQ be a small graph and GT be a target graph. There are two definitions of frequency
of GQ in GT .
If the dataset contains multiple graphs, we can treat the dataset as a giant graph GT with
disconnected components corresponding to individual graphs.
To define significance, we need to have a null-model (point of comparison). Subgraphs that
occur in a real network much more often than in a random network have functional
significance.
Methods of generating random graphs:
• A configuration model graph: Based on a real graph Greal , gather its nodes’ degree
sequence and create nodes with “spokes” corresponding to its degree. Then randomly
pair up nodes. This results in a graph Grand with the same degree sequence as Greal .
• A switching graph: Start from a given graph G and repeat a switching step Q · |E|
times, where Q is large (e.g. 100). The switching step selects a pair of edges
(a, b), (c, d) at random and exchange their endpoints, giving (a, d), (c, b). Exchange
only if no multi-edges or self-edges are generated.
96
18 FAST NEURAL SUBGRAPH MATCHING AND COUNTING
This creates a randomly wired graph with the same node degrees as the original.
Question: If you are rearranging the edges, does it also preserve clus-
tering coëfficient or centrality?
When you generate such a random graph you have to decide what properties not
to preserve. The random switching process destroys the local structure so it has
no reason to preserve clustering coëfficient.
Motifs are over-represented in a real graph compared to a random graph. The number of
motifs in a graph can be measured with statistical tools to evaluate its significance.
We can use statistical methods to evaluate the occurrence significance of a motif. The
Z-score, defined as #(Motif i) in graph Greal Average #(Motif i) in random graphs
Nireal − N̄irand
Zi :=
Std Nirand
Random variable #(Motif i) in random graphs
measures the significance. The network significance profile (SP) is a vector of
normalized Z-scores: sX
SP i := Zi / Zj2
j
97
18 FAST NEURAL SUBGRAPH MATCHING AND COUNTING
Figure 18.3: Examples of SP: Networks from the same domain have similar significance
profiles
A GNN can be used to predict subgraph isomorphism. We are going to work with
node-anchored definition of frequency. We wish to generate embedding from the target
anchored neighbourhood and the query graph, and the comparison between the
embeddings lead to a binary label. The intuition of this method is to exploit the geometric
structure of the embedding space to capture properties of subgraph isomorphism.
The algorithm first should decompose the input graph GT into neighbourhoods. Then each
neighbourhood is embedded into the embedding space, and comparison is executed on this
98
18 FAST NEURAL SUBGRAPH MATCHING AND COUNTING
1. For each node v in GT , Obtain a k-hop neighbourhood around the anchor v (e.g.
using BFS)
99
18 FAST NEURAL SUBGRAPH MATCHING AND COUNTING
Figure 18.5: Order satisfies transitivity, anti-symmetry, and closure under intersection.
Question: Can we break a big query into smaller anchored queries and find
those smaller queries in the engine?
This is an interesting problem to research.
How can we design a loss function to ensure GNN learns the ordering? We design loss
functions based on the order constraint:
Embedding Dimension
∀i. zq [ i ] ≤ zt [ i ] ⇐⇒ GQ ⊆ GT
Query Embedding Target Embedding Subgraph relation
trained with max-margin loss, i.e. the penalty is the square of the amount of violation of
the order constraint.
D
E(Gq , Gt ) := max(0, zq [i] − zt [i])2
X
i=1
To learn such embeddings, we generate training examples (Gq , Gt ) such that Gq ⊂ Gt with
probability 1/2, and we minimize:
q , Gt ) Gq ⊆ Gt
E(G
L(Gq , Gt ) :=
max(0, α − E(Gq , Gt )) Gq ̸⊆ Gt
Margin hyperparameter α > 0
100
18 FAST NEURAL SUBGRAPH MATCHING AND COUNTING
1. Initialize S := {v}, V := ∅
2. Let N (S) be the all neighbours of nodes in S. At every step, sample 10% of the
nodes in N (S) \ V and put them in S. Put the remaining nodes in V .
3. After K steps, take the subgraph of G induced by S anchored at q.
If you keep query size constant but increase the size of hops (bigger neighbourhoods),
performance could drop.
During inference, embed anchored query GQ and target GT graphs. Output whether the
query is a node-anchored subgraph of the target using the predicate E(GQ , GT ) < ϵ.
101
18 FAST NEURAL SUBGRAPH MATCHING AND COUNTING
SE := { z Q : z Q ≤ z N , GN ⊆ GT }
Embedding of motif GQ Supergraph region
102
19 LABEL PROPAGATION
Figure 18.7: Motif Walk and the supergraph region representing the complement of total
violation
19 Label Propagation
Given a graph with labels on some nodes, how can we assign labels to other nodes on the
graph? Node embeddings are a partial way to solving this problem.
Settings
• Homophily: The tendency of individuals to associate and bond with similar others.
103
19 LABEL PROPAGATION
1
P (t+1) (Yv = c) = P Av,u P (t) (Yu = c)
X
Edge weights
Issues:
• Convergence may be very slow and not guaranteed.
Question: We say this is fast but slow to converge. How does it compare
with GNN?
Fast here refers to one-step of label propagation. In most cases, it is still faster than
training and applying a deep neural network.
However, inference of GNN is faster than label propagation.
104
19 LABEL PROPAGATION
Figure 19.1: GNN performed on two graphs with different labels and identical node features
lead to very different performance since GNN does not take the labels into account. The
resulting node embeddings do not have sufficient differentiation power.
Question: Why can’t we include the labels in the messages instead of doing
correct and smooth?
The motivation of Correct and Smooth is that its model agnostic, so it can make
prediction with your favourite model. Directly leveraging the label is possible too.
The core idea of Correct and Smooth12 is that we expect the errors of a base label
predictor to be correlated along edges of the graph, so we should spread such uncertainty
over the graph. We define the diffusion matrix to be
12
See Zhu et al. ICML 2013 for details.
105
19 LABEL PROPAGATION
Diagonal degree matrix Di,i := deg i
à = D −1/2 A D −1/2
Theorem 19.1. All the eigenvalues of à are in the range of [−1, +1], and the maximum
eigenvalue is always 1 with eigenvector D 1/2 1 so the power of à is well-behaved for any K.
Proof.
ÃD 1/2 1 = D −1/2 AD −1/2 D 1/2 1 = D −1/2 A1 = D −1/2 D1 = D 1/2 1
1. Train base predictor which predicts soft labels (class probabilities) over all nodes.
Labeled nodes can be used for train/validation data.
2. Apply base predictor to all (including the truth) nodes to obtain soft labels.
• (Correct): The degree of the errors of the soft labels are biased. We need to
correct for the bias.
Compute the training errors of nodes, i.e.
Ground one-hot label Predicted soft label
p − p̂ v v is labeled
e v := v
0 v is unlabeled
We add the scaled diffused training errors into the soft labels:
P̃ := P̂ + sE (T )
106
20 SCALING UP GNNS
• (Smooth): The predicted soft labels may not be smooth over the graph. We
need to smoothen the labels.
Assumption: Neighboring nodes tend to share the same labels.
Diffuse z v along the edges:
20 Scaling Up GNNs
In modern applications, graphs are very large.
107
20 SCALING UP GNNS
• In knowledge graphs such as Wikidata or Freebase, there are about 108 entities.
Tasks: KG Completion, Reasoning
Why is training GNN difficult?
Naı̈ve full-batch processing iterates the GCN layers on the entire graph:
H (k+1) := σ ÃH (k) WK
⊺
+ H (k) B⊺K
is infeasible in large graphs since the memory on a GPU is extremely limited (10-20 GB).
We introduce 3 methods to scaling up GNNs.
• Two methods sample smaller subgraphs:
Neighbour Sampling and Cluster GCN
108
20 SCALING UP GNNS
• H is the trade off between aggregation efficiency and variance. A smaller H leads to
more efficient computation but increases the variance which gives more unstable
training results.
• The size of the computation graph is still exponential w.r.t. K. One more GNN layer
makes computation H times more expensive.
The neighbours can be sampled randomly, which is fast but might not be optimal. In
natural graphs. We could use random walks with restarts and sample nodes with the
highest restart scores. This works better in practice.
Overall time complexity: M · H K .
109
20 SCALING UP GNNS
V2
V1
V3
2. Mini-batch training: Sample node group Vi at a time and apply GNN on node
group Gi .
• Sampled node group tends to only cover the small-concentrated portion of the entire
data.
• Sampled nodes are not diverse enough to be represent the entire graph structure
This leads to very different gradients across clusters, which translate to high training
variance and slow convergence of SGD.
2. Mini-batch training: Sample node groups Vi1 , . . . , Viq and apply GNN to the
induced graph of Vi1 ∪ · · · ∪ Viq .
110
20 SCALING UP GNNS
Overall time complexity of Cluster-GCN: K · M · Davg , where Davg is the average node
degree. This linear growth is much more efficient than neighbourhood sampling.
Question: Since we are removing the activation function and the perfor-
mance is still strong, does this mean a lot of things we learned are linear?
Yes.
111
21 TRUSTWORTHY GRAPH AI
to have similar pre-processed features. This is due to the high dimensionality of embedding
space.
21 Trustworthy Graph AI
Trustworthy AI/GNN include explainability, fairness, robustness, privacy, etc. Previously
the role of graph topology is unexplored in these problems. This lecture covers robustness
and explainability.
21.1 Explainability
Deep learning models are black boxes which makes it a major challenge to explain and
extract insight from. Explainable Artificial Intelligence (XAI) is an umbrella term for
any research trying to solve the black-box problem for AI.
It is useful since it enables
• Trust: Explainability is a prerequisite for humans to trust and accept the model’s
prediction.
• Causality: Explainability (e.g. attribute importance) conveys causality to the
system’s target prediction: attribute X causes the data to be Y
• Transferability: The model needs to convey an understanding of decision- making
for humans before it can be safely deployed to unseen data.
• Fair and Ethical Decision Making: Knowing the reasons for a certain decision is
a societal need, in order to perceive if the prediction conforms to ethical standards.
Excursion: Explainable Models
A model is explainable when
Examples:
The slope is explainable as the amount of effect a variable has on the prediction.
112
21 TRUSTWORTHY GRAPH AI
• Decision Trees: A very explainable set of models where each node represents a
logical decision. We can compute statistics for each decision node.
113
21 TRUSTWORTHY GRAPH AI
• Explaining model predictions: Why does the model recommend no loan for person X?
21.2 GNNExplainer
GNNExplainer is a post-hoc, model agnostic explanation method for GNNs.
• Training time: Optimise GNN on training graphs and save the trained model
GNNExplainer can explain different tasks, including Node classification, Link prediction,
and Graph classification. It can be adapted to GAT, Gated Graph Sequence, Graph
Networks, GraphSAGE, etc.
In a general message-passing framework, we can produce structural explanations and
feature explainations. GNNExplainer explains both aspects simultaneously.
Settings
Without loss of generality, consider the node classification task. The input is
114
21 TRUSTWORTHY GRAPH AI
where PX , PY are the marginal distributions of X, Y , resp., and P(X,Y ) is the joint dis-
tribution of P(X,Y ) . I(X, Y ) measures the amount of deviation the product distribution
has relative to the hypothetical case of when X, Y have no correlation.
I(X, Y ) = I(Y, X)
A good explanation should have a high correlation with model prediction, so the
GNNExplainer’s goal Explanation
is Feature subset
Note the approximation used here since H is not a convex function. GNNExplainer uses
Ac ⊙ σ(M ), where σ(M ) is a mask, to approximate EA [Ã]. σ is the sigmoid function
which squashes M into [0, 1], representing whether an edge should be kept or dropped.
115
21 TRUSTWORTHY GRAPH AI
Figure 21.2: GNNExplainer thresholding graphs using the relaxed adjacency matrix
Alternative approaches:
• GNN Saliency Map: Record of the gradients of output scores w.r.t. inputs on a
GNN
116
21 TRUSTWORTHY GRAPH AI
Figure 21.4: GraphFramEx explanation framework focuses on the phenomenon and the
model.
117
21 TRUSTWORTHY GRAPH AI
The fidelity metrics are fid± , corresponding to removing important subgraphs and using
only the important subgraphs.
Remove important subgraph S
1 XN
fid+ := 1(ŷi = yi ) − 1(ŷi (GC\S ) = yi )
N i=1
1 XN
fid− := 1(ŷi = yi ) − 1(ŷi (GS ) = yi )
N i=1
Original prediction Keep only important subgraph S
probability/confidence
Evaluation criteria are multidimensional:
Types of explanations:
118
22 CONCLUSION
22 Conclusion
22.1 GNN Design Space and Task Space
How to find a good GNN design for a specific GNN task? Redo hyperparameter grid search
for each new task is not feasible.
Overall there are 300, 000 possible designs for an assorted combination of parameters. The
total size of the design space is huge (> 105 ”. We cannot cover all possible designs.
GNN tasks can be categorised into node/edge/graph level tasks. This is not precise
enough, since for example “predicting clustering coëfficient” and “predicting a node’s
subject area in citation networks” are completely different.
119
22 CONCLUSION
22.2 GraphGym
GraphGym is a platforms for exploring different GNN architectures.
In GraphGym, a quantitative task similarity metric is defined as
2. Randomly sample N models from our design space (e.g. sample 100 models)
3. Sort these models based on their performance (e.g. sample 12 models in the
experiments)
120
22 CONCLUSION
Example
Evaluating a design dimension: If we want to inquire about “is Batch Normalisation
generally useful for GNNs”? The common practice is to select one model and compare
it with batch normalisation on and off.
In GraphGym, the process is
2. Rank the models with Batch Normalisation on and off. Here the computational
budget of the models are controlled.
The lower the ranking the better.
3. Recommend the best designs from existing tasks with high similarity
2. Out-of-distribution prediction
121
22 CONCLUSION
Settings
In this section we investigate a setting in molecule classification.
• Task: Binary classification of molecules
122
22 CONCLUSION
123
22 CONCLUSION
Figure 22.5: Pre-training both node and graph embeddings compared to training them
individually
The Naı̈ve strategy is multi-task supervised pre-training on relevant labels. This has
limited performance on downstream tasks and often leads to negative transfer.
The key idea to improving this is to pre-trian both node and graph embeddings.
Pre-training methods:
• Attribute Masking (Node-lvel, self-supervised):
124
22 CONCLUSION
When different GNN models are pre-trained, the most expressive model (GIN) benefits the
most from pre-training.
125