0% found this document useful (0 votes)
29 views127 pages

CS224w Machine Learning With Graphs

CS 224w is a course on Machine Learning with Graphs, taught by Dr. Jure Leskovec at Stanford University. The course covers various topics including traditional machine learning on graphs, node embeddings, graph neural networks, and advanced topics such as graph transformers and trustworthy graph AI. The syllabus includes a comprehensive list of contents spanning from foundational concepts to advanced applications in graph-based machine learning.

Uploaded by

sonu23144
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views127 pages

CS224w Machine Learning With Graphs

CS 224w is a course on Machine Learning with Graphs, taught by Dr. Jure Leskovec at Stanford University. The course covers various topics including traditional machine learning on graphs, node embeddings, graph neural networks, and advanced topics such as graph transformers and trustworthy graph AI. The syllabus includes a comprehensive list of contents spanning from foundational concepts to advanced applications in graph-based machine learning.

Uploaded by

sonu23144
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 127

CS 224w: Machine Learning with Graphs

Leni Aniva
Autumn 2024

1
• Instructor: Jure Leskovec

• Website: http://cs244w.stanford.edu

• Note: A page with a dark background indicates it is from Winter 2023.

Super! – Dr. Jure Leskovec


CONTENTS

Contents
0.1 Why Graphs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
0.2 Choice of Graph Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1 Traditional Machine Learning on Graphs 6


1.1 Node-Level Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Link-Level Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Graph Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Node Embeddings 9
2.1 Random Walk Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Embedding Entire Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Relations to Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Applications and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Graph Neural Networks 16


3.1 Basics of Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Deep Learning for Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Graph Convolutional Networks . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 GNNs subsume CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 A General Perspective on Graph Neural Networks 20


4.1 A Single Layer of GNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 GNN Layers in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Stacking GNN Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 Graph Manipulation in GNNs . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 GNN Augmentation and Training 26


5.1 Predictions with GNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2 Training Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3 Setup GNN Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6 Theory of Graph Neural Networks 29


6.1 Designing the Most Powerful GNN . . . . . . . . . . . . . . . . . . . . . . . 30
6.2 When things don’t go as planned . . . . . . . . . . . . . . . . . . . . . . . . 33

7 Limits of Graph Neural Networks 34


7.1 Spectral Perspective of Message Passing . . . . . . . . . . . . . . . . . . . . 35
7.2 Feature-Augmentation: Structurally-Aware GNNs . . . . . . . . . . . . . . . 37
7.3 Counting Graph Sub-Structures . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.4 Position-Aware GNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.5 Identity-Aware GNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

1
CONTENTS

8 Graph Transformers 41
8.1 Self-Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.2 Self-Attention and Message Passing . . . . . . . . . . . . . . . . . . . . . . . 45
8.3 A New Design Landscape for Graph Transformers . . . . . . . . . . . . . . . 46
8.4 Positional Encodings for Graph Transformers . . . . . . . . . . . . . . . . . 47

9 Machine Learning with Heterogeneous Graphs 48


9.1 Heterogeneous Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9.2 Relational GCN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
9.3 Heterogeneous Graph Transformer . . . . . . . . . . . . . . . . . . . . . . . . 52
9.4 Design Space for Heterogeneous GNNs . . . . . . . . . . . . . . . . . . . . . 53

10 Knowledge Graph Embeddings 54


10.1 Knowledge Graph Completion . . . . . . . . . . . . . . . . . . . . . . . . . . 54

11 Reasoning in Knowledge Graphs 56


11.1 Answering Predictive Queries on Knowledge Graphs . . . . . . . . . . . . . . 58
11.2 Query2Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
11.3 Training Query2Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

12 GNNs for Recommender Systems 63


12.1 Recommender Systems: Embedding Based Models . . . . . . . . . . . . . . . 64
12.2 Neural Graph Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . 66
12.3 LightGCN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
12.4 PinSAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

13 Relational Deep Learning 69


13.1 Relational Database Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
13.2 Predictive Tasks in Relational Databases . . . . . . . . . . . . . . . . . . . . 72
13.3 RelBench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

14 Advanced Topics in Graph Neural Networks 73


14.1 PRODIGY: Enabling In-Context Learning Over Graphs . . . . . . . . . . . . 73
14.2 Conformal GNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
14.3 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

15 Foundation Models for Knowledge Graphs 80


15.1 InGram: Inductive KG Embedding via Relation Graphs . . . . . . . . . . . . 82

16 Deep Generative Models for Graphs 84


16.1 Machine Learning for Graph Generation . . . . . . . . . . . . . . . . . . . . 85
16.2 Generating Realistic Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
16.3 Scaling Up and Evaluating Graph Generation . . . . . . . . . . . . . . . . . 88
16.4 Graph Convolutional Policy Network . . . . . . . . . . . . . . . . . . . . . . 88

2
CONTENTS

17 Geometric Graph Learning 91


17.1 Invariant GNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
17.2 Equivariant GNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
17.3 Geometric Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . 93

18 Fast Neural Subgraph Matching and Counting 95


18.1 Subgraphs and Motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
18.2 Neural Subgraph Representations . . . . . . . . . . . . . . . . . . . . . . . . 98
18.3 Mining Frequent Motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

19 Label Propagation 103


19.1 Label Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
19.2 Correct and Smooth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
19.3 Masked Label Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

20 Scaling Up GNNs 107


20.1 Neighbour Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
20.2 Cluster GCN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
20.3 Simplified GCN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

21 Trustworthy Graph AI 112


21.1 Explainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
21.2 GNNExplainer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
21.3 Explainability Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

22 Conclusion 119
22.1 GNN Design Space and Task Space . . . . . . . . . . . . . . . . . . . . . . . 119
22.2 GraphGym . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
22.3 Pre-Training Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . 121

3
CONTENTS

Introduction
0.1 Why Graphs?
Graphs are a general language for describing and analyzing entities with relations and
interactions.
Applications:
• Molecules: Vertices are atoms and edges are bonds

• Event Graphs

• Computer Networks

• Disease Pathways

• Code Graphs
Complex domains have a rich relational structure which can be represented as a relational
graph. By explicitly modeling relationships we achieve a better performance with lower
modeling capacity.
The Modern ML Toolbox processes tensors, e.g. images (2D), text/speech (1D). Modern
deep learning toolbox is designed for simple sequences and grids. Not everything can be
represented as a sequence of a grid. How can we develop neural networks that are much
more broadly applicable? We can use graphs. Graphs connect things.
• Graph neural network is the 3rd most popular keyword in ICLR ’22.

• Graph learning is also very difficult due to the complex and less structured nature of
graphs.

• Graph learning is also associated with representation learning. In some cases it may
be possible to learn a d-dimensional embedding for each node in the graph such that
similar nodes have closer embeddings.
A number of different tasks can be executed on graph data.
• Node level prediction: to characterize the structure and position of a node in the
network.
Example: In Protein folding, each atom is a node and the task is to predict the
coordinate of the node. predicted.

• Edge/Link level prediction: Predicting property for a pair of nodes. This can be
either trying to find missing links or finding new links as time progresses
Example: Graph-based recommender systems and drug side effects.

• Graph level prediction: Predict for an entire subgraph or graph


Examples: traffic prediction, drug discovery, physics simulation, and weather
forecasting

4
CONTENTS

Figure 0.1: When a machine learning model is applied to a graph, each node defines its own
computational graph in its neighbourhood.

0.2 Choice of Graph Embeddings


A graph has several components:

• Objects N : Nodes, vertices

• Interactions E: Edges, links

• Systems G(N, E): Networks, graphs

Sometimes there is a ubiquitous representation in a particular case. Some times there is


not. The choice of representation determines what information can be mined from the
graph. A graph may also have some other properties:

• Undirected/Directed edges

• Allow/Disallow self-loop

• Allow/Disallow multi-graphs (mutliple edges between nodes)

• Heterogeneous Graphs: A graph G = (V, E, T, R, τ, ϕ) where nodes have types


τ (v) : T and edges have types ϕ(e) : R.
Many graphs are heterogeneous. For example, drug-protein interaction graph is
heterogeneous.

• Bipartite Graphs: e.g. Author-Paper graph, Actor-Films graph

Most real-world networks are sparse. The adjacency matrix is a sparse matrix with mostly
0’s. The density of the matrix (E/N 2 ) is 1.51 · 10−5 for the WWW and 2.27 · 10−8 for MSN
IM.

5
1 TRADITIONAL MACHINE LEARNING ON GRAPHS

Figure 1.1: Different levels of tasks on a graph

Figure 1.2: Classifying nodes on a graph when a few labels are provided

1 Traditional Machine Learning on Graphs


In a traditional ML pipeline such as Logistic regression, Random Forest, and Neural
networks, the model is first trained on features of a graph and then the model can be
applied on a new graph. Using effective features over graphs is the key to achieving good
model performance. For simplicity, in this section we focus on un-directed graphs.

1.1 Node-Level Features


A simple example of node-level task is node classification based on a few samples. It is
illustrated in Figure 1.2
A couple different measures characterise the structure and position of a node on a graph:

• Node degree: Number of neighbours

• Node centralities: A measure of the “importance” of a node in a graph. There


several different types of centrality.

– Eigenvector centrality: A node is important if it is surrounded by important


neighbouring nodes. We define the centrality of node v as the centrality of

6
1 TRADITIONAL MACHINE LEARNING ON GRAPHS

Figure 1.3: Examples of clustering coefficients

neighbouring nodes. This leads to a set of |N | simultaneous linear equations:


1 X
cv := cu ⇐⇒ λc = Ac
λ u∈N (v)

where λ is a normalisation constant and A is the adjacency matrix of the graph.


By Perron-Forbenius Theorem, the largest eigenvalue λmax is always positive
and unique, and its corresponding eigenvectors can be used for centrality.
When λ is the second-largest eigenvalue, cv has a different meaning.
– Betweenness centrality: A node is important if it is a gatekeeper. i.e. when
it lies on many shortest paths between other nodes.

|{shortest paths between s, t containing v}|


cv :=
X

s̸=v̸=t |{shortest paths between s, t}|

This is important for social networks.


– Closeness centrality:

1
cv := P
u̸=v |{shortest paths between u, v}|

• Clustering Coëfficients: How connected v’s neighbouring nodes are

1
ev := k  |{edges among N (v)}| ∈ [0, 1]
v
2

Social networks have a huge amount of clustering.

Clustering coëfficient counts the number of triangles in the ego-network (the network
formed by {v} ∪ N (v), where v is the ego). We can generalise the above by counting
graphlets.
An induced graph is a graph formed by taking a subset of vertices in a larger graph such
that only edges connecting the remaining vertices are preserved.
Two graphs with identical topologies are isomorphic.

7
1 TRADITIONAL MACHINE LEARNING ON GRAPHS

Figure 1.4: Different levels of features distinguish nodes in different ways

Graphlets are small subgraphs that describe the structure of u’s neighbourhood network.
Specifically, they are rooted, connected, induced, non-isomorphic subgraphs. Considering
graphs of size 2 to 5 nodes we get a vector of 73 (number of graphlets with 2 to 5 vertices)
elements that describes the topology of a node’s neighbourhood. This vector is the
graphlet degree vector (GDV) of a node.
The features we have discussed so far capture local topological properties of the graph but
cannot distinguish points in a global scale.

1.2 Link-Level Features


Two formulations of link-level prediction
• Links missing at random: Remove a random set of links and then aim to predict
them.
e.g. drug interactions
• Links over time: Given G[t0 , t′0 ] a graph defined by edges up to time t′0 output a
rankd list L of edges that are predicted to appear in time G[t1 , t′1 ].
Evaluation: n = |Enew |: number of new edges that appear during the test period.
Methodology: For each pair of nodes x, y compute a distance c(x, y) and predict top
n elements as links.
A couple measures exist for local neighbourhood overlap:
• Common neighbours:
c(u, v) := |N (v1 ) ∩ N (v2 )|
• Jaccard’s Coëfficient:
|N (v1 ) ∩ N (v2 )|
c(u, v) :=
|N (v1 ) ∪ N (v2 )|
• Adamic-Adar Index:
1
c(u, v) :=
X
k
u∈N (v1 )∩N (v2 ) u

The problem with the three indices above is that they are always 0 if u, v do not
share a neighbour.

8
2 NODE EMBEDDINGS

• Katz Index:

c(u, v) := |{All walks of all lengths between u, v}|

This can be computed by powers of the adjacency matrix. The matrix counting all
walks of length n between vertices is An , so the Katz index can be computed by

C := β i Ai = (I − βA)−1 − I
X

i=1

where the β < 1 decay factor is necessary to prevent C from blowing up to + inf.
An analogous definition exists for directed graphs.

1.3 Graph Kernels


The goal of graph kernels is to create a feature vector for the entire graph.
Kernel methods are widely used for traditional ML for graph-level prediction. Instead of
designing feature vectors, we design kernels:

• k(G, G′ ) ∈ R

• Kernel matrix K = [K(G, G′ )]G,G′ must always be positive semi-definite.

• There exists a feature representation ϕ such that K(G, G′ ) = ϕ(G)ϕ(G′ ) which can
even be infinite-dimensional.

We could use a bag-of-words (BOW) for a graph. Recall that in NLP, BoW simply uses the
word count as features for documents with no ordering being considered. We regard nodes
as a word.
Graph-Level Graphlet features counts the number of different graphlets in a graph.
The graphlets here are not rooted and do not have to be connected. This definition of
graphlet is slightly different from the definition of node level features. A limitation of this
definition is that counting graphlets is expensive. Counting size-k gfor a raph with size n
by enumeration takes time O(nk ) due to costly subgraph isomorphism tests. If a graph’s
node has bounded degree the time could be compressed down to O(ndk−1 ).
so far we have only considered features related to the graph structure and not considered
the attributes of nodes and their neighbours.

2 Node Embeddings
Representation learning avoids the need of doing feature engineering every time. The goal
is to map individual nodes or an entire graph into vectors, or embeddings.
In node embeddings, we would like the embedding to have the following properties:

• Similarity of embeddings between nodes indicates their similarity in the network.


e.g. Nodes closer together could be considered similar.

9
2 NODE EMBEDDINGS

Figure 2.1: Node Level Embeddings of graphs map each node to a vector

Figure 2.2: Embeddings of Karate Club Graph

10
2 NODE EMBEDDINGS

• Encode network information

• Potentially useful for downstream tasks


Suppose we have a graph G and we wish to devise an embedding for the nodes V of G. The
goal is to encode nodes so that similarity in the embedding space approximates similarity in
the graph. We have 3 design choices, the encoder, decoder, and a target similarity function
• Encoder: Enc : V → Rd×|V | , Enc(u) := z u
This could be defined to be a simple lookup table from nodes to vectors. In this case
it is a shallow encoder. Deep encoders using GNN will be covered in Subsection 5.1.
This is the only learnable function.

• Decoder: Given two embeddings, measure the similarity. Usually chosen to be the
dot product Dec(z, w) = z · w

• Similarity: A measure of similarity of the nodes. This could be distance in the


graph, neighbourhood topology, etc. It is approximated by a combination of encoder
and decoder. similarity(u, v) ≃ Dec(Enc(u), Enc(v))
In this example, it is similarity(u, v) ≃ z u · z v
Many methods stem from this simple setting, including DeepWalk and node2vec.

2.1 Random Walk Embedding


An unsupervised/self-supervised node similarity metric is based on random walks. In which
case we are not using node labels or features. The embeddings are task independent and do
not use any node labels or features. The rationale of choosing random walks:
• Expressivity: Flexible stochastic definition of node similarity that incorporate lower
and higher order neighbourhood information.

• Efficiency: The graph does not need to be throughoutly traversed when training.
We define:
• Vector Enc(u) := z u : Embedding of node u

• Probability P (v|z u ): Predicted probability of visiting v on random walks starting


from u.

• Softmax Function: Turn vector of k values into k probabilities that sum to 1:


exp z
σ(z) = Pk
i=1 exp z

• Sigmoid Function: Compress R into (0, 1):


1
S(x) =
1 + exp(−x)

11
2 NODE EMBEDDINGS

• A random walk is a sequence v0 , v1 , . . . of random vertices such that vi+1 is chosen


randomly from N (vi ).

• The target similarity function is

similarity(u, v) := P (u, v occur on a random walk over G)

• We select a random walk strategy R and use such strategy to determine PR (v|u), the
probability that a random walk starting from u visits v. The strategy defines NR (u),
a random multiset (can have repeats for nodes visited multiple times, essentially a
probability distribution) collected from combining all short fixed-length random walks
starting at u

Now we are ready to mathematically state the objective function. The negative
log-likelihood
Predicted probability of u, v co-occurring
Summation over all vertices

L(Enc) := − log P (NR (u)|z u ) = − log P (v|z u )


X X X

u∈V u∈V v∈NR (u)

Summation over nodes seen on random walk from u

We parameterise P (v|z u ) using softmax. Essentially, we view the exponentiated similarity


scores exp(z u · z v ) as unnormalized probabilities:

exp(z u · z v )
P (v|z u ) := P
n∈V exp(z u · z n )

However, this function is expensive to evaluate. The two n∈V loops already give O(|V |2 )
P

time complexity. The solution to this problem is negative sampling1 , which provides the
estimate

exp(z u · z v ) k
log P ≃ log σ(z u · z v ) − log σ(z u · z ni ) (ni ∼ Pv )
X

n∈V exp(z u · z n ) i=1

where Pi is a probability distribution over V . Instead of normalising w.r.t. all nodes, just
normalize against k random negative samples ni . We could select PV such that
PV (n) ∝ deg n. The value of k is usually chosen to be 5 to 20 since

• Higher k gives more robust estimates

• Higher k corresponds to higher bias on negative events.

With this in mind, we can robustly evaluate ∂L


∂z u
and execute stochastic gradient descent
(SGD) to optimize Enc.

1
This is a form of Noise Contrastive Estimation (NCE). See https://arxiv.org/pdf/1402.3722.pdf

12
2 NODE EMBEDDINGS

Figure 2.3: Comparison of neighbourhood NR generated by BFS and DFS strategies. BFS
provides a micro-view of the neighbourhood while DFS provides a macro-view.

Excursion: Stochastic Gradient Descent


A recap of the stochastic gradient descent algorithm in the setting of random walk
embedding:

1. Initialize z u at random for all u

2. For all u:

• Compute ∂L/∂z u
• Make a step z u ← z u − η · ∂L/∂z u . η is the learning rate.

3. Return to step (2) until convergence.

A couple different options are in order for the random walk strategy R. In DeepWalk,
this is just an unbiased random walk starting from each node but this could be too
constrained. In node2vec, the strategy is chosen to embed nodes with similar network
neighbourhood close in the feature space. We frame this as a maximum-likelihood
optimisation problem which is independent to the downstream prediction task. The key
observation is that flexible notion of NR lead to rich node embeddings.
Two classic strategies define a neighbourhood NR : Breadth First Search and Depth First
Search. We could interpolate BFS and DFS using two parameters:

• Return parameter p: Return back to previous node.

• In-out parameter q: Moving outwards (DFS) vs. inwards (BFS). Intuitively q is the
interpolation parameter.

We define a biased 2nd-order random walk to explore network neighbourhoods. Suppose


we just traveled the edge (s1 , w) and is now at w. Three choices are ahead in N (w).

• Return to s1 (distance 0) with weight 1/p

• Stay in N (s1 ) (distance 1) with weight 1 for each nodes in N .

• Leave N (s1 ) and explore further (distance 2) with weight 1/q for each node further
out.

13
2 NODE EMBEDDINGS

Figure 2.4: Biased 2nd order neighbourhoods along with unnormalized probabilities.

Figure 2.5: DiffPool: Hierarchical Embedding

A BFS-like walk has low value of p (easy to backtrace) and a DFS-like walk has a low value
of q.
In a survery in 2017, node2vec performs better on node classification tasks while
alternative methods perform better on link prediction. No one method wins all cases.
Random walk approaches are generally more efficient.

2.2 Embedding Entire Graphs


We sometimes wish to embed an entire graph in some embedding space. This could be
useful to e.g. classifying toxic/non-toxic molecules or identifying anomalous graphs. A
simple approach is to sum over all node embeddings on an existing node embedding:

z G :=
X
zv
v∈G

Another approach is to introduce a “virtual node” to represent the subgraph and run a
node embedding algorithm to use its embedding.
We will discuss hierarchical embeddings, which successively summarises the graph in
smaller clusters to generate an embedding.

2.3 Relations to Matrix Factorization


Recall that the encoder can be viewed as an embedding lookup into a matrix Z of size
d × |V |.
The most simple node similarity is the adjacency matrix. Two nodes can be considered
similar when they are connected by an edge and not otherwise. This is the adjacency

14
2 NODE EMBEDDINGS

matrix A and an attempt to learn the embedding is a matrix factorisation A = Z ⊺ Z .


Since d ≪ |V |, this factorisation cannot be done exactly due to rank, so we may instead
optimize to minimize the Frobenius norm minZ ∥A − Z ⊺ Z ∥F .
In the example of DeepWalk, the embedding factorizes the matrix

P
Volume of graph vol G := i,j Ai,j Degree matrix Du,u := deg u

1 XT
! !
log vol(G) ( D −1 A)r D −1 − log b
T r=1
Number of negative samples
Context Window Size T := |NR (u)|

Essentially D −1 A is a Markov matrix of the random walk.

2.4 Applications and Limitations


Embeddings can be used for

• Clustering/Community detection

• Node classification

• Link prediction: Predict edge (i, j) based on (z i , z j )

• Graph classification: Classification of z G

Node embeddings via matrix factorisation and random walks have some limitations:

• O(|V | d) parameters are needed.


No sharing of parameters is possible between nodes and each node has its own unique
embedding.

• Transductivity: The embedding can only be generated after all nodes in the graph
are seen. Cannot obtain embeddings for nodes not in the training set.

• Cannot capture structural similarity of local topologies.

• Cannot utilize node, edge, and graph features.

Deep Representation Learning and Graph Neural Networks mitigate these limitations,
which will be covered in depth in Section 4 and Section 5.

15
3 GRAPH NEURAL NETWORKS

3 Graph Neural Networks


Recall from Lecture 2 that node embeddings map nodes to d-dimensional embeddings such
that similar nodes in the graph are embedded close together. To learn such an embedding,
we need to define encoder Enc, decoder Dec, and a target similarity function
similarity(u, v).
Today we discuss a class of deep learning methods based on Graph Neural Networks. We
use a Deep Graph Encoder as Enc. The modern machine learning toolbox is based off
regular, repeating lattice or grids, which cannot be easily adapted to graphs since the
structure of a graph is far more complex than a rectangular grid. There is no fixed node
ordering or reference point and graphs are often dynamic.

3.1 Basics of Deep Learning


This is a brief introduction to deep learning.

Excursion: Supervised Learning


In supervised learning, we are given inputs x and our goal is to predict output y.
The input x can be vectors, sequences, matrices, graphs. We formula the task as an
optimisation problem.
min L(y, f (x))
θ

where

• θ is a set of parameters we optimise. e.g. in shallow encoder, θ = {Z}.

• L is the loss function.


Common choices of L:

– L2 -loss: L(y, ŷ) := ∥y − ŷ∥


– Cross Entropy: L(y, ŷ) := − C i=1 yi log ŷi
P

In this case y ∈ {0, 1} is a one-hot encoding of the ground truth category,


C

while ŷ ∈ [0, 1]C is a probability vector ( ŷ = 1)


P

• The optimisation problem is solved using gradients.

– The gradient ∇θ L is a directional derivative in the direction of largest in-


crease.
– An iterative algorithm which updates θ ← θ−η∇θ L moves θ in the opposite
direction of ∇θ L until convergence.
– η is the learning rate.
– Evaluating the gradient w.r.t. the entire training set could be ex-
pensive, so often ∇θ L = i=1 ∇θ L(yi , f (xi ; θ)) is approximated by a
Pn

random sample i∈I ∇θ L(yi , f (xi ; θ)) where I ⊆ {1, . . . , n}. This is
P

16
3 GRAPH NEURAL NETWORKS

stochastic gradient descent and I is the batch. |I| is the batch size
and the number of full passes through the dataset is the epoch.
– Other higher order optimisers exist such as RMSprop, Adam, Adagrad, etc.

• A multi-layer perceptron (MLP) is a neural network constructed from stack-


ing layers of the form fi (x) = σ(W x + b), where W , b are learnable. σ is a
non-linear function referred to as activation function.

3.2 Deep Learning for Graphs


Settings

Suppose we have a graph G with vertex set V , adjacency matrix A ∈ {0, 1}|V |×|V | ,
node features X ∈ R|V |×d .

A naı̈ve approach would be to join the adjacency matrix and features and feed them into a
deep neural net. The problems with this idea are

• O(|V |) parameters

• Graph size is inherently baked into the size of the neural network

• Sensitive to node ordering

One solution is to use convolutional networks, which use a sliding kernel which is
invariant across all points on the graph. There is no fixed notion of locality or sliding
window on a graph, and graphs do not give an inherent order to their vertices.
Consider we learn a function f that maps a graph G := (A, X) to a vector in Rd . Then we
would like f (A1 , X 1 ) = f (A2 , X 2 ) for two different orderings of the vertices of G. For any
graph function f : R|V |×m × R|V |×|V | → Rd ,

• f is permutation invariant if f (A, X) = f (PAP ⊺ , PX) for any permutation


matrix P.

• f is permutation equivariant if Pf (A, X) = f (PAP ⊺ , PX) for any permutation


matrix P.

e.g.

• f (A, X) = 1 ⊺ X is permutation invariant (summing of all node features).

• f (A, X) = X is permutation equivariant.

• f (A, X) = AX is permutation equivariant.

Graph neural networks consist of multiple permutation equivariant/invariant layers. Other


neural network architectures e.g. MLP, do not exhibit this property.

17
3 GRAPH NEURAL NETWORKS

Figure 3.1: General structure of GNN which consists of permutation equivariant convolu-
tional layers and permutation invariant pooling layers.

(b) Propagate and transform informa-


(a) Neighbour computation graph tion

Figure 3.2: Computation graph defined from a node’s neighburhood. Each node defines a
computation graph based on its neighbourhood and this can change from node to node.

3.3 Graph Convolutional Networks


A critical observation is that a node’s neighbourhood defines a computation graph.
Information can be passed along this computation graph to combine information from
different parts of the graph. A model constructed in this fashion can be of arbitrary depth.
Layer-0 embeddings of node v is its input feature x v and the layer-k embeddings are
constructed from layer-(k − 1) embeddings. Neighbourhood aggregation is the
approach to aggregate information across layers. A basic approach is to average information
from neighbourds and apply a neural network. This leads to the Deep Encoder:

18
3 GRAPH NEURAL NETWORKS
Initial 0th layer embedding

h (0)
v := x v
Average of neighbours’
previous layer embeddings Embedding of v at layer k
 
1
h (k+1) := σ  Wk h (k) + Bk h (k) (k = 0, . . . , K − 1)
 X 
v
|N (v)| u∈N (v) u v 

Non-linear Total number of layers


Activation Function Learnable weights

z v := h (K)
v

Embedding after K layers of neighbourhood aggregation

GNN computation is permutation equivariant. A GNN can be trained using SGD by


updating the weight matrices W k , B k , for aggregation and transformation, respectively.
Many aggregation can be performed efficiently as sparse matrix operations. If
(k) (k)
H (k) := [h 1 , . . . , h |V | ]⊺ , then
Diagonal matrix Dv,v := deg v
1 (k+1)
h (k−1) =⇒ Ĥ = D −1 AH (k)
X
deg v u∈N (v) u

We can re-write the update function in matrix form2 :


 
H (k+1) := σ D −1 AH (k) Wk⊺ + H (k) B⊺k
Neighbourhood Aggragation Self Transformation
To train a GNN:
• Supervised setting: We could minimise the loss function of node label yv with the
node embedding f(z v ) by minimising minθ L(yv , f(z v )).
For example, the training of predicting a binary label on the nodes can entail a
binary cross-entropy or logistic loss function of the form
Node class label

L=− yv log(σ( z v ⊺ θ )) + (1 − yv ) log(1 − σ( z v ⊺ θ ))


X

v∈V
Encoder output
Classification weights

• Unsupervised setting: When no node labels are available we could use the graph’s
structure as supervision. By requiring similar nodes have similar embeddings. i.e. we
optimise
L= CE(yu,v , Dec(z u , z v ))
X

u,v

where yu,v is a similarity score of the nodes.


2
not all GNNs can be expressed in matrix form, especially when aggregation functions are complex.

19
4 A GENERAL PERSPECTIVE ON GRAPH NEURAL NETWORKS

(a) Image (b) Graph

Figure 3.3: Image and graph comparison of CNN vs GNN

Overall:
1. Define a neighbourhood aggregation function
2. Define a loss function on the embeddings
3. Train on a set of nodes, i.e. a batch of computation graphs.
4. Generate embeddings of nodes (even for nodes that the model never trained on)
A GNN is inductive as opposed to transductive. The same mode generalises to unseen
nodes.

3.4 GNNs subsume CNNs


GNNs can be viewed as a class of more general architectures compared to CNNs. In a
convolutional layer with 3 × 3 filter, it can be formulated as
 

h(l+1) := σ  wl,u h(l) (l = 0, . . . , L − 1)


 X 

v  u 
u∈ N (v) ∪{v}

N (v) is the 8 neighbour pixels of v.


The key difference is we can learn different Wl,u for different “neighbour” u for pixel v on
the image. To do this we can pick a particular order of relative positions w.r.t. the centre
pixel v.
CNNs can be seen as special GNNs with fixed neighbour size and ordering. CNN is not
permutation invariant or equivariant.

4 A General Perspective on Graph Neural Networks


In this section we expand upon the previous construction of GNNs and create a general
GNN framework. A GNN consists of 5 parts:

20
4 A GENERAL PERSPECTIVE ON GRAPH NEURAL NETWORKS

1. Message

2. Aggregation Function
A GNN Layer is composed of the message and aggregation. Different
implementations include Graph Convolutional Networks (GCN), GraphSAGE, and
GAT (Graph Attention).

3. Layer Connectivity
Layer Connectivity refers to the topological connection between layers including skip
connections.

4. Graph Augmentation: Feature augmentation, structure augmentation, etc.


The core idea is that the raw input graph should not be directly used at the
computational graph for a number of problems we shall explain later.

5. Learning Objective: Supervised/Unsupervised, Node/Edge/Graph level objectives.

4.1 A Single Layer of GNN


A GNN Layer compresses a (variable sized) set of vectors into a single vector. This involves
(l−1)
taking inputs h (l−1)
v , h u∈N (v) and outputting node embedding h (l)
v .

• The message function converts the hidden state of each node into a message, which
will be sent to other nodes later.

u := Msg (h u )
(l) (l−1)
m (l)

An example is a linear layer


u := W h u
(l) (l−1)
m (l)
Question: Can a node send different messages to different neighbours?
Yes. We shall see an example of this, which is Graph Attention Networks.

• The aggregation function defines how the node’s neighbours’ messages are combined

v := Agg({m u : u ∈ N (v)})
h (l) (l)

Example: h (l)
v can be summation, mean, or maximum.

The issue here is that the information from node v itself could get lost, since h (l)
v does
not directly depend on h v(l−1) . The solution is to include h (l−1)
v in the computation of
h v . We can compute a message for v itself, and then use
(l)

v := concat(Agg({m u : u ∈ N (v)}), m v )
h (l) (l) (l)

• A non-linear activation function follows message or aggregation to add


expressiveness.

21
4 A GENERAL PERSPECTIVE ON GRAPH NEURAL NETWORKS

Examples:
(1) Graph Convolutional Networks (GCN), where the message and aggregation
functions are Activation
Message
 
(l−1)
(l) h u
v := σ 
h (l)
X
W
 
deg v

u∈N (v)

where Aggregation
Normalized by node degree
GCN Graphs are assumed to
1 have self-edges that are included in the
u =
m (l) W(l) h (l−1) summation.
N (v) u

Xn o
v = σ(
h (l) u : u ∈ N (v) )
m (l)

(2) GraphSAGE:
Stage 2 aggregation with self Stage 1 aggregation from neighbours
 
v := σ
h (l) W(l) · concat(h (l−1)
v , Agg({h (l−1)
u : u ∈ N (v)}) )

Message is computed within Agg(·).


A couple different aggregation functions exist:

• Mean: Weighted average of neighbours

h (l−1)
Agg(v) := u
X

u∈N (v)
|N (v)|

• Pool: Transform neighbour vectors and apply symmetric vector function mean or
max.
Agg(v) := meanu∈N (v) MLP (h (l−1)
u )
• LSTM: Apply LSTM to reshuffled neighbours

Agg(v) := LST M (h (l−1)


u : u ∈ π(N (v)))

Optionally, apply L2 -normalisation to h (l)


v at every layer. Without L normalisation,
2

the embedding vectors have different scales for vectors. In some cases, normalisation
results in performance improvements.

(3) Graph Attention Networks (GAT):


Attention weights
 

v := σ
h (l) αv,u W(l) h (l−1)
X
 
u
u∈N (v)

22
4 A GENERAL PERSPECTIVE ON GRAPH NEURAL NETWORKS

GAT assigns different importance to messages coming from different nodes. When
αv,u = N 1(v) , attention reduces to GCN/GraphSAGE, where αv,u is defined explicitly
based on the structural properties of the graph, specifically, the node degree deg v. The
attention mechanism in GAT is inspired by cognitive attention and focuses on
importants parts of the input data.
An attention mechanism computes αv,u . Define the attention coëfficients
ev,u := a(W(l) h (l−1)
u , W(l) h (l−1)
v )
Then we normalize ev,u into the attention weight using softmax:
exp ev,u
αv,u := P
k∈N (v) exp ev,k

In Multi-head Attention, multiple attention scores are used and the result of each
attention “head” is aggregated:

v [j] := σ(
h (l) αv,u [j]W(l) h (l−1) )
X
u
u∈N (v)

h (l)
v := Agg(h lv [j] : j)
The benefits of attention mechanism are:
• Allows for implicitly specifying different importance values of neighbours
• Computationally efficient: Attention can be parallelised across all edges of the
graph.
• Storage efficient: Sparse matrix operations do not require more than O(V + E)
entries. The number of parameters is fixed.
• Localised: Only attends over local neighbourhoods
• Inductive capability: Does not depend on global graph structure.

4.2 GNN Layers in Practice


In practice, the classic GNN layers are a starting point.
• We can often get better performance by considering a general GNN layer design
• Concretely, we can include modern deep learning modules. e.g. Batch Normalisation,
Dropout3 , Attention/Gating, and others.
One particular note about activation function: Empirically the Parametric ReLU
function, defined as
PReLU(x) := max(x, 0) + a min(x, 0)
performs better than ReLU.
Designing novel GNN layers is a still an active research frontier. You can explore diverse
GNN designs or try your own ideas in GraphGym.
3
In GNNs, dropout is applied to the linear layer of the message function.

23
4 A GENERAL PERSPECTIVE ON GRAPH NEURAL NETWORKS

Figure 4.1: Standard way of stacking GNN layers

4.3 Stacking GNN Layers


How to connect GNN layers into a GNN? In the standard way, we stack GNN layers
sequentially. This leads to the over-smoothing problem. Since each node’s receptive
field becomes larger and larger, the receptive fields of two different nodes in the graph
eventually converge into one, which causes all the node embeddings to become convergent.
Hence
1. Be cautious when stacking GNN Layers. More layers do not always help, unlike
neural networks in other domains. The layers should be a bit more numerous than
the radius of the receptive field, but there should not be too many layers.

• Solution 1: Increase the expressive power within each GNN layer: We can make
aggregation/transformation into a deep neural network.
• Solution 2: Add layers that do not pass messages. A GNN does not necessarily
only contain GNN layers. We can apply MLP layers, applied to each node,
before and after GNN layers as pre-process and post-process layers. In practice
adding these layers are beneficial.

2. Add skip connections in GNNs: Since earlier GNN layers can sometimes be better
in differentiating nodes, we add shortcuts to the GNN.
A skip connection creates a mixture of models. We get a mixture of shallow and deep
GNNs using mixed connections. When we have N skip connections, information has
2N possible pathways of transmission. An example of a skip connection:
F (x) x
 
(l−1)
(l) h u
v := σ 
h (l) + h (l−1)
X
W
 
|N (v)| v 
u∈N (v)

Another option is to directly skip to the last layer.

24
4 A GENERAL PERSPECTIVE ON GRAPH NEURAL NETWORKS

Figure 4.2: GNN with pre- and post-process layers

(a) A shortcut or skip connection (b) Skip connections to the last layer

4.4 Graph Manipulation in GNNs


The raw input graph is not always suitable for using as a computation graph. The reasons
could be:

• Feature Level:

25
5 GNN AUGMENTATION AND TRAINING

– Input graph lacks features: Feature augmentation


(a) Assign constant values to nodes: Drawback on expressive power.
(b) Assign unique ids to nodes: Drawback on inductive learning and
computational cost.
Certain structures are hard to learn by GNN. For example, the cycle count feature
(the length of a cycle that v resides in). We could embed the cycle count as a feature.
Other commonly used augmented features are clustering coëfficient, PageRank,
centrality.
• Structure level:
– The graph is too sparse leading to inefficient message passing
(a) We could add virtual edges: e.g. Connect 2-hop neighbours via virtual
edges. Intuitively, we use A + A2 as the adjacency matrix instead of just A.
This is useful in bipartite graphs.
(b) Add virtual node that connects to all nodes in the graph. This greatly
improves message passing in sparse graphs.
– The graph is too dense leading to costly message passing
We could sample neighbours when doing message passing. i.e. during
aggregation, only a select random subset of N (v) have their messages passed to
v.
– The graph is too large which cannot fit the computational graph into GPU.
We could sample subgraphs to compute embeddings. This will be covered later
in lecture: Scaling up GNNs.
Its unlikely the input graph happens to be the optimal computation graph.

5 GNN Augmentation and Training


5.1 Predictions with GNNs
so far we have covered graph neural networks and node embeddings. With the node
embeddings we have created, we can execute different prediction tasks Different levels
require different prediction heads.
• Node-level Prediction: We can directly make prediction using node embeddings
using a linear prediction head
ŷ v := Head(h (L)
v ) = Wh v
(L)

• Edge-level Prediction: Make prediction using pairs of node embeddings

u , hv )
ŷ u,v := Head(h (L) (L)

This can be implemented using

26
5 GNN AUGMENTATION AND TRAINING

Figure 5.1: DiffPool: A Hierarchical Pooling

– Concatnation and Linear layers:


Head(h (L)
u , h v ) = W1 h u + W2 h v
(L) (L) (L)

– Dot product:
Head(h (L)
u , h v ) = h u Wh v
(L) (L) (L)

where W can be I , or a set of k matrices for k-way label prediction.


• Graph-Level Prediction:

ŷ G := Head(h (L)
v : v ∈ V (G))
This is similar to the aggregation step in GNNs. Global pooling, e.g. mean pooling,
max pooling, sum pooling, work great for smaller graphs.
The main issue is global pooling over a large graph loses information. A solution is to
aggregate the graph hierarchically. We train two independent GNNs at each level.
• GNN-A: Compute node embeddings (embedding task)
• GNN-B: Compute the cluster a node belongs to (clustering task)
For each pooling layer, use clustering assignments from GNN-B to aggregate node
embeddings generated by GNN-A.

Question: Is GNN-A trained on the entire graph?


In the first layer of clustering, it is. In the subsequent layers it is trained on the
clustered graph.

5.2 Training Graph Neural Networks


The ground truth of training GNNs come from either
• (Supervised learning) External data e.g. predict drug likeness of a molecular graph.
• (Unsupervised learning) Graph features e.g. link prediction, predicting whether two
nodes are connected. This is sometimes called self-supervision.
Sometimes the differences are blurry.

27
5 GNN AUGMENTATION AND TRAINING

Figure 5.2: A training/validation/test split schematic. In transductive settings, the grey


edges would be included. In inductive settings they would not.

Figure 5.3: An inductive dataset of three different graphs. Each graph is given its own
message and supervision edges.

5.3 Setup GNN Prediction


How to split dataset into train/validation/test set? The three datasets are used for:

• Training set: Used for optimising GNN parameters

• Validation set: Used to develop model and hyperparameters

• Test set: Used to report final performance.

Sometimes we cannot guarantee that the test set will be really held out. In which case we
could use random spplit and report the average performance over different random seeds.
Splitting graphs is special and has its own quirkiness compared to image dataset. If we
split a graph into different vertices, the nodes are not independent. The nodes in the
“unseen” validation or test set will affect our prediction on the nodes in the training set,
because of message passing mechanics. There are two solutions to this issue:

• Transductive Setting: The input graph can be observed in all datasets splits, but
only the training sets have visible labels.

• Inductive Setting: Break the edges between splits to get multiple graphs. In this
case the nodes in different components of the graph are truly independent.
Only this setting is applicable to graph classification.

In a link-prediction task, the setup of the task is tricky. It is a self-supervised task and we
need to generate labels and datasets on our own. A practical method is to hide edges from
the GNN and let GNN predict if those edges exist. We split edges twice:

28
6 THEORY OF GRAPH NEURAL NETWORKS

Figure 5.4: A transductive dataset

1. Assign two types of edges in the original graph: Message edges and Supervision edges.
Message edges will be visible to the GNN while supervision edges will not.

2. Split edges into train/validation/test. We can either use a inductive (Figure 5.3) or
transductive (Figure 5.4) setting.
The transductive setting is the default when people talk about link prediction. In this
case there is only one graph, observable in all dataset splits.

Stage GNN Input Labels


Training Training message edges Training supervision edges
Validation Training message/supervision edges Validation edges
Test Training message/supervision, and validation edges Test edges

Link prediction is fully supported in PyG and GraphGym.

6 Theory of Graph Neural Networks


A question we wish to answer: How powerful are GNNs? Specifically:

• What is the expressive power of GNNs such as GCN, GAT, GraphSage?

• How to design a maximally expressive GNN model?

Settings
We focus on message passing in GNNs:

1. Message: Each node computes a message

u := Message (h u )
(l) (l)
m (l)

2. Aggregate: Aggregates messages from neighbours

v := Aggregate ({m u : u ∈ N (v) ∪ {v}})


(l)
h (l) (l)

29
6 THEORY OF GRAPH NEURAL NETWORKS

Figure 6.1: When f produces one-hot encodings, the ϕ counts the number of occurrences of
each element of the multi-set.

A GNN distinguishes graph structures using the computation graphs induced by the
neighbourhood of each node. If the k-hop neighbourhood structures of two nodes are
identical and the nodes have the same features, a GNN would not be able to distinguish
between them. A computation subgraph is a rooted subtree with root at each node.
We can measure expressive power using injections. The most expressive GNNs should map
subtrees to node structure injectively. If each step of GNN’s aggregation can completely
retain the neighbourhood information of each node, the generated node embeddings can
distinguish different rooted structures. In other words, the most expressive GNN uses
injective neighbourhood aggregation.

6.1 Designing the Most Powerful GNN


Observe that the expressiveness of GNNs can be characterised by the neighbourhood
aggregation function they use. It can be abstracted as a function over a multi-set

Aggregate({x u : u ∈ N (v)})

In GCN, this is the mean function, and in GraphSAGE, it is the max pool. For example,
both pooling functions will create the same aggregation over the neighbour message
multi-sets:
1 0 1 1 0 0
(" # " #) (" # " # " # " #)
, , , , ,
0 1 0 0 1 1
In general, the discriminative power decreases in the series

sum (multiset) > mean (distribution) > max(set)

Theorem 6.1 (Xu et al. ICLR 2019). Any injective multi-set function can be expressed as
!
f (x)
X
ϕ
x∈S

where ϕ, f are non-linear functions.

Proof. (sketch) f produces one-hot encodings, and ϕ adds them together. See
Figure 6.1.
To model ϕ and f , we can use a MLP.

30
6 THEORY OF GRAPH NEURAL NETWORKS

Figure 6.2: MLP with one hidden layer

Theorem 6.2 (Universal Approximation Theorem, Hornik et al., 1989). 1-hidden-layer


MLP with sufficiently large hidden dimensionality and non-polynomial activation function
σ can approximate any function to arbitrary accuracy.
Therefore we can use the following structure to model any injective multiset function.
Usually a hidden dimension of 100 to 500 is sufficient. This brings us the most expressive
GNN: Graph Isomorphism Network (GIN).
!
MLPf (x)
X
MLPϕ
x∈S

GIN uses the two-MLP structure above and it is the most expressive message passing GNN.

Question: Why are we using a MLP as the multiset function?


We could also use hash functions, but the benefit of a neural network is that we will
be able to learn depending on the specific task.

Question: Can we just assign randomness to each node to distinguish them?


Yes, but we want to specifically map same computation graphs to the same output.
Randomness would distinguish every node.

Question: Why should we map same neighbourhood structure to the same


embedding and disregard the individual identity of the node?
In real use cases its very rare to see two nodes with identical computation graphs.
Another architecture, position-aware GNN, solves this problem. We shall see it in a
future lecture.

Question: Why would we want the aggregation function to be learnable?


Why can’t we just use a hash function in place of MLP?
The hash function is indeed very expressive. The main motivation of using a hash
function is that

1. Each node has features.

2. The neural network is differentiable and can be trained in conjunction.

31
6 THEORY OF GRAPH NEURAL NETWORKS

GIN is comparable to the Weisfeiler-Lehman Kernel (WL Kernel) /


Colour-Refinement algorithm. Given a graph G with nodes V ,
1. Assign an initial colour c(0) (V ) to each node v.
2. Iteratively refine node colours by
c(k+1) (v) := hash({c(k) (v)} ∪ {c(k) (u)|u ∈ N (v)})

3. After k steps of colour refinement, c(k) summarises the structure of k-hop


neighbourhood.
The WL Kernel is computationally efficient. The total time complexity is O(|E|). where
the aggregation function is a hash function.
GIN models the hash function in the WL kernel using neural networks.
Theorem 6.3 (Xu et al. ICLR 2019). Any injective function over the tuple

( c (k) (v) , { c (k) (u) }u∈N (v) )


can be modeled as Root feature Neighbouring features

GINConv(c (k) (v), {c (k) (u) : u ∈ N (v)})


 

:= MLPϕ (1 + ϵ)MLPf (c (k) (v)) + MLPf (c (k) (u))


X

u∈N (v)

Question: Why is the ϵ needed here?


It allows differentiating the node itself with its neighbours.

Question: In the hash table, we would not be able to control the output
(almost random), but in our case the output seems to be deterministic.

The discussion here is mainly about how to design an injective function over a multi-
set.

Question: What happens to nodes in the network with no edges? These


edges cannot be distinguished by arbitrarily deep WL kernel hashing or
GNN.
In this case, the two nodes are just isolated nodes. These nodes would have identical
neighbourhood structure and identical embedding.

If the input features c(0) (v) is one-hot, direct summation is injective. In this case, we only
need ϕ to ensure injectivity:
 

GINConv(c (k) (v), {c (k) (u) : u ∈ N (v)}) := MLP (1 + ϵ)c (k) (v) + c (k) (u)
X

u∈N (v)

Advantages of GIN over WL:

32
6 THEORY OF GRAPH NEURAL NETWORKS

Update Target Update Function


WL Graph Kernel Node-colours hash
(one-hot)
GIN Node embeddings GINConv
(low-dim vectors)

• Node embeddings are low-dimensional. Hence they can capture the fine-grained
similarity of different nodes.

• Parameters of update function can be learned from downstream tasks.

WL Kernel has been both theoretically and empirically shown to distinguish most of the
real-world graphs [Cai et al. 1992]. Hence GIN is also powerful enough to distinguish most
of the real graphs.

6.2 When things don’t go as planned


• Data preprocessing is important! Node attributes should be normalized.

• Optimiser: Adam is relatively robust to learning rate

• Activation Function:

– ReLU often works well.


– Other good alternatives: LeakyReLU, PReLU
– No activation at the output layer
– Include bias at every layer

• Embedding dimensions: 32, 64, 128 are good starting points.

Debugging Deep Neural Networks:

• Loss/Accuracy not converging:

– Check pipeline (e.g. in PyTorch we need to zero the gradients)


– Adjust hyperparameters such as learning rate
– Pay attention to weight parameter initialisation
– Scrutinise loss function

• Important for model development:

– Overfit on (part of) training data


With a small training dataset, loss should be essentially close to 0 with an
expressive neural network.
– Monitor the training and validation loss curve.

33
7 LIMITS OF GRAPH NEURAL NETWORKS

Figure 7.1: A k-layer GNN embeds a node based on the k-hop neighborhood structure
.

Figure 7.2: The two square nodes have the same computational graphs and therefore the
same embedding despite having different neighbourhood structures.

7 Limits of Graph Neural Networks


What should a perfect GNN do? Intuitively, a perfect GNN should build an injective
function between neighbourhood structure and node embeddings. Therefore, in a perfect
GNN: Therefore, in a perfect GNN:

1. If two nodes have the same neighborhood structure, they must have the same
embedding

2. If two nodes have different neighborhood structure, they must have different
embeddings

We observe that a perfect GNN should:

1. If two nodes have the same neighborhood structure, they must have the same
embedding

2. If two nodes have different neighborhood structure, they must have different
embeddings

Observation (2) is often unsatisfiable: There are basic structures that existing GNN
frameworks cannot distinguish, such as the length of cycles. GNNs power can be improved
in to resolve this problem.
Observation (1) could also be problematic: Sometimes we may want to assign different
embeddings to nodes that have different positions in the graph. e.g. In position-aware
tasks.
We’ll resolve these issues by building more expressive GNNs.

34
7 LIMITS OF GRAPH NEURAL NETWORKS

(a) Node on a cycle (b) Computational graph

Figure 7.3: Failure 1: The computational graph of a node on a cycle is always the same

Figure 7.4: Failure 2: The computational graph of • and • are the same, so the link-level
prediction on two dashed edges will be identical.

Figure 7.5: Failure 3: Nodes on two different graphs have identical computational graphs

Figure 7.6: The WL Kernel inherits graph symmetries. Symmetric colours are associated
with limitations involving spectral decomposition of a graph.

7.1 Spectral Perspective of Message Passing


Due to its high symmetry, GNN cannot perform perfectly in structure aware tasks either.
GNNs exhibit three levels of failure in structure-aware tasks:

• Node Level: Different inputs with the same computational graph leads to GNN
failure (Figure 7.3)

• Edge Level: Edge prediction tasks may fail since the nodes on the edges have
identical computational graphs. (Figure 7.4)

• Graph Level: Same overall computation graphs on different graphs lead to same
prediction. (Figure 7.5)

35
7 LIMITS OF GRAPH NEURAL NETWORKS

Recall the definition of the Graph Isomoephism Network (GIN)


Next layer’s colour
 

c (l+1) := MLP (1 + ϵ) c (k)


(v) + (k)
(u) 
X
v c
u∈N (v)
Root feature Neighbouring features

We can unroll the first MLP Layer


  
(l) (l)
c (l+1) := MLP−1 σ W 0 (1 + ϵ)c (k) (v) + W 1 c (k) (u)
X
v
u∈N (v)
All MLP Layers except first

This can be written in matrix form:


1
!!
  
(l) (l) (l)
(l+1)
= MLP−1 σ C (l)
+ AC (l)
= MLP−1 σ k (l)
(1)
X
C W0 W1 A C Wk
k=0
(l)
Adjacency matrix C (l) [v, :] = c v
We can compute the eigen-decomposition of an adjacency graph

A = V Λ V⊺
Diagonal matrix of eigenvalues λ1 , . . . , λN

The eigen-decomposition of A is a universal characterization of the graph.

Example
The number of cycles in a graph can be viewed as functions of eigenvalues and eigen-
vectors, e.g.
N
diag(A3 ) = λ3n ∥v n ∥2
X

n=1

We can interpret GIN layers as MLPs operating on the eigenvectors. Replacing


A = V ΛV ⊺ in Equation 1: P1 (l)
Defining V Z := k=0 Ak C (l) W k

C (l+1) = MLP−1 (σ(V Z ))


1
!
(l)
Z [n, f ] = k ⊺ (l)
[n, f ]
X
V

 V C Wk
k=0
d 1
= λkn W k [i, f ](v n · C (l) [:, i]⟩
XX

i=1 k=0

Thus the weights of the first MLP layer depends on the eigenvalues and the dot product
between the eigenvectors v n and the colours at the previous level C (l) [:, i].
With uniform initial colours, we have C (l) [:, i] = 1 . The new node embeddings only
depend on the eigenvectors that are not orthogonal to 1. However, graphs with symmetries
admit eigenvectors orthogonal to 1 .

36
7 LIMITS OF GRAPH NEURAL NETWORKS

In a nutshell: WL cannot distinguish between symmetric nodes in the graph since the
embeddings and graphs structure admit the same symmetry. The limitations of the WL
kernel are limitations of the initial node color.

Question: How does the dot product relate to the initial colour?

It is a little bit beyond the scope of the course, but the initial colour C (1) . In graphs
that have symmetries, the inner product v n · 1 goes to zero, and the information from
C (l) [:, i] is extinguished.

7.2 Feature-Augmentation: Structurally-Aware GNNs


A naı̈ve solution is one-hot encoding: Assign each node with a different id. The issues are:

1. Non-scalable: Needs O(n) feature dimensions

2. Non-inductive: Cannot generalize to new graphs. A graph with a different ordering of


nodes but the same structure will have different embedding.

One-hot encoding:

+ Has high expressive power since each node has a unique id

- Has low generalizability and cannot generalize to new nodes. New nodes introduce new
id’s.

- Has high computational cost

Is mainly used for small graphs and transductive settings.

In comparison, constant node features:

~ In terms of expressive power, all nodes are identical, but the GNN can still learn from
structure.

+ Is simple to generalize to new nodes by assigning constant features to them.

+ Has low computational cost (1 dimensional feature)

+ Can be used for any graph and inductive settings.

We can also use the diagonals of the adjacency powers as augmented node features. They
correspond to the closed loops each node is involved in.

C (0) := [Diag(A0 ), . . . , Diag(A(D−1) )] ∈ N (N ×D)

Theorem 7.1. If two graphs have adjacency matrices with different eigenvalues, there
exists a GNN with closed-loop initial node features that can always tell them apart.

37
7 LIMITS OF GRAPH NEURAL NETWORKS

GNNs with structural initial node features can produce different representations for almost
all real-world graphs. Almost all since distinguishing graphs is an open problem. In this
case, a GIN with structural initial node features is strictly more powerful than the WL
Kernel.
Certain structures are hard to learn by GNN. For example, the cycle count feature (the
length of a cycle that v resides in). We could embed the cycle count as a feature. Other
commonly used augmented features are clustering coëfficient, PageRank, centrality.
Structurally aware node feature:

+ Has high expressive power, so node-specific information can be stored.

+ Is simple to generalize to new nodes. Can count triangles or closed loops.

+ Has low computational cost

+ Can be used for any graph

7.3 Counting Graph Sub-Structures


Can we count graph substructures with GNNs only? It turns out we can. Suppose we
assign unique ID to nodes, where each assignment c4 leads to output y. To maintain
equivariance the final output, we take expectation over unique id’s over the nodes

y := E[y]

This maintains equivariance. In practice, it is computed as

1 Xm
y := y (j)
m j=1

We allow the GNN to momentarily break equivariance during each individual sample, but
equivariance holds after taking expectation.

Question: Is this more computationally efficient than eigen-decomposition

It depends on the size of the graph. Eigen-decomposition is O(n2 ). Often sampling


random id’s is more efficient.

Question: Does the distribution matter


Sometimes. If the distribution is structurally aware, you can compute more stuff. Its
a trade-off between complexity of distribution and power of the NN.

Question: Is this affected by the number of layers in the GNN?


Yes, if you don’t normalize.
4
Following Ian Goodfellow’s convention of using upright letters as random variables.

38
7 LIMITS OF GRAPH NEURAL NETWORKS

(a) Structure-Aware Task (b) Position-Aware Task

Figure 7.7: Structure and Position Aware Tasks; GNNs often work well for structure-aware
tasks but fail at position-aware tasks.

1. Positive-aware Tasks: We may wish to assign different embeddings to nodes with


different positions in the network.
Solution: Position-aware GNNs
A naı̈ve approach assigns unique one-hot labels to each vertex. This is infeasible since
it is

• Non-scalable: It requires O(|V |) features


• Non-inductive: Cannot generalise to new graphs

2. The expressive power of GNNs is upper bounded by the WL test. For example,
message passing GNNs cannot count the length of a cycle in a graph.
Solution: Identity-aware GNNs

7.4 Position-Aware GNN


Two types of tasks on graphs:

• Structure-Aware Tasks: Nodes are labeled by their structural roles in the graph

• Position-Aware Tasks: Nodes are labeled by their position in the graph.

GNNs always fail for position aware tasks due to the similarity of computational graphs.
We could randomly pick a node s1 as an anchor node and represent other nodes by their
relative distances w.r.t. s1 . The anchor node serves as a coördinate axis. We can pick more
nodes s2 , s3 , . . . as anchor nodes to better characterise node positions in the graph.

Theorem 7.2 (Bourgain’s Theorem, Informal). Let (V, d) be the metric space on the graph
vertices V and c a constant. Select random nodes Si,j ⊆ V such that each node in v has a
probability of 2−i of being included.
Consider the following embedding function
2
f (v) := [dmin (v, S1,1 ), dmin (v, S1,2 ), . . . , dmin (v, Slog n,c log n )] ∈ Rc log n

where dmin (v, S) := minu∈S d(v, u).


Then the embedding distance produced by f under ∥·∥2 is “close” to d5 .
5
See https://www.cs.toronto.edu/˜avner/teaching/S6-2414/LN2.pdf

39
7 LIMITS OF GRAPH NEURAL NETWORKS

Figure 7.8: Position-aware GNN using permutation invariant NN

P-GNN6 follows this theory. It samples O(log2 n) anchor sets Si,j and embeds each node
via f . The embedding positional information can be used:

• Simple solution: Use the position encodings as an augmented node feature. The
problem with this is since the encoding is tied to a random anchor set, dimensions of
positional encoding can be randomly permuted without changing its meaning.

• Rigorous solution: Use a special NN that can maintain permutation invariant


property of position embedding

Question: How is this embedding different from Node2Vec?


Node2Vec boils down to particular matrix factorisation. Bourgain Theorem’s embed-
ding is about anchoring high dimensional metric spaces into Rn and more geometric.

Question: What do we pick for d in Bourgain Theorem?


Usually people use the shortest path distance. This is dependent on the application.

• Training: New anchor sets are re-sampled every time

• Inference: Given a new unseen graph, new anchor sets are sampled.

Question: Why sample new anchor nodes every time?


Although the anchor nodes are randomised, the distances from these anchor nodes
follow a relatively stable distribution.

Question: Why is it better to have multiple anchor nodes?


Larger anchor sets reduce the variance of the distribution of node distances.

In some applications it may be better to manually engineer an anchor set.

Question: Do different metrics afford different performances?


There is not a clear answer.
6
See https://arxiv.org/abs/1906.04817

40
8 GRAPH TRANSFORMERS

7.5 Identity-Aware GNNs


ID-GNN uses heterogeneous message passing. Nodes of different colourings use
different message and aggregation functions. For each node v, a K-layer ID-GNN applies a
special message function on every node u in the K-neighbourhood of v:
( ) !
(k) (k)
h u|v := Aggregate
(k)
Message (h (k−1)
s ) : s ∈ N (u) , h (k−1)
u
1[s=v]

Depend on whether v is the centre node s


(K)
and the embedding of v is set to h v := h u|v
Intuitively, ID-GNN can count cycles originating from a given node but GNN cannot.
Based on this, we propose a simplified version ID-GNN-Fast, where we use cycle count
(the number of times the coloured node is reached) as an augmented node feature.

Question: Is there a case where the node colouring would not be useful in
network A and B?
The networks A,B will not converge to the same result since they might but they won’t
if your objective function forces them to distinguish the nodes.

Question: Is the node colouring similar to node anchors?


Position-Aware is more macroscopic and colouring is more microscopic. These two
concepts are orthogonal.

Question: Can we view the coloured graphs as a heterogeneous graph?


Yes.

8 Graph Transformers
We know a lot about the design space of GNNs. What does the design space of graph
transformers look like?

Graph

Computation Graph

Figure 7.9: Inductive Node Colouring distinguishes computational graphs on cycles of length
3 and 4.

41
8 GRAPH TRANSFORMERS

1 1
0 0
2 2
2 0
(a) 3-cycle (b) 4-cycle

Figure 7.10: Cycle counts

A transformer is a type of deep learning model featuring:


1. Self-Attention Mechanism: An attention layer with the same query, key, and
value vectors.
2. Encoder-Decoder Architecture: The input to the transformer is first embedded
in an embedding space. The output undergoes the reverse transformation during
generation.
3. Positional Encoding: Attention mechanism has no innate knowledge of position, so
a positional encoding helps the mechanism.
4. Multi-head Attention: Multiple attention layers are used for each transformer
model.
5. Feed-forward NN: A feed-forward network at the output end of the attentions
processes the output.
The applications are:
• Natural Language: Transformers are the basis of BERT, GPT, and T5
• Vision: Vision Transformers (ViTs) apply the attention mechanism to sections of the
input
• Graphs: Transformers can model relationships between nodes
Transformers map 1D vectors of input data to 1D vectors of output data. Each element in
the vector is a token. Each token represents a “unit” of data e.g. a word. The output of a
transformer can either be this token sequence or a pooled output by combining all output
symbols.
Before we discuss multi-head attention, we have to discuss single-head attention.

8.1 Self-Attention
The attention layer processes each input token x i into three values, the query q i , key k i ,
and value v i 7 , via 3 trainable models.
q i := Wq x i , k i := Wk x i , v i := Wv x i
7
This terminology comes from search engines where the user inputs a query which gets matched with
keys.

42
8 GRAPH TRANSFORMERS

Figure 8.1: A multi-head transformer

43
8 GRAPH TRANSFORMERS

Figure 8.2: Self attention layer in a transformer

q i , k i must have the same length d. Then the layer computes a score between the query
and the key:
1
ai := √ q i · k i
d
Finally the output of the layer is a mixture of v i s weighted by the softmax of ai

z := softmax(a1 , . . . , aN )i v i
X

We can represent the same calculation in matrix form, with the input matrix being
Query
X ∈ RM ×N
Q := XWq , K := XWk , V := XWv
Key Value
Then we can compute the score
 
1
Z := softmax  √ QK ⊺  V
d
Attention Score
Multi-head Attention is the same as executing many instances of this process in parallel.

Question: Can we take mean pool over the outputs of the heads of the
multi-head attention?
Mean pool ignores ordering, so it would not be very useful.

44
8 GRAPH TRANSFORMERS

Excursion:
Aniva: I think this would be easier to see using the Einstein summation convention,
where if an index appears on only one side, it is assumed to be summed over. Suppose
xij is the feature, q i,µ the query, kµi the key, and vνi the value, then one self-attention
layer is
q i,µ := (Wq )j,µ xij , kµi := (Wk )jµ xij , vνi := (Wv )jν xij
and one attention layer is

1
!
i′ ′
zν := softmaxi √ q i,µ kµi vνi
d i′

You can replicate this with the Python package einops.

Similar to transformers, GNNs also take in a sequence of vectors (in no particular order)
and output a sequence of embedding. The difference is when GNN uses message passing,
transformer uses attention.

8.2 Self-Attention and Message Passing


Consider the attention output of just one token:

z 1 := softmaxj (q j · k j )v i
X

We can represent this as:


1. Compute message from j: (v j , k j ) := Message(W v x j , W k xj )

2. Compute query from 1: q 1 := W q x 1

3. Aggregate all messages:

Aggregate(q 1 , {Message(x i ) : i}) := softmaxj (q j · k j )v i


X

Question: Does this assume the ordering of tokens is not important.


Yes, and we will have to fix it.

This shows Self-attention can be written as message and aggregation – i.e., it is a GNN!
Every node receives information from every other node. In other words, the graph is fully
connected.
At the moment the transformer model we have is oblivious to tokens, since the attention
mechanism and weighting function softmax ignore index ordering. To fix this issue we need
positional encoding. For NLP tasks, each token x i in the input is concatenated with a
vector p indicating its position e.g. [cos i/N, sin i/N ]. Then we use the concatenated vector
[x, p] as the input to an attention layer instead of x.

45
8 GRAPH TRANSFORMERS

Figure 8.3: Node features on a graph can be used as the inputs features for a transformer

8.3 A New Design Landscape for Graph Transformers


How do we input a graph into a transformer? We need to understand the key components
of a transformer: (1) tokenization, (2) positional encoding, and (3) self-attention, and make
graph versions of them. A graph transformer must takes the inputs (1) Node features (2)
Adjacency information (3) Edge features.
For (1), a sensible choice is the node features. The main problem about this is we
completely lose adjacency information. We can add this information back by adding
positional encoding based on adjacency information.
A few options for positional encoding exist:

1. Relative distances based on random walks from anchor set: This is particularly
strong for tasks that require counting cycles. Pick anchor vertices v1 , . . . , vl , and each
vertex v gets the position encoding

p := [d(v, v1 ), . . . , d(v, vl )]

Relative distances useful for position-aware task but not structural aware tasks.

2. Laplacian Eigenvector Positional Encoding: Each graph has a Laplacian matrix

L := D − A
Diagonal degree matrix Adjacency Matrix

Several Laplacian variants that add degree information differently. Laplacian matrix
captures the matrix structure, and its eigenvectors inherit this structure.
Eigenvectors with small eigenvalue correspond to global structure, and large
eigenvalue correspond to local symmetries. We can calculate the eigen-decomposition
of the Laplacian matrix L = ΣΛΣ ⊺ , and use Σ (with the indices order [data,
feature]) as the position encoding.
A simple task such as whether a graph has a cycle, could be done with a GNN with
the assistance of the Laplacian eigenvectors.

46
8 GRAPH TRANSFORMERS

Figure 8.4: Examples of eigenvectors in a Laplacian matrix

Question: Does the Laplacian only encodes structure or both structure and
position?
Both. See Figure 8.4.

Finally, we need to find out how to embed the edge features x i,j . The only place in the
attention mechanism where pairs of vertices come in is during the computation of the
attention scores [ai,j ] = QK ⊺ . We can adjust this based on the edge features
ai,j 7→ ai,j + ci,j . i.e.

w ⊺ x
e i,j edge exists between i, j
ci,j :=
P
k

w ek x ek path e1 , . . . , en exists between i, j

8.4 Positional Encodings for Graph Transformers


Laplacian eigenvectors are not the best that we can do. They have structure that we have
been ignoring. For example, if v is an eigenvector of L, then cv for any c ̸= 0 is also an
eigenvector of L. The iterative algorithms for computing eigenvectors will produce
eigenvectors with random signs. This arbitrarity affects the prediction of our GNN.
A simple idea is to randomly flip the signs of eigenvectors during training. The problem is
an exponential number of sign choices exist. We can alternatively design an NN which is
invariant to sign flips.
Theorem 8.1. For f : Rn → Rn , f (x) = f (−x) if and only if there exists g such that
f (x) = g(x) + g(−x)
Proof. If f (x) = g(x) + g(−x) then f (−x) = g(−x) + g(x).
Conversely, define g = f /2, then g(x) + g(−x) = f (x)/2 + f (−x)/2 = f (x).

47
9 MACHINE LEARNING WITH HETEROGENEOUS GRAPHS

Hence we can use a neural network structured with the structure.


f (x 1 , . . . , x n ) := Aggregate(g(x i ) + g(−x i ))
This is known as SignNet.
Theorem 8.2. If f is sign invariant, there are functions g, h such that
f (x1 , . . . , xn ) = h(g(x1 ) + g(−x1 ), . . . , g(xn ) + g(−xn ))
With this, we have the full structure of positional encoding for GNN:
1. Compute eigenvectors Σ
2. Get eigenvector embeddings using SignNet
3. Concatenate SignNet embeddings p i with feature vectors x i
4. Pass through main GNN/Transformer
5. (Training) Back-propagate gradients to train SignNet and Prediction models jointly.

9 Machine Learning with Heterogeneous Graphs


So far, we have only handled graphs with one edge type. In this section we describe
learning algorithms on heterogeneous graphs.

9.1 Heterogeneous Graphs


Many real world datasets are naturally described as heterogeneous graphs, graphs whose
nodes and edges are of more than one type. For example, in a publication graph, we could
have paper nodes and author nodes, and the edges could be differentiated into cite edges
and like edges.
A heterogeneous graph is a graph G := (V, E, τ, ϕ), where
• V is the set of nodes v ∈ V .
• E is the set of edges e ∈ E.
• τ is the map of node types τ (v) : v ∈ V
• ϕ is the map of edge types ϕ(e) : e ∈ E
The relation type for edge e is
r(u, v) = (τ (u), ϕ(u, v), τ (v))
Moreover, for edge type r, we define the r-neighbourhood of v ∈ V to be
Nr (v) := {u ∈ N (v) : ϕ(u, v) = r}
We could treat types of nodes and edges as features. For example, we could encode the
one-hot feature [1, 0] for author nodes and [0, 1] for paper nodes.
When do we need a heterogeneous graph?

48
9 MACHINE LEARNING WITH HETEROGENEOUS GRAPHS

Figure 9.1: People, conferences, and publications in academia can be represented by a het-
ereogeneous graph.

• Different nodes/edges have different shapes of features


• We know different relation types represent different types of interactions.
Heterogeneous graphs are a more expressive class of graphs, but it comes with drawbacks
such as more computation overhead and harder implementation. There are ways of
converting a heterogeneous graph to a homogeneous graph.

Question: Can we have multiple edges of different types between two nodes
Yes.

9.2 Relational GCN


We shall extend Graph Convolutional Networks (GCN) so they can operate on
heterogeneous graphs.


• Directed Graph G := (V, E ) with one relation:
In a directed graph, we could only pass messages along the direction of edges:
Message
Activation
 
h (l−1) 
v := σ
h (l) W(l) u
X
degin v

u→v

Aggregation
• In a graph with multiple relation types, different neural network weights WR(e) could
be used on different edge types:
Message
Activation
 
(l) h (l−1) 
v := σ
h (l) WR(u→v) u
X
degin v

u→v

Aggregation
49
9 MACHINE LEARNING WITH HETEROGENEOUS GRAPHS

Question: We would expect some degree of correlation in the edge relations


of real-world data. Could we choose whether to use RGCN or convert the
graph to a homogeneous one based on correlations in the data?
In practice choose the simpler homogeneous graph models first.

Relational Graph Convolutional Newtork (RGCN) introduces a set of neural


networks for each relation type on the heterogeneous graph G := (V, T, R, E):

 
1 (l)
v := σ 
h (l) Wr(l) h (l−1) + W0 h (l−1)
X X 
 
u v
cv,r

r∈R u∈ N r (v)

r-Neighbourhood r-degree cv,r := |Nr (v)|

In message-aggregation form:
• Message:
1
u,r :=
m (l) Wr(l) h (l)
u
cv,r
(l)
v := W0 h v
(l)
m (l)

• Aggregation:
(l)
h (l+1) = σ( m u,R(u→v) + m (l)
v )
X
v
u∈N (v)

Each relation has L matrices Wr(1) , . . . , Wr(L) . The size of each Wr(l) is d(l+1) × d(l) . In total
this leads to rapid growth of the numbers of parameters w.r.t. the number of relations, so
overfitting may become an issue. Two methods of regularisation exist:
• Block Diagonal Matrices: Use B block diagonal matrices for Wr :
 
Wr,1 · · · 0
 . .. .. 
 ..
Wr :=  . . 
0 · · · Wr,B
Limitation: Only nearby neurons could interact via Wr . This reduces the number of
(l+1) (l)
parameters from d(l+1) × d(l) to d B × dB
• Basis/Dictionary Learning: Share weights across different relations.
We could represent the matrix of each relation Wr as a linear combination of
learnable basis matrices
B
Wr :=
X
ar,b Vb
b=1

50
9 MACHINE LEARNING WITH HETEROGENEOUS GRAPHS

where Vb s are shared across relations and (ar,b )B


b=1 are learnable scalars. When the
number of relation types is large, this requires only |R| B + d(l+1) × d(l) parameters per
layer rather than |R| × d(l+1) × d(l) parameters per layer, which is an improvement.
Question: How to choose the basis matrices?
One way is to train homogeneous networks and use their weights as bases, or you could
make them random.

Example
Consider the following graph:

1 3

A 2 C

1 3 3 2

D 3 E 1 F

Two different tasks could be executed on this graph using an RGCN:


• Label Prediction: If there are |T | different node labels possible, the final layer
(L)
prediction head h A for node A encodes the probability of classes for node A.
• Link Prediction: Assume (E, A) is a training supervision edge and all other edges
are training message edges, we could score the RGCN on (E, A) using a relation-
specific score function
fr (h E , h A ) := h ⊺E Wr h A (Wr ∈ Rd×d )
This is specific to one relation and the model is tasked with determining the
probability of there being an edge between E and A. A more general task head
can determine the existence of an edge on top of the edge’s category, but in
many real world cases (e.g. drug discovery, paper citation) the type of the edge
is already known.
A negative supervision edge can be created by perturbing the tail of (E, A) to
become for example (E, B). Note that negative supervision edges should not be
training message edges, so (E, C) cannot be a negative training supervision edge.
Then, the output from f1 is used as a logit in a sigmoid function (σ(f1 (E, A)))
which is trained against the ground truth label of the supervision edges.
The edges not in the training message and supervision edges can then be ranked
by their logits r1 , . . . , rm . The performance of the model can be measured in a
variety of metrics:

51
9 MACHINE LEARNING WITH HETEROGENEOUS GRAPHS

– Hits@k: Fix a value k, and |{ri < k : yi }|, where yi is the ground truth
label, can be used to measure the number of relevant hits in the top k
ranked edges.
– Reciprocal Rank: yi /ri (a higher score is better)
P
i

Question: How can we generate negative edges for link prediction if the
graph is very dense.
The question of link prediction is ill defined when the graph is dense since if there are
no places to insert new edges, prediction of new edges cannot fail.

Question: How can we choose the negative edge examples


We can corrupt the tail of existing edges so they point to somewhere not linked in the
original graph.

Question: For negative sampling, How do you account for the imbalance of
edge types?

The prediction of edges (u, r, e)is equivalent to sampling the marginal distribution
v|u, r given u, r fixed. This marginal distribution is not affected by the imbalance of
edges.

9.3 Heterogeneous Graph Transformer


Graph Attention Networks (GAT) can be adapted for heterogeneous graphs. Introducing a
new neural network for each relation type is too expensive for attention.
Heterogeneous Graph Transformer (HGT)8 uses scaled dot-product attention
from transformer, where the attention weights are defined as

QK ⊺
!
Attention(Q, K , V ) := softmax √ V
dk
where Q is the query, K is the key, and V is the value. All 3 matrices have the shape
(batchsize, dk ).
Recall that when applying GAT to a homogeneous graph,

v := Aggregate{ αv,u · m u : u ∈ N (v)}


h (l) (l)

Attention weights
Without decomposition, the number of attention parameters can quickly overwhelm the
model, since |T | node types and |R| relation types produce |T |2 |R| different weight
matrices.
8
Hu et al. WWW ’20

52
9 MACHINE LEARNING WITH HETEROGENEOUS GRAPHS

In Heterogeneous Mutual Attention, we decompose homogeneous attention to node


and edge type-dependent attention mechanisms. Specifically the attention head is defined
as a learned quadratic form
 ⊺  
αv,u := kτi (u) h (l−1)
u
att
Wϕ(u→v) qiτ (v) h (l−1)
v

Note that the edge type directly parameterizes Wϕ(u→v)


att
and the node types parameterize
kτ (u) , qτ (v) : d (l−1)
→ dk .
Moreover, the message head is also decomposed into node and relation types:

u := Wϕ(u→v) Nτ (u) h u
msg (l−1)
m (l)
Weight for each edge type Linear head for each node type
A layer of HGT is given by
Attention scores
 Values V 
1
!
 
h (l)
v := Aggregate  softmax √ αv,u : u ∈ N (v) u : u ∈ N (v)
· m (l)
dk

On the ogbn-mag benchmark to predict paper venues, HGT uses much fewer parameters
even though the attention computation is expensive, but it performs better than R-GCN.

Question: What does ⊕ mean (in the slides)?

Aggregation.

Question: Why attention weights alone aren’t enough to distinguish differ-


ent relation types?
Attention weights don’t take into account the different messages emitted by different
relation types.

9.4 Design Space for Heterogeneous GNNs


So far the message aggregation function treats messages on equal footing. In
Heterogeneous Message Aggregation, a different aggregation can be used for each
relation type:  
(l)
h (l)
v := Aggregater∈R Aggregate(l)
r ({m (l)
u : u ∈ Nr (v)})
A common case is where the outer aggregation is concatenation and the inner aggregation
is summation. Since the number of relations is fixed, this produces fixed sized aggregations.
 
(l)

v :=
h (l) m (l)
M X
 
u
r∈R u∈Nr (v)

where ⊕ is concatenation.

53
10 KNOWLEDGE GRAPH EMBEDDINGS

Model Score Embedding S. A. I. C. M.


TransE − ∥h + r − t∥ h, t, r ∈ Rk ✓ ✓ ✓
TransR − ∥M r h + r − M r t∥ M r ∈ Rd×k ; h, t ∈ Rk ; r ∈ Rd ✓ ✓ ✓ ✓ ✓
DistMult ⟨h, r, t⟩ h, t, r ∈ Rk ✓ ✓
ComplEx Re⟨h, r, t⟩ h, t, r ∈ Ck ✓ ✓ ✓ ✓

Table 10.1: A summary of various knowledge graph completion methods. The last 5 prop-
erty columns represent the model’s capability to model symmetry, antisymmetry, inverse,
composition, and 1-to-n relationships, respectively.

10 Knowledge Graph Embeddings


Knowledge Graphs (KG) are heterogeneous graphs capture entities, types, and
relationships.

• Nodes are entities

• Nodes are labeled by their types

• Edges between nodes capture relationships between entities

Real world applications: FreeBase, Wikipedia, YAGO, etc. They are used for serving
information and question answering agents.
Common Characteristics:

• Massive: Millions of nodes and edges

• Incomplete: Many true edges are missing

Enumerating all the possible facts is impossible, but we can predict plausible but missing
links.

10.1 Knowledge Graph Completion


The completion task is the following: Given head h, relation r, predict missing tails nodes
t. Note that this is a bit different from link prediction tasks. Edges in KGs are represented
as triples (h, r, t).
To solve this task, we would like to have a shallow embedding for (h, r) pairs and
embeddings for t’s such that for plausible edges we have e (h,r) ≃ e t . Note that we do not
use a GNN here. There are two design questions:

• How to embed (h, r)?

• How to define close-ness?

Relations in a knowledge graph can have the following properties:

• Symmetric: r(h, t) =⇒ h(t, r). e.g. Family, Roommate

54
10 KNOWLEDGE GRAPH EMBEDDINGS

• Antisymmetric: r(h, t) =⇒ ¬r(t, h)

• Inverse: If r1 (h, t) =⇒ r2 (t, h) e.g. Advisor/Advisee

• Composition/Transitivity: r(x, y) ∧ r(y, z) =⇒ r(x, z)


e.g. Subsumes, logical consequence

• 1-to-n: r(h, ti ) : i = 1, . . . , n are all true.


e.g. r is “student of”.

Question: Are h, r, t vectors learned at the same time?


See the learning TransE algorithm. Each mini-batch updates the embedding.
Training is done by updating the embeddings for losses in the criterion function fr .

Question: How often do we update embeddings?


Knowledge graphs are very stable. Once they’re created they’re sort of fixed. The
embeddings are frozen even if new knowledge comes in.

• TransE:
Intuition: For a triple (h, r, t), h + r ≃ t if the given fact is true. The scoring function
is fr (h, t) := − ∥h + r − t∥. TransE originated from an observation in Language
models that the embeddings of words often have analogous relations

(Washington) − (USA) ≃ (Tokyo) − (Japan)

• TransR:
Like TransE, but the entities are morphed by the vector r by a projection matrix
M k×d
r .

Question: Can you make a generalised claim that symmetry implies


1-to-n?
No. In general these are different properties that do not relate to each other.

Question: Why use a linear transformation M r in TransR instead of


a non-linear transformation like a MLP?
You could.

• DistMult: We use a bilinear modeling function fr (h, t) := i hi ri ti . This can be


P

intuitively viewed as the cosine similarity between i hi ri and t.


P

Intuitively, DistMult defines a hyperplane for each (h, r) pair.

55
11 REASONING IN KNOWLEDGE GRAPHS

Algorithm 10.1 TransE Training Algorithm


Require: Training set S = {(h, r, t)}, Entity set E, Relation set L, Margin γ, Embedding
Dimension k, batch size b.
l ∼ U (− √6k , + √6k ) for each l ∈ L
l ← l/ ∥l∥ for each l ∈ L
e ∼ U (− √6k , + √6k ) for each e ∈ E ▷ Initialise Relation and Entity embeddings
loop
e ← e/ ∥e∥ for each e ∈ E
Sbatch ∼ MiniBatch(S, b) ▷ Sample Mini-batch
Tbatch ← ∅
for (h, r, t) ∈ Sbatch do
(h′ , r, t′ ) ∼ Corrupted-Triplet(S(h,r,t) ′
)
Tbatch ← Tbatch ∪ {((h, r, t), (h , r, t ))}
′ ′
▷ Add positive/negative sample pair
end for
Loss ← (h,r,t),(h′ ,r,t′ )∈Tbatch (γ + d(h + l, t) − (h ′ + l ′ , t))+ ▷ Contrastive Loss
P

Update embeddings using ∇Loss


end loop

• ComplEx: Uses a similar bilinear modeling function but instead with h, r, t ∈ Ck .


fr (h, t) = Re i hi ri ti .
P

This allows modeling of asymmetric relations due to the asymmetry introduced by


the conjugate

• RotatE: TransE in a complex space

There is not a general embedding that works for all KGs. Use a table to select models.

11 Reasoning in Knowledge Graphs


Multi-Hop Reasoning refers to answering a complex query on an incomplete, massive
knowledge graph. The 1-hop reasoning problem is simply KG completion:

• KG Completion: Is the link (h, r, t) in KG?

• 1-Hop Reasoning: Is t the answer to (h, r)?

Multi-Hop Reasoning generalises this notion by allowing intermediate steps in the


reasoning chain. A n-hop path query q can be represented by

q := ( va , ( r1 , . . . , rn ))
Anchor entity Path relations

The answer to q in graph G is denoted JqKg .

56
11 REASONING IN KNOWLEDGE GRAPHS

Figure 11.1: Example knowledge graph from biomedicine

Query Type Example Natural Language and Query


One-hop queries What adverse event is caused by Fulvestrant?
(e:Fulvestrant, (r:Causes))
Path queries What protein is associated with the adverse event caused by Fulvestrant?
(e:Fulvestrant, (r:Causes, r:Assoc))
Conjunctive queries
What is the drug that treats breast cancer and caused headache?
(e:BreastCancer, (r:Treatedby)), (e:Migraine, (r:CausedBy))

Table 11.1: Examples of queries on knowledge graphs

r1 r2 rn
···
va ?

Figure 11.2: Query plan of q

57
11 REASONING IN KNOWLEDGE GRAPHS

Figure 11.3: Learning to reason in latent space

Answering queries seems easy: Just traverse the graph. However knowledge graphs are
incomplete and known. Due to this incompleteness, one is not able to identify all the
answer entities.
Can we first do KG completion and then traverse the completed probabilistic KG? No,
since the probabilistic graph is a dense graph and the time complexity of traversing such a
graph is exponential in the number of query path length L.
A solution of this predictive query problem would be able to answer arbitrary queries
while implicitly accounting for the missing information.

11.1 Answering Predictive Queries on Knowledge Graphs


The key idea is to embed nodes and relations in a graph and learn to reason in the space of
embeddings.
Recall that TransE is a method used for Knowledge Graph completions, where the link
(h, r, t) is considered positive if h + r ≃ t. We can generalise this by using multiple
relations. Define the embedding of a query q to be

q := v a + r 1 + · · · + r n

Then, we can query for node embeddings close to q and label them as the answer to the
query.

Question: Does the order of the path not matter in TransE since it is a
vector addition?
TransE would not be able to model ordered paths.

Since TransE can handle compositional relations, it can handle path queries by translating
multiple relations into a composition. TransR, DistMult, ComplEx cannot handle
composition and hence cannot be easily extended to handle path queries.
Can we answer more complex queries with conjunction operation? If a conjunctive query
q = (q1 , q2 ), then JqKG = Jq1 KG ∩ Jq2 KG .

58
11 REASONING IN KNOWLEDGE GRAPHS

n
r
r2 t
r r1 ···
t
h h

Figure 11.4: TransE for completion and query answering

Figure 11.5: A conjunctive query

11.2 Query2Box
We have two problems to solve in conjunctive queries:

• Each intermediate node represents a set of entities. How can we represent it?

• How do we define the intersection operation in latent space when two queries have to
be simultaneously satisfied.

In Query2Box, each query is embedded as a hyperrectangle (box). A benefit of using


boxes is that the intersection of boxes is well-defined. Each box is represented by a centre
and an offset Some other methods use other geometric shapes, e.g. cones, beta
distributions.

Question: Can we use the TransE approach and embed the paths separately
and take the intersection?
Yes, but representing an entire set with a single point is difficult and we can easily
come up with counterexamples.

Settings

Let d be the out degree, |V | be the number of entities, and |R| be the number of
relations.

In Query2Box:

• Entity embeddings (d |V | parameters): Entities are seen as zero-volume boxes

59
11 REASONING IN KNOWLEDGE GRAPHS

Figure 11.6: Box Embeddings for biomedicine that encloses some answer entities

r
q′
q

Figure 11.7: Projection operator acting on query q with relation r

• Relation embeddings (2d |R| parameters): Each relation takes a box and produces a
new box. This is the projection operator P and maps Box × R 7→ Box. There is
one projection operator for each relation type, where the centre and offset are moved
by

centre(q ′ ) := centre(q) + centre(r)


offset(q ′ ) := offset(q) + offset(r)

• Intersection operator I: Inputs are boxes and output is a box. The centre of the new
box should be “close” to the centres of the input boxes, and the offset should shrink
since the intersected box is smaller than the size of all previous boxes. It does not
have to be a strict geometric intersection.
We can define the result of intersection to be
Hadamard (Elementwise) Product

centre q∩ = w i ⊙ centre(qi )
X

i
exp(fc (centre qi ))
w i := P
j exp(fc (centre qi ))

offset q∩ := min{offset q1 , . . . , offset qn } ⊙ σ (fo (offset q1 , . . . , offset qn ))


Ensure shrinking Sigmoid function
60
11 REASONING IN KNOWLEDGE GRAPHS

Figure 11.8: Conjunctive query with Query2Box

where fc , fo are neural networks. fc assigns attention scores to each box.

Question: Why do we have to learn the intersection operator? Why


not just use a fixed one?
Learned operator allows more expressivity for the model.

• Score function fq (v) (captures relevancy of v w.r.t. q)


We can define this to be the sum of in-distance and out-distance:
Hyperparameter 0 < α < 1

dbox (q, v) := dout (q, v) + α · din (q, v)


Distance outside of the box Distance inside of the box

and the score function fq (v) would be defined as the inverse distance of v to q:
fq (v) := −dbox (q, v).
The intuition is that the distance inside the box is shrunk relative to the outside.

Question: Does it matter which norm we use to measure the distance


of node embedding to box centre?

The distance used in the slides is Manhattan (L1 ) distance. It is very natural
since balls in L1 norm are boxes. Other distances can be used as well.

Question: How can we modify negative relations as a box?


If a query contains a negation it is not a conjunctive query. This would require things
like Beta distributions. See the paper BetaE.

Question: Can we answer disjunctive queries?


The union of boxes is no longer a box, so this can’t be done directly, but
You can write a query into conjunctive normal form and perform the union at the last
step.

61
11 REASONING IN KNOWLEDGE GRAPHS

Figure 11.9: Converting a query to disjunctive normal form where unions are done at the
last step

Question: How does a box handle the curse of dimensionality?


There is no way around the curse of dimensionality. It can be mitigated by only using
tens of dimensions instead of hundreds in real world cases.

Conjunctive queries and disjunction is called Existential Positive First-Order (EPFO)


queries. We will call them and-or queries.
Given d queries q1 , . . . , qd , for any subset Q ⊆ {q1 , . . . , qd }, we can form the query q∈Q q.
W

We would like the question answering model to include points belonging to q ∈ Q but
exclude the points belonging to q ̸∈ Q. When q1 , . . . , qd have non-overlapping answers, a
dimensionality of Θ(d) is needed to handle all OR queries.
For arbitrary real world queries, this number d is often very large, so we cannot embed
and-or queries in low dimensional space. A solution to this is to leave the union operation
to the very last step. This is the disjunctive normal form of the query, and any query
can be written in the form q = q1 ∨ · · · ∨ qn .
We can use the following metric to measure the distance of an entity embedding to a
Disjunctive Normal Form (DNF) query:

dbox (q, v) = min{dbox (qi , v) : i = 1, . . . , n}

Question: Is there a case where knowledge is dynamic or changing over


time? How would we deal with that?
Yes. Retrain the embeddings when the knowledge graph gets an updated.

11.3 Training Query2Box


Similar to KG completion, the training goal is for any query embedding q, maximise the
score fq (v) for positive answers v ∈ JqK and minimise the score fq (v ′ ) for negative answers
v ′ ̸∈ JqK. This involves minimising the objective on training graph G

ℓ(f ) := E q∼Query(G) ,v∈JqK,v′ ̸∈JqK [ − log σ(fq (v)) − log(1 − σ(fq (v ′ ))) ]

Sample over queries q KL Divergence/Logistic Loss


62
12 GNNS FOR RECOMMENDER SYSTEMS

Question: Does it have to be a box? Can it be something else?


The property that sets apart boxes is that the intersection of boxes is a box, so changing
it to another shape would add complexity.

The queries q are generated from query templates. A query template outlines the
topological structure of a query. Generating a grounded query from a template involving
tracing back from the answer node of the query template and grounding each question edge
backwards.9

Question: If we have a language based KG where words can have multiple


meanings, yet our graph has only 1 embedding for each node, how can we
deal with this?
Knowledge graphs should be unambiguous, so the ambiguity should be dealt with
outside and not inside.

Question: Do we care about the intermediate step nodes in a query?


In some cases we would not care, but this would be interesting to study what are the
intermediate representations.

Question: What if instead of one box, we use beam search to avoid sparsity.
Good idea would be interesting to try out.

12 GNNs for Recommender Systems


Settings
Personalised recommender systems can be naturally modeled as a bipartite graph:

• Nodes are users U and items V

• Edges E connect users and items and indicate user-item interaction and are often
associated with timestamp.

Given past user-item interactions, we wish to predict new items each user will interact with
in the future. This is a link prediction problem. For each u ∈ U, v ∈ V , the model should
generate a score f (u, v) which ranks the recommendations. Since |V | is large, evaluating
every user-item pair (u, v) is infeasible. The solution to this problem is to break down the
recommender into two stages.
9
See the SMORE paper for detail on this subject.

63
12 GNNS FOR RECOMMENDER SYSTEMS

Users

Items

Figure 12.1: Recommendation system involves predicting possible user-item interaction edges
given past edges

• Candidate generation (cheap, fast):


Retrieve about 1000 candidate items from embeddings of millions

• Ranking (slow, accurate):


Score/rank the items using f (u, v).
For each user, we recommend the top K items. For recommendation to be effective, K
(usually 10 to 100) has to be much smaller than the total number of items (usually
billions). The goal is to include as many positive items (items that the user will interact
with in the future) in the top-K recommended items as possible.
The evaluation metric is Recall@K, defined as follows. Let Pu be a set of positive items
that the user will interact with in the future. Let Ru be a set of items recommended by the
model. In the top-K recommendation, |Ru | = K. The recall is the intersection region in
the Venn diagram of Pu and Ru :
|Pu ∩ Ru |
Recall@K :=
|Pu |
The final Recall@K is the average of all user Recall@K values.

12.1 Recommender Systems: Embedding Based Models


To get the top-K items, the K itms with the largest scores for a given user u, excluding
already-interacted items, are recommended. In an embedding-based model, each user and
item correspond to embeddings. If u ∈ RD (resp. v ∈ RD ) is the embedding of user u ∈ U
(resp. v ∈ V ), the score of (u, v) is a parameterised function fθ (·, ·) : RD × RD → R.
The training objective function is to achieve high recall@K on seen (training sample)
user-item interactions. This objective is not differentiable, so instead there are two widely
used surrogate loss functions to enable gradient based training. They are differentiable and
align well with the original training objective.

• Binary Loss: Define positive/negative edges. The set of positive edges E are
observed, and the set of negative edges E− := {(u, v)|(u, v) ̸∈ E}
The binary loss is
1 X 1
log σ(fθ (u, v)) − log(1 − σ(fθ (u, v)))
X

|E| (u,v)∈E |E− | (u,v)∈E−

64
12 GNNS FOR RECOMMENDER SYSTEMS

during training, both sums are approximated with mini-batches of positive and
negative edges.
Binary loss pushes scores of positive edges higher than those of negative edges. An
issue is that the scores of all positive edges are pushed higher then the scores of all
negative edges, but the recommendation for a user u is not affected by the
recommendations of another user, so this unnecessarily penalises the model.

Question: Why can’t we just use a hinge loss to plateau the loss when
the ranking is correct?
The main goal of the BPR loss is to not compare scores for different users,
because the scores across users do not matter in the rankings.

Question: What is the problem with the binary loss in practice?


In practice the binary loss is hard to optimise and its harder to make the model
focus on the wrong things.

• Bayesian Personalised Ranking (BPR)10 : For each user u∗ ∈ U , define the


rooted positive/negative edges as
– E(u∗ ) := {(u∗ , v) : (u∗ , v) ∈ E}
– E− (u∗ ) := {(u∗ , v) : (u∗ , v) ∈ E− }
Training objective: For each user u∗ we want the scores of rooted positive edges to
exceed those of rooted negative edges. The loss is
1
L(u∗ ) := − log σ(fθ (u ∗ , v + ) − fθ (u ∗ , v − ))
X X
|E(u )| |E− (u )| (u∗ ,v+ )∈E(u∗ ) (u∗ ,v− )∈E− (u∗ )
∗ ∗

The overall BPR loss is


1 X
L := L(u∗ )
|U | u∗ ∈U
In a mini-batch, we sample a subset of users Û ⊆ U . For each u∗ ∈ Û , we sample one
positive item and a set of sampled negative items V− . The mini-batch loss is then
 
1 X 1 X
EÛ ⊆U,v+ ,V−  − log(σ(fθ (u ∗ , v + ) − fθ (u ∗ , v − )))
Û u∗ ∈Û |V − | v− ∈V−

Question: Why sample many more negative edges than positive edges?
The problem is inherently imbalanced. Each user only interacts with a tiny
subsets of all items, so a lot of negative examples are needed.

10
The term “Bayesian” is not essential to the loss definition. The original paper Rendle et al. 2009
considers the Bayesian prior over parameters (essentially acts as a parameter regularization), which we omit
here.

65
12 GNNS FOR RECOMMENDER SYSTEMS

Question: How to select the negative edges?


See slides at the end. Just picking random edges makes it too easy for the
model to learn. The candidate negative edge generator has to generate difficult
examples to force the model to pick out the first hundreds out of billions.
We could introduce hard negative examples. We will see distance based sampling
(evenly across embedding distances) to solve this issue.

Why do embedding models work? The core idea is collaborative filtering:


Recommended items for a user are related to preferences of many other similar users. When
the embeddings are low dimensional, they cannot simply memorise all user-item interaction
data, and they are forced to learn similarities between users and items to fit the data.

Question: Do we consider people who share accounts?


We model this heterogeneity outside. The splitting of one user node into multiple real
users should be done before.

12.2 Neural Graph Collaborative Filtering


Conventional collaborative filtering model is based on shallow encoders. There are no
user/item features, and use shallow encoders to embed each user u and item v. The score
function is just the dot f (u, v) = u · v. This model does not explicitly capture graph
structure. The structure is only implicitly captured in the training objective.
In shallow encoders, only first-order graph structure (edges) is captured. Higher-order
graph structures such as K-hop paths is not explicitly captured.
We want a model that explicitly captures graph structure and captures higher-order graph
structure. GNNs are a natural approach to achieve this. In
Neural Graph Collaborative Filtering (NGCF), the initial non-graph-aware shallow
embeddings are propagated through a GNN to create graph-aware embeddings.
For every user u and item v, set h (0)
u , h v as the user/item’s shallow embedding. Then we
(0)

iteratively update node embeddings using neighbourhood embeddings:



h (k+1)
v := Combine(h (k)
v , Aggregate(h u : u ∈ N (v)))
(k)

h (k+1)
u u , Aggregate(h v : v ∈ N (u)))
:= Combine(h (k) (k)

after K rounds of neighbour aggregation, we get the final user/item embeddings


u , v ← h v . Score function is the inner product u v.
u ← h (K) (K) ⊺

12.3 LightGCN
NGCF jointly learns two kinds of parameters: Shallow user/item embeddings, and
parameters for the GNN. The embeddings are already quite expressive. They are learned
for every user and item node and the total number of parameters in them is O(N D) (where
N is the number of nodes), where as the number of parameters in the GNN is only O(D2 ).
The GNN parameters may not be very essential.

66
12 GNNS FOR RECOMMENDER SYSTEMS

We can simplify the GNN used in NGCF. The adjacency and embedding matrices of an
undirected bipartite graph can be expressed as User embedding
" #  
0 R EU
A= , E=
R⊺ 0

EV
Item embedding
Recall the diffusion matrix of Correct and Smooth (Section 19.2) method. Let D be the
degree matrix of A. Each layer of a GCN’s aggregation can be written in the form
Learnable linear transformation

E (k+1) := ReLU( Ã E (k) W(k) )


Normalized Adjacency Matrix à := D −1/2 AD −1/2

LightGCN simplifies this by removing the ReLU non-linearity. Iterating the layers we see
that

E (K) := ÃE (K−1) W(K−1)


= Ã · · · ÃE (0) W(0) · · · W(K−1)
K
= Ã E (W(0) · · · W(K−1) )
| {z }
W

Removing the ReLU part significantly simplifies the GCN. The LightGCN algorithm
applies E ← ÃE K times, and each multiplication diffuses the embeddings to their
K
neighbours. The matrix à is dense and not stored in memory. Instead the above
multiplication step is executed many times. We could also consider multi-scale diffusion:
K K
k
E := αi E (i) = αi à E (0)
X X

i=0 i=0

where α0 E (0) acts as self connection. For simplicity, LightGCN uses αk := 1/(k + 1).

Question: Do we do all the matrix multiplication in memory?


Multiplying big matrices is not a problem. e.g. using MapReduce.
The bottleneck is storing the embeddings.

Question: Can we do SVD to aid with the matrix multiplication?


We can’t do SVD because its cubic. CUR decomposition is fine but its beyond this
course.

Intuitively, the simple diffusion propagation encourages embeddings of similar users and
items to be similar. LightGCN is similar to GCN and C&S, except that self-loops are not
added, and the final embedding is the average of layer embeddings.
LightGCN performs better than shallow encoders but are also more computationally costly.
The simplification from NGCF leads to better performance.

67
12 GNNS FOR RECOMMENDER SYSTEMS

Figure 12.2: “Bed rail” may look like “Garden fence” but they are rarely adjacent in the
graph

Question: Is averaging of embedding across layers a general approach not


limited to recommender systems?
It is more general but it was originally developed for this purpose.

12.4 PinSAGE
The PinSAGE algorithm was developed for Pinterest to recommend pins and is the largest
industry deployment of a GCN. Each pin embedding unifies visual, textual, and graph
information. It works for fresh content and is available a few seconds after pin creation.
The task of pin recommendation is to learn node embeddings zi such that the distance of
similar pins are shorter than the distance of dissimilar pins. e.g.
d(zcake1 , zcake2 ) < d(zcake1 , zsweater ). There are 1 + B repin pairs from related pins surface
which captures semantic relatedness.
Graph has tens of billions of nodes and edges. In addition to the GNN model, PinSAGE
paper introduces several methods to scale the GNN:
• Shared negative samples across users:
Recall that in BPR loss, for each user u∗ ∈ Û , we sample one positive item and a set
of negative items V− . Using more negative samples per user improves
recommendation but is also expensive, where the number of computational graphs is
Û · |V− |.
The key idea is that the same set of negative samples can be used across all users in
the mini-batch. This saves computational cost by a factor of Û .

• Hard negative samples:


Industrial recommendation systems need extremely fine-grained predictions. Each
user has 10 to 100 items to recommend while there are billions of items in total. A
random sample from all items contain mostly easy negatives. We need a way to
sample hard negatives to force the model to be fine grained. Hard negatives are
obtained as follows:

1. Compute personalized page rank (PPR) for user u

68
13 RELATIONAL DEEP LEARNING

Figure 12.3: Examples of easy and hard negatives

2. Sort items in the descending order of their PPR scores


3. Randomly sample item nodes that are ranked high but not too high. e.g. 2000th
to 5000th.

The paper Wu et al. 2017 describes sampling based on distances so the


query-negative distance distribution is about U [0.5, 1.4].

• Curriculum learning:
Make the negative samples gradually harder in the process of training. At the nth
epoch, we add n − 1 hard negative samples. For each user node, the hard negatives
are item nodes that are close but not connected to the user node in the graph.

• Mini-batch training of GNNs:


This will be covered in a future lecture.

13 Relational Deep Learning


A large amount of data is available in the form of relational databases.
e.g.

• Which products will a user purchase?

• Will an active user churn over the next 90 days?

We will see a way to execute predictions on relational data.


Settings
Learn a user churn model based on their sales, purchased products, and browsing
behaviour.

Simple application of deep learning to the user churn model leads to


impedance mismatch: A set of issues

• Features are chosen arbitrarily

69
13 RELATIONAL DEEP LEARNING

• Only a limited set of data is used

• Issues with point-in-time correctness/information leakage

Excursion: End-To-End Learning

Historically, computational vision (CV) has been done with handcrafted features. For
example, the wheel and window of a car could be detected by feature detectors.
Modern computer vision has been mostly using end-to-end learning: The input data
does not undergo augmentation before being fed to the machine learning model.

We would like to apply deep learning on end-to-end relational database tasks. This has 4
benefits:

• More accurate models (no feature engineering)

• More robust models (model-learned features automatically update)

• Shorter time to models (no mundane ETL work)

• Simpler infrastructure (no pipelines, no feature stores, etc.)

A classical method of machine learning on databases is Tabular ML. e.g. decision trees
on single tables. The advantage of using deep learning over Tabular ML is being able to
operate on multiple tables.
Another way is to do statistical relational learning (SDL) which learns a distribution
over relational structures. RDL a scalable, expressive inheritor of SDL.

13.1 Relational Database Graph


A database is a graph.
Real-world data are stored in databases. For example, a transaction, a user, or a product,
could be a row in a table. A database is a set of tables T = {T1 , . . . , Tn }, and links
between tables are L ⊆ Ti × Tj . The schema graph represents the allowed connections of
the database graph. Each table Ti is a set of entities v ∈ Ti with a primary key and
optional foreign keys. The relational entity graph is defined as the graph with

• Nodes are V := T , the union of all tables


S

• Edges are E := {(u, v) ∈ V : primary − key(u) ∈ foreign − key(v)}

Different from knowledge graphs, each entity can also have features.

70
13 RELATIONAL DEEP LEARNING

Figure 13.1: A database is a graph

Figure 13.2: A schema graph

Figure 13.3: Overview of the approach

71
13 RELATIONAL DEEP LEARNING

Figure 13.4: Training table attached to the main graph

Figure 13.5: Overall pipeline

13.2 Predictive Tasks in Relational Databases


Settings
Consider an example: Predict whether a user is going to churn in the next 30 days?

Most tasks are temporal: User’s label and the database changes all the time. To train a
GNN for such a task, we define a training table containing (entityId, time, label). This
could be used for Classification, Regression, or Multi-class categorization tasks. The time
label is essential to temporal prediction tasks. An entity may have different labels at
different times. In the churn example, we create a training table with columns
(user, time, churn), and attach it to the database.
Each node’s neighbourhood defines a computational graph. The computation graph for
each node is time-dependent, and so are the message and aggregation process.

72
14 ADVANCED TOPICS IN GRAPH NEURAL NETWORKS

Model AUCROC Labour Hours Lines of Code Training Hours


XGBoost 0.68 - - -
GNN 0.90 1 54.00 1
Expert 0.86 11 682.00 1

Excursion: SQL Join


Given two tables R, S where each entity is a set of values, the join operation defines
the Cartesian product

R ▷◁ S := {r ∪ s : r ∈ R, s ∈ S, r.key = s.key}

GNN’s Aggreagtion and Message passing mechanisms allow them to learn a SQL Join and
Agg operation. GNNs can perform multi-hop reasoning and discover patterns across
multiple samples.

13.3 RelBench
The link for this paper can be found at (RelBench”.
RelBench is a collection of databases and tools for evaluating GNNs. We can use this to
compare the efficacy of GNNs compared to an expert data scientist. For the problem “Will
a user be active in the next 6 months?”, the data scientists workflow consists of
4h Exploratory Data Analysis (EDA): Observe plots from individual and joined tables
0.5h Feature ideation: Come up with possibly indicative features
5h SQL Query writing
2h XGBoost hparam sweep
1h SHAP (feature importance analysis)
GNNs consistently outperform human experts in tasks in RelBench.

14 Advanced Topics in Graph Neural Networks


14.1 PRODIGY: Enabling In-Context Learning Over Graphs
Suppose there are 3 tasks x 7→ f1,2,3 (x) and we want to train a machine learning model for
them. There are a few paradigms:
1. We can train one model g1 , g2 , g3 for each task f1 , f2 , f3
2. We can train a common model (head) for all 3 tasks, and fine-tune a tail model for
each task. i.e.
fi (x) ≃ hi (g(x))
where hi and g are learnable.

73
14 ADVANCED TOPICS IN GRAPH NEURAL NETWORKS

Figure 14.1: Example of Node Prediction for In-Context Learning

Figure 14.2: Example of Link Prediction for In-Context Learning

3. We can train a model which can perform different and diverse tasks via
in-context learning: The model can perform a task using a description for the task:
fi (x) ≃ g(x, i)

The power of in-context learning is few-shot prompting: Prompting the pre-trained model
with only a few examples is sufficient for the model to run other tasks. This is common for
large language models (LLMs). Note that during few-shot prompting, the model’s
gradients are not updated.
Performing in-context learning on graphs is more difficult than text. The main problems
are
1. How to represent the few-shot prompt for different graph tasks in the same input
format, so that it can be consumed by one shared model?
2. How to pretrain a model that can solve any task in this format?
PRODIGY is a method of in-context learning for graphs. To solve the first problem, we
use prompt graphs, which is a meta hierarchical graph.
Consider the task of link prediction. Suppose we have prompt graphs Gi := (Vi , Ei ), where
each Vi has two specially labeled nodes si , ti and the ground truth label yi for the edge
(si , ti ) and we want to predict the link (s0 , t0 ) on G0 := (V0 , E0 ). PRODIGY creates the
prompt graph Task Graph Data Graph

V := {yi : i} ∪ {vi : i} ∪ V0 ∪
[
Vi
i

E := {(vi , yi ), (vi , Gi ) : i} ∪ E0 ∪
[
Ei
i

74
14 ADVANCED TOPICS IN GRAPH NEURAL NETWORKS

Figure 14.3: Example of Graph Prediction for In-Context Learning

Figure 14.4: A Prompt Graph for link prediction

The task of the GNN is to predict the link (v0 , y) over possible labels y. This converts a
link/node/graph level prediction task into a link-level prediction task with a consistent
graph format.
The task graph contains hierarchical edges (vi , Gi ). This could be processed by having one
GNN for each prompt example.
To pre-train PRODIGY, we generate data in the PromptGraph format and pretrain our
model over them. There are two ways to do this:
1. Neighbour Matching: Train a task to classify which neighborhood a node is in,
where each neighborhood is defined by other nodes in it
2. Multi-Task: Train the model with multiple task data combined into PromptGraph
data

Figure 14.5: Message passing over hierarchical edges

75
14 ADVANCED TOPICS IN GRAPH NEURAL NETWORKS

Figure 14.6: Neighbour Matching

Text and image features could be incorporated using language model embeddings or CNNs.

14.2 Conformal GNNs


In high-stake settings, errors from neural networks have a huge cost. Can we produce a
measure of the uncertainty in a model?

1. How to evaluate if an uncertainty estimation method is good?

2. How to produce uncertainty estimates with reliability guarantees?

3. How to produce reliable uncertainty estimates for graphs?

If we can quantify uncertainty on GNNs, we could stop trusting its results when
uncertainty is high. Instead of having a single output, the model can

1. (Point Prediction) Output a range of predictions C(xn+1 ) instead of a point ŷn+1

2. (Classification) Output a set {1, 2} instead of one class 2

3. (Regression) Output an interval [a, b] instead of a point y

Settings
We want to construct provable prediction sets with confidence α < 1 from test data
Dtest .

Set outputs enable a rigorous notion of reliability. The coverage of a prediction set is

1
Coverage := 1(Yi ∈ C(Xi ))
X
|Dtest | i∈Dtest

With set predictions, we could produce predictions with provable reliability.

76
14 ADVANCED TOPICS IN GRAPH NEURAL NETWORKS

Figure 14.7: Quantile Computation

Figure 14.8: Prediction Set Construction

Consider a categorical prediction task where each category y has confidence µ̂(x)y . This is
often based on softmax scores. We define the non-conformity score function
V : X × Y → R to be
V (x, y) = 1 − µ̂(x)y
When the softmax score is high, the non-conformity score is low, and the model is more
confident. We calibrate the model over many data points (xi , yi ) and take the 1 − α (where
α is a prescribed confidence level) quantile

1
η̂ := quantilei (V (xi , yi ), (1 − α)(1 + ))
n

For each predicted point Ŷn+1 , we can then construct a set based on this interval:

C(Xn+1 ) := {y ∈ Y : V (xn+1 , y) ≤ η̂}

Similarly, for regression tasks x 7→ y, we can construct a heuristic uncertainty


[µ̂α/2 (x), µ̂1−α/2 (x)] and compute the non-conformity score

V (x, y) := max(µ̂α/2 (x) − y, y − µ̂1−α/2 (x))

and the prediction interval is

C(xn+1 ) = [µ̂α/2 (x) − y, y − µ̂1−α/2 (x)]

Theorem 14.1 ((Vovk, Gammerman, and Saunders, 1999)). Given exchangeability


between calibration set (xi , yi )ni=1 and xn+1 , yn+1 ,

P (yn+1 ∈ C(xn+1 )) ≥ 1 − α

77
14 ADVANCED TOPICS IN GRAPH NEURAL NETWORKS

where exchangeabiltiy is defined as following: Given any permutation π of {1, . . . , n + 1}, it


holds

P ((zπ(1) , . . . , zπ(n+1) ) = (z1 , . . . , zn+1 )) = P ((z1 , . . . , zn+1 ) = (z1 , . . . , zn+1 ))

where zi := (xi , yi ).

Exchangeabiltiy is difficult for graph data: There’s dependency between test and
calibration nodes (i.e. not IID). Message passing during training includes calibration and
test nodes.

Theorem 14.2 ((Huang, Jin, Candès, Leskovec)). In transductive node-level prediction


problem under random data split, graph exchangeability holds given permutation invariance.

Having coverage is not enough. We also need to ensure the coverage interval is efficient. An
infinitely large interval has 100% coverage but is useless. Inefficiency is defined as

1
Inefficiency := |C(xi )|
X
|Dtest | i∈Dtest

GNNs prediction scores are not optimized for conformal efficiency. We can design a loss
function which approximates this metric. The prediction set size proxy is

1 X X
Lset := σ ((−̂V (xi , y))/τ )
N i∈Vct y∈Y

The prediction interval length proxy is

1 X
Linterval := (µ̃1−α/2 (xi ) + η̂) − (µ̃α/2 (xi ) − η̂)
N i∈Vct

Based on Theorem 14.2, there is still coverage guarantee.

14.3 Robustness
Deep convolutional neural networks are vulnerable to adversarial attacks.

Excursion: Adversarial Attacks


Imperceptible changes can be added to neural network input to change the prediction.

78
14 ADVANCED TOPICS IN GRAPH NEURAL NETWORKS

Adversarial examples are also reported in natural language processing and audio pro-
cessing.
The existence of adversarial examples prevent the reliable deployment of deep learning
models to the real world. Adversaries may try to actively interfere the neural networks.
Deep learning models are often not robust.

How robust are GNNs?


Settings
• Task: Semi-supervised node classification

• Model: GCN

• Target node t ∈ V : Node whose label prediction we want to change

• Attacker nodes S ⊆ V : Nodes the attacker can change


The attacker can modify node feature, add connections, or remove connections.

• The attacker has access to A (the adjacency matrix), X (the feature matrix),
Y (the label matrix), and the learning algorithm.

• The attacker can modify (A, X) to (A′ , X ′ ) with the assumption (A′ , X ′ ) ≃
(A, X). The manipulation is unnoticeably small.

• θ (resp. θ ′ ) is the model parameter learned over (A, X, Y ) (resp. (A′ , X ′ , Y ))

• cv (resp. c′v ) is the class label of node v predicted by GCN with parameters θ
(resp. θ ′ ).

• The attacker wants to make c′v ̸= cv , i.e.


Change of prediction on target node v

∆(v; A′ , X ′ ) := log fθ′ (A′ , X ′ )v,c′v − log fθ′ (A′ , X ′ )v,cv


Predicted log probability Predicted log probability
of newly predicted class of originally predicted class
Attacker wants to increase Attacker wants to decrease

79
15 FOUNDATION MODELS FOR KNOWLEDGE GRAPHS

• The optimisation objective of the attacker is


Change of target node label prediction

arg max ∆(v; A′ , X ′ ) subject to (A′ , X ′ ) ≃ (A, X)


A′ ,X ′
Graph manipulation is small

Question: Why we assume only the adjacency matrix and feature matrix
can be changed, but not the labels?
If the attacker can change the label the adversary problem becomes very easy.

Question: Does the attacker need to know the structure of the model itself?
Yes.

Attack possibilities:

• Direct attack: Node is the target S = {t}

• Indirect attack: The target node is not in the attacker nodes: t ̸∈ S

Optimising the objective function ∆(v; A′ , X ′ ) is challenging since A′ is a discrete object


and for every modified graph A′ , X ′ the GCN needs to be retrained. The solution [Zügner
et al. KDD2018] is to follow a iterative locally optimal strategy:

1. Sequentially manipulate the most promising element/entry from A

2. Pick one which obtains the highest difference in the log-probabilities indicated by the
score function

GCN is not robust to direct adversarial attacks but it is somewhat robust to indirect and
random attacks.

Question: Has there been work of using Reinforcement Learning to train


the attacker to attack a black box?
Yes, in this work they assume the overall architecture of the model but not its weights.

15 Foundation Models for Knowledge Graphs


A foundation model is a general purpose machine learning model for multiple tasks.
Foundation models are pre-trained on a large amount of data and can be used as zero-shot
or few-shot learners. Foundation models have seen applications in natural language
processing, vision, biology, and other fields.
Can we develop a foundation model for knowledge graphs? Recall that PRODIGY (14.1)
was a unified framework to learn graph tasks of multiple levels and is agnostic to the graph

80
15 FOUNDATION MODELS FOR KNOWLEDGE GRAPHS

Figure 15.1: Two Knowledge Graphs with the same relations

type. To construct a foundation model for knowledge graphs, we need to unsupervisedly


train a model on large amounts of data.
Settings

Suppose we have a knowledge graph G := (V, E) where the set of all relations is R.
The edges are triples (h, r, t), where head h is related via r to tail t. Suppose that
nodes and edges are associated with shallow embeddings (not learning a GNN yet).
Given a triple (h, r, t), the goal is in the embedding space (h, r) should have a similar
embedding to t.

In comparison with natural language tasks, each token in a language corpus is a node of
the same type. The vocabulary is homogeneous. Knowledge graphs have heterogeneous
vocabulary. In natural language tasks, the words are connected sequentially, which is not
true for knowledge graphs.
There are two essential tasks for Knowledge Graphs

• Transductive Link Prediction:


Given an enormous KG, can we complete the KG? i.e. For a given (h, r), we predict
the targets t.
Recall that structure agnostic embeddings learn shallow embeddings for entities and
relations. KG embedding methods can only be used on a single KG. If a new entity is
introduced to the knowledge graph, we need to re-train the shallow encoder.
Observe that a KG is a heterogeneous graph. Since a KG has multiple relation types,
we can use a different neural network for each relation type. This is the Relational
GCN (Section 9.2).

• Inductive Link Prediction: What if I want to transfer knowledge from one KG to


another?
Relational GCN fails at Inductive Link Prediction since it cannot transfer knowledge
between KGs with different relation types.

81
15 FOUNDATION MODELS FOR KNOWLEDGE GRAPHS

Figure 15.2: Working principle of Ingram

15.1 InGram: Inductive KG Embedding via Relation Graphs


To transfer knowledge between knowledge graphs, we can build double equivariant models.
Recall that for inductive link prediction, structure-agnostic shallow encoders moved the
structure of the KG from the loss function to the loss function and the model. This allows
the model to learn embeddings for new entities. We can do the same for relations.
InGram Model operates on the relation graph. Define
Eh [h, r] := 1(Entity h is head in relation r)
Et [t, r] := 1(Entity t is tail in relation r)
Dh [h, h] := Degree of entity h as head
Dt [t, t] := Degree of entity t as tail
and the adjacency matrix of relation graph to be
AR := E ⊺h D −2
h E h + E t Dt E t ∈ R
⊺ −2 |R|×|R|

AR is a weighted adjacency matrix. In practice we are looking for entities that serve as
head and/or tail for relation pairs. A GNN can then act on this relation graph and
generate embeddings for each relation.
InGram is similar to the transformer architecture: Transformer learns a fully connected
graph between tokens. InGram learns a similar graph between all relations.
The next step is entity-level message passing on the original (V, E) knowledge graph. We
use an attention mechanism in 4.1:
 
(l) (l)
h ∗(l+1) := σ  W 1 h (l) + W 0 h (l)
X X

v u v
r∈R u∈N r (v)
 
(l) (l)
h †(l+1) := σ  αurv (W 1 h (l) + v )
W 0 h (l)
X X

v u
r∈R u∈N r (v)

h (l+1)
v = h ∗(l+1)
v + h †(l+1)
v Attention

82
15 FOUNDATION MODELS FOR KNOWLEDGE GRAPHS

Figure 15.3: Ingram scoring function

Figure 15.4: Transductive Link Prediction Performance of Ingram

The entity update depends on the relation update.


The final step is to perform link prediction on the original knowledge graph. We generate
the score Embedding of h Embedding of t

f (h, r, t) := z ⊺h Diag(W w r ) t
with the loss function Embedding of r

L := max (0, γ − f (h, r, t) + f (h◦ , r, t◦ ))


X X

(h,r,t) (h◦ ,r,t◦ )


Hinge loss with bias γ
Positive Examples Negative Examples

InGram can be used for inductive link prediction across new entities and new relation. It
could be trained into a foundational model.

83
16 DEEP GENERATIVE MODELS FOR GRAPHS

Figure 15.5: Inductive Link Prediction Performance of Ingram

16 Deep Generative Models for Graphs


We want to generate realistic graphs using graph generative models, which we expect to be
similar to real graphs. Its applications include

• Drug discovery, material design

• Social network modeling

We study graph generation because

• Insights: Understand the formulation of graphs

• Predictions: Predict how will the graph further evolve

• Simulations: Generate novel graph instances

• Anomaly detection: Detect if a graph is normal/abnormal

History of graph generation

1. Properties of real-world graphs: A successful Graph Generative Model should fit


these properties.

2. Traditional graph generative models: Each come with different assumptions on


the formulation process

3. Deep graph generative models: Learn the graph formulation process from data

In this lecture we will cover Deep Graph decoders: A model which produces a graph
structure from an embedding.

84
16 DEEP GENERATIVE MODELS FOR GRAPHS

16.1 Machine Learning for Graph Generation


Two types of generation tasks:
• Realistic Graph Generation: Generate graphs that are similar to a given set of
graphs.
• Goal-directed graph generation: Generate graphs that optimize an objective or
constraint.
A Graph Generative Model uses the data distribution pdata (G) and learns the
distribution pmodel (G) from which samples can be generated.

Excursion: Basics of Generative Models


Assume we want to learn a generative model from data points {xi }.

• The data distribution pdata (x) is not known to us, but we have samples xi ∼
pdata (x).

• The model pmodel (x|θ) is parameterised by θ and can be tractably sampled. We


wish to make pmodel (x|θ) and pdata (x) similar.

To bring pmodel (x|θ) close to pdata (x), we use the principle of maximum likelihood, where
we find θ such that the likelihood of drawing xi ∼ pmodel (x|θ) is the greatest.

θ∗ := arg max Ex∼pc data [log pmodel (x|θ)] ≃ arg max log pmodel (xi |theta)
X
θ θ i

To sample from pmodel (x|θ), there are a couple of approaches. The most common
approach is to sample from a noise latent distribution z ∼ N (0 , I ), and transform the
noise with a function f to obtain x := f (z|θ). The distribution of x is the pushforward
of N (0 , I ). In deep generative models, f (·) is a neural network.
To generate a sequence, we can use auto-regressive models, where the chain rule is
used to generate a join distribution of x1 , . . . , xn :
n
pmodel (x; θ) = pmodel (xt |x1 , . . . , xt−1 ; θ)
Y

t=1

16.2 Generating Realistic Graphs


A graph can be generated by successively adding nodes an edges. This is the core idea of
GraphRNN. For a graph with node ordering π, we have the sequence (Siπ )ni=1 . The
sequence S π has two levels:
• Node-level generation: Add nodes, one at a time
• Edge-level generation: Add edges between existing nodes.
For example Siπ = [Si,1
π π
, . . . , Si,m ] where each Si,j
π
indicates whether an edge exists in
between nodes i, j.

85
16 DEEP GENERATIVE MODELS FOR GRAPHS

Figure 16.1: Generation process of a graph

Figure 16.2: Generating a graph using GraphRNN is plotting its adjacency matrix column
by column

A graph and a node ordering is a sequence of sequences. The node ordering is randomly
selected. This generates an adjacency matrix column by column. One drawback of this
approach is the huge modeling capacity required to generate longer and longer columns We
have transformed a graph generation problem into a sequence generation problem.

Excursion: Recurrent Neural Networks


Recurrent Neural Networks(RNN) is a neural network designed for sequential
data. An RNN sequentially takes an input sequence to update its hidden states. The
hidden states are passed between time steps of the RNN and the update is conducted
via RNN cells.
y1 y2

s0 RNN Cell s1 RNN Cell s2

x1 x2

An RNN cell takes inputs st−1 (previous hidden state) and xt , and outputs yt and st
(next hidden state). It is defined as

st := σ(W · xt + U · st−1 ), yt := V · st

86
16 DEEP GENERATIVE MODELS FOR GRAPHS

More expressive RNN cells such as GRU and LSTM have been developed to combat
vanishing gradient problems.
An RNN can be used to generate sequences by feeding back xt+1 := tt (i.e. using
the previous output as input). The sequence is initialised by a special SOS (start of
sequence) symbol and EOS (end of sequence) symbol is emitted (as an extra RNN
output) to signal halting the generation process.
An RNN modeled in this fashion is completely deterministic. To introduce stochastic-
ity in the model, each yi = pmodel (xt |x1 , . . . , xt−1 ; θ) could be a categorical probability
distribution and we can sample xi+1 ∼ yi .
A sequence generation RNN can be trained using teacher forcing, where the inputs
to the RNN is forced to be the ground truth sequence and the loss is computed between
the (shifted) ground truth sequence and RNN output.
RNNs are trained using Backpropagation Through Time (BPTT) which accu-
mulates gradients across time steps.

GraphRNN has a node-level RNN and an edge-level RNN. Node-level RNN generates the
initial state for edge-level RNN, and edge-level RNN sequentially predict if the new node
will connect to each of the previous nodes. The edge-level RNN at node-level step i
outputs a binary label ŷj of whether nodes i, j have an edge, and it is trained using binary
cross-entropy
L := −(ŷj log(yj ) + (1 − ŷj ) log(1 − yj ))
Question: Could we have vanshing gradients in GraphRNN?
If the generation sequence is too long this could be a problem.

87
16 DEEP GENERATIVE MODELS FOR GRAPHS

Question: Does the generation process depend on the ordering of the


nodes?
Yes.

Question: If the graph grows in multiple directions, how can the model
build one part of the graph and then another part of the graph?

(#610)
Not sure if you have a specific context in mind but one way to mitigate forgetting
other branches of a graph could be to use attention networks. It would definitely help
in improving the forgetting part and depending on the problem, you can tweak the
network to be more useful for any particular kind of forgetting you are facing.

Question: Sequence generation order agnostic supervision, would this pe-


nalise the model unnecessarily? Could we feed in stochastic inputs when
generating cliques as teacher forcing?

Some models can generate larger graph structures (e.g. a clique) at a time and this
mitigates some of the issue.

16.3 Scaling Up and Evaluating Graph Generation


The generation steps of GraphRNN could become intractable since any node can connect
to any prior node. This is too many steps for edge generatio and requires complex long
range dependencies. To mitigate this issue, we could use
Breadth-First Search node ordering.
The nodes are ordered using a BFS algorithm. If the algorithm is concerned with the
generation of two nodes i < j and no edges to j were produced at step i, at step j there is
no need to consider an edge to node i. Using BFS ordering, the number of “look-back”
steps is reduced from n − 1 to maxv deg v.

Question: Why could 5 not have link to 1?


If 5 is connected to 1, in the BFS ordering 5 would be traversed before 4, so it would
be 4 instead.

Question: Could you have edge RNN generate stop token?


Yes.

16.4 Graph Convolutional Policy Network


Sometimes we would like to generate graphs that:

• Optimise a given objective (High-score): e.g. drug-likeness

88
16 DEEP GENERATIVE MODELS FOR GRAPHS

Figure 16.3: Ordering the nodes using BFS reduces the number of memory steps required
to generate the graph

Figure 16.4: Example of generating a graph using BFS node ordering

89
16 DEEP GENERATIVE MODELS FOR GRAPHS

Figure 16.5: Overview of GCPN

This introduces a black-box to graph generation. Objectives like drug-likeness are


governed by physical laws which is assumed to be unknown to us.

• Obey underlying rules (Valid): e.g. Chemical validity

• Are learned from examples (Realistic): Imitates a molecule graph dataset.

Excursion: Reinforcement Learning

In Reinforcement Learning (RL), a ML agent observes the environment and takes


action to interact with the environment. The agent receives positive or negative reward.
The ML agent is trained from this loop. The environment is a black box to the agent
but the agent can directly learn from it.

Graph Convolutional Policy Network (GCPN) combines graph representation and


reinforcement learning. GCPN generates graphs sequentially like GraphRNN but has some
differences:

• GCPN uses GNN to predict the generation action (more expressive, but takes longer
time to compute)

• GCPN uses RL to direct graph generation

Steps in GCPN:

(a) Insert nodes

(b,c) Use GNN to predict which nodes to connect

(d) Take an action

(e,f) Compute reward

• Step Reward: Learn to take valid action


• Final Reward: Optimise desired properties

GCPN is trained in two parts:

1. Supervised training: Train policy by imitating the action given by real observed
graphs.

2. RL training: Train policy to optimise rewards using a policy gradient algorithm.

90
17 GEOMETRIC GRAPH LEARNING

Figure 16.6: Training process of GCPN

Question: Can graph completion task delete edges?


Not sure how this could be useful.

17 Geometric Graph Learning


Settings

A graph G = (A, S) is a set V of n nodes connected by edges. Each node has scalar
attributes (e.g. atom type). A is the adjacency matrix and S ∈ R|V |×f
A geometric graph is a graph G = (A, S, R), where each node is embedded in d-
dimensional Euclidean space, i.e. R ∈ R|V |×d .

Molecules can be represented as a graph G with node features si (atom type, charges) and
edge features ai,j (valence bond type). Sometimes we also know the 3D positions of each
node r i .
Geometric graphs lead to a variety of GNN models: Geometric GNN, Geometric
Generative model.
To design a GNN which processes geometric graphs, we need to overcome some obstacles.
The coördinate system used to describe graph geometry transform the node coördinates.
The output of traditional GNN will be affected by this transformation, so we would like the
GNN to be aware of symmetry based on the coördinates system.
A function F : X → Y is

• Equivariant if it commutes with a rigid transformation. i.e. for a transformation ρ


it satisfies F ◦ ρX = ρY ◦ F .

• Invariant if F ◦ ρX = F .

91
17 GEOMETRIC GRAPH LEARNING

For example, force on a molecule is equivariant, and energy in a molecule is invariant.


For geometric graphs, we consider the group of 3D special Euclidean symmetries SE(3),
which consists of rotations and translations.
For ML models with no explicit handling of symmetry, we can use data augmentation:
Create more training data by randomly generating rigidly transformed original data. This
is expensive. An alternative is to develop a GNN which respects the symmetry in the
input. This substantially shrinks the search space of the training process.
There are two classes of Geometric GNNs:
• Invariant GNNs for learning invariant scalar features
• Equivariant GNNs for learning equivariant tensor features

17.1 Invariant GNNs


Continuous-Filter Convolutions refers to convolutions executed on a continuous grid.
Such convolution filters may be useful when the continuous position of the irregular grid
becomes important, e.g. positions of atoms in a molecule. A continuous-filter convolution
on the features H (l) is Element-wise Multiplication
(l+1) X (l)
hi := (H (l) ∗ W(l) )i = h j ⊙ W (l) ( r i − r j )
j
Filter-Generating Function Relative Position
W(l) : Rd → Rf
SchNet is a class of invariant GNNs, the filter-generating function W(l) is a learnable
function (MLP) of the radial bases functions Interatomic distance di,j := ∥r i − r j ∥
2
ek (r i − r j ) := exp(−γ di,j − µk )
Radial basis centres
This is to curb the initial training difficulties. 11 SchNet makes W invariant by scalarising
relative positions. This gives up the angular information in the edges.
DimeNet improves upon SchNet by using both distance and angle features during
message passing.
Still, distances/angles are incompletely descriptors for uniquely identifying geometric
structure (e.g. cis-1,2-dichloroethene and trans-1,2-dichloroethene).
Limitations of invariant GNNs: You have to guarantee that your input features already
contain any necessary equivariant interactions.

17.2 Equivariant GNNs


PaiNN is a class of equivariant GNNs. PaiNN still takes learnable weights W conditioned
on the relative distance but each node has two features, a scalar feature si and vector
feature v i . They are defined by
(0) (l+1) (l) (l)
si := 0 si := si + ∆si
(l+1) (l) (l)
v (0) := e i vi := v i + ∆v i
11
See SchNet

92
17 GEOMETRIC GRAPH LEARNING

Figure 17.1: A pair of molecules exhibiting geometric isomerism, having identical scalar
quantities (distance and angle) but are distinguished by directional (normal) and geometric
information.

Figure 17.2: Molecular conformation generation

where the updates ∆si , ∆v i are defined by


(l) (l)
∆si := (ϕs (s(l) ) ∗ Ws )i = ϕs (sj ) ⊙ Ws (∥r i,j ∥)
X

j
(l) X (l) (l) (l) r i,j
∆v i := v j ⊙ ϕvv (sj ) ⊙ Wvv (∥r i,j ∥) + ϕvs (sj ) ⊙ Wvs (∥r i,j ∥)
X

j j ∥r i,j ∥

where ϕ, W are neural networks. This passes invariant scalar messages and equivariant
vector messages through each layer, thus keeping the equivariant properties.

17.3 Geometric Generative Models


Geometric Generative Models have application in molecule/protein design, biomolecule
structure prediction, and protein-molecule interaction simulation.
The problem of Molecular Conformation Generation is to generate stable
conformations from molecule graph. The model converts a molecular graph (atom-bond
graph) G to a conformation C with 3D coördinates. C follows the Boltzmann distribution
C ∝ exp(−E(C )/T ).

Atom embedding

93
17 GEOMETRIC GRAPH LEARNING

Figure 17.3: Geometric Diffusion

Excursion: Diffusion Models


A forward diffusion process destroys data by gradually adding noise and
diffusion models learn to reverse this noise i.e. denoise.
Let t1 = 0, . . . , tn = T be time steps. Corrupt the data x0 by adding noise xt :=
µt x + σt ϵ. The diffusion model is trained to reverse every step of this process.

Geometric Diffusion (GeoDiff) samples by an equivariant denoising procedure. The


noise generation process is a distribution q(C (t) |C (t−1) ) and GeoDiff learns its inverse
pθ (C (t−1) |G, C (t) ). The denoising network is parameterized by an equivariant GNN.

94
18 FAST NEURAL SUBGRAPH MATCHING AND COUNTING

18 Fast Neural Subgraph Matching and Counting


Subgraphs are the building blocks of networks. They have the power to discriminate and
characterize networks. For example, functional groups are the building blocks of organic
molecules.

18.1 Subgraphs and Motifs


Settings

Let G = (V, E) be a graph.

• A node-induced subgraph G′ = (V ′ , E ′ ) is a subgraph such that V ′ ⊆ V and


E ′ = {(u, v) ∈ E : u, v ∈ V ′ }.

• A edge-induced subgraph G′ = (V ′ , E ′ ) is a subgraph such that E ′ ⊆ E and


V ′ = {v ∈ V : ∃u.(v, u) ∈ E ′ }

Question: How do we define the boundary of a subgraph?


Right now we are not worried about the boundary of a subgraph. We will expand on
the definition later.

How do we express the statement “G1 ’s topology is contained in G2 ”? We can use


isomorphism. Two graphs G1 = (V1 , E1 ), G2 = (V2 , E2 ) are isomorphic if there is a
bijection f : V1 → V2 such that

(u, v) ∈ E1 ⇐⇒ (f (u), f (v)) ∈ E2

f is the (graph) isomorphism. Finding the mapping is a computation problem that we


do not know if it belongs to NP-hard, but we also do not know any polynomial time
algorithm for it.
G2 is subgraph-isomorphic to G1 if some (node or edge induced) subgraph of G2 is
isomorphic to G1 . We also commonly say G1 is a subgraph of G2 . Determining subgraph
isomorphism is NP-hard.
A network motif is a recurring, significant pattern of interconnections.

• Pattern: Small (node-induced) subgraph

• Recurring: Found many times


How to define frequent?

• Significant: More frequent than expected than e.g. randomly generated graphs
What is the distribution of such random graphs?

95
18 FAST NEURAL SUBGRAPH MATCHING AND COUNTING

(a) Feed-Forward Loop:


Found in networks of neu- (c) Single-Input Module:
rons, where they neutralize (b) Parallel Loop: Found in Found in gene control net-
“biological noise” food webs works

Figure 18.1: Common motifs in graphs

Motifs help us understand how graphs work and make prediction based on presence or lack
of presence in a graph dataset.

Question: Is it fine for subgraphs to have overlaps during counting?


Yes.

Let GQ be a small graph and GT be a target graph. There are two definitions of frequency
of GQ in GT .

• Graph-level Frequency: The frequency of GQ in GT is the number of unique


subsets of nodes VT for which the subgraph of GT induced by VT is isomorphic to GQ .
Frequency can get very large due to permutations.

• Node-level Frequency: In addition to the above we have an anchor v ∈ VQ such


that the number of nodes u ∈ VT for which some (counted as 1) subgraph of GT is
isomorphic to GQ and the isomorphism maps u to v.

If the dataset contains multiple graphs, we can treat the dataset as a giant graph GT with
disconnected components corresponding to individual graphs.
To define significance, we need to have a null-model (point of comparison). Subgraphs that
occur in a real network much more often than in a random network have functional
significance.
Methods of generating random graphs:

• A Erdős-Rényi (ER) Random Graphs is a graph Gn , p where n is the number of


nodes and the edges (u, v) appears iid. with probability p.
The distribution of deg v : v ∈ Gn,p is binomial, which can be unrealistic for social
graphs.

• A configuration model graph: Based on a real graph Greal , gather its nodes’ degree
sequence and create nodes with “spokes” corresponding to its degree. Then randomly
pair up nodes. This results in a graph Grand with the same degree sequence as Greal .

• A switching graph: Start from a given graph G and repeat a switching step Q · |E|
times, where Q is large (e.g. 100). The switching step selects a pair of edges
(a, b), (c, d) at random and exchange their endpoints, giving (a, d), (c, b). Exchange
only if no multi-edges or self-edges are generated.

96
18 FAST NEURAL SUBGRAPH MATCHING AND COUNTING

Figure 18.2: Configuration Model

This creates a randomly wired graph with the same node degrees as the original.

Question: If you are rearranging the edges, does it also preserve clus-
tering coëfficient or centrality?
When you generate such a random graph you have to decide what properties not
to preserve. The random switching process destroys the local structure so it has
no reason to preserve clustering coëfficient.

This is the slowest method.

Motifs are over-represented in a real graph compared to a random graph. The number of
motifs in a graph can be measured with statistical tools to evaluate its significance.
We can use statistical methods to evaluate the occurrence significance of a motif. The
Z-score, defined as #(Motif i) in graph Greal Average #(Motif i) in random graphs

Nireal − N̄irand
Zi :=
Std Nirand
Random variable #(Motif i) in random graphs
measures the significance. The network significance profile (SP) is a vector of
normalized Z-scores: sX
SP i := Zi / Zj2
j

The SP vector emphasises relative significance of subgraphs. It is important for comparison


of networks of different sizes. Generally, larger graphs display somewhat larger Z-scores.
For each subgraph, the Z-score metric is capable of classifying the subgraph “significance”,
where negative (resp. positive) values indicate under- (resp. over-)representation.
Variations on the motif concepts:

• Extensions: Directed/Undirected, Coloured/Uncoloured, Temporal/Static

• Variations on the concept: Different frequency concepts, significance concepts,


unde-representation (anti-motifs), null models.

97
18 FAST NEURAL SUBGRAPH MATCHING AND COUNTING

Figure 18.3: Examples of SP: Networks from the same domain have similar significance
profiles

Question: How do we choose the motif we use the analyse?


Usually choose subgraphs up to a certain size.
Or if there is domain knowledge, we could use the method described in the next section.

18.2 Neural Subgraph Representations


Settings

Subgraph matching: Given a large target graph (can be disconnected) GT and


query graph (connected) GQ , how can we decide if a GQ is a subgraph in GT ?

A GNN can be used to predict subgraph isomorphism. We are going to work with
node-anchored definition of frequency. We wish to generate embedding from the target
anchored neighbourhood and the query graph, and the comparison between the
embeddings lead to a binary label. The intuition of this method is to exploit the geometric
structure of the embedding space to capture properties of subgraph isomorphism.

Question: So far we have only used Euclidean manifold embeddings. Can


the subgraphs be embedded into non-Euclidean manifolds?
Would be cool to investigate. There have been experiments of embeddings in hyper-
bolic space. See. Dr. Chris Ré’s research.

The algorithm first should decompose the input graph GT into neighbourhoods. Then each
neighbourhood is embedded into the embedding space, and comparison is executed on this

98
18 FAST NEURAL SUBGRAPH MATCHING AND COUNTING

Figure 18.4: By comparing the embedding, we find (Query 1) ≤ GT but (Query 2) ̸≤ GT .

embedding space to determine if GQ is isomorphic to a subgraph of the neighbourhood.


The benefits of using anchored node-level frequency definition is that the embedding of
vectors naturally translate to embeddings of anchored subgraphs.
We impose a partial order on the embedding space. We define u, v ∈ Rn to satisfy u ≤ v
if ui ≤ vi for all i. It is trivial to check that ≤ defined on Rn is a partial order, and
therefore follows transitivity. If the GNN’s output embeddings are in such a way that the
subgraph isomorphism relation is mapped to ≤ in embedding space, we can quickly
determine subgraph embedding by directly comparing the embeddings. Moreover, we force
the embedding to be non-negative.

1. For each node v in GT , Obtain a k-hop neighbourhood around the anchor v (e.g.
using BFS)

2. Apply (1) to GQ to obtain the neighbourhoods

3. Embed neighbourhoods using a GNN.

Question: Is the embedding of the subgraph the same as the embedding of


the anchor node?
Yes

Question: How do we choose the anchor of the query?


The user chooses anchor i.e. its task specific.

99
18 FAST NEURAL SUBGRAPH MATCHING AND COUNTING

Figure 18.5: Order satisfies transitivity, anti-symmetry, and closure under intersection.

Question: If my query’s diameter is greater than the depth of the GNN,


do we have trouble capturing the query?
Yes.

Question: How do we handle real life cases with big queries?


We can several form decompositions of the network based on the structure of the query.

Question: Can we break a big query into smaller anchored queries and find
those smaller queries in the engine?
This is an interesting problem to research.

How can we design a loss function to ensure GNN learns the ordering? We design loss
functions based on the order constraint:
Embedding Dimension

∀i. zq [ i ] ≤ zt [ i ] ⇐⇒ GQ ⊆ GT
Query Embedding Target Embedding Subgraph relation

trained with max-margin loss, i.e. the penalty is the square of the amount of violation of
the order constraint.
D
E(Gq , Gt ) := max(0, zq [i] − zt [i])2
X

i=1

To learn such embeddings, we generate training examples (Gq , Gt ) such that Gq ⊂ Gt with
probability 1/2, and we minimize:

q , Gt ) Gq ⊆ Gt
E(G
L(Gq , Gt ) := 
max(0, α − E(Gq , Gt )) Gq ̸⊆ Gt
Margin hyperparameter α > 0

Question: Why is the loss capped at α?


Because if the reward for negative examples can be arbitrarily large, the network may
be incentivized to push them infinitely far. In this case no learning can occur.

100
18 FAST NEURAL SUBGRAPH MATCHING AND COUNTING

At each iteration, generate random samples (GT , G+


Q , GQ ):

• Positive: Sample induced subgraph GQ ⊆ GT using BFS Sampling

1. Initialize S := {v}, V := ∅
2. Let N (S) be the all neighbours of nodes in S. At every step, sample 10% of the
nodes in N (S) \ V and put them in S. Put the remaining nodes in V .
3. After K steps, take the subgraph of G induced by S anchored at q.

Usually use 3 − 5 layers of BFS. Trades off runtime and performance.

• Negative: Corrupt GQ by adding/removing nodes/edges so its no longer a subgraph.

Question: What types of accuracy do you expect from this technique?


There will be some improvement from subgraph embedding methods. This is state of
the art for motif counting.

Question: Do you see performance drops in size of the hops?

If you keep query size constant but increase the size of hops (bigger neighbourhoods),
performance could drop.

Question: Can you find bijection between the subgraphs?


This method determines the correspondence of the anchors.

During inference, embed anchored query GQ and target GT graphs. Output whether the
query is a node-anchored subgraph of the target using the predicate E(GQ , GT ) < ϵ.

18.3 Mining Frequent Motifs


Generally, finding the most frequent size-k motifs requires solving two challenges:

• Enumerating all size k subgraphs.

• Counting number of occurrences for each subgraph type.

Just knowing if a subgraph exists in a graph is computationally difficult problem. The


feasibility of traditional motif counting is relatively small (3 to 7)
Settings
In the problem of frequent motif mining, we wish to find among all possible graphs
of k nodes, the r graphs with the highest frequency in GT .

The SPMiner algorithm:

101
18 FAST NEURAL SUBGRAPH MATCHING AND COUNTING

Figure 18.6: SPMiner algorithm

1. Decompose input graph GT into neighbourhoods

2. Embed neighbourhoods into an order embedding space


The key benefit of order embedding is that we can quickly predict the frequency of a
given subgraph GQ .

3. Estimate the frequency of each motif by counting the set


Randomly sampled node-anchored neighbourhoods

SE := { z Q : z Q ≤ z N , GN ⊆ GT }
Embedding of motif GQ Supergraph region

4. Execute motif walk, a kind of beam search:

• Start by randomly picking a starting node u in the target graph GT . Set


S := {u}
• Grow a motif by iteratively chooisng a neighbour in GT of a node in S and add
that node to S. Grow the motif S to find larger frequent motifs. At each step,
maximize |SE |.
This maximisation is a greedy procedure and the complement of SE is the
total violation of GQ . The greedy heuristic adds the node that results in the
smallest total violation.
• Terminate upon reaching a desired motif size.

Question: In SPMiner algorithm, what is the goal of the network


Find top k common motifs of a given size. e.g. Out of graphs of size 20, which ones
are the most common.

102
19 LABEL PROPAGATION

Figure 18.7: Motif Walk and the supergraph region representing the complement of total
violation

Question: Would exam questions be similar to homework?


We will release some sample exam questions. They will be less involved than the
homework questions. No coding on exam questions.

Question: What is the origin of converting space of graphs to space of


embeddings
That is the innovation of the entire deep learning field.

19 Label Propagation
Given a graph with labels on some nodes, how can we assign labels to other nodes on the
graph? Node embeddings are a partial way to solving this problem.
Settings

Transductive (also called semi-supervised) Node Classification: We have labels Yv :


v ∈ V of some nodes in G = (V, E). We wish to predict the labels of other nodes in
the graph.

Today we discuss an alternative framework: Label propagation. The intuition is that


correlations exist in the networks, and connected nodes tend to share the same label.
Behaviours of nodes are correlated across the links of the network.
There are two explanations of why behaviours of nodes in a network are correlated:

• Homophily: The tendency of individuals to associate and bond with similar others.

• Influence: Social connections can influence the individual charactersitics of a person.

103
19 LABEL PROPAGATION

We will look at three techniques:


• Label propagation:
Directly propagate known labels to all the nodes

• Correct and smooth:


First define a base predictor, then correct and smooth the predictions with label
propagation

• Masked label prediction:


Construct a self-supervised ML task, let graph ML model to propagate label
information

19.1 Label Propagation


Label Propagation (LP) is the following algorithm: Propagate node labels across the
network. Suppose Yv ∈ {0, 1}. For labeled nodes, initialize Ŷv := Yv , and for unlabeled
nodes v, let Ŷv := 12 . Then we update all unlabeled nodes until convergence or a number of
iterations is reached.
Probability of node having label c

1
P (t+1) (Yv = c) = P Av,u P (t) (Yu = c)
X

(v,u)∈E Av,u (v,u)∈E

Edge weights

This is iterated until convergence (ϵ > 0 is a hyperparameter):

P (t) (Yv ) − P (t−1) (Yv ) ≤ ϵ, ∀v ∈ V

Question: In the homework, there are over-smoothing problems and bipar-


tite graphs may never converge. Does the same issue happen here?
This algorithm also suffers from the non-convergence issue. We may need to add
random noise, but this is a limitation of the algorithm.

Issues:
• Convergence may be very slow and not guaranteed.

• Label propagation does not use node attributes

Question: We say this is fast but slow to converge. How does it compare
with GNN?
Fast here refers to one-step of label propagation. In most cases, it is still faster than
training and applying a deep neural network.
However, inference of GNN is faster than label propagation.

104
19 LABEL PROPAGATION

(a) Indicative features

(b) Non-indicative features

Figure 19.1: GNN performed on two graphs with different labels and identical node features
lead to very different performance since GNN does not take the labels into account. The
resulting node embeddings do not have sufficient differentiation power.

19.2 Correct and Smooth


The problem with GNNs is that it fails catastrophically when the node features are mildly
predictive.

Question: Why can’t we include the labels in the messages instead of doing
correct and smooth?
The motivation of Correct and Smooth is that its model agnostic, so it can make
prediction with your favourite model. Directly leveraging the label is possible too.

The core idea of Correct and Smooth12 is that we expect the errors of a base label
predictor to be correlated along edges of the graph, so we should spread such uncertainty
over the graph. We define the diffusion matrix to be
12
See Zhu et al. ICML 2013 for details.

105
19 LABEL PROPAGATION
Diagonal degree matrix Di,i := deg i

à = D −1/2 A D −1/2

Theorem 19.1. All the eigenvalues of à are in the range of [−1, +1], and the maximum
eigenvalue is always 1 with eigenvector D 1/2 1 so the power of à is well-behaved for any K.

Proof.
ÃD 1/2 1 = D −1/2 AD −1/2 D 1/2 1 = D −1/2 A1 = D −1/2 D1 = D 1/2 1

If i, j are connected, then


1
Ãi,j = q
di · dj
which takes into account the connectivity of both i and j.
We also add self edges to adjacency matrices, i.e. Ai,i := 1

1. Train base predictor which predicts soft labels (class probabilities) over all nodes.
Labeled nodes can be used for train/validation data.

2. Apply base predictor to all (including the truth) nodes to obtain soft labels.

3. Correct and Smooth:

• (Correct): The degree of the errors of the soft labels are biased. We need to
correct for the bias.
Compute the training errors of nodes, i.e.
Ground one-hot label Predicted soft label

p − p̂ v v is labeled
e v := v
0 v is unlabeled

Then, diffuse the training errors E (0) along the edges.


Hyperparameter

E (t+1) := (1 − α )E (t) + α ÃE (t)

We add the scaled diffused training errors into the soft labels:

P̃ := P̂ + sE (T )

This results in the corrected labels



p v is labeled
z v := v
p̃
v v is unlabeled

106
20 SCALING UP GNNS

• (Smooth): The predicted soft labels may not be smooth over the graph. We
need to smoothen the labels.
Assumption: Neighboring nodes tend to share the same labels.
Diffuse z v along the edges:

Z (t+1) := (1 − α)Z (t) + αÃZ (t)

Question: What does the hyperparameter α control?


It controls the degree of homophily in the network

Question: Could I use it for graph prediction during training?


No because it needs some ground truth.
However, during training stage of graph prediction, a method like this could be used
to diffuse the features if you do not trust the features, but this is a different paradigm.

Correct and Smooth:

• Significantly improves the performance of the base model.

• Outperforms the smooth-only baseline.

• Can be combined with GNNs

19.3 Masked Label Prediction


This is inspired by BERT objective in NLP. We treat labels as additional features. We
concat the label matrix Y with the node feature matrix X and use the partially observed
labels Ẏ to predict the remaining labels.

• Training: Corrupt Ẏ into Ỹ by randomly masking a portion of the node labels to


0’s. Then use [X, Ỹ ] to predict the masked node labels.

• Inference: Employ all of Ẏ to predict the remaining unlabeled nodes.

Masked Label Prediction is also a self-supervised task like link prediction.

20 Scaling Up GNNs
In modern applications, graphs are very large.

• In recommender systems used by Amazon/Youtube/Pinterest etc., the number of


users is on the order of 108 to109 , and the number of products/videos is on the order
of 107 to 109 .
Tasks: Recommend items, Classify users/items

107
20 SCALING UP GNNS

• In social networks used by Facebook/Twitter/Instagram etc., the number of users


is on the order of 108 to 109
Tasks: Friend recommendation, User property recommendation

• In academic graphs such as Microsoft Academic Graph


Tasks: Paper categorisation, author collaboration recommendation citation
recommendation

• In knowledge graphs such as Wikidata or Freebase, there are about 108 entities.
Tasks: KG Completion, Reasoning
Why is training GNN difficult?

Excursion: Training Machine Learning Models


The objective of training machine learning models is usually to minimise some averaged
loss:
1 NX
−1
ℓ(θ) := ℓi (θ)
N i=0
We perform stochastic gradient descent by randomly sampling batches Bof size M ≪
N , calculate the loss ℓ̂(θ) over B, and update using θ ← θ − α∇ℓ̂(θ).

Naı̈ve full-batch processing iterates the GCN layers on the entire graph:
 
H (k+1) := σ ÃH (k) WK

+ H (k) B⊺K

is infeasible in large graphs since the memory on a GPU is extremely limited (10-20 GB).
We introduce 3 methods to scaling up GNNs.
• Two methods sample smaller subgraphs:
Neighbour Sampling and Cluster GCN

• One method simplifies GNN into a feature preprocessing operation:


Simplified GCN

20.1 Neighbour Sampling


Recall that GNNs generate node embeddings via neighbour aggregation. To compute the
embedding of a single node n, all we need is the K-hop neighbourhood. We can sample
M ≪ N nodes, construct their K-hop neighbourhoods, and use the resulting computational
graph to train a GNN. The issue with this method is the size of K-hop neighbourhoods can
be exponentially large, especially when it hits a hub node (a node with a high degree).
In Neighbourhood Sampling, each hop only samples at most Hk neighbours at the kth
level. We then use the pruned computational graph to train the GNN. A K-layer GNN will
involve at most K k=1 Hk leaf nodes in the computational graph.
Q

Benefits and Drawbacks

108
20 SCALING UP GNNS

Figure 20.1: Neighbourhood Sampling with H = 2

• H is the trade off between aggregation efficiency and variance. A smaller H leads to
more efficient computation but increases the variance which gives more unstable
training results.

• The size of the computation graph is still exponential w.r.t. K. One more GNN layer
makes computation H times more expensive.

The neighbours can be sampled randomly, which is fast but might not be optimal. In
natural graphs. We could use random walks with restarts and sample nodes with the
highest restart scores. This works better in practice.
Overall time complexity: M · H K .

20.2 Cluster GCN


Observe that when nodes share many layers, computation of the embeddings of these
neighbours become redundant. In a full-batch GNN, only 2 |EM | (where EM is the set of
edge messages) messages need to be computed, and overall only 2K |EK | messages are
needed in a K-layer GNN.
Layer-wise node embedding update allows the re-use of embeddings from the previous
layer. This is infeasible in large graphs due to limited GPU memory.
In Cluster-GCN, we can sample a small subgraph of the large original graph and perform
efficient layer-wise node embedding updates in the smaller graph. What subgraphs are
good for training GNNs? Since GNNs pass messages along edges, the subgraphs should
retain as much connectivity from the original graph as possible. Real-world graphs exhibit
community structure, and we can sample a community as a subgraph, each of which retains
essential connection patterns.
Cluster-GCN:

109
20 SCALING UP GNNS

V2
V1

V3

Figure 20.2: Cluster GCN

1. Preprocessing: Given a large graph, partition it into groups of nodes


Partition V of a large graph G = (V, E) into C groups V1 , . . . , VC using a community
detection algorithm e.g. Louvain, METIS. Each Vi induces a subgraph Gi . Note that
between group edges in G are dropped.

2. Mini-batch training: Sample node group Vi at a time and apply GNN on node
group Gi .

The issues with this algorithm are:

• Between group edges are dropped


This leads to systematically biased gradient estimates.

• Sampled node group tends to only cover the small-concentrated portion of the entire
data.

• Sampled nodes are not diverse enough to be represent the entire graph structure
This leads to very different gradients across clusters, which translate to high training
variance and slow convergence of SGD.

The solution is to aggregate multiple node groups per mini-batch. This is


Advanced Cluster-GCN:

1. Preprocessing: Given a large graph, partition it into groups of nodes


Partition V of a large graph G = (V, E) into C groups V1 , . . . , VC using a community
detection algorithm. They have to be small enough such that multiple node groups
can be loaded into the GPU at a time.

2. Mini-batch training: Sample node groups Vi1 , . . . , Viq and apply GNN to the
induced graph of Vi1 ∪ · · · ∪ Viq .

110
20 SCALING UP GNNS

Question: What about nodes that have no communities?


Assign the isolated nodes into some community

Question: Can we train different models on different subgraphs?


Its better to train one model to prevent models from overfitting on a subgraph.

Question: Induced subgraphs lose connection between subgraphs. Can we


treat induced subgraphs as hypernodes?
That is an interesting research problem. It is unclear if you can represent the entire
subgraph using one hypernode.
See GNNAutoScale.

Question: Is the computational trade off of selecting a large number of


small communities vs. small number of larger communities worth it?

Yes. Many community selection algorithms are linear in |V |.

Question: Can Cluster-GCN be easily adapted to heterogeneous graphs?


This might introduce bias for the edge types. This could be a good research question.

Overall time complexity of Cluster-GCN: K · M · Davg , where Davg is the average node
degree. This linear growth is much more efficient than neighbourhood sampling.

20.3 Simplified GCN


We start from Graph Convolutional Network (GCN). Recall that the GCN aggregation
function is of the form E (k+1) := ReLU(ÃE (k) W(k) ).
Simplified GCN uses self-loops on the adjacency matrix A: A ← A + I . It also assumes
the node embeddings E are given as features. These are the main differences from
K
LightGCN. Since the node features are fixed, the matrix Ẽ := Ã E can be calculated only
once, as a preprocessing step.
Ẽ already has very rich features and can be fed to existing machine learning models.

Question: Since we are removing the activation function and the perfor-
mance is still strong, does this mean a lot of things we learned are linear?
Yes.

The drawback is that Simplified GCN is less expressive.


Despite being less expressive, Simplified GCN performs comparably to the original GNNs.
The reason for this is homophily in many node classification tasks, i.e. nodes connected
by edges tend to share the same target labels. As a result, nodes connected by edges tend

111
21 TRUSTWORTHY GRAPH AI

to have similar pre-processed features. This is due to the high dimensionality of embedding
space.

21 Trustworthy Graph AI
Trustworthy AI/GNN include explainability, fairness, robustness, privacy, etc. Previously
the role of graph topology is unexplored in these problems. This lecture covers robustness
and explainability.

21.1 Explainability
Deep learning models are black boxes which makes it a major challenge to explain and
extract insight from. Explainable Artificial Intelligence (XAI) is an umbrella term for
any research trying to solve the black-box problem for AI.
It is useful since it enables
• Trust: Explainability is a prerequisite for humans to trust and accept the model’s
prediction.
• Causality: Explainability (e.g. attribute importance) conveys causality to the
system’s target prediction: attribute X causes the data to be Y
• Transferability: The model needs to convey an understanding of decision- making
for humans before it can be safely deployed to unseen data.
• Fair and Ethical Decision Making: Knowing the reasons for a certain decision is
a societal need, in order to perceive if the prediction conforms to ethical standards.
Excursion: Explainable Models
A model is explainable when

• Importance values: Scores for input features

• Attribution: straightforward relationships between prediction and input fea-


tures

Examples:

• Linear Regression: The most simple model is linear regression. In a model


Features
specified by
y := w1 x1 + · · · + wd xd
Prediction Weights

The slope is explainable as the amount of effect a variable has on the prediction.

• Dimension Reduction: Allows us to visualise the training data distribution geo-


metrically with a boundary characterising different classes.

112
21 TRUSTWORTHY GRAPH AI

• Decision Trees: A very explainable set of models where each node represents a
logical decision. We can compute statistics for each decision node.

What makes models explainable?

• Importance values: Assigning importance to features, pixels, words, nodes, etc.

• Attributions: Straightforward relationships between prediction and input fea-


tures.

• Concepts and Prototypes

Some architectures provide explainability:

• Proxy Model: Learn an interpretable model that locally approximates the


original model.

• Saliency Map: Compute the gradients of outputs w.r.t. inputs

• Attention: Visualise attention weights in attention models such as transformers


and GAT.

Explainability settings can be classified into:

• Instance-level: Local explanation for a single input x and prediction ŷ

• Model-level: A global explanation of a specific dataset D or classes of D

113
21 TRUSTWORTHY GRAPH AI

Explanation in Graph Learning: an important subgraph structure and a small subset of


node features that play a crucial role in GNNs prediction. Examples:

• Explaining ground truth phenomenon: What are the characteristics of a toxic


molecule?

• Explaining model predictions: Why does the model recommend no loan for person X?

Question: Can we make an explainer of the post-hoc explainer model and


so on?
Yes. This is done for CV and it often gives better performance.

21.2 GNNExplainer
GNNExplainer is a post-hoc, model agnostic explanation method for GNNs.

• Training time: Optimise GNN on training graphs and save the trained model

• Test time: Explain the prediction of GNN on unseen instances

GNNExplainer can explain different tasks, including Node classification, Link prediction,
and Graph classification. It can be adapted to GAT, Gated Graph Sequence, Graph
Networks, GraphSAGE, etc.
In a general message-passing framework, we can produce structural explanations and
feature explainations. GNNExplainer explains both aspects simultaneously.
Settings
Without loss of generality, consider the node classification task. The input is

• The computation graph Gc (v)

• Adjacency matrix Ac (v)

• Node features are given by Xc (v) := {xu : u ∈ Gc (v)}

114
21 TRUSTWORTHY GRAPH AI

• GNN model ϕ learns a distribution Pϕ (Y |Ac (v), Xc (v)).


F
• GNNExplainer outputs (Â, X̂ ). Graph Ĝ with adjacency matrix  is a subgraph of
Gc (v)
F
• X̂ := {x Fu : u ∈ Ĝ}
• F is a mask which masks out unimportant features.

Excursion: Mutual Information


The mutual information of two random variables X, Y measure the correlation It
is defined as

I(X, Y ) := DKL (P(X,Y ) ∥PX ⊗ PY )


= H(X) − H(X|Y ) = H(Y ) − H(Y |X)
= H(X) + H(Y ) − H(X, Y )
= H(X, Y ) − H(X|Y ) − H(Y |X)

where PX , PY are the marginal distributions of X, Y , resp., and P(X,Y ) is the joint dis-
tribution of P(X,Y ) . I(X, Y ) measures the amount of deviation the product distribution
has relative to the hypothetical case of when X, Y have no correlation.
I(X, Y ) = I(Y, X)

A good explanation should have a high correlation with model prediction, so the
GNNExplainer’s goal Explanation
is Feature subset

max I( Y , (Â, X F ) ) = H(Y ) − H(Y | A = Â , X = X F )



Label Subgraph

Finding  that minimises the conditional entropy H(Y | · · · ) is computationally expensive


due to the exponentially many possible Â. The solution in GNNExplainer is to treat
explanation as a distribution of plausible explanations instead of a single graph. This has
two benefits:
• Captures multiple possible explanations for the same node
• Turns discrete optimisation to continuous
GNNExplainer instead optimises the expected adjacency matrix Â:
F F
min EÃ∼A H(Pϕ (Y = y|A = Ã, X = X̂ )) ≃ min H(Pϕ (Y = y|A = EA [Ã], X = X̂ ))
A A
F
≃ min H(Pϕ (Y = y|A = Ac ⊙ σ(M ), X = X̂ ))
A

Note the approximation used here since H is not a convex function. GNNExplainer uses
Ac ⊙ σ(M ), where σ(M ) is a mask, to approximate EA [Ã]. σ is the sigmoid function
which squashes M into [0, 1], representing whether an edge should be kept or dropped.

115
21 TRUSTWORTHY GRAPH AI

Figure 21.2: GNNExplainer thresholding graphs using the relaxed adjacency matrix

Figure 21.3: GNN Post-hoc explanation pipeline

After optimisation, threshold on Ac ⊙ σ(M ) to obtain Ĝ. Similarly, select features by


optimising for feature mask σ(F). Finally, we add regularisation terms and this forms the
training objective of GNNExplainer.
Regularisation
F
min − H(Pϕ (Y = y|A = Ac ⊙ σ(M ), X = X̂ )) + λ1 σ(M ) + λ2
X X
σ(F)
M ,F

• Node classification: Optimise mask (M , F) on the node’s neighbourhood

• Edge prediction: Optimise mask (M , F) on two nodes’ neighbourhoods

• Graph classification: Optimise mask (M , F) on the entire graph

Alternative approaches:

• GNN Saliency Map: Record of the gradients of output scores w.r.t. inputs on a
GNN

• Attention from GAT: Attention scores provide some explanation.

116
21 TRUSTWORTHY GRAPH AI

Figure 21.4: GraphFramEx explanation framework focuses on the phenomenon and the
model.

117
21 TRUSTWORTHY GRAPH AI

21.3 Explainability Evaluation


During evaluation, the ground truth might not always be available, and evaluation is
multi-dimensional: It can be done on the goal (phenomenon/model), masking, or type
(sufficiency/necessity)

• Phenomenon explanation refers to explaining the underlying reasons for the


ground truth phenomenon

• Model explanation refers to explaining why model makes a particular prediction

The fidelity metrics are fid± , corresponding to removing important subgraphs and using
only the important subgraphs.
Remove important subgraph S
1 XN
fid+ := 1(ŷi = yi ) − 1(ŷi (GC\S ) = yi )
N i=1
1 XN
fid− := 1(ŷi = yi ) − 1(ŷi (GS ) = yi )
N i=1
Original prediction Keep only important subgraph S
probability/confidence
Evaluation criteria are multidimensional:

• Quality: High fidelity/characterisation scores

• Stability: Consistency across random optimisation seeds

• Complexity: Explanation should be concise and easy to understand

Types of explanations:

• Sufficiency: An explanation is sufficient if it leads by its own to the initial


prediction of the model explanation (fid− → 0)

• Necessity: An explanation is necessary if the model prediction changes when


removing it from the initial graph. (fid+ → 1)

The characterisation score summarises the explanation quality

w+ + w− (w+ + w− ) · fid+ · (1 − fid− )


c := w− =
w+
fid+
+ fid− w+ (1 − fid− ) + w− · fid+

where w± are the weights of both fidelity metrics and usually w± = 1.


Other types of explanations:

• Counterfactual explanations: What makes an instance belonging to a different


class? What perturbation is needed to move an instance to a different class?

• Model-Level explanations: What are the general characteristics of all instances


belonging to a certain class?

118
22 CONCLUSION

22 Conclusion
22.1 GNN Design Space and Task Space
How to find a good GNN design for a specific GNN task? Redo hyperparameter grid search
for each new task is not feasible.

• Design: is a concrete model instantiation


e.g. 4-layer GraphSAGE

• Design Dimension: Characterises a design


e.g. The number of layers l ∈ {2, 4, 6, 8}

• Design choice: The actual selected value of the design dimension


e.g. The number of layers is l := 2

• Design space: Cartesian product of design dimensions

• Task: A specific task of interest


e.g. Node classification on Cora, Graph classification on ENZYMES.

• Task Space: Consists of all the tasks we care about

THe GNN Design space consist of

• Intra-Layer Design: GNN Layer is transformation and aggregation


e.g. Batch Normalisation, Dropout (0, 0.3, 0.6), Activation (ReLU, Swish),
Aggregation

• Inter-Layer Design: Explore different ways of organizing GNN layers


e.g. Layer connectivity (Skip connection), Pre-process layers (1, 2, 3), Message
passing connections (2, 4, 6, 8), Post-process layers (1, 2, 3)

• Learning-Configuration: Batch size, Learning rate, Optimiser, etc.


e.g. Batch size, Learning Rate, Optimiser, Training epochs

Overall there are 300, 000 possible designs for an assorted combination of parameters. The
total size of the design space is huge (> 105 ”. We cannot cover all possible designs.
GNN tasks can be categorised into node/edge/graph level tasks. This is not precise
enough, since for example “predicting clustering coëfficient” and “predicting a node’s
subject area in citation networks” are completely different.

119
22 CONCLUSION

Figure 22.1: The design space of GNNs

22.2 GraphGym
GraphGym is a platforms for exploring different GNN architectures.
In GraphGym, a quantitative task similarity metric is defined as

1. Select anchor models (M1 , . . . , M5 )

2. Characterise a task by ranking the performance of anchor models

3. Tasks with similar rankings are similar.

Anchor models can be selected from

1. Select a small dataset (e.g. node classification on Cora)

2. Randomly sample N models from our design space (e.g. sample 100 models)

3. Sort these models based on their performance (e.g. sample 12 models in the
experiments)

120
22 CONCLUSION

Figure 22.2: Task similarities of 3 tasks

Example
Evaluating a design dimension: If we want to inquire about “is Batch Normalisation
generally useful for GNNs”? The common practice is to select one model and compare
it with batch normalisation on and off.
In GraphGym, the process is

1. Sample from 107 ≃ 32(tasks) · 315000(model) model-task combinations

2. Rank the models with Batch Normalisation on and off. Here the computational
budget of the models are controlled.
The lower the ranking the better.

3. Plot the average/distribution of the ranking

Question: Is there any autoencoder task in the task space?

In the set of 32 tasks its all node/structure prediction tasks.

To apply this paradigm to novel task:

1. Measure 12 anchor model performance on the new task

2. Compute similarity between new task and existing tasks

3. Recommend the best designs from existing tasks with high similarity

22.3 Pre-Training Graph Neural Networks


Challenges of applying ML to scientific domains

1. Scarcity of labeled data

2. Out-of-distribution prediction

121
22 CONCLUSION

(a) Intra-layer designs

(b) Inter-layer designs

(c) Learning configuration designs

Figure 22.3: Comparison of designs on a selection of 32 GNN tasks

Excursion: Pre-Training and Fine-Tuning


Deep learning models are extremely prone to overfitting on small labeled data and
extrapolate poorly.
To improve a model’s out-of-distribution prediction performance even with limited
data, we can inject domain knowledge into a model before training on scarcely labeled
tasks.

• Pre-training: Training a model on relevant tasks with large amounts of data

• Fine-tuning: Training a pre-trained model on a small number of out-of-


distribution downstream task.

Settings
In this section we investigate a setting in molecule classification.
• Task: Binary classification of molecules

122
22 CONCLUSION

Figure 22.4: Best GNN designs in different tasks vary

123
22 CONCLUSION

Figure 22.5: Pre-training both node and graph embeddings compared to training them
individually

• Evaluation metric: ROC-AUC

• Supervised pre-training data: 1310 diverse binary bioassays annotated over


450, 000 molecules.

• Downstream task (target): 8 molecular classification datasets, each with


1, 000 to 100, 000 molecules

• Data split: Scaffold (test molecules are out of distribution)

The Naı̈ve strategy is multi-task supervised pre-training on relevant labels. This has
limited performance on downstream tasks and often leads to negative transfer.
The key idea to improving this is to pre-trian both node and graph embeddings.
Pre-training methods:
• Attribute Masking (Node-lvel, self-supervised):

1. Mask-node attributes and use GNNs to generate node embeddings.


2. Use these embeddings to predict masked attributes (e.g. molecule)

Intuitively, this forces the GNN to learn domain knowledge.


• Context Prediction (Node-level, self-supervised):

1. For each graph, sample one centre node


2. Extract neighbourhood and context
3. Use GNNs to encode neighbourhood and context graphs into vectors
4. Maximise/minimise inner product between true/false (neighbourhood, context)
pairs

Intuitively, subgraphs that are surrounded by similar contexts are semantically


similar.

124
22 CONCLUSION

Figure 22.6: Context prediction

Figure 22.7: Comparison of naı̈ve and effective graph pre-training

• Supervised Attribute Prediction (Graph-level):


Multi-task supervised training on many relevant labels.

• Structural Similarity Prediction (Graph-level)

When different GNN models are pre-trained, the most expressive model (GIN) benefits the
most from pre-training.

125

You might also like