0% found this document useful (0 votes)
104 views26 pages

Large Language Models On Graphs: A Comprehensive Survey

This document provides a comprehensive survey of applying large language models (LLMs) to graph data. It categorizes scenarios where text data is associated with graph structures into three types: pure graphs, text-rich graphs, and text-paired graphs. It then discusses techniques for utilizing LLMs on graphs, including using LLMs as predictors, encoders, and aligners. The document concludes by mentioning real-world applications and opportunities for future research at the intersection of LLMs and graph representation learning.

Uploaded by

patrizio.gelosi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views26 pages

Large Language Models On Graphs: A Comprehensive Survey

This document provides a comprehensive survey of applying large language models (LLMs) to graph data. It categorizes scenarios where text data is associated with graph structures into three types: pure graphs, text-rich graphs, and text-paired graphs. It then discusses techniques for utilizing LLMs on graphs, including using LLMs as predictors, encoders, and aligners. The document concludes by mentioning real-world applications and opportunities for future research at the intersection of LLMs and graph representation learning.

Uploaded by

patrizio.gelosi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.

8, AUGUST 2021 1

Large Language Models on Graphs: A


Comprehensive Survey
Bowen Jin*, Gang Liu*, Chi Han*, Meng Jiang, Heng Ji, Jiawei Han


arXiv:2312.02783v1 [cs.CL] 5 Dec 2023

Abstract—Large language models (LLMs), such as ChatGPT and Graph-Text Relationship


LLaMA, are creating significant advancements in natural language Protein Molecule Academic Social
Tra c
processing, due to their strong text encoding/decoding ability and newly Networks Graphs Graphs Networks Networks
found emergent capability (e.g., reasoning). While LLMs are mainly HV “Benzene is toxic”
L
T PE
designed to process pure texts, there are many real-world scenarios E
K H
where text data are associated with rich structure information in the O “Water is less toxic”
“Myoglobin holds H
form of graphs (e.g., academic networks, and e-commerce networks) oxygen in muscles.”

or scenarios where graph data are paired with rich textual information
(e.g., molecules with descriptions). Besides, although LLMs have shown Pure Graphs Text-Paired Graphs Text-Rich Graphs
their pure text-based reasoning ability, it is underexplored whether
such ability can be generalized to graph scenarios (i.e., graph-based
reasoning). In this paper, we provide a systematic review of scenarios
and techniques related to large language models on graphs. We first
summarize potential scenarios of adopting LLMs on graphs into three
categories, namely pure graphs, text-rich graphs, and text-paired graphs.
We then discuss detailed techniques for utilizing LLMs on graphs,
including LLM as Predictor, LLM as Encoder, and LLM as Aligner, LLM LLM GNN LLM GNN
and compare the advantages and disadvantages of different schools
of models. Furthermore, we mention the real-world applications of “text” GNN “text” “text” LLM
such methods and summarize open-source codes and benchmark
“text”
datasets. Finally, we conclude with potential future research directions
in this fast-growing field. The related source can be found at https: LLM as Predictor LLM as Aligner LLM as Encoder
//github.com/PeterGriffinJin/Awesome-Language-Model-on-Graphs.
ffi
Large Language Models’ Roles
Index Terms—Large Language Models, Graph Neural Networks, Natural
Language Processing, Graph Representation Learning Fig. 1. According to the relationship between graph and text, we catego-
rize three LLM on graph scenarios. Depending on the role of LLM, we
summarize three LLM-on-graph techniques. “LLM as Predictor” is where
LLMs are responsible for predicting the final answer. “LLM as Aligner”
1 I NTRODUCTION will align the inputs-output pairs with those of GNNs. “LLM as Encoder”
refers to using LLMs to encode and obtain feature vectors.
L ARGE language models (LLMs) (e.g., BERT [22], T5
[30], LLaMA [119]) which are pretrained on very large
text corpus has been demonstrated to be very powerful in
While LLMs are extensively applied to process pure texts,
solving natural language processing (NLP) tasks, including
there is an increasing number of applications where the text
question answering [1], text generation [2] and document
data are associated with structure information which are
understanding [3]. Early LLMs (e.g., BERT [22], RoBERTa [23])
represented in the form of graphs. As presented in Fig. 1, in
adopt an encoder-only architecture and are mainly applied
academic networks, papers (with title and description) and
for text representation learning [4] and natural language
authors (with profile text), are interconnected with author-
understanding [3]. In recent years, more focus has been
ship relationships. Understanding both the author/paper’s
given to decoder-only architectures [119] or encoder-decoder
text information and author-paper structure information
architectures [30]. As the model size scales up, such LLMs
on such graphs can contribute to advanced author/paper
have also shown reasoning ability and even more advanced
modeling and accurate recommendations for collaboration;
emergent ability [5], exposing a strong potential for Artificial
In the scientific domain, molecules are represented as
General Intelligence (AGI).
graphs and are often paired with text that describes their
basic information (e.g., toxicity). Joint modeling of both the
• * The first three authors contributed equally to this work.
• Bowen Jin, Chi Han, Heng Ji, Jiawei Han: University of Illinois at Urbana- molecule structure (graph) and the associated rich knowledge
Champaign. {bowenj4, chihan3, hengji, hanj}@illinois.edu (text) is important for deeper molecule understanding. Since
• Gang Liu, Meng Jiang: University of Notre Dame. {gliu7, LLMs are mainly proposed for modeling texts that lie in a
mjiang2@}@nd.edu
sequential fashion, those scenarios mentioned above pose
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 2

new challenges on how to enable LLMs to encode the TABLE 1


structure information on graphs. In addition, since LLMs Notations of Concepts.
have demonstrated their superb text-based reasoning ability,
it is promising to explore whether they have the potential Notations Descriptions
to address fundamental graph reasoning problems on pure |·| The length of a set.
graphs. These graph reasoning tasks include inferring con- [A, B] The concatenation of A and B.
nectivity [6], shortest path [7], and subgraph matching [8]. ∥ Concatenate operation.
G A graph.
Recently, there has been an increasing interest in ex-
V The set of nodes in a graph.
tending LLMs for graph-based applications (summarized v A node v ∈ V .
in Fig. 1). According to the relationship between graph and E The set of edges in a graph.
text presented in Fig. 1, the application scenarios can be e An edge e ∈ E .
categorized into pure graphs, text-rich graphs, and text- Gv The ego graph associated with v in G .
N (v) The neighbors of a node v .
paired graphs. Depending on the role of LLMs and their M A meta-path or a meta-graph.
interaction with graph neural networks (GNNs), the LLM The nodes which are reachable from
NM (v)
on graphs techniques can be classified into treating LLMs node v with meta-path or meta-graph M .
as the task predictor (LLM as Predictor), treating LLMs as D The text set.
s∈S The text token in a text sentence S .
the feature encoder for GNNs (LLM as Encoder), and align dvi The text associated with the node vi .
LLMs with GNNs (LLM as Aligner). deij The text associated with the edge eij .
There are a limited number of existing surveys exploring dG The text associated with the graph G .
the intersection between LLMs and graphs. Related to deep n The number of nodes, n = |V |.
b The dimension of a node hidden state.
learning on graphs, Wu et al. [17] give a comprehensive
xvi ∈ Rd The initial feature vector of the node vi .
overview of graph neural networks (GNNs) with detailed Hv ∈ Rn×b The node hidden feature matrix.
illustrations on recurrent graph neural networks, convo- hvi ∈ Rb The hidden representation of node vi .
lutional graph neural networks, graph autoencoders, and hG ∈ Rb The hidden representation of a graph G .
spatial-temporal graph neural networks. Liu et al. [18] dis- hdv ∈ Rb The representation of text dv .
cuss pretrained foundation models on graphs, including their Hdv ∈ R|dv |×b The hidden states of tokens in dv .
backbone architectures, pretraining methods, and adaptation W, Θ, w, θ Learnable model parameters.
LLM(·) Large Language model.
techniques. Pan et al. [19] review the connection between GNN(·) Graph neural network.
LLMs and knowledge graphs (KGs) especially on how KGs
can enhance LLMs training and inference, and how LLMs
can facilitate KG construction and reasoning. In summary, related concepts. Section 3 categorizes graph scenarios where
existing surveys either focus more on GNNs rather than LLMs can be adopted and summarizes LLMs on graph
LLMs or fail to provide a systematic perspective on their techniques. Section 4-6 provides a detailed illustration of
applications in various graph scenarios as in Fig. 1. Our paper LLM methodologies for different graph scenarios. Section
provides a comprehensive review of the LLMs on graphs, 7 delivers available datasets, opensource codebases, and a
for broader researchers from diverse backgrounds besides collection of applications across various domains. Section
the computer science and machine learning community who 8 introduces some potential future directions. Section 9
want to enter this rapidly developing field. summarizes the paper.
Our Contributions. The notable contributions of our
paper are summarized as follows:
• Categorization of Graph Scenarios. We systemati- 2 D EFINITIONS & BACKGROUND
cally summarize the graph scenarios where language 2.1 Definitions
models can be adopted into: pure graphs, text-rich
In this section, we provide definitions of various types of
graphs, and text-paired graphs.
graphs and introduce the notations (as shown in Table 1)
• Systematic Review of Techniques. We provide the
used in this paper.
most comprehensive overview of language models
Definition 1 (Graph): A graph can be defined as G = (V, E).
on graph techniques. For different graph scenarios,
Here V signifies the set of nodes, while E denotes the set
we summarize the representative models, provide
of edges. A specific node can be represented by vi ∈ V , and
detailed illustrations of each of them, and make
an edge directed from node vj to vi can be expressed as
necessary comparisons.
eij = (vi , vj ) ∈ E . The set of nodes adjacent to a particular
• Abundant Resources. We collect abundant resources
node v is articulated as N (v) = {u ∈ V|(v, u) ∈ E}.
on language models on graphs, including benchmark
A graph containing a node type set A and an edge type
datasets, open-source codebases, and practical appli-
set R, where |A| + |R| > 2, is called a heterogeneous graph.
cations.
A heterogeneous graph is also associated with a node type
• Future Directions. We delve into the foundational
mapping function ϕ : V → A and an edge type mapping
principles of language models on graphs and propose
function ψ : E → R.
six prospective avenues for future exploration.
Definition 2 (Graph with node-level textual information): A
Organization of Survey. The rest of this survey is graph with node-level textual information can be denoted as
organized as follows. Section 2 introduces the background of G = (V, E, D), where V , E and D are node set, edge set, and
LLMs and GNNs, lists commonly used notations, and defines text set, respectively. Each vi ∈ V is associated with some
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 3

textual information dvi ∈ D. For instance, in an academic where Q, K, V ∈ RNS ×dk are the query, key, and value vec-
citation network, one can interpret v ∈ V as the scholarly tors for each word in the sentence, respectively. The attention
articles, e ∈ E as the citation links between them, and d ∈ D mechanism is designed to capture the dependencies between
as the textual content of these articles. A graph with node- words in a sentence in a flexible way, an advantage also
level textual information is also called a text-rich graph [32], potentially useful for combining with other input formats
a text-attributed graph [62], or a textual graph [73]. like graphs. BERT is useful as a text representation model,
Definition 3 (Graph with edge-level textual information): A where the last layer outputs the representation of the input
graph with node-level textual information can be denoted text hS ∈ Rd . Following BERT, many other masked language
as G = (V, E, D), where V , E and D are node set, edge set, models are proposed, such as RoBERTa [23], ALBERT [116],
and text set, respectively. Each eij ∈ E is associated with and ELECTRA [117], with similar architectures and objectives
some textual information deij ∈ D. For example, in a social of text representation. This type of model is also called the
network, one can interpret v ∈ V as the users, e ∈ E as pretrained language model (PLM).
the interaction between the users, and d ∈ D as the textual Although the original Transformer paper [94] was ex-
content of the messages sent between the users. perimented on machine translation, it was not until the
Definition 4 (Graph with graph-level textual information): A release of GPT-2 [113] that causal language modeling (i.e.,
graph data object with graph-level textual information can be text generation) became impactful on downstream tasks.
denoted as the pair (G, dG ), where G = (V, E). V and E are Causal language modeling is the task of predicting the next
node set and edge set. dG is the text set paired to the graph G . word given the previous words in a sentence. The objective
For instance, in a molecular graph G , v ∈ V denotes an atom, of causal language modeling is defined as:
e ∈ E represents the strong attractive forces or chemical  
bonds that hold molecules together, and dG represents the X
ES∼D  log p(si |s1 , . . . , si−1 ) . (3)
textual description of the molecule.
si ∈S

To adapt the Transformer to causal language modeling,


2.2 Background
causal attention is introduced which masks the attention
(Large) Language Models. Language Models (LMs), or weights for the future words. Simple but powerful, subse-
language modeling, is a field of natural language processing quent models like GPT-3 [25], GPT-4 [118], LLaMA [119],
(NLP) on understanding and generation from text distri- LLaMA2 [119], Mistral 7B [120], and T5 [30] show impressive
butions. In recent years, large language models (LLMs) emergent capabilities such as few-shot learning, chain-of-
have demonstrated impressive capabilities in tasks such thought reasoning, and even programming. Efforts have also
as machine translation, text summarization, and question been made to combine language models with other modali-
answering [25], [44], [110]–[113]. ties such as vision [97], [121] and biochemical structures [48],
Large Language models have evolved significantly over [122], [123]. Its combination with graphs is another exciting
time. Initially, word vectors such as Word2Vec [114] and topic we will discuss in the current paper.
GloVe [115] represent words in a continuous vector space We would like to point out that the word “large” in
where semantically similar words were mapped close to- LLM is not associated with a clear and static threshold
gether. These embeddings w ∈ Rd usually model the to divide language models: current models that we call
word-level correlations in a corpus, such as the conditional “large” will eventually appear small compared to future
word probability on skip-grams or continuous bag-of-words models, and many smaller models were large at the time
(CBOW), and are useful for preliminary tasks like word they were developed. “Large” actually refers to a direction
analogy and word similarity. in which language models are inevitably evolving. More
The advent of BERT [22] marks significant progress in importantly, it represents an empirical rule that researchers
language modeling and representation. BERT models the discovered recently: larger foundational models tend to
conditional probability of a word given its bidirectional possess significantly more representation and generalization
context, also named masked language modeling (MLM) power. Hence, we define LLMs to encompass both medium-
objective. Its training involves masking a random subset scale PLMs, such as BERT, and large-scale LMs, like GPT-4,
of words in a sentence and predicting them based on as suggested by [19].
the remaining words. For an intuitive understanding, if Graph Neural Networks & Graph Transformers. GNN is
simplified to the case with only one masked word, the proposed as a deep learning architecture for graph data.
objective corresponds to the following equation: Primary GNNs including GCN [85], GraphSAGE [86] and,
  GAT [87] are designed for solving node-level tasks. They
X mainly adopt a propagation-aggregation paradigm to obtain
ES∼D  log p(si |s1 , . . . , si−1 , si+1 , . . . , sNS ) , (1) node representations:
si ∈S  
a(l−1) (l)
h(l−1) , h(l−1)

vi vj = PROP vi vj , ∀vj ∈ N (vi ) ; (4)
where S is a sentence sampled from the corpus D, si is  
the i-th word in the sentence, and NS is the length of the h(l)
vi = AGG
(l)
h(l−1)
vi , {a(l−1)
vi vj |vj ∈ N (vi )} . (5)
sentence. BERT utilizes the Transformer architecture with
Later works such as GIN [205] explore GNNs for solving
attention mechanisms as the core building block. In the
graph-level tasks. They obtain graph representations by
vanilla Transformer, the attention mechanism is defined as:
adopting a READOUT function on node representations:
QK T
 
Attention(Q, K, V ) = softmax √ V, (2) hG = READOUT({hvi |vi ∈ G}). (6)
dk
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 4

The READOUT functions include mean pooling, max pool- (solve graph theory problems) or serve as knowledge sources
ing, and so on. Subsequent work on GNN tackles the issues to enhance the large language models (alleviate hallucina-
of over-smoothing [137], over-squashing [138], interpretabil- tion).
ity [147], and bias [143]. While message-passing-based Text-Rich Graphs refers to graphs where nodes or edges
GNNs have demonstrated advanced structure encoding are associated with semantically rich text information. Such
capability, researchers are exploring further enhancing its graphs are also called text-rich networks [32], text-attributed
expressiveness with Transformers (i.e., graph Transformers). graphs [62], textual graphs [73] or textual-edge networks
Graph Transformers utilize a global multi-head attention [75]. Real-world examples include academic networks, e-
mechanism to expand the receptive field of each graph commerce networks, social networks, and legal case net-
encoding layer [141]. They integrate the inductive biases works. On these graphs, people are interested in learning
of graphs into the model by positional encoding, structural representations for nodes or edges with both textual infor-
encoding, the combination of message-passing layers with mation and structure information [73] [75].
attention layers [142], or improving the efficiency of attention Text-Paired Graphs are graphs where the textual description
on large graphs [144]. Graph Transformers have been proven is defined on the whole graph structure. Such graphs include
as the state-of-the-art solution for many pure graph problems. molecules or proteins where nodes represent atoms and
We refer readers to [140] for the most recent advances in GTs. edges represent chemical bonds. The text description can
GTs can be treated as a special type of GNN. be molecule captions or protein textual features. Although
Language Models vs. Graph Transformers. Modern lan- the graph structure is the most significant factor influencing
guage models and graph Transformers both use Transformers molecular properties, text descriptions of molecules can serve
[94] as the base model architecture. This makes the two as a complementary knowledge source to help understand
concepts hard to distinguish, especially when the language molecules [148]. The graph scenarios can be found in Fig. 1.
models are adopted on graph applications. In this paper,
“Transformers” typically refers to Transformer language
3.2 Categorization of LLM on Graph Techniques
models for simplicity. Here, we provide three points to
help distinguish them: 1) Tokens (word token vs. node According to the roles of LLMs and what are the final
token): Transformers take a token sequence as inputs. For components for solving graph-related problems, we classify
language models, the tokens are word tokens; while for graph LLM on graph techniques into three main categories:
Transformers, the tokens are node tokens. In those cases LLM as Predictor. This category of methods serves LLM
where tokens include both word tokens and node tokens if as the final component to output representations or predic-
the backbone Transformers is pretrained on text corpus (e.g., tions. It can be enhanced with GNNs and can be classified
BERT [22] and LLaMA [119]), we will call it a “language depending on how the graph information is injected into
model”. 2) Positional Encoding (sequence vs. graph): language LLM: 1) Graph as Sequence: This type of method makes no
models typically adopt the absolute or relative positional changes to the LLM architecture, but makes it be aware
encoding considering the position of the word token in the of graph structure by taking a “graph token sequence” as
sequence, while graph Transformers adopt shortest path input. The “graph token sequence” can be natural language
distance [141], random walk distance, the eigenvalues of the descriptions for a graph or hidden representations outputted
graph Laplacian [142] to consider the distance of nodes in by graph encoders. 2) Graph-Empowered LLM: This type of
the graph. 3) Goal (text vs. graph): The language models method modifies the architecture of the LLM base model
are originally proposed for text encoding and generation; (i.e., Transformers) and enables it to conduct joint text and
while graph Transformers are proposed for node encoding graph encoding inside their architecture. 3) Graph-Aware LLM
or graph encoding. In those cases where texts are served Finetuning: This type of method makes no changes to the
as nodes/edges on the graph if the backbone Transformers input of the LLMs or LLM architectures, but only fine-tunes
is pretrained on text corpus, we will call it a “language the LLMs with supervision from the graph.
model”. LLM as Encoder. This method is mostly utilized for graphs
where nodes or edges are associated with text information
(solving node-level or edge-level tasks). GNNs are the final
3 C ATEGORIZATION AND F RAMEWORK components and we adopt LLM as the initial text encoder.
In this section, we first introduce our categorization of graph To be specific, LLMs are first utilized to encode the text
scenarios where language models can be adopted. Then associated with the nodes/edges. The outputted feature
we discuss the categorization of LLM on graph techniques. vectors by LLMs then serve as input embeddings for GNNs
Finally, we summarize the training & inference framework for graph structure encoding. The output embeddings from
for language models on graphs. the GNNs are adopted as final node/edge representations
for downstream tasks. However, these methods suffer from
convergence issues, sparse data issues, and inefficient issues,
3.1 Categorization of Graph Scenarios with Language where we summarize solutions from optimization, data
Models. augmentation, and knowledge distillation perspectives.
Pure Graphs without Textual Information are graphs with LLM as Aligner. This category of methods adopts LLMs
no text information or no semantically rich text information. as text-encoding components and aligns them with GNNs
Examples of those graphs include traffic graphs and power which serve as graph structure encoding components. LLMs
transmission graphs. Those graphs often serve as context to and GNNs are adopted together as the final components for
test the graph reasoning ability of large language models task solving. To be specific, the alignment between LLMs and
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 5

Zero-Shot [124]–[126], [128], [131],


Direct Answering Few-Shot [124], [125], [128], [131], GraphLLM [43],
Role Prompting [126], Format Explanation [126],

CoT [124]–[126], [128], [131], [132],


Pure Graphs LLM as Predictor
Heuristic Reasoning Self-Consistency [124], BaG [124], [131],
RoG [129], StructGPT [130], ToG [132]

Algorithmic Reasoning Algorithmic Prompting [124], Graph-ToolFormer [127]

Rule-based InstructGLM [47], GraphText [66], [84],


Graph as
Sequence
GNN-based GNP [42], GraphGPT [46], DGTL [77], METERN [76]

LLM as Predictor GreaseLM [68], DRAGON [83], GraphFormers [73],


Graph-Empowered LLM
Patton [32], Heterformer [74], Edgeformers [75],
Categorization of LLM on Graph

SPECTER [52], SciNCL [53], Touchup-G [55],


Graph-Aware LLM Finetuning
TwHIN-BERT [57], MICoL [60], E2EG [61]

One-step TextGNN [79], AdsGNN [80], GNN-LM [67]


Text-Rich Optimization
Graphs Two-step GIANT [59], LM-GNN [69], SimTeG [36], GaLM [82]
LLM as Encoder
Data Augmentation LLM-GNN [64], TAPE [71], ENG [72]

Knowledge Distillation AdsGNN [80], GraD [70]

Prediction Alignment LTRN [58], GLEM [62]


LLM as Aligner
Latent Space Alignment ConGrat [54], GRENADE [56], G2P2 [63], THLM [34]

CatBERTa [168] , LLaMA-Mol [169] , LLM4Mol [172] ,


RT [173] , MolReGPT [174] , ChatMol [175] ,
Graph as Sequence MolXPT [178] , LLM-ICL [177] , Text+Chem T5 [180] ,
MolT5 [123] , KV-PLM [184] , Chemformer [164] ,
LLM as Predictor MFBERT [185] , Galatica [187] , SMILES-BERT [188]

ReLM [166] , Prot2Text [170] , GIMLET [48] ,


Text-Paired Graph-Empowered LLM
Text2Mol [122]
Graphs
MolCA [176] , GIT-Mol [167] , MolFM [171] ,
LLM as Aligner Latent Space Alignment CLAMP [179] , MoMu-v2 [182] , MoleculeSTM [181] ,
MoMu [183]

Fig. 2. A taxonomy of LLM on graph scenarios and techniques with representative examples.

GNNs can be categorized into 1) Prediction Alignment where model fine-tuning methodology can be further categorized
the generated pseudo labels from one modality are utilized into fully fine-tuning, efficient fine-tuning, and instruction
for training on the other modality in an iterative learning tuning.
fashion and 2) Latent Space Alignment where contrastive • Full Finetuning means updating all the parameters
learning is adopted to align text embeddings generated by inside the language model. It is the most commonly
LLMs and graph embeddings generated by GNNs. used fine-tuning method that fully stimulates the
language model’s potential for downstream tasks, but
3.3 Training & Inference Framework with LLMs can suffer from heavy computational overload [37]
There are two typical training and inference paradigms and result in overfitting issues [36].
to apply language models on graphs: 1) Pretraining-then- • Efficient Finetuning refers to only fine-tuning a
finetuning: typically adopted for medium-scale large lan- subset of parameters inside the language model.
guage models; and 2) Pretraining-then-prompting: typically Efficient tuning methods for pure text include prompt
adopted for large-scale large language models. tuning [38], prefix tuning [39], adapter [40] and LoRA
Pretraining denotes training the language model with unsu- [41]. Efficient language model fine-tuning methods
pervised objectives to initialize them with language under- particularly designed for graph data include graph
standing and inference ability for downstream tasks. Typical neural prompt [42] and graph-enhanced prefix [43].
pretraining objectives for pure text include masked language • Instruction Tuning denotes fine-tuning language
modeling [22], auto-regressive causal language modeling model with downstream task instructions [44] [45] to
[25], corruption-reconstruction language modeling [29] and encourage model generalization to unseen tasks in
text-to-text transfer modeling [30]. When extended in the inference. It is an orthogonal concept with full fine-
graph domain, language model pretraining strategies include tuning and efficient fine-tuning, in other words, one
document relation prediction [31], network-contextualized can choose both full fine-tuning and efficient fine-
masked language modeling [32], contrastive social prediction tuning for instruction tuning. Instruction tuning is
[33] and context graph prediction [34]. adopted in the graph domain for node classification
Finetuning refers to the process of training the language [46], link prediction [47], and graph-level tasks [48].
model with labeled data for the downstream tasks. Language Prompting is a technique to apply language model for
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 6

downstream task solving without updating the model param- input graphs as a starting point or a baseline, partially
eters. One needs to formulate the test samples into natural because of the simplicity of the approach and partially in
language sequences and ask the language model to directly awe of other emergent abilities of LLMs. This approach can
conduct inference based on the in-context demonstrations. be viewed as a probe of graph understanding of LLMs (in
This is a technique particularly popular for large-scale au- contrast to graph “reasoning”), which tests if LLMs acquire
toregressive language models. Apart from direct prompting, a good enough representation internally for LLMs to directly
following-up works propose chain-of-thought prompting “guess” the answers. Although various attempts have been
[49], tree-of-thought prompting [50], and graph-of-thought made to optimize how graphs are presented in the input
prompting [51]. sequence, which we will discuss in the following sections,
In the following sections, we will follow our categoriza- bounded by the finite sequence length and computational
tion in Section 3 and discuss detailed methodologies for each operations, there is a fundamental limitation of this approach
graph scenario. to solving complex reasoning problems such as NP-complete
ones. Unsurprisingly, most studies find that LLMs possess
preliminary graph understanding ability, but the perfor-
4 P URE G RAPHS mance is less satisfactory on more complex problems or
Problems on pure graphs provide a fundamental motivation larger graphs [43], [124]–[126], [128], [131]. In the following,
for why and how LLMs are introduced into graph-related we will discuss the details of these studies based on their
reasoning problems. Investigated thoroughly in graph theory, input representation methods and their main difference.
pure graphs serve as a universal representation format for a Plainly Verbalizing Graphs. Verbalizing the graph structure
wide range of classical algorithmic problems in all perspec- in natural language is the most straightforward way of
tives in computer science. Many graph-based concepts, such representing graphs. Representative approaches include
as shortest paths, particular sub-graphs, and flow networks, describing the edge and adjacency lists, widely studied
have strong connections with real-world applications [133]– in [124], [125], [128], [131]. For example, for a triangle graph
[135]. Therefore, pure graph-based reasoning is vital in with three nodes, the edge list can be written as “[(0, 1), (1, 2),
providing theoretical solutions and insights for reasoning (2, 0)]”, which means node 0 is connected to node 1, node 1
problems grounded in real-world applications. is connected to node 2, node 2 is connected to node 0. It can
Nevertheless, many reasoning tasks require a computa- also be written in natural language such as “There is an edge
tion capacity beyond traditional GNNs. GNNs are typically between node 0 and node 1, an edge between node 1 and node 2,
designed to carry out a bounded number of operations and an edge between node 2 and node 0.” On the other hand, we
given a graph size. In contrast, graph reasoning problems can describe the adjacency list from the nodes’ perspective.
can require up to indefinite complexity depending on the For example, for the same triangle graph, the adjacency list
task’s nature. Training conventional GNNs on general reason- can be written as “Node 0 is connected to node 1 and node 2.
ing–intensive problems is challenging without prior assump- Node 1 is connected to node 0 and node 2. Node 2 is connected to
tions and specialized model design. This fundamental gap node 0 and node 1.” On these inputs, one can prompt LLMs to
motivates researchers to seek to incorporate LLMs in graph answer questions either in zero-shot or few-shot (in-context
problems. On the other hand, LLMs demonstrate excellent learning) settings, the former of which is to directly ask
emergent reasoning ability [49], [110], [111] recently. This questions given the graph structure, while the latter is to ask
is partially due to their autoregressive mechanism, which questions about the graph structure after providing a few
enables computing indefinite sequences of intermediate steps examples of questions and answers. [124]–[126] do confirm
with careful prompting or training [49], [50]. that LLMs can answer easier questions such as connectivity,
The following subsections discuss the attempts to incorpo- neighbor identification, and graph size counting but fail
rate LLMs into pure graph reasoning problems. We will also to answer more complex questions such as cycle detection
discuss these works’ challenges, limitations, and findings. and Hamiltonian pathfinding. Their results also reveal that
Table 2 lists a rough categorization of these efforts. Usually, providing more examples in the few-shot setting increases
input graphs are serialized as part of the input sequence, the performance, especially on easier problems, although it
either by verbalizing the graph structure [124]–[126], [128]– is still not satisfactory.
[132] or by encoding the graph structure into implicit feature Paraphrasing Graphs. The verbalized graphs can be lengthy,
sequences [43]. The studied reasoning problems range from unstructured, and complicated to read, even for humans,
simpler ones like connectivity, shortest paths, and cycle de- so they might not be the best input format for LLMs to
tection to harder ones like maximum flow and Hamiltonian infer the answers. To this end, researchers also attempt to
pathfinding (an NP-complete problem). A comprehensive list paraphrase the graph structure into more natural or concise
of the studied problems is listed in Table 4. Note that we only sentences. [126] find that by prompting LLMs to generate a
list representative problems here. This table does not include format explanation of the raw graph inputs for itself (Format-
more domain-specific problems, such as the spatial-temporal Explanation) or to pretend to play a role in a natural task
reasoning problems in [128]. (Role Prompting), the performance on some problems can be
improved but not systematically. [131] explores the effect of
grounding the pure graph in a real-world scenario, such as
4.1 Direct Answering social networks, friendship graphs, or co-authorship graphs.
Although graph-based reasoning problems usually involve In such graphs, nodes are described as people, and edges are
complex computation, researchers still attempt to let lan- relationships between people. Results indicate that encoding
guage models directly generate answers from the serialized in real-world scenarios can improve the performance on
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 7

TABLE 2
A collection of LLM reasoning methods on pure graph discussed in Section 4. We do not include the backbone models used in these methods
studied in the original papers, as these methods generally apply to any LLMs. The “Papers” column lists the papers that study the specific methods.

Method Graph Format or Encoding Reasoning Process Reasoning Category Papers


Zero-Shot Verbalized edge or adjacency Directly answering. Direct Answering [124]–[126], [128],
list. [131]
Role Prompting Verbalized edge or adjacency Directly answering by designating a specific role to the Direct Answering [126]
list. LLM.
Format Explanation Verbalized edge or adjacency Encouraging the LLM to explain the input graph format Direct Answering [126]
list. first.
GraphLLM Prefix tokens encoded by a Directly answering. Direct Answering [43]
graph encoder.
Few-Shot (In-Context Verbalized edge or adjacency Directly answering by following the examples. Direct Answering [124], [125], [128],
Learning) lists preceded with a few [131]
demonstrative examples.
Chain-of-Thought Verbalized edge or adjacency Reasoning through a series of intermediate reasoning steps Heuristic Reasoning [124]–[126], [128],
lists preceded with a few in the generation following the examples. [131], [132]
demonstrative examples.
Self-Consistency Verbalized edge or adjacency Reasoning through a series of intermediate reasoning Heuristic Reasoning [124]
lists preceded with a few steps in generation, and then selecting the most consistent
demonstrative examples. answer.
Build-a-Graph Verbalized edge or adjacency Reconstructing the graph in output, and then reasoning on Heuristic Reasoning [124], [131]
list. the graph.
Context-Summarization Verbalized edge or adjacency Directly answering by first summarizing the key elements Heuristic Reasoning [126]
list. in the graph.
Reasoning-on-Graph Retrieved paths from external First, plan the reasoning process in the form of paths to be Heuristic Reasoning [129]
graphs. retrieved and then infer on the retrieved paths.
Iterative Reading-then- Retrived neighboring edges Iteratively retrieving neighboring edges or nodes from ex- Heuristic Reasoning [130], [132]
Reasoning or nodes from external ternal graphs and inferring from the retrieved information.
graphs.
Algorithmic Reasoning Verbalized edge or adjacency Simulating the reasoning process of a relevant algorithm in Algorithmic Reasoning [124]
list. a generation.
Calling APIs External Knowledge Base. Generate the reasoning process as (probably nested) API Algorithmic Reasoning [127], [132]
calls to be executed externally on the knowledge base.

some problems, but still not consistently. examples. These techniques are studied in [43], [124]–[126],
Encoding Graphs Into Implicit Feature Sequences. Finally, [128], [131], [132]. Results indicate that CoT-style reasoning
researchers also attempt to encode the graph structure into can improve the performance on simpler problems, such as
implicit feature sequences as part of the input sequence [43]. cycle detection and shortest path. Still, the improvement is
Unlike the previous verbalizing approaches, this usually inconsistent or diminishes on more complex problems, such
involves training a graph encoder to encode the graph as Hamiltonian path finding and topological sorting.
structure into a sequence of features and fine-tuning the Retrieving Subgraphs as Evidence. Many graph reasoning
LLMs to adapt to the new input format. [43] demonstrates problems, such as node degree counting and neighborhood
drastic performance improvement on problems including detection, only involve reasoning on a subgraph of the
substructure counting, maximum triplet sum, shortest path, whole graph. Such properties allow researchers to let LLMs
and bipartite matching, evidence that fine-tuning LLMs has retrieve the subgraphs as evidence and perform reasoning
great fitting power on a specific task distribution. on the subgraphs. Build-a-Graph prompting [124] encour-
ages LLMs to reconstruct the relevant graph structures
4.2 Heuristic Reasoning to the questions and then perform reasoning on them.
Direct mapping to the output leverages the LLMs’ powerful This method demonstrates promising results on problems
representation power to “guess” the answers. Still, it does not except for Hamiltonian pathfinding, a notoriously tricky
fully utilize the LLMs’ impressive emergent reasoning ability, problem requiring reasoning on the whole graph. Another
which is essential for solving complex reasoning problems. approach, Context-Summarization [126], encourages LLMs to
To this end, attempts have been made to let LLMs perform summarize the key nodes, edges, or sub-graphs and perform
heuristic reasoning on graphs. This approach encourages reasoning. They evaluate only on node classification, and
LLMs to perform a series of intermediate reasoning steps results show improvement when combined with CoT-style
that might heuristically lead to the correct answer. reasoning, an intuitive outcome considering the local nature
Reasoning Step by Step. Encouraged by the success of of the node classification problem.
chain-of-thought (CoT) reasoning [49], [111], researchers also Searching on Graphs. This kind of reasoning is related to
attempt to let LLMs perform reasoning step by step on the search algorithms on graphs, such as breadth-first search
graphs. Chain-of-thought encourages LLMs to roll out a (BFS) and depth-first search (DFS) Although not universally
sequence of reasoning steps to solve a problem, similar to applicable, BFS and DFS are the most intuitive and effective
how humans solve problems. It usually incorporates a few ways to solve some graph reasoning problems. Numer-
demonstrative examples to guide the reasoning process. Zero- ous explorations have been made to simulate searching-
shot CoT is a similar approach that does not require any based reasoning, especially on knowledge-graph question
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 8

TABLE 3
A collection of pure graph reasoning problems studied in Section 4. G = (V, E) denotes a graph with vertices V and edges E . v and e denote
individual vertices and edges, respectively. The “Papers” column lists the papers that study the problem using LLMs. The “Complexity” column lists
the time complexity of standard algorithms for the problem, ignoring more advanced but complex algorithms that are not comparable to LLMs’
reasoning processes.

Problem Definition Applications Typical Complexity Papers


Connectivity Given a graph G and two nodes u and v , tell Relationship Detection, Link O(|E|) or O(V 2 ) [124], [125]
if they are connected by a path. Prediction
Neighbor Detection Given a graph G and a node v , find the nodes Recommendation, O(min(|E|, |V |)) [126]
connected to v . Knowledge QA
Node Degree Given a graph G and a node v , find the Entity Popularity, Importance O(min(|E|, |V |)) [125], [126]
number of edges connected to v . Ranking
Attribute Retrieval Given a graph G with node-level informa- Recommendation, Node Clas- O(1) [126]
tion and a node v , return the attribute of v . sification, Node QA
Graph Size Given a graph G , find the number of nodes Graph-level Classification O(|V | + |E|) [126]
and edges.
Cycle Detection Given a graph G , tell if it contains a cycle. Loop Elimination, Program O(|V |) [124]
Loop Detection
Diameter Given a graph G , find the diameter of G . Graph-level Classification, O(|V |3 ) or O(|V |2 log |V | + [126]
Clustering |V ||E|)
Topological Sort Given a directed acyclic graph G , find a Timeline Generation, Depen- O(|V | + |E|) [124]
topological ordering of its vertices so that dency Parsing, Scheduling
for every edge (u, v), u comes before v in
the ordering.
Wedge or Triangle Detection Given a graph G and a vertex v , identify if Relationship Detection, Link O(|V | + |E|) [125]
there is a wedge or triangle centered at v . Prediction
Maximum Triplet Sum Given a graph G , find the maximum sum Community Detection O(|V |3 ) [43]
of the weights of three vertices that are
connected.
Shortest Path Given a graph G and two nodes u and v , Navigation, Planning O(|E|) or O(V 2 ) [43], [124], [125]
find the shortest path between u and v .
Maximum Flow Given a directed graph G with a source node Transportation Planning, Net- O(|V ||E|2 ), [124]
s and a sink node t, find the maximum flow work Design O(|E||V | log |V |) or O(|V |3 )
from s to t.
p
Bipartite Graph Matching Given a bipartite graph G with two disjoint Recommendation, Resource O(|E| |V |) [43], [124]
sets of vertices V1 and V2 , find a matching Allocation, Scheduling
between V1 and V2 that maximizes the
number of matched pairs.
Graph Neural Networks Given a graph G with node features X of Node Classification, Graph- O(ld|V |2 ) [124]
dimension d, simulate a graph neural net- level Classification
works with l payers and return the encoded
node features
Clustering Coefficient Given a graph G , find the clustering coeffi- Community Detection, Node O(|V |3 ) [126]
cient of G . Clustering
Substrcuture Counting Given a graph G and a subgraph G ′ , count Pattern Matching, Subgraph NP-Complete [43]
the number of occurrences of G ′ in G . Detection, Abnormality De-
tection
Hamilton Path Given a graph G , find a path that visits every Route Planning, Drilling Ma- NP-Complete [124]
vertex exactly once. chine Planning, DNA Se-
quencing
(Knowledge) Graph QA Given a (knowledge) graph G and a question Dialogue System, Smart As- — [126], [129]–[132]
q , find the answer to q . sistant, Recommendation
Graph Query Language Gen- Given a graph G and a query q , generate a Graph Summarization, FAQ — [126]
eration query language that can be used to query G . Generation, Query Sugges-
tions
Node Classification Given a graph G , predict the class of a node Recommendation, User Pro- — [126], [127]
v. filing, Abnormality Detection
Graph Classification Given a graph G , predict the class of G . Molecule Property Prediction, — [126], [127]
Moledule QA, Graph QA

answering. This approach enjoys the advantage of providing 4.3 Algorithmic Reasoning
interpretable evidence besides the answer. Reasoning-on-
The previous two approaches are heuristic, which means
Graphs (RoG) [129] is a representative approach that prompts
that the reasoning process accords with human intuition
LLMs to generate several relation paths as plans, which are
but is not guaranteed to lead to the correct answer. In
then retrieved from the knowledge graph (KG) and used
contrast, these problems are usually solved by algorithms
as evidence to answer the questions. Another approach is
in computer science. Therefore, researchers also attempt to
to iteratively retrieve and reason on the subgraphs from
let LLMs perform algorithmic reasoning on graphs. [124]
KG [130], [132], simulating a dynamic searching process. At
proposed “Algorithmic Prompting”, which prompts the LLMs
each step, the LLMs retrieve neighbors of the current nodes
to recall the algorithms that are relevant to the questions
and then decide to answer the question or continue the next
and then perform reasoning step by step according to the
search step.
algorithms. Their results, however, do not show consistent
improvement over the heuristic reasoning approach, such
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 9

as the BaG prompting proposed in the same paper. This 5.1.1 Graph as Sequence.
might be because the algorithmic reasoning approach is still
In these methods, the graph information is mainly encoded
complex to simulate without more careful techniques. A more
into the LLM from the “input” side. The ego-graphs associ-
direct approach, Graph-ToolFormer [127], lets LLMs generate
ated with nodes/edges are serialized into a sequence HGv
API calls as explicit reasoning steps. These API calls are then
which can be fed into the LLM together with the texts dv :
executed externally to acquire answers on an external graph.
This approach is suitable for converting tasks grounded in HGv = Graph2Seq(Gv ), (7)
real tasks into pure graph reasoning problems, demonstrating
efficacy on various applications such as knowledge graphs, hv = LLM([HGv , dv ]). (8)
social networks, and recommender systems.
Depending on the choice of Graph2Seq(·) function, the
methods can be further categorized into rule-based methods
4.4 Discussion and GNN-based methods. The illustration of the categories
can be found in Fig. 3.
The above approaches are not mutually exclusive, and they
can be combined to achieve better performance. Moreover, Rule-based: Linearizing Graphs into Text Sequence with
strictly speaking, heuristic reasoning can also conduct direct Rules. These methods design rules to describe the structure
answering, while algorithmic reasoning contains the capacity with natural language and adopt a text prompt template
of heuristic reasoning as a special case. Researchers are as Graph2Seq(·). For example, given an ego-graph Gvi of
advised to select the most suitable approach for a specific the paper node vi connecting to author nodes vj and vk
problem. For example, direct answering is suitable for prob- and venue nodes vt and vs , HGvi = Graph2Seq(Gvi ) = “The
lems that are easy to solve and where the pre-training dataset centor paper node is vi . Its author neighbor nodes are vj and
provides sufficient bias for a good guess, such as common vk and its venue neighbor nodes are vt and vs ”. This is the
entity classification and relationship detection. Heuristic most straightforward and easiest way (without introducing
reasoning is suitable for problems that are hard to deal with extra model parameters) to encode graph structures into
explicitly, but intuition can provide some guidance, such language models. Along this line, InstructGLM [47] designs
as graph-based question answering and knowledge graph templates to describe local ego-graph structure (maximum
reasoning. Algorithmic reasoning is suitable for problems 3-hop connection) for each node and conduct instruction
that are hard to solve but where the algorithmic solution is tuning for node classification and link prediction. GraphText
well-defined, such as route planning and pattern matching. [66] further proposes a syntax tree-based method to structure
into text sequence. Researchers [84] also study when and why
the linearized structure information on graphs can improve
5 T EXT-R ICH G RAPHS . the performance of LLM on node classification and find
that the structure information is beneficial when the textual
Graphs with node/edge-level textual information (text-rich information associated with the node is scarce.
graphs) exist ubiquitously in the real world, e.g., academic GNN-based: Encoding Graphs into Special Tokens with
networks, social networks, and legal case networks. Learning GNNs. Different from rule-based methods which use natural
on such networks requires the model to encode both the language prompts to linearize graphs into sequences, GNN-
textual information associated with the nodes/edges and based methods adopt graph encoder models (i.e., GNN) to
the structure information lying inside the input graph. encode the ego-graph associated with nodes into special
Depending on the role of LLM, existing works can be token representations which are concatenated with the pure
categorized into three types: LLM as Predictor, LLM as text information into the language model:
Encoder, and LLM as Aligner. We summarize all surveyed
methods in Table 5. HGv = Graph2Seq(Gv ) = GraphEnc(Gv ). (9)

The strength of these methods is that they can capture


5.1 LLM as Predictor
the hidden representations of useful structure information
These methods serve the language model as the main model with a strong graph encoder, while the challenge is how
architecture to capture both the text information and graph to fill the gap between graph modality and text modality.
structure information. They can be categorized into three GNP [42] adopts a similar philosophy from LLaVA [92],
types: Graph as Sequence methods, Graph-Empowered LLMs, where they utilize GNN to generate graph tokens and then
and Graph-Aware LLM finetuning methods, depending on how project the graph tokens into the text token space with
structure information in graphs is injected into language learnable projection matrices. The projected graph tokens are
models (input vs. architecture vs. loss). In the Graph as Se- concatenated with text tokens and fed into the language
quence methods, graphs are converted into sequences that can model. GraphGPT [46] further proposes to train a text-
be understood by language models together with texts from grounded GNN for the projection with a text encoder and
the inputs. In the Graph-Empowered LLMs methods, people contrastive learning. DGTL [77] introduces disentangled
modify the architecture of Transformers (which is the base graph learning, serves graph representations as positional
architecture for LLMs) to enable it to encode text and graph encoding, and adds them to the text sequence. METERN
structure simultaneously. In the Graph-Aware LLM finetuning [76] adds learnable relation embeddings to node textual
methods, LLM is fine-tuned with graph structure supervision sequences for text-based multiplex representation learning
and can generate graph-contextualized representations. on graphs [93].
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10



Graphs
Center: 𝑣! ; 1-hop neighbors: 𝑣" , 𝑣# ;... …

What is the category of the node? Graphs


CNC(C)C1=CC=C2C(=C1)OCO2. … What is the category of the node/graph?
Graphs What is the category of the graph? What is the category of the node/graph?
word sequence hidden state sequence

(a) Rule-based Graph as Sequence (b) GNN-based Graph as Sequence (c) Graph-Empowered LM

Fig. 3. The illustration of various LLM as Predictor methods, including (a) Rule-based Graph As Sequence, (b) GNN-based Graph As Sequence, (c)
Graph-Empowered LLMs.

5.1.2 Graph-Empowered LLMs. networks where some nodes are associated with text (text-
In these methods, researchers design advanced LLM archi- rich) and others are not (textless). Virtual neighbor tokens
tecture (i.e., Graph-Empowered LLMs) which can conduct for text-rich neighbors and textless neighbors are concate-
joint text and graph encoding inside their model architecture. nated with the original text tokens and inputted into each
Transformers [94] serve as the base model for nowadays pre- Transformer layer. Edgeformers [75] are proposed for repre-
trained LMs [22] and LLMs [37]. However, they are designed sentation learning on textual-edge networks where edges are
for natural language (sequence) encoding and do not take associated with rich textual information. When conducting
non-sequential structure information into consideration. To edge encoding, virtual node tokens will be concatenated onto
this end, Graph-Empowered LLMs are proposed. They have the original edge text tokens for joint encoding.
a shared philosophy of introducing virtual structure tokens
HGv inside each Transformer layer: 5.1.3 Graph-Aware LLM finetuning.
In these methods, the graph information is mainly injected
f(l) = [H (l) , H (l) ]
H (10) into the LLM by “fine-tuning on graphs”. Researchers
dv Gv dv
assume that the structure of graphs can provide hints on
where HGv can be learnable embeddings or output from what documents are “semantically similar” to what other
graph encoders. Then the original multi-head attention documents. For example, papers citing each other in an
(MHA) in Transformers is modified into an asymmetric MHA academic graph can be of similar topics; items co-purchased
to take the structure tokens into consideration: by many users in an e-commerce graph can be of related
functions. These methods adopt vanilla language models
(l) f(l) U (l) f(l) that take text as input (e.g., BERT [22] and SciBERT [24]) as
MHAasy (Hdv , H dv ) = ∥u=1 headu (Hdv , Hdv ),
(l) f(l)⊤
! the base model and fine-tune them with structure signals
(l) f(l) Qu K u eu(l) , on the graph. After that, the LLMs will learn node/edge
where headu (Hdv , Hdv ) = softmax p ·V
d/U representations that capture the graph homophily from the
Q(l)
(l) (l) f(l) f(l) (l)
u = Hdv WQ,u , Ku = Hdv WK,u , Vu
f(l) W (l) .
e (l) = H text perspective.
dv V,u
(11) Most methods adopt the two-tower encoding and training
pipeline, where the representation of each node is obtained
With the asymmetric MHA mechanism, the node encoding separately:
process of the (l + 1)-th layer will be:

hvi = LLMθ (dvi ), (13)
f(l) = Normalize(H (l) + MHAasy (H
H f(l) , H (l) )),
dv dv dv dv
(12) and the model is optimized by
(l+1) (l)′ (l)′
Hdv = Normalize(H
f + MLP(H
dv
f )).
dv
min f (hvi , {hv+ }, {hv− }). (14)
θ i i
Along this line of work, GreaseLM [68] proposes to have a
language encoding component and a graph encoding com- Here vi+ represents the positive nodes to vi , vi− represents the
ponent in each layer. The two components interact through negative nodes to vi and f (·) denotes the pairwise training
an MInt layer, where a special structure token is added to objective. Different methods have different strategies for vi+
the text Transformer input, and a special node is added to and vi− with different training objectives f (·). SPECTER [52]
the graph encoding layer. DRAGON [83] further proposes constructs the positive text/node pairs with the citation
strategies to pretrain GreaseLM with unsupervised signals. relation, explores random negatives and structure hard
GraphFormers [73] are designed for node representation negatives, and fine-tunes SciBERT [24] with the triplet
learning on homogeneous text-attributed networks where loss. SciNCL [53] extends SPECTER by introducing more
the current layer [CLS] token hidden states of neighboring advanced positive and negative sampling methods based on
documents are aggregated and added as a new token on the embeddings trained on graphs. Touchup-G [55] proposes the
current layer center node text encoding. Patton [32] further measurement of feature homophily on graphs and brings
proposes to pretrain GraphFormers with two novel strategies: up a binary cross-entropy fine-tuning objective. TwHIN-
network-contextualized masked language modeling and BERT [57] mines positive node pairs with off-the-shelf
masked node prediction. Heterformer [74] is introduced heterogeneous information network embeddings and trains
for learning representations on heterogeneous text-attributed the model with a contrastive social loss. MICoL [60] discovers
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11

TABLE 4
A summarization of Graph-Aware LLM finetuning objectives on text-rich graphs. vi+ and vi− denote a positive training node and a negative training
node to vi respectively.

Method positive vi+ negative vi− Objective f (·)


(vi , vi− ) ∈
/ E; max{||hvi − hv+ ||2 − ||hvi − hv− ||2 + m, 0}
SPECTER [52] (vi , vi+ ) ∈ E
(vi , vu ) ∈ E , (vu , vi− ) ∈ E , (vi , vi− ) ∈
/E i i


SciNCL [53] ||hvi − hv+ ||2 ∈ (k+ − c+ ; k+ ] ||hvi − hv− ||2 ∈ (khard − c− −
hard ; khard ] max{||hvi − hv+ ||2 − ||hvi − hv− ||2 + m, 0}
i i i i

Touchup-G [55] (vi , vi+ ) ∈E (vi , vi− ) ∈


/E log(hvi · hv+ ) + log(1 − hvi · hv− )
i i
exp(cos(hvi ,h + )/η)
v i
TwHIN-BERT [57] cos(xvi , xv+ ) < k in-batch random − log P
exp(cos(hvi ,h − )/η)
i −
vi v i
exp(cos(hvi ,h + )/η)
v
MICoL [60] vi+ ∈ NM (vi ) in-batch random − log P i
exp(cos(hvi ,h − )/η)

vi v i

(vi , vi+ ) ∈ E (vi , vi− ) ∈


P
E2EG [61] /E log(hvi · hv+ ) + vi−
log(1 − hvi · hv− )
i i

semantically positive node pairs with meta-path [91] and LLMs) which are hard to adopt for generation tasks. A
adopts the InfoNCE objective. E2EG [61] utilizes a similar potential solution is to design Graph-Empowered LLMs
philosophy from GIANT [59] and adds a neighbor prediction with decoder-only or encoder-decoder LLMs as the base
objective apart from the downstream task objective. A architecture. 2) Pretraining. Pretraining is important to enable
summarization of the two-tower graph-centric LLM fine- LLMs with contextualized data understanding capability,
tuning objectives can be found in Table 4. which can be generalized to other tasks. However, existing
There are other methods using the one-tower pipeline, works mainly focus on pretraining LLMs on homogeneous
where node pairs are concatenated and encoded together: text-rich networks. Future studies are needed to explore LLM
pretraining in more diverse real-world scenarios including
hvi ,vj = LLMθ (dvi , dvj ), (15) heterogeneous text-rich networks [74], dynamic text-rich
min f (hvi ,vj ). (16) networks [128], and textual-edge networks [75].
θ

LinkBERT [31] proposes a document relation prediction


objective (an extension of next sentence prediction in BERT 5.2 LLM as Encoder
[22]) which aims to classify the relation of two node text pairs LLMs extract textual features to serve as initial node feature
from contiguous, random, and linked. MICoL [60] explores vectors for GNNs, which then generate node/edge repre-
predicting the node pairs’ binary meta-path or meta-graph sentations and make predictions. These methods typically
indicated relation with the one-tower language model. adopt an LLM-GNN cascaded architecture to obtain the final
representation hvi for node vi :
5.1.4 Discussion
We provide simple guidance on the selection of the methods xvi = LLM(dvi ) (17)
above when facing real-world problems based on your tasks: hvi = GNN(Xv , G). (18)
1) Representation Learning: Graph-Aware LLM fine-tuning
Here xvi is the feature vector that captures the textual
(more efficient but less effective) and Graph-Empowered
information dvi associated with vi . The final representation
LLMs (less efficient but more effective); 2) Generation:
Rule-based Graph As Sequence (support zero-shot, limited
hvi will contain both textual information and structure
information of vi and can be used for downstream tasks.
expressiveness) and GNN-based Graph As Sequence (need
In the following sections, we will discuss the optimization,
training, stronger expressiveness). Although the community
augmentation, and distillation of such models. The figures
is making good progress, there are still some open questions
for these techniques can be found in Fig. 4.
to be solved.
Graph as Code Sequence. Existing graphs as sequence
methods are mainly rule-based or GNN-based. The former 5.2.1 Optimization
relies on natural language to describe the graphs which is One-step training refers to training the LLM and GNN
not natural for structure data, while the latter has a GNN together in the cascaded architecture for the downstream
component that needs to be trained. A more promising way tasks. TextGNN [79] explores GCN [85], GraphSAGE [86],
is to obtain a structure-aware sequence for graphs that can GAT [87] as the base GNN architecture, add skip connection
support zero-shot inference. A potential solution is to adopt between LLM output and GNN output, and optimize the
codes (that can capture structures) to describe the graphs whole architecture for sponsored search task. AdsGNN
and utilize code LLMs [21]. [80] further extends TextGNN by proposing edge-level
Advanced Graph-Empowered LLM techniques. Graph- information aggregation. GNN-LM [67] adds GNN layers
empowered LLM is a promising direction to achieve foun- to enable the vanilla language model to reference similar
dational models for graphs. However, existing works are contexts in the corpus for language modeling. Joint training
far from enough: 1) Task. Existing methods are mainly LLMs and GNNs in a cascaded pipeline is straightforward
designed for representation learning (with encoder-only but may suffer from efficiency [69] (only support sampling a
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12

🔥 🔥 🔥 🔥 🔥

🔥Trainable
🔥 🔥 🔥 🔥 🔥 🔥
❄ Fixed LLM
step1 step2 ❄ Teacher Student
(a) One-step Training (b) Two-step Training (c) Augmenta7on (d) Dis7lla7on

Fig. 4. The illustration of various techniques related to LLM as Encoder, including (a) One-step Training, (b) Two-step Training, (c) Data Augmentation,
and (d) Knowledge Distillation.

few one-hop neighbors regarding memory complexity) and node and its neighbors. A straightforward solution is to
local minimal [36] (LLM underfits the data) issues. serve the LLM-GNN cascade pipeline as the teacher model
Two-step training means first adapting LLMs to the graph, and distill it into an LLM as the student model. In this
and then finetuning the whole LLM-GNN cascaded pipeline. case, during inference, the model (which is a pure LLM)
This strategy can effectively alleviate the insufficient training only needs to encode the text on the center node and avoid
of the LLM which contributes to higher text representation time-consuming neighbor sampling. AdsGNN [80] proposes
quality. GIANT [59] proposes to conduct neighborhood an L2-loss to force the outputs of the student model to
prediction with the use of XR-Transformers [81] and results preserve topology after the teacher model is trained. GraD
in an LLM that can output better feature vectors than [70] introduces three strategies including the distillation
bag-of-words and vanilla BERT [22] embedding for node objective and task objective to optimize the teacher model
classification. LM-GNN [69] introduces graph-aware pre-fine- and distill its capability to the student model.
tuning to warm up the LLM on the given graph before fine-
tuning the whole LLM-GNN pipeline and demonstrating 5.2.4 Discussion
significant performance gain. SimTeG [36] finds that the sim- Given that GNNs are demonstrated as powerful models in
ple framework of first training the LLMs on the downstream encoding graphs, “LLMs as encoders” seems to be the most
task and then fixing the LLMs and training the GNNs can straightforward way to utilize LLMs on graphs. Although
result in outstanding performance. They further find that we have discussed much research on “LLMs as encoders”,
using the efficient fine-tuning method, e.g., LoRA [41] to tune there are still open questions to be solved.
the LLM can alleviate overfitting issues. GaLM [82] explores Limited Task: Go Beyond Representation Learning. Current
ways to pretrain the LLM-GNN cascaded architecture. “LLMs as encoders” methods or LLM-GNN cascaded architec-
tures are mainly focusing on representation learning, given
5.2.2 Data Augmentation the single embedding propagation-aggregation mechanism
With its demonstrated zero-shot capability [44], LLMs can be of GNNs, which prevents it from being adopted to generation
used for data augmentation to generate additional text data tasks (e.g., node/text generation). A potential solution to
for the LLM-GNN cascaded architecture. The philosophy of this challenge can be to conduct GNN encoding for LLM
using LLM to generate pseudo data is widely explored in outputted token-level representations and to design proper
NLP [90]. LLM-GNN [64] proposes to conduct zero-shot node decoders that can perform generation based on the LLM-
classification on text-attributed networks by labeling a few GNN cascaded model outputs.
nodes and using the pseudo labels to fine-tune GNNs. TAPE Low Efficiency: Advanced Knowledge Distillation. The
[71] presents a method that uses LLM to generate prediction LLM-GNN cascaded pipeline suffers from time complexity
text and explanation text, which serve as augmented text issues since the model needs to conduct neighbor sampling
data compared with the original text data. A following and then embedding encoding for each neighboring node.
medium-scale language model is adopted to encode the texts Although there are methods that explore distilling the
and output features for augmented texts and original text learned LLM-GNN model into an LLM model for fast
respectively before feeding into GNNs. ENG [72] brings inference, they are far from enough given that the inference
forward the idea of generating labeled nodes for each of LLM itself is time-consuming. A potential solution is to
category, adding edges between labeled nodes and other distill the model into a much smaller LM or even an MLP.
nodes, and conducting semi-supervised GNN learning for Similar methods [88] have been proven effective in GNN to
node classification. MLP distillation and are worth exploring for the LLM-GNN
cascaded pipeline as well.
5.2.3 Knowledge Distillation
LLM-GNN cascaded pipeline is capable of capturing both
text information and structure information. However, the 5.3 LLM as Aligner
pipeline suffers from time complexity issues during inference, These methods contain an LLM component for text encoding
since GNNs need to conduct neighbor sampling and LLMs and a GNN component for structure encoding. The two
need to encode the text associated with both the center components are served equally and trained iteratively or
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

parallelly. LLMs and GNNs can mutually enhance each other pseudo-label by GM contrastive learning
since the LLMs can provide textual signals to GNNs, while
the GNNs can deliver structure information to LLMs. Accord- text

ing to how the LLM and the GNN interact, these methods can graph
be further categorized into: LLM-GNN Prediction Alignment
and LLM-GNN Latent Space Alignment. The illustration of pseudo-label by LM graph text
the two categories of method can be found in Fig. 5. (a) LLM-GNN Prediction Alignment (b) LLM-GNN Latent Space Alignment

5.3.1 LLM-GNN Prediction Alignment Fig. 5. The illustration of LLM as Aligner methods, including (a) LLM-GNN
Prediction Alignment and (b) LLM-GNN Latent Space Alignment.
This refers to training the LLM with the text data on a
graph and training the GNN with the structure data on a
graph iteratively. LLM will generate labels for nodes from the 5.3.3 Discussion.
text perspective and serve them as pseudo-labels for GNN In “LLMs as Aligners” methods, most research is adopt-
training, while GNN will generate labels for nodes from the ing shallow GNNs (e.g., GCN, GAT, with thousands of
structure perspective and serve them as pseudo-labels for parameters) to be the graph encoders that are aligned
LLM training. with LLMs through iterative training (i.e., prediction align-
By this design, the two modality encoders can learn from ment) or contrastive training (i.e., latent space alignment).
each other and contribute to a final joint text and graph Although LLMs (with millions or billions of parameters)
encoding. In this direction, LTRN [58] proposes a novel GNN have strong expressive capability, the shallow GNNs (with
architecture with personalized PageRank [95] and attention limited representative capability) can constrain the mutual
mechanism for structure encoding while adopting BERT learning effectiveness between LLMs and GNNs. A potential
[22] as the language model. The pseudo labels generated by solution is to adopt GNNs which can be scaled up [89].
LLM and GNN are merged for the next iteration of training. Furthermore, deeper research to explore what is the best
GLEM [62] formulates the iterative training process into a model size combination for LLMs and GNNs in such “LLMs
pseudo-likelihood variational framework, where the E-step as Aligners” LLM-GNN mutual enhancement framework is
is to optimize LLM and the M-step is to train the GNN. very important.

5.3.2 LLM-GNN Latent Space Alignment 6 T EXT-PAIRED G RAPHS


Graphs are prevalent data objects in scientific disciplines such
This denotes connecting text encoding (LLM) and structure
as cheminformatics [194], material informatics [190], bioin-
encoding (GNN) with cross-modality contrastive learning
formatics [149], computer vision [150], and quantum com-
[96]:
puting [151]. Within these diverse fields, graphs frequently
hdvi = LLM(dvi ), hvi = GNN(Gv ), (19) come paired with critical graph-level text information. For
instance, molecular graphs in cheminformatics are annotated
Sim(hdvi , hvi )
l(hdvi , hvi ) = P (20) with text properties such as toxicity, water solubility, and
j̸=i Sim(hdvi , hvj ) permeability properties [190], [194], [214]. Research on such
X 1 graphs (scientific discovery) could be accelerated by the text
L= (l(hdvi , hvi ) + l(hvi , hdvi )) (21)
information and the adoption of LLMs. In this section, we
v ∈G
2|G|
i
review the application of LLMs on graph-captioned graphs
A similar philosophy is widely used in vision-language joint with a focus on molecular graphs. According to the technique
modality learning [97]. Along this line of works, ConGrat categorization in Section 3.2, we begin by investigating
[54] adopts GAT [87] as the graph encoder and tries MPNet methods that utilize LLMs as Predictor. Then, we discuss
[35] and DistillGPT [20] as the language model encoder. methods that align GNNs with LLMs. We summarize all
They extend the original InfoNCE loss by adding graph- surveyed methods in Table 6.
specific elements about the most likely second, third, and
further choices for the nodes a text comes from and the 6.1 LLM as Predictor
texts a node produces. In addition to the node-level multi- In this subsection, we review how to conduct “LLM as
modality contrastive objective, GRENADE [56] proposes Predictor” for graph-level tasks. Existing methods can be
KL-divergence-based neighbor-level knowledge alignment: categorized into Graph as Sequence (treat graph data as
minimize the neighborhood similarity distribution calculated sequence input) and Graph-Empowered LLMs (design model
between LLM and GNN. G2P2 [63] further extends node- architecture to encode graphs).
text contrastive learning by adding text-summary interaction
and node-summary interaction. Then, they introduce using 6.1.1 Graph as Sequence
label texts in the text modality for zero-shot classification, For text-paired graphs, we have three steps to utilize existing
and using soft prompts for few-show classification. THLM LLM for graph inputs. Step 1: Linearize graphs into sequence
[34] proposes to pretrain the language model by contrastive with rule-based methods. Step 2: Tokenize the linearized se-
learning with a heterogeneous GNN on heterogeneous text- quence. Step 3: Train/Finetune different LLMs (e.g., Encoder-
attributed networks. The pretrained LLM can be fine-tuned only, Encoder-Decoder, Decoder-only) for specific tasks. We
on downstream tasks. will discuss each step as follows.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14

TABLE 5
Summary of large language models on text-rich graphs. Role of LM: “TE”, “SE”, “ANN” and “AUG” denote text encoder, structure encoder, annotator
(labeling the node/edges), and augmentator (conduct data augmentation). Task: “NC”, “UAP”, “LP”, “Rec”, “QA”, “NLU”, “EC”, “LM”, “RG” denote
node classification, user activity prediction, link prediction, recommendation, question answering, natural language understanding, edge
classification, language modeling, and regression task.

Approach Time Category Role of LM LM Size Focus Task


GNN-LM [67] 2021.10 LLM as Encoder TE 237M Task LM
GIANT [59] 2021.11 LLM as Encoder TE 110M Task NC
TextGNN [79] 2022.1 LLM as Encoder TE 110M Task Search
AdsGNN [80] 2022.4 LLM as Encoder TE 110M Task Search
LM-GNN [69] 2022.6 LLM as Encoder TE 110M Efficiency NC, LP, EC
GraD [70] 2023.4 LLM as Encoder TE 110M/66M Efficiency LP, NC
TAPE [71] 2023.5 LLM as Encoder TE, AUG 129M/GPT-3.5 Task NC
SimTeG [36] 2023.8 LLM as Encoder TE 80M/355M Task NC, LP
LLM-GNN [64] 2023.10 LLM as Encoder ANN GPT-3.5 Task NC
ENG [72] 2023.10 LLM as Encoder TE, AUG 80M/GPT-3.5 Task NC
SPECTER [52] 2020.4 LLM as Predictor TE 110M Representation NC, UAP, LP, Rec
GraphFormers [73] 2021.5 LLM as Predictor TE, SE 110M Representation LP
GreaseLM [68] 2022.1 LLM as Predictor TE, SE 355M Task QA
SciNCL [53] 2022.2 LLM as Predictor TE 110M Representation NC, UAP, LP, Rec
MICoL [60] 2022.2 LLM as Predictor TE 110M Supervision NC
LinkBERT [31] 2022.3 LLM as Predictor TE 110M Pretraining QA, NLU
Heterformer [74] 2022.5 LLM as Predictor TE, SE 110M Representation NC, LP
E2EG [61] 2022.8 LLM as Predictor TE 66M Task NC
TwHIN-BERT [57] 2022.9 LLM as Predictor TE 110M/355M Pretraining NC, LP
Edgeformers [75] 2023.1 LLM as Predictor TE, SE 110M Representation NC, LP, EC
Patton [32] 2023.5 LLM as Predictor TE, RE 110M Pretraining NC, LP, Search
InstructGLM [47] 2023.8 LLM as Predictor TE, SE 250M/7B Generalization NC, LP
GNP [42] 2023.9 LLM as Predictor TE, SE 3B/11B Task QA
Touchup-G [55] 2023.9 LLM as Predictor TE 110M Representation NC, LP
DGTL [77] 2023.10 LLM as Predictor TE, SE 13B Task NC
GraphText [66] 2023.10 LLM as Predictor TE, SE GPT-3.5/4 Task NC
GraphGPT [46] 2023.10 LLM as Predictor TE, SE 7B Generalization NC
METERN [76] 2023.10 LLM as Predictor TE, RE 110M Representation NC, LP, Rec, RG
LTRN [58] 2021.2 LLM as Aligner TE 110M Supervision NC
GLEM [62] 2023.1 LLM as Aligner TE 110M Task NC
G2P2 [63] 2023.5 LLM as Aligner TE 110M Supervision NC
ConGraT [54] 2023.5 LLM as Aligner TE 110M/82M Representation LP, LM, NC
GRENADE [56] 2023.10 LLM as Aligner TE 110M Representation NC, LP
THLM [34] 2023.10 LLM as Aligner TE 110B Pretraining NC, LP

Step 1: Rule-based Graph Linearization. Rule-based lin- single molecule and SMILES sometimes represent invalid
earization converts molecular graphs into text sequences molecules; LLMs learned from these linearized sequences can
that can be processed by LLMs. To achieve this, researchers easily generate invalid molecules (e.g., incorrect ring closure
develop specifications based on human expertise in the form symbols and unmatched parentheses) due to syntactical
of line notations [152]. For example, the Simplified Molecular- errors. To this end, DeepSMILES [154] is proposed. It can
Input Line-Entry System (SMILES) [152] records the symbols alleviate this issue in most cases but does not guarantee
of nodes encountered during a depth-first traversal of 100% robustness. The linearized string could still violate
a molecular graph. The International Chemical Identifier basic physical constraints. To fully address this problem,
(InChI) [153], created by the International Union of Pure and SELFIES [155] is introduced which consistently yields valid
Applied Chemistry (IUPAC), encodes molecular structures molecular graphs.
into unique string texts with more hierarchical information. Step 2: Tokenization. The tokenization approaches for these
Canonicalization algorithms produce unique SMILES for linearized sequences are typically language-independent.
each molecule, often referred to as canonical SMILES. How- They operate at both the character level [176], [187] and the
ever, there are more than one SMILES corresponding to a substring level [171], [178], [182]–[185], based on Sentence-
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 15

TABLE 6
Model collection in Section 6 for text-captioned graphs. “Lin.” and “Vec.” represent Linearized Graph Encoding and Vectorized Graph Encoding.
“Classif.”, “Regr.”, “NER”, “RE”, “Retr.”, “Gen.”, “Cap.” represent classification, regression, named entity recognition, relation extraction, (molecule)
graph retrieval, (molecule) graph generation, (molecule) graph captioning.

Model Time LM Encoder Graph Encoder Gen. Decoder LM Size Task


SMILES-BERT [188] 2019.09 Transformer [94] Linearized N.A. 30M-565M Classification
Text2Mol [122] 2021.11 SciBERT [24] GCN Transformer [94] ≥110M Retrieval
MolGPT [186] 2021.10 N.A. Linearized GPT 6M Generation
Chemformer [164] 2022.01 BART [29] Linearized BART [29] 45M-230M Regression, Gen.
Classif., NER,
KV-PLM [184] 2022.02 BERT [22] Linearized N.A 110M-340M
RE, Retrieval
LLM as Predictor in Section 6.1

MFBERT [185] 2022.06 RoBERTa [23] Linearized N.A 110M-340M Classification


MFBERT [185] 2022.06 RoBERTa [23] Linearized N.A 110M-340M Classification
Galatica [187] 2022.11 N.A. Linearized Transformer [94] 125M-120B Classification
MolT5 [123] 2022.12 T5.1.1 [30] Linearized Transformer 80M-780M Gen., Cap.
Classif, Gen.,
Text+Chem T5 [180] 2023.05 T5 [30] Linearized T5 [30] 80M-780M
Caption
GPT-3.5/4 Classification
LLM-ICL [177] 2023.05 N.A. Linearized LLaMA2 [119] ≥ 780M Generation
Galactica [187] Caption
GIMLET [48] 2023.05 T5 [30] GT T5 [30] 80M-780M Classif., Regr.
Classif, Gen.,
MolXPT [178] 2023.05 N.A. Linearized GPT-2 350M
Caption
ChatMol [175] 2023.06 T5 [30] Linearized T5 [30] 80M-780M Gen., Cap.
MolReGPT [174] 2023.06 N.A. Linearized GPT-3.5 N.A. Gen., Cap.
RT [173] 2023.06 N.A. Linearized XLNet [26] 27M Regr. Gen.
LLM4Mol [172] 2023.07 RoBERTa [23] Linearized GPT-3.5 N.A. Classif, Regr.
LLaMA-Mol [169] 2023.07 N.A. Linearized LLaMA [119] 7B Regr., Gene.
LLaMA-Mol [169] 2023.07 N.A. Linearized LLaMA [119] 7B Regr., Gene.
LLaMA-Mol [169] 2023.07 N.A. Linearized LLaMA [119] 7B Regr., Gene.
Prot2Text [170] 2023.07 N.A. GNN GPT-2 256M-760M Caption
Prot2Text [170] 2023.07 N.A. GNN GPT-2 256M-760M Caption
CatBERTa [168] 2023.10 N.A. Linearized RoBERTa [23] N.A. Regression
ReLM [166] 2023.10 N.A. GNN GPT-3.5 N.A. Classification
SciBERT [24] MolT5 [123] Classif, Gen.,
MoMu [183] 2022.12 GNN 82M-782M
KV-PLM [184] MoFlow [220] Caption, Retr.
LLM as Aligner in Section 6.2

GIN Classif, Gen.,


MoleculeSTM [181] 2022.12 BART [29] BART [29] 45M-230M
Linearized Caption
BioBERT [189] GNN Classification
CLAMP [179] 2023.05 N.A. ≤11B
CLIP [97], T5 [30] Lin., Vec. Retrieval
Classif., Gen.
MolFM [171] 2023.06 BERT [22] GIN MolT5 [123] 61.8M
Caption, Retr.
MoMu-v2 [182] 2023.07 SciBERT [24] GIN N.A 82M-782M Classification
GIN Classif, Gen.
GIT-Mol [167] 2023.08 SciBERT [24] MolT5 [123] 190M-890M
Linearized Caption
Classif., Regr.,
MolCA [176] 2023.10 Galactica [187] GIN N.A 100M-877M
Retrieval

Piece [163] or BPE [162]. Additionally, RT [173] proposes a joint understanding of them [168], [184]: The linearized graph
tokenization approach that facilitates handling regression sequence is concatenated on the raw natural language data
tasks within LM Transformers. and then input into the LLMs. Specifically, KV-PLM [184] is
built based on BERT [22] to understand the molecular struc-
Step 3: Encoding the Linearized Graph with LLMs. Encoder- ture in a biomedical context. CatBERTa [168], as developed
only LLMs. Earlier LLMs like SciBERT [24] and BioBERT [189] from RoBERTa [23], specializes in the prediction of catalyst
are trained on scientific literature to understand natural properties for molecular graphs.
language descriptions related to molecules but are not capa-
ble of comprehending molecular graph structures. To this Encoder-Decoder LLMs. Encoder-only LLMs may lack the
end, SMILES-BERT [188] and MFBERT [185] are proposed capability for generation tasks. In this section, we discuss
for molecular graph classification with linearized SMILES LLMs with encoder-decoder architectures. For example,
strings. Since scientific natural language descriptions contain Chemformer [164] uses a similar architecture as BART [29].
human expertise which can serve as a supplement for The representation from the encoder can be used for property
molecular graph structures, recent advances emphasize the prediction tasks, and the whole encoder-decoder architecture
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 16

can be optimized for molecule generation. Other works focus between tokens i and j as follows:
on molecule captioning (which involves generating textual 
descriptions from a molecule) and text-based molecular 
i−j if i, j ∈ dG ,

generation (where a molecular graph structure is generated
GSD(i, j) + Mean
ek ∈SP(i,j) xek if i, j ∈ V ,
PE(i, j) =
from a natural description). Specifically, MolT5 [123] is 

−∞ if i ∈ V, j ∈ dG ,
developed based on the T5 [30], suitable for these two tasks. 
0 if i ∈ dG , j ∈ V .
It formulates molecule-text translation as a multilingual (22)
problem and initializes the model using the T5 checkpoint. Here, GSD denotes the graph shortest distance between
The model was pre-trained on two monolingual corpora: two nodes, and Meank∈SP(i,j) represents the mean pooling
the Colossal Clean Crawled Corpus (C4) [30] for the natural of the edge features xek along the shortest path SP(i, j)
language modality and one million SMILES [164] for the between nodes i and j . GIMLET [48] adapts bi-directional
molecule modality. Text+Chem T5 [180] extends the input attention for node tokens and enables texts to selectively
and output domains to include both SMILES and texts, attend to nodes. These designs render the Transformer’s
unlocking LLMs for more generation functions such as submodule, which handles the graph part, equivalent to a
text or reaction generation. ChatMol [175] exploits the Graph Transformer [141].
interactive capabilities of LLMs and proposes designing There are other works that modify cross-attention mod-
molecule structures through multi-turn dialogs with T5. ules to facilitate interaction between graph and text repre-
Decoder-only LLMs. Decoder-only architectures have been sentations. Given the graph hidden state hG , its node-level
adopted for recent LLMs due to their advanced generation hidden state Hv and text hidden state HdG , Text2Mol [122]
ability. MolGPT [186] and MolXPT [178] are GPT-style models implemented interaction between representations in the hid-
used for molecule classification and generation. Specifically, den layers of encoders, while Prot2Text [170] implemented
MolGPT [186] focuses on conditional molecule generation this interaction within the layers of between encoder and
tasks using scaffolds, while MolXPT [178] formulates the decoder:
classification task as a question-answering problem with yes
!
WQ HdG · (WK Hv )T
or no responses. RT [173] adopts XLNet [26] and focuses HdG = softmax √ · WV Hv , (23)
dk
on molecular regression tasks. It frames the regression as a
conditional sequence modeling problem. Galactica [187] is Where WQ , WK , WV are trainable parameters that trans-
a set of LLMs with a maximum of 120 billion parameters, form the query modality (e.g., sequences) and the key/value
which is pretrained on two million compounds from Pub- modality (e.g., graphs) into the attention space. Furthermore,
Chem [194]. Therefore, Galactica could understand molecular Prot2Text [170] utilizes two trainable parameter matrices
graph structures through SMILES. With instruction tuning W1 and W2 to integrate the graph representation into the
data and domain knowledge, researchers also adapt general- sequence representation.
domain LLMs such as LLaMA to recognize molecular graph 
structures and solve molecule tasks [169]. Recent studies HdG = HdG + 1|dG | hG W1 W2 . (24)
also explore the in-context learning capabilities of LLMs
6.1.3 Discussion
on graphs. LLM-ICL [177] assesses the performance of
LLMs across eight tasks in the molecular domain, ranging LLM Inputs with Sequence Prior. The first challenge is that the
from property classification to molecule-text translation. progress in advanced linearization methods has not progressed in
MolReGPT [174] proposes a method to retrieve molecules tandem with the development of LLMs. Emerging around 2020,
with similar structures and descriptions to improve in- linearization methods for molecular graphs like SELFIES
context learning. LLM4Mol [172] utilizes the summarization offer significant grammatical advantages, yet advanced LMs
capability of LLMs as a feature extractor and combines it and LLMs from graph machine learning and language model
with a smaller, tunable LLM for specific prediction tasks. communities might not fully utilize these, as these encoded
results are not part of pretraining corpora prior to their
proposal. Consequently, recent studies [177] indicate that
6.1.2 Graph-Empowered LLMs LLMs, such as GPT-3.5/4, may be less adept at using SELFIES
compared to SMILES. Therefore, the performance of LM-only
Different from the methods that adopt the original LLM and LLM-only methods may be limited by the expressiveness
architecture (i.e., Transformers) and input the graphs as of older linearization methods, as there is no way to optimize
sequences to LLMs, graph-empowered LLMs attempt to these hard-coded rules during the learning pipeline of LLMs.
design LLM architectures that can conduct joint encoding of However, the second challenge remains as the inductive bias of
text and graph structures. Some works modify the positional graphs may be broken by linearization. Rule-based linearization
encoding of Transformers. For instance, GIMLET [48] treats methods introduce inductive biases for sequence modeling,
nodes in a graph as tokens. It uses a single Transformer thereby breaking the permutation invariance assumption
to manage both the graph structure and text sequence inherent in molecular graphs. It may reduce task difficulty
[v1 , v2 , . . . , v|V| , s|V|+1 , . . . , s|V|+|dG | ], where v ∈ V is a node by introducing sequence order to reduce the search space.
and s ∈ dG is a token in the text associated with G . It However, it does not mean model generalization. Specifically,
had three sub-encoding approaches for positional encodings there could be multiple string-based representations for a
to cater to different data modalities and their interactions. single graph from single or different approaches. Numerous
Specifically, it adopted the structural position encoding (PE) studies [156]–[159] have shown that training on different
from the Graph Transformer and defines the relative distance string-based views of the same molecule can improve the
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 17

sequential model’s performance, as these data augmentation inference in an expertise-intensive domain.


approaches manage to retain the permutation-invariance
nature of graphs. These advantages are also achievable with 6.2 LLM as Aligner
a permutation-invariant GNN, potentially simplifying the 6.2.1 Latent Space Alignment
model by reducing the need for complex, string-based data One may directly align the latent spaces of the GNN and LLM
augmentation design. through contrastive learning and predictive regularization.
LLM Inputs with Graph Prior. Rule-based linearization may Typically, a graph representation from a GNN can be read
be considered less expressive and generalizable compared out by summarizing all node-level representations, and a
to the direct graph representation with rich node features, sequence representation can be obtained from the [CLS]
edge features, and the adjacency matrix [203]. Various atomic token. We first use two projection heads, which are usually
features include atomic number, chirality, degree, formal MLPs, to map the separate representation vectors from the
charge, number of hydrogen atoms, number of radical GNN and LLM into a unified space as hG and hdG , and
electrons, hybridization state, aromaticity, and presence in a then align them within this space. Specifically, MoMu [183]
ring. Bond features encompass the bond’s type (e.g., single, and MoMu-v2 [182] retrieve two sentences from the corpus
double, or triple), the bond’s stereochemistry (e.g., E/Z for each molecular graph. During training, graph data
or cis/trans), and whether the bond is conjugated [204]. augmentation was applied to molecular graphs, creating
Each feature provides specific information about atomic two augmented views. Consequently, there are four pairs
properties and structure, crucial for molecular modeling and of G and dG . For each pair, the contrastive loss for space
cheminformatics. One may directly vectorize the molecular alignment is as follows.
graph structure into binary vectors [199] and then apply
parameterized Multilayer Perceptrons (MLPs) on the top exp (cos (hG , hdG ) /τ )
ℓMoMu = − log P    , (25)
of these vectors to get the graph representation. These
d˜G ̸=dG exp cos hG , hd˜G /τ
vectorization approaches are also known as fingerprints such
as MACCS [200], ECFP [201], and CDK fingerprints [202] and where τ is the temperature hyper-parameter and d˜G denotes
are based on human-defined rules. These rules take inputs the sequence not paired to the graph G . MoleculeSTM [181]
of a molecule and output a vector consisting of 0/1 bits. also applies contrastive learning to minimize the repre-
Each bit denotes a specific type of substructure related to sentation distance between a molecular graph G and its
functional groups that could be used for various property corresponding texts dG , while maximizing the distance
predictions. Fingerprints consider atoms and structures, but between the molecule and unrelated descriptions. Specifically,
they still fall short of automatically learning from the raw it considers two contrastive learning strategies: EBM-NCE
graph structure. GNNs could serve as automatic feature and InfoNCE:
extractors to replace or enhance fingerprints. Some specific 1 h  i
methods are explored in Section 6.1.2, while the other graph
ℓSTM−EBM = − EhG ,hdG log σ h⊤ G hdG
2
prior such as the eigenvectors of a graph Laplacian and the
h   i
+EhG ,hd˜ log 1 − σ h⊤ G hd˜G
random walk prior could also be used [142]. G
(26)
LLM Outputs for Prediction. LMs like KV-PLM [184], 1 h  i
− EhG ,hdG log σ h⊤ h
G dG
SMILES-BERT [188], MFBERT [185], and Chemformer [164] 2 h   i
use a prediction head on the output vector of the last layer. +EhG̃ ,hdG log 1 − σ h⊤ h
G̃ dG
These models are finetuned with standard classification and 
regression losses but may not fully utilize all the parameters exp h⊤

1 G hdG
and advantages of the complete architecture. In contrast, ℓSTM−Info = − EhG ,hdG log P  
2 ⊤
models like RT [173], MolXPT [178], and Text+Chem T5 [180] d˜G ̸=dG exp hG hd˜G
 (27)
frame prediction as a text generation task. These models are
exp h⊤

G h d
trained with either masked language modeling or autoregres- + log P  G 
exp h ⊤h
sive targets, which requires a meticulous design of the context G̸̃=G G̃ dG
words in the text [173]. Specifically, domain knowledge
MoleculeSTM [181] randomly samples negative graphs or
instructions may be necessary to activate the in-context ˜ and (G̃, d). Similarly,
texts to construct negative pairs of (G, d)
learning ability of LLMs, thereby making them domain
MolFM [171] and GIT-Mol [167] implement the contrastive
experts [177]. For example, a possible template could be
loss with mutual information and negative sampling, as
divided into four parts: {General Description}{Task-Specific
shown in Eq. (27). These two methods also use cross-entropy
Description}{Question-Answer Examples}{Test Question}.
to regularize the unified space with the assumption that
LLM Outputs for Reasoning. Since string representations
randomly permuted graph and text inputs are predictable
of molecular graphs usually carry new and in-depth domain
if they originate from the same molecule. However, the
knowledge, which is beyond the knowledge of LLMs, recent
aforementioned methods cannot leverage task labels. Given
work [148], [166], [174] also attempts to utilize the reasoning
a classification label y , CLAMP [179] learns to map active
ability of LLMs, instead of using them as a knowledge source
molecules (y = 1) so that they align with the corresponding
for predicting the property of molecular graphs. ReLM [166]
assay description for each molecular graph G .
utilizes GNNs to suggest top-k candidates, which were then   
used to construct multiple-choice answers for in-context ℓCLAMP = y log σ τ −1 hTG hdG
learning. ChemCrow [148] designs the LLMs as the chemical    (28)
agent to implement various chemical tools. It avoided direct + (1 − y) log 1 − σ τ −1 hTG hdG ,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 18

CLAMP [179] requires labels to encourage that active 7 A PPLICATIONS


molecules and their corresponding text descriptions are 7.1 Datasets, Splitting and Evaluation
clustered together in the latent space. To advance the
alignment between two modalities, MolCA [176] trains We summarize the datasets for the three scenarios (namely
the Query Transformer (Q-Former) [207] for molecule-text pure graphs, text-rich graphs, and text-paired graphs) and
projecting and contrastive alignment. Q-former initializes show them in Table 4, Table 7, and Table 8 respectively.
Nq
Nq learnable query tokens {qk }k=1 . These query tokens are
7.1.1 Pure Graphs
updated with self-attention and interact with the output of
GNNs through cross-attention to obtain the k -th queried In Table 4, we summarize the pure graph reasoning problems
molecular representation vector (hG )k := Q-Former(qk ). discussed in Section 4. Many problems are shared or revisited
The query tokens share the same self-attention modules with in different datasets due to their generality. NLGraph [124],
the texts, but use different MLPs, allowing the Q-Former to LLMtoGraph [125] and GUC [126] study a set of standard
be used for obtaining the representation of text sequence graph reasoning problems, including connectivity, shortest
hdG := Q-Former([CLS]). path, and graph diameter. GraphQA [131] benchmarks a
similar set of problems but additionally describes the graphs
exp (maxk cos ((hG )k , hdG ) /τ ) in real-world scenarios to study the effect of graph grounding.
ℓg2t = log P    
LLM4DyG [128] focuses on reasoning tasks on temporally
d˜G ̸=dG exp maxk cos (hG )k , hd˜G /τ evolving graphs. Accuracy is the most common evaluation
exp (maxk cos (hdG , (hG )k ) /τ ) (29) metric as they are primarily formulated as graph question-
ℓt2g = log P  
G̸̃=G exp maxk cos hdG , (hG̃ )k /τ
answering tasks.
ℓMolCA = −ℓg2t − ℓt2g
7.1.2 Text-Rich Graphs
We summarize the famous datasets for evaluating models
6.2.2 Discussion
on text-rich graphs in Table 7. The datasets are mostly
Larger-Scale GNNs. GNNs integrate atomic and graph struc- from the academic, e-commerce, book, social media, and
tural features for molecular representation learning [147]. Wikipedia domains. The popular tasks to evaluate models
Specifically, Text2Mol [122] utilizes the GCN [85] as its graph on those datasets include node classification, link prediction,
encoder and extracts unique identifiers for node features edge classification, regression, and recommendation. The
based on Morgan fingerprints [201]. MoMu [183], MoMu- evaluation metrics for node/edge classification include
v2 [182], MolFM [171], GIT-Mol [167], and MolCA [176] Accuracy, Macro-F1, and Micro-F1. For link prediction and
prefer GIN [205] as the backbone, as GIN has been proven recommendation evaluation, Mean Reciprocal Rank (MRR),
to be as expressive and powerful as the Weisfeiler-Lehman Normalized Discounted Cumulative Gain (NDCG), and Hit
graph isomorphism test [206]. As described in Section 2.2, Ratio (Hit) usually serve as metrics. While evaluating model
there has been notable progress in making GNNs deeper, performance on regression tasks, people tend to adopt mean
more generalizable, and more powerful since the proposal of absolute errors (MAE) or root mean square error (RMSE).
the GCN [85] in 2016 and the GIN [205] in 2018. However,
most reviewed works [167], [171], [176], [182], [183] are 7.1.3 Text-Paired Graphs
developed using the GIN [205] as a proof of concept for Table 8 shows text-paired graph datasets (including text-
their approaches. These pretrained GINs feature five layers available and graph-only datasets). For Data Splitting, options
and 300 hidden dimensions. The scale of GNNs could include random splitting, source-based splitting, activity
be a bottleneck in learning semantic meanings in their cliffs [221] and scaffolds [222], and data balancing [143].
representation vectors and there is a risk of over-reliance Graph classification usually adopts AUC [204] as the metrics,
on one modality, neglecting the other. Therefore, for future while regression uses MAE, RMSE, and R2 [147]. For text
large-scale GNN designs comparable to LLMs, scaling up the generation evaluation, people tend to use the Bilingual
dimension size and adding deeper layers, which have proven Evaluation Understudy (BLEU) score; while for molecule
effective in recent material discovery tasks [231], may be generation evaluation, heuristic evaluation methods (based
considered. Recent studies [139], [140], [142] also suggest that on factors including validity, novelty, and uniqueness) are
designing advanced GNNs with Transformer encoder layers adopted. However, it is worth noted that BLEU score is
may improve the expressive power of GNNs. Therefore, the efficient but less accurate, while heuristic evaluation methods
deployment of real-world and large-scale GNNs should be are problematic subject to unintended modes, such as the
promising when combining the aforementioned methods. superfluous addition of carbon atoms in [223].
Generation Decoder with GNNs. Except for MoMu [183],
which utilizes the flow-based MoFlow method [220] as the
molecular graph generator, aligned GNNs are often used as 7.2 Open-source Implementations
encoders, not as decoders for generation. The prevalent de- Hugging Face. HF Transformers is the most popular Python
coder architecture is mostly text-based, generating linearized library for Transformers-based language models. Besides, it
graph structures such as SMILES. These approaches may also provides two additional packages: Datasets for easily
hardly utilize the characteristic of permutation-invariant accessing and sharing datasets and Evaluate for easily
graphs, as discussed in Section 6.1.3. Recent advances in evaluating machine learning models and datasets.
generative diffusion models on graphs [219], [232] could be Fairseq. Fairseq is another open-source Python library for
utilized in future work to design generators with GNNs. Transformers-based language models.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 19

TABLE 7
Data collection in Section 5 for text-rich graphs. Task: “NC”, “UAP”, “LP”, “Rec”, “EC”, “RG” denote node classification, user activity
prediction, link prediction, recommendation, edge classification, and regression task.

Text. Data Year Task # Nodes # Edges Domain Source & Notes
ogb-arxiv 2020.5 NC 169,343 1,166,243 Academic OGB [204]
ogb-products 2020.5 NC 2,449,029 61,859,140 E-commerce OGB [204]
ogb-papers110M 2020.5 NC 111,059,956 1,615,685,872 Academic OGB [204]
ogb-citation2 2020.5 LP 2,927,963 30,561,187 Academic OGB [204]
Cora 2000 NC 2,708 5,429 Academic [9]
Citeseer 1998 NC 3,312 4,732 Academic [10]
DBLP 2023.1 NC, LP 5,259,858 36,630,661 Academic www.aminer.org/citation
Node

MAG 2020 NC, LP, Rec RG ∼ 10M ∼ 50M Academic multiple domains [11] [12]
Goodreads-books 2018 NC, LP ∼ 2M ∼ 20M Books multiple domains [13]
Amazon-items 2018 NC, LP, Rec ∼ 15.5M ∼ 100M E-commerce multiple domains [14]
SciDocs 2020 NC, UAP, LP, Rec - - Academic [52]
PubMed 2020 NC 19,717 44,338 Academic [15]
Wikidata5M 2021 LP ∼ 4M ∼ 20M Wikipedia [16]
Twitter 2023 NC, LP 176,279 2,373,956 Social [54]
Goodreads-reviews 2018 EC, LP ∼ 3M ∼ 100M Books multiple domains [13]
Edge

Amazon-reviews 2018 EC, LP ∼ 15.5M ∼ 200M E-commerce multiple domains [14]


Stackoverflow 2023 EC, LP 129,322 281,657 Social [75]

PyTorch Geometric. PyG is an open-source Python library molecules [225], which is beyond the capacity of exploration
for graph machine learning. It packages more than 60 types in wet lab experiments. One of the biggest challenges lies
of GNN layers, combined with various aggregation and in generating high-quality candidates, rather than randomly
pooling layers. producing candidates in irrelevant subspaces. Molecular
Deep Graph Library. DGL is another open-source Python generation with multiple conditions (textual, numerical,
library for graph machine learning. categorical) shows promise to solve this problem.
RDKit. RDKit is one of the most popular open-source Synthesis Planning. Synthesis designs start from available
cheminformatics software programs that facilitates various molecules and involve planning a sequence of steps that
operations and visualizations for molecular graphs. It offers can finally produce a desired chemical compound through
many useful APIs, such as the linearization implementation a series of reactions [228]. This procedure includes a se-
for molecular graphs, to convert them into easily stored quence of reactant molecules and reaction conditions. Both
SMILES and to convert these SMILES back into graphs. graphs and texts play important roles in this process. For
example, graphs may represent the fundamental structure of
7.3 Practical applications molecules, while texts may describe the reaction conditions,
additives, and solvents. LLMs can also assist in the planning
7.3.1 Scientific Discovery
by suggesting possible synthesis paths directly or by serving
Virtual Screening. While we may have numerous unlabeled as agents to operate on existing planning tools [148].
molecule candidates for drug and material design, chemists
are often interested in only a small portion of them that are 7.3.2 Computational Social Science
located in a specific area of chemical space [226]. Machine In computational social science, researchers are interested
learning models could help researchers automatically screen in modeling the behavior of people/users and discovering
out trivial candidates. However, training accurate models is new knowledge that can be utilized to forecast the future.
not an easy task because labeled molecule datasets often have The behaviors of users and interactions between users can
small sizes and imbalanced data distribution [143]. There are be modeled as graphs, where the nodes are associated with
many efforts to improve GNNs against data sparsity [143], rich text information (e.g., user profile, messages, emails). We
[147], [229]. However, it is difficult, if not impossible, for will show two example scenarios below.
a model to generalize and understand in-depth domain E-commerce. In E-commerce platforms, there are many in-
knowledge that it has never been trained on. Texts, therefore, teractions (e.g., purchase, view) between users and products.
could be complementary sources of knowledge. Discovering For example, users can view, cart, or purchase products. In
task-related content from massive scientific papers and using addition, the users, products, and their interactions are asso-
them as instructions has great potential to improve GNNs in ciated with rich text information. For instance, products have
accurate virtual screening tasks [48]. titles/descriptions and users can leave a review of products.
Optimizing Scientific Hypotheses. Molecular generation In this case, we can construct a graph where nodes are users
and optimization represent one of the fundamental goals in and products, while edges are their interactions. Both nodes
chemical science for drug and material discovery [227]. Sci- and edges are associated with text. It is important to utilize
entific hypotheses, such as the complex molecules [228], can both the text information and the graph structure information
be represented in the joint space of GNNs and LLMs. Then, (user behavior) to model users and items and solve complex
one may search in the latent space for a better hypothesis downstream tasks (e.g., item recommendation [104], bundle
that aligns with the text description (human requirements) recommendation [105], and product understanding [106]).
and adheres to structural constraints like chemical validity. Social Media. In social media platforms, there are many
Chemical space has been found to contain more than 1060 users and they interact with each other through messages,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 20

TABLE 8
Data collection in Section 6 for text-captioned graphs. The availability of text refers to the text descriptions of graphs, not the format of the linearized
graphs like the SMILES-represented molecules. “PT”, “FT”, “Cap.”, “GC”, “Retr.’, “Gen.”, and “GR” refer to pretraining, finetuning, caption, graph
classification, retrieval, graph generation, graph regression, respectively. The superscript for the size denotes # graph-text pairs1 , # graphs2 , #
assays3 .

Data Date Task Size Tested Model Source & Notes


ChEMBL-2023 [196] 2023 Various 2.4M2 ,20.3M3 Various Drug-like
PubChem [194] 2019 Various 96M2 ,237M3 Various Biomedical
PT
324K1
Both Graph and Text Are Available

PC324K [176] 2023 MolCA [176] PubChem [194]


Cap.
PubChem [194],
MolXPT-PT [178] 2023 PT 30M2 MolXPT [178] PubMed,
ChEBI [193]
ChE-bio [48] 2023 PT 365K2 GIMLET [48] ChEMBL [195]
ChE-phy [48] 2023 PT 365K2 GIMLET [48] ChEMBL [195]
ChE ZS [48] 2023 GC 91K2 GIMLET [48] ChEMBL [195]
PT 223M1 ,2M2 ,
PC223M [179] 2023 CLAMP [179] PubChem [194]
Retr. 20K3
PCSTM [181] 2022 PT 281K1 MoleculeSTM [181] PubChem [194]
FT, KV-PLM [184], MoMu [183], MolFM [171],
PCdes [194] 2022 Cap. 15K1 PubChem [194]
ChatMol [175], MolCA [176]
Retr.
FT. Text2Mol [122], MoMu [183], Text+Chem T5 [180],
PubChem [194]
Retr. LLM-ICL [177], MolXPT [178], MolFM [171],
ChEBI-20 [122] 2021 33K1 ChEBI [193]
Gen. ChatMol [175], MolReGPT [174],
Cap. GIT-Mol [167], MolCA [176]
ZINC15 [197] 2015 PT 120M2 Various Drug-like
GDB13 [198] 2009 PT 977M2 MFBERT [185] Drug-like
FS-Mol [192] 2021 GC 27,0652 CLAMP [179] ChEMBL [195]
PCBA [208] 2018 GC 439,8632 SMILES-BERT [188], GIMLET [48] PubChem-Bio [194]
MUV [208] 2018 GC 93,1272 MoMu [183], GIMLET [48], MolFM [171] PubChem-Bio [194]
KV-PLM [184], MoMu [183], LLM-ICL [177], Drug Therapeutics
HIV [208] 2018 GC 41,9132 MolXPT [178], CLAMP [179] Program AIDS Antiviral
GIMLET [48], MolFM [171] Screen Data [209]
CYP450 [191] 2018 GC 16,8962 GIMLET [48] Pharmacokinetic
Galatica [187], MoMu [183], MoMu-v2 [182], Binding for
LLM-ICL [177], MolXPT [178], CLAMP [179], inhibitors of
BACE [208] 2018 GC 1,5222
GIMLET [48], MolFM [171], human β -secretase 1
GIT-Mol [167], MolCA [176] [210]
Graph only

Galatica [187], KV-PLM [184], MoMu [183], Blood-brain barrier


MoMu-v2 [182], LLM-ICL [177], MolXPT [178], penetration
BBBP [208] 2018 GC 2,0532
CLAMP [179], GIMLET [48], MolFM [171], permeability
GIT-Mol [167], MolCA [176], [211]
Galatica [187], KV-PLM [184], MoMu [183], Toxicology in the
MoMu-v2 [182], LLM-ICL [177], MolXPT [178], 21st Century
Tox21 [208] 2018 GC 8,0142
CLAMP [179], GIMLET [48], MolFM [171], initiative
GIT-Mol [167], MolCA [176] [212]
MoMu [183], MoMu-v2 [182], CLAMP [179], Toxicology in the
Toxcast [208] 2018 GC 8,6152 GIMLET [48], MolFM [171], 21st Century
GIT-Mol [167], MolCA [176] initiative [212]
Galatica [187], KV-PLM [184], MoMu [183], DeepChem [213] for
SIDER [208] 2018 GC 1,4272 MoMu-v2 [182], MolXPT [178], MolFM [171], side effect resource
LLM4Mol [172], GIT-Mol [167], MolCA [176] of marketed drugs
Galatica [187], MoMu [183], MoMu-v2 [182], Drugs approved by
LLM-ICL [177], MolXPT [178], CLAMP [179], FDA or failed
ClinTox [208] 2018 GC 1,4912
MolFM [171], LLM4Mol [172], GIT-Mol [167], clinical trials
MolCA [176] [214], [215]
Lipophilicity [208] 2018 GR 4,2002 Chemformer [164], GIMLET [48], LLM4Mol [172] ChEMBL [195]
ESOL [208] 2018 GR 1,1282 Chemformer [164], GIMLET [48], LLM4Mol [172] [217]
FreeSolv [208] 2018 GR 6422 Chemformer [164], GIMLET [48] [216]

emails, and so on. In this case, we can build a graph where Academic Domain. In the academic domain, networks [11]
nodes are users and edges are the interaction between users. are constructed with papers as nodes and their relations
There will be text associated with nodes (e.g., user profile) (e.g., citation, authorship, etc) as edges. The representation
and edges (e.g., messages). Interesting research questions learned for papers on such networks can be utilized for paper
will be how to do joint text and graph structure modeling recommendation [101], paper classification [102], and author
to deeply understand the users for friend recommendation identification [103].
[107], user analysis [108], and community detection [109]. Legal Domain. In the legal domain, opinions given by
the judges always contain references to opinions given for
7.3.3 Specific Domains previous cases. In such a scenario, people can construct an
In many specific domains, text data are interconnected and opinion network [98] based on the citation relations between
lie in the format of graphs. The structure information on the opinions. The representations learned on such a network
graphs can be utilized to better understand the text unit and with both text and structure information can be utilized for
contribute to advanced problem-solving. clause classification [99] and opinion recommendation [100].
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 21

8 F UTURE DIRECTIONS complexity. On the other hand, optimizing LLMs itself is


computationally expensive. Although some general efficient
Better Benchmark Datasets. For pure graphs, most existing tuning methods such as LoRA are proposed, there is a lack
benchmarks are constructed for evaluating the reasoning of discussion on graph-aware LLM efficient tuning methods.
ability of LLM on homogeneous graphs while lacking the Generalizable and Robust LLMs on Graphs. Another
evaluation on heterogeneous graphs or spatial-temporal interesting direction is to explore the generalizability and
graphs. For text-rich graphs, as summarized in Table 7, robustness of LLMs on graphs. Generalizability refers to
most benchmark datasets are from academic domains and having the ability to transfer the knowledge learned from
e-commerce domains. However, in the real world, text- one domain graph to another; while robustness denotes
rich graphs are ubiquitous across multiple domains (e.g., having consistent prediction regarding to obfuscations and
legal and health). More diverse datasets are needed to attacks. Although LLMs have demonstrated their strong
comprehensively evaluate LLMs on real-world scenarios. generalizability and robustness in processing text, it is still
For text-paired graphs, as summarized in Table 8, there is an open problem whether these abilities exist for graph data.
a lack of comprehensive datasets covering various machine LLM as Dynamic Agents on Graphs. Although LLMs have
learning tasks in chemistry. Although a massive number shown their advanced capability in generating text, one-
of scientific papers are available, preprocessing them into a pass generation of LLMs suffers from hallucination and
ready-to-use format and pairing them with specific molecular misinformation issues due to the lack of accurate parametric
graph data points of interest remains a cumbersome and knowledge. Simply augmenting retrieved knowledge in
challenging task. context is also bottlenecked by the capacity of the retriever.
Broader Task Space with LLMs. More comprehensive In many real-world scenarios, graphs such as academic
studies on the performance of LLMs for graph tasks hold networks, and Wikipedia are dynamically looked up by
promise for the future. While LLMs as encoder approaches humans for knowledge-guided reasoning. Simulating such
have been explored for text-rich graphs, their application a role of dynamic agents can help LLMs more accurately re-
to text-captioned molecular graphs remains underexplored. trieve relevant information via multi-hop reasoning, thereby
Promising directions include using LLMs for data augmen- correcting their answers and alleviating hallucinations.
tation and knowledge distillation to design domain-specific
GNNs for various text-paired graph tasks. Furthermore,
although graph generation has been approached in text- 9 C ONCLUSION
paired graphs, it remains an open problem for text-rich In this paper, we provide a comprehensive review of large
graphs (i.e., how to conduct joint text and graph structure language models on graphs. We first categorize graph
generation) scenarios where LMs can be adopted and summarize the
Multi-Modal Foundation Models. One open question is, large language models on graph techniques. We then provide
“Should we use one foundation model to unify different a thorough review, analysis, and comparison of methods
modalities, and how?” The modalities can include texts, within each scenario. Furthermore, we summarize available
graphs, and even images. For instance, molecules can be datasets, open-source codebases, and multiple applications.
represented as graphs, described as texts, and photographed Finally, we suggest future directions for large language
as images; products can be treated as nodes in a graph, models on graphs.
associated with a title/description, and combined with an
image. Designing a model that can conduct joint encoding for ACKNOWLEDGMENTS
all modalities will be useful but challenging. This problem is
This should be a simple paragraph before the References to
even harder given that data from graph modalities are quite
thank those individuals and institutions who have supported
diverse (e.g., molecular graphs are quite different from social
your work on this article.
networks). Furthermore, there has always been a tension
between efforts in building a uniform foundational model
and in customizing model architectures for different domains. R EFERENCES
It is thus intriguing to ask whether a unified architecture [1] Yang, W., Xie, Y., Lin, A., Li, X., Tan, L., Xiong, K., Li, M. and Lin, J.,
will suit different data types, or if tailoring model designs “End-to-end open-domain question answering with bertserini,” in
according to domains will be necessary. Correctly answering NAACL, 2019.
[2] Liu, Y. and Lapata, M., “Text Summarization with Pretrained
this question can save economic and intellectual resources Encoders,” in EMNLP, 2019.
from unnecessary attempts and also shed light on a deeper [3] Wang, A., Singh, A., Michael, J., Hill, F., Levy, O. and Bowman, S.R.,
understanding of graph-related tasks. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural
Language Understanding,” in ICLR, 2018.
Efficienct LLMs on Graphs. While LLMs have shown
[4] Reimers, N. and Gurevych, I., “Sentence-BERT: Sentence Embed-
a strong capability to learn on graphs, they suffer from dings using Siamese BERT-Networks,” in EMNLP, 2019.
inefficiency issues in terms of graph linearization and model [5] Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S.,
optimization. On one hand, as discussed in Section 5.1.1 Yogatama, D., Bosma, M., Zhou, D., Metzler, D. and Chi, E.H.,
“Emergent Abilities of Large Language Models,” in TMLR, 2022.
and 6.1.1, many methods rely on transferring graphs into [6] Nagamochi, H. and Ibaraki, T., “Algorithmic aspects of graph
sequences that can be inputted into LLMs. However, the connectivity,” in Cambridge University Press, 2018.
length of the transferred sequence will increase significantly [7] Goldberg, A.V. and Harrelson, C., “Computing the shortest path: A
as the size of the graph increases. This poses challenges since search meets graph theory,” in SODA (Vol. 5, pp. 156-165), 2005.
[8] Sun, Z., Wang, H., Wang, H., Shao, B. and Li, J., “Efficient subgraph
LLMs always have a maximum sequence input length and matching on billion node graphs,” in arXiv preprint arXiv:1205.6691,
a long input sequence will lead to higher time and memory 2012.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 22

[9] McCallum, A.K., Nigam, K., Rennie, J. and Seymore, K., “Automat- [34] Zou, T., Yu, L., Huang, Y., Sun, L. and Du, B., “Pretraining Language
ing the construction of internet portals with machine learning,” in Models with Text-Attributed Heterogeneous Graphs,” in arXiv
Information Retrieval, 3, pp.127-163, 2000. preprint arXiv:2310.12580, 2023.
[10] Giles, C.L., Bollacker, K.D. and Lawrence, S., “CiteSeer: An au- [35] Song, K., Tan, X., Qin, T., Lu, J. and Liu, T.Y., “Mpnet: Masked and
tomatic citation indexing system,” in Proceedings of the third ACM permuted pre-training for language understanding,” in NeurIPs.,
conference on Digital libraries (pp. 89-98), 1998. 2020.
[11] Wang, K., Shen, Z., Huang, C., Wu, C.H., Dong, Y. and Kanakia, [36] Duan, K., Liu, Q., Chua, T.S., Yan, S., Ooi, W.T., Xie, Q. and He, J.,
A., “Microsoft academic graph: When experts are not enough,” in “Simteg: A frustratingly simple approach improves textual graph
Quantitative Science Studies, 1(1), pp.396-413, 2020. learning,” in arXiv preprint arXiv:2308.02565., 2023.
[12] Zhang, Y., Jin, B., Zhu, Q., Meng, Y. and Han, J., “The Effect of [37] Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D.,
Metadata on Scientific Literature Tagging: A Cross-Field Cross- Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E. and
Model Study,” in WWW, 2023. Krusche, S., “ChatGPT for good? On opportunities and challenges
[13] Wan, M. and McAuley, J., “Item recommendation on monotonic of large language models for education,” in Learning and individual
behavior chains,” in Proceedings of the 12th ACM conference on differences, 103., 2023.
recommender systems, 2018. [38] Lester, B., Al-Rfou, R. and Constant, N., “The power of scale for
[14] Ni, J., Li, J. and McAuley, J., “Justifying recommendations using parameter-efficient prompt tuning,” in EMNLP, 2021.
distantly-labeled reviews and fine-grained aspects,” in EMNLP- [39] Li, X.L. and Liang, P., “Prefix-tuning: Optimizing continuous
IJCNLP, 2019. prompts for generation,” in ACL, 2021.
[15] Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B. and Eliassi- [40] Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Larous-
Rad, T., “Collective classification in network data,” in AI magazine, silhe, Q., Gesmundo, A., Attariyan, M. and Gelly, S., “Parameter-
29(3), pp.93-93, 2008. efficient transfer learning for NLP,” in ICML, 2019.
[16] Wang, X., Gao, T., Zhu, Z., Zhang, Z., Liu, Z., Li, J. and Tang, [41] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang,
J., “KEPLER: A unified model for knowledge embedding and pre- L. and Chen, W., “Lora: Low-rank adaptation of large language
trained language representation,” in TACL, 2021. models,” in ICLR, 2022.
[17] Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., & Philip, S. Y., [42] Tian, Y., Song, H., Wang, Z., Wang, H., Hu, Z., Wang, F., Chawla,
“A comprehensive survey on graph neural networks,” in IEEE N.V. and Xu, P., “Graph Neural Prompting with Large Language
transactions on neural networks and learning systems, 32(1), 4-24, 2020. Models,” in arXiv preprint arXiv:2309.15427., 2023.
[18] Liu, J., Yang, C., Lu, Z., Chen, J., Li, Y., Zhang, M., Bai, T., Fang, Y., [43] Chai, Z., Zhang, T., Wu, L., Han, K., Hu, X., Huang, X. and Yang, Y.,
Sun, L., Yu, P.S. and Shi, C., “Towards Graph Foundation Models: A “GraphLLM: Boosting Graph Reasoning Ability of Large Language
Survey and Beyond,” in arXiv preprint arXiv:2310.11829, 2023. Model,” in arXiv preprint arXiv:2310.05845., 2023.
[19] Pan, S., Luo, L., Wang, Y., Chen, C., Wang, J. and Wu, X., “Unifying [44] Wei, J., Bosma, M., Zhao, V.Y., Guu, K., Yu, A.W., Lester, B., Du, N.,
Large Language Models and Knowledge Graphs: A Roadmap,” in Dai, A.M. and Le, Q.V., “Finetuned language models are zero-shot
arXiv preprint arXiv:2306.08302, 2023. learners,” in ICLR., 2022.
[20] Sanh, V., Debut, L., Chaumond, J. and Wolf, T., “DistilBERT, a [45] Sanh, V., Webson, A., Raffel, C., Bach, S.H., Sutawika, L., Alyafeai,
distilled version of BERT: smaller, faster, cheaper and lighter.,” in Z., Chaffin, A., Stiegler, A., Scao, T.L., Raja, A. and Dey, M.,
arXiv preprint arXiv:1910.01108, 2019. “Multitask prompted training enables zero-shot task generalization,”
in ICLR., 2022.
[21] Wang, Y., Le, H., Gotmare, A.D., Bui, N.D., Li, J. and Hoi, S.C.,
[46] Tang, J., Yang, Y., Wei, W., Shi, L., Su, L., Cheng, S., Yin, D.
“Codet5+: Open code large language models for code understanding
and Huang, C., “GraphGPT: Graph Instruction Tuning for Large
and generation.,” in arXiv preprint arXiv:2305.07922, 2023.
Language Models,” in arXiv preprint arXiv:2310.13023., 2023.
[22] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., “Bert: Pre-
[47] Ye, R., Zhang, C., Wang, R., Xu, S. and Zhang, Y., “Natural language
training of deep bidirectional transformers for language understand-
is all a graph needs,” in arXiv preprint arXiv:2308.07134., 2023.
ing,” in NAACL, 2019.
[48] Zhao, H., Liu, S., Ma, C., Xu, H., Fu, J., Deng, Z.H., Kong, L. and Liu,
[23] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis,
Q., “GIMLET: A Unified Graph-Text Model for Instruction-Based
M., Zettlemoyer, L. and Stoyanov, V., “Roberta: A robustly optimized
Molecule Zero-Shot Learning,” in bioRxiv, pp.2023-05., 2023.
bert pretraining approach,” in arXiv preprint arXiv:1907.11692, 2019.
[49] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le,
[24] Beltagy, I., Lo, K. and Cohan, A., “SciBERT: A pretrained language Q.V. and Zhou, D., “Chain-of-thought prompting elicits reasoning
model for scientific text,” in arXiv preprint arXiv:1903.10676, 2019. in large language models,” in NeurIPs., 2022.
[25] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, [50] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y. and
P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., and Agarwal, Narasimhan, K., “Tree of thoughts: Deliberate problem solving with
“Language models are few-shot learners,” in NeurIPS, 2020. large language models,” in arXiv preprint arXiv:2305.10601., 2023.
[26] Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R. and Le, [51] Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Gianinazzi, L.,
Q.V., “Xlnet: Generalized autoregressive pretraining for language Gajda, J., Lehmann, T., Podstawski, M., Niewiadomski, H., Nyczyk,
understanding,” in NeurIPS, 2019. P. and Hoefler, T., “Graph of thoughts: Solving elaborate problems
[27] Clark, K., Luong, M.T., Le, Q.V. and Manning, C.D., “Electra: Pre- with large language models,” in arXiv preprint arXiv:2308.09687.,
training text encoders as discriminators rather than generators,” in 2023.
ICLR, 2020. [52] Cohan, A., Feldman, S., Beltagy, I., Downey, D. and Weld, D.S.,
[28] Meng, Y., Xiong, C., Bajaj, P., Bennett, P., Han, J. and Song, X., “Specter: Document-level representation learning using citation-
“Coco-lm: Correcting and contrasting text sequences for language informed transformers,” in ACL., 2020.
model pretraining,” in NeurIPs, 2021. [53] Ostendorff, M., Rethmeier, N., Augenstein, I., Gipp, B. and Rehm,
[29] Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, G., “Neighborhood contrastive learning for scientific document
A., Levy, O., Stoyanov, V. and Zettlemoyer, L., “Bart: Denoising representations with citation embeddings,” in EMNLP., 2022.
sequence-to-sequence pre-training for natural language generation, [54] Brannon, W., Fulay, S., Jiang, H., Kang, W., Roy, B., Kabbara, J. and
translation, and comprehension,” in ACL, 2020. Roy, D., “ConGraT: Self-Supervised Contrastive Pretraining for Joint
[30] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, Graph and Text Embeddings,” in arXiv preprint arXiv:2305.14321.,
M., Zhou, Y., Li, W. and Liu, P.J., “Exploring the limits of transfer 2023.
learning with a unified text-to-text transformer,” in JMLR, 2020. [55] Zhu, J., Song, X., Ioannidis, V.N., Koutra, D. and Faloutsos, C.,
[31] Yasunaga, M., Leskovec, J. and Liang, P., “LinkBERT: Pretraining “TouchUp-G: Improving Feature Representation through Graph-
Language Models with Document Links,” in ACL, 2022. Centric Finetuning,” in arXiv preprint arXiv:2309.13885., 2023.
[32] Jin, B., Zhang, W., Zhang, Y., Meng, Y., Zhang, X., Zhu, Q. and Han, [56] Li, Y., Ding, K. and Lee, K., “GRENADE: Graph-Centric Lan-
J., “Patton: Language Model Pretraining on Text-Rich Networks,” in guage Model for Self-Supervised Representation Learning on Text-
ACL, 2023. Attributed Graphs,” in EMNLP., 2023.
[33] Zhang, X., Malkov, Y., Florez, O., Park, S., McWilliams, B., Han, J. [57] Zhang, X., Malkov, Y., Florez, O., Park, S., McWilliams, B., Han, J.
and El-Kishky, A., “TwHIN-BERT: a socially-enriched pre-trained and El-Kishky, A., “TwHIN-BERT: A Socially-Enriched Pre-trained
language model for multilingual Tweet representations,” in KDD, Language Model for Multilingual Tweet Representations at Twitter,”
2023. in KDD., 2023.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 23

[58] Zhang, X., Zhang, C., Dong, X.L., Shang, J. and Han, J., “Minimally- [83] Yasunaga, M., Bosselut, A., Ren, H., Zhang, X., Manning,
supervised structure-rich text categorization via learning on text-rich C.D., Liang, P.S. and Leskovec, J., “Deep bidirectional language-
networks,” in WWW., 2021. knowledge graph pretraining,” in NeurIPs., 2022.
[59] Chien, E., Chang, W.C., Hsieh, C.J., Yu, H.F., Zhang, J., Milenkovic, [84] Huang, J., Zhang, X., Mei, Q. and Ma, J., “CAN LLMS EF-
O., and Dhillon, I.S., “Node feature extraction by self-supervised FECTIVELY LEVERAGE GRAPH STRUCTURAL INFORMATION:
multi-scale neighborhood prediction,” in ICLR., 2022. WHEN AND WHY,” in arXiv preprint arXiv:2309.16595.., 2023.
[60] Zhang, Y., Shen, Z., Wu, C.H., Xie, B., Hao, J., Wang, Y.Y., Wang, K. [85] Kipf, T.N. and Welling, M., “Semi-supervised classification with
and Han, J., “Metadata-induced contrastive learning for zero-shot graph convolutional networks,” in ICLR., 2017.
multi-label text classification,” in WWW., 2022. [86] Hamilton, W., Ying, Z. and Leskovec, J., “Inductive representation
[61] Dinh, T.A., Boef, J.D., Cornelisse, J. and Groth, P., “E2EG: End- learning on large graphs,” in NeurIPs., 2017.
to-End Node Classification Using Graph Topology and Text-based [87] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P. and
Node Attributes,” in arXiv preprint arXiv:2208.04609., 2022. Bengio, Y., “Graph attention networks,” in ICLR., 2018.
[62] Zhao, J., Qu, M., Li, C., Yan, H., Liu, Q., Li, R., Xie, X. and Tang, [88] Zhang, S., Liu, Y., Sun, Y. and Shah, N., “Graph-less Neural
J., “Learning on large-scale text-attributed graphs via variational Networks: Teaching Old MLPs New Tricks Via Distillation,” in
inference,” in ICLR., 2023. ICLR., 2022.
[63] Wen, Z. and Fang, Y., “Augmenting Low-Resource Text Classifica- [89] Liu, M., Gao, H. and Ji, S., “Towards deeper graph neural networks,”
tion with Graph-Grounded Pre-training and Prompting,” in SIGIR., in KDD., 2020.
2023. [90] Meng, Y., Huang, J., Zhang, Y. and Han, J., “Generating training
[64] Chen, Z., Mao, H., Wen, H., Han, H., Jin, W., Zhang, H., Liu, H. data with language models: Towards zero-shot language under-
and Tang, J., “Label-free Node Classification on Graphs with Large standing,” in NeurIPs., 2022.
Language Models (LLMS),” in arXiv preprint arXiv:2310.04668., 2023. [91] Sun, Y., Han, J., Yan, X., Yu, P.S. and Wu, T., “Pathsim: Meta
[65] Huang, X., Han, K., Bao, D., Tao, Q., Zhang, Z., Yang, Y. and Zhu, path-based top-k similarity search in heterogeneous information
Q., “Prompt-based Node Feature Extractor for Few-shot Learning on networks,” in VLDB., 2011.
Text-Attributed Graphs,” in arXiv preprint arXiv:2309.02848., 2023. [92] Liu, H., Li, C., Wu, Q. and Lee, Y.J., “Visual instruction tuning,” in
[66] Zhao, J., Zhuo, L., Shen, Y., Qu, M., Liu, K., Bronstein, M., Zhu, Z. NeurIPs., 2023.
and Tang, J., “Graphtext: Graph reasoning in text space,” in arXiv [93] Park, C., Kim, D., Han, J. and Yu, H., “Unsupervised attributed
preprint arXiv:2310.01089., 2023. multiplex network embedding,” in AAAI., 2020.
[67] Meng, Y., Zong, S., Li, X., Sun, X., Zhang, T., Wu, F. and Li, J., [94] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
“Gnn-lm: Language modeling based on global contexts via gnn,” in A.N., Kaiser, Ł. and Polosukhin, I., “Attention is all you need,” in
ICLR., 2022. NeurIPs., 2017.
[68] Zhang, X., Bosselut, A., Yasunaga, M., Ren, H., Liang, P., Manning, [95] Haveliwala, T.H., “Topic-sensitive pagerank,” in WWW., 2002.
C.D. and Leskovec, J., “Greaselm: Graph reasoning enhanced [96] Oord, A.V.D., Li, Y. and Vinyals, O., “Representation learning with
language models for question answering,” in ICLR., 2022. contrastive predictive coding,” in arXiv preprint arXiv:1807.03748.,
[69] Ioannidis, V.N., Song, X., Zheng, D., Zhang, H., Ma, J., Xu, Y., Zeng, 2018.
B., Chilimbi, T. and Karypis, G., “Efficient and effective training of [97] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agar-
language and graph neural network models,” in AAAI, 2023. wal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J. and Krueger,
[70] Mavromatis, C., Ioannidis, V.N., Wang, S., Zheng, D., Adeshina, S., G., “Learning transferable visual models from natural language
Ma, J., Zhao, H., Faloutsos, C. and Karypis, G., “Train Your Own supervision,” in ICML., 2021.
GNN Teacher: Graph-Aware Distillation on Textual Graphs,” in [98] Whalen, R., “Legal networks: The promises and challenges of legal
PKDD, 2023. network analysis,” in Mich. St. L. Rev.., 2016.
[71] He, X., Bresson, X., Laurent, T. and Hooi, B., “Explanations as [99] Friedrich, A. and Palmer, A. and Pinkal, M., “Situation entity types:
Features: LLM-Based Features for Text-Attributed Graphs,” in arXiv automatic classification of clause-level aspect,” in Proceedings of the
preprint arXiv:2305.19523., 2023. 54th Annual Meeting of the Association for Computational Linguistics
[72] Yu, J., Ren, Y., Gong, C., Tan, J., Li, X. and Zhang, X., “Empower Text- (Volume 1: Long Papers)., 2016.
Attributed Graphs Learning with Large Language Models (LLMs),” [100] Guha, N., Nyarko, J., Ho, D.E., Ré, C., Chilton, A., Narayana,
in arXiv preprint arXiv:2310.09872., 2023. A., Chohlas-Wood, A., Peters, A., Waldon, B., Rockmore, D.N. and
[73] Yang, J., Liu, Z., Xiao, S., Li, C., Lian, D., Agrawal, S., Singh, A., Zambrano, D., “Legalbench: A collaboratively built benchmark for
Sun, G. and Xie, X., “GraphFormers: GNN-nested transformers for measuring legal reasoning in large language models,” in arXiv
representation learning on textual graph,” in NeurIPs., 2021. preprint arXiv:2308.11462., 2023.
[74] Jin, B., Zhang, Y., Zhu, Q. and Han, J., “Heterformer: Transformer- [101] Bai, X., Wang, M., Lee, I., Yang, Z., Kong, X. and Xia, F., “Scientific
based deep node representation learning on heterogeneous text-rich paper recommendation: A survey,” in Ieee Access, 7, pp.9324-9339,
networks,” in KDD., 2023. 2019.
[75] Jin, B., Zhang, Y., Meng, Y. and Han, J., “Edgeformers: Graph- [102] Chowdhury, S. and Schoen, M.P., “Research paper classification
Empowered Transformers for Representation Learning on Textual- using supervised machine learning techniques,” in Intermountain
Edge Networks,” in ICLR., 2023. Engineering, Technology and Computing, 2020.
[76] Jin, B., Zhang, W., Zhang, Y., Meng, Y., Zhao, H. and Han, J., [103] Madigan, D., Genkin, A., Lewis, D.D., Argamon, S., Fradkin, D.
“Learning Multiplex Embeddings on Text-rich Networks with One and Ye, L., “Author identification on the large scale,” in Proceedings
Text Encoder,” in arXiv preprint arXiv:2310.06684., 2023. of the 2005 Meeting of the Classification Society of North America (CSNA),
[77] Qin, Y., Wang, X., Zhang, Z. and Zhu, W., “Disentangled Represen- 2005.
tation Learning with Large Language Models for Text-Attributed [104] He, X., Deng, K., Wang, X., Li, Y., Zhang, Y. and Wang, M.,
Graphs,” in arXiv preprint arXiv:2310.18152., 2023. “Lightgcn: Simplifying and powering graph convolution network
[78] Zhang, Y., Shen, Z., Dong, Y., Wang, K. and Han, J., “MATCH: for recommendation,” in SIGIR, 2020.
Metadata-aware text classification in a large hierarchy,” in WWW., [105] Chang, J., Gao, C., He, X., Jin, D. and Li, Y., “Bundle recommenda-
2021. tion with graph convolutional networks,” in SIGIR, 2020.
[79] Zhu, J., Cui, Y., Liu, Y., Sun, H., Li, X., Pelger, M., Yang, T., Zhang, [106] Xu, H., Liu, B., Shu, L. and Yu, P., “Open-world learning and
L., Zhang, R. and Zhao, H., “Textgnn: Improving text encoder via application to product classification,” in WWW, 2019.
graph neural network in sponsored search,” in WWW., 2021. [107] Chen, L., Xie, Y., Zheng, Z., Zheng, H. and Xie, J., “Friend recom-
[80] Li, C., Pang, B., Liu, Y., Sun, H., Liu, Z., Xie, X., Yang, T., Cui, mendation based on multi-social graph convolutional network,” in
Y., Zhang, L. and Zhang, Q., “Adsgnn: Behavior-graph augmented IEEE Access, 8, pp.43618-43629, 2020.
relevance modeling in sponsored search,” in SIGIR., 2021. [108] Wang, G., Zhang, X., Tang, S., Zheng, H. and Zhao, B.Y., “Unsu-
[81] Zhang, J., Chang, W.C., Yu, H.F. and Dhillon, I., “Fast multi- pervised clickstream clustering for user behavior analysis,” in CHI,
resolution transformer fine-tuning for extreme multi-label text 2016.
classification,” in NeurIPs., 2021. [109] Shchur, O. and Günnemann, S., “Overlapping community detec-
[82] Xie, H., Zheng, D., Ma, J., Zhang, H., Ioannidis, V.N., Song, X., Ping, tion with graph neural networks,” in arXiv preprint arXiv:1909.12201.,
Q., Wang, S., Yang, C., Xu, Y. and Zeng, B., “Graph-Aware Language 2019.
Model Pre-Training on a Large Graph Corpus Can Help Multiple [110] Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S.,
Graph Applications,” in KDD., 2023. Yogatama, D., Bosma, M., Zhou, D., Metzler, D. and Chi, E.H., 2022.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 24

”Emergent Abilities of Large Language Models” in Transactions on of large language model with knowledge graph” in arXiv preprint
Machine Learning Research, 2022. arXiv:2307.07697, 2023.
[111] Kojima, T., Gu, S.S., Reid, M., Matsuo, Y. and Iwasawa, Y., 2022. [133] Danny Z. Chen. 1996. ”Developing algorithms and software for
”Large language models are zero-shot reasoners” in Advances in geometric path planning problems” in ACM Comput. Surv. 28, 4es
neural information processing systems, 35, pp.22199-22213. (Dec. 1996), 18–es. https://doi.org/10.1145/242224.242246, 1996.
[112] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, [134] Iqbal A., Hossain Md., Ebna A. (2018). ”Airline Scheduling with
E., Le, Q.V. and Zhou, D., 2022. ”Chain-of-thought prompting Max Flow algorithm” in International Journal of Computer Applications,
elicits reasoning in large language models” in Advances in Neural 2018.
Information Processing Systems, 35, pp.24824-24837. [135] Li Jiang, Xiaoning Zang, Ibrahim I.Y. Alghoul, Xiang Fang, Junfeng
[113] Radford, A., 2019. ”Language Models are Unsupervised Multitask Dong, Changyong Liang, 2022. ”Scheduling the covering delivery
Learners” in OpenAI Blog, 2019. problem in last mile delivery” in Expert Systems with Applications,
[114] Mikolov, T., Chen, K., Corrado, G. and Dean, J., 2013. ”Efficient 2022.
estimation of word representations in vector space” in arXiv preprint [136] Zhang, X., Wang, L., Helwig, J., Luo, Y., Fu, C., Xie, Y., ... & Ji, S.
arXiv:1301.3781. (2023). Artificial intelligence for science in quantum, atomistic, and
[115] Pennington, J., Socher, R. and Manning, C.D., 2014, October. continuum systems. arXiv preprint arXiv:2307.08423.
”Glove: Global vectors for word representation” in Proceedings of [137] Rusch, T. K., Bronstein, M. M., & Mishra, S. (2023). A sur-
the 2014 conference on empirical methods in natural language processing vey on oversmoothing in graph neural networks. arXiv preprint
(EMNLP) (pp. 1532-1543). arXiv:2303.10993.
[116] Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P. and Soricut, [138] Topping, J., Di Giovanni, F., Chamberlain, B. P., Dong, X., & Bron-
R., 2019, September. ”ALBERT: A Lite BERT for Self-supervised stein, M. M. (2021). Understanding over-squashing and bottlenecks
Learning of Language Representations” in International Conference on graphs via curvature. arXiv preprint arXiv:2111.14522.
on Learning Representations. [139] Zhang, B., Luo, S., Wang, L., & He, D. (2023). Rethinking the
[117] Clark, K., Luong, M.T., Le, Q.V. and Manning, C.D., 2019, expressive power of gnns via graph biconnectivity. arXiv preprint
September. ”ELECTRA: Pre-training Text Encoders as Discriminators arXiv:2301.09505.
Rather Than Generators” in International Conference on Learning [140] Müller L, Galkin M, Morris C, Rampášek L. Attending to graph
Representations. transformers. arXiv preprint arXiv:2302.04181. 2023 Feb 8.
[118] Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, [141] Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., ... & Liu, T. Y.
E., Kamar, E., Lee, P., Lee, Y.T., Li, Y., Lundberg, S. and Nori, H., (2021). Do transformers really perform badly for graph represen-
2023. ”Sparks of artificial general intelligence: Early experiments tation?. Advances in Neural Information Processing Systems, 34,
with gpt-4” in arXiv preprint arXiv:2303.12712. 28877-28888.
[119] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, [142] Rampášek, L., Galkin, M., Dwivedi, V. P., Luu, A. T., Wolf, G.,
Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S. and Bikel, D., & Beaini, D. (2022). Recipe for a general, powerful, scalable graph
2023. ”Llama 2: Open foundation and fine-tuned chat models” in transformer. Advances in Neural Information Processing Systems,
arXiv preprint arXiv:2307.09288. 35, 14501-14515.
[120] Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chap-
[143] Liu, G., Zhao, T., Inae, E., Luo, T., & Jiang, M. (2023).
lot, D.S., Casas, D.D.L., Bressand, F., Lengyel, G., Lample, G.,
Semi-Supervised Graph Imbalanced Regression. arXiv preprint
Saulnier, L. and Lavaud, L.R., 2023. ”Mistral 7B” in arXiv preprint
arXiv:2305.12087.
arXiv:2310.06825.
[144] Wu Q, Zhao W, Li Z, Wipf DP, Yan J. Nodeformer: A scalable graph
[121] Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson,
structure learning transformer for node classification. Advances in
Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M. and Ring, R.,
Neural Information Processing Systems. 2022 Dec 6;35:27387-401.
2022. ”Flamingo: a visual language model for few-shot learning” in
Advances in Neural Information Processing Systems (pp. 23716-23736). [145] Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021).
Roformer: Enhanced transformer with rotary position embedding.
[122] Edwards, C., Zhai, C. and Ji, H., 2021, November. ”Text2mol:
arXiv preprint arXiv:2104.09864.
Cross-modal molecule retrieval with natural language queries” in
Proceedings of the 2021 Conference on Empirical Methods in Natural [146] Balaban, A. T., Applications of graph theory in chemistry. Journal
Language Processing (pp. 595-607). of chemical information and computer sciences, 25(3), 334-343, 1985.
[123] Edwards, C., Lai, T., Ros, K., Honke, G., Cho, K. and Ji, H., 2022, [147] Liu, G., Zhao, T., Xu, J., Luo, T., & Jiang, M., Graph rationalization
December. ”Translation between Molecules and Natural Language” with environment-based augmentations, In ACM SIGKDD, 2022.
in Proceedings of the 2022 Conference on Empirical Methods in Natural [148] Bran, A. M., Cox, S., White, A. D., & Schwaller, P., ChemCrow:
Language Processing (pp. 375-413). Augmenting large-language models with chemistry tools, arXiv
[124] Wang, H., Feng, S., He, T., Tan, Z., Han, X. and Tsvetkov, Y., 2023. preprint arXiv:2304.05376, 2023.
”Can Language Models Solve Graph Problems in Natural Language?” [149] Borgwardt, K. M., Ong, C. S., Schönauer, S., Vishwanathan, S. V.
in arXiv preprint arXiv:2305.10037., 2023. N., Smola, A. J., & Kriegel, H. P., Protein function prediction via
[125] Liu, C. and Wu, B., 2023. ”Evaluating large language models on graph kernels. Bioinformatics, 21, i47-i56, 2005.
graphs: Performance insights and comparative analysis” in arXiv [150] Riesen, K., & Bunke, H., IAM graph database repository for
preprint arXiv:2308.11224, 2023. graph based pattern recognition and machine learning. In Structural,
[126] Guo, J., Du, L. and Liu, H., 2023. ”GPT4Graph: Can Large Syntactic, and Statistical Pattern Recognition: Joint IAPR International
Language Models Understand Graph Structured Data? An Empirical Workshop, SSPR & SPR 2008, Orlando, USA, December 4-6, 2008.
Evaluation and Benchmarking” in arXiv preprint arXiv:2305.15066, Proceedings (pp. 287-297). Springer Berlin Heidelberg.
2023. [151] Jain, N., Coyle, B., Kashefi, E., & Kumar, N., Graph neural network
[127] Zhang, J., 2023. ”Graph-ToolFormer: To Empower LLMs with initialisation of quantum approximate optimisation. Quantum, 6,
Graph Reasoning Ability via Prompt Augmented by ChatGPT” in 861, 2022.
arXiv preprint arXiv:2304.11116, 2023. [152] Weininger, D., SMILES, a chemical language and information
[128] Zhang, Z., Wang, X., Zhang, Z., Li, H., Qin, Y., Wu, S. and Zhu, system. 1. Introduction to methodology and encoding rules. Journal
W., 2023. ”LLM4DyG: Can Large Language Models Solve Problems of chemical information and computer sciences, 28(1), 31-36, 1988
on Dynamic Graphs?” in arXiv preprint arXiv:2310.17110, 2023. [153] Heller S, McNaught A, Stein S, Tchekhovskoi D, Pletnev I. InChI-
[129] Luo, L., Li, Y.F., Haffari, G. and Pan, S., 2023. ”Reasoning on the worldwide chemical structure identifier standard. Journal of
graphs: Faithful and interpretable large language model reasoning” cheminformatics. 2013 Dec;5(1):1-9.
in arXiv preprint arXiv:2310.01061, 2023. [154] O’Boyle, N., & Dalke, A., DeepSMILES: an adaptation of SMILES
[130] Jiang, J., Zhou, K., Dong, Z., Ye, K., Zhao, W.X. and Wen, J.R., 2023. for use in machine-learning of chemical structures, 2018.
”Structgpt: A general framework for large language model to reason [155] Krenn, M., Häse, F., Nigam, A., Friederich, P., & Aspuru-Guzik, A.,
over structured data” in arXiv preprint arXiv:2305.09645, 2023. Self-referencing embedded strings (SELFIES): A 100% robust molec-
[131] Fatemi, B., Halcrow, J. and Perozzi, B., 2023. ”Talk like a graph: ular string representation. Machine Learning: Science and Technology,
Encoding graphs for large language models” in arXiv preprint 1(4), 045024, 2020.
arXiv:2310.04560, 2023. [156] Bjerrum, E. J. (2017). SMILES enumeration as data augmenta-
[132] Sun, J., Xu, C., Tang, L., Wang, S., Lin, C., Gong, Y., Shum, H.Y. tion for neural network modeling of molecules. arXiv preprint
and Guo, J., 2023. ”Think-on-graph: Deep and responsible reasoning arXiv:1703.07076.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 25

[157] Arús-Pous, J., Johansson, S. V., Prykhodko, O., Bjerrum, E. J., [182] Lacombe, R., Gaut, A., He, J., Lüdeke, D., & Pistunova, K., Extract-
Tyrchan, C., Reymond, J. L., ... & Engkvist, O. (2019). Randomized ing Molecular Properties from Natural Language with Multimodal
SMILES strings improve the quality of molecular generative models. Contrastive Learning, ICML Workshop on Computational Biology, 2023.
Journal of cheminformatics, 11(1), 1-13. [183] Su, B., Du, D., Yang, Z., Zhou, Y., Li, J., Rao, A., ... & Wen, J. R.,
[158] Tetko IV, Karpov P, Bruno E, Kimber TB, Godin G. Augmentation A molecular multimodal foundation model associating molecule
is what you need!. InInternational Conference on Artificial Neural graphs with natural language, arXiv preprint arXiv:2209.05481. 2022.
Networks 2019 Sep 9 (pp. 831-835). Cham: Springer International [184] Zeng, Z., Yao, Y., Liu, Z., & Sun, M., A deep-learning system
Publishing. bridging molecule structure and biomedical text with comprehen-
[159] van Deursen R, Ertl P, Tetko IV, Godin G. GEN: highly efficient sion comparable to human professionals, Nature communications,
SMILES explorer using autodidactic generative examination net- 13(1), 862 ,2022.
works. Journal of Cheminformatics. 2020 Dec;12(1):1-4. [185] Iwayama, M., Wu, S., Liu, C., & Yoshida, R., Functional Output
[160] Schwaller, P., Gaudin, T., Lanyi, D., Bekas, C., & Laino, T., Regression for Machine Learning in Materials Science. Journal of
“Found in Translation”: predicting outcomes of complex organic Chemical Information and Modeling, 62(20), 4837-4851, 2022.
chemistry reactions using neural sequence-to-sequence models. [186] Bagal V, Aggarwal R, Vinod PK, Priyakumar UD. MolGPT:
Chemical science, 9(28), 6091-6098, 2018. molecular generation using a transformer-decoder model. Journal of
[161] Morgan, H. L.,. The generation of a unique machine description Chemical Information and Modeling. 2021 Oct 25;62(9):2064-76.
for chemical structures-a technique developed at chemical abstracts [187] Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A.,
service. Journal of chemical documentation, 5(2), 107-113, 1965. Saravia, E., ... & Stojnic, R., Galactica: A large language model for
[162] Sennrich, R., Haddow, B., & Birch, A. Neural machine translation science. arXiv preprint arXiv:2211.09085, 2022.
of rare words with subword units, in ACL, 2016. [188] Wang, S., Guo, Y., Wang, Y., Sun, H., & Huang, J., Smiles-bert: large
[163] Kudo, T., & Richardson, J., Sentencepiece: A simple and language scale unsupervised pre-training for molecular property prediction.
independent subword tokenizer and detokenizer for neural text In BCB, 2019
processing, in EMNLP, 2018. [189] Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J.,
[164] Irwin, R., Dimitriadis, S., He, J., & Bjerrum, E. J. (2022). Chem- BioBERT: a pre-trained biomedical language representation model
former: a pre-trained transformer for computational chemistry. for biomedical text mining. Bioinformatics, 36(4), 1234-1240, 2020.
Machine Learning: Science and Technology, 3(1), 015022. [190] Ma, R., & Luo, T. (2020). PI1M: a benchmark database for polymer
[165] Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, Min Y, Zhang B, informatics. Journal of Chemical Information and Modeling, 60(10),
Zhang J, Dong Z, Du Y. A survey of large language models. arXiv 4684-4690.
preprint arXiv:2303.18223. 2023 Mar 31. [191] Li, X., Xu, Y., Lai, L., & Pei, J. Prediction of human cytochrome
[166] Shi, Y., Zhang, A., Zhang, E., Liu, Z., & Wang, X., ReLM: P450 inhibition using a multitask deep autoencoder neural network.
Leveraging Language Models for Enhanced Chemical Reaction Molecular pharmaceutics, 15(10), 4336-4345, 2020.
Prediction, in EMNLP, 2023. [192] Stanley M, Bronskill JF, Maziarz K, Misztela H, Lanini J, Segler M,
[167] Liu P, Ren Y, Ren Z., Git-mol: A multi-modal large language model Schneider N, Brockschmidt M. Fs-mol: A few-shot learning dataset
for molecular science with graph, image, and text, arXiv preprint of molecules, in NeurIPS, 2021.
arXiv:2308.06911, 2023 [193] Hastings, J., Owen, G., Dekker, A., Ennis, M., Kale, N., Muthukr-
[168] Ock J, Guntuboina C, Farimani AB. Catalyst Property Prediction ishnan, V., ... & Steinbeck, C., ChEBI in 2016: Improved services and
with CatBERTa: Unveiling Feature Exploration Strategies through an expanding collection of metabolites. Nucleic acids research, 44(D1),
Large Language Models. arXiv preprint arXiv:2309.00563, 2023. D1214-D1219, 2016.
[169] Fang Y, Liang X, Zhang N, Liu K, Huang R, Chen Z, Fan X, Chen [194] Kim, S., Chen, J., Cheng, T., Gindulyte, A., He, J., He, S., ... &
H., Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset Bolton, E. E., PubChem 2019 update: improved access to chemical
for Large Language Models. arXiv preprint arXiv:2306.08018, 2023. data, Nucleic acids research, 47(D1), D1102-D1109, 2019.
[170] Abdine H, Chatzianastasis M, Bouyioukos C, Vazirgiannis M., [195] Gaulton, A., Bellis, L. J., Bento, A. P., Chambers, J., Davies, M.,
Prot2Text: Multimodal Protein’s Function Generation with GNNs Hersey, A., ... & Overington, J. P., ChEMBL: a large-scale bioactivity
and Transformers, arXiv preprint arXiv:2307.14367, 2023. database for drug discovery. Nucleic acids research, 40(D1), D1100-
[171] Luo Y, Yang K, Hong M, Liu X, Nie Z., MolFM: A Multimodal D1107. 2012.
Molecular Foundation Model, arXiv preprint arXiv:2307.09484, 2023. [196] Zdrazil B, Felix E, Hunter F, Manners EJ, Blackshaw J, Corbett
[172] Qian, C., Tang, H., Yang, Z., Liang, H., & Liu, Y., Can large S, de Veij M, Ioannidis H, Lopez DM, Mosquera JF, Magarinos MP.
language models empower molecular property prediction? arXiv The ChEMBL Database in 2023: a drug discovery platform spanning
preprint arXiv:2307.07443, 2023 multiple bioactivity data types and time periods. Nucleic Acids
[173] Born, J., & Manica, M., Regression Transformer enables concur- Research. 2023 Nov 2:gkad1004.
rent sequence regression and generation for molecular language [197] Sterling, T. and Irwin, J.J., ZINC 15–ligand discovery for everyone.
modelling. Nature Machine Intelligence, 5(4), 432-444, 2023. Journal of chemical information and modeling, 55(11), pp.2324-2337,
[174] Li J, Liu Y, Fan W, Wei XY, Liu H, Tang J, Li Q., Empower- 2015.
ing Molecule Discovery for Molecule-Caption Translation with [198] Blum LC, Reymond JL. 970 million druglike small molecules for
Large Language Models: A ChatGPT Perspective. arXiv preprint virtual screening in the chemical universe database GDB-13. Journal
arXiv:2306.06615, 2023. of the American Chemical Society. 2009 Jul 1;131(25):8732-3.
[175] Zeng, Z., Yin, B., Wang, S., Liu, J., Yang, C., Yao, H., ... & Liu, [199] Mellor, C. L., Robinson, R. M., Benigni, R., Ebbrell, D., Enoch, S.
Z., Interactive Molecular Discovery with Natural Language. arXiv J., Firman, J. W., ... & Cronin, M. T. D. (2019). Molecular fingerprint-
preprint arXiv:2306.11976, 2023. derived similarity measures for toxicological read-across: Recom-
[176] Liu Z, Li S, Luo Y, Fei H, Cao Y, Kawaguchi K, Wang X, Chua TS., mendations for optimal use. Regulatory Toxicology and Pharmacology,
MolCA: Molecular Graph-Language Modeling with Cross-Modal 101, 121-134.
Projector and Uni-Modal Adapter, in EMNLP, 2023. [200] Maggiora, G., Vogt, M., Stumpfe, D., & Bajorath, J. (2014). Molec-
[177] Guo T, Guo K, Liang Z, Guo Z, Chawla NV, Wiest O, Zhang X. ular similarity in medicinal chemistry: miniperspective. Journal of
What indeed can GPT models do in chemistry? A comprehensive medicinal chemistry, 57(8), 3186-3204.
benchmark on eight tasks. in NeurIPS, 2023. [201] Rogers, D., & Hahn, M. (2010). Extended-connectivity fingerprints.
[178] Liu Z, Zhang W, Xia Y, Wu L, Xie S, Qin T, Zhang M, Liu TY., Journal of chemical information and modeling, 50(5), 742-754.
MolXPT: Wrapping Molecules with Text for Generative Pre-training, [202] Beisken, S., Meinl, T., Wiswedel, B., de Figueiredo, L. F., Berthold,
in ACL, 2023. M., & Steinbeck, C. (2013). KNIME-CDK: Workflow-driven chemin-
[179] Seidl, P., Vall, A., Hochreiter, S., & Klambauer, G., Enhancing formatics. BMC bioinformatics, 14, 1-4.
activity prediction models in drug discovery with the ability to [203] Krenn, M., Ai, Q., Barthel, S., Carson, N., Frei, A., Frey, N. C., ...
understand human language, in ICML, 2023. & Aspuru-Guzik, A. (2022). SELFIES and the future of molecular
[180] Christofidellis, D., Giannone, G., Born, J., Winther, O., Laino, T., string representations. Patterns, 3(10).
& Manica, M., Unifying molecular and textual representations via [204] Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., ... &
multi-task language modelling, in ICML, 2023. Leskovec, J., Open graph benchmark: Datasets for machine learning
[181] Liu, S., Nie, W., Wang, C., Lu, J., Qiao, Z., Liu, L., ... & Anandku- on graphs. In NeurIPS, 2020.
mar, A. Multi-modal molecule structure-text model for text-based [205] Xu, K., Hu, W., Leskovec, J., & Jegelka, S. How powerful are graph
retrieval and editing, Nature Machine Intelligence, 2023. neural networks? In ICLR, 2019.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 26

[206] Leman, A. A., & Weisfeiler, B. (1968). A reduction of a graph [232] Jo, J., Lee, S., & Hwang, S. J. (2022, June). Score-based genera-
to a canonical form and an algebra arising during this reduction. tive modeling of graphs via the system of stochastic differential
Nauchno-Technicheskaya Informatsiya, 2(9), 12-16. equations. In International Conference on Machine Learning (pp.
[207] Li, J., Li, D., Savarese, S., & Hoi, S., Blip-2: Bootstrapping language- 10362-10383). PMLR.
image pre-training with frozen image encoders and large language
models. arXiv preprint arXiv:2301.12597.
[208] Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C.,
Pappu, A. S., ... & Pande, V. (2018). MoleculeNet: a benchmark for
molecular machine learning. Chemical science, 9(2), 513-530.
[209] AIDS Antiviral Screen Data. https://wiki.nci.nih.gov/display/NCIDTPdata/
AIDS+Antiviral+Screen+Data, Accessed: 2017-09-27
[210] Subramanian, G., Ramsundar, B., Pande, V., & Denny, R. A. (2016).
Computational modeling of β -secretase 1 (BACE-1) inhibitors using
ligand based approaches. Journal of chemical information and
modeling, 56(10), 1936-1949.
[211] Martins, I. F., Teixeira, A. L., Pinheiro, L., & Falcao, A. O. (2012).
A Bayesian approach to in silico blood-brain barrier penetration
modeling. Journal of chemical information and modeling, 52(6),
1686-1697.
[212] Tox21 Challenge. https://tripod.nih.gov/tox21/challenge/, Ac-
cessed: 2017-09- 27
[213] Altae-Tran H, Ramsundar B, Pappu AS, Pande V. Low data drug
discovery with one-shot learning. ACS central science. 2017 Apr
26;3(4):283-93.
[214] Novick PA, Ortiz OF, Poelman J, Abdulhay AY, Pande VS.
SWEETLEAD: an in silico database of approved drugs, regulated
chemicals, and herbal isolates for computer-aided drug discovery.
PloS one. 2013 Nov 1;8(11):e79568.
[215] Aggregate Analysis of ClincalTrials.gov (AACT) Database.
https://www.ctti-clinicaltrials.org/aact-database, Accessed: 2017-
09-27.
[216] Mobley DL, Guthrie JP. FreeSolv: a database of experimental
and calculated hydration free energies, with input files. Journal of
computer-aided molecular design. 2014 Jul;28:711-20.
[217] Delaney, J. S. (2004). ESOL: estimating aqueous solubility directly
from molecular structure. Journal of chemical information and
computer sciences, 44(3), 1000-1005.
[218] Zhao, T., Liu, G., Wang, D., Yu, W., & Jiang, M. (2022, June). Learn-
ing from counterfactual links for link prediction. In International
Conference on Machine Learning (pp. 26911-26926). PMLR.
[219] Vignac, C., Krawczuk, I., Siraudin, A., Wang, B., Cevher, V., &
Frossard, P. (2022). Digress: Discrete denoising diffusion for graph
generation. arXiv preprint arXiv:2209.14734.
[220] Zang, C., & Wang, F. Moflow: an invertible flow model for
generating molecular graphs. In ACM SIGKDD, 2020.
[221] Deng, J., Yang, Z., Wang, H., Ojima, I., Samaras, D., & Wang, F.
(2023). A systematic study of key elements underlying molecular
property prediction. Nature Communications, 14(1), 6395.
[222] Böhm, H. J., Flohr, A., & Stahl, M. (2004). Scaffold hopping. Drug
discovery today: Technologies, 1(3), 217-224.
[223] Renz, P., Van Rompaey, D., Wegner, J. K., Hochreiter, S., &
Klambauer, G. (2019). On failure modes in molecule generation
and optimization. Drug Discovery Today: Technologies, 32, 55-63.
[224] Polykovskiy, D., Zhebrak, A., Sanchez-Lengeling, B., Golovanov,
S., Tatanov, O., Belyaev, S., ... & Zhavoronkov, A. (2020). Molecular
sets (MOSES): a benchmarking platform for molecular generation
models. Frontiers in pharmacology, 11, 565644.
[225] Reymond, J. L. (2015). The chemical space project. Accounts of
Chemical Research, 48(3), 722-730.
[226] Lin, A., Horvath, D., Afonina, V., Marcou, G., Reymond, J. L., &
Varnek, A. (2018). Mapping of the Available Chemical Space versus
the Chemical Universe of Lead-Like Compounds. ChemMedChem,
13(6), 540-554.
[227] Gao, W., Fu, T., Sun, J., & Coley, C. (2022). Sample efficiency mat-
ters: a benchmark for practical molecular optimization. Advances in
Neural Information Processing Systems, 35, 21342-21357.
[228] Wang, H., Fu, T., Du, Y., Gao, W., Huang, K., Liu, Z., ... & Zitnik,
M. (2023). Scientific discovery in the age of artificial intelligence.
Nature, 620(7972), 47-60.
[229] Liu, G., Inae, E., Zhao, T., Xu, J., Luo, T., & Jiang, M. (2023). Data-
Centric Learning from Unlabeled Graphs with Diffusion Model.
arXiv preprint arXiv:2303.10108.
[230] https://practicalcheminformatics.blogspot.com/2023/08/we-
need-better-benchmarks-for-machine.html?m=1
[231] Merchant, A., Batzner, S., Schoenholz, S. S., Aykol, M., Cheon, G.,
& Cubuk, E. D. (2023). Scaling deep learning for materials discovery.
Nature, 1-6.

You might also like