Large Language Model
Large Language Model
A large language model (LLM) is a computational model capable of language generation or other natural
language processing tasks. As language models, LLMs acquire these abilities by learning statistical
relationships from vast amounts of text during a self-supervised and semi-supervised training process.[1]
The largest and most capable LLMs, as of August 2024, are artificial neural networks built with a decoder-
only transformer-based architecture, which enables efficient processing and generation of large-scale text
data. Modern models can be fine-tuned for specific tasks or can be guided by prompt engineering.[2] These
models acquire predictive power regarding syntax, semantics, and ontologies[3] inherent in human language
corpora, but they also inherit inaccuracies and biases present in the data they are trained on.[4]
Some notable LLMs are OpenAI's GPT series of models (e.g., GPT-3.5, GPT-4 and GPT-4o; used in ChatGPT
and Microsoft Copilot), Google's Gemini (used in the chatbot of the same name), Meta's LLaMA family of
models, IBM's Granite models initially released with Watsonx, Anthropic's Claude models, and Mistral AI's
models.
History
Before 2017, there were a few language models that were large as compared to capacities then available. In
the 1990s, the IBM alignment models pioneered statistical language modelling. A smoothed n-gram model in
2001 trained on 0.3 billion words achieved then-SOTA (state of the art) perplexity.[5] In the 2000s, as Internet
use became prevalent, some researchers constructed Internet-scale language datasets ("web as corpus"[6]),
upon which they trained statistical language models.[7][8] In 2009, in most language processing tasks,
statistical language models dominated over symbolic language models, as they can usefully ingest large
datasets.[9]
After neural networks became dominant in image processing around 2012, they were applied to language
modelling as well. Google converted its translation service to Neural Machine Translation in 2016. As it was
before Transformers, it was done by seq2seq deep LSTM networks.
An illustration of main components of the
transformer model from the original paper, where
layers were normalized after (instead of before)
multiheaded attention
At the 2017 NeurIPS conference, Google researchers introduced the transformer architecture in their
landmark paper "Attention Is All You Need". This paper's goal was to improve upon 2014 Seq2seq
technology,[10] and was based mainly on the attention mechanism developed by Bahdanau et al. in 2014.[11]
The following year in 2018, BERT was introduced and quickly became "ubiquitous".[12] Though the original
transformer has both encoder and decoder blocks, BERT is an encoder-only model.
Although decoder-only GPT-1 was introduced in 2018, it was GPT-2 in 2019 that caught widespread attention
because OpenAI at first deemed it too powerful to release publicly, out of fear of malicious use.[13] GPT-3 in
2020 went a step further and as of 2024 is available only via API with no offering of downloading the model
to execute locally. But it was the 2022 consumer-facing browser-based ChatGPT that captured the
imaginations of the general population and caused some media hype and online buzz.[14] The 2023 GPT-4
was praised for its increased accuracy and as a "holy grail" for its multimodal capabilities.[15] OpenAI did not
reveal high-level architecture and the number of parameters of GPT-4.
Competing language models have for the most part been attempting to equal the GPT series, at least in
terms of number of parameters.[16]
Since 2022, source-available models have been gaining popularity, especially at first with BLOOM and LLaMA,
though both have restrictions on the field of use. Mistral AI's models Mistral 7B and Mixtral 8x7b have the
more permissive Apache License. As of June 2024, The Instruction fine tuned variant of the Llama 3 70 billion
parameter model is the most powerful open LLM according to the LMSYS Chatbot Arena Leaderboard, being
more powerful than GPT-3.5 but not as powerful as GPT-4.[17]
As of 2024, the largest and most capable models are all based on the Transformer architecture. Some recent
implementations are based on other architectures, such as recurrent neural network variants and Mamba (a
state space model).[18][19][20]
Dataset preprocessing
Tokenization
Because machine learning algorithms process numbers rather than text, the text must be converted to
numbers. In the first step, a vocabulary is decided upon, then integer indices are arbitrarily but uniquely
assigned to each vocabulary entry, and finally, an embedding is associated to the integer index. Algorithms
include byte-pair encoding (BPE) and WordPiece. There are also special tokens serving as control characters,
such as [MASK] for masked-out token (as used in BERT), and [UNK] ("unknown") for characters not
appearing in the vocabulary. Also, some special symbols are used to denote special text formatting. For
example, "Ġ" denotes a preceding whitespace in RoBERTa and GPT. "##" denotes continuation of a preceding
word in BERT.[21]
For example, the BPE tokenizer used by GPT-3 (Legacy) would split tokenizer: texts -> series of
numerical "tokens" as
Tokenization also compresses the datasets. Because LLMs generally require input to be an array that is not
jagged, the shorter texts must be "padded" until they match the length of the longest one. How many tokens
are, on average, needed per word depends on the language of the dataset.[22][23]
BPE
As an example, consider a tokenizer based on byte-pair encoding. In the first step, all unique characters
(including blanks and punctuation marks) are treated as an initial set of n-grams (i.e. initial set of uni-grams).
Successively the most frequent pair of adjacent characters is merged into a bi-gram and all instances of the
pair are replaced by it. All occurrences of adjacent pairs of (previously merged) n-grams that most frequently
occur together are then again merged into even lengthier n-gram, until a vocabulary of prescribed size is
obtained (in case of GPT-3, the size is 50257).[24] After a tokenizer is trained, any text can be tokenized by it,
as long as it does not contain characters not appearing in the initial-set of uni-grams.[25]
Problems
A token vocabulary based on the frequencies extracted from mainly English corpora uses as few tokens as
possible for an average English word. An average word in another language encoded by such an English-
optimized tokenizer is however split into suboptimal amount of tokens. GPT-2 tokenizer can use up to 15
times more tokens per word for some languages, for example for the Shan language from Myanmar. Even
more widespread languages such as Portuguese and German have "a premium of 50%" compared to
English.[26]
In the context of training LLMs, datasets are typically cleaned by removing toxic passages from the dataset,
discarding low-quality data, and de-duplication.[28] Cleaned datasets can increase training efficiency and lead
to improved downstream performance.[29][30] A trained LLM can be used to clean datasets for training a
further LLM.[31]
With the increasing proportion of LLM-generated content on the web, data cleaning in the future may include
filtering out such content. LLM-generated content can pose a problem if the content is similar to human text
(making filtering difficult) but of lower quality (degrading performance of models trained on it).[32]
Synthetic data
Training of largest language models might need more linguistic data than naturally available, or that the
naturally occurring data is of insufficient quality. In these cases, synthetic data might be used. Microsoft's Phi
series of LLMs is trained on textbook-like data generated by another LLM.[33]
Reinforcement learning from human feedback (RLHF) through algorithms, such as proximal policy
optimization, is used to further fine-tune a model based on a dataset of human preferences.[34]
Instruction tuning
Using "self-instruct" approaches, LLMs have been able to bootstrap correct responses, replacing any naive
responses, starting from human-generated corrections of a few cases. For example, in the instruction "Write
an essay about the main themes represented in Hamlet," an initial naive completion might be "If you submit
the essay after March 17, your grade will be reduced by 10% for each day of delay," based on the frequency of
this textual sequence in the corpus.[35]
Mixture of experts
The largest LLM may be too expensive to train and use directly. For such models, mixture of experts (MoE)
can be applied, a line of research pursued by Google researchers since 2017 to train models reaching up to 1
trillion parameters.[36][37][38]
Prompt engineering, attention mechanism, and context window
Most results previously achievable only by (costly) fine-tuning, can be achieved through prompt engineering,
although limited to the scope of a single conversation (more precisely, limited to the scope of a context
window).[39]
In order to find out which tokens are relevant to each other within the scope of the context window, the
attention mechanism calculates "soft" weights for each token, more precisely for its embedding, by using
multiple attention heads, each with its own "relevance" for calculating its own soft weights. For example, the
small (i.e. 117M parameter sized) GPT-2 model has had twelve attention heads and a context window of only
1k tokens.[41] In its medium version it has 345M parameters and contains 24 layers, each with 12 attention
heads. For the training with gradient descent a batch size of 512 was utilized.[25]
The largest models, such as Google's Gemini 1.5, presented in February 2024, can have a context window
sized up to 1 million (context window of 10 million was also "successfully tested").[42] Other models with
large context windows includes Anthropic's Claude 2.1, with a context window of up to 200k tokens.[43] Note
that this maximum refers to the number of input tokens and that the maximum number of output tokens
differs from the input and is often smaller. For example, the GPT-4 Turbo model has a maximum output of
4096 tokens.[44]
Length of a conversation that the model can take into account when generating its next answer is limited by
the size of a context window, as well. If the length of a conversation, for example with ChatGPT, is longer than
its context window, only the parts inside the context window are taken into account when generating the next
answer, or the model needs to apply some algorithm to summarize the too distant parts of conversation.
The shortcomings of making a context window larger include higher computational cost and possibly diluting
the focus on local context, while making it smaller can cause a model to miss an important long-range
dependency. Balancing them are a matter of experimentation and domain-specific considerations.
A model may be pre-trained either to predict how the segment continues, or what is missing in the segment,
given a segment from its training dataset.[45] It can be either
autoregressive (i.e. predicting how the segment continues, the way GPTs do it): for example given a
segment "I like to eat", the model predicts "ice cream", or "sushi".
"masked" (i.e. filling in the parts missing from the segment, the way "BERT"[46] does it): for example, given a
segment "I like to [__] [__] cream", the model predicts that "eat" and "ice" are missing.
Models may be trained on auxiliary tasks which test their understanding of the data distribution, such as Next
Sentence Prediction (NSP), in which pairs of sentences are presented and the model must predict whether
they appear consecutively in the training corpus.[46] During training, regularization loss is also used to
stabilize training. However regularization loss is usually not used during testing and evaluation.
Infrastructure
Training cost
For Transformer-based LLM, training cost is much higher than inference cost. It costs 6 FLOPs per parameter
to train on one token, whereas it costs 1 to 2 FLOPs per parameter to infer on one token.[54]
Tool use
There are certain tasks that, in principle, cannot be solved by any LLM, at least not without the use of external
tools or additional software. An example of such a task is responding to the user's input '354 * 139 = ',
provided that the LLM has not already encountered a continuation of this calculation in its training corpus. In
such cases, the LLM needs to resort to running program code that calculates the result, which can then be
included in its response. Another example is 'What is the time now? It is ', where a separate program
interpreter would need to execute a code to get system time on the computer, so LLM could include it in its
reply.[55][56] This basic strategy can be sophisticated with multiple attempts of generated programs, and other
sampling strategies.[57]
Generally, in order to get an LLM to use tools, one must finetune it for tool-use. If the number of tools is finite,
then finetuning may be done just once. If the number of tools can grow arbitrarily, as with online API services,
then the LLM can be fine-tuned to be able to read API documentation and call API correctly.[58][59]
A simpler form of tool use is retrieval-augmented generation: the augmentation of an LLM with document
retrieval. Given a query, a document retriever is called to retrieve the most relevant documents. This is usually
done by encoding the query and the documents into vectors, then finding the documents with vectors
(usually stored in a vector database) most similar to the vector of the query. The LLM then generates an
output based on both the query and context included from the retrieved documents.[60]
Agency
An LLM is a language model, which is not an agent as it has no goal, but it can be used as a component of an
intelligent agent.[61] Researchers have described several methods for such integrations.
The ReAct pattern, a portmanteau of "Reason + Act", constructs an agent out of an LLM, using the LLM as a
planner. The LLM is prompted to "think out loud". Specifically, the language model is prompted with a textual
description of the environment, a goal, a list of possible actions, and a record of the actions and observations
so far. It generates one or more thoughts before generating an action, which is then executed in the
environment.[62] The linguistic description of the environment given to the LLM planner can even be the
LaTeX code of a paper describing the environment.[63]
In the DEPS ("Describe, Explain, Plan and Select") method, an LLM is first connected to the visual world via
image descriptions, then it is prompted to produce plans for complex tasks and behaviors based on its
pretrained knowledge and environmental feedback it receives.[64]
The Reflexion method[65] constructs an agent that learns over multiple episodes. At the end of each episode,
the LLM is given the record of the episode, and prompted to think up "lessons learned", which would help it
perform better at a subsequent episode. These "lessons learned" are given to the agent in the subsequent
episodes.
Monte Carlo tree search can use an LLM as rollout heuristic. When a programmatic world model is not
available, an LLM can also be prompted with a description of the environment to act as world model.[66]
For open-ended exploration, an LLM can be used to score observations for their "interestingness", which can
be used as a reward signal to guide a normal (non-LLM) reinforcement learning agent.[67] Alternatively, it can
propose increasingly difficult tasks for curriculum learning.[68] Instead of outputting individual actions, an
LLM planner can also construct "skills", or functions for complex action sequences. The skills can be stored
and later invoked, allowing increasing levels of abstraction in planning.[68]
LLM-powered agents can keep a long-term memory of its previous contexts, and the memory can be
retrieved in the same way as Retrieval Augmented Generation. Multiple such agents can interact socially.[69]
Compression
Typically, LLMs are trained with single- or half-precision floating point numbers (float32 and float16). One
float16 has 16 bits, or 2 bytes, and so one billion parameters require 2 gigabytes. The largest models typically
have 100 billion parameters, requiring 200 gigabytes to load, which places them outside the range of most
consumer electronics.[70]
Post-training quantization[71] aims to decrease the space requirement by lowering precision of the
parameters of a trained model, while preserving most of its performance.[72][73] The simplest form of
quantization simply truncates all numbers to a given number of bits. It can be improved by using a different
quantization codebook per layer. Further improvement can be done by applying different precisions to
different parameters, with higher precision for particularly important parameters ("outlier weights").[74] See [75]
for a visual guide.
While quantized models are typically frozen, and only pre-quantized models are fine-tuned, quantized models
can still be fine-tuned.[76]
Multimodality
Multimodality means "having several modalities", and a "modality" refers to a type of input or output, such as
video, image, audio, text, proprioception, etc.[77] There have been many AI models trained specifically to
ingest one modality and output another modality, such as AlexNet for image to label,[78] visual question
answering for image-text to text,[79] and speech recognition for speech to text.
A common method to create multimodal models out of an LLM is to "tokenize" the output of a trained
encoder. Concretely, one can construct an LLM that can understand images as follows: take a trained LLM,
and take a trained image encoder . Make a small multilayered perceptron , so that for any image , the
post-processed vector has the same dimensions as an encoded token. That is an "image token".
Then, one can interleave text tokens and image tokens. The compound model is then fine-tuned on an image-
text dataset. This basic construction can be applied with more sophistication to improve the model. The
image encoder may be frozen to improve stability.[80]
Flamingo demonstrated the effectiveness of the tokenization method, finetuning a pair of pretrained
language model and image encoder to perform better on visual question answering than models trained from
scratch.[81] Google PaLM model was fine-tuned into a multimodal model PaLM-E using the tokenization
method, and applied to robotic control.[82] LLaMA models have also been turned multimodal using the
tokenization method, to allow image inputs,[83] and video inputs.[84]
GPT-4 can use both text and image as inputs[85] (although the vision component was not released to the
public until GPT-4V[86]); Google DeepMind's Gemini is also multimodal.[87] Mistral introduced its own
multimodel Pixtral 12B model in September 2024.[88]
Properties
Scaling laws
cost of (pre-)training ( ),
size of the artificial neural network itself, such as number of parameters (i.e. amount of neurons in its
layers, amount of weights between them and biases),
They are related by simple statistical laws, called "scaling laws". One particular scaling law ("Chinchilla
scaling") for LLM autoregressively trained for one epoch, with a log-log learning rate schedule, states that:[89]
, meaning that it costs 6 FLOPs per parameter to train on one token. Note that training cost is much
higher than inference cost, where it costs 1 to 2 FLOPs per parameter to infer on one token.[54]
Emergent abilities
Performance of bigger models on various tasks, when plotted on a log-log scale, appears as a linear
extrapolation of performance achieved by smaller models. However, this linearity may be punctuated by
"break(s)"[90] in the scaling law, where the slope of the line changes abruptly, and where larger models acquire
"emergent abilities".[39][91] They arise from the complex interaction of the model's components and are not
explicitly programmed or designed.[92]
The most intriguing among emergent abilities is in-context learning from example demonstrations.[93] In-
context learning is involved in tasks, such as:
reported arithmetics, decoding the International Phonetic Alphabet, unscrambling a word's letters,
disambiguate word in context,[39][94][95] converting spatial words, cardinal directions (for example, replying
"northeast" upon [0, 0, 1; 0, 0, 0; 0, 0, 0]), color terms represented in text.[96]
chain-of-thought prompting: Model outputs are improved by chain-of-thought prompting only when model
size exceeds 62B. Smaller models perform better when prompted to answer immediately, without chain of
thought.[97]
identifying offensive content in paragraphs of Hinglish (a combination of Hindi and English), and
generating a similar English equivalent of Kiswahili proverbs.[98]
Schaeffer et. al. argue that the emergent abilities are not unpredictably acquired, but predictably acquired
according to a smooth scaling law. The authors considered a toy statistical model of an LLM solving multiple-
choice questions, and showed that this statistical model, modified to account for other types of tasks, applies
to these tasks as well.[99]
Let be the number of parameter count, and be the performance of the model.
When , then is an exponential curve (before it hits the plateau at one), which
looks like emergence.
When , then the plot is a straight line (before it hits the plateau at zero),
which does not look like emergence.
Interpretation
Large language models by themselves are "black boxes", and it is not clear how they can perform linguistic
tasks. There are several methods for understanding how LLM work.
In another example, the authors trained small transformers on modular arithmetic addition. The resulting
models were reverse-engineered, and it turned out they used discrete Fourier transform.[103]
NLP researchers were evenly split when asked, in a 2022 survey, whether (untuned) LLMs "could (ever)
understand natural language in some nontrivial sense".[104] Proponents of "LLM understanding" believe that
some LLM abilities, such as mathematical reasoning, imply an ability to "understand" certain concepts. A
Microsoft team argued in 2023 that GPT-4 "can solve novel and difficult tasks that span mathematics, coding,
vision, medicine, law, psychology and more" and that GPT-4 "could reasonably be viewed as an early (yet still
incomplete) version of an artificial general intelligence system": "Can one reasonably say that a system that
passes exams for software engineering candidates is not really intelligent?"[105][106] Some researchers
characterize LLMs as "alien intelligence".[107][108] For example, Conjecture CEO Connor Leahy considers
untuned LLMs to be like inscrutable alien "Shoggoths", and believes that RLHF tuning creates a "smiling
facade" obscuring the inner workings of the LLM: "If you don't push it too far, the smiley face stays on. But
then you give it [an unexpected] prompt, and suddenly you see this massive underbelly of insanity, of weird
thought processes and clearly non-human understanding."[109][110]
In contrast, some proponents of the "LLMs lack understanding" school believe that existing LLMs are "simply
remixing and recombining existing writing",[108] a phenomenon known as stochastic parrot, or they point to
the deficits existing LLMs continue to have in prediction skills, reasoning skills, agency, and explainability.[104]
For example, GPT-4 has natural deficits in planning and in real-time learning.[106] Generative LLMs have been
observed to confidently assert claims of fact which do not seem to be justified by their training data, a
phenomenon which has been termed "hallucination".[111] Specifically, hallucinations in the context of LLMs
correspond to the generation of text or responses that seem syntactically sound, fluent, and natural but are
factually incorrect, nonsensical, or unfaithful to the provided source input.[112] Neuroscientist Terrence
Sejnowski has argued that "The diverging opinions of experts on the intelligence of LLMs suggests that our
old ideas based on natural intelligence are inadequate".[104]
The matter of LLM's exhibiting intelligence or understanding has two main aspects – the first is how to model
thought and language in a computer system, and the second is how to enable the computer system to
generate human like language.[104] These aspects of language as a model of cognition have been developed
in the field of cognitive linguistics. American linguist George Lakoff presented Neural Theory of Language
(NTL)[113] as a computational basis for using language as a model of learning tasks and understanding. The
NTL Model (https://www.icsi.berkeley.edu/icsi/projects/ai/ntl) outlines how specific neural structures of
the human brain shape the nature of thought and language and in turn what are the computational properties
of such neural systems that can be applied to model thought and language in a computer system. After a
framework for modeling language in a computer systems was established, the focus shifted to establishing
frameworks for computer systems to generate language with acceptable grammar. In his 2014 book titled
The Language Myth: Why Language Is Not An Instinct, British cognitive linguist and digital communication
technologist Vyvyan Evans mapped out the role of probabilistic context-free grammar (PCFG) in enabling NLP
to model cognitive patterns and generate human like language.[114][115]
Evaluation
Perplexity
The most commonly used measure of a language model's performance is its perplexity on a given text
corpus. Perplexity is a measure of how well a model is able to predict the contents of a dataset; the higher
the likelihood the model assigns to the dataset, the lower the perplexity. Mathematically, perplexity is defined
as the exponential of the average negative log likelihood per token:
here is the number of tokens in the text corpus, and "context for token " depends on the specific type of
LLM used. If the LLM is autoregressive, then "context for token " is the segment of text appearing before
token . If the LLM is masked, then "context for token " is the segment of text surrounding token .
Because language models may overfit to their training data, models are usually evaluated by their perplexity
on a test set of unseen data.[46] This presents particular challenges for the evaluation of large language
models. As they are trained on increasingly large corpora of text largely scraped from the web, it becomes
increasingly likely that models' training data inadvertently includes portions of any given test set.[2]
In information theory, the concept of entropy is intricately linked to perplexity, a relationship notably
established by Claude Shannon.[116] This relationship is mathematically expressed as
.
Entropy, in this context, is commonly quantified in terms of bits per word (BPW) or bits per character (BPC),
which hinges on whether the language model utilizes word-based or character-based tokenization.
Notably, in the case of larger language models that predominantly employ sub-word tokenization, bits per
token (BPT) emerges as a seemingly more appropriate measure. However, due to the variance in tokenization
methods across different Large Language Models (LLMs), BPT does not serve as a reliable metric for
comparative analysis among diverse models. To convert BPT into BPW, one can multiply it by the average
number of tokens per word.
In the evaluation and comparison of language models, cross-entropy is generally the preferred metric over
entropy. The underlying principle is that a lower BPW is indicative of a model's enhanced capability for
compression. This, in turn, reflects the model's proficiency in making accurate predictions.
A large number of testing datasets and benchmarks have also been developed to evaluate the capabilities of
language models on more specific downstream tasks. Tests may be designed to evaluate a variety of
capabilities, including general knowledge, commonsense reasoning, and mathematical problem-solving.
One broad category of evaluation dataset is question answering datasets, consisting of pairs of questions
and correct answers, for example, ("Have the San Jose Sharks won the Stanley Cup?", "No").[117] A question
answering task is considered "open book" if the model's prompt includes text from which the expected
answer can be derived (for example, the previous question could be adjoined with some text which includes
the sentence "The Sharks have advanced to the Stanley Cup finals once, losing to the Pittsburgh Penguins in
2016."[117]). Otherwise, the task is considered "closed book", and the model must draw on knowledge retained
during training.[118] Some examples of commonly used question answering datasets include TruthfulQA, Web
Questions, TriviaQA, and SQuAD.[118]
Evaluation datasets may also take the form of text completion, having the model select the most likely word
or sentence to complete a prompt, for example: "Alice was friends with Bob. Alice went to visit her friend,
____".[2]
Some composite benchmarks have also been developed which combine a diversity of different evaluation
datasets and tasks. Examples include GLUE, SuperGLUE, MMLU, BIG-bench, and HELM.[116][118] OpenAI has
released tools for running composite benchmarks, but noted that the eval results are sensitive to the
prompting method.[119][120] Some public datasets contain questions that are mislabeled, ambiguous,
unanswerable, or otherwise of low-quality, which can be cleaned to give more reliable benchmark scores.[121]
It was previously standard to report results on a heldout portion of an evaluation dataset after doing
supervised fine-tuning on the remainder. It is now more common to evaluate a pre-trained model directly
through prompting techniques, though researchers vary in the details of how they formulate prompts for
particular tasks, particularly with respect to how many examples of solved tasks are adjoined to the prompt
(i.e. the value of n in n-shot prompting).
Because of the rapid pace of improvement of large language models, evaluation benchmarks have suffered
from short lifespans, with state of the art models quickly "saturating" existing benchmarks, exceeding the
performance of human annotators, leading to efforts to replace or augment the benchmark with more
challenging tasks.[122] In addition, there are cases of "shortcut learning" wherein AIs sometimes "cheat" on
multiple-choice tests by using statistical correlations in superficial test question wording in order to guess
the correct responses, without necessarily understanding the actual question being asked.[104]
Some datasets have been constructed adversarially, focusing on particular problems on which extant
language models seem to have unusually poor performance compared to humans. One example is the
TruthfulQA dataset, a question answering dataset consisting of 817 questions which language models are
susceptible to answering incorrectly by mimicking falsehoods to which they were repeatedly exposed during
training. For example, an LLM may answer "No" to the question "Can you teach an old dog new tricks?"
because of its exposure to the English idiom you can't teach an old dog new tricks, even though this is not
literally true.[123]
Another example of an adversarial evaluation dataset is Swag and its successor, HellaSwag, collections of
problems in which one of multiple options must be selected to complete a text passage. The incorrect
completions were generated by sampling from a language model and filtering with a set of classifiers. The
resulting problems are trivial for humans but at the time the datasets were created state of the art language
models had poor accuracy on them. For example:
We see a fitness center sign. We then see a man talking to the camera and sitting
and laying on a exercise ball. The man...
a) demonstrates how to increase efficient exercise work by running up and down
balls.
b) moves all his arms and legs and builds up a lot of muscle.
c) then plays the ball and we see a graphics and hedge trimming demonstration.
d) performs sit ups while on the ball and talking.[124]
BERT selects b) as the most likely completion, though the correct answer is d).[124]
Wider impact
In 2023, Nature Biomedical Engineering wrote that "it is no longer possible to accurately distinguish" human-
written text from text created by large language models, and that "It is all but certain that general-purpose
large language models will rapidly proliferate... It is a rather safe bet that they will change many industries
over time."[125] Goldman Sachs suggested in 2023 that generative language AI could increase global GDP by
7% in the next ten years, and could expose to automation 300 million jobs globally.[126][127]
Memorization is an emergent behavior in LLMs in which long strings of text are occasionally output verbatim
from training data, contrary to typical behavior of traditional artificial neural nets. Evaluations of controlled
LLM output measure the amount memorized from training data (focused on GPT-2-series models) as
variously over 1% for exact duplicates[128] or up to about 7%.[129]
Security
Some commenters expressed concern over accidental or deliberate creation of misinformation, or other
forms of misuse.[130] For example, the availability of large language models could reduce the skill-level
required to commit bioterrorism; biosecurity researcher Kevin Esvelt has suggested that LLM creators should
exclude from their training data papers on creating or enhancing pathogens.[131]
A study by researchers at Google and several universities, including Cornell University and University of
California, Berkeley, showed that there are potential security risks in language models such as ChatGPT. In
their study, they examined and confirmed the possibility that questioners could get, from ChatGPT, the
training data that the AI model used. For example, when asking ChatGPT 3.5 turbo to repeat the word "poem"
forever, the AI model will say "poem" hundreds of times and then diverge, deviating from the standard
dialogue style and spitting out nonsense phrases, thus spitting out the training data as it is. The researchers
have seen more than 10,000 examples of the AI model exposing their training data in a similar method. The
researchers said that it was hard to tell if the AI model was actually safe or not.[132]
The potential presence of "sleeper agents" within LLM models is another emerging security concern. These
are hidden functionalities built into the model that remain dormant until triggered by a specific event or
condition. Upon activation, the LLM deviates from its expected behavior to make insecure actions.[133]
Large language model (LLM) applications accessible to the public, like ChatGPT or Claude, typically
incorporate safety measures designed to filter out harmful content. However, implementing these controls
effectively has proven challenging. For instance, research by Kang et al. [134] demonstrated a method for
circumventing LLM safety systems. Similarly, Wang [135] illustrated how a potential criminal could potentially
bypass ChatGPT 4o's safety controls to obtain information on establishing a drug trafficking operation.
Algorithmic bias
While LLMs have shown remarkable capabilities in generating human-like text, they are susceptible to
inheriting and amplifying biases present in their training data. This can manifest in skewed representations or
unfair treatment of different demographics, such as those based on race, gender, language, and cultural
groups.[136] Since English data is overrepresented in current large language models' training data, it may also
downplay non-English views.[137]
Stereotyping
AI models can reinforce a wide range of stereotypes, including those based on gender, ethnicity, age,
nationality, religion, or occupation. This can lead to outputs that unfairly generalize or caricature groups of
people, sometimes in harmful or derogatory ways.[138]
Notably, gender bias refers to the tendency of these models to produce outputs that are unfairly prejudiced
towards one gender over another. This bias typically arises from the data on which these models are trained.
Large language models often assign roles and characteristics based on traditional gender norms.[136] For
example, it might associate nurses or secretaries predominantly with women and engineers or CEOs with
men.[139]
Political bias
Political bias refers to the tendency of algorithms to systematically favor certain political viewpoints,
ideologies, or outcomes over others. Language models may also exhibit political biases. Since the training
data includes a wide range of political opinions and coverage, the models might generate responses that lean
towards particular political ideologies or viewpoints, depending on the prevalence of those views in the
data.[140]
List
For the training cost column, 1 petaFLOP-day = 1 petaFLOP/sec × 1 day = 8.64E19 FLOP. Also, only the largest
model's cost is written.
Training
Number of
Release Corpus cost
Name Developer parameters License[c] Notes
date[a] [b]
size (petaFLOP-
(billion)
day)
An early and
influential
language
model.[4] Encoder-
only and thus not
[143]
3.3 billion [144]
Apache
BERT October 2018 Google 0.340 9 built to be
words[143] 2.0[145]
prompted or
generative.[146]
Training took 4
days on 64 TPUv2
chips.[147]
An alternative to
BERT; designed
33 billion Apache as encoder-only.
XLNet June 2019 Google 0.340[151] 330 [152]
words 2.0 Trained on 512
TPU v3 chips for
5.5 days.[153]
40GB[155] Trained on 32
[154] [157] [158]
GPT-2 February 2019 OpenAI 1.5 (~10 billion 28 MIT TPUv3 chips for 1
[156]
tokens) week.[157]
A fine-tuned
variant of GPT-3,
termed GPT-3.5,
was made
[50]
300 billion [159]
GPT-3 May 2020 OpenAI 175 3640 proprietary available to the
tokens[156]
public through a
web interface
called ChatGPT in
2022.[160]
GPT-Neo March 2021 EleutherAI 2.7[161] 825 GiB[162] MIT[163] The first of a
series of free GPT-
3 alternatives
Training
Number of
Release Corpus cost
Name Developer parameters License[c] Notes
date[a] [b]
size (petaFLOP-
(billion)
day)
released by
EleutherAI. GPT-
Neo outperformed
an equivalent-size
GPT-3 model on
some
benchmarks, but
was significantly
worse than the
largest GPT-3.[163]
GPT-3-style
GPT-J June 2021 EleutherAI 6[164] 825 GiB[162] 200[165] Apache 2.0
language model
Standard
architecture but
Megatron-Turing Microsoft 338.6 billion Restricted
October 2021[166] 530[167] trained on a
NLG and Nvidia tokens[167] web access
supercomputing
cluster.
Chinese-language
LLM. Ernie Bot is
Ernie 3.0 Titan December 2021 Baidu 260[168] 4 Tb Proprietary
based on this
model.
Fine-tuned for
400 billion desirable behavior
Claude[169] December 2021 Anthropic 52[170] [170]
beta
tokens in
conversations.[171]
Sparse mixture of
experts model,
making it more
GLaM (Generalist 1.6 trillion expensive to train
December 2021 Google 1200[38] [38]
5600[38] Proprietary
Language Model) tokens but cheaper to run
inference
compared to GPT-
3.
Later developed
[172]
300 billion [174]
Gopher December 2021 DeepMind 280 5833 Proprietary into the Chinchilla
tokens[173]
model.
based on the
[177] [162] [165]
GPT-NeoX February 2022 EleutherAI 20 825 GiB 740 Apache 2.0 Megatron
architecture
Reduced-
parameter model
trained on more
[178]
1.4 trillion [174]
Chinchilla March 2022 DeepMind 70 6805 Proprietary data. Used in the
tokens[178][173]
Sparrow bot.
Often cited for its
neural scaling law.
GPT-3 architecture
OPT (Open Non-
180 billion with some
Pretrained May 2022 Meta 175[180] [181]
310[165] commercial
tokens adaptations from
Transformer) research[d]
Megatron
English-Russian
model based on
YaLM 100B June 2022 Yandex 100[182] 1.7TB[182] Apache 2.0
Microsoft's
Megatron-LM.
Essentially GPT-3
Large but trained on a
collaboration 350 billion multi-lingual
[185]
Responsible
BLOOM July 2022 led by 175 tokens corpus (30%
[186]
AI
Hugging (1.6TB) English excluding
Face programming
languages)
Training
Number of
Release Corpus cost
Name Developer parameters License[c] Notes
date[a] [b]
size (petaFLOP-
(billion)
day)
Trained on
106 billion
Galactica November 2022 Meta 120 unknown CC-BY-NC-4.0 scientific text and
tokens[187]
modalities.
bidirectional
AlexaTM (Teacher sequence-to-
November 2022 Amazon 20[188] 1.3 trillion[189] proprietary[190]
Models) sequence
architecture
A language model
privately- designed for live-
Neuro-sama December 2022 Independent Unknown Unknown
owned streaming on
Twitch.
Corpus has 20
languages.
"Overtrained"
LLaMA (Large Non- (compared to
[191] [191] [192]
Language Model February 2023 Meta AI 65 1.4 trillion 6300 commercial Chinchilla scaling
[e]
Meta AI) research law) for better
performance with
fewer
parameters.[191]
Trained with
[195] [165]
Cerebras-GPT March 2023 Cerebras 13 270 Apache 2.0 Chinchilla
formula.
1 trillion
tokens, from
RefinedWeb
Technology (filtered web
[196]
Apache
Falcon March 2023 Innovation 40 text 2800[192]
2.0[199]
Institute corpus)[197]
plus some
"curated
corpora".[198]
329 billion
PanGu-Σ March 2023 Huawei 1085 Proprietary
tokens[201]
Trained on
[202]
1.5 trillion
OpenAssistant March 2023 LAION 17 Apache 2.0 crowdsourced
tokens
open data
PaLM 2
(Pathways 3.6 trillion Was used in Bard
May 2023 Google 340[205] [205]
85000[192] Proprietary
Language Model tokens chatbot.[206]
2)
Used in Claude
Claude 2 July 2023 Anthropic Unknown Unknown Unknown Proprietary
chatbot.[209]
Used in IBM
Granite 13b July 2023 IBM Unknown Unknown Unknown Proprietary
Watsonx.[210]
Used in Claude
chatbot. Has a
Claude 2.1 November 2023 Anthropic Unknown Unknown Unknown Proprietary context window of
200,000 tokens, or
~500 pages.[212]
Used in Grok
chatbot. Grok-1
has a context
[213]
Grok-1 November 2023 x.AI 314 Unknown Unknown Apache 2.0 length of 8,192
tokens and has
access to X
(Twitter).[214]
Multimodal
model, comes in
Google three sizes. Used
Gemini 1.0 December 2023 Unknown Unknown Unknown Proprietary
DeepMind in the chatbot of
the same
name.[215]
Training
Number of
Release Corpus cost
Name Developer parameters License[c] Notes
date[a] [b] size (petaFLOP-
(billion)
day)
Outperforms GPT-
3.5 and Llama 2
70B on many
benchmarks.[216]
Mixtral 8x7B December 2023 Mistral AI 46.7 Unknown Unknown Apache 2.0 Mixture of experts
model, with 12.9
billion parameters
activated per
token.[217]
[218]
Mixtral 8x22B April 2024 Mistral AI 141 Unknown Unknown Apache 2.0
Trained on real
and synthetic
"textbook-quality"
Phi-2 December 2023 Microsoft 2.7 1.4T tokens 419[219] MIT
data, for 14 days
on 96 A100
GPUs.[219]
Multimodal
model, based on a
Mixture-of-Experts
Google (MoE)
Gemini 1.5 February 2024 Unknown Unknown Unknown Proprietary
DeepMind architecture.
Context window
above 1 million
tokens.[220]
Gemma
Google
Gemma February 2024 7 6T tokens Unknown Terms of
DeepMind
Use[221]
Includes three
models, Haiku,
Claude 3 March 2024 Anthropic Unknown Unknown Unknown Proprietary
Sonnet, and
Opus.[222]
Databricks Databricks
Training cost 10
DBRX March 2024 and Mosaic 136 12T Tokens Open Model
million USD.
ML License
Fujitsu,
The largest model
Tokyo
ever trained on
Fugaku-LLM May 2024 Institute of 13 380B Tokens
CPU-only, on the
Technology,
Fugaku.[223]
etc.
Phi-3 April 2024 Microsoft 14[224] 4.8T Tokens MIT Microsoft markets
them as "small
Training
Number of
Release Corpus cost
Name Developer parameters License[c] Notes
date[a] [b] size (petaFLOP-
(billion)
day)
language
model".[225]
Granite Code
May 2024 IBM Unknown Unknown Unknown Apache 2.0
Models
Trained for 1
epoch. Trained on
6144 H100 GPUs
NVIDIA Open
Nemotron-4 June 2024 Nvidia 340 9T Tokens 200,000 between
Model License
December 2023
and May
2024.[227][228]
See also
Foundation models
Notes
a. This is the date that documentation describing the model's architecture was first released.
b. In many cases, researchers release or report on multiple versions of a model having different sizes. In
these cases, the size of the largest model is listed here.
c. This is the license of the pre-trained model weights. In almost all cases the training code itself is open-
source or can be easily replicated.
d. The smaller models including 66B are publicly available, while the 175B model is available on request.
e. Facebook's license and distribution scheme restricted access to approved researchers, but the model
weights were leaked and became widely available.
f. As stated in Technical report: "Given both the competitive landscape and the safety implications of
large-scale models like GPT-4, this report contains no further details about the architecture (including
model size), hardware, training compute, dataset construction, training method ..."[193]
References
2. Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla;
Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss,
Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon; Ramesh, Aditya; Ziegler, Daniel M.; Wu, Jeffrey;
Winter, Clemens; Hesse, Christopher; Chen, Mark; Sigler, Eric; Litwin, Mateusz; Gray, Scott; Chess,
Benjamin; Clark, Jack; Berner, Christopher; McCandlish, Sam; Radford, Alec; Sutskever, Ilya; Amodei,
Dario (Dec 2020). Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.F.; Lin, H. (eds.). "Language Models
are Few-Shot Learners" (https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac
142f64a-Paper.pdf) (PDF). Advances in Neural Information Processing Systems. 33. Curran
Associates, Inc.: 1877–1901. Archived (https://web.archive.org/web/20231117204007/https://proceedi
ngs.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf) (PDF) from the
original on 2023-11-17. Retrieved 2023-03-14.
3. Fathallah, Nadeen; Das, Arunav; De Giorgis, Stefano; Poltronieri, Andrea; Haase, Peter; Kovriguina, Liubov
(2024-05-26). NeOn-GPT: A Large Language Model-Powered Pipeline for Ontology Learning (https://202
4.eswc-conferences.org/wp-content/uploads/2024/05/77770034.pdf) (PDF). Extended Semantic Web
Conference 2024. Hersonissos, Greece.
6. Kilgarriff, Adam; Grefenstette, Gregory (September 2003). "Introduction to the Special Issue on the Web
as Corpus" (https://direct.mit.edu/coli/article/29/3/333-347/1816) . Computational Linguistics. 29 (3):
333–347. doi:10.1162/089120103322711569 (https://doi.org/10.1162%2F089120103322711569) .
ISSN 0891-2017 (https://search.worldcat.org/issn/0891-2017) .
7. Banko, Michele; Brill, Eric (2001). "Scaling to very very large corpora for natural language
disambiguation" (https://dx.doi.org/10.3115/1073012.1073017) . Proceedings of the 39th Annual
Meeting on Association for Computational Linguistics - ACL '01. Morristown, NJ, USA: Association for
Computational Linguistics: 26–33. doi:10.3115/1073012.1073017 (https://doi.org/10.3115%2F107301
2.1073017) .
8. Resnik, Philip; Smith, Noah A. (September 2003). "The Web as a Parallel Corpus" (https://direct.mit.edu/
coli/article/29/3/349-380/1809) . Computational Linguistics. 29 (3): 349–380.
doi:10.1162/089120103322711578 (https://doi.org/10.1162%2F089120103322711578) . ISSN 0891-
2017 (https://search.worldcat.org/issn/0891-2017) . Archived (https://web.archive.org/web/20240607
172811/https://direct.mit.edu/coli/article/29/3/349-380/1809) from the original on 2024-06-07.
Retrieved 2024-06-07.
9. Halevy, Alon; Norvig, Peter; Pereira, Fernando (March 2009). "The Unreasonable Effectiveness of Data" (h
ttps://ieeexplore.ieee.org/document/4804817) . IEEE Intelligent Systems. 24 (2): 8–12.
doi:10.1109/MIS.2009.36 (https://doi.org/10.1109%2FMIS.2009.36) . ISSN 1541-1672 (https://search.
worldcat.org/issn/1541-1672) .
10. Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser,
Łukasz; Polosukhin, Illia (2017). "Attention is All you Need" (https://proceedings.neurips.cc/paper/2017/
file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) (PDF). Advances in Neural Information
Processing Systems. 30. Curran Associates, Inc. Archived (https://web.archive.org/web/202402211411
13/https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)
(PDF) from the original on 2024-02-21. Retrieved 2024-01-21.
11. Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (2014). "Neural Machine Translation by Jointly
Learning to Align and Translate". arXiv:1409.0473 (https://arxiv.org/abs/1409.0473) [cs.CL (https://arx
iv.org/archive/cs.CL) ].
12. Rogers, Anna; Kovaleva, Olga; Rumshisky, Anna (2020). "A Primer in BERTology: What We Know About
How BERT Works" (https://aclanthology.org/2020.tacl-1.54) . Transactions of the Association for
Computational Linguistics. 8: 842–866. arXiv:2002.12327 (https://arxiv.org/abs/2002.12327) .
doi:10.1162/tacl_a_00349 (https://doi.org/10.1162%2Ftacl_a_00349) . S2CID 211532403 (https://api.s
emanticscholar.org/CorpusID:211532403) . Archived (https://web.archive.org/web/20220403103310/
https://aclanthology.org/2020.tacl-1.54/) from the original on 2022-04-03. Retrieved 2024-01-21.
13. Hern, Alex (14 February 2019). "New AI fake text generator may be too dangerous to release, say
creators" (https://www.theguardian.com/technology/2019/feb/14/elon-musk-backed-ai-writes-convinci
ng-news-fiction) . The Guardian. Archived (https://web.archive.org/web/20190214173112/https://ww
w.theguardian.com/technology/2019/feb/14/elon-musk-backed-ai-writes-convincing-news-fiction)
from the original on 14 February 2019. Retrieved 20 January 2024.
14. "ChatGPT a year on: 3 ways the AI chatbot has completely changed the world in 12 months" (https://ww
w.euronews.com/next/2023/11/30/chatgpt-a-year-on-3-ways-the-ai-chatbot-has-completely-changed-th
e-world-in-12-months) . Euronews. November 30, 2023. Archived (https://web.archive.org/web/202401
14025250/https://www.euronews.com/next/2023/11/30/chatgpt-a-year-on-3-ways-the-ai-chatbot-has-c
ompletely-changed-the-world-in-12-months) from the original on January 14, 2024. Retrieved
January 20, 2024.
15. Heaven, Will (March 14, 2023). "GPT-4 is bigger and better than ChatGPT—but OpenAI won't say why" (htt
ps://www.technologyreview.com/2023/03/14/1069823/gpt-4-is-bigger-and-better-chatgpt-openai/) .
MIT Technology Review. Archived (https://web.archive.org/web/20230317224201/https://www.technolo
gyreview.com/2023/03/14/1069823/gpt-4-is-bigger-and-better-chatgpt-openai/) from the original on
March 17, 2023. Retrieved January 20, 2024.
18. Peng, Bo; et al. (2023). "RWKV: Reinventing RNNS for the Transformer Era". arXiv:2305.13048 (https://ar
xiv.org/abs/2305.13048) [cs.CL (https://arxiv.org/archive/cs.CL) ].
20. Gu, Albert; Dao, Tri (2023-12-01), Mamba: Linear-Time Sequence Modeling with Selective State Spaces,
arXiv:2312.00752 (https://arxiv.org/abs/2312.00752)
21. Kaushal, Ayush; Mahowald, Kyle (2022-06-06), What do tokens know about their characters and how do
they know it? (https://arxiv.org/abs/2206.02608) , arXiv:2206.02608 (https://arxiv.org/abs/2206.0260
8) , retrieved 2024-09-08
22. Yennie Jun (2023-05-03). "All languages are NOT created (tokenized) equal" (https://web.archive.org/we
b/20230817165705/https://blog.yenniejun.com/p/all-languages-are-not-created-tokenized) . Language
models cost much more in some languages than others. Archived from the original (https://blog.yenniej
un.com/p/all-languages-are-not-created-tokenized) on 2023-08-17. Retrieved 2023-08-17. "In other
words, to express the same sentiment, some languages require up to 10 times more tokens."
23. Petrov, Aleksandar; Malfa, Emanuele La; Torr, Philip; Bibi, Adel (June 23, 2023). "Language Model
Tokenizers Introduce Unfairness Between Languages" (https://openreview.net/forum?id=Pj4YYuxTq9) .
NeurIPS. arXiv:2305.15425 (https://arxiv.org/abs/2305.15425) . Archived (https://web.archive.org/we
b/20231215212906/https://openreview.net/forum?id=Pj4YYuxTq9) from the original on December 15,
2023. Retrieved September 16, 2023 – via openreview.net.
25. Paaß, Gerhard; Giesselbach, Sven (2022). "Pre-trained Language Models" (https://link.springer.com/cha
pter/10.1007/978-3-031-23190-2_2) . Foundation Models for Natural Language Processing. Artificial
Intelligence: Foundations, Theory, and Algorithms. pp. 19–78. doi:10.1007/978-3-031-23190-2_2 (http
s://doi.org/10.1007%2F978-3-031-23190-2_2) . ISBN 9783031231902. Archived (https://web.archive.or
g/web/20230803212329/https://link.springer.com/chapter/10.1007/978-3-031-23190-2_2) from the
original on 3 August 2023. Retrieved 3 August 2023.
26. Petrov, Aleksandar; Emanuele La Malfa; Torr, Philip H. S.; Bibi, Adel (2023). "Language Model Tokenizers
Introduce Unfairness Between Languages". arXiv:2305.15425 (https://arxiv.org/abs/2305.15425)
[cs.CL (https://arxiv.org/archive/cs.CL) ].
27. Lundberg, Scott (2023-12-12). "The Art of Prompt Design: Prompt Boundaries and Token Healing" (http
s://towardsdatascience.com/the-art-of-prompt-design-prompt-boundaries-and-token-healing-3b2448b0
be38) . Medium. Retrieved 2024-08-05.
28. Dodge, Jesse; Sap, Maarten; Marasović, Ana; Agnew, William; Ilharco, Gabriel; Groeneveld, Dirk; Mitchell,
Margaret; Gardner, Matt (2021). "Documenting Large Webtext Corpora: A Case Study on the Colossal
Clean Crawled Corpus". arXiv:2104.08758 (https://arxiv.org/abs/2104.08758) [cs.CL (https://arxiv.org/
archive/cs.CL) ].
29. Lee, Katherine; Ippolito, Daphne; Nystrom, Andrew; Zhang, Chiyuan; Eck, Douglas; Callison-Burch, Chris;
Carlini, Nicholas (May 2022). "Deduplicating Training Data Makes Language Models Better" (https://acla
nthology.org/2022.acl-long.577.pdf) (PDF). Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics. 1: Long Papers: 8424–8445. doi:10.18653/v1/2022.acl-
long.577 (https://doi.org/10.18653%2Fv1%2F2022.acl-long.577) .
30. Li, Yuanzhi; Bubeck, Sébastien; Eldan, Ronen; Del Giorno, Allie; Gunasekar, Suriya; Lee, Yin Tat (2023-09-
11), Textbooks Are All You Need II: phi-1.5 technical report, arXiv:2309.05463 (https://arxiv.org/abs/230
9.05463)
31. Lin, Zhenghao; Gou, Zhibin; Gong, Yeyun; Liu, Xiao; Shen, Yelong; Xu, Ruochen; Lin, Chen; Yang, Yujiu;
Jiao, Jian (2024-04-11). "Rho-1: Not All Tokens Are What You Need". arXiv:2404.07965 (https://arxiv.org/
abs/2404.07965) [cs.CL (https://arxiv.org/archive/cs.CL) ].
32. Brown, Tom B.; et al. (2020). "Language Models are Few-Shot Learners". arXiv:2005.14165 (https://arxiv.
org/abs/2005.14165) [cs.CL (https://arxiv.org/archive/cs.CL) ].
33. Abdin, Marah; Jacobs, Sam Ade; Awan, Ammar Ahmad; Aneja, Jyoti; Awadallah, Ahmed; Awadalla, Hany;
Bach, Nguyen; Bahree, Amit; Bakhtiari, Arash (2024-04-23). "Phi-3 Technical Report: A Highly Capable
Language Model Locally on Your Phone". arXiv:2404.14219 (https://arxiv.org/abs/2404.14219) [cs.CL
(https://arxiv.org/archive/cs.CL) ].
34. Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang,
Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, John; Hilton, Jacob; Kelton, Fraser;
Miller, Luke; Simens, Maddie; Askell, Amanda; Welinder, Peter; Christiano, Paul; Leike, Jan; Lowe, Ryan
(2022). "Training language models to follow instructions with human feedback". arXiv:2203.02155 (http
s://arxiv.org/abs/2203.02155) [cs.CL (https://arxiv.org/archive/cs.CL) ].
35. Wang, Yizhong; Kordi, Yeganeh; Mishra, Swaroop; Liu, Alisa; Smith, Noah A.; Khashabi, Daniel; Hajishirzi,
Hannaneh (2022). "Self-Instruct: Aligning Language Model with Self Generated Instructions".
arXiv:2212.10560 (https://arxiv.org/abs/2212.10560) [cs.CL (https://arxiv.org/archive/cs.CL) ].
36. Shazeer, Noam; Mirhoseini, Azalia; Maziarz, Krzysztof; Davis, Andy; Le, Quoc; Hinton, Geoffrey; Dean, Jeff
(2017-01-01). "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer".
arXiv:1701.06538 (https://arxiv.org/abs/1701.06538) [cs.LG (https://arxiv.org/archive/cs.LG) ].
37. Lepikhin, Dmitry; Lee, HyoukJoong; Xu, Yuanzhong; Chen, Dehao; Firat, Orhan; Huang, Yanping; Krikun,
Maxim; Shazeer, Noam; Chen, Zhifeng (2021-01-12). "GShard: Scaling Giant Models with Conditional
Computation and Automatic Sharding". arXiv:2006.16668 (https://arxiv.org/abs/2006.16668) [cs.CL (h
ttps://arxiv.org/archive/cs.CL) ].
38. Dai, Andrew M; Du, Nan (December 9, 2021). "More Efficient In-Context Learning with GLaM" (https://ai.g
oogleblog.com/2021/12/more-efficient-in-context-learning-with.html) . ai.googleblog.com. Archived (ht
tps://web.archive.org/web/20230312072042/https://ai.googleblog.com/2021/12/more-efficient-in-cont
ext-learning-with.html) from the original on 2023-03-12. Retrieved 2023-03-09.
39. Wei, Jason; Tay, Yi; Bommasani, Rishi; Raffel, Colin; Zoph, Barret; Borgeaud, Sebastian; Yogatama, Dani;
Bosma, Maarten; Zhou, Denny; Metzler, Donald; Chi, Ed H.; Hashimoto, Tatsunori; Vinyals, Oriol; Liang,
Percy; Dean, Jeff; Fedus, William (31 August 2022). "Emergent Abilities of Large Language Models" (http
s://openreview.net/forum?id=yzkSU5zdwD) . Transactions on Machine Learning Research. ISSN 2835-
8856 (https://search.worldcat.org/issn/2835-8856) . Archived (https://web.archive.org/web/20230322
210052/https://openreview.net/forum?id=yzkSU5zdwD) from the original on 22 March 2023. Retrieved
19 March 2023.
41. Allamar, Jay. "The Illustrated GPT-2 (Visualizing Transformer Language Models)" (https://jalammar.githu
b.io/illustrated-gpt2/) . Retrieved 2023-08-01.
42. "Our next-generation model: Gemini 1.5" (https://blog.google/technology/ai/google-gemini-next-generati
on-model-february-2024/#context-window) . Google. 15 February 2024. Archived (https://web.archive.o
rg/web/20240218141522/https://blog.google/technology/ai/google-gemini-next-generation-model-febr
uary-2024/#context-window) from the original on 18 February 2024. Retrieved 18 February 2024.
45. Zaib, Munazza; Sheng, Quan Z.; Emma Zhang, Wei (4 February 2020). "A Short Survey of Pre-trained
Language Models for Conversational AI-A New Age in NLP" (https://www.researchgate.net/publication/
338931711) . Proceedings of the Australasian Computer Science Week Multiconference. pp. 1–4.
arXiv:2104.10810 (https://arxiv.org/abs/2104.10810) . doi:10.1145/3373017.3373028 (https://doi.org/
10.1145%2F3373017.3373028) . ISBN 9781450376976. S2CID 211040895 (https://api.semanticschola
r.org/CorpusID:211040895) .
46. Jurafsky, Dan; Martin, James H. (7 January 2023). Speech and Language Processing (https://web.stanfo
rd.edu/~jurafsky/slp3/ed3book_jan72023.pdf) (PDF) (3rd edition draft ed.). Archived (https://web.arch
ive.org/web/20230323210221/https://web.stanford.edu/~jurafsky/slp3/ed3book_jan72023.pdf)
(PDF) from the original on 23 March 2023. Retrieved 24 May 2022.
47. "From bare metal to a 70B model: infrastructure set-up and scripts" (https://imbue.com/research/70b-in
frastructure/) . imbue.com. Archived (https://web.archive.org/web/20240726203419/https://imbue.co
m/research/70b-infrastructure/) from the original on 2024-07-26. Retrieved 2024-07-24.
49. Albrecht, Josh (2024-07-23). "State of the Art: Training >70B LLMs on 10,000 H100 clusters" (https://ww
w.latent.space/p/llm-training-2024) . www.latent.space. Retrieved 2024-07-24.
50. Wiggers, Kyle (28 April 2022). "The emerging types of language models and why they matter" (https://te
chcrunch.com/2022/04/28/the-emerging-types-of-language-models-and-why-they-matter/) .
TechCrunch. Archived (https://web.archive.org/web/20230316072443/https://techcrunch.com/2022/0
4/28/the-emerging-types-of-language-models-and-why-they-matter/) from the original on 16 March
2023. Retrieved 9 March 2023.
51. Sharir, Or; Peleg, Barak; Shoham, Yoav (2020). "The Cost of Training NLP Models: A Concise Overview".
arXiv:2004.08900 (https://arxiv.org/abs/2004.08900) [cs.CL (https://arxiv.org/archive/cs.CL) ].
52. Biderman, Stella; Schoelkopf, Hailey; Anthony, Quentin; Bradley, Herbie; Khan, Mohammad Aflah; Purohit,
Shivanshu; Prashanth, USVSN Sai (April 2023). "Pythia: A Suite for Analyzing Large Language Models
Across Training and Scaling". arXiv:2304.01373 (https://arxiv.org/abs/2304.01373) [cs.CL (https://arxi
v.org/archive/cs.CL) ].
53. Maslej, Nestor; Fattorini, Loredana; Brynjolfsson, Erik; Etchemendy, John; Ligett, Katrina; Lyons, Terah;
Manyika, James; Ngo, Helen; Niebles, Juan Carlos (2023-10-05), Artificial Intelligence Index Report 2023,
arXiv:2310.03715 (https://arxiv.org/abs/2310.03715)
54. Section 2.1 and Table 1, Kaplan, Jared; McCandlish, Sam; Henighan, Tom; Brown, Tom B.; Chess,
Benjamin; Child, Rewon; Gray, Scott; Radford, Alec; Wu, Jeffrey; Amodei, Dario (2020). "Scaling Laws for
Neural Language Models". arXiv:2001.08361 (https://arxiv.org/abs/2001.08361) [cs.LG (https://arxiv.o
rg/archive/cs.LG) ].
55. Gao, Luyu; Madaan, Aman; Zhou, Shuyan; Alon, Uri; Liu, Pengfei; Yang, Yiming; Callan, Jamie; Neubig,
Graham (2022-11-01). "PAL: Program-aided Language Models". arXiv:2211.10435 (https://arxiv.org/abs/
2211.10435) [cs.CL (https://arxiv.org/archive/cs.CL) ].
57. Paranjape, Bhargavi; Lundberg, Scott; Singh, Sameer; Hajishirzi, Hannaneh; Zettlemoyer, Luke; Tulio
Ribeiro, Marco (2023-03-01). "ART: Automatic multi-step reasoning and tool-use for large language
models". arXiv:2303.09014 (https://arxiv.org/abs/2303.09014) [cs.CL (https://arxiv.org/archive/cs.C
L) ].
58. Liang, Yaobo; Wu, Chenfei; Song, Ting; Wu, Wenshan; Xia, Yan; Liu, Yu; Ou, Yang; Lu, Shuai; Ji, Lei; Mao,
Shaoguang; Wang, Yun; Shou, Linjun; Gong, Ming; Duan, Nan (2023-03-01). "TaskMatrix.AI: Completing
Tasks by Connecting Foundation Models with Millions of APIs". arXiv:2303.16434 (https://arxiv.org/abs/
2303.16434) [cs.AI (https://arxiv.org/archive/cs.AI) ].
59. Patil, Shishir G.; Zhang, Tianjun; Wang, Xin; Gonzalez, Joseph E. (2023-05-01). "Gorilla: Large Language
Model Connected with Massive APIs". arXiv:2305.15334 (https://arxiv.org/abs/2305.15334) [cs.CL (ht
tps://arxiv.org/archive/cs.CL) ].
60. Lewis, Patrick; Perez, Ethan; Piktus, Aleksandra; Petroni, Fabio; Karpukhin, Vladimir; Goyal, Naman;
Küttler, Heinrich; Lewis, Mike; Yih, Wen-tau; Rocktäschel, Tim; Riedel, Sebastian; Kiela, Douwe (2020).
"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (https://proceedings.neurips.cc/p
aper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html) . Advances in Neural
Information Processing Systems. 33. Curran Associates, Inc.: 9459–9474. arXiv:2005.11401 (https://arx
iv.org/abs/2005.11401) . Archived (https://web.archive.org/web/20230612171229/https://proceeding
s.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html) from the
original on 2023-06-12. Retrieved 2023-06-12.
61. Huang, Wenlong; Abbeel, Pieter; Pathak, Deepak; Mordatch, Igor (2022-06-28). "Language Models as
Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents" (https://proceedings.mlr.pr
ess/v162/huang22a.html) . Proceedings of the 39th International Conference on Machine Learning.
PMLR: 9118–9147. arXiv:2201.07207 (https://arxiv.org/abs/2201.07207) .
62. Yao, Shunyu; Zhao, Jeffrey; Yu, Dian; Du, Nan; Shafran, Izhak; Narasimhan, Karthik; Cao, Yuan (2022-10-
01). "ReAct: Synergizing Reasoning and Acting in Language Models". arXiv:2210.03629 (https://arxiv.or
g/abs/2210.03629) [cs.CL (https://arxiv.org/archive/cs.CL) ].
63. Wu, Yue; Prabhumoye, Shrimai; Min, So Yeon (24 May 2023). "SPRING: GPT-4 Out-performs RL
Algorithms by Studying Papers and Reasoning". arXiv:2305.15486 (https://arxiv.org/abs/2305.15486)
[cs.AI (https://arxiv.org/archive/cs.AI) ].
64. Wang, Zihao; Cai, Shaofei; Liu, Anji; Ma, Xiaojian; Liang, Yitao (2023-02-03). "Describe, Explain, Plan and
Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents".
arXiv:2302.01560 (https://arxiv.org/abs/2302.01560) [cs.AI (https://arxiv.org/archive/cs.AI) ].
65. Shinn, Noah; Cassano, Federico; Labash, Beck; Gopinath, Ashwin; Narasimhan, Karthik; Yao, Shunyu
(2023-03-01). "Reflexion: Language Agents with Verbal Reinforcement Learning". arXiv:2303.11366 (http
s://arxiv.org/abs/2303.11366) [cs.AI (https://arxiv.org/archive/cs.AI) ].
66. Hao, Shibo; Gu, Yi; Ma, Haodi; Jiahua Hong, Joshua; Wang, Zhen; Zhe Wang, Daisy; Hu, Zhiting (2023-05-
01). "Reasoning with Language Model is Planning with World Model". arXiv:2305.14992 (https://arxiv.or
g/abs/2305.14992) [cs.CL (https://arxiv.org/archive/cs.CL) ].
67. Zhang, Jenny; Lehman, Joel; Stanley, Kenneth; Clune, Jeff (2 June 2023). "OMNI: Open-endedness via
Models of human Notions of Interestingness". arXiv:2306.01711 (https://arxiv.org/abs/2306.01711)
[cs.AI (https://arxiv.org/archive/cs.AI) ].
68. "Voyager | An Open-Ended Embodied Agent with Large Language Models" (https://voyager.minedojo.or
g/) . voyager.minedojo.org. Archived (https://web.archive.org/web/20230608225054/https://voyager.
minedojo.org/) from the original on 2023-06-08. Retrieved 2023-06-09.
69. Park, Joon Sung; O'Brien, Joseph C.; Cai, Carrie J.; Ringel Morris, Meredith; Liang, Percy; Bernstein,
Michael S. (2023-04-01). "Generative Agents: Interactive Simulacra of Human Behavior".
arXiv:2304.03442 (https://arxiv.org/abs/2304.03442) [cs.HC (https://arxiv.org/archive/cs.HC) ].
70. Mann, Tobias. "How to run an LLM locally on your PC in less than 10 minutes" (https://www.theregister.c
om/2024/03/17/ai_pc_local_llm/) . www.theregister.com. Retrieved 2024-05-17.
71. Nagel, Markus; Amjad, Rana Ali; Baalen, Mart Van; Louizos, Christos; Blankevoort, Tijmen (2020-11-21).
"Up or Down? Adaptive Rounding for Post-Training Quantization" (https://proceedings.mlr.press/v119/na
gel20a.html) . Proceedings of the 37th International Conference on Machine Learning. PMLR: 7197–
7206. Archived (https://web.archive.org/web/20230614080854/https://proceedings.mlr.press/v119/na
gel20a.html) from the original on 2023-06-14. Retrieved 2023-06-14.
72. Polino, Antonio; Pascanu, Razvan; Alistarh, Dan (2018-02-01). "Model compression via distillation and
quantization". arXiv:1802.05668 (https://arxiv.org/abs/1802.05668) [cs.NE (https://arxiv.org/archive/c
s.NE) ].
73. Frantar, Elias; Ashkboos, Saleh; Hoefler, Torsten; Alistarh, Dan (2022-10-01). "GPTQ: Accurate Post-
Training Quantization for Generative Pre-trained Transformers". arXiv:2210.17323 (https://arxiv.org/abs/
2210.17323) [cs.LG (https://arxiv.org/archive/cs.LG) ].
74. Dettmers, Tim; Svirschevski, Ruslan; Egiazarian, Vage; Kuznedelev, Denis; Frantar, Elias; Ashkboos, Saleh;
Borzunov, Alexander; Hoefler, Torsten; Alistarh, Dan (2023-06-01). "SpQR: A Sparse-Quantized
Representation for Near-Lossless LLM Weight Compression". arXiv:2306.03078 (https://arxiv.org/abs/2
306.03078) [cs.CL (https://arxiv.org/archive/cs.CL) ].
76. Dettmers, Tim; Pagnoni, Artidoro; Holtzman, Ari; Zettlemoyer, Luke (2023-05-01). "QLoRA: Efficient
Finetuning of Quantized LLMs". arXiv:2305.14314 (https://arxiv.org/abs/2305.14314) [cs.LG (https://ar
xiv.org/archive/cs.LG) ].
77. Kiros, Ryan; Salakhutdinov, Ruslan; Zemel, Rich (2014-06-18). "Multimodal Neural Language Models" (htt
ps://proceedings.mlr.press/v32/kiros14.html) . Proceedings of the 31st International Conference on
Machine Learning. PMLR: 595–603. Archived (https://web.archive.org/web/20230702195952/https://pr
oceedings.mlr.press/v32/kiros14.html) from the original on 2023-07-02. Retrieved 2023-07-02.
78. Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E (2012). "ImageNet Classification with Deep
Convolutional Neural Networks" (https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76
c8436e924a68c45b-Abstract.html) . Advances in Neural Information Processing Systems. 25. Curran
Associates, Inc. Archived (https://web.archive.org/web/20230702195952/https://proceedings.neurips.c
c/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html) from the original on 2023-
07-02. Retrieved 2023-07-02.
79. Antol, Stanislaw; Agrawal, Aishwarya; Lu, Jiasen; Mitchell, Margaret; Batra, Dhruv; Zitnick, C. Lawrence;
Parikh, Devi (2015). "VQA: Visual Question Answering" (https://openaccess.thecvf.com/content_iccv_20
15/html/Antol_VQA_Visual_Question_ICCV_2015_paper.html) . ICCV: 2425–2433. Archived (https://we
b.archive.org/web/20230702195952/https://openaccess.thecvf.com/content_iccv_2015/html/Antol_V
QA_Visual_Question_ICCV_2015_paper.html) from the original on 2023-07-02. Retrieved 2023-07-02.
80. Li, Junnan; Li, Dongxu; Savarese, Silvio; Hoi, Steven (2023-01-01). "BLIP-2: Bootstrapping Language-
Image Pre-training with Frozen Image Encoders and Large Language Models". arXiv:2301.12597 (http
s://arxiv.org/abs/2301.12597) [cs.CV (https://arxiv.org/archive/cs.CV) ].
81. Alayrac, Jean-Baptiste; Donahue, Jeff; Luc, Pauline; Miech, Antoine; Barr, Iain; Hasson, Yana; Lenc, Karel;
Mensch, Arthur; Millican, Katherine; Reynolds, Malcolm; Ring, Roman; Rutherford, Eliza; Cabi, Serkan;
Han, Tengda; Gong, Zhitao (2022-12-06). "Flamingo: a Visual Language Model for Few-Shot Learning" (ht
tps://proceedings.neurips.cc/paper_files/paper/2022/hash/960a172bc7fbf0177ccccbb411a7d800-Abs
tract-Conference.html) . Advances in Neural Information Processing Systems. 35: 23716–23736.
arXiv:2204.14198 (https://arxiv.org/abs/2204.14198) . Archived (https://web.archive.org/web/202307
02195951/https://proceedings.neurips.cc/paper_files/paper/2022/hash/960a172bc7fbf0177ccccbb41
1a7d800-Abstract-Conference.html) from the original on 2023-07-02. Retrieved 2023-07-02.
82. Driess, Danny; Xia, Fei; Sajjadi, Mehdi S. M.; Lynch, Corey; Chowdhery, Aakanksha; Ichter, Brian; Wahid,
Ayzaan; Tompson, Jonathan; Vuong, Quan; Yu, Tianhe; Huang, Wenlong; Chebotar, Yevgen; Sermanet,
Pierre; Duckworth, Daniel; Levine, Sergey (2023-03-01). "PaLM-E: An Embodied Multimodal Language
Model". arXiv:2303.03378 (https://arxiv.org/abs/2303.03378) [cs.LG (https://arxiv.org/archive/cs.L
G) ].
83. Liu, Haotian; Li, Chunyuan; Wu, Qingyang; Lee, Yong Jae (2023-04-01). "Visual Instruction Tuning".
arXiv:2304.08485 (https://arxiv.org/abs/2304.08485) [cs.CV (https://arxiv.org/archive/cs.CV) ].
84. Zhang, Hang; Li, Xin; Bing, Lidong (2023-06-01). "Video-LLaMA: An Instruction-tuned Audio-Visual
Language Model for Video Understanding". arXiv:2306.02858 (https://arxiv.org/abs/2306.02858)
[cs.CL (https://arxiv.org/archive/cs.CL) ].
87. Pichai, Sundar (10 May 2023), Google Keynote (Google I/O '23) (https://www.youtube.com/watch?v=cNf
INi5CNbY&t=931s) , timestamp 15:31, retrieved 2023-07-02
88. Wiggers, Kyle (11 September 2024). "Mistral releases Pixtral 12B, its first multimodal model" (https://tec
hcrunch.com/2024/09/11/mistral-releases-pixtral-its-first-multimodal-model/?utm_medium=aisecret.us
&utm_source=aisecret.us&utm_campaign=aisecret.us) . TechCrunch. Retrieved 14 September 2024.
89. Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; Buchatskaya, Elena; Cai, Trevor; Rutherford,
Eliza; Casas, Diego de Las; Hendricks, Lisa Anne; Welbl, Johannes; Clark, Aidan; Hennigan, Tom; Noland,
Eric; Millican, Katie; Driessche, George van den; Damoc, Bogdan (2022-03-29). "Training Compute-
Optimal Large Language Models". arXiv:2203.15556 (https://arxiv.org/abs/2203.15556) [cs.CL (http
s://arxiv.org/archive/cs.CL) ].
90. Caballero, Ethan; Gupta, Kshitij; Rish, Irina; Krueger, David (2022). "Broken Neural Scaling Laws".
arXiv:2210.14891 (https://arxiv.org/abs/2210.14891) [cs.LG (https://arxiv.org/archive/cs.LG) ].
93. Hahn, Michael; Goyal, Navin (2023-03-14). "A Theory of Emergent In-Context Learning as Implicit
Structure Induction". arXiv:2303.07971 (https://arxiv.org/abs/2303.07971) [cs.LG (https://arxiv.org/arc
hive/cs.LG) ].
94. Pilehvar, Mohammad Taher; Camacho-Collados, Jose (June 2019). "Proceedings of the 2019
Conference of the North" (https://aclanthology.org/N19-1128) . Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for
Computational Linguistics: 1267–1273. doi:10.18653/v1/N19-1128 (https://doi.org/10.18653%2Fv1%2F
N19-1128) . S2CID 102353817 (https://api.semanticscholar.org/CorpusID:102353817) . Archived (htt
ps://web.archive.org/web/20230627202732/https://aclanthology.org/N19-1128/) from the original on
2023-06-27. Retrieved 2023-06-27.
96. Patel, Roma; Pavlick, Ellie (2021-10-06). "Mapping Language Models to Grounded Conceptual Spaces" (h
ttps://openreview.net/forum?id=gJcEM8sxHK) . ICLR. Archived (https://web.archive.org/web/2023062
4191940/https://openreview.net/forum?id=gJcEM8sxHK) from the original on 2023-06-24. Retrieved
2023-06-27.
98. Ornes, Stephen (March 16, 2023). "The Unpredictable Abilities Emerging From Large AI Models" (https://
www.quantamagazine.org/the-unpredictable-abilities-emerging-from-large-ai-models-20230316/) .
Quanta Magazine. Archived (https://web.archive.org/web/20230316203438/https://www.quantamagazi
ne.org/the-unpredictable-abilities-emerging-from-large-ai-models-20230316/) from the original on
March 16, 2023. Retrieved March 16, 2023.
99. Schaeffer, Rylan; Miranda, Brando; Koyejo, Sanmi (2023-04-01). "Are Emergent Abilities of Large
Language Models a Mirage?". arXiv:2304.15004 (https://arxiv.org/abs/2304.15004) [cs.AI (https://arxi
v.org/archive/cs.AI) ].
100. Li, Kenneth; Hopkins, Aspen K.; Bau, David; Viégas, Fernanda; Pfister, Hanspeter; Wattenberg, Martin
(2022-10-01). "Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic
Task". arXiv:2210.13382 (https://arxiv.org/abs/2210.13382) [cs.LG (https://arxiv.org/archive/cs.LG) ].
101. "Large Language Model: world models or surface statistics?" (https://thegradient.pub/othello/) . The
Gradient. 2023-01-21. Retrieved 2023-06-12.
102. Jin, Charles; Rinard, Martin (2023-05-01). "Evidence of Meaning in Language Models Trained on
Programs". arXiv:2305.11169 (https://arxiv.org/abs/2305.11169) [cs.LG (https://arxiv.org/archive/cs.L
G) ].
103. Nanda, Neel; Chan, Lawrence; Lieberum, Tom; Smith, Jess; Steinhardt, Jacob (2023-01-01). "Progress
measures for grokking via mechanistic interpretability". arXiv:2301.05217 (https://arxiv.org/abs/2301.05
217) [cs.LG (https://arxiv.org/archive/cs.LG) ].
104. Mitchell, Melanie; Krakauer, David C. (28 March 2023). "The debate over understanding in AI's large
language models" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10068812) . Proceedings of the
National Academy of Sciences. 120 (13): e2215907120. arXiv:2210.13966 (https://arxiv.org/abs/2210.1
3966) . Bibcode:2023PNAS..12015907M (https://ui.adsabs.harvard.edu/abs/2023PNAS..12015907
M) . doi:10.1073/pnas.2215907120 (https://doi.org/10.1073%2Fpnas.2215907120) . PMC 10068812
(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10068812) . PMID 36943882 (https://pubmed.ncbi.nl
m.nih.gov/36943882) .
105. Metz, Cade (16 May 2023). "Microsoft Says New A.I. Shows Signs of Human Reasoning" (https://www.n
ytimes.com/2023/05/16/technology/microsoft-ai-human-reasoning.html) . The New York Times.
106. Bubeck, Sébastien; Chandrasekaran, Varun; Eldan, Ronen; Gehrke, Johannes; Horvitz, Eric; Kamar, Ece;
Lee, Peter; Lee, Yin Tat; Li, Yuanzhi; Lundberg, Scott; Nori, Harsha; Palangi, Hamid; Ribeiro, Marco Tulio;
Zhang, Yi (2023). "Sparks of Artificial General Intelligence: Early experiments with GPT-4".
arXiv:2303.12712 (https://arxiv.org/abs/2303.12712) [cs.CL (https://arxiv.org/archive/cs.CL) ].
107. "ChatGPT is more like an 'alien intelligence' than a human brain, says futurist" (https://www.zdnet.com/a
rticle/chatgpt-is-more-like-an-alien-intelligence-than-a-human-brain-says-futurist/) . ZDNET. 2023.
Archived (https://web.archive.org/web/20230612065937/https://www.zdnet.com/article/chatgpt-is-mor
e-like-an-alien-intelligence-than-a-human-brain-says-futurist/) from the original on 12 June 2023.
Retrieved 12 June 2023.
108. Newport, Cal (13 April 2023). "What Kind of Mind Does ChatGPT Have?" (https://www.newyorker.com/sc
ience/annals-of-artificial-intelligence/what-kind-of-mind-does-chatgpt-have) . The New Yorker. Archived
(https://web.archive.org/web/20230612071443/https://www.newyorker.com/science/annals-of-artificia
l-intelligence/what-kind-of-mind-does-chatgpt-have) from the original on 12 June 2023. Retrieved
12 June 2023.
109. Roose, Kevin (30 May 2023). "Why an Octopus-like Creature Has Come to Symbolize the State of A.I." (ht
tps://www.nytimes.com/2023/05/30/technology/shoggoth-meme-ai.html) The New York Times.
Archived (https://web.archive.org/web/20230530193814/https://www.nytimes.com/2023/05/30/techn
ology/shoggoth-meme-ai.html) from the original on 30 May 2023. Retrieved 12 June 2023.
110. "The A to Z of Artificial Intelligence" (https://time.com/6271657/a-to-z-of-artificial-intelligence/) . Time
Magazine. 13 April 2023. Archived (https://web.archive.org/web/20230616123839/https://time.com/62
71657/a-to-z-of-artificial-intelligence/) from the original on 16 June 2023. Retrieved 12 June 2023.
111. Ji, Ziwei; Lee, Nayeon; Frieske, Rita; Yu, Tiezheng; Su, Dan; Xu, Yan; Ishii, Etsuko; Bang, Yejin; Dai,
Wenliang; Madotto, Andrea; Fung, Pascale (November 2022). "Survey of Hallucination in Natural
Language Generation" (https://dl.acm.org/doi/pdf/10.1145/3571730) (pdf). ACM Computing Surveys.
55 (12). Association for Computing Machinery: 1–38. arXiv:2202.03629 (https://arxiv.org/abs/2202.036
29) . doi:10.1145/3571730 (https://doi.org/10.1145%2F3571730) . S2CID 246652372 (https://api.se
manticscholar.org/CorpusID:246652372) . Archived (https://web.archive.org/web/20230326145635/ht
tps://dl.acm.org/doi/pdf/10.1145/3571730) from the original on 26 March 2023. Retrieved 15 January
2023.
112. Varshney, Neeraj; Yao, Wenlin; Zhang, Hongming; Chen, Jianshu; Yu, Dong (2023). "A Stitch in Time
Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation".
arXiv:2307.03987 (https://arxiv.org/abs/2307.03987) [cs.CL (https://arxiv.org/archive/cs.CL) ].
113. Lakoff, George (1999). Philosophy in the Flesh: The Embodied Mind and Its Challenge to Western
Philosophy; Appendix: The Neural Theory of Language Paradigm. New York Basic Books. pp. 569–583.
ISBN 978-0-465-05674-3.
114. Evans, Vyvyan. (2014). The Language Myth. Cambridge University Press. ISBN 978-1-107-04396-1.
115. Friston, Karl J. (2022). Active Inference: The Free Energy Principle in Mind, Brain, and Behavior; Chapter 4
The Generative Models of Active Inference. The MIT Press. ISBN 978-0-262-36997-8.
116. Huyen, Chip (October 18, 2019). "Evaluation Metrics for Language Modeling" (https://thegradient.pub/un
derstanding-evaluation-metrics-for-language-models/) . The Gradient. Retrieved January 14, 2024.
117. Clark, Christopher; Lee, Kenton; Chang, Ming-Wei; Kwiatkowski, Tom; Collins, Michael; Toutanova,
Kristina (2019). "BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions".
arXiv:1905.10044 (https://arxiv.org/abs/1905.10044) [cs.CL (https://arxiv.org/archive/cs.CL) ].
118. Wayne Xin Zhao; Zhou, Kun; Li, Junyi; Tang, Tianyi; Wang, Xiaolei; Hou, Yupeng; Min, Yingqian; Zhang,
Beichen; Zhang, Junjie; Dong, Zican; Du, Yifan; Yang, Chen; Chen, Yushuo; Chen, Zhipeng; Jiang, Jinhao;
Ren, Ruiyang; Li, Yifan; Tang, Xinyu; Liu, Zikang; Liu, Peiyu; Nie, Jian-Yun; Wen, Ji-Rong (2023). "A Survey
of Large Language Models". arXiv:2303.18223 (https://arxiv.org/abs/2303.18223) [cs.CL (https://arxiv.
org/archive/cs.CL) ].
122. Srivastava, Aarohi; et al. (2022). "Beyond the Imitation Game: Quantifying and extrapolating the
capabilities of language models". arXiv:2206.04615 (https://arxiv.org/abs/2206.04615) [cs.CL (https://
arxiv.org/archive/cs.CL) ].
123. Lin, Stephanie; Hilton, Jacob; Evans, Owain (2021). "TruthfulQA: Measuring How Models Mimic Human
Falsehoods". arXiv:2109.07958 (https://arxiv.org/abs/2109.07958) [cs.CL (https://arxiv.org/archive/c
s.CL) ].
124. Zellers, Rowan; Holtzman, Ari; Bisk, Yonatan; Farhadi, Ali; Choi, Yejin (2019). "HellaSwag: Can a Machine
Really Finish Your Sentence?". arXiv:1905.07830 (https://arxiv.org/abs/1905.07830) [cs.CL (https://arx
iv.org/archive/cs.CL) ].
125. "Prepare for truly useful large language models". Nature Biomedical Engineering. 7 (2): 85–86. 7 March
2023. doi:10.1038/s41551-023-01012-6 (https://doi.org/10.1038%2Fs41551-023-01012-6) .
PMID 36882584 (https://pubmed.ncbi.nlm.nih.gov/36882584) . S2CID 257403466 (https://api.semanti
cscholar.org/CorpusID:257403466) .
128. Peng, Zhencan; Wang, Zhizhi; Deng, Dong (13 June 2023). "Near-Duplicate Sequence Search at Scale for
Large Language Model Memorization Evaluation" (https://people.cs.rutgers.edu/~dd903/assets/paper
s/sigmod23.pdf) (PDF). Proceedings of the ACM on Management of Data. 1 (2): 1–18.
doi:10.1145/3589324 (https://doi.org/10.1145%2F3589324) . S2CID 259213212 (https://api.semantics
cholar.org/CorpusID:259213212) . Archived (https://web.archive.org/web/20240827053753/https://pe
ople.cs.rutgers.edu/~dd903/assets/papers/sigmod23.pdf) (PDF) from the original on 2024-08-27.
Retrieved 2024-01-20. Citing Lee et al 2022.
130. Alba, Davey (1 May 2023). "AI chatbots have been used to create dozens of news content farms" (http
s://www.japantimes.co.jp/news/2023/05/01/business/tech/ai-fake-news-content-farms/) . The Japan
Times. Retrieved 18 June 2023.
131. "Could chatbots help devise the next pandemic virus?" (https://www.science.org/content/article/could-c
hatbots-help-devise-next-pandemic-virus) . Science. 14 June 2023. doi:10.1126/science.adj2463 (http
s://doi.org/10.1126%2Fscience.adj2463) . Archived (https://web.archive.org/web/20230618013834/ht
tps://www.science.org/content/article/could-chatbots-help-devise-next-pandemic-virus) from the
original on 18 June 2023. Retrieved 18 June 2023.
132. Stephen Council (1 Dec 2023). "How Googlers cracked an SF rival's tech model with a single word" (http
s://www.sfgate.com/tech/article/google-openai-chatgpt-break-model-18525445.php) . SFGATE.
Archived (https://web.archive.org/web/20231216160941/https://www.sfgate.com/tech/article/google-
openai-chatgpt-break-model-18525445.php) from the original on 16 December 2023.
133. Hubinger, Evan (10 January 2024). "Sleeper Agents: Training Deceptive LLMs that Persist Through
Safety Training". arXiv:2401.05566 (https://arxiv.org/abs/2401.05566) [cs.CR (https://arxiv.org/archiv
e/cs.CR) ].
134. Kang, Daniel (2023). "Exploiting programmatic behavior of LLMs: Dual-use through standard security
attacks". arXiv:2302.05733 (https://arxiv.org/abs/2302.05733) [cs.CR (https://arxiv.org/archive/cs.C
R) ].
135. Wang, Yongge (20 June 2024). "Encryption Based Covert Channel for Large Language Models" (https://e
print.iacr.org/2024/586.pdf) (PDF). IACR ePrint 2024/586. Archived (https://web.archive.org/web/202
40624191233/https://eprint.iacr.org/2024/586.pdf) (PDF) from the original on 24 June 2024.
Retrieved 24 June 2024.
136. Stokel-Walker, Chris (November 22, 2023). "ChatGPT Replicates Gender Bias in Recommendation
Letters" (https://www.scientificamerican.com/article/chatgpt-replicates-gender-bias-in-recommendation
-letters/) . Scientific American. Archived (https://web.archive.org/web/20231229043124/https://www.s
cientificamerican.com/article/chatgpt-replicates-gender-bias-in-recommendation-letters/) from the
original on 2023-12-29. Retrieved 2023-12-29.
137. Luo, Queenie; Puett, Michael J.; Smith, Michael D. (2023-03-28). "A Perspectival Mirror of the Elephant:
Investigating Language Bias on Google, ChatGPT, Wikipedia, and YouTube". arXiv:2303.16281v2 (https://
arxiv.org/abs/2303.16281v2) [cs.CY (https://arxiv.org/archive/cs.CY) ].
138. Cheng, Myra; Durmus, Esin; Jurafsky, Dan (2023-05-29), Marked Personas: Using Natural Language
Prompts to Measure Stereotypes in Language Models, arXiv:2305.18189 (https://arxiv.org/abs/2305.18
189)
139. Kotek, Hadas; Dockum, Rikker; Sun, David (2023-11-05). "Gender bias and stereotypes in Large
Language Models" (https://dl.acm.org/doi/10.1145/3582269.3615599) . Proceedings of the ACM
Collective Intelligence Conference. CI '23. New York, NY, USA: Association for Computing Machinery.
pp. 12–24. doi:10.1145/3582269.3615599 (https://doi.org/10.1145%2F3582269.3615599) . ISBN 979-
8-4007-0113-9.
140. Heikkilä, Melissa (August 7, 2023). "AI language models are rife with different political biases" (https://w
ww.technologyreview.com/2023/08/07/1077324/ai-language-models-are-rife-with-political-biases/) .
MIT Technology Review. Retrieved 2023-12-29.
143. Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training
of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2 (https://arxiv.org/
abs/1810.04805v2) [cs.CL (https://arxiv.org/archive/cs.CL) ].
144. Prickett, Nicole Hemsoth (2021-08-24). "Cerebras Shifts Architecture To Meet Massive AI/ML Models" (h
ttps://www.nextplatform.com/2021/08/24/cerebras-shifts-architecture-to-meet-massive-ai-ml-model
s/) . The Next Platform. Archived (https://web.archive.org/web/20230620151619/https://www.nextpla
tform.com/2021/08/24/cerebras-shifts-architecture-to-meet-massive-ai-ml-models/) from the original
on 2023-06-20. Retrieved 2023-06-20.
146. Patel, Ajay; Li, Bryan; Rasooli, Mohammad Sadegh; Constant, Noah; Raffel, Colin; Callison-Burch, Chris
(2022). "Bidirectional Language Models Are Also Few-shot Learners". arXiv:2209.14500 (https://arxiv.or
g/abs/2209.14500) [cs.LG (https://arxiv.org/archive/cs.LG) ].
147. Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training
of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2 (https://arxiv.org/
abs/1810.04805v2) [cs.CL (https://arxiv.org/archive/cs.CL) ].
148. Raffel, Colin; Shazeer, Noam; Roberts, Adam; Lee, Katherine; Narang, Sharan; Matena, Michael; Zhou,
Yanqi; Li, Wei; Liu, Peter J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text
Transformer" (http://jmlr.org/papers/v21/20-074.html) . Journal of Machine Learning Research. 21
(140): 1–67. arXiv:1910.10683 (https://arxiv.org/abs/1910.10683) . ISSN 1533-7928 (https://search.wo
rldcat.org/issn/1533-7928) .
153. Yang, Zhilin; Dai, Zihang; Yang, Yiming; Carbonell, Jaime; Salakhutdinov, Ruslan; Le, Quoc V. (2 January
2020). "XLNet: Generalized Autoregressive Pretraining for Language Understanding". arXiv:1906.08237
(https://arxiv.org/abs/1906.08237) [cs.CL (https://arxiv.org/archive/cs.CL) ].
159. Table D.1 in Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal,
Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-
Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon; Ramesh, Aditya; Ziegler, Daniel M.; Wu,
Jeffrey; Winter, Clemens; Hesse, Christopher; Chen, Mark; Sigler, Eric; Litwin, Mateusz; Gray, Scott; Chess,
Benjamin; Clark, Jack; Berner, Christopher; McCandlish, Sam; Radford, Alec; Sutskever, Ilya; Amodei,
Dario (May 28, 2020). "Language Models are Few-Shot Learners". arXiv:2005.14165v4 (https://arxiv.org/
abs/2005.14165v4) [cs.CL (https://arxiv.org/archive/cs.CL) ].
160. "ChatGPT: Optimizing Language Models for Dialogue" (https://openai.com/blog/chatgpt/) . OpenAI.
2022-11-30. Archived (https://web.archive.org/web/20221130180912/https://openai.com/blog/chatgp
t/) from the original on 2022-11-30. Retrieved 2023-01-13.
162. Gao, Leo; Biderman, Stella; Black, Sid; Golding, Laurence; Hoppe, Travis; Foster, Charles; Phang, Jason;
He, Horace; Thite, Anish; Nabeshima, Noa; Presser, Shawn; Leahy, Connor (31 December 2020). "The
Pile: An 800GB Dataset of Diverse Text for Language Modeling". arXiv:2101.00027 (https://arxiv.org/ab
s/2101.00027) [cs.CL (https://arxiv.org/archive/cs.CL) ].
163. Iyer, Abhishek (15 May 2021). "GPT-3's free alternative GPT-Neo is something to be excited about" (http
s://venturebeat.com/ai/gpt-3s-free-alternative-gpt-neo-is-something-to-be-excited-about/) .
VentureBeat. Archived (https://web.archive.org/web/20230309012717/https://venturebeat.com/ai/gpt-
3s-free-alternative-gpt-neo-is-something-to-be-excited-about/) from the original on 9 March 2023.
Retrieved 13 March 2023.
164. "GPT-J-6B: An Introduction to the Largest Open Source GPT Model | Forefront" (https://web.archive.org/
web/20230309205439/https://www.forefront.ai/blog-posts/gpt-j-6b-an-introduction-to-the-largest-open-
sourced-gpt-model) . www.forefront.ai. Archived from the original (https://www.forefront.ai/blog-posts/
gpt-j-6b-an-introduction-to-the-largest-open-sourced-gpt-model) on 2023-03-09. Retrieved 2023-02-28.
165. Dey, Nolan; Gosal, Gurpreet; Zhiming; Chen; Khachane, Hemant; Marshall, William; Pathria, Ribhu; Tom,
Marvin; Hestness, Joel (2023-04-01). "Cerebras-GPT: Open Compute-Optimal Language Models Trained
on the Cerebras Wafer-Scale Cluster". arXiv:2304.03208 (https://arxiv.org/abs/2304.03208) [cs.LG (htt
ps://arxiv.org/archive/cs.LG) ].
166. Alvi, Ali; Kharya, Paresh (11 October 2021). "Using DeepSpeed and Megatron to Train Megatron-Turing
NLG 530B, the World's Largest and Most Powerful Generative Language Model" (https://www.microsoft.
com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-
largest-and-most-powerful-generative-language-model/) . Microsoft Research. Archived (https://web.ar
chive.org/web/20230313180531/https://www.microsoft.com/en-us/research/blog/using-deepspeed-an
d-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-languag
e-model/) from the original on 13 March 2023. Retrieved 13 March 2023.
167. Smith, Shaden; Patwary, Mostofa; Norick, Brandon; LeGresley, Patrick; Rajbhandari, Samyam; Casper,
Jared; Liu, Zhun; Prabhumoye, Shrimai; Zerveas, George; Korthikanti, Vijay; Zhang, Elton; Child, Rewon;
Aminabadi, Reza Yazdani; Bernauer, Julie; Song, Xia (2022-02-04). "Using DeepSpeed and Megatron to
Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model". arXiv:2201.11990 (http
s://arxiv.org/abs/2201.11990) [cs.CL (https://arxiv.org/archive/cs.CL) ].
168. Wang, Shuohuan; Sun, Yu; Xiang, Yang; Wu, Zhihua; Ding, Siyu; Gong, Weibao; Feng, Shikun; Shang,
Junyuan; Zhao, Yanbin; Pang, Chao; Liu, Jiaxiang; Chen, Xuyi; Lu, Yuxiang; Liu, Weixin; Wang, Xi; Bai,
Yangfan; Chen, Qiuliang; Zhao, Li; Li, Shiyong; Sun, Peng; Yu, Dianhai; Ma, Yanjun; Tian, Hao; Wu, Hua; Wu,
Tian; Zeng, Wei; Li, Ge; Gao, Wen; Wang, Haifeng (December 23, 2021). "ERNIE 3.0 Titan: Exploring
Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation".
arXiv:2112.12731 (https://arxiv.org/abs/2112.12731) [cs.CL (https://arxiv.org/archive/cs.CL) ].
170. Askell, Amanda; Bai, Yuntao; Chen, Anna; et al. (9 December 2021). "A General Language Assistant as a
Laboratory for Alignment". arXiv:2112.00861 (https://arxiv.org/abs/2112.00861) [cs.CL (https://arxiv.o
rg/archive/cs.CL) ].
171. Bai, Yuntao; Kadavath, Saurav; Kundu, Sandipan; et al. (15 December 2022). "Constitutional AI:
Harmlessness from AI Feedback". arXiv:2212.08073 (https://arxiv.org/abs/2212.08073) [cs.CL (http
s://arxiv.org/archive/cs.CL) ].
172. "Language modelling at scale: Gopher, ethical considerations, and retrieval" (https://www.deepmind.co
m/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval) .
www.deepmind.com. 8 December 2021. Archived (https://web.archive.org/web/20230320082323/http
s://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieva
l) from the original on 20 March 2023. Retrieved 20 March 2023.
173. Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; et al. (29 March 2022). "Training Compute-
Optimal Large Language Models". arXiv:2203.15556 (https://arxiv.org/abs/2203.15556) [cs.CL (http
s://arxiv.org/archive/cs.CL) ].
174. Table 20 and page 66 of PaLM: Scaling Language Modeling with Pathways (https://storage.googleapis.c
om/pathways-language-model/PaLM-paper.pdf) Archived (https://web.archive.org/web/20230610040
050/https://storage.googleapis.com/pathways-language-model/PaLM-paper.pdf) 2023-06-10 at the
Wayback Machine
175. Cheng, Heng-Tze; Thoppilan, Romal (January 21, 2022). "LaMDA: Towards Safe, Grounded, and High-
Quality Dialog Models for Everything" (https://ai.googleblog.com/2022/01/lamda-towards-safe-grounde
d-and-high.html) . ai.googleblog.com. Archived (https://web.archive.org/web/20220325014118/http
s://ai.googleblog.com/2022/01/lamda-towards-safe-grounded-and-high.html) from the original on
2022-03-25. Retrieved 2023-03-09.
176. Thoppilan, Romal; De Freitas, Daniel; Hall, Jamie; Shazeer, Noam; Kulshreshtha, Apoorv; Cheng, Heng-
Tze; Jin, Alicia; Bos, Taylor; Baker, Leslie; Du, Yu; Li, YaGuang; Lee, Hongrae; Zheng, Huaixiu Steven;
Ghafouri, Amin; Menegali, Marcelo (2022-01-01). "LaMDA: Language Models for Dialog Applications".
arXiv:2201.08239 (https://arxiv.org/abs/2201.08239) [cs.CL (https://arxiv.org/archive/cs.CL) ].
177. Black, Sidney; Biderman, Stella; Hallahan, Eric; et al. (2022-05-01). GPT-NeoX-20B: An Open-Source
Autoregressive Language Model (https://aclanthology.org/2022.bigscience-1.9/) . Proceedings of
BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models.
Vol. Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large
Language Models. pp. 95–136. Archived (https://web.archive.org/web/20221210082456/https://aclant
hology.org/2022.bigscience-1.9/) from the original on 2022-12-10. Retrieved 2022-12-19.
178. Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; Sifre, Laurent (12 April 2022). "An empirical
analysis of compute-optimal large language model training" (https://www.deepmind.com/blog/an-empir
ical-analysis-of-compute-optimal-large-language-model-training) . Deepmind Blog. Archived (https://we
b.archive.org/web/20220413014510/https://www.deepmind.com/blog/an-empirical-analysis-of-comput
e-optimal-large-language-model-training) from the original on 13 April 2022. Retrieved 9 March 2023.
179. Narang, Sharan; Chowdhery, Aakanksha (April 4, 2022). "Pathways Language Model (PaLM): Scaling to
540 Billion Parameters for Breakthrough Performance" (https://ai.googleblog.com/2022/04/pathways-la
nguage-model-palm-scaling-to.html) . ai.googleblog.com. Archived (https://web.archive.org/web/2022
0404161447/https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html)
from the original on 2022-04-04. Retrieved 2023-03-09.
180. Susan Zhang; Mona Diab; Luke Zettlemoyer. "Democratizing access to large-scale language models with
OPT-175B" (https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-o
pt-175b/) . ai.facebook.com. Archived (https://web.archive.org/web/20230312231820/https://ai.faceb
ook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/) from the
original on 2023-03-12. Retrieved 2023-03-12.
181. Zhang, Susan; Roller, Stephen; Goyal, Naman; Artetxe, Mikel; Chen, Moya; Chen, Shuohui; Dewan,
Christopher; Diab, Mona; Li, Xian; Lin, Xi Victoria; Mihaylov, Todor; Ott, Myle; Shleifer, Sam; Shuster, Kurt;
Simig, Daniel; Koura, Punit Singh; Sridhar, Anjali; Wang, Tianlu; Zettlemoyer, Luke (21 June 2022). "OPT:
Open Pre-trained Transformer Language Models". arXiv:2205.01068 (https://arxiv.org/abs/2205.0106
8) [cs.CL (https://arxiv.org/archive/cs.CL) ].
182. Khrushchev, Mikhail; Vasilev, Ruslan; Petrov, Alexey; Zinov, Nikolay (2022-06-22), YaLM 100B (https://gith
ub.com/yandex/YaLM-100B) , archived (https://web.archive.org/web/20230616050056/https://github.
com/yandex/YaLM-100B) from the original on 2023-06-16, retrieved 2023-03-18
183. Lewkowycz, Aitor; Andreassen, Anders; Dohan, David; Dyer, Ethan; Michalewski, Henryk; Ramasesh,
Vinay; Slone, Ambrose; Anil, Cem; Schlag, Imanol; Gutman-Solo, Theo; Wu, Yuhuai; Neyshabur, Behnam;
Gur-Ari, Guy; Misra, Vedant (30 June 2022). "Solving Quantitative Reasoning Problems with Language
Models". arXiv:2206.14858 (https://arxiv.org/abs/2206.14858) [cs.CL (https://arxiv.org/archive/cs.C
L) ].
184. "Minerva: Solving Quantitative Reasoning Problems with Language Models" (https://ai.googleblog.com/
2022/06/minerva-solving-quantitative-reasoning.html) . ai.googleblog.com. 30 June 2022. Retrieved
20 March 2023.
185. Ananthaswamy, Anil (8 March 2023). "In AI, is bigger always better?" (https://www.nature.com/articles/d
41586-023-00641-w) . Nature. 615 (7951): 202–205. Bibcode:2023Natur.615..202A (https://ui.adsabs.
harvard.edu/abs/2023Natur.615..202A) . doi:10.1038/d41586-023-00641-w (https://doi.org/10.1038%
2Fd41586-023-00641-w) . PMID 36890378 (https://pubmed.ncbi.nlm.nih.gov/36890378) .
S2CID 257380916 (https://api.semanticscholar.org/CorpusID:257380916) . Archived (https://web.archi
ve.org/web/20230316181013/https://www.nature.com/articles/d41586-023-00641-w) from the
original on 16 March 2023. Retrieved 9 March 2023.
187. Taylor, Ross; Kardas, Marcin; Cucurull, Guillem; Scialom, Thomas; Hartshorn, Anthony; Saravia, Elvis;
Poulton, Andrew; Kerkez, Viktor; Stojnic, Robert (16 November 2022). "Galactica: A Large Language
Model for Science". arXiv:2211.09085 (https://arxiv.org/abs/2211.09085) [cs.CL (https://arxiv.org/arch
ive/cs.CL) ].
188. "20B-parameter Alexa model sets new marks in few-shot learning" (https://www.amazon.science/blog/2
0b-parameter-alexa-model-sets-new-marks-in-few-shot-learning) . Amazon Science. 2 August 2022.
Archived (https://web.archive.org/web/20230315190223/https://www.amazon.science/blog/20b-para
meter-alexa-model-sets-new-marks-in-few-shot-learning) from the original on 15 March 2023.
Retrieved 12 March 2023.
189. Soltan, Saleh; Ananthakrishnan, Shankar; FitzGerald, Jack; et al. (3 August 2022). "AlexaTM 20B: Few-
Shot Learning Using a Large-Scale Multilingual Seq2Seq Model". arXiv:2208.01448 (https://arxiv.org/ab
s/2208.01448) [cs.CL (https://arxiv.org/archive/cs.CL) ].
190. "AlexaTM 20B is now available in Amazon SageMaker JumpStart | AWS Machine Learning Blog" (https://
aws.amazon.com/blogs/machine-learning/alexatm-20b-is-now-available-in-amazon-sagemaker-jumpsta
rt/) . aws.amazon.com. 17 November 2022. Archived (https://web.archive.org/web/20230313163933/
https://aws.amazon.com/blogs/machine-learning/alexatm-20b-is-now-available-in-amazon-sagemaker-j
umpstart/) from the original on 13 March 2023. Retrieved 13 March 2023.
192. "The Falcon has landed in the Hugging Face ecosystem" (https://huggingface.co/blog/falcon) .
huggingface.co. Archived (https://web.archive.org/web/20230620002832/https://huggingface.co/blog/
falcon) from the original on 2023-06-20. Retrieved 2023-06-20.
193. "GPT-4 Technical Report" (https://cdn.openai.com/papers/gpt-4.pdf) (PDF). OpenAI. 2023. Archived (ht
tps://web.archive.org/web/20230314190904/https://cdn.openai.com/papers/gpt-4.pdf) (PDF) from
the original on March 14, 2023. Retrieved March 14, 2023.
194. Schreiner, Maximilian (2023-07-11). "GPT-4 architecture, datasets, costs and more leaked" (https://the-d
ecoder.com/gpt-4-architecture-datasets-costs-and-more-leaked/) . THE DECODER. Archived (https://w
eb.archive.org/web/20230712123915/https://the-decoder.com/gpt-4-architecture-datasets-costs-and-
more-leaked/) from the original on 2023-07-12. Retrieved 2024-07-26.
195. Dey, Nolan (March 28, 2023). "Cerebras-GPT: A Family of Open, Compute-efficient, Large Language
Models" (https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-languag
e-models/) . Cerebras. Archived (https://web.archive.org/web/20230328213339/https://www.cerebras.
net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/) from the original
on March 28, 2023. Retrieved March 28, 2023.
196. "Abu Dhabi-based TII launches its own version of ChatGPT" (https://fastcompanyme.com/news/abu-dha
bi-based-tii-launches-its-own-version-of-chatgpt/) . tii.ae. Archived (https://web.archive.org/web/20230
403021729/https://fastcompanyme.com/news/abu-dhabi-based-tii-launches-its-own-version-of-chatgp
t/) from the original on 2023-04-03. Retrieved 2023-04-03.
197. Penedo, Guilherme; Malartic, Quentin; Hesslow, Daniel; Cojocaru, Ruxandra; Cappelli, Alessandro;
Alobeidli, Hamza; Pannier, Baptiste; Almazrouei, Ebtesam; Launay, Julien (2023-06-01). "The RefinedWeb
Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only".
arXiv:2306.01116 (https://arxiv.org/abs/2306.01116) [cs.CL (https://arxiv.org/archive/cs.CL) ].
199. UAE's Falcon 40B, World's Top-Ranked AI Model from Technology Innovation Institute, is Now Royalty-
Free (https://www.businesswire.com/news/home/20230531005608/en/UAE's-Falcon-40B-World's-Top-
Ranked-AI-Model-from-Technology-Innovation-Institute-is-Now-Royalty-Free) Archived (https://web.arc
hive.org/web/20240208133040/https://www.businesswire.com/news/home/20230531005608/en/UA
E%27s-Falcon-40B-World%27s-Top-Ranked-AI-Model-from-Technology-Innovation-Institute-is-Now-Royal
ty-Free) 2024-02-08 at the Wayback Machine, 31 May 2023
200. Wu, Shijie; Irsoy, Ozan; Lu, Steven; Dabravolski, Vadim; Dredze, Mark; Gehrmann, Sebastian; Kambadur,
Prabhanjan; Rosenberg, David; Mann, Gideon (March 30, 2023). "BloombergGPT: A Large Language
Model for Finance". arXiv:2303.17564 (https://arxiv.org/abs/2303.17564) [cs.LG (https://arxiv.org/arch
ive/cs.LG) ].
201. Ren, Xiaozhe; Zhou, Pingyi; Meng, Xinfan; Huang, Xinjing; Wang, Yadao; Wang, Weichao; Li, Pengfei;
Zhang, Xiaoda; Podolskiy, Alexander; Arshinov, Grigory; Bout, Andrey; Piontkovskaya, Irina; Wei,
Jiansheng; Jiang, Xin; Su, Teng; Liu, Qun; Yao, Jun (March 19, 2023). "PanGu-Σ: Towards Trillion
Parameter Language Model with Sparse Heterogeneous Computing". arXiv:2303.10845 (https://arxiv.or
g/abs/2303.10845) [cs.CL (https://arxiv.org/archive/cs.CL) ].
202. Köpf, Andreas; Kilcher, Yannic; von Rütte, Dimitri; Anagnostidis, Sotiris; Tam, Zhi-Rui; Stevens, Keith;
Barhoum, Abdullah; Duc, Nguyen Minh; Stanley, Oliver; Nagyfi, Richárd; ES, Shahul; Suri, Sameer;
Glushkov, David; Dantuluri, Arnav; Maguire, Andrew (2023-04-14). "OpenAssistant Conversations –
Democratizing Large Language Model Alignment". arXiv:2304.07327 (https://arxiv.org/abs/2304.0732
7) [cs.CL (https://arxiv.org/archive/cs.CL) ].
203. Wrobel, Sharon. "Tel Aviv startup rolls out new advanced AI language model to rival OpenAI" (https://ww
w.timesofisrael.com/ai21-labs-rolls-out-new-advanced-ai-language-model-to-rival-openai/) .
www.timesofisrael.com. Archived (https://web.archive.org/web/20230724191823/https://www.timesofi
srael.com/ai21-labs-rolls-out-new-advanced-ai-language-model-to-rival-openai/) from the original on
2023-07-24. Retrieved 2023-07-24.
204. Wiggers, Kyle (2023-04-13). "With Bedrock, Amazon enters the generative AI race" (https://techcrunch.co
m/2023/04/13/with-bedrock-amazon-enters-the-generative-ai-race/) . TechCrunch. Archived (https://w
eb.archive.org/web/20230724102458/https://techcrunch.com/2023/04/13/with-bedrock-amazon-enter
s-the-generative-ai-race/) from the original on 2023-07-24. Retrieved 2023-07-24.
205. Elias, Jennifer (16 May 2023). "Google's newest A.I. model uses nearly five times more text data for
training than its predecessor" (https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-time
s-more-text-data-than-predecessor.html) . CNBC. Archived (https://web.archive.org/web/20230516225
326/https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-pre
decessor.html) from the original on 16 May 2023. Retrieved 18 May 2023.
207. "Introducing Llama 2: The Next Generation of Our Open Source Large Language Model" (https://ai.meta.
com/llama/) . Meta AI. 2023. Archived (https://web.archive.org/web/20240105234629/https://ai.meta.
com/llama/) from the original on 2024-01-05. Retrieved 2023-07-19.
210. Nirmal, Dinesh (2023-09-07). "Building AI for business: IBM's Granite foundation models" (https://www.ib
m.com/blog/building-ai-for-business-ibms-granite-foundation-models) . IBM Blog. Archived (https://we
b.archive.org/web/20240722083855/https://www.ibm.com/blog/building-ai-for-business-ibms-granite-f
oundation-models/) from the original on 2024-07-22. Retrieved 2024-08-11.
211. "Announcing Mistral 7B" (https://mistral.ai/news/announcing-mistral-7b/) . Mistral. 2023. Archived (htt
ps://web.archive.org/web/20240106051047/https://mistral.ai/news/announcing-mistral-7b/) from the
original on 2024-01-06. Retrieved 2023-10-06.
216. Franzen, Carl (11 December 2023). "Mistral shocks AI community as latest open source model eclipses
GPT-3.5 performance" (https://venturebeat.com/ai/mistral-shocks-ai-community-as-latest-open-source-
model-eclipses-gpt-3-5-performance/) . VentureBeat. Archived (https://web.archive.org/web/20231211
213640/https://venturebeat.com/ai/mistral-shocks-ai-community-as-latest-open-source-model-eclipses-
gpt-3-5-performance/) from the original on 11 December 2023. Retrieved 12 December 2023.
219. Hughes, Alyssa (12 December 2023). "Phi-2: The surprising power of small language models" (https://w
ww.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/) .
Microsoft Research. Archived (https://web.archive.org/web/20231212232647/https://www.microsoft.c
om/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/) from the original on
12 December 2023. Retrieved 13 December 2023.
229. "The Llama 3 Herd of Models" (July 23, 2024) Llama Team, AI @ Meta (https://ai.meta.com/research/pu
blications/the-llama-3-herd-of-models/)
Further reading
Jurafsky, Dan, Martin, James. H. Speech and Language Processing: An Introduction to Natural Language
Processing, Computational Linguistics, and Speech Recognition (https://web.stanford.edu/~jurafsky/slp3/
ed3book_jan72023.pdf) , 3rd Edition draft, 2023.
Zhao, Wayne Xin; et al. (2023). "A Survey of Large Language Models". arXiv:2303.18223 (https://arxiv.org/a
bs/2303.18223) [cs.CL (https://arxiv.org/archive/cs.CL) ].
Kaddour, Jean; et al. (2023). "Challenges and Applications of Large Language Models". arXiv:2307.10169
(https://arxiv.org/abs/2307.10169) [cs.CL (https://arxiv.org/archive/cs.CL) ].
Yin, Shukang; Fu, Chaoyou; Zhao, Sirui; Li, Ke; Sun, Xing; Xu, Tong; Chen, Enhong (2023-06-01). "A Survey on
Multimodal Large Language Models". arXiv:2306.13549 (https://arxiv.org/abs/2306.13549) [cs.CV (http
s://arxiv.org/archive/cs.CV) ].
Frank, Michael C. (27 June 2023). "Baby steps in evaluating the capacities of large language models" (http
s://www.nature.com/articles/s44159-023-00211-x) . Nature Reviews Psychology. 2 (8): 451–452.
doi:10.1038/s44159-023-00211-x (https://doi.org/10.1038%2Fs44159-023-00211-x) . ISSN 2731-0574 (ht
tps://search.worldcat.org/issn/2731-0574) . S2CID 259713140 (https://api.semanticscholar.org/CorpusI
D:259713140) . Retrieved 2 July 2023.