paniit-demystifying-llms
paniit-demystifying-llms
Subbarao Kambhampati
Slides
School of Computing & AI
13
Agenda for today
• Trends in AI Technology leading to LLMs
• LLM basics—auto-regressive training of n-gram models
• LLMs as doing Approximate Retrieval
• Hallucinations—always or sometimes?
• Style vs. Content/Form vs. Factuality
• Quest to improve Factuality
• Prompt engineering; Fine-tuning; RAGging
• Data Quality; Synthetic Data
• LLMs as approximate knowledge sources
• ..and resurgence of (approximate) knowledge-based systems
• Are LLMs AI-Complete?
• Planning/Reasoning
• Societal Impacts
Trends in AI Technology-1
Trends in AI Technology-2
From Deep & Narrow to From Discriminative Classification to
Broad & Shallow Generative Imagination
• AI systems used to have deep expertise in • AI systems used to focus on
narrow domains “identification” and “classification”
• The old “expert systems”, Deep Blue for • Is this a picture of a dog? Is this an x-ray of a
Chess; Alpha Go for Go, Alpha Fold for malignant tumor? Is this a spam mail?
Protein Folding etc. • P(dog|Picture); P(tumor|x-ray); P(Spam|text)
• Recent trend is to develop systems with • Recent trend is to learn the “distribution
broad expertise. But they tend to be of the objects”
shallow in their understanding • Draw me a picture of a dog. Write me a
• Large Language Models, Diffusion Models spam mail
• Learning P(tumor,x-ray) P(Spam, text)
• (Thinking in terms of Broad & Shallow vs.
Deep & Narrow is more instructive than
talking of AGI vs AI.. )
• The implied guarantees are different.. LLMs ~ Broad & Shallow
Generative Systems
for Language (or any other sequential data)
Scope of Today’s Talk
• We will focus mostly on auto-regressively
trained Large Language models (LLMs)
• LLMs are one part of the Generative AI
paradigm of learning distribution of
language, vision and audio
Training LLMS
..but the count table is Ginormous! (and is VERY
sparse)
• With an n-gram model, you need to keep track of the
conditional distributions for (n-1)-sized prefixes.
• With a vocabulary size |V| (~ 50000), there are |V|n-1
different prefixes!!
• Easy for unigram (1 prefix), bigram (|V| prefixes) and trigram (|V|2
prefixes)
• For ChatGPT’s 3001-gram model, with a 50,000 word vocabulary, we
are looking at a whopping (50000)3000 conditional distributions
• (and most entries will be zero—as the chance of seeing the same 3000-word
sequence again is vanishingly small!)
• What LLMs do is to essentially compress/approximate this
ginormous count table with a function
• That is while high capacity (176 billion weights!) is3000
still vanishingly
small compared to the ginormous count ((50000) >> 176 billion or
a trillion!)
• ..and oh by the way, the compressed function winds up having fewer
zeros
• It approximates both the non-zero counts and zero counts, so.. Transformers are a
• GENERALIZATION!!! (not particularly principled)
• In essence the function learns to “abstract” and “cluster” over parallelization of the
“similar” sequences recurrent neural networks
(graphic by James Campbell)
So ChatGPT is just completing your prompt by
repeatedly predicting the next word given the
previous 3000 words
• But, the function it learns to predict the next word is a very high capacity
one (with 175 billion parameters for ChatGPT and over a trillion for GPT4)
• This function is learned by analyzing 500 gb of text
• The learning phase is very time consuming (and is feasible only because of the
extreme computational power utilized)
Answer: MAGIC..!
Some possible factors:
45
46
47
48
Style vs. Content
Form vs. Factuality
• LLMs (and Generative AI in general)
capture the distribution of the data
they are trained on
• Style is a distributional property
• ..and LLMs are able to learn this (they
have been called the vibe machines..)
• Correctness/factuality is an instance
level property
• ..LLMs can’t guarantee this
• Civilizationally, we had always thought
style is harder than content
• And even assumed that good style implies
good content!
• LLMs (and GenAI in general) turn this
intuition on its head!
Agenda for today
• Trends in AI Technology leading to LLMs
• LLM basics—auto-regressive training of n-gram models
• LLMs as doing Approximate Retrieval
• Hallucinations—always or sometimes?
• Style vs. Content/Form vs. Factuality
• Quest to improve Factuality
• Prompt engineering; Fine-tuning; RAGging
• Data Quality; Synthetic Data
• LLMs as approximate knowledge sources
• ..and resurgence of (approximate) knowledge-based systems
• Are LLMs AI-Complete?
• Planning/Reasoning
• Societal Impacts
Standard ways to improve LLM responses
Prompting (“in-context learning”) Fine Tuning
(doesn’t change LLM parameters) (Changes LLM parameters)
• If you don’t like what an LLM is giving as an answer • Fine tune the parameters of a pre-trained LLM by
to your prompt, you can add additional prompts making it look specifically at the data of interest to
you
• The LLM will then take the new context window • Give it lots of plan sequences, so it learns better
(including what it said and what you said) to conditional distributions on predicting plans
predict the next sequence of words/tokens • Use labeled <prompt, response> pairs to make its
• Every word in the context window—including the responses “more palatable”
ones LLM added-–is changing the conditional • Use Supervised techniques or RL techniques to improve
parameters to be more consistent with the finetuning
distribution with which the next token being data
predicted. • [There is also evidence that big companies use more
• Note that all these conditional distributions have been “polished”/”annotated” data during fine tuning phase–
precomputed! including paying humans to generate data adorned
• Nothing inside LLM is changing because of your prompts with derivational information—which is often not
included in the web text]
• The undeniable attraction of “prompting” is that it
is natural for us! It is sort of how we interact with • Because fine tuning is changing the parameters of
each other! the LLM, while its performance on the specific
• There is a whole cottage industry on the “art” of good
task (be a better planner, be less offensive) may
prompting improve, it also changes its performance on other
• “How to ask LLMs nicely?”
tasks in unpredictable ways
• If you give k examples of good answers as part of the • Microsoft claims that GPT4 had more AGI sparks before
it was lobotomized with RLHF to be less offensive! 56
prompt, it is called “k-shot in-context learning”
Back-Prompting by Humans
(..and the Clever Hans peril..)
• Humans doing the verification & giving helpful
prompts to the LLM)
• Okay when the humans know the domain and can
correct the plan (with some guarantees)
• Okay for "this essay looks good enough" kind of critiquing
• But for planning, with end users not aware of the domain
physics, the plans that humans are happy with may still not
be actually executable
• When humans know the correct answer (plan) there
is also the very significant possibility of Clever Hans
effect
• Humans unwittingly/unknowingly/non-deliberately giving
important hints
57
RAGging on LLMs to Improve their Factuality
IR
• Given LLMs don’t stick to the script, you (Google)
might want to
• Send the user prompt/query to Google (or Prompt
some IR system) LLM
• Instead of old style keyword search, do search
with embeddings (“Vector DB”)
• Take the top result(s) and have the LLM
summarize them
• Or alternately, just add those results to the
context window to provide more factual
background to further queries..
• Not that different from what LLM
“search engines” Perplexity do..
• Pure LLMs to a kind of “search by
imagination..”
Impact of Training Data Type/Quality on
LLM Performance
• Quality of training data matters in improving the
quality of LLM completions
• This is why most modern LLMs are trained without 4Chan
(..and upsampling good quality sources like Wikipedia,
NYTimes etc..)
• When the data is not readily available, getting it can
be quite costly
• Web contains “correct data” but not as much “corrections
data”
• Getting derivation/thought behind the data can be quite
expensive.
• But even sticking just to purely factual data for
training still doesn’t eliminate hallucinations
• Remember LLMs are not retrieving stored documents, but
completing the prompt on the fly (approximate retrieval)
Can’t LLMs “Self-Train”
on Synthetic Data?
• If getting quality data is hard, one
alternative would be generating
“synthetic data” and training LLMs on it
• The question is how is synthetic data
generated?
• Idea 1: LLMs generate their own data, test it
for correctness, and use it in further
training
• Idea 2: Use external solvers to generate the
synthetic training data
• Let me compile an outside System 2 to my
System 1
Synthetic Data Conundrums
Solving Blocksworld: GoFAI vs LLaMAI
GOFAI LLaMAI
• Get the domain model • Get the domain model
• Get a combinatorial search planner • Get a combinatorial search planner
• Have the planner solve the problem • Make a trillion Blocksworld problems
• Make the planner solve them all
• Finetune GPT4 with the problems and solutions
• (Alternately, index the trillion solutions in a vector DB
for later RAG)
• Have the finetuned/RAG’ed GPT4 guess the
solution for the given problem
• (Ensure the correctness of the guess with an external
validator/Simulator working LLM-Modulo)
• If, by luck, it guesses right, write a NeurIPS/ICLR
paper about the effectiveness of synthetic data
Synthetic Data Conundrums
Solving Blocksworld: GoFAI vs LLaMAI
GOFAI LLaMAI
• Get the domain model • Get the domain model
• Get a combinatorial search planner • Get a combinatorial search planner
• Have the planner solve the problem • Make a trillion Blocksworld problems
• Make the planner solve them all
• Finetune GPT4 with the problems and solutions
• (Alternately, index the trillion solutions in a vector DB
for later RAG)
• Have the finetuned/RAG’ed GPT4 guess the
solution for the given problem
• (Ensure the correctness of the guess with an external
validator/Simulator working LLM-Modulo)
• If, by luck, it guesses right, write a NeurIPS/ICLR
paper about the effectiveness of synthetic data
LLMs as Approximate Knowledge Sources
64
65
71
72
Planning
83
To be presented at ICML 2024
LLMs as Behavior Critics to catch undesirable robot behaviors
Can LLMs capture human preferences in embodied AI tasks?