0% found this document useful (0 votes)

45 views29 pages

Introduction To LLMs 1730172304

Uploaded by

Nicolás Leonardo Adriazola Román

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views29 pages

Introduction To LLMs 1730172304

Uploaded by

Nicolás Leonardo Adriazola Román

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

IntroLLM Attention Interpretability Text Classification

Introduction to Large Language Models

Dr. Aijun Zhang

October 2024

StatSoft.org 1
IntroLLM Attention Interpretability Text Classification

Recommended Texts

• Tunstall, L., von Werra, L. and Wolf, T. (2022).

• Bishop, C. and Bishop, H. (2023)
• Alammar, J. and Grootendorst, M. (2024)
• Raschka, S. (2024)

StatSoft.org 2
IntroLLM Attention Interpretability Text Classification

Outline

1 LLMs - A Quick Overview

2 Attention is All You Need (2017)

Transformer Architecture
Attention Mechanism
Specialized Transformer LLMs

3 Interpreting Contextual Embeddings

4 Text Classification and Fine-Tuning

StatSoft.org 3
IntroLLM Attention Interpretability Text Classification

Landscape of Artificial Intelligence

Source: Raschka (2024)

StatSoft.org 4
IntroLLM Attention Interpretability Text Classification

Evolution of Language Models

Source: Alammar and Grootendorst (2024)

StatSoft.org 5
IntroLLM Attention Interpretability Text Classification

Growing Scales of Language Models

Scaling laws: increasing param-

eters, data quality, and compute
resources generally improves LLM
performance.

Click into VizSweet

StatSoft.org 6
IntroLLM Attention Interpretability Text Classification

Key Tasks of Large Language Models

• Text Classification: Sentiment analysis, Spam detection, Named

Entity Recognition (NER), Natural Language Inference (NLI)

• Text Generation: Creative writing, Chatbot, Translation, Coding

• Summarization: News article summaries, Research paper abstracts,

Document condensation

• Question Answering: Customer support queries, Educational FAQs

• Knowledge Integration: Retrieval-Augmented Generation (RAG) for

up-to-date responses, evidence-based information generation

• Reasoning: Logical deduction, Mathematics problem solving,

Chain-of-Thought analysis

StatSoft.org 7
IntroLLM Attention Interpretability Text Classification

Outline

1 LLMs - A Quick Overview

2 Attention is All You Need (2017)

Transformer Architecture
Attention Mechanism
Specialized Transformer LLMs

3 Interpreting Contextual Embeddings

4 Text Classification and Fine-Tuning

StatSoft.org 8
IntroLLM Attention Interpretability Text Classification

Attention is All You Need (2017)

• By Google Brain: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,
Jones, L. and Gomez, A. N., Kaiser, L. and Polosukhin, I. (NIPS 2017)
• It revolutionized neural networks by introducing the Transformer, a
highly flexible architecture enabling LLMs like BERT, GPT, and T5.
• The core innovation of self-attention allows each token to capture
global context efficiently, eliminating the need for recurrence.
• It sparked groundbreaking models across NLP and other domains, with
applications extending to vision, audio, and multimodal tasks.

StatSoft.org 9
IntroLLM Attention Interpretability Text Classification

Transformer Architecture

• Encoder processes input

• Decoder generates output, predicting

the next token auto-regressively

• Feed forward network (deep learning)

• Multi-head self-attention

• Masked attention for decoder

• Positional encoding

• Check PyTorch function:

Docs>torch.nn>Transformer

StatSoft.org 10
IntroLLM Attention Interpretability Text Classification

Attention Mechanism

Single-head Attention Multi-head Attention

StatSoft.org 11
IntroLLM Attention Interpretability Text Classification

Scaled Dot-Product Self-Attention

N
X
y(k) ← αki x(i)
i=1

exp(x(k) xT(i) )
αki = PN
T
j=1 exp(x(k) x(j) )

Expressed in matrix form:

Y = Softmax[XXT ]X

(Q,K,V)-parameterization: W(q) , W(k) , W(v) ∈ RD×D

h i
Y = Softmax XW(q) (XW(k) )T XW(v) ≡ Softmax[QKT ]V
h T
i
QK
Scaled self-attention: Y = Softmax √
D
V ≡ Attention(Q, K, V)

StatSoft.org 12
IntroLLM Attention Interpretability Text Classification

Specialized Transformer LLMs

• Encoder-Only Models: BERT (Bidirenctional Encoder Representations

from Transformers), DistilBERT, RoBERTa, DeBERTa, etc. Ideal for
understanding tasks such as text classification, sentiment analysis, and
named entity recognition, with bidirectional attention capturing full context.

• Decoder-Only Models: GPT (Generative Pretrained Transformer), GPT-2,

GPT-3, and beyond. Optimized for generation tasks, with unidirectional
attention. Live example: ChatGPT (GPT-3.5+)

• Encoder-Decoder Models: T5 (Text-to-Text Transfer Transformer),

BART. Balance input understanding with output generation, suitable for
tasks like translation, summarization, and paraphrasing.

Highlight: Encoder-only models (BERT) are for understanding tasks, while

decoder-only models (GPT) are for generative tasks.

StatSoft.org 13
IntroLLM Attention Interpretability Text Classification

Transformer Explainer: Interactive Learning of GPT-2 by Cho et al.(2024)

StatSoft.org 14
IntroLLM Attention Interpretability Text Classification

Live Demo: BertViz

BertViz: Visualize Attention in NLP Models (BERT, GPT2, BART, etc.)

https://github.com/jessevig/bertviz

Try it on Google Colab: BertViz Interactive Tutorial

StatSoft.org 15
IntroLLM Attention Interpretability Text Classification

Outline

1 LLMs - A Quick Overview

2 Attention is All You Need (2017)

Transformer Architecture
Attention Mechanism
Specialized Transformer LLMs

3 Interpreting Contextual Embeddings

4 Text Classification and Fine-Tuning

StatSoft.org 16
IntroLLM Attention Interpretability Text Classification

Static and Contextual Embeddings

• Embeddings provide a way to represent textual data as dense,

continuous vectors in a high-dimensional space, to capture the
semantic meaning of words, phrases, and documents.

• Traditional Static Embeddings: Word2Vec, GloVe, etc. Easy to

compute, able to capture basic semantic relationships. However, each
word has a single, fixed embedding regardless of context.

• Contextual Embeddings: BERT, GTP, etc. Generate dynamic,

context-dependent embeddings for each token. BERT is most popular
as it captures both the left and right context of a word in a sentense.

StatSoft.org 17
IntroLLM Attention Interpretability Text Classification

Interpreting Contextual Embeddings

• Interpretability matters as it is crucial to understand how LLMs

work and make decisions, fostering transparency, trust and reliability.

• Challenges in Interpreting LLMs:

– Complexity of contextual embeddings, high-dimensionality
– Black-box nature, millions/billions of parameters, highly nonlinear
– Semantic ambiguity, polysemy (words with multiple meanings) in
diverse contexts

• Here’s a structured process to interpret contextual embeddings:

1. Dimensionality reduction for effective clustering
2. Extract topics for each cluster
3. Further dimensionality reduction for 2D or 3D visualization

StatSoft.org 18
IntroLLM Attention Interpretability Text Classification

Interpreting Contextual Embeddings

Source: Alammar and Grootendorst (2024)

StatSoft.org 19
IntroLLM Attention Interpretability Text Classification

Dimensionality Reduction Techniques

Technique Type Strength Weakness Best Use
PCA Linear Captures maximum Assumes linear- Medium data with
variance ity, less flexible linear relationships
t-SNE Nonlinear Preserves local Computationally Small data, ideal for
structure expensive 2D/3D visualization
UMAP Nonlinear Balances local & Requires tuning Clustering in large,
global structure, complex datasets
scalable
Random Linear Fast, scalable, pre- Lower inter- High-dim data with
Projection (Random) serves distances pretability simple structure
AE (Auto- Nonlinear Learns nonlinear Requires tuning Complex datasets
Encoders) relationships, cus- & significant with non-linear pat-
tomizable data terns
VAE (Varia- Nonlinear Captures data vari- Complex train- When data variabil-
tional AE) (Proba- ability, generates ing, requires ity and generation
bilistic) new data tuning are important
MDS Nonlinear Flexible metrics, Computationally Semantic similarity
preserves distances intensive in embeddings

StatSoft.org 20
IntroLLM Attention Interpretability Text Classification

Clustering Techniques

Technique How It Works Advantages Limitations Best Use

k-Means Minimizing Simple, effi- Requires k in When clusters
Clustering distances to cen- cient, scalable advance; as- are approximately
troids sumes spherical spherical and
clusters equally sized
Agglomerative Merges closest Captures hierar- Computationally Data with inher-
/Hierarchical clusters itera- chical structure, intensive, sensi- ent hierarchical
Clustering tively, forming a no need for k tive to noise structure
hierarchy
DBSCAN Groups densely Handles non- Sensitive to Arbitrary shapes,
packed points; convex shapes, parameters, no outlier detection,
labels sparse detects outliers good for varying unknown clusters
points as outliers densities
Spectral Uses similar- Good for non- Computationally Small data with
Clustering ity matrix and convex clusters, expensive, re- complex relation-
eigenvalues for adaptable simi- quires k ships
clustering larity measures

StatSoft.org 21
IntroLLM Attention Interpretability Text Classification

Topic Extraction Techniques

Technique Description Use in Clusters

TF-IDF Identifies important terms by Applied to each cluster to find
(Term Frequency – comparing term frequency in a keywords that best represent its
Inverse Document document to its frequency in the content.
Frequency) corpus. High scores indicate terms
important to the document but
uncommon in the corpus.
KeyBERT A transformer-based method us- Identifies representative key-
(Keyword Extraction ing BERT embeddings to extract words or phrases, providing
with BERT) semantically relevant keywords. rich, contextually accurate topic
descriptions for each cluster.
LDA A topic modeling technique that Extracts coherent topics from
(Latent Dirichlet Al- assumes documents are mixtures clusters, revealing specific
location) of topics, each with a unique word themes within broader topics.
distribution.

StatSoft.org 22
IntroLLM Attention Interpretability Text Classification

BERTopic Pipeline for Interpretable Topic Modeling

https://github.com/MaartenGr/BERTopic

Source: Alammar and Grootendorst (2024)

StatSoft.org 23
IntroLLM Attention Interpretability Text Classification

Live Demo: BERTopic

Try it on Google Colab: BERTopic Demo with Rotten Tomatoes Dataset

StatSoft.org 24
IntroLLM Attention Interpretability Text Classification

Outline

1 LLMs - A Quick Overview

2 Attention is All You Need (2017)

Transformer Architecture
Attention Mechanism
Specialized Transformer LLMs

3 Interpreting Contextual Embeddings

4 Text Classification and Fine-Tuning

StatSoft.org 25
IntroLLM Attention Interpretability Text Classification

Text Classification with Pretrained Transformers

Source: Tunstall et al. (2022)

1. Select a pretrained transformer model, e.g. SBERT or DistilBERT;

2. Prepare the labeled text data, split into training and validation sets;
3. Add a classifier layer with sigmoid/softmax activation;
4. Freeze the transformer model parameters, train only the classifier;
5. Evaluate performance and perform outcome analysis.

StatSoft.org 26
IntroLLM Attention Interpretability Text Classification

Fine-Tuning Transformers for Classification

Source: Tunstall et al. (2022)

• Pros: Fine-tuning the entire model adapts fully to the task, yielding
higher accuracy and flexibility.
• Cons: Increased computational demands, potential risk of overfitting.

StatSoft.org 27
IntroLLM Attention Interpretability Text Classification

Live Demo: DistilBERT

• DistilBERT is a small, fast, cheap and light Transformer model

trained by distilling BERT base.

• Check the hugging face transformers page: DistilBERT

Table: DistilBERT Classification on Rotten Tomatoes Data

DistilBERT-raw DistilBERT-finetuned
train-AUC 0.918097 0.980089
test-AUC 0.882378 0.911010

Try it on Google Colab: Text Classification using DistiBERT

StatSoft.org 28
IntroLLM Attention Interpretability Text Classification

Thank you!
https://www.linkedin.com/in/ajzhang/

StatSoft.org 29

IDiscover Leaflet
33% (3)
IDiscover Leaflet
24 pages
Sinan Ozdemir - Quick Start Guide To Large Language Models, Second Edition-Addison-Wesley (2024)
No ratings yet
Sinan Ozdemir - Quick Start Guide To Large Language Models, Second Edition-Addison-Wesley (2024)
279 pages
PDF Link It Level3 Teachers Pack Compress
100% (4)
PDF Link It Level3 Teachers Pack Compress
137 pages
Astrology and Numerology in Medieval and Early Modern Catalonia - John Scott Lucas PDF
86% (7)
Astrology and Numerology in Medieval and Early Modern Catalonia - John Scott Lucas PDF
241 pages
Thoughts On NLP Research in The (Post-) LLM Era: Yijia Shao Yuanpei College 2023/04/28
No ratings yet
Thoughts On NLP Research in The (Post-) LLM Era: Yijia Shao Yuanpei College 2023/04/28
51 pages
RADL LHPhuong
No ratings yet
RADL LHPhuong
66 pages
Challenges Interpretability
No ratings yet
Challenges Interpretability
12 pages
Presentation 11
No ratings yet
Presentation 11
20 pages
Report 1 Transformers
No ratings yet
Report 1 Transformers
7 pages
Large Language Models For Information Management - 01 - Modulo Base (MB) - 4pdf
No ratings yet
Large Language Models For Information Management - 01 - Modulo Base (MB) - 4pdf
68 pages
DZ-getting-started-large Language Models LLMs-2024
No ratings yet
DZ-getting-started-large Language Models LLMs-2024
7 pages
SESSION 1 LLMs
No ratings yet
SESSION 1 LLMs
40 pages
14 LookingForward
No ratings yet
14 LookingForward
48 pages
Deciphering The Enigma A Deep Dive Into Understand
No ratings yet
Deciphering The Enigma A Deep Dive Into Understand
11 pages
LLM4BeSciV2 2024 04 29T13 - 02 - 01.601Z
No ratings yet
LLM4BeSciV2 2024 04 29T13 - 02 - 01.601Z
25 pages
2023 LLMBC LLM Foundations
No ratings yet
2023 LLMBC LLM Foundations
92 pages
3.1 Language Models and Attention
No ratings yet
3.1 Language Models and Attention
22 pages
Seminar
No ratings yet
Seminar
14 pages
Complete NLP Guide - From Fundamentals To Deep Learning With TensorFlow
No ratings yet
Complete NLP Guide - From Fundamentals To Deep Learning With TensorFlow
13 pages
Summer Course Material
No ratings yet
Summer Course Material
52 pages
AN2DL 06 2324 AttentionAndTrasformers
No ratings yet
AN2DL 06 2324 AttentionAndTrasformers
60 pages
Understanding LLMS: A Comprehensive Overview From Training To Inference
No ratings yet
Understanding LLMS: A Comprehensive Overview From Training To Inference
30 pages
GenAI Preparation
No ratings yet
GenAI Preparation
15 pages
Lesson 14 - Transformer
No ratings yet
Lesson 14 - Transformer
124 pages
Understanding LLMS: A Comprehensive Overview From Training To Inference
No ratings yet
Understanding LLMS: A Comprehensive Overview From Training To Inference
30 pages
Exploring The Evolution of Large Language Models: Architectures, Applications, and Future Directions
No ratings yet
Exploring The Evolution of Large Language Models: Architectures, Applications, and Future Directions
11 pages
Generative AI With LArge Language Models
No ratings yet
Generative AI With LArge Language Models
36 pages
Interpretability: Demystifying The Black-Box Lms
No ratings yet
Interpretability: Demystifying The Black-Box Lms
61 pages
Bert
No ratings yet
Bert
60 pages
To Create A LLM
No ratings yet
To Create A LLM
53 pages
Large Language Models Johns Hopkins University
No ratings yet
Large Language Models Johns Hopkins University
54 pages
LectureLtR-neural IR 2
No ratings yet
LectureLtR-neural IR 2
52 pages
LLM Review
No ratings yet
LLM Review
16 pages
Explainability For Large Language Models: A Survey
No ratings yet
Explainability For Large Language Models: A Survey
31 pages
Explainability For Large Language Models-A Survey
No ratings yet
Explainability For Large Language Models-A Survey
38 pages
L22 - Attention in Deep Learning
No ratings yet
L22 - Attention in Deep Learning
65 pages
Harnessing LLMs For Explaining - 2310.05797v4
No ratings yet
Harnessing LLMs For Explaining - 2310.05797v4
27 pages
LLM Interpretability 101
No ratings yet
LLM Interpretability 101
8 pages
A M3 RD Ipjn Yd Ps GKF
No ratings yet
A M3 RD Ipjn Yd Ps GKF
20 pages
Breaking Quadratic Barriers: A Non-Attention LLM For Ultra-Long Context Horizons
No ratings yet
Breaking Quadratic Barriers: A Non-Attention LLM For Ultra-Long Context Horizons
36 pages
Attention Book Sample
No ratings yet
Attention Book Sample
32 pages
Attention Attention!
No ratings yet
Attention Attention!
26 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
Leveraging Language Models With RAG
No ratings yet
Leveraging Language Models With RAG
57 pages
2AMM30+AY23 24+Text+Mining+Lecture+3
No ratings yet
2AMM30+AY23 24+Text+Mining+Lecture+3
88 pages
LLM Learning
No ratings yet
LLM Learning
56 pages
Pieces DZ RC 393 Getting Started Llms 2024
No ratings yet
Pieces DZ RC 393 Getting Started Llms 2024
8 pages
LLM Intro
No ratings yet
LLM Intro
8 pages
Definition:: Large Language Models (LLMS)
No ratings yet
Definition:: Large Language Models (LLMS)
41 pages
Llm
No ratings yet
Llm
22 pages
Tranformrerz
No ratings yet
Tranformrerz
62 pages
LLM Series 01 - Introduction To LLMS
No ratings yet
LLM Series 01 - Introduction To LLMS
10 pages
The Best LLMs Cheatsheet - Part 1
No ratings yet
The Best LLMs Cheatsheet - Part 1
16 pages
4-HC24.PrimisAI - Hans Bouwmeester.v4
No ratings yet
4-HC24.PrimisAI - Hans Bouwmeester.v4
29 pages
Lec20 LLM
No ratings yet
Lec20 LLM
58 pages
Comprehensive Guide Attention Mechanism Deep Learning
No ratings yet
Comprehensive Guide Attention Mechanism Deep Learning
17 pages
LLM .Foundation - Models.from - The.ground - Up
No ratings yet
LLM .Foundation - Models.from - The.ground - Up
195 pages
Transformers
No ratings yet
Transformers
27 pages
Unlocking The Potential A Comprehensive Exploratio
No ratings yet
Unlocking The Potential A Comprehensive Exploratio
6 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
Transformers in NLP 1
No ratings yet
Transformers in NLP 1
9 pages
Week 13 LLM ChatGPT HAAI IITKgp v2
No ratings yet
Week 13 LLM ChatGPT HAAI IITKgp v2
119 pages
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Word For Resume in Spanish
100% (2)
Word For Resume in Spanish
6 pages
Andrew F. Santos: Curriculum Vitae
No ratings yet
Andrew F. Santos: Curriculum Vitae
4 pages
Paper 2 English Study Material Dinesh
No ratings yet
Paper 2 English Study Material Dinesh
28 pages
Properties of A Well Written Texts PDF
No ratings yet
Properties of A Well Written Texts PDF
88 pages
Infografía Japon Luisa
No ratings yet
Infografía Japon Luisa
3 pages
2007 The House On Mango 1 1
No ratings yet
2007 The House On Mango 1 1
12 pages
Red Right Hand Black Levi Download
No ratings yet
Red Right Hand Black Levi Download
39 pages
Passive Voice
No ratings yet
Passive Voice
3 pages
Software Engineering Bca
No ratings yet
Software Engineering Bca
133 pages
The Role of Memory Is Significant For Vocabulary Acquisition. There Are Two Main Types of Memory: Short-Term Memory and Long-Term
No ratings yet
The Role of Memory Is Significant For Vocabulary Acquisition. There Are Two Main Types of Memory: Short-Term Memory and Long-Term
2 pages
3 - Tradução Na Área de Jornalismo
No ratings yet
3 - Tradução Na Área de Jornalismo
13 pages
Chapter14 19packet
No ratings yet
Chapter14 19packet
5 pages
Amhara National Regional State of Education Bureau: August 2012 E.C Bahir Dar
No ratings yet
Amhara National Regional State of Education Bureau: August 2012 E.C Bahir Dar
10 pages
Cbi Zbo
No ratings yet
Cbi Zbo
4 pages
Chips 8
No ratings yet
Chips 8
1 page
Test 27 Vocab Check
No ratings yet
Test 27 Vocab Check
4 pages
Unit 9: Subject-Verb Agreement: Making Subjects and Verbs Agree
No ratings yet
Unit 9: Subject-Verb Agreement: Making Subjects and Verbs Agree
10 pages
TCS Digital Placement Papers PDF
100% (1)
TCS Digital Placement Papers PDF
22 pages
Application For A Scholarship of The Deutsche Bundesstiftung Umwelt DBU
No ratings yet
Application For A Scholarship of The Deutsche Bundesstiftung Umwelt DBU
7 pages
Q&A Types of Ambiguity
No ratings yet
Q&A Types of Ambiguity
12 pages
Phil. Tourism History Geography and Cult
No ratings yet
Phil. Tourism History Geography and Cult
7 pages
Microsoft Word - Somenath Patra CV - 2
No ratings yet
Microsoft Word - Somenath Patra CV - 2
1 page
Make Your Academic and Business Documents Professional
No ratings yet
Make Your Academic and Business Documents Professional
11 pages
Brain Teasers
No ratings yet
Brain Teasers
39 pages
Chapter 13-Communicating Across Cultures
No ratings yet
Chapter 13-Communicating Across Cultures
15 pages
LOW VOWELS (Pronounce Group 6)
No ratings yet
LOW VOWELS (Pronounce Group 6)
7 pages
IELTS INFORMATION TO CANDIDATES IDP YOGYA at LBUSD
No ratings yet
IELTS INFORMATION TO CANDIDATES IDP YOGYA at LBUSD
2 pages