IntroLLM Attention Interpretability Text Classification
Introduction to Large Language Models
Dr. Aijun Zhang
October 2024
StatSoft.org 1
IntroLLM Attention Interpretability Text Classification
Recommended Texts
• Tunstall, L., von Werra, L. and Wolf, T. (2022).
• Bishop, C. and Bishop, H. (2023)
• Alammar, J. and Grootendorst, M. (2024)
• Raschka, S. (2024)
StatSoft.org 2
IntroLLM Attention Interpretability Text Classification
Outline
1 LLMs - A Quick Overview
2 Attention is All You Need (2017)
Transformer Architecture
Attention Mechanism
Specialized Transformer LLMs
3 Interpreting Contextual Embeddings
4 Text Classification and Fine-Tuning
StatSoft.org 3
IntroLLM Attention Interpretability Text Classification
Landscape of Artificial Intelligence
Source: Raschka (2024)
StatSoft.org 4
IntroLLM Attention Interpretability Text Classification
Evolution of Language Models
Source: Alammar and Grootendorst (2024)
StatSoft.org 5
IntroLLM Attention Interpretability Text Classification
Growing Scales of Language Models
Scaling laws: increasing param-
eters, data quality, and compute
resources generally improves LLM
performance.
Click into VizSweet
StatSoft.org 6
IntroLLM Attention Interpretability Text Classification
Key Tasks of Large Language Models
• Text Classification: Sentiment analysis, Spam detection, Named
Entity Recognition (NER), Natural Language Inference (NLI)
• Text Generation: Creative writing, Chatbot, Translation, Coding
• Summarization: News article summaries, Research paper abstracts,
Document condensation
• Question Answering: Customer support queries, Educational FAQs
• Knowledge Integration: Retrieval-Augmented Generation (RAG) for
up-to-date responses, evidence-based information generation
• Reasoning: Logical deduction, Mathematics problem solving,
Chain-of-Thought analysis
StatSoft.org 7
IntroLLM Attention Interpretability Text Classification
Outline
1 LLMs - A Quick Overview
2 Attention is All You Need (2017)
Transformer Architecture
Attention Mechanism
Specialized Transformer LLMs
3 Interpreting Contextual Embeddings
4 Text Classification and Fine-Tuning
StatSoft.org 8
IntroLLM Attention Interpretability Text Classification
Attention is All You Need (2017)
• By Google Brain: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,
Jones, L. and Gomez, A. N., Kaiser, L. and Polosukhin, I. (NIPS 2017)
• It revolutionized neural networks by introducing the Transformer, a
highly flexible architecture enabling LLMs like BERT, GPT, and T5.
• The core innovation of self-attention allows each token to capture
global context efficiently, eliminating the need for recurrence.
• It sparked groundbreaking models across NLP and other domains, with
applications extending to vision, audio, and multimodal tasks.
StatSoft.org 9
IntroLLM Attention Interpretability Text Classification
Transformer Architecture
• Encoder processes input
• Decoder generates output, predicting
the next token auto-regressively
• Feed forward network (deep learning)
• Multi-head self-attention
• Masked attention for decoder
• Positional encoding
• Check PyTorch function:
Docs>torch.nn>Transformer
StatSoft.org 10
IntroLLM Attention Interpretability Text Classification
Attention Mechanism
Single-head Attention Multi-head Attention
StatSoft.org 11
IntroLLM Attention Interpretability Text Classification
Scaled Dot-Product Self-Attention
N
X
y(k) ← αki x(i)
i=1
exp(x(k) xT(i) )
αki = PN
T
j=1 exp(x(k) x(j) )
Expressed in matrix form:
Y = Softmax[XXT ]X
(Q,K,V)-parameterization: W(q) , W(k) , W(v) ∈ RD×D
h i
Y = Softmax XW(q) (XW(k) )T XW(v) ≡ Softmax[QKT ]V
h T
i
QK
Scaled self-attention: Y = Softmax √
D
V ≡ Attention(Q, K, V)
StatSoft.org 12
IntroLLM Attention Interpretability Text Classification
Specialized Transformer LLMs
• Encoder-Only Models: BERT (Bidirenctional Encoder Representations
from Transformers), DistilBERT, RoBERTa, DeBERTa, etc. Ideal for
understanding tasks such as text classification, sentiment analysis, and
named entity recognition, with bidirectional attention capturing full context.
• Decoder-Only Models: GPT (Generative Pretrained Transformer), GPT-2,
GPT-3, and beyond. Optimized for generation tasks, with unidirectional
attention. Live example: ChatGPT (GPT-3.5+)
• Encoder-Decoder Models: T5 (Text-to-Text Transfer Transformer),
BART. Balance input understanding with output generation, suitable for
tasks like translation, summarization, and paraphrasing.
Highlight: Encoder-only models (BERT) are for understanding tasks, while
decoder-only models (GPT) are for generative tasks.
StatSoft.org 13
IntroLLM Attention Interpretability Text Classification
Transformer Explainer: Interactive Learning of GPT-2 by Cho et al.(2024)
StatSoft.org 14
IntroLLM Attention Interpretability Text Classification
Live Demo: BertViz
BertViz: Visualize Attention in NLP Models (BERT, GPT2, BART, etc.)
https://github.com/jessevig/bertviz
Try it on Google Colab: BertViz Interactive Tutorial
StatSoft.org 15
IntroLLM Attention Interpretability Text Classification
Outline
1 LLMs - A Quick Overview
2 Attention is All You Need (2017)
Transformer Architecture
Attention Mechanism
Specialized Transformer LLMs
3 Interpreting Contextual Embeddings
4 Text Classification and Fine-Tuning
StatSoft.org 16
IntroLLM Attention Interpretability Text Classification
Static and Contextual Embeddings
• Embeddings provide a way to represent textual data as dense,
continuous vectors in a high-dimensional space, to capture the
semantic meaning of words, phrases, and documents.
• Traditional Static Embeddings: Word2Vec, GloVe, etc. Easy to
compute, able to capture basic semantic relationships. However, each
word has a single, fixed embedding regardless of context.
• Contextual Embeddings: BERT, GTP, etc. Generate dynamic,
context-dependent embeddings for each token. BERT is most popular
as it captures both the left and right context of a word in a sentense.
StatSoft.org 17
IntroLLM Attention Interpretability Text Classification
Interpreting Contextual Embeddings
• Interpretability matters as it is crucial to understand how LLMs
work and make decisions, fostering transparency, trust and reliability.
• Challenges in Interpreting LLMs:
– Complexity of contextual embeddings, high-dimensionality
– Black-box nature, millions/billions of parameters, highly nonlinear
– Semantic ambiguity, polysemy (words with multiple meanings) in
diverse contexts
• Here’s a structured process to interpret contextual embeddings:
1. Dimensionality reduction for effective clustering
2. Extract topics for each cluster
3. Further dimensionality reduction for 2D or 3D visualization
StatSoft.org 18
IntroLLM Attention Interpretability Text Classification
Interpreting Contextual Embeddings
Source: Alammar and Grootendorst (2024)
StatSoft.org 19
IntroLLM Attention Interpretability Text Classification
Dimensionality Reduction Techniques
Technique Type Strength Weakness Best Use
PCA Linear Captures maximum Assumes linear- Medium data with
variance ity, less flexible linear relationships
t-SNE Nonlinear Preserves local Computationally Small data, ideal for
structure expensive 2D/3D visualization
UMAP Nonlinear Balances local & Requires tuning Clustering in large,
global structure, complex datasets
scalable
Random Linear Fast, scalable, pre- Lower inter- High-dim data with
Projection (Random) serves distances pretability simple structure
AE (Auto- Nonlinear Learns nonlinear Requires tuning Complex datasets
Encoders) relationships, cus- & significant with non-linear pat-
tomizable data terns
VAE (Varia- Nonlinear Captures data vari- Complex train- When data variabil-
tional AE) (Proba- ability, generates ing, requires ity and generation
bilistic) new data tuning are important
MDS Nonlinear Flexible metrics, Computationally Semantic similarity
preserves distances intensive in embeddings
StatSoft.org 20
IntroLLM Attention Interpretability Text Classification
Clustering Techniques
Technique How It Works Advantages Limitations Best Use
k-Means Minimizing Simple, effi- Requires k in When clusters
Clustering distances to cen- cient, scalable advance; as- are approximately
troids sumes spherical spherical and
clusters equally sized
Agglomerative Merges closest Captures hierar- Computationally Data with inher-
/Hierarchical clusters itera- chical structure, intensive, sensi- ent hierarchical
Clustering tively, forming a no need for k tive to noise structure
hierarchy
DBSCAN Groups densely Handles non- Sensitive to Arbitrary shapes,
packed points; convex shapes, parameters, no outlier detection,
labels sparse detects outliers good for varying unknown clusters
points as outliers densities
Spectral Uses similar- Good for non- Computationally Small data with
Clustering ity matrix and convex clusters, expensive, re- complex relation-
eigenvalues for adaptable simi- quires k ships
clustering larity measures
StatSoft.org 21
IntroLLM Attention Interpretability Text Classification
Topic Extraction Techniques
Technique Description Use in Clusters
TF-IDF Identifies important terms by Applied to each cluster to find
(Term Frequency – comparing term frequency in a keywords that best represent its
Inverse Document document to its frequency in the content.
Frequency) corpus. High scores indicate terms
important to the document but
uncommon in the corpus.
KeyBERT A transformer-based method us- Identifies representative key-
(Keyword Extraction ing BERT embeddings to extract words or phrases, providing
with BERT) semantically relevant keywords. rich, contextually accurate topic
descriptions for each cluster.
LDA A topic modeling technique that Extracts coherent topics from
(Latent Dirichlet Al- assumes documents are mixtures clusters, revealing specific
location) of topics, each with a unique word themes within broader topics.
distribution.
StatSoft.org 22
IntroLLM Attention Interpretability Text Classification
BERTopic Pipeline for Interpretable Topic Modeling
https://github.com/MaartenGr/BERTopic
Source: Alammar and Grootendorst (2024)
StatSoft.org 23
IntroLLM Attention Interpretability Text Classification
Live Demo: BERTopic
Try it on Google Colab: BERTopic Demo with Rotten Tomatoes Dataset
StatSoft.org 24
IntroLLM Attention Interpretability Text Classification
Outline
1 LLMs - A Quick Overview
2 Attention is All You Need (2017)
Transformer Architecture
Attention Mechanism
Specialized Transformer LLMs
3 Interpreting Contextual Embeddings
4 Text Classification and Fine-Tuning
StatSoft.org 25
IntroLLM Attention Interpretability Text Classification
Text Classification with Pretrained Transformers
Source: Tunstall et al. (2022)
1. Select a pretrained transformer model, e.g. SBERT or DistilBERT;
2. Prepare the labeled text data, split into training and validation sets;
3. Add a classifier layer with sigmoid/softmax activation;
4. Freeze the transformer model parameters, train only the classifier;
5. Evaluate performance and perform outcome analysis.
StatSoft.org 26
IntroLLM Attention Interpretability Text Classification
Fine-Tuning Transformers for Classification
Source: Tunstall et al. (2022)
• Pros: Fine-tuning the entire model adapts fully to the task, yielding
higher accuracy and flexibility.
• Cons: Increased computational demands, potential risk of overfitting.
StatSoft.org 27
IntroLLM Attention Interpretability Text Classification
Live Demo: DistilBERT
• DistilBERT is a small, fast, cheap and light Transformer model
trained by distilling BERT base.
• Check the hugging face transformers page: DistilBERT
Table: DistilBERT Classification on Rotten Tomatoes Data
DistilBERT-raw DistilBERT-finetuned
train-AUC 0.918097 0.980089
test-AUC 0.882378 0.911010
Try it on Google Colab: Text Classification using DistiBERT
StatSoft.org 28
IntroLLM Attention Interpretability Text Classification
Thank you!
https://www.linkedin.com/in/ajzhang/
StatSoft.org 29