Transformers For Natural Language Processing and Computer Vision
Transformers For Natural Language Processing and Computer Vision
Professor
Institute of Information Management, National Taipei University
https://web.ntpu.edu.tw/~myday
1
2025-02-25
Syllabus
Week Date Subject/Topics
1 2025/02/18 Introduction to Generative AI Innovative Applications
2 2025/02/25 Transformers for Natural Language Processing and
Computer Vision
3 2025/03/04 Large Language Models (LLMs),
NVIDIA Building RAG Agents with LLMs Part I
4 2025/03/11 Case Study on Generative AI Innovative Applications I
5 2025/03/18 NVIDIA Building RAG Agents with LLMs Part II
6 2025/03/25 NVIDIA Building RAG Agents with LLMs Part III
2
Syllabus
Week Date Subject/Topics
7 2025/04/01 Self-Learning
8 2025/04/08 Midterm Project Report
9 2025/04/15 Generative AI for Multimodal Information Generation
10 2025/04/22 NVIDIA Generative AI with Diffusion Models Part I
11 2025/04/29 NVIDIA Generative AI with Diffusion Models Part II
12 2025/05/06 Case Study on Generative AI Innovative Applications II
3
Syllabus
Week Date Subject/Topics
13 2025/05/13 NVIDIA Generative AI with Diffusion Models Part III
14 2025/05/20 AI Agents and Large Multimodal Agents (LMAs)
15 2025/05/27 Final Project Report I
16 2025/06/03 Final Project Report II
4
Transformers for
Natural Language Processing
and Computer Vision
5
Denis Rothman (2024),
Transformers for Natural Language Processing and Computer Vision:
Explore Generative AI and Large Language Models with Hugging Face, ChatGPT, GPT-4V, and DALL-E 3,
3rd Edition, Packt Publishing
Source: https://www.amazon.com/Transformers-Natural-Language-Processing-Computer/dp/1805128728 6
Denis Rothman (2024),
Transformers for Natural Language Processing and Computer Vision:
Explore Generative AI and Large Language Models with Hugging Face, ChatGPT, GPT-4V, and DALL-E 3,
3rd Edition, Packt Publishing
1.What Are Transformers?
2.Getting Started with the Architecture of the Transformer Model
3.Emergent vs Downstream Tasks: The Unseen Depths of Transformers
4.Advancements in Translations with Google Trax, Google Translate, and Gemini
5.Diving into Fine-Tuning through BERT
6.Pretraining a Transformer from Scratch through RoBERTa
7.The Generative AI Revolution with ChatGPT
8.Fine-Tuning OpenAI GPT Models
9.Shattering the Black Box with Interpretable Tools
10.Investigating the Role of Tokenizers in Shaping Transformer Models
11.Leveraging LLM Embeddings as an Alternative to Fine-Tuning
12.Toward Syntax-Free Semantic Role Labeling with ChatGPT and GPT-4
13.Summarization with T5 and ChatGPT
14.Exploring Cutting-Edge LLMs with Vertex AI and PaLM 2
15.Guarding the Giants: Mitigating Risks in Large Language Models
16.Beyond Text: Vision Transformers in the Dawn of Revolutionary AI
17.Transcending the Image-Text Boundary with Stable Diffusion
18.Hugging Face AutoTrain: Training Vision Models without Coding
19.On the Road to Functional AGI with HuggingGPT and its Peers
20.Beyond Human-Designed Prompts with Generative Ideation
Source: https://github.com/Denis2054/Transformers-for-NLP-and-Computer-Vision-3rd-Edition 7
Denis Rothman (2024),
Transformers for Natural Language Processing and Computer Vision:
Explore Generative AI and Large Language Models with Hugging Face, ChatGPT, GPT-4V, and DALL-E 3,
3rd Edition, Packt Publishing
Source: https://github.com/Denis2054/Transformers-for-NLP-and-Computer-Vision-3rd-Edition 8
Jay Alammar and Maarten Grootendorst (2024),
Hands-On Large Language Models:
Language Understanding and Generation,
O'Reilly Media
Source: https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961/ 9
Jay Alammar and Maarten Grootendorst (2024),
Hands-On Large Language Models:
Language Understanding and Generation,
O'Reilly Media
Chapter 1: Introduction to Language Models
Chapter 2: Tokens and Embeddings
Chapter 3: Looking Inside Transformer LLMs
Chapter 4: Text Classification
Chapter 5: Text Clustering and Topic Modeling
Chapter 6: Prompt Engineering
Chapter 7: Advanced Text Generation Techniques and Tools
Chapter 8: Semantic Search and Retrieval-Augmented Generation
Chapter 9: Multimodal Large Language Models
Chapter 10: Creating Text Embedding Models
Chapter 11: Fine-tuning Representation Models for Classification
Chapter 12: Fine-tuning Generation Models
Source: https://github.com/HandsOnLLM/Hands-On-Large-Language-Models 10
Generative AI
Large Language Models
(LLMs)
Foundation Models
11
Generative AI
(Gen AI)
AI Generated Content
(AIGC) 12
Generative AI (Gen AI)
AI Generated Content (AIGC)
Source: Yihan Cao, Siyu Li, Yixin Liu, Zhiling Yan, Yutong Dai, Philip S. Yu, and Lichao Sun (2023). "A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT."
arXiv preprint arXiv:2303.04226. 13
AI, ML, DL, Generative AI
Source: Jeong, Cheonsu. "A Study on the Implementation of Generative AI Services Using an Enterprise Data-Based LLM Application Architecture." arXiv preprint arXiv:2309.01105 (2023). 14
lmarena.ai Chatbot Arena Leaderboard
Rank* Rank Arena
Model 95% CI Votes Organization License
(UB) (StyleCtrl) Score
1 1 chocolate (Early Grok-3) 1403 +6/-6 9992 xAI Proprietary
2 3 Gemini-2.0-Flash-Thinking-Exp-01-21 1385 +4/-6 15083 Google Proprietary
2 3 Gemini-2.0-Pro-Exp-02-05 1380 +5/-6 13000 Google Proprietary
2 1 ChatGPT-4o-latest (2025-01-29) 1377 +5/-5 13470 OpenAI Proprietary
5 3 DeepSeek-R1 1362 +7/-7 6581 DeepSeek MIT
5 8 Gemini-2.0-Flash-001 1358 +7/-7 10862 Google Proprietary
5 3 o1-2024-12-17 1352 +5/-5 17248 OpenAI Proprietary
8 7 o1-preview 1335 +3/-4 33169 OpenAI Proprietary
8 8 Qwen2.5-Max 1334 +5/-5 9282 Alibaba Proprietary
8 7 o3-mini-high 1332 +5/-9 5954 OpenAI Proprietary
11 11 DeepSeek-V3 1318 +4/-5 19461 DeepSeek DeepSeek
11 13 Qwen-Plus-0125 1311 +9/-7 5112 Alibaba Proprietary
11 14 GLM-4-Plus-0111 1310 +6/-9 5134 Zhipu Proprietary
11 13 Gemini-2.0-Flash-Lite-Preview-02-05 1309 +6/-5 10262 Google Proprietary
12 12 o3-mini 1306 +5/-6 12179 OpenAI Proprietary
https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard 15
lmarena.ai Chatbot Arena Leaderboard
Confidence Intervals on Model Strength (via Bootstrapping)
https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard 16
Transformer (Attention is All You Need)
(Vaswani et al., 2017)
Source: Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin.
"Attention is all you need." In Advances in neural information processing systems, pp. 5998-6008. 2017. 17
Transformer (Attention is All You Need)
(Vaswani et al., 2017)
Source: Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin.
"Attention is all you need." In Advances in neural information processing systems, pp. 5998-6008. 2017. 18
Transformer Models
Transformer
Encoder Decoder
Source: Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova (2018).
"Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805. 20
Fine-tuning BERT on Different Tasks
Source: Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova (2018).
"Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805. 21
Sentiment Analysis:
Single Sentence Classification
Source: Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova (2018).
"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv preprint arXiv:1810.04805 22
Fine-tuning BERT on
Question Answering (QA)
Source: Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova (2018).
"Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805. 23
Fine-tuning BERT on Dialogue
Intent Detection (ID; Classification)
Source: Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova (2018).
"Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805. 24
Fine-tuning BERT on Dialogue
Slot Filling (SF)
Source: Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova (2018).
"Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805. 25
Key Features of the Transformer Model
• Self-Attention Mechanism: Allows the model to weigh the importance of different
words in a sentence, regardless of their position.
• Positional Encoding: Since Transformers don’t use recurrence (like RNNs), positional
encodings are added to input embeddings to retain word order information.
• Multi-Head Attention: The model attends to different words simultaneously in multiple
ways, capturing various relationships.
• Feed-Forward Layers: After attention, the output passes through dense layers for
further transformation.
• Layer Normalization & Residual Connections: Improve gradient flow and training
stability.
• Encoder-Decoder Architecture:
• Encoder: Processes input text and converts it into contextual embeddings.
• Decoder: Generates output text, often used in translation or text generation tasks.
Source: Denis Rothman (2024), Transformers for Natural Language Processing and Computer Vision: Explore Generative AI and Large Language Models with Hugging Face, ChatGPT, GPT-4V, and DALL-E 3, 3rd Edition, Packt Publishing 26
Popular Transformer-Based Models
• BERT (Bidirectional Encoder Representations from
Transformers)
• Used for tasks like classification and question answering.
• GPT (Generative Pre-trained Transformer)
• Generates text based on input prompts.
• T5 (Text-To-Text Transfer Transformer)
• Converts all NLP tasks into a text-to-text format.
• ViT (Vision Transformer)
• Applies Transformer architecture to computer vision.
Source: Denis Rothman (2024), Transformers for Natural Language Processing and Computer Vision: Explore Generative AI and Large Language Models with Hugging Face, ChatGPT, GPT-4V, and DALL-E 3, 3rd Edition, Packt Publishing 27
Tokens in NLP (Text Processing)
Aspect Description
Splitting text into meaningful units
Tokenization
(words, subwords, or characters).
Methods like Byte Pair Encoding (BPE) and
Subword Tokens WordPiece break words into reusable subunits
(e.g., "unhappiness" → ["un", "happiness"]).
Source: Stuart Russell and Peter Norvig (2020), Artificial Intelligence: A Modern Approach, 4th Edition, Pearson 34
Transformer Architecture
for POS Tagging
Source: Stuart Russell and Peter Norvig (2020), Artificial Intelligence: A Modern Approach, 4th Edition, Pearson 35
Transformer (Attention is All You Need)
(Vaswani et al., 2017)
Source: Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin.
"Attention is all you need." In Advances in neural information processing systems, pp. 5998-6008. 2017. 36
Transformer
Source: Stuart Russell and Peter Norvig (2020), Artificial Intelligence: A Modern Approach, 4th Edition, Pearson 46
Masked Language Modeling:
Pretrain a Bidirectional Model
Source: Stuart Russell and Peter Norvig (2020), Artificial Intelligence: A Modern Approach, 4th Edition, Pearson 47
Illustrated BERT
Source: Jay Alammar (2019), The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning),
http://jalammar.github.io/illustrated-bert/ 48
BERT Classification Input Output
Source: Jay Alammar (2019), The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning),
http://jalammar.github.io/illustrated-bert/ 49
BERT Encoder Input
Source: Jay Alammar (2019), The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning),
http://jalammar.github.io/illustrated-bert/ 50
BERT Classifier
Source: Jay Alammar (2019), The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning),
http://jalammar.github.io/illustrated-bert/ 51
Sentiment Analysis:
Single Sentence Classification
Source: Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova (2018).
"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv preprint arXiv:1810.04805 52
A Visual Guide to
Using BERT for the First Time
(Jay Alammar, 2019)
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
53
Sentiment Classification: SST2
Sentences from movie reviews
sentence label
a stirring , funny and finally transporting re imagining of beauty and the beast
1
and 1930s horror films
apparently reassembled from the cutting room floor of any given daytime soap 0
they presume their audience won't sit still for a sociology lesson 0
this is a visually stunning rumination on love , memory , history and the war
1
between art and commerce
jonathan parker 's bartleby should have been the be all end all of the modern
1
office anomie films
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
54
Movie Review Sentiment Classifier
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
55
Movie Review Sentiment Classifier
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
56
Movie Review Sentiment Classifier
Model Training
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
57
Step # 1 Use distilBERT to
Generate Sentence Embeddings
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
58
Step #2:Test/Train Split for
Model #2, Logistic Regression
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
59
Step #3 Train the logistic regression
model using the training set
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
60
Tokenization
[CLS] a visually stunning rum ##ination on love [SEP]
a visually stunning rumination on love
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
61
Tokenization
tokenizer.encode("a visually stunning rumination on love",
add_special_tokens=True)
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
62
Tokenization for BERT Model
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
63
Flowing Through DistilBERT
(768 features)
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
64
Model #1 Output Class vector as
Model #2 Input
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
65
Fine-tuning BERT on
Single Sentence Classification Tasks
Source: Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova (2018).
"Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805. 66
Model #1 Output Class vector as
Model #2 Input
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
67
Logistic Regression Model to
classify Class vector
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
68
df = pd.read_csv('https://github.com/clairett/pytorch-
sentiment-classification/raw/master/data/SST2/train.tsv',
delimiter='\t', header=None)
df.head()
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
69
Tokenization
tokenized = df[0].apply((lambda x: tokenizer.encode(x,
add_special_tokens=True)))
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
70
BERT Input Tensor
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
71
Processing with DistilBERT
input_ids = torch.tensor(np.array(padded))
last_hidden_states = model(input_ids)
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
72
Unpacking the BERT output tensor
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
73
Sentence to last_hidden_state[0]
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
74
BERT’s output for the [CLS] tokens
# Slice the output for the first position for all the
sequences, take all hidden unit outputs
features = last_hidden_states[0][:,0,:].numpy()
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
75
The tensor sliced from BERT's output
Sentence Embeddings
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
76
Dataset for Logistic Regression
(768 Features)
The features are the output vectors of BERT for the [CLS] token (position #0)
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
77
labels = df[1]
train_features, test_features, train_labels, test_labels =
train_test_split(features, labels)
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
78
Score Benchmarks
Logistic Regression Model
on SST-2 Dataset
# Training
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)
#Testing
lr_clf.score(test_features, test_labels)
# Accuracy: 81%
# Highest accuracy: 96.8%
# Fine-tuned DistilBERT: 90.7%
# Full size BERT model: 94.9%
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
79
Sentiment Classification: SST2
Sentences from movie reviews
sentence label
a stirring , funny and finally transporting re imagining of beauty and the beast
1
and 1930s horror films
apparently reassembled from the cutting room floor of any given daytime soap 0
they presume their audience won't sit still for a sociology lesson 0
this is a visually stunning rumination on love , memory , history and the war
1
between art and commerce
jonathan parker 's bartleby should have been the be all end all of the modern
1
office anomie films
Source: Jay Alammar (2019), A Visual Guide to Using BERT for the First Time,
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
80
A Visual Notebook to
Using BERT for the First Time
https://colab.research.google.com/github/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_
Using_BERT_for_the_First_Time.ipynb 81
Artificial Intelligence
(AI)
82
Definition
of
Artificial Intelligence
(A.I.)
83
Artificial Intelligence
1. 4.
Acting Humanly: Acting Rationally:
The Turing Test The Rational Agent
Approach (1950) Approach
Source: Stuart Russell and Peter Norvig (2020), Artificial Intelligence: A Modern Approach, 4th Edition, Pearson 87
AI Acting Humanly:
The Turing Test Approach
(Alan Turing, 1950)
• Knowledge Representation
• Automated Reasoning
• Machine Learning (ML)
• Deep Learning (DL)
• Computer Vision (Image, Video)
• Natural Language Processing (NLP)
• Robotics
Source: Stuart Russell and Peter Norvig (2020), Artificial Intelligence: A Modern Approach, 4th Edition, Pearson 88
AI, ML, DL
Artificial Intelligence (AI)
Supervised Unsupervised
Learning Learning
Deep Learning (DL)
CNN
RNN LSTM GRU
GAN
Semi-supervised Reinforcement
Learning Learning
Source: https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/deep_learning.html 89
Comparison of Generative AI
and Traditional AI
Feature Generative AI Traditional AI
91
Neural Network and Deep Learning
Source: 3Blue1Brown (2017), But what *is* a Neural Network? | Chapter 1, deep learning,
https://www.youtube.com/watch?v=aircAruvnKk 92
Gradient Descent
how neural networks learn
Source: 3Blue1Brown (2017), Gradient descent, how neural networks learn | Chapter 2, deep learning,
https://www.youtube.com/watch?v=IHZwWFHWa-w 93
Backpropagation
Source: 3Blue1Brown (2017), What is backpropagation really doing? | Chapter 3, deep learning,
https://www.youtube.com/watch?v=Ilg3gGewQ5U 94
Transformers (how LLMs work)
Source: 3Blue1Brown (2024), Transformers (how LLMs work) explained visually | DL5,
https://www.youtube.com/watch?v=wjZofJX0v4M 95
Attention in Transformers
Source: Cheng, Yuheng, Ceyao Zhang, Zhengwen Zhang, Xiangrui Meng, Sirui Hong, Wenhao Li, Zihao Wang et al. "Exploring large language model based intelligent agents: Definitions, methods, and prospects." arXiv preprint arXiv:2401.03428 (2024). 102
AI Agents
• Traditional AI Agents • Evolution of AI Agents
• Simple reflex agents • LLM-based Agents
• Model-based reflex agents • Multi-modal agents
• Goal-based agents • Embodied AI agents in
• Utility-based agents virtual environments
• Learning agents • Collaborative AI agents
103
Reinforcement Learning (DL)
Agent
Environment
Source: Richard S. Sutton & Andrew G. Barto (2018), Reinforcement Learning: An Introduction, 2nd Edition, A Bradford Book. 104
Reinforcement Learning (DL)
1 observation Agent
2 action
3 reward
Environment
Source: Richard S. Sutton & Andrew G. Barto (2018), Reinforcement Learning: An Introduction, 2nd Edition, A Bradford Book. 105
Reinforcement Learning (DL)
1 observation Agent
2 action
Ot At
3 reward Rt
Environment
Source: Richard S. Sutton & Andrew G. Barto (2018), Reinforcement Learning: An Introduction, 2nd Edition, A Bradford Book. 106
Large Language Model (LLM) based Agents
Source: Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B., ... & Gui, T. (2023). The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864. 107
LLM-based Agents
Source: Cheng, Yuheng, Ceyao Zhang, Zhengwen Zhang, Xiangrui Meng, Sirui Hong, Wenhao Li, Zihao Wang et al. "Exploring large language model based intelligent agents: Definitions, methods, and prospects." arXiv preprint arXiv:2401.03428 (2024). 108
Large Multimodal Agents (LMA)
Source: Xie, J., Chen, Z., Zhang, R., Wan, X., & Li, G. (2024). Large Multimodal Agents: A Survey. ArXiv, abs/2402.15116. 109
Large Multimodal Agents (LMA)
Source: Xie, J., Chen, Z., Zhang, R., Wan, X., & Li, G. (2024). Large Multimodal Agents: A Survey. ArXiv, abs/2402.15116. 110
The Development of LM-based Dialogue Systems
1) Early Stage (1966 - 2015)
2) The Independent Development of TOD and ODD (2015 - 2019)
3) Fusions of Dialogue Systems (2019 - 2022)
4) LLM-based DS (2022 - Now)
Source: Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. (2024) "A survey on multimodal large language models." National Science Review (2024): nwae403. 113
Multimodal Large Language Models (MLLM)
Multimodall LLM
Three types of connectors:
1. projection-based
2. query-based
3. fusion-based connectors
Source: Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. (2024) "A survey on multimodal large language models." National Science Review (2024): nwae403. 114
Multimodal Large Language Model (MLLM)
for Vision Question Answering
Source: Jiayi Kuang, Jingyou Xie, Haohao Luo, Ronghao Li, Zhe Xu, Xianfeng Cheng, Yinghui Li, Xika Lin, and Ying Shen. (2024) "Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey." arXiv preprint arXiv:2411.17558. 115
Multi-task Language Understanding on MMLU
GPT-4, Claude 3.5 Sonnet
Massive Multitask Language Understanding (MMLU)
Source: Minaee, Shervin, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. (2024) "Large language models: A survey." arXiv preprint arXiv:2402.06196 (2024). 117
LLM-powered Multimodal Agents
Large Multimodal Agents (LMAs)
Source: Xie, Junlin, Zhihong Chen, Ruifei Zhang, Xiang Wan, and Guanbin Li. "Large Multimodal Agents: A Survey." arXiv preprint arXiv:2402.15116 (2024). 118
Four Paradigms in NLP (LM)
Source: Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. (2023) "Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing." ACM Computing Surveys 55, no. 9 (2023): 1-35. 119
Large Language Models (LLM)
Three typical learning paradigms
Source: Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. (2024) "A survey on multimodal large language models." National Science Review (2024): nwae403. 120
Popular Generative AI
• OpenAI ChatGPT (GPT-o1, GPT-4o, GPT-4)
• Claude.ai (Claude 3.5)
• Google Gemini
• Meta Llama 3.3, Llama 3.2 Vision
• Mixtral Pixtral (mistral.ai)
• DeepSeek
• Chat.LMSys.org (lmarena.ai)
• Perplexity.ai
• Stable Diffusion
• Video: D-ID, Synthesia
• Audio: Speechify
121
Claude 3.5 Sonnet State-of-the-art vision
https://www.llama.com/ 124
Mistral Pixtral Large (124B)
Frontier-class multimodal performance
Source: Agrawal, Pravesh, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, Diogo Costa et al. (2024) "Pixtral 12B." arXiv preprint arXiv:2410.07073. 126
Large Language Models (LLMs)
Artificial Analysis Quality Index
Open
Large
Language
Models:
Chatbot
Arena
https://lmarena.ai/ 129
Perplexity.ai
https://www.perplexity.ai/ 130
Generative AI (Gen AI)
AI Generated Content (AIGC)
Image Generation
Source: Yihan Cao, Siyu Li, Yixin Liu, Zhiling Yan, Yutong Dai, Philip S. Yu, and Lichao Sun (2023). "A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT."
arXiv preprint arXiv:2303.04226. 131
Generative AI (Gen AI)
AI Generated Content (AIGC)
Source: Yihan Cao, Siyu Li, Yixin Liu, Zhiling Yan, Yutong Dai, Philip S. Yu, and Lichao Sun (2023). "A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT."
arXiv preprint arXiv:2303.04226. 132
The history of Generative AI
in CV, NLP and VL
Source: Yihan Cao, Siyu Li, Yixin Liu, Zhiling Yan, Yutong Dai, Philip S. Yu, and Lichao Sun (2023). "A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT."
arXiv preprint arXiv:2303.04226. 133
Generative AI
Foundation Models
Source: Yihan Cao, Siyu Li, Yixin Liu, Zhiling Yan, Yutong Dai, Philip S. Yu, and Lichao Sun (2023). "A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT."
arXiv preprint arXiv:2303.04226. 134
Categories of Vision Generative Models
Source: Yihan Cao, Siyu Li, Yixin Liu, Zhiling Yan, Yutong Dai, Philip S. Yu, and Lichao Sun (2023). "A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT."
arXiv preprint arXiv:2303.04226. 135
The General Structure of
Generative Vision Language
Source: Yihan Cao, Siyu Li, Yixin Liu, Zhiling Yan, Yutong Dai, Philip S. Yu, and Lichao Sun (2023). "A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT."
arXiv preprint arXiv:2303.04226. 136
RAG LLM
Dialogue Systems
137
Technology Tree of RAG Research
Retrieval-Augmented Generation (RAG) for Large Language Models (LLMs)
Source: Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., ... & Wang, H. (2023). Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997. 138
Retrieval-Augmented Generation (RAG)
for Large Language Models (LLMs)
Source: Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., ... & Wang, H. (2023). Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997. 139
Retrieval-Augmented Generation (RAG) Architecture
Source: Zhao, P., Zhang, H., Yu, Q., Wang, Z., Geng, Y., Fu, F., ... & Cui, B. (2024). Retrieval-augmented generation for ai-generated content: A survey. arXiv preprint arXiv:2402.19473. 140
Synthesizing RAG with LLMs
for Question Answering Application
Source: Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., & Gao, J. (2024). Large language models: A survey. arXiv preprint arXiv:2402.06196. 141
Synthesizing the KG as a Retriever with LLMs
Source: Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., & Gao, J. (2024). Large language models: A survey. arXiv preprint arXiv:2402.06196. 142
HuggingGPT:
An agent-based approach to use tools and planning
Source: Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., & Gao, J. (2024). Large language models: A survey. arXiv preprint arXiv:2402.06196. 143
A LLM-based Agent for
Conversational Information Seeking
Source: Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., & Gao, J. (2024). Large language models: A survey. arXiv preprint arXiv:2402.06196. 144
Direct LLM, RAG, and GraphRAG
Source: Peng, B., Zhu, Y., Liu, Y., Bo, X., Shi, H., Hong, C., ... & Tang, S. (2024). Graph retrieval -augmented generation: A survey. arXiv preprint arXiv:2408.08921. 145
GraphRAG Framework for Question Answering
Source: Peng, B., Zhu, Y., Liu, Y., Bo, X., Shi, H., Hong, C., ... & Tang, S. (2024). Graph retrieval -augmented generation: A survey. arXiv preprint arXiv:2408.08921. 146
LangChain Architecture
150