Mod 4
Mod 4
• A language model should predict "coffee", "tea", or "juice" based on previous words.
Language Model
Types of
Language
Models
Statistical Transformer-
Neural Language
Language Based Language
Models (NLMs)
Models (SLMs) Models
Language Model
neural networks.
RNNs.
• Self-Attention Mechanism
• Example - In "The cat sat on the mat", "cat" should be closely related to "sat", not "mat".
Language Model
• Advanced Language Models (Transformers & Pretrained Models)
• BERT (Bidirectional Encoder Representations from Transformers)
• Uses bidirectional context, unlike traditional left-to-right models.
• Pretrained on Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) tasks.
BERT: https://www.youtube.com/watch?v=t45S_MwAcOw&list=WL&index=24
Language Model
• Applications of Language Models
• Machine Translation (Google Translate)
• Uses Transformer models like BERT and T5.
• Future Directions
• More efficient models (quantization, pruning, distillation).
• Better multilingual support (handling low-resource languages).
• Ethical AI (reducing bias in NLP models).
Language Model
• Why Are Language Models Important in AI?
• Improve Human-Computer Interaction - AI assistants (e.g., Siri, Alexa, ChatGPT).
• Automate Text-Based Tasks - AI-driven writing, coding, and summarization.
• Enhance Search & Information Retrieval - Google Search uses BERT for better query understanding.
• Multimodal AI - Some models (GPT-4, Gemini) handle text, images, and audio together.
• Real-World Example
• Google uses BERT (Bidirectional Encoder Representations from Transformers) to understand search
queries better.
• OpenAI’s GPT models power ChatGPT, which generates human-like conversations.
LLM
Training Data Text-only (books, articles, web data). Text, images, audio, videos, 3D models.
AI image generation, video generation,
Use Cases Chatbots, summarization, translation, code generation.
multimodal AI applications.
Uses Transformers, GANs, VAEs, Diffusion
Architectural Foundation Transformer-based (decoder-only or encoder-decoder).
Models, and hybrid architectures.
Multimodal Capabilities? No (text-only). Yes (text, image, speech, video).
LLM Vs Foundational Models
• Key Features
• 175 billion parameters (compared to 1.5B in GPT-2).
• Key Features
• Bidirectional understanding – Reads text both forward and backward.
• Example
• Sentence: "The bank approved my loan because I had a good credit score."
https://doi.org/10.48550/arXiv.1810.04805
Transformers
https://arxiv.org/abs/1706.03762v7
BERT
Component Function
Token Embeddings Converts words into numerical vectors.
Positional Encoding Adds word order information.
Bidirectional Self-Attention Understands relationships between words.
Feedforward Neural Network Refines token representations.
Output Layer (Softmax) Predicts missing words or next sentence.
Large amounts of training data
• BERT is specially designed to work on larger word counts.
https://doi.org/10.48550/arXiv.1810.04805
Masked Language Model
• That is, we can try to understand the previous and next words of
the hidden word for predicting it.
Next Sentence Prediction
• When we do massive parallelization, it makes the model feasible to train BERT on extensive data quickly.
• Human brains have limited memory, and machine learning models must learn to pay attention to what
matters most.
• We can avoid wasting computational resources and use those for processing irrelevant information when
the machine learning model does that.
• Transformers create differential weights by sending signals to the words in a sentence which are critical
for further processing.
BERT
• Key Features
• All NLP tasks are converted into text generation problems.
• Pretrained on a massive text corpus (Colossal Clean Crawled Corpus - C4).
• Works on multiple NLP tasks: summarization, translation, Q&A, and classification.
• Encoder-Decoder Transformer architecture (like seq2seq models).
T5
• Tasks
• Translation
• Summarization
• Sentiment Analysis
• QA
• Components
• Token Embeddings - Converts words into numerical vectors.
• Positional Encoding - Adds word order information.
• Encoder (Processes Input Text) - Converts input into hidden representations.
• Decoder (Generates Output Text) - Takes encoder output and generates text
token by token.
• Attention Mechanism - Helps model focus on important parts of input text.
T5
• Summary
• The paper introduces a unified "text-to-text" framework for natural language processing tasks, explores various techniques for transfer
learning in this framework, and then combines these insights to achieve state-of-the-art results on a wide range of NLP benchmarks.
• Main Findings
• The key finding is the development of a unified text-to-text framework that can be applied to a diverse set of NLP tasks.
• The encoder-decoder Transformer architecture performed best in their text-to-text framework, despite using more parameters than other
architectural variants.
• The C4 dataset they introduced, which is a cleaned version of the Common Crawl web data, can provide benefits over more curated
unlabeled datasets for some downstream tasks.
RoBERTa
• RoBERTa (Robustly Optimized BERT Pretraining Approach) is an improved version of BERT
developed by Facebook AI (Meta AI) in 2019. It enhances BERT’s pretraining process to
improve performance on NLP tasks.
• Key Features
• Same architecture as BERT (Transformer-based encoder).
• RoBERTa is like "BERT but stronger" with better fine-tuning results on NLP benchmarks!
RoBERTa
• How is RoBERTa Different from BERT?
Feature BERT RoBERTa
• RoBERTa: Uses dynamic masking, meaning different words are masked each time the model sees the same sentence.
• Example
• Sentence: "The [MASK] jumped over the fence."
• RoBERTa: Randomly masks different words in different epochs ("dog", "jumped", "fence"), forcing the model to learn
better representations.
• Uses bigger batch sizes (8,000 vs. 256 in BERT) for better learning stability.
• https://arxiv.org/pdf/1907.11692
Leading Language Models and their Real Life Applications
• Leading language models like GPT-3, BERT, and LaMDA are
revolutionizing various fields, with applications ranging from
customer service and content creation to data analysis and
language translation, demonstrating their ability to understand and
generate human-like text.
Leading Language Models and their Real Life Applications
• Traditional language models produce responses based only on data that they have
been trained with, which may be inaccurate, especially when dealing with recent
events or specialised knowledge domains.
• RAG addresses this limitation by allowing the model to obtain the relevant
information in real time before generating the reply.
Document Embeddings
Storing document embeddings allows for efficient
organization and searching of complex information.
Synthesis of Output
The generator combines original input and retrieved information to create final outputs
effectively.
Contextual Relevance
The generator focuses on maintaining contextual relevance to enhance the quality of the
generated text.
RAG – Simplified Sketch
Retrieval
Phase II : Generation
Document 1
Query + Context
Context
LLM
Document 2 Prompt
Model
Document 3
Generates Answers
Knowledge Base
Query
Question
Output Answer
68
April 24, 2025
Advantages and Limitations of RAG
Enhanced Accuracy
RAG improves the accuracy of information retrieval, ensuring relevant results that meet user needs
effectively.
Contextual Relevance
The contextual relevance of RAG enhances the user experience by providing tailored information
based on context.
Retrieval Noise
Retrieval noise may arise in RAG systems, leading to irrelevant or misleading information being
presented.