0% found this document useful (0 votes)
3 views48 pages

Module 3 Presentation

This document covers the fundamental concepts of Generative AI, focusing on encoder/decoder architectures, latent space, and transformer models. It discusses the significance of attention mechanisms in enhancing model performance and explores applications of generative models like GPT, VAEs, and GANs in various domains. The document emphasizes the advancements in natural language processing and computer vision brought about by transformer architectures and conditional generative models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views48 pages

Module 3 Presentation

This document covers the fundamental concepts of Generative AI, focusing on encoder/decoder architectures, latent space, and transformer models. It discusses the significance of attention mechanisms in enhancing model performance and explores applications of generative models like GPT, VAEs, and GANs in various domains. The document emphasizes the advancements in natural language processing and computer vision brought about by transformer architectures and conditional generative models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

GENERATIVE AI

UNIT – III
Generative AI Concepts: Encoder/decoder architectures as basis for
Generative AI, the role of the latent space, Transformer architectures and
Attention, Conditional Generative Models, Introduction to GPT and its
significance, Architecture and working of GPT models.

Course Code: 21CS4807


Course Instructor: Prof. Arjun Krishnamurthy
Generative AI: Core
Concepts
This presentation unveils the fundamental building blocks of
generative AI models, which are capable of creating new content. From
generating images to synthesizing text, we will explore the underlying
principles that drive these innovative systems. Our focus will be on
Encoder/Decoder architectures, the role of the Latent Space, and the
power of Transformers and Attention mechanisms.
Encoder-Decoder Architectures: The
Foundation
Encoding Decoding
Compression of input data into a fixed-length vector Reconstruction of the original input (or a variation) from
representation. Encoders transform complex data, such as the encoded vector. Language models can decode a
a 256x256 image, into a more manageable vector into a coherent and contextually relevant sentence.
1024-dimensional vector.

Encoder-decoder architectures have found applications in machine translation, image captioning, and generative
adversarial networks (GANs). Neural Machine Translation (NMT) based on encoder-decoders has greatly improved
translation quality. Google Translate implemented this architecture in 2016.
Autoencoders
An autoencoder, which is a neural network made up
of two parts:

• An encoder network that compresses high-dimensional


input data into a lower dimensional representation vector

• A decoder network that decompresses a given


representation vector back to the original domain
Building a Variational Autoencoder
Deep Dive: Encoder Mechanics
CNNs RNNs
1 2
Convolutional Neural Networks Recurrent Neural Networks,
are especially powerful for image including LSTMs and GRUs, excel
encoding, extracting key features at processing sequential data like
through convolutional layers. For text. They capture temporal
instance, ResNet-50 can encode dependencies by encoding
images with high accuracy, sentences into context vectors,
achieving <8% error rate on and have achieved high BLEU
ImageNet. scores on translation tasks.
Key Properties
3
A good encoder should capture essential information, create a compact
representation, and enable effective decoding, ensuring that the critical
aspects of the input are preserved and can be reconstructed accurately.
Deep Dive: Decoder
Mechanics
CNNs (Transposed RNNs/LSTMs/GRUs
Convolutions) These networks generate
Transposed Convolutions enable sequences of data such as text or
upsampling encoded audio. GPT-2 utilizes a
representations to generate transformer-based decoder to
high-resolution images (e.g., generate realistic text, while other
1024x1024) from latent vectors. RNN variants can produce
Decoders use CNNs to create coherent audio sequences.
detailed visual outputs.

Autoregressive Decoding
This technique generates data one step at a time, conditioning on previous
outputs. Autoregressive models are excellent at predicting the next word in
a sentence based on the words that came before.
Latent Space
• Latent Space is like a secret language that AI uses to understand and
organize information. Imagine it as a hidden room where AI stores all
the important details it has learned from the data it's been fed.

• Just like how we might organize our thoughts in a diary or a file


cabinet, AI uses latent space to keep things tidy and easy to access
when needed.

• In the world of AI, latent space is where all the magic happens. It's like
a translator that takes complex data and simplifies it into a more
manageable form.
• AI processes input data, identifies patterns and
relationships, then organizes this information in latent space
for easier retrieval. This hidden space helps AI make
predictions, generate new data, or categorize information
efficiently.

• By using latent space, AI can turn complicated data into


valuable insights that can be used for a wide range of tasks.
The Latent Space: A Compressed Representation of
Data

2 Role

1
Definition

Properties
3

The latent space is a multi-dimensional vector space learned by the encoder, where each point represents a possible data sample. It enables
manipulation and generation of new data through interpolation (smooth transitions between different data samples) and arithmetic
operations (combining different concepts). The latent space is continuous, organized, and disentangled, enabling smooth transitions and
meaningful manipulations.
Latent Space
Applications and Examples
• Latent space is often used in the field of machine learning and AI. One practical
example is in the creation of deepfake videos.

• In this process, the AI model learns a latent space of facial expressions and features,
and then uses this knowledge to seamlessly overlay one personʼs face onto another in
a video.

• Another example of latent space application is in natural language processing. AI


models can learn a latent space of word embeddings, which represent words in a
multi-dimensional space based on their context.

• This allows the AI to understand and generate human-like language, and is used in
applications like language translation and chatbots.
Applications and Examples
• In the field of computer vision, latent space can also be used in
image generation. AI models can learn a latent space of features
and textures from a dataset of images, and then generate new,
realistic images based on this latent space.

• This is used in applications like generating synthetic data for


training other AI models or creating realistic visuals for video
games and virtual simulations.
Latent Space Exploration:
Examples
VAEs
Variational Autoencoders learn probabilistic latent spaces. VAEs have
several applications, including image generation, anomaly detection,
and representation learning. An example is VAEs generating images
of handwritten digits with controlled style and content.

GANs
Generative Adversarial Networks learn latent spaces through
adversarial training. GANs also have a variety of applications,
including image synthesis, style transfer, and super-resolution.
StyleGAN is a GAN that generates photorealistic faces with
fine-grained control.

Both VAEs and GANs provide powerful tools for exploring and manipulating latent
spaces, enabling the creation of new and interesting data samples.
TRANSFORMER
TRANSFORMER and ATTENTION
• Transformers and attention mechanisms have revolutionized
the field of deep learning, offering a powerful way to process
sequential data and capture long-range dependencies.

• In this unit, we will put our serious hat on and truly explore the
basics of transformers and the importance of attention
mechanisms in enhancing model performance and coherence.
Importance of Attention Mechanisms
• The advent of attention mechanisms has been nothing short of revolutionary in the
realm of deep learning. Attention allows models to dynamically focus on
pertinent parts of the input data, akin to the way humans pay attention to certain
aspects of a visual scene or conversation.

• This selective focus is particularly crucial in tasks where context is key, such as
language understanding or image recognition.

• In the context of transformers, attention mechanisms serve to weigh the influence


of different input tokens when producing an output. This is not merely
a replication of human attention but an enhancement, enabling machines to
surpass human performance in certain tasks.
Importance of Attention Mechanisms
• They provide a means to handle variable-sized inputs by focusing on the most
relevant parts.

• Attention-based models can capture long-range dependencies that earlier


models like RNNs struggled with.

• They facilitate parallel processing of input data, leading to significant


improvements in computational efficiency.

• The elegance of attention mechanisms lies in their simplicity and power. By


enabling models to consider the entire context of an input, they have opened up
new possibilities in machine learning, leading to breakthroughs in natural language
processing and beyond.
Encoder-Decoder Model
• Here there is sequential data translation, where the encoder processes the input
sequence and distills it into a fixed-length representation, often referred to as
the context vector. This vector serves as a condensed summary of the input,
capturing its essence for the decoder to interpret.

• The decoder, mirroring the encoder’s structure, alerts with the context
vector, infusing its initial hidden state. It uses the first token of the output
sequence, and continues to weave the subsequent tokens, each prediction is
influenced by the previously materialized tokens and the persistent whisper of the
context vector.

• This persists until the narrative is complete, i.e when end-of-sequence token is
encountered or the bounds of a predefined sequence length.
Transformer Architecture
• The encoder’s role is to meticulously
extract features from the input
sequence.
• This is achieved through a series of
layers, each comprising a multi-head
attention mechanism followed by a
feed-forward neural network.
• These layers are further enhanced with
normalization and residual connections
to ensure stability during training.
• Remarkably, the entire sequence is processed in parallel, which is a stark
departure from the sequential processing of traditional recurrent neural networks
(RNNs).

• The decoder, on the other hand, is tasked with generating the output sequence. It
mirrors the encoder’s structure but includes an additional layer of cross-attention
that allows it to focus on relevant parts of the input sequence as it produces the
output.
Applications of Transformers
• NLP and Language Modeling
• In the realm of Natural Language Processing (NLP), transformers have ushered in a new
“epoch” of language modeling prowess. Transformers, with their attention mechanisms,
have become the cornerstone of modern NLP, excelling in capturing context and
relationships between words.
• Widespread growth from sentiment analysis to language translation, and the development
of sophisticated chatbots and virtual assistants.
• The self-attention mechanism within transformers allows for the nuanced understanding of
language, enabling models to process words in relation to all other words in a sentence,
rather than in isolation.
Applications of Transformers
• Computer Vision
• The advent of Transformers in computer vision marks a paradigm shift from the
conventional convolutional neural networks (CNNs) that dominated the field for years.
• Transformers introduce a novel approach to processing visual data, leveraging self-attention
mechanisms to capture global dependencies within an image, which is particularly
beneficial for tasks such as object detection, image segmentation, and classification.
• The hierarchical structure of vision transformers, such as the Swin Transformer, enables the
model to focus on different scales of an image, enhancing the ability to discern fine details
while maintaining a global perspective.
Transformer Architectures:
Revolutionizing Generative AI
Introduction
1
Transformers are powerful architectures that replace RNNs with
attention mechanisms for sequence modeling. A key innovation of
Transformers is the parallel processing of input sequences.

Advantages
2
Transformers have several advantages, including superior
performance (compared to RNNs), scalability, and the ability to
capture long-range dependencies.

BERT exemplifies the power of Transformers for pre-training language models,


achieving state-of-the-art results on various NLP tasks. Their ability to process
sequences in parallel and capture long-range dependencies has led to significant
advancements in generative AI.
Attention Mechanisms:
Focusing on the Relevant
Information

Core Idea Different Types


The core idea of attention mechanisms There are different types of attention,
is assigning weights to different parts including self-attention (relating
of the input sequence based on their different parts of the same input
relevance. In machine translation, this sequence) and cross-attention (relating
involves focusing on the relevant words different input sequences, e.g., encoder
in the source sentence when generating and decoder).
the target sentence.
Attention mechanisms involve mathematical formulations with queries, keys, and
values, allowing the model to focus on the most relevant information and make
more informed decisions.
Transformers in Generative Models
GPT Series DALL-E and Imagen
The GPT series (GPT-3, GPT-4) excels at language DALL-E and Imagen generate high-quality images from
generation, producing realistic and coherent text with text descriptions, creating diverse and visually appealing
minimal human intervention. GPT-3, with 175 billion content based on textual prompts. For example, DALL-E
parameters, has demonstrated its ability to write articles, can generate images of "an astronaut riding a horse in
poems, and code. space".

Key components of these models include multi-head attention, feedforward networks, and residual connections, enabling
them to capture complex relationships and generate high-quality outputs.
Conditional Generative
Models: An Introduction
Generative models are powerful tools in the field of artificial
intelligence, capable of learning the underlying distribution of data and
generating new samples that resemble the original dataset.
Conditional generative models take this a step further by allowing us
to guide the generation process with additional input, or conditions.
This opens up a wide range of possibilities, from image-to-image
translation to personalized content creation. In this presentation, we
will explore the fascinating world of conditional generative models,
with a special focus on GPT and its significance in this domain.
Types of Conditional Generative Models
Conditional Variational Autoencoders (CVAEs) and Conditional Generative Adversarial Networks (CGANs) are two
prominent types of conditional generative models. CVAEs, extending VAEs with conditional input, offer stable training but
may produce blurry outputs. CGANs, on the other hand, extending GANs, can generate sharp, realistic outputs, but often
suffer from training instability.

CVAE CGAN
Generating diverse outputs with specific attributes is a key For applications requiring high-quality, realistic outputs,
strength of CVAEs. Their stability makes them suitable for CGANs are often preferred. However, their training
applications where consistency is crucial. complexity needs to be carefully managed.
Introduction to GPT: Generative Pre-trained
Transformer
GPT, or Generative Pre-trained Transformer, is a transformer-based language model developed by OpenAI. It can
generate human-quality text. GPT models are pre-trained on a massive corpus of text data and then fine-tuned for
specific tasks such as text summarization and question answering. A key characteristic of GPT is its autoregressive nature.

Pre-trained Fine-tuned Autoregressive


1 2 3
GPT models are pre-trained on a These models can be fine-tuned GPT predicts the next word
large corpus of text data, for specific tasks, making them given the previous words,
enabling them to learn general versatile for a wide range of allowing it to generate coherent
language patterns and applications. and contextually relevant text.
knowledge.
Introduction to GPT
• GPT is based on the transformer architecture, which was introduced in the paper "Attention is All
You Need" by Vaswani et al. The core idea behind the transformer is the use of self-attention
mechanisms that process words in relation to all other words in a sentence, contrary to traditional
methods that process words in sequential order.

• This allows the model to weigh the importance of each word no matter its position in the sentence,
leading to a more nuanced understanding of language.

• As a generative model, GPT can produce new content. When provided with a prompt or a part of a
sentence, GPT can generate coherent and contextually relevant continuations. This makes it
extremely useful for applications like creating written content, generating creative writing, or even
simulating dialogue.
Architecture of Generative Pre-trained Transformer
• Important elements
• Self-Attention System: This enables the model to evaluate each word's significance within the
context of the complete input sequence. It makes it possible for the model to comprehend word
linkages and dependencies, which is essential for producing content that is logical and suitable for its
context.
• Layer normalization and residual connections: By reducing problems such as disappearing and
exploding gradients, these characteristics aid in training stabilization and enhance network
convergence.
• Feedforward Neural Networks: These networks process the output of the attention mechanism and
add another layer of abstraction and learning capability. They are positioned between self-attention
layers.
GPT Architecture
1.Input Embedding
• Input: The raw text input is tokenized into
individual tokens (words or subwords).
• Embedding: Each token is converted into a dense
vector representation using an embedding layer.
2.Positional Encoding: Since transformers do not
inherently understand the order of tokens, positional
encodings are added to the input embeddings to retain
the sequence information.
3.Dropout Layer: A dropout layer is applied to the
embeddings to prevent overfitting during training.
GPT Architecture
4.Transformer Blocks
• LayerNorm: Each transformer block starts with a layer
normalization.
• Multi-Head Self-Attention: The core component, where the
input passes through multiple attention heads.
• Add & Norm: The output of the attention mechanism is added
back to the input (residual connection) and normalized again.
• Feed-Forward Network: A position-wise feed-forward
network is applied, typically consisting of two linear
transformations with a GeLU activation in between.
• Dropout: Dropout is applied to the feed-forward network
output.
GPT Architecture
5.Layer Stack: The transformer blocks are stacked to form a deeper
model, allowing the network to capture more complex patterns and
dependencies in the input.

6.Final Layers
• LayerNorm: A final layer normalization is applied.
• Linear: The output is passed through a linear layer to map it
to the vocabulary size.
• Softmax: A softmax layer is applied to produce the final
probabilities for each token in the vocabulary.
Significance of GPT Models
GPT models have revolutionized Natural Language Processing (NLP),
achieving state-of-the-art results in various tasks. Their zero-shot and
few-shot learning capabilities allow them to perform tasks with minimal
task-specific training data.

NLP Revolution Zero/Few-shot Learning


GPT models have significantly Their ability to perform tasks
advanced NLP, enabling more without extensive training data
accurate and human-like makes them highly adaptable
language processing. and efficient.

AI Research Impact
GPT has inspired new research directions in generative modeling and
transfer learning, pushing the boundaries of AI.
GPT Architecture: Transformer
Basics
GPT's architecture is based on the transformer, an attention-based neural network. The key
components of the transformer include multi-head self-attention mechanisms, feedforward
neural networks, residual connections, and layer normalization. These components work
together to enable parallel processing of input sequences and better handling of long-range
dependencies.

Multi-Head Self-Attention
1
Allows the model to attend to different parts of the input sequence, capturing
complex relationships between words.

Feedforward Neural Networks


2
Process the output of the attention mechanism, adding non-linearity and
enabling the model to learn more complex patterns.

Residual Connections
3
Help to stabilize training and improve performance by allowing gradients to flow
more easily through the network.
GPT Training
• Training a GPT model is a computationally intensive process
that involves feeding it massive amounts of text data and
employing a self-supervised learning approach.

• The model doesn't rely on explicitly labeled data; instead, it


learns by identifying patterns and relationships within the data
itself.
The Training Process Steps
• Data preparation: The first step is to gather and prepare a massive dataset of text and code. This dataset is carefully curated to
be as diverse and representative as possible, covering a wide range of topics, writing styles, and languages.

• Tokenization: The text data is then divided into smaller units called "tokens." These can be individual words, parts of words, or
even characters, depending on the specific GPT model and the desired level of granularity.

• Model initialization: The GPT model is initialized with random parameters. These parameters will be adjusted during the
training process as the model learns from the data.

• Self-supervised learning: The model is then fed the tokenized text data and tasked with predicting the next token in a sequence.
For example, given the input "The cat sat on the", the model might predict "mat."

• Backpropagation and optimization: The model's predictions are compared to the actual next tokens in the training data, and the
difference between them is used to calculate a "loss" value. This loss represents how far off the model's predictions are from the
truth. The model then uses backpropagation to adjust its internal parameters to minimize this loss. This iterative process of
prediction, loss calculation, and parameter adjustment continues over many epochs, with the model gradually improving its
ability to predict the next token in a sequence accurately.
Applications of GPT in AI
Content creation

• GPT models can assist in creating high-quality content for websites, blogs, social media, and more. This can be a valuable tool
for businesses and individuals who need to create engaging and informative content on a regular basis.

• One example is using GPT models to draft custom social media posts or write product descriptions, based on the specific
prompts and information given to the model. This can help free up time for other tasks.

Customer service

• These models can be used to power chatbots and virtual assistants that can provide customer support, answer questions, and
resolve issues. This can help businesses to improve customer satisfaction and reduce support costs.

• Imagine being able to get instant customer service support at any time of day or night, without having to wait on hold or
navigate complicated phone menus. This is the potential of AI-powered customer service.
Chatbots

• Outside of customer support, chatbots can also be used by a wider audience to answer questions, and even engage in casual
conversation. As GPT technology continues to develop, expect to see even more sophisticated and human-like chatbots in the
future.

Code generation

• GPT technology has the potential to revolutionize the way developers work. It can be used to assist in computer code
generation, which can be a valuable tool for developers who are looking to automate tasks or speed up the development
process.

• This can free up developers to focus on more complex and creative tasks. Imagine a future where even those with limited
coding experience could bring their ideas to life with the help of AI-powered code generation tools.

Education

• GPT has the possibility to transform education by offering personalized learning experiences tailored to each student's needs. It
can provide tailored feedback, practice problems, interactive modules, study plans, virtual tutors, and language support. This
integration of AI can create an inclusive, engaging, and effective learning environment for all students.
GPT Architecture: Details
GPT employs a decoder-only transformer structure with masked self-attention to prevent attending to future words
during training. Layer normalization stabilizes training and improves performance. GPT models have been scaled up to
billions of parameters.

2 Masked Self-Attention

1
Decoder-Only

Layer Normalization
3
Working of GPT Models:
Pre-training
GPT models undergo pre-training through unsupervised learning on a
massive dataset of unlabeled text data. During pre-training, the model
learns to predict the next word in a sequence. The objective function is
to maximize the likelihood of the training data.

45 TB
Data Size
GPT-3 was trained on 45 TB of text data, enabling it to learn a vast
amount of knowledge about language and the world.
Working of GPT Models: Fine-tuning
After pre-training, GPT models are fine-tuned on specific tasks using labeled data through supervised learning. Task-specific datasets such as
question answering and machine translation are utilized. This process leverages the knowledge learned during pre-training to improve
performance on downstream tasks.

Supervised Learning Task-Specific Datasets Transfer Learning


Fine-tuning involves supervised learning on Datasets tailored to specific tasks such as Leveraging pre-training knowledge
specific tasks using labeled data. question answering and translation are improves performance on downstream
used. tasks.
GPT Model Variants and
Advancements
GPT models have seen several variants and advancements over the years.
GPT-2 featured an improved architecture and a larger dataset compared to
GPT-1. GPT-3 significantly expanded the model size with enhanced
capabilities. GPT-4 is a multimodal model capable of accepting image and
text inputs.

Model Key Improvements

GPT-2 Improved architecture, larger


dataset

GPT-3 Significantly larger model,


enhanced capabilities

GPT-4 Multimodal, accepts image and


text inputs
Conclusion: The Future of
Generative AI
1 Crucial Concepts 2 Continuous Advancements
Encoder-decoder architectures, Continuous advancements in
latent spaces, transformers, and generative AI models are expected,
attention mechanisms are crucial leading to more sophisticated and
concepts in generative AI, paving versatile systems that can tackle
the way for innovative applications complex tasks and generate
and advancements. increasingly realistic content.

3 Ethical Development
Ethical considerations and responsible development are essential to ensure that
generative AI is used for beneficial purposes and does not perpetuate biases or
create harmful content.

Generative AI is transforming industries through personalized content creation, drug


discovery, and materials science, offering new possibilities and opportunities for
innovation.

You might also like