100% found this document useful (1 vote)
147 views55 pages

Diffusion: by Aryan Jain

Diffusion models generate images by gradually adding noise to images over multiple timesteps, then learning to reverse the process by denoising the images. They have achieved state-of-the-art results compared to GANs. Recent improvements include using a cosine rather than linear noise schedule, learning the covariance matrix, and injecting classifier gradients to improve fidelity. Latent diffusion models separate image generation into a perceptual compression stage and diffusion stage to improve efficiency.

Uploaded by

KaiWang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
147 views55 pages

Diffusion: by Aryan Jain

Diffusion models generate images by gradually adding noise to images over multiple timesteps, then learning to reverse the process by denoising the images. They have achieved state-of-the-art results compared to GANs. Recent improvements include using a cosine rather than linear noise schedule, learning the covariance matrix, and injecting classifier gradients to improve fidelity. Latent diffusion models separate image generation into a perceptual compression stage and diffusion stage to improve efficiency.

Uploaded by

KaiWang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 55

Diffusion

By Aryan Jain
Agenda 1.
2.
Theory of diffusion
Diffusion for image generation
3. Tricks to improve image synthesis models
4. Latent Diffusion Models
5. Examples of recent diffusion models
Diffusion
Density Modeling for Data Synthesis

Assume that all data comes from a distribution pdata(x):


● The goal of generative machine learning models is to learn
this distribution to the best of their ability — the distribution
approximated by the model is denoted as pθ(x)
● We generate new data by sampling from the learned
distribution
● In practice, train models to maximize the expected log
likelihood of pθ(x) (or minimizing negative log
likelihood)/minimize divergence between pθ(x) and pdata(x)
Prior Methods

VAE:

GAN:
Sampling from Noise
Diffusion

● Another kind of generative modeling technique that takes inspiration from physics (non-
equilibrium statistical physics and stochastic differential equations to be more exact)!
● Main idea: convert a well-known and simple base distribution (like a Gaussian) to the target
(data) distribution iteratively, with small step sizes, via a Markov chain:
○ Treat the the output of the Markov Chain as the model’s approximation for the learned
distribution
○ Inspiration? Estimating and analyzing small step sizes is more tractable/easier than
describing a single non-normalizable step from random noise to the learned distribution
(which is what VAEs/GANs are doing)
Anatomy of a Diffusion Model

1. Forward Process
2. Reverse Process
Forward Process

● Take a datapoint x_0 and keep gradually adding very small amounts of Gaussian noise to it
○ Vary the parameters of the Gaussian according to a noise schedule controlled by beta_t
● Repeat this process for T steps — as the timesteps increase, the more features of the original
input are destroyed
● You can prove with some math that as T approaches infinity, you eventually end up with an
Isotropic Gaussian (i.e. pure random noise)
A neat (reparametrization) trick!

Define

Then
Reverse Process

● The goal of a diffusion model is to learn the reverse denoising process to iteratively undo the
forward process
● In this way, the reverse process appears as if it is generating new data from random noise!
Finding the exact distribution is hard

The distribution of each timestep and q(x_t | x_{t-1}) depends on the entire data distribution:

● Computing this is computationally intractable (where else have we seen this dilemma?)
● However, we still need those to describe the reverse process. Can we approximate them
somehow?
What should the distribution look like?

Turns out that for small enough forward steps, i.e.

the reverse process step can be estimate is a Gaussian distribution too (take a course
of stochastic differential equations if you want learn more)!

Therefore, we can parametrize the learned reverse process as

such that
A preliminary objective

The VAE (ELBO) loss is a bound on the true log likelihood (also called the variational lower bound)

Apply the same trick to diffusion:

Expanding out,
A simplified objective

The reverse step conditioned on x_0 is a Gaussian:

After doing some algebra, each loss term can be approximated by


A simplified objective

Instead of predicting the mu, Ho et al. say that we should predict epsilon instead!

Thus, our loss becomes


A simplified objective

The authors of DDPM say that it’s fine to drop all that baggage in the front and instead just use

Note that this is not a variational lower bound on the log-likelihood anymore: in fact, you can view it
as a reweighted version of ELBO that emphasizes reconstruction quality!
Training
Sampling
Diffusion for Image
Generation
Forward process:
converting the image
distribution to pure noise

Reverse process: sampling


from the image
distribution, starting with
pure noise
UNet + Other Stuff

Diffusion models typically use a


U-Net on steroids as the noise
predictive model — you take the
good ol’ model that you are
already familiar with and add:
● Positional Embeddings
● ResNet Blocks
● ConvNext Blocks
● Attention Modules
● Group Normalization
● Swish and GeLU
It’s a massive kitchen sink of
modern CV tricks
Tricks for Improving
Generation
Linear vs Cosine Schedule

● A linear noise schedule converts initial data to noise really quickly, making the reverse process
harder for the model to learn.
● Researchers hypothesized that a cosine-like function that is changing relatively gradually near
the endpoints might work better
○ Note: It did end up working better but this choice of cosine was completely arbitrary

Linear (top) vs Cosine (bottom)


Learning a Covariance matrix

● DDPM authors said that it’s better to use a fixed covariance matrix
where or .
○ The intuition is that covariance does not contribute as significantly as the mean does to
the learned conditional distributions during the reverse process
○ However, it can still help us improve log-likelihood!
● So, Nichol and Dhariwal propose

This modification leads to better likelihood estimates while maintaining image quality!
Architecture Improvements

Nichol and Dhariwal proposed several architectural changes that seem to help diffusion training:

1. Increasing model depth vs width (not both): both help but increasing width is computationally
cheaper while providing similar gains as increased depth
2. Increasing number of attention heads and applying it to multiple resolutions
3. Stealing BigGAN residual blocks for upsampling and downsampling
4. Adaptive Group Normalization — hopes to better incorporate timestep (and potentially class)
information during the training/reverse process
Classifier Guidance

Recall conditional GANs from the previous lectures: they can be conditioned on class labels to
synthesize specific kinds of images. We can apply the same idea to diffusion!

The main idea is this:


● Take a pre-trained unconditional diffusion model
● During sampling, inject the gradients of a classifier model (that is trained from scratch on noisy
images) into the unconditional reverse process
● Classifier guidance trades off image diversity for model fidelity, allowing it to push the
performance of a diffusion model past that of a GAN
Classifier Guidance
Classifier Guidance

At a high level:
● FID and sFID captures
image quality
● Precision measures image
fidelity (“resemblance to
training images”)
● Recall measures image
diversity/distribution
coverage
Lower FID/sFID is better
Higher Precision and Recall is
better
Diffusion Models Beats GANs

BigGAN Diffusion Training Set


Diffusion Models Beats GANs

BigGAN Diffusion Training Set


Classifier-Free Guidance

Classifier guidance worked great but


● Can’t use a pre-trained classifier since classifier must be trained on noisy data
● Since classifier guidance injects classifier gradients into it’s training process, is it really not just
an adversarial attack that the classifier has learned to become robust against? It is hard to
interpret what classifier guidance is actually doing
● To avoid these dilemmas, researchers proposed ignoring an external classifier all together and
jointly training a class-conditioned and unconditional diffusion model simultaneously
● The goal of this paper was to understand the behavior of guidance, not to push the boundaries
of image synthesis, but…
Classifier-Free Guidance
Classifier vs Classifier-Free Guidance

1. Classifier-free guidance is pretty simple to implement while classifier guidance requires training
an external classifier on noisy data
2. Classifier-free guidance does not employ any kind of classifier gradients and cannot be
interpreted as an adversarial attack: it is more in line with traditional generative models
3. Classifier-free guidance is slower than classifier guidance by the virtue of requiring twice as
many reverse diffusion steps
4. Both lower the sample diversity to increase sample fidelity/quality: is this really acceptable?
Diversity
(unguided samples)
vs.
Fidelity
(guided samples)
Latent Diffusion Models
Latent Diffusion Models Motivation

● Training models in the pixel space is excessively computationally expensive (can easily multiple
days on a V100 GPU)
○ Even image synthesis is very slow compared to GANs
○ Images are high dimensional → more things to model
● Researchers observed that most “bits” of an image contribute to its perceptual characteristics
since aggressively compressing it usually maintains its semantic and conceptual composition
○ In layman’s terms, there are more bits for describing pixel-level details while less bits for
describing “the meaning” within an image
○ Generative models should learn the latter
● Can we separate these two components?
Latent Diffusion Models

Latent Diffusion Models can be divided into two stages:


1. Training perceptual compression models that strip away irrelevant high-level details and learn a
latent space that is semantically equivalent to the high level image pixel-space
a. The loss is a combination of a reconstruction loss, an adversarial loss (remember GANs?)
that promotes high quality decoder reconstruction, and regularization terms

2. Performing a diffusion process in this latent space. There are several benefits to this:
a. The diffusion process is only focusing on the relevant semantic bits of the data
b. Performing diffusion in a low dimensional space is significantly more efficient
U-Net
Examples of Recent
Diffusion Models
DALLE 2 (Text-to-Image)

Teddy bears mixing sparkling An astronaut riding a horse in a A bowl of soup as a planet in the
chemicals as mad scientists photorealistic style universe
Imagen (Text-to-Image)

A cute corgi lives in a house made of A majestic oil painting of a raccoon A robot couple fine-dining with the
sushi Queen wearing red French royal Eiffel Tower in the background
gown.
Video Diffusion (Text-to-Video)
Make-A-Video (Text-to-Video)

An artist’s brush painting on a A young couple walking in heavy Horse drinking water
canvas close up rain
Make-A-Video (Text-to-Video)

A confused grizzly bear in a calculus A golden retriever eating ice cream A panda playing on a swing set
class on a beautiful tropical beach at
sunset, high resolution
Imagen Video (Text-to-Video)
DreamFusion (Text-to-3D)

a fox holding a video game controller a lobster playing the saxophone

a corgi wearing a beret and holding a baguette, standing a human skeleton drinking a glass of red wine
up on two hind legs
Diffuser (Trajectory Planning)
Diffusion-QL (Offline RL)

Even applies to RL!

● Recall that diffusion is basically a


method for learning unknown
distributions.
● In offline RL, the goal is to train
an agent from offline datasets
● To ensure
Wrap-Up
Summary

We went over

● A quick tour of generative modeling and how image synthesis can be viewed as sampling from
a density
● Preliminary theory of diffusion (don’t worry if this is confusing; this is a very theory rich subject
and even I don’t know all the details!)
● Some tricks that modern diffusion models employ for image generation:
○ A U-Net architecture equipped with all kinds of modifications
○ Other architecture improvements
○ Several implementation tricks (different noise schedules, covariance parametrizations)
○ Classifier and classifier-free guidance
● Latent diffusion models for improving diffusion quality and efficiency
Main papers ● Deep Unsupervised Learning using
Nonequilibrium Thermodynamics:

referenced here! ●
https://arxiv.org/pdf/1503.03585.pdf
Denoising Diffusion Probabilistic Models:
https://arxiv.org/pdf/2006.11239.pdf
● Improved Denoising Diffusion Probabilistic
Models: https://arxiv.org/pdf/2102.09672.pdf
● Diffusion Models Beat GANs on Image
Synthesis: https://arxiv.org/pdf/2105.05233.pdf
● Classifier-free Diffusion Guidance:
Disclaimer: some of the foundational https://arxiv.org/pdf/2207.12598.pdf
work done on Diffusion is relatively ● High Resolution Image Synthesis with Latent
math and notation heavy! Diffusion Models:
https://arxiv.org/pdf/2112.10752.pdf
Other cool papers to ● Denoising Diffusion Implicit Models:
https://arxiv.org/pdf/1503.03585.pdf

check out! Generative Modeling by Estimating Gradients
of the Data Distribution:
https://yang-song.net/blog/2021/score/
● Sampling is as easy as learning the score:
theory for diffusion models with minimal data
assumptions: https://arxiv.org/pdf/2209.11215.pdf

Disclaimer: some of the foundational


work done on Diffusion is relatively
math and notation heavy!
Even more resources! Other resources:
● Lillian Weng’s Blog:
https://lilianweng.github.io/posts/2021-07-11-diffusion
-models/
● The Annotated Diffusion Model:
https://huggingface.co/blog/annotated-diffusion
● The Illustrated Stable Diffusion:
https://jalammar.github.io/illustrated-stable-diffusion/
● PyTorch implementation of the DDPM Unet:
https://nn.labml.ai/diffusion/ddpm/unet.html
● Guidance: a cheat code for diffusion models:
https://benanne.github.io/2022/05/26/guidance.html
● Understanding Diffusion Models: A Unified
Perspective: https://arxiv.org/pdf/2208.11970.pdf

You might also like