Diffusion: by Aryan Jain
Diffusion: by Aryan Jain
By Aryan Jain
Agenda 1.
2.
Theory of diffusion
Diffusion for image generation
3. Tricks to improve image synthesis models
4. Latent Diffusion Models
5. Examples of recent diffusion models
Diffusion
Density Modeling for Data Synthesis
VAE:
GAN:
Sampling from Noise
Diffusion
● Another kind of generative modeling technique that takes inspiration from physics (non-
equilibrium statistical physics and stochastic differential equations to be more exact)!
● Main idea: convert a well-known and simple base distribution (like a Gaussian) to the target
(data) distribution iteratively, with small step sizes, via a Markov chain:
○ Treat the the output of the Markov Chain as the model’s approximation for the learned
distribution
○ Inspiration? Estimating and analyzing small step sizes is more tractable/easier than
describing a single non-normalizable step from random noise to the learned distribution
(which is what VAEs/GANs are doing)
Anatomy of a Diffusion Model
1. Forward Process
2. Reverse Process
Forward Process
● Take a datapoint x_0 and keep gradually adding very small amounts of Gaussian noise to it
○ Vary the parameters of the Gaussian according to a noise schedule controlled by beta_t
● Repeat this process for T steps — as the timesteps increase, the more features of the original
input are destroyed
● You can prove with some math that as T approaches infinity, you eventually end up with an
Isotropic Gaussian (i.e. pure random noise)
A neat (reparametrization) trick!
Define
Then
Reverse Process
● The goal of a diffusion model is to learn the reverse denoising process to iteratively undo the
forward process
● In this way, the reverse process appears as if it is generating new data from random noise!
Finding the exact distribution is hard
The distribution of each timestep and q(x_t | x_{t-1}) depends on the entire data distribution:
● Computing this is computationally intractable (where else have we seen this dilemma?)
● However, we still need those to describe the reverse process. Can we approximate them
somehow?
What should the distribution look like?
the reverse process step can be estimate is a Gaussian distribution too (take a course
of stochastic differential equations if you want learn more)!
such that
A preliminary objective
The VAE (ELBO) loss is a bound on the true log likelihood (also called the variational lower bound)
Expanding out,
A simplified objective
Instead of predicting the mu, Ho et al. say that we should predict epsilon instead!
The authors of DDPM say that it’s fine to drop all that baggage in the front and instead just use
Note that this is not a variational lower bound on the log-likelihood anymore: in fact, you can view it
as a reweighted version of ELBO that emphasizes reconstruction quality!
Training
Sampling
Diffusion for Image
Generation
Forward process:
converting the image
distribution to pure noise
● A linear noise schedule converts initial data to noise really quickly, making the reverse process
harder for the model to learn.
● Researchers hypothesized that a cosine-like function that is changing relatively gradually near
the endpoints might work better
○ Note: It did end up working better but this choice of cosine was completely arbitrary
● DDPM authors said that it’s better to use a fixed covariance matrix
where or .
○ The intuition is that covariance does not contribute as significantly as the mean does to
the learned conditional distributions during the reverse process
○ However, it can still help us improve log-likelihood!
● So, Nichol and Dhariwal propose
This modification leads to better likelihood estimates while maintaining image quality!
Architecture Improvements
Nichol and Dhariwal proposed several architectural changes that seem to help diffusion training:
1. Increasing model depth vs width (not both): both help but increasing width is computationally
cheaper while providing similar gains as increased depth
2. Increasing number of attention heads and applying it to multiple resolutions
3. Stealing BigGAN residual blocks for upsampling and downsampling
4. Adaptive Group Normalization — hopes to better incorporate timestep (and potentially class)
information during the training/reverse process
Classifier Guidance
Recall conditional GANs from the previous lectures: they can be conditioned on class labels to
synthesize specific kinds of images. We can apply the same idea to diffusion!
At a high level:
● FID and sFID captures
image quality
● Precision measures image
fidelity (“resemblance to
training images”)
● Recall measures image
diversity/distribution
coverage
Lower FID/sFID is better
Higher Precision and Recall is
better
Diffusion Models Beats GANs
1. Classifier-free guidance is pretty simple to implement while classifier guidance requires training
an external classifier on noisy data
2. Classifier-free guidance does not employ any kind of classifier gradients and cannot be
interpreted as an adversarial attack: it is more in line with traditional generative models
3. Classifier-free guidance is slower than classifier guidance by the virtue of requiring twice as
many reverse diffusion steps
4. Both lower the sample diversity to increase sample fidelity/quality: is this really acceptable?
Diversity
(unguided samples)
vs.
Fidelity
(guided samples)
Latent Diffusion Models
Latent Diffusion Models Motivation
● Training models in the pixel space is excessively computationally expensive (can easily multiple
days on a V100 GPU)
○ Even image synthesis is very slow compared to GANs
○ Images are high dimensional → more things to model
● Researchers observed that most “bits” of an image contribute to its perceptual characteristics
since aggressively compressing it usually maintains its semantic and conceptual composition
○ In layman’s terms, there are more bits for describing pixel-level details while less bits for
describing “the meaning” within an image
○ Generative models should learn the latter
● Can we separate these two components?
Latent Diffusion Models
2. Performing a diffusion process in this latent space. There are several benefits to this:
a. The diffusion process is only focusing on the relevant semantic bits of the data
b. Performing diffusion in a low dimensional space is significantly more efficient
U-Net
Examples of Recent
Diffusion Models
DALLE 2 (Text-to-Image)
Teddy bears mixing sparkling An astronaut riding a horse in a A bowl of soup as a planet in the
chemicals as mad scientists photorealistic style universe
Imagen (Text-to-Image)
A cute corgi lives in a house made of A majestic oil painting of a raccoon A robot couple fine-dining with the
sushi Queen wearing red French royal Eiffel Tower in the background
gown.
Video Diffusion (Text-to-Video)
Make-A-Video (Text-to-Video)
An artist’s brush painting on a A young couple walking in heavy Horse drinking water
canvas close up rain
Make-A-Video (Text-to-Video)
A confused grizzly bear in a calculus A golden retriever eating ice cream A panda playing on a swing set
class on a beautiful tropical beach at
sunset, high resolution
Imagen Video (Text-to-Video)
DreamFusion (Text-to-3D)
a corgi wearing a beret and holding a baguette, standing a human skeleton drinking a glass of red wine
up on two hind legs
Diffuser (Trajectory Planning)
Diffusion-QL (Offline RL)
We went over
● A quick tour of generative modeling and how image synthesis can be viewed as sampling from
a density
● Preliminary theory of diffusion (don’t worry if this is confusing; this is a very theory rich subject
and even I don’t know all the details!)
● Some tricks that modern diffusion models employ for image generation:
○ A U-Net architecture equipped with all kinds of modifications
○ Other architecture improvements
○ Several implementation tricks (different noise schedules, covariance parametrizations)
○ Classifier and classifier-free guidance
● Latent diffusion models for improving diffusion quality and efficiency
Main papers ● Deep Unsupervised Learning using
Nonequilibrium Thermodynamics:
referenced here! ●
https://arxiv.org/pdf/1503.03585.pdf
Denoising Diffusion Probabilistic Models:
https://arxiv.org/pdf/2006.11239.pdf
● Improved Denoising Diffusion Probabilistic
Models: https://arxiv.org/pdf/2102.09672.pdf
● Diffusion Models Beat GANs on Image
Synthesis: https://arxiv.org/pdf/2105.05233.pdf
● Classifier-free Diffusion Guidance:
Disclaimer: some of the foundational https://arxiv.org/pdf/2207.12598.pdf
work done on Diffusion is relatively ● High Resolution Image Synthesis with Latent
math and notation heavy! Diffusion Models:
https://arxiv.org/pdf/2112.10752.pdf
Other cool papers to ● Denoising Diffusion Implicit Models:
https://arxiv.org/pdf/1503.03585.pdf
●
check out! Generative Modeling by Estimating Gradients
of the Data Distribution:
https://yang-song.net/blog/2021/score/
● Sampling is as easy as learning the score:
theory for diffusion models with minimal data
assumptions: https://arxiv.org/pdf/2209.11215.pdf