0% found this document useful (0 votes)

61 views17 pages

Diffusion Based Causal Representation Learning

The document proposes a new method called Diffusion-Based Causal Representation Learning (DCRL) that uses diffusion-based representations for causal discovery. DCRL offers access to infinite dimensional latent codes that encode different levels of information. In an initial study, DCRL is shown to perform comparably well to other methods in identifying causal structure and variables from data. The document discusses how DCRL could help address limitations of prior work using variational autoencoders for causal representation learning, which only provide representations from a point estimate and are unsuitable for high dimensions.

Uploaded by

bayesianroad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views17 pages

Diffusion Based Causal Representation Learning

Uploaded by

bayesianroad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Diffusion Based Causal Representation Learning

Amir Mohammad Karimi Mamaghan1 , Andrea Dittadi2,3 ,

Stefan Bauer2,4 , Karl Henrik Johansson1 ,
Francesco Quinzan5
1
KTH Royal Institute of Technology
2
Helmholtz AI, Helmholtz Center Munich
arXiv:2311.05421v1 [cs.LG] 9 Nov 2023

3
Max Planck Institute for Intelligent Systems, Tübingen, Germany
4
Technical University of Munich
5
Department of Computer Science, University of Oxford

Abstract
Causal reasoning can be considered a cornerstone of intelligent systems. Having access
to an underlying causal graph comes with the promise of cause-effect estimation and the
identification of efficient and safe interventions. However, learning causal representations
remains a major challenge, due to the complexity of many real-world systems. Previous
works on causal representation learning have mostly focused on Variational Auto-Encoders
(VAE). These methods only provide representations from a point estimate, and they
are unsuitable to handle high dimensions. To overcome these problems, we proposed a
new Diffusion-based Causal Representation Learning (DCRL) algorithm. This algorithm
uses diffusion-based representations for causal discovery. DCRL offers access to infinite
dimensional latent codes, which encode different levels of information in the latent code.
In a first proof of principle, we investigate the use of DCRL for causal representation
learning. We further demonstrate experimentally that this approach performs comparably
well in identifying the causal structure and causal variables.

1 Introduction
Causal representation learning consists in uncovering a system’s latent causal factors and their
relationships, from observed low-level data. Causal representation learning finds applicability
in domains such as autonomous driving [Schölkopf et al., 2021], robotics [Hellström, 2021],
healthcare [Anwar et al., 2014], climate studies [Runge et al., 2019], epidemiology [Hernán
et al., 2000, Robins et al., 2000], and finance [Hiemstra and Jones, 1994]. In these tasks,
the underlying causal variables are often unknown, and we only have access to low-level
representations.
Causal representation learning is a challenging problem. In fact, identifying latent causal factors
is generally impossible from observational data. There has been an ongoing effort to study
sets of assumptions that ensure the identifiability of causal variables and their relationships
[Brehmer et al., 2022, Liu et al., 2022, Schölkopf et al., 2021, Subramanian et al., 2022, Yang
et al., 2020]. These approaches consider the availability of additional information or they use
assumptions on the underlying causal structure of the DGP. Interestingly, Brehmer et al. [2022]
consider a weak form of supervision in which we have access to a data pair, corresponding

1
to the state of the system before and after a random, unknown intervention. Brehmer et al.
[2022] prove that, in this weakly-supervised setting, the structure and the causal variables are
identifiable up to a relabeling and element-wise reparameterization.
There has been a growing interest in leveraging generative models to learn causal representations
with specific properties. For example, disentangled and object-centric representations have
been shown to be helpful for complex downstream tasks and generalization [Dittadi et al., 2022,
Papa et al., 2022, Van Steenkiste et al., 2019, Wu et al., 2022, Yoon et al., 2023]. Variational
Autoencoders (VAE) [Kingma and Welling, 2014] are among the most widely studied generative
models, and they have been successfully used for disentanglement and causal representation
learning [Brehmer et al., 2022, Locatello et al., 2020]. However, the problem of learning causal
representations has not yet been approached with more powerful generative models.
Recently, diffusion models have emerged as state-of-the-art generative models, and they have
demonstrated remarkable success across several domains [Dhariwal and Nichol, 2021a, Ho
et al., 2022b, Höppe et al., 2022, Ramesh et al., 2022, Saharia et al., 2022]. Diffusion models
draw on concepts and principles from diffusion processes to learn the data distribution [Cai
et al., 2020, Chen et al., 2021, Dhariwal and Nichol, 2021b, Ho et al., 2020, 2022a, Luhman
and Luhman, 2021, Mehrjou et al., 2017, Niu et al., 2020, Sajjadi et al., 2018, Saremi et al.,
2018, Sohl-Dickstein et al., 2015, Sohl-Dickstein et al., 2015, Song et al., 2021a,b,c]. These
models exploit diffusion behavior to produce diverse, high-quality, and realistic samples.
Furthermore, diffusion-based models have the appealing property of infinite-dimensional latent
codes [Abstreiter et al., 2022], which allows to efficiently learn representations across different
downstream tasks. Despite their remarkable performance and advantages, diffusion models
have not yet been employed for causal representation learning, indicating that their potential
has yet to be explored in this context.

Our contribution. In this work, we study the connection between diffusion-based models
and causal structure learning. In particular, our contributions are the following:
• We propose DCRL, a diffusion-based model for causal representation learning. We study
and test the connection between the learned representations of DCRL with causal variables.
To accomplish this, we utilize both finite and infinite-dimensional representations.
• We derive the Evidence Lower Bound (ELBO) for DCRL, in the case of both finite and
infinite-dimensional representations.
• We empirically illustrate that the noise and diffusion-based representations contain
equivalent information about the underlying causal variables and causal mechanisms,
and can be used interchangeably.

2 Related Work
Diffusion-based Representation Learning. Learning representations with diffusion mod-
els remains a relatively unexplored area. Several works try to train an external module (e.g.,
an encoder) along with the score function of the diffusion model to extract representations.
Abstreiter et al. [2022] and Mittal et al. [2022] condition the score function of a diffusion model
on a time-independent and time-dependent encoder and obtain finite and infinite-dimensional
representations, respectively. Wang et al. [2023] use the same conditioning but regularizes

2
the objective function with the mutual information between the input data and learned rep-
resentations. Traub [2022] does the same conditioning but they use Latent Diffusion Models
[Rombach et al., 2022] where the inputs of the diffusion model are latent variables obtained from
applying a pre-trained autoencoder on the input. Furthermore, Kwon et al. [2022] proposes an
asymmetric reverse process that discovers the semantic latent space of a frozen diffusion model
where modification in the space synthesizes various attributes on input images. However, in
principle, diffusion models lack a semantic latent space and it’s unclear how to efficiently learn
representations using their capabilities.

Causal Representation Learning. Given the inherent challenges of identifiability in causal

representation learning, many previous studies have tackled this issue by imposing certain
assumptions on the dataset or the causal structure. Several previous methods rely on additional
knowledge of the data generation process, such as knowledge of the causal graph or labels for
the high-level causal variables. CausalGAN [Kocaoglu et al., 2017] requires the structure of the
underlying causal graph to be known. Yang et al. [2020] and Liu et al. [2022] assume a linear
structural equation model, and they require additional information associated with the true
causal concepts as supervising signals. Similar to Yang et al. [2020], Komanduri et al. [2022]
assume the availability of supplementary supervision labels, but without requiring mutual
independence among factors. Von Kügelgen et al. [2021] investigate self-supervised causal
representation learning by utilizing a known, but non-trivial, causal graph between content
and style factors. Subramanian et al. [2022] applies Bayesian structure learning in the latent
space and relies on having interventional samples. For an overview of causal representation
learning we refer to Schölkopf et al. [2021]. Other relevant work closely related to causal
representation learning includes disentangled representations and independent component
analysis [Ahuja et al., 2022, Hyvärinen and Oja, 2000, Khemakhem et al., 2020, Lachapelle
et al., 2022, Locatello et al., 2019, Shu et al., 2019].

2.1 Overview
The fundamental concept behind diffusion-based generative models is to learn to generate data
by inverting a diffusion process. Diffusion models comprise two processes: a forward process
and a backward process. The forward process gradually adds noise to data and maps data
to (almost) pure noise. The backward process, on the other hand, is used to go from a noise
sample back to the original data space.
The forward process is defined by a stochastic differential equation (SDE) across a continuous
time domain t ∈ [0, 1], aiming to transform the data distribution to a known prior distribution,
typically a standard multivariate Gaussian. Given x0 sampled from a data distribution p(x),
the forward process constructs a trajectory (xt )t∈[0,1] across the time domain. We utilize the
Variance Exploding SDE [Song et al., 2021c] for the forward process, which is defined as:
r
d[σ 2 (t)]
dx = f (x, t) + g(t)dw := dw,
dt
where w is the standard Wiener process and σ 2 (t) is the noise variance of the diffusion process
at time t. The backward process is also formulated as an SDE in the following manner:
dx = [f (x, t) − g 2 (t)∇x log pt (x)]dt + g(t)dw̄ ,
where w̄ is the standard Wiener process in reverse time.

3
Score matching. To use this backward process, the score function ∇x log pt (x) is required.
It is usually approximated by a neural score function sθ (·) which can be trained by Explicit
Score Matching [Hyvärinen and Dayan, 2005] defined as:
" #
h i
2
L(θ) = Et λ(t)Ep(xt ) ||sθ (xt , t) − ∇xt log pt (xt )|| ,

However, the ground-truth score function ∇x log pt (x) is generally not known. Vincent [2011]
addresses this issue by proposing Denoising Score Matching. The approximate score function
is then learned by minimizing the loss function:
" #
h i
2
L(θ) = λ(t)Ex0 Ep(xt |x0 ) ||sθ (xt , t) − ∇xt log pt (xt |x0 )|| ,

where the conditional distribution of xt given x0 is pt (xt |x0 ) = N (xt ; x0 , [σ 2 (t) − σ 2 (0)]I) and
λ(t) is a positive weighting function. This objective function originates from the evidence lower
bound (ELBO) of the data distribution, and it’s been shown that with a specific weighting
function, this objective function becomes exactly a term in the ELBO Song et al. [2021c]. For
more details, see Appendix A.

Conditional Score Matching. We can modify Denoising Score Matching to perform

representation learning while training the score function. Abstreiter et al. [2022] proposes
conditional denoising score matching defined as:
" #
h i
2
L(θ, ϕ) = Et λ(t)Ex0 Ep(xt |x0 ) ||sθ (xt , Eϕ (x0 ), t) − ∇xt log pt (xt |x0 )|| , (1)

where the score function is conditioned on a module Eϕ (x0 ) which provides additional informa-
tion about the data to the diffusion model through a learned encoder with parameters ϕ. In
fact, the encoder learns to extract necessary information from x0 in a reduced-dimensional
space that helps recover x0 by denoising xt . Abstreiter et al. [2022] also presents an alternative
objective where the encoder is a function of time. Formally, the new objective is
" #
h i
L(θ, ϕ) = Et λ(t)Ex0 Ep(xt |x0 ) ||sθ (xt , Eϕ (x0 , t), t) − ∇xt log pt (xt |x0 )||2 , (2)

With this objective, the encoder learns a representation trajectory of x0 instead of a single
representation. Training this system has the potential to minimize the objective to zero,
motivating the encoder Eϕ (.) to learn meaningful, distinct representations at different timesteps
[Abstreiter et al., 2022, Mittal et al., 2022].

2.2 Comparison with Other Generative Models

The key difference between the other generative models and diffusion-based representations is
that other generative models are only concerned with one finite code and all the information is
encoded into this single code while in the latter, different levels of information are encoded
along an infinite-dimensional code, i.e., the encoder is conditioned on time t and produces a
trajectory-based representation (Eϕ (x0 , t))t∈[0,1] . Within this representation, various points

4
U0 U1 UT
p(U0 | U1) p(U1 | U2) p(UT-1 | UT)
…
q(U1 | U0) q(U2 | U1) q(UT-1 | UT)

Encoder

Projection Causal Graph

Intervention Encoder Solution Function

Projection

Encoder

Figure 1: Overview of our framework. Here we have a paired image of a face before and after
an intervention (the smile). The paired image is mapped to latent variables by a stochastic
encoder. The intervention target is determined by applying the intervention encoder to these
latent variables. To maintain the weakly supervised structure, the latent variables are projected
into a new pair and then, serve as the conditioning module for a conditional diffusion model
(The projected latent variables are diffusion-based representations of the input pair). Finally,
they are utilized in neural solution functions together with the intervention target to obtain
the latent causal variables.

along the trajectory contain different levels of information, as highlighted by Mittal et al. [2022].
In this work, we first explore a time-independent single code where we employ Eq. 1 and show
that with a certain weighting function, this objective function will become the ELBO. Then,
we apply the same experiments with infinite-dimensional latent code (Eq. 2) and study the
benefits and implications of these formulations for causal representation learning.

3 Problem Description
We consider a system that is described by an unknown underlying SCM on the latent causal
variable Z where we have access to low-level data pairs (x, x̃) ∼ p(x, x̃) representing the system
before and after a random, unknown, and atomic intervention. It is known that under this
weakly supervised setting, it is possible to identify the causal variables and causal mechanisms
up to a permutation and elementwise reparameterization of the variables [Brehmer et al.,
2022]. Our objective is to learn an SCM that accurately represents the true underlying SCM
associated with the given data, up to a permutation and elementwise reparameterization
of causal variables. To this end, we train an SCM by maximizing the likelihood of data.

5
With sufficient data and perfect optimization, we can find the SCM that is equivalent to the
ground-truth SCM.

4 The DCRL Algorithm

4.1 Overview
Figure 1 provides a visual representation of the framework’s architecture. In this study, we
utilize a conditional diffusion model and apply it to the input data (x, x̃) where x, x̃ ∈ R3×W ×H
and W and H are the width and height of the input, respectively. The conditioning module is
defined as the encoding module, generating high-level diffusion-based representations (e, ẽ) for
each low-level data pair where e, ẽ ∈ Rd and d is the number of latent causal variables assumed
to be known. We empirically show that these latent variables contain equivalent information
as in noise variables of the underlying SCM and can be used interchangeably. Then, we infer
the intervention target I ∈ {0, 1, ..., d − 1} for each data pair by an intervention module and
use neural solution functions on top of the latent variables (e, ẽ) and the intervention target I
to obtain the underlying latent causal variable (z, z̃). We describe each part in detail in the
next paragraphs.

The Encoding and the Intervention Module. The encoding module consists of two
main parts: the stochastic encoder and the projection module. The stochastic encoder q(e|x)
maps data pairs (x, x̃) to pre-projection latent variables (e, ẽ). The encoded inputs are then
utilized in the intervention module q(I|x, x̃) to infer the intervention target I for the data pair
(x, x̃). Based on our data generation process, the encoded inputs have the property that only
for the elements that are intervened upon, we have ei ̸= ẽi , i ∈ I, and the rest will remain
the same. Based on this property, in order to infer interventions, we employ an intervention
module q(I|e, ẽ) which is defined heuristically as
1
log q(i ∈ I|x, x̃) = (α + β|µe (x)i − µe (x̃)i | + γ|µe (x)i − µe (x̃)i |2 )
Z
Where µe (x) is the mean of the stochastic encoder q(e|x), α, β, and γ are learnable parameters,
and Z is a normalization constant. Using this simple heuristic function, we increase the
likelihood of a component as it undergoes more significant changes in response to interventions
on the encoded input. Once the intervention is inferred from the pre-projection latent variables,
we apply the projection module. The projection module is dependent on the inferred intervention
target I and projects the encoded input (e, ẽ) to new latent variables in a way that for the
components ei that are not intervened upon, i ∈ / I, the pre-intervention and post-intervention
latent components will be equal, ei = ẽi . This prevents solution functions from deviating from
the weakly supervised structure.
We write the combination of the encoder and the projection module as q(e, ẽ|x, x̃, I), and refer
it to as the encoding module. By this definition, the encoding module q(e, ẽ|x, x̃, I) maps the
input (x, x̃) to latent variables (e, ẽ) and the intervention module infers the intervention I
based on pre-projection latent variables.

Prior. Given the intervention target I and latent variables (e, ẽ), we define the prior p(e, ẽ, I)
as p(e, ẽ, I) = p(I)p(e)p(ẽ|e, I). The objective of the prior distribution is to implicitly capture

6
the causal structure and causal mechanisms within the system. Specifically, p(I) and p(e)
denote the prior distributions over intervention targets and latent variables, respectively,
and are configured as uniform categorical and standard Gaussian distributions, respectively.
According to our data generation process, when an intervention is applied, only the elements in
the latent variables that are intervened upon are altered; the other elements remain unchanged
and independent of each other. Consequently, we can define p(ẽ|e, I) as follows:
Y Y
p(ẽ|e, I) = δ(ẽi − ei ) p(ẽi |e)
i∈I
/ i∈I

In this equation, δ(.) is the Dirac delta function that fulfills this property for non-intervened
elements of latent variables.

Neural Solution Functions. Finally, in order to encode the information about the intervened
variables, we incorporate a conditional normalizing flow p(ẽi |e) defined as

∂hi (ẽi ; ei )
p(ẽi |e) = p̃(hi (ẽi ; ei ))
∂ẽi
where h(.) are the solution functions of the SCM. They are defined as invertible affine trans-
formations with parameters learned with neural networks. Therefore, by learning solution
functions, i.e., learning to transform e to z, we implicitly model the causal graph into the
framework and obtain the latent causal variables. For more details about the implementation,
see Appendix B.

4.2 The Evidence Lower Bound for DCRL

Putting everything together, we calculate the Evidence Lower Bound (ELBO) for the proposed
model which will be:
"
log p(x, x̃) ≥ Ep(x,x̃) Eq(I|x,x̃) Eq(e,ẽ|x,x̃,I) Et∼U (0,1) Eq(ut |x) Eq(ũt |x̃) log p(I) + log p(e)

+ log p(ẽ|e, I) − log q(I|x, x̃) − log q(e, ẽ|x, x̃, I) + λ(t)||sθ (ut , e, t) − ∇ut log p(ut |x)||22
#
+λ(t)||sθ (ũt , ẽ, t) − ∇ũt log p(ũt |x̃)||22 ,

where λ(t) is a positive weighting function. We train the model by minimizing a reweighted
loss function reminiscent of β-VAEs:
"
Lmodel = Ep(x,x̃) Eq(I|x,x̃) Eq(e,ẽ|x,x̃,I) Et∼U (0,1) Eq(ut |x) Eq(ũt |x̃) λ(t)||sθ (ut , e, t)
h
− ∇ut log p(ut |x)||22 + λ(t)||sθ (ũt , ẽ, t) − ∇ũt log p(ũt |x̃)||22 + β log p(I) + log p(e)
#
i
+ log p(ẽ|e, I) − log q(I|x, x̃) − log q(e, ẽ|x, x̃, I) ,

7
In case of using infinite-dimensional representations (Eq. 2), the objective function be-
comes:
"
Lmodel = Ep(x,x̃) Eq(I|x,x̃) Et∼U (0,1) Eq(et ,e˜t |x,x̃,I) Eq(ut |x) Eq(ũt |x̃) λ(t)||sθ (ut , et , t)
h
− ∇ut log p(ut |x)||22 + λ(t)||sθ (ũt , e˜t , t) − ∇ũt log p(ũt |x̃)||22 + β log p(I) + log p(et )
#
i
+ log p(ẽt |et , I) − log q(I|x, x̃) − log q(et , ẽt |x, x̃, I) , (3)

where (et )t∈[0,1] is the trajectory-based representation and et ∈ Rd is the single point of the
trajectory at time t. For more details about the problem formulation, see Appendix A. To
prevent a collapse of the latent space to a lower-dimensional subspace, we add the negative
entropy of the batch-aggregate intervention posterior (qIbatch (I) = Ex,x̃∈batch [q(I|x, x̃]) as a
regularization term to the loss function:
h X i
Lentropy = Ebatches − qIbatch (I) log qIbatch (I)
I

where Ebatches [ · ] is the expected value over all the batches of data. After the training,
the framework contains information about the underlying causal structure and latent causal
variables and it can be used in different downstream tasks.

5 Experiments
5.1 Overview of the Experiments
Here we analyze the performance of the proposed model, DCRL, on synthetic data. We
employ DCRL for the task of causal discovery and subsequently use ENCO [Lippe et al.,
2021], a continuous optimization structure learning method that leverages observational and
interventional data, on top of DCRL to infer the underlying causal graph. Furthermore, we
evaluate the learned latent variables with the DCI framework [Eastwood and Williams, 2018].

Data Generation. In order to generate latent variables, we adopt random graphs where
each edge in a fixed topological order is sampled from a Bernoulli distribution with a parameter
that is equal to 0.5. We consider the SCM to be linear Gaussian and we sample the weights
from a multivariate Normal distribution with zero mean and unit variance. We make sure the
weights are not close to zero to avoid the violation of the faithfulness assumption. We introduce
additive Gaussian noise with equal variances across all nodes, with its variance set to 0.1.
Latent causal variables are then sampled using ancestral sampling, and we generate 105 training
samples, 104 validation samples, and 104 test samples. Finally, to generate input data x, we
apply a random linear projection on the obtained latent variables. We keep the dimension of x
fixed to 16. We utilize an SCM with 5, 10, and 15 variables. To enhance the robustness of the
results, we generate data for 4 different seeds and repeat our experiments for each seed.

8
Figure 2: Comparison of models on different metrics when using single-point representation.
Our approach outperforms or competes favorably with the baseline methods on all metrics.
Particularly in higher dimensions, our method excels by capturing additional information about
the causal variables and the underlying causal structure.

Baselines. We consider ILCM as our main baseline. To the best of our knowledge, there
aren’t any other methods that consider the same weakly-supervised assumptions. We also
evaluate the outcomes against a variation of disentanglement VAE proposed by [Locatello
et al., 2020] tailored for weakly supervised settings. This model, referred to as d-VAE, models
the weakly supervised process but assumes unconnected variation factors instead of a causal
relationship among variables. Similarly, we apply ENCO on top of both to obtain the learned
graph.

Metrics. We assess the performance of models with the following metrics:

• Structural Hamming Distance (SHD) is a metric used to quantify the dissimilarity
between two directed acyclic graphs (DAGs) by measuring the minimum number of
edge additions, deletions, and reversals required to transform one graph into another. It
is calculated by summing up the absolute differences between the entries of adjacency
matrices of two graphs.
• DCI Disentanglement Score is a metric used to evaluate the disentanglement quality
of a generative model and takes values between 0 and 1. Disentanglement refers to the
extent to which the model learns to predict the underlying factors of variation in the
data in a way that each predicted variable captures at most one underlying factor. If a
predicted factor is important to predict a single underlying factor, the score will be 1,
and if a predicted factor is equally important to predict all the underlying factors, the
score will be 0 [Eastwood and Williams, 2018].
• DCI Completeness Score measures how well each underlying factor of variation is
captured by a single predicted latent variable and has a value between 0 and 1. If a single
variable contributes to one underlying factor, the score will be 1, and if all variables
equally contribute to the prediction of a single factor, the score will be 0 [Eastwood and
Williams, 2018].

9
5.2 Single-point Representations
Utilizing single-point representations where e ∈ Rd and is independent of time, our method
demonstrates superior or competitive performance compared to the baselines, as indicated by
the metrics shown in Figure 2. In higher dimensions, our method excels by acquiring more
information about the causal variables and underlying causal structure.

5.3 Infinite-dimensional Representations

In these experiments, we utilize the infinite-dimensional representations approach and develop
trajectory-based representations for each input x0 , denoted as (et )t∈[0,1] . In order to perform
inference, we sample points from this trajectory at intervals of 0.1 resulting in 11 specific
time steps. The outcomes are depicted in Figure 3 in Appendix C. Generally, representations
in the middle of the trajectory contain the most information and are comparable to or even
outperform the baselines. Going further in time, representations appear to lose information
but improve as they move towards the end of the trajectory. This phenomenon arises because
during training, as we are further in time, the noise in the diffusion model is pretty high and
the conditioning module compensates for that by providing the necessary information for the
diffusion model to learn the score function.

6 Conclusion
Identifying the underlying causal variables and mechanisms of a system solely from observational
data is considered impossible without additional assumptions. In this project, we use weak
supervision as an inductive bias and study if the information encoded in the latent code of
diffusion-based representations contains useful knowledge of causal variables and the underlying
causal graph.

References
K. Abstreiter, S. Mittal, S. Bauer, B. Schölkopf, and A. Mehrjou. Diffusion-based representation
learning. CoRR, abs/2105.14257, 2022.

K. Ahuja, J. S. Hartford, and Y. Bengio. Weakly supervised representation learning with

sparse perturbations. Proc. of NeurIPS, 35:15516–15528, 2022.

A. R. Anwar, K. G. Mideska, H. Hellriegel, N. Hoogenboom, H. Krause, A. Schnitzler,

G. Deuschl, J. Raethjen, U. Heute, and M. Muthuraman. Multi-modal causality analysis
of eyes-open and eyes-closed data from simultaneously recorded eeg and meg. In Proc. of
EMBC, pages 2825–2828, 2014.

J. Brehmer, P. De Haan, P. Lippe, and T. S. Cohen. Weakly supervised causal representation

learning. In Proc. of NeurIPS, pages 2256–2265, 2022.

R. Cai, G. Yang, H. Averbuch-Elor, Z. Hao, S. J. Belongie, N. Snavely, and B. Hariharan.

Learning gradient fields for shape generation. In Proc. fo ECCV, volume 12348, pages
364–381, 2020.

10
N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan. Wavegrad: Estimating
gradients for waveform generation. In Proc. of ICLR, 2021.

P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. In Proc. of NeurIPS,
pages 8780–8794, 2021a.

P. Dhariwal and A. Q. Nichol. Diffusion models beat gans on image synthesis. In Proc. of
NeurIPS, pages 8780–8794, 2021b.

A. Dittadi, S. Papa, M. De Vita, B. Schölkopf, O. Winther, and F. Locatello. Generalization

and robustness implications in object-centric learning. In Proc. of ICML, pages 5221–5285,
2022.

C. Eastwood and C. K. Williams. A framework for the quantitative evaluation of disentangled

representations. In Proc. of ICLR, 2018.

T. Hellström. The relevance of causation in robotics: A review, categorization, and analysis.

Paladyn, Journal of Behavioral Robotics, 12(1):238–255, 2021.

M. Á. Hernán, B. Brumback, and J. M. Robins. Marginal structural models to estimate the

causal effect of zidovudine on the survival of hiv-positive men. Epidemiology, pages 561–570,
2000.

C. Hiemstra and J. D. Jones. Testing for linear and nonlinear granger causality in the stock
price-volume relation. The Journal of Finance, 49(5):1639–1664, 1994.

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. In Proc. of NeurIPS,
pages 8780–8794, 2020.

J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans. Cascaded diffusion

models for high fidelity image generation. Journal of Machine Learning Research, 23:
47:1–47:33, 2022a.

J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion

models. CoRR, abs/2204.03458, 2022b.

A. Hyvärinen and P. Dayan. Estimation of non-normalized statistical models by score matching.

Journal of Machine Learning Research, 6(4), 2005.

A. Hyvärinen and E. Oja. Independent component analysis: algorithms and applications.

Neural networks, 13(4-5):411–430, 2000.

T. Höppe, A. Mehrjou, S. Bauer, D. Nielsen, and A. Dittadi. Diffusion models for video
prediction and infilling. CoRR, abs/2206.07696, 2022.

I. Khemakhem, D. Kingma, R. Monti, and A. Hyvarinen. Variational autoencoders and

nonlinear ica: A unifying framework. In Proc. of AISTATS, pages 2207–2217, 2020.

D. P. Kingma and M. Welling. Auto-encoding variational bayes. In Y. Bengio and Y. LeCun,

editors, Proc. of ICLR, 2014.

M. Kocaoglu, C. Snyder, A. G. Dimakis, and S. Vishwanath. ausalgan: Learning causal implicit

generative models with adversarial training. CoRR, abs/1709.02023, 2017.

11
A. Komanduri, Y. Wu, W. Huang, F. Chen, and X. Wu. Scm-vae: Learning identifiable causal
representations via structural knowledge. In IEEE Big Data, pages 1014–1023, 2022.

M. Kwon, J. Jeong, and Y. Uh. Diffusion models already have a semantic latent space. CoRR,
abs/2210.10960, 2022.

S. Lachapelle, P. Rodriguez, Y. Sharma, K. E. Everett, R. Le Priol, A. Lacoste, and S. Lacoste-

Julien. Disentanglement via mechanism sparsity regularization: A new principle for nonlinear
ica. In Proc. of CLR, pages 428–484, 2022.

P. Lippe, T. Cohen, and E. Gavves. Efficient neural causal discovery without acyclicity
constraints. CoRR, abs/2107.10483, 2021.

Y. Liu, Z. Zhang, D. Gong, M. Gong, B. Huang, A. v. d. Hengel, K. Zhang, and J. Q. Shi.

Identifying weight-variant latent causal models. CoRR, abs/2208.14153, 2022.

F. Locatello, S. Bauer, M. Lucic, G. Raetsch, S. Gelly, B. Schölkopf, and O. Bachem. Challenging

common assumptions in the unsupervised learning of disentangled representations. In Proc.
of ICML, pages 4114–4124, 2019.

F. Locatello, B. Poole, G. Rätsch, B. Schölkopf, O. Bachem, and M. Tschannen. Weakly-

supervised disentanglement without compromises. In Proc. of ICML, pages 6348–6359,
2020.

E. Luhman and T. Luhman. Knowledge distillation in iterative generative models for improved
sampling speed. CoRR, abs/2101.02388, 2021.

C. Luo. Understanding diffusion models: A unified perspective. CoRR, abs/2208.11970, 2022.

A. Mehrjou, B. Schölkopf, and S. Saremi. Annealed generative adversarial networks. CoRR,

abs/1705.07505, 2017.

S. Mittal, G. Lajoie, S. Bauer, and A. Mehrjou. From points to functions: Infinite-dimensional

representations in diffusion models, 2022.

C. Niu, Y. Song, J. Song, S. Zhao, A. Grover, and S. Ermon. Permutation invariant graph
generation via score-based generative modeling. In Proc. of AISTATS, volume 108, pages
4474–4484, 2020.

S. Papa, O. Winther, and A. Dittadi. Inductive biases for object-centric representations in the
presence of complex textures. In UAI 2022 Workshop on Causal Representation Learning,
2022.

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image

generation with clip latents. CoRR, abs/2204.06125, 2022.

J. M. Robins, M. A. Hernan, and B. Brumback. Marginal structural models and causal

inference in epidemiology. Epidemiology, pages 550–560, 2000.

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image

synthesis with latent diffusion models. In Proc. of ECCV, pages 10684–10695, 2022.

12
J. Runge, S. Bathiany, E. Bollt, G. Camps-Valls, D. Coumou, E. Deyle, C. Glymour,
M. Kretschmer, M. D. Mahecha, J. Muñoz-Marí, et al. Inferring causation from time
series in earth system sciences. Nature Communications, 10(1):2553, 2019.

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gon-

tijo Lopes, B. Karagol Ayan, T. Salimans, et al. Photorealistic text-to-image diffusion models
with deep language understanding. In Proc. of NeurIPS, 2022.

M. S. M. Sajjadi, G. Parascandolo, A. Mehrjou, and B. Schölkopf. Tempered adversarial

networks. In Proc. of ICML, volume 80, pages 4448–4456, 2018.

S. Saremi, A. Mehrjou, B. Schölkopf, and A. Hyvärinen. Deep energy estimator networks.

CoRR, abs/1805.08306, 2018.

B. Schölkopf, F. Locatello, S. Bauer, N. R. Ke, N. Kalchbrenner, A. Goyal, and Y. Bengio.

Toward causal representation learning. Proceedings of the IEEE, 109(5):612–634, 2021.

R. Shu, Y. Chen, A. Kumar, S. Ermon, and B. Poole. Weakly supervised disentanglement

with guarantees. arXiv preprint arXiv:1910.09772, 2019.

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning

using nonequilibrium thermodynamics. In Proc. of ICML, pages 2256–2265, 2015.

J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised

learning using nonequilibrium thermodynamics. In Proc. of ICML, volume 37, pages 2256–
2265, 2015.

J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. In Proc. of ICLR, 2021a.

Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based

generative modeling through stochastic differential equations. In Proc. of ICLR, 2021b.

Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based

generative modeling through stochastic differential equations. CoRR, abs/2011.13456, 2021c.

J. Subramanian, Y. Annadani, I. Sheth, N. R. Ke, T. Deleu, S. Bauer, D. Nowrouzezahrai,

and S. E. Kahou. Learning latent structural causal models. CoRR, abs/2210.13583, 2022.

J. Traub. Representation learning with diffusion models. arXiv preprint arXiv:2210.11058,

2022.

S. Van Steenkiste, F. Locatello, J. Schmidhuber, and O. Bachem. Are disentangled represen-

tations helpful for abstract visual reasoning? Advances in Neural Information Processing
Systems, 32, 2019.

P. Vincent. A connection between score matching and denoising autoencoders. Neural

computation, 23(7):1661–1674, 2011.

J. Von Kügelgen, Y. Sharma, L. Gresele, W. Brendel, B. Schölkopf, M. Besserve, and F. Lo-

catello. Self-supervised learning with data augmentations provably isolates content from
style. Proc. of NeurIPS, 34:16451–16467, 2021.

13
Y. Wang, Y. Schiff, A. Gokaslan, W. Pan, F. Wang, C. De Sa, and V. Kuleshov. Infodiffusion:
Representation learning using information maximizing diffusion models. arXiv preprint
arXiv:2306.08757, 2023.

Z. Wu, N. Dvornik, K. Greff, T. Kipf, and A. Garg. Slotformer: Unsupervised visual dynamics
simulation with object-centric models. CoRR, abs/2210.05861, 2022.

M. Yang, F. Liu, Z. Chen, X. Shen, J. Hao, and J. Wang. Causalvae: Structured causal
disentanglement in variational autoencoder. CoRR, abs/2208.14153, 2020.

J. Yoon, Y.-F. Wu, H. Bae, and S. Ahn. An investigation into pre-training object-centric
representations for reinforcement learning. CoRR, abs/2302.04419, 2023.

The terms in the first bracket correspond to the intervention encoder and the noise encoding
module, respectively, and the terms in the second bracket correspond to the diffusion model
conditioned on pre- and post-intervention noise encodings.
Song et al. [2021c] shows that the discretization of SDE formulations of the diffusion model is
equivalent to discrete-time diffusion models. Therefore, for simplicity, we derive the ELBO for
discrete-time diffusion models. Following [Luo, 2022], for a discrete-time diffusion model where
t ∈ [1, T ], we have
" #
p(x, u|e)
Eq(I|x,x̃) Eq(e,ẽ|x,x̃,I) Eq(u|x) Eq(ũ|x̃) log
q(u|x)
"
= Eq(I|x,x̃) Eq(e,ẽ|x,x̃,I) Eq(u|x) Eq(ũ|x̃) Eq(u1 |x) [log p(x|u1 )] − DKL (q(uT |x)||p(uT ))

T
#
X
− Eq(ut |x) [DKL (q(ut−1 |ut , x, e)||p(ut−1 |ut , e)] (4)
t=2

where we have that

• Eq(u1 |x) [log p(x|u1 )] is the reconstruction term and it can be defined in a way that it is
constant so it can be ignored during training;
• DKL (q(uT |x)||p(uT )) is the prior matching term and can similarly be defined in a way
that it is constant;
• Eut |x [DKL (q(ut−1 |ut , x, e)||p(ut−1 |ut , e)] is a denoising matching term. This term is the
origin of different interpretations of the score-based diffusion models.
For the SDE formulation of the forward process, the denoising matching term becomes [Song
et al., 2021c]
λ(t)||sθ (ut , e, t) − ∇ut log p(ut |x)||22 . (5)

15
The weight λ(t) of denoising matching terms is related to the diffusion coefficient of the forward
SDE. For a Variance Exploding SDE the weight is defined as λ(t) = 2σ 2 (t) log(σmax /σmin )
with σ(t) = σmin · (σmax /σmin )t .
Therefore, by combining (4) with (5), the ELBO becomes

log p(x, x̃) ≥ Ep(x,x̃) Eq(I|x,x̃) Eq(e,ẽ|x,x̃,I) Et∼U (0,1) Eq(ut |x) Eq(ũt |x̃)
"
log p(I) + log p(e) + log p(ẽ|e, I) − log q(I|x, x̃) − log q(e, ẽ|x, x̃, I)
#
h i
2 2
+λ(t) ||sθ (ut , e, t) − ∇ut log p(ut |x)||2 + ||sθ (ũt , ẽ, t) − ∇ũt log p(ũt |x̃)||2

For infinite-dimensional representations, we can derive the ELBO using a similar argument. In
this case, the formula for the ELBO is

log p(x, x̃) ≥ Ep(x,x̃) Eq(I|x,x̃) Et∼U (0,1) Eq(et ,ẽt |x,x̃,I) Eq(ut |x) Eq(ũt |x̃)
"
log p(I) + log p(et ) + log p(ẽt |et , I) − log q(I|x, x̃) − log q(et , ẽt |x, x̃, I)
#
+λ(t)||sθ (ut , et , t) − ∇ut log p(ut |x)||22 + λ(t)||sθ (ũt , ẽt , t) − ∇ũt log p(ũt |x̃)||22 ,

B Implementation Details
Training For the training, we follow the 4-phase training of Brehmer et al. [2022] but consider
only the first 3 phases. In summary, we consider the following steps:
(1) We begin by training the diffusion model and the encoding module together on data
pairs for 20 epochs. This can be interpreted as a warm-up on the diffusion model and
the encoding module to extract meaningful representations of data.
(2) We include all modules except for solution functions. We consider p(ẽi |e) to be a uniform
probability density. We do this phase for 50 epochs.
(3) We include solution functions and train the whole framework with the proposed loss and
do this for 50 epochs.
We find out that considering our data generation process, including the fourth training phase
of Brehmer et al. [2022] has no impact on the model’s performance. Consequently, we choose
to disregard it in our analysis. We use the loss in Eq. 3 as the objective function and consider
the coefficient of the regularization term Lentropy to be 1. Therefore, our overall loss function
is then given by L = Lmodel + Lentropy .

Architectures & Hyperparameters We train the model for 120 epochs and use the learning
rate of 3e-4 with a batch size of 64. β is initially set to 0 and increased to 1 during training.
The noise encoder is considered Gaussian, with mean and standard deviation parameterized
as an MLP with two hidden layers and 64 units each and ReLU activation functions. The
architecture of the score function of the diffusion model is based on NCSN++ architecture

16
[Song et al., 2021c] with the same set of hyperparameters. As the input x is 16-dimensional
and the score model follows a convolutional architecture, we reshape the input into a 4 × 4
format and then feed it into the diffusion model. Furthermore, In the forward SDE, σmin and
σmax are set to 0.01 and 50, respectively.

C Missing Plots

Figure 3: Comparison of models on different metrics when using infinite-dimensional represen-

tations. From top to bottom, (a), (b), and (c) correspond to experiments with 5, 10, and 15
causal variables, respectively. We sample points from the trajectory at intervals of 0.1, creating
a total of 11 specific timesteps. Typically, representations in the middle of the trajectory
carry the most information, often matching or surpassing the baseline performance. As we
move further in time, representations seem to lose some information, but they improve as they
approach the end of the trajectory. Furthermore, the framework performs worse or on par with
baselines in lower dimensions but generally outperforms them in higher dimensions.

Diffusion Based Representation Learning
No ratings yet
Diffusion Based Representation Learning
20 pages
Deep Causal Learning
No ratings yet
Deep Causal Learning
35 pages
Intro To Vae
No ratings yet
Intro To Vae
89 pages
Intro to Variational Autoencoders
No ratings yet
Intro to Variational Autoencoders
89 pages
Kami Export - 37 Factorized World Models For Le
No ratings yet
Kami Export - 37 Factorized World Models For Le
9 pages
1929 Causal Discovery With Reinforc
No ratings yet
1929 Causal Discovery With Reinforc
17 pages
Wang 等 - 2023 - Causal Balancing for Domain Generalization
No ratings yet
Wang 等 - 2023 - Causal Balancing for Domain Generalization
24 pages
Weakly Supervised Causal Representation Learning
No ratings yet
Weakly Supervised Causal Representation Learning
38 pages
Sample, Estimate, Aggregate: A Recipe For Causal Discovery Foundation Models
No ratings yet
Sample, Estimate, Aggregate: A Recipe For Causal Discovery Foundation Models
32 pages
Generalization - Through - V - Supplementary Material
No ratings yet
Generalization - Through - V - Supplementary Material
39 pages
Causal-Learn: Causal Discovery in Python
No ratings yet
Causal-Learn: Causal Discovery in Python
8 pages
Diffusion Models in Time Series
No ratings yet
Diffusion Models in Time Series
25 pages
Score Approximation, Estimation and Distribution Recovery of Diffusion Models On Low-Dimensional Data
No ratings yet
Score Approximation, Estimation and Distribution Recovery of Diffusion Models On Low-Dimensional Data
52 pages
Contrastive Diffuser
No ratings yet
Contrastive Diffuser
16 pages
Variational Diffusion Models Guide
No ratings yet
Variational Diffusion Models Guide
48 pages
Towards Causal Representation Learning
No ratings yet
Towards Causal Representation Learning
24 pages
Wang 等 - 2023 - Contrastive-ACE Domain Generalization Through Alignment of Causal Mechanisms
No ratings yet
Wang 等 - 2023 - Contrastive-ACE Domain Generalization Through Alignment of Causal Mechanisms
10 pages
L1-Understanding Diffusion Models A Unified Persp
No ratings yet
L1-Understanding Diffusion Models A Unified Persp
27 pages
CVPR2022 Tutorial Diffusion Model
No ratings yet
CVPR2022 Tutorial Diffusion Model
188 pages
Information Dropout Learning Optimal Representations Through Noisy Computation
No ratings yet
Information Dropout Learning Optimal Representations Through Noisy Computation
9 pages
Graph Learning Unified Framework
No ratings yet
Graph Learning Unified Framework
21 pages
Understanding Diffusion Models: A Unified Perspective
No ratings yet
Understanding Diffusion Models: A Unified Perspective
23 pages
Sample-Efficient Reinforcement Learning in The Presence of Exogenous Information
No ratings yet
Sample-Efficient Reinforcement Learning in The Presence of Exogenous Information
56 pages
1 - Early Work - Deep Unsupervised Learning
No ratings yet
1 - Early Work - Deep Unsupervised Learning
10 pages
Deep Unsupervised Learning Using Nonequilibrium Thermodynamics
No ratings yet
Deep Unsupervised Learning Using Nonequilibrium Thermodynamics
18 pages
Ai Lorenz Pinn
No ratings yet
Ai Lorenz Pinn
28 pages
D D B M: Enoising Iffusion Ridge Odels
No ratings yet
D D B M: Enoising Iffusion Ridge Odels
26 pages
2407A Survey On Generative Diffusion Models
No ratings yet
2407A Survey On Generative Diffusion Models
23 pages
A Review and Roadmap of Deep Causal Model From Different Causal Structures and Representations
No ratings yet
A Review and Roadmap of Deep Causal Model From Different Causal Structures and Representations
35 pages
Representation Learning
No ratings yet
Representation Learning
6 pages
Deep Learning of Contagion Dynamics On Complex Networks: Article
No ratings yet
Deep Learning of Contagion Dynamics On Complex Networks: Article
11 pages
An Overview of Diffusion Models: Applications, Guided Generation, Statistical Rates and Optimization
No ratings yet
An Overview of Diffusion Models: Applications, Guided Generation, Statistical Rates and Optimization
39 pages
A Universal Framework
No ratings yet
A Universal Framework
13 pages
A Meta-Reinforcement Learning Algorithm For Causal Discovery
No ratings yet
A Meta-Reinforcement Learning Algorithm For Causal Discovery
18 pages
Unpaired Multi-Domain Causal Representation Learning
No ratings yet
Unpaired Multi-Domain Causal Representation Learning
28 pages
AI-Aristotle: Gray-Box Systems Biology Framework
No ratings yet
AI-Aristotle: Gray-Box Systems Biology Framework
26 pages
Toward Causal Representation Learning: Byb S, F L, S B, N R K, N K, A G, Y B
No ratings yet
Toward Causal Representation Learning: Byb S, F L, S B, N R K, N K, A G, Y B
23 pages
Deep Learning For Causal Inference
No ratings yet
Deep Learning For Causal Inference
67 pages
From Denoising Diffusions To Denoising Markov Models
No ratings yet
From Denoising Diffusions To Denoising Markov Models
55 pages
LocalGLMnet Interpretable Deep Learning For Tabular Data
No ratings yet
LocalGLMnet Interpretable Deep Learning For Tabular Data
26 pages
Diffusion With Forward Models Solving Stochastic Inverse Problems Without Direct Supervision
No ratings yet
Diffusion With Forward Models Solving Stochastic Inverse Problems Without Direct Supervision
14 pages
Diffusion Models
No ratings yet
Diffusion Models
151 pages
A Variational Perspective On Diffusion-Based Generative Models and Score Matching
No ratings yet
A Variational Perspective On Diffusion-Based Generative Models and Score Matching
23 pages
UC Merced Previously Published Works
No ratings yet
UC Merced Previously Published Works
40 pages
OMINewsLetter November 2024
No ratings yet
OMINewsLetter November 2024
5 pages
Causal Machine Learning - A Survey and Open Problems
No ratings yet
Causal Machine Learning - A Survey and Open Problems
191 pages
Diffusion Model
No ratings yet
Diffusion Model
16 pages
Chapter 5
No ratings yet
Chapter 5
140 pages
Sample-Efficient Reinforcement Learning Via Counterfactual-Based Data Augmentation
No ratings yet
Sample-Efficient Reinforcement Learning Via Counterfactual-Based Data Augmentation
14 pages
2024 ICML Discrete Diffusion Modeling by Estimating The Ratios of The Data Distribution
No ratings yet
2024 ICML Discrete Diffusion Modeling by Estimating The Ratios of The Data Distribution
30 pages
Summary
No ratings yet
Summary
44 pages
Deep Causal Learning for Biomedicine
No ratings yet
Deep Causal Learning for Biomedicine
11 pages
Weakly Supervised Disentangled Generative Causal Representation Learning
No ratings yet
Weakly Supervised Disentangled Generative Causal Representation Learning
55 pages
Structured Denoising Diffusion Models in Discrete State-Spaces
No ratings yet
Structured Denoising Diffusion Models in Discrete State-Spaces
33 pages
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan March 28, 2024
No ratings yet
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan March 28, 2024
51 pages
HW 1
No ratings yet
HW 1
6 pages
Trajectory Flow Matching
No ratings yet
Trajectory Flow Matching
27 pages
Intro Generative AI PCA
No ratings yet
Intro Generative AI PCA
49 pages
Source Coding
No ratings yet
Source Coding
10 pages
11.III) CPP Justification After
No ratings yet
11.III) CPP Justification After
2 pages
A Novel Evolutionary Algorithm With Column and Sub-Block Local Search For Sudoku Puzzles
No ratings yet
A Novel Evolutionary Algorithm With Column and Sub-Block Local Search For Sudoku Puzzles
12 pages
JavaScript Interview Questions
No ratings yet
JavaScript Interview Questions
28 pages
PowerShell Command Guide
No ratings yet
PowerShell Command Guide
2 pages
326C5B
No ratings yet
326C5B
2 pages
UG 3-2 R19 CSE Syllabus
No ratings yet
UG 3-2 R19 CSE Syllabus
16 pages
Lab Project - 2
No ratings yet
Lab Project - 2
3 pages
Chapter 4 (Boolean Algebra and Logic Simplification)
No ratings yet
Chapter 4 (Boolean Algebra and Logic Simplification)
47 pages
Data Structures and Algorithms in Python - (2 Object-Oriented Programming)
No ratings yet
Data Structures and Algorithms in Python - (2 Object-Oriented Programming)
53 pages
2008-Response Surface Methodology (RSM) As A Tool For Optimization in Analytical Chemistry PDF
No ratings yet
2008-Response Surface Methodology (RSM) As A Tool For Optimization in Analytical Chemistry PDF
13 pages
Homework Week 2 Big Oh
No ratings yet
Homework Week 2 Big Oh
3 pages
Aspiring Cloud & iOS Developer Profile
No ratings yet
Aspiring Cloud & iOS Developer Profile
2 pages
Functional Bid Landscape Forecasting
No ratings yet
Functional Bid Landscape Forecasting
16 pages
Clojure Programming Book
No ratings yet
Clojure Programming Book
176 pages
ISO IEC 10118-3-2004 Amd1-2006
No ratings yet
ISO IEC 10118-3-2004 Amd1-2006
16 pages
Understanding C Pointers: Basics & Usage
No ratings yet
Understanding C Pointers: Basics & Usage
22 pages
Unit I C++
No ratings yet
Unit I C++
28 pages
Dmodule 1 Introduction To C
No ratings yet
Dmodule 1 Introduction To C
18 pages
Errorret FTFDR
No ratings yet
Errorret FTFDR
10 pages
DAA - Question Bank
No ratings yet
DAA - Question Bank
7 pages
Scikit-Learn Feature Extraction Guide
No ratings yet
Scikit-Learn Feature Extraction Guide
16 pages
C# Async Pattern
No ratings yet
C# Async Pattern
5 pages
Cblecspu 07
No ratings yet
Cblecspu 07
10 pages
Exam (June 2024) Rechecking Result For B.tech (Affiliated) 6th Sem
No ratings yet
Exam (June 2024) Rechecking Result For B.tech (Affiliated) 6th Sem
6 pages
Operating System
50% (2)
Operating System
6 pages
CAT 2021 Junior Solutions
No ratings yet
CAT 2021 Junior Solutions
7 pages
Technical Document Analysis
No ratings yet
Technical Document Analysis
6 pages
Part1 1counting
No ratings yet
Part1 1counting
203 pages
Power of Recurrent Neural Networks (RNN) - Revolutionizing AI
No ratings yet
Power of Recurrent Neural Networks (RNN) - Revolutionizing AI
33 pages

Diffusion Based Causal Representation Learning

Uploaded by

Diffusion Based Causal Representation Learning

Uploaded by

Diffusion Based Causal Representation Learning

Amir Mohammad Karimi Mamaghan1 , Andrea Dittadi2,3 ,

Causal Representation Learning. Given the inherent challenges of identifiability in causal

Conditional Score Matching. We can modify Denoising Score Matching to perform

2.2 Comparison with Other Generative Models

Projection Causal Graph

Intervention Encoder Solution Function

4 The DCRL Algorithm

4.2 The Evidence Lower Bound for DCRL

Metrics. We assess the performance of models with the following metrics:

5.3 Infinite-dimensional Representations

K. Ahuja, J. S. Hartford, and Y. Bengio. Weakly supervised representation learning with

A. R. Anwar, K. G. Mideska, H. Hellriegel, N. Hoogenboom, H. Krause, A. Schnitzler,

J. Brehmer, P. De Haan, P. Lippe, and T. S. Cohen. Weakly supervised causal representation

R. Cai, G. Yang, H. Averbuch-Elor, Z. Hao, S. J. Belongie, N. Snavely, and B. Hariharan.

A. Dittadi, S. Papa, M. De Vita, B. Schölkopf, O. Winther, and F. Locatello. Generalization

C. Eastwood and C. K. Williams. A framework for the quantitative evaluation of disentangled

T. Hellström. The relevance of causation in robotics: A review, categorization, and analysis.

M. Á. Hernán, B. Brumback, and J. M. Robins. Marginal structural models to estimate the

J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans. Cascaded diffusion

J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion

A. Hyvärinen and P. Dayan. Estimation of non-normalized statistical models by score matching.

A. Hyvärinen and E. Oja. Independent component analysis: algorithms and applications.

I. Khemakhem, D. Kingma, R. Monti, and A. Hyvarinen. Variational autoencoders and

D. P. Kingma and M. Welling. Auto-encoding variational bayes. In Y. Bengio and Y. LeCun,

M. Kocaoglu, C. Snyder, A. G. Dimakis, and S. Vishwanath. ausalgan: Learning causal implicit

S. Lachapelle, P. Rodriguez, Y. Sharma, K. E. Everett, R. Le Priol, A. Lacoste, and S. Lacoste-

Y. Liu, Z. Zhang, D. Gong, M. Gong, B. Huang, A. v. d. Hengel, K. Zhang, and J. Q. Shi.

F. Locatello, S. Bauer, M. Lucic, G. Raetsch, S. Gelly, B. Schölkopf, and O. Bachem. Challenging

F. Locatello, B. Poole, G. Rätsch, B. Schölkopf, O. Bachem, and M. Tschannen. Weakly-

C. Luo. Understanding diffusion models: A unified perspective. CoRR, abs/2208.11970, 2022.

A. Mehrjou, B. Schölkopf, and S. Saremi. Annealed generative adversarial networks. CoRR,

S. Mittal, G. Lajoie, S. Bauer, and A. Mehrjou. From points to functions: Infinite-dimensional

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image

J. M. Robins, M. A. Hernan, and B. Brumback. Marginal structural models and causal

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gon-

M. S. M. Sajjadi, G. Parascandolo, A. Mehrjou, and B. Schölkopf. Tempered adversarial

S. Saremi, A. Mehrjou, B. Schölkopf, and A. Hyvärinen. Deep energy estimator networks.

B. Schölkopf, F. Locatello, S. Bauer, N. R. Ke, N. Kalchbrenner, A. Goyal, and Y. Bengio.

R. Shu, Y. Chen, A. Kumar, S. Ermon, and B. Poole. Weakly supervised disentanglement

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning

J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised

Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based

Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based

J. Subramanian, Y. Annadani, I. Sheth, N. R. Ke, T. Deleu, S. Bauer, D. Nowrouzezahrai,

J. Traub. Representation learning with diffusion models. arXiv preprint arXiv:2210.11058,

S. Van Steenkiste, F. Locatello, J. Schmidhuber, and O. Bachem. Are disentangled represen-

P. Vincent. A connection between score matching and denoising autoencoders. Neural

J. Von Kügelgen, Y. Sharma, L. Gresele, W. Brendel, B. Schölkopf, M. Besserve, and F. Lo-

where we have that

Figure 3: Comparison of models on different metrics when using infinite-dimensional represen-

You might also like