0% found this document useful (0 votes)
31 views14 pages

CS236 Hw2 Answers

This document provides solutions for CS 236 Homework 2, which involves implementing Variational Autoencoders (VAE), Mixture of Gaussians VAE (GMVAE), and Importance Weighted Autoencoder (IWAE) using PyTorch. The solutions include code snippets for key functions, performance metrics from training runs, and instructions for visualizing generated samples. The document also contains theoretical proofs related to the IWAE and its relationship with the ELBO.

Uploaded by

nvt2341
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views14 pages

CS236 Hw2 Answers

This document provides solutions for CS 236 Homework 2, which involves implementing Variational Autoencoders (VAE), Mixture of Gaussians VAE (GMVAE), and Importance Weighted Autoencoder (IWAE) using PyTorch. The solutions include code snippets for key functions, performance metrics from training runs, and instructions for visualizing generated samples. The document also contains theoretical proofs related to the IWAE and its relationship with the ELBO.

Uploaded by

nvt2341
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

CS 236 Homework 2 Solutions

Instructors: Stefano Ermon and Aditya Grover


{ermon,adityag}@cs.stanford.edu

Available: 10/15/2018; Due: 23:59 PST, 10/29/2018

Problem 1: Implementing the Variational Autoencoder (VAE) (25 points)


For this problem we will be using PyTorch to implement the variational autoencoder (VAE) and learn a
probabilistic model of the MNIST dataset of handwritten digits. Formally, we observe a sequence of binary
pixels x ∈ {0, 1}d , and let z ∈ Rk denote a set of latent variables. Our goal is to learn a latent variable model
pθ (x) of the high-dimensional data distribution pdata (x).
R R
The VAE is a latent variable model that learns a specific parameterization pθ (x) = pθ (x, z)dz = p(z)pθ (x|z)dz.
Specifically, the VAE is defined by the following generative process:

p(z) = N (z|0, I)

pθ (x|z) = Bern(x|fθ (z))


In other words, we assume that the latent variables z are sampled from a unit Gaussian distribution N (z|0, I).
The latent z are then passed through a neural network decoder fθ (·) to obtain the parameters of the d Bernoulli
random variables which model the pixels in each image.
R
Although we would like to maximize the marginal likelihood pθ (x), computation of pθ (x) = p(z)pθ (x|z)dz is
generally intractable as it involves integration over all possible values of z. Therefore, we posit a variational
approximation to the true posterior and perform amortized inference as we have seen in class:

qφ (z|x) = N (z|µφ (x), diag(σφ2 (x)))

Specifically, we pass each image x through a neural network which outputs the mean µφ and diagonal covariance
diag(σφ2 (x)) of the multivariate Gaussian distribution that approximates the distribution over latent variables
z given x. We then maximize the lower bound to the marginal log-likelihood to obtain an expression known as
the evidence lower bound (ELBO):

log pθ (x) ≥ ELBO(x; θ, φ) = Eqφ (z|x) [log pθ (x|z)] − DKL (qφ (z|x)||p(z)))

Notice that the ELBO as shown on the right hand side of the above expression decomposes into two terms: (1)
the reconstruction loss: −Eqφ (z) [log pθ (x|z)], and (2) the Kullback-Leibler (KL) term: DKL (qφ (z|x)||p(z))).

Your objective is to implement the variational autoencoder by modifying utils.py and vae.py.

1. [5 points] Implement the reparameterization trick in the function sample gaussian of utils.py. Specif-
ically, your answer will take in the mean m and variance v of the Gaussian distribution qφ (z|x) and return
a sample z ∼ qφ (z|x).

1
Solution:
Code:
def s am ple _g au ss i an (m , v ) :
eps = torch . randn_like ( v )
z = m + torch . sqrt ( v ) * eps
return z

2. [5 points] Next, implement negative elbo bound in the file vae.py. Several of the functions in utils.py
will be helpful, so please check what is provided. Note that we ask for the negative ELBO, as PyTorch
optimizers minimize  the nloss function. Additionally, since we are computing
Pn the negative ELBO over a
mini-batch of data x(i) i=1 , make sure to compute the average − n1 i=1 ELBO(x(i) ; θ, φ) over the mini-
batch. Finally, note that the ELBO itself cannot be computed exactly since exact computation of the
reconstruction term is intractable. Instead we ask that you estimate the reconstruction term via Monte
Carlo sampling

−Eqφ (z|x) [log pθ (x|z)] ≈ − log pθ (x|z(1) ),

where z(1) ∼ qφ (z|x) denotes a single sample. The function kl normal in utils.py will be helpful. Note:
negative elbo bound also expects you to return the average reconstruction loss and KL divergence.

Solution:
Code:
def n e g a t i v e _ e l b o _ b o u n d ( self , x ) :
m , v = self . enc . encode ( x )
z = ut . sa mp le _ ga us si a n (m , v )
logits = self . dec . decode ( z )

kl = ut . kl_normal (m , v , self . z_prior [ 0 ] , self . z_prior [ 1 ] )


rec = - ut . l o g _ b e r n o u l l i _ w i t h _ l o g i t s (x , logits )
nelbo = kl + rec

nelbo , kl , rec = nelbo . mean () , kl . mean () , rec . mean ()


return nelbo , kl , rec

2
3. [10 points] To test your implementation, run python run vae.py to train the VAE. Once the run is
complete (20000 iterations), it will output (assuming your implementation is correct): the average (1)
negative ELBO, (2) KL term, and (3) reconstruction loss as evaluated on a test subset that we have
selected. Report the three numbers you obtain as part of the write-up. Since we’re using stochastic opti-
mization, you may wish to run the model multiple times and report each metric’s mean and corresponding
standard error. (Hint: the negative ELBO on the test subset should be somewhere around 100.)

Solution:
Standard deviations are provided (based on 50 runs). We provide numbers computed on both the CPU
and GPU. CPU numbers are slightly better since the test subset is slightly different.
GPU Numbers:
(a) VAE negative ELBO: 101.21 ± 0.61
(b) VAE KL: 19.34 ± 0.19
(c) VAE Rec: 81.86 ± 0.66
CPU Numbers:
(a) VAE negative ELBO: 99.20 ± 0.56
(b) VAE KL: 19.35 ± 0.18
(c) VAE Rec: 79.84 ± 0.61
4. [5 points] Visualize 200 digits (generate a single image tiled in a grid of 10 × 20 digits) sampled from
pθ (x).

Solution:
Visualization of VAE samples

3
Problem 2: Implementing the Mixture of Gaussians VAE (GMVAE) (30 points)
Recall that in Problem 1, the VAE’s prior distribution was a parameter-free isotropic Gaussian p(z) =
N (z|0, I). While this original setup works well, there are settings in which we desire more expressivity to better
model our data. In this problem we will implement the GMVAE, which has a mixture of Gaussians as the prior
distribution. Specifically:
k
X 1
pθ (z) = N (z|µi , diag(σi2 ))
i=1
k

where i ∈ {1, . . . , k} denotes the ith cluster index. For notational simplicity, we shall subsume our mixture of
Guassian parameters {µi , σi }ki=1 into our generative model parameters θ. For simplicity, we have also assumed
fixed uniform weights 1/k over the possible different clusters. Apart from the prior, the GMVAE shares an
identical setup as the VAE:
qφ (z|x) = N (z|µφ (x), diag(σφ2 (x))
pθ (x|z) = Bern(x|fθ (z))

Although the ELBO for the GMVAE: Eqφ (z) [log pθ (x|z)]−DKL (qφ (z|x)||pθ (z)) is identical to that of the VAE, we
note that the KL term DKL (qφ (z|x)||pθ (z)) cannot be computed analytically between a Gaussian distribution
qφ (z|x) and a mixture of Gaussians pθ (z). However, we can obtain its unbiased estimator via Monte Carlo
sampling:

DKL (qφ (z|x)||pθ (z)) ≈ log qφ (z(1) |x) − log pθ (z(1) )


k
X 1
= log N (z(1) |µφ (x), diag(σφ2 (x)) − log N (z(1) |µi , diag(σi2 )),
i=1
k

where z(1) ∼ qφ (z|x) denotes a single sample.

1. [15 points] Implement the (1) log normal and (2) log normal mixture functions in utils.py, and the
function negative elbo bound in gmvae.py. The function log mean exp in utils.py will be helpful for
this problem.

Solution:
Code:
def log_normal (x , m , v ) :
element_wise = - 0 . 5 * ( torch . log ( v ) + ( x - m ) . pow ( 2 ) / v + np . log ( 2 * np . pi ) )
log_prob = element_wise . sum ( - 1 )
return log_prob

def l o g _ n o r m a l _ m i x t u r e (z , m , v ) :
# ( batch , dim ) -> ( batch , 1 , dim )
z = z . unsqueeze ( 1 )
# ( batch , 1 , dim ) -> ( batch , mix , dim ) -> ( batch , mix )
log_prob = log_normal (z , m , v )
# ( batch , mix ) -> ( batch ,)
log_prob = log_mean_exp ( log_prob , dim = 1 )
return log_prob

4
def n e g a t i v e _ e l b o _ b o u n d ( self , x ) :
# Prior
prior = ut . g a u s s i a n _ p a r a m e t e r s ( self . z_pre , dim = 1 )

m , v = self . enc . encode ( x )


z = ut . sa mp le _ ga us si a n (m , v )
logits = self . dec . decode ( z )

kl = ut . log_normal (z , m , v ) - ut . l o g _ n o r m a l _ m i x t u r e (z , * prior )
rec = - ut . l o g _ b e r n o u l l i _ w i t h _ l o g i t s (x , logits )
nelbo = kl + rec

nelbo , kl , rec = nelbo . mean () , kl . mean () , rec . mean ()


return nelbo , kl , rec

2. [10 points] To test your implementation, run python run gmvae.py to train the GMVAE. Once the
run is complete (20000 iterations), it will output: the average (1) negative ELBO, (2) KL term, and (3)
reconstruction loss as evaluated on a test subset that we have selected. Report the three numbers you
obtain as part of the write-up. Since we’re using stochastic optimization, you may wish to run the model
multiple times and report each metric’s mean and the corresponding standard error.

Solution:
Standard deviations are provided (based on 50 runs). We provide numbers computed on both the CPU
and GPU. CPU numbers are slightly better since the test subset is slightly different.
GPU Numbers:
(a) GMVAE negative ELBO: 98.58 ± 0.46
(b) GMVAE KL: 17.82 ± 0.18
(c) GMVAE Rec: 80.77 ± 0.51

CPU Numbers:
(a) GMVAE negative ELBO: 96.75 ± 0.56
(b) GMVAE KL: 17.78 ± 0.20
(c) GMVAE Rec: 78.96 ± 0.59

5
3. [5 points] Visualize 200 digits (generate a single image tiled in a grid of 10 × 20 digits) sampled from
pθ (x).

Solution:
Visualization of GMVAE samples

6
Problem 3: Implementing the Importance Weighted Autoencoder (IWAE) (25 points)
While the ELBO serves as a lower bound to the true marginal log-likelihood, it may be loose if the variational
posterior qφ (z|x) is a poor approximation to the true posterior pθ (z|x). It is worth noting that, for a fixed
choice of x, the ELBO is, in expectation, the log of the unnormalized density ratio

pθ (x, z) pθ (z|x)
= · pθ (x),
qφ (z|x) qφ (z|x)

where z ∼ qφ (z|x). As can be seen from the RHS, the density ratio is unnormalized since the density ratio is is
multiplied by the constant pθ (x). We can obtain a tighter bound by averaging multiple unnormalized density
ratios. This is the key idea behind IWAE, which uses m > 1 samples from the approximate posterior qφ (z|x)
to obtain the following IWAE bound:
m
!
1 X pθ (x, z(i) )
Lm (x; θ, φ) = E (1) i.i.d. log
z ,...,z(m) ∼ qφ (z|x) m i=1 qφ (z(i) |x)

Notice that for the special case of m = 1, the IWAE objective reduces to the standard ELBO.

1. [5 points] Prove that IWAE is a valid lower bound of the log-likelihood, and that the ELBO lower bounds
IWAE
log pθ (x) ≥ Lm (x) ≥ L1 (x)
for any m ≥ 1. [Hint: consider Jensen’s Inequality]

Solution:
A step-by-step proof. The crucial steps are (3) and (4), where we apply Jensen’s Inequality. Identifying
these two steps is sufficient for full-credit.

m
!
1 X pθ (x, z(i) )
log pθ (x) = log Ez(i) ∼qφ (z|x) (1)
m i=1 qφ (z(i) | x)
m
!
1 X pθ (x, z(i) )
= log E (1) (m) i.i.d. (2)
z ...z ∼ qφ (z|x) m q (z(i) | x)
i=1 φ
m
!
1 X pθ (x, z(i) )
≥E i.i.d. log = Lm (x) (3)
z(1) ...z(m) ∼ qφ (z|x) m i=1 qφ (z(i) | x)
m
!
1 X pθ (x, z(i) )
≥ E (1) (m) i.i.d. log (4)
z ...z ∼ qφ (z|x) m i=1 qφ (z(i) | x)
m
pθ (x, z(i) )
 
1 X
= Ez(i) ∼qφ (z|x) log (5)
m i=1 qφ (z(i) | x)
m  
1 X pθ (x, z)
= Ez∼qφ (z|x) log (6)
m i=1 qφ (z | x)
 
pθ (x, z)
= Ez∼qφ (z|x) log = L1 (x). (7)
qφ (z | x)

2. [5 points] Implement IWAE for VAE in the negative iwae bound function in vae.py. The functions
duplicate and log mean exp defined in utils.py will be helpful.

Solution:
Code:

7
def n e g a t i v e _ i w a e _ b o u n d ( self , x , iw ) :
m , v = self . enc . encode ( x )

# Duplicate
m = ut . duplicate (m , iw )
v = ut . duplicate (v , iw )
x = ut . duplicate (x , iw )
z = ut . sa mp le _ ga us si a n (m , v )
logits = self . dec . decode ( z )

kl = ut . log_normal (z , m , v ) - ut . log_normal (z , self . z_prior [ 0 ] , self . z_prior [ 1 ] )


# I m p o r t a n t : it is t e c h n i c a l l y i n c o r r e c t to c a l c u l a t e KL a n a l y t i c a l l y
# IWAE bound r e q u i r e s that we c a l c u l a t e the i m p o r t a n c e sample weight
# So do not do : kl = ut . k l _ n o r m a l (m , v , self . z_prior [ 0 ] , self . z_prior [ 1 ])
rec = - ut . l o g _ b e r n o u l l i _ w i t h _ l o g i t s (x , logits )
nelbo = kl + rec
niwae = - ut . log_mean_exp ( - nelbo . reshape ( iw , - 1 ) , dim = 0 )

niwae , kl , rec = niwae . mean () , kl . mean () , rec . mean ()


return niwae , kl , rec

3. [10 points] Run python run vae.py --train 0 to evaluate your implementation against the test subset.
This will output IWAE bounds for m = {1, 10, 100, 1000}. Check that the IWAE-1 result is consistent
with your reported ELBO for the VAE. Report all four IWAE bounds for this write-up.

Solution:
Standard deviations are provided (based on 50 runs). We provide numbers computed on both the CPU
and GPU. CPU numbers are slightly better since the test subset is slightly different.
GPU Numbers:

(a) VAE negative IWAE-1: 101.21 ± 0.61


(b) VAE negative IWAE-10: 98.50 ± 0.51
(c) VAE negative IWAE-100: 97.46 ± 0.50
(d) VAE negative IWAE-1000: 96.93 ± 0.45
CPU Numbers:

(a) VAE negative IWAE-1: 99.20 ± 0.56


(b) VAE negative IWAE-10: 96.52 ± 0.46
(c) VAE negative IWAE-100: 95.54 ± 0.41
(d) VAE negative IWAE-1000: 95.07 ± 0.40

8
4. [5 points] As IWAE only requires the averaging of multiple unnormalized density ratios, the IWAE
bound is also applicable to the GMVAE model. Repeat parts 2 and 3 for the GMVAE by implementing
the negative iwae bound function in gmvae.py. Compare and contrast IWAE bounds for GMVAE and
VAE.

Solution:
Code:
def n e g a t i v e _ i w a e _ b o u n d ( self , x , iw ) :
# Prior
prior = ut . g a u s s i a n _ p a r a m e t e r s ( self . z_pre , dim = 1 )

m , v = self . enc . encode ( x )

# Duplicate
m = ut . duplicate (m , iw )
v = ut . duplicate (v , iw )
x = ut . duplicate (x , iw )
z = ut . sa mp le _ ga us si a n (m , v )
logits = self . dec . decode ( z )

kl = ut . log_normal (z , m , v ) - ut . l o g _ n o r m a l _ m i x t u r e (z , * prior )
rec = - ut . l o g _ b e r n o u l l i _ w i t h _ l o g i t s (x , logits )
nelbo = kl + rec
niwae = - ut . log_mean_exp ( - nelbo . reshape ( iw , - 1 ) , dim = 0 )

niwae , kl , rec = niwae . mean () , kl . mean () , rec . mean ()


return niwae , kl , rec

Standard deviations are provided (based on 50 runs). We provide numbers computed on both the CPU
and GPU. CPU numbers are slightly better since the test subset is slightly different.
GPU Numbers:

(a) GMVAE negative IWAE-1: 98.58 ± 0.46


(b) GMVAE negative IWAE-10: 96.12 ± 0.42
(c) GMVAE negative IWAE-100: 95.16 ± 0.42
(d) GMVAE negative IWAE-1000: 94.71 ± 0.39

CPU Numbers:
(a) GMVAE negative IWAE-1: 96.75 ± 0.56
(b) GMVAE negative IWAE-10: 94.30 ± 0.48
(c) GMVAE negative IWAE-100: 93.43 ± 0.46
(d) GMVAE negative IWAE-1000: 92.98 ± 0.41

9
Problem 4: Implementing the Semi-Supervised VAE (SSVAE) (20 points)
So far we have dealt with generative models in the unsupervised setting. We now consider semi-supervised
learning on the MNIST dataset, where we have a small number of labeled x` = {(x(i) , y (i) )}100 i=1 pairs in our
training data and a large amount of unlabeled data xu = {x(i) }60000
i=101 . A label y (i)
for an image is simply the
number the image x(i) represents. We are interested in building a classifier that predicts the label y given the
sample x. One approach is to train a classifier using standard approaches using only the labeled data. However,
we would like to leverage the large amount of unlabeled data that we have to improve our classifier’s performance.

We will use a latent variable generative model (a VAE), where the labels y are partially observed, and z are
always unobserved. The benefit of a generative model is that it allows us to naturally incorporate unlabeled
data into the maximum likelihood objective function simply by marginalizing y when it is unobserved. We
will implement the Semi-Supervised VAE (SSVAE) for this task, which follows the generative process specified
below:
p(z) = N (z|0, I)
1
p(y) = Categorical(y|π) =
10
pθ (x|y, z) = Bern(x|fθ (y, z))
where π = (1/10, . . . , 1/10) is a fixed uniform prior over the 10 possible labels and each sequence of pixels x is
modeled by a Bernoulli random variable parameterized by the output of a neural network decoder fθ (·).

y z y z

x x

Figure 1: Graphical model for SSVAE. Gray nodes denote observed variables; unshaded nodes denote latent
variables. Left: SSVAE for the setting where the labels y are unobserved; Right: SSVAE where some data
points (x, y) have observed labels.

To train a model on the datasets X` and Xu , the principle of maximum likelihood suggests that we find the
model pθ which maximizes the likelihood over both datasets. Assuming the samples from X` and Xu are drawn
i.i.d., this translates to the following objective
X X
max log pθ (x) + log pθ (x, y),
θ
x∈Xu x,y∈X`

where
XZ
pθ (x) = pθ (x, y, z)dz
y∈Y
Z
pθ (x, y) = pθ (x, y, z)dz.

To overcome the intractability of exact marginalization of the latent variables z, we will instead maximize their
respective evidence lower bounds,
X X
max ELBO(x; θ, φ) + ELBO(x, y; θ, φ),
θ,φ
x∈Xu x,y∈X`

where we introduce some amortized inference model qφ (y, z|x) = qφ (y|x)qφ (z|x, y). Specifically,

qφ (y|x) = Categorical(y|fφ (x))

qφ (z|x, y) = N (z|µφ (x, y), diag(σφ2 (x, y)))

10
where the parameters of the Gaussian distribution are obtained through a forward pass of the encoder. We note
that qφ (y|x) = Categorical(y|fφ (x)) is actually an MLP classifier that is also a part of the inference model, and
it predicts the probability of the label y given the observed data x.

We use this amortized inference model to construct the ELBOs.


 
pθ (x, y, z)
ELBO(x; θ, φ) = Eqφ (y,z|x) log
qφ (y, z|x)
 
pθ (x, y, z)
ELBO(x, y; θ, φ) = Eqφ (z|x,y) log .
qφ (z|x, y)

However, Kingma et al. (2014)1 observed that maximizing the lower bound of the log-likelihood is not sufficient
to learn a good classifier. Instead, they proposed to introduce an additional training signal that directly trains
the classifier on the labeled data
X X X
max ELBO(x; θ, φ) + ELBO(x, y; θ, φ) + α log qφ (y|x).
θ,φ
x∈Xu x,y∈X` x,y∈X`

where α ≥ 0 weights the importance of the classification accuracy. In this problem, we will consider a simpler
variant of this objective that works just as well in practice,
X X
max ELBO(x; θ, φ) + α log qφ (y|x).
θ,φ
x∈X x,y∈X`

It is worth noting that the introduction of the classification loss has a natural interpretation as maximizing the
ELBO subject to the soft constraint that the classifier qφ (y|x) (which is a component of the amortized inference
model) achieves good performance on the labeled dataset. This approach of controlling the generative model
by constraining its inference model is thus a form of amortized inference regularization.2

1. [1 point] Run python run ssvae.py --gw 0. The gw flag denotes how much weight to put on the
ELBO(x) term in the objective function; scaling the term by zero corresponds to a traditional supervised
learning setting on the small labeled dataset only, where we ignore the unlabeled data. Report your
classification accuracy on the test set after the run completes (30000 iterations).

Solution:
Classifier accuracy using only the limited labeled data: 73.5%
2. [10 points] Implement the negative elbo bound function in ssvae.py. Note that the function expects
as output the negative Evidence Lower Bound as well as its decomposition into the following terms,
 
p(y) p(z)
−ELBO(x; θ, φ) = −Eqφ (y|x) log − Eqφ (y|x) Eqφ (z|x,y) log + log pθ (x|z, y)
qφ (y|x) qφ (z|x, y)
= DKL (qφ (y|x)kp(y)) + Eqφ (y|x) DKL (qφ (z|x, y)kp(z)) + Eqφ (y,z|x) [− log pθ (x|z, y)] .
| {z } | {z } | {z }
KLy KLz Reconstruction

Since there are only ten labels, we shall compute the expectations with respect to qφ (y|x) exactly, while
using a single Monte Carlo sample of the latent variables z sampled from each qφ (z|x, y) when dealing
with the reconstruction term. In other words, we approximate the negative ELBO with
X  
(y)
DKL (qφ (y|x)kp(y)) + qφ (y|x) DKL (qφ (z|x, y)kp(z)) − log pθ (x|z , y) ,
y∈Y

where z(y) ∼ qφ (z|x, y) denotes a sample from the inference distribution when conditioned on a possible
(x, y) pairing. The functions kl normal and kl cat in utils.py will be useful.
1 Kingma, et al. Semi-Supervised Learning With Deep Generative Models. Neural Information Processing Systems, 2014
2 Shu, et al. Amortized Inference Regularization. Neural Information Processing Systems, 2018.

11
Solution:
Code:
def n e g a t i v e _ e l b o _ b o u n d ( self , x ) :
y_logits = self . cls . classify ( x )
y_logprob = F . log_softmax ( y_logits , dim = 1 )
y_prob = torch . softmax ( y_logprob , dim = 1 ) # ( batch , y_dim )

# D u p l i c a t e y based on x ’s batch size . Then d u p l i c a t e x


y = np . repeat ( np . arange ( self . y_dim ) , x . size ( 0 ) )
y = x . new ( np . eye ( self . y_dim ) [ y ] )
x = ut . duplicate (x , self . y_dim )

m , v = self . enc . encode (x , y )


z = ut . sa mp le _ ga us si a n (m , v )
x_logits = self . dec . decode (z , y )

kl_y = ut . kl_cat ( y_prob , y_logprob , np . log ( 1 . 0 / self . y_dim ) )


kl_z = ut . kl_normal (m , v , self . z_prior [ 0 ] , self . z_prior [ 1 ] ) # ( y_dim * batch )
rec = - ut . l o g _ b e r n o u l l i _ w i t h _ l o g i t s (x , x_logits ) # ( y_dim * batch )

# Compute the e x p e c t e d r e c o n s t r u c t i o n and kl ( under the d i s t r i b u t i o n q ( y | x ) )


rec = ( y_prob . t () * rec . reshape ( self . y_dim , - 1 ) ) . sum ( 0 ) # ( batch ,)
kl_z = ( y_prob . t () * kl_z . reshape ( self . y_dim , - 1 ) ) . sum ( 0 ) # ( batch ,)

# Reduce to means
kl_y , kl_z , rec = kl_y . mean () , kl_z . mean () , rec . mean ()
nelbo = rec + kl_z + kl_y
return nelbo , kl_z , kl_y , rec

3. [9 points] Run python run ssvae.py. This will run the SSVAE with the ELBO(x) term included, and
thus perform semi-supervised learning. Report your classification accuracy on the test set after the run
completes.
SSVAE Accuracy: 93.52 ± 1.09%

12
Bonus: Style and Content Disentanglement in SVHN (10 points)
A curious property of the SSVAE graphical model is that, in addition to the latent variables y learning to
encode the content (i.e. label) of the image, the latent variables z also learns to encode the style of the image.
We shall demonstrate this phenomenon on the SVHN dataset. To make the problem simpler, we will only
consider the fully-supervised scenario where y is fully-observed. This yields the fully-supervised VAE, shown
below.

y z

Figure 2: Graphical model for FSVAE. Gray nodes denote observed variables.

1. [3 points] Since fully-supervised VAE (FSVAE) always conditions on an observed y in order to generate
the sample x, it is a special case of the conditional variational autoencoder. Derive the Evidence Lower
Bound ELBO(x; θ, φ, y) of the conditional log probability log pθ (x|y). You are allowed to introduce the
amortized inference model qφ (z|x, y).

Solution:
The only difference from the vanilla VAE is that we condition on the additional information y. The
prior p(z) is not conditioned on y since y does not point to z in the generative model.
 
p(z)pθ (x | z, y)
log pθ (x | y) ≥ Eqφ (z|x,y) log = Eqφ (z|x,y) [log pθ (x | z, y)] − DKL (qφ (z | x, y)kp(z)). (8)
qφ (z | x, y)

2. [7 points] Implement the negative elbo bound function in fsvae.py. In contrast to the MNIST dataset,
the SVHN dataset has a continuous observation space

p(z) = N (z|0, I)

pθ (x|y, z) = N (x|µθ (y, z), diag(σθ2 (y, z))).


To simplify the problem more, we shall assume that the variance is fixed at
1
diag(σθ2 (y, z)) = I,
10
and only train the decoder mean function µθ . Once you have implemented negative elbo bound, run
python run fsvae.py. The default settings will use a max iteration of 1 million. We suggest checking
the image quality of clip(µθ (y, z))—where clip(·) performs element-wise clipping of outputs outside the
range [0, 1]—every 10k iterations and stopping the training whenever the digit classes are recognizable.3
i.i.d.
Once you have learned a sufficiently good model, generate twenty latent variables z(0) , . . . , z(19) ∼ p(z).
Then, generate 200 SVHN digits (a single image tiled in a grid of 10 × 20 digits) where the digit in the
ith row, j th column (assuming zero-indexing) is the image clip(µθ (y = i, z = j)).

Solution:
Code:
3 An important note about sample quality inspection when modeling continuous images using VAEs with Gaussian observation
decoders: modeling continuous image data distributions is quite challenging. Rather than truly sampling x ∼ pθ (x|y), a common
heuristic is to simply sample clip(µθ (y, z)) instead.

13
def n e g a t i v e _ e l b o _ b o u n d ( self , x , y ) :
m , v = self . enc . encode (x , y )
z = ut . sa mp le _ ga us si a n (m , v )
x_mean = self . dec . decode (z , y )

kl_z = ut . kl_normal (m , v , self . z_prior [ 0 ] , self . z_prior [ 1 ] ) . mean ()


rec = - ut . log_normal (x , x_mean , 0 . 1 * torch . ones_like ( x_mean ) ) . mean ()
nelbo = kl_z + rec
return nelbo , kl_z , rec

Visualization of FSVAE samples

14

You might also like