100% found this document useful (1 vote)
832 views6 pages

CS 236, Fall 2018 Midterm Exam: Stanford University Honor Code

This exam is worth 130 points and lasts 3 hours. Students are allowed to use notes, books, and laptops but no communication or network access. The exam contains 7 multiple choice or short answer questions testing knowledge of generative models like VAEs, GANs, autoregressive models, and flow models. It concludes with the Stanford honor code statement requiring students to complete the exam with integrity.

Uploaded by

gfgfdgdf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
832 views6 pages

CS 236, Fall 2018 Midterm Exam: Stanford University Honor Code

This exam is worth 130 points and lasts 3 hours. Students are allowed to use notes, books, and laptops but no communication or network access. The exam contains 7 multiple choice or short answer questions testing knowledge of generative models like VAEs, GANs, autoregressive models, and flow models. It concludes with the Stanford honor code statement requiring students to complete the exam with integrity.

Uploaded by

gfgfdgdf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

CS 236 Midterm Exam 1

CS 236, Fall 2018


Midterm Exam
This exam is worth 130 points. You have 3 hours to complete it. You are allowed to consult
notes, books, and use a laptop but no communication or network access is allowed. Good
luck!

Stanford University Honor Code


The Honor Code is the University’s statement on academic integrity written by students in 1921. It articu-
lates University expectations of students and faculty in establishing and maintaining the highest standards
in academic work:

• The Honor Code is an undertaking of the students, individually and collectively:

– that they will not give or receive aid in examinations; that they will not give or receive unpermitted
aid in class work, in the preparation of reports, or in any other work that is to be used by the
instructor as the basis of grading;
– that they will do their share and take an active part in seeing to it that others as well as themselves
uphold the spirit and letter of the Honor Code.

• The faculty on its part manifests its condence in the honor of its students by refraining from proctoring
examinations and from taking unusual and unreasonable precautions to prevent the forms of dishonesty
mentioned above. The faculty will also avoid, as far as practicable, academic procedures that create
temptations to violate the Honor Code.
• While the faculty alone has the right and obligation to set academic requirements, the students and
faculty will work together to establish optimal conditions for honorable academic work.

Signature
I attest that I have not given or received aid in this examination, and that I have done my share and taken an
active part in seeing to it that others as well as myself uphold the spirit and letter of the Stanford University
Honor Code.

Name / SUnetID:

Signature:

Question Score Question Score

1 / 15 5 / 20

2 / 20 6 / 20

3 / 15 7 / 20

4 / 20

Total score: / 130


CS 236 Midterm Exam 2

Note: Partial credit will be given for partially correct answers. Zero points will be given to
answers left blank.

1. [15 points total] Comparison of Models


In this course, we discussed four major types of generative models: autoregressive models, variational
autoencoders, flow models, and generative adversarial networks.
(a) [5 points] Suppose we are interested in quickly generating i.i.d. samples from a trained model.
Which of these models can we sample from efficiently (i.e., in a time polynomial in the number
of dimensions of a sample, such as in linear time)?
Answer: All of them.
(b) [5 points] Suppose we are interested in exactly evaluating the likelihood of a data point under
the trained model. In which of these models can we exactly evaluate a data point’s likelihood in
an efficient way (i.e., in a time polynomial in the number of dimensions of a sample, such as in
linear time)?
Answer: Autoregressive and flow models allow for exact likelihood evaluation.
(c) [5 points] Suppose we are interested in learning a latent representation for new data points.
Which of these models are most appropriate for this task, and why?
Answer: In a VAE, qφ (z|x) can serve as an encoder. In a flow model, f −1 (x) can serve as an
encoder. GAN can also learn representations via Bi-GAN.
2. [20 points total] Masked Autoregressive Distribution Estimation (MADE)
An autoencoder learns a feed-forward, hidden representation h(x) of its input x ∈ RD such that, from
it, we can obtain a reconstruction x̂ which is as close as possible to x. Specifically, we have

h(x) = g(b + Wx)


x̂ = sigmoid(c + Vh(x))

where W and V are weight matrices, b and c are bias vectors, g is a nonlinear activation function and
sigmoid(a) = 1/(1 + exp(−a)).
MADE modifies Q the autoencoder to build an autoregressive model. To satisfy the autoregressive
D
property p(x) = d=1 p(xd | x<d ), we use the dth output of MADE x̂d to parameterize the conditional
probability p(xd | x<d ), which means that x̂d must depend only on the preceding inputs x<d . In order
to enforce this property, MADE multiplies each weight matrix by a mask matrix. For a single hidden
layer autoencoder, we write

h(x) = g(b + (W MW )x)


x̂ = sigmoid(c + (V MV )h(x))

where MW and MV are the masks for W and V respectively, and denotes element-wise multiplica-
tion. Note that the entries of a mask matrix can only be 0 or 1.
In this question, we consider a MADE model with a single hidden layer. The dimension of the input
x is assumed to be D, and the number of hidden units h(x) is D − 1.

(a) [2 points] What are the shapes (number of rows and columns) of the mask matrices MW and
MV ?
Answer: Shape of MW : (D − 1, D) Shape of MV : (D, D − 1)
(b) [8 points] Consider two candidate MADE models as shown below. For both models, D = 3, and
there are 2 hidden units. In the figures, an arrow connecting two neurons a b indicates
that the value of b depends on the value of a. Check whether they satisfy the autoregressive
property. If yes, write down the mask matrices MW , MV and compute MV MW . Else, explain
why the autoregressive property is violated.
CS 236 Midterm Exam 3

x̂1 x̂2 x̂3 x̂1 x̂2 x̂3

h1 h2 h1 h2

x1 x2 x3 x1 x2 x3

(a) (b)

Answer: (a) is a valid MADE. (b) is not because x̂1 cannot depend on x1 .
   
0 0 0 0 0
 
1 0 0    
MW =   MV = 1 MV MW
    
   0 =
1 0 0
1 1 0 







0 1 1 1 0

(c) [5 points] Let M = MV MW . What is the maximum number of non-zero entries that M can
have in order to preserve the autoregressive property? Briefly explain your results. [Hint: The
answer should be a function of D.]
Answer: D(D−1)
2 . The matrix M is always strictly lower triangular.
(d) [5 points] It can often be advantageous to have direct connections between the input x and
output layer x̂. In this context, the reconstruction part of MADE becomes

x̂ = sigmoid(c + (V MV )h(x) + (A MA )x),

where A is the weight matrix that directly connects input and output, and MA is its mask matrix.
What is the maximum number of non-zero entries that MA can have to satisfy the autoregressive
property? Briefly explain your results. [Hint: The answer should be a function of D.]
Answer: D(D−1)
2 . The matrix MA is always strictly lower triangular.

3. [15 points total] Variational Autoencoders Basics


For each of the following questions, state true or false. Explain your answer for full points.
(a) [5 points] Suppose we are training a VAE where the prior p(z) is such that each dimension of
the latent variable z is Bernoulli distributed. We can use reparameterization with z to get an
unbiased estimate of the gradient of the variational objective function. (False)

(b) [5 points] Suppose we have trained a VAE parameterized by φ and θ. We can obtain a sample
from pdata (x) by first drawing a sample z0 ∼ pθ (z), then drawing another sample x0 ∼ pθ (x|z0 ).
(False, but True if they say that the VAE is optimal)
CS 236 Midterm Exam 4

(c) [5 points] After learning a VAE model on a dataset, Alice gives Bob the trained decoder pθ (x|z)
and the prior p(z) she used. However, she forgets to give Bob the encoder. Given sufficient
computation, can Bob still infer a latent representation for a new test point x0 ? (True)

4. [20 points total] Evidence Lower Bound


Consider the joint distribution of a latent variable model denoted by p(x, z). The model is capable of
sampling only two images {x(1) , x(2) }. You may imagine that these are two binarized MNIST images
where x(i) ∈ {0, 1}784 . Note that this latent variable model is equipped with a scalar latent variable
z ∈ R. Furthermore, this model is described by

p(z) = N (z; 0, 1)
 (1)
1 if z ≥ 0 ∧ x = x

0 if z ≥ 0 ∧ x 6= x(1)

p(x | z) =
1 if z < 0 ∧ x = x(2)


0 if z < 0 ∧ x 6= x(2)

where p(x | z) is a probability mass function and p(z) is a probability density function.
In other words, the generative model will always sample the first image x(1) when conditioned on
z ≥ 0, and the model will always sample the second image x(2) when conditioned on z < 0. N (z; 0, 1)
indicates a Gaussian distribution with mean zero and variance 1.

(a) [4 points] We can consider the log-likelihood of some image x under our model

`like (x) := log p(x).

Based on the model described above, what is `like (x(1) )?


Answer: `like (x(1) ) = log 0.5 = − log 2
(b) [2 points] One can also reason about the posterior p(z | x). What is the set of all points z for
which the posterior density is positive p(z | x(1) ) > 0 when conditioned on x(1) ?
Answer: [0, +∞)
(c) [8 points] The log-likelihood can be lower bounded by the Evidence Lower Bound (ELBO) as
follows:
 
p(x, z)
`ELBO (x ; q) := Eq(z) log ,
q(z)
Note that this lower bound is a function of some variational distribution q(z). Prove that
 
p(x, z)
Eq(z) log = log p(x) − DKL (q(z) k p(z | x)).
q(z)

Answer:
   
p(x, z) p(z | x)
Eq(z) log = log p(x) + Eq(z) log (1)
q(z) q(z)
 
q(z)
= log p(x) − Eq(z) log (2)
p(z | x)
= log p(x) − DKL (q(z) k p(z | x)). (3)

(d) [2 points] For parts (d) and (e), suppose q(z) is a univariate Gaussian distribution with positive
variance. What is the set of all points z for which q(z) > 0?
Answer: R
CS 236 Midterm Exam 5

(e) [4 points] Using what you have determined so far, select the option that is correct.
The log-likelihood `like (x(1) ) is:
i. Finite and negative
ii. 0
iii. Finite and positive
iv. None of the above
Answer: Finite and negative.
For any q that is univariate Gaussian with positive variance, `ELBO (x(1) ; q) is:
i. Finite and negative
ii. 0
iii. Finite and positive
iv. None of the above
Answer: None of the above. The KL divergence is undefined since q(z) assigns non-zero proba-
bility mass to the interval (−∞, 0), but p(z | x(1) ) assigns zero probability mass to that interval.

5. [20 points total] Normalizing Flow Models Basics

(a) [5 points] Let Z ∼ Uniform[−2, 3] and X = exp(Z). What is pX (5)?. Answer: (True, by
change of variables pX (x) = x1 pZ (log x) = 1/25)
For each of the statements below, state true or false. Explain your answer for full points.
(b) [5 points] For efficient learning and inference in flow models, any discrete or continuous distribu-
tion which allows for efficient sampling and likelihood evaluation can be used to specify the prior
distribution over latent variables.
Answer: (False, only continuous distributions can be used.)
(c) [5 points] In Parallel Wavenet, evaluating the likelihood assigned by the student model for any
external data point is computationally intractable (i.e., requires exponential time in the number
of dimensions of the sample).
Answer: (False, they are expensive to compute but not computationally intractable.)
(d) [5 points] A permutation matrix is defined as a binary square matrix with {0, 1} entries such
that every column and every row sums to 1. The Jacobian for a RealNVP model can be expressed
as the product of a series of (upper or lower) triangular matrices and permutation matrices.
Answer: (True, a permutation matrix does not affect invertibility)

6. [20 points total] Flow + GAN: Maximum Likelihood vs. Adversarial Training
Let pdata (x) denote a data distribution that we are trying to learn with a generative model, where
x ∈ Rn . Consider the simple generative model parameterized by a single invertible matrix A ∈ Rn×n ,
where a sample is obtained by first sampling an n-dimensional vector z ∼ p(z) from a given distribution
p(z), and returning the matrix-vector product Az. Let D be a training set of samples from pdata (x).

(a) [10 points] Write a loss function L(A) that trains this model using maximum likelihood.
Answer:
L(A) = −Ex∼pdata (x) [log p(A−1 x) + log | det(A−1 )|]
(b) [10 points] Write a loss function L(A) that trains this model as the generator in a generative
adversarial network. You may assume that a discriminator Dφ : Rn → R that outputs the
probability that the input is real has been defined and is trained alongside the generative model.
Answer:
L(A) = Ez∼p(z) [log(1 − Dφ (Az)]
or
L(A) = −Ez∼p(z) [log Dφ (Az)]
CS 236 Midterm Exam 6

7. [20 points total] Flow + VAE: Augmenting variational posteriors


We wish to use flexible flow models for variational autoencoders. Let x ∈ RD denote the inputs, z the
latent variables, pθ (x|z) the generative model, p(z) the prior and rφ (z|x) as the basic inference model
representing a Gaussian distribution N (z; µφ (x), diag(σφ (x)2 )).
Instead of using rφ (z|x) directly as the inference model, we will transform this distribution using a
normalizing flow to obtain a richer variational posterior distribution qφ,ψ . Specifically, let fψ : RF →
RF be an invertible transformation, µφ : RD → RF , and σφ : RD → RF . We use the following
procedure to sample z from qφ,ψ (z|x) given x:
• Sample z̃ ∼ N (z; µφ (x), diag(σφ (x)2 ))
• Compute z = fψ (z̃)

(a) [8 points] Derive an expression for log qφ,ψ (z|x). The function should take x and z as input,
output a scalar value, and depend on µφ , σφ , and fψ . You can use N (u; µ, diag(σ 2 )) to denote
the pdf for normal distribution with mean µ and covariance diag(σ 2 ) evaluated at u.
Answer: The log probability of ẑ is:

log N (ẑ; µφ (x), diag(σφ (x)2 ))

and since z = f (ẑ), we have

∂fψ−1 (z)
qφ,ψ (z|x) = log N (fψ−1 (z); µφ (x), diag(σφ (x)2 )) + log |det |
∂z

(b) [12 points] Consider rφ (z|x) as the basic Gaussian inference model (without using the flow
model), with the following sampling process:
• Sample z ∼ N (z; µφ (x), diag(σφ (x)2 ))
Show that the best evidence lower bound we can achieve with qφ,ψ is at least as tight as the best
one we can achieve with rφ , i.e.,

max ELBO(x; pθ , qφ,ψ ) ≥ max ELBO(x; pθ , rφ )


θ,φ,ψ θ,φ

where
ELBO(x; p, q) = Eq(z|x) [log p(x, z) − log q(z|x)]
You may assume that fψ can represent any invertible function RF → RF .
Answer: For any instance of φ selected for rφ , we can always choose fψ to be the identity function
for qφ,ψ . This function is invertible and preserves volume. Therefore, for this instance of ψ,

qφ,ψ (z|x) = rφ (z|x)

and for any solution rφ , we have a qφ,ψ that has the same ELBO. Therefore, the maximum ELBO
of qφ,ψ should be greater or equal to that of rφ .

You might also like