CS 236, Fall 2018 Midterm Exam: Stanford University Honor Code
CS 236, Fall 2018 Midterm Exam: Stanford University Honor Code
– that they will not give or receive aid in examinations; that they will not give or receive unpermitted
aid in class work, in the preparation of reports, or in any other work that is to be used by the
instructor as the basis of grading;
– that they will do their share and take an active part in seeing to it that others as well as themselves
uphold the spirit and letter of the Honor Code.
• The faculty on its part manifests its condence in the honor of its students by refraining from proctoring
examinations and from taking unusual and unreasonable precautions to prevent the forms of dishonesty
mentioned above. The faculty will also avoid, as far as practicable, academic procedures that create
temptations to violate the Honor Code.
• While the faculty alone has the right and obligation to set academic requirements, the students and
faculty will work together to establish optimal conditions for honorable academic work.
Signature
I attest that I have not given or received aid in this examination, and that I have done my share and taken an
active part in seeing to it that others as well as myself uphold the spirit and letter of the Stanford University
Honor Code.
Name / SUnetID:
Signature:
1 / 15 5 / 20
2 / 20 6 / 20
3 / 15 7 / 20
4 / 20
Note: Partial credit will be given for partially correct answers. Zero points will be given to
answers left blank.
where W and V are weight matrices, b and c are bias vectors, g is a nonlinear activation function and
sigmoid(a) = 1/(1 + exp(−a)).
MADE modifies Q the autoencoder to build an autoregressive model. To satisfy the autoregressive
D
property p(x) = d=1 p(xd | x<d ), we use the dth output of MADE x̂d to parameterize the conditional
probability p(xd | x<d ), which means that x̂d must depend only on the preceding inputs x<d . In order
to enforce this property, MADE multiplies each weight matrix by a mask matrix. For a single hidden
layer autoencoder, we write
where MW and MV are the masks for W and V respectively, and denotes element-wise multiplica-
tion. Note that the entries of a mask matrix can only be 0 or 1.
In this question, we consider a MADE model with a single hidden layer. The dimension of the input
x is assumed to be D, and the number of hidden units h(x) is D − 1.
(a) [2 points] What are the shapes (number of rows and columns) of the mask matrices MW and
MV ?
Answer: Shape of MW : (D − 1, D) Shape of MV : (D, D − 1)
(b) [8 points] Consider two candidate MADE models as shown below. For both models, D = 3, and
there are 2 hidden units. In the figures, an arrow connecting two neurons a b indicates
that the value of b depends on the value of a. Check whether they satisfy the autoregressive
property. If yes, write down the mask matrices MW , MV and compute MV MW . Else, explain
why the autoregressive property is violated.
CS 236 Midterm Exam 3
h1 h2 h1 h2
x1 x2 x3 x1 x2 x3
(a) (b)
Answer: (a) is a valid MADE. (b) is not because x̂1 cannot depend on x1 .
0 0 0 0 0
1 0 0
MW = MV = 1 MV MW
0 =
1 0 0
1 1 0
0 1 1 1 0
(c) [5 points] Let M = MV MW . What is the maximum number of non-zero entries that M can
have in order to preserve the autoregressive property? Briefly explain your results. [Hint: The
answer should be a function of D.]
Answer: D(D−1)
2 . The matrix M is always strictly lower triangular.
(d) [5 points] It can often be advantageous to have direct connections between the input x and
output layer x̂. In this context, the reconstruction part of MADE becomes
where A is the weight matrix that directly connects input and output, and MA is its mask matrix.
What is the maximum number of non-zero entries that MA can have to satisfy the autoregressive
property? Briefly explain your results. [Hint: The answer should be a function of D.]
Answer: D(D−1)
2 . The matrix MA is always strictly lower triangular.
(b) [5 points] Suppose we have trained a VAE parameterized by φ and θ. We can obtain a sample
from pdata (x) by first drawing a sample z0 ∼ pθ (z), then drawing another sample x0 ∼ pθ (x|z0 ).
(False, but True if they say that the VAE is optimal)
CS 236 Midterm Exam 4
(c) [5 points] After learning a VAE model on a dataset, Alice gives Bob the trained decoder pθ (x|z)
and the prior p(z) she used. However, she forgets to give Bob the encoder. Given sufficient
computation, can Bob still infer a latent representation for a new test point x0 ? (True)
p(z) = N (z; 0, 1)
(1)
1 if z ≥ 0 ∧ x = x
0 if z ≥ 0 ∧ x 6= x(1)
p(x | z) =
1 if z < 0 ∧ x = x(2)
0 if z < 0 ∧ x 6= x(2)
where p(x | z) is a probability mass function and p(z) is a probability density function.
In other words, the generative model will always sample the first image x(1) when conditioned on
z ≥ 0, and the model will always sample the second image x(2) when conditioned on z < 0. N (z; 0, 1)
indicates a Gaussian distribution with mean zero and variance 1.
(a) [4 points] We can consider the log-likelihood of some image x under our model
Answer:
p(x, z) p(z | x)
Eq(z) log = log p(x) + Eq(z) log (1)
q(z) q(z)
q(z)
= log p(x) − Eq(z) log (2)
p(z | x)
= log p(x) − DKL (q(z) k p(z | x)). (3)
(d) [2 points] For parts (d) and (e), suppose q(z) is a univariate Gaussian distribution with positive
variance. What is the set of all points z for which q(z) > 0?
Answer: R
CS 236 Midterm Exam 5
(e) [4 points] Using what you have determined so far, select the option that is correct.
The log-likelihood `like (x(1) ) is:
i. Finite and negative
ii. 0
iii. Finite and positive
iv. None of the above
Answer: Finite and negative.
For any q that is univariate Gaussian with positive variance, `ELBO (x(1) ; q) is:
i. Finite and negative
ii. 0
iii. Finite and positive
iv. None of the above
Answer: None of the above. The KL divergence is undefined since q(z) assigns non-zero proba-
bility mass to the interval (−∞, 0), but p(z | x(1) ) assigns zero probability mass to that interval.
(a) [5 points] Let Z ∼ Uniform[−2, 3] and X = exp(Z). What is pX (5)?. Answer: (True, by
change of variables pX (x) = x1 pZ (log x) = 1/25)
For each of the statements below, state true or false. Explain your answer for full points.
(b) [5 points] For efficient learning and inference in flow models, any discrete or continuous distribu-
tion which allows for efficient sampling and likelihood evaluation can be used to specify the prior
distribution over latent variables.
Answer: (False, only continuous distributions can be used.)
(c) [5 points] In Parallel Wavenet, evaluating the likelihood assigned by the student model for any
external data point is computationally intractable (i.e., requires exponential time in the number
of dimensions of the sample).
Answer: (False, they are expensive to compute but not computationally intractable.)
(d) [5 points] A permutation matrix is defined as a binary square matrix with {0, 1} entries such
that every column and every row sums to 1. The Jacobian for a RealNVP model can be expressed
as the product of a series of (upper or lower) triangular matrices and permutation matrices.
Answer: (True, a permutation matrix does not affect invertibility)
6. [20 points total] Flow + GAN: Maximum Likelihood vs. Adversarial Training
Let pdata (x) denote a data distribution that we are trying to learn with a generative model, where
x ∈ Rn . Consider the simple generative model parameterized by a single invertible matrix A ∈ Rn×n ,
where a sample is obtained by first sampling an n-dimensional vector z ∼ p(z) from a given distribution
p(z), and returning the matrix-vector product Az. Let D be a training set of samples from pdata (x).
(a) [10 points] Write a loss function L(A) that trains this model using maximum likelihood.
Answer:
L(A) = −Ex∼pdata (x) [log p(A−1 x) + log | det(A−1 )|]
(b) [10 points] Write a loss function L(A) that trains this model as the generator in a generative
adversarial network. You may assume that a discriminator Dφ : Rn → R that outputs the
probability that the input is real has been defined and is trained alongside the generative model.
Answer:
L(A) = Ez∼p(z) [log(1 − Dφ (Az)]
or
L(A) = −Ez∼p(z) [log Dφ (Az)]
CS 236 Midterm Exam 6
(a) [8 points] Derive an expression for log qφ,ψ (z|x). The function should take x and z as input,
output a scalar value, and depend on µφ , σφ , and fψ . You can use N (u; µ, diag(σ 2 )) to denote
the pdf for normal distribution with mean µ and covariance diag(σ 2 ) evaluated at u.
Answer: The log probability of ẑ is:
∂fψ−1 (z)
qφ,ψ (z|x) = log N (fψ−1 (z); µφ (x), diag(σφ (x)2 )) + log |det |
∂z
(b) [12 points] Consider rφ (z|x) as the basic Gaussian inference model (without using the flow
model), with the following sampling process:
• Sample z ∼ N (z; µφ (x), diag(σφ (x)2 ))
Show that the best evidence lower bound we can achieve with qφ,ψ is at least as tight as the best
one we can achieve with rφ , i.e.,
where
ELBO(x; p, q) = Eq(z|x) [log p(x, z) − log q(z|x)]
You may assume that fψ can represent any invertible function RF → RF .
Answer: For any instance of φ selected for rφ , we can always choose fψ to be the identity function
for qφ,ψ . This function is invertible and preserves volume. Therefore, for this instance of ψ,
and for any solution rφ , we have a qφ,ψ that has the same ELBO. Therefore, the maximum ELBO
of qφ,ψ should be greater or equal to that of rφ .