ADL Midterm Mock Exam 2021
ADL Midterm Mock Exam 2021
Instructions
1. This exam is open book. However, computers, mobile phones and other handheld devices
are not allowed.
2. Any reference materials that are used in the exam (other than materials distributed in
the course webpage) should be pre-approved with the instructor before the exam.
3. No additional resources (other than those pre-approved) are allowed for use in the exam.
5. Notation - bold symbols are vectors, capital bold symbols are matrices and regular symbols
are scalars.
Name - ................................
Dept. - ....................
SR Number - ....................
1. The variational autoencoder model attempts to learn the parameters by maximizing the
data likelihood. Let, pD (x) denote the true underlying distribution of x. If the latent
representations z, with a prior distribution, p(z), is approximated using qφ (z|x) and the
reconstruction is modeled as pθ (x|z), then the training objective is:
EpD (x) [log pθ (x)] = EpD (x) [log Ep(z) [pθ (x|z)]]
whose lower bound is obtained as :
LELBO (x) = −DKL (qφ (z|x)||p(z)) + Eqφ (z|x) [log pθ (x|z)]]
q (z)
A modification to the training objective is to include Iq (x; z) = Eqφ (z,x) [log qφφ(z|x) ], which
represents the mutual information between the visible with a factor of α and to weight
the KL divergence term by a factor of λ. Thus, the new objective function is
2. t-SNE Let the joint probability of two vectors (xi , xj ) in the high dimensional space be
pij , and their corresponding vectors’ (yi , yj ) joint probability in the lower-dimensional
space, qij , be
yT Wyj
qij = P Pi T
l k6=l yk Wyl
The cost function of the model is KL Divergence between the two joint probability dis-
tributions, C = KL(P||Q). Now, what is the gradient of the cost function with respect
to yi ? (Let pii = qii = 0, and W be symmetric.) [20 marks]
Figure 1:
h(t) = φ(Wx(t) + b)
(
(t) φ(vT h(t) + ry (t−1) + c) for t > 1
y =
φ(vT h(t) + c0 ) for t = 1,
(a) Write down, for each state (i.e., for all the combinations of settings of the units), the
expression for energy, the un-normalized probability and the normalized probability.
(b) Compute the probability that the visible units are in the state (v1 , v2 ) = (1, 1) when
the network is generating data freely (i.e., when the visible units are not clamped).
(c) If the network is being trained on a single data point where the visible units are in
the state (v1 , v2 ) = (1, 1), what is the derivative of the log probability of the data
with respect to wv2 h
[20 marks]