0% found this document useful (0 votes)
34 views5 pages

ADL Midterm Mock Exam 2021

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views5 pages

ADL Midterm Mock Exam 2021

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

E9 309 – Advance Deep Learning

Midterm Mock Question Paper


October 2021

Instructions

1. This exam is open book. However, computers, mobile phones and other handheld devices
are not allowed.

2. Any reference materials that are used in the exam (other than materials distributed in
the course webpage) should be pre-approved with the instructor before the exam.

3. No additional resources (other than those pre-approved) are allowed for use in the exam.

4. Academic integrity and ethics of highest order are expected.

5. Notation - bold symbols are vectors, capital bold symbols are matrices and regular symbols
are scalars.

6. Answer all questions.

7. Total Duration - 180 minutes including answer upload

8. Total Marks - 100 points

Name - ................................

Dept. - ....................

SR Number - ....................
1. The variational autoencoder model attempts to learn the parameters by maximizing the
data likelihood. Let, pD (x) denote the true underlying distribution of x. If the latent
representations z, with a prior distribution, p(z), is approximated using qφ (z|x) and the
reconstruction is modeled as pθ (x|z), then the training objective is:
EpD (x) [log pθ (x)] = EpD (x) [log Ep(z) [pθ (x|z)]]
whose lower bound is obtained as :
LELBO (x) = −DKL (qφ (z|x)||p(z)) + Eqφ (z|x) [log pθ (x|z)]]
q (z)
A modification to the training objective is to include Iq (x; z) = Eqφ (z,x) [log qφφ(z|x) ], which
represents the mutual information between the visible with a factor of α and to weight
the KL divergence term by a factor of λ. Thus, the new objective function is

−λDKL (qφ (z|x)||p(z)) + Eqφ (z|x) [log pθ (x|z)]] + αIq (x; z)

Show that the new objective can be simplified as


EpD (x) Eqφ (z|x) [log pθ (x|z)]]−(1−α) EpD (x) DKL (qφ (z|x)||p(z))−(α+λ−1)DKL (qφ (z)||p(z))
[25 marks]

2. t-SNE Let the joint probability of two vectors (xi , xj ) in the high dimensional space be
pij , and their corresponding vectors’ (yi , yj ) joint probability in the lower-dimensional
space, qij , be
yT Wyj
qij = P Pi T
l k6=l yk Wyl

The cost function of the model is KL Divergence between the two joint probability dis-
tributions, C = KL(P||Q). Now, what is the gradient of the cost function with respect
to yi ? (Let pii = qii = 0, and W be symmetric.) [20 marks]
Figure 1:

3. A question answering LSTM machine is shown in Figure above. To find answers of


questions in a multiple choice type problem, the model uses a LSTM-Attention Neural
architecture. Each question has four options which are either word, phrase, value or
sentence. The structure of the LSTM layer with attention framework is shown in Figure
1. Given a D-dimensional input sentence sequence embeddings X = x(1), ..., x(n) for each
questions and four options, we pass it to the LSTM layer. We obtain o = o(1), ..., o(n)
as the H-dimensional vector sequence of the answers. For each question , we obtain the
output embedding oq by average pooling the vectors. For options (answers) side, an
attention based embedding is generated as follows,

ma,q (t) = tanh(Wam oa (t) + Wqm oq )


T
sa,q (t) = sof tmax(wms [ma,q (1), ..., ma,q (n)])(t)
õa (t) = oa (t)sa,q (t)
where, Wam ∈ RHXH Wqm ∈ RHXH and wms ∈ RH are attention weights.Finally an
average pooling is done over all õa (t) to generate the answer embedding oa . Let oq , oap
and oan are the network outputs of a question input, it’s correct answer(ap) and incorrect
answer(an). We aim to maximize similarity between oq and oap and minimize similarity
between oq and oan using the triplet loss defined as:

L(oq , oap , oan ) = −oTq oap + oTq oan + α


where α is an arbitrary constant. How will you derive update equation for attention
weight Wam ∈ RHXH .
[20 marks]
(1) (T ) (1) (T )
4. RNN: Suppose we receive two binary sequences x1 = (x1 , · · · , x1 ) and x2 = (x2 , · · · , x2 )
of equal length, and we would like to design an RNN to determine if they are identical.
We will use the following (rather unusual) architecture, drawn with self-loops on the left
and unrolled on the right:

The computation in each step is as follows:

h(t) = φ(Wx(t) + b)
(
(t) φ(vT h(t) + ry (t−1) + c) for t > 1
y =
φ(vT h(t) + c0 ) for t = 1,

where φ denotes the hard threshold activation function


(
1 if z > 0
φ(z) =
0 if z ≤ 0

The parameters are a 2 x 2 weight matrix W, a 2-dimensional bias vector b, a 2- dimen-


sional weight vector v, a scalar recurrent weight r, a scalar bias c for all but the first time
step, and a separate bias c0 for the first time step.
We’ll use the following strategy. We’ll proceed one step at a time, and at time t, the
(t) (t)
binary-valued elements x1 and x2 will be fed as inputs. The output unit y (t) at time t
will compute whether all pairs of elements have matched up to time t. The two hidden
(t) (t)
units h1 and h2 will help determine if both inputs match at a given time step. Give
parameters which correctly implement this function: W, b, v, r, c, c0 . Hint: We can have
h1 (t) determine if both inputs are 0, and h2 (t) determine if both inputs are 1.
[15 marks]
5. Consider the Boltzmann machine shown here with three units: two visible and one hidden.
The units take on values 0, 1 and the weights between units are wv1 v2 = loge 3, wv1 h =
loge 2, wv2 h = loge 2.

(a) Write down, for each state (i.e., for all the combinations of settings of the units), the
expression for energy, the un-normalized probability and the normalized probability.
(b) Compute the probability that the visible units are in the state (v1 , v2 ) = (1, 1) when
the network is generating data freely (i.e., when the visible units are not clamped).
(c) If the network is being trained on a single data point where the visible units are in
the state (v1 , v2 ) = (1, 1), what is the derivative of the log probability of the data
with respect to wv2 h
[20 marks]

You might also like