Week 7
Week 7
(a) It adds a penalty term to the loss function that is proportional to the absolute
value of the weights.
(b) It results in sparse solutions for w.
(c) It adds a penalty term to the loss function that is proportional to the square of
the weights.
(d) It is equivalent to adding Gaussian noise to the weights.
fˆ2 (x) = w0 + w1 x2 + w2 x2 + w4 x4 + w5 x5
y = 7x3 + 12x + x + 2.
We fit the two models fˆ1 (x) and fˆ2 (x) on this data and train them using a neural
network.
(a) The dropout probability p can be different for each hidden layer
(b) Batch gradient descent cannot be used to update the parameters of the network
(c) Dropout with p = 0.5 acts as a ensemble regularize
(d) The weights of the neurons which were dropped during the forward propagation
at tth iteration will not get updated during t + 1th iteration
(a) The dropout probability p can be different for each hidden layer:
• True. It is common practice to apply different dropout rates to different
hidden layers, which allows for more control over the regularization strength
applied to each layer.
(b) Batch gradient descent cannot be used to update the parameters of
the network:
• False. Batch gradient descent, as well as mini-batch gradient descent, can
be used to update the parameters of a network with dropout regularization.
Dropout affects the training phase by randomly dropping neurons but does
not prevent the use of gradient descent algorithms for parameter updates.
(c) Dropout with p = 0.5 acts as an ensemble regularizer:
• True. Dropout with p = 0.5 can be seen as an ensemble method in the sense
that, during training, different subsets of neurons are active, which can
be interpreted as training a large number of “thinned” networks. During
testing, the full network is used but with the weights scaled to account for
the dropout, effectively acting as an ensemble of these thinned networks.
(d) The weights of the neurons which were dropped during the forward
propagation at t-th iteration will not get updated during t + 1-th it-
eration:
• False. During training, dropout randomly drops neurons in each mini-batch
iteration, but this does not mean that the weights of dropped neurons are
not updated. The update process occurs based on the backpropagation of
the loss through the network, and weights are updated according to the
gradients computed from the dropped and non-dropped neurons.
5. We have trained four different models on the same dataset using various hyperparam-
eters. The training and validation errors for each model are provided below. Based
on this information, which model is likely to perform best on the test dataset?
Model Training error Validation error
1 0.8 1.4
2 2.5 0.5
3 1.7 1.7
4 0.2 0.6
(a) Model 1
(b) Model 2
(c) Model 3
(d) Model 4
∂L
= 0.8w
∂w
∂L
= 14b
∂b
Setting these partial derivatives to zero:
0.8w = 0 =⇒ w = 0
14b = 0 =⇒ b = 0
∇L(w∗ , b∗ ) = (0, 0) .
0 + 0 = 0.
∂2L
= 0.8
∂w2
∂2L
= 14
∂b2
∂2L ∂2L
= =0
∂w∂b ∂b∂w
Thus, the Hessian matrix is:
0.8 0
HL (w, b) = .
0 14
9. Compute the Eigenvalues and Eigenvectors of the Hessian. According to the eigen-
values of the Hessian, which parameter is the loss more sensitive to?
(a) b
(b) w
10. Consider the problem of recognizing an alphabet (in upper case or lower case) of
English language in an image. There are 26 alphabets in the language. Therefore,
a team decided to use CNN network to solve this problem. Suppose that data aug-
mentation technique is being used for regularization. Then which of the following
transformation(s) on all the training images is (are) appropriate to the problem