0% found this document useful (0 votes)
5 views7 pages

Week 7

The document discusses various concepts related to deep learning, including L2 regularization, model complexity, bias and variance in models, dropout regularization, and data augmentation techniques for character recognition. It provides correct answers and explanations for multiple-choice questions regarding these topics, emphasizing the importance of understanding model behavior and regularization methods. Additionally, it covers the evaluation of models based on training and validation errors, as well as mathematical concepts such as gradients and Hessians.

Uploaded by

durgaraoscet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views7 pages

Week 7

The document discusses various concepts related to deep learning, including L2 regularization, model complexity, bias and variance in models, dropout regularization, and data augmentation techniques for character recognition. It provides correct answers and explanations for multiple-choice questions regarding these topics, emphasizing the importance of understanding model behavior and regularization methods. Additionally, it covers the evaluation of models based on training and validation errors, as well as mathematical concepts such as gradients and Hessians.

Uploaded by

durgaraoscet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Deep Learning - Week 7

1. Which of the following statements about L2 regularization is true?

(a) It adds a penalty term to the loss function that is proportional to the absolute
value of the weights.
(b) It results in sparse solutions for w.
(c) It adds a penalty term to the loss function that is proportional to the square of
the weights.
(d) It is equivalent to adding Gaussian noise to the weights.

Correct Answer: (c)


Solution:
It adds a penalty term to the loss function that is proportional to the
square of the weights. L2 regularization, also known as Ridge Regularization,
adds a penalty term to the loss function that is proportional to the sum of the squares
of the weights. The modified loss function typically looks like:
X
Lreg = L + λ w2

where λ is a hyperparameter that controls the strength of regularization.


Now, let’s analyze the other options:
It adds a penalty term to the loss function that is proportional to the
absolute value of the weights. Incorrect. This describes L1 regularization
(Lasso), not L2.
It results in sparse solutions for w. Incorrect. L2 regularization does not lead
to sparse solutions (i.e., it does not force weights to be exactly zero). Instead, it
shrinks weights toward zero but usually keeps them nonzero. L1 regularization is
the one that encourages sparsity.
It is equivalent to adding Gaussian noise to the weights. Incorrect. While
L2 regularization can be interpreted as a prior in a Bayesian framework (i.e.,
assuming a Gaussian prior on weights), it does not mean that Gaussian noise is
explicitly added to the weights during training.

Common Data Q2-Q3


Consider two models:
fˆ1 (x) = w0 + w1 x

fˆ2 (x) = w0 + w1 x2 + w2 x2 + w4 x4 + w5 x5

2. Which of these models has higher complexity?

(a) fˆ1 (x)


(b) fˆ2 (x)
(c) It is not possible to decide without knowing the true distribution of data points
in the dataset.

Correct Answer: (b)


Solution: Model fˆ2 (x) has higher complexity compared to Model fˆ1 (x). The com-
plexity of a model generally increases with the degree of the polynomial terms. Model
fˆ1 (x) is a linear model, whereas Model fˆ2 (x) includes higher-degree polynomial terms
(specifically x2 and x5 ), making it capable of capturing more complex patterns.
Therefore, fˆ2 (x) is more complex.

3. We generate the data using the following model:

y = 7x3 + 12x + x + 2.

We fit the two models fˆ1 (x) and fˆ2 (x) on this data and train them using a neural
network.

(a) fˆ1 (x) has a higher bias than fˆ2 (x).


(b) fˆ2 (x) has a higher bias than fˆ1 (x).
(c) fˆ2 (x) has a higher variance than fˆ1 (x).
(d) fˆ1 (x) has a higher variance than fˆ2 (x).

Correct Answer: (a),(c)


Solution: fˆ1 (x) has a higher bias than fˆ2 (x). (Because fˆ1 (x) is simpler and cannot
capture the true complexity of the data.) fˆ2 (x) has a higher variance than fˆ1 (x).
(Because fˆ2 (x) is more complex and may fit the training data too closely.)

4. Suppose that we apply Dropout regularization to a feed forward neural network.


Suppose further that mini-batch gradient descent algorithm is used for updating the
parameters of the network. Choose the correct statement(s) from the following state-
ments.

(a) The dropout probability p can be different for each hidden layer
(b) Batch gradient descent cannot be used to update the parameters of the network
(c) Dropout with p = 0.5 acts as a ensemble regularize
(d) The weights of the neurons which were dropped during the forward propagation
at tth iteration will not get updated during t + 1th iteration

Correct Answer: (a),(c)


Solution:

(a) The dropout probability p can be different for each hidden layer:
• True. It is common practice to apply different dropout rates to different
hidden layers, which allows for more control over the regularization strength
applied to each layer.
(b) Batch gradient descent cannot be used to update the parameters of
the network:
• False. Batch gradient descent, as well as mini-batch gradient descent, can
be used to update the parameters of a network with dropout regularization.
Dropout affects the training phase by randomly dropping neurons but does
not prevent the use of gradient descent algorithms for parameter updates.
(c) Dropout with p = 0.5 acts as an ensemble regularizer:
• True. Dropout with p = 0.5 can be seen as an ensemble method in the sense
that, during training, different subsets of neurons are active, which can
be interpreted as training a large number of “thinned” networks. During
testing, the full network is used but with the weights scaled to account for
the dropout, effectively acting as an ensemble of these thinned networks.
(d) The weights of the neurons which were dropped during the forward
propagation at t-th iteration will not get updated during t + 1-th it-
eration:
• False. During training, dropout randomly drops neurons in each mini-batch
iteration, but this does not mean that the weights of dropped neurons are
not updated. The update process occurs based on the backpropagation of
the loss through the network, and weights are updated according to the
gradients computed from the dropped and non-dropped neurons.

5. We have trained four different models on the same dataset using various hyperparam-
eters. The training and validation errors for each model are provided below. Based
on this information, which model is likely to perform best on the test dataset?
Model Training error Validation error
1 0.8 1.4
2 2.5 0.5
3 1.7 1.7
4 0.2 0.6

(a) Model 1
(b) Model 2
(c) Model 3
(d) Model 4

Correct Answer: (d)


Solution: Model 4 has both low training loss and low validation loss. Hence Model 4
will give you best results.

Common Data Q6-Q9


Consider a function L(w, b) = 0.4w2 + 7b2 + 1 and its contour plot given below:
6. What is the value of L(w∗ , b∗ ) where w∗ and b∗ are the values that minimize the
function.
Correct Answer: 1
Solution: To find the value of L(w∗ , b∗ ) where w∗ and b∗ are the values that minimize
the function

L(w, b) = 0.4w2 + 7b2 + 1,

We follow these steps:


1. Find the Minimum Values of w and b:
The partial derivatives of L with respect to w and b are:

∂L
= 0.8w
∂w
∂L
= 14b
∂b
Setting these partial derivatives to zero:

0.8w = 0 =⇒ w = 0
14b = 0 =⇒ b = 0

Therefore, the values that minimize the function are w∗ = 0 and b∗ = 0.


2. Evaluate L at w∗ and b∗ :
Substitute w∗ = 0 and b∗ = 0 into the function L(w, b):
L(w∗ , b∗ ) = L(0, 0) = 0.4(0)2 + 7(0)2 + 1 = 1

Thus, the value of L(w∗ , b∗ ) is 1.

7. What is the sum of the elements of ∇L(w∗ , b∗ )?


Correct Answer: 0
Solution: The gradient ∇L(w, b) is:
 
∂L ∂L
∇L(w, b) = , = (0.8w, 14b) .
∂w ∂b

At w∗ = 0 and b∗ = 0, the gradient is:

∇L(w∗ , b∗ ) = (0, 0) .

The sum of the elements of ∇L(w∗ , b∗ ) is:

0 + 0 = 0.

8. What is the determinant of HL (w∗ , b∗ ), where H is the Hessian of the function?


Correct Answer: 11.2
Solution: The Hessian matrix HL (w, b) is:
" #
∂2L ∂2L
∂w 2 ∂w∂b
HL (w, b) = ∂2L ∂2L .
∂b∂w ∂b2

Compute the second-order partial derivatives:

∂2L
= 0.8
∂w2
∂2L
= 14
∂b2
∂2L ∂2L
= =0
∂w∂b ∂b∂w
Thus, the Hessian matrix is:

0.8 0
HL (w, b) = .
0 14

The determinant of this matrix is:

Determinant = (0.8 · 14) − (0 · 0) = 11.2.

9. Compute the Eigenvalues and Eigenvectors of the Hessian. According to the eigen-
values of the Hessian, which parameter is the loss more sensitive to?
(a) b
(b) w

Correct Answer: (a)


Solution: The Hessian matrix is:
 
0.8 0
HL (w, b) = .
0 14
 
1
The eigenvalues are λ1 = 0.8 and λ2 = 14, with corresponding eigenvectors and
0
 
0
, respectively. The larger eigenvalue λ2 = 14 corresponds to the parameter b.
1

10. Consider the problem of recognizing an alphabet (in upper case or lower case) of
English language in an image. There are 26 alphabets in the language. Therefore,
a team decided to use CNN network to solve this problem. Suppose that data aug-
mentation technique is being used for regularization. Then which of the following
transformation(s) on all the training images is (are) appropriate to the problem

(a) Rotating the images by ±10◦


(b) Rotating the images by ±180◦
(c) Translating image by 1 pixel in all direction
(d) Cropping

Correct Answer: (a),(c),(d)


Solution:
Cropping:
Appropriate. Cropping is useful for augmenting data by varying the parts of the
image that are used for training. This can help the model learn to recognize letters
even if they are partially obscured or not centered perfectly. It ensures that the model
is robust to variations in the position of the letter within the image.

Rotating the images by ±10◦ :


Appropriate. Rotating images slightly (such as ±10◦ ) helps the model become in-
variant to small rotational changes. This is useful because in practical scenarios,
characters might be slightly tilted, and the model should be able to recognize them
regardless of minor rotations.

Rotating the images by 180◦ :


Not Appropriate. Rotating images by 180◦ is generally not useful for character recog-
nition because it might lead to images that are completely inverted. For example, ’A’
would become ’Λ’ and ’B’ would become ’q’. Such rotations do not usually represent
valid variations in the context of character recognition.

Translating the image by 1 pixel in all directions:


Appropriate. Translating images by small amounts (such as 1 pixel) helps the model
become robust to slight positional shifts. This can improve the model’s ability to
recognize characters that are not perfectly aligned or are slightly shifted.

You might also like