0% found this document useful (0 votes)
36 views22 pages

MCQ1

The document discusses various gradient descent optimization techniques, including Deterministic Gradient Descent, Stochastic Gradient Descent, and Mini-Batch SGD, highlighting their advantages and disadvantages. It emphasizes the importance of learning rate and regularization in preventing overfitting and ensuring convergence. Additionally, it covers the structure and training of Feedforward Neural Networks for tabular data, detailing key components and advanced concepts in deep learning.

Uploaded by

Adithya Ayanam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views22 pages

MCQ1

The document discusses various gradient descent optimization techniques, including Deterministic Gradient Descent, Stochastic Gradient Descent, and Mini-Batch SGD, highlighting their advantages and disadvantages. It emphasizes the importance of learning rate and regularization in preventing overfitting and ensuring convergence. Additionally, it covers the structure and training of Feedforward Neural Networks for tabular data, detailing key components and advanced concepts in deep learning.

Uploaded by

Adithya Ayanam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Gradient Descent Optimization:

Deterministic Gradient Descent : Uses full data set to compute gradients at


each step

More accurate gradient estimation

Slow for large dateset

Stochastic Gradient Descent

Uses one data point at a time for updates

Faster updates better for large datasets

Noisy updates and High variance

Mini Batch SGD

Uses small batches of data instead of one or all


Balance between SGD and accuracy

Choosing the right batch size is crucial

Sampling in SGD
 With Replacement: Each sample is chosen independently.
 Without Replacement: Each sample is used once per epoch.
 Role of Step Size / Learning Rate
 The learning rate η\etaη controls how much we update parameters at each step.

Learning Rate η Effect


Too Large Overshooting, failure to converge.
Too Small Slow convergence.
Optimal Fast convergence, avoiding local minima.

Regularization prevents overfitting.


✅ Gradient Descent updates weights to minimize loss.
✅ SGD is faster than Batch GD but introduces noise.
✅ Mini-batch SGD balances efficiency and accuracy.
✅ Choosing the right learning rate is crucial for convergence.

Multiple-Choice Questions (MCQs)


1. What is the purpose of adding a regularization term to the loss function?

A) To increase the model complexity


B) To prevent overfitting by penalizing large weights
C) To make training faster
D) To eliminate bias completely

✅ Correct Answer: B

2. What is the key difference between Stochastic Gradient Descent (SGD) and
Batch Gradient Descent?

A) SGD uses the full dataset, while Batch GD updates weights after each data point.
B) Batch GD uses a single data point per update, while SGD uses all data points.
C) SGD updates weights using only one data point per iteration, while Batch GD updates using
all training data.
D) There is no difference.
✅ Correct Answer: C

3. Which loss function is most commonly used for binary classification?

A) Mean Squared Error (MSE)


B) Cross-Entropy Loss
C) Hinge Loss
D) L1 Loss

✅ Correct Answer: B

4. What happens when the learning rate is too high?

A) The model may overshoot and fail to converge.


B) The model will converge slowly.
C) The gradients will vanish completely.
D) The regularization will increase.

✅ Correct Answer: A

5. What is the primary advantage of Mini-Batch SGD over Batch Gradient


Descent?

A) It updates parameters more frequently, improving convergence speed.


B) It eliminates noise in updates completely.
C) It requires less memory than Batch GD.
D) Both A and C.

✅ Correct Answer: D

🔹 Theoretical Problems
6. Compute the Number of Parameters in a Regularized Linear Model

Consider a linear regression model with:

 Input features: 15
 Regularization term: L2 (Ridge)
 Bias term included
🔹 Comparison: Deterministic GD vs. Stochastic GD
Criterion Deterministic GD (Batch GD) Stochastic GD (SGD)
Update Per Step Uses entire dataset Uses one sample
Speed Slow for large datasets Faster
Noise in Updates Low (smooth) High (noisy)
Memory Usage High Low
Best Used For Small datasets Large datasets
📌 Summary
✅ Deterministic GD updates weights using the entire dataset, leading to stable convergence
but high computational cost.
✅ Gradient Computation: Uses the sum of all data points.
✅ Memory-Intensive: Requires holding the entire dataset in memory.
✅ Slower for Large Datasets, so Mini-Batch SGD is often preferred.

🔹 Why Not the Other Options?

Option Why Not?


Machine translation requires Seq2Seq (Sequence-to-Sequence)
A) Machine
models, where both input and output are sequences (e.g., English →
Translation
French translation).
C) Text Summarization requires a Seq2Seq model, as the input is a long text,
Summarization and the output is another sequence (shortened summary).
D) Image Caption This involves a CNN + Seq2Seq model, where a CNN extracts image
Generation features and an RNN generates a sequence (caption).

Key Takeaways

✅ Seq2Vec is ideal for tasks where an input sequence is converted into a single vector for
classification.
✅ It is used in applications like sentiment analysis, spam detection, and document
classification.
✅ Machine translation and text summarization require Seq2Seq models, not Seq2Vec.

📌 Advanced Multiple-Choice Questions (MCQs) and Theoretical Problems on


Gradient Descent and Loss Functions

Here are twisted MCQs and theoretical problems designed to test deep conceptual
understanding.

🔹 Multiple-Choice Questions (MCQs)


1. Which of the following is TRUE about Stochastic Gradient Descent (SGD)?

A) It guarantees convergence to the global minimum for all loss functions.


B) It updates the weights after computing the gradient over the entire dataset.
C) It can escape shallow local minima due to its noisy updates.
D) It is always slower than Batch Gradient Descent.
✅ Correct Answer: C
Explanation: The noise in SGD allows it to escape shallow local minima and potentially find a
better minimum.

2. If the learning rate is too low in gradient descent, what is the most likely
outcome?

A) The algorithm will converge very quickly but might not reach the optimal solution.
B) The algorithm will oscillate and never converge.
C) The algorithm will take a very long time to converge.
D) The algorithm will stop learning after a few iterations.

✅ Correct Answer: C
Explanation: A very small learning rate makes weight updates tiny, leading to slow
convergence.

3. Which of the following is NOT a reason why we use Mini-Batch Gradient


Descent?

A) It reduces memory requirements compared to Batch GD.


B) It smooths out the noisy updates of SGD while keeping computational efficiency.
C) It eliminates the need for computing gradients altogether.
D) It balances between stability and efficiency.

✅ Correct Answer: C
Explanation: Mini-Batch GD still requires computing gradients, but in smaller batches.

4. In gradient descent, what is the primary reason for using momentum?

A) To prevent overfitting by reducing the number of parameters.


B) To accelerate learning and smooth out oscillations.
C) To dynamically adjust the learning rate.
D) To make weight updates completely random.

✅ Correct Answer: B
Explanation: Momentum helps reduce oscillations and speed up convergence by
incorporating past gradients.
5. Which of the following statements is FALSE about the loss function in
machine learning?

A) The loss function measures how well a model’s predictions match actual outputs.
B) Different loss functions can be used depending on the type of task (classification, regression).
C) The loss function is minimized during training using optimization algorithms.
D) A zero loss always means that the model has perfectly generalized to unseen data.

✅ Correct Answer: D
Explanation: A zero loss might indicate overfitting, meaning the model memorized the training
data but may not generalize well to new data.

6. What would happen if you remove the bias term from a linear model?

A) The model will still fit the data perfectly.


B) The model will only be able to fit data that passes through the origin.
C) The model will always perform better without bias terms.
D) The model will take longer to converge in gradient descent.

✅ Correct Answer: B
Explanation: Removing bias forces the model to pass through the origin, limiting its flexibility.

7. A neural network is trained with a very high learning rate. What is the most
likely outcome?

A) The model will converge faster to the optimal solution.


B) The loss function will decrease smoothly over time.
C) The model will overshoot the minimum and might diverge.
D) The model will train more accurately without regularization.

✅ Correct Answer: C
Explanation: A high learning rate causes instability in training, leading to oscillations or
divergence.

8. Which optimizer would be the best choice when working with sparse data?

A) Stochastic Gradient Descent (SGD)


B) Adagrad
C) RMSprop
D) Newton’s Method

✅ Correct Answer: B
Explanation: Adagrad adjusts learning rates for each parameter, making it suitable for sparse
data.

9. If the number of features in a dataset increases significantly, which of the


following might happen in gradient descent optimization?

A) The loss function might become non-differentiable.


B) The training might suffer from the curse of dimensionality, slowing convergence.
C) The gradient descent updates will become smaller, leading to faster convergence.
D) The number of neurons in each layer will automatically increase.

✅ Correct Answer: B
Explanation: High-dimensional data can make optimization harder due to sparsity and slower
convergence.

10. In Logistic Regression, why do we use Cross-Entropy Loss instead of Mean


Squared Error (MSE)?

A) Cross-Entropy Loss is computationally cheaper than MSE.


B) MSE does not work for binary classification problems.
C) Cross-Entropy Loss ensures better convergence in gradient-based optimization.
D) MSE does not provide probabilistic outputs.

✅ Correct Answer: C
Explanation: MSE can lead to slow and unstable convergence in logistic regression, while
Cross-Entropy Loss is more suitable for probability-based outputs.

🔹 Theoretical Problems (Twisted Conceptual Questions)


11. Compute One Step of Gradient Descent

You are training a simple linear regression model:

L(w)=(w−5)2L(w) = (w - 5)^2L(w)=(w−5)2
📌 Task:

1. Compute dLdw\frac{dL}{dw}dwdL.
2. Find the gradient at w=3w = 3w=3.
3. Perform one step of gradient descent using η=0.1\eta = 0.1η=0.1.

Solution:

1. Compute Gradient:

dLdw=2(w−5)\frac{dL}{dw} = 2(w - 5)dwdL=2(w−5)

2. Evaluate at w=3w = 3w=3:

dLdw=2(3−5)=2(−2)=−4\frac{dL}{dw} = 2(3 - 5) = 2(-2) = -4dwdL=2(3−5)=2(−2)=−4

3. Update Using Gradient Descent:

wnew=w−ηdLdww_{\text{new}} = w - \eta \frac{dL}{dw}wnew=w−ηdwdL


wnew=3−(0.1×−4)=3+0.4=3.4w_{\text{new}} = 3 - (0.1 \times -4) = 3 + 0.4 = 3.4wnew
=3−(0.1×−4)=3+0.4=3.4

✅ Final Answer: wnew=3.4w_{\text{new}} = 3.4wnew=3.4

12. Impact of Removing the Bias Term

You are training a neural network with ReLU activation and notice that the model
underperforms when the bias term is removed.

📌 Task: Explain why removing the bias term might negatively impact performance.

✅ Answer:

 If the bias term is removed, all neurons must pass through the origin.
 This reduces the flexibility of the model.
 Certain patterns cannot be learned because the model loses the ability to shift decision
boundaries.

13. Convergence of Gradient Descent

A neural network is trained using SGD, and the loss function fluctuates instead of steadily
decreasing.
📌 Task: What are three possible reasons for this behavior?

✅ Answer:

1. Learning Rate is Too High → Causes oscillations and prevents convergence.


2. Mini-Batch Size is Too Small → Updates become noisy and unstable.
3. Gradient Explosion → Weights increase too much, making loss unstable.

14. Role of Step Size in Adaptive Optimizers

Consider two different optimizers: SGD and Adam.

📌 Task: Why does Adam perform better in some cases?

✅ Answer:

 SGD uses a constant learning rate.


 Adam adapts learning rates based on past gradients, preventing slow updates or
divergence.
 This makes Adam more efficient in training deep networks.

Deep Learning for Tabular Data (Feedforward Networks)

This lecture covers the fundamental concepts of feedforward neural networks (FNNs) for
tabular data, how they are trained, and improvements to their performance.

🔹 Tabular vs. Non-Tabular Data


Type Definition Examples Best Suited Model
Structured data stored in CSV files, relational
Feedforward Neural
Tabular tables with rows databases, customer
Networks (FNNs), XGBoost,
Data (samples) and columns transactions, financial
Random Forests
(features). data.
CNNs (for images),
Non- Data that doesn’t fit into a
Images, text, audio, RNNs/LSTMs (for
Tabular table format and has a
videos. sequences), Transformers
Data more complex structure.
(for NLP)

🔹 Basic Flow of Operations in a Feedforward Network


A feedforward neural network (FNN) consists of multiple layers where the data moves in one
direction (no loops like RNNs).

1. Input Layer: Receives the data (features).


2. Hidden Layers: Transform inputs using weights, biases, and activation functions.
3. Output Layer: Produces predictions (classification or regression).
4. Training: Uses Stochastic Gradient Descent (SGD) and backpropagation to adjust
weights.

🔹 Single Hidden Layer Case


A basic neural network with one hidden layer:

z=Wx+bz = W x + bz=Wx+b h=σ(z)h = \sigma(z)h=σ(z) y=W′h+b′y = W' h + b'y=W′h+b′

where:

 xxx = input data


 W,W′W, W'W,W′ = weight matrices
 b,b′b, b'b,b′ = bias terms
 σ\sigmaσ = activation function (e.g., ReLU, sigmoid)

🔹 Key Components in a Neural Network


Component Description
Nodes/Units Computational elements in each layer.
Inputs Feature values from the dataset.
Weights Learnable parameters that define the importance of each input.
Biases Additional learnable parameters that shift the output.
Activations Functions applied to node outputs to introduce non-linearity.

🔹 Common Activation Functions


Activation Formula Use Case
ReLU (Rectified Image processing, deep
max⁡(0,x)\max(0, x)max(0,x)
Linear Unit) networks
Sigmoid 11+e−x\frac{1}{1+e^{-x}}1+e−x1 Binary classification
Softmax exi∑exj\frac{e^{x_i}}{\sum e^{x_j}}∑exjexi Multi-class classification
Activation Formula Use Case
ex−e−xex+e−x\frac{e^x - e^{-x}}{e^x + e^{- Centered around zero,
Tanh
x}}ex+e−xex−e−x used in RNNs

🔹 Training a Feedforward Neural Network


1. Initialize weights and biases.
2. Forward propagation:
o Compute layer outputs using activations.
3. Compute loss function:
o Cross-Entropy Loss (for classification)
o Mean Squared Error (MSE) (for regression)
4. Backpropagation:
o Compute gradients using the chain rule.
o Adjust weights using Gradient Descent or Adam optimizer.
5. Repeat until convergence.

🔹 Advanced Concepts in Deep Learning for Tabular Data


1. Adding Depth (Deep Networks)

 More hidden layers = better feature representation.


 Helps in learning complex patterns.

2. Partially Connected Networks & Dropout

 Dropout: Randomly removes neurons during training to prevent overfitting.


 Batch Normalization: Standardizes activations to speed up training.

3. Feature Standardization

 Normalize features using Z-score normalization or Min-Max scaling.


 Prevents gradient explosion and improves training stability.

4. Improved Optimizers

Optimizer Key Idea


SGD (Stochastic Gradient Descent) Updates weights based on mini-batches.
Momentum Uses past gradients to smooth updates.
Adam (Adaptive Moment Estimation) Combines momentum & adaptive learning rates.
🔹 When to Use Deep Learning for Tabular Data?
Scenario Best Approach
Small dataset (<100k rows) XGBoost, Random Forests
Large dataset (100k+ rows, many features) Deep Neural Networks (FNNs)
Complex feature interactions Deep Learning with Feature Engineering
Time-series forecasting RNNs, LSTMs

📌 Summary

✅ Feedforward Neural Networks are effective for tabular data.


✅ Adding more layers allows for capturing complex interactions.
✅ SGD and backpropagation help optimize weights.
✅ Feature standardization & dropout improve performance.
✅ Choose deep learning when working with large datasets or complex patterns.

Would you like a code example for training a Feedforward Neural Network in PyTorch or
TensorFlow? 🚀🔥

4o

You said:

Problems related to these


ChatGPT said:

📌 Problems Related to Feedforward Networks & Deep Learning for Tabular


Data

Below are multiple-choice questions (MCQs), theoretical problems, and coding exercises to
reinforce concepts covered in Lecture 17: Deep Learning for Tabular Data.

🔹 MCQ Problems
1. What is a key difference between tabular and non-tabular data?

A) Tabular data is always small, while non-tabular data is large.


B) Tabular data consists of structured rows and columns, while non-tabular data includes text,
images, and videos.
C) Deep learning always outperforms traditional ML methods for tabular data.
D) Non-tabular data cannot be processed using deep learning models.
✅ Correct Answer: B

2. Which activation function is commonly used in hidden layers of feedforward


networks?

A) Softmax
B) Sigmoid
C) ReLU
D) Tanh

✅ Correct Answer: C (ReLU is most commonly used due to its efficiency and non-linearity.)

3. Why is feature standardization important in deep learning for tabular data?

A) It reduces the need for activation functions.


B) It speeds up convergence by preventing large weight updates.
C) It removes the need for backpropagation.
D) It increases the number of layers in the network.

✅ Correct Answer: B

4. Which optimizer is an improvement over standard SGD by incorporating


adaptive learning rates and momentum?

A) Adam
B) SGD
C) Newton’s Method
D) Softmax

✅ Correct Answer: A (Adam optimizer combines momentum and adaptive learning rates.)

5. What is the primary advantage of dropout in deep learning?

A) It improves training speed.


B) It prevents overfitting by randomly removing neurons during training.
C) It eliminates the need for batch normalization.
D) It increases the model size.
✅ Correct Answer: B

🔹 Theoretical Problems
6. Compute the Number of Parameters in a Feedforward Neural Network

Consider a feedforward neural network with:

 Input layer: 10 neurons


 Hidden layer: 20 neurons
 Output layer: 5 neurons

Each neuron in a layer is fully connected to the next layer.

📌 Task: Compute the total number of parameters (weights and biases).

Solution:

 Weights between Input and Hidden Layer = (10×20)=200


 Biases for Hidden Layer = 20
 Weights between Hidden and Output Layer = (20×5)=100
 Biases for Output Layer = 5

Total Parameters=200+20+100+5=325

✅ Final Answer: 325 parameters


📌 Twisted Theoretical Questions on Adding More Hidden Layers in Neural
Networks

Below are conceptual and twisted problems designed to test your deep understanding of neural
networks and hidden layers.

1. Why Does Adding More Hidden Layers Improve a Neural Network?

A researcher claims that adding more hidden layers always improves performance.

📌 Task:

Explain when adding more layers is beneficial and when it can hurt performance.

✅ Answer:

 Beneficial: When the problem is highly non-linear, deeper networks can learn
hierarchical features.
 Harmful: If the dataset is small, adding more layers can lead to overfitting and
vanishing gradients.

2. Why Might a Deeper Neural Network Perform Worse Than a Shallower One?

Consider two neural networks:

 Network A: 3 hidden layers


 Network B: 15 hidden layers
Despite having more layers, Network B performs worse on the test set.

📌 Task: Identify three possible reasons for this behavior.

✅ Answer:

1. Vanishing Gradient Problem: In deeper networks, gradients become too small,


preventing effective learning in earlier layers.
2. Overfitting: More parameters can cause the network to memorize training data instead
of generalizing.
3. Optimization Issues: More layers make training harder, leading to slower convergence
or getting stuck in poor local minima.

3. Does Adding More Hidden Layers Always Increase Model Complexity?

A team of researchers trained two neural networks:

 Network X: 4 hidden layers


 Network Y: 8 hidden layers

Surprisingly, Network X outperforms Network Y even though Network Y has more layers.

📌 Task: How can a deeper network have lower complexity than a shallower one?

✅ Answer:

 Regularization Techniques (Dropout, L1/L2): If Network Y uses aggressive dropout


or L2 regularization, its effective capacity is reduced, making it perform worse.
 Wider vs. Deeper Trade-off: A network with fewer layers but more neurons per layer
can capture patterns better than an unnecessarily deep one.
 Bad Weight Initialization: If weights are poorly initialized, deeper networks can get
stuck in suboptimal solutions.
4. If More Hidden Layers Are Good, Why Don’t We Use Hundreds of Layers?

A company is training a neural network for fraud detection and wants to test a 100-layer model.

📌 Task: Explain why ultra-deep networks are rarely used in practice.

✅ Answer:

1. Computational Cost: More layers require more training time and memory, making it
impractical.
2. Diminishing Returns: After a certain depth, additional layers stop improving accuracy.
3. Gradient Instability: Without batch normalization or residual connections, deep
networks struggle with vanishing or exploding gradients.

5. When Should You Add More Layers vs. More Neurons Per Layer?

You have a neural network with 2 hidden layers and 50 neurons per layer. You want to
improve performance.

📌 Task: Should you add more layers or more neurons per layer? Why?

✅ Answer:

 If the problem is highly complex (e.g., image recognition), adding more layers helps
learn hierarchical features.
 If the model is underfitting, adding more neurons per layer increases the network’s
capacity.
 Empirical tuning is required—sometimes wider networks (more neurons per layer)
work better than deeper networks (more layers).

6. Can a Neural Network with Only One Hidden Layer Be a Universal


Approximator?

A professor claims that a single hidden layer neural network can approximate any function.

📌 Task:

1. Is this statement true?


2. If so, why do we use multiple hidden layers in practice?
✅ Answer:

1. True—According to the Universal Approximation Theorem, a single hidden layer


MLP with sufficient neurons can approximate any continuous function.
2. However, deep networks are preferred because:
o They are more parameter-efficient—Fewer neurons are needed in deep
networks than in a wide shallow network.
o They generalize better—Multiple layers allow feature learning at different levels
(e.g., edges → shapes → objects in vision tasks).

7. What Happens If We Remove All Hidden Layers from an FNN?

A company wants to build a deep learning model but decides to remove all hidden layers to
make training faster.

📌 Task: Explain how this affects model performance.

✅ Answer:

 Without hidden layers, the model becomes a simple linear function.


 Non-linear relationships cannot be captured, leading to poor performance on
complex tasks.
 This is equivalent to logistic regression (for classification) or linear regression (for
regression tasks).

8. How Can Skip Connections Help Deep Networks?

A team is building a 50-layer neural network but finds that adding more layers worsens
performance.

📌 Task: How can skip connections (residual connections) solve this issue?

✅ Answer:

 Skip connections allow information to flow directly from earlier layers to later layers,
avoiding vanishing gradients.
 Deep Residual Networks (ResNets) use this technique to train very deep networks
(e.g., 100+ layers) successfully.

9. Why Do Some Deep Networks Perform Well Without Vanishing Gradients?


Despite the vanishing gradient problem, architectures like ResNets and Transformers train
very deep networks.

📌 Task: Explain why these networks avoid vanishing gradients.

✅ Answer:

1. Skip Connections (ResNets)—Allow gradients to flow unimpeded, reducing


degradation in deep networks.
2. Batch Normalization—Prevents gradient explosion/shrinkage by normalizing
activations.
3. Self-Attention (Transformers)—Processes input in parallel instead of sequentially like
RNNs, avoiding long dependency chains.

10. If Adding More Layers Is Useful, Why Do Some Deep Networks Still Use
Fewer Layers?

Google and OpenAI often use deep networks, but some models (e.g., XGBoost) use fewer
layers.

📌 Task: Explain why deep learning is not always the best choice.

✅ Answer:

 Deep networks require massive data—For small datasets, tree-based models


(XGBoost, Random Forests) often outperform deep learning.
 Overfitting risk—More layers can lead to memorization rather than generalization.
 Computation cost—Shallow models train faster and are more interpretable.

📌 Summary of Twisted Questions


Question Type Concept Tested
Vanishing gradients, overfitting, computational
Why deeper networks sometimes fail
cost
Wider vs. deeper networks When to add layers vs. more neurons per layer
Why one hidden layer is sometimes enough Universal Approximation Theorem
Why deep learning is not always the best Small dataset limitations, overfitting,
choice computation
Skip connections, batch normalization, self-
How to improve deep networks
attention

You might also like