MCQ1
MCQ1
Sampling in SGD
With Replacement: Each sample is chosen independently.
Without Replacement: Each sample is used once per epoch.
Role of Step Size / Learning Rate
The learning rate η\etaη controls how much we update parameters at each step.
✅ Correct Answer: B
2. What is the key difference between Stochastic Gradient Descent (SGD) and
Batch Gradient Descent?
A) SGD uses the full dataset, while Batch GD updates weights after each data point.
B) Batch GD uses a single data point per update, while SGD uses all data points.
C) SGD updates weights using only one data point per iteration, while Batch GD updates using
all training data.
D) There is no difference.
✅ Correct Answer: C
✅ Correct Answer: B
✅ Correct Answer: A
✅ Correct Answer: D
🔹 Theoretical Problems
6. Compute the Number of Parameters in a Regularized Linear Model
Input features: 15
Regularization term: L2 (Ridge)
Bias term included
🔹 Comparison: Deterministic GD vs. Stochastic GD
Criterion Deterministic GD (Batch GD) Stochastic GD (SGD)
Update Per Step Uses entire dataset Uses one sample
Speed Slow for large datasets Faster
Noise in Updates Low (smooth) High (noisy)
Memory Usage High Low
Best Used For Small datasets Large datasets
📌 Summary
✅ Deterministic GD updates weights using the entire dataset, leading to stable convergence
but high computational cost.
✅ Gradient Computation: Uses the sum of all data points.
✅ Memory-Intensive: Requires holding the entire dataset in memory.
✅ Slower for Large Datasets, so Mini-Batch SGD is often preferred.
Key Takeaways
✅ Seq2Vec is ideal for tasks where an input sequence is converted into a single vector for
classification.
✅ It is used in applications like sentiment analysis, spam detection, and document
classification.
✅ Machine translation and text summarization require Seq2Seq models, not Seq2Vec.
Here are twisted MCQs and theoretical problems designed to test deep conceptual
understanding.
2. If the learning rate is too low in gradient descent, what is the most likely
outcome?
A) The algorithm will converge very quickly but might not reach the optimal solution.
B) The algorithm will oscillate and never converge.
C) The algorithm will take a very long time to converge.
D) The algorithm will stop learning after a few iterations.
✅ Correct Answer: C
Explanation: A very small learning rate makes weight updates tiny, leading to slow
convergence.
✅ Correct Answer: C
Explanation: Mini-Batch GD still requires computing gradients, but in smaller batches.
✅ Correct Answer: B
Explanation: Momentum helps reduce oscillations and speed up convergence by
incorporating past gradients.
5. Which of the following statements is FALSE about the loss function in
machine learning?
A) The loss function measures how well a model’s predictions match actual outputs.
B) Different loss functions can be used depending on the type of task (classification, regression).
C) The loss function is minimized during training using optimization algorithms.
D) A zero loss always means that the model has perfectly generalized to unseen data.
✅ Correct Answer: D
Explanation: A zero loss might indicate overfitting, meaning the model memorized the training
data but may not generalize well to new data.
6. What would happen if you remove the bias term from a linear model?
✅ Correct Answer: B
Explanation: Removing bias forces the model to pass through the origin, limiting its flexibility.
7. A neural network is trained with a very high learning rate. What is the most
likely outcome?
✅ Correct Answer: C
Explanation: A high learning rate causes instability in training, leading to oscillations or
divergence.
8. Which optimizer would be the best choice when working with sparse data?
✅ Correct Answer: B
Explanation: Adagrad adjusts learning rates for each parameter, making it suitable for sparse
data.
✅ Correct Answer: B
Explanation: High-dimensional data can make optimization harder due to sparsity and slower
convergence.
✅ Correct Answer: C
Explanation: MSE can lead to slow and unstable convergence in logistic regression, while
Cross-Entropy Loss is more suitable for probability-based outputs.
L(w)=(w−5)2L(w) = (w - 5)^2L(w)=(w−5)2
📌 Task:
1. Compute dLdw\frac{dL}{dw}dwdL.
2. Find the gradient at w=3w = 3w=3.
3. Perform one step of gradient descent using η=0.1\eta = 0.1η=0.1.
Solution:
1. Compute Gradient:
You are training a neural network with ReLU activation and notice that the model
underperforms when the bias term is removed.
📌 Task: Explain why removing the bias term might negatively impact performance.
✅ Answer:
If the bias term is removed, all neurons must pass through the origin.
This reduces the flexibility of the model.
Certain patterns cannot be learned because the model loses the ability to shift decision
boundaries.
A neural network is trained using SGD, and the loss function fluctuates instead of steadily
decreasing.
📌 Task: What are three possible reasons for this behavior?
✅ Answer:
✅ Answer:
This lecture covers the fundamental concepts of feedforward neural networks (FNNs) for
tabular data, how they are trained, and improvements to their performance.
where:
3. Feature Standardization
4. Improved Optimizers
📌 Summary
Would you like a code example for training a Feedforward Neural Network in PyTorch or
TensorFlow? 🚀🔥
4o
You said:
Below are multiple-choice questions (MCQs), theoretical problems, and coding exercises to
reinforce concepts covered in Lecture 17: Deep Learning for Tabular Data.
🔹 MCQ Problems
1. What is a key difference between tabular and non-tabular data?
A) Softmax
B) Sigmoid
C) ReLU
D) Tanh
✅ Correct Answer: C (ReLU is most commonly used due to its efficiency and non-linearity.)
✅ Correct Answer: B
A) Adam
B) SGD
C) Newton’s Method
D) Softmax
✅ Correct Answer: A (Adam optimizer combines momentum and adaptive learning rates.)
🔹 Theoretical Problems
6. Compute the Number of Parameters in a Feedforward Neural Network
Solution:
Total Parameters=200+20+100+5=325
Below are conceptual and twisted problems designed to test your deep understanding of neural
networks and hidden layers.
A researcher claims that adding more hidden layers always improves performance.
📌 Task:
Explain when adding more layers is beneficial and when it can hurt performance.
✅ Answer:
Beneficial: When the problem is highly non-linear, deeper networks can learn
hierarchical features.
Harmful: If the dataset is small, adding more layers can lead to overfitting and
vanishing gradients.
2. Why Might a Deeper Neural Network Perform Worse Than a Shallower One?
✅ Answer:
Surprisingly, Network X outperforms Network Y even though Network Y has more layers.
📌 Task: How can a deeper network have lower complexity than a shallower one?
✅ Answer:
A company is training a neural network for fraud detection and wants to test a 100-layer model.
✅ Answer:
1. Computational Cost: More layers require more training time and memory, making it
impractical.
2. Diminishing Returns: After a certain depth, additional layers stop improving accuracy.
3. Gradient Instability: Without batch normalization or residual connections, deep
networks struggle with vanishing or exploding gradients.
5. When Should You Add More Layers vs. More Neurons Per Layer?
You have a neural network with 2 hidden layers and 50 neurons per layer. You want to
improve performance.
📌 Task: Should you add more layers or more neurons per layer? Why?
✅ Answer:
If the problem is highly complex (e.g., image recognition), adding more layers helps
learn hierarchical features.
If the model is underfitting, adding more neurons per layer increases the network’s
capacity.
Empirical tuning is required—sometimes wider networks (more neurons per layer)
work better than deeper networks (more layers).
A professor claims that a single hidden layer neural network can approximate any function.
📌 Task:
A company wants to build a deep learning model but decides to remove all hidden layers to
make training faster.
✅ Answer:
A team is building a 50-layer neural network but finds that adding more layers worsens
performance.
📌 Task: How can skip connections (residual connections) solve this issue?
✅ Answer:
Skip connections allow information to flow directly from earlier layers to later layers,
avoiding vanishing gradients.
Deep Residual Networks (ResNets) use this technique to train very deep networks
(e.g., 100+ layers) successfully.
✅ Answer:
10. If Adding More Layers Is Useful, Why Do Some Deep Networks Still Use
Fewer Layers?
Google and OpenAI often use deep networks, but some models (e.g., XGBoost) use fewer
layers.
📌 Task: Explain why deep learning is not always the best choice.
✅ Answer: