Module 04: Regularization & Normalization
1. Introduction to Regularization and Normalization
What is Overfitting?
• Overfitting occurs when a model learns not only the useful patterns but also the noise
in the training data.
• The model performs well on the training set but poorly on unseen data.
• Overfitting is indicated by:
o Low training error and high validation/test error.
o High model complexity, leading to memorization rather than generalization.
What is Underfitting?
• Underfitting happens when a model is too simple to capture the underlying patterns in
the data.
• It has high bias, meaning it makes strong assumptions about the data.
• Underfitting is indicated by high training error and high test error.
How to Address Overfitting and Underfitting?
• Regularization: Adds constraints or penalties to prevent overfitting.
• Normalization: Scales the data to improve convergence and stability.
• Better Activation Functions & Initialization: Ensures stable learning.
• Data Augmentation: Expands the dataset artificially.
Use Cases:
• Deep Learning: Overfitting is common in deep networks with small datasets.
• Stock Market Prediction: Complex models may fit historical data but fail in future
trends.
2. Key Differences: Overfitting vs Underfitting
Feature Overfitting Underfitting
Model learns noise in training data Model is too simple and fails to
Definition
and fails on test data. capture the true pattern.
Training Error Very low High
Test Error High High
Bias Low High
Variance High Low
Feature Overfitting Underfitting
Complexity Too complex Too simple
Poor (does not work well on new Poor (does not even work well on
Generalization
data) training data)
Example Deep Neural Network with too many Linear Regression used for a non-
Model layers and parameters linear problem
Causes of Overfitting & Underfitting
Causes of Overfitting Causes of Underfitting
Too many parameters (complex models). Too few parameters (simple models).
Not enough training data. Too much regularization.
Training too long without stopping. Not training long enough.
Noise in data influencing the model. Poor feature selection.
Example in Education (Student Performance Prediction)
Scenario
A university wants to predict students' final exam scores based on their study hours, previous
grades, attendance, and extracurricular activities.
Underfitting Case
• The model only considers study hours as the predictor (Linear Regression: Final
Score = a * Study Hours + b).
• It ignores attendance, previous grades, and extracurricular activities, leading to
poor predictions.
• The model cannot learn important trends → Underfitting occurs.
Overfitting Case
• The model includes too many unnecessary factors, such as:
o The brand of pen the student uses
o The color of their notebooks
o The temperature in the exam hall
• These factors do not actually affect the final score, but the model memorizes them
instead of learning real relationships → Overfitting occurs.
Solution: Select relevant features such as study hours, previous grades, and attendance
while ignoring unnecessary details.
2. Example in Medical Diagnosis (Disease Prediction)
Scenario
A hospital builds an AI model to detect whether a patient has diabetes based on age, weight,
blood sugar levels, and exercise habits.
Underfitting Case
• The model only uses age and weight to predict diabetes, ignoring blood sugar levels
and exercise habits.
• Since diabetes is directly related to blood sugar levels, missing that feature leads to
poor accuracy → Underfitting occurs.
Overfitting Case
• The model also considers unnecessary personal details like:
o The brand of shoes the patient wears
o The time of the day the test was taken
• These do not actually cause diabetes, but the model memorizes them instead of
learning real medical patterns → Overfitting occurs.
Solution: Use only clinically relevant factors such as blood sugar levels, weight, and
exercise habits.
Bias in Deep Learning and Types
1. What is Bias in Deep Learning?
Bias in deep learning refers to errors in model predictions due to incorrect assumptions in
the learning algorithm. It represents the inability of a model to capture the true pattern of the
data.
Bias leads to poor generalization and can cause issues such as underfitting (high bias) or
systematic errors due to dataset imbalances or improper training techniques.
1. Selection Bias
Definition:
Selection bias occurs when the training data is not representative of the real-world data,
leading to poor generalization on unseen examples. It happens when certain groups or
patterns are overrepresented or underrepresented in the dataset.
Example
A facial recognition system is trained mostly on light-skinned faces but deployed globally.
The model performs well on light-skinned individuals but fails to recognize darker-skinned
faces accurately.
Use Cases Where Selection Bias Occurs
• Medical AI Models: If a disease detection model is trained only on data from one
hospital, it may not generalize well to patients in other locations.
• Self-Driving Cars: If a self-driving car is trained only on city roads, it may fail to
navigate rural areas.
• Speech Recognition: If a voice assistant is trained on English speakers with
American accents, it may struggle to understand non-native English speakers.
Advantages & Disadvantages
Pros Cons
Can be useful for localized models (e.g., region-
Leads to poor model generalization
specific applications)
Can reduce training complexity if data is Can introduce biases against
controlled underrepresented groups
How to Fix Selection Bias?
Collect diverse and representative training data.
Use data augmentation to introduce missing variations.
Apply stratified sampling to balance different categories.
2. Confirmation Bias
Definition:
Confirmation bias occurs when model predictions are interpreted or used in a way that
reinforces pre-existing beliefs rather than objectively evaluating new data. This can be a
result of biased training data or human intervention in model interpretation.
Example
A hiring AI model is trained using past recruitment data, where male candidates were
historically favored. The model learns this bias and continues to prefer male candidates for
job positions, reinforcing past hiring patterns.
Use Cases Where Confirmation Bias Occurs
• News Feed Algorithms: Social media platforms recommend content that aligns with
user interests, reinforcing echo chambers.
• Credit Scoring: A loan approval model may unfairly reject certain demographics if
past data shows lower approvals in those groups.
• Judicial AI Systems: Predictive policing tools may disproportionately flag minority
groups if trained on biased crime data.
Advantages & Disadvantages
Pros Cons
Can improve user engagement in recommendation Leads to unfair and biased
systems decisions
Reinforces successful patterns in historical data Prevents adaptation to new trends
How to Fix Confirmation Bias?
Regularly audit and update training data to remove historical bias.
Introduce fairness constraints in AI models.
Use counterfactual data augmentation (e.g., train models with synthetic data that
challenges existing assumptions).
3. Overgeneralization Bias
Definition:
Overgeneralization bias occurs when a model assumes patterns that do not exist in all
cases. The model learns a rule too broadly, leading to incorrect predictions in special cases.
Example
A sentiment analysis model learns that words like "great" and "amazing" always indicate
a positive sentiment. However, in a sentence like "The movie was amazing… amazingly
bad!", the model still classifies it as positive, missing the sarcasm.
Use Cases Where Overgeneralization Bias Occurs
• Image Classification: A model trained to recognize dogs may mistakenly classify a
wolf as a dog because it overgeneralizes features like fur and four legs.
• Medical Diagnosis: A model trained to detect cancer may flag any abnormality as
cancer, even if it is benign.
• Spam Detection: If a spam filter learns that emails containing “free” are always
spam, it may block legitimate emails with “free” in them.
Advantages & Disadvantages
Pros Cons
Leads to false positives and
Can improve efficiency in rule-based classification
misclassifications
Reduces model complexity by avoiding too many May fail on edge cases and nuanced
specific rules scenarios
How to Fix Overgeneralization Bias?
Use fine-grained feature engineering to differentiate subtle patterns.
Train the model on edge cases and exceptions to the rule.
Use adversarial examples to challenge model assumptions.
Regularization & Bias-Variance Tradeoff :
1. Bias-Variance Tradeoff
Definition
The bias-variance tradeoff is a key concept in machine learning that describes the balance
between bias (error from overly simplistic models) and variance (error from overly
complex models).
• High Bias (Underfitting): The model is too simple and fails to capture underlying
patterns.
• High Variance (Overfitting): The model is too complex and learns noise along with
patterns.
• Goal: Achieve an optimal balance where the model generalizes well to unseen data.
Mathematical Representation
The total error (Expected Loss) is given by:
• Bias²: Error due to incorrect assumptions in the model.
• Variance: Error due to sensitivity to small fluctuations in training data.
• Irreducible Error: Noise in the data that cannot be removed.
Examples
1. High Bias (Underfitting Example)
o A linear regression model trying to fit a highly non-linear dataset (e.g., a
quadratic or exponential function).
o The model performs poorly on both training and test data.
2. High Variance (Overfitting Example)
o A deep neural network trained on a small dataset learns unnecessary noise
and outliers, making it perform well on training data but poorly on test data.
Regularization Methods
Regularization techniques help reduce overfitting by penalizing complex models, making
them simpler and more generalizable.
2.1 L1 and L2 Regularization
(a) L1 Regularization (Lasso Regression)
L1 regularization adds the absolute value of coefficients as a penalty to the loss function,
encouraging sparsity. This makes it useful for feature selection.
Mathematical Formula:
Example:
• Suppose we have a house price prediction model with features:
area, bedrooms, swimming pool, neighborhood.
• If "swimming pool" has low correlation with price, L1 regularization may remove it
from the model by setting its coefficient to zero.
Advantages & Disadvantages
Pros Cons
Feature selection by setting some weights to zero May remove useful features
Pros Cons
Handles high-dimensional sparse data well Computationally expensive optimization
(b) L2 Regularization (Ridge Regression)
L2 regularization adds squared values of coefficients as a penalty to the loss function.
Unlike L1, it shrinks weights towards zero but does not eliminate them.
Mathematical Formula:
Example:
• In a fraud detection system, many features may contribute to fraud detection. L2
regularization helps reduce overfitting by preventing large weights without removing
useful features.
Advantages & Disadvantages
Pros Cons
Reduces overfitting while keeping all features Does not perform feature selection
Works well when all features are useful Requires fine-tuning of λ
2.2 Early Stopping
Definition
Early stopping monitors validation loss during training and stops before overfitting starts.
Example:
In image classification, if training loss decreases but validation loss starts increasing after 50
epochs, early stopping prevents overfitting.
Mathematical Implementation:
• Compute training loss and validation loss after each epoch.
• If validation loss increases for n consecutive epochs, stop training.
Advantages & Disadvantages
Pros Cons
Reduces overfitting without modifying the
Requires monitoring validation loss
model
May stop too early if validation loss
Saves computational resources
fluctuates
2.3 Dataset Augmentation
Definition
Dataset augmentation artificially increases the training dataset using transformations like:
• Rotation, Flipping, Zooming, Cropping, Color Shifting, Noise Injection
Example:
In medical image analysis, if we only have 1000 X-ray images, augmenting them by
flipping, rotating, and adjusting brightness can create more training samples.
Advantages & Disadvantages
Pros Cons
Improves generalization and reduces
Increases training time
overfitting
May distort information if not applied
Useful when data collection is expensive
correctly
2.4 Parameter Sharing and Tying
Definition
• Parameter sharing reduces the number of independent parameters by reusing the
same weights in different parts of the model.
• Parameter tying enforces constraints so that certain parameters are identical in
different layers.
Example:
• CNNs (Convolutional Neural Networks) use parameter sharing, where a filter
(kernel) scans the entire image, reducing the number of parameters compared to fully
connected networks.
Advantages & Disadvantages
Pros Cons
Reduces model size and computation Limits flexibility in feature learning
Improves generalization Works best in structured data like images
Comparison of Regularization Methods
Method How It Works Best Use Case
Shrinks weights to zero, Feature selection in high-
L1 Regularization
removing features dimensional data
Shrinks weights but keeps all
L2 Regularization Regression and neural networks
features
Stops training when validation Deep learning models (CNNs,
Early Stopping
loss increases RNNs)
Dataset Augmentation Increases dataset size artificially Image processing, NLP
Parameter Sharing & Reuses weights to reduce
CNNs, transformers, RNNs
Tying complexity
Greedy Layer wise Pre-training
Definition
Greedy Layer wise Pre-training is an unsupervised training approach where a deep
network is trained one layer at a time, rather than training all layers simultaneously. This
method helps initialize weights properly, reducing the risk of vanishing gradients.
Why is it needed?
• Deep neural networks often suffer from poor weight initialization and vanishing
gradients.
• Greedy pre-training trains each layer sequentially, providing a good starting point
for the network.
How it Works?
1. First layer is trained as an unsupervised autoencoder (or RBM - Restricted
Boltzmann Machine).
2. Once trained, its weights are frozen and the next layer is added on top.
3. This process continues layer by layer until the entire network is initialized.
4. Finally, the full network is fine-tuned using backpropagation.
Example - Pre-training for an Image Classifier
1. Train the first layer (input → hidden) using unsupervised learning
(autoencoders/RBM).
2. Use the first layer's output as input to train the second layer.
3. Stack more layers, training each one separately.
4. Once all layers are trained, fine-tune the entire model using supervised learning.
Advantages
1. Helps in training very deep networks.
2. Reduces overfitting by providing better weight initialization.
3. Works well when labeled data is scarce.
Disadvantages
1. Computationally expensive due to layerwise training.
2. Less effective for modern deep architectures (like ResNet, Transformers) which
use Batch Normalization instead.
Better Activation Functions
What are Activation Functions?
Activation functions introduce non-linearity in neural networks, allowing them to learn
complex patterns.
Types of Activation Functions
(a) Sigmoid Activation Function
• Output range: (0,1)
• Used in binary classification.
Issues:
1. Causes vanishing gradients → slows down deep networks.
2. Output is not zero-cantered, causing slow convergence.
(b) Tanh Activation Function
1. Output range: (-1,1)
2. Zero-centered, making it better than sigmoid.
(c) ReLU (Rectified Linear Unit)
• Output range: [0, ∞)
• Used in CNNs, deep networks.
1. Faster convergence
2. Avoids vanishing gradients
3. Dying ReLU Problem: Neurons can get stuck with zero outputs.
( d ) Leaky ReLU & Parametric ReLU (PReLU)
1. Fixes the Dying ReLU Problem.
2. Used in deep CNNs, GANs.
(e) Softmax Function
• Used in multi-class classification.
• Converts outputs into probabilities.
Activation Function Range Pros Cons
Sigmoid (0,1) Smooth, differentiable Vanishing gradient
Tanh (-1,1) Zero-centered Still vanishes
Fast, no vanishing
ReLU [0,∞) Dying neurons
gradient
Leaky ReLU (-∞,∞) Fixes dying ReLU Adds small overhead
Softmax (0,1) Probability distribution Expensive computation
Better Weight Initialization Methods
Poor weight initialization can lead to:
• Vanishing gradients (if weights are too small)
• Exploding gradients (if weights are too large)
(a) Zero Initialization (Bad Method )
wi=0w_i = 0wi=0
• Causes all neurons to have the same weights → model never learns.
(b) Random Initialization
wi∼U(−1,1)
• Works better than zero initialization but still causes problems in deep networks.
(b) Xavier (Glorot) Initialization
• Balances variance of activations across layers.
• Used in tanh-based networks.
1. Prevents exploding/vanishing gradients
2. Not ideal for ReLU-based networks.
(c) He Initialization (Best for ReLU)
• Used in ReLU-based networks.
• Works well in deep CNNs.
Batch Normalization (BN)
Definition
Batch Normalization normalizes activations across a mini-batch to stabilize training.
How it Works?
1. Compute mean & variance for each mini-batch.
2. Normalize activations:
3. Apply scaling (γ) and shifting (β) parameters.
Advantages
1. Speeds up training
2. Reduces internal covariate shift
3. Reduces dependence on weight initialization
Example
Before Batch Normalization:
• A deep CNN takes 200 epochs to converge.
After adding Batch Normalization:
• The same network converges in 50 epochs!
When to Use?
• Used in CNNs, Transformers, GANs.
• Can be applied after every activation layer.
Batch Normalization (BN) Example:
Let's say we have a small dataset with 5 samples and 1 feature. Before applying batch
normalization, the values vary significantly. Our goal is to normalize them so that the
network trains faster.
Step 1: Given Data (Before Normalization)
Let’s assume a mini batch of 5 samples with feature values:
X= [5,10,15,20,25]