0% found this document useful (0 votes)
15 views16 pages

Module 04

Module 04 covers regularization and normalization techniques to address overfitting and underfitting in machine learning models. It explains the concepts of bias and variance, the importance of feature selection, and various regularization methods like L1 and L2 regularization, early stopping, and dataset augmentation. Additionally, it discusses the significance of activation functions and weight initialization methods in improving model performance.

Uploaded by

yoxisam356
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views16 pages

Module 04

Module 04 covers regularization and normalization techniques to address overfitting and underfitting in machine learning models. It explains the concepts of bias and variance, the importance of feature selection, and various regularization methods like L1 and L2 regularization, early stopping, and dataset augmentation. Additionally, it discusses the significance of activation functions and weight initialization methods in improving model performance.

Uploaded by

yoxisam356
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Module 04: Regularization & Normalization

1. Introduction to Regularization and Normalization

What is Overfitting?

• Overfitting occurs when a model learns not only the useful patterns but also the noise
in the training data.

• The model performs well on the training set but poorly on unseen data.

• Overfitting is indicated by:

o Low training error and high validation/test error.


o High model complexity, leading to memorization rather than generalization.

What is Underfitting?

• Underfitting happens when a model is too simple to capture the underlying patterns in
the data.

• It has high bias, meaning it makes strong assumptions about the data.

• Underfitting is indicated by high training error and high test error.

How to Address Overfitting and Underfitting?

• Regularization: Adds constraints or penalties to prevent overfitting.

• Normalization: Scales the data to improve convergence and stability.

• Better Activation Functions & Initialization: Ensures stable learning.


• Data Augmentation: Expands the dataset artificially.

Use Cases:

• Deep Learning: Overfitting is common in deep networks with small datasets.

• Stock Market Prediction: Complex models may fit historical data but fail in future
trends.

2. Key Differences: Overfitting vs Underfitting

Feature Overfitting Underfitting


Model learns noise in training data Model is too simple and fails to
Definition
and fails on test data. capture the true pattern.
Training Error Very low High
Test Error High High
Bias Low High
Variance High Low
Feature Overfitting Underfitting
Complexity Too complex Too simple
Poor (does not work well on new Poor (does not even work well on
Generalization
data) training data)
Example Deep Neural Network with too many Linear Regression used for a non-
Model layers and parameters linear problem

Causes of Overfitting & Underfitting

Causes of Overfitting Causes of Underfitting

Too many parameters (complex models). Too few parameters (simple models).

Not enough training data. Too much regularization.

Training too long without stopping. Not training long enough.

Noise in data influencing the model. Poor feature selection.

Example in Education (Student Performance Prediction)


Scenario

A university wants to predict students' final exam scores based on their study hours, previous
grades, attendance, and extracurricular activities.

Underfitting Case

• The model only considers study hours as the predictor (Linear Regression: Final
Score = a * Study Hours + b).

• It ignores attendance, previous grades, and extracurricular activities, leading to


poor predictions.

• The model cannot learn important trends → Underfitting occurs.

Overfitting Case

• The model includes too many unnecessary factors, such as:

o The brand of pen the student uses

o The color of their notebooks

o The temperature in the exam hall


• These factors do not actually affect the final score, but the model memorizes them
instead of learning real relationships → Overfitting occurs.
Solution: Select relevant features such as study hours, previous grades, and attendance
while ignoring unnecessary details.

2. Example in Medical Diagnosis (Disease Prediction)

Scenario

A hospital builds an AI model to detect whether a patient has diabetes based on age, weight,
blood sugar levels, and exercise habits.

Underfitting Case

• The model only uses age and weight to predict diabetes, ignoring blood sugar levels
and exercise habits.

• Since diabetes is directly related to blood sugar levels, missing that feature leads to
poor accuracy → Underfitting occurs.

Overfitting Case

• The model also considers unnecessary personal details like:

o The brand of shoes the patient wears


o The time of the day the test was taken

• These do not actually cause diabetes, but the model memorizes them instead of
learning real medical patterns → Overfitting occurs.

Solution: Use only clinically relevant factors such as blood sugar levels, weight, and
exercise habits.

Bias in Deep Learning and Types

1. What is Bias in Deep Learning?

Bias in deep learning refers to errors in model predictions due to incorrect assumptions in
the learning algorithm. It represents the inability of a model to capture the true pattern of the
data.

Bias leads to poor generalization and can cause issues such as underfitting (high bias) or
systematic errors due to dataset imbalances or improper training techniques.

1. Selection Bias

Definition:
Selection bias occurs when the training data is not representative of the real-world data,
leading to poor generalization on unseen examples. It happens when certain groups or
patterns are overrepresented or underrepresented in the dataset.
Example

A facial recognition system is trained mostly on light-skinned faces but deployed globally.
The model performs well on light-skinned individuals but fails to recognize darker-skinned
faces accurately.

Use Cases Where Selection Bias Occurs

• Medical AI Models: If a disease detection model is trained only on data from one
hospital, it may not generalize well to patients in other locations.

• Self-Driving Cars: If a self-driving car is trained only on city roads, it may fail to
navigate rural areas.

• Speech Recognition: If a voice assistant is trained on English speakers with


American accents, it may struggle to understand non-native English speakers.

Advantages & Disadvantages

Pros Cons

Can be useful for localized models (e.g., region-


Leads to poor model generalization
specific applications)

Can reduce training complexity if data is Can introduce biases against


controlled underrepresented groups

How to Fix Selection Bias?

Collect diverse and representative training data.


Use data augmentation to introduce missing variations.
Apply stratified sampling to balance different categories.

2. Confirmation Bias

Definition:
Confirmation bias occurs when model predictions are interpreted or used in a way that
reinforces pre-existing beliefs rather than objectively evaluating new data. This can be a
result of biased training data or human intervention in model interpretation.

Example

A hiring AI model is trained using past recruitment data, where male candidates were
historically favored. The model learns this bias and continues to prefer male candidates for
job positions, reinforcing past hiring patterns.
Use Cases Where Confirmation Bias Occurs
• News Feed Algorithms: Social media platforms recommend content that aligns with
user interests, reinforcing echo chambers.
• Credit Scoring: A loan approval model may unfairly reject certain demographics if
past data shows lower approvals in those groups.

• Judicial AI Systems: Predictive policing tools may disproportionately flag minority


groups if trained on biased crime data.

Advantages & Disadvantages

Pros Cons

Can improve user engagement in recommendation Leads to unfair and biased


systems decisions

Reinforces successful patterns in historical data Prevents adaptation to new trends

How to Fix Confirmation Bias?

Regularly audit and update training data to remove historical bias.


Introduce fairness constraints in AI models.
Use counterfactual data augmentation (e.g., train models with synthetic data that
challenges existing assumptions).

3. Overgeneralization Bias

Definition:
Overgeneralization bias occurs when a model assumes patterns that do not exist in all
cases. The model learns a rule too broadly, leading to incorrect predictions in special cases.

Example

A sentiment analysis model learns that words like "great" and "amazing" always indicate
a positive sentiment. However, in a sentence like "The movie was amazing… amazingly
bad!", the model still classifies it as positive, missing the sarcasm.

Use Cases Where Overgeneralization Bias Occurs

• Image Classification: A model trained to recognize dogs may mistakenly classify a


wolf as a dog because it overgeneralizes features like fur and four legs.

• Medical Diagnosis: A model trained to detect cancer may flag any abnormality as
cancer, even if it is benign.

• Spam Detection: If a spam filter learns that emails containing “free” are always
spam, it may block legitimate emails with “free” in them.

Advantages & Disadvantages


Pros Cons

Leads to false positives and


Can improve efficiency in rule-based classification
misclassifications

Reduces model complexity by avoiding too many May fail on edge cases and nuanced
specific rules scenarios

How to Fix Overgeneralization Bias?

Use fine-grained feature engineering to differentiate subtle patterns.


Train the model on edge cases and exceptions to the rule.
Use adversarial examples to challenge model assumptions.

Regularization & Bias-Variance Tradeoff :


1. Bias-Variance Tradeoff
Definition

The bias-variance tradeoff is a key concept in machine learning that describes the balance
between bias (error from overly simplistic models) and variance (error from overly
complex models).

• High Bias (Underfitting): The model is too simple and fails to capture underlying
patterns.

• High Variance (Overfitting): The model is too complex and learns noise along with
patterns.

• Goal: Achieve an optimal balance where the model generalizes well to unseen data.

Mathematical Representation
The total error (Expected Loss) is given by:

• Bias²: Error due to incorrect assumptions in the model.


• Variance: Error due to sensitivity to small fluctuations in training data.

• Irreducible Error: Noise in the data that cannot be removed.


Examples

1. High Bias (Underfitting Example)

o A linear regression model trying to fit a highly non-linear dataset (e.g., a


quadratic or exponential function).

o The model performs poorly on both training and test data.

2. High Variance (Overfitting Example)


o A deep neural network trained on a small dataset learns unnecessary noise
and outliers, making it perform well on training data but poorly on test data.

Regularization Methods

Regularization techniques help reduce overfitting by penalizing complex models, making


them simpler and more generalizable.

2.1 L1 and L2 Regularization

(a) L1 Regularization (Lasso Regression)

L1 regularization adds the absolute value of coefficients as a penalty to the loss function,
encouraging sparsity. This makes it useful for feature selection.

Mathematical Formula:

Example:

• Suppose we have a house price prediction model with features:


area, bedrooms, swimming pool, neighborhood.

• If "swimming pool" has low correlation with price, L1 regularization may remove it
from the model by setting its coefficient to zero.

Advantages & Disadvantages

Pros Cons

Feature selection by setting some weights to zero May remove useful features
Pros Cons

Handles high-dimensional sparse data well Computationally expensive optimization

(b) L2 Regularization (Ridge Regression)

L2 regularization adds squared values of coefficients as a penalty to the loss function.


Unlike L1, it shrinks weights towards zero but does not eliminate them.

Mathematical Formula:

Example:

• In a fraud detection system, many features may contribute to fraud detection. L2


regularization helps reduce overfitting by preventing large weights without removing
useful features.

Advantages & Disadvantages

Pros Cons

Reduces overfitting while keeping all features Does not perform feature selection

Works well when all features are useful Requires fine-tuning of λ

2.2 Early Stopping

Definition

Early stopping monitors validation loss during training and stops before overfitting starts.
Example:

In image classification, if training loss decreases but validation loss starts increasing after 50
epochs, early stopping prevents overfitting.

Mathematical Implementation:

• Compute training loss and validation loss after each epoch.

• If validation loss increases for n consecutive epochs, stop training.

Advantages & Disadvantages


Pros Cons

Reduces overfitting without modifying the


Requires monitoring validation loss
model

May stop too early if validation loss


Saves computational resources
fluctuates

2.3 Dataset Augmentation

Definition
Dataset augmentation artificially increases the training dataset using transformations like:

• Rotation, Flipping, Zooming, Cropping, Color Shifting, Noise Injection

Example:

In medical image analysis, if we only have 1000 X-ray images, augmenting them by
flipping, rotating, and adjusting brightness can create more training samples.

Advantages & Disadvantages

Pros Cons

Improves generalization and reduces


Increases training time
overfitting

May distort information if not applied


Useful when data collection is expensive
correctly

2.4 Parameter Sharing and Tying


Definition

• Parameter sharing reduces the number of independent parameters by reusing the


same weights in different parts of the model.
• Parameter tying enforces constraints so that certain parameters are identical in
different layers.
Example:

• CNNs (Convolutional Neural Networks) use parameter sharing, where a filter


(kernel) scans the entire image, reducing the number of parameters compared to fully
connected networks.
Advantages & Disadvantages
Pros Cons

Reduces model size and computation Limits flexibility in feature learning

Improves generalization Works best in structured data like images

Comparison of Regularization Methods

Method How It Works Best Use Case

Shrinks weights to zero, Feature selection in high-


L1 Regularization
removing features dimensional data

Shrinks weights but keeps all


L2 Regularization Regression and neural networks
features

Stops training when validation Deep learning models (CNNs,


Early Stopping
loss increases RNNs)

Dataset Augmentation Increases dataset size artificially Image processing, NLP

Parameter Sharing & Reuses weights to reduce


CNNs, transformers, RNNs
Tying complexity

Greedy Layer wise Pre-training


Definition

Greedy Layer wise Pre-training is an unsupervised training approach where a deep


network is trained one layer at a time, rather than training all layers simultaneously. This
method helps initialize weights properly, reducing the risk of vanishing gradients.

Why is it needed?
• Deep neural networks often suffer from poor weight initialization and vanishing
gradients.

• Greedy pre-training trains each layer sequentially, providing a good starting point
for the network.

How it Works?
1. First layer is trained as an unsupervised autoencoder (or RBM - Restricted
Boltzmann Machine).

2. Once trained, its weights are frozen and the next layer is added on top.
3. This process continues layer by layer until the entire network is initialized.
4. Finally, the full network is fine-tuned using backpropagation.

Example - Pre-training for an Image Classifier

1. Train the first layer (input → hidden) using unsupervised learning


(autoencoders/RBM).

2. Use the first layer's output as input to train the second layer.

3. Stack more layers, training each one separately.

4. Once all layers are trained, fine-tune the entire model using supervised learning.
Advantages

1. Helps in training very deep networks.


2. Reduces overfitting by providing better weight initialization.
3. Works well when labeled data is scarce.

Disadvantages
1. Computationally expensive due to layerwise training.
2. Less effective for modern deep architectures (like ResNet, Transformers) which
use Batch Normalization instead.

Better Activation Functions

What are Activation Functions?

Activation functions introduce non-linearity in neural networks, allowing them to learn


complex patterns.
Types of Activation Functions

(a) Sigmoid Activation Function

• Output range: (0,1)


• Used in binary classification.

Issues:
1. Causes vanishing gradients → slows down deep networks.
2. Output is not zero-cantered, causing slow convergence.
(b) Tanh Activation Function

1. Output range: (-1,1)


2. Zero-centered, making it better than sigmoid.

(c) ReLU (Rectified Linear Unit)

• Output range: [0, ∞)

• Used in CNNs, deep networks.

1. Faster convergence
2. Avoids vanishing gradients
3. Dying ReLU Problem: Neurons can get stuck with zero outputs.

( d ) Leaky ReLU & Parametric ReLU (PReLU)

1. Fixes the Dying ReLU Problem.


2. Used in deep CNNs, GANs.

(e) Softmax Function

• Used in multi-class classification.

• Converts outputs into probabilities.


Activation Function Range Pros Cons

Sigmoid (0,1) Smooth, differentiable Vanishing gradient

Tanh (-1,1) Zero-centered Still vanishes

Fast, no vanishing
ReLU [0,∞) Dying neurons
gradient

Leaky ReLU (-∞,∞) Fixes dying ReLU Adds small overhead

Softmax (0,1) Probability distribution Expensive computation

Better Weight Initialization Methods

Poor weight initialization can lead to:


• Vanishing gradients (if weights are too small)

• Exploding gradients (if weights are too large)

(a) Zero Initialization (Bad Method )

wi=0w_i = 0wi=0

• Causes all neurons to have the same weights → model never learns.

(b) Random Initialization

wi∼U(−1,1)

• Works better than zero initialization but still causes problems in deep networks.

(b) Xavier (Glorot) Initialization

• Balances variance of activations across layers.

• Used in tanh-based networks.


1. Prevents exploding/vanishing gradients
2. Not ideal for ReLU-based networks.
(c) He Initialization (Best for ReLU)

• Used in ReLU-based networks.

• Works well in deep CNNs.

Batch Normalization (BN)

Definition

Batch Normalization normalizes activations across a mini-batch to stabilize training.


How it Works?

1. Compute mean & variance for each mini-batch.

2. Normalize activations:

3. Apply scaling (γ) and shifting (β) parameters.

Advantages

1. Speeds up training
2. Reduces internal covariate shift
3. Reduces dependence on weight initialization

Example

Before Batch Normalization:

• A deep CNN takes 200 epochs to converge.


After adding Batch Normalization:

• The same network converges in 50 epochs!

When to Use?

• Used in CNNs, Transformers, GANs.

• Can be applied after every activation layer.


Batch Normalization (BN) Example:

Let's say we have a small dataset with 5 samples and 1 feature. Before applying batch
normalization, the values vary significantly. Our goal is to normalize them so that the
network trains faster.

Step 1: Given Data (Before Normalization)

Let’s assume a mini batch of 5 samples with feature values:

X= [5,10,15,20,25]

You might also like