0% found this document useful (0 votes)
41 views12 pages

Gen Ai Mynotes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views12 pages

Gen Ai Mynotes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Activation Functions

Definition:
An activation function is a mathematical function in a neural network that decides whether a
neuron should be activated. It adds non-linearity so the network can learn complex patterns.
Why Needed?
• Without it, the network acts like a simple linear model.
• Helps learn non-linear decision boundaries.
• Makes deep learning powerful for images, speech, and text.

Choosing Functions:
• Hidden layers → ReLU / Leaky ReLU / ELU
• Binary classification output → Sigmoid
• Multiclass classification output → Softmax
• RNN hidden layers → tanh or ReLU
Common Problems:
• Vanishing Gradient: Sigmoid, tanh cause very small gradients.
• Dead Neurons: ReLU outputs 0 permanently.
• High computation: Softmax, ELU are slower.
Introduction to Deep Learning
Definition:
Deep Learning is a branch of Machine Learning that uses artificial neural networks with many
layers to automatically learn patterns and features from data.

Scope:
• Handles large & complex datasets (images, speech, text).
• Learns features automatically (no manual feature extraction).
• Used in AI systems that require high accuracy and adaptability.

Applications:
• Image recognition (face detection, medical imaging)
• Speech recognition & virtual assistants (Siri, Alexa)
• Natural Language Processing (chatbots, translation)
• Autonomous vehicles
• Recommendation systems (Netflix, YouTube)

Historical Context & Evolution:


• 1950s: Early neural network concepts (Perceptron).
• 1980s: Backpropagation popularized for training networks.
• 2000s: Growth due to big data and powerful GPUs.
• Present: Advanced architectures like CNNs, RNNs, Transformers.

Deep Learning vs Machine Learning:


Machine Learning Deep Learning
Learns from features manually extracted Learns features automatically
Works well with small datasets Needs large datasets
Algorithms: Decision Trees, SVM Algorithms: CNN, RNN, Transformers
Low computational cost High computational cost (needs GPUs/TPUs)
Feature engineering is crucial Minimal feature engineering needed
Easier to interpret Often a "black box" (harder to explain decisions)

Popular Open-Source Libraries:


• TensorFlow – Google
• PyTorch – Meta
• Keras – High-level API for TensorFlow
• MXNet – Amazon
• Caffe – Image-focused
Artificial Neurons – Basics with Structure
An Artificial Neuron is the basic unit of an Artificial Neural Network (ANN), inspired by the
working of biological brain neurons. It processes inputs to produce an output.
Structure:
1. Inputs (x₁, x₂, …, xn): Values or features given to the neuron.
2. Weights (w₁, w₂, …, wn): Each input has a weight that decides its importance.
3. Summation Function: Adds all weighted inputs (Σ wᵢxᵢ).
4. Bias (b): A constant value added to help the neuron make better decisions.
5. Activation Function: Decides whether the neuron should activate and controls the
output range (e.g., Sigmoid, ReLU).
6. Output (y): Final value sent to the next neuron or layer.
Working:
• Multiply each input by its weight.
• Add all results and bias.
• Pass the sum through the activation function to get output.
Formula:
y = Activation(Σ(wᵢxᵢ) + b)
Key Points:
• Mimics biological neuron behavior.
• Helps the network learn patterns.
• Multiple neurons together form layers of ANN.

Multiclass Classification (FFNN)


DEFN : Predicts exactly one label from 3+ classes using a neural network.
1. Applications – Used for digit recognition (MNIST), classifying images (dog/cat/bird),
sorting text topics, or diagnosing multiple diseases.
2. Architecture – Input layer for features → hidden layers with ReLU/tanh → output layer with
one neuron per class using Softmax.
3. Challenges – Faces issues like imbalanced classes, overfitting, high computation, and
confusion between similar classes.
4. Techniques to Improve – Apply data augmentation, dropout/L2 regularization, better
optimizers (Adam/RMSprop), early stopping, and balanced data.
5. Training Process – Do a forward pass, calculate categorical cross-entropy loss,
backpropagate errors, update weights, and repeat till epochs finish.
6. Example – Classify pictures as cat, dog, or horse using a Softmax output layer.
7. Output Activation – Softmax converts raw scores to probabilities summing to 1 for all
classes.
Backpropagation
Definition:
Algorithm to train neural networks by finding how much each weight contributed to error and
updating it to reduce loss.
Why Needed:
• Too many weights for random adjustment.
• Tells exactly which weights to change and by how much.
Steps:
1. Forward Pass: Input → layers → prediction.
2. Loss Calculation: Compare prediction with target (Cross-Entropy, MSE).
3. Backward Pass: Calculate gradients via chain rule from output layer backwards.
4. Weight Update:

using Gradient Descent / Adam / RMSProp.


5. Repeat: Train for many epochs.
Math Insight:
Chain rule:

Error flows backward layer-by-layer.


Types of Backpropagations
1. Static – For feed-forward nets, fixed connections (e.g., image classification).
2. Recurrent – For RNNs, unfolded through time.
3. Online – Update after each sample (fast but noisy).
4. Batch – Update after full batch (stable but slower).
Advantages:
• Works for deep networks.
• Efficient & scalable.
• Trains millions of parameters.
Challenges:
• Vanishing / exploding gradients.
• Large data need.
• Local minima.
Improvements:
• ReLU/Leaky ReLU, Batch Norm.
• Momentum optimizers (Adam, RMSProp).
• Dropout, L2 regularization.
Example:
MNIST digit recognition — updates weights to fix misclassifications.
Hyperparameters in a Fully Connected Neural Network (FCNN)
These are settings you choose before training:
1. Learning Rate – Controls how big each weight update is.
2. Number of Layers (Depth) – How many hidden layers the network has.
3. Neurons per Layer (Width) – Units in each hidden layer.
4. Batch Size – Number of samples processed before updating weights.
5. Epochs – How many times the model sees the entire dataset.
6. Activation Functions – e.g., ReLU, Sigmoid, Tanh.
7. Optimizer – e.g., SGD, Adam, RMSProp (affects training speed & quality).
8. Dropout Rate – % of neurons randomly disabled to prevent overfitting.
9. Weight Initialization Method – e.g., Xavier, He initialization.
10. Regularization Parameters – e.g., L1/L2 penalty to reduce overfitting.

Memory Requirements in FCNN


Memory is mainly used for storing:
1. Weights & Biases –
o Total grows with (neurons_in × neurons_out) per layer.

o Deeper/wider networks → more parameters → more memory.

2. Activations –
o Outputs of each neuron for every layer are stored during forward pass

(needed in backprop).
3. Gradients –
o Each weight and bias has a gradient stored for updating during

backprop.
4. Batch Size Effect –
o Larger batch size = more activations & gradients stored in memory.

o Smaller batch size reduces memory but might slow training.


Gradient Descent
Definition:
Optimization algorithm to minimize a loss function by iteratively
updating parameters in the opposite direction of the gradient until
convergence.

Steps
1. Initialize weights & biases.
2. Forward pass – compute predictions.
3. Compute loss (MSE, Cross-Entropy).
4. Backward pass – find gradients via backpropagation.
5. Update:

6. Repeat until convergence/epochs complete.

Key Components
• Loss Function – guides optimization.

• Learning Rate (η) – step size (too high → overshoot, too low → slow).

• Gradient – steepest ascent direction; move opposite.

• Convergence Criteria – stop if loss change is minimal or epochs

end.

Variants (Optimizers)
Momentum, NAG, AdaGrad, RMSProp, Adam.

Advantages
Simple, works on many problems, supports learning rate scheduling.
Limitations
Learning rate sensitive, may get stuck in local minima, expensive for
large data.
Types
1. Batch GD – full dataset per step (stable but slow, high memory).
2. Stochastic Gradient Descent (SGD)
• One sample per update → updates happen very frequently.

• Fast per iteration (only one sample’s gradient to compute).

• Escapes local minima/saddle points due to noise in updates.

• High variance in gradient → path is zig-zag.

• Requires careful learning rate tuning to avoid divergence.

• Often used with shuffling to avoid bias from data order.

3. Mini-Batch Gradient Descent


• Uses small batches (e.g., 32–512 samples) per update.

• Faster than Batch GD due to parallel computation on batches.

• Less noisy than SGD → smoother convergence.

• Can leverage GPU acceleration for matrix operations.

• Better generalization than pure batch (due to mild noise).

• Batch size affects performance — too small = noisy, too big = slow.

Challenges with Gradient Descent (simple points)


1. Local Minima / Saddle Points – Can get stuck in bad points in the
loss surface.
2. Slow Convergence – May take many iterations, especially with poor
learning rate.
3. Learning Rate Sensitivity – Too high overshoots, too low is very
slow.
4. Vanishing / Exploding Gradients – Gradients become too small or
too large, hurting learning.
5. Non-convex Loss – Multiple minima make optimization harder.
6. Overfitting Risk – Model may learn noise if not regularized.
Overfitting in Neural Networks
• Definition: Model learns training data too well, including noise, and fails to
generalize to new/unseen data.
• Symptoms:
o High training accuracy but low validation/test accuracy.

o Gap between training and validation loss grows after a point.

• Causes:
o Too many parameters relative to data size.

o Training too long without regularization.

o Lack of sufficient/varied training data.

• Prevention Techniques:
o Regularization (L1, L2).

o Dropout layers.

o Early stopping during training.

o Data augmentation (for images, text, etc.).

o Reducing model complexity.

Dropout in Neural Networks


• Definition: Regularization technique where randomly selected neurons are
ignored (“dropped”) during training.
• Mechanism:
o At each training step, a fraction p of neurons are temporarily removed

along with their connections.


o During inference (testing), no neurons are dropped, but activations are

scaled by p to match training distribution.


• Purpose:
o Prevents co-adaptation of neurons.

o Forces network to learn redundant representations → improves

generalization.
• Advantages:
o Simple and effective against overfitting.

o Works well for large and deep networks.

• Hyperparameter:
o Dropout rate (p) → typical values are 0.2–0.5.

• Limitations:
• Increases training time.
• May hurt performance if dataset is very small or dropout rate is too high.
Delta Rule
• Also called Widrow-Hoff rule or Least Mean Squares (LMS) rule.
• Used in supervised learning for updating weights in a perceptron or simple
neural network.

• Goal: Minimize the mean squared error between predicted and target
output.
• Works well when activation is linear and problem is continuous-valued.

Learning Rate (η)


1. Definition – Controls how much the weights change during each update
step.
2. Small η – More stable convergence but slower learning.
3. Large η – Faster learning but risk of overshooting and divergence.
4. Adaptive Strategies – Can change over time (learning rate decay) or adapt
per parameter (Adam, RMSProp).
5. Balance Needed – Too small → stuck in local minima; too large → oscillations
or instability.
6. Experimentation Required – Often tuned using trial-and-error or grid search.
7. Influence on Accuracy – Right value ensures both fast and accurate
convergence.
8. Relation to Delta Rule – η in delta rule directly affects the magnitude of
weight change.
Local Minima
• In training, the loss function is like a hilly surface. A local minimum is a
point where the loss is lower than nearby points but not the lowest
possible.
• Getting stuck here means the model may not reach the best performance.
• Some local minima can still work fine, especially if they are wide and flat.
• Can happen more often in small models; large neural networks often have
many good minima.
• Can cause overfitting if the minimum fits training data too tightly.
• We can reduce the risk by using methods like random restarts, momentum,
or Adam optimizer.

Flat Regions (Plateaus)


• These are areas in the loss surface where the value is almost the same in all
directions.
• Here, the gradient is near zero, so learning slows down or pauses.
• Often caused by dead neurons in ReLU layers or saturated activation
functions (like sigmoid/tanh).
• Some flat regions are actually saddle points that confuse gradient descent.
• Can make training take much longer if the learning rate is too small.
• Using momentum, adaptive optimizers (Adam, RMSProp), or batch
normalization helps escape faster

You might also like