0% found this document useful (0 votes)
47 views8 pages

Day 2 - Loss & Activation Functions

Uploaded by

cpusingpython
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views8 pages

Day 2 - Loss & Activation Functions

Uploaded by

cpusingpython
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Feb 19, 2025

Class #2:
📌 Keywords
✒️ Neural Network - Computational model that is inspired by human intelligence.
✒️ Vanishing Gradient Problem - The Vanishing Gradient Problem occurs when
gradients (partial derivatives) in deep neural networks become extremely small,
making it difficult for the earlier layers to learn.

✒️ Exploding Gradient Problem - The Exploding Gradient Problem occurs when


gradients become excessively large, leading to unstable weight updates and
causing the model to fail to converge.

✒️ Overfitting - The model learns too much from the training data, including noise
and irrelevant patterns. It performs well on training data but poorly on new (test)
data.

✒️ Underfitting - The model is too simple to learn from the data and fails to
capture key patterns. It performs poorly on both training and test data.

✒️ Training Parameters - GPT3 (175 billion), GPT4 (1 trillion)


✒️ Hyperparameter - A hyperparameter is a parameter set before training a
machine learning model, rather than learned from the data. It controls the training
process and affects model performance.

●​ Manually set.
●​ Not learnt from data.
●​ Settings of the model.
✒️ Hyperparameters & Their Importance

Learning Controls how much 0.01, 0.001, Too high -> Model overshoots.
Rate(α) the model updates 0.0001 Too low -> Model learns too
weights in each step. slowly.
Batch Size Number of training 16, 32, 64, Small batch -> Better
samples processed 128 generalization, more noise.
before updating Large batch -> Faster but
weights. may overfit.
Number of Number of times the 10, 50, 100 Too many -> Overfitting
Epochs model sees the Too few -> Underfitting
entire dataset.
Number of Determines the 1, 2, 3, 10 More layers -> Better feature
Hidden depth of the neural extraction but more risk of
Layers network. overfitting.
Number of Defines the 32, 64, 128, More neurons -> More
Neurons per complexity of each 512 complexity but higher
Layer layer. computation.
Dropout Percentage of 0.1, 0.2, 0.5 Higher values prevent
Rate neurons randomly overfitting but may slow
dropped during learning.
training to prevent
overfitting.
Optimizer Algorithm that SGD, Adam, Different optimizers converge
adjusts weights to RMSprop at different speeds and
minimize loss. stabilize.
Activation Defines how neurons ReLu, Affects how well a model
Function activate and pass Sigmoid, captures non-linear
values forward. Tanh relationships.
Weight Sets initial weight Xavier, He, Poor initialization leads to
Initialization values before Random slow or stuck learning.
training.

✒️ Question - How do we get the number of epochs, number of hidden layers,


number of neurons per layer, initialise weights or initialise bias?​
=> All of these are derived from trial & error.
✒️ Loss/Cost/Error Function - A function to predict how far our predicted value is
from the true value.

ȳ = σz​
σz = 1 / (1+e^-z)​
z = wx+b​
c = (y - ȳ)^2​
δc/δw = δc/δȳ * δȳ/δz * δz/δw = -2(y - ȳ) * σz * (1-σz) * x​
δc/δb = δc/δȳ * δȳ/δz * δz/δb = -2(y - ȳ) * σz * (1-σz)

✒️ Loss functions
Mean Squared tf.keras.losses. Regression
Error (MSE) MeanSquaredEr problems where
ror errors need to be
minimized.
Mean Absolute tf.keras.losses. y - \hat{y}
Error (MAE) MeanAbsoluteEr
ror

Mean Squared tf.keras.losses. Regression tasks


Logarithmic MeanSquaredLo where small
Error (MSLE) garithmicError differences matter
more than large
ones.
Huber Loss tf.keras.losses.H y - \hat(y)
uber

Binary tf.keras.losses.B Binary


Cross-Entropy inaryCrossentro classification tasks
py (e.g. spam
detection, medical
diagnosis).
Categorical tf.keras.losses.C Multi-class
Cross-Entropy ategoricalCross classification
entropy when labels are
one-hot encoded.
Sparse tf.keras.losses.S Same as Categorical Multi-class
Categorical parseCategorica Crossentropy but with classification
Cross-Entropy lCrossentropy integer labels. when labels are
integer-encoded
instead of
one-hot.
Kullback-Leibl tf.keras.losses.K Probability
er Divergence LDivergence distributions in
(KL variational
Divergence) autoencoders and
reinforcement
learning.
Cosine tf.keras.losses.C
Similarity Loss osineSimilarity

Hinge Loss tf.keras.losses.H Used for Support


inge Vector Machines
(SVMs) and
max-margin
classification
tasks.
Squared Hinge tf.keras.losses.S Similar to hinge
Loss quaredHinge loss but penalizes
large margin
violations more.
Poisson Loss tf.keras.losses.P Used when
oisson modelling
count-based data
(e.g. predicting the
number of events
occurring).
Log Cosh Loss tf.keras.losses.L Regression tasks,
ogCosh similar to Huber
less but smoother.

📌 Regression
●​ MSE - Mean Squared Error
●​ MAE - Mean Absolute Error
●​ Huber Loss
✒️ MSE - Mean Squared Error (MSE) is a commonly used loss function for
regression models. It measures the average squared difference between actual
and predicted values.

✒️ MAE - Mean Absolute Error (MAE) is a commonly used loss function for
regression models. It measures the average absolute difference between actual
and predicted values.

✒️ Huber Loss - Huber Loss is a robust loss function that effectively handles
outliers by combining Mean Squared Error (MSE) and Mean Absolute Error (MAE).

📌 Classification
●​ Binary cross-entropy
●​ Categorical cross-entropy
●​ Sparse Categorical Entropy

✒️ Binary cross-entropy - Binary Cross-Entropy (BCE) is a loss function commonly


used in binary classification problems in machine learning and deep learning. It
measures the difference between the true labels and the predicted probabilities.

✒️ Categorical cross-entropy - Categorical Cross-Entropy (CCE) is a commonly


used loss function in machine learning and deep learning for multi-class
classification problems where each input belongs to one of several categories.

✒️ Sparse categorical cross-entropy - Sparse Categorical Cross-Entropy (SCCE) is


a loss function used for multi-class classification when labels are integers. It is an
optimized version of Categorical Cross-Entropy (CCE) for sparse labels.

📌 Training Steps
●​ Forward pass
●​ Gradient computation / Backward propagation
●​ Optimization / Update weights & biases

✒️ Forward pass - The forward pass is the process in a neural network where input
data flows through the layers, transforming weights, biases, and activation
functions, to generate an output (prediction).
✒️ Backward propagation - Backpropagation is an optimization algorithm used to
train neural networks by adjusting weights based on the error from the forward
pass. It works by propagating the error backwards through the network and
updating weights using gradient descent.

✒️ Optimization algorithm - An optimization algorithm in AI is used to adjust the


model’s parameters (weights and biases) to minimize the loss function and
improve performance. A very common algorithm is “Adam”.

✒️ Optimizers
Gradient Large Converges Slow,
Descent (GD) datasets with to optimal requires
offline batch solution. entire
updates. dataset
for each
update.
Stochastic Large Fast, Noisy
Gradient datasets, updates updates
Descent (SGD) real-time weights per may not
updates. sample. converge.
Mini-Batch Balance Faster than Still has
Gradient between GD & GD, more some
Descent SGD stable than variance.
SGD.

Momentum Training deep Fater Requires


networds with convergenc tuning β.
oscillations e, smooths
updates.
Nesterov Helps in cases Better than Requires
Accelerated of slow Momentum extra
Giant (NAG) convergence. in convex gradient
loss computati
surfaces. on.
Adagrad Sparse data Adapts Learning
(NLP, learning rate
embeddings) rate per decreases
parameter. too much
over time.
RMSprop RNNs, NLP, Reduces Requires
and learning tuning β.
non-stationar rate issues
y loss of Adagrad.
functions.
Adam Default Fast, Uses more
(Adaptive optimizer for adaptive, memory.
Moment deep learning. and works
Estimation) well for
most cases.

AdamW (Adam Similar to Adam but Regularized Prevents Requires


with Weight includes weight deep overfitting. tuning
Decay) decay. networks. weight
decay
factor.
AdaDelta Avoids No need for Computati
manual a fixed onally
learning rate learning expensive.
tuning. rate.
Nadam Adam + Nesterov Helps with Combines May not
(Nesterov-Ada Momentum slow benefits of always be
m) convergence. NAG & better
Adam than
Adam.

✒️ Activations - In AI, the activation of a neuron refers to the output value of a


neuron after applying an activation function to the weighted sum of its inputs.
This determines whether the neuron should "fire" and pass information to the next
layer.

✒️ Activation Functions
●​ Sigmoid
●​ ReLU (Rectified Linear Unit)
●​ Leaky ReLU
●​ Tanh
●​ Softmax
✒️ Activation functions
Activation Function Formula Common Use Cases
Sigmoid Binary classifications,
output later.

Tanh Hidden laters, better than


Sigmoid for centering
data.

ReLU Most common in hidden


layers of deep networks.

Leaky ReLU Avoids “dying ReLU”


problem, better for
negative inputs.
Softmax Multi-class classification,
output layer.

ELU Deep networks, faster


convergence than ReLU.

Swish Advanced deep networks,


often better than ReLU.

Softplus Smooth approximation of


ReLU, avoids zero
gradients.

You might also like