0% found this document useful (0 votes)
11 views7 pages

2 - Intro To Neural Network

The document provides an overview of neural networks, particularly deep neural networks (DNNs), explaining their structure, training process, and applications in tasks like image recognition and natural language processing. It covers key concepts such as forward and backward propagation, gradient descent, loss functions, activation functions, and techniques to prevent overfitting. Additionally, it discusses data scaling methods, learning rates, weight initialization, and the steps involved in training a neural network.

Uploaded by

James Acosta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views7 pages

2 - Intro To Neural Network

The document provides an overview of neural networks, particularly deep neural networks (DNNs), explaining their structure, training process, and applications in tasks like image recognition and natural language processing. It covers key concepts such as forward and backward propagation, gradient descent, loss functions, activation functions, and techniques to prevent overfitting. Additionally, it discusses data scaling methods, learning rates, weight initialization, and the steps involved in training a neural network.

Uploaded by

James Acosta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Intro to Neural Network

Neural Network
A neural network is a type of machine learning algorithm modeled after the structure and function
of the human brain. A deep neural network (DNN) is a type of neural network that has multiple
layers, typically more than two or three. The extra layers allow the network to learn and represent
more complex and abstract features of the input data. DNNs are particularly useful for tasks such
as image and speech recognition, natural language processing, and decision making.
The layers of a DNN are made up of interconnected "neurons," which process and transmit
information. Each neuron takes in a set of inputs, performs a computation on them, and produces
an output. The computation is typically a simple mathematical operation, such as a dot product
followed by a non-linear function called an activation function. The outputs from one layer of
neurons are then fed as inputs to the next layer, and so on.
Training a DNN involves showing it a large dataset of labeled examples and adjusting the weights
of the connections between neurons so that the network can correctly classify new examples. This
is typically done using a variant of stochastic gradient descent, an optimization algorithm that
adjusts the weights incrementally based on the error the network makes on the training examples.
DNNs have had great success in recent years, achieving state-of-the-art results on a variety of tasks
such as image and speech recognition, natural language processing and decision making. With the
help of large amounts of data and powerful hardware, DNNs are able to learn rich, abstract
representations of the input data, allowing them to generalize well to new examples.
One of the main challenges in using DNNs is the need for a large amount of labeled training data
and computational resources. The training process can also be time-consuming and requires a high
level of expertise. Additionally, DNNs can be difficult to interpret, making it hard to understand
how they are making decisions.
Despite these challenges, DNNs have proven to be a powerful tool for solving a wide range of
problems and are an active area of research in the field of machine learning.

Forward Propagation
Forward propagation is the process of passing input data through a neural network to generate
output predictions. It is the first step in the process of training a neural network and also used in
the prediction phase.
In forward propagation, the input data is passed through the layers of the network in a sequential
manner, starting from the input layer and ending at the output layer. Each layer performs a
computation on the inputs it receives, and passes the result to the next layer.
The computation performed by each neuron in a layer typically consists of two steps: a dot product
of the input vector with the weight vector, and an activation function, which is applied element-
wise to the dot product. The dot product is used to compute a weighted sum of the inputs and the
weights, which represents the raw output of the neuron. The activation function is used to introduce
non-linearity into the network and is used to calculate the final output of the neuron.
In this way, the input data is transformed and propagated through the layers of the network, until
the final output predictions are generated. The output predictions are then compared with the true
labels to compute the error, which is used to update the weights of the network during the training
process.
It is worth noting that forward propagation is a deterministic process, meaning that given the same
input and the same weights, it will always produce the same output.

Backward Propagation
Backward propagation, also known as backpropagation, is the process of updating the weights of
a neural network in order to minimize the error between the network's predictions and the true
labels. It is the second step in the process of training a neural network, following the forward
propagation step.
The basic idea behind backward propagation is to use the error computed during the forward
propagation step to adjust the weights of the network in a way that will reduce the error on the
next forward pass. This is done using a technique called gradient descent, which is an optimization
algorithm that adjusts the weights incrementally based on the error the network makes on the
training examples.
The process of backward propagation involves computing the gradient of the error with respect to
the weights of the network. This gradient tells us how much the error changes as we adjust the
weights.
The calculation of the gradient starts from the output layer, where the error is computed directly.
Then it propagates backwards through the layers, computing the gradient for each layer using the
gradients of the next layer. This process is called backpropagation of errors.
The backpropagation algorithm is an efficient way to calculate the gradient of the error with respect
to the weights, which is done using the chain rule of calculus. This allows for efficient training of
deep neural networks, which have many layers.
It is worth noting that the optimization algorithm used in the backward propagation can be different
from simple gradient descent, such as variations of gradient descent like Adam, Adagrad, etc. They
all have the same goal of reducing the error but have different approaches to achieve that.

Gradient Descent
Gradient descent is an optimization algorithm that is used to update the weights of a neural network
during the training process. The goal of the algorithm is to find the set of weights that minimizes
the error between the network's predictions and the true labels.
The basic idea behind gradient descent is to adjust the weights of the network in the direction that
reduces the error. This is done by computing the gradient of the error with respect to the weights
and adjusting the weights in the opposite direction of the gradient. The amount by which the
weights are adjusted is determined by the learning rate, a hyperparameter that controls the step
size.
There are different variants of gradient descent algorithm, but the most common one is called
stochastic gradient descent (SGD). In SGD, the weights are updated after each training example.
The gradient is computed for the current training example and the weights are updated accordingly.
Another variant of gradient descent is called mini-batch gradient descent, where the weights are
updated after each mini-batch of training examples, rather than after each example. This is more
computationally efficient than SGD because it allows the use of vectorized operations, which are
faster than iterating over individual examples.

Loss Function
In a neural network, the loss function is used to measure the error between the network's
predictions and the true labels. The goal of the training process is to find the set of weights that
minimizes the loss function. There are several different types of loss functions that can be used in
a neural network, each with its own strengths and weaknesses. Some of the most commonly used
loss functions are:
Mean Squared Error (MSE): This is a common loss function for regression problems. It measures
the average squared difference between the predicted output and the true output. It is also called
as L2 loss.
Mean Absolute Error (MAE): This is also common loss function for regression problems. It
measures the average absolute difference between the predicted output and the true output. It is
also called as L1 loss.
L2 loss is more sensitive to outliers then L1 loss. When there is large difference in predicted and
actual value, the error get big and squaring a big number, the value gets bigger.
Root Mean Squared Error (RMSE): It apply the square root to MSE.
Binary Cross-Entropy (BCE): This is a common loss function for binary classification problems.
It measures the distance between the predicted probability of the positive class and the true binary
label.
Categorical Cross-Entropy (CCE): This is a common loss function for multi-class classification
problems. It measures the distance between the predicted probability distribution over all classes
and the true categorical label.
The choice of loss function depends on the specific problem that the network is being used to
solve. For example, mean squared error is a common choice for regression problems, while cross-
entropy loss is a common choice for classification problems. It is also worth noting that the choice
of loss function can have a significant impact on the performance of the network and the speed of
convergence during training.

Activation Function
An activation function is a non-linear function applied to the output of each neuron in a neural
network. It is used to introduce non-linearity into the network and is essential for the network to
be able to learn and represent complex patterns in the data. There are several different types of
activation functions that can be used in a neural network, each with their own strengths and
weaknesses. Some of the most commonly used activation functions are:
Sigmoid: The sigmoid function maps any real-valued number to a value between 0 and 1. It is
commonly used in the output layer of a binary classification network.
ReLU (Rectified Linear Unit): The ReLU function maps any negative value to 0 and any positive
value to itself. It is computationally efficient and often used as the default activation function in
deep neural networks.
Tanh (hyperbolic tangent): The tanh function maps any real-valued number to a value between -1
and 1. It is similar to sigmoid function but maps the input to a wider range of values.
The choice of activation function depends on the specific problem that the network is being used
to solve. For example, sigmoid activation functions is commonly used in the output layer of a
classification network, while ReLU and its variants are commonly used in the hidden layers of a
deep neural network. It is also worth noting that the choice of activation function can have a
significant impact on the performance of the network and the speed of convergence during training.

Overfitting Problem
Overfitting is a common problem in machine learning, and it occurs when a model is trained too
well on the training data and performs poorly on unseen data. In other words, the model has learned
the noise in the training data and it is not generalizing well.
In the case of neural networks, overfitting occurs when the network has too many parameters and
is able to fit the training data perfectly, but it is not able to generalize well to new data. This is
because the network has learned the noise in the training data and it is not able to generalize to
new examples that it has not seen before.
There are several ways to detect and prevent overfitting in a neural network, such as:
1) Using a smaller network: By reducing the number of parameters in the network, the model
is less likely to fit the noise in the training data.
2) Using regularization techniques: Techniques like dropout and early stopping help to
prevent overfitting by adding noise to the training process and by regularizing the model's
parameters.
3) Using a larger dataset: By increasing the amount of training data, the model is less likely
to overfit the data.
4) Using cross-validation: By evaluating the model's performance on a held-out validation
dataset, it is possible to detect overfitting and adjust the model accordingly.
5) Monitoring the learning curve: By monitoring the performance of the model on the training
and validation datasets during the training process, it is possible to detect overfitting and
adjust the model accordingly.
6) DropOut: Dropout is a technique for regularizing neural networks by randomly dropping
out (i.e., setting to zero) a certain percentage of the neurons during the training process.
The idea behind dropout is to prevent the neurons from co-adapting too much, which can
lead to overfitting. By randomly dropping out neurons during each forward pass, the
network is forced to learn multiple independent representations of the input data, which
can make it more robust to overfitting. During the training process, dropout is applied to
the neurons of the network by randomly setting a certain percentage of the neurons' output
to zero. Dropout rate is a hyperparameter that controls the percentage of neurons to drop
out. During the prediction phase, dropout is not applied to the network, and all neurons are
used to make predictions. Dropout is a simple yet powerful technique for regularizing
neural networks, and it has been shown to be effective in reducing overfitting and
improving the generalization performance of deep neural networks.
7) DropConnect: DropConnect is a regularization technique that is similar to dropout. The
main difference is that instead of dropping out entire neurons, It drops out individual
connections between neurons. It helps to prevent overfitting by adding noise to the training
process and regularizing the model's parameters. It is a way of applying dropout to the
weights instead of the activations. By randomly dropping out connections, the network is
forced to learn multiple independent representations of the input data, which can make it
more robust to overfitting.

Data Scaling
1) Normalization: Normalization is a technique that scales the values of the input data to a
specific range, typically between 0 and 1. This is done by subtracting the minimum value
of the data and dividing by the range (i.e., the difference between the maximum and
minimum values). This technique is useful when the input data has different scales, and it
is important to bring them to a common scale before training the model.
2) Standardization: Standardization is a technique that scales the values of the input data to
have zero mean and unit variance. This is done by subtracting the mean and dividing by
the standard deviation. This technique is useful when the input data has a Gaussian
(normal) distribution and it is important to center the data around zero before training the
model.
3) Batch Normalization: Batch normalization is a technique used to normalize the activations
of a neural network during the training process. It helps to stabilize the training process
and improve the generalization performance of the network by normalizing the inputs to
each neuron in a layer. It is typically applied to the inputs of each neuron before the
activation function is applied. It is applied in the forward pass, so the normalization is
different for each batch, hence the name "batch normalization". The normalization is done
by computing the mean and standard deviation of the activations for each batch of training
examples, and then using these statistics to normalize the activations.
4) Layer Normalization: Layer normalization is a technique similar to batch normalization,
but instead of normalizing the activations for each batch of training examples, it normalizes
the activations across all the dimensions of the input. This means that instead of computing
the mean and standard deviation for each batch, the mean and standard deviation are
computed for each feature (i.e. across all the examples in the batch)

Learning Rate
The learning rate is a hyperparameter of a neural network that controls the step size of the gradient
descent optimization algorithm. It determines the amount by which the weights of the network are
updated during each iteration of the training process. A small learning rate implies that the
optimization algorithm will take small steps towards the optimal solution, which can lead to slow
convergence but with a smaller chance of overshooting the optimal solution. On the other hand, a
large learning rate implies that the optimization algorithm will take large steps towards the optimal
solution, which can lead to faster convergence but with a larger chance of overshooting the optimal
solution and getting stuck in a suboptimal solution.
The learning rate is a critical hyperparameter that can have a significant impact on the performance
of a neural network. Finding an appropriate learning rate is essential for the network to converge
to an optimal solution and avoid overfitting or underfitting.
There are different ways to set the learning rate, such as:
1) Manual tuning: This is the simplest approach, where the learning rate is set to a fixed value
and the network is trained until convergence.
2) Adaptive learning rate: This approach adjusts the learning rate during training based on
the performance of the network. For example, the learning rate can be decreased when the
performance of the network stops improving.
3) Learning rate schedule: This approach schedules the learning rate to change over time. For
example, the learning rate can be set to a high value at the beginning of training and
gradually decrease over time.
It is worth noting that finding an appropriate learning rate is an important task that requires some
trial and error and it can be a challenging task. There are also techniques like learning rate finder
and learning rate range test that can help to find a good learning rate.

Weight Initialization
Weight initialization is the process of initializing the weights of a neural network before the
training process begins. The goal of weight initialization is to set the initial values of the weights
in a way that will make the training process more efficient and stable.
Random initialization: The weights are initialized randomly using a distribution such as Gaussian
or uniform distribution. This is a simple and widely used method, but the initial weights are highly
dependent on the chosen distribution.
Steps for training a Neural Network
1) Initialize the weights.
2) Pass the first observation and perform forward Propagation.
3) Compare the actual value with the predicted value and calculate the cost function.
4) Perform back Propagation: Update weights.
5) Repeat steps 1 to 4 until we get the desired error.
When whole dataset is passed for one time, it’s called one epoch. Try with multiple epoch.

You might also like