0% found this document useful (0 votes)
2 views4 pages

Gradient Descent Algorithm and Back-Propagation Derivation

The document explains the Gradient Descent algorithm for estimating regression parameters in linear models, detailing the functional form, error calculation, and cost function. It outlines the iterative process of adjusting parameters to minimize mean squared error, including the derivation of the back-propagation rule for updating weights in neural networks. The document also distinguishes between training rules for output and hidden unit weights using gradient descent.

Uploaded by

prasadshetty1566
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views4 pages

Gradient Descent Algorithm and Back-Propagation Derivation

The document explains the Gradient Descent algorithm for estimating regression parameters in linear models, detailing the functional form, error calculation, and cost function. It outlines the iterative process of adjusting parameters to minimize mean squared error, including the derivation of the back-propagation rule for updating weights in neural networks. The document also distinguishes between training rules for output and hidden unit weights using gradient descent.

Uploaded by

prasadshetty1566
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Gradient Descent Algorithm

The Gradient Descent (GD) algorithm can be used for estimating the values of regression
parameters, given a dataset with inputs and outputs.
The functional form of a simple linear regression model is given by:

Yi = b0 + b1 Xi + ei (6.2)

where b0 is called the bias or intercept, b1 is the feature weight or regression coefficient, and ei
is the error in prediction.
The predicted value of Yi is written as Ŷi , and it is given by:

Ŷi = b̂0 + b̂1 Xi (6.3)

where b̂0 and b̂1 are the estimated values of b0 and b1 .


The error is given by:
ei = Yi − Ŷi = Yi − b̂0 − b̂1 Xi (6.4)

The cost function for the linear regression model is the total error (mean squared error)
across all N records and is given by:

N
1 X 2
MSE = Yi − b̂0 − b̂1 Xi (6.5)
2N i=1

The error is a function of b0 and b1 . It is a pure convex function and has a global minimum
as shown in Figure 6.1. The gradient descent algorithm starts at a random starting value (for
b0 and b1 ) and moves toward the optimal solution.

1
Gradient descent finds the optimal values of b0 and b1 that minimize the loss function using
the following steps:

1. Randomly guess the initial values of b0 (bias or intercept) and b1 (feature weight).

2. Calculate the estimated value of the outcome variable Ŷi for the initialized values of bias
and weights.

3. Calculate the mean square error function (MSE).

4. Adjust the b0 and b1 values by calculating the gradients of the error function:

∂MSE
b0 := b0 − α (6.6)
∂b0
∂MSE
b1 := b1 − α (6.7)
∂b1

where α is the learning rate (a hyperparameter). The value of α is chosen based on the
magnitude of the update needed to be applied to the bias and weights at each iteration.
The partial derivatives of MSE with respect to b0 and b1 are given by:

N
∂MSE 2 X
=− (Yi − Ŷi ) (6.8)
∂b0 N i=1
N
∂MSE 2 X
=− (Yi − Ŷi )Xi (6.9)
∂b1 N i=1

5. Repeat steps 1 to 4 for several iterations until the error stops reducing further or the
change in cost becomes infinitesimally small.

The values of b0 and b1 at the minimum cost points are the best estimates of the model
parameters.

Derivation of Back-propagation Rule


The stochastic gradient descent involves iterating through the training examples one at a time.
For each training example d, we descend the gradient of the error Ed with respect to that single
example.
For each training example d, every weight wji is updated by adding to it ∆wji :

∂Ed
∆wji = −η (1)
∂wji
where Ed is the error on training example d, summed over all output units in the network:

1 X
Ed (w)
⃗ = (tk − ok )2 (2)
2 k∈outputs

2
• outputs → Set of output units in the network.

• tk = target value of unit k for training example d.

• ok = output of unit k given training example d.

Using the chain rule, we can write:

∂Ed ∂Ed ∂netj


= ·
∂wji ∂netj ∂wji
or equivalently,
∂Ed ∂Ed
= xi (3)
∂wji ∂netj

Case 1: Training Rule for Output-Unit Weights


For the output units, we have:

∂Ed ∂Ed ∂oj


= · (4)
∂netj ∂oj ∂netj
Consider just the first term in (4). From equation (4):
!
∂Ed ∂ 1 X
= (tk − ok )2 (@)
∂oj ∂oj 2 k∈outputs

The derivative

(tk − ok )2
∂oj
will be zero for all output units k except when k = j. Hence, we drop the summation over
output units and simply set k = j:
 
∂Ed ∂ 1
= (tj − oj )2
∂oj ∂oj 2
1 ∂(tj − oj )
= · 2(tj − oj ) · = −(tj − oj ) (5)
2 ∂oj
Now, consider the second term in (4). Since

oj = σ(netj )

the derivative is the derivative of the sigmoid function:

∂oj d
= σ(netj ) = oj (1 − oj ) (6)
∂netj d(netj )
Substituting equations (5) and (6) into (4), we obtain:

∂Ed
= −(tj − oj ) oj (1 − oj )
∂netj

3
Now, using equation (3), the weight update becomes:

∂Ed
∆wji = −η = η(tj − oj ) oj (1 − oj ) xi (7)
∂wji
This is the final training rule for output-layer weights in backpropagation using gradient
descent with a sigmoid activation.

Case 2: Training Rule for Hidden Unit Weights


In the case where j is an internal (hidden) unit in the network, the derivation must take into
account the indirect ways in which wji can influence the network outputs (and hence Ed ). We
write:

∂Ed X ∂Ed ∂netk


= ·
∂netj ∂netk ∂netj
k∈downstream(j)

X ∂netk
= −δk ·
∂netj
k∈downstream(j)

X ∂netk ∂oj
= −δk · ·
∂oj ∂netj
k∈downstream(j)

X ∂oj
= −δk · wkj ·
∂netj
k∈downstream(j)
X
= −δk · wkj · oj (1 − oj )
k∈downstream(j)

∂Ed
Using the notation δj to denote − ∂net j
, we have:

X
δj = oj (1 − oj ) δk wkj
k∈downstream(j)

And the final weight update rule for the hidden units becomes:

∆wji = η δj xi

This is the standard backpropagation rule for updating weights connected to hidden units
in a neural network.

You might also like