Gradient Descent Algorithm and Back-Propagation Derivation
Gradient Descent Algorithm and Back-Propagation Derivation
The Gradient Descent (GD) algorithm can be used for estimating the values of regression
parameters, given a dataset with inputs and outputs.
The functional form of a simple linear regression model is given by:
Yi = b0 + b1 Xi + ei (6.2)
where b0 is called the bias or intercept, b1 is the feature weight or regression coefficient, and ei
is the error in prediction.
The predicted value of Yi is written as Ŷi , and it is given by:
The cost function for the linear regression model is the total error (mean squared error)
across all N records and is given by:
N
1 X 2
MSE = Yi − b̂0 − b̂1 Xi (6.5)
2N i=1
The error is a function of b0 and b1 . It is a pure convex function and has a global minimum
as shown in Figure 6.1. The gradient descent algorithm starts at a random starting value (for
b0 and b1 ) and moves toward the optimal solution.
1
Gradient descent finds the optimal values of b0 and b1 that minimize the loss function using
the following steps:
1. Randomly guess the initial values of b0 (bias or intercept) and b1 (feature weight).
2. Calculate the estimated value of the outcome variable Ŷi for the initialized values of bias
and weights.
4. Adjust the b0 and b1 values by calculating the gradients of the error function:
∂MSE
b0 := b0 − α (6.6)
∂b0
∂MSE
b1 := b1 − α (6.7)
∂b1
where α is the learning rate (a hyperparameter). The value of α is chosen based on the
magnitude of the update needed to be applied to the bias and weights at each iteration.
The partial derivatives of MSE with respect to b0 and b1 are given by:
N
∂MSE 2 X
=− (Yi − Ŷi ) (6.8)
∂b0 N i=1
N
∂MSE 2 X
=− (Yi − Ŷi )Xi (6.9)
∂b1 N i=1
5. Repeat steps 1 to 4 for several iterations until the error stops reducing further or the
change in cost becomes infinitesimally small.
The values of b0 and b1 at the minimum cost points are the best estimates of the model
parameters.
∂Ed
∆wji = −η (1)
∂wji
where Ed is the error on training example d, summed over all output units in the network:
1 X
Ed (w)
⃗ = (tk − ok )2 (2)
2 k∈outputs
2
• outputs → Set of output units in the network.
The derivative
∂
(tk − ok )2
∂oj
will be zero for all output units k except when k = j. Hence, we drop the summation over
output units and simply set k = j:
∂Ed ∂ 1
= (tj − oj )2
∂oj ∂oj 2
1 ∂(tj − oj )
= · 2(tj − oj ) · = −(tj − oj ) (5)
2 ∂oj
Now, consider the second term in (4). Since
oj = σ(netj )
∂oj d
= σ(netj ) = oj (1 − oj ) (6)
∂netj d(netj )
Substituting equations (5) and (6) into (4), we obtain:
∂Ed
= −(tj − oj ) oj (1 − oj )
∂netj
3
Now, using equation (3), the weight update becomes:
∂Ed
∆wji = −η = η(tj − oj ) oj (1 − oj ) xi (7)
∂wji
This is the final training rule for output-layer weights in backpropagation using gradient
descent with a sigmoid activation.
X ∂netk
= −δk ·
∂netj
k∈downstream(j)
X ∂netk ∂oj
= −δk · ·
∂oj ∂netj
k∈downstream(j)
X ∂oj
= −δk · wkj ·
∂netj
k∈downstream(j)
X
= −δk · wkj · oj (1 − oj )
k∈downstream(j)
∂Ed
Using the notation δj to denote − ∂net j
, we have:
X
δj = oj (1 − oj ) δk wkj
k∈downstream(j)
And the final weight update rule for the hidden units becomes:
∆wji = η δj xi
This is the standard backpropagation rule for updating weights connected to hidden units
in a neural network.