0% found this document useful (0 votes)
17 views6 pages

Chapter 5

This document summarizes the gradient descent algorithm for training a sigmoid neuron. It begins by defining the loss function and describing how gradient descent works by taking small steps in the direction that minimizes the loss. It then derives the update rules for the weights and biases by taking the partial derivative of the loss with respect to each parameter. The final formulas show the weight and bias updates subtract the learning rate multiplied by the partial derivative of the loss. This allows the parameters to be adjusted in a way that reduces the loss at each step, gradually minimizing it over many iterations of gradient descent.

Uploaded by

ADY Beats
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views6 pages

Chapter 5

This document summarizes the gradient descent algorithm for training a sigmoid neuron. It begins by defining the loss function and describing how gradient descent works by taking small steps in the direction that minimizes the loss. It then derives the update rules for the weights and biases by taking the partial derivative of the loss with respect to each parameter. The final formulas show the weight and bias updates subtract the learning rate multiplied by the partial derivative of the loss. This allows the parameters to be adjusted in a way that reduces the loss at each step, gradually minimizing it over many iterations of gradient descent.

Uploaded by

ADY Beats
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Chapter 5 : Gradient Descent

1.3 Gradient Descent aka Sigmoid Neuron Learning algorithm

1.3.1 Sigmoid Neuron Gradient Descent Algorithm

t←0;
epochs ←1000 ;
while t <epochs do
w t +1=w t−η ∇ w ;
b t+1 =bt −η ∇ b;
t ← t+1
end

 Why is some value being subtracted from w and b, why not added?

1.3.1.1 Loss function of sigmoid neuron

n
1
L(θ ¿= ∑ ( y i−^y i )
2

N i=0
y i isthe actual output
^
y i isthe Sigmoid neuronoutput

1
^
y i= − ( w . x+ b)
T

1+e

 θ is a vector containing our trainable parameters i. e w∧b

 θ= [ wb ]
 w is a vector of weights
 x is a vector of all features at a givendata point x i

1.3.1.2 Updating the learning parameters i.e Weights and Bias

We need to adjust our weights and biases such that the Loss is minimum

Let Δ θ be the vector containing our small change in weights and bias

Δ θ= ( ΔΔwb )
And to adjust the updation of our parameters we scale the updates down by η → Learning rate
We can write the update as

θnew =θold +η . Δ θ

So we want to find Loss function at θnew i.e L ( θ+η . Δ θ )


We can use taylor series for approximating a function at x+(small change in x) if we know the value
of that function at x

( Δ x) ' ( Δ x )2 ' ' ( Δ x )3 ' ' '


f ( x + Δ x ) =f ( x ) + f ( x )+ f ( x )+ f ( x )….
1! 2! 3!

We know the value of L at θ


So
2
( η. Δ θ T ) ( η . Δ θT )
L ( θ+η Δ θ )=L ( θ )+ ∇θ L ( θ ) + H θ L ( θ )…….
1! 2!

Since η is a small value, it means η2 , 3 ….. will be much smaller and can be neglected

∴ L ( θ+ η Δ θ )=L ( θ ) +η . Δθ . ∇θ L ( θ ) ….(1)

Now, we want the new Loss to be less than Old Loss


i.e.

→ L ( θ+ η Δ θ ) < L ( θ )
∴ L ( θ+ η Δ θ )−L ( θ ) <0 ….(2)

Put (2) in e qn (1)


T
L ( θ+η Δ θ )−L ( θ )=η . Δ θ . ∇θ L ( θ )< 0

∴ η . Δ θT . ∇ θ L ( θ ) <0

Ok, what are these symbols

[ ]
∂ L ( w ,b )

[ ]
Δ θ= Δw and ∇ θ L ( θ ) =
Δb
∂w
∂ L ( w ,b )
∂b

But 2x1 cant be dot multiplied by 2x1


So we transposed Δ θ
Δ θ = [ Δ w Δ b]
T

T
So, if we are talking about dot product lets see the angle β between Δ θ ∧∇θ L (θ )

Δ θ T . ∇θ L ( θ )
cos β=
|| Δ θT||||∇ θ L ( θ )||
But we know cosβ always lies within the range [−1 , 1 ]

cos ( x ) graph

So

Δ θT . ∇ θ L ( θ )
−1 ≤ cosβ= ≤1
||Δ θ||||∇θ L ( θ )||
We only care about the numerator so lets remove the denominator by multiplying both sides by
denominator
k =|| Δ θ||||∇θ L (θ )||

T
−k ≤ k . cosβ= Δ θ . ∇ θ L ( θ ) ≤ k …..(3)

Now, think

from e qn (1)

L ( θ+η Δ θ )=L ( θ )+ η . Δθ . ∇ θ L ( θ )

We want the new Loss to be least i.e very very less


As less as possible

So η . Δθ . ∇ θ L ( θ ) should be as least as possible, i.e as negative as possible

From e qn (3)

T
−k ≤ k . cosβ= Δ θ . ∇ θ L ( θ ) ≤ k
T
Δ θ . ∇θ L (θ ) can only be as negative as −k and that too when cosβ is most negative i.e -1,

So β has to be 180°
As cos 180 °=−1

That means Δ θ T ∧∇θ L (θ ) have to be 180 ° from each other for the Loss to be as least as possible

But, what does this means

As we already know

Δ θ = [ Δ w Δ b]
T

So Δ θ T is the vector containing small change in weights and biases

And

[ ]
∂ L ( w ,b )
∇θ L ( θ ) = ∂w
∂ L ( w ,b )
∂b

So ∇ θ L ( θ ) is the vector containing Partial derivatives of Loss functions with respect to weights
and biases

Being 180 degrees apart means

Changes in weights and biases should be 180 degree from Loss function’s Derivative wrt to
them

−∂ L ( w , b ) −∂ L ( w , b )
Δ w= and Δ b=
∂w ∂b

So our weight and bias update equations can now be written as


w new =w old +η Δ w
−∂ L ( w , b )
w new =w old +η. ( )
∂w
∂ L ( w ,b )
w new =w old −η .
∂w
Similarly,
∂ L ( w ,b )
b new =b old −η.
∂b

1.3.1.3 Final formula

For 1 data point with 1 input and 1 bias only

Let z=wT . x+ b
1
y=σ ( z )= −(z )
1+ e

1
L(w ,b ¿= ¿ ¿
2

∂ L ( w ,b )
=∂ ¿ ¿
∂w

∂ ( y actual−σ ( z ) )
¿ ( y actual−σ ( z ) ) .
∂w
∂σ ( z)
¿−( y actual −σ ( z ) ) .
∂w
∂z
¿−( y actual −σ ( z ) ) . σ ( z ) . ( 1−σ ( z ) ) .
∂w

∂ L (w , b )
∴ =−( y actual −σ ( z ) ) .σ ( z ) . ( 1−σ ( z ) ) . x
∂w

And

1 2
∂ L ( w ,b ) 2
(∂y actual−σ ( z ) )
=
∂b ∂b

∂ ( y actual−σ ( z ) )
¿ ( y actual−σ ( z ) ) .
∂b
∂σ (z)
¿−2 ( y actual−σ ( z ) ) .
∂b
∂z
¿−2 ( y actual−σ ( z ) ) .σ ( z ) . ( 1−σ ( z ) ) .
∂b

∂ L (w , b )
∴ =−( y actual −σ ( z ) ) .σ ( z ) . ( 1−σ ( z ) )
∂b

We can now write

∂ L ( w ,b )
w new =w old −η .
∂w
w new =w old −η . (−( y actual−σ ( z ) ) . σ ( z ) . ( 1−σ ( z ) ) . x )

And

∂ L ( w ,b )
b new =b old −η.
∂b
b new =b old −η . (−( y actual −σ ( z ) ) . σ ( z ) . ( 1−σ ( z )) )

You might also like