Chapter 5
Chapter 5
t←0;
epochs ←1000 ;
while t <epochs do
w t +1=w t−η ∇ w ;
b t+1 =bt −η ∇ b;
t ← t+1
end
Why is some value being subtracted from w and b, why not added?
n
1
L(θ ¿= ∑ ( y i−^y i )
2
N i=0
y i isthe actual output
^
y i isthe Sigmoid neuronoutput
1
^
y i= − ( w . x+ b)
T
1+e
θ= [ wb ]
w is a vector of weights
x is a vector of all features at a givendata point x i
We need to adjust our weights and biases such that the Loss is minimum
Let Δ θ be the vector containing our small change in weights and bias
Δ θ= ( ΔΔwb )
And to adjust the updation of our parameters we scale the updates down by η → Learning rate
We can write the update as
θnew =θold +η . Δ θ
Since η is a small value, it means η2 , 3 ….. will be much smaller and can be neglected
∴ L ( θ+ η Δ θ )=L ( θ ) +η . Δθ . ∇θ L ( θ ) ….(1)
→ L ( θ+ η Δ θ ) < L ( θ )
∴ L ( θ+ η Δ θ )−L ( θ ) <0 ….(2)
∴ η . Δ θT . ∇ θ L ( θ ) <0
[ ]
∂ L ( w ,b )
[ ]
Δ θ= Δw and ∇ θ L ( θ ) =
Δb
∂w
∂ L ( w ,b )
∂b
T
So, if we are talking about dot product lets see the angle β between Δ θ ∧∇θ L (θ )
Δ θ T . ∇θ L ( θ )
cos β=
|| Δ θT||||∇ θ L ( θ )||
But we know cosβ always lies within the range [−1 , 1 ]
cos ( x ) graph
So
Δ θT . ∇ θ L ( θ )
−1 ≤ cosβ= ≤1
||Δ θ||||∇θ L ( θ )||
We only care about the numerator so lets remove the denominator by multiplying both sides by
denominator
k =|| Δ θ||||∇θ L (θ )||
T
−k ≤ k . cosβ= Δ θ . ∇ θ L ( θ ) ≤ k …..(3)
Now, think
from e qn (1)
L ( θ+η Δ θ )=L ( θ )+ η . Δθ . ∇ θ L ( θ )
From e qn (3)
T
−k ≤ k . cosβ= Δ θ . ∇ θ L ( θ ) ≤ k
T
Δ θ . ∇θ L (θ ) can only be as negative as −k and that too when cosβ is most negative i.e -1,
So β has to be 180°
As cos 180 °=−1
That means Δ θ T ∧∇θ L (θ ) have to be 180 ° from each other for the Loss to be as least as possible
As we already know
Δ θ = [ Δ w Δ b]
T
And
[ ]
∂ L ( w ,b )
∇θ L ( θ ) = ∂w
∂ L ( w ,b )
∂b
So ∇ θ L ( θ ) is the vector containing Partial derivatives of Loss functions with respect to weights
and biases
Changes in weights and biases should be 180 degree from Loss function’s Derivative wrt to
them
−∂ L ( w , b ) −∂ L ( w , b )
Δ w= and Δ b=
∂w ∂b
Let z=wT . x+ b
1
y=σ ( z )= −(z )
1+ e
1
L(w ,b ¿= ¿ ¿
2
∂ L ( w ,b )
=∂ ¿ ¿
∂w
∂ ( y actual−σ ( z ) )
¿ ( y actual−σ ( z ) ) .
∂w
∂σ ( z)
¿−( y actual −σ ( z ) ) .
∂w
∂z
¿−( y actual −σ ( z ) ) . σ ( z ) . ( 1−σ ( z ) ) .
∂w
∂ L (w , b )
∴ =−( y actual −σ ( z ) ) .σ ( z ) . ( 1−σ ( z ) ) . x
∂w
And
1 2
∂ L ( w ,b ) 2
(∂y actual−σ ( z ) )
=
∂b ∂b
∂ ( y actual−σ ( z ) )
¿ ( y actual−σ ( z ) ) .
∂b
∂σ (z)
¿−2 ( y actual−σ ( z ) ) .
∂b
∂z
¿−2 ( y actual−σ ( z ) ) .σ ( z ) . ( 1−σ ( z ) ) .
∂b
∂ L (w , b )
∴ =−( y actual −σ ( z ) ) .σ ( z ) . ( 1−σ ( z ) )
∂b
∂ L ( w ,b )
w new =w old −η .
∂w
w new =w old −η . (−( y actual−σ ( z ) ) . σ ( z ) . ( 1−σ ( z ) ) . x )
And
∂ L ( w ,b )
b new =b old −η.
∂b
b new =b old −η . (−( y actual −σ ( z ) ) . σ ( z ) . ( 1−σ ( z )) )