Tensorflow中支持11中不同的优化器,包括:
tf.train.Optimizer
tf.train.GradientDescentOptimizer
tf.train.AdadeltaOptimizer
tf.train.AdagradOptimizer
tf.train.AdagradDAOptimizer
tf.train.MomentumOptimizer
tf.train.AdamOptimizer
tf.train.FtrlOptimizer
tf.train.RMSPropOptimizer
tf.train.ProximalAdagradOptimizer
tf.train.ProximalGradientDescentOptimizer
常用的主要有3种,分别是
(1) GradientDescent
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
使用随机梯度下降算法,使参数沿着
梯度的反方向,即总损失减小的方向移动,实现更新参数。
W[l]=W[l]−α dW[l] W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]} W[l]=W[l]−α dW[l]b[l]=b[l]−α db[l] b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]} b[l]=b[l]−α db[l]
(2) Momentum
optimizer = tf.train.MomentumOptimizer(learning_rate,momentum).minimize(loss)
在更新参数时,利用了超参数
{vdW[l]=βvdW[l]+(1−β)dW[l]vdb[l]=βvdb[l]+(1−β)db[l] \begin{cases}
v_{dW^{[l]}} = \beta v_{dW^{[l]}} + (1 - \beta) dW^{[l]} \\
v_{db^{[l]}} = \beta v_{db^{[l]}} + (1 - \beta) db^{[l]}
\end{cases}{vdW[l]=βvdW[l]+(1−β)dW[l]vdb[l]=βvdb[l]+(1−β)db[l]
{W[l]=W[l]−αvdW[l]b[l]=b[l]−αvdb[l]\begin{cases} W^{[l]} = W^{[l]} - \alpha v_{dW^{[l]}} \\ b^{[l]} = b^{[l]} - \alpha v_{db^{[l]}} \end{cases}{W[l]=W[l]−αvdW[l]b[l]=b[l]−αvdb[l]
其中,
β\betaβ : the momentum
α\alphaα : the learning rate
(3) Adam
optimizer = tf.train.AdamOptimizer(learning_rate=0.001,
beta1=0.9, beta2=0.999,
epsilon=1e-08).minimize(loss)
利用自适应学习率的优化算法(此时learning_rate传入固定值,不支持使用指数衰减方式),Adam 算法和随机梯度下降算法不同。随机梯度下降算法保持单一的学习率更新所有的参数,学习率在训练过程中并不会改变。而 Adam 算法通过计算梯度的一阶矩估计和二阶矩估计而为不同的参数设计独立的自适应性学习率。
{vdW[l]=β1vdW[l]+(1−β1)∂J∂W[l]vdb[l]=β1vdb[l]+(1−β1)∂J∂b[l] (moment:β1)\begin{cases}
v_{dW^{[l]}} = \beta_1 v_{dW^{[l]}} + (1 - \beta_1) \frac{\partial \mathcal{J} }{ \partial W^{[l]} } \\
v_{db^{[l]}} = \beta_1 v_{db^{[l]}} + (1 - \beta_1) \frac{\partial \mathcal{J} }{ \partial b^{[l]} }
\end{cases} (moment:\beta_1){vdW[l]=β1vdW[l]+(1−β1)∂W[l]∂Jvdb[l]=β1vdb[l]+(1−β1)∂b[l]∂J (moment:β1)
{sdW[l]=β2sdW[l]+(1−β2)(∂J∂W[l])2sdb[l]=β2sdb[l]+(1−β2)(∂J∂b[l])2 (RMSprop:β2)\begin{cases} s_{dW^{[l]}} = \beta_2 s_{dW^{[l]}} + (1 - \beta_2) (\frac{\partial \mathcal{J} }{\partial W^{[l]} })^2 \\ s_{db^{[l]}} = \beta_2 s_{db^{[l]}} + (1 - \beta_2) (\frac{\partial \mathcal{J} }{\partial b^{[l]} })^2 \end{cases} (RMSprop:\beta_2){sdW[l]=β2sdW[l]+(1−β2)(∂W[l]∂J)2sdb[l]=β2sdb[l]+(1−β2)(∂b[l]∂J)2 (RMSprop:β2)
{vdW[l]corrected=vdW[l]1− (β1)tvdW[b]corrected=vdW[b]1− (β1)tsdW[l]corrected=sdW[2]1− (β1)tsdW[b]corrected=sdW[2]1− (β1)t (Biascorrection)\begin{cases} v^{corrected}_{dW^{[l]}} = \frac{v_{dW^{[l]}}}{1 - (\beta_1)^t} \\ v^{corrected}_{dW^{[b]}} = \frac{v_{dW^{[b]}}}{1 - (\beta_1)^t} \\ s^{corrected}_{dW^{[l]}} = \frac{s_{dW^{[2]}}}{1 - (\beta_1)^t} \\ s^{corrected}_{dW^{[b]}} = \frac{s_{dW^{[2]}}}{1 - (\beta_1)^t} \end{cases} (Bias correction)⎩⎪⎪⎪⎪⎨⎪⎪⎪⎪⎧vdW[l]corrected=1− (β1)tvdW[l]vdW[b]corrected=1− (β1)tvdW[b]sdW[l]corrected=1− (β1)tsdW[2]sdW[b]corrected=1− (β1)tsdW[2] (Biascorrection)
{W[l]=W[l]−αvdW[l]correctedsdW[l]corrected + εb[l]=b[l]−αvdb[l]correctedsdb[l]corrected + ε\begin{cases} W^{[l]} = W^{[l]} - \alpha \frac{v^{corrected}_{dW^{[l]}}}{\sqrt{s^{corrected}_{dW^{[l]}}} + \varepsilon}\\ b^{[l]} = b^{[l]} - \alpha \frac{v^{corrected}_{db^{[l]}}}{\sqrt{s^{corrected}_{db^{[l]}}} + \varepsilon} \end{cases}⎩⎪⎪⎨⎪⎪⎧W[l]=W[l]−αsdW[l]corrected + εvdW[l]correctedb[l]=b[l]−αsdb[l]corrected + εvdb[l]corrected
其中,
- β1\beta_1β1 and β2\beta_2β2 are hyperparameters that control the two exponentially weighted averages.
- α\alphaα is the learning rate
- ε\varepsilonε is a very small number to avoid dividing by zero