NLP(四十三)模型调参技巧之Warmup and Decay

本文介绍了深度学习中WarmupandDecay学习率调整策略,详细阐述了其原理和作用,并展示了在keras_bert库中如何使用AdamWarmup优化器实现这一策略。通过实例对比,说明了WarmupandDecay在序列标注任务中的效果提升。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

  Warmup and Decay是深度学习中模型调参的常用trick。本文将简单介绍Warmup and Decay以及如何在keras_bert中使用它们。

什么是warmup and decay?

  Warmup and Decay是模型训练过程中,一种学习率(learning rate)的调整策略。
  Warmup是在ResNet论文中提到的一种学习率预热的方法,它在训练开始的时候先选择使用一个较小的学习率,训练了一些epoches或者steps(比如4个epoches,10000steps),再修改为预先设置的学习来进行训练。
  同理,Decay是学习率衰减方法,它指定在训练到一定epoches或者steps后,按照线性或者余弦函数等方式,将学习率降低至指定值。一般,使用Warmup and Decay,学习率会遵循从小到大,再减小的规律。
  由于刚开始训练时,模型的权重(weights)是随机初始化的,此时若选择一个较大的学习率,可能带来模型的不稳定(振荡),选择Warmup预热学习率的方式,可以使得开始训练的几个epoches或者一些steps内学习率较小,在预热的小学习率下,模型可以慢慢趋于稳定,等模型相对稳定后再选择预先设置的学习率进行训练,使得模型收敛速度变得更快,模型效果更佳。而当模型训到一定阶段后(比如10个epoch),模型的分布就已经比较固定了,或者说能学到的新东西就比较少了。如果还沿用较大的学习率,就会破坏这种稳定性,用我们通常的话说,就是已经接近损失函数的局部最优值点了,为了靠近这个局部最优值点,我们就要慢慢来。

如何在keras_bert中使用Warmup and Decay?

  在keras_bert中,提供了优化器AdamWarmup类,其参数定义如下:

class AdamWarmup(keras.optimizers.Optimizer):
    """Adam optimizer with warmup.

    Default parameters follow those provided in the original paper.

    # Arguments
        decay_steps: Learning rate will decay linearly to zero in decay steps.
        warmup_steps: Learning rate will increase linearly to lr in first warmup steps.
        learning_rate: float >= 0. Learning rate.
        beta_1: float, 0 < beta < 1. Generally close to 1.
        beta_2: float, 0 < beta < 1. Generally close to 1.
        epsilon: float >= 0. Fuzz factor. If `None`, defaults to `K.epsilon()`.
        weight_decay: float >= 0. Weight decay.
        weight_decay_pattern: A list of strings. The substring of weight names to be decayed.
                              All weights will be decayed if it is None.
        amsgrad: boolean. Whether to apply the AMSGrad variant of this
            algorithm from the paper "On the Convergence of Adam and
            Beyond".
    """

    def __init__(self, decay_steps, warmup_steps, min_lr=0.0,
                 learning_rate=0.001, beta_1=0.9, beta_2=0.999,
                 epsilon=None, weight_decay=0., weight_decay_pattern=None,
                 amsgrad=False, **kwargs):

在这个类中,我们需要指定decay_stepswarmup_stepslearning_ratemin_lr,其含义为模型在训练warmup_steps后,将学习率逐渐增加至learning_rate,在训练decay_steps后,将学习率逐渐线性地降低至min_lr
  以下为Warmup预热学习率以及学习率预热完成后衰减(sin or exp decay)的曲线图:
学习率Warmup and Decay示意图
  在keras_bert的官方文档中,给出了使用Warmup and Decay的代码例子,如下:

import numpy as np
from keras_bert import AdamWarmup, calc_train_steps

train_x = np.random.standard_normal((1024, 100))

total_steps, warmup_steps = calc_train_steps(
    num_example=train_x.shape[0],
    batch_size=32,
    epochs=10,
    warmup_proportion=0.1,
)

optimizer = AdamWarmup(total_steps, warmup_steps, lr=1e-3, min_lr=1e-5)

Warmup and Decay实战

  笔者在文章NLP(三十四)使用keras-bert实现序列标注任务中,在使用keras-bert训练 序列标注模型时,学习率调整策略使用了ReduceLROnPlateau,代码如下:

reduce_lr = ReduceLROnPlateau(monitor='val_loss', min_delta=0.0004, patience=2, factor=0.1, min_lr=1e-6,
                                  mode='auto',
                                  verbose=1)

在三个数据集上的评估结果如下:

  • 人民日报命名实体识别数据集:micro avg F1=0.9182
  • 时间识别数据集:micro avg F1=0.8587
  • CLUENER细粒度实体识别数据集:micro avg F1=0.7603

  我们将学习率调整策略修改为Warmup and Decay(模型其他参数不变,数据集不变),代码如下:

# add warmup
    total_steps, warmup_steps = calc_train_steps(
        num_example=len(input_train),
        batch_size=BATCH_SIZE,
        epochs=EPOCH,
        warmup_proportion=0.2,
    )
    optimizer = AdamWarmup(total_steps, warmup_steps, lr=1e-4, min_lr=1e-7)
    model = BertBilstmCRF(max_seq_length=MAX_SEQ_LEN, lstm_dim=64).create_model()
    model.compile(
        optimizer=optimizer,
        loss=crf_loss,
        metrics=[crf_accuracy]
    )

使用该trick,在三个数据集上的评估结果如下:

  • 人民日报命名实体识别数据集
学习率调整预测1预测2预测3avg
Warmup0.92760.92170.92520.9248
  • 时间识别数据集
学习率调整预测1预测2预测3avg
Warmup0.89260.89340.88200.8893
  • CLUENER细粒度实体识别数据集
学习率调整预测1预测2预测3avg
Warmup0.76120.76290.76070.7616

可以看到,使用了Warmup and Decay,模型在不同的数据集上均有不同程度的效果提升。
  本项目已经开源,代码地址为:https://github.com/percent4/keras_bert_sequence_labeling
  本文到此结束,感谢大家的阅读~
  2021年3月27日于上海浦东~

### Warmup Steps in Machine Learning Context In the context of machine learning, particularly during the training phase involving optimization algorithms like Adam or SGD (Stochastic Gradient Descent), **warmup steps** refer to an initial period where the learning rate gradually increases from a small value to its full intended value over a predefined number of iterations or epochs. This technique helps stabilize the early stages of training by preventing large updates that could destabilize convergence when starting with high learning rates. The concept behind using warmup steps lies in addressing potential instability issues caused by initializing weights randomly before they have been sufficiently adjusted through backpropagation. By slowly ramping up the learning rate instead of applying it fully right away, models may achieve better performance metrics such as accuracy while reducing divergence risks associated with overly aggressive parameter changes too soon into their lifecycle [^1]. Here’s how one might implement this within PyTorch framework: ```python import torch.optim.lr_scheduler as lr_sched def get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, last_epoch=-1): def lr_lambda(current_step: int): if current_step < num_warmup_steps: return float(current_step) / float(max(1, num_warmup_steps)) return max( 0.0, float(num_training_steps - current_step) / float(max(1, num_training_steps - num_warmup_steps)) ) return lr_sched.LambdaLR(optimizer, lr_lambda, last_epoch) # Example Usage optimizer = torch.optim.Adam(model.parameters(), lr=initial_lr) scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps_count, num_training_steps=total_steps) for epoch in range(epochs): for batch in dataloader: optimizer.zero_grad() loss = model(batch) loss.backward() optimizer.step() scheduler.step() # Adjusts LR based on schedule defined above. ``` This implementation creates a linear increase followed by decay pattern suitable for many NLP tasks utilizing transformers architectures among others requiring fine-grained control over hyperparameters throughout different phases including warming-up stage described earlier [^2].
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值