微调预训练模型Bert实现中文文本分类(bert_base_chinese) 1

最新推荐文章于 2025-06-21 10:33:33 发布

原创最新推荐文章于 2025-06-21 10:33:33 发布 · 1.4k 阅读

7 ·

CC 4.0 BY-SA版权

文章标签：

#bert #分类 #人工智能 #自然语言处理 #深度学习

文章介绍了如何使用BERT基础中文模型进行文本情感分析，包括数据集的预处理（如将文本编码为输入ID和注意力掩码）、模型加载、编码训练数据、创建TensorDataset和DataLoader以及训练过程。最后，展示了如何定义模型、优化器并保存训练好的模型。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1.数据集处理：
data(dataframe形式)是数据集有两列：data有两列：[‘train_texts’,‘train_labels’]
train_texts就是x输入模型前的最终形式，如下
train_texts=data[‘‘train_texts’],例如：
train_texts=[‘这部电影很好看’，‘这部电影剧情拖沓，没演技’,’',‘非常好看的一部电影，很催泪’]
train_labels输入模型前的最终形式，如下
train_labels=data[‘train_labels’],例如：
train_labels=[‘好评’，’差评‘，’好评‘]
2.中文Bert模型加载（bert_base_chinese）

下载预训练模型
从https://huggingface.co/models下载，目前需要翻墙。这里提供百度云下载，点击连接链接：https://pan.baidu.com/s/1iDY4ANbAgOR6OCOr7QAwwA?pwd=8pu6
提取码：8pu6
将下载后的压缩包解压，放到你的文本分类项目目录下，如：

[chinese_L-12这个文件就是下载好的预训练模型，里面包含3个文件：vocab、config、pytorch_model.bin,分别是词典（获取词到索引的映射，将词变数字）、config.json是用于配置预训练模型参数的文件包含了模型的架构、超参数和其他模型配置信息、pytorch_model.bin是模型的权重。
导入bert相关包：

from transformers import BertTokenizer, BertForSequenceClassification

对train_texts进行编码：
将tokenizer导入

tokenizer = BertTokenizer.from_pretrained('./chinese_L-12_H-768_A-12')

注：这里是相对路径，只要你把chinese_L-12_H-768_A-12文件夹放到图1的位置都可以成功导进来。
对train_texts进行编码，max_length=64是单个文本得最大长度，超了截掉，没超补0，可以根据文本的词数分布自己定一个数：

train_input_ids = []
train_attention_masks = []
for text in train_texts:
    encoded = tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        max_length=64,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt'
    )
    train_input_ids.append(encoded['input_ids'])
    train_attention_masks.append(encoded['attention_mask'])

train_input_ids = torch.cat(train_input_ids, dim=0)
train_attention_masks = torch.cat(train_attention_masks, dim=0)

将train_labels=[‘好评’，’差评‘，’中评‘]进行编码，[[1,0，0],[0,0，1],[0，1,0]]

train_labels = torch.tensor(train_labels)

生成batch,可以32 64 128 256等

train_dataset = TensorDataset(train_input_ids, train_attention_masks, train_labels)
train_dataloader = DataLoader(train_dataset, batch_size=32)

定义模型和优化器：

model = BertForSequenceClassification.from_pretrained('./chinese_L-12_H-768_A-12', num_labels=4)
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

训练模型：

print("++++++++++++++++++++++++++++开始训练+++++++++++++++++++++++++++++++++")
model.train()
for epoch in range(5):
    print(f'Epoch {epoch + 1}')
    total_loss = 0
    for step, batch in enumerate(train_dataloader):
        batch_input_ids = batch[0]
        batch_attention_masks = batch[1]
        batch_labels = batch[2]

        optimizer.zero_grad()
        outputs = model(batch_input_ids, attention_mask=batch_attention_masks, labels=batch_labels)
        loss = outputs.loss
        total_loss += loss.item()

        loss.backward()
        optimizer.step()
        if step%10==0:
            print(f'step {step}, avg_loss: {total_loss:.4f}')

保存模型:

torch.save(model.state_dict(), './bert_model.pth')