1.数据集处理:
data(dataframe形式)是数据集有两列:data有两列:[‘train_texts’,‘train_labels’]
train_texts就是x输入模型前的最终形式,如下
train_texts=data[‘‘train_texts’],例如:
train_texts=[‘这部电影很好看’,‘这部电影剧情拖沓,没演技’,’',‘非常好看的一部电影,很催泪’]
train_labels输入模型前的最终形式,如下
train_labels=data[‘train_labels’],例如:
train_labels=[‘好评’,’差评‘,’好评‘]
2.中文Bert模型加载(bert_base_chinese)
-
下载预训练模型
从https://huggingface.co/models下载,目前需要翻墙。这里提供百度云下载,点击连接链接:https://pan.baidu.com/s/1iDY4ANbAgOR6OCOr7QAwwA?pwd=8pu6
提取码:8pu6 -
将下载后的压缩包解压,放到你的文本分类项目目录下,如:
[chinese_L-12这个文件就是下载好的预训练模型,里面包含3个文件:vocab、config、pytorch_model.bin,分别是词典(获取词到索引的映射,将词变数字)、config.json是用于配置预训练模型参数的文件包含了模型的架构、超参数和其他模型配置信息、pytorch_model.bin是模型的权重。 -
导入bert相关包:
from transformers import BertTokenizer, BertForSequenceClassification
- 对train_texts进行编码:
将tokenizer导入
tokenizer = BertTokenizer.from_pretrained('./chinese_L-12_H-768_A-12')
注:这里是相对路径,只要你把chinese_L-12_H-768_A-12文件夹放到图1的位置都可以成功导进来。
对train_texts进行编码,max_length=64是单个文本得最大长度,超了截掉,没超补0,可以根据文本的词数分布自己定一个数:
train_input_ids = []
train_attention_masks = []
for text in train_texts:
encoded = tokenizer.encode_plus(
text,
add_special_tokens=True,
max_length=64,
padding='max_length',
truncation=True,
return_attention_mask=True,
return_tensors='pt'
)
train_input_ids.append(encoded['input_ids'])
train_attention_masks.append(encoded['attention_mask'])
train_input_ids = torch.cat(train_input_ids, dim=0)
train_attention_masks = torch.cat(train_attention_masks, dim=0)
将train_labels=[‘好评’,’差评‘,’中评‘]进行编码,[[1,0,0],[0,0,1],[0,1,0]]
train_labels = torch.tensor(train_labels)
生成batch,可以32 64 128 256等
train_dataset = TensorDataset(train_input_ids, train_attention_masks, train_labels)
train_dataloader = DataLoader(train_dataset, batch_size=32)
定义模型和优化器:
model = BertForSequenceClassification.from_pretrained('./chinese_L-12_H-768_A-12', num_labels=4)
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
训练模型:
print("++++++++++++++++++++++++++++开始训练+++++++++++++++++++++++++++++++++")
model.train()
for epoch in range(5):
print(f'Epoch {epoch + 1}')
total_loss = 0
for step, batch in enumerate(train_dataloader):
batch_input_ids = batch[0]
batch_attention_masks = batch[1]
batch_labels = batch[2]
optimizer.zero_grad()
outputs = model(batch_input_ids, attention_mask=batch_attention_masks, labels=batch_labels)
loss = outputs.loss
total_loss += loss.item()
loss.backward()
optimizer.step()
if step%10==0:
print(f'step {step}, avg_loss: {total_loss:.4f}')
保存模型:
torch.save(model.state_dict(), './bert_model.pth')