transformers DataCollator 探究

作为transformers训练过程中的重要一环,有必要重新审视并理解一下DataCollator了

huggingface transformers datacollator

default_data_collator

测试代码

input_ = [{"input_ids": [1, 2, 3]}]
output = transformers.data.default_data_collator(input_ )

输出:

{'input_ids': tensor([[1, 2, 3]])}

作用:
把一个形如dict[str]: [value]的字典中的value转化为tensor
非常简单的一个转换

DataCollatorWithPadding

测试代码

import transformers

input_ = [{"input_ids": [1, 2, 3]}]

tokenizer = transformers.AutoTokenizer.from_pretrained("D:\HFModels\qwen2")
data_with_padding = transformers.data.DataCollatorWithPadding(tokenizer, padding="max_length", max_length=512)

output = data_with_padding(input_ )

输出

{'input_ids': tensor([[     1,      2,      3, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
         151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
...
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]])}

作用:
把输入的input_ids转化为tensor,并使用tokenizer的pad方法padding。值得注意的是,如果输入的input_中没有input_ids的key就会raise一个error。此外,input_ids必须是encode之后的token id,直接输入str会报错。输出为input_ids对应的padding之后的tensor和其对应的attention musk, 输入中除input_ids之外的其他键值会自动convert为tensor,并不会padding。

DataCollatorForSeq2Seq

测试代码

import transformers

input_ = [{"input_ids": [1, 2, 3], "tt": [1, 2,3]}]

data_with_s2s = transformers.data.DataCollatorForSeq2Seq(tokenizer, model, padding="max_length", max_length=512)

output_ = data_with_s2s(input_)

输出:

{"input_ids": [tensor], "attention_musk": [tensor], "tt": [tensor], "labels": None}

作用:
跟DataCollatorWithPadding很相似,只不过输出多了一个labels字段,并且labels需要input中包含否则为None。如果提供了labels,它会自动padding为同input_ids同样的长度,padding的部分再labels中会被自动打上-100标记

DataCollatorForLanguageModeling

测试代码

input_ = [{"input_ids": [1, 2, 3], "tt": [1, 2,3], "labels": [1, 2]}]

data_with_lm = transformers.data.DataCollatorForLanguageModeling(tokenizer,mlm=False)

output_ = data_with_lm(input_)

输出

{'input_ids': tensor([[1, 2, 3]]), 'tt': tensor([[1, 2, 3]]), 'attention_mask': tensor([[1, 1, 1]]), 'labels': tensor([[1, 2, 3]])}

作用
labels会自动设置为和input_ids相同(如果mlm=False, mlm只有bert等一些模型会使用,基本上都是false)

DataCollatorForWholeWordMask

适用于bert相关任务的一个collator,不过多解释

DataCollatorForPermutationLanguageModeling

几乎没用过,不过多解释.

DataCollatorWithFlattening

同上

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值