作为transformers训练过程中的重要一环,有必要重新审视并理解一下DataCollator了
huggingface transformers datacollator
default_data_collator
测试代码
input_ = [{"input_ids": [1, 2, 3]}]
output = transformers.data.default_data_collator(input_ )
输出:
{'input_ids': tensor([[1, 2, 3]])}
作用:
把一个形如dict[str]: [value]的字典中的value转化为tensor
非常简单的一个转换
DataCollatorWithPadding
测试代码
import transformers
input_ = [{"input_ids": [1, 2, 3]}]
tokenizer = transformers.AutoTokenizer.from_pretrained("D:\HFModels\qwen2")
data_with_padding = transformers.data.DataCollatorWithPadding(tokenizer, padding="max_length", max_length=512)
output = data_with_padding(input_ )
输出
{'input_ids': tensor([[ 1, 2, 3, 151643, 151643, 151643, 151643, 151643, 151643,
151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643,
...
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0]])}
作用:
把输入的input_ids转化为tensor,并使用tokenizer的pad方法padding。值得注意的是,如果输入的input_中没有input_ids的key就会raise一个error。此外,input_ids必须是encode之后的token id,直接输入str会报错。输出为input_ids对应的padding之后的tensor和其对应的attention musk, 输入中除input_ids之外的其他键值会自动convert为tensor,并不会padding。
DataCollatorForSeq2Seq
测试代码
import transformers
input_ = [{"input_ids": [1, 2, 3], "tt": [1, 2,3]}]
data_with_s2s = transformers.data.DataCollatorForSeq2Seq(tokenizer, model, padding="max_length", max_length=512)
output_ = data_with_s2s(input_)
输出:
{"input_ids": [tensor], "attention_musk": [tensor], "tt": [tensor], "labels": None}
作用:
跟DataCollatorWithPadding很相似,只不过输出多了一个labels字段,并且labels需要input中包含否则为None。如果提供了labels,它会自动padding为同input_ids同样的长度,padding的部分再labels中会被自动打上-100标记
DataCollatorForLanguageModeling
测试代码
input_ = [{"input_ids": [1, 2, 3], "tt": [1, 2,3], "labels": [1, 2]}]
data_with_lm = transformers.data.DataCollatorForLanguageModeling(tokenizer,mlm=False)
output_ = data_with_lm(input_)
输出
{'input_ids': tensor([[1, 2, 3]]), 'tt': tensor([[1, 2, 3]]), 'attention_mask': tensor([[1, 1, 1]]), 'labels': tensor([[1, 2, 3]])}
作用
labels会自动设置为和input_ids相同(如果mlm=False, mlm只有bert等一些模型会使用,基本上都是false)
DataCollatorForWholeWordMask
适用于bert相关任务的一个collator,不过多解释
DataCollatorForPermutationLanguageModeling
几乎没用过,不过多解释.
DataCollatorWithFlattening
同上