0 数据loading
from datasets import load_dataset
dataset = load_dataset("glue", "mrpc", split="train")
dataset[0]
'''
{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
'label': 1,
'idx': 0}
'''
1排序
sorted_dataset = dataset.sort("label")
sorted_dataset['label'][:10],dataset['label'][:10]
#([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 0, 1, 0, 1, 1, 0, 1, 0, 0])
2 打乱
shuffled_dataset = dataset.shuffle(seed=42)
shuffled_dataset['idx'][:10]
#[3946, 3683, 3919, 485, 2251, 2173, 3936, 1603, 1351, 736]
打乱会创建索引映射,可能会降低性能。如果需要恢复性能,可以调用 flatten_indices()