hugging face基础入门——Dataset（1）加载数据集

原创已于 2024-11-25 14:17:46 修改 · 1k 阅读

6 ·

CC 4.0 BY-SA版权

文章标签：

#pytorch #深度学习 #人工智能

于 2024-11-22 04:00:07 首次发布

hugging face入门与实战专栏收录该内容

6 篇文章

订阅专栏

内容前置

本系列下内容为跟随b站课程总结

课程链接：【手把手带你实战HuggingFace Transformers】

本节内容主要是加载数据集内容及一些基础操作，后续将会更新加载本地数据流程。

本文内容

首先我们导入我们的dataset及必要的一些包

from datasets import *
import torch
from transformers import AutoConfig,AutoModel,AutoTokenizer

这里我们加载数据集madao33/new-title-chinese

datasets = load_dataset("madao33/new-title-chinese")

输出内容为

DatasetDict({
    train: Dataset({
        features: ['title', 'content'],
        num_rows: 5850
    })
    validation: Dataset({
        features: ['title', 'content'],
        num_rows: 1679
    })
})

当要调用大数据集下某个小任务数据时如glue下的boolq数据时我们采用

datasets = load_dataset("super_glue","boolq")
datasets["train"][:3]

输出内容为

{'question': ['do iran and afghanistan speak the same language',
  'do good samaritan laws protect those who help at an accident',
  'is windows movie maker part of windows essentials'],
 'passage': ['Persian language -- Persian (/ˈpɜːrʒən, -ʃən/), also known by its endonym Farsi (فارسی fārsi (fɒːɾˈsiː) ( listen)), is one of the Western Iranian languages within the Indo-Iranian branch of the Indo-European language family. It is primarily spoken in Iran, Afghanistan (officially known as Dari since 1958), and Tajikistan (officially known as Tajiki since the Soviet era), and some other regions which historically were Persianate societies and considered part of Greater Iran. It is written in the Persian alphabet, a modified variant of the Arabic script, which itself evolved from the Aramaic alphabet.',
  "Good Samaritan law -- Good Samaritan laws offer legal protection to people who give reasonable assistance to those who are, or who they believe to be, injured, ill, in peril, or otherwise incapacitated. The protection is intended to reduce bystanders' hesitation to assist, for fear of being sued or prosecuted for unintentional injury or wrongful death. An example of such a law in common-law areas of Canada: a good Samaritan doctrine is a legal principle that prevents a rescuer who has voluntarily helped a victim in distress from being successfully sued for wrongdoing. Its purpose is to keep people from being reluctant to help a stranger in need for fear of legal repercussions should they make some mistake in treatment. By contrast, a duty to rescue law requires people to offer assistance and holds those who fail to do so liable.",
  'Windows Movie Maker -- Windows Movie Maker (formerly known as Windows Live Movie Maker in Windows 7) is a discontinued video editing software by Microsoft. It is a part of Windows Essentials software suite and offers the ability to create and edit videos as well as to publish them on OneDrive, Facebook, Vimeo, YouTube, and Flickr.'],
 'idx': [0, 1, 2],
 'label': [1, 1, 1]}

数据划分上我们这里可以dataset自带的train_step_split，其中stratify_by_column表示在某位置上进行均衡划分。

traindata = datasets["train"]
splitdata = traindata.train_test_split(test_size=0.1,stratify_by_column="label")

当我们挑选特定数据时，使用select函数

datasets["train"].select([0,1])

输出为

Dataset({
    features: ['question', 'passage', 'idx', 'label'],
    num_rows: 2
})

数据过滤上我们常用filter函数，这里我们举例说明

datasets = load_dataset("LooksJuicy/ruozhiba")
filterdata = datasets["train"].filter(lambda example: "篮球" in example["instruction"])
filterdata["instruction"]
#输出为
['人体肠道有一个篮球场大小，为什么不在肠道上打篮球呢？',
 '他们都说黑子的篮球好看，为什么我看是金子的白球？',
 '苹果下落是因为重力场，篮球会下落是不是篮球场?',
 '为什么世界杯的篮球这么轻，可以踢到观众席上去']

如果我们要对每条数据进行处理，我们常用map＋映射函数，这里我为每个数据前加入xxx：

代码为

def yingshe1(example):
    example["instruction"] = ["xxx:" + t for t in example["instruction"]]
    return example
 
 
prefix_data = datasets.map(yingshe1, batched=True)
print(prefix_data['train']["instruction"][2])

结果为

xxx:樟脑丸是我吃过最难吃的硬糖有奇怪的味道怎么还有人买

我们使用映射函数处理批量数据，也就是将batched设为True，处理时间会更快，值得注意的是，在使用map函数时，如果batched为True，我们处理的字段需要是list，同时如果我们用到tokenizer我们需要tokenizer支持fast，如果tokenizer不支持fast，batched使用会报错，在这种情况下我们又想更快处理数据，那么我们需要做以下修改：1.将tokenizer传入process函数中，2.使用num_proc

如下

from transformers import AutoConfig,AutoModel,AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("trained_model")
datasets = load_dataset("madao33/new-title-chinese")
 
def process(batch,tokenizer=tokenizer):
    model_inputs = tokenizer(batch["content"],max_length=128,truncation=True)
    labels = tokenizer(batch["title"],max_length=128,truncation=True)
    #摘要任务上，labels为title
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs
pro_dataset = datasets.map(process,num_proc=4)

输出为：

DatasetDict({
    train: Dataset({
        features: ['title', 'content', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 5850
    })
    validation: Dataset({
        features: ['title', 'content', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 1679
    })
})

若我们想要删去原有字段title和content，可以在map中将remove_colum中添加这两个字段。

如有讲解不周到之初，欢迎指出交流，请大家多多包涵