利用合成数据进行后训练--synthetic_continued_pretraining

An_ich

已于 2025-08-01 00:03:35 修改

阅读量805

点赞数 12

CC 4.0 BY-SA版权

文章标签： python 人工智能神经网络深度学习语言模型 pytorch

于 2025-08-01 00:03:09 首次发布

本文链接：https://blog.csdn.net/weixin_62891098/article/details/149816593

一、利用EntiGraph生成数据

（一）加载数据，处理文本

数据结构：

questions:有用的属性有question, options, gold_lable, difficult

数据结构中的questions, title, anthor, article, year, topic形成文本类的实例属性

最终形成用于生成合成数据的文本属性如下：

（二）获取实体

1.生成string文本

包括文章题目，作者，年份和文章内容

2.获得实体的提示词

system:
As a knowledge analyzer, your task is to dissect(解剖) and understand an article provided by the user. You are required to perform(执行) the following steps:
1. Summarize the Article: Provide a concise(简洁的) summary of the entire article, capturing the main points and themes.
2. Extract Entities: Identify(识别) and list all significant(重要的) "nouns(名词)" or entities mentioned(提到的) within the article. These entities should include but not limited to:
* People: Any individuals mentioned in the article, using the names or references provided.
* Places: Both specific locations and abstract(抽象的) spaces relevant to the content.
* Object: Any concrete(具体的) object that is referenced by the provided content.
* Concepts(概念): Any significant abstract ideas or themes that are central(主题) to the article's discussion.

Try to exhaust(详尽) as many entities as possible. Your response should be structured in a JSON format to organize the information effectively. Ensure that the summary is brief yet comprehensive(综合性的), and the list of entities is detailed and accurate(准确).

Here is the format you should use for your response:

{
"summary": "<A concise summary of the article>",
"entities": ["entity1", "entity2", ...]
}

user:

3.获取所有可能组合的二元组

即C(len(entities), 2)

4.获取所有可能组合的三元组

即C(len(entities), 3)

（三）遍历二元组列表，获得两实体关系

1.获取输入文本

上一步的string文本，还有两个实体

2.获取实体对关系的提示词

system:
You will act as a knowledge analyzer tasked with dissecting an article provided by the user. Your role involves two main objectives:
1. Rephrasing(改述) Content: The user will identify two specific entities mentioned in the article. You are required to rephrase the content of the article twice:
* Once, emphasizing(强调) the first entity.
* Again, emphasizing the second entity.
2. Analyzing Interactions: Discuss how the two specified entities interact within the context of the article.

Your responses should provide clear segregation(分离) between the rephrased content and the interaction analysis. Ensure each section(部分) of the output include sufficient(足够的) context, ideally referencing(引用) the article's title to maintain(保持) clarity about the discussion's focus.
Here is the format you should follow for your response:

### Discussion of <title> in relation to <entity1>
<Rephrased content focusing on the first entity>

### Discussion of <title> in relation to <entity2>
<Rephrased content focusing on the second entity>

### Discussion of Interaction between <entity1> and <entity2> in context of <title>
<Discussion on how the two entities interact within the article>

user：

（四）遍历三元组列表，获得三实体的关系

1.获取输入文本

string文本，还有三个实体

2.获取实体对关系的提示词

system:
You will act as a knowledge analyzer tasked with dissecting an article provided by the user. Your role involves three main objectives:
1. Rephrasing Content: The user will identify three specific entities mentioned in the article. You are required to rephrase the content of the article three times:
* Once, emphasizing the first entity.
* Again, emphasizing the second entity.
* Lastly, emphasizing the third entity.
2. Analyzing Interactions: Discuss how these three specified entities interact within the context of the article.

Your responses should provide clear segregation between the rephrased content and the interaction analysis. Ensure each section of the output include sufficient context, ideally referencing the article's title to maintain clarity about the discussion's focus.
Here is the format you should follow for your response:

### Discussion of <title> in relation to <entity1>
<Rephrased content focusing on the first entity>

### Discussion of <title> in relation to <entity2>
<Rephrased content focusing on the second entity>

### Discussion of <title> in relation to <entity3>
<Rephrased content focusing on the third entity>

### Discussion of Interaction between <entity1>, <entity2> and <entity3> in context of <title>
<Discussion on how the three entities interact within the article>

user：

（五）Tokenize

1.加载之前生成的所有文本数据（不包括实体list）

2.将所有文本tokenize

3.转为numpy写到内存里

这里都是用的extend方法，保存为维度为1的numpy数组，后面在加载数据时会block_size进行截断。

二、加入重放数据

定义: replay指的是一种训练策略，其中以前的数据或经验被定期重新引入到训练中，以帮助模型记住旧知识（重复使用旧数据）

（一）下载replay data

这里使用的是togethercomputer/RedPajama-Data-1T-Sample作为重放数据

from huggingface_hub import hf_hub_download
import os

file_path = hf_hub_download(
    repo_id="togethercomputer/RedPajama-Data-1T-Sample",  # 模型 repo ID
    filename="stackexchange_sample.jsonl",  # 文件名（相对路径）
    cache_dir="./hf_cache",
    repo_type="dataset",
    token="xxxx"

（二）加载并tokenize replay data

这里没有什么好说的

注意如果下载单个数据，用json加载

from datasets import load_dataset

dataset = load_dataset('json', data_files="xxx", trust_remote_code=True)

（三）加入replay data

根据设置的重放比例超参数，随机加入重放数据

加入时将保存的一维数组按照block_size进行截断作为一条数据

三、继续训练

遇到问题了，16G的显存训练不了1B的模型。

明天尝试一下qwen3_0.6B吧

精彩这都资源不够.....还是得多捡垃圾啊