1 数据准备
下载训练数据集和精调数据集:
Traditional-Chinese-Medicine-Dataset-SFT|中医数据集|自然语言处理数据集
下载两种:
预训练数据集:SylvanL/Traditional-Chinese-Medicine-Dataset-Pretrain
精调数据集:SylvanL/Traditional-Chinese-Medicine-Dataset-SFT
2 下载模型
下载ColossalAI自己预训练好的Llama2的模型hpcai-tech/Colossal-LLaMA-2-7b-base和hpcai-tech/Colossal-LLaMA-2-13b-base
可以使用以下脚本进行下载
from huggingface_hub import snapshot_download
from huggingface_hub import login
import argparse
import os
from pathlib import Path
#执行的时候,先在外部执行 export HF_ENDPOINT=https://hf-mirror.com
os.environ['HF_ENDPOINT']='https://hf-mirror.com'
print(os.environ['HF_ENDPOINT'])
# 使用huggingface_hub的snapshot_download方法下载仓库
def download_data(repo_id, save_path,repo_type):
try:
print(f"Downloading repository {repo_id} to {save_path}...")
snapshot_download(repo_id=repo_id, local_dir=save_path, repo_type=repo_type,local_dir_use_symlinks=False)
print(f"Repository successfully downloaded and saved to {save_path}")
except Exception as e:
print(f"An error occurred while downloading the repository: {e}")
if __name__ == "__main__":
login("xx有的资源需要hf的token此时要先登录")
# 设置命令行参数
parser = argparse.ArgumentParser(description='Download a Hugging Face repository to a local path.')
parser.add_argument('repo_id', type=str, help='The repository ID in the format "owner/repo".')
parser.add_argument('save_path', type=str, help='The local path where the repository will be saved.')
parser.add_argument('repo_type', type=str,default="model", help='the repo type of the repo.')
# 解析命令行参数
args = parser.parse_args()
# 将命令行参数转换为Path对象
repo_id = args.repo_id
save_path = Path(args.save_path)
repo_type=args.repo_type
# 检查本地保存路径是否存在,如果不存在则创建
if not save_path.exists():
os.makedirs(save_path)
# 运行解析命令行参数并执行下载的函数
args = parser.parse_args()
download_data(args.repo_id, args.save_path, repo_type)
3 容器镜像
基础镜像hpcaitech/colossalai:v0.4.6,验证过程中修改的镜像已上传至swr.cn-north-4.myhuaweicloud.com/tiger202203/hpcaitech/colossalai:0.4.6.a-nccl,在/workspace目录下有修改后的代码
4 数据集处理
sft数据集处理,原始的示例代码 /ColossalAI/applications/Colossal-LLaMA/dataset/prepare_sft_dataset.py在处理数据集时,调用的函数supervised_tokenize_sft无法直接处理Traditional-Chinese-Medicine-Dataset-SFT,需要对supervised_tokenize_sft作以下的修改:
def supervised_tokenize_sft(
data_point: Dict[str, str],
tokenizer: AutoTokenizer,
conversation_template: Conversation = default_conversation,
ignore_index: int = None,
max_length: int = 4096,
) -> Dict[str, Union[int, str, List[int]]]:
"""
A tokenization function to tokenize an original supervised data point as following:
{"messages": [{"from": "human", "content": "xxx"}, {"from": "assistant", "content": "xxx"}]}
"""
assert tokenizer.add_bos_token is False and tokenizer.add_eos_token is False, (
"Initially set `tokenizer.add_bos_token` and `tokenizer.add_eos_token` to False, "
"add <bos> and <eos> manually later"
)
assert (
tokenizer.bos_token == conversation_template.seps[0] and tokenizer.eos_token == conversation_template.seps[1]
), "`bos_token` and `eos_token` should be the same with `conversation_template.seps`."
if ignore_index is None:
ignore_index = IGNORE_INDEX
# messages = data_point["messages"]
template = deepcopy(conversation_template)
template.messages = []
# 新修改的代码,满足数据要求
humanMsg=data_point["input"]
assistantMsg=data_point["output"]
template.append_message("human",humanMsg)
template.append_message("assistant",assistantMsg)
# 注释掉旧代码
# for mess in messages:
# from_str = mess["from"]
# if from_str.lower() == "human":
# from_str = template.roles[0]
# elif from_str.lower() == "assistant":
# from_str = template.roles[1]
# else:
# raise ValueError(f"Unsupported role {from_str.lower()}")
# template.append_message(from_str, mess["content"])
原始的代码中只处理.jsonl后缀,而这数据集是.json后缀,所以要修改prepare_sft_dataset.py中的.josnl后缀为.json后缀.
for ds_dir in input_data_dirs:
ds_dir = os.path.abspath(ds_dir)
assert os.path.exists(ds_dir), f"Not find data dir {ds_dir}"
ds_files = [name for name in os.listdir(ds_dir) if name.endswith(".json")]
ds_paths = [os.path.join(ds_dir, name) for name in ds_files]
input_data_paths.extend(ds_paths)
执行以下命令生成数据集,其中llama_version为llama的版本:
python prepare_sft_dataset.py\
--data_input_dirs "/data/src_data/SylvanL/Traditional-Chinese-Medicine-Dataset-SFT" \
--tokenizer_dir "/data/base_model/Meta-Llama-3-8B-Instruct" \
--data_output_dirs "/data/dataset/Traditional-Chinese-Medicine-Dataset-SFT3" \
--max_length 4096 \
--num_spliced_dataset_bins 10 \
--llama_version 2
在执行过程中,报了文件解析错误
上述错误,原因在于文件中有dataset_info.json的格式和其它的文件不一致,对数据集的说明,在处理中,也可以根据dataset_info.json的要求灵活构建处理逻辑,在本文中作了简单的删除处理。
5 验证准备
使用ColossalAI/applications/Colossal-LLaMA/dataset/prepare_sft_dataset.py进行数据集处理,在执行时报module找不到,将prepare_sft_dataset.py挪到父级目录可以解决。
使用ColossalAI/applications/Colossal-LLaMA/inference/inference_example.py进行推理验证,同样会报有module找不到,将inference_example.py挪到父级目录可以解决
使用ColossalAI/applications/ColossalChat/examples/training_scripts下的脚本进行各种类型的训练,在训练的时候,需要将相应的脚本挪到ColossalAI/applications/ColossalChat目录下,否则会报coati module找不到。
6 训练过程中遇到的问题
6.1 memlock不足,shm-size不足,导致连通问题,无法训练
现象:在训练过程中出现以下信息
通过nccl-test报以下信息(在执行nccl-test时,出现的故障参考5.2)
在训练脚本中export NCCL_DEBUG=INFO,以输出详细信息
由此可以得出是ib发送时内存分配出现了问题。容器默认的memlock参数为64K,在使用ib网卡时,必定会出现上述问题,在配置memlock配置为1G时,在节点少于3个时不会出现问题,超过3台时,会出现上述信息。在shm-size较小时,也会出现相关的内存错误,这里不作一一展示。
解决方法:
将memlock改为10G时,问题解决。执行以下命令问题得到解决。
docker run -d --gpus all --network host --device=/dev/infiniband/uverbs0 --device=/dev/infiniband/uverbs1 --device=/dev/infiniband/uverbs2 --device=/dev/infiniband/uverbs3 --device=/dev/infiniband/uverbs4 --device=/dev/infiniband/uverbs5 --device=/dev/infiniband/uverbs6 --device=/dev/infiniband/uverbs7 -v /data_mnt:/data -v /train_mnt:/train --ulimit memlock=1099511627776 --shm-size 10g --entrypoint /workspace/keep_run.sh swr.cn-north-4.myhuaweicloud.com/tiger202203/hpcaitech/colossalai:0.4.6.a-nccl
6.2 libnccl版本不对,导致nccl-test出现coredump
现象:在容器执行nccl-test指令时会出现coredump
在容器中检查libnccl的版本为2.1.17,但在官网没有看到相关的cuda12.4对应的此版本,初步怀疑是库版本不匹配的原因。nccl编译使用的cuda12.4编译的版本。
检查安装的pytorch使用的libnccl的版本:
虽然基础容器的cuda为12.1、libnccl为2.17.1,但对于nccl-test无法使用,毕竟不方便。
从官方资源看,2.20.5当前并没有对应于cuda12.1的版本,只有12.4的版本。当前容器中的nccl的版本是2.17.1
https://developer.nvidia.com/nccl/nccl-legacy-downloads
重新安装匹配本机环境的pytorch,更新libnccl的版本为2.20.5
编译mpirun
./configure --prefix=/usr/local/mpi --with-cuda=/usr/local/cuda-12.4
make -j20
make install
编译nccl-test
make MPI=1 MPI_HOME=/usr/local/mpi CUDA_HOME=/usr/local/cuda-12.4 NCCL_HOME=./
执行mpirun,在多台机器上运行问题解决
/usr/local/mpi/bin/mpirun -np 56 \
--hostfile hostfile\
--allow-run-as-root -bind-to none -map-by :OVERSUBSCRIBE \
-x NCCL_DEBUG=INFO \
-x NCCL_IB_DISABLE=0 \
-x NCCL_SOCKET_IFNAME=eth0 \
-x NCCL_NET_GDR_LEVEL=2 \
-x NCCL_IB_QPS_PER_CONNECTION=4 \
-x NCCL_IB_TC=160 \
-x NCCL_SHM_DISABLE=1\
-x LD_LIBRARY_PATH -x PATH \
-mca coll_hcoll_enable 0 -mca pml ob1 -mca btl_tcp_if_include eth0 -mca btl ^openib \
/nccl-tests/build/all_reduce_perf -b 32M -e 1G -i 1000 -f 2 -g 1
其中hostfile内容为:
node1 slots=8
node2 slots=8
node3 slots=8
node4 slots=8
node5 slots=8
node6 slots=8
node7 slots=8
6.3 练过程中,掉卡导致任务中断
现象:
通过nccl-test出现Test failure common.cu:891,检查故障节点发现nvidia-smi发现只有7张gpu,发现有一张gpu卡丢了
解决方法:重启机器,可以重新发现GPU
6.4 训练好模型后,调用模型进行推理时,tokenizer加载失败
解决方法:将基础模型中的有关token的文件复制到训练好的模型的目录:
cp /data/hpcai-tech/Colossal-LLaMA-2-13b-base/special_tokens_map.json xxx/modeling/
cp /data/hpcai-tech/Colossal-LLaMA-2-13b-base/tokenizer* xxx/modeling/
6.5 使用插件进行训练时,无法按设置的参数进行checkpoint
解决方法:sft.py的checkpoint缩进不对,导致训练中无法checkpoint,仅在训练结束后,保存最终的模型。
修改如下:
def _train(self, epoch: int):
self.model.train()
if isinstance(self.plugin, HybridParallelPlugin) and self.plugin.pp_size > 1:
data_iter = iter(self.train_dataloader)
step_bar = tqdm(
range(len(self.train_dataloader)),
desc="Step",
disable=not (dist.get_rank() == dist.get_world_size() - 1),
)
for step in step_bar:
outputs = self.booster.execute_pipeline(
data_iter,
self.model,
criterion=lambda outputs, inputs: outputs[0],
optimizer=self.optimizer,
return_loss=True,
)
loss = outputs["loss"]
if self.booster.plugin.stage_manager.is_last_stage():
global_loss = all_reduce_mean(loss, self.plugin)
if dist.get_rank() == dist.get_world_size() - 1:
step_bar.set_postfix({"train/loss": global_loss.item()})
self.optimizer.step()
self.optimizer.zero_grad()
# 增加以下逻辑进行按配置的参数进行checkpoint
if (
self.save_dir is not None
and self.save_interval is not None
and (step + 1) % self.save_interval == 0
):
save_checkpoint(
save_dir=self.save_dir,
booster=self.booster,
model=self.model,
optimizer=self.optimizer,
lr_scheduler=self.scheduler,
epoch=epoch,
step=step + 1,
batch_size=batch_size,
coordinator=self.coordinator,
)
self.coordinator.print_on_master(
f"Saved checkpoint at epoch {epoch} step {step} at folder {self.save_dir}"
)
else:
step_bar = trange(
len(self.train_dataloader) // self.accumulation_steps,
desc=f"Epoch {epoch + 1}/{self.max_epochs}",
disable=not is_rank_0(),
)
for i, batch in enumerate(self.train_dataloader):
batch = to_device(batch, torch.cuda.current_device())
batch_size = batch["input_ids"].size(0)
outputs = self.model(
batch["input_ids"],
attention_mask=batch["attention_mask"],
labels=batch["labels"] if self.apply_loss_mask else batch["input_ids"],
)
loss = outputs.loss
self.booster.backward(loss=loss, optimizer=self.optimizer)
loss_mean = all_reduce_mean(tensor=loss)
self.accumulative_meter.add("loss", loss_mean.to(torch.float16).item())
# Gradient accumulation
if (i + 1) % self.accumulation_steps == 0:
self.optimizer.step()
self.optimizer.zero_grad()
self.scheduler.step()
step_bar.set_postfix({"train/loss": self.accumulative_meter.get("loss")})
if self.writer:
self.writer.add_scalar("train/loss", self.accumulative_meter.get("loss"), self.num_train_step)
self.writer.add_scalar("train/lr", self.scheduler.get_last_lr()[0], self.num_train_step)
self.num_train_step += 1
self.accumulative_meter.reset()
step_bar.update()
# Save checkpoint
# 对齐缩进
if (
self.save_dir is not None
and self.save_interval is not None
and (self.num_train_step + 1) % self.save_interval == 0
):
save_checkpoint(
save_dir=self.save_dir,
booster=self.booster,
model=self.model,
optimizer=self.optimizer,
lr_scheduler=self.scheduler,
epoch=epoch,
step=self.num_train_step + 1,
batch_size=batch_size,
coordinator=self.coordinator,
)
self.coordinator.print_on_master(
f"Saved checkpoint at epoch {epoch} step {self.num_train_step} at folder {self.save_dir}"
)
step_bar.close()