chatGLM2 p-tuning踩坑全纪录

最新推荐文章于 2025-06-27 11:25:08 发布

原创最新推荐文章于 2025-06-27 11:25:08 发布 · 849 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#机器学习 #深度学习 #人工智能

本文介绍ChatGLM2-6B项目p-tuning代码训练方法，按官方文档安装依赖后运行命令开始训练。同时列举训练中遇到的问题，如torchrun创建进程失败、模块缺失、分布式包无NCCL、函数参数异常等，并给出相应解决办法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

ChatGLM2-6B项目代码的ptuning子目录下有所有p-tuning代码。按照官方文档安装依赖（使用conda install），再运行以下命令可以开始训练：

torchrun --standalone --nnodes=1 --nproc-per-node=1 main.py --do_train --train_file D:\ChatGLM2-6B\dataset\oss.json --prompt_column prompt --response_column response --overwrite_cache --model_name_or_path D:\ChatGLM2-6B\model --output_dir D:\ChatGLM2-6B\oss_model --overwrite_output_dir --max_source_length 256 --max_target_length 256 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 16 --predict_with_generate --max_steps 300 --logging_steps 10 --save_steps 100 --learning_rate 2e-1 --pre_seq_len 128 --quantization_bit 4

命令运行过程中遇到了以下问题：

1. torchrun: failed to create process.

在site-packages中找到torchrun-script.py，删除首行内容#!C:\cb\PYTORC~1\_h_env\python.exe

2. ModuleNotFoundError: No module named 'cchardet'

运行命令安装cchardet：conda install cchardet

3. Distributed package doesn't have NCCL built in

Windows环境不支持NCCL，修改main.py中的main函数，加入以下两行：

os.environ['CUDA_VISIBLE_DEVICES'] = '0'
dist.init_process_group(backend='gloo', init_method= 'tcp://localhost:23456', rank=0, world_size=1)

4. Field() got unexpected keyword “alias”

需要升级attrs到22.2.0以上，conda只支持到22.1.0，需要使用pip install --upgrade attrs来升级。

参考TypeError: field() got an unexpected keyword argument 'alias' · Issue #56 · python-jsonschema/referencing · GitHub,