The easiest method is to set these DeepSpeed config values to 'auto'. [rank1]: Traceback (most recent call last): [rank1]: File "/data1/users/heyu/find_size_and_weight/train711.py", line 625, in <module> [rank1]: train() [rank1]: File "/data1/users/heyu/find_size_and_weight/train711.py", line 617, in train [rank1]: train_result = trainer.train() [rank1]: File "/data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/transformers/trainer.py", line 2240, in train [rank1]: return inner_training_loop( [rank1]: File "/data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/transformers/trainer.py", line 2322, in _inner_training_loop [rank1]: self.optimizer, self.lr_scheduler = deepspeed_init(self, num_training_steps=max_steps) [rank1]: File "/data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 444, in deepspeed_init [rank1]: hf_deepspeed_config.trainer_config_finalize(args, model, num_training_steps) [rank1]: File "/data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 268, in trainer_config_finalize [rank1]: raise ValueError( [rank1]: ValueError: Please correct the following DeepSpeed config values that mismatch TrainingArguments values: [rank1]: - ds scheduler.params.warmup_max_lr=0.0001 vs hf learning_rate=1e-05 [rank1]: The easiest method is to set these DeepSpeed config values to 'auto'. 2025-07-11 16:02:32,818 - ERROR - Training failed: Please correct the following DeepSpeed config values that mismatch TrainingArguments values: - ds scheduler.params.warmup_max_lr=0.0001 vs hf learning_rate=1e-05 The easiest method is to set these DeepSpeed config values to 'auto'.上述代码报错,修改
时间: 2025-07-22 19:40:25 浏览: 20
<think>我们遇到了DeepSpeed配置错误,具体是DeepSpeed的scheduler配置中的warmup_max_lr与TrainingArguments中的learning_rate不匹配。
根据错误信息,DeepSpeed配置中scheduler.params.warmup_max_lr被设置为0.0001(即1e-4),而TrainingArguments中的learning_rate为1e-5。
错误提示建议将这些值设置为'auto',这样DeepSpeed会自动使用TrainingArguments中的值。
修改方法:将DeepSpeed配置中scheduler.params.warmup_max_lr设置为'auto',同时检查其他相关参数是否也需要设置为'auto'。
在代码中,我们找到DeepSpeed配置部分,将scheduler.params.warmup_max_lr改为'auto',同时将warmup_min_lr也改为'auto'(因为错误提示中只提到了warmup_max_lr,但为了安全,将整个scheduler部分设置为自动)。
另外,注意错误信息还提到了其他可能不匹配的值,但当前只报告了warmup_max_lr。因此,我们按照错误提示,将scheduler部分的相关参数都设置为'auto'。
修改后的DeepSpeed配置部分如下:
原始配置:
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 1e-4,
"warmup_num_steps": "auto",
"total_num_steps": "auto",
}
},
修改为:
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto",
"total_num_steps": "auto",
}
},
这样,DeepSpeed就会自动从TrainingArguments中获取学习率相关的设置。
同时,我们注意到在TrainingArguments中已经设置了learning_rate=1e-5,所以DeepSpeed会使用这个值。
修改代码中的CONFIG字典的deepspeed_config部分。
注意:由于我们使用的是DeepSpeed的配置文件(这里是以字典形式给出),需要确保所有参数设置正确。
另外,错误信息中还提到“The easiest method is to set these DeepSpeed config values to 'auto'.”,所以我们按照这个建议修改。
修改后重新运行。</think>错误原因是 DeepSpeed 配置中的学习率参数与 TrainingArguments 中的学习率设置不一致。让我们修复这个问题:
```python
def train():
TRAIN_FILES = ["./data/train_26.4k_20250703.jsonl"]
VAL_FILES = ['./data/test_1.0k_20250703.jsonl']
CONFIG = {
# ... 其他配置保持不变 ...
"deepspeed_config": {
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto", # 确保这里设置为 auto
"weight_decay": "auto",
"betas": "auto",
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": "auto", # 修改为 auto
"warmup_max_lr": "auto", # 修改为 auto
"warmup_num_steps": "auto",
"total_num_steps": "auto",
}
},
"fp16": {
"enabled": "auto",
},
"bf16": {
"enabled": "auto",
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "none",
"pin_memory": True
},
"allgather_partitions": True,
"allgather_bucket_size": 2e8,
"overlap_comm": True,
"reduce_scatter": True,
"reduce_bucket_size": 2e8,
"contiguous_gradients": True
},
"gradient_clipping": 1.0,
"steps_per_print": 5,
"wall_clock_breakdown": False
}
}
# ... 后续代码保持不变 ...
```
### 关键修改点:
1. **DeepSpeed 调度器配置**:
```python
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": "auto", # 修改为 auto
"warmup_max_lr": "auto", # 修改为 auto
"warmup_num_steps": "auto",
"total_num_steps": "auto",
}
}
```
2. **优化器配置**:
```python
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto", # 确保这里设置为 auto
# ...
}
}
```
### 完整修复后的代码:
```python
def train():
TRAIN_FILES = ["./data/train_26.4k_20250703.jsonl"]
VAL_FILES = ['./data/test_1.0k_20250703.jsonl']
CONFIG = {
"model_path": '/nas_data/xiao/models/Qwen2.5-VL-7B-Instruct',
"processor_path": '/nas_data/xiao/models/Qwen2.5-VL-7B-Instruct',
"train_files": TRAIN_FILES,
"val_files": VAL_FILES,
"max_pixels": 301056,
"max_length": 4096,
"output_dir": "./output",
"learning_rate": 1e-5,
"batch_size": 4,
"grad_accum_steps": 2,
"num_epochs": 1,
"warmup_ratio": 0.05,
"weight_decay": 0.1,
"logging_steps": 5,
"eval_steps": 500,
"save_steps": 500,
"save_total_limit": 5,
"fp16": False,
"bf16": True,
"gradient_checkpointing": True,
"deepspeed_config": {
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"weight_decay": "auto",
"betas": "auto",
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto",
"total_num_steps": "auto",
}
},
"fp16": {
"enabled": "auto",
},
"bf16": {
"enabled": "auto",
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "none",
"pin_memory": True
},
"allgather_partitions": True,
"allgather_bucket_size": 2e8,
"overlap_comm": True,
"reduce_scatter": True,
"reduce_bucket_size": 2e8,
"contiguous_gradients": True
},
"gradient_clipping": 1.0,
"steps_per_print": 5,
"wall_clock_breakdown": False
}
}
# ... 后续代码保持不变 ...
```
### 为什么这样修改?
1. **DeepSpeed 的 'auto' 模式**:
- 当设置为 'auto' 时,DeepSpeed 会自动从 TrainingArguments 中获取相应的值
- 这确保了 DeepSpeed 配置与 Hugging Face Trainer 配置的一致性
2. **错误原因分析**:
```bash
- ds scheduler.params.warmup_max_lr=0.0001 vs hf learning_rate=1e-05
```
- 原始代码中固定设置了 `warmup_max_lr=1e-4`,但 TrainingArguments 使用的是 `1e-5`
- 使用 'auto' 让 DeepSpeed 自动匹配 TrainingArguments 中的学习率设置
3. **最佳实践**:
- 在 DeepSpeed 配置中,所有与 Trainer 相关的参数都应设置为 'auto'
- 这样可以避免手动同步配置,减少配置错误
这个修改将解决 DeepSpeed 配置与 TrainingArguments 之间的不一致问题,确保训练能够正常启动。
阅读全文
相关推荐



















