运行CUDA_VISIBLE_DEVICES=0,2 dbgpt start webserver --config /app/configs/dbgpt-local-vllm.toml报错如下: =========================== VLLMDeployModelParameters =========================== name: DeepSeek-R1-Distill-Qwen-32B provider: vllm verbose: False concurrency: 100 backend: None prompt_template: None context_length: None reasoning_model: None path: models/DeepSeek-R1-Distill-Qwen-32B device: auto trust_remote_code: True download_dir: None load_format: auto config_format: auto dtype: auto kv_cache_dtype: auto seed: 0 max_model_len: None distributed_executor_backend: None pipeline_parallel_size: 1 tensor_parallel_size: 1 max_parallel_loading_workers: None block_size: None enable_prefix_caching: None swap_space: 4.0 cpu_offload_gb: 0.0 gpu_memory_utilization: 0.9 max_num_batched_tokens: None max_num_seqs: 2 max_logprobs: 20 revision: None code_revision: None tokenizer_revision: None tokenizer_mode: auto quantization: fp8 max_seq_len_to_capture: 8192 worker_cls: auto extras: None ====================================================================== 2025-08-05 07:43:32 1249bbe41ac7 dbgpt.util.code.server[4248] INFO Code server is ready INFO 08-05 07:43:36 config.py:520] This model supports multiple tasks: {'score', 'classify', 'generate', 'reward', 'embed'}. Defaulting to 'generate'. WARNING 08-05 07:43:36 arg_utils.py:1107] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False. INFO 08-05 07:43:36 config.py:1483] Chunked prefill is enabled with max_num_batched_tokens=2048. INFO 08-05 07:43:37 llm_engine.py:232] Initializing an LLM engine (v0.7.0) with config: model='/app/models/DeepSeek-R1-Distill-Qwen-32B', speculative_config=None, tokenizer='/app/models/DeepSeek-R1-Distill-Qwen-32B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/app/models/DeepSeek-R1-Distill-Qwen-32B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[2,1],"max_capture_size":2}, use_cached_outputs=False, INFO 08-05 07:43:37 cuda.py:225] Using Flash Attention backend. INFO 08-05 07:43:37 model_runner.py:1110] Starting to load model /app/models/DeepSeek-R1-Distill-Qwen-32B... /app/packages/dbgpt-core/src/dbgpt/util/model_utils.py:27: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' if (hasattr(backends, "mps") and backends.mps.is_built()) or torch.has_mps: 2025-08-05 07:43:38 1249bbe41ac7 dbgpt.util.model_utils[4248] INFO Clear torch cache of device: cuda:0 2025-08-05 07:43:38 1249bbe41ac7 dbgpt.util.model_utils[4248] INFO Clear torch cache of device: cuda:1 2025-08-05 07:43:38 1249bbe41ac7 dbgpt.model.cluster.worker.embedding_worker[4248] INFO Load embeddings model: bge-large-zh-v1.5 2025-08-05 07:43:38 1249bbe41ac7 datasets[4248] INFO PyTorch version 2.5.1+cu121 available. 2025-08-05 07:43:38 1249bbe41ac7 datasets[4248] INFO Duckdb version 1.2.0 available. 2025-08-05 07:43:38 1249bbe41ac7 sentence_transformers.SentenceTransformer[4248] INFO Use pytorch device_name: cuda 2025-08-05 07:43:38 1249bbe41ac7 sentence_transformers.SentenceTransformer[4248] INFO Load pretrained SentenceTransformer: /app/models/bge-large-zh-v1.5 INFO: 127.0.0.1:42180 - "POST /api/controller/models HTTP/1.1" 200 OK 2025-08-05 07:43:39 1249bbe41ac7 dbgpt.model.cluster.worker.manager[4248] ERROR Error starting worker manager: model DeepSeek-R1-Distill-Qwen-32B@vllm(172.17.0.2:8002) start failed, Traceback (most recent call last): File "/app/packages/dbgpt-core/src/dbgpt/model/cluster/worker/manager.py", line 631, in _start_worker await self.run_blocking_func( File "/app/packages/dbgpt-core/src/dbgpt/model/cluster/worker/manager.py", line 146, in run_blocking_func return await loop.run_in_executor(self.executor, func, *args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/packages/dbgpt-core/src/dbgpt/model/cluster/worker/default_worker.py", line 122, in start self.model, self.tokenizer = self.ml.loader_with_params( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/packages/dbgpt-core/src/dbgpt/model/adapter/loader.py", line 70, in loader_with_params return llm_adapter.load_from_params(model_params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/packages/dbgpt-core/src/dbgpt/model/adapter/vllm_adapter.py", line 488, in load_from_params engine = AsyncLLMEngine.from_engine_args(engine_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 642, in from_engine_args engine = cls( ^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 592, in __init__ self.engine = self._engine_class(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 265, in __init__ super().__init__(*args, **kwargs) File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 271, in __init__ self.model_executor = executor_class(vllm_config=vllm_config, ) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 49, in __init__ self._init_executor() File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 40, in _init_executor self.collective_rpc("load_model") File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 49, in collective_rpc answer = run_method(self.driver_worker, method, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/utils.py", line 2208, in run_method return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/worker/worker.py", line 182, in load_model self.model_runner.load_model() File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1112, in load_model self.model = get_model(vllm_config=self.vllm_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/model_loader/__init__.py", line 12, in get_model return loader.load_model(vllm_config=vllm_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 376, in load_model model = _initialize_model(vllm_config=vllm_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 118, in _initialize_model return model_class(vllm_config=vllm_config, prefix=prefix) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 451, in __init__ self.model = Qwen2Model(vllm_config=vllm_config, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 149, in __init__ old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs) File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 305, in __init__ self.start_layer, self.end_layer, self.layers = make_layers( ^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 555, in make_layers [PPMissingLayer() for _ in range(start_layer)] + [ ^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 556, in <listcomp> maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}")) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 307, in <lambda> lambda prefix: Qwen2DecoderLayer(config=config, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 206, in __init__ self.self_attn = Qwen2Attention( ^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 134, in __init__ self.qkv_proj = QKVParallelLinear( ^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 728, in __init__ super().__init__(input_size=input_size, File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 311, in __init__ self.quant_method.create_weights( File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/fp8.py", line 199, in create_weights weight = ModelWeightParameter(data=torch.empty( ^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/torch/utils/_device.py", line 106, in __torch_function__ return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 70.00 MiB. GPU 0 has a total capacity of 44.53 GiB of which 15.94 MiB is free. Process 1334940 has 44.51 GiB memory in use. Of the allocated memory 44.17 GiB is allocated by PyTorch, and 14.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) ;model bge-large-zh-v1.5@hf(172.17.0.2:8002) start successfully INFO: Shutting down
时间: 2025-08-09 12:02:30 浏览: 4
从最新日志来看,问题已从**多进程CUDA初始化冲突**转变为**GPU显存不足**(`CUDA out of memory`)。以下是详细分析和解决方案:
---
### **核心问题**
1. **显存不足**
- 错误关键行:
```
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 70.00 MiB. GPU 0 has a total capacity of 44.53 GiB of which 15.94 MiB is free.
```
- 原因:尝试加载`DeepSeek-R1-Distill-Qwen-32B`模型时,GPU 0的显存已被占用(44.51 GiB使用中),仅剩15.94 MiB空闲。
2. **多GPU配置未生效**
- 虽然通过`CUDA_VISIBLE_DEVICES=0,2`指定了GPU 0和2,但日志显示模型仍尝试在GPU 0上加载(`Process 1334940 has 44.51 GiB memory in use`),说明多GPU并行未正确配置。
3. **其他进程占用显存**
- GPU 0的显存几乎被占满(44.51 GiB使用中),可能是其他进程(如嵌入模型`bge-large-zh-v1.5`或之前的任务)未释放显存。
---
### **可能的原因**
1. **模型并行未启用**
- `tensor_parallel_size=1`:当前配置为单卡推理,未利用多GPU。
2. **显存碎片化**
- 日志建议设置`PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`以避免碎片化。
3. **嵌入模型占用显存**
- 日志显示`bge-large-zh-v1.5`模型加载成功,可能占用了部分显存。
---
### **解决方案**
#### 1. **启用多GPU并行**
- **修改配置**:
将`tensor_parallel_size`改为`2`(与`CUDA_VISIBLE_DEVICES=0,2`匹配):
```python
model_params = {
"tensor_parallel_size": 2, # 启用2卡并行
"device": "cuda:0,2", # 显式指定设备
"distributed_executor_backend": "none", # 禁用Ray分布式
}
```
- **验证GPU可见性**:
运行`nvidia-smi`确认GPU 0和2的显存占用情况。
#### 2. **释放显存**
- **终止占用进程**:
通过`nvidia-smi`找到占用GPU 0的进程(PID 1334940),终止它:
```bash
kill -9 1334940
```
- **重启服务**:
确保无其他任务占用显存后重新启动DB-GPT。
#### 3. **优化显存配置**
- **设置显存分配策略**:
在启动命令前添加环境变量:
```bash
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
export CUDA_VISIBLE_DEVICES=0,2
dbgpt start webserver --config /app/configs/dbgpt-local-vllm.toml
```
- **降低批次大小**:
如果仍显存不足,尝试减少`max_num_batched_tokens`(如从2048降到1024)。
#### 4. **检查嵌入模型**
- 嵌入模型`bge-large-zh-v1.5`可能占用显存,尝试禁用或换用CPU:
```python
model_params = {
"embedding_model": None, # 禁用嵌入模型
"embedding_device": "cpu", # 或强制使用CPU
}
```
---
### **其他注意事项**
1. **日志中的关键线索**
- `Clear torch cache of device: cuda:0`和`cuda:1`:说明系统尝试清理显存,但未成功释放。
- `Chunked prefill is enabled`:分块预填充可能增加显存压力,建议关闭:
```python
model_params["enable_chunked_prefill"] = False
```
2. **极端情况处理**
- 如果GPU 2无显存,可能是驱动或CUDA问题,需验证:
```bash
python -c "import torch; print(torch.cuda.get_device_properties('cuda:2'))"
```
---
### **总结**
根本问题是**单GPU显存不足**且**多GPU未正确启用**。建议按以下顺序操作:
1. 终止占用GPU 0的进程(`kill -9 PID`)。
2. 启用多GPU并行(`tensor_parallel_size=2`)。
3. 设置显存分配策略(`PYTORCH_CUDA_ALLOC_CONF`)。
4. 关闭不必要的嵌入模型或分块预填充。
如果问题持续,可能是模型本身过大(32B参数),需升级硬件或使用量化(如`quantization=fp4`)。
阅读全文
相关推荐


















