运行CUDA_VISIBLE_DEVICES=0,2 dbgpt start webserver --config /app/configs/dbgpt-local-vllm.toml报错如下： =========================== VLLMDeployModelParameters =========================== name: DeepSeek-R1-Distill-Qwen-32B provider: vllm verbose: False concurrency: 100 backend: None prompt_template: None context_length: None reasoning_model: None path: models/DeepSeek-R1-Distill-Qwen-32B device: auto trust_remote_code: True download_dir: None load_format: auto config_format: auto dtype: auto kv_cache_dtype: auto seed: 0 max_model_len: None distributed_executor_backend: None pipeline_parallel_size: 1 tensor_parallel_size: 1 max_parallel_loading_workers: None block_size: None enable_prefix_caching: None swap_space: 4.0 cpu_offload_gb: 0.0 gpu_memory_utilization: 0.9 max_num_batched_tokens: None max_num_seqs: 2 max_logprobs: 20 revision: None code_revision: None tokenizer_revision: None tokenizer_mode: auto quantization: fp8 max_seq_len_to_capture: 8192 worker_cls: auto extras: None ====================================================================== 2025-08-05 07:43:32 1249bbe41ac7 dbgpt.util.code.server[4248] INFO Code server is ready INFO 08-05 07:43:36 config.py:520] This model supports multiple tasks: {'score', 'classify', 'generate', 'reward', 'embed'}. Defaulting to 'generate'. WARNING 08-05 07:43:36 arg_utils.py:1107] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False. INFO 08-05 07:43:36 config.py:1483] Chunked prefill is enabled with max_num_batched_tokens=2048. INFO 08-05 07:43:37 llm_engine.py:232] Initializing an LLM engine (v0.7.0) with config: model='/app/models/DeepSeek-R1-Distill-Qwen-32B', speculative_config=None, tokenizer='/app/models/DeepSeek-R1-Distill-Qwen-32B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/app/models/DeepSeek-R1-Distill-Qwen-32B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[2,1],"max_capture_size":2}, use_cached_outputs=False, INFO 08-05 07:43:37 cuda.py:225] Using Flash Attention backend. INFO 08-05 07:43:37 model_runner.py:1110] Starting to load model /app/models/DeepSeek-R1-Distill-Qwen-32B... /app/packages/dbgpt-core/src/dbgpt/util/model_utils.py:27: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()' if (hasattr(backends, "mps") and backends.mps.is_built()) or torch.has_mps: 2025-08-05 07:43:38 1249bbe41ac7 dbgpt.util.model_utils[4248] INFO Clear torch cache of device: cuda:0 2025-08-05 07:43:38 1249bbe41ac7 dbgpt.util.model_utils[4248] INFO Clear torch cache of device: cuda:1 2025-08-05 07:43:38 1249bbe41ac7 dbgpt.model.cluster.worker.embedding_worker[4248] INFO Load embeddings model: bge-large-zh-v1.5 2025-08-05 07:43:38 1249bbe41ac7 datasets[4248] INFO PyTorch version 2.5.1+cu121 available. 2025-08-05 07:43:38 1249bbe41ac7 datasets[4248] INFO Duckdb version 1.2.0 available. 2025-08-05 07:43:38 1249bbe41ac7 sentence_transformers.SentenceTransformer[4248] INFO Use pytorch device_name: cuda 2025-08-05 07:43:38 1249bbe41ac7 sentence_transformers.SentenceTransformer[4248] INFO Load pretrained SentenceTransformer: /app/models/bge-large-zh-v1.5 INFO: 127.0.0.1:42180 - "POST /api/controller/models HTTP/1.1" 200 OK 2025-08-05 07:43:39 1249bbe41ac7 dbgpt.model.cluster.worker.manager[4248] ERROR Error starting worker manager: model DeepSeek-R1-Distill-Qwen-32B@vllm(172.17.0.2:8002) start failed, Traceback (most recent call last): File "/app/packages/dbgpt-core/src/dbgpt/model/cluster/worker/manager.py", line 631, in _start_worker await self.run_blocking_func( File "/app/packages/dbgpt-core/src/dbgpt/model/cluster/worker/manager.py", line 146, in run_blocking_func return await loop.run_in_executor(self.executor, func, args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run result = self.fn(self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/packages/dbgpt-core/src/dbgpt/model/cluster/worker/default_worker.py", line 122, in start self.model, self.tokenizer = self.ml.loader_with_params( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/packages/dbgpt-core/src/dbgpt/model/adapter/loader.py", line 70, in loader_with_params return llm_adapter.load_from_params(model_params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/packages/dbgpt-core/src/dbgpt/model/adapter/vllm_adapter.py", line 488, in load_from_params engine = AsyncLLMEngine.from_engine_args(engine_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 642, in from_engine_args engine = cls( ^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 592, in init self.engine = self._engine_class(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 265, in init super().init(*args, **kwargs) File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 271, in init self.model_executor = executor_class(vllm_config=vllm_config, ) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 49, in init self._init_executor() File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 40, in _init_executor self.collective_rpc("load_model") File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/executor/uniproc_executor.py", line 49, in collective_rpc answer = run_method(self.driver_worker, method, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/utils.py", line 2208, in run_method return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/worker/worker.py", line 182, in load_model self.model_runner.load_model() File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1112, in load_model self.model = get_model(vllm_config=self.vllm_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/model_loader/init.py", line 12, in get_model return loader.load_model(vllm_config=vllm_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 376, in load_model model = _initialize_model(vllm_config=vllm_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 118, in _initialize_model return model_class(vllm_config=vllm_config, prefix=prefix) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 451, in init self.model = Qwen2Model(vllm_config=vllm_config, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 149, in init old_init(self, vllm_config=vllm_config, prefix=prefix, kwargs) File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 305, in init self.start_layer, self.end_layer, self.layers = make_layers( ^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 555, in make_layers [PPMissingLayer() for _ in range(start_layer)] + [ ^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 556, in <listcomp> maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}")) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 307, in <lambda> lambda prefix: Qwen2DecoderLayer(config=config, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 206, in init self.self_attn = Qwen2Attention( ^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 134, in init self.qkv_proj = QKVParallelLinear( ^^^^^^^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 728, in init super().init(input_size=input_size, File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 311, in init self.quant_method.create_weights( File "//opt/.uv.venv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/fp8.py", line 199, in create_weights weight = ModelWeightParameter(data=torch.empty( ^^^^^^^^^^^^ File "//opt/.uv.venv/lib/python3.11/site-packages/torch/utils/_device.py", line 106, in __torch_function__ return func(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 70.00 MiB. GPU 0 has a total capacity of 44.53 GiB of which 15.94 MiB is free. Process 1334940 has 44.51 GiB memory in use. Of the allocated memory 44.17 GiB is allocated by PyTorch, and 14.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) ;model bge-large-zh-v1.5@hf(172.17.0.2:8002) start successfully INFO: Shutting down

时间: 2025-08-09 12:02:30 浏览: 4

从最新日志来看，问题已从**多进程CUDA初始化冲突**转变为**GPU显存不足**（`CUDA out of memory`）。以下是详细分析和解决方案： --- ### **核心问题** 1. **显存不足** - 错误关键行： ``` torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 70.00 MiB. GPU 0 has a total capacity of 44.53 GiB of which 15.94 MiB is free. ``` - 原因：尝试加载`DeepSeek-R1-Distill-Qwen-32B`模型时，GPU 0的显存已被占用（44.51 GiB使用中），仅剩15.94 MiB空闲。 2. **多GPU配置未生效** - 虽然通过`CUDA_VISIBLE_DEVICES=0,2`指定了GPU 0和2，但日志显示模型仍尝试在GPU 0上加载（`Process 1334940 has 44.51 GiB memory in use`），说明多GPU并行未正确配置。 3. **其他进程占用显存** - GPU 0的显存几乎被占满（44.51 GiB使用中），可能是其他进程（如嵌入模型`bge-large-zh-v1.5`或之前的任务）未释放显存。 --- ### **可能的原因** 1. **模型并行未启用** - `tensor_parallel_size=1`：当前配置为单卡推理，未利用多GPU。 2. **显存碎片化** - 日志建议设置`PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`以避免碎片化。 3. **嵌入模型占用显存** - 日志显示`bge-large-zh-v1.5`模型加载成功，可能占用了部分显存。 --- ### **解决方案** #### 1. **启用多GPU并行** - **修改配置**：将`tensor_parallel_size`改为`2`（与`CUDA_VISIBLE_DEVICES=0,2`匹配）： ```python model_params = { "tensor_parallel_size": 2, # 启用2卡并行 "device": "cuda:0,2", # 显式指定设备 "distributed_executor_backend": "none", # 禁用Ray分布式 } ``` - **验证GPU可见性**：运行`nvidia-smi`确认GPU 0和2的显存占用情况。 #### 2. **释放显存** - **终止占用进程**：通过`nvidia-smi`找到占用GPU 0的进程（PID 1334940），终止它： ```bash kill -9 1334940 ``` - **重启服务**：确保无其他任务占用显存后重新启动DB-GPT。 #### 3. **优化显存配置** - **设置显存分配策略**：在启动命令前添加环境变量： ```bash export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" export CUDA_VISIBLE_DEVICES=0,2 dbgpt start webserver --config /app/configs/dbgpt-local-vllm.toml ``` - **降低批次大小**：如果仍显存不足，尝试减少`max_num_batched_tokens`（如从2048降到1024）。 #### 4. **检查嵌入模型** - 嵌入模型`bge-large-zh-v1.5`可能占用显存，尝试禁用或换用CPU： ```python model_params = { "embedding_model": None, # 禁用嵌入模型 "embedding_device": "cpu", # 或强制使用CPU } ``` --- ### **其他注意事项** 1. **日志中的关键线索** - `Clear torch cache of device: cuda:0`和`cuda:1`：说明系统尝试清理显存，但未成功释放。 - `Chunked prefill is enabled`：分块预填充可能增加显存压力，建议关闭： ```python model_params["enable_chunked_prefill"] = False ``` 2. **极端情况处理** - 如果GPU 2无显存，可能是驱动或CUDA问题，需验证： ```bash python -c "import torch; print(torch.cuda.get_device_properties('cuda:2'))" ``` --- ### **总结** 根本问题是**单GPU显存不足**且**多GPU未正确启用**。建议按以下顺序操作： 1. 终止占用GPU 0的进程（`kill -9 PID`）。 2. 启用多GPU并行（`tensor_parallel_size=2`）。 3. 设置显存分配策略（`PYTORCH_CUDA_ALLOC_CONF`）。 4. 关闭不必要的嵌入模型或分块预填充。如果问题持续，可能是模型本身过大（32B参数），需升级硬件或使用量化（如`quantization=fp4`）。

阅读全文

相关推荐

VCL-ofdm.zip_OFDM_visible light_ofdm_visible light_vlc

HW6.rar_.HW6文件_NOISE_spectral bandwidth_visible light_www.6hw·co

cuda 用于tensorflow加速，解决cudart64_100.dll、cublas64_10.dll、cudnn64_7.dll...”报错的问题

CUDA_C_Programming_Guide.pdf

CUDA_Multi_Process_Service_Overview.pdf

olyfill_for_focus-visible_focus-visible.zip

Databound-PaletteSet-Visible.rar_C#编程_C#_

cudnn-windows-x86_64-8.3.3.40_cuda11.5

Python库 | aws_cdk.aws_emr-1.18.0-py3-none-any.whl

OFDM-Based_Visible_Light_communications_ofdm_think7cj_MobileDevi

Extraction-of-web-data.rar_QueryTables ie_extraction_vba web_xml

failed call to cuInit: CUDA_ERROR_NO_DEVICE解决方法

torch.cuda.is-available()返回False的问题解决

vue-visible-VueJS（2.x）的v-visible指令，类似于v-show但具有可见性。-Vue.js开发

前端项目-jquery-visible.zip

前端项目-egjs-visible.zip

simpleZeroCopy.tar.gz_Zero_cuda

bugreport-venus-TKQ1.220829.002-2024-04-18-00-45-35.zip

开发界面语义化：声控 + 画图协同生成代码.doc

Diskgenius恢复硬盘误删文件及数据

Halcon与C#结合的高效四轴运动控制贴片机程序开发及优化 · C#

大家在看

基于ADS的微带滤波器设计

Pixhawk4飞控驱动.zip

ztecfg中兴配置加解密工具3.0版本.rar

配置车辆-feedback systems_an introduction for scientists and engineers

xilinx.com_user_IIC_AXI_1.0.zip

最新推荐

开发界面语义化：声控 + 画图协同生成代码.doc

Python程序TXLWizard生成TXL文件及转换工具介绍

【创新图生成：扣子平台的技术前沿与创新思维】：引领图像生成技术的新潮流

海康威视机器视觉工程师考核

Linux环境下Docker Hub公共容器映像检测工具集

【扣子平台图像艺术探究：理论与实践的完美结合】：深入学习图像生成的艺术

增广路定理的证明

Pulse：基于SwiftUI的Apple平台高效日志记录与网络监控

【深入扣子平台：图像生成机制全揭秘】：掌握背后技术，提升图像生成效率

对RTL跑regression是什么意思