Qwen2.5-VL模型部署实践：从环境搭建到推理调用详解

本文链接：https://blog.csdn.net/Gaga246/article/details/148951873

通义千问2.5-视觉语言模型Qwen2.5-VL：https://github.com/QwenLM/Qwen2.5-VL

引言

将视觉感知与自然语言处理深度融合已成为产业与学术界的热点。Qwen系列最新推出的旗舰级视觉语言模型 Qwen2.5-VL，以其卓越的图文理解能力和对话式交互性能，成为行业标杆。本文将带你从零开始，系统梳理 Qwen2.5-VL 的环境准备、模型下载、依赖安装及推理脚本实战，并通过示例图解，帮助你快速掌握部署与应用要点。

创建虚拟环境

conda remove -n llmqwen --all
conda create -n llmqwen python=3.12
conda activate llmqwen

大模型下载

通过魔搭社区（https://www.modelscope.cn/models?name=qwen&page=1&tabKey=task）下载需要的模型。

使用modelscope安装及下载对应模型，该组件具备断点续传的功能，例如：当前网络不佳，可以杀死命令行，重新执行命令，已下载的文件内容不会丢失，可以继续在进度条附近开始下载任务。

# 安装ModelScope
pip install modelscope -i https://pypi.tuna.tsinghua.edu.cn/simple
# 下载完整模型repo
modelscope download --model qwen/Qwen2.5-1.5B
modelscope download --model qwen/Qwen2.5-VL-3B-Instruct
# 下载单个文件（以README.md为例）
modelscope download --model qwen/Qwen2.5-1.5B README.md

下载完毕后，移动到一个位置（示例）

mv /home/xzx/.cache/modelscope/hub/models/qwen/Qwen2___5-VL-3B-Instruct/ /home/xzx/2025-zll/Qwen2.5-VL

模型文件大致如下：

依赖库安装

从源码构建transformers
pip install transformers==4.51.3 accelerate -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install qwen-vl-utils[decord]==0.0.8 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install torchvision==0.22.1 -i https://pypi.tuna.tsinghua.edu.cn/simple

推理脚本

zll_test.py脚本内容如下：

from modelscope import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
)

# default processer
processor = AutoProcessor.from_pretrained("Qwen2.5-VL-3B-Instruct")

# messages = [
#     {
#         "role": "user",
#         "content": [
#             {
#                 "type": "image",
#                 "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
#             },
#             {"type": "text", "text": "请描述这个菜，给出评价"},
#         ],
#     }
# ]
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "./2.jpg",
            },
            {"type": "text", "text": "请描述这个菜，给出评价"},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print('------------------------------------------begin------------------------------------------')
print(output_text)
print('------------------------------------------end------------------------------------------')

持续输入图片的推理脚本vl.test.py

import time
import torch
import os
from modelscope import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

def main():
    # 加载模型和处理器（只做一次）
    model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
        "Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
    )
    processor = AutoProcessor.from_pretrained("Qwen2.5-VL-3B-Instruct")
    model.eval()

    print("Qwen2.5-VL 推理交互式程序")
    print("输入图片路径进行描述，输入 exit 或直接回车退出。")

    while True:
        img_path = input("\n请输入图片路径: ").strip()
        if not img_path or img_path.lower() == "exit":
            print("⚠️ 退出程序。")
            break

        # 判断文件是否存在
        if not os.path.isfile(img_path):
            print(f"⚠️ 文件未找到：{img_path}，请确认路径后重试。")
            continue

        # 构造对话消息
        messages = [
            {
                "role": "user",
                "content": [
                    {"type": "image", "image": img_path},
                    {"type": "text", "text": "请给出这道菜的菜名、食材、烹饪方法、营养信息等系列描述。"},
                ],
            }
        ]
        # 准备输入
        text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        image_inputs, video_inputs = process_vision_info(messages)
        inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt",).to("cuda")

        # 同步与计时
        torch.cuda.synchronize()
        start = time.perf_counter()

        # 推理
        generated_ids = model.generate(**inputs, max_new_tokens=128)

        torch.cuda.synchronize()
        elapsed = (time.perf_counter() - start) * 1000  # 毫秒

        # 解码并打印
        trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
        output_text = processor.batch_decode(trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

        print("\n--- 推理结果 ---")
        print(output_text)
        print(f"推理耗时：{elapsed:.1f} ms")

if __name__ == "__main__":
    main()

❝

Qwen2.5-VL 以其强大的跨模态理解与生成能力，为各行业场景注入新动力。从智能餐饮点评、医疗影像诊断到零售商品识别，都能轻松驾驭。希望本文提供的环境配置、脚本示例及优化建议，能帮助你在项目中快速落地 Qwen2.5-VL，并开启更多创新应用。期待你在公众号留言，分享你的部署心得与应用案例！