【训练】Qwen2.5VL 多机多卡 Grounding Box定位

最新推荐文章于 2025-06-28 11:09:19 发布

原创最新推荐文章于 2025-06-28 11:09:19 发布 · 1.5k 阅读

19 ·

CC 4.0 BY-SA版权

文章标签：

#深度学习

之前的相关文章：

【深度学习】LLaMA-Factory微调sft Qwen2-VL进行印章识别

https://www.dong-blog.fun/post/1661

使用LLaMA-Factory微调sft Qwen2-VL-7B-Instruct

https://www.dong-blog.fun/post/1762

构建最新的LLaMA-Factory镜像

https://www.dong-blog.fun/post/1799

关于Grounding 如何不偏移的问题解决

看了一些帖子：

https://github.com/QwenLM/Qwen2.5-VL/issues/1094
https://github.com/QwenLM/Qwen2.5-VL/issues/950
https://github.com/QwenLM/Qwen2.5-VL/issues/900
https://github.com/QwenLM/Qwen2.5-VL/issues/866
https://github.com/QwenLM/Qwen2.5-VL/issues/721
https://github.com/QwenLM/Qwen2.5-VL/issues/584
https://github.com/QwenLM/Qwen2.5-VL/issues/830
https://github.com/QwenLM/Qwen2.5-VL/issues/773

所以有一些关于 Qwen2.5VL Grounding 的结论：

官方微调用的是这样的格式，所以在prompt中也无需特殊的格式/

[
    {"bbox_2d": [x1, y1, x2, y2], "label": "obj_name/description"},
    {"bbox_2d": [x1, y1, x2, y2], "label": "obj_name/description"},
    {"bbox_2d": [x1, y1, x2, y2], "label": "obj_name/description"},
]

官方还是这么做了，因为底层对28的切割性质，所以要这么做，不然真会偏一点：
在 Qwen2.5-VL 中，我们首先调整输入图像的大小，以确保其宽高为 28*n，然后使用调整后图像上的绝对坐标作为最终目标。
使用 (左, 上), (右, 下) 坐标，不用归一化到0-1000.
您好，根据您的描述，我怀疑问题出在 Qwen2-VL 和 Qwen2.5-VL 中 bbox 坐标的处理方式不同。具体来说，我们在 Qwen2.5-VL-7B 中现在使用的是绝对坐标，而不是 Qwen2-VL 中使用的相对坐标（后者被缩放到 [0,1000]）。
例如，在 Qwen2-VL 中，640x640 图像中 [0, 0, 320, 320] 的边界框用 (0, 0), (500, 500) 表示。但在 Qwen2.5-VL 中，我们直接使用 [0, 0, 320, 320] 或 (0,0),(320,320)。此外，如果在图像增强过程中将图像尺寸调整为 1280x1280，则坐标现在应相应地扩展为 [0, 0, 640, 640]。

由于 Qwen2.5-VL 使用绝对坐标进行训练，我建议在微调时也使用相同的绝对坐标系。如果您出于某种原因坚持使用相对坐标，可以延长训练时间，看看偏差问题是否会随着训练时间的延长而消失。

供大家参考，详细坐标流程如下：

调整图像大小，使高度和宽度为 28*n
resized_w, resized_h = smart_resize(img_w, img_h)
相应地改变绝对坐标
new_bbox = bbox / np.array([img_w, img_h, img_w, img_h]) * np.array([resized_w, resized_h, resized_w, resized_h]))
如果使用绝对坐标后仍然观察到明显的 grounding 偏差，则另一个可能的问题在于图像的大小。如果图像非常大或非常小（例如，> 4k 4k 或 < 320 320），则模型很可能会输出有偏差的 bbox 结果。
新版本的transformers才没有rope问题，安装： pip install git+https://github.com/huggingface/transformers

对自己图像的处理

调整图像大小，使高度和宽度为 28 * n 。坐标使用绝对坐标，无需特殊格式。使用正确的transformer版本。

llamafactory 的数据要求：

https://llamafactory.readthedocs.io/zh-cn/latest/getting_started/data_preparation.html#id16

llamafactory 如何解析这类数据：

https://www.dong-blog.fun/post/2077

我的数据样本 xdx_b_intervl8btrain_28.json

  {
    "messages": [
      {
        "content": "<image>点[56,259]所处位置的信息是什么？",
        "role": "user"
      },
      {
        "content": "<ref>文本-地址</ref><box>[[33, 241, 66, 264]]</box>",
        "role": "assistant"
      }
    ],
    "images": [
      "/img_datasets/img_small_size_28/didichuxing-20240914171548.jpg"
    ]
  }

对应的dataset_info.json中的描述应该是：

{
    "grounding1": {
      "file_name": "xdx_b_intervl8btrain_28.json",
      "formatting": "sharegpt",
      "columns": {
        "messages": "messages",
        "images": "images"
      },
      "tags": {
        "role_tag": "role",
        "content_tag": "content",
        "user_tag": "user",
        "assistant_tag": "assistant"
      }
    }
}

启动训练

cd LLaMA-Factory

docker run -it --gpus  '"device=0,2,3,4,5,6,7"' \
    -v /data/xiedong/train_qwenvl25_for_grounding/data:/app/data \
    -v ./output:/app/output \
    -v ./examples:/app/examples \
    -v /data/xiedong/train_qwenvl25_for_grounding:/img_datasets \
    -v /data/xiedong/vlm_r1_train_tools/Qwen2.5-VL-7B-Instruct:/Qwen2.5-VL-7B-Instruct \
    --shm-size 32G \
    -p 8034:7860 \
    -p 8035:8000 \
    kevinchina/deeplearning:llamafactory20250311-3 bash

装个swanlab：

pip install swanlab -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

可以打开webui看一下：

llamafactory-cli webui

单机训练：

llamafactory-cli train \
    --stage sft \
    --do_train True \
    --model_name_or_path /Qwen2.5-VL-7B-Instruct \
    --preprocessing_num_workers 16 \
    --finetuning_type full \
    --template qwen2_vl \
    --flash_attn auto \
    --dataset_dir data \
    --dataset grounding1 \
    --cutoff_len 4096 \
    --learning_rate 5e-05 \
    --num_train_epochs 3.0 \
    --max_samples 100000 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 5 \
    --save_steps 100 \
    --warmup_steps 0 \
    --packing False \
    --report_to none \
    --output_dir output/Qwen2.5-VL-7B-Instruct/full/train_2025-05-08-07-28-25 \
    --bf16 True \
    --plot_loss True \
    --trust_remote_code True \
    --ddp_timeout 180000000 \
    --include_num_input_tokens_seen True \
    --optim adamw_torch \
    --deepspeed cache/ds_z2_config.json \
    --use_swanlab True \
    --swanlab_project llamafactory \
    --swanlab_mode cloud

如果要有验证集：

    --val_size 0.1 \
    --eval_strategy steps \
    --eval_steps 100 \
    --per_device_eval_batch_size 2 \

如果要用swanlab：

export SWANLAB_API_KEY=pM7Xvs5OS2EeXPO5gKXfJ   # 设置在线跟踪模式API，这里我随便填的
export SWANLAB_LOG_DIR=/swanlab_log    # 设置本地日志存储路径
export SWANLAB_MODE=cloud     # 包含四种模式：cloud云端跟踪模式（默认）、cloud-only仅云端跟踪本地不保存文件、local本地跟踪模式、disabled完全不记录用于debug

    --use_swanlab True \
    --swanlab_project llamafactory \
    --swanlab_mode cloud \