【训练】Qwen2.5VL 多机多卡 Grounding Box定位

之前的相关文章:

【深度学习】LLaMA-Factory微调sft Qwen2-VL进行印章识别

https://www.dong-blog.fun/post/1661

使用LLaMA-Factory微调sft Qwen2-VL-7B-Instruct

https://www.dong-blog.fun/post/1762

构建最新的LLaMA-Factory镜像

https://www.dong-blog.fun/post/1799

关于Grounding 如何不偏移的问题解决

看了一些帖子:

  • https://github.com/QwenLM/Qwen2.5-VL/issues/1094
  • https://github.com/QwenLM/Qwen2.5-VL/issues/950
  • https://github.com/QwenLM/Qwen2.5-VL/issues/900
  • https://github.com/QwenLM/Qwen2.5-VL/issues/866
  • https://github.com/QwenLM/Qwen2.5-VL/issues/721
  • https://github.com/QwenLM/Qwen2.5-VL/issues/584
  • https://github.com/QwenLM/Qwen2.5-VL/issues/830
  • https://github.com/QwenLM/Qwen2.5-VL/issues/773

所以有一些关于 Qwen2.5VL Grounding 的结论:

  • 官方微调用的是这样的格式,所以在prompt中也无需特殊的格式/
[
    {"bbox_2d": [x1, y1, x2, y2], "label": "obj_name/description"},
    {"bbox_2d": [x1, y1, x2, y2], "label": "obj_name/description"},
    {"bbox_2d": [x1, y1, x2, y2], "label": "obj_name/description"},
]
  • 官方还是这么做了,因为底层对28的切割性质,所以要这么做,不然真会偏一点:
    在 Qwen2.5-VL 中,我们首先调整输入图像的大小,以确保其宽高为 28*n,然后使用调整后图像上的绝对坐标作为最终目标。

  • 使用 (左, 上), (右, 下) 坐标,不用归一化到0-1000.

  • 您好,根据您的描述,我怀疑问题出在 Qwen2-VL 和 Qwen2.5-VL 中 bbox 坐标的处理方式不同。具体来说,我们在 Qwen2.5-VL-7B 中现在使用的是绝对坐标,而不是 Qwen2-VL 中使用的相对坐标(后者被缩放到 [0,1000])。
    例如,在 Qwen2-VL 中,640x640 图像中 [0, 0, 320, 320] 的边界框用 (0, 0), (500, 500) 表示。但在 Qwen2.5-VL 中,我们直接使用 [0, 0, 320, 320] 或 (0,0),(320,320)。此外,如果在图像增强过程中将图像尺寸调整为 1280x1280,则坐标现在应相应地扩展为 [0, 0, 640, 640]。

    由于 Qwen2.5-VL 使用绝对坐标进行训练,我建议在微调时也使用相同的绝对坐标系。如果您出于某种原因坚持使用相对坐标,可以延长训练时间,看看偏差问题是否会随着训练时间的延长而消失。

    供大家参考,详细坐标流程如下:

    调整图像大小,使高度和宽度为 28*n
    resized_w, resized_h = smart_resize(img_w, img_h)
    相应地改变绝对坐标
    new_bbox = bbox / np.array([img_w, img_h, img_w, img_h]) * np.array([resized_w, resized_h, resized_w, resized_h]))
    如果使用绝对坐标后仍然观察到明显的 grounding 偏差,则另一个可能的问题在于图像的大小。如果图像非常大或非常小(例如,> 4k 4k 或 < 320 320),则模型很可能会输出有偏差的 bbox 结果。

  • 新版本的transformers才没有rope问题,安装: pip install git+https://github.com/huggingface/transformers

对自己图像的处理

调整图像大小,使高度和宽度为 28 * n 。坐标使用绝对坐标,无需特殊格式。使用正确的transformer版本。

llamafactory 的数据要求:

https://llamafactory.readthedocs.io/zh-cn/latest/getting_started/data_preparation.html#id16

llamafactory 如何解析这类数据:

https://www.dong-blog.fun/post/2077

我的数据样本 xdx_b_intervl8btrain_28.json

  {
    "messages": [
      {
        "content": "<image>点[56,259]所处位置的信息是什么?",
        "role": "user"
      },
      {
        "content": "<ref>文本-地址</ref><box>[[33, 241, 66, 264]]</box>",
        "role": "assistant"
      }
    ],
    "images": [
      "/img_datasets/img_small_size_28/didichuxing-20240914171548.jpg"
    ]
  }

对应的dataset_info.json中的描述应该是:

{
    "grounding1": {
      "file_name": "xdx_b_intervl8btrain_28.json",
      "formatting": "sharegpt",
      "columns": {
        "messages": "messages",
        "images": "images"
      },
      "tags": {
        "role_tag": "role",
        "content_tag": "content",
        "user_tag": "user",
        "assistant_tag": "assistant"
      }
    }
}

启动训练

cd LLaMA-Factory

docker run -it --gpus  '"device=0,2,3,4,5,6,7"' \
    -v /data/xiedong/train_qwenvl25_for_grounding/data:/app/data \
    -v ./output:/app/output \
    -v ./examples:/app/examples \
    -v /data/xiedong/train_qwenvl25_for_grounding:/img_datasets \
    -v /data/xiedong/vlm_r1_train_tools/Qwen2.5-VL-7B-Instruct:/Qwen2.5-VL-7B-Instruct \
    --shm-size 32G \
    -p 8034:7860 \
    -p 8035:8000 \
    kevinchina/deeplearning:llamafactory20250311-3 bash

装个swanlab:

pip install swanlab -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

可以打开webui看一下:

llamafactory-cli webui

单机训练:

llamafactory-cli train \
    --stage sft \
    --do_train True \
    --model_name_or_path /Qwen2.5-VL-7B-Instruct \
    --preprocessing_num_workers 16 \
    --finetuning_type full \
    --template qwen2_vl \
    --flash_attn auto \
    --dataset_dir data \
    --dataset grounding1 \
    --cutoff_len 4096 \
    --learning_rate 5e-05 \
    --num_train_epochs 3.0 \
    --max_samples 100000 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 5 \
    --save_steps 100 \
    --warmup_steps 0 \
    --packing False \
    --report_to none \
    --output_dir output/Qwen2.5-VL-7B-Instruct/full/train_2025-05-08-07-28-25 \
    --bf16 True \
    --plot_loss True \
    --trust_remote_code True \
    --ddp_timeout 180000000 \
    --include_num_input_tokens_seen True \
    --optim adamw_torch \
    --deepspeed cache/ds_z2_config.json \
    --use_swanlab True \
    --swanlab_project llamafactory \
    --swanlab_mode cloud 

如果要有验证集:

    --val_size 0.1 \
    --eval_strategy steps \
    --eval_steps 100 \
    --per_device_eval_batch_size 2 \

如果要用swanlab:

export SWANLAB_API_KEY=pM7Xvs5OS2EeXPO5gKXfJ   # 设置在线跟踪模式API,这里我随便填的
export SWANLAB_LOG_DIR=/swanlab_log    # 设置本地日志存储路径
export SWANLAB_MODE=cloud     # 包含四种模式:cloud云端跟踪模式(默认)、cloud-only仅云端跟踪本地不保存文件、local本地跟踪模式、disabled完全不记录用于debug
    --use_swanlab True \
    --swanlab_project llamafactory \
    --swanlab_mode cloud \
### Qwen2.5VL Model Configuration and Performance on Multiple GPUs For deploying the Qwen2.5VL model across multiple GPUs, several factors are crucial to ensure optimal performance and efficient resource utilization[^1]. The primary considerations include batch size adjustment, gradient accumulation steps, and effective use of mixed precision training. #### Batch Size Adjustment When scaling up from a single GPU to multiple GPUs, increasing the batch size proportionally can lead to better hardware utilization without significantly affecting convergence properties or final accuracy metrics[^2]. #### Gradient Accumulation Steps To maintain stable gradients while using larger batches distributed over many devices, configuring appropriate gradient accumulation steps is essential. This technique allows accumulating gradients over multiple forward/backward passes before applying updates, which helps mitigate potential issues related to very large mini-batches[^3]. #### Mixed Precision Training Utilizing NVIDIA's AMP (Automatic Mixed Precision) library facilitates automatic conversion between FP32 and FP16 formats during computations where lower precision suffices but higher speed benefits exist. Such an approach reduces memory consumption and accelerates computation times effectively when running deep learning models like Qwen2.5VL on multi-GPU setups[^4]. ```python import torch from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained('Qwen/Qwen-2.5VL') device_ids = list(range(torch.cuda.device_count())) model_parallel = torch.nn.DataParallel(model, device_ids=device_ids).cuda() ``` --related questions-- 1. How does one determine the ideal batch size for different numbers of GPUs? 2. What specific parameters need tuning besides those mentioned here for achieving maximum efficiency with Qwen2.5VL on multiple GPUs? 3. Can you provide examples demonstrating how varying levels of mixed precision impact both training time and evaluation quality? 4. Are there any known limitations or challenges associated with distributing this particular architecture across numerous GPUs?
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值