教程 | 使用SAHI结合supervision进行小目标检测

希言自然_happy

于 2025-08-07 15:24:42 发布

阅读量630

点赞数 2

CC 4.0 BY-SA版权

文章标签：目标检测人工智能计算机视觉

本文链接：https://blog.csdn.net/weixin_45276304/article/details/150012744

小目标检测 `supervision.InferenceSlicer`

教程来源于https://github.com/roboflow/supervision
本文对其进行了翻译整理
需要翻译后的.pynb和md文件自行下载链接：https://pan.quark.cn/s/53ae70342f47

本指南展示了如何使用切片辅助超推理 (SAHI) 结合 supervision 进行小目标检测。
在这里插入图片描述
点击 Open in Colab按钮在 Google Colab上运行.

在你开始之前

You’ll need:

需要注册一个Roboflow账号 Create one here.
一个来自 Roboflow 的 API 密钥。需要帮助获取吗？在此了解更多。

安装所需的包

让我们安装这个项目的依赖项。以下是所需依赖项的列表：

inference: 由 Roboflow 开发的一个用于轻松部署计算机视觉模型的包。
supervision: 由 Roboflow 开发的一个包，用于提供构建和管理计算机视觉应用程序的实用工具。

%pip install inference supervision jupyter_compare_view

使用计算机视觉进行人群计数

你会如何解决人群中人数统计的问题呢？经过一些测试后，我发现最佳方法是检测人的头部。其他身体部位很可能会被其他人遮挡，但头部通常是暴露在外的，尤其是在航拍或高位拍摄的画面中。

使用开源公共模型进行人群检测

检测人（或他们的头部）是一个常见的问题，过去已有许多研究人员对此进行过探讨。在这个项目中，我们将使用一个开源的公共数据集和一个经过微调的模型对图像进行推理。

项目 “people_counterv0 计算机视觉项目” 的一些细节：

包含 4574 张图像的数据集
平均精度均值（mAP）= 49.2% / 精确率 = 74.5% / 召回率 = 39.2%
模型：Roboflow 2.0 目标检测（快速版）
检查点：COCOv6n
创建者：SIT

导入模块

运行以下代码下载并加载本教程所需的模块

import math
import os
import time

import cv2
import matplotlib.pyplot as plt
import numpy as np
import supervision as sv
from inference import get_model
from jupyter_compare_view import compare

下载图片

# Download the image
!wget -O human_tower.jpg "https://upload.wikimedia.org/wikipedia/commons/thumb/d/d0/4_de_8_amb_l%27agulla_carregat_Castellers_de_Barcelona_%2821937141066%29.jpg/2560px-4_de_8_amb_l%27agulla_carregat_Castellers_de_Barcelona_%2821937141066%29.jpg"```

```python
image = cv2.imread("human_tower.jpg")
image_wh = (image.shape[1], image.shape[0])
print(f"Image shape: {image_wh[0]}w x {image_wh[1]}h")
sv.plot_image(image)

你正在查看的是一座“卡斯特尔”（Castell），这是一种传统上在西班牙加泰罗尼亚部分地区的节日期间搭建的人塔，后来这种习俗传播到了巴利阿里群岛和瓦伦西亚自治区。图片来源在此，你可以在维基百科上了解更多关于这些人塔的信息。

让我们测试一下模型的性能

在深入探讨用于小目标检测的SAHI技术之前，先看看经过微调的模型在未经任何预处理或切片的原始图像上的表现是很有帮助的。这样做的目的是了解模型何时会失效，以便我们逐步制定出高效的切片策略。

让我们运行模型吧！

MODEL_ID = "people_counterv0/1"
API_KEY = "" # Retrieve your API key: https://docs.roboflow.com/api-reference/authentication

# If using Google Colab
#from google.colab import userdata
#API_KEY = userdata.get("ROBOFLOW_API_KEY")  #Retrieve your API key: https://docs.roboflow.com/api-reference/authentication

model = get_model(MODEL_ID, api_key=API_KEY)

# Run inference
results = model.infer(image, model_id=MODEL_ID)
detections = sv.Detections.from_inference(results[0])

print(f"Found {len(detections)} people")

bbox_annotator = sv.BoxAnnotator(
    color=sv.ColorPalette.DEFAULT.colors[6],
    thickness=2
)

# Annotate our image with detections.
image_no_sahi = bbox_annotator.annotate(scene=image.copy(), detections=detections)

sv.plot_image(image_no_sahi)

比较源图像与未检测到SAHI的图像

resize_image = (600, 400)

bgr_image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
bgr_image_no_sahi = cv2.cvtColor(image_no_sahi, cv2.COLOR_RGB2BGR)

# Resize the images for better comparison
bgr_image = cv2.resize(bgr_image, resize_image)
bgr_image_no_sahi = cv2.resize(bgr_image_no_sahi, resize_image)

compare(bgr_image, bgr_image_no_sahi, start_mode="horizontal", start_slider_pos=0.5)

该模型在检测图像下半部分的人时表现出色，但在上半部分的边界框预测上却难以做到准确。这带来了两个关键的洞察：其一，该模型能够熟练地从不同角度识别出人的头部；其二，使用 SAHI 可以有效解决图像上半部分的检测难题。现在，是时候尝试一下 SAHI 了！

使用 `sv.InferenceSlicer` 进行小目标检测

InferenceSlicer 是一个用于在大图像上执行基于切片的推理的工具，特别适用于检测小目标。它将大图像分割成较小的切片，对每个切片进行推理，然后合并结果，以形成整个图像的最终检测结果。这种方法被称为切片自适应超推理（Slicing Adaptive Hyper Inference，SAHI），通过聚焦于全尺寸推理中可能遗漏小目标的较小区域，提高了检测精度。

主要功能：

切片策略：将图像分割成可配置大小和重叠的较小切片。
重叠管理：支持不同的重叠策略（基于比例或基于像素），以确保切片之间的过渡平滑。
检测合并：使用非极大值抑制（Non-Maximum Suppression，NMS）或非极大值合并（Non-Maximum Merging，NMM）合并所有切片的检测结果，以处理重叠的检测框。
并行处理：利用多线程并发地对切片进行推理，提高处理速度。
自定义推理回调：允许用户定义自己的推理函数，以便灵活集成各种检测模型。

SAHI 可以被视为一个旨在解决小目标检测挑战的框架。supervision 库中的 InferenceSlicer 类提供了 SAHI 的实现，您可以按如下方式轻松使用：

import cv2
import supervision as sv
from ultralytics import YOLO

image = cv2.imread(SOURCE_IMAGE_PATH)
model = YOLO(...)

def callback(image_slice: np.ndarray) -> sv.Detections:
    result = model(image_slice)[0]
    return sv.Detections.from_ultralytics(result)

slicer = sv.InferenceSlicer(
    # A function that performs inference on a given image slice and returns detections.
    callback=callback,
    # Strategy for filtering or merging overlapping detections in slices.
    overlap_filter=sv.OverlapFilter.NON_MAX_SUPPRESSION,
    # Dimensions of each slice measured in pixels. The tuple should be in the format (width, height).
    slice_wh=(100, 100)
)

detections = slicer(image)

查看 sv.InferenceSlicer 的文档请点击这里。

使用 `supervision` 对图像进行切片

让我们首先可视化这些图块在图像上的显示效果。我们从一个 2x2 的小图块集合开始，图块之间在垂直方向（高度）和水平方向（宽度）上的重叠为零。这些参数的最终取值将取决于你的具体使用场景，因此建议进行反复试验！

下面的一些方法用于可视化图块及其重叠情况。在你的应用程序中，你只需要使用 calculate_tile_size 方法来计算图块的大小。

用于可视化图块的实用函数

def tile_image(image_shape: tuple[int, int], slice_wh: tuple[int, int], overlap_wh: tuple[float, float])-> np.ndarray:
    """
    Computes the coordinates and dimensions of tiles for an image with specified slicing and overlap parameters.
    """
    offsets = sv.InferenceSlicer._generate_offset(
        resolution_wh=image_shape,
        slice_wh=slice_wh,
        overlap_ratio_wh=None,
        overlap_wh=overlap_wh
    )

    offsets = np.ceil(offsets).astype(int)

    return offsets

def draw_transparent_tiles(scene: np.ndarray, x: int, y: int, w:int, h:int) -> np.ndarray:
    """
    Draws a transparent tile with an optional index label on the given scene.
    """
    alpha=0.15

    # Generate a mask for the tile
    rectangle = np.zeros((h, w, 3), dtype=np.uint8)
    rectangle.fill(255)

    rect = sv.Rect(x=x, y=y, width=w, height=h)
    overlay_image = sv.draw_image(scene=scene.copy(), image=rectangle, opacity=alpha, rect=rect)

    # Draw a border around the edge of the mask
    border_color = sv.Color.BLACK
    border_thickness=2
    overlay_image = sv.draw_rectangle(
        scene=overlay_image,
        rect=sv.Rect(x=x, y=y, width=w, height=h),
        color=border_color,
        thickness=border_thickness
    )

    return overlay_image

def draw_tiles(scene: np.ndarray, offsets):
    """
    Draws transparent tiles on a scene based on the given offsets.
    """

    tiled_image = scene.copy()

    for index, offset in enumerate(offsets):
        x = offset[0]
        y = offset[1]
        width = offset[2] - x
        height = offset[3] - y

        tiled_image = draw_transparent_tiles(scene=tiled_image, x=x, y=y, w=width, h=height)

    return tiled_image

def print_offsets(offsets):
    for index, (x1, y1, x2, y2) in enumerate(offsets, 1):
        w, h = x2 - x1, y2 - y1
        print(f"Tile {index + 1}")
        print(f"  w={w}, h={h}, x1={x1}, y1={y1}, x2={x2}, y2={y2}, area={w*h}")

计算图块大小

重要提示：从 supervision==0.23.0 版本开始，您需要手动提供图块大小。您可以使用下面的函数来计算它。

calculate_tile_size 函数在将图像划分为网格时，通过考虑以下参数来确定所需的图块尺寸：

图像尺寸：图像的宽度和高度，以 (宽度, 高度) 的形式指定，例如 (1024, 768)。
网格布局：图块的数量，以 (行数, 列数) 的形式指定，例如 (2, 2)。
重叠区域：相邻图块之间的重叠百分比，水平和垂直重叠分别指定，例如 (0.1, 0.1)。

它返回一个元组，包含：

图块大小：一个元组，表示每个图块的宽度和高度，包括相邻图块之间的重叠区域 (overlap_wh)。
重叠尺寸：一个元组，表示图块之间的像素重叠量 (overlap_wh)。如果重叠比例设置为 (0.0, 0.0)，则该值为 (0, 0)，表示没有重叠。

例如：

>>> image_shape = (1024, 768)
>>> tiles = (4, 4)
>>> overlap_ratio_wh = (0.15, 0.15)
>>> calculate_tile_size(image_shape, tiles, overlap_ratio_wh)
((295, 221), (39, 29))

def calculate_tile_size(image_shape: tuple[int, int], tiles: tuple[int, int], overlap_ratio_wh: tuple[float, float] = (0.0, 0.0)):
    """
    Calculate the size of the tiles based on the image shape, the number of tiles, and the overlap ratio.

    Parameters:
    ----------
    image_shape : tuple[int, int]
        The dimensions of the image as (width, height).

    tiles : tuple[int, int]
        The tiling strategy defined as (rows, columns), specifying the number of tiles along the height and width of the image.

    overlap_ratio_wh : tuple[float, float], optional
        The overlap ratio for width and height as (overlap_ratio_w, overlap_ratio_h). This defines the fraction of overlap between adjacent tiles. Default is (0.0, 0.0), meaning no overlap.

    Returns:
    -------
    tuple[tuple[int, int], tuple[int, int]]
        A tuple containing:
        - The size of each tile as (tile_width, tile_height), accounting for overlap.
        - The overlap dimensions as (overlap_width, overlap_height).

    Example:
    -------
    >>> image_shape = (1024, 768)
    >>> tiles = (4, 4)
    >>> overlap_ratio_wh = (0.15, 0.15)
    >>> calculate_tile_size(image_shape, tiles, overlap_ratio_wh)
    ((295, 221), (39, 29))
    """

    w, h = image_shape
    rows, columns = tiles

    tile_width = (w / columns)
    tile_height = (h / rows)
    overlap_w, overlap_h = overlap_ratio_wh

    tile_width = math.ceil(w / columns * (1 + overlap_w))
    tile_height = math.ceil(h / rows * (1 + overlap_h))
    overlap_wh = (math.ceil(tile_width * overlap_w), math.ceil(tile_height * overlap_h))

    return (tile_width, tile_height), overlap_wh

可视化图像切片

tiles = (2,2)
overlap_ratio_wh = (0.0, 0.0) # The overlap between tiles
slice_wh, overlap_wh = calculate_tile_size(image_wh, tiles, overlap_ratio_wh)
offsets = tile_image(image_wh, slice_wh, overlap_wh)

print(f"Image shape: {image_wh[0]}w x {image_wh[1]}h")
print(f"Tiles: {tiles}")
print(f"Tile size: {slice_wh[0]}w x {image_wh[1]}")
print(f"Generated {len(offsets)} tiles. These are the calculated dimensions")
print_offsets(offsets)

tiled_image = draw_tiles(scene=image.copy(), offsets=offsets)

sv.plot_image(tiled_image)

你可以看到图像已被切割成四个不同的图块。接下来，模型将独立处理每个图块，并且 supervision 会将所有预测结果合并为一组连贯的检测结果。请注意，我们目前没有使用重叠切割（稍后会详细介绍）。

使用 `supervision` 对切割后的图像进行推理

使用来自 Supervision 的 InferenceSlicer 类，对图像切片进行推理非常简单。这个来自 Roboflow 的 API 会将较大的图像分割成较小的切片，对每个切片进行推理，然后将检测结果合并到单个 detections 对象中。

def callback(image_slice: np.ndarray) -> sv.Detections:
  result = get_model(model_id=MODEL_ID, api_key=API_KEY).infer(image_slice )[0]
  return sv.Detections.from_inference(result)

tiles = (2,2) # The number of tiles you want
overlap_ratio_wh = (0.0, 0.0) # The overlap between tiles
slice_wh, overlap_wh = calculate_tile_size(image_wh, tiles, overlap_ratio_wh)
offsets = tile_image(image_wh, slice_wh, overlap_wh)

slicer = sv.InferenceSlicer(
  callback=callback,
  slice_wh=slice_wh,
  overlap_ratio_wh=None,
  overlap_wh=overlap_wh,
  thread_workers=4
)

detections = slicer(image)

print(f"Image shape: {image_wh[0]}w x {image_wh[1]}h")
print(f"Tiles: {tiles}")
print(f"Tile size: {slice_wh[0]}w x {image_wh[1]}")
print(f"Overlap: {overlap_wh[0]}w x {overlap_wh[1]}h. Ratio {overlap_ratio_wh}")
print(f"Found {len(detections)} people")

tiled_image_2x2 = draw_tiles(scene=image.copy(), offsets=offsets)
tiled_image_2x2 = bbox_annotator.annotate(scene=tiled_image_2x2, detections=detections)

sv.plot_image(image=tiled_image_2x2, size=(20, 20))

太棒了！我们检测到了 726 个人，相比最初未使用图像切片时检测到的 185 人有了显著提升。模型仍然能够从不同角度检测到人，但对于广场远处的人，检测仍然存在困难。现在是时候增加图块数量了 —— 换句话说，放大图像，这样模型就能捕捉到更小的人头细节。

Missing detections

增加图块密度：改用 5x5 网格

既然我们已经看到 2x2 网格带来的改进，现在是时候进一步优化模型了。将图块数量增加到 5x5 网格，我们实际上是对图像进行了放大，这样模型就能捕捉到更精细的细节，比如之前可能遗漏的更小、更远的物体。这种方法将帮助我们了解模型在放大程度更高的图像上的表现。让我们来探索这种改变如何影响我们的检测准确率和整体性能。


def callback(image_slice: np.ndarray) -> sv.Detections:
  result = get_model(model_id=MODEL_ID, api_key=API_KEY).infer(image_slice )[0]
  return sv.Detections.from_inference(result)

tiles = (5,5) # The number of tiles you want
overlap_ratio_wh = (0.0, 0.0) # The overlap between tiles
slice_wh, overlap_wh = calculate_tile_size(image_wh, tiles, overlap_ratio_wh)
offsets = tile_image(image_wh, slice_wh, overlap_wh)

slicer = sv.InferenceSlicer(
  callback=callback,
  slice_wh=slice_wh,
  overlap_wh=overlap_wh,
  overlap_ratio_wh=None,
  thread_workers=4
)

detections = slicer(image)

print(f"Image shape: {image_wh[0]}w x {image_wh[1]}h")
print(f"Tiles: {tiles}")
print(f"Tile size: {slice_wh[0]}w x {image_wh[1]}")
print(f"Overlap: {overlap_wh[0]}w x {overlap_wh[1]}h. Ratio {overlap_ratio_wh}")
print(f"Overlap filter: {sv.OverlapFilter.NON_MAX_SUPPRESSION}")
print(f"Found {len(detections)} people")

tiled_image_5x5 = draw_tiles(scene=image.copy(), offsets=offsets)
tiled_image_5x5 = bbox_annotator.annotate(scene=tiled_image_5x5, detections=detections)

sv.plot_image(image=tiled_image_5x5, size=(20, 20),)

我们刚刚使用 25 个图块的网格（5 行 x 5 列）检测到了 1494 人，与使用 4 个图块（2x2）的网格检测到的 726 人相比有了显著增加。然而，随着图块数量的增加，一个新的挑战出现了：图块边缘出现重复检测或漏检的情况。在这些示例中，这种问题变得很明显，图块之间的重叠或间隙会导致我们模型的检测结果不准确。

使用重叠图块改进边界附近的目标检测

当物体（如人）出现在图块边缘时，如果它们跨越了两个图块，可能会被重复检测或完全漏检。这会导致检测结果不准确。为了解决这个问题，我们使用重叠的图块，让模型能够同时看到相邻图块的部分内容。这种重叠有助于确保边界附近的物体被完整捕获，减少重复检测并提高检测精度。
我们将在图块的宽度和高度上设置重叠比例为 (0.2, 0.2)。这种重叠有助于确保边界附近的物体被完整捕获，减少重复检测并提高检测精度。

tiles = (5,5) # The number of tiles you want
overlap_ratio_wh = (0.15, 0.15) # Ratio of overlapping, width/height

slice_wh, overlap_wh = calculate_tile_size(image_wh, tiles, overlap_ratio_wh)
offsets = tile_image(image_wh, slice_wh, overlap_wh)

slicer = sv.InferenceSlicer(
  callback=callback,
  overlap_filter=sv.OverlapFilter.NON_MAX_SUPPRESSION,
  iou_threshold=0.1,
  slice_wh=slice_wh,
  overlap_ratio_wh=None,
  overlap_wh=overlap_wh,
  thread_workers=4
)

detections = slicer(image)

print(f"Image shape: {image_wh[0]}w x {image_wh[1]}h")
print(f"Tiles: {tiles}")
print(f"Tile size: {slice_wh[0]}w x {image_wh[1]}")
print(f"Overlap: {overlap_wh[0]}w x {overlap_wh[1]}h. Ratio {overlap_ratio_wh}")
print(f"Overlap Filter: {sv.OverlapFilter.NON_MAX_SUPPRESSION}")
print(f"Found {len(detections)} people")

tiled_image_5x5_nms = draw_tiles(scene=image.copy(), offsets=offsets)
tiled_image_5x5_nms = bbox_annotator.annotate(scene=tiled_image_5x5_nms, detections=detections)

sv.plot_image(image=tiled_image_5x5_nms, size=(20, 20))

非极大值抑制 vs 非极大值合并

在处理重叠的检测结果时，确定哪些检测结果代表同一个对象，哪些是唯一的至关重要。非极大值抑制（Non-Maximum Suppression，NMS）和非极大值合并（Non-Maximum Merging，NMM）是两种常用于解决这一问题的技术。NMS 通过基于置信度分数消除冗余的检测结果来工作，而 NMM 则将重叠的检测结果合并，以增强对跨越多个图块的对象的表示。理解这两种方法之间的差异有助于优化目标检测，特别是在图块边界附近。
在 supervision 中，overlap_filter 参数允许我们指定处理切片中重叠检测结果的策略。该参数可以取两个值：

sv.OverlapFilter.NON_MAX_SUPRESSION（默认值）：通过保留置信度分数最高的检测结果来消除冗余的检测结果。
sv.OverlapFilter.NON_MAX_MERGE：合并重叠的检测结果，以创建对跨越多个图块的对象更全面的表示。
需要注意的是，这种方法并不完美，可能需要进一步的测试和微调，才能在各种用例中获得最佳结果。你应该验证输出结果，并根据需要调整参数，以有效地处理特定场景。
sv.OverlapFilter.NON_MAX_SUPRESSION（默认值）：通过保留置信度分数最高的检测结果来消除冗余的检测结果。
sv.OverlapFilter.NON_MAX_MERGE：合并重叠的检测结果，以创建对跨越多个图块的对象更全面的表示。

需要注意的是，这种方法并不完美，可能需要进一步的测试和微调，才能在各种用例中获得最佳结果。你应该验证输出结果，并根据需要调整参数，以有效地处理特定场景。

tiles = (5,5) # The number of tiles you want
overlap_ratio_wh = (0.15, 0.15) # The overlap ratio: 20% width, 20% height

slice_wh, overlap_wh = calculate_tile_size(image_wh, tiles, overlap_ratio_wh)
offsets = tile_image(image_wh, slice_wh, overlap_wh)

slicer = sv.InferenceSlicer(
  callback=callback,
  overlap_filter=sv.OverlapFilter.NON_MAX_MERGE,
  #iou_threshold=0.1,
  slice_wh=slice_wh,
  overlap_ratio_wh=None,
  overlap_wh=overlap_wh,
  thread_workers=4
)

detections = slicer(image)

print(f"Image shape: {image_wh[0]}w x {image_wh[1]}h")
print(f"Tile size: {slice_wh[0]}w x {image_wh[1]}")
print(f"Overlap: {overlap_wh[0]}w x {overlap_wh[1]}h. Ratio {overlap_ratio_wh}")
print(f"Overlap Filter: {sv.OverlapFilter.NON_MAX_MERGE}")
print(f"Found {len(detections)} people")

tiled_image_5x5_nmm = draw_tiles(scene=image.copy(), offsets=offsets)
tiled_image_5x5_nmm = bbox_annotator.annotate(scene=tiled_image_5x5_nmm, detections=detections)

sv.plot_image(image=tiled_image_5x5_nmm, size=(20, 20))

比较图像与Sahi处理后的图像

resize_image = (600, 400)

bgr_image = cv2.cvtColor(image.copy(), cv2.COLOR_RGB2BGR)
tiled_image = bbox_annotator.annotate(scene=image.copy(), detections=detections)
bgr_tiled_image = cv2.cvtColor(tiled_image, cv2.COLOR_RGB2BGR)

# Resize the images for better comparison
tiled_image = cv2.resize(bgr_image, resize_image)
bgr_tiled_image = cv2.resize(bgr_tiled_image, resize_image)

compare(tiled_image, bgr_tiled_image, start_mode="horizontal", start_slider_pos=0.5)

结论

在本指南中，我们探索了使用 SAHI 技术增强小目标检测的优势，以及尝试各种切片策略以有效放大图像的重要性。通过结合这些方法，我们可以提高目标检测模型的准确性和可靠性，特别是在目标较小或位于切片边界附近的具有挑战性的场景中。这些方法为计算机视觉中的常见挑战提供了实用的解决方案，使开发人员能够构建更强大、更精确的检测系统。