YOLO-World: Real-Time Open-Vocabulary Object Detection
YOLO-World: Real-Time Open-Vocabulary Object Detection
Tianheng Cheng3,2,∗ , Lin Song1,∗,✉ , Yixiao Ge1,2,† , Wenyu Liu3 , Xinggang Wang3,✉ , Ying Shan1,2
∗ † ✉
equal contribution project lead corresponding author
1
Tencent AI Lab 2 ARC Lab, Tencent PCG
3
School of EIC, Huazhong University of Science & Technology
arXiv:2401.17270v2 [cs.CV] 2 Feb 2024
Abstract
1
demonstrated the promising performance of pre-training to connect vision and language features and an open-
large detectors while pre-training small detectors to en- vocabulary region-text contrastive pre-training scheme
dow them with open recognition capabilities remains un- for YOLO-World.
explored. • The proposed YOLO-World pre-trained on large-scale
In this paper, we present YOLO-World, aiming for datasets demonstrates strong zero-shot performance and
high-efficiency open-vocabulary object detection, and ex- achieves 35.4 AP on LVIS with 52.0 FPS. The pre-trained
plore large-scale pre-training schemes to boost the tradi- YOLO-World can be easily adapted to downstream tasks,
tional YOLO detectors to a new open-vocabulary world. e.g., open-vocabulary instance segmentation and referring
Compared to previous methods, the proposed YOLO- object detection. Moreover, the pre-trained weights and
World is remarkably efficient with high inference speed codes of YOLO-World will be open-sourced to facilitate
and easy to deploy for downstream applications. Specif- more practical applications.
ically, YOLO-World follows the standard YOLO archi-
tecture [19] and leverages the pre-trained CLIP [36] text 2. Related Works
encoder to encode the input texts. We further propose
the Re-parameterizable Vision-Language Path Aggregation 2.1. Traditional Object Detection
Network (RepVL-PAN) to connect text features and im- Prevalent object detection research concentrates on fixed-
age features for better visual-semantic representation. Dur- vocabulary (close-set) detection, in which object detectors
ing inference, the text encoder can be removed and the are trained on datasets with pre-defined categories, e.g.,
text embeddings can be re-parameterized into weights of COCO dataset [25] and Objects365 dataset [43], and then
RepVL-PAN for efficient deployment. We further inves- detect objects within the fixed set of categories. During
tigate the open-vocabulary pre-training scheme for YOLO the past decades, the methods for traditional object de-
detectors through region-text contrastive learning on large- tection can be simply categorized into three groups, i.e.,
scale datasets, which unifies detection data, grounding data, region-based methods, pixel-based methods, and query-
and image-text data into region-text pairs. The pre-trained based methods. The region-based methods [10, 11, 15, 26,
YOLO-World with abundant region-text pairs demonstrates 41], such as Faster R-CNN [41], adopt a two-stage frame-
a strong capability for large vocabulary detection and train- work for proposal generation [41] and RoI-wise (Region-
ing more data leads to greater improvements in open- of-Interest) classification and regression. The pixel-based
vocabulary capability. methods [27, 30, 39, 46, 58] tend to be one-stage detec-
In addition, we explore a prompt-then-detect paradigm tors, which perform classification and regression over pre-
to further improve the efficiency of open-vocabulary object defined anchors or pixels. DETR [1] first explores object
detection in real-world scenarios. As illustrated in Fig. 2, detection through transformers [47] and inspires extensive
traditional object detectors [15, 19, 22, 38–40, 49] con- query-based methods [61]. In terms of inference speed,
centrate on the fixed-vocabulary (close-set) detection with Redmon et al. presents YOLOs [37–39] which exploit sim-
predefined and trained categories. While previous open- ple convolutional architectures for real-time object detec-
vocabulary detectors [23, 29, 53, 56] encode the prompts of tion. Several works [9, 22, 32, 49, 52] propose various ar-
a user for online vocabulary with text encoders and detect chitectures or designs for YOLO, including path aggrega-
objects. Notably, those methods tend to employ large de- tion networks [28], cross-stage partial networks [48], and
tectors with heavy backbones, e.g., Swin-L [31], to increase re-parameterization [6], which further improve both speed
the open-vocabulary capacity. In contrast, the prompt-then- and accuracy. In comparison to previous YOLOs, YOLO-
detect paradigm (Fig. 2 (c)) first encodes the prompts of a World in this paper aims to detect objects beyond the fixed
user to build an offline vocabulary and the vocabulary varies vocabulary with strong generalization ability.
with different needs. Then, the efficient detector can infer
the offline vocabulary on the fly without re-encoding the 2.2. Open-Vocabulary Object Detection
prompts. For practical applications, once we have trained Open-vocabulary object detection (OVD) [55] has emerged
the detector, i.e., YOLO-World, we can pre-encode the as a new trend for modern object detection, which aims
prompts or categories to build an offline vocabulary and to detect objects beyond the predefined categories. Early
then seamlessly integrate it into the detector. works [12] follow the standard OVD setting [55] by train-
Our main contributions can be summarized into three ing detectors on the base classes and evaluating the novel
folds: (unknown) classes. Nevertheless, this open-vocabulary set-
• We introduce the YOLO-World, a cutting-edge open- ting can evaluate the capability of detectors to detect and
vocabulary object detector with high efficiency for real- recognize novel objects, it is still limited for open sce-
world applications. narios and lacks generalization ability to other domains
• We propose a Re-parameterizable Vision-Language PAN due to training on the limited dataset and vocabulary.
2
User User User
Fixed Text Text Offline
Vocabulary Encoder Online Encoder Vocabulary
Vocabulary
Re-parameterize
(a) Traditional Object Detector (b) Preivous Open-Vocabulary Detector (c) YOLO-World
Figure 2. Comparison with Detection Paradigms. (a) Traditional Object Detector: These object detectors can only detect objects
caption, e.g.,
within the fixed vocabulary pre-defined by the training datasets, noun80phrases, category…
categories of COCO dataset [25]. The fixed vocabulary limits the
extension for open scenes. (b) Previous Open-Vocabulary Detectors: Previous methods tend to develop large and heavy detectors for
open-vocabulary detection which intuitively have strong capacity. In addition, these detectors simultaneously encode images and texts as Text Embeddin
input for prediction, which is time-consuming for practical applications. (c) YOLO-World: We demonstrate the strong open-vocabulary
performance of lightweight detectors, e.g., YOLO detectors [19, 39], which is of great significance for real-world applications. Rather than D
using online vocabulary, we present a prompt-then-detect paradigm for efficient inference, in which the user generates a series of prompts
according to the need and the prompts will be encoded into an offline vocabulary. Then it can be re-parameterized as the model weights D
for deployment and further acceleration.
Vocabulary Embeddings Image-aware Embeddings Region-Text Matching
Training: Online Vocabulary
man man
A man and a Text
woman are skiing woman woman
with a dog Encoder dog dog
Extract Nouns Vision-Language PAN
Deployment: Offline Vocabulary
User’s
Vocabulary Object Embeddings
Multi-scale
User
Image Features
Text
Contrastive Head
YOLO
Backbone
Box Head
Input Image
Figure 3. Overall Architecture of YOLO-World. Compared to traditional YOLO detectors, YOLO-World as an open-vocabulary detector
adopts text as input. The Text Encoder first encodes the input text input text embeddings. Then the Image Encoder encodes the input image
Multi-scale Image F
into multi-scale image features and the proposed RepVL-PAN exploits the multi-level cross-modality fusion for both image and text features.
Finally, YOLO-World predicts the regressed bounding boxes and the object embeddings for matching the categories or nouns that appeared
in the input text.
Inspired by vision-language pre-training [18, 36], recent these methods often use heavy detectors like ATSS [58]
works [7, 21, 50, 59, 60] formulate open-vocabulary ob- or DINO [57] with Swin-L [31] as a backbone, lead-
ject detection as image-text matching and exploit large- ing to high computational demands and deployment chal-
scale image-text data to increase the training vocabulary lenges. In contrast, we present YOLO-World, aiming for
at scale. GLIP [23] presents a pre-training framework for efficient open-vocabulary object detection with real-time
open-vocabulary detection based on phrase grounding and inference and easier downstream application deployment.
evaluates in a zero-shot setting. Grounding DINO [29] Differing from ZSD-YOLO [51], which also explores open-
incorporates the grounded pre-training [23] into detection vocabulary detection [55] with YOLO through language
transformers [57] with cross-modality fusions. model alignment, YOLO-World introduces a novel YOLO
framework with an effective pre-training strategy, enhanc-
Several methods [24, 53, 54, 56] unify detection datasets
ing open-vocabulary performance and generalization.
and image-text datasets through region-text matching and
pre-train detectors with large-scale image-text pairs, achiev-
ing promising performance and generalization. However,
3
3. Method transformation with the learnable scaling factor α and shift-
ing factor β. Both the L2 norms and the affine transforma-
3.1. Pre-training Formulation: Region-Text Pairs tions are important for stabilizing the region-text training.
The traditional object detection methods, including the
YOLO-series [19], are trained with instance annotations Training with Online Vocabulary. During training, we
Ω = {Bi , ci }N i=1 , which consist of bounding boxes {Bi } construct an online vocabulary T for each mosaic sample
and category labels {ci }. In this paper, we reformulate the containing 4 images. Specifically, we sample all positive
instance annotations as region-text pairs Ω = {Bi , ti }N i=1 , nouns involved in the mosaic images and randomly sam-
where ti is the corresponding text for the region Bi . Specif- ple some negative nouns from the corresponding dataset.
ically, the text ti can be the category name, noun phrases, The vocabulary for each mosaic sample contains at most M
or object descriptions. Moreover, YOLO-World adopts both nouns, and M is set to 80 as default.
the image I and texts T (a set of nouns) as input and outputs
predicted boxes {B̂k } and the corresponding object embed-
Inference with Offline Vocabulary. At the inference
dings {ek } (ek ∈ RD ).
stage, we present a prompt-then-detect strategy with an of-
3.2. Model Architecture fline vocabulary for further efficiency. As shown in Fig. 3,
the user can define a series of custom prompts, which might
The overall architecture of the proposed YOLO-World is il-
include captions or categories. We then utilize the text en-
lustrated in Fig. 3, which consists of a YOLO detector, a
coder to encode these prompts and obtain offline vocabu-
Text Encoder, and a Re-parameterizable Vision-Language
lary embeddings. The offline vocabulary allows for avoid-
Path Aggregation Network (RepVL-PAN). Given the input
ing computation for each input and provides the flexibility
text, the text encoder in YOLO-World encodes the text into
to adjust the vocabulary as needed.
text embeddings. The image encoder in the YOLO detector
extracts the multi-scale features from the input image. Then 3.3. Re-parameterizable Vision-Language PAN
we leverage the RepVL-PAN to enhance both text and im-
age representation by exploiting the cross-modality fusion Fig. 4 shows the structure of the proposed RepVL-PAN
between image features and text embeddings. which follows the top-down and bottom-up paths in [19, 28]
to establish the feature pyramids {P3 , P4 , P5 } with the
multi-scale image features {C3 , C4 , C5 }. Furthermore,
YOLO Detector. YOLO-World is mainly developed
we propose the Text-guided CSPLayer (T-CSPLayer) and
based on YOLOv8 [19], which contains a Darknet back-
Image-Pooling Attention (I-Pooling Attention) to further
bone [19, 40] as the image encoder, a path aggregation net-
enhance the interaction between image features and text
work (PAN) for multi-scale feature pyramids, and a head
features, which can improve the visual-semantic represen-
for bounding box regression and object embeddings.
tation for open-vocabulary capability. During inference, the
offline vocabulary embeddings can be re-parameterized into
Text Encoder. Given the text T , we adopt the Trans- weights of convolutional or linear layers for deployment.
former text encoder pre-trained by CLIP [36] to extract the
corresponding text embeddings W = TextEncoder(T ) ∈
RC×D , where C is the number of nouns and D is the em- Text-guided CSPLayer. As Fig. 4 illustrates, the cross-
bedding dimension. The CLIP text encoder offers better stage partial layers (CSPLayer) are utilized after the top-
visual-semantic capabilities for connecting visual objects down or bottom-up fusion. We extend the CSPLayer
with texts compared to text-only language encoders [5]. (also called C2f) of [19] by incorporating text guidance
When the input text is a caption or referring expression, into multi-scale image features to form the Text-guided
we adopt the simple n-gram algorithm to extract the noun CSPLayer. Specifically, given the text embeddings W and
phrases and then feed them into the text encoder. image features Xl ∈ RH×W ×D (l ∈ {3, 4, 5}), we adopt
the max-sigmoid attention after the last dark bottleneck
block to aggregate text features into image features by:
Text Contrastive Head. Following previous works [19],
we adopt the decoupled head with two 3×3 convs to regress Xl′ = Xl · δ( max (Xl Wj⊤ ))⊤ , (2)
bounding boxes {bk }K k=1 and object embeddings {ek }k=1 ,
K j∈{1..C}
where K denotes the number of objects. We present a text
contrastive head to obtain the object-text similarity sk,j by: where the updated Xl′ is concatenated with the cross-stage
features as output. The δ indicates the sigmoid function.
sk,j = α · L2-Norm(ek ) · L2-Norm(wj )⊤ + β, (1)
where L2-Norm(·) is the L2 normalization and wj ∈ W Image-Pooling Attention. To enhance the text embed-
is the j-th text embeddings. In addition, we add the affine dings with image-aware information, we aggregate image
4
woman
dog=
Text Embeddings Image-aware Embeddings noisy boxes, we only calculate the regression loss for sam-
ples with accurate bounding boxes.
C5 T-CSPLayer P5
I-Pooling Attention
I-Pooling Attention
×2 ×
1 Pseudo Labeling with Image-Text Data. Rather than di-
2
rectly using image-text pairs for pre-training, we propose an
C4 T-CSPLayer T-CSPLayer P4 automatic labeling approach to generate region-text pairs.
1
×2 ×
2
Specifically, the labeling approach contains three steps: (1)
extract noun phrases: we first utilize the n-gram algo-
C3 T-CSPLayer P3 rithm to extract noun phrases from the text; (2) pseudo la-
Text to Image Image to Text beling: we adopt a pre-trained open-vocabulary detector,
Text 3×3
e.g., GLIP [23], to generate pseudo boxes for the given
S Split C Concat Text
noun phrases for each image, thus providing the coarse
S Dark Bottleneck Max-Sigmoid C C MHCA region-text pairs. (3) filtering: We employ the pre-trained
CLIP [36] to evaluate the relevance of image-text pairs and
T-CSPLayer (C2f Block) I-Pooling Attention region-text pairs, and filter the low-relevance pseudo an-
notations and images. We further filter redundant bound-
Figure 4. Illustration of the RepVL-PAN. The proposed RepVL- ing boxes by incorporating methods such as Non-Maximum
PAN adopts the Text-guided CSPLayer (T-CSPLayer) for injecting Suppression (NMS). We suggest the readers refer to the ap-
language information into image features and the Image Pooling pendix for the detailed approach. With the above approach,
Attention (I-Pooling Attention) for enhancing image-aware text we sample and label 246k images from CC3M [44] with
embeddings.
821k pseudo annotations.
features to update the text embeddings by proposing the
Image-Pooling Attention. Rather than directly using cross- 4. Experiments
attention on image features, we leverage max pooling on
In this section, we demonstrate the effectiveness of the
multi-scale features to obtain 3 × 3 regions, resulting in a
proposed YOLO-World by pre-training it on large-scale
total of 27 patch tokens X̃ ∈ R27×D . The text embeddings
datasets and evaluating YOLO-World in a zero-shot manner
are then updated by:
on both LVIS benchmark and COCO benchmark (Sec. 4.2).
W ′ = W + MultiHead-Attention(W, X̃, X̃) (3) We also evaluate the fine-tuning performance of YOLO-
World on COCO, LVIS for object detection.
3.4. Pre-training Schemes
4.1. Implementation Details
In this section, we present the training schemes for pre-
training YOLO-World on large-scale detection, grounding, The YOLO-World is developed based on the MMYOLO
and image-text datasets. toolbox [3] and the MMDetection toolbox [2]. Following
[19], we provide three variants of YOLO-World for differ-
ent latency requirements, e.g., small (S), medium (M), and
Learning from Region-Text Contrastive Loss. Given large (L). We adopt the open-source CLIP [36] text encoder
the mosaic sample I and texts T , YOLO-World outputs with pre-trained weights to encode the input text. Unless
K object predictions {Bk , sk }K k=1 along with annotations specified, we measure the inference speeds of all models on
Ω = {Bi , ti }N
i=1 . We follow [19] and leverage task-aligned one NVIDIA V100 GPU without extra acceleration mecha-
label assignment [8] to match the predictions with ground- nisms, e.g., FP16 or TensorRT.
truth annotations and assign each positive prediction with a
text index as the classification label. Based on this vocabu- 4.2. Pre-training
lary, we construct the region-text contrastive loss Lcon with
region-text pairs through cross entropy between object-text Experimental Setup. At the pre-training stage, we adopt
(region-text) similarity and object-text assignments. In ad- the AdamW optimizer [33] with an initial learning rate
dition, we adopt IoU loss and distributed focal loss for of 0.002 and weight decay of 0.05. YOLO-World is pre-
bounding box regression and the total training loss is de- trained for 100 epochs on on 32 NVIDIA V100 GPUs with
fined as: L(I) = Lcon + λI · (Liou + Ldfl ), where λI is a total batch size of 512. During pre-training, we follow
an indicator factor and set to 1 when input image I is from previous works [19] and adopt color augmentation, random
detection or grounding data and set to 0 when it is from affine, random flip, and mosaic with 4 images for data aug-
the image-text data. Considering image-text datasets have mentation. The text encoder is frozen during pre-training.
5
Dataset Type Vocab. Images Anno. 4.3. Ablation Experiments
Objects365V1 [43] Detection 365 609k 9,621k We provide extensive ablation studies to analyze YOLO-
GQA [16] Grounding - 621k 3,681k World from two primary aspects, i.e., pre-training and ar-
Flickr [35] Grounding - 149k 641k chitecture. Unless specified, we mainly conduct ablation
CC3M† [44] Image-Text - 246k 821k experiments based on YOLO-World-L and pre-train Ob-
jects365 with zero-shot evaluation on LVIS minival.
Table 1. Pre-training Data. The specifications of the datasets
used for pre-training YOLO-World.
Pre-training Data. In Tab. 3, we evaluate the perfor-
mance of pre-training YOLO-World using different data.
Pre-training Data. For pre-training YOLO-World, we Compared to the baseline trained on Objects365, adding
mainly adopt detection or grounding datasets including Ob- GQA can significantly improve performance with an 8.4
jects365 (V1) [43], GQA [16], Flickr30k [35], as specified AP gain on LVIS. This improvement can be attributed to
in Tab. 1. Following [23], we exclude the images from the richer textual information provided by the GQA dataset,
the COCO dataset in GoldG [20] (GQA and Flickr30k). which can enhance the model’s ability to recognize large
The annotations of the detection datasets used for pre- vocabulary objects. Adding part of CC3M samples (8%
training contain both bounding boxes and categories or of the full datasets) can further bring 0.5 AP gain with 1.3
noun phrases. In addition, we also extend the pre-training AP on rare objects. Tab. 3 demonstrates that adding more
data with image-text pairs, i.e., CC3M† [44], which we have data can effectively improve the detection capabilities on
labeled 246k images through the pseudo-labeling method large-vocabulary scenarios. Furthermore, as the amount of
discussed in Sec. 3.4. data increases, the performance continues to improve, high-
lighting the benefits of leveraging larger and more diverse
datasets for training.
Zero-shot Evaluation. After pre-training, we di-
rectly evaluate the proposed YOLO-World on the LVIS Ablations on RepVL-PAN. Tab. 4 demonstrates the ef-
dataset [13] in a zero-shot manner. The LVIS dataset fectiveness of the proposed RepVL-PAN of YOLO-World,
contains 1203 object categories, which is much more including Text-guided CSPLayers and Image Pooling At-
than the categories of the pre-training detection datasets tention, for the zero-shot LVIS detection. Specifically, we
and can measure the performance on large vocabulary adopt two settings, i.e., (1) pre-training on O365 and (2)
detection. Following previous works [20, 23, 53, 54], we pre-training on O365 & GQA. Compared to O365 which
mainly evaluate on LVIS minival [20] and report the only contains category annotations, GQA includes rich
Fixed AP [4] for comparison. The maximum number of texts, particularly in the form of noun phrases. As shown
predictions is set to 1000. in Tab. 4, the proposed RepVL-PAN improves the base-
line (YOLOv8-PAN [19]) by 1.1 AP on LVIS, and the im-
provements are remarkable in terms of the rare categories
Main Results on LVIS Object Detection. In Tab. 2, we (APr ) of LVIS, which are hard to detect and recognize. In
compare the proposed YOLO-World with recent state-of- addition, the improvements become more significant when
the-art methods [20, 29, 53, 54, 56] on LVIS benchmark in a YOLO-World is pre-trained with the GQA dataset and ex-
zero-shot manner. Considering the computation burden and periments indicate that the proposed RepVL-PAN works
model parameters, we mainly compare with those methods better with rich textual information.
based on lighter backbones, e.g., Swin-T [31]. Remarkably,
YOLO-World outperforms previous state-of-the-art meth- Text Encoders. In Tab. 5, we compare the performance
ods in terms of zero-shot performance and inference speed. of using different text encoders, i.e., BERT-base [5] and
Compared to GLIP, GLIPv2, and Grounding DINO, which CLIP-base (ViT-base) [36]. We exploit two settings dur-
incorporate more data, e.g., Cap4M (CC3M+SBU [34]), ing pre-training, i.e., frozen and fine-tuned, and the learn-
YOLO-World pre-trained on O365 & GolG obtains bet- ing rate for fine-tuning text encoders is a 0.01× factor of
ter performance even with fewer model parameters. Com- the basic learning rate. As Tab. 5 shows, the CLIP text
pared to DetCLIP, YOLO-World achieves comparable per- encoder obtains superior results than BERT (+10.1 AP for
formance (35.4 v.s. 34.4) while obtaining 20× increase in rare categories in LVIS), which is pre-trained with image-
inference speed. The experimental results also demonstrate text pairs and has better capability for vision-centric embed-
that small models, e.g., YOLO-World-S with 13M parame- dings. Fine-tuning BERT during pre-training brings signifi-
ters, can be used for vision-language pre-training and ob- cant improvements (+3.7 AP) while fine-tuning CLIP leads
tain strong open-vocabulary capabilities. to a severe performance drop. We attribute the drop to that
6
Method Backbone Params Pre-trained Data FPS AP APr APc APf
MDETR [20] R-101 [14] 169M GoldG - 24.2 20.9 24.3 24.2
GLIP-T [23] Swin-T [31] 232M O365,GoldG 0.12 24.9 17.7 19.5 31.0
GLIP-T [23] Swin-T [31] 232M O365,GoldG,Cap4M 0.12 26.0 20.8 21.4 31.0
GLIPv2-T [56] Swin-T [31] 232M O365,GoldG 0.12 26.9 - - -
GLIPv2-T [56] Swin-T [31] 232M O365,GoldG,Cap4M 0.12 29.0 - - -
Grounding DINO-T [29] Swin-T [31] 172M O365,GoldG 1.5 25.6 14.4 19.6 32.2
Grounding DINO-T [29] Swin-T [31] 172M O365,GoldG,Cap4M 1.5 27.4 18.1 23.3 32.7
DetCLIP-T [53] Swin-T [31] 155M O365,GoldG 2.3 34.4 26.9 33.9 36.3
YOLO-World-S YOLOv8-S 13M (77M) O365,GoldG 74.1 (19.9) 26.2 19.1 23.6 29.8
YOLO-World-M YOLOv8-M 29M (92M) O365,GoldG 58.1 (18.5) 31.0 23.8 29.2 33.9
YOLO-World-L YOLOv8-L 48M (110M) O365,GoldG 52.0 (17.6) 35.0 27.1 32.8 38.3
YOLO-World-L YOLOv8-L 48M (110M) O365,GoldG,CC3M† 52.0 (17.6) 35.4 27.6 34.1 38.0
Table 2. Zero-shot Evaluation on LVIS. We evaluate YOLO-World on LVIS minival [20] in a zero-shot manner. We report the Fixed
AP [4] for a fair comparison with recent methods. † denotes the pseudo-labeled CC3M in our setting, which contains 246k samples.
The FPS is evaluated on one NVIDIA V100 GPU w/o TensorRT. The parameters and FPS of YOLO-World are evaluated for both the
re-parameterized version (w/o bracket) and the original version (w/ bracket).
Pre-trained Data AP APr APc APf Text Encoder Frozen? AP APr APc APf
O365 23.5 16.2 21.1 27.0 BERT-base Frozen 14.6 3.4 10.7 20.0
O365,GQA 31.9 22.5 29.9 35.4 BERT-base Fine-tune 18.3 6.6 14.6 23.6
O365,GoldG 32.5 22.3 30.6 36.0 CLIP-base Frozen 22.4 14.5 20.1 26.0
O365,GoldG,CC3M† 33.0 23.6 32.0 35.5 CLIP-base Fine-tune 19.3 8.6 15.7 24.8
Table 3. Ablations on Pre-training Data. We evaluate the zero- Table 5. Text Encoder in YOLO-World. We ablate different text
shot performance on LVIS of pre-training YOLO-World with dif- encoders in YOLO-World through the zero-shot LVIS evaluation.
ferent amounts of data.
GQA T→I I→T AP APr APc APf to demonstrate the effectiveness of the pre-training.
✗ ✗ ✗ 22.4 14.5 20.1 26.0
✗ ✓ ✗ 23.2 15.2 20.6 27.0 Experimental Setup. We use the pre-trained weights to
✗ ✓ ✓ 23.5 16.2 21.1 27.0 initialize YOLO-World for fine-tuning. All models are fine-
✓ ✗ ✗ 29.7 21.0 27.1 33.6 tuned for 80 epochs with the AdamW optimizer and the ini-
tial learning rate is set to 0.0002. In addition, we fine-tune
✓ ✓ ✓ 31.9 22.5 29.9 35.4
the CLIP text encoder with a learning factor of 0.01. For the
LVIS dataset, we follow previous works [7, 12, 60] and fine-
Table 4. Ablations on Re-parameterizable Vision-Language
tune YOLO-World on the LVIS-base (common & frequent)
Path Aggregation Network. We evaluate the zero-shot perfor-
mance on LVIS of the proposed Vision-Language Path Aggrega-
and evaluate it on the LVIS-novel (rare).
tion Network. T→I and I→T denote the Text-guided CSPLayers
and Image-Pooling Attention, respectively. COCO Object Detection. We compare the pre-trained
YOLO-World with previous YOLO detectors [19, 22, 49]
in Tab. 6. For fine-tuning YOLO-World on the COCO
fine-tuning on O365 may degrade the generalization ability dataset, we remove the proposed RepVL-PAN for fur-
of the pre-trained CLIP, which contains only 365 categories ther acceleration considering that the vocabulary size of
and lacks abundant textual information. the COCO dataset is small. In Tab. 6, it’s evident that
our approach can achieve decent zero-shot performance on
4.4. Fine-tuning YOLO-World
the COCO dataset, which indicates that YOLO-World has
In this section, we further fine-tune YOLO-World for close- strong generalization ability. Moreover, YOLO-World af-
set object detection on the COCO dataset and LVIS dataset ter fine-tuning on the COCO train2017 demonstrates
7
Method Pre-train AP AP50 AP75 FPS Method AP APr APc APf
Training from scratch. ViLD [12] 27.8 16.7 26.5 34.2
YOLOv6-S [22] ✗ 43.7 60.8 47.0 442 RegionCLIP [59] 28.2 17.1 - -
YOLOv6-M [22] ✗ 48.4 65.7 52.7 277 Detic [60] 26.8 17.8 - -
YOLOv6-L [22] ✗ 50.7 68.1 54.8 166 FVLM [21] 24.2 18.6 - -
YOLOv7-T [49] ✗ 37.5 55.8 40.2 404 DetPro [7] 28.4 20.8 27.8 32.4
YOLOv7-L [49] ✗ 50.9 69.3 55.3 182 BARON [50] 29.5 23.2 29.3 32.5
YOLOv7-X [49] ✗ 52.6 70.6 57.3 131 YOLOv8-S 19.4 7.4 17.4 27.0
YOLOv8-S [19] ✗ 44.4 61.2 48.1 386 YOLOv8-M 23.1 8.4 21.3 31.5
YOLOv8-M [19] ✗ 50.5 67.3 55.0 238 YOLOv8-L 26.9 10.2 25.4 35.8
YOLOv8-L [19] ✗ 52.9 69.9 57.7 159 YOLO-World-S 23.9 12.8 20.4 32.7
Zero-shot transfer. YOLO-World-M 28.8 15.9 24.6 39.0
YOLO-World-S O+G 37.6 52.3 40.7 - YOLO-World-L 34.1 20.4 31.1 43.5
YOLO-World-M O+G 42.8 58.3 46.4 -
YOLO-World-L O+G 44.4 59.8 48.3 - Table 7. Comparison with Open-Vocabulary Detectors on
LVIS. We train YOLO-World on the LVIS-base (including com-
YOLO-World-L O+G+C 45.1 60.7 48.9 -
mon and frequent) report the bbox AP. The YOLO-v8 are trained
Fine-tuned w/ RepVL-PAN. on the full LVIS datasets (including base and novel) along with the
YOLO-World-S O+G 45.9 62.3 50.1 - class balanced sampling.
YOLO-World-M O+G 51.2 68.1 55.9 -
YOLO-World-L O+G+C 53.3 70.1 58.2 -
4.5. Open-Vocabulary Instance Segmentation
Fine-tuned w/o RepVL-PAN.
YOLO-World-S O+G 45.7 62.3 49.9 373 In this section, we further fine-tune YOLO-World for
YOLO-World-M O+G 50.7 67.2 55.1 231 segmenting objects under the open-vocabulary setting,
which can be termed open-vocabulary instance segmenta-
YOLO-World-L O+G+C 53.3 70.3 58.1 156
tion (OVIS). Previous methods [17] have explored OVIS
with pseudo-labelling on novel objects. Differently, con-
Table 6. Comparison with YOLOs on COCO Object Detec-
sidering that YOLO-World has strong transfer and gener-
tion. We fine-tune the YOLO-World on COCO train2017 and
evaluate on COCO val2017. The results of YOLOv7 [49] and alization capabilities, we directly fine-tune YOLO-World
YOLOv8 [19] are obtained from MMYOLO [3]. ‘O’, ‘G’, and ‘C’ on a subset of data with mask annotations and evaluate the
denote pertaining using Objects365, GoldG, and CC3M† , respec- segmentation performance under large-vocabulary settings.
tively. The FPS is measured on one NVIDIA V100 w/ TensorRT. Specifically, we benchmark open-vocabulary instance seg-
mentation under two settings:
• (1) COCO to LVIS setting, we fine-tune YOLO-World on
higher performance compared to previous methods trained the COCO dataset (including 80 categories) with mask
from scratch. annotations, under which the models need to transfer
from 80 categories to 1203 categories (80 → 1203);
• (2) LVIS-base to LVIS setting, we fine-tune YOLO-World
on the LVIS-base (including 866 categories, common &
LVIS Object Detection. In Tab. 7, we evaluate the fine-
frequent) with mask annotations, under which the models
tuning performance of YOLO-World on the standard LVIS
need to transfer from 866 categories to 1203 categories
dataset. Firstly, compared to the oracle YOLOv8s [19]
(866 → 1203).
trained on the full LVIS datasets, YOLO-World achieves
We evaluate the fine-tuned models on the standard LVIS
significant improvements, especially for larger models, e.g.,
val2017 with 1203 categories, in which 337 rare cate-
YOLO-World-L outperforms YOLOv8-L by 7.2 AP and
gories are unseen and can be used to measure the open-
10.2 APr . The improvements can demonstrate the effec-
vocabulary performance.
tiveness of the proposed pre-training strategy for large-
vocabulary detection. Moreover, YOLO-World, as an effi-
cient one-stage detector, outperforms previous state-of-the- Results. Tab. 8 shows the experimental results of extend-
art two-stage methods [7, 12, 21, 50, 60] on the overall per- ing YOLO-World for open-vocabulary instance segmenta-
formance without extra designs, e.g., learnable prompts [7] tion. Specifically, we adopt two fine-tuning strategies: (1)
or region-based alginments [12]. only fine-tuning the segmentation head and (2) fine-tuning
8
all modules. Under strategy (1), the fine-tuned YOLO- corresponding bounding boxes, demonstrating that the pre-
World still retains the zero-shot capabilities acquired from trained YOLO-World has the referring or grounding capa-
the pre-training stage, allowing it to generalize to unseen bility. This ability can be attributed to the proposed pre-
categories without additional fine-tuning. Strategy (2) en- training strategy with large-scale training data.
ables YOLO-World fit the LVIS dataset better, but it may
result in the degradation of the zero-shot capabilities. 5. Conclusion
Tab. 8 shows the comparisons of fine-tuning YOLO-
World with different settings (COCO or LVIS-base) and We present YOLO-World, a cutting-edge real-time open-
different strategies (fine-tuning seg. head or fine-tuning vocabulary detector aiming to improve efficiency and open-
all). Firstly, fine-tuning on LVIS-base obtains better perfor- vocabulary capability in real-world applications. In this pa-
mance compared to that based on COCO. However, the ra- per, we have reshaped the prevalent YOLOs as a vision-
tios between AP and APr (APr /AP) are nearly unchanged, language YOLO architecture for open-vocabulary pre-
e.g., the ratios of YOLO-World on COCO and LVIS-base training and detection and proposed RepVL-PAN, which
are 76.5% and 74.3%, respectively. Considering that the connects vision and language information with the network
detector is frozen, we attribute the performance gap to the and can be re-parameterized for efficient deployment. We
fact that the LVIS dataset provides more detailed and denser further present the effective pre-training schemes with de-
segmentation annotations, which are beneficial for learn- tection, grounding and image-text data to endow YOLO-
ing the segmentation head. When fine-tuning all mod- World with a strong capability for open-vocabulary de-
ules, YOLO-World obtains remarkable improvements on tection. Experiments can demonstrate the superiority of
LVIS, e.g., YOLO-World-L achieves 9.6 AP gain. However, YOLO-World in terms of speed and open-vocabulary per-
the fine-tuning might degrade the open-vocabulary perfor- formance and indicate the effectiveness of vision-language
mance and lead to a 0.6 box APr drop for YOLO-World-L. pre-training on small models, which is insightful for future
research. We hope YOLO-World can serve as a new bench-
4.6. Visualizations mark for addressing real-world open-vocabulary detection.
We provide the visualization results of pre-trained YOLO-
References
World-L under three settings: (a) we perform zero-shot
inference with LVIS categories; (b) we input the custom [1] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas
prompts with fine-grained categories with attributes; (c) re- Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-
ferring detection. The visualizations also demonstrate that to-end object detection with transformers. In ECCV, pages
YOLO-World has a strong generalization ability for open- 213–229, 2020. 2
vocabulary scenarios along with referring ability. [2] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu
Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu,
Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tian-
Zero-shot Inference on LVIS. Fig. 5 shows the visual- heng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu,
ization results based on the LVIS categories which are gen- Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang,
erated by the pre-trained YOLO-World-L in a zero-shot Chen Change Loy, and Dahua Lin. MMDetection: Open
manner. The pre-trained YOLO-World exhibits strong zero- mmlab detection toolbox and benchmark. arXiv preprint
shot transfer capabilities and is able to detect as many ob- arXiv:1906.07155, 2019. 5
jects as possible within the image. [3] MMYOLO Contributors. MMYOLO: OpenMMLab YOLO
series toolbox and benchmark. https://github.com/
open-mmlab/mmyolo, 2022. 5, 8
Inference with User’s Vocabulary. In Fig. 6, we explore [4] Achal Dave, Piotr Dollár, Deva Ramanan, Alexander Kir-
the detection capabilities of YOLO-World with our defined illov, and Ross B. Girshick. Evaluating large-vocabulary
categories. The visualization results demonstrate that the object detectors: The devil is in the details. CoRR,
pre-trained YOLO-World-L also exhibits the capability for abs/2102.01066, 2021. 6, 7
(1) fine-grained detection (i.e., detect the parts of one ob- [5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
ject) and (2) fine-grained classification (i.e., distinguish dif- Toutanova. BERT: pre-training of deep bidirectional trans-
ferent sub-categories of objects.). formers for language understanding. In NAACL-HLT, pages
4171–4186, 2019. 1, 4, 6
[6] Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han,
Referring Object Detection. In Fig. 7, we leverage some Guiguang Ding, and Jian Sun. Repvgg: Making vgg-style
descriptive (discriminative) noun phrases as input, e.g., the convnets great again. In CVPR, pages 13733–13742, 2021.
standing person, to explore whether the model can locate 2
regions or objects in the image that match our given in- [7] Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao,
put. The visualization results display the phrases and their and Guoqi Li. Learning to prompt for open-vocabulary ob-
9
Model Fine-tune Data Fine-tune Modules AP APr APc APf APb APbr
YOLO-World-M COCO Seg Head 12.3 9.1 10.9 14.6 22.3 16.2
YOLO-World-L COCO Seg Head 16.2 12.4 15.0 19.2 25.3 18.0
YOLO-World-M LVIS-base Seg Head 16.7 12.6 14.6 20.8 22.3 16.2
YOLO-World-L LVIS-base Seg Head 19.1 14.2 17.2 23.5 25.3 18.0
YOLO-World-M LVIS-base All 25.9 13.4 24.9 32.6 32.6 15.8
YOLO-World-L LVIS-base All 28.7 15.0 28.3 35.2 36.2 17.4
Table 8. Open-Vocabulary Instance Segmentation. We evaluate YOLO-World for open-vocabulary instance segmentation under the two
settings. We fine-tune the segmentation head or all modules of YOLO-World and report Mask AP for comparison. APb denotes the box
AP.
Figure 5. Visualization Results on Zero-shot Inference on LVIS. We adopt the pre-trained YOLO-World-L and infer with the LVIS
vocabulary (containing 1203 categories) on the COCO val2017.
ject detection with vision-language model. In CVPR, pages Deep residual learning for image recognition. In CVPR,
14064–14073, 2022. 1, 3, 7, 8 pages 770–778, 2016. 7
[8] Chengjian Feng, Yujie Zhong, Yu Gao, Matthew R. Scott, [15] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B.
and Weilin Huang. TOOD: task-aligned one-stage object de- Girshick. Mask R-CNN. In ICCV, pages 2980–2988, 2017.
tection. In ICCV, pages 3490–3499. IEEE, 2021. 5 1, 2
[9] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian [16] Drew A. Hudson and Christopher D. Manning. GQA: A new
Sun. YOLOX: exceeding YOLO series in 2021. CoRR, dataset for real-world visual reasoning and compositional
abs/2107.08430, 2021. 2 question answering. In CVPR, pages 6700–6709, 2019. 6
[10] Ross B. Girshick. Fast R-CNN. In ICCV, pages 1440–1448, [17] Dat Huynh, Jason Kuen, Zhe Lin, Jiuxiang Gu, and Ehsan
2015. 2 Elhamifar. Open-vocabulary instance segmentation via ro-
[11] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra bust cross-modal pseudo-labeling. In CVPR, pages 7010–
Malik. Rich feature hierarchies for accurate object detection 7021, 2022. 8
and semantic segmentation. In CVPR, pages 580–587, 2014. [18] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
2 Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom
[12] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Duerig. Scaling up visual and vision-language representation
Open-vocabulary object detection via vision and language learning with noisy text supervision. In ICML, pages 4904–
knowledge distillation. In ICLR, 2022. 1, 2, 7, 8 4916, 2021. 1, 3
[13] Agrim Gupta, Piotr Dollár, and Ross B. Girshick. LVIS: A [19] Glenn Jocher, Ayush Chaurasia, and Jing Qiu. Ultralyt-
dataset for large vocabulary instance segmentation. In CVPR, ics yolov8. https://github.com/ultralytics/
pages 5356–5364, 2019. 6 ultralytics, 2023. 2, 3, 4, 5, 6, 7, 8
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [20] Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel
10
{men, women, boy, girl} {elephant, ear, leg, trunk, ivory} {golden dog, black dog, spotted dog} {grass, sky, zebra, trunk, tree}
Figure 6. Visualization Results on User’s Vocabulary. We define the custom vocabulary for each input image and YOLO-World can
detect the accurate regions according to the vocabulary. Images are obtained from COCO val2017.
the person in red the brown animal the tallest person person with a white shirt
the jumping person holding a person holding a toy the standing person moon
person baseball bat
Figure 7. Visualization Results on Referring Object Detection. We explore the capability of the pre-trained YOLO-World to detect
objects with descriptive noun phrases. Images are obtained from COCO val2017.
Synnaeve, Ishan Misra, and Nicolas Carion. MDETR - mod- [24] Chuang Lin, Peize Sun, Yi Jiang, Ping Luo, Lizhen Qu, Gho-
ulated detection for end-to-end multi-modal understanding. lamreza Haffari, Zehuan Yuan, and Jianfei Cai. Learning
In ICCV, pages 1760–1770, 2021. 6, 7 object-language alignments for open-vocabulary object de-
[21] Weicheng Kuo, Yin Cui, Xiuye Gu, A. J. Piergiovanni, tection. In ICLR, 2023. 3
and Anelia Angelova. F-VLM: open-vocabulary object de- [25] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James
tection upon frozen vision and language models. CoRR, Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
abs/2209.15639, 2022. 3, 8 C. Lawrence Zitnick. Microsoft COCO: common objects
[22] Chuyi Li, Lulu Li, Hongliang Jiang, Kaiheng Weng, Yifei in context. In Proceedings of the European Conference on
Geng, Liang Li, Zaidan Ke, Qingyuan Li, Meng Cheng, Computer Vision (ECCV), pages 740–755, 2014. 1, 2, 3, 13
Weiqiang Nie, Yiduo Li, Bo Zhang, Yufei Liang, Linyuan [26] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He,
Zhou, Xiaoming Xu, Xiangxiang Chu, Xiaoming Wei, and Bharath Hariharan, and Serge J. Belongie. Feature pyramid
Xiaolin Wei. Yolov6: A single-stage object detection frame- networks for object detection. In CVPR, pages 936–944,
work for industrial applications. CoRR, abs/2209.02976, 2017. 1, 2
2022. 2, 7, 8 [27] Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He,
[23] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- and Piotr Dollár. Focal loss for dense object detection. In
wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu ICCV, pages 2999–3007, 2017. 2
Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and [28] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia.
Jianfeng Gao. Grounded language-image pre-training. In Path aggregation network for instance segmentation. In
CVPR, pages 10955–10965, 2022. 1, 2, 3, 5, 6, 7, 13 CVPR, pages 8759–8768, 2018. 2, 4
11
[29] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao [44] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu
Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Soricut. Conceptual captions: A cleaned, hypernymed, im-
Zhu, and Lei Zhang. Grounding DINO: marrying DINO with age alt-text dataset for automatic image captioning. In ACL,
grounded pre-training for open-set object detection. CoRR, pages 2556–2565, 2018. 5, 6, 13
abs/2303.05499, 2023. 1, 2, 3, 6, 7 [45] Cheng Shi and Sibei Yang. Edadet: Open-vocabulary ob-
[30] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian ject detection using early dense alignment. In ICCV, pages
Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. 15678–15688, 2023. 1
Berg. SSD: single shot multibox detector. In ECCV, pages [46] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS:
21–37, 2016. 2 fully convolutional one-stage object detection. In ICCV,
[31] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng pages 9626–9635, 2019. 2
Zhang, Stephen Lin, and Baining Guo. Swin transformer: [47] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
Hierarchical vision transformer using shifted windows. In reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
ICCV, pages 9992–10002, 2021. 2, 3, 6, 7 Polosukhin. Attention is all you need. In NeurIPS, pages
5998–6008, 2017. 2
[32] Xiang Long, Kaipeng Deng, Guanzhong Wang, Yang Zhang,
Qingqing Dang, Yuan Gao, Hui Shen, Jianguo Ren, Shumin [48] Chien-Yao Wang, Hong-Yuan Mark Liao, Yueh-Hua Wu,
Han, Errui Ding, and Shilei Wen. PP-YOLO: an effec- Ping-Yang Chen, Jun-Wei Hsieh, and I-Hau Yeh. Cspnet: A
tive and efficient implementation of object detector. CoRR, new backbone that can enhance learning capability of CNN.
abs/2007.12099, 2020. 2 In CVPRW, pages 1571–1580, 2020. 2
[49] Chien-Yao Wang, Alexey Bochkovskiy, and Hong-
[33] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
Yuan Mark Liao. Yolov7: Trainable bag-of-freebies sets
regularization. In ICLR, 2019. 5
new state-of-the-art for real-time object detectors. In CVPR,
[34] Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. pages 7464–7475, 2023. 2, 7, 8
Im2text: Describing images using 1 million captioned pho- [50] Size Wu, Wenwei Zhang, Sheng Jin, Wentao Liu, and
tographs. In NeurIPS, pages 1143–1151, 2011. 6 Chen Change Loy. Aligning bag of regions for open-
[35] Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, vocabulary object detection. In CVPR, pages 15254–15264,
Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2023. 1, 3, 8
Flickr30k entities: Collecting region-to-phrase correspon- [51] Johnathan Xie and Shuai Zheng. ZSD-YOLO: zero-shot
dences for richer image-to-sentence models. Int. J. Comput. YOLO detection using vision-language knowledgedistilla-
Vis., pages 74–93, 2017. 6 tion. CoRR, 2021. 3
[36] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya [52] Shangliang Xu, Xinxin Wang, Wenyu Lv, Qinyao Chang,
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Cheng Cui, Kaipeng Deng, Guanzhong Wang, Qingqing
Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Dang, Shengyu Wei, Yuning Du, and Baohua Lai.
Krueger, and Ilya Sutskever. Learning transferable visual PP-YOLOE: an evolved version of YOLO. CoRR,
models from natural language supervision. In ICML, pages abs/2203.16250, 2022. 2
8748–8763, 2021. 1, 2, 3, 4, 5, 6, 13 [53] Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan
[37] Joseph Redmon and Ali Farhadi. YOLO9000: better, faster, Xu, Wei Zhang, Zhenguo Li, Chunjing Xu, and Hang Xu.
stronger. In CVPR, pages 6517–6525, 2017. 2 Detclip: Dictionary-enriched visual-concept paralleled pre-
[38] Joseph Redmon and Ali Farhadi. Yolov3: An incremental training for open-world detection. In NeurIPS, 2022. 1, 2, 3,
improvement. CoRR, abs/1804.02767, 2018. 2 6, 7
[54] Lewei Yao, Jianhua Han, Xiaodan Liang, Dan Xu, Wei
[39] Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick,
Zhang, Zhenguo Li, and Hang Xu. Detclipv2: Scal-
and Ali Farhadi. You only look once: Unified, real-time ob-
able open-vocabulary object detection pre-training via word-
ject detection. In CVPR, pages 779–788, 2016. 2, 3
region alignment. In CVPR, pages 23497–23506, 2023. 1, 3,
[40] Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, 6
and Ali Farhadi. You only look once: Unified, real-time ob-
[55] Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-
ject detection. In CVPR, pages 779–788, 2016. 1, 2, 4
Fu Chang. Open-vocabulary object detection using captions.
[41] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. In CVPR, pages 14393–14402, 2021. 1, 2, 3
Faster R-CNN: towards real-time object detection with re- [56] Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun
gion proposal networks. IEEE Transactions on Pattern Anal- Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu
ysis and Machine Intelligence, pages 1137–1149, 2017. 2 Yuan, Jenq-Neng Hwang, and Jianfeng Gao. Glipv2: Uni-
[42] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. fying localization and vision-language understanding. In
Faster R-CNN: towards real-time object detection with re- NeurIPS, 2022. 1, 2, 3, 6, 7
gion proposal networks. IEEE Transactions on Pattern Anal- [57] Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun
ysis and Machine Intelligence, pages 1137–1149, 2017. 1 Zhu, Lionel M. Ni, and Heung-Yeung Shum. DINO: DETR
[43] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang with improved denoising anchor boxes for end-to-end object
Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: detection. In ICLR, 2023. 3
A large-scale, high-quality dataset for object detection. In [58] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and
ICCV, pages 8429–8438, 2019. 2, 6 Stan Z. Li. Bridging the gap between anchor-based and
12
anchor-free detection via adaptive training sample selection. YOLO-World using pre-trained weights. The learning rate
In CVPR, pages 9756–9765, 2020. 2, 3 of fine-tuning is set to 0.0002 with the weight decay set to
[59] Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan 0.05. After fine-tuning, we pre-compute the class text em-
Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang beddings with given COCO categories and store the embed-
Dai, Lu Yuan, Yin Li, and Jianfeng Gao. Regionclip: dings into the weights of the classification layers.
Region-based language-image pretraining. In CVPR, pages
16772–16782, 2022. 3, 8
[60] Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp B. Automatic Labeling on Large-scale Image-
Krähenbühl, and Ishan Misra. Detecting twenty-thousand Text Data
classes using image-level supervision. In ECCV, pages 350–
368, 2022. 3, 7, 8 In this section, we add details procedures for labeling
[61] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, region-text pairs with large-scale image-text data, e.g.,
and Jifeng Dai. Deformable DETR: deformable transformers CC3M [44]. The overall labeling pipeline is illustrated in
for end-to-end object detection. In ICLR, 2021. 2 Fig. 8, which mainly consists of three procedures, i.e., (1)
extract object nouns, (2) pseudo labeling, and (3) filtering.
As discussed in Sec. 3.4, we adopt the simple n-gram algo-
A. Additional Details rithm to extract nouns from captions.
A.1. Re-parameterization for RepVL-PAN
During inference on an offline vocabulary, we adopt re- Region-Text Proposals. After obtaining the set of object
parameterization for RepVL-PAN for faster inference speed nouns T = {tk }K from the first step, we leverage a pre-
and deployment. Firstly, we pre-compute the text embed- trained open-vocabulary detector, i.e., GLIP-L [23], to gen-
dings W ∈ RC×D through the text encoder. erate pseudo boxes {Bi } along with confidence scores {ci }:
13
Automatic Labeling Pipeline
“A photography of a man
and a woman”
Figure 8. Labeling Pipeline for Image-Text Data We first leverage the simple n-gram to extract object nouns from the captions. We adopt
a pre-trained open-vocabulary detector to generate pseudo boxes given the object nouns, which forms the coarse region-text proposals.
Then we use a pre-trained CLIP to rescore or relabel the boxes along with filtering.
objects
• (6) Image-level Filtering: we compute the image-level
region-text scores sregion by averaging the kept region-
text scores. Then
√ we obtain the image-level confidence
score by s = simg ∗ sregion and we keep the images
with scores larger than 0.3.
The thresholds mentioned above are empirically set accord-
ing to the part of labeled results and the whole pipeline is
automatic without human verification. Finally, the labeled
samples are used for pre-training YOLO-World. We will
provide the pseudo annotations of CC3M for further re-
search.
14
Method Pre-trained Data Samples AP APr APc APf
YOLO-World-S O365 0.61M 16.3 9.2 14.1 20.1
YOLO-World-S O365+GoldG 1.38M 24.2 16.4 21.7 27.8
YOLO-World-S O365+CC3M-245k 0.85M 16.5 10.8 14.8 19.1
YOLO-World-S O365+CC3M-520k 1.13M 19.2 10.7 17.4 22.4
YOLO-World-S O365+CC3M-750k 1.36M 18.2 11.2 16.0 21.1
Table 9. Zero-shot Evaluation on LVIS. We evaluate the performance of pre-training YOLO-World-S with different amounts of data, the
image-text data.
15