Kim 等 - 2023 - Region-Aware Pretraining for Open-Vocabulary Objec

Uploaded by

yedongyu1996

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views11 pages

Kim 等 - 2023 - Region-Aware Pretraining for Open-Vocabulary Objec

Uploaded by

yedongyu1996

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;

the final published version of the proceedings is available on IEEE Xplore.

Region-Aware Pretraining for Open-Vocabulary Object Detection with

Vision Transformers

Dahun Kim Anelia Angelova Weicheng Kuo

Google Research, Brain Team
{mcahny, anelia, weicheng}@google.com

Abstract Contrastive image-text Open-vocabulary

pretraining detection finetuning
We present Region-aware Open-vocabulary Vision
Transformers (RO-ViT) – a contrastive image-text pretrain- text embeddings region embeddings
ing recipe to bridge the gap between image-level pretrain-
ing and open-vocabulary object detection. At the pretrain- image feature
embeddings region
ing phase, we propose to randomly crop and resize regions Cropped PE crop
of positional embeddings instead of using the whole image GAP (CPE)
positional embeddings. This better matches the use of posi- detector heads
random
tional embeddings at region-level in the detection finetuning ViT encoder crop-resize
phase. In addition, we replace the common softmax cross ViT encoder
entropy loss in contrastive learning with focal loss to bet- CPE
whole-image PE
ter learn the informative yet difficult examples. Finally, we
leverage recent advances in novel object proposals to im-
prove open-vocabulary detection finetuning. We evaluate patch patch embedding
embedding
our full model on the LVIS and COCO open-vocabulary de-
tection benchmarks and zero-shot transfer. RO-ViT achieves
a state-of-the-art 32.1 APr on LVIS, surpassing the best ex-
isting approach by +5.8 points in addition to competitive
zero-shot transfer detection. Surprisingly, RO-ViT improves
the image-level representation as well and achieves the state Figure 1. Existing vision-language models are designed for image-
level tasks, e.g., classification and retrieval. In this paper, we aim
of the art on 9 out of 12 metrics on COCO and Flickr image-
to enhance the image-text pretraining for region-level downstream
text retrieval benchmarks, outperforming competitive ap-
task: open-vocabulary object detection. At pretraining, we pro-
proaches with larger models. pose to randomly crop and resize regions of positional embeddings
(PE) instead of using the whole image PE. This better matches the
1. Introduction use of PE at region-level in the detection finetuning and results in
better performance in detection and surprisingly also retrieval task.
The ability to detect objects in the visual world is a hall-
mark of computer vision and machine intelligence. It is abundant image-text pairs for training and ingesting the text
key to enable many applications, e.g. autonomous agents queries provided by users at test time [54]. By treating
adapting to new environments with many novel objects, a the categories as text embeddings rather than discrete ids,
shopping system handling fine-grained user queries such as open-vocabulary detectors can flexibly predict a wide range
“fedora”, “bonnet” outside of the training vocabulary. How- of objects unseen during the training time. Most existing
ever, modern object detectors typically rely on manual an- works leverage image-text pre-training which contains rich
notations of regions and class labels, which limits their vo- semantic knowledge of open-vocabulary concepts. Knowl-
cabulary size to an order of 103 and it is prohibitively expen- edge distillation [13, 19], weak supervision [63], self-
sive to scale up further. This is orders of magnitude smaller training [41, 57, 60], and frozen model [29] have been pro-
than the objects we encounter in the visual world. posed, and CNN backbones are most commonly used. With
Recently, the open-vocabulary detection task (OVD) has the growing popularity of vision transformers in image un-
been proposed to overcome such limitation by leveraging derstanding [12, 32, 56], multimodal [1, 2, 47], and self-

11144
supervised tasks [5, 6, 22], it is important to understand • RO-ViT achieves the state of the art on the LVIS open-
how to build capable open-vocabulary detectors with vision vocabulary detection benchmark, and 9 out 12 metrics
transformers [12, 35]. on COCO and Flickr image-text retrieval benchmarks.
To our best knowledge, all existing works assume pre-
We hope these findings will facilitate the research com-
trained Vision-Language Models (VLM) are given, and de-
munity to further explore open-vocabulary detection from
velop adaptation or finetuning recipes to bridge the gap
the perspective of image-text pretraining with potential ben-
between image-level pretraining and object-level finetun-
efits for both region-level and image-level tasks.
ing [13, 19, 41, 57, 60]. Since the VLMs are designed for
image-level tasks e.g. classification, retrieval, the notion of 2. Related Works
objects/regions are not adequately utilized in the pretrain-
ing process. We believe it would be beneficial for open- Learning open-vocabulary and zero-shot recognition.
vocabulary detection if we bake in locality information in Building general representation for open-vocabulary and
the image-text pretraining. zero-shot recognition is a fundamental machine learning
We present RO-ViT, a simple recipe to pretrain vision task. DeViSE [15] and ConSE [36] are pioneering works
transformers in a region-aware manner for open-vocabulary to learn a shared image-text embedding space for zero-shot
object detection. Standard pretraining typically uses full- recognition by convolutional networks. As image and lan-
image positional embeddings, which does not generalize guage often co-occur on the internet, the community has
well to detection tasks. Thus, we propose a novel posi- explored learning such representation from various data
tional embedding scheme called “Cropped Positional Em- sources including image tags [8, 11, 27], captions/text de-
bedding” which better matches the use of region crops in scriptions [10, 23, 42, 45, 59], alt-texts [26, 37, 55], and im-
detection finetuning (see Fig. 1). In addition, we found it age search queries [39]. Recently, contrastive learning has
beneficial to replace the softmax cross entropy loss with fo- become a popular paradigm to learn from large image-text
cal loss in contrastive learning, which affords us more con- corpus due to its simplicity and scalability, where increasing
trol to learn from harder and more informative examples. model size and data yield consistent improvement on open-
Finally, we leverage recent advances in novel object pro- vocabulary classification benchmarks. The aforementioned
posals [28] to improve open-vocabulary detection finetun- works focus on image-level recognition, whereas we study
ing. The motivation is that existing approaches often miss how to best tailor them for region-level understanding.
novel objects in the object proposal stage because the pro- Self-supervised representation learning for localization.
posals tend to overfit to the foreground categories. Due to the challenge to scale up labeling for localiza-
We evaluate the approach on the standard LVIS and tion tasks, many efforts have been made to learn locality-
COCO open-vocabulary benchmarks. On LVIS, our best sensitive representation in a self-supervised manner. Exist-
model achieves a state-of-the-art 32.1 APr at the system ing approaches roughly fall into the contrastive or gener-
level, surpassing the best existing approach by +5.8 APr . ative paradigms. Contrastive approaches typically involve
Compared to the best existing ViT-based approach, ours region or point-level contrastive learning using sliding win-
outperforms by a healthy margin of +6.5 APr . Our LVIS- dows [49], object proposals [24, 48], or point samples [3].
trained model outperforms existing baselines on transfer Most existing contrastive methods are CNN-based, while
detection to Objects365 without re-training. Although ViT backbones have been popular recently with genera-
not explicitly optimized for retrieval, RO-ViT surprisingly tive approaches [5, 22] and has shown promise with con-
achieves the state-of-the-art performance on 9 out of 12 trastive approaches [6]. Masked image modeling is com-
metrics in image-text retrieval benchmark and outperforms monly used to learn the ViT features for localization tasks.
strong baselines with standard positional embeddings, cross Even though these self-supervised methods are well-suited
entropy loss, and larger model capacity. We conduct abla- for standard localization tasks, they lack the image-text
tions to confirm the benefits of each proposed component. correspondence necessary for open-vocabulary recognition.
Visualization of the learnt positional embeddings shows that Thus, we study contrastive image-text representation learn-
our approach (CPE) leads to more symmetrical and diverse ing for open-vocabulary detection with ViTs.
patterns than the baseline. In summary:
Open-vocabulary object detection and segmentation.
• We propose RO-ViT to bridge the positional embed- Zero-shot detection was proposed to scale up detection
dings in image-text pretraining to open-vocabulary de- models beyond their limited training categories. Popular
tection finetuning. approaches accomplish this by learning the alignment be-
• We show that image-text pretraining with focal loss is tween region visual representation and category word em-
more effective than existing softmax CE loss. beddings [4, 9, 40, 58] or hallucinating visual features with
• We improve the open-vocabulary detection finetuning a generative model [21, 64]. Without any visual knowledge
recipe with novel object proposals. of the novel categories, zero-shot detectors tend to struggle

11145
with low performance. This motivates the open-vocabulary conventional fixed-size classifier fully-connected layer with
detection benchmark [54] to bridge the gap between zero- the text embeddings of base categories [19, 54]. It is impor-
shot and fully-supervised detection. tant that the text embeddings come from the matching text
With the rise of image-text pretraining, many works encoder of the image encoder during pretraining, so that the
have explored adapting these pretrained models to open- open-vocabulary knowledge would be preserved. We rep-
vocabulary detection and segmentation [17, 19, 30, 60, 61]. resent the background category by a “background” phrase
For example, ViLD [19] proposes to distill the image-text and the proposals not matched to any annotations in CB are
knowledge into the detector. DetPro [13] improves ViLD labeled as background.
by incorporating the idea of prompt optimization [62]. During training, we compute the detection scores pi for
Moreover, region-text self-training has been shown effec- each region i as the cosine similarity between the RoI-Align
tive on image caption data [60], classification data [41], feature (i.e., region embedding) and the text embeddings of
or even unlabeled data [57]. Weak supervision [63] and CB , followed by a softmax. At test time, we expand the text
phrase grounding [31] approaches have also been pro- embeddings from CB to CB ∪ CN plus the “background”
posed. In terms of architecture, CNNs are most commonly category for open-vocabulary detection. We extract the
used but vision transformers have recently been adopted as VLM embedding of region i by RoI-Align on the output
well [35,61]. While existing works typically assume image- feature map of the ViT backbone, and compute the VLM
text pretrained models are given and focus on finetuning or region scores zi as the cosine similarity with the CB ∪ CN
adaptation strategies, our focus is to improve the upstream text embeddings. Similarly, the detection scores pi are now
image-text pretraining with vision transformers. computed with the CB ∪ CN text embeddings. The com-
bined open-vocabulary detection score si OVD is obtained by
3. Method geometric means [19, 29]:

In this paper we address the problem of open-vocabulary (

(1−α)
OVD zi · pαi if i ∈ CB
object detection. At training time, the model has access to si = (1−β) (1)
the detection labels of CB base categories, but needs to de- zi · pβi if i ∈ CN
tect objects from a set of CN novel categories at test time.
We follow existing works [13, 19, 29, 60] to leverage pre- , where α, β ∈ [0, 1] control the weights for base and novel
trained vision and language models (VLM) [39]. Unlike categories. The background score comes directly from the
existing works, we explore how to best pretrain our own detection score pi , because the VLM score with “back-
VLMs with vision transformers [12] for open-vocabulary ground” phrase tends to be not as reliable.
object detection. We adopt Mask R-CNN heads in our detector and use
class-agnostic box regression and mask prediction heads
3.1. Preliminaries following existing works [13, 19, 29, 54, 60]. Note that we
use a contrastively pretrained ViT to initialize our detector
Contrastive image-text pretraining. Two-tower con- backbone, and adopt the simple feature pyramid and win-
trastive image-text learning involves an image encoder and dowed attention to handle higher-resolution image inputs
a text encoder [26, 39]. The text tower is typically trans- as proposed by Li et al. [32].
former encoder, whereas the image tower can be CNN-
based or ViT as in our case. Given a set of image-text 3.2. Region-Aware Image-Text Pretraining
corpus, the model learns to bring each image and its cor-
responding text together and push other non-matching texts Existing vision-language models are trained to match
away in the embedding space. The most common objective an image as a whole to a text description [26, 39]. How-
is softmax cross-entropy loss. The image/text embeddings ever, the pretraining is unaware of the alignment between its
are typically obtained by taking the pre-pended class token region-level representations and text tokens, which is essen-
embedding, self-attention pooling, or average pooling. We tial to downstream open-vocabulary detection. We propose
use the global average pooling followed by L2 normaliza- a novel Cropped Positional Embeddings (CPE) to bridge
tion for both image and text embeddings. this gap, and also find it beneficial to learn from hard exam-
Open-vocabulary object detector. An open-vocabulary ples with a focal loss. Our improvements introduce no extra
object detector is trained with the labels of CB base cate- parameters and minimal computation costs.
gories, but needs to detect the union of base and novel cat- Cropped Positional Embedding (CPE). The positional
egories (CB ∪ CN ) at test time. Most object detectors use embeddings are important to transformers as they provide
K-way classifiers because the number of categories are the the information of where each element in the set comes
same between train and test time. To deal with additional from. This information is often useful for downstream
categories at test time, the common practice is to replace the recognition and localization tasks.

11146
Focal contrastive image-text pretraining Cropped Positional Embedding (CPE) RO-ViT: downstream open-vocabulary detector
region-text similarity
RoI Align
randomly crop detected regions as OVD score 𝑠 OVD
and resize
image embeddings detector heads detection region score 𝑝
GAP ViT encoder VLM region score 𝑧
ViT encoder full-image PE
CPE base classes novel classes
patch embedding (train & test) (test time only)
text embeddings
patch embedding
text encoder
initialize with
upsample pretrained PE
‘two dogs (upsampled)
playing on the positional
grass’ embeddings

Figure 2. RO-ViT framework overview. Our Region-aware Open-vocabulary Vision Transformer (RO-ViT) attempts to bridge the gap
between the image-level vision-language model (VLM) pretraining and downstream open-vocabulary detection. For the pretraining, we
propose Cropped Positional Embedding (CPE) which randomly crops and resizes a region of positional embeddings instead of using the
whole-image PE. In addition, we use focal loss instead of the common softmax cross entropy loss for contrastive learning. The pretrained
ViT backbone is transferred to the downstream open-vocabulary detection by replacing the global average pooling with detector heads.
RO-ViT detector takes the whole-image positional embeddings as input, and the detections are used to crop out region embeddings from
the ViT backbone features. The region embeddings then match with the cached category embeddings to obtain the VLM score z, which is
combined with the detection score p into the open-vocabulary detection score sOVD (see Equation 1).

There is a mismatch between the way the positional em- loss can provide. Focal loss offers a natural option to tune
beddings are used in existing contrastive pretraining ap- the weights of hard examples [33] . Let vi and li be the
proaches and open-vocabulary detection finetuning. The normalized image and text embeddings, and the image-to-
pretraining approaches typically apply full-image positional text (I2T) contrastive losses be Lsoftmax and Lfocal for the
embeddings [26, 39] during training, and use the same po- softmax (baseline) or focal loss (RO-ViT). We define the
sitional embeddings for downstream tasks, e.g., zero-shot losses mathematically below:
recognition. However, the recognition occurs at region- B
level for open-vocabulary detection finetuning, which re- 1 X exp(vi li /τ )
Lsoftmax = − log( PB ) (2)
quires the full-image positional embeddings to generalize B i=1 j=1 exp(vi lj /τ )
to regions that they never see during the pretraining.
B B
To bridge this gap, we propose Cropped Positional Em- 1 XX
bedding (CPE) (see center of Figure 2). First, we up-sample Lfocal = − (1 − pi )γ log(pi ) (3)
B i=1 j=1
the positional embeddings from the image size typical for
pretraining, e.g., 224 to that typical for detection tasks, , where pi denotes the true class probability as below:
e.g., 1024. Then we randomly crop and resize a region (
σ(vi lj /τ ) if i = j
from the up-sampled positional embeddings and use that as pi = (4)
the image-level positional embeddings during pretraining. 1 − σ(vi lj /τ ) if i 6= j
The regions are uniformly sampled from the normalized Here σ denotes the sigmoid function. We adopt the sim-
coordinates as x1 ∼ Uniform(0, 1), y1 ∼ Uniform(0, 1), pler non alpha-balanced form of focal loss [33]. The text-
x2 ∼ Uniform(x1 , 1), y2 ∼ Uniform(y1 , 1), while keeping to-image (T2I) contrastive losses are symmetrical with the
the crop scale ratio in [0.1, 1.0]. I2T losses by simply exchanging the inner/outer summation
Intuitively, this causes the model to view an image not as loops. The total loss is the sum of both I2T and T2I losses.
a full image in itself, but as a region crop from some larger
unknown image. This better matches the downstream use 3.3. Open-vocabulary Detector Finetuning
case of detection where recognition occurs at region- rather Here we present two simple techniques to improve
than image-level. the downstream open-vocabulary detector (Sec. 3.1). De-
Focal Loss. We desire to have finer control over how hard spite the backbone features pretrained from the vast open-
examples are weighted than what the softmax cross entropy vocabulary data, the added detector layers (neck and heads)

11147
pretrained detector pretrained detector
method APr AP method novel AP AP
model backbone model backbone
ConvNet based: ConvNet based:
DetPro-Cascade [13] ViT-B/32 R-50 20.0 27.0 ViLD [19] ViT-B/32 R-50 27.6 51.3
Detic-CN2 [63] ViT-B/32 R-50 24.6 32.4 OV-DETR [53] ViT-B/32 R-50 29.4 52.7
RegionCLIP [60] R-50x4 R-50x4 22.0 32.3 w/ pseudo box labels:
ViLD-Ens [19] ViT-B/32 R-152 18.7 26.0 XPM et al. [25] R-50 R-50 27.0 41.2
ViLD-Ens [19] ViT-L/14 EffNet-B7 21.7 29.6 RegionCLIP [60] † R-50x4 R-50x4 39.3 55.7
ViLD-Ens [19] EffNet-B7 EffNet-B7 26.3 29.3 PromptDet [14] ViT-B/32 R-50 26.6 50.6
VL-PLM [57] ViT-B/32 R-50 17.2 27.0 VL-PLM [57] ViT-B/32 R-50 34.4 53.5
OV-DETR [53] ViT-B/32 R-50 17.4 26.6 Rasheed et al. [41] ‡ ViT-B/32 R-50 36.9 51.5
Rasheed et al. [41] ViT-B/32 R-50 21.1 25.9 w/ weak supervision:
PromptDet [14] ViT-B/32 R-50 21.4 25.3 Detic-CN2 [63] ViT-B/32 R-50 24.6 32.4
ViT based: ViT based:*
OWL-ViT [35] ViT-H/14 ViT-H/14 23.3 35.3 RO-ViT (ours) ViT-B/16 ViT-B/16 30.2 41.5
OWL-ViT [35] ViT-L/14 ViT-L/14 25.6 34.7 RO-ViT (ours) ViT-L/16 ViT-L/16 33.0 47.7
RO-ViT (ours) ViT-B/16 ViT-B/16 28.0 30.2
RO-ViT (ours) ViT-L/14 ViT-L/14† 31.4 34.0 Table 2. COCO open-vocabulary object detection (box AP50).
RO-ViT (ours) ViT-L/16 ViT-L/16 32.1 34.0 RO-ViT represents the first ViT-based approach and demonstrates
a very competitive novel AP without using pseudo labeling or
Table 1. LVIS open-vocabulary object detection (mask APs). weak supervision. †: RegionCLIP uses an off-the-shelf RPN dur-
RO-ViT outperforms the best existing approach by +5.8 APr on ing its pretraining. ‡: Rasheed et al. uses an external MViT detec-
novel categories. When using ViT-Base and Large RO-ViT out- tor [34] during pretraining. *: The other ViT-based method [35]
performs OWL-ViT based on ViT-Large by +3.0 and +6.5 APr , report their results on LVIS only.
respectively. All methods use the same instance-level supervision
from LVIS base categories for detection training. †: We use image gion with the scale ratio in [0.1, 1.0], and the aspect ratio in
size 860 during detection for a fair comparison with OWL-ViT. [0.5, 2.0]. The region crop is resized back to the size 14×14
(i.e., CPE), and is added to the patch embeddings. We use
are newly trained with the downstream detection dataset the global average pooling at the last ViT layer to obtain
(e.g., LVIS base categories). Existing approaches often miss the image embedding. The text encoder is a 12-layer Trans-
novel/unlabeled objects in the object proposal stage because former following [39, 51], and the maximum length of the
the proposals tend to classify them as background. To rem- input text as 64 tokens. Both image and text embeddings are
edy this, we leverage recent advances in novel object pro- L2 normalized and trained with focal loss. We train on the
posal method [28] and adopt the localization quality-based same image-text dataset as Chao et al. [26]. Unless other-
objectness (i.e., centerness score) instead of object-or-not wise specified, we use a batch size of 4096 for ablation and
binary classification score. Following [28], we use a single 16384 to compare with other methods. We use the AdamW
anchor per location and combine the predicted objectness optimizer with learning rate of 5e-4 and a linear warmup of
score oi with the ensemble detection score in Equation 1 to 10k steps, and train for 500k iterations.
obtain the final OVD score as: Si OVD = oi δ · si OVD. . Downstream detection details. We train RO-ViT with
Additionally, we replace the standard classifier and mask base categories CB for 46.1k/11.3k iterations with image
output layer with the normalized layers [46]. This is to resize 1024, large scale jittering [16], batch size 256/128, the
duce their scale variance over categories by L2-normalizing SGD optimizer with weight decay 1e-4/1e-2, momentum
the weights w and features x as: f (x; w, b, τ ) = 0.9 and an initial learning rate of 0.36/0.02 for LVIS/COCO
τ T
kwk2 kxk2 w x + b, where τ = 20. Although we do not have datasets. The pretrained positional embeddings are bilin-
rare categories at training (i.e., open-vocabulary setting), we early interpolated to adjust to the size of patch embeddings
empirically found it beneficial (see Sec. 4). of higher resolution [12]. We set the backbone learning
rate lower (e.g., 0.1×) than the rest of the model to retain
4. Experimental Results the pretraining knowledge during detection finetuning. We
use the score combination of (α, β, δ) = (0.65, 0.3, 3) in
Pretraining details. Our contrastive image-text pretrain- Sec. 3.3. We use CLIP [39] prompt templates and take the
ing is performed from scratch. We adopt the widely used average text embeddings of each category.
ViT-B/16 and ViT-L/16 as the image encoder. The input im-
age size is 224×224 which results in 14×14 positional em- 4.1. Open-vocabulary Object Detection
beddings with patch size 16×16. To generate the Cropped
Positional Embedding (CPE), we first interpolate the posi- LVIS benchmark. We conduct evaluation on the LVIS
tional embeddings to size 64×64. We randomly crop a re- dataset [20] which contains a large and diverse set of 1203

11148
image MS COCO (5K test set) Flickr30K (1K test set)
backbone ——-image-to-text——- ——-text-to-image——- ——-image-to-text——- ——-text-to-image——-
method size R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R10 R@1 R@5 R@10
CLIP [39] 302M 58.4 81.5 88.1 37.8 62.4 72.2 88.0 98.7 99.4 68.7 90.6 95.2
ALIGN [26] 408M 58.6 83.0 89.7 45.6 69.8 78.6 88.6 98.7 99.7 75.7 93.8 96.8
FLAVA [44] 86M 42.7 76.8 - 38.4 67.5 - 67.7 94.0 - 65.2 89.4 -
FILIP [50] 302M 61.3 84.3 90.4 45.9 70.6 79.3 89.8 99.2 99.8 75.0 93.4 96.3
Florence [52] 637M 64.7 85.9 - 47.2 71.4 - 90.9 99.1 - 76.7 93.6 -
CoCa-Large [51] 303M 65.4 85.6 91.4 50.1 73.8 81.8 91.4 99.2 99.9 79.0 95.1 97.4
CoCa [51] 1B 66.3 86.2 91.8 51.2 74.2 82.0 92.5 99.5 99.9 80.4 95.7 97.7
RO-ViT (ours) 303M 68.9 87.8 92.2 51.8 75.0 83.0 92.1 99.4 99.7 80.7 96.1 97.7

Table 3. Zero-shot image-text retrieval results on COCO and Flickr30K benchmarks. We compare with dual-encoder methods. We
achieve state-of-the-art results on COCO benchmark. We outperform CoCa-Large with the same backbone by +3.5 / +1.7 R@1, and even
surpass CoCa with 3× larger backbone (ViT-Giant) by +2.6 / +0.6 R@1 on image-to-text / text-to-image retrieval. We also match or
outperform the state-of-the-art methods on Flickr benchmark

object categories suitable for open-vocabulary detection. method backbone AP AP50 AP75
Following the existing works [13, 19, 59], we use the fre- supervised [19] R-50 25.6 38.6 28.0
ViLD [19] R-50 11.8 18.2 12.6
quent and common categories as base categories CB for
DetPro [13] R-50 12.1 18.8 12.9
training, and hold out the rare categories as novel categories RO-ViT (ours) ViT-B/16 14.0 22.3 14.9
CN for testing. Mask APr is the main benchmark metric. RO-ViT (ours) ViT-L/16 17.1 26.9 18.5
We report the mean over three runs following [19].
As shown in Table 1, our best model achieves 32.1 Table 4. Transfer detection on Objects365 (Box APs). All
APr , which significantly outperforms best existing ViT- models are trained on the LVIS base categories and tested on Ob-
based approach OWL-ViT [35] by +6.5 points. Notably, our jects365 dataset, without finetuning.
method with smaller ViT-B/16 backbone surpasses OWL-
ViT with ViT-L/14 by +3.0 APr . Compared to the best ex- We further train RO-ViT (ViT-L/16), the same model as the
isting approach (ViLD-Ens with EffNet-B7 backbone), we last row of Table 1 and 4, for 40K iterations more at a
outperform by +5.8 APr . We note that RO-ViT is sim- higher resolution e.g. 448, following the standard of exist-
ply trained end-to-end with the cross-entropy loss with- ing works [26, 51].
out the employment of LVIS-tailored losses such as fed-
Table 3 shows the comparison with other dual-encoder
erated loss [35, 60, 63], weak supervision [63], or self-
methods on zero-shot image-to-text (I2T) and text-to-image
training [41, 57, 60].
(T2I) retrieval. Surprisingly, RO-ViT outperform all pub-
COCO benchmark. We also present the comparison on lished works on the MS COCO benchmark, and is on par
the COCO benchmark where the setup uses 48 base cate- with the state of the art [51] on Flickr. Compared to
gories for training and 17 novel categories for testing [19]. CoCa [51] with the same backbone capacity (ViT-L), RO-
The main metric is AP50 of novel categories. Due to the ViT outperforms on 11 out of 12 image-text retrieval met-
smaller number of training categories, we observe a ten- rics. Our model has 303M parameters and achieves the
dency to overfit to these categories. Unlike most com- best performance overall. This shows that our pretraining
peting methods, RO-ViT does not use any auxiliary ob- method not only improves the region-level representation
jectives to mitigate overfitting such as pseudo box/mask for open-vocabulary detection but also the global image-
labels [14, 25, 41, 57, 60], knowledge distillation [13, 19], level representation for retrieval.
weak supervision [63]. Still, Table 2 shows that RO-ViT
is very competitive among the other methods trained with 4.3. Transfer Object Detection
various sources. In addition, RO-ViT represents the first
ViT-based method to be reported on this benchmark, as the We evaluate the generalization ability of RO-ViT in zero-
other method OWL-ViT [35] only benchmarks on LVIS. shot transfer detection. We use the same detector trained on
the LVIS base categories (Sec. 4.1) and test on Objects365-
v1 validation split [43] following the setup of ViLD [13,19].
4.2. Image-Text Retrieval
We replace the LVIS with Objects365 vocabulary embed-
Apart from evaluating region-level representation dings to perform the transfer detection without finetuning.
through open-vocabulary detection, we evaluate the image- Table 4 summarizes the box AP scores in comparison
level representation of RO-ViT in image-text retrieval with prior works. Our best model achieves 17.1 AP, which
through the MS-COCO [7] and Flickr30K [38] benchmarks. outperforms existing works ViLD by +5.3 AP and DetPro

11149
pretraining method CPE focal APr AP frozen backbone APr AP bblr APr AP
base 21.4 (+0.0) 26.6 base 9.7 (+0.0) 12.2 0.0 16.5 17.1
no PE 18.7 (+0.0) 25.2 CPE 16.2 (+6.5) 17.1 0.001 19.7 22.9
SinCos PE 21.7 (+0.0) 26.9 CPE + focal 16.5 (+6.8) 17.1 0.01 20.4 23.0
feat Crop-Resize 20.6 (+0.0) 26.6 0.1 24.3 27.6
ours X 23.8 (+2.4) 27.4 1.0 17.9 25.1
ours X 21.9 (+0.5) 27.4
ours X X 24.3 (+2.9) 27.6
(a) Pretraining strategy: ‘base’ has the standard learnable po- (b) Frozen backbone study: Only the newly (c) Ablating backbone
sitional embeddings (PE). No PE, randomly crop and resize the added detector layers are trained in detection finetuning learning rate
last feature map (feat Crop-Resize), and sinusoidal PE are ei- finetuning, to directly evaluate the pretrained ratio (‘bblr’) w.r.t. added
ther slightly worse or help marginally. Our proposed Cropped features for open-vocabulary detection. Our detector layers.
PE (CPE) improves by +2.4 APr . Focal contrastive loss brings proposed CPE improves the baseline by a large
the gain further to +2.9 APr . margin of +6.5 APr .

backbone loc.obj. norm.lyr APr AP backbone batch imp.(d) CPE + focal APr AP
ViT-B/16 21.4 (+0.0) 26.6 ViT-B/16 4096 21.4 (+0.0) 26.6
ViT-B/16 X 23.4 (+2.0) 27.2 ViT-B/16 4096 X 23.9 (+0.0) 27.5
ViT-B/16 X 22.0 (+0.6) 26.8 ViT-B/16 4096 X X 26.2 (+2.3) 29.2
ViT-B/16 X X 23.9 (+2.5) 27.5 ViT-B/16 16384 X 26.4 (+0.0) 30.3
ViT-L/16 25.1 (+0.0) 30.1 ViT-B/16 16384 X X 28.0 (+1.6) 30.2
ViT-L/16 X X 27.2 (+1.9) 30.6 ViT-L/16 4096 X 27.2 (+0.0) 30.6
ViT-L/16 4096 X X 29.9 (+2.7) 31.1
ViT-L/16 16384 X X 32.1 (+0.0) 34.0
(d) Downstream detector improvement: Combining the localiza- (e) Scalability w.r.t model size and contrastive batch size: The benefit of our
tion quality-based objectness from the proposal stage to the detec- ‘CPE + focal’ holds with larger model and batch size, improving APr up to
tion scoring (‘loc.obj.’) improves by ∼2 APr . Normalized classifier +2.7 points. ‘imp.(d)’ denotes the detector improvements in sub-table (d).
and mask output layer (‘norm.lyr’) gives additional gain.

Table 5. Ablation studies. We follow LVIS open-vocabulary detection benchmark. We train on base (‘frequent’ + ‘common’) categories,
test on novel (‘rare’) categories, and report APr . We use ViT-B/16 backbone and contrastive batch size 4096 unless otherwise noted.

by +5.0 AP. Due to the different backbone capacity (R50 MS COCO Flickr30K
backbone focal CPE
I2T T2I I2T T2I
vs ViT), we note that this comparison is to showcase that
ViT-B/16 59.1 42.5 84.8 70.9
RO-ViT can achieve strong cross-dataset detection general-
ViT-B/16 X 60.3 44.0 85.4 71.6
ization. For transfer detection, we assume all categories are ViT-B/16 X X 62.2 44.2 86.5 71.8
novel and set α, β=(0.0, 0.65) in Equation 1. ViT-L/16 X X 67.0 49.7 89.5 77.2

4.4. Ablation Study Table 6. Pretraining evaluation on zero-shot image-text re-

trieval (Recall@1). We evaluate the image-level representation
We provide ablation studies on the LVIS open- of our pretrained model on COCO and Flickr30k retrieval tasks.
vocabulary detection and image-text retrieval benchmarks. We ablate the focal contrastive loss, Cropped Positional Embed-
Pretraining strategy. Table 5a ablates our proposed ding (CPE) and backbone size.
methods in the contrastive image-text pretraining. We start
with “base” that uses the common full-image positional em-
both of which are essential for open-vocabulary detection.
beddings and cross entropy loss. We explore different PE
and observe that removing the positional embeddings hurts Image-text retrieval. In Table 6, we demonstrate that our
the performance (no PE), while sinusoidal PE (SinCos PE) proposed focal contrastive loss and Cropped Positional Em-
yields a similar performance to the baseline. We also try bedding (CPE) both improve the image-text retrieval. We
the random crop and resize on the last backbone feature use a batch size of 16384 , image size 224, and ViT-B/16
map (“feat Crop-Resize”) to use region features during pre- backbone as our baseline, and report Recall@1 metrics on
training but see no improvement. In contrast, our proposed MS COCO and Flickr datasets. We first observe that focal
Cropped Positional Embedding (CPE) offers a clear benefit loss brings clear improvements on all metrics, and adding
of +2.4 APr and focal contrastive loss (focal) yields addi- CPE brings further improvements with a total of +3.1 / +1.7
tional gains. Our method achieves a gain of +2.9 APr in to- image-to-text (I2T) and +1.7 / +0.9 text-to-image (T2I) im-
tal. This shows that CPE can learn strong region-level repre- provements on the MS COCO / Flickr datasets. By replac-
sentations, and our focal contrastive learning helps preserv- ing ViT-B/16 with ViT-L/16, we observe marked improve-
ing the knowledge about the pretrained image-text concepts, ments of +4.8 / +5.5 / +3.0 / +5.4 on these metrics, showing

11150
that our recipe is highly scalable with backbone capacity. baseline ours
1 1
2
3
Frozen backbone study. We investigate the effectiveness 4
5

input patch row

of the frozen backbone features in open-vocabulary detec-

cosine similarity
6
7
tion in Table 5b. This evaluation setting is more rigorous 8
9
for representation learning because finetuning an entire net- 10
work evaluates not only the quality of the representations 11
12
but also the initialization and optimization method, as dis- 13
14
cussed in Goyal et al. [18]. During the detection finetun- -1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14

ing, all ViT backbone layers are frozen, and only the added input patch column input patch column

detector layers (neck and heads) are trained. We observe

frozen RO-ViT backbone improves the baseline by +6.8
APr , a much larger margin compared to the full-finetuning
setting. This study shows the region-aware representation Figure 3. Visualization of learned positional embeddings. Tiles
of RO-ViT is critical for downstream detection tasks. show the cosine similarity between the positional embedding of
the patch (at the indicated row-column position) and the positional
embeddings of all other patches. ViT-B/16 backbone is used.
Backbone learning rate ratio. As discussed in Sec. 3.1,
our open-vocabulary object detector depends on the pre-
trained knowledge in the backbone to recognize novel cate- 4.5. Visualization of Positional Embeddings
gories beyond the detector annotations. Therefore, it is im- In Fig. 3, we visualize and compare the learned posi-
portant to set the backbone learning rates lower than the rest tional embeddings of RO-ViT with the baseline, based on
of the detector layers to prevent forgetting in the finetuning ViT-B/16 backbone. Each tile is the cosine similarity be-
process. We present ablations on the learning rate ratio between positional embeddings of one patch and all other
tween the backbone and the detector layers in Table 5c. We patches. The brightness patterns show in both models,
found 0.1× to be the sweet spot. Higher ratios lead to for- closer patches have more similar positional embeddings, in-
getting and hurts the APr , and lower ratios hurt the ability dicating the encoded distance and locality within the image.
to adapt to detection tasks and hurts AP overall. We observe a few differences between RO-ViT posi-
tional embeddings and the baselines. First, RO-ViT forms
Downstream detector improvements. In addition, Ta- more distinct clusters at different patch locations, i.e., ours
ble 5d demonstrates our improvements in the downstream shows symmetrical global pattern around the center patch,
detector. Learning the objectness by localization quality, while the baseline has similar patterns on opposite ends of
i.e., centerness (“loc.obj.”) [28] and leveraging it into detec- the image (e.g., the pattern in top-left patch is similar to the
tion score improves APr by a clear gain of +1.9 points com- bottom-right patch). Also, the brightness patterns of RO-
pared to using the conventional binary-classification in the ViT tend to be more concentrated and strong (i.e., near 1 or
proposal. This indicates the novel, unlabeled objects are in- -1). To summarize, the visualization indicates RO-ViT po-
deed often missed in the proposal stage, which may limit the sitional embeddings acquire more structure and symmetry
final open-vocabulary detection performance. Localization- than the baseline in the pretraining process.
quality based objectness helps remedy this. The use of nor-
malized activation in the last layers of the classifier and 5. Conclusion
mask heads additionally improves the performance.
We present RO-ViT– a contrastive image-text pretraining
Scalability with respect to model size and batch size. framework to bridge the gap between image-level pretrain-
While increasing the model capacity and batch size have ing and open-vocabulary detection finetuning. Our methods
been beneficial for contrastive learning in general [26, 37, are simple, scalable, and easy to apply to any contrastive
39, 51], we study their benefits in the downstream open- backbones with minimal computation overhead and no in-
vocabulary detection in Table 5e. Note the detector im- crease in parameters. Experiments show that our RO-ViT
provements presented in Sec. 3.3 are applied. We first ob- achieves the state-of-the-art on LVIS open-vocabulary de-
serve increasing the batch size from the default 4096 to tection benchmark. Moreover, RO-ViT achieves state of the
16384 shows gains of +2.1∼2.5 APr for both ViT-B/L. art on 9 out of 12 metrics of image-text retrieval benchmark,
Then, we notice upgrading ViT-B to ViT-L brings +3.3∼3.7 showing that the learnt representation is not only beneficial
APr gain across different batch sizes. Last but not least, the at region-level but also highly effective on image-level. We
gain of our proposed “CPE + focal” pretraining is consistent hope this study can help the research on open-vocabulary
among all settings by improving +2.2∼2.7 APr . Putting ev- detection from the perspective of image-text pretraining
erything together, RO-ViT best model achieves 32.1 APr . which can benefit both region-level and image-level tasks.

11151
References ages. In European Conference on Computer Vision, pages
701–717. Springer, 2022. 5, 6
[1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An-
[15] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio,
toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur
Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. De-
Mensch, Katie Millican, Malcolm Reynolds, Roman Ring,
vise: A deep visual-semantic embedding model. In NeurIPS,
Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong,
2013. 2
Sina Samangooei, Marianne Monteiro, Jacob Menick, Se-
bastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sa- [16] Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-
hand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Yi Lin, Ekin D Cubuk, Quoc V Le, and Barret Zoph. Simple
Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. copy-paste is a strong data augmentation method for instance
Flamingo: a visual language model for few-shot learning, segmentation. In CVPR, 2021. 5
2022. 1 [17] Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scal-
[2] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen ing open-vocabulary image segmentation with image-level
Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video labels. In European Conference on Computer Vision, pages
vision transformer. In Proceedings of the IEEE/CVF Inter- 540–557. Springer, 2022. 3
national Conference on Computer Vision, pages 6836–6846, [18] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noord-
2021. 1 huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch,
[3] Yutong Bai, Xinlei Chen, Alexander Kirillov, Alan Yuille, Yangqing Jia, and Kaiming He. Accurate, large mini-
and Alexander C. Berg. Point-level region contrast for object batch sgd: Training imagenet in 1 hour. arXiv preprint
detection pre-training. In CVPR, pages 16061–16070, June arXiv:1706.02677, 2017. 8
2022. 2 [19] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui.
[4] Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chel- Open-vocabulary object detection via vision and language
lappa, and Ajay Divakaran. Zero-shot object detection. In knowledge distillation. In ICLR, 2022. 1, 2, 3, 5, 6
ECCV, 2018. 2 [20] Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A
[5] Hangbo Bao, Li Dong, and Furu Wei. Beit: Bert pre-training dataset for large vocabulary instance segmentation. In CVPR,
of image transformers. arXiv preprint arXiv:2106.08254, 2019. 5
2021. 2 [21] Nasir Hayat, Munawar Hayat, Shafin Rahman, Salman
[6] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Khan, Syed Waqas Zamir, and Fahad Shahbaz Khan. Syn-
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- thesizing the unseen for zero-shot object detection. In ACCV,
ing properties in self-supervised vision transformers. In 2020. 2
ICCV, pages 9650–9660, October 2021. 2
[22] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr
[7] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- Dollár, and Ross Girshick. Masked autoencoders are scalable
tam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zit- vision learners. In CVPR, pages 16000–16009, June 2022. 2
nick. Microsoft coco captions: Data collection and evalu-
[23] Xiangteng He and Yuxin Peng. Fine-grained image classifi-
ation server. In https://arxiv.org/abs/1504.00325, 2015. 6
cation via combining vision and language. In CVPR, 2017.
[8] Xinlei Chen and Abhinav Gupta. Webly supervised learning
2
of convolutional networks. In ICCV, 2015. 2
[9] Berkan Demirel, Ramazan Gokberk Cinbis, and Nazli [24] Olivier J. Hénaff, Skanda Koppula, Jean-Baptiste Alayrac,
Ikizler-Cinbis. Zero-shot object detection by hybrid region Aaron van den Oord, Oriol Vinyals, and João Carreira. Effi-
embedding. In BMVC, 2018. 2 cient visual pretraining with contrastive detection. In ICCV,
pages 10086–10096, October 2021. 2
[10] Karan Desai and Justin Johnson. Virtex: Learning visual
representations from textual annotations. In CVPR, 2021. 2 [25] Dat Huynh, Jason Kuen, Zhe Lin, Jiuxiang Gu, and Ehsan
[11] Santosh K Divvala, Ali Farhadi, and Carlos Guestrin. Learn- Elhamifar. Open-vocabulary instance segmentation via ro-
ing everything about anything: Webly-supervised visual con- bust cross-modal pseudo-labeling. In Proceedings of the
cept learning. In CVPR, 2014. 2 IEEE/CVF Conference on Computer Vision and Pattern
[12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Recognition, pages 7020–7031, 2022. 5, 6
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, [26] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom
vain Gelly, et al. An image is worth 16x16 words: Trans- Duerig. Scaling up visual and vision-language representation
formers for image recognition at scale. arXiv preprint learning with noisy text supervision. In ICML, 2021. 2, 3, 4,
arXiv:2010.11929, 2020. 1, 2, 3, 5 5, 6, 8
[13] Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, [27] Armand Joulin, Laurens van der Maaten, Allan Jabri, and
and Guoqi Li. Learning to prompt for open-vocabulary ob- Nicolas Vasilache. Learning visual features from large
ject detection with vision-language model. In CVPR, 2022. weakly supervised data. In ECCV, 2016. 2
1, 2, 3, 5, 6 [28] Dahun Kim, Tsung-Yi Lin, Anelia Angelova, In So Kweon,
[14] Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, and Weicheng Kuo. Learning open-world object proposals
Haibing Ren, Xiaolin Wei, Weidi Xie, and Lin Ma. Prompt- without learning to classify. IEEE Robotics and Automation
det: Towards open-vocabulary detection using uncurated im- Letters, 7(2):5453–5460, 2022. 2, 5, 8

11152
[29] Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, and [42] Mert Bulent Sariyildiz, Julien Perez, and Diane Larlus.
Anelia Angelova. F-vlm: Open-vocabulary object detec- Learning visual representations with caption annotations. In
tion upon frozen vision and language models. arXiv preprint ECCV, 2020. 2
arXiv:2209.15639, 2022. 1, 3 [43] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang
[30] Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365:
Koltun, and René Ranftl. Language-driven semantic seg- A large-scale, high-quality dataset for object detection. In
mentation. arXiv preprint arXiv:2201.03546, 2022. 3 ICCV, 2019. 6
[31] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- [44] Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guil-
wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu laume Couairon, Wojciech Galuba, Marcus Rohrbach, and
Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Douwe Kiela. Flava: A foundational language and vision
Jianfeng Gao. Grounded language-image pre-training. In alignment model. In CVPR, pages 15638–15650, 2022. 6
CVPR, 2022. 3 [45] Josiah Wang, Katja Markert, Mark Everingham, et al. Learn-
[32] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. ing models for object recognition from natural language de-
Exploring plain vision transformer backbones for object de- scriptions. In BMVC, 2009. 2
tection. In ECCV, 2022. 1, 3 [46] Jiaqi Wang, Wenwei Zhang, Yuhang Zang, Yuhang
[33] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Cao, Jiangmiao Pang, Tao Gong, Kai Chen, Ziwei Liu,
Piotr Dollár. Focal loss for dense object detection. In ICCV, Chen Change Loy, and Dahua Lin. Seesaw loss for
pages 2980–2988, 2017. 4 long-tailed instance segmentation. In Proceedings of the
IEEE/CVF conference on computer vision and pattern
[34] Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fa-
recognition, pages 9695–9704, 2021. 5
had Shahbaz Khan, Rao Muhammad Anwer, and Ming-
[47] Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia
Hsuan Yang. Class-agnostic object detection with multi-
Tsvetkov, and Yuan Cao. Simvlm: Simple visual language
modal transformer. In Computer Vision–ECCV 2022: 17th
model pretraining with weak supervision. arXiv preprint
European Conference, Tel Aviv, Israel, October 23–27, 2022,
arXiv:2108.10904, 2021. 1
Proceedings, Part X, pages 512–531. Springer, 2022. 5
[48] Fangyun Wei, Yue Gao, Zhirong Wu, Han Hu, and Stephen
[35] Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim
Lin. Aligning pretraining for detection via object-level con-
Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh
trastive learning. In NeurIPS, 2021. 2
Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran
[49] Tete Xiao, Colorado J Reed, Xiaolong Wang, Kurt Keutzer,
Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil
and Trevor Darrell. Region similarity representation learn-
Houlsby. Simple open-vocabulary object detection with vi-
ing. In ICCV, pages 10539–10548, October 2021. 2
sion transformers. In ECCV, 2022. 2, 3, 5, 6
[50] Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe
[36] Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram
Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and
Singer, Jonathon Shlens, Andrea Frome, Greg S Corrado,
Chunjing Xu. Filip: Fine-grained interactive language-image
and Jeffrey Dean. Zero-shot learning by convex combination
pre-training. In ICLR, 2021. 6
of semantic embeddings. 2014. 2
[51] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo-
[37] Hieu Pham, Zihang Dai, Golnaz Ghiasi, Hanxiao Liu, jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive
Adams Wei Yu, Minh-Thang Luong, Mingxing Tan, and captioners are image-text foundation models. TMLR, 2022.
Quoc V. Le. Combined scaling for zero-shot transfer learn- 5, 6, 8
ing. CoRR, abs/2111.10050, 2021. 2, 8 [52] Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella,
[38] Bryan A Plummer, Liwei Wang, Chris M Cervantes, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang,
Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazeb- Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu,
nik. Flickr30k entities: Collecting region-to-phrase corre- Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao,
spondences for richer image-to-sentence models. In ICCV, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, and
pages 2641–2649, 2015. 6 Pengchuan Zhang. Florence: A new foundation model for
[39] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya computer vision. arXiv preprint arXiv:2111.11432, Novem-
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, ber 2021. 6
Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen [53] Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and
Krueger, and Ilya Sutskever. Learning transferable visual Chen Change Loy. Open-vocabulary detr with conditional
models from natural language supervision. In ICML, 2021. matching. arXiv preprint arXiv:2203.11876, 2022. 5
2, 3, 4, 5, 6, 8 [54] Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-
[40] Shafin Rahman, Salman Khan, and Nick Barnes. Improved Fu Chang. Open-vocabulary object detection using captions.
visual-semantic alignment for zero-shot object detection. In In CVPR, 2021. 1, 3
AAAI, 2020. 2 [55] Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner,
[41] Hanoona Rasheed, Muhammad Maaz, Muhammad Uzair Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer.
Khattak, Salman Khan, and Fahad Shahbaz Khan. Bridg- Lit: Zero-shot transfer with locked-image text tuning. In
ing the gap between object and image-level represen- CVPR, 2022. 2
tations for open-vocabulary detection. arXiv preprint [56] Bowen Zhang, Zhi Tian, Quan Tang, Xiangxiang Chu, Xi-
arXiv:2207.03482, 2022. 1, 2, 3, 5, 6 aolin Wei, Chunhua Shen, and Yifan Liu. Segvit: Semantic

11153
segmentation with plain vision transformers. arXiv preprint
arXiv:2210.05844, 2022. 1
[57] Shiyu Zhao, Zhixing Zhang, Samuel Schulter, Long Zhao,
Anastasis Stathopoulos, Manmohan Chandraker, Dimitris
Metaxas, et al. Exploiting unlabeled data with vision and
language models for object detection. In ECCV, 2022. 1, 2,
3, 5, 6
[58] Ye Zheng, Ruoran Huang, Chuanqi Han, Xi Huang, and Li
Cui. Background learnable cascade for zero-shot object de-
tection. In ACCV, 2020. 2
[59] Yiwu Zhong, Jing Shi, Jianwei Yang, Chenliang Xu, and Yin
Li. Learning to generate scene graph from natural language
supervision. In ICCV, 2021. 2, 6
[60] Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan
Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang
Dai, Lu Yuan, Yin Li, and Jianfeng Gao. Regionclip:
Region-based language-image pretraining. In CVPR, 2022.
1, 2, 3, 5, 6
[61] Chong Zhou, Chen Change Loy, and Bo Dai. Extract free
dense labels from clip. In ECCV, 2022. 3
[62] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Zi-
wei Liu. Conditional prompt learning for vision-language
models. In CVPR, 2022. 3
[63] Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp
Krähenbühl, and Ishan Misra. Detecting twenty-thousand
classes using image-level supervision. In ECCV, 2022. 1, 3,
5, 6
[64] Pengkai Zhu, Hanxiao Wang, and Venkatesh Saligrama.
Don’t even look once: Synthesizing features for zero-shot
detection. In CVPR, 2020. 2