Kim 等 - 2023 - Region-Aware Pretraining for Open-Vocabulary Objec
Kim 等 - 2023 - Region-Aware Pretraining for Open-Vocabulary Objec
11144
supervised tasks [5, 6, 22], it is important to understand • RO-ViT achieves the state of the art on the LVIS open-
how to build capable open-vocabulary detectors with vision vocabulary detection benchmark, and 9 out 12 metrics
transformers [12, 35]. on COCO and Flickr image-text retrieval benchmarks.
To our best knowledge, all existing works assume pre-
We hope these findings will facilitate the research com-
trained Vision-Language Models (VLM) are given, and de-
munity to further explore open-vocabulary detection from
velop adaptation or finetuning recipes to bridge the gap
the perspective of image-text pretraining with potential ben-
between image-level pretraining and object-level finetun-
efits for both region-level and image-level tasks.
ing [13, 19, 41, 57, 60]. Since the VLMs are designed for
image-level tasks e.g. classification, retrieval, the notion of 2. Related Works
objects/regions are not adequately utilized in the pretrain-
ing process. We believe it would be beneficial for open- Learning open-vocabulary and zero-shot recognition.
vocabulary detection if we bake in locality information in Building general representation for open-vocabulary and
the image-text pretraining. zero-shot recognition is a fundamental machine learning
We present RO-ViT, a simple recipe to pretrain vision task. DeViSE [15] and ConSE [36] are pioneering works
transformers in a region-aware manner for open-vocabulary to learn a shared image-text embedding space for zero-shot
object detection. Standard pretraining typically uses full- recognition by convolutional networks. As image and lan-
image positional embeddings, which does not generalize guage often co-occur on the internet, the community has
well to detection tasks. Thus, we propose a novel posi- explored learning such representation from various data
tional embedding scheme called “Cropped Positional Em- sources including image tags [8, 11, 27], captions/text de-
bedding” which better matches the use of region crops in scriptions [10, 23, 42, 45, 59], alt-texts [26, 37, 55], and im-
detection finetuning (see Fig. 1). In addition, we found it age search queries [39]. Recently, contrastive learning has
beneficial to replace the softmax cross entropy loss with fo- become a popular paradigm to learn from large image-text
cal loss in contrastive learning, which affords us more con- corpus due to its simplicity and scalability, where increasing
trol to learn from harder and more informative examples. model size and data yield consistent improvement on open-
Finally, we leverage recent advances in novel object pro- vocabulary classification benchmarks. The aforementioned
posals [28] to improve open-vocabulary detection finetun- works focus on image-level recognition, whereas we study
ing. The motivation is that existing approaches often miss how to best tailor them for region-level understanding.
novel objects in the object proposal stage because the pro- Self-supervised representation learning for localization.
posals tend to overfit to the foreground categories. Due to the challenge to scale up labeling for localiza-
We evaluate the approach on the standard LVIS and tion tasks, many efforts have been made to learn locality-
COCO open-vocabulary benchmarks. On LVIS, our best sensitive representation in a self-supervised manner. Exist-
model achieves a state-of-the-art 32.1 APr at the system ing approaches roughly fall into the contrastive or gener-
level, surpassing the best existing approach by +5.8 APr . ative paradigms. Contrastive approaches typically involve
Compared to the best existing ViT-based approach, ours region or point-level contrastive learning using sliding win-
outperforms by a healthy margin of +6.5 APr . Our LVIS- dows [49], object proposals [24, 48], or point samples [3].
trained model outperforms existing baselines on transfer Most existing contrastive methods are CNN-based, while
detection to Objects365 without re-training. Although ViT backbones have been popular recently with genera-
not explicitly optimized for retrieval, RO-ViT surprisingly tive approaches [5, 22] and has shown promise with con-
achieves the state-of-the-art performance on 9 out of 12 trastive approaches [6]. Masked image modeling is com-
metrics in image-text retrieval benchmark and outperforms monly used to learn the ViT features for localization tasks.
strong baselines with standard positional embeddings, cross Even though these self-supervised methods are well-suited
entropy loss, and larger model capacity. We conduct abla- for standard localization tasks, they lack the image-text
tions to confirm the benefits of each proposed component. correspondence necessary for open-vocabulary recognition.
Visualization of the learnt positional embeddings shows that Thus, we study contrastive image-text representation learn-
our approach (CPE) leads to more symmetrical and diverse ing for open-vocabulary detection with ViTs.
patterns than the baseline. In summary:
Open-vocabulary object detection and segmentation.
• We propose RO-ViT to bridge the positional embed- Zero-shot detection was proposed to scale up detection
dings in image-text pretraining to open-vocabulary de- models beyond their limited training categories. Popular
tection finetuning. approaches accomplish this by learning the alignment be-
• We show that image-text pretraining with focal loss is tween region visual representation and category word em-
more effective than existing softmax CE loss. beddings [4, 9, 40, 58] or hallucinating visual features with
• We improve the open-vocabulary detection finetuning a generative model [21, 64]. Without any visual knowledge
recipe with novel object proposals. of the novel categories, zero-shot detectors tend to struggle
11145
with low performance. This motivates the open-vocabulary conventional fixed-size classifier fully-connected layer with
detection benchmark [54] to bridge the gap between zero- the text embeddings of base categories [19, 54]. It is impor-
shot and fully-supervised detection. tant that the text embeddings come from the matching text
With the rise of image-text pretraining, many works encoder of the image encoder during pretraining, so that the
have explored adapting these pretrained models to open- open-vocabulary knowledge would be preserved. We rep-
vocabulary detection and segmentation [17, 19, 30, 60, 61]. resent the background category by a “background” phrase
For example, ViLD [19] proposes to distill the image-text and the proposals not matched to any annotations in CB are
knowledge into the detector. DetPro [13] improves ViLD labeled as background.
by incorporating the idea of prompt optimization [62]. During training, we compute the detection scores pi for
Moreover, region-text self-training has been shown effec- each region i as the cosine similarity between the RoI-Align
tive on image caption data [60], classification data [41], feature (i.e., region embedding) and the text embeddings of
or even unlabeled data [57]. Weak supervision [63] and CB , followed by a softmax. At test time, we expand the text
phrase grounding [31] approaches have also been pro- embeddings from CB to CB ∪ CN plus the “background”
posed. In terms of architecture, CNNs are most commonly category for open-vocabulary detection. We extract the
used but vision transformers have recently been adopted as VLM embedding of region i by RoI-Align on the output
well [35,61]. While existing works typically assume image- feature map of the ViT backbone, and compute the VLM
text pretrained models are given and focus on finetuning or region scores zi as the cosine similarity with the CB ∪ CN
adaptation strategies, our focus is to improve the upstream text embeddings. Similarly, the detection scores pi are now
image-text pretraining with vision transformers. computed with the CB ∪ CN text embeddings. The com-
bined open-vocabulary detection score si OVD is obtained by
3. Method geometric means [19, 29]:
11146
Focal contrastive image-text pretraining Cropped Positional Embedding (CPE) RO-ViT: downstream open-vocabulary detector
region-text similarity
RoI Align
randomly crop detected regions as OVD score 𝑠 OVD
and resize
image embeddings detector heads detection region score 𝑝
GAP ViT encoder VLM region score 𝑧
ViT encoder full-image PE
CPE base classes novel classes
patch embedding (train & test) (test time only)
text embeddings
patch embedding
text encoder
initialize with
upsample pretrained PE
‘two dogs (upsampled)
playing on the positional
grass’ embeddings
Figure 2. RO-ViT framework overview. Our Region-aware Open-vocabulary Vision Transformer (RO-ViT) attempts to bridge the gap
between the image-level vision-language model (VLM) pretraining and downstream open-vocabulary detection. For the pretraining, we
propose Cropped Positional Embedding (CPE) which randomly crops and resizes a region of positional embeddings instead of using the
whole-image PE. In addition, we use focal loss instead of the common softmax cross entropy loss for contrastive learning. The pretrained
ViT backbone is transferred to the downstream open-vocabulary detection by replacing the global average pooling with detector heads.
RO-ViT detector takes the whole-image positional embeddings as input, and the detections are used to crop out region embeddings from
the ViT backbone features. The region embeddings then match with the cached category embeddings to obtain the VLM score z, which is
combined with the detection score p into the open-vocabulary detection score sOVD (see Equation 1).
There is a mismatch between the way the positional em- loss can provide. Focal loss offers a natural option to tune
beddings are used in existing contrastive pretraining ap- the weights of hard examples [33] . Let vi and li be the
proaches and open-vocabulary detection finetuning. The normalized image and text embeddings, and the image-to-
pretraining approaches typically apply full-image positional text (I2T) contrastive losses be Lsoftmax and Lfocal for the
embeddings [26, 39] during training, and use the same po- softmax (baseline) or focal loss (RO-ViT). We define the
sitional embeddings for downstream tasks, e.g., zero-shot losses mathematically below:
recognition. However, the recognition occurs at region- B
level for open-vocabulary detection finetuning, which re- 1 X exp(vi li /τ )
Lsoftmax = − log( PB ) (2)
quires the full-image positional embeddings to generalize B i=1 j=1 exp(vi lj /τ )
to regions that they never see during the pretraining.
B B
To bridge this gap, we propose Cropped Positional Em- 1 XX
bedding (CPE) (see center of Figure 2). First, we up-sample Lfocal = − (1 − pi )γ log(pi ) (3)
B i=1 j=1
the positional embeddings from the image size typical for
pretraining, e.g., 224 to that typical for detection tasks, , where pi denotes the true class probability as below:
e.g., 1024. Then we randomly crop and resize a region (
σ(vi lj /τ ) if i = j
from the up-sampled positional embeddings and use that as pi = (4)
the image-level positional embeddings during pretraining. 1 − σ(vi lj /τ ) if i 6= j
The regions are uniformly sampled from the normalized Here σ denotes the sigmoid function. We adopt the sim-
coordinates as x1 ∼ Uniform(0, 1), y1 ∼ Uniform(0, 1), pler non alpha-balanced form of focal loss [33]. The text-
x2 ∼ Uniform(x1 , 1), y2 ∼ Uniform(y1 , 1), while keeping to-image (T2I) contrastive losses are symmetrical with the
the crop scale ratio in [0.1, 1.0]. I2T losses by simply exchanging the inner/outer summation
Intuitively, this causes the model to view an image not as loops. The total loss is the sum of both I2T and T2I losses.
a full image in itself, but as a region crop from some larger
unknown image. This better matches the downstream use 3.3. Open-vocabulary Detector Finetuning
case of detection where recognition occurs at region- rather Here we present two simple techniques to improve
than image-level. the downstream open-vocabulary detector (Sec. 3.1). De-
Focal Loss. We desire to have finer control over how hard spite the backbone features pretrained from the vast open-
examples are weighted than what the softmax cross entropy vocabulary data, the added detector layers (neck and heads)
11147
pretrained detector pretrained detector
method APr AP method novel AP AP
model backbone model backbone
ConvNet based: ConvNet based:
DetPro-Cascade [13] ViT-B/32 R-50 20.0 27.0 ViLD [19] ViT-B/32 R-50 27.6 51.3
Detic-CN2 [63] ViT-B/32 R-50 24.6 32.4 OV-DETR [53] ViT-B/32 R-50 29.4 52.7
RegionCLIP [60] R-50x4 R-50x4 22.0 32.3 w/ pseudo box labels:
ViLD-Ens [19] ViT-B/32 R-152 18.7 26.0 XPM et al. [25] R-50 R-50 27.0 41.2
ViLD-Ens [19] ViT-L/14 EffNet-B7 21.7 29.6 RegionCLIP [60] † R-50x4 R-50x4 39.3 55.7
ViLD-Ens [19] EffNet-B7 EffNet-B7 26.3 29.3 PromptDet [14] ViT-B/32 R-50 26.6 50.6
VL-PLM [57] ViT-B/32 R-50 17.2 27.0 VL-PLM [57] ViT-B/32 R-50 34.4 53.5
OV-DETR [53] ViT-B/32 R-50 17.4 26.6 Rasheed et al. [41] ‡ ViT-B/32 R-50 36.9 51.5
Rasheed et al. [41] ViT-B/32 R-50 21.1 25.9 w/ weak supervision:
PromptDet [14] ViT-B/32 R-50 21.4 25.3 Detic-CN2 [63] ViT-B/32 R-50 24.6 32.4
ViT based: ViT based:*
OWL-ViT [35] ViT-H/14 ViT-H/14 23.3 35.3 RO-ViT (ours) ViT-B/16 ViT-B/16 30.2 41.5
OWL-ViT [35] ViT-L/14 ViT-L/14 25.6 34.7 RO-ViT (ours) ViT-L/16 ViT-L/16 33.0 47.7
RO-ViT (ours) ViT-B/16 ViT-B/16 28.0 30.2
RO-ViT (ours) ViT-L/14 ViT-L/14† 31.4 34.0 Table 2. COCO open-vocabulary object detection (box AP50).
RO-ViT (ours) ViT-L/16 ViT-L/16 32.1 34.0 RO-ViT represents the first ViT-based approach and demonstrates
a very competitive novel AP without using pseudo labeling or
Table 1. LVIS open-vocabulary object detection (mask APs). weak supervision. †: RegionCLIP uses an off-the-shelf RPN dur-
RO-ViT outperforms the best existing approach by +5.8 APr on ing its pretraining. ‡: Rasheed et al. uses an external MViT detec-
novel categories. When using ViT-Base and Large RO-ViT out- tor [34] during pretraining. *: The other ViT-based method [35]
performs OWL-ViT based on ViT-Large by +3.0 and +6.5 APr , report their results on LVIS only.
respectively. All methods use the same instance-level supervision
from LVIS base categories for detection training. †: We use image gion with the scale ratio in [0.1, 1.0], and the aspect ratio in
size 860 during detection for a fair comparison with OWL-ViT. [0.5, 2.0]. The region crop is resized back to the size 14×14
(i.e., CPE), and is added to the patch embeddings. We use
are newly trained with the downstream detection dataset the global average pooling at the last ViT layer to obtain
(e.g., LVIS base categories). Existing approaches often miss the image embedding. The text encoder is a 12-layer Trans-
novel/unlabeled objects in the object proposal stage because former following [39, 51], and the maximum length of the
the proposals tend to classify them as background. To rem- input text as 64 tokens. Both image and text embeddings are
edy this, we leverage recent advances in novel object pro- L2 normalized and trained with focal loss. We train on the
posal method [28] and adopt the localization quality-based same image-text dataset as Chao et al. [26]. Unless other-
objectness (i.e., centerness score) instead of object-or-not wise specified, we use a batch size of 4096 for ablation and
binary classification score. Following [28], we use a single 16384 to compare with other methods. We use the AdamW
anchor per location and combine the predicted objectness optimizer with learning rate of 5e-4 and a linear warmup of
score oi with the ensemble detection score in Equation 1 to 10k steps, and train for 500k iterations.
obtain the final OVD score as: Si OVD = oi δ · si OVD. . Downstream detection details. We train RO-ViT with
Additionally, we replace the standard classifier and mask base categories CB for 46.1k/11.3k iterations with image
output layer with the normalized layers [46]. This is to re- size 1024, large scale jittering [16], batch size 256/128, the
duce their scale variance over categories by L2-normalizing SGD optimizer with weight decay 1e-4/1e-2, momentum
the weights w and features x as: f (x; w, b, τ ) = 0.9 and an initial learning rate of 0.36/0.02 for LVIS/COCO
τ T
kwk2 kxk2 w x + b, where τ = 20. Although we do not have datasets. The pretrained positional embeddings are bilin-
rare categories at training (i.e., open-vocabulary setting), we early interpolated to adjust to the size of patch embeddings
empirically found it beneficial (see Sec. 4). of higher resolution [12]. We set the backbone learning
rate lower (e.g., 0.1×) than the rest of the model to retain
4. Experimental Results the pretraining knowledge during detection finetuning. We
use the score combination of (α, β, δ) = (0.65, 0.3, 3) in
Pretraining details. Our contrastive image-text pretrain- Sec. 3.3. We use CLIP [39] prompt templates and take the
ing is performed from scratch. We adopt the widely used average text embeddings of each category.
ViT-B/16 and ViT-L/16 as the image encoder. The input im-
age size is 224×224 which results in 14×14 positional em- 4.1. Open-vocabulary Object Detection
beddings with patch size 16×16. To generate the Cropped
Positional Embedding (CPE), we first interpolate the posi- LVIS benchmark. We conduct evaluation on the LVIS
tional embeddings to size 64×64. We randomly crop a re- dataset [20] which contains a large and diverse set of 1203
11148
image MS COCO (5K test set) Flickr30K (1K test set)
backbone ——-image-to-text——- ——-text-to-image——- ——-image-to-text——- ——-text-to-image——-
method size R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R10 R@1 R@5 R@10
CLIP [39] 302M 58.4 81.5 88.1 37.8 62.4 72.2 88.0 98.7 99.4 68.7 90.6 95.2
ALIGN [26] 408M 58.6 83.0 89.7 45.6 69.8 78.6 88.6 98.7 99.7 75.7 93.8 96.8
FLAVA [44] 86M 42.7 76.8 - 38.4 67.5 - 67.7 94.0 - 65.2 89.4 -
FILIP [50] 302M 61.3 84.3 90.4 45.9 70.6 79.3 89.8 99.2 99.8 75.0 93.4 96.3
Florence [52] 637M 64.7 85.9 - 47.2 71.4 - 90.9 99.1 - 76.7 93.6 -
CoCa-Large [51] 303M 65.4 85.6 91.4 50.1 73.8 81.8 91.4 99.2 99.9 79.0 95.1 97.4
CoCa [51] 1B 66.3 86.2 91.8 51.2 74.2 82.0 92.5 99.5 99.9 80.4 95.7 97.7
RO-ViT (ours) 303M 68.9 87.8 92.2 51.8 75.0 83.0 92.1 99.4 99.7 80.7 96.1 97.7
Table 3. Zero-shot image-text retrieval results on COCO and Flickr30K benchmarks. We compare with dual-encoder methods. We
achieve state-of-the-art results on COCO benchmark. We outperform CoCa-Large with the same backbone by +3.5 / +1.7 R@1, and even
surpass CoCa with 3× larger backbone (ViT-Giant) by +2.6 / +0.6 R@1 on image-to-text / text-to-image retrieval. We also match or
outperform the state-of-the-art methods on Flickr benchmark
object categories suitable for open-vocabulary detection. method backbone AP AP50 AP75
Following the existing works [13, 19, 59], we use the fre- supervised [19] R-50 25.6 38.6 28.0
ViLD [19] R-50 11.8 18.2 12.6
quent and common categories as base categories CB for
DetPro [13] R-50 12.1 18.8 12.9
training, and hold out the rare categories as novel categories RO-ViT (ours) ViT-B/16 14.0 22.3 14.9
CN for testing. Mask APr is the main benchmark metric. RO-ViT (ours) ViT-L/16 17.1 26.9 18.5
We report the mean over three runs following [19].
As shown in Table 1, our best model achieves 32.1 Table 4. Transfer detection on Objects365 (Box APs). All
APr , which significantly outperforms best existing ViT- models are trained on the LVIS base categories and tested on Ob-
based approach OWL-ViT [35] by +6.5 points. Notably, our jects365 dataset, without finetuning.
method with smaller ViT-B/16 backbone surpasses OWL-
ViT with ViT-L/14 by +3.0 APr . Compared to the best ex- We further train RO-ViT (ViT-L/16), the same model as the
isting approach (ViLD-Ens with EffNet-B7 backbone), we last row of Table 1 and 4, for 40K iterations more at a
outperform by +5.8 APr . We note that RO-ViT is sim- higher resolution e.g. 448, following the standard of exist-
ply trained end-to-end with the cross-entropy loss with- ing works [26, 51].
out the employment of LVIS-tailored losses such as fed-
Table 3 shows the comparison with other dual-encoder
erated loss [35, 60, 63], weak supervision [63], or self-
methods on zero-shot image-to-text (I2T) and text-to-image
training [41, 57, 60].
(T2I) retrieval. Surprisingly, RO-ViT outperform all pub-
COCO benchmark. We also present the comparison on lished works on the MS COCO benchmark, and is on par
the COCO benchmark where the setup uses 48 base cate- with the state of the art [51] on Flickr. Compared to
gories for training and 17 novel categories for testing [19]. CoCa [51] with the same backbone capacity (ViT-L), RO-
The main metric is AP50 of novel categories. Due to the ViT outperforms on 11 out of 12 image-text retrieval met-
smaller number of training categories, we observe a ten- rics. Our model has 303M parameters and achieves the
dency to overfit to these categories. Unlike most com- best performance overall. This shows that our pretraining
peting methods, RO-ViT does not use any auxiliary ob- method not only improves the region-level representation
jectives to mitigate overfitting such as pseudo box/mask for open-vocabulary detection but also the global image-
labels [14, 25, 41, 57, 60], knowledge distillation [13, 19], level representation for retrieval.
weak supervision [63]. Still, Table 2 shows that RO-ViT
is very competitive among the other methods trained with 4.3. Transfer Object Detection
various sources. In addition, RO-ViT represents the first
ViT-based method to be reported on this benchmark, as the We evaluate the generalization ability of RO-ViT in zero-
other method OWL-ViT [35] only benchmarks on LVIS. shot transfer detection. We use the same detector trained on
the LVIS base categories (Sec. 4.1) and test on Objects365-
v1 validation split [43] following the setup of ViLD [13,19].
4.2. Image-Text Retrieval
We replace the LVIS with Objects365 vocabulary embed-
Apart from evaluating region-level representation dings to perform the transfer detection without finetuning.
through open-vocabulary detection, we evaluate the image- Table 4 summarizes the box AP scores in comparison
level representation of RO-ViT in image-text retrieval with prior works. Our best model achieves 17.1 AP, which
through the MS-COCO [7] and Flickr30K [38] benchmarks. outperforms existing works ViLD by +5.3 AP and DetPro
11149
pretraining method CPE focal APr AP frozen backbone APr AP bblr APr AP
base 21.4 (+0.0) 26.6 base 9.7 (+0.0) 12.2 0.0 16.5 17.1
no PE 18.7 (+0.0) 25.2 CPE 16.2 (+6.5) 17.1 0.001 19.7 22.9
SinCos PE 21.7 (+0.0) 26.9 CPE + focal 16.5 (+6.8) 17.1 0.01 20.4 23.0
feat Crop-Resize 20.6 (+0.0) 26.6 0.1 24.3 27.6
ours X 23.8 (+2.4) 27.4 1.0 17.9 25.1
ours X 21.9 (+0.5) 27.4
ours X X 24.3 (+2.9) 27.6
(a) Pretraining strategy: ‘base’ has the standard learnable po- (b) Frozen backbone study: Only the newly (c) Ablating backbone
sitional embeddings (PE). No PE, randomly crop and resize the added detector layers are trained in detection finetuning learning rate
last feature map (feat Crop-Resize), and sinusoidal PE are ei- finetuning, to directly evaluate the pretrained ratio (‘bblr’) w.r.t. added
ther slightly worse or help marginally. Our proposed Cropped features for open-vocabulary detection. Our detector layers.
PE (CPE) improves by +2.4 APr . Focal contrastive loss brings proposed CPE improves the baseline by a large
the gain further to +2.9 APr . margin of +6.5 APr .
backbone loc.obj. norm.lyr APr AP backbone batch imp.(d) CPE + focal APr AP
ViT-B/16 21.4 (+0.0) 26.6 ViT-B/16 4096 21.4 (+0.0) 26.6
ViT-B/16 X 23.4 (+2.0) 27.2 ViT-B/16 4096 X 23.9 (+0.0) 27.5
ViT-B/16 X 22.0 (+0.6) 26.8 ViT-B/16 4096 X X 26.2 (+2.3) 29.2
ViT-B/16 X X 23.9 (+2.5) 27.5 ViT-B/16 16384 X 26.4 (+0.0) 30.3
ViT-L/16 25.1 (+0.0) 30.1 ViT-B/16 16384 X X 28.0 (+1.6) 30.2
ViT-L/16 X X 27.2 (+1.9) 30.6 ViT-L/16 4096 X 27.2 (+0.0) 30.6
ViT-L/16 4096 X X 29.9 (+2.7) 31.1
ViT-L/16 16384 X X 32.1 (+0.0) 34.0
(d) Downstream detector improvement: Combining the localiza- (e) Scalability w.r.t model size and contrastive batch size: The benefit of our
tion quality-based objectness from the proposal stage to the detec- ‘CPE + focal’ holds with larger model and batch size, improving APr up to
tion scoring (‘loc.obj.’) improves by ∼2 APr . Normalized classifier +2.7 points. ‘imp.(d)’ denotes the detector improvements in sub-table (d).
and mask output layer (‘norm.lyr’) gives additional gain.
Table 5. Ablation studies. We follow LVIS open-vocabulary detection benchmark. We train on base (‘frequent’ + ‘common’) categories,
test on novel (‘rare’) categories, and report APr . We use ViT-B/16 backbone and contrastive batch size 4096 unless otherwise noted.
by +5.0 AP. Due to the different backbone capacity (R50 MS COCO Flickr30K
backbone focal CPE
I2T T2I I2T T2I
vs ViT), we note that this comparison is to showcase that
ViT-B/16 59.1 42.5 84.8 70.9
RO-ViT can achieve strong cross-dataset detection general-
ViT-B/16 X 60.3 44.0 85.4 71.6
ization. For transfer detection, we assume all categories are ViT-B/16 X X 62.2 44.2 86.5 71.8
novel and set α, β=(0.0, 0.65) in Equation 1. ViT-L/16 X X 67.0 49.7 89.5 77.2
11150
that our recipe is highly scalable with backbone capacity. baseline ours
1 1
2
3
Frozen backbone study. We investigate the effectiveness 4
5
cosine similarity
6
7
tion in Table 5b. This evaluation setting is more rigorous 8
9
for representation learning because finetuning an entire net- 10
work evaluates not only the quality of the representations 11
12
but also the initialization and optimization method, as dis- 13
14
cussed in Goyal et al. [18]. During the detection finetun- -1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14
ing, all ViT backbone layers are frozen, and only the added input patch column input patch column
11151
References ages. In European Conference on Computer Vision, pages
701–717. Springer, 2022. 5, 6
[1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An-
[15] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio,
toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur
Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. De-
Mensch, Katie Millican, Malcolm Reynolds, Roman Ring,
vise: A deep visual-semantic embedding model. In NeurIPS,
Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong,
2013. 2
Sina Samangooei, Marianne Monteiro, Jacob Menick, Se-
bastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sa- [16] Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-
hand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Yi Lin, Ekin D Cubuk, Quoc V Le, and Barret Zoph. Simple
Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. copy-paste is a strong data augmentation method for instance
Flamingo: a visual language model for few-shot learning, segmentation. In CVPR, 2021. 5
2022. 1 [17] Golnaz Ghiasi, Xiuye Gu, Yin Cui, and Tsung-Yi Lin. Scal-
[2] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen ing open-vocabulary image segmentation with image-level
Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video labels. In European Conference on Computer Vision, pages
vision transformer. In Proceedings of the IEEE/CVF Inter- 540–557. Springer, 2022. 3
national Conference on Computer Vision, pages 6836–6846, [18] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noord-
2021. 1 huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch,
[3] Yutong Bai, Xinlei Chen, Alexander Kirillov, Alan Yuille, Yangqing Jia, and Kaiming He. Accurate, large mini-
and Alexander C. Berg. Point-level region contrast for object batch sgd: Training imagenet in 1 hour. arXiv preprint
detection pre-training. In CVPR, pages 16061–16070, June arXiv:1706.02677, 2017. 8
2022. 2 [19] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui.
[4] Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chel- Open-vocabulary object detection via vision and language
lappa, and Ajay Divakaran. Zero-shot object detection. In knowledge distillation. In ICLR, 2022. 1, 2, 3, 5, 6
ECCV, 2018. 2 [20] Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A
[5] Hangbo Bao, Li Dong, and Furu Wei. Beit: Bert pre-training dataset for large vocabulary instance segmentation. In CVPR,
of image transformers. arXiv preprint arXiv:2106.08254, 2019. 5
2021. 2 [21] Nasir Hayat, Munawar Hayat, Shafin Rahman, Salman
[6] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Khan, Syed Waqas Zamir, and Fahad Shahbaz Khan. Syn-
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- thesizing the unseen for zero-shot object detection. In ACCV,
ing properties in self-supervised vision transformers. In 2020. 2
ICCV, pages 9650–9660, October 2021. 2
[22] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr
[7] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- Dollár, and Ross Girshick. Masked autoencoders are scalable
tam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zit- vision learners. In CVPR, pages 16000–16009, June 2022. 2
nick. Microsoft coco captions: Data collection and evalu-
[23] Xiangteng He and Yuxin Peng. Fine-grained image classifi-
ation server. In https://arxiv.org/abs/1504.00325, 2015. 6
cation via combining vision and language. In CVPR, 2017.
[8] Xinlei Chen and Abhinav Gupta. Webly supervised learning
2
of convolutional networks. In ICCV, 2015. 2
[9] Berkan Demirel, Ramazan Gokberk Cinbis, and Nazli [24] Olivier J. Hénaff, Skanda Koppula, Jean-Baptiste Alayrac,
Ikizler-Cinbis. Zero-shot object detection by hybrid region Aaron van den Oord, Oriol Vinyals, and João Carreira. Effi-
embedding. In BMVC, 2018. 2 cient visual pretraining with contrastive detection. In ICCV,
pages 10086–10096, October 2021. 2
[10] Karan Desai and Justin Johnson. Virtex: Learning visual
representations from textual annotations. In CVPR, 2021. 2 [25] Dat Huynh, Jason Kuen, Zhe Lin, Jiuxiang Gu, and Ehsan
[11] Santosh K Divvala, Ali Farhadi, and Carlos Guestrin. Learn- Elhamifar. Open-vocabulary instance segmentation via ro-
ing everything about anything: Webly-supervised visual con- bust cross-modal pseudo-labeling. In Proceedings of the
cept learning. In CVPR, 2014. 2 IEEE/CVF Conference on Computer Vision and Pattern
[12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Recognition, pages 7020–7031, 2022. 5, 6
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, [26] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom
vain Gelly, et al. An image is worth 16x16 words: Trans- Duerig. Scaling up visual and vision-language representation
formers for image recognition at scale. arXiv preprint learning with noisy text supervision. In ICML, 2021. 2, 3, 4,
arXiv:2010.11929, 2020. 1, 2, 3, 5 5, 6, 8
[13] Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, [27] Armand Joulin, Laurens van der Maaten, Allan Jabri, and
and Guoqi Li. Learning to prompt for open-vocabulary ob- Nicolas Vasilache. Learning visual features from large
ject detection with vision-language model. In CVPR, 2022. weakly supervised data. In ECCV, 2016. 2
1, 2, 3, 5, 6 [28] Dahun Kim, Tsung-Yi Lin, Anelia Angelova, In So Kweon,
[14] Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, and Weicheng Kuo. Learning open-world object proposals
Haibing Ren, Xiaolin Wei, Weidi Xie, and Lin Ma. Prompt- without learning to classify. IEEE Robotics and Automation
det: Towards open-vocabulary detection using uncurated im- Letters, 7(2):5453–5460, 2022. 2, 5, 8
11152
[29] Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, and [42] Mert Bulent Sariyildiz, Julien Perez, and Diane Larlus.
Anelia Angelova. F-vlm: Open-vocabulary object detec- Learning visual representations with caption annotations. In
tion upon frozen vision and language models. arXiv preprint ECCV, 2020. 2
arXiv:2209.15639, 2022. 1, 3 [43] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang
[30] Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365:
Koltun, and René Ranftl. Language-driven semantic seg- A large-scale, high-quality dataset for object detection. In
mentation. arXiv preprint arXiv:2201.03546, 2022. 3 ICCV, 2019. 6
[31] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- [44] Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guil-
wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu laume Couairon, Wojciech Galuba, Marcus Rohrbach, and
Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Douwe Kiela. Flava: A foundational language and vision
Jianfeng Gao. Grounded language-image pre-training. In alignment model. In CVPR, pages 15638–15650, 2022. 6
CVPR, 2022. 3 [45] Josiah Wang, Katja Markert, Mark Everingham, et al. Learn-
[32] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. ing models for object recognition from natural language de-
Exploring plain vision transformer backbones for object de- scriptions. In BMVC, 2009. 2
tection. In ECCV, 2022. 1, 3 [46] Jiaqi Wang, Wenwei Zhang, Yuhang Zang, Yuhang
[33] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Cao, Jiangmiao Pang, Tao Gong, Kai Chen, Ziwei Liu,
Piotr Dollár. Focal loss for dense object detection. In ICCV, Chen Change Loy, and Dahua Lin. Seesaw loss for
pages 2980–2988, 2017. 4 long-tailed instance segmentation. In Proceedings of the
IEEE/CVF conference on computer vision and pattern
[34] Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fa-
recognition, pages 9695–9704, 2021. 5
had Shahbaz Khan, Rao Muhammad Anwer, and Ming-
[47] Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia
Hsuan Yang. Class-agnostic object detection with multi-
Tsvetkov, and Yuan Cao. Simvlm: Simple visual language
modal transformer. In Computer Vision–ECCV 2022: 17th
model pretraining with weak supervision. arXiv preprint
European Conference, Tel Aviv, Israel, October 23–27, 2022,
arXiv:2108.10904, 2021. 1
Proceedings, Part X, pages 512–531. Springer, 2022. 5
[48] Fangyun Wei, Yue Gao, Zhirong Wu, Han Hu, and Stephen
[35] Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim
Lin. Aligning pretraining for detection via object-level con-
Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh
trastive learning. In NeurIPS, 2021. 2
Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran
[49] Tete Xiao, Colorado J Reed, Xiaolong Wang, Kurt Keutzer,
Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil
and Trevor Darrell. Region similarity representation learn-
Houlsby. Simple open-vocabulary object detection with vi-
ing. In ICCV, pages 10539–10548, October 2021. 2
sion transformers. In ECCV, 2022. 2, 3, 5, 6
[50] Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe
[36] Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram
Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and
Singer, Jonathon Shlens, Andrea Frome, Greg S Corrado,
Chunjing Xu. Filip: Fine-grained interactive language-image
and Jeffrey Dean. Zero-shot learning by convex combination
pre-training. In ICLR, 2021. 6
of semantic embeddings. 2014. 2
[51] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo-
[37] Hieu Pham, Zihang Dai, Golnaz Ghiasi, Hanxiao Liu, jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive
Adams Wei Yu, Minh-Thang Luong, Mingxing Tan, and captioners are image-text foundation models. TMLR, 2022.
Quoc V. Le. Combined scaling for zero-shot transfer learn- 5, 6, 8
ing. CoRR, abs/2111.10050, 2021. 2, 8 [52] Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella,
[38] Bryan A Plummer, Liwei Wang, Chris M Cervantes, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang,
Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazeb- Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu,
nik. Flickr30k entities: Collecting region-to-phrase corre- Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao,
spondences for richer image-to-sentence models. In ICCV, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, and
pages 2641–2649, 2015. 6 Pengchuan Zhang. Florence: A new foundation model for
[39] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya computer vision. arXiv preprint arXiv:2111.11432, Novem-
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, ber 2021. 6
Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen [53] Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and
Krueger, and Ilya Sutskever. Learning transferable visual Chen Change Loy. Open-vocabulary detr with conditional
models from natural language supervision. In ICML, 2021. matching. arXiv preprint arXiv:2203.11876, 2022. 5
2, 3, 4, 5, 6, 8 [54] Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-
[40] Shafin Rahman, Salman Khan, and Nick Barnes. Improved Fu Chang. Open-vocabulary object detection using captions.
visual-semantic alignment for zero-shot object detection. In In CVPR, 2021. 1, 3
AAAI, 2020. 2 [55] Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner,
[41] Hanoona Rasheed, Muhammad Maaz, Muhammad Uzair Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer.
Khattak, Salman Khan, and Fahad Shahbaz Khan. Bridg- Lit: Zero-shot transfer with locked-image text tuning. In
ing the gap between object and image-level represen- CVPR, 2022. 2
tations for open-vocabulary detection. arXiv preprint [56] Bowen Zhang, Zhi Tian, Quan Tang, Xiangxiang Chu, Xi-
arXiv:2207.03482, 2022. 1, 2, 3, 5, 6 aolin Wei, Chunhua Shen, and Yifan Liu. Segvit: Semantic
11153
segmentation with plain vision transformers. arXiv preprint
arXiv:2210.05844, 2022. 1
[57] Shiyu Zhao, Zhixing Zhang, Samuel Schulter, Long Zhao,
Anastasis Stathopoulos, Manmohan Chandraker, Dimitris
Metaxas, et al. Exploiting unlabeled data with vision and
language models for object detection. In ECCV, 2022. 1, 2,
3, 5, 6
[58] Ye Zheng, Ruoran Huang, Chuanqi Han, Xi Huang, and Li
Cui. Background learnable cascade for zero-shot object de-
tection. In ACCV, 2020. 2
[59] Yiwu Zhong, Jing Shi, Jianwei Yang, Chenliang Xu, and Yin
Li. Learning to generate scene graph from natural language
supervision. In ICCV, 2021. 2, 6
[60] Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan
Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang
Dai, Lu Yuan, Yin Li, and Jianfeng Gao. Regionclip:
Region-based language-image pretraining. In CVPR, 2022.
1, 2, 3, 5, 6
[61] Chong Zhou, Chen Change Loy, and Bo Dai. Extract free
dense labels from clip. In ECCV, 2022. 3
[62] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Zi-
wei Liu. Conditional prompt learning for vision-language
models. In CVPR, 2022. 3
[63] Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp
Krähenbühl, and Ishan Misra. Detecting twenty-thousand
classes using image-level supervision. In ECCV, 2022. 1, 3,
5, 6
[64] Pengkai Zhu, Hanxiao Wang, and Venkatesh Saligrama.
Don’t even look once: Synthesizing features for zero-shot
detection. In CVPR, 2020. 2
11154