A Survey of Visual Transformers
A Survey of Visual Transformers
6, JUNE 2024
Abstract— Transformer, an attention-based encoder–decoder As shown in Fig. 1, Transformers have gradually emerged
model, has already revolutionized the field of natural language as the predominant deep learning models for many natural
processing (NLP). Inspired by such significant achievements, language processing (NLP) tasks. The most recent dominant
some pioneering works have recently been done on employing
Transformer-liked architectures in the computer vision (CV) models are the self-supervised Transformers, which are pre-
field, which have demonstrated their effectiveness on three trained over sufficient datasets and then fine-tuned over a
fundamental CV tasks (classification, detection, and segmenta- small sample set for a given downstream task [2], [3], [4],
tion) as well as multiple sensory data stream (images, point [5], [6], [7], [8], [9]. The generative pretrained transformer
clouds, and vision-language data). Because of their competitive (GPT) families [2], [3], [4] leverage the Transformer decoders
modeling capabilities, the visual Transformers have achieved
impressive performance improvements over multiple benchmarks to enable autoregressive language modeling, while the bidirec-
as compared with modern convolution neural networks (CNNs). tional encoder representations from transformers (BERT) [5]
In this survey, we have reviewed over 100 of different visual and its variants [6], [7] serve as autoencoder language models
Transformers comprehensively according to three fundamental built on the Transformer encoders.
CV tasks and different data stream types, where taxonomy is In the computer vision (CV) field, prior to the visual Trans-
proposed to organize the representative methods according to
their motivations, structures, and application scenarios. Because formers, convolution neural networks (CNNs) have emerged
of their differences on training settings and dedicated vision as a dominant paradigm [10], [11], [12]. Inspired by the
tasks, we have also evaluated and compared all these existing great success of such self-attention mechanisms (Fig. 2) for
visual Transformers under different configurations. Furthermore, the NLP tasks [1], [13], some CNN-based models attempted
we have revealed a series of essential but unexploited aspects that to capture the long-range dependencies through adding a
may empower such visual Transformers to stand out from numer-
ous architectures, e.g., slack high-level semantic embeddings to self-attention layer at either spatial level [14], [15], [16] or
bridge the gap between the visual Transformers and the sequen- channel level [17], [18], [19], while others try to replace the
tial ones. Finally, two promising research directions are suggested traditional convolutions entirely with the global [20] or local
for future investment. We will continue to update the latest self-attention blocks [21], [22], [23], [24], [25], [26], [27].
articles and their released source codes at https://github. Although Ramachandran et al. [24] have demonstrated the
com/liuyang-ict/awesome-visual-transformers.
efficiency of self-attention block without the help from CNNs,
Index Terms— Classification, computer vision (CV), detec- such pure attention model is still inferior to the state-of-the-art
tion, point clouds, segmentation, self-supervision, visual-linguistic (SoTA) CNN models on the prevailing benchmarks.
pretraining, visual Transformer.
With the grateful achievements of linguistic Transformers
I. I NTRODUCTION and the rapid development of visual attention-based mod-
els, numerous recent works have migrated the Transformers
Fig. 1. Odyssey of Transformer application and growth curves of both Transformer [1] and ViT [29] citations according to Google Scholar. Top left: growth
of Transformer citations in the top linguistics and machine learning conference publications. Top right: growth of ViT citations in Arxiv publications. Bottom
left: Odyssey of language model [1], [2], [3], [4], [5], [6], [7], [8]. Bottom right: Odyssey of visual Transformer backbone where the black [29], [35], [36],
[37], [38], [39] is the SoTA with external data and the blue [40], [41], [42], [43], [44] refers to the SoTA without external data (best viewed in color).
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: SURVEY OF VISUAL TRANSFORMERS 7481
B. Positionwise FFNs Fig. 5. Taxonomy of visual Transformer backbone (best viewed in color).
The output of MHSA is then fed into two successive FFNs
with a ReLU activation as stack and the decoder stack. Finally, a softmax operation is
used for predicting the next word.
FFN(x) = RELU(W1 x + b1 )W2 + b2 . (4) In an autoregressive language model, the Transformer
This positionwise feedforward layer can be viewed as a is originated from the machine translation tasks. Given a
pointwise convolution, which treats each position equally but sequence of words, the Transformer vectorizes the input
uses different parameters between each layer. sequence into the word embeddings, adds the positional encod-
ings, and feeds the resulting sequence of the vectors into an
C. Positional Encoding encoder. During training, as shown in Fig. 4, Vaswani et al. [1]
Since the Transformer/attention operates on the input designed a masking operation according to the rule for the
embedding simultaneously and identically, the order of the autoregressive task, where the current position only depends on
sequence is neglected. To make use of the sequential informa- the outputs of the previous positions. Based on this masking,
tion, a common solution is to append an extra positional vector the Transformer decoder is able to process the sequence of the
to the inputs, hence term the “positional encoding.” There are input labels parallelly. During the inference time, the sequence
many choices for positional encoding. For example, a typical of the previously predicted words is processed by the same
choice is cosine functions with different frequencies as operation to predict the next word.
sin(pos · ωk ), if i = 2k III. T RANSFORMER FOR C LASSIFICATION
PE(pos,i) = Following the prominent developments of the Transformers
cos(pos · ωk ), if i = 2k + 1
in NLP recent works attempt to introduce visual Transformers
1
ωk = , k = 1, . . . , d/2 (5) for image classification. This section comprehensively reviews
10 0002k/d over 40 visual Transformers and groups them into six cat-
where pos and d are the position and the length of the vector, egories, as shown in Fig. 5. We start with introducing the
respectively, and i is the index of each element within vector. fully attentional network [24], [28] and the vision Trans-
former (ViT) [29] that first demonstrates Transformer efficacy
D. Transformer Model on large scale classification benchmarks. Then, we discuss
Fig. 4 shows the overall Transformer models with the Transformer-enhanced CNN methods that utilize Transformer
encoder–decoder architecture. Specifically, it consists of N to enhance the representation learning in CNNs. Due to the
successive encoder blocks, each of which is composed of negligence of local information in the original ViT, the CNN-
two sublayers: 1) an MHSA layer aggregates the relationship enhanced transformer employs an appropriate convolutional
within the encoder embeddings and 2) a positionwise FFN inductive bias to augment the visual Transformer, while
layer extracts feature representations. For the decoder, it also the local attention-enhanced Transformer redesigns patch par-
involves N consecutive blocks that follow a stack of the tition and attention blocks to improve their locality. Following
encoders. Compared with the encoder, each decoder block the hierarchical and deep structures in CNNs [163], the hier-
appends to a multihead cross-attention layer to aggregate both archical Transformer replaces the fixed-resolution columnar
decoder embeddings and encoder outputs, where Y corre- structure with a pyramid stem, while the deep Transformer
sponds to the former and X is the latter as shown in (1). More- prevents the attention map from oversmooth and increases
over, all of the sublayers in both encoder and decoder employ its diversity in the deep layer. Moreover, we also review the
a residual connection [11] and a layer normalization [162] to existing visual Transformers with self-supervised learning.
enhance the scalability of the Transformer. In order to record Finally, we make a brief discussion based on intuitive com-
the sequential information, each input embedding is attached parisons for further investigation. More visual Transformers’
with a positional encoding at the beginning of the encoder milestones are introduced in the Supplementary Material.
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
7482 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 6, JUNE 2024
TABLE I
T OP -1 A CCURACY C OMPARISON OF V ISUAL T RANSFORMERS ON I MAGE N ET-1 K . “1 K O NLY ” D ENOTES T RAINING ON I MAGE N ET-1K O NLY. “21 K P RE .”
D ENOTES P RETRAINING ON I MAGE N ET-21K AND F INE -T UNING ON I MAGE N ET-1K. “D ISTILL .” D ENOTES A PPLYING D ISTILLATION T RAINING
S CHEME OF D EI T [40]. T HE C OLOR OF “L EGEND ” C ORRESPONDING TO E ACH M ODEL A LSO D ENOTES THE S AME M ODEL IN F IG . 8
freedom for channel adjustment, and 2) last few class-attention resizes and flattens the image to a lower resolution sequence.
stages with frozen patch embeddings. A later class token is The resized sequences are then input into a GPT-2 [4]
inserted to model global representations, similar to DEtection for autoregressive pixel prediction. iGPT demonstrates the
with TRansformer (DETR) [30] with an encoder–decoder effectiveness of the Transformer in the visual tasks without
structure. This explicit separation is based on the assumption any help from image-specific knowledge, but its considerable
that the class token is invalid for the gradient of patch computation cost is hard to be accepted (roughly 2500 V100-
embeddings in the forward pass. With distillation training days for pretraining). Instead of the pixel-wise generation
strategy [40], CaiT achieves a new SoTA on imagenet-1k Bao et al. [5] proposed a BERT-style visual Transformer
(86.5% top-1 accuracy) without external data. Although deep (BEiT) [70] by reconstructing the masked image in the latent
Transformer suffers from attention collapse and oversmoothing space. Precisely, a dVAE [148] first converts input patches
problems, it still largely preserves the diversity of the attention into discrete visual tokens, like BERT’s [5] dictionary. These
map between different heads. Based on this observation, tokens are then employed as latent pseudo labels for SSL.
Zhou et al. [66] proposed Deep ViT that aggregates different For the discriminative models, Chen et al. [72] investigated
head attention maps and regenerates a new one by using a lin- the effects of several fundamental components for stabilized
ear layer to increase cross-layer feature diversity. Furthermore, self-supervised ViT training. They observed that the unstable
Refiner [37] applies a linear layer to expand the dimension training process mildly affects the eventual performance and
of the attention maps (indirectly increasing the head number) extended MoCo series to MoCo v3, containing a series of
for diversity promotion. Then, a distributed local attention training strategies such as freezing projection layer. Fol-
(DLA) is employed to achieve better modeling of both the lowing DeiT [40], Caron et al. [73] further extended the
local features and the global ones, which is implemented by teacher–student recipe to self-supervised learning and propose
a headwise convolution effecting on the attention map. DINO. The core concepts of DINO can be summarized into
From the aspect of training strategy, Gong et al. [67] three points. A momentum encoder inherited SwAV [178]
presented three patch diversity losses for deep Transformer serves as a teacher model that outputs the centered pseudo
that can significantly encourage patches’ diversity and offset labels over a batch. An online encoder without the prediction
oversmoothing problem. Similar to [176], a patchwise cosine head serves as a student model to fit the teacher’s output.
loss minimizes pairwise cosine similarity among patches. A standard cross-entropy loss connects self-training with
A patchwise contrastive loss regularizes the deeper patches knowledge distillation. More interestingly, self-supervised ViT
by their corresponding one in the early layer. Inspired by can learn flourishing features for segmentation, which are
Cutmix [177], a patchwise mixing loss mixes two different normally unattainable by the supervised models.
images and forces each patch to only attend to the patches
from the same image and ignore unrelated ones. Distinct from
H. Discussion
the similar loss function of LV-ViT [43], it is motivated by
patch diversity rather than patch-based label augmentation that 1) Algorithm Evaluation and Comparative Analysis: In our
LV-ViT [43] focuses on. taxonomy, all the existing supervised models are grouped into
six categories. Table I summarizes the performances of these
G. Transformers With Self-Supervised Learning existing visual Transformers on ImageNet-1k benchmarks.
Following the grateful success of self-supervised learning To evaluate them objectively and intuitively, we use the follow-
(SSL) in the NLP field [5], recent works also attempt to ing three figures to illustrate their performances on ImageNet-
design various self-supervised learning schemes for the visual 1k under different configurations. Fig. 8(a) summarizes the
Transformers in both generative and discriminative ways. accuracy of each model under 2242 inputs size. Fig. 8(b)
For the generative models, Chen et al. [68] proposed an takes the FLOPs as the horizontal axis, which focuses on
image GPT (iGPT) for self-supervised visual learning. Dif- their performances under higher resolution. Fig. 8(c) focuses
ferent from the patch embedding of ViT [29], iGPT directly on the pretrained models with external datasets. From these
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: SURVEY OF VISUAL TRANSFORMERS 7485
Fig. 8. Comparisons of recent visual Transformers on ImageNet-1k benchmark, including [29], [36], [37], [41], [42], [43], [50], [51], [52], [53], [54],
[58], [62], [63], [65], [180] (best viewed in color). (a) Bubble plot of the mentioned models with 2242 resolution input, the size of cycle denotes GFLOPs.
(b) Comparison on high-resolution inputs, the square indicates 4482 input resolution. (c) Accuracy plot of some pretrained models on ImageNet-21k.
comparison results, we briefly summarize several performance IV. T RANSFORMER FOR D ETECTION
improvements on efficiency and scalability as follows. In this section, we review visual Transformers for object
1) Compared with the most structure-improved methods, detection, which can be grouped into two folds: Transformer
the basic training strategies, such as DeiT [40] and LV- as the neck and Transformer as the backbone. For the neck
ViT [43], are more universal for various models. detectors, we mainly focus on a new representation speci-
2) The locality is indispensable for the Transformer, which fied to the Transformer structure, called object query, that
is reflected by the dominant of VOLO [44] and a set of learned parameters aggregate instance features from
Swin [35] on various tasks. input images. The recent variants try to solve an optimal
3) The convolutional patchify stem (ViTc [58]) and early fusion paradigm in terms of either convergence acceleration
convolutional stage (CoAtNet [39]) can significantly or performance improvement. Besides these neck designs,
boost the accuracy of the Transformers, especially for a proportion of backbone detectors also take specific strategies
large models. We speculate that the reason is because into consideration. Finally, we evaluate them and analyze some
they introduce a more stringent high-level feature than potential methods for these detectors.
the sketchy patch projection in ViT [29].
4) The deep Transformer, such as Refined-ViT [37] and A. Transformer Neck
CaiT [42], has great potential. As the model size grows We first review DETR [30] and Pix2seq [75], two orig-
quadratically with the channel dimension, the tradeoff in inal Transformer detectors based on different paradigms.
deep Transformer is considered for further investigation. Subsequently, we mainly focus on the DETR-based variants,
5) CeiT [54] and CvT [36] show significant advan- improving Transformer detectors in accuracy and convergence
tages in training a small or medium model (0– from five aspects: sparse attention, spatial prior, structural
40M), which suggests that such kinds of hybrid atten- redesign, assignment optimization, and pretraining model.
tion blocks for lightweight models are worth further 1) Original Detectors: DETR [30] is the first end-to-end
exploring. Transformer detector that eliminates hand-designed represen-
2) Brief Discussion on Alternatives: During the develop- tations [180], [181], [182], [183] and nonmaximum sup-
ment of the visual Transformers, the most common question pression (NMS) postprocessing, which redefines the object
is whether the visual Transformers can replace the tradi- detection as a set prediction problem. As shown in Fig. 9,
tional convolution completely. By reviewing the history of a small set of learnable positional encodings, called object
the performance improvements in the last year, there is no queries, are parallelly fed into the Transformer decoder to
sign of relative inferiority here. The visual Transformers have extract the instance information from the image features.
returned from a pure structure to a hybrid form, and the Then, these object queries are independently predicted to be
global information has gradually returned to a mixed stage a detection result. Instead of the vanilla k-class classification,
with the locality bias. Although the visual Transformers can a special class, no object label (∅) is added for k + 1 class
be equivalent to CNN or even has a better modeling capability, classification. During the training process, a bipartite matching
such a simple and effective convolution operation is enough to strategy is applied between the predicted objects and the
process the locality and the semantic features in the shallow ground truth to identify one-to-one label assignment, hence
layer. In the future, the spirit of combining both of them shall removing the redundant predictions at the inference time
drive more breakthroughs for image classification. without NMS. In backpropagation, a Hungarian loss includes
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
7486 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 6, JUNE 2024
more stacks yield even worse results. Dynamic DETR [86] preserve the feature discrimination so as to avoid overbias
regards the object prediction in a coarse-to-fine process. Dif- toward the localization in pretraining. FP-DETR [90] devotes
ferent from the previous RoI-based initialization, according to narrowing the gap between upstream and downstream
to queries’ reference boxes, a query-based weight is used to tasks. During the pretraining, a fully encoder-only DETR
replace cross-attention layers and directly affect their corre- such as YOLOS [88] views query positional embeddings as
sponding coarse RoI features for query refinement. a visual prompt to enhance target area attention and object
4) Transformer With Redesigned Structure: Besides the discrimination. A task adapter implemented by self-attention
modifications focusing on the cross attention, some works is used to enhance object interaction during fine-tuning.
redesign an encoder-only structure to avoid the problem
of the decoder directly. TSP [87] inherits the idea of set B. Transformer Backbone
prediction [30] and dismisses the decoder and the object We have reviewed numerous Transformer-based backbones
query to accelerate convergence. Such encoder-only DETR for image classification [29], [40] in Section III. These back-
reuses previous representations [180], [186] and generates a bones can be easily incorporated into various frameworks [30],
set of fixed-size features of interests (FoIs) [186] or pro- [182], [187] to perform dense prediction tasks. For example,
posals [180] that are subsequently fed into the Transformer the hierarchical structure, such as PVT [41], [65], constructs
encoder. In addition, a matching distillation is applied to the visual Transformer as a high-to-low resolution process to
resolve the early instability of the bipartite matching during learn multiscale features. The locally enhanced structure con-
training process. Fang et al. [88] presented an encoder-only structs the backbone as a local-to-global combination, which
decoder YOLOS, a pure sequence-to-sequence Transformer to can efficiently extract both short- and long-range visual depen-
unify the classification and detection tasks. It inherits ViT’s dencies and avoid quadratic computational overhead, such as
structure and replaces the single class token with fixed-size Swin-Transformer [35], ViL [61], and Focal Transformer [62].
learned detection tokens. These object tokens are first pre- The Supplementary Material includes more detailed compar-
trained on the transfer ability for the classification tasks and isons of these models for the dense prediction tasks. In addi-
then fine-tuned on the detection benchmark. tion to the generic Transformer backbone, the feature pyramid
5) Transformer With Bipartite Matched Optimization: In Transformer (FPT) [93] combines the characteristics across
DETR [30], the bipartite matching strategy forces the predic- both the spaces and the scales, by using self-attention, top-
tion results to fulfill one-to-one label assignment during the down cross attention, and bottom-up cross-channel attention.
training scheme. Such a training strategy simplifies detection Following [190], HRFormer [94] introduces the advantages of
pipeline and directly builds up an end-to-end system without multiresolution to the Transformer along with nonoverlapping
the help of NMS. To deeply understand the efficacy of the local self-attention. HRViT [95] redesigns a heterogeneous
end-to-end detector, Sun et al. [189] devoted to exploring a branch and a cross-shaped attention block to further optimize
theoretical view of one-to-one prediction. Based on multiple the tradeoff between efficiency and accuracy.
ablation and theoretical analyses, they concluded that the
classification cost for one-to-one matching strategy serves C. Discussion
as the key component for significantly avoiding duplicate We summarize fivefold of the Transformer neck detectors
predictions. Even so, DETR is suffering from multiple prob- in Table II, and more details of Transformer backbone for
lems caused by bipartite matching. Li et al. [91] exploited dense prediction tasks are referred to in Table SI of the
a denoising DETR (DN-DETR) to mitigate the instability Supplementary Material. The majority of Transformer neck
of bipartite matching. Concretely, a series of objects with promotions concentrate on the following five aspects.
slight perturbation is supposed to denoise to their original 1) The sparse attention model and the scoring network are
coordinates. The main ingredients of the denoising part are an proposed to address the problem of redundant feature
attention mask that prevents information leakage between the interaction. These methods can significantly alleviate
matching and noised parts, and a specified label embedding to computational costs and accelerate model convergence.
indicate the perturbation. Recently, Zhang et al. [92] presented 2) The explicit spatial prior, which is decomposed into the
an improved denoising training model called DINO (2022) by selected feature initialization and the positional informa-
incorporating a contrastive loss for the perturbation groups. tion extracted by learned parameters, would enable the
Based on DN-DETR [91], DINO attaches a “no object” class detector to predict the results precisely.
for the negative example if the distance is far enough from 3) Multiscale features and iterative box refinement are
the perturbation, which avoids redundant prediction due to the benefit DETR for small object detection.
confusion of multiple reference points near an object. As a 4) The improved bipartite matching strategy is beneficial to
result, DINO attains the current SoTA on the COCO dataset. avoid redundant prediction, add positive gradients, and
6) Transformer Detector With Pretraining: Inspired by the perform end-to-end object detection.
pretrained linguistic Transformer [3], [5], Dai et al. [89] 5) The encoder-only structure reduces the overall Trans-
devised an unsupervised pretraining DETR (UP-DETR) to former stack layers but increases the FLOPs excessively,
assist the convergence for supervised training. The objective while the encoder–decoder structure is a good tradeoff
of pretraining is to localize the random cropped patches between FLOPs and parameters, but the deeper decoder
from a given image. Specifically, each patch is assigned to layers may cause the slow convergence.
a set of queries and predicted independently via the attention Existing Transformer backbones mostly focus on the clas-
mask. An auxiliary reconstruction loss forces the detector to sification task, but a few works are developed for the
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
7488 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 6, JUNE 2024
TABLE II
C OMPARISON B ETWEEN T RANSFORMER N ECKS AND R EPRESENTATIVE
CNNs W ITH R ES N ET-50 BACKBONE ON COCO 2017 VAL S ET
B. Query-Based Transformer
Query embeddings are a set of scratch semantic/instance
representations gradually learning from the image inputs.
Unlike patch embeddings, queries can more “fairly” integrate
the information from features and naturally join with the set
prediction loss [30] for postprocessing elimination. The exist-
ing query-based models can be grouped into two categories.
One (object query) is driven by both detection and segmen-
tation tasks, simultaneously. The other (mask embedding) is
dense prediction tasks. In the future, we anticipate that only supervened by segmentation task.
the Transformer backbone would cooperate with the deep 1) Object Queries: There are three training manners for
high-resolution network to solve dense prediction tasks. object query-based methods [Fig. 10(a)–(c)]. With the success
V. T RANSFORMER FOR S EGMENTATION of DETR [30] for the object detection tasks, the authors
Patch- and query-based Transformers are the two major extend it to panoptic segmentation (hence termed panoptic
ways for segmentation. The latter can be further grouped into DETR [30]) by training a mask head based on the pre-
object query and mask embedding methods. trained object queries [Fig. 10(a)]. In detail, a cross-attention
block between the object queries and the encoded features is
A. Patch-Based Transformer applied to generate an attention map for each object. After an
Because of the receptive field expansion strategy [191], upsampling FPN-style CNN, a spatial argmax operation fuses
CNNs require multiple decoder stacks to map the high-level the resulting binary masks to a nonoverlapping prediction.
features into the original spatial resolution. Instead, patch- Instead of using a multistage serial training process, Cell-
based Transformer can easily incorporate with a simple DETR and VisTR develop a parallel model for end-to-end
decoder for segmented mask prediction because of its global instance segmentation [Fig. 10(b)]. Based on DETR [30], Cell-
modeling capability and resolution invariance. Zheng et al. DETR leverages a cross-attention block to extract instancewise
extended ViT [29] for semantic segmentation tasks and pre- features from the box branch and fuses the previous backbone
sented SEgmentation TRansformer (SETR) [96] by employing features to augment the CNN decoder for accurate instance
three fashions of the decoder to perform per-pixel classifi- mask segmentation of biological cells. Another extension
cation: naive upsampling, progressive upsampling, and mul- is VisTR [100] that directly formulates the video instance
tilevel feature aggregation (MLA). SETR demonstrates the segmentation (VIS) task as parallel sequence prediction. Apart
feasibility of the visual Transformer for the segmentation from the similar structure as Cell-DETR [99], the key of VisTR
tasks, but it also brings unacceptably extra GPU costs. Tran- is a bipartite matching loss at the instance sequence level
sUNet [97] is the first for medical image segmentation. For- to maintain the order of outputs, so as to adapt DETR [30]
mally, it can be viewed as either a variant of SETR with to VIS for direct one-to-one predictions. Unlike prior works
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: SURVEY OF VISUAL TRANSFORMERS 7489
TABLE III
C OMPARISON B ETWEEN CNN- AND T RANSFORMER -BASED M ODEL
ON ADE20K AND COCO FOR D IFFERENT S EGMENTATION TASKS .
“+MS” D ENOTES THE M ULTISCALE I NPUTES
QueryInst [101] attains the SoTA among various Trans- global interactions separately within each layer. Different from
former models. It is worthy of further study for combin- previous local–global forms, a coordinate refinement operation
ing the visual Transformers with the hybrid task cascade after the local attention is adopted to update the centroid point
structures. Table III(c) focuses on panoptic segmentation. instead of the surface one. Also, a local–global cross-attention
Max-DeepLab [31] is general to solve both foreground and model fuses the high-resolution features, followed by global
background in the panoptic segmentation task via a mask attention. Fan et al. [110] returned to the single-stride sparse
prediction format. Maskformer [32] successfully employs this transformer (SST) to address the problem for small-scale
format and unifies both semantic and instance segmentation detection. Similar to Swin [35], a shifted group in continuous
tasks into a single model. It is concluded that the visual Transformer block is adopted to attend to each group of tokens
Transformers could unify multiple segmentation tasks into one separately, which further mitigates the computation prob-
box-free framework with mask prediction. lem. In voxel-based methods, voxel transformer (VoTr) [111]
separately operate on the empty and nonempty voxel positions
VI. T RANSFORMER FOR 3-D V ISUAL R ECOGNITION effectively via local attention and dilated attention blocks.
With the rapid development of 3-D acquisition technol- VoxSeT [112] further decomposes the self-attention layer into
ogy, stereo/monocular images and light detection and ranging two cross-attention layers, and a group of latent codes link
(LiDAR) point clouds become the popular sensory data for 3-D them to preserve global features in a hidden space.
recognition. Discriminated from the RGB(D) data, point cloud Following the mentioned methods in Section III-G, a series
representation pays more attention to distance, geometry, and of self-supervised Transformers is also extended to 3-D
shape information. Notably, such a geometric feature is signif- spaces [113], [114], [115]. Specifically, Point-BERT [113] and
icantly suitable for Transformer on account of its characteristic Point-MAE [114] directly transfer the previous works [70],
on sparseness, disorder, and irregularity. Following the success [71] to point clouds, while MaskPoint [115] changes the
of 2-D visual Transformers, substantial approaches are devel- generative training scheme by using a contrastive decoder as
oped for 3-D visual analysis. This section exhibits a compact similar as DINO (2022) [92] for binary noise/part classifica-
review for 3-D visual Transformers following representation tion. Based on large experiments, we can conclude that such
learning, cognition mapping, and specific processing. generative/contrastive self-training methods empower visual
Transformers to be valid in either images or points.
A. Representation Learning
Compared with conventional hand-designed networks, B. Cognition Mapping
visual Transformer is more appropriate for learning semantic Given rich representation features, how to directly map
representations from point clouds, in which such irregular the instance/semantic cognition to the target outputs also
and permutation-invariant nature can be transformed into a arouse considerable interests. Different from 2-D images, the
series of parallel embeddings with positional information. objects in 3-D scenes are independent and can be intuitively
Point Transformer [105] and PCT [106] first demonstrate represented by a series of discrete surface points. To bridge the
the efficacy of the visual Transformer in 3-D scenes. The gap, some existing methods transfer domain knowledge into 2-
former merges a hierarchical Transformer [105] with the D prevailing models. 3DETR [117] first extends Transformer
downsampling strategy [199] and extends their previous vector detector to 3-D object detection via farthest point sampling and
attention block [25] to 3-D point clouds. The latter first Fourier positional embeddings for object query initialization.
aggregates neighbor points and then processes such neighbor Group-Free 3-D DETR [117] applies a more specified and
embeddings on a global offset Transformer where a knowledge stronger structure than [116]. In detail, it directly selects a set
transfer from graph convolution network (GCN) is applied of candidate sample points from the extracted point clouds
for noise mitigation. Notably, the positional encoding, a sig- as the object queries and updates them in the decoder layer-
nificant operation of the visual Transformer, is diminished by-layer iteratively. Moreover, the K -closed inside points are
in both approaches because of points’ inherent coordinate assigned positive and supervised by a binary objectiveness
information. PCT directly processes on the coordinates with- loss in both sampler and decoder heads. Sheng et al. [118]
out positional encodings, while Point Transformer adds a proposed a typical two-stage method that leverages a chan-
learnable relative positional encoding for further enhancement. nelwise transformer 3-D detector (CT3D) to simultaneously
Lu et al. [107] leveraged a local–global aggregation module aggregate proposal-aware embedding and channelwise context
3DCTN to achieve local enhancement and cost efficiency. information for the point features within each proposal.
Given the multistride downsampling groups, an explicit graph For monocular sensors, both MonoDTR [119] and Mon-
convolution with max-pooling operation is used to aggregate oDETR [120] utilize an auxiliary depth supervision to estimate
the local information within each group. The resulting group pseudo depth positional encodings (DPEs) during the training
embeddings are concatenated and fed into the improved trans- process. In MonoDETR [119], DPEs are first attached with
former [105], [106] for global aggregation. Park et al. [108] the image features for Transformer encoder and then serve
presented Fast Point Transformer to optimize the model as the inputs of the DETR-like [30] decoder to initialize the
efficiency by using voxel-hashing neighbor search, voxel- object queries. In MonoDETR [120], both visual features and
bridged relative positional encoding, and similarity-based local DPEs are first extracted by two different encoders parallelly
attention. and then interact with object queries via two successive cross-
For dense prediction, Pan et al. [109] proposed a customized attention layers. Based on foreground depth supervision and
point-based backbone Pointformer for attending the local and narrow categorization interval, MonoDETR obtains the SoTA
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: SURVEY OF VISUAL TRANSFORMERS 7491
result on the KITTI benchmark. DETR3D [121] introduces enables different modalities to interact beyond the local
a multicamera 3-D object detection paradigm where both restriction.
2-D images and 3-D positions are associated by the camera For the local interaction, MVT [126] spatially concatenates
transformation matrices and a set of 3-D object queries. the patch embeddings from different views and strengthens
TransFusion [122] further takes the advantages of both LiDAR their interaction via a modal-agnostic Transformer encoder.
points and RGB images by interacting with object queries Considering the redundant features from different modalities,
through two Transformer decoder layers successively. More MVDeTr [127] projects each view of features onto the ground
multisensory data fusion is introduced in Section VII-A. plane and extends the multiscale deformable attention [76]
to a multiview design. TransFuser [128], COTR [129], and
C. Specific Processing mmFormer [133] deploy a hybrid model. TransFuser models
Limited by sensor resolution and view angle, point clouds image and LiDAR inputs separately by using two different
are afflicted with incompletion, noise, and sparsity problems convolution backbones and links the intermediate feature maps
in real-world scenes. To this end, PoinTr [123] represents via a Transformer encoder together with a residual connection.
the original point cloud as a set of local point proxies and COTR shares the CNN backbone for each of view images
leverages a geometry-aware encoder–decoder Transformer to and inputs the resulted features into a Transformer encoder
migrate the center point proxies toward incomplete points block with a spatially expanded mesh-grid positional encoding.
direction. SnowflakeNet [124] formulates the process of com- mmFormer exploits a modality-specific Transformer encoder
pleting point clouds as a snowflake-like growth, which pro- for each sequence of MRI image and a modality-correlated
gressively generates child points from their parent points Transformer encoder for multimodal modeling.
implemented by a pointwise splitting deconvolution strat- For the global interaction, Wang et al. [130] leveraged a
egy. A skip-Transformer for adjacent layers further refines shared backbone to extract the features for different views.
the spatial-context features between parents and children to Instead of pixelwise/patchwise concatenation in COTR [129],
enhance their connection regions. Choe et al. [125] unified var- the extracted viewwise global features are spatially concate-
ious generation tasks (e.g., denoising, completing, and super- nated to perform view fusion within a Transformer. Consid-
resolution) into a point cloud reconstruction problem, hence ering the angular and position discrepancy across different
termed PointRecon. Based on voxel hashing, it covers the camera views, TransformerFusion [132] first converts each
absolute-scale local geometry and utilizes a PointTransformer- view feature into an embedding vector with the intrinsics and
like [105] structure to aggregate each voxel (the query) with extrinsics of their camera views. These embeddings are then
its neighbors (the value–key pair) for fine-grained conversion fed into a global Transformer whose attention weights are used
from the discrete voxel to a group of point sets. Moreover, for a frame selection so as to compute efficiently. To unify
an amplified positional encoding is adapted to the voxel local the multisensory data in 3-D detection, FUTR3D [131]
attention scheme, implemented by using a negative exponential projects the object queries in the DETR-like decoder into
function with L1-loss as weights for vanilla positional encod- a set of 3-D reference points. These points together with
ings. Notably, compared with masked generative self-training, their related features are subsequently sampled from different
the completion task directly generates a set of complete points modalities and spatially concatenated to update the object
without the explicit spatial prior of incomplete points. queries.
2) Transfer Fusion: Unlike the interactive fusion imple-
VII. T RANSFORMER FOR M ULTISENSORY DATA S TREAM mented by the Transformer encoder with self-attention, the
In the real world, multiple sensors are always used com- other fusing form is more like a transfer learning from the
plementarily rather than a single one. To this end, recent source data to the target one via a cross-attention mechanism.
works start to explore different fusing methods to coop- For instance, Tulder et al. [134] inserted two cooperative
erate multisensory data stream effectively. Compared with cross-attention Transformers into the intermediate backbone
the typical CNNs, Transformer is naturally appropriate for features for bridging the unregistered multiview medical
multistream data fusion because of its nonspecific embedding images. Instead of the pixelwise attention form, token-pixel
and dynamically interactive attention mechanism. This section cross attention is further developed to alleviate arduous com-
details these methods according to their data stream sources: putation. Long et al. [135] proposed an epipolar spatiotem-
homologous stream and heterologous stream. poral Transformer for multiview image depth estimation.
Given a single video containing a series of static multiview
A. Homologous Stream frames, the neighbor frames are first concatenated and the
Homologous stream is a set of multisensory data with epipolar is then warped into the center camera space. The
similar inherent characteristics, such as multiview, multidi- resulted frame volume finally serves as the source data to
mension, and multimodality visual stream data. They can be perform fusion with the center frame through a cross-attention
categorized into two groups: interactive fusion and transfer block. With the spatially aligned data streams, DRT [136]
fusion, according to their fusion mechanism. first explicitly models the relation map between different
1) Interactive Fusion: The classical fusion pattern of CNN data streams by using a convolution layer. The resulting
adopts a channel concatenation operation. However, the same maps are subsequently fed into a dual-path cross attention
positions from different modalities might be anisotropic, to build both local and global relationships parallelly, and
which is unsuitable for the translation-invariant bias of CNN. thereby, it can collect more regional information for glaucoma
Instead, the spatial concatenation operation of Transformer diagnosis.
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
7492 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 6, JUNE 2024
B. Heterologous Stream
Besides the box loss, two auxiliary losses are adopted to in Section VIII-B, future research directions suggestion in
enforce the network to model an alignment between image Section VIII-C, and the final conclusion in Section VIII-D.
feature and their corresponding phrase tokens. With the
A. Summary of Recent Improvements
large-scale image–text pairs pretraining, MDETR can be
easily generalized in few-shot learning, even on long-tail data. We briefly summarize the major performance improvements
Different from MDETR [153] adding two auxiliary losses for for three fundamental CV tasks as follows.
box-phrase alignments, referring Transformer [156] directly 1) For classification, a deep hierarchical Transformer back-
initializes object queries with phrase-specific embeddings bone is valid for decreasing the computational com-
for PG, which explicitly reserves a one-to-one phrase plexity [41] and avoiding the feature oversmooth [37],
assignment for final bounding box prediction. VGTR [155] [42], [66], [67] in the deep layer. Meanwhile, the early
reformulates the REC as a task for single salient object stage convolution [39] is enough to capture the low-level
localization from the language features. In detail, a text- features, which can significantly enhance the robustness
guided attention mechanism encapsulates both self-attention and reduce the computational complexity in the shallow
block and text–image cross-attention one to update the layer. Moreover, both the convolutional projection [54],
image features simultaneously. The resulted image features, [55] and the local attention mechanism [35], [44] can
which serve as the key–value pairs, interact with language improve the locality of the visual Transformers. The
queries when regressing bounding box coordinates in the former [56], [57] may also be a new approach to replace
decoder. Following ViT [29], TransVG [154] keeps the the positional encoding.
class token to aggregate the image and language features 2) For detection, the Transformer necks benefit from
simultaneously for the mentioned object localization in the encoder–decoder structure with less computation
REC. Pseudo-Q [157] focuses on REC for unsupervised than the encoder-only Transformer detector [88]. Thus,
learning, where a pseudo-query generation module based on the decoder is necessary, but it requires more spatial
a pretrained detector and a series of attributes&relationship prior [76], [80], [81], [82], [83], [85], [86] due to its slow
generation algorithm is applied to generate a set of pseudo convergence [87]. Furthermore, sparse attention [76] and
phrase descriptions and a query prompt is introduced to match scoring network [78], [79] for fore-grounding sampling
feature proposals and phrase queries for REC adaptation. are conducive to reducing the computational costs and
In the 3-D spaces, LanguageRefer [158] redefines the accelerating the convergence of visual Transformers.
multistream data reasoning as a language modeling problem, 3) For segmentation, the encoder–decoder Transformer
whose core idea is to omit point cloud features and infuse models may unify three segmentation subtasks into a
the predicted class embeddings together with a caption into a mask prediction problem via a set of learnable mask
language model to get a binary prediction for object selection. embeddings [31], [104], [198]. This box-free approach
Following the conventional two-stream methods, TransRe- has achieved the latest SoTA performance on multiple
fer3D [159] further enhances the relationship of the object benchmarks [198]. Moreover, the specific hybrid task
features by using a cross-attention between asymmetric object is cascaded with the model [101] of the box-based
relation maps and linguistic features. Considering the specific visual Transformers, which have demonstrated a higher
view for varied descriptions, Huang et al. [160] presented a performance for instance segmentation.
multiview Transformer (MVT 2022) for 3-D visual grounding. 4) For 3-D visual recognition, the local hierarchical Trans-
Given a shared point cloud feature for each object, MVT former with a scoring network could efficiently extract
first appends the converted bounding box coordinates to the features from the point clouds. Instead of the elaborate
shared objects in order to get specific view features. These local design, the global modeling capability enables the
multiview features are then fed into a stack of the Transformer Transformer to easily aggregate surface points. In addi-
decoders for text data fusion. Finally, the multiview features tion, the visual Transformers can handle multisensory
are merged by an order-independent aggregation function and data in 3-D visual recognition, such as multiview and
converted to the grounding score. MVT achieves the SoTA multidimension data.
performance on Nr3D and Sr3D datasets [205]. In the video 5) The mainstream visual-linguistic pretraining have gradu-
space, a specific 3-D data (with temporal dimension), Yang ally focused on the alignments [147] or similarities [152]
et al. [161] proposed TubeDETR to address the problem of among different data streams in the latent space based on
spatiotemporal video grounding (STVG). Concretely, a slow- the large-scale noised datasets [149]. Another concern is
fast encoder sparsely samples the frames and performs cross- to adapt the downstream visual tasks to the pretraining
modal self-attention between the sampled frames and the text scheme to perform zero-short transferring [147].
features in the slow branch and aggregates the updated sample 6) The recent prevailing architecture for multisensory data
features into the full-frame features from fast branch via a fusion is the single-stream method, which spatially con-
broadcast operation. A learnable query attached with different catenates different data streams and performs interac-
time encodings, called time-specific queries in the decoder, tion simultaneously. Based on the single-stream model,
is then predicted as either a time-aligned bounding box or “no numerous recent works devote to finding a latent space
object.” It attains SoTA results on STVG leaderboards. to semantically align different data.
VIII. C ONCLUSION AND D ISCUSSION B. Discussion on Visual Transformers
This section briefly summaries the recent improve- Despite that the visual Transformer models are evolved sig-
ments in Section VIII-A, some critical issues disscusion nificantly, the “essential” understanding remains insufficient.
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
7494 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 6, JUNE 2024
class tokens are aligned to mix-patches via set prediction, [4] T. B. Brown et al., “Language models are few-shot learners,” in Proc.
such as the data augmentation training strategy in LV-ViT [43]. NeurIPS, 2020, pp. 1877–1901.
[5] J. Devlin, M.-W. Chang, K. Lee, and K. N. Toutanova, “BERT: Pre-
Furthermore, the label assignment in the set prediction strategy training of deep bidirectional transformers for language understanding,”
leads to training instability during the early process, which in Proc. NAACL, 2018, pp. 4171–4186.
degrades the accuracy of the final results. Redesigning the [6] Y. Liu et al., “RoBERTa: A robustly optimized BERT pretraining
approach,” 2019, arXiv:1907.11692.
label assignments and set prediction losses may be helpful for
[7] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut,
the detection frameworks. “ALBERT: A lite BERT for self-supervised learning of language
2) Self-Supervised Learning: Self-supervised pretraining of representations,” in Proc. ICLR, 2020, pp. 1–17.
Transformers has standardized the NLP field and obtained [8] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and
Q. V. Le, “XLNet: Generalized autoregressive pretraining for language
tremendous successes in various applications [2], [5]. Because understanding,” in Proc. NeurIPS, 2019, pp. 5753–5763.
of the popularity of self-supervision paradigms in the CV [9] D. W. Otter, J. R. Medina, and J. K. Kalita, “A survey of the usages
field, the convolutional Siamese networks employ contrastive of deep learning for natural language processing,” IEEE Trans. Neural
Netw. Learn. Syst., vol. 32, no. 2, pp. 604–624, Feb. 2021.
learning to perform self-supervised pretraining, which differs
[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica-
from the masked autoencoders used in the NLP field. Recently, tion with deep convolutional neural networks,” in Proc. NIPS, 2012,
some studies have tried to design self-supervised visual Trans- pp. 1097–1105.
formers to bridge the discrepancy of pretraining methodology [11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
between vision and language. Most of them inherit the masked (CVPR), Jun. 2016, pp. 770–778.
autoencoders in the NLP field or contrastive learning schemes [12] M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convo-
in the CV field. There is no specific supervised method for lutional neural networks,” in Proc. ICML, 2019, pp. 6105–6114.
the visual Transformers, but it has revolutionized the NLP [13] A. Galassi, M. Lippi, and P. Torroni, “Attention in natural language
processing,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 10,
tasks such as GPT-3. As described in Section VIII-B4, the pp. 4291–4308, Oct. 2021.
encoder–decoder structure may unify the visual tasks by learn- [14] X. Wang, R. B. Girshick, A. Gupta, and K. He, “Non-local neural
ing the decoder embedding and the positional encoding jointly. networks,” in Proc. IEEE CVPR, Jun. 2018, pp. 7794–7803.
[15] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “CCNet:
Thus, it is worth of further investigating the encoder–decoder Criss-cross attention for semantic segmentation,” in Proc. IEEE/CVF
Transformers for self-supervised learning. Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 603–612.
[16] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “GCNet: Non-local networks
D. Conclusion meet squeeze-excitation networks and beyond,” in Proc. IEEE/CVF Int.
Conf. Comput. Vis. Workshop (ICCVW), Oct. 2019, pp. 1971–1980.
Since ViT demonstrated its effectiveness for the CV tasks, [17] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
the visual Transformers have received considerable attention Proc. IEEE/CVF CVPR, Jun. 2018, pp. 7132–7141.
and undermined the dominant of CNNs in the CV field. [18] S. Woo, J. Park, J. Y. Lee, and I. S. Kweon, “CBAM: Convolutional
block attention module,” in Proc. Eur. Conf. Comput. Vis. (ECCV),
In this article, we have comprehensively reviewed more than 2018, pp. 3–19.
100 of visual Transformer models that have been successively [19] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “ECA-Net: Efficient
applied to various vision tasks (i.e., classification, detection, channel attention for deep convolutional neural networks,” in Proc.
IEEE/CVF CVPR, Jun. 2020, pp. 11534–11542.
and segmentation) and data streams (e.g., images, point clouds,
[20] N. Parmar et al., “Image transformer,” in Proc. ICML, 2018,
image–text pairs, and other multiple data streams). For each pp. 4055–4064.
vision task and data stream, a specific taxonomy is proposed to [21] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei, “Relation net-
organize the recently developed visual Transformers and their works for object detection,” in Proc. IEEE/CVF CVPR, Jun. 2018,
pp. 3588–3597.
performances are further evaluated over various prevailing [22] H. Hu, Z. Zhang, Z. Xie, and S. Lin, “Local relation networks for image
benchmarks. From our integrative analysis and systematic recognition,” in Proc. IEEE/CVF ICCV, Oct. 2019, pp. 3464–3473.
comparison of all these existing methods, a summary of [23] I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le, “Attention aug-
remarkable improvements is provided in this article, four mented convolutional networks,” in Proc. ICCV, 2019, pp. 3286–3295.
[24] P. Ramachandran, N. Parmar, A. Vaswani, I. Bello, A. Levskaya,
essential issues for the visual Transformers are also discussed, and J. Shlens, “Stand-alone self-attention in vision models,” in Proc.
and several potential research directions are further suggested NeurIPS, 2019, pp. 68–80.
for future investigation. We do expect that this review article [25] H. Zhao, J. Jia, and V. Koltun, “Exploring self-attention for image
recognition,” in Proc. IEEE/CVF CVPR, Jun. 2020, pp. 10076–10085.
can help readers have better understandings of various visual [26] Z. Zheng, G. An, D. Wu, and Q. Ruan, “Global and local knowledge-
Transformers before they decide to perform deep explorations. aware attention network for action recognition,” IEEE Trans. Neural
Netw. Learn. Syst., vol. 32, no. 1, pp. 334–347, Jan. 2021.
ACKNOWLEDGMENT [27] A. Vaswani, P. Ramachandran, A. Srinivas, N. Parmar, B. Hechtman,
and J. Shlens, “Scaling local self-attention for parameter efficient visual
This work was done at the AI Lab, Lenovo Research, backbones,” in Proc. IEEE/CVF CVPR, Jun. 2021, pp. 12894–12904.
Beijing, China. [28] J.-B. Cordonnier, A. Loukas, and M. Jaggi, “On the relationship
between self-attention and convolutional layers,” in Proc. ICLR, 2020,
R EFERENCES pp. 1–18.
[1] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. [29] A. Dosovitskiy et al., “An image is worth 16 × 16 words: Transformers
Process. Syst. (NIPS), 2017, pp. 5998–6008. for image recognition at scale,” in Proc. ICLR, 2021, pp. 1–16.
[2] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving [30] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
language understanding by generative pre-training,” OpenAI, Tech. S. Zagoruyko, “End-to-end object detection with transformers,” in
Rep., 2018. Proc. ECCV, 2020, pp. 213–229.
[3] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, [31] H. Wang, Y. Zhu, H. Adam, A. Yuille, and L.-C. Chen, “MaX-
“Language models are unsupervised multitask learners,” OpenAI, Tech. DeepLab: End-to-end panoptic segmentation with mask transformers,”
Rep., 2019. in Proc. IEEE/CVF CVPR, Jun. 2021, pp. 5463–5474.
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
7496 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 6, JUNE 2024
[32] B. Cheng, A. Schwing, and A. Kirillov, “Per-pixel classification is [63] L. Yuan et al., “Tokens-to-Token ViT: Training vision transformers
not all you need for semantic segmentation,” in Proc. NeurIPS, 2021, from scratch on ImageNet,” in Proc. IEEE/CVF ICCV, Oct. 2021,
pp. 17864–17875. pp. 558–567.
[33] X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu, “Transformer [64] B. Heo, S. Yun, D. Han, S. Chun, J. Choe, and S. J. Oh, “Rethinking
tracking,” in Proc. CVPR, 2021, pp. 8126–8135. spatial dimensions of vision transformers,” in Proc. IEEE/CVF ICCV,
[34] Y. Jiang, S. Chang, and Z. Wang, “TransGAN: Two pure trans- Oct. 2021, pp. 11936–11945.
formers can make one strong GAN, and that can scale up,” 2021, [65] W. Wang et al., “PVT v2: Improved baselines with pyramid vision
arXiv:2102.07074. transformer,” 2021, arXiv:2106.13797.
[35] Z. Liu et al., “Swin transformer: Hierarchical vision transformer [66] D. Zhou et al., “DeepViT: Towards deeper vision transformer,” 2021,
using shifted Windows,” in Proc. IEEE/CVF ICCV, Oct. 2021, arXiv:2103.11886.
pp. 10012–10022. [67] C. Gong, D. Wang, M. Li, V. Chandra, and Q. Liu, “Vision transformers
[36] H. Wu et al., “CvT: Introducing convolutions to vision transformers,” with patch diversification,” 2021, arXiv:2104.12753.
in Proc. IEEE/CVF ICCV, Oct. 2021, pp. 22–31. [68] M. Chen et al., “Generative pretraining from pixels,” in Proc. ICML,
[37] D. Zhou et al., “Refiner: Refining self-attention for vision transform- 2020, pp. 1691–1703.
ers,” 2021, arXiv:2106.03714. [69] Z. Li et al., “MST: Masked self-supervised transformer for visual
[38] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, “Scaling vision representation,” in Proc. NeurIPS, 2021, pp. 13165–13176.
transformers,” 2021, arXiv:2106.04560. [70] H. Bao, L. Dong, and F. Wei, “BEiT: BERT pre-training of image
[39] Z. Dai, H. Liu, Q. V. Le, and M. Tan, “CoAtNet: Marrying con- transformers,” in Proc. ICLR, 2021, pp. 1–18.
volution and attention for all data sizes,” in Proc. NeurIPS, 2021, [71] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked
pp. 3965–3977. autoencoders are scalable vision learners,” in Proc. IEEE/CVF CVPR,
[40] H. Touvron, M. Cord, D. Matthijs, F. Massa, A. Sablayrolles, and Jun. 2022, pp. 16000–16009.
H. Jegou, “Training data-efficient image transformers & distillation [72] X. Chen, S. Xie, and K. He, “An empirical study of training self-
through attention,” in Proc. ICLR, 2021, pp. 10347–10357. supervised vision transformers,” in Proc. IEEE/CVF ICCV, Oct. 2021,
[41] W. Wang et al., “Pyramid vision transformer: A versatile backbone pp. 9640–9649.
for dense prediction without convolutions,” in Proc. IEEE/CVF ICCV, [73] M. Caron et al., “Emerging properties in self-supervised vision trans-
Oct. 2021, pp. 568–578. formers,” in Proc. IEEE/CVF ICCV, Oct. 2021, pp. 9650–9660.
[42] H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and H. Jégou, [74] Z. Xie et al., “Self-supervised learning with Swin transformers,” 2021,
“Going deeper with image transformers,” in Proc. IEEE/CVF ICCV, arXiv:2105.04553.
Oct. 2021, pp. 32–42. [75] T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. Hinton, “Pix2seq: A
[43] Z.-H. Jiang et al., “All tokens matter: Token labeling for training better language modeling framework for object detection,” in Proc. ICLR,
vision transformers,” in Proc. NeurIPS, 2021, pp. 18590–18602. 2021, pp. 1–17.
[44] L. Yuan, Q. Hou, Z. Jiang, J. Feng, and S. Yan, “VOLO: Vision [76] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable
outlooker for visual recognition,” 2021, arXiv:2106.13112. DETR: Deformable transformers for end-to-end object detection,” in
[45] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, “Efficient transformers: Proc. ICLR, 2021, pp. 1–16.
A survey,” ACM Comput. Surv., vol. 55, no. 6, pp. 1–28, Apr. 2022. [77] M. Zheng et al., “End-to-End object detection with adaptive clustering
[46] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, transformer,” 2020, arXiv:2011.09315.
“Transformers in vision: A survey,” ACM Comput. Surv., vol. 54, [78] T. Wang, L. Yuan, Y. Chen, J. Feng, and S. Yan, “PnP-DETR: Towards
no. 10s, pp. 1–41, Jan. 2022. efficient visual analysis with transformers,” in Proc. IEEE/CVF ICCV,
[47] K. Han et al., “A survey on vision transformer,” IEEE Trans. Pattern Oct. 2021, pp. 4661–4670.
Anal. Mach. Intell., vol. 45, no. 1, pp. 87–110, Jan. 2023. [79] B. Roh, J. Shin, W. Shin, and S. Kim, “Sparse DETR: Efficient end-
[48] T. Lin, Y. Wang, X. Liu, and X. Qiu, “A survey of transformers,” AI to-end object detection with learnable sparsity,” in Proc. ICLR, 2021,
Open, vol. 3, pp. 111–132, Oct. 2022. pp. 1–23.
[49] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning [80] P. Gao, M. Zheng, X. Wang, J. Dai, and H. Li, “Fast convergence
with neural networks,” in Proc. NeurIPS, 2014, pp. 3104–3112. of DETR with spatially modulated co-attention,” in Proc. IEEE/CVF
[50] K. Greff, R. K. Srivastava, J. Koutnìk, B. R. Steunebrink, and ICCV, Oct. 2021, pp. 3621–3630.
J. Schmidhuber, “LSTM: A search space Odyssey,” IEEE Trans. [81] D. Meng et al., “Conditional DETR for fast training convergence,” in
Neural Netw. Learn. Syst., vol. 28, no. 10, pp. 2222–2232, Oct. 2017. Proc. IEEE/CVF ICCV, Oct. 2021, pp. 3651–3660.
[51] B. Wu et al., “Visual transformers: Token-based image representation [82] Y. Wang, X. Zhang, T. Yang, and J. Sun, “Anchor DETR: Query design
and processing for computer vision,” 2020, arXiv:2006.03677. for transformer-based detector,” in Proc. AAAI, 2022, pp. 2567–2575.
[52] A. Srinivas, T.-Y. Lin, N. Parmar, J. Shlens, P. Abbeel, and A. Vaswani, [83] S. Liu et al., “DAB-DETR: Dynamic anchor boxes are better queries
“Bottleneck transformers for visual recognition,” in Proc. IEEE/CVF for DETR,” in Proc. ICLR, 2021, pp. 1–19.
CVPR, Jun. 2021, pp. 16519–16529. [84] Y. Liu et al., “SAP-DETR: Bridging the gap between salient points and
[53] S. D’Ascoli, H. Touvron, M. Leavitt, A. Morcos, G. Biroli, and queries-based transformer detector for fast model convergency,” 2022,
L. Sagun, “ConViT: Improving vision transformers with soft convo- arXiv:2211.02006.
lutional inductive biases,” in Proc. ICLR, 2021, pp. 2286–2296. [85] Z. Yao, J. Ai, B. Li, and C. Zhang, “Efficient DETR: Improving end-
[54] K. Yuan, S. Guo, Z. Liu, A. Zhou, F. Yu, and W. Wu, “Incorporating to-end object detector with dense prior,” 2021, arXiv:2104.01318.
convolution designs into visual transformers,” in Proc. IEEE/CVF [86] X. Dai, Y. Chen, J. Yang, P. Zhang, L. Yuan, and L. Zhang, “Dynamic
ICCV, Oct. 2021, pp. 579–588. DETR: End-to-end object detection with dynamic attention,” in Proc.
[55] Y. Li, K. Zhang, J. Cao, R. Timofte, and L. Van Gool, “LocalViT: IEEE/CVF ICCV, Oct. 2021, pp. 2988–2997.
Bringing locality to vision transformers,” 2021, arXiv:2104.05707. [87] Z. Sun, S. Cao, Y. Yang, and K. Kitani, “Rethinking transformer-
[56] X. Chu et al., “Conditional positional encodings for vision transform- based set prediction for object detection,” in Proc. IEEE/CVF ICCV,
ers,” 2021, arXiv:2102.10882. Oct. 2021, pp. 3611–3620.
[57] Q. Zhang and Y.-B. Yang, “ResT: An efficient transformer for visual [88] Y. Fang et al., “You only look at one sequence: Rethinking trans-
recognition,” in Proc. NeurIPS, 2021, pp. 15475–15485. former in vision through object detection,” in Proc. NeurIPS, 2021,
[58] T. Xiao, P. Dollár, M. Singh, E. Mintun, T. Darrell, and R. Girshick, pp. 26183–26197.
“Early convolutions help transformers see better,” in Proc. NeruIPS, [89] Z. Dai, B. Cai, Y. Lin, and J. Chen, “UP-DETR: Unsupervised pre-
2021, pp. 30392–30400. training for object detection with transformers,” in Proc. IEEE/CVF
[59] K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y. Wang, “Transformer CVPR, Jun. 2021, pp. 1601–1610.
in transformer,” in Proc. NeurIPS, 2021, pp. 15908–15919. [90] W. Wang, Y. Cao, J. Zhang, and D. Tao, “FP-DETR: Detection
[60] X. Chu et al., “Twins: Revisiting the design of spatial attention in vision transformer advanced by fully pre-training,” in Proc. ICLR, 2021,
transformers,” in Proc. NeurIPS, 2021, pp. 9355–9366. pp. 1–14.
[61] P. Zhang et al., “Multi-scale vision longformer: A new vision trans- [91] F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang, “DN-DETR:
former for high-resolution image encoding,” in Proc. IEEE/CVF ICCV, Accelerate DETR training by introducing query denoising,” in Proc.
Oct. 2021, pp. 2998–3008. IEEE/CVF CVPR, Jun. 2022, pp. 13619–13627.
[62] J. Yang et al., “Focal self-attention for local-global interactions in vision [92] H. Zhang et al., “DINO: DETR with improved denoising anchor boxes
transformers,” 2021, arXiv:2107.00641. for end-to-end object detection,” 2022, arXiv:2203.03605.
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: SURVEY OF VISUAL TRANSFORMERS 7497
[93] D. Zhang, H. Zhang, J. Tang, M. Wang, X. Hua, and Q. Sun, “Feature [123] X. Yu, Y. Rao, Z. Wang, Z. Liu, J. Lu, and J. Zhou, “PoinTr: Diverse
pyramid transformer,” in Proc. ECCV, 2020, pp. 323–339. point cloud completion with geometry-aware transformers,” in Proc.
[94] Y. Yuan et al., “HRFormer: High-resolution vision transformer for ICCV, 2021, pp. 12478–12487.
dense predict,” in Proc. NeurIPS, 2021, pp. 7281–7293. [124] P. Xiang et al., “SnowflakeNet: Point cloud completion by snowflake
[95] J. Gu et al., “HRViT: Multi-scale high-resolution vision transformer,” point deconvolution with skip-transformer,” in Proc. IEEE/CVF ICCV,
2021, arXiv:2111.01236. Oct. 2021, pp. 5479–5489.
[96] S. Zheng et al., “Rethinking semantic segmentation from a sequence- [125] J. Choe, B. Joung, F. Rameau, J. Park, and I. So Kweon, “Deep point
to-sequence perspective with transformers,” in Proc. IEEE/CVF CVPR, cloud reconstruction,” 2021, arXiv:2111.11704.
[126] S. Chen, T. Yu, and P. Li, “MVT: Multi-view vision transformer for
Jun. 2021, pp. 6881–6890.
3D object recognition,” 2021, arXiv:2110.13083.
[97] J. Chen et al., “TransUNet: Transformers make strong encoders for
[127] Y. Hou and L. Zheng, “Multiview detection with shadow transformer
medical image segmentation,” 2021, arXiv:2102.04306.
(and view-coherent data augmentation),” in Proc. 29th ACM Int. Conf.
[98] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, Multimedia, Oct. 2021, pp. 1673–1682.
“SegFormer: Simple and efficient design for semantic segmentation [128] A. Prakash, K. Chitta, and A. Geiger, “Multi-modal fusion transformer
with transformers,” in Proc. NeurIPS, 2021, pp. 12077–12090. for end-to-end autonomous driving,” in Proc. IEEE/CVF Conf. Comput.
[99] T. Prangemeier, C. Reich, and H. Koeppl, “Attention-based transform- Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 7077–7087.
ers for instance segmentation of cells in microstructures,” in Proc. IEEE [129] W. Jiang, E. Trulls, J. Hosang, A. Tagliasacchi, and K. M. Yi, “COTR:
Int. Conf. Bioinf. Biomed. (BIBM), Dec. 2020, pp. 700–707. Correspondence transformer for matching across images,” in Proc.
[100] Y. Wang et al., “End-to-end video instance segmentation with trans- IEEE/CVF ICCV, Oct. 2021, pp. 6207–6217.
formers,” in Proc. IEEE/CVF CVPR, Jun. 2021, pp. 8741–8750. [130] D. Wang et al., “Multi-view 3D reconstruction with transformers,” in
[101] Y. Fang et al., “Instances as queries,” in Proc. ICCV, 2021, Proc. IEEE/CVF ICCV, Oct. 2021, pp. 5722–5731.
pp. 6910–6919. [131] X. Chen, T. Zhang, Y. Wang, Y. Wang, and H. Zhao, “FUTR3D:
[102] J. Hu et al., “ISTR: End-to-end instance segmentation with transform- A unified sensor fusion framework for 3D detection,” 2022,
ers,” 2021, arXiv:2105.00637. arXiv:2203.10642.
[103] B. Dong, F. Zeng, T. Wang, X. Zhang, and Y. Wei, “SOLQ: Segmenting [132] A. Bozic, P. Palafox, J. Thies, A. Dai, and M. Nießner, “Transformer-
objects by learning queries,” in Proc. NeurIPS, 2021, pp. 21898–21909. Fusion: Monocular RGB scene reconstruction using transformers,” in
[104] R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Proc. NeurIPS, 2021, pp. 1403–1414.
Transformer for semantic segmentation,” in Proc. IEEE/CVF ICCV, [133] Y. Zhang et al., “mmFormer: Multimodal medical transformer for
Oct. 2021, pp. 7262–7272. incomplete multimodal learning of brain tumor segmentation,” in Proc.
[105] H. Zhao, L. Jiang, J. Jia, P. Torr, and V. Koltun, “Point transformer,” MICCAI, 2022, pp. 107–117.
in Proc. IEEE/CVF ICCV, Oct. 2021, pp. 16259–16268. [134] G. V. Tulder, Y. Tong, and E. Marchiori, “Multi-view analysis of
[106] M.-H. Guo, J.-X. Cai, Z.-N. Liu, T.-J. Mu, R. R. Martin, and S.-M. Hu, unregistered medical images using cross-view transformers,” in Proc.
“PCT: Point cloud transformer,” Comput. Vis. Media, vol. 7, no. 2, MICCAI, 2021, pp. 104–113.
pp. 187–199, 2021. [135] X. Long, L. Liu, W. Li, C. Theobalt, and W. Wang, “Multi-view
depth estimation using epipolar spatio-temporal networks,” in Proc.
[107] D. Lu, Q. Xie, K. Gao, L. Xu, and J. Li, “3DCTN: 3D convolution-
IEEE/CVF CVPR, Jun. 2021, pp. 8258–8267.
transformer network for point cloud classification,” IEEE Trans. Intell.
[136] D. Song et al., “Deep relation transformer for diagnosing glaucoma
Transp. Syst., vol. 23, no. 12, pp. 24854–24865, Dec. 2022.
with optical coherence tomography and visual field function,” IEEE
[108] C. Park, Y. Jeong, M. Cho, and J. Park, “Fast point transformer,” in Trans. Med. Imag., vol. 40, no. 9, pp. 2392–2402, Sep. 2021.
Proc. IEEE/CVF CVPR, Jun. 2022, pp. 16949–16958. [137] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid,
[109] X. Pan, Z. Xia, S. Song, L. E. Li, and G. Huang, “3D object “VideoBERT: A joint model for video and language representation
detection with pointformer,” in Proc. IEEE/CVF CVPR, Jun. 2021, learning,” in Proc. IEEE/CVF ICCV, Oct. 2019, pp. 7464–7473.
pp. 7463–7472. [138] J. Lu, D. Batra, D. Parikh, and S. Lee, “ViLBERT: Pretraining task-
[110] L. Fan et al., “Embracing single stride 3D object detector with sparse agnostic visiolinguistic representations for vision-and-language tasks,”
transformer,” in Proc. IEEE/CVF CVPR, Jun. 2022, pp. 8458–8468. in Proc. NeurIPS, 2019, pp. 13–23.
[111] J. Mao et al., “Voxel transformer for 3D object detection,” in Proc. [139] H. Tan and M. Bansal, “LXMERT: Learning cross-modality encoder
IEEE/CVF ICCV, Oct. 2021, pp. 3144–3153. representations from transformers,” in Proc. Conf. Empirical Methods
[112] C. He, R. Li, S. Li, and L. Zhang, “Voxel set transformer: A set-to-set Natural Lang. Process. 9th Int. Joint Conf. Natural Lang. Process.
approach to 3D object detection from point clouds,” in Proc. IEEE/CVF (EMNLP-IJCNLP), 2019, pp. 5100–5111.
CVPR, Jun. 2022, pp. 8417–8427. [140] L. Harold Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “Visu-
[113] X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, and J. Lu, “Point-BERT: alBERT: A simple and performant baseline for vision and language,”
Pre-training 3D point cloud transformers with masked point modeling,” 2019, arXiv:1908.03557.
in Proc. IEEE/CVF CVPR, Jun. 2022, pp. 19313–19322. [141] W. Su et al., “VL-BERT: Pre-training of generic visual-linguistic
[114] Y. Pang, W. Wang, F. E. H. Tay, W. Liu, Y. Tian, and L. Yuan, representations,” 2019, arXiv:1908.08530.
“Masked autoencoders for point cloud self-supervised learning,” 2022, [142] Y.-C. Chen et al., “UNITER: Universal image-text representation
arXiv:2203.06604. learning,” in Proc. ECCV, 2020, pp. 104–120.
[115] H. Liu, M. Cai, and Y. Jae Lee, “Masked discrimination for self- [143] X. Li et al., “OSCAR: Object-semantics aligned pre-training for vision-
supervised learning on point clouds,” 2022, arXiv:2203.11183. language tasks,” in Proc. ECCV, 2020, pp. 121–137.
[116] I. Misra, R. Girdhar, and A. Joulin, “An end-to-end transformer [144] L. Zhou, H. Palangi, L. Zhang, H. Hu, J. Corso, and J. Gao, “Unified
model for 3D object detection,” in Proc. IEEE/CVF ICCV, Oct. 2021, vision-language pre-training for image captioning and VQA,” in Proc.
pp. 2906–2917. AAAI, 2020, pp. 13041–13049.
[145] W. Kim, B. Son, and I. Kim, “ViLT: Vision-and-language transformer
[117] Z. Liu, Z. Zhang, Y. Cao, H. Hu, and X. Tong, “Group-free 3D object
without convolution or region supervision,” in Proc. ICML, 2021,
detection via transformers,” in Proc. IEEE/CVF ICCV, Oct. 2021,
pp. 5583–5594.
pp. 2949–2958.
[146] P. Zhang et al., “VinVL: Revisiting visual representations in
[118] H. Shenga et al., “Improving 3D object detection with channel-wise vision-language models,” in Proc. IEEE/CVF CVPR, Jun. 2021,
transformer,” in Proc. IEEE/CVF ICCV, Oct. 2021, pp. 2743–2752. pp. 5579–5588.
[119] K.-C. Huang, T.-H. Wu, H.-T. Su, and W. H. Hsu, “MonoDTR: [147] A. Radford et al., “Learning transferable visual models from natural
Monocular 3D object detection with depth-aware transformer,” in Proc. language supervision,” in Proc. ICML, 2021, pp. 8748–8763.
IEEE/CVF CVPR, Jun. 2022, pp. 4012–4021. [148] A. Ramesh et al., “Zero-shot text-to-image generation,” in Proc. ICML,
[120] R. Zhang et al., “MonoDETR: Depth-guided transformer for monocular 2021, pp. 8821–8831.
3D object detection,” 2022, arXiv:2203.13310. [149] C. Jia et al., “Scaling up visual and vision-language representa-
[121] Y. Wang, V. C. Guizilini, T. Zhang, Y. Wang, H. Zhao, and tion learning with noisy text supervision,” in Proc. ICML, 2021,
J. Solomon, “DETR3D: 3D object detection from multi-view images pp. 4904–4916.
via 3D-to-2D queries,” in Proc. 5th Conf. Robot Learn., 2022, [150] R. Hu and A. Singh, “UniT: Multimodal multitask learning with a uni-
pp. 180–191. fied transformer,” in Proc. IEEE/CVF ICCV, Oct. 2021, pp. 1439–1449.
[122] X. Bai et al., “TransFusion: Robust LiDAR-camera fusion for 3D object [151] Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao, “SimVLM:
detection with transformers,” in Proc. IEEE/CVF CVPR, Jun. 2022, Simple visual language model pretraining with weak supervision,” in
pp. 1090–1099. Proc. ICLR, 2021, pp. 1–17.
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
7498 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 6, JUNE 2024
[152] A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “Data2vec: [180] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
A general framework for self-supervised learning in speech, vision and real-time object detection with region proposal networks,” IEEE
language,” 2022, arXiv:2202.03555. Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149,
[153] A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, and N. Carion, Jun. 2017.
“MDETR—Modulated detection for end-to-end multi-modal under- [181] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only
standing,” in Proc. IEEE/CVF ICCV, Oct. 2021, pp. 1780–1790. look once: Unified, real-time object detection,” in Proc. IEEE CVPR,
[154] J. Deng, Z. Yang, T. Chen, W. Zhou, and H. Li, “TransVG: End-to- Jun. 2016, pp. 779–788.
end visual grounding with transformers,” in Proc. IEEE/CVF ICCV, [182] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal
Oct. 2021, pp. 1769–1779. loss for dense object detection,” in Proc. IEEE ICCV, Oct. 2017,
[155] Y. Du, Z. Fu, Q. Liu, and Y. Wang, “Visual grounding with transform- pp. 2980–2988.
ers,” in Proc. ICME, 2022, pp. 1–6. [183] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, “Object detection with deep
[156] M. Li and L. Sigal, “Referring transformer: A one-step approach learning: A review,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30,
to multi-task visual grounding,” in Proc. NeurIPS, 2021, no. 11, pp. 3212–3232, Nov. 2019.
pp. 19652–19664. [184] J. Dai et al., “Deformable convolutional networks,” in Proc. ICCV,
[157] H. Jiang, Y. Lin, D. Han, S. Song, and G. Huang, “Pseudo-Q: 2017, pp. 764–773.
Generating pseudo language queries for visual grounding,” in Proc. [185] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
IEEE/CVF CVPR, Jun. 2022, pp. 15513–15523. “Feature pyramid networks for object detection,” in Proc. IEEE CVPR,
[158] J. Roh, K. Desingh, A. Farhadi, and D. Fox, “LanguageRefer: Spatial- Jul. 2017, pp. 2117–2125.
language model for 3D visual grounding,” in Proc. 5th Conf. Robot [186] Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully convolutional
Learn., 2022, pp. 1046–1056. one-stage object detection,” in Proc. IEEE/CVF ICCV, Oct. 2019,
[159] D. He et al., “TransRefer3D: Entity-and-relation aware transformer pp. 9627–9636.
for fine-grained 3D visual grounding,” in Proc. 29th ACM Int. Conf. [187] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in
Multimedia (CMM), 2021, pp. 2344–2352. Proc. ICCV, 2017, pp. 2961–2969.
[160] S. Huang, Y. Chen, J. Jia, and L. Wang, “Multi-view transformer [188] Z. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into high quality
for 3D visual grounding,” in Proc. IEEE/CVF CVPR, Jun. 2022, object detection,” in Proc. IEEE CVPR, Jun. 2018, pp. 6154–6162.
pp. 15524–15533. [189] P. Sun et al., “What makes for end-to-end object detection?” in Proc.
[161] A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid, “TubeDETR: ICML, 2021, pp. 9934–9944.
Spatio-temporal video grounding with transformers,” in Proc. [190] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution represen-
IEEE/CVF CVPR, Jun. 2022, pp. 16442–16453. tation learning for human pose estimation,” in Proc. IEEE/CVF CVPR,
[162] J. Lei Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016, Jun. 2019, pp. 5693–5703.
arXiv:1607.06450. [191] L. C. Chen, G. Papandreou, and I. Kokkinos, “DeepLab: Semantic
[163] P. P. Brahma, D. Wu, and Y. She, “Why deep learning works: A image segmentation with deep convolutional nets, atrous convolution,
manifold disentanglement perspective,” IEEE Trans. Neural Netw. and fully connected CRFs,” IEEE Trans. Pattern Anal. Mach. Intell.,
Learn. Syst., vol. 27, no. 10, pp. 1997–2008, Oct. 2016. vol. 40, no. 4, pp. 834–848, Jun. 2016.
[164] Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng, “A2 -Nets: Double [192] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional
attention networks,” in Proc. NeurIPS, 2018, pp. 352–361. networks for biomedical image segmentation,” in Proc. Med. Image
[165] A. Krizhevsky et al., “Learning multiple layers of features from tiny Comput. Comput.-Assist. Intervent., 2015, pp. 234–241.
images,” Univ. Toronto, Tech. Rep., 2009. [193] P. Sun et al., “Sparse R-CNN: End-to-end object detection
[166] C. Sun, A. Shrivastava, S. Singh, and A. Gupta, “Revisiting unreason- with learnable proposals,” in Proc. IEEE/CVF CVPR, Jun. 2021,
able effectiveness of data in deep learning era,” in Proc. IEEE ICCV, pp. 14454–14463.
Oct. 2017, pp. 843–852. [194] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, “Unified per-
[167] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ceptual parsing for scene understanding,” in Proc. ECCV, 2018,
“ImageNet: A large-scale hierarchical image database,” in Proc. CVPR, pp. 418–434.
Jun. 2009, pp. 248–255. [195] M. Chen et al., “Searching the search space of vision transformer,” in
[168] Y. Dong, J.-B. Cordonnier, and A. Loukas, “Attention is not all you Proc. NeurIPS, 2021, pp. 1–13.
need: Pure attention loses rank doubly exponentially with depth,” in [196] H. Chen, K. Sun, Z. Tian, C. Shen, Y. Huang, and Y. Yan, “Blend-
Proc. ICLR, 2021, pp. 2793–2803. Mask: Top-down meets bottom-up for instance segmentation,” in Proc.
[169] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with rel- IEEE/CVF CVPR, Jun. 2020, pp. 8573–8581.
ative position representations,” in Proc. Conf. North Amer. Chap- [197] X. Wang, R. Zhang, T. Kong, L. Li, and C. Shen, “SOLOv2:
ter Assoc. Comput. Linguistics, Hum. Lang. Technol., vol. 2, 2018, Dynamic and fast instance segmentation,” in Proc. NeurIPS, 2020,
pp. 464–468. pp. 17721–17732.
[170] P. W. Battaglia et al., “Relational inductive biases, deep learning, and [198] Y. Wang, R. Huang, S. Song, Z. Huang, and G. Huang, “Not all images
graph networks,” 2018, arXiv:1806.01261. are worth 16 × 16 words: Dynamic transformers for efficient image
[171] S. Abnar, M. Dehghani, and W. Zuidema, “Transferring inductive recognition,” in Proc. NeurIPS, 2021, pp. 11960–11973.
biases through knowledge distillation,” 2020, arXiv:2006.00555. [199] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep hierarchical
[172] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, feature learning on point sets in a metric space,” in Proc. NeurIPS,
“MobileNetV2: Inverted residuals and linear bottlenecks,” in Proc. 2017, pp. 1–10.
IEEE/CVF CVPR, Jun. 2018, pp. 4510–4520. [200] P. Anderson et al., “Bottom-up and top-down attention for image
[173] A. Islam, S. Jia, and N. D. B. Bruce, “How much position information captioning and visual question answering,” in Proc. IEEE/CVF CVPR,
do convolutional neural networks encode?” in Proc. ICLR, 2020, Jun. 2018, pp. 6077–6086.
pp. 1–11. [201] R. Krishna et al., “Visual Genome: Connecting language and vision
[174] J. Lin, C. Gan, and S. Han, “TSM: Temporal shift module for using crowdsourced dense image annotations,” Int. J. Comput. Vis.,
efficient video understanding,” in Proc. IEEE/CVF ICCV, Oct. 2019, vol. 123, no. 1, pp. 32–73, 2017.
pp. 7083–7093. [202] P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions:
[175] Y. Pang, M. Sun, X. Jiang, and X. Li, “Convolution in convolution for A cleaned, hypernymed, image alt-text dataset for automatic image
network in network,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, captioning,” in Proc. 56th Annu. Meeting Assoc. Comput. Linguistics,
no. 5, pp. 1587–1597, May 2018. 2018, pp. 2556–2565.
[176] J. Gao, D. He, X. Tan, T. Qin, L. Wang, and T.-Y. Liu, “Representation [203] S. Antol et al., “VQA: Visual question answering,” in Proc. IEEE
degeneration problem in training natural language generation models,” ICCV, Dec. 2015, pp. 2425–2433.
in Proc. ICLR, 2019, pp. 1–14. [204] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell:
[177] S. Yun, D. Han, S. Chun, S. J. Oh, Y. Yoo, and J. Choe, “CutMix: Reg- A neural image caption generator,” in Proc. IEEE CVPR, Jun. 2015,
ularization strategy to train strong classifiers with localizable features,” pp. 3156–3164.
in Proc. IEEE/CVF ICCV, Oct. 2019, pp. 6023–6032. [205] P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. Guibas,
[178] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “ReferIt3D: Neural listeners for fine-grained 3D object identification
“Unsupervised learning of visual features by contrasting cluster assign- in real-world scenes,” in Proc. ECCV, 2020, pp. 422–440.
ments,” in Proc. NeurIPS, 2020, pp. 9912–9924. [206] W. Yu et al., “MetaFormer is actually what you need for vision,” in
[179] C.-F.-R. Chen, Q. Fan, and R. Panda, “CrossViT: Cross-attention multi- Proc. IEEE/CVF CVPR, Jun. 2022, pp. 10819–10829.
scale vision transformer for image classification,” in Proc. IEEE/CVF [207] N. Park and S. Kim, “How do vision transformers work?” in Proc.
ICCV, Oct. 2021, pp. 357–366. ICLR, 2021, pp. 1–26.
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.