0% found this document useful (0 votes)
27 views21 pages

A Survey of Visual Transformers

Uploaded by

rafsun.sheikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views21 pages

A Survey of Visual Transformers

Uploaded by

rafsun.sheikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

7478 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO.

6, JUNE 2024

A Survey of Visual Transformers


Yang Liu , Yao Zhang , Yixin Wang , Feng Hou , Jin Yuan , Jiang Tian , Yang Zhang ,
Zhongchao Shi , Jianping Fan , and Zhiqiang He

Abstract— Transformer, an attention-based encoder–decoder As shown in Fig. 1, Transformers have gradually emerged
model, has already revolutionized the field of natural language as the predominant deep learning models for many natural
processing (NLP). Inspired by such significant achievements, language processing (NLP) tasks. The most recent dominant
some pioneering works have recently been done on employing
Transformer-liked architectures in the computer vision (CV) models are the self-supervised Transformers, which are pre-
field, which have demonstrated their effectiveness on three trained over sufficient datasets and then fine-tuned over a
fundamental CV tasks (classification, detection, and segmenta- small sample set for a given downstream task [2], [3], [4],
tion) as well as multiple sensory data stream (images, point [5], [6], [7], [8], [9]. The generative pretrained transformer
clouds, and vision-language data). Because of their competitive (GPT) families [2], [3], [4] leverage the Transformer decoders
modeling capabilities, the visual Transformers have achieved
impressive performance improvements over multiple benchmarks to enable autoregressive language modeling, while the bidirec-
as compared with modern convolution neural networks (CNNs). tional encoder representations from transformers (BERT) [5]
In this survey, we have reviewed over 100 of different visual and its variants [6], [7] serve as autoencoder language models
Transformers comprehensively according to three fundamental built on the Transformer encoders.
CV tasks and different data stream types, where taxonomy is In the computer vision (CV) field, prior to the visual Trans-
proposed to organize the representative methods according to
their motivations, structures, and application scenarios. Because formers, convolution neural networks (CNNs) have emerged
of their differences on training settings and dedicated vision as a dominant paradigm [10], [11], [12]. Inspired by the
tasks, we have also evaluated and compared all these existing great success of such self-attention mechanisms (Fig. 2) for
visual Transformers under different configurations. Furthermore, the NLP tasks [1], [13], some CNN-based models attempted
we have revealed a series of essential but unexploited aspects that to capture the long-range dependencies through adding a
may empower such visual Transformers to stand out from numer-
ous architectures, e.g., slack high-level semantic embeddings to self-attention layer at either spatial level [14], [15], [16] or
bridge the gap between the visual Transformers and the sequen- channel level [17], [18], [19], while others try to replace the
tial ones. Finally, two promising research directions are suggested traditional convolutions entirely with the global [20] or local
for future investment. We will continue to update the latest self-attention blocks [21], [22], [23], [24], [25], [26], [27].
articles and their released source codes at https://github. Although Ramachandran et al. [24] have demonstrated the
com/liuyang-ict/awesome-visual-transformers.
efficiency of self-attention block without the help from CNNs,
Index Terms— Classification, computer vision (CV), detec- such pure attention model is still inferior to the state-of-the-art
tion, point clouds, segmentation, self-supervision, visual-linguistic (SoTA) CNN models on the prevailing benchmarks.
pretraining, visual Transformer.
With the grateful achievements of linguistic Transformers
I. I NTRODUCTION and the rapid development of visual attention-based mod-
els, numerous recent works have migrated the Transformers

T RANSFORMER [1], which adopts an attention-based


structure, has first demonstrated its tremendous effects
on the tasks of sequence modeling and machine translation.
to the CV tasks, and some comparable results have been
achieved. Cordonnier et al. [28] theoretically demonstrated
the equivalence between multihead self-attention (MHSA) and
CNNs, and they designed a pure Transformer by using patch
Manuscript received 10 November 2021; revised 29 April 2022 and 24 July
2022; accepted 26 November 2022. Date of publication 30 March 2023; downsampling and quadratic position encoding to verify their
date of current version 4 June 2024. (Corresponding authors: Zhiqiang He; theoretical conclusion. Dosovitskiy et al. [29] further extended
Zhongchao Shi; Yang Zhang.) such a pure Transformer for large-scale pretraining, which
Yang Liu, Yao Zhang, and Feng Hou are with the Institute of Com-
puting Technology, Chinese Academy of Sciences, Beijing 100000, China, has achieved SoTA performance over many benchmarks.
and also with the School of Computer Science and Technology, Univer- In addition, the visual Transformers have also obtained great
sity of Chinese Academy of Sciences, Beijing 100000, China (e-mail: performances for other CV tasks, such as detection [30],
[email protected]).
Yixin Wang is with the School of Engineering, Stanford University, Palo segmentation [31], [32], tracking [33], and generation [34].
Alto, 94305 USA. As shown in Fig. 1, following the pioneer works [29], [30],
Jin Yuan is with the School of Computer Science and Engineering, Southeast hundreds of Transformer-based models have been proposed for
University, Nanjing 214135, China.
Jiang Tian, Yang Zhang, Zhongchao Shi, and Jianping Fan are with the AI various vision applications within the last year. Thus, a system-
Lab, Lenovo Research, Beijing 100000, China (e-mail: [email protected]; atic literature survey is strongly desired to identify, categorize,
[email protected]). and evaluate these existing visual Transformers. Considering
Zhiqiang He is with the Institute of Computing Technology, Chinese
Academy of Sciences, Beijing 100000, China, also with the University of that the readers may come from different areas, we review
Chinese Academy of Sciences, Beijing 100000, China, and also with Lenovo all these visual Transformers according to three fundamental
Ltd., Beijing 100000, China (e-mail: [email protected]). CV tasks (i.e., classification, detection, and segmentation) and
This article has supplementary downloadable material available at
https://doi.org/10.1109/TNNLS.2022.3227717, provided by the authors. different types of data streams (i.e., images, point clouds, and
Digital Object Identifier 10.1109/TNNLS.2022.3227717 multistream data). As shown in Fig. 3, this survey categorizes
2162-237X © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: SURVEY OF VISUAL TRANSFORMERS 7479

Fig. 1. Odyssey of Transformer application and growth curves of both Transformer [1] and ViT [29] citations according to Google Scholar. Top left: growth
of Transformer citations in the top linguistics and machine learning conference publications. Top right: growth of ViT citations in Arxiv publications. Bottom
left: Odyssey of language model [1], [2], [3], [4], [5], [6], [7], [8]. Bottom right: Odyssey of visual Transformer backbone where the black [29], [35], [36],
[37], [38], [39] is the SoTA with external data and the blue [40], [41], [42], [43], [44] refers to the SoTA without external data (best viewed in color).

for neck detector, and general-purpose mask prediction


scheme for segmentation.
(3) In-depth Analysis. We further provide well-thought
insights from the following aspects.
a) How visual Transformers bridge the traditional
sequential tasks to the visual ones (why does
Transformer work effectively in CV).
b) The correspondence between the visual Transform-
ers and other neural networks.
c) The double edges of the visual Transformers.
d) The correlation of the learnable embeddings (i.e.,
class token, object query, and mask embedding)
Fig. 2. Structure of the attention layer. Left: scaled dot-product attention.
Right: multihead attention mechanism.
adopted in different tasks and data stream types.
Finally, we outline some future research directions.
all these existing methods into multiple groups according to For example, the encoder–decoder Transformer
their dedicated vision tasks, data stream types, motivations, backbone can unify multiple visual tasks and data
and structural characteristics. stream types through query embeddings.
Before us, several reviews have been published. Tay The rest of this article is organized as follows. An overview
et al. [45] reviewed the efficiency of the linguistic Trans- of the architectures and the critical components for the
formers, Khan et al. [46] and Han et al. [47] summarized vanilla sequential Transformers are introduced in Section II.
the early visual Transformers and attention-based models, A comprehensive taxonomy for the Transformer backbones
and Lin et al. [48] provided a systematic review of various is summarized in Section III with a brief discussion of
linguistic Transformers and sketchy vision applications. Dis- their applications for image classification. We then review
tinctively, this article provides a more comprehensive review contemporary Transformer detectors, including Transformer
of the most recent visual Transformers and categorizes them necks and backbones in Section IV. Section V clarifies the
systematically. mainstream and its variants for the visual Transformers in
(1) Comprehensiveness and Readability. This article com- the segmentation field according to their embedding forms
prehensively reviews over 100 visual Transformers (i.e., patch embedding and query embedding). Sections III–V
according to their applications on three fundamental CV also briefly analyze a specific aspect of their corresponding
tasks (i.e., classification, detection, and segmentation) fields with performance evaluation. In addition to 2-D visual
and different types of data streams (i.e., image, point recognition, Section VI briefly introduces the recently devel-
clouds, and multistream data). We select more represen- oped 3-D visual recognition from the perspective of point
tative methods with detailed descriptions and analyses clouds. Section VII further overviews the fusion approaches
but introduce other related works briefly. In addition within the visual Transformers for multiple data stream types
to analyzing each model independently, we also build (e.g., multiview, multimodality, visual-linguistic pretraining,
their internal connections from certain senses such as and visual grounding). Finally, Section VIII provides three
progressive, contrastive, and multiview analysis. aspects for further discussion and points out some promising
(2) Intuitive Comparison. As these visual Transformers research directions for future investment.
follow different training schemes and hyperparameter
settings for various vision tasks, this survey presents II. O RIGINAL T RANSFORMER
multiple lateral comparisons over different datasets and The original Transformer [1] is first applied to the task for
restrictions. More importantly, we summarize a series of sequence-to-sequence autoregression. Compared with previous
promising components designed for each task, including sequence transduction models [49], [50], such original Trans-
shallow local convolution with hierarchical structure for former inherits the encoder–decoder structure but discards
backbone, spatial prior acceleration with sparse attention the recurrence and convolutions entirely by using multihead
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
7480 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 6, JUNE 2024

Fig. 4. Overall architecture of Transformer [1]. The 2-D lattice represents


each state of queries during training (best viewed in color).

attention mechanisms and pointwise feedforward networks


(FFNs). In the following, we will provide an architectural
overview of the original Transformer.

A. Multihead Attention Mechanism


The mechanism with one single head attention can be
grouped into two parts: 1) a transformation layer maps the
input sequences X ∈ Rn x ×dx and Y ∈ Rn y ×d y into three
different vectors (query Q, key K , and value V ), where n and
d are the length and the dimension of the inputs, respectively,
and 2) an attention layer, as shown in Fig. 2, explicitly
aggregates the query with the corresponding key, assigns them
to the value, and updates the output vector.
The formula for the transformation layer is defined as

Q = XW Q, K = YWK, V = YWV (1)


v
where W Q ∈ Rdx ×d , W K ∈ Rd y ×d , and W V ∈ Rd y ×d
k k

are linear matrices; and d k and d v are the dimension of the


query–key pair and the value that are projected from Y and
X, respectively. Such two sequence inputs are referred to
as the cross-attention mechanism. It can also be regarded as
self-attention when Y = X. In form, self-attention is applied
to both Transformer encoder and decoder, while the cross
attention severs as a junction within the decoder.
Then, the scale-dot attention mechanism is formulated as
 
QK T
Attention(Q, K , V ) = Softmax √ V (2)
dk
where the attention weights are generated by a dot-product
operation between Q and K , and a scaling factor (dk )1/2 and
a softmax operation are supplied to translate the attention
weights into a normalized distribution. The resulting weights
are assigned to the corresponding value elements, thereby
yielding the final output vector.
Due to the restricted feature subspace, the modeling capabil-
ity of the single-head attention block is quite coarse. To tackle
this issue, as shown in Fig. 2, an MHSA mechanism is
proposed to linearly project the input into multiple feature
subspaces and process them by using several independent
Fig. 3. Taxonomy of visual Transformers (best viewed in color). attention heads (layers) parallelly. The resulting vectors are

Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: SURVEY OF VISUAL TRANSFORMERS 7481

concatenated and mapped to the final outputs. The process of


MHSA can be formulated as
Q i = X W Q i , K i = X W K i , Vi = X W Vi
Z i = Attention(Q i , K i , Vi ), i = 1 . . . h
MultiHead(Q, K , V ) = Concat(Z 1 , Z 2 , . . . , Z h )W O (3)
where h is the head number, W O ∈ Rhdv ×dmodel denotes the
output projected matrix, Z i denotes the output vector of each
head, and W Q i ∈ Rdmodel ×dk , W K i ∈ Rdmodel ×dk , and W Vi ∈
Rdmodel ×dv are three different groups of matrices. Multihead
attention separates the inputs into h independent attention
heads with dmodel / h-dimensional vectors and integrates each
head feature dependently. Without extra costs, multihead atten-
tion enriches the diversity of the feature subspaces.

B. Positionwise FFNs Fig. 5. Taxonomy of visual Transformer backbone (best viewed in color).
The output of MHSA is then fed into two successive FFNs
with a ReLU activation as stack and the decoder stack. Finally, a softmax operation is
used for predicting the next word.
FFN(x) = RELU(W1 x + b1 )W2 + b2 . (4) In an autoregressive language model, the Transformer
This positionwise feedforward layer can be viewed as a is originated from the machine translation tasks. Given a
pointwise convolution, which treats each position equally but sequence of words, the Transformer vectorizes the input
uses different parameters between each layer. sequence into the word embeddings, adds the positional encod-
ings, and feeds the resulting sequence of the vectors into an
C. Positional Encoding encoder. During training, as shown in Fig. 4, Vaswani et al. [1]
Since the Transformer/attention operates on the input designed a masking operation according to the rule for the
embedding simultaneously and identically, the order of the autoregressive task, where the current position only depends on
sequence is neglected. To make use of the sequential informa- the outputs of the previous positions. Based on this masking,
tion, a common solution is to append an extra positional vector the Transformer decoder is able to process the sequence of the
to the inputs, hence term the “positional encoding.” There are input labels parallelly. During the inference time, the sequence
many choices for positional encoding. For example, a typical of the previously predicted words is processed by the same
choice is cosine functions with different frequencies as operation to predict the next word.

sin(pos · ωk ), if i = 2k III. T RANSFORMER FOR C LASSIFICATION
PE(pos,i) = Following the prominent developments of the Transformers
cos(pos · ωk ), if i = 2k + 1
in NLP recent works attempt to introduce visual Transformers
1
ωk = , k = 1, . . . , d/2 (5) for image classification. This section comprehensively reviews
10 0002k/d over 40 visual Transformers and groups them into six cat-
where pos and d are the position and the length of the vector, egories, as shown in Fig. 5. We start with introducing the
respectively, and i is the index of each element within vector. fully attentional network [24], [28] and the vision Trans-
former (ViT) [29] that first demonstrates Transformer efficacy
D. Transformer Model on large scale classification benchmarks. Then, we discuss
Fig. 4 shows the overall Transformer models with the Transformer-enhanced CNN methods that utilize Transformer
encoder–decoder architecture. Specifically, it consists of N to enhance the representation learning in CNNs. Due to the
successive encoder blocks, each of which is composed of negligence of local information in the original ViT, the CNN-
two sublayers: 1) an MHSA layer aggregates the relationship enhanced transformer employs an appropriate convolutional
within the encoder embeddings and 2) a positionwise FFN inductive bias to augment the visual Transformer, while
layer extracts feature representations. For the decoder, it also the local attention-enhanced Transformer redesigns patch par-
involves N consecutive blocks that follow a stack of the tition and attention blocks to improve their locality. Following
encoders. Compared with the encoder, each decoder block the hierarchical and deep structures in CNNs [163], the hier-
appends to a multihead cross-attention layer to aggregate both archical Transformer replaces the fixed-resolution columnar
decoder embeddings and encoder outputs, where Y corre- structure with a pyramid stem, while the deep Transformer
sponds to the former and X is the latter as shown in (1). More- prevents the attention map from oversmooth and increases
over, all of the sublayers in both encoder and decoder employ its diversity in the deep layer. Moreover, we also review the
a residual connection [11] and a layer normalization [162] to existing visual Transformers with self-supervised learning.
enhance the scalability of the Transformer. In order to record Finally, we make a brief discussion based on intuitive com-
the sequential information, each input embedding is attached parisons for further investigation. More visual Transformers’
with a positional encoding at the beginning of the encoder milestones are introduced in the Supplementary Material.

Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
7482 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 6, JUNE 2024

A. Original Visual Transformer


Inspired by the tremendous achievements of the Transform-
ers in the NLP field [2], [3], [4], [5], the previous technology
trends for the vision tasks [14], [15], [16], [17], [164] incor-
porate the attention mechanisms with the convolution models
to augment the models’ receptive field and global dependency.
Beyond such hybrid models, Ramachandran et al. [24]
contemplated whether the attention can completely replace
the convolution and then presented a stand-alone self-attention
network (SANet), which has achieved superior performance on
the vision tasks compared with the original baseline. Given a Fig. 6. Illustration of ViT. The flatten image patches with an additional class
ResNet [11] architecture, the authors straightforwardly replace token are fed into the vanilla Transformer encoder after positional encoding.
Only the class token can be predicted for classification (from [29]).
the spatial convolution layer (3 × 3 kernel) in each bottleneck
block with a locally spatial self-attention layer and keep other namely, VT-block. Such VT-block substitutes the last convo-
structures the same as the original setting in ResNet. Moreover, lution stage to enhance the CNN model’s ability on semantic
lots of ablations have shown that the positional encodings and modeling. Unlike previous approaches that directly replace
convolutional stem can further improve the network efficacy. convolution with attention structure, Srinivas et al. [52] pro-
Following [24], Cordonnier et al. [28] pioneered a prototype posed a conceptual redefinition that the successive bottleneck
design (called fully attentional network in their original paper), blocks with MHSA can be formulated as the bottleneck Trans-
including a fully vanilla Transformer and quadratic positional former (BoTNet) blocks. The relative position encoding [169]
encoding. They also theoretically proved that a convolutional is adopted to further mimic the original Transformer. Based
layer can be approximated by a single MHSA layer with on ResNet [11], BoTNet outperforms the most CNN models
relative positional encoding and sufficient heads. With the with similar parameter settings on the ImageNet benchmark
ablations on CIFAR-10 [165], they further verify that such a and further demonstrates the efficacy of hybrid models.
prototype design does learn to attend a grid-like pattern around
each query pixel, as their theoretical conclusion.
C. CNN-Enhanced Transformer
Different from [28] that only focuses on lite scale model,
the ViT [29] further explores the effectiveness of the vanilla Inductive bias is defined as a set of assumptions on data dis-
Transformer with large-scale pretrained learning, and such a tribution and solution space, whose manifestations within con-
pioneer work impacts the community significantly. Because volution are the locality and the translation invariance [170].
the vanilla Transformers only accept the sequential inputs, the As the covariance within local neighborhoods is large and
input image in ViT is first split into a series of nonoverlapped tends to be gradually stationary across an image, CNNs can
patches and they are then projected into patch embeddings. process an image effectively with the help of the biases. Nev-
Then, a 1-D learnable positional encoding is added on the ertheless, strong biases also limit the upper bound of CNNs
patch embeddings to retain the spatial information, and the when sufficient data are available. Recent efforts attempt to
joint embeddings are then fed into the encoder, as shown in leverage an appropriate CNN bias to enhance Transformer.
Fig. 6. Similar to BERT [5], a learned [class] token is attached Touvron et al. [40] proposed a data-efficient image Trans-
with the patch embeddings to aggregate the global repre- former (DeiT) to moderate the ViT’s dependence on large
sentation and it serves as the final output for classification. datasets. In addition to the existing strategies for data aug-
Moreover, a 2-D interpolation complements the pretrained mentation and regularization, a teacher–student distillation
positional encoding to maintain the consistent order of the strategy is applied for auxiliary representation learning, where
patches when the feeding images are in arbitrary resolu- the student ViT is attached with a distilled token super-
tion. By pretraining with a large-scale private dataset (JFT- vised by the pseudo label from the teacher model. Extensive
300M [166]), ViT has achieved similar or even superior results experiments have demonstrated that CNN is a better teacher
on multiple image recognition benchmarks (ImageNet [167] model than the Transformer. Surprisingly, the distilled student
and CIFAR-100 [165]) compared with the most prevailing Transformer even outperforms its teacher CNN model. These
CNNs methods. However, its generalization capability tends observations are explained in [171]: the teacher CNN transfers
to be eroded with limited training data. its inductive bias in a soft way to the student Transformer
through knowledge distillation. Based on ViT’s architecture,
B. Transformer-Enhanced CNNs DeiT-B attains the top-1 accuracy of 85.2% without external
As described in Section II, the Transformer has two keys: data. ConViT [53] appends a parallel convolution branch with
MHSA and FFN. There exists an approximation between the vanilla Transformer to impose inductive biases softly. The
convolutional layer and the MHSA [28], and Dong et al. [168] main idea of the convolution branch is a learnable embedding
suggested that the Transformer can further mitigate the strong that is first initialized to approximate the locality as similar
bias of MHSA with the help of skip connections and FFN. to the convolution and then explicitly gives each attention
Recently, some methods attempt to integrate the Transformer head freedom to escape the locality by adjusting a learned
into CNNs to enhance representation learning. VTs [51] gating parameter. CeiT [54] and LocalViT [55] extract the
decouple semantic concepts for the input image into different locality by directly adding a depthwise convolution in FFN.
channels and relate them densely through the encoder block, As pointwise convolution is equal to positionwise FFN, they
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: SURVEY OF VISUAL TRANSFORMERS 7483

extend FFN to an inverted residual block [172] to build a


depthwise convolutional framework. Based on the assumption
of positional encoding [57] and the observation in [173],
ResT [57] and CPVT [56] try to adapt the inherent positional
information of the convolution to the arbitrary size of inputs
instead of interpolating the positional encoding. Including
CvT [36], these methods replace the linear patch projection
and positional encoding with the convolution stacks. Both
methods benefit from such convolutional position embedding,
especially for small model.
Besides the “internal” fusion, many approaches focus on
Fig. 7. Overview of Swin Transformer and TSM. (a) TSM with bidirectional
“apparent” combinations according to different visual Trans- and unidirectional operation. (b) Shifted window method. (c) Two successive
former’s structures. For standard columnar structure, Xiao Transformer blocks of Swin Transformer. The regular and shifted windows
et al. [58] substituted the original patchify stem (single correspond to W-MSA and SW-MSA, respectively (from [35] and [174]).
non-overlap large kernel) with several stacked stride-2 3 × finer level features, including three operations: unfold, linear-
3 kernels. Such a convolutional stem significantly improves weights attention, and refold. Based on [43], it achieves SoTA
ViT by 1%–2% accuracy on ImageNet-1k and facilitates its results on ImageNet without external data.
stability and generalization on the downstream tasks. For
hierarchical structures, Dai et al. [39] investigated an optimal E. Hierarchical Transformer
combination of hybrid models to benefit the performance As its columnar structure produces a fixed resolution fea-
tradeoff. By comparing a series of hybrid models, they propose tures across all Transformer layers, ViT [29] sacrifices fine-
a convolution and attention network (CoAtNet) to leverage grained feature extraction and incurs substantial computational
the strength of both CNNs and Transformer. They observe costs. Followed by the hierarchical models, Tokens-to-Token
that using convolution in the early stages is more effective ViT (T2T-ViT) first introduces a paradigm of hierarchical
than transformers, and depthwise convolution can be naturally Transformer and employs an overlapping unfold operation
integrated into attention blocks for hierarchical structures. for downsampling. However, such an operation brings heavy
It has achieved the SoTA performance across multiple datasets. memory and computation costs. Therefore, pyramid vision
Transformer (PVT) [41] leverages a nonoverlapping patch
D. Local Attention-Enhanced Transformer partition to reduce the feature size. Furthermore, a spatial-
The coarse patchify process in ViT [29] neglects the local reduction attention (SRA) layer is applied in PVT to fur-
image information. In addition to adding CNNs, various local ther reduce the computational cost by learning low-resolution
attention mechanisms are proposed to dynamically attend the key–value pairs. Empirically, PVT adapts the Transformer to
neighbor elements and augment the local extraction ability. the dense prediction tasks on many benchmarks that demand
One of the representative methods is the shifted windows large inputs and fine-grained features with computational effi-
(Swin) Transformer [35]. Similar to TSM [174] [Fig. 7(a)], ciency. Moreover, both PiT [64] and CvT [36] utilize pooling
Swin utilizes a shifted window along the spatial dimension and convolution to perform token downsampling, respectively.
to model the global and boundary features. In detail, two In detail, CvT [36] improves the SRA of PVT [41] by replac-
successive windowwise attention layers can facilitate the ing the linear layer with a convolutional projection. Based on
cross-window interactions [Fig. 7(b)–(c)], similar to the recep- the convolutional bias, CvT [36] can adapt to arbitrary size
tive field expansion in CNNs. Such operation also reduces the inputs without positional encodings.
computational complexity from O(2n 2 C) to O(4M 2 nC) in
one attention layer, where n and M denote the patch length and F. Deep Transformer
the window size, respectively. Swin Transformer achieves the Empirically, increasing model’s depth always strengthens its
84.2% accuracy on ImageNet and the latest SoTA on multiple learning capacity [11]. Recent works apply a deep structure
dense prediction benchmarks (see Section IV-B). to Transformer and massive experiments are conducted to
Inspired by [175], Han et al. [59] leveraged a Transformer- investigate its scalability by analyzing cross-patch [67] and
iN-Transformer (TNT) model to aggregate both patch- and cross-layer [37], [66] similarities, and the contribution of
pixel-level representations. Each layer of TNT consists of two residual blocks [42]. In the deep Transformer, the features
successive blocks: an inner block models the pixelwise interac- from the deeper layers tend to be less representative (attention
tion within each patch and an outer block extracts the global collapse [66]), and the patches are mapped into the indistin-
information. Twins [60] employs a spatially separable self- guishable latent representations (patch oversmoothing [67]).
attention mechanism, similar to depthwise convolution [172] To address such limitations mentioned above, current methods
or windowwise TNT [59], to model the local–global represen- present the corresponding solutions from two aspects.
tation. Another separate form is ViL [61], which replaces the From the aspect of model’s structure, Touvron et al. [42]
single class token with a series of local embeddings (termed as presented efficient class attention in image Transformers
global memory). These local embeddings only perform inner (CaiT), including two stages: 1) multiple self-attention stages
attention and interaction with their corresponding 2-D spatial without class token, in each layer, a learned diagonal matrix
neighbors. VOLO [44] proposes outlook attention, which is initialized by small values is exploited to update the channel
similar to a patchwise dynamic convolution, to focus on the weights dynamically, thereby offering a certain degree of
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
7484 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 6, JUNE 2024

TABLE I
T OP -1 A CCURACY C OMPARISON OF V ISUAL T RANSFORMERS ON I MAGE N ET-1 K . “1 K O NLY ” D ENOTES T RAINING ON I MAGE N ET-1K O NLY. “21 K P RE .”
D ENOTES P RETRAINING ON I MAGE N ET-21K AND F INE -T UNING ON I MAGE N ET-1K. “D ISTILL .” D ENOTES A PPLYING D ISTILLATION T RAINING
S CHEME OF D EI T [40]. T HE C OLOR OF “L EGEND ” C ORRESPONDING TO E ACH M ODEL A LSO D ENOTES THE S AME M ODEL IN F IG . 8

freedom for channel adjustment, and 2) last few class-attention resizes and flattens the image to a lower resolution sequence.
stages with frozen patch embeddings. A later class token is The resized sequences are then input into a GPT-2 [4]
inserted to model global representations, similar to DEtection for autoregressive pixel prediction. iGPT demonstrates the
with TRansformer (DETR) [30] with an encoder–decoder effectiveness of the Transformer in the visual tasks without
structure. This explicit separation is based on the assumption any help from image-specific knowledge, but its considerable
that the class token is invalid for the gradient of patch computation cost is hard to be accepted (roughly 2500 V100-
embeddings in the forward pass. With distillation training days for pretraining). Instead of the pixel-wise generation
strategy [40], CaiT achieves a new SoTA on imagenet-1k Bao et al. [5] proposed a BERT-style visual Transformer
(86.5% top-1 accuracy) without external data. Although deep (BEiT) [70] by reconstructing the masked image in the latent
Transformer suffers from attention collapse and oversmoothing space. Precisely, a dVAE [148] first converts input patches
problems, it still largely preserves the diversity of the attention into discrete visual tokens, like BERT’s [5] dictionary. These
map between different heads. Based on this observation, tokens are then employed as latent pseudo labels for SSL.
Zhou et al. [66] proposed Deep ViT that aggregates different For the discriminative models, Chen et al. [72] investigated
head attention maps and regenerates a new one by using a lin- the effects of several fundamental components for stabilized
ear layer to increase cross-layer feature diversity. Furthermore, self-supervised ViT training. They observed that the unstable
Refiner [37] applies a linear layer to expand the dimension training process mildly affects the eventual performance and
of the attention maps (indirectly increasing the head number) extended MoCo series to MoCo v3, containing a series of
for diversity promotion. Then, a distributed local attention training strategies such as freezing projection layer. Fol-
(DLA) is employed to achieve better modeling of both the lowing DeiT [40], Caron et al. [73] further extended the
local features and the global ones, which is implemented by teacher–student recipe to self-supervised learning and propose
a headwise convolution effecting on the attention map. DINO. The core concepts of DINO can be summarized into
From the aspect of training strategy, Gong et al. [67] three points. A momentum encoder inherited SwAV [178]
presented three patch diversity losses for deep Transformer serves as a teacher model that outputs the centered pseudo
that can significantly encourage patches’ diversity and offset labels over a batch. An online encoder without the prediction
oversmoothing problem. Similar to [176], a patchwise cosine head serves as a student model to fit the teacher’s output.
loss minimizes pairwise cosine similarity among patches. A standard cross-entropy loss connects self-training with
A patchwise contrastive loss regularizes the deeper patches knowledge distillation. More interestingly, self-supervised ViT
by their corresponding one in the early layer. Inspired by can learn flourishing features for segmentation, which are
Cutmix [177], a patchwise mixing loss mixes two different normally unattainable by the supervised models.
images and forces each patch to only attend to the patches
from the same image and ignore unrelated ones. Distinct from
H. Discussion
the similar loss function of LV-ViT [43], it is motivated by
patch diversity rather than patch-based label augmentation that 1) Algorithm Evaluation and Comparative Analysis: In our
LV-ViT [43] focuses on. taxonomy, all the existing supervised models are grouped into
six categories. Table I summarizes the performances of these
G. Transformers With Self-Supervised Learning existing visual Transformers on ImageNet-1k benchmarks.
Following the grateful success of self-supervised learning To evaluate them objectively and intuitively, we use the follow-
(SSL) in the NLP field [5], recent works also attempt to ing three figures to illustrate their performances on ImageNet-
design various self-supervised learning schemes for the visual 1k under different configurations. Fig. 8(a) summarizes the
Transformers in both generative and discriminative ways. accuracy of each model under 2242 inputs size. Fig. 8(b)
For the generative models, Chen et al. [68] proposed an takes the FLOPs as the horizontal axis, which focuses on
image GPT (iGPT) for self-supervised visual learning. Dif- their performances under higher resolution. Fig. 8(c) focuses
ferent from the patch embedding of ViT [29], iGPT directly on the pretrained models with external datasets. From these
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: SURVEY OF VISUAL TRANSFORMERS 7485

Fig. 8. Comparisons of recent visual Transformers on ImageNet-1k benchmark, including [29], [36], [37], [41], [42], [43], [50], [51], [52], [53], [54],
[58], [62], [63], [65], [180] (best viewed in color). (a) Bubble plot of the mentioned models with 2242 resolution input, the size of cycle denotes GFLOPs.
(b) Comparison on high-resolution inputs, the square indicates 4482 input resolution. (c) Accuracy plot of some pretrained models on ImageNet-21k.

comparison results, we briefly summarize several performance IV. T RANSFORMER FOR D ETECTION
improvements on efficiency and scalability as follows. In this section, we review visual Transformers for object
1) Compared with the most structure-improved methods, detection, which can be grouped into two folds: Transformer
the basic training strategies, such as DeiT [40] and LV- as the neck and Transformer as the backbone. For the neck
ViT [43], are more universal for various models. detectors, we mainly focus on a new representation speci-
2) The locality is indispensable for the Transformer, which fied to the Transformer structure, called object query, that
is reflected by the dominant of VOLO [44] and a set of learned parameters aggregate instance features from
Swin [35] on various tasks. input images. The recent variants try to solve an optimal
3) The convolutional patchify stem (ViTc [58]) and early fusion paradigm in terms of either convergence acceleration
convolutional stage (CoAtNet [39]) can significantly or performance improvement. Besides these neck designs,
boost the accuracy of the Transformers, especially for a proportion of backbone detectors also take specific strategies
large models. We speculate that the reason is because into consideration. Finally, we evaluate them and analyze some
they introduce a more stringent high-level feature than potential methods for these detectors.
the sketchy patch projection in ViT [29].
4) The deep Transformer, such as Refined-ViT [37] and A. Transformer Neck
CaiT [42], has great potential. As the model size grows We first review DETR [30] and Pix2seq [75], two orig-
quadratically with the channel dimension, the tradeoff in inal Transformer detectors based on different paradigms.
deep Transformer is considered for further investigation. Subsequently, we mainly focus on the DETR-based variants,
5) CeiT [54] and CvT [36] show significant advan- improving Transformer detectors in accuracy and convergence
tages in training a small or medium model (0– from five aspects: sparse attention, spatial prior, structural
40M), which suggests that such kinds of hybrid atten- redesign, assignment optimization, and pretraining model.
tion blocks for lightweight models are worth further 1) Original Detectors: DETR [30] is the first end-to-end
exploring. Transformer detector that eliminates hand-designed represen-
2) Brief Discussion on Alternatives: During the develop- tations [180], [181], [182], [183] and nonmaximum sup-
ment of the visual Transformers, the most common question pression (NMS) postprocessing, which redefines the object
is whether the visual Transformers can replace the tradi- detection as a set prediction problem. As shown in Fig. 9,
tional convolution completely. By reviewing the history of a small set of learnable positional encodings, called object
the performance improvements in the last year, there is no queries, are parallelly fed into the Transformer decoder to
sign of relative inferiority here. The visual Transformers have extract the instance information from the image features.
returned from a pure structure to a hybrid form, and the Then, these object queries are independently predicted to be
global information has gradually returned to a mixed stage a detection result. Instead of the vanilla k-class classification,
with the locality bias. Although the visual Transformers can a special class, no object label (∅) is added for k + 1 class
be equivalent to CNN or even has a better modeling capability, classification. During the training process, a bipartite matching
such a simple and effective convolution operation is enough to strategy is applied between the predicted objects and the
process the locality and the semantic features in the shallow ground truth to identify one-to-one label assignment, hence
layer. In the future, the spirit of combining both of them shall removing the redundant predictions at the inference time
drive more breakthroughs for image classification. without NMS. In backpropagation, a Hungarian loss includes
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
7486 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 6, JUNE 2024

the contextual background features into a smaller one. Such


fine–coarse tokens are then fed into DETR to generate the
detection results. Instead of the input sparsification, Sparse
DETR [79] applied a hysteretic scoring network (correspond-
ing to the poll operation in [78]) to update the expected tokens
selectively within the transformer encoder, where the top-
Fig. 9. Overview of DETR. (Modified from [30].)
k selected tokens are supervised by pseudo labels from the
a log-likelihood loss for all classification results and a box loss binarized decoder cross-attention map with BCE loss.
for all the matched pairs. More details about the Hungarian 3) Transformer With Spatial Prior: Unlike anchor or other
matching strategy are available in the Supplementary Material. representations directly generated by content and geometry
Overall, DETR provides a new paradigm for end-to-end features [180], [186], object queries implicitly model the
object detection. The object query gradually learns an instance spatial information with random initialization, which is weakly
representation during the interaction with image features. The related to the bounding box. The mainstream for spatial
bipartite matching allows a direct set prediction and easily prior applications is the one-stage detector with empirical
joints to the one-to-one label assignment, hence eliminating spatial information and the two-stage detector with geometric
traditional postprocessing. DETR achieves competitive per- coordinates initialization or region-of-interest (RoI) features.
formance on the COCO benchmark but suffers from slow In one-stage methods, Gao et al. [80] suggested spatially
convergence as well as poor performance on small objects. modulated cross attention (SMCA) to estimate the object
Another pioneered work is Pix2seq [75], treating generic queries’ spatial prior explicitly. Specifically, a Gaussian-like
object detection as a language modeling task. Given an image weight map generated by object queries is multiplied with
input, a vanilla sequential Transformer is executed to extract the corresponding cross-attention map to augment the RoI
features and generate a series of object descriptions (i.e., class for convergence acceleration. Furthermore, both intrascale
labels and bounding boxes) autoregressively. Such a simplified and multiscale self-attention layers are utilized in the Trans-
but more elaborate image caption method is derived under former encoder for multiscale feature aggregation, and the
the assumption that if a model learns about both location and scale-selection weights generated from object queries are
label of an object, it can be taught to produce a description applied for scale-query adaptation. Meng et al. [81] extracted
with specified sequence [75]. Compared with DETR, Pix2seq the spatial attention map from the cross-attention formulation
attains a better result on small objects. How to combine both and observe that the extreme region of such attention map
kinds of concepts is worthy of further consideration. has larger deviations at the early training. Consequently,
2) Transformer With Sparse Attention: In DETR, the dense they proposed conditional DETR where a new spatial prior
interaction across both object queries and image features mapped from reference points is adopted in the cross-attention
costs unbearable resources and slows down the convergence layer, thereby attending to extreme regions of each object
of DETR. Therefore, the most recent efforts aim to design explicitly. The reference point is predicted by the object query
data-dependent sparse attention to address these issues. or serves as a learned parameter replacing the object query.
Following [184], Zhu et al. [76] developed deformable Following [81], Anchor DETR [82] suggests to explicitly learn
DETR to ameliorate both training convergence and detection the 2-D anchor points ([cx, cy]) and different anchor patterns
performance significantly via multiscale deformable attention. instead of the high-dimensional spatial embedding. Similar
Compared with the original DETR, the deformable attention to [180], the pattern embeddings are assigned to the meshed
module only samples a small set of key (reference) points for anchor points so that they can detect different scale objects
full feature aggregation. Such sparse attention can be easily anywhere. DAB-DETR [83] then extends the 2-D concept
stretched to multiscale feature fusion without the help of to a 4-D anchor box ([cx, cy, w, h]) to explicitly provide
FPN [185]. Moreover, an iterative bounding box refinement proposal bounding box information during the cross attention.
and a two-stage prediction strategy (Section IV-A3) are devel- With the auxiliary decoder loss of the coordinates offset [76],
oped to further enhance the detection accuracy. Empirically, such a 4-D box can be dynamically refined layer-by-layer in
deformable DETR achieves a higher accuracy (especially for the decoder. However, the same reference boxes/points may
small objects) with 10× less training epochs and reduces the severely deteriorate queries’ saliency and confuses the detector
computing complexity to O(2Nq C 2 +min(HWC2 , NqKC 2
)) with due to the indiscriminative spatial prior. By assigning query-
1.6× faster inference speed. Please see the Supplementary specific reference points to object queries, SAP-DETR [84]
Material for more details of deformable attention mechanism. only predicts the distance from each side of the bounding
By visualizing the encoder attention maps of DETR [30], box to these points. Such a query-specific prior discrepancies
Zheng et al. [78] observed that the semantically similar queries’ saliency paves the way for fast model convergence.
and spatially close elements always have similar attention In two-stage methods, Zhe et al. [76] empowered the Top-
maps. As such, they presented an adaptive clustering Trans- K region proposals from encoder features to initialize the
former (ACT), leveraging a multiround sensitivity hashing to decoder embedding instead of the learned parameters. Efficient
dynamically cluster object queries into different prototypes. DETR [85] also adopts a similar initialization operation for
The attention map of prototypes is then broadcast to their dense proposals and refines them in the decoder to get sparse
corresponding queries. Unlike the redesign on sparse attention, prediction by using a shared detection head with the dense
Wang et al. [78] introduced a poll and pool (PnP) sampling parts. More interestingly, it is observed that small stacking
model to extract the fine foreground features and condense decoder layers bring slight performance improvement, but
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: SURVEY OF VISUAL TRANSFORMERS 7487

more stacks yield even worse results. Dynamic DETR [86] preserve the feature discrimination so as to avoid overbias
regards the object prediction in a coarse-to-fine process. Dif- toward the localization in pretraining. FP-DETR [90] devotes
ferent from the previous RoI-based initialization, according to narrowing the gap between upstream and downstream
to queries’ reference boxes, a query-based weight is used to tasks. During the pretraining, a fully encoder-only DETR
replace cross-attention layers and directly affect their corre- such as YOLOS [88] views query positional embeddings as
sponding coarse RoI features for query refinement. a visual prompt to enhance target area attention and object
4) Transformer With Redesigned Structure: Besides the discrimination. A task adapter implemented by self-attention
modifications focusing on the cross attention, some works is used to enhance object interaction during fine-tuning.
redesign an encoder-only structure to avoid the problem
of the decoder directly. TSP [87] inherits the idea of set B. Transformer Backbone
prediction [30] and dismisses the decoder and the object We have reviewed numerous Transformer-based backbones
query to accelerate convergence. Such encoder-only DETR for image classification [29], [40] in Section III. These back-
reuses previous representations [180], [186] and generates a bones can be easily incorporated into various frameworks [30],
set of fixed-size features of interests (FoIs) [186] or pro- [182], [187] to perform dense prediction tasks. For example,
posals [180] that are subsequently fed into the Transformer the hierarchical structure, such as PVT [41], [65], constructs
encoder. In addition, a matching distillation is applied to the visual Transformer as a high-to-low resolution process to
resolve the early instability of the bipartite matching during learn multiscale features. The locally enhanced structure con-
training process. Fang et al. [88] presented an encoder-only structs the backbone as a local-to-global combination, which
decoder YOLOS, a pure sequence-to-sequence Transformer to can efficiently extract both short- and long-range visual depen-
unify the classification and detection tasks. It inherits ViT’s dencies and avoid quadratic computational overhead, such as
structure and replaces the single class token with fixed-size Swin-Transformer [35], ViL [61], and Focal Transformer [62].
learned detection tokens. These object tokens are first pre- The Supplementary Material includes more detailed compar-
trained on the transfer ability for the classification tasks and isons of these models for the dense prediction tasks. In addi-
then fine-tuned on the detection benchmark. tion to the generic Transformer backbone, the feature pyramid
5) Transformer With Bipartite Matched Optimization: In Transformer (FPT) [93] combines the characteristics across
DETR [30], the bipartite matching strategy forces the predic- both the spaces and the scales, by using self-attention, top-
tion results to fulfill one-to-one label assignment during the down cross attention, and bottom-up cross-channel attention.
training scheme. Such a training strategy simplifies detection Following [190], HRFormer [94] introduces the advantages of
pipeline and directly builds up an end-to-end system without multiresolution to the Transformer along with nonoverlapping
the help of NMS. To deeply understand the efficacy of the local self-attention. HRViT [95] redesigns a heterogeneous
end-to-end detector, Sun et al. [189] devoted to exploring a branch and a cross-shaped attention block to further optimize
theoretical view of one-to-one prediction. Based on multiple the tradeoff between efficiency and accuracy.
ablation and theoretical analyses, they concluded that the
classification cost for one-to-one matching strategy serves C. Discussion
as the key component for significantly avoiding duplicate We summarize fivefold of the Transformer neck detectors
predictions. Even so, DETR is suffering from multiple prob- in Table II, and more details of Transformer backbone for
lems caused by bipartite matching. Li et al. [91] exploited dense prediction tasks are referred to in Table SI of the
a denoising DETR (DN-DETR) to mitigate the instability Supplementary Material. The majority of Transformer neck
of bipartite matching. Concretely, a series of objects with promotions concentrate on the following five aspects.
slight perturbation is supposed to denoise to their original 1) The sparse attention model and the scoring network are
coordinates. The main ingredients of the denoising part are an proposed to address the problem of redundant feature
attention mask that prevents information leakage between the interaction. These methods can significantly alleviate
matching and noised parts, and a specified label embedding to computational costs and accelerate model convergence.
indicate the perturbation. Recently, Zhang et al. [92] presented 2) The explicit spatial prior, which is decomposed into the
an improved denoising training model called DINO (2022) by selected feature initialization and the positional informa-
incorporating a contrastive loss for the perturbation groups. tion extracted by learned parameters, would enable the
Based on DN-DETR [91], DINO attaches a “no object” class detector to predict the results precisely.
for the negative example if the distance is far enough from 3) Multiscale features and iterative box refinement are
the perturbation, which avoids redundant prediction due to the benefit DETR for small object detection.
confusion of multiple reference points near an object. As a 4) The improved bipartite matching strategy is beneficial to
result, DINO attains the current SoTA on the COCO dataset. avoid redundant prediction, add positive gradients, and
6) Transformer Detector With Pretraining: Inspired by the perform end-to-end object detection.
pretrained linguistic Transformer [3], [5], Dai et al. [89] 5) The encoder-only structure reduces the overall Trans-
devised an unsupervised pretraining DETR (UP-DETR) to former stack layers but increases the FLOPs excessively,
assist the convergence for supervised training. The objective while the encoder–decoder structure is a good tradeoff
of pretraining is to localize the random cropped patches between FLOPs and parameters, but the deeper decoder
from a given image. Specifically, each patch is assigned to layers may cause the slow convergence.
a set of queries and predicted independently via the attention Existing Transformer backbones mostly focus on the clas-
mask. An auxiliary reconstruction loss forces the detector to sification task, but a few works are developed for the
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
7488 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 6, JUNE 2024

TABLE II
C OMPARISON B ETWEEN T RANSFORMER N ECKS AND R EPRESENTATIVE
CNNs W ITH R ES N ET-50 BACKBONE ON COCO 2017 VAL S ET

Fig. 10. Query-based frameworks for segmentation tasks. (a) Transfer


learning for fine-tuning mask head. (b) Multitask learning for two independent
tasks. (c) Cascade learning predict fine-grained masks based on box results.
(d) Query embeddings are independently supervised by mask embeddings
and boxes. (e) Box-free model directly predicts masks without box branch
and views segmentation task as a mask prediction problem.

MLA decoder [96] or a hybrid model of U-Net [192] and


Transformer. Due to the strong global modeling capability of
Transformer encoder, Segformer [98] designs a lightweight
decoder with only four MLP layers. Segformer shows superior
performance as well as stronger robustness than CNNs when
tested with multiple corrupted types of images.

B. Query-Based Transformer
Query embeddings are a set of scratch semantic/instance
representations gradually learning from the image inputs.
Unlike patch embeddings, queries can more “fairly” integrate
the information from features and naturally join with the set
prediction loss [30] for postprocessing elimination. The exist-
ing query-based models can be grouped into two categories.
One (object query) is driven by both detection and segmen-
tation tasks, simultaneously. The other (mask embedding) is
dense prediction tasks. In the future, we anticipate that only supervened by segmentation task.
the Transformer backbone would cooperate with the deep 1) Object Queries: There are three training manners for
high-resolution network to solve dense prediction tasks. object query-based methods [Fig. 10(a)–(c)]. With the success
V. T RANSFORMER FOR S EGMENTATION of DETR [30] for the object detection tasks, the authors
Patch- and query-based Transformers are the two major extend it to panoptic segmentation (hence termed panoptic
ways for segmentation. The latter can be further grouped into DETR [30]) by training a mask head based on the pre-
object query and mask embedding methods. trained object queries [Fig. 10(a)]. In detail, a cross-attention
block between the object queries and the encoded features is
A. Patch-Based Transformer applied to generate an attention map for each object. After an
Because of the receptive field expansion strategy [191], upsampling FPN-style CNN, a spatial argmax operation fuses
CNNs require multiple decoder stacks to map the high-level the resulting binary masks to a nonoverlapping prediction.
features into the original spatial resolution. Instead, patch- Instead of using a multistage serial training process, Cell-
based Transformer can easily incorporate with a simple DETR and VisTR develop a parallel model for end-to-end
decoder for segmented mask prediction because of its global instance segmentation [Fig. 10(b)]. Based on DETR [30], Cell-
modeling capability and resolution invariance. Zheng et al. DETR leverages a cross-attention block to extract instancewise
extended ViT [29] for semantic segmentation tasks and pre- features from the box branch and fuses the previous backbone
sented SEgmentation TRansformer (SETR) [96] by employing features to augment the CNN decoder for accurate instance
three fashions of the decoder to perform per-pixel classifi- mask segmentation of biological cells. Another extension
cation: naive upsampling, progressive upsampling, and mul- is VisTR [100] that directly formulates the video instance
tilevel feature aggregation (MLA). SETR demonstrates the segmentation (VIS) task as parallel sequence prediction. Apart
feasibility of the visual Transformer for the segmentation from the similar structure as Cell-DETR [99], the key of VisTR
tasks, but it also brings unacceptably extra GPU costs. Tran- is a bipartite matching loss at the instance sequence level
sUNet [97] is the first for medical image segmentation. For- to maintain the order of outputs, so as to adapt DETR [30]
mally, it can be viewed as either a variant of SETR with to VIS for direct one-to-one predictions. Unlike prior works
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: SURVEY OF VISUAL TRANSFORMERS 7489

TABLE III
C OMPARISON B ETWEEN CNN- AND T RANSFORMER -BASED M ODEL
ON ADE20K AND COCO FOR D IFFERENT S EGMENTATION TASKS .
“+MS” D ENOTES THE M ULTISCALE I NPUTES

Fig. 11. Illustration of Maskformer (from [32]).

and explored three reversible compression encoding methods.


In detail, a set of unified queries is applied to perform multiple
representation learning parallelly: classification, box regres-
sion, and mask encoding. Based on the original DETR [191],
SOLQ adds a mask branch to produce mask embedding
loss. Both ISTR and SOLQ obtain comparable results and
outperform previous methods even with approximation-based
suboptimal embeddings. However, there exists a huge gap
between APbox and APseg (Table III).
From the box-free perspective, Wang et al. [31] pio-
neered a new paradigm Max-DeepLab that directly predicts
panoptic masks from the query without the help of the
box branch. Specifically, it forces the query to predict the
corresponding mask via a PQ-style bipartite matching loss
and a dual-path Transformer structure. Given a set of mask
embeddings and an image input, Max-DeepLab processes
them separately in both Transformer and CNN path and
then generates a binary mask and a class for each query,
respectively. Max-DeepLab achieves a new SoTA with 51.3%
PQ on the COCO test-dev set but leads to heavy computa-
tional costs due to its dual-path high-resolution processing.
Segmenter [104] views the semantic segmentation task as
a sequence-to-sequence problem. In detail, a set of mask
embeddings that represent different semantic classes are fed
into the Transformer encoder together with image patches, and
that treat detection and mask generation branches separately, then, a set of labeled masks are predicted for each patch via
QueryInst [101] builds a hybrid cascaded network [Fig. 10(c)], an argmax operation.
where the previous box outputs together with the shared Unlike the conventional pixel-wise segmentation, Cheng
queries serve as the inputs of the mask head for accurate mask et al. [32] reformulated the semantic segmentation task as
segmentation. Notably, QueryInst leverages the shared queries a mask prediction problem and enabled this output format
to keep the instance correspondences across multistage so to the query-based Transformer, which is called Maskformer.
that mitigating the problem of inconsistent objects in previous Different from Max-DeepLab [31], as shown in Fig. 11,
nonquery-based methods [188], [193]. QueryInst obtains the Maskformer leverages a simple Transformer decoder without
latest SoTA results on the COCO datasets. redundant connection as well as a sigmoid activation for
2) Mask Embeddings: The other framework makes efforts overlapping binary mask selection. It not only outperforms the
to use queries to predict mask directly, and we refer to current per-pixel classification SoTA on large-class semantic
this learned mask-based query as mask embeddings. Unlike segmentation datasets but also generalizes the panoptic seg-
object queries, mask embeddings are only supervised by the mentation task with a new SoTA result (Table III).
segmentation tasks. As shown in Fig. 10(d), two disjoint sets of
queries are employed parallelly for different tasks, and the box C. Discussion
learning is viewed as an auxiliary loss for further enhancement. We summarize the aforementioned Transformers according
For semantic and box-free instance segmentation, a series of to three different tasks. Table III(a) focuses on ADE20K (170
query-based Transformers predict the mask directly without classes). It can be shown that when trained on datasets with
the help of the box branch [Fig. 10(e)]. large numbers of classes, the segmentation performance of
From the auxiliary training perspective, the core is how to visual Transformers is improved significantly. Table III(b)
enable 1-D sequence outputs to be supervised by 2-D mask focuses on the COCO test dataset for instance segmenta-
labels directly. To this end, ISTR [102] empowered a mask tion. Clearly, the visual Transformers with mask embeddings
precoding method to encode the ground-truth mask into a surpass most prevailing models for both segmentation and
sparse mask vector for instance segmentation. Similarly, Dong detection tasks. However, there is a huge performance gap
et al. [103] proposed a more straightforward pipeline SOLQ between APbox and APseg . With the cascaded framework,
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
7490 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 6, JUNE 2024

QueryInst [101] attains the SoTA among various Trans- global interactions separately within each layer. Different from
former models. It is worthy of further study for combin- previous local–global forms, a coordinate refinement operation
ing the visual Transformers with the hybrid task cascade after the local attention is adopted to update the centroid point
structures. Table III(c) focuses on panoptic segmentation. instead of the surface one. Also, a local–global cross-attention
Max-DeepLab [31] is general to solve both foreground and model fuses the high-resolution features, followed by global
background in the panoptic segmentation task via a mask attention. Fan et al. [110] returned to the single-stride sparse
prediction format. Maskformer [32] successfully employs this transformer (SST) to address the problem for small-scale
format and unifies both semantic and instance segmentation detection. Similar to Swin [35], a shifted group in continuous
tasks into a single model. It is concluded that the visual Transformer block is adopted to attend to each group of tokens
Transformers could unify multiple segmentation tasks into one separately, which further mitigates the computation prob-
box-free framework with mask prediction. lem. In voxel-based methods, voxel transformer (VoTr) [111]
separately operate on the empty and nonempty voxel positions
VI. T RANSFORMER FOR 3-D V ISUAL R ECOGNITION effectively via local attention and dilated attention blocks.
With the rapid development of 3-D acquisition technol- VoxSeT [112] further decomposes the self-attention layer into
ogy, stereo/monocular images and light detection and ranging two cross-attention layers, and a group of latent codes link
(LiDAR) point clouds become the popular sensory data for 3-D them to preserve global features in a hidden space.
recognition. Discriminated from the RGB(D) data, point cloud Following the mentioned methods in Section III-G, a series
representation pays more attention to distance, geometry, and of self-supervised Transformers is also extended to 3-D
shape information. Notably, such a geometric feature is signif- spaces [113], [114], [115]. Specifically, Point-BERT [113] and
icantly suitable for Transformer on account of its characteristic Point-MAE [114] directly transfer the previous works [70],
on sparseness, disorder, and irregularity. Following the success [71] to point clouds, while MaskPoint [115] changes the
of 2-D visual Transformers, substantial approaches are devel- generative training scheme by using a contrastive decoder as
oped for 3-D visual analysis. This section exhibits a compact similar as DINO (2022) [92] for binary noise/part classifica-
review for 3-D visual Transformers following representation tion. Based on large experiments, we can conclude that such
learning, cognition mapping, and specific processing. generative/contrastive self-training methods empower visual
Transformers to be valid in either images or points.
A. Representation Learning
Compared with conventional hand-designed networks, B. Cognition Mapping
visual Transformer is more appropriate for learning semantic Given rich representation features, how to directly map
representations from point clouds, in which such irregular the instance/semantic cognition to the target outputs also
and permutation-invariant nature can be transformed into a arouse considerable interests. Different from 2-D images, the
series of parallel embeddings with positional information. objects in 3-D scenes are independent and can be intuitively
Point Transformer [105] and PCT [106] first demonstrate represented by a series of discrete surface points. To bridge the
the efficacy of the visual Transformer in 3-D scenes. The gap, some existing methods transfer domain knowledge into 2-
former merges a hierarchical Transformer [105] with the D prevailing models. 3DETR [117] first extends Transformer
downsampling strategy [199] and extends their previous vector detector to 3-D object detection via farthest point sampling and
attention block [25] to 3-D point clouds. The latter first Fourier positional embeddings for object query initialization.
aggregates neighbor points and then processes such neighbor Group-Free 3-D DETR [117] applies a more specified and
embeddings on a global offset Transformer where a knowledge stronger structure than [116]. In detail, it directly selects a set
transfer from graph convolution network (GCN) is applied of candidate sample points from the extracted point clouds
for noise mitigation. Notably, the positional encoding, a sig- as the object queries and updates them in the decoder layer-
nificant operation of the visual Transformer, is diminished by-layer iteratively. Moreover, the K -closed inside points are
in both approaches because of points’ inherent coordinate assigned positive and supervised by a binary objectiveness
information. PCT directly processes on the coordinates with- loss in both sampler and decoder heads. Sheng et al. [118]
out positional encodings, while Point Transformer adds a proposed a typical two-stage method that leverages a chan-
learnable relative positional encoding for further enhancement. nelwise transformer 3-D detector (CT3D) to simultaneously
Lu et al. [107] leveraged a local–global aggregation module aggregate proposal-aware embedding and channelwise context
3DCTN to achieve local enhancement and cost efficiency. information for the point features within each proposal.
Given the multistride downsampling groups, an explicit graph For monocular sensors, both MonoDTR [119] and Mon-
convolution with max-pooling operation is used to aggregate oDETR [120] utilize an auxiliary depth supervision to estimate
the local information within each group. The resulting group pseudo depth positional encodings (DPEs) during the training
embeddings are concatenated and fed into the improved trans- process. In MonoDETR [119], DPEs are first attached with
former [105], [106] for global aggregation. Park et al. [108] the image features for Transformer encoder and then serve
presented Fast Point Transformer to optimize the model as the inputs of the DETR-like [30] decoder to initialize the
efficiency by using voxel-hashing neighbor search, voxel- object queries. In MonoDETR [120], both visual features and
bridged relative positional encoding, and similarity-based local DPEs are first extracted by two different encoders parallelly
attention. and then interact with object queries via two successive cross-
For dense prediction, Pan et al. [109] proposed a customized attention layers. Based on foreground depth supervision and
point-based backbone Pointformer for attending the local and narrow categorization interval, MonoDETR obtains the SoTA
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: SURVEY OF VISUAL TRANSFORMERS 7491

result on the KITTI benchmark. DETR3D [121] introduces enables different modalities to interact beyond the local
a multicamera 3-D object detection paradigm where both restriction.
2-D images and 3-D positions are associated by the camera For the local interaction, MVT [126] spatially concatenates
transformation matrices and a set of 3-D object queries. the patch embeddings from different views and strengthens
TransFusion [122] further takes the advantages of both LiDAR their interaction via a modal-agnostic Transformer encoder.
points and RGB images by interacting with object queries Considering the redundant features from different modalities,
through two Transformer decoder layers successively. More MVDeTr [127] projects each view of features onto the ground
multisensory data fusion is introduced in Section VII-A. plane and extends the multiscale deformable attention [76]
to a multiview design. TransFuser [128], COTR [129], and
C. Specific Processing mmFormer [133] deploy a hybrid model. TransFuser models
Limited by sensor resolution and view angle, point clouds image and LiDAR inputs separately by using two different
are afflicted with incompletion, noise, and sparsity problems convolution backbones and links the intermediate feature maps
in real-world scenes. To this end, PoinTr [123] represents via a Transformer encoder together with a residual connection.
the original point cloud as a set of local point proxies and COTR shares the CNN backbone for each of view images
leverages a geometry-aware encoder–decoder Transformer to and inputs the resulted features into a Transformer encoder
migrate the center point proxies toward incomplete points block with a spatially expanded mesh-grid positional encoding.
direction. SnowflakeNet [124] formulates the process of com- mmFormer exploits a modality-specific Transformer encoder
pleting point clouds as a snowflake-like growth, which pro- for each sequence of MRI image and a modality-correlated
gressively generates child points from their parent points Transformer encoder for multimodal modeling.
implemented by a pointwise splitting deconvolution strat- For the global interaction, Wang et al. [130] leveraged a
egy. A skip-Transformer for adjacent layers further refines shared backbone to extract the features for different views.
the spatial-context features between parents and children to Instead of pixelwise/patchwise concatenation in COTR [129],
enhance their connection regions. Choe et al. [125] unified var- the extracted viewwise global features are spatially concate-
ious generation tasks (e.g., denoising, completing, and super- nated to perform view fusion within a Transformer. Consid-
resolution) into a point cloud reconstruction problem, hence ering the angular and position discrepancy across different
termed PointRecon. Based on voxel hashing, it covers the camera views, TransformerFusion [132] first converts each
absolute-scale local geometry and utilizes a PointTransformer- view feature into an embedding vector with the intrinsics and
like [105] structure to aggregate each voxel (the query) with extrinsics of their camera views. These embeddings are then
its neighbors (the value–key pair) for fine-grained conversion fed into a global Transformer whose attention weights are used
from the discrete voxel to a group of point sets. Moreover, for a frame selection so as to compute efficiently. To unify
an amplified positional encoding is adapted to the voxel local the multisensory data in 3-D detection, FUTR3D [131]
attention scheme, implemented by using a negative exponential projects the object queries in the DETR-like decoder into
function with L1-loss as weights for vanilla positional encod- a set of 3-D reference points. These points together with
ings. Notably, compared with masked generative self-training, their related features are subsequently sampled from different
the completion task directly generates a set of complete points modalities and spatially concatenated to update the object
without the explicit spatial prior of incomplete points. queries.
2) Transfer Fusion: Unlike the interactive fusion imple-
VII. T RANSFORMER FOR M ULTISENSORY DATA S TREAM mented by the Transformer encoder with self-attention, the
In the real world, multiple sensors are always used com- other fusing form is more like a transfer learning from the
plementarily rather than a single one. To this end, recent source data to the target one via a cross-attention mechanism.
works start to explore different fusing methods to coop- For instance, Tulder et al. [134] inserted two cooperative
erate multisensory data stream effectively. Compared with cross-attention Transformers into the intermediate backbone
the typical CNNs, Transformer is naturally appropriate for features for bridging the unregistered multiview medical
multistream data fusion because of its nonspecific embedding images. Instead of the pixelwise attention form, token-pixel
and dynamically interactive attention mechanism. This section cross attention is further developed to alleviate arduous com-
details these methods according to their data stream sources: putation. Long et al. [135] proposed an epipolar spatiotem-
homologous stream and heterologous stream. poral Transformer for multiview image depth estimation.
Given a single video containing a series of static multiview
A. Homologous Stream frames, the neighbor frames are first concatenated and the
Homologous stream is a set of multisensory data with epipolar is then warped into the center camera space. The
similar inherent characteristics, such as multiview, multidi- resulted frame volume finally serves as the source data to
mension, and multimodality visual stream data. They can be perform fusion with the center frame through a cross-attention
categorized into two groups: interactive fusion and transfer block. With the spatially aligned data streams, DRT [136]
fusion, according to their fusion mechanism. first explicitly models the relation map between different
1) Interactive Fusion: The classical fusion pattern of CNN data streams by using a convolution layer. The resulting
adopts a channel concatenation operation. However, the same maps are subsequently fed into a dual-path cross attention
positions from different modalities might be anisotropic, to build both local and global relationships parallelly, and
which is unsuitable for the translation-invariant bias of CNN. thereby, it can collect more regional information for glaucoma
Instead, the spatial concatenation operation of Transformer diagnosis.
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
7492 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 6, JUNE 2024

B. Heterologous Stream

Visual Transformers also perform excellently on heterol-


ogous data fusion, especially in visual-linguistic representa-
tion learning. Although different tasks may adopt different
training schemes, such as supervised/self-supervised learn-
ing or compact/large-scale datasets, we categorize them into
two representative groups only according to their cognitive
forms: visual-linguistic pretraining including vision-language Fig. 12. Overview of CILP (from [147]).
pretraining (VLP) and contrastive language-image pretraining However, these methods rely heavily on the visual extrac-
(CLIP), and visual grounding such as phrase grounding (PG), tor or predefined visual vocabulary, leading to a bottle-
referring expression comprehension (REC). For more details, neck of the VLP expressive upper bound. To address this
see Table SII in the Supplementary Material. issue, VinVL [146] develops an improved object detector for
1) Visual-Linguistic Pretraining: Due to limited anno- VLP pretraining on multiple large-scale dataset combinations.
tated data, early VLP methods commonly rely on off-the- Instead of the object-centric RoI features, ViLT [145] initial-
shelf object detector [200] and text encoder [5] to extract izes the interaction Transformer weights from a pretrained
data-specific features for joint distribution learning. Given ViT and adopts whole word masking and image augmentation
an image–text pair, an object detector pretrained on visual strategy for VLP pretraining. UniT [150] follows the archi-
genome (VG) [201] first extracts a set of object-centric RoI tecture of DETR and applies a wide range of task for uni-
features from the image. The RoI features serving as visual fied Transformer pretraining via different task-specific output
tokens are then merged with text embeddings for predefined heads simultaneously. SimVLM [151] adopts [39] to obtain
task pretraining. Basically, these methods are grouped into image features and designs a prefix language modeling as a
dual- and single-stream fusion. pretraining objective to generalize zero-shot image captioning.
The dual-stream methods, including ViLBERT [138] and Besides the conventional pretraining scheme with multitask
LXMERT [139], apply a vision-language cross-attention supervision, another recent line has been developed for con-
layer between two data-specific frameworks for multimodal trastive learning. The most representative work is CLIP [147].
transferring fusion. Concretely, ViLBERT [138] is pre- Based on the 400M Internet image–text pairs datasets, both
trained through masked language modeling (MLM), masked image and text encoders are jointly trained by a contrastive
region classification (MRC), and image text alignment (ITA) loss for ITA (Fig. 12). Notably, CLIP enables the pretrained
on conceptual captions (CCs) [202] with 3M image–text model with a linear classifier to zero-shot transfer to the
pairs. LXMERT [139] extends the pretraining datasets to a most visual downstream tasks efficiently by embedding the
large-scale combination and further indicates that the pre- whole semantics of the objective datasets classes. Based on
trained task-specific (BERT [5]) weights initialization is harm- extensive experiments on over 30 existing CV tasks (e.g.,
ful to the pretraining of multisensory data fusion. classification and action recognition), CLIP attains superior
VideoBERT [137] is the first single-stream VLP method, results to classical supervised methods, demonstrating that
which clusters latent space features of each video frame as such task-agnostic pretraining is also generalized well in the
visual tokens and organizes the corresponding text embed- CV field. ALIGN [149] further expands a noisy dataset of
dings via a captioning API. These features are then fed into over one billion image alt–text pairs rather than the elaborate
a cross-modality self-attention layer for joint representation filtering or postprocessing steps in CLIP [147].
learning. Following [137], VisualBERT [140] extends such Combining masked modeling and contrastive learning pre-
a single-stream framework for various image–text tasks and training strategy, Data2Vec [152] proposes a self-distilled
adds a segment embedding to distinguish between textual network treating the masked features as a type of data aug-
and visual tokens. VL-BERT [141] suggests that unmatched mentation, whose structure is analogous to DINO (2021) [73].
image–caption pairs over the ITA pretraining may decrease the By testing on different sensory benchmarks (voice, image, and
accuracy of downstream tasks. Also, the authors further intro- language), it achieves competitive or better results compared
duce both text-only corpus and unfrozen detector strategies with the existing self-supervised methods.
for pretraining enhancement. Instead, such a “harmful” pre- 2) Visual Grounding: Compared with VLP, visual
training strategy is refuted by UNITER [142], and the authors grounding has more concrete target signal supervision whose
deploy an optimal transport loss to explicitly build word-region objective is to locate the target objects according to their
alignment (WRA) at the instance level. To the same end, corresponding descriptions. In the image space, modulated
Oscar [143] uses shared linguistic semantic embeddings of a DETR (MDETR) [153] extends its previous work [30]
salient object class (called tag) as an anchor point to link both to PG pretraining that locates and assigns the bounding
region and its paired words. Zhou et al. [144] proposed unified box to each instance phrase in one description. Based on
VLP to handle both generation and understanding tasks via the proposed combined dataset from many existing ones,
a shared Transformer encoder–decoder with two customized MDETR is first pretrained on the 1.3M aligned text–image
attention masks. Without extra auxiliary training, unified VLP pairs for PG and then fine-tuned on other downstream
only adopts MLM during pretraining and attains superior tasks. During pretraining, the image–text pair features are
results on visual question answering (VQA) [203] and visual separately processed by two specific extractors and fed into
captioning (VC) [204] tasks. a DETR-like Transformer for salient object localization.
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: SURVEY OF VISUAL TRANSFORMERS 7493

Besides the box loss, two auxiliary losses are adopted to in Section VIII-B, future research directions suggestion in
enforce the network to model an alignment between image Section VIII-C, and the final conclusion in Section VIII-D.
feature and their corresponding phrase tokens. With the
A. Summary of Recent Improvements
large-scale image–text pairs pretraining, MDETR can be
easily generalized in few-shot learning, even on long-tail data. We briefly summarize the major performance improvements
Different from MDETR [153] adding two auxiliary losses for for three fundamental CV tasks as follows.
box-phrase alignments, referring Transformer [156] directly 1) For classification, a deep hierarchical Transformer back-
initializes object queries with phrase-specific embeddings bone is valid for decreasing the computational com-
for PG, which explicitly reserves a one-to-one phrase plexity [41] and avoiding the feature oversmooth [37],
assignment for final bounding box prediction. VGTR [155] [42], [66], [67] in the deep layer. Meanwhile, the early
reformulates the REC as a task for single salient object stage convolution [39] is enough to capture the low-level
localization from the language features. In detail, a text- features, which can significantly enhance the robustness
guided attention mechanism encapsulates both self-attention and reduce the computational complexity in the shallow
block and text–image cross-attention one to update the layer. Moreover, both the convolutional projection [54],
image features simultaneously. The resulted image features, [55] and the local attention mechanism [35], [44] can
which serve as the key–value pairs, interact with language improve the locality of the visual Transformers. The
queries when regressing bounding box coordinates in the former [56], [57] may also be a new approach to replace
decoder. Following ViT [29], TransVG [154] keeps the the positional encoding.
class token to aggregate the image and language features 2) For detection, the Transformer necks benefit from
simultaneously for the mentioned object localization in the encoder–decoder structure with less computation
REC. Pseudo-Q [157] focuses on REC for unsupervised than the encoder-only Transformer detector [88]. Thus,
learning, where a pseudo-query generation module based on the decoder is necessary, but it requires more spatial
a pretrained detector and a series of attributes&relationship prior [76], [80], [81], [82], [83], [85], [86] due to its slow
generation algorithm is applied to generate a set of pseudo convergence [87]. Furthermore, sparse attention [76] and
phrase descriptions and a query prompt is introduced to match scoring network [78], [79] for fore-grounding sampling
feature proposals and phrase queries for REC adaptation. are conducive to reducing the computational costs and
In the 3-D spaces, LanguageRefer [158] redefines the accelerating the convergence of visual Transformers.
multistream data reasoning as a language modeling problem, 3) For segmentation, the encoder–decoder Transformer
whose core idea is to omit point cloud features and infuse models may unify three segmentation subtasks into a
the predicted class embeddings together with a caption into a mask prediction problem via a set of learnable mask
language model to get a binary prediction for object selection. embeddings [31], [104], [198]. This box-free approach
Following the conventional two-stream methods, TransRe- has achieved the latest SoTA performance on multiple
fer3D [159] further enhances the relationship of the object benchmarks [198]. Moreover, the specific hybrid task
features by using a cross-attention between asymmetric object is cascaded with the model [101] of the box-based
relation maps and linguistic features. Considering the specific visual Transformers, which have demonstrated a higher
view for varied descriptions, Huang et al. [160] presented a performance for instance segmentation.
multiview Transformer (MVT 2022) for 3-D visual grounding. 4) For 3-D visual recognition, the local hierarchical Trans-
Given a shared point cloud feature for each object, MVT former with a scoring network could efficiently extract
first appends the converted bounding box coordinates to the features from the point clouds. Instead of the elaborate
shared objects in order to get specific view features. These local design, the global modeling capability enables the
multiview features are then fed into a stack of the Transformer Transformer to easily aggregate surface points. In addi-
decoders for text data fusion. Finally, the multiview features tion, the visual Transformers can handle multisensory
are merged by an order-independent aggregation function and data in 3-D visual recognition, such as multiview and
converted to the grounding score. MVT achieves the SoTA multidimension data.
performance on Nr3D and Sr3D datasets [205]. In the video 5) The mainstream visual-linguistic pretraining have gradu-
space, a specific 3-D data (with temporal dimension), Yang ally focused on the alignments [147] or similarities [152]
et al. [161] proposed TubeDETR to address the problem of among different data streams in the latent space based on
spatiotemporal video grounding (STVG). Concretely, a slow- the large-scale noised datasets [149]. Another concern is
fast encoder sparsely samples the frames and performs cross- to adapt the downstream visual tasks to the pretraining
modal self-attention between the sampled frames and the text scheme to perform zero-short transferring [147].
features in the slow branch and aggregates the updated sample 6) The recent prevailing architecture for multisensory data
features into the full-frame features from fast branch via a fusion is the single-stream method, which spatially con-
broadcast operation. A learnable query attached with different catenates different data streams and performs interac-
time encodings, called time-specific queries in the decoder, tion simultaneously. Based on the single-stream model,
is then predicted as either a time-aligned bounding box or “no numerous recent works devote to finding a latent space
object.” It attains SoTA results on STVG leaderboards. to semantically align different data.
VIII. C ONCLUSION AND D ISCUSSION B. Discussion on Visual Transformers
This section briefly summaries the recent improve- Despite that the visual Transformer models are evolved sig-
ments in Section VIII-A, some critical issues disscusion nificantly, the “essential” understanding remains insufficient.
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
7494 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 6, JUNE 2024

Therefore, we will focus on reviewing some key issues for a


deep and comprehensive understanding.
1) How Transformers Bridge the Gap Between Language
and Vision: Transformers are initially designed for machine
translation tasks [1], where each word of a sentence is taken as
a basic unit representing the high-level semantics. These words
can be embedded into a series of vector representations in
the N-dimensional feature space. For visual tasks, each single
pixel of an image is unable to carry semantic information, Fig. 13. Taxonomy of the learnable embedding.
which is not full compliance with the feature embedding
as done for the traditional NLP tasks. Therefore, the key excellent robustness, whereas it is insensitive to low-level
for transferring such feature embeddings (i.e., word embed- features (e.g., complicated textures and edges) compared with
ding) to image features and applying Transformer to various CNN. Accordingly, it is concluded that Transformers at a
vision tasks is to build an image-to-vector transformation and high-level stage play a vital role in various vision tasks.
maintain the image’s characteristics effectively. For example, 4) Learnable Embeddings for Different Visual Tasks: Vari-
ViT [29] transforms an image into patch embeddings with mul- ous learnable embeddings are designed to perform different
tiple low-level information under strong slackness conditions. visual tasks, such as class token, object query, and mask
Also, its votarist [39], [58] leverages convolution to extract the embedding. These learnable tokens are mainly adopted into
low-level features and reduce the redundancy from patches. two different Transformer patterns, i.e., encoder-only and
2) Relationship Between Transformers, Self-Attention, and encoder–decoder ones, as shown in Fig. 13. On the quantity
CNNs: From the perspective of CNNs, its inductive bias level, the number of learned tokens depends on the target
is mainly shown as locality, translation invariance, weight prediction. For example, the visual Transformers [29], [40]
sharing, and sparse connection. Such a simple convolutional in the classification task adopt only one class token, and
kernel can perform template matching efficiently in lower level the DETR’s votarist in detection [30], [81] and segmenta-
semantic processing, but its upper bound tends to be lower than tion [198] tasks employ multiple learned queries. On the
Transformers due to the excessive bias. position level, encoder-only Transformers capitalize on the
From the perspective of Transformers, as detailed in initial token(s) [29], [88] and later token(s) [42], [104],
Sections III-B and III-D, attention layer can theoretically while the learned positional encoding [30], [81], [198] and
express any convolution when a sufficient number of heads the learned decoder input embedding [76] are applied to the
are adopted [28]. Such fully attentional operation can com- encoder–decoder structure. Different from the vanilla ViT with
bine both local-level attention and global-level attention and initial class token, CaiT [42] observes that the later class token
generate attention weights dynamically according to the fea- can reduce FLOPs of Transformer and improve the model per-
ture relationships. Dong et al. [168] demonstrated that the formance slightly. Segmenter [104] also shows such strategy
self-attention layer manifests strong inductive bias toward efficiency for the segmentation tasks. From the viewpoint of
“token uniformity” when it is trained on deep layers without the encoder–decoder Transformer, the decoder input token is
short connection or FFNs. Yu et al. [206] also argued that such considered as a special case of the encoder-only Transformer
an elaborate attention mechanism can be replaced by a pooling with later token. It standardizes visual Transformers in the
operation readily. Therefore, it is concluded that Transformer fields of detection [30] and segmentation [198] by using a
must consist of two key components: a global token mixer small set of object queries (mask embeddings). By combing
(e.g., self-attention layer) aggregates the relationship of tokens both later tokens and object queries (mask embeddings), the
and a positionwise FFN extracts the features from the inputs. structure such as deformable DETR [76], which takes object
By comparison, the visual Transformer has a powerful queries and the learnable decoder embeddings (equivalent to
global modeling capability, making it efficiently attend to the later tokens) as the inputs, may unify the learnable embed-
high-level semantic features. CNNs can effectively process the dings for different tasks into the encoder–decoder Transformer.
low-level features [39], [58], enhance the locality of the visual
Transformers [53], [81], and append the positional features via C. Future Research Directions
padding operations [56], [57], [173]. Visual Transformers have achieved significant progresses
3) Double Edges of Visual Transformers: We conclude three and obtained promising results. However, some key technolo-
double-edged properties of visual Transformers as follows. gies are still insufficient to cope with complicated challenges
Global property enables Transformer to acquire capacious in the CV fields. We point out some promising research
receptive fields and interact easily between various high-level directions for future investigation.
semantic features, while it becomes inefficiency and debility 1) Set Prediction: Touvron et al. [40] found that mul-
during low-level processing because of quadratic computing tiple class tokens would converge consistently due to the
and noised low-level features. Slack bias offers visual Trans- same gradient from the loss function, whereas it does not
former a higher upper bound than CNNs based on sufficient emerge in dense prediction tasks [30], [198]. We conclude
training data without sophistic assumptions but performs infe- that their marked difference lies in the label assignment and
riority and slow convergence in small datasets [207]. Low pass the number of targets. Thus, it is natural to consider a set
is also a significant property of visual Transformer showing prediction design for the classification tasks, e.g., multiple
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: SURVEY OF VISUAL TRANSFORMERS 7495

class tokens are aligned to mix-patches via set prediction, [4] T. B. Brown et al., “Language models are few-shot learners,” in Proc.
such as the data augmentation training strategy in LV-ViT [43]. NeurIPS, 2020, pp. 1877–1901.
[5] J. Devlin, M.-W. Chang, K. Lee, and K. N. Toutanova, “BERT: Pre-
Furthermore, the label assignment in the set prediction strategy training of deep bidirectional transformers for language understanding,”
leads to training instability during the early process, which in Proc. NAACL, 2018, pp. 4171–4186.
degrades the accuracy of the final results. Redesigning the [6] Y. Liu et al., “RoBERTa: A robustly optimized BERT pretraining
approach,” 2019, arXiv:1907.11692.
label assignments and set prediction losses may be helpful for
[7] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut,
the detection frameworks. “ALBERT: A lite BERT for self-supervised learning of language
2) Self-Supervised Learning: Self-supervised pretraining of representations,” in Proc. ICLR, 2020, pp. 1–17.
Transformers has standardized the NLP field and obtained [8] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and
Q. V. Le, “XLNet: Generalized autoregressive pretraining for language
tremendous successes in various applications [2], [5]. Because understanding,” in Proc. NeurIPS, 2019, pp. 5753–5763.
of the popularity of self-supervision paradigms in the CV [9] D. W. Otter, J. R. Medina, and J. K. Kalita, “A survey of the usages
field, the convolutional Siamese networks employ contrastive of deep learning for natural language processing,” IEEE Trans. Neural
Netw. Learn. Syst., vol. 32, no. 2, pp. 604–624, Feb. 2021.
learning to perform self-supervised pretraining, which differs
[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica-
from the masked autoencoders used in the NLP field. Recently, tion with deep convolutional neural networks,” in Proc. NIPS, 2012,
some studies have tried to design self-supervised visual Trans- pp. 1097–1105.
formers to bridge the discrepancy of pretraining methodology [11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
between vision and language. Most of them inherit the masked (CVPR), Jun. 2016, pp. 770–778.
autoencoders in the NLP field or contrastive learning schemes [12] M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convo-
in the CV field. There is no specific supervised method for lutional neural networks,” in Proc. ICML, 2019, pp. 6105–6114.
the visual Transformers, but it has revolutionized the NLP [13] A. Galassi, M. Lippi, and P. Torroni, “Attention in natural language
processing,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 10,
tasks such as GPT-3. As described in Section VIII-B4, the pp. 4291–4308, Oct. 2021.
encoder–decoder structure may unify the visual tasks by learn- [14] X. Wang, R. B. Girshick, A. Gupta, and K. He, “Non-local neural
ing the decoder embedding and the positional encoding jointly. networks,” in Proc. IEEE CVPR, Jun. 2018, pp. 7794–7803.
[15] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “CCNet:
Thus, it is worth of further investigating the encoder–decoder Criss-cross attention for semantic segmentation,” in Proc. IEEE/CVF
Transformers for self-supervised learning. Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 603–612.
[16] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “GCNet: Non-local networks
D. Conclusion meet squeeze-excitation networks and beyond,” in Proc. IEEE/CVF Int.
Conf. Comput. Vis. Workshop (ICCVW), Oct. 2019, pp. 1971–1980.
Since ViT demonstrated its effectiveness for the CV tasks, [17] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
the visual Transformers have received considerable attention Proc. IEEE/CVF CVPR, Jun. 2018, pp. 7132–7141.
and undermined the dominant of CNNs in the CV field. [18] S. Woo, J. Park, J. Y. Lee, and I. S. Kweon, “CBAM: Convolutional
block attention module,” in Proc. Eur. Conf. Comput. Vis. (ECCV),
In this article, we have comprehensively reviewed more than 2018, pp. 3–19.
100 of visual Transformer models that have been successively [19] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “ECA-Net: Efficient
applied to various vision tasks (i.e., classification, detection, channel attention for deep convolutional neural networks,” in Proc.
IEEE/CVF CVPR, Jun. 2020, pp. 11534–11542.
and segmentation) and data streams (e.g., images, point clouds,
[20] N. Parmar et al., “Image transformer,” in Proc. ICML, 2018,
image–text pairs, and other multiple data streams). For each pp. 4055–4064.
vision task and data stream, a specific taxonomy is proposed to [21] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei, “Relation net-
organize the recently developed visual Transformers and their works for object detection,” in Proc. IEEE/CVF CVPR, Jun. 2018,
pp. 3588–3597.
performances are further evaluated over various prevailing [22] H. Hu, Z. Zhang, Z. Xie, and S. Lin, “Local relation networks for image
benchmarks. From our integrative analysis and systematic recognition,” in Proc. IEEE/CVF ICCV, Oct. 2019, pp. 3464–3473.
comparison of all these existing methods, a summary of [23] I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le, “Attention aug-
remarkable improvements is provided in this article, four mented convolutional networks,” in Proc. ICCV, 2019, pp. 3286–3295.
[24] P. Ramachandran, N. Parmar, A. Vaswani, I. Bello, A. Levskaya,
essential issues for the visual Transformers are also discussed, and J. Shlens, “Stand-alone self-attention in vision models,” in Proc.
and several potential research directions are further suggested NeurIPS, 2019, pp. 68–80.
for future investigation. We do expect that this review article [25] H. Zhao, J. Jia, and V. Koltun, “Exploring self-attention for image
recognition,” in Proc. IEEE/CVF CVPR, Jun. 2020, pp. 10076–10085.
can help readers have better understandings of various visual [26] Z. Zheng, G. An, D. Wu, and Q. Ruan, “Global and local knowledge-
Transformers before they decide to perform deep explorations. aware attention network for action recognition,” IEEE Trans. Neural
Netw. Learn. Syst., vol. 32, no. 1, pp. 334–347, Jan. 2021.
ACKNOWLEDGMENT [27] A. Vaswani, P. Ramachandran, A. Srinivas, N. Parmar, B. Hechtman,
and J. Shlens, “Scaling local self-attention for parameter efficient visual
This work was done at the AI Lab, Lenovo Research, backbones,” in Proc. IEEE/CVF CVPR, Jun. 2021, pp. 12894–12904.
Beijing, China. [28] J.-B. Cordonnier, A. Loukas, and M. Jaggi, “On the relationship
between self-attention and convolutional layers,” in Proc. ICLR, 2020,
R EFERENCES pp. 1–18.
[1] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. [29] A. Dosovitskiy et al., “An image is worth 16 × 16 words: Transformers
Process. Syst. (NIPS), 2017, pp. 5998–6008. for image recognition at scale,” in Proc. ICLR, 2021, pp. 1–16.
[2] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving [30] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
language understanding by generative pre-training,” OpenAI, Tech. S. Zagoruyko, “End-to-end object detection with transformers,” in
Rep., 2018. Proc. ECCV, 2020, pp. 213–229.
[3] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, [31] H. Wang, Y. Zhu, H. Adam, A. Yuille, and L.-C. Chen, “MaX-
“Language models are unsupervised multitask learners,” OpenAI, Tech. DeepLab: End-to-end panoptic segmentation with mask transformers,”
Rep., 2019. in Proc. IEEE/CVF CVPR, Jun. 2021, pp. 5463–5474.

Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
7496 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 6, JUNE 2024

[32] B. Cheng, A. Schwing, and A. Kirillov, “Per-pixel classification is [63] L. Yuan et al., “Tokens-to-Token ViT: Training vision transformers
not all you need for semantic segmentation,” in Proc. NeurIPS, 2021, from scratch on ImageNet,” in Proc. IEEE/CVF ICCV, Oct. 2021,
pp. 17864–17875. pp. 558–567.
[33] X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu, “Transformer [64] B. Heo, S. Yun, D. Han, S. Chun, J. Choe, and S. J. Oh, “Rethinking
tracking,” in Proc. CVPR, 2021, pp. 8126–8135. spatial dimensions of vision transformers,” in Proc. IEEE/CVF ICCV,
[34] Y. Jiang, S. Chang, and Z. Wang, “TransGAN: Two pure trans- Oct. 2021, pp. 11936–11945.
formers can make one strong GAN, and that can scale up,” 2021, [65] W. Wang et al., “PVT v2: Improved baselines with pyramid vision
arXiv:2102.07074. transformer,” 2021, arXiv:2106.13797.
[35] Z. Liu et al., “Swin transformer: Hierarchical vision transformer [66] D. Zhou et al., “DeepViT: Towards deeper vision transformer,” 2021,
using shifted Windows,” in Proc. IEEE/CVF ICCV, Oct. 2021, arXiv:2103.11886.
pp. 10012–10022. [67] C. Gong, D. Wang, M. Li, V. Chandra, and Q. Liu, “Vision transformers
[36] H. Wu et al., “CvT: Introducing convolutions to vision transformers,” with patch diversification,” 2021, arXiv:2104.12753.
in Proc. IEEE/CVF ICCV, Oct. 2021, pp. 22–31. [68] M. Chen et al., “Generative pretraining from pixels,” in Proc. ICML,
[37] D. Zhou et al., “Refiner: Refining self-attention for vision transform- 2020, pp. 1691–1703.
ers,” 2021, arXiv:2106.03714. [69] Z. Li et al., “MST: Masked self-supervised transformer for visual
[38] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, “Scaling vision representation,” in Proc. NeurIPS, 2021, pp. 13165–13176.
transformers,” 2021, arXiv:2106.04560. [70] H. Bao, L. Dong, and F. Wei, “BEiT: BERT pre-training of image
[39] Z. Dai, H. Liu, Q. V. Le, and M. Tan, “CoAtNet: Marrying con- transformers,” in Proc. ICLR, 2021, pp. 1–18.
volution and attention for all data sizes,” in Proc. NeurIPS, 2021, [71] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked
pp. 3965–3977. autoencoders are scalable vision learners,” in Proc. IEEE/CVF CVPR,
[40] H. Touvron, M. Cord, D. Matthijs, F. Massa, A. Sablayrolles, and Jun. 2022, pp. 16000–16009.
H. Jegou, “Training data-efficient image transformers & distillation [72] X. Chen, S. Xie, and K. He, “An empirical study of training self-
through attention,” in Proc. ICLR, 2021, pp. 10347–10357. supervised vision transformers,” in Proc. IEEE/CVF ICCV, Oct. 2021,
[41] W. Wang et al., “Pyramid vision transformer: A versatile backbone pp. 9640–9649.
for dense prediction without convolutions,” in Proc. IEEE/CVF ICCV, [73] M. Caron et al., “Emerging properties in self-supervised vision trans-
Oct. 2021, pp. 568–578. formers,” in Proc. IEEE/CVF ICCV, Oct. 2021, pp. 9650–9660.
[42] H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and H. Jégou, [74] Z. Xie et al., “Self-supervised learning with Swin transformers,” 2021,
“Going deeper with image transformers,” in Proc. IEEE/CVF ICCV, arXiv:2105.04553.
Oct. 2021, pp. 32–42. [75] T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. Hinton, “Pix2seq: A
[43] Z.-H. Jiang et al., “All tokens matter: Token labeling for training better language modeling framework for object detection,” in Proc. ICLR,
vision transformers,” in Proc. NeurIPS, 2021, pp. 18590–18602. 2021, pp. 1–17.
[44] L. Yuan, Q. Hou, Z. Jiang, J. Feng, and S. Yan, “VOLO: Vision [76] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable
outlooker for visual recognition,” 2021, arXiv:2106.13112. DETR: Deformable transformers for end-to-end object detection,” in
[45] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, “Efficient transformers: Proc. ICLR, 2021, pp. 1–16.
A survey,” ACM Comput. Surv., vol. 55, no. 6, pp. 1–28, Apr. 2022. [77] M. Zheng et al., “End-to-End object detection with adaptive clustering
[46] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, transformer,” 2020, arXiv:2011.09315.
“Transformers in vision: A survey,” ACM Comput. Surv., vol. 54, [78] T. Wang, L. Yuan, Y. Chen, J. Feng, and S. Yan, “PnP-DETR: Towards
no. 10s, pp. 1–41, Jan. 2022. efficient visual analysis with transformers,” in Proc. IEEE/CVF ICCV,
[47] K. Han et al., “A survey on vision transformer,” IEEE Trans. Pattern Oct. 2021, pp. 4661–4670.
Anal. Mach. Intell., vol. 45, no. 1, pp. 87–110, Jan. 2023. [79] B. Roh, J. Shin, W. Shin, and S. Kim, “Sparse DETR: Efficient end-
[48] T. Lin, Y. Wang, X. Liu, and X. Qiu, “A survey of transformers,” AI to-end object detection with learnable sparsity,” in Proc. ICLR, 2021,
Open, vol. 3, pp. 111–132, Oct. 2022. pp. 1–23.
[49] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning [80] P. Gao, M. Zheng, X. Wang, J. Dai, and H. Li, “Fast convergence
with neural networks,” in Proc. NeurIPS, 2014, pp. 3104–3112. of DETR with spatially modulated co-attention,” in Proc. IEEE/CVF
[50] K. Greff, R. K. Srivastava, J. Koutnìk, B. R. Steunebrink, and ICCV, Oct. 2021, pp. 3621–3630.
J. Schmidhuber, “LSTM: A search space Odyssey,” IEEE Trans. [81] D. Meng et al., “Conditional DETR for fast training convergence,” in
Neural Netw. Learn. Syst., vol. 28, no. 10, pp. 2222–2232, Oct. 2017. Proc. IEEE/CVF ICCV, Oct. 2021, pp. 3651–3660.
[51] B. Wu et al., “Visual transformers: Token-based image representation [82] Y. Wang, X. Zhang, T. Yang, and J. Sun, “Anchor DETR: Query design
and processing for computer vision,” 2020, arXiv:2006.03677. for transformer-based detector,” in Proc. AAAI, 2022, pp. 2567–2575.
[52] A. Srinivas, T.-Y. Lin, N. Parmar, J. Shlens, P. Abbeel, and A. Vaswani, [83] S. Liu et al., “DAB-DETR: Dynamic anchor boxes are better queries
“Bottleneck transformers for visual recognition,” in Proc. IEEE/CVF for DETR,” in Proc. ICLR, 2021, pp. 1–19.
CVPR, Jun. 2021, pp. 16519–16529. [84] Y. Liu et al., “SAP-DETR: Bridging the gap between salient points and
[53] S. D’Ascoli, H. Touvron, M. Leavitt, A. Morcos, G. Biroli, and queries-based transformer detector for fast model convergency,” 2022,
L. Sagun, “ConViT: Improving vision transformers with soft convo- arXiv:2211.02006.
lutional inductive biases,” in Proc. ICLR, 2021, pp. 2286–2296. [85] Z. Yao, J. Ai, B. Li, and C. Zhang, “Efficient DETR: Improving end-
[54] K. Yuan, S. Guo, Z. Liu, A. Zhou, F. Yu, and W. Wu, “Incorporating to-end object detector with dense prior,” 2021, arXiv:2104.01318.
convolution designs into visual transformers,” in Proc. IEEE/CVF [86] X. Dai, Y. Chen, J. Yang, P. Zhang, L. Yuan, and L. Zhang, “Dynamic
ICCV, Oct. 2021, pp. 579–588. DETR: End-to-end object detection with dynamic attention,” in Proc.
[55] Y. Li, K. Zhang, J. Cao, R. Timofte, and L. Van Gool, “LocalViT: IEEE/CVF ICCV, Oct. 2021, pp. 2988–2997.
Bringing locality to vision transformers,” 2021, arXiv:2104.05707. [87] Z. Sun, S. Cao, Y. Yang, and K. Kitani, “Rethinking transformer-
[56] X. Chu et al., “Conditional positional encodings for vision transform- based set prediction for object detection,” in Proc. IEEE/CVF ICCV,
ers,” 2021, arXiv:2102.10882. Oct. 2021, pp. 3611–3620.
[57] Q. Zhang and Y.-B. Yang, “ResT: An efficient transformer for visual [88] Y. Fang et al., “You only look at one sequence: Rethinking trans-
recognition,” in Proc. NeurIPS, 2021, pp. 15475–15485. former in vision through object detection,” in Proc. NeurIPS, 2021,
[58] T. Xiao, P. Dollár, M. Singh, E. Mintun, T. Darrell, and R. Girshick, pp. 26183–26197.
“Early convolutions help transformers see better,” in Proc. NeruIPS, [89] Z. Dai, B. Cai, Y. Lin, and J. Chen, “UP-DETR: Unsupervised pre-
2021, pp. 30392–30400. training for object detection with transformers,” in Proc. IEEE/CVF
[59] K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y. Wang, “Transformer CVPR, Jun. 2021, pp. 1601–1610.
in transformer,” in Proc. NeurIPS, 2021, pp. 15908–15919. [90] W. Wang, Y. Cao, J. Zhang, and D. Tao, “FP-DETR: Detection
[60] X. Chu et al., “Twins: Revisiting the design of spatial attention in vision transformer advanced by fully pre-training,” in Proc. ICLR, 2021,
transformers,” in Proc. NeurIPS, 2021, pp. 9355–9366. pp. 1–14.
[61] P. Zhang et al., “Multi-scale vision longformer: A new vision trans- [91] F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang, “DN-DETR:
former for high-resolution image encoding,” in Proc. IEEE/CVF ICCV, Accelerate DETR training by introducing query denoising,” in Proc.
Oct. 2021, pp. 2998–3008. IEEE/CVF CVPR, Jun. 2022, pp. 13619–13627.
[62] J. Yang et al., “Focal self-attention for local-global interactions in vision [92] H. Zhang et al., “DINO: DETR with improved denoising anchor boxes
transformers,” 2021, arXiv:2107.00641. for end-to-end object detection,” 2022, arXiv:2203.03605.

Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
LIU et al.: SURVEY OF VISUAL TRANSFORMERS 7497

[93] D. Zhang, H. Zhang, J. Tang, M. Wang, X. Hua, and Q. Sun, “Feature [123] X. Yu, Y. Rao, Z. Wang, Z. Liu, J. Lu, and J. Zhou, “PoinTr: Diverse
pyramid transformer,” in Proc. ECCV, 2020, pp. 323–339. point cloud completion with geometry-aware transformers,” in Proc.
[94] Y. Yuan et al., “HRFormer: High-resolution vision transformer for ICCV, 2021, pp. 12478–12487.
dense predict,” in Proc. NeurIPS, 2021, pp. 7281–7293. [124] P. Xiang et al., “SnowflakeNet: Point cloud completion by snowflake
[95] J. Gu et al., “HRViT: Multi-scale high-resolution vision transformer,” point deconvolution with skip-transformer,” in Proc. IEEE/CVF ICCV,
2021, arXiv:2111.01236. Oct. 2021, pp. 5479–5489.
[96] S. Zheng et al., “Rethinking semantic segmentation from a sequence- [125] J. Choe, B. Joung, F. Rameau, J. Park, and I. So Kweon, “Deep point
to-sequence perspective with transformers,” in Proc. IEEE/CVF CVPR, cloud reconstruction,” 2021, arXiv:2111.11704.
[126] S. Chen, T. Yu, and P. Li, “MVT: Multi-view vision transformer for
Jun. 2021, pp. 6881–6890.
3D object recognition,” 2021, arXiv:2110.13083.
[97] J. Chen et al., “TransUNet: Transformers make strong encoders for
[127] Y. Hou and L. Zheng, “Multiview detection with shadow transformer
medical image segmentation,” 2021, arXiv:2102.04306.
(and view-coherent data augmentation),” in Proc. 29th ACM Int. Conf.
[98] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, Multimedia, Oct. 2021, pp. 1673–1682.
“SegFormer: Simple and efficient design for semantic segmentation [128] A. Prakash, K. Chitta, and A. Geiger, “Multi-modal fusion transformer
with transformers,” in Proc. NeurIPS, 2021, pp. 12077–12090. for end-to-end autonomous driving,” in Proc. IEEE/CVF Conf. Comput.
[99] T. Prangemeier, C. Reich, and H. Koeppl, “Attention-based transform- Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 7077–7087.
ers for instance segmentation of cells in microstructures,” in Proc. IEEE [129] W. Jiang, E. Trulls, J. Hosang, A. Tagliasacchi, and K. M. Yi, “COTR:
Int. Conf. Bioinf. Biomed. (BIBM), Dec. 2020, pp. 700–707. Correspondence transformer for matching across images,” in Proc.
[100] Y. Wang et al., “End-to-end video instance segmentation with trans- IEEE/CVF ICCV, Oct. 2021, pp. 6207–6217.
formers,” in Proc. IEEE/CVF CVPR, Jun. 2021, pp. 8741–8750. [130] D. Wang et al., “Multi-view 3D reconstruction with transformers,” in
[101] Y. Fang et al., “Instances as queries,” in Proc. ICCV, 2021, Proc. IEEE/CVF ICCV, Oct. 2021, pp. 5722–5731.
pp. 6910–6919. [131] X. Chen, T. Zhang, Y. Wang, Y. Wang, and H. Zhao, “FUTR3D:
[102] J. Hu et al., “ISTR: End-to-end instance segmentation with transform- A unified sensor fusion framework for 3D detection,” 2022,
ers,” 2021, arXiv:2105.00637. arXiv:2203.10642.
[103] B. Dong, F. Zeng, T. Wang, X. Zhang, and Y. Wei, “SOLQ: Segmenting [132] A. Bozic, P. Palafox, J. Thies, A. Dai, and M. Nießner, “Transformer-
objects by learning queries,” in Proc. NeurIPS, 2021, pp. 21898–21909. Fusion: Monocular RGB scene reconstruction using transformers,” in
[104] R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Proc. NeurIPS, 2021, pp. 1403–1414.
Transformer for semantic segmentation,” in Proc. IEEE/CVF ICCV, [133] Y. Zhang et al., “mmFormer: Multimodal medical transformer for
Oct. 2021, pp. 7262–7272. incomplete multimodal learning of brain tumor segmentation,” in Proc.
[105] H. Zhao, L. Jiang, J. Jia, P. Torr, and V. Koltun, “Point transformer,” MICCAI, 2022, pp. 107–117.
in Proc. IEEE/CVF ICCV, Oct. 2021, pp. 16259–16268. [134] G. V. Tulder, Y. Tong, and E. Marchiori, “Multi-view analysis of
[106] M.-H. Guo, J.-X. Cai, Z.-N. Liu, T.-J. Mu, R. R. Martin, and S.-M. Hu, unregistered medical images using cross-view transformers,” in Proc.
“PCT: Point cloud transformer,” Comput. Vis. Media, vol. 7, no. 2, MICCAI, 2021, pp. 104–113.
pp. 187–199, 2021. [135] X. Long, L. Liu, W. Li, C. Theobalt, and W. Wang, “Multi-view
depth estimation using epipolar spatio-temporal networks,” in Proc.
[107] D. Lu, Q. Xie, K. Gao, L. Xu, and J. Li, “3DCTN: 3D convolution-
IEEE/CVF CVPR, Jun. 2021, pp. 8258–8267.
transformer network for point cloud classification,” IEEE Trans. Intell.
[136] D. Song et al., “Deep relation transformer for diagnosing glaucoma
Transp. Syst., vol. 23, no. 12, pp. 24854–24865, Dec. 2022.
with optical coherence tomography and visual field function,” IEEE
[108] C. Park, Y. Jeong, M. Cho, and J. Park, “Fast point transformer,” in Trans. Med. Imag., vol. 40, no. 9, pp. 2392–2402, Sep. 2021.
Proc. IEEE/CVF CVPR, Jun. 2022, pp. 16949–16958. [137] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid,
[109] X. Pan, Z. Xia, S. Song, L. E. Li, and G. Huang, “3D object “VideoBERT: A joint model for video and language representation
detection with pointformer,” in Proc. IEEE/CVF CVPR, Jun. 2021, learning,” in Proc. IEEE/CVF ICCV, Oct. 2019, pp. 7464–7473.
pp. 7463–7472. [138] J. Lu, D. Batra, D. Parikh, and S. Lee, “ViLBERT: Pretraining task-
[110] L. Fan et al., “Embracing single stride 3D object detector with sparse agnostic visiolinguistic representations for vision-and-language tasks,”
transformer,” in Proc. IEEE/CVF CVPR, Jun. 2022, pp. 8458–8468. in Proc. NeurIPS, 2019, pp. 13–23.
[111] J. Mao et al., “Voxel transformer for 3D object detection,” in Proc. [139] H. Tan and M. Bansal, “LXMERT: Learning cross-modality encoder
IEEE/CVF ICCV, Oct. 2021, pp. 3144–3153. representations from transformers,” in Proc. Conf. Empirical Methods
[112] C. He, R. Li, S. Li, and L. Zhang, “Voxel set transformer: A set-to-set Natural Lang. Process. 9th Int. Joint Conf. Natural Lang. Process.
approach to 3D object detection from point clouds,” in Proc. IEEE/CVF (EMNLP-IJCNLP), 2019, pp. 5100–5111.
CVPR, Jun. 2022, pp. 8417–8427. [140] L. Harold Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “Visu-
[113] X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, and J. Lu, “Point-BERT: alBERT: A simple and performant baseline for vision and language,”
Pre-training 3D point cloud transformers with masked point modeling,” 2019, arXiv:1908.03557.
in Proc. IEEE/CVF CVPR, Jun. 2022, pp. 19313–19322. [141] W. Su et al., “VL-BERT: Pre-training of generic visual-linguistic
[114] Y. Pang, W. Wang, F. E. H. Tay, W. Liu, Y. Tian, and L. Yuan, representations,” 2019, arXiv:1908.08530.
“Masked autoencoders for point cloud self-supervised learning,” 2022, [142] Y.-C. Chen et al., “UNITER: Universal image-text representation
arXiv:2203.06604. learning,” in Proc. ECCV, 2020, pp. 104–120.
[115] H. Liu, M. Cai, and Y. Jae Lee, “Masked discrimination for self- [143] X. Li et al., “OSCAR: Object-semantics aligned pre-training for vision-
supervised learning on point clouds,” 2022, arXiv:2203.11183. language tasks,” in Proc. ECCV, 2020, pp. 121–137.
[116] I. Misra, R. Girdhar, and A. Joulin, “An end-to-end transformer [144] L. Zhou, H. Palangi, L. Zhang, H. Hu, J. Corso, and J. Gao, “Unified
model for 3D object detection,” in Proc. IEEE/CVF ICCV, Oct. 2021, vision-language pre-training for image captioning and VQA,” in Proc.
pp. 2906–2917. AAAI, 2020, pp. 13041–13049.
[145] W. Kim, B. Son, and I. Kim, “ViLT: Vision-and-language transformer
[117] Z. Liu, Z. Zhang, Y. Cao, H. Hu, and X. Tong, “Group-free 3D object
without convolution or region supervision,” in Proc. ICML, 2021,
detection via transformers,” in Proc. IEEE/CVF ICCV, Oct. 2021,
pp. 5583–5594.
pp. 2949–2958.
[146] P. Zhang et al., “VinVL: Revisiting visual representations in
[118] H. Shenga et al., “Improving 3D object detection with channel-wise vision-language models,” in Proc. IEEE/CVF CVPR, Jun. 2021,
transformer,” in Proc. IEEE/CVF ICCV, Oct. 2021, pp. 2743–2752. pp. 5579–5588.
[119] K.-C. Huang, T.-H. Wu, H.-T. Su, and W. H. Hsu, “MonoDTR: [147] A. Radford et al., “Learning transferable visual models from natural
Monocular 3D object detection with depth-aware transformer,” in Proc. language supervision,” in Proc. ICML, 2021, pp. 8748–8763.
IEEE/CVF CVPR, Jun. 2022, pp. 4012–4021. [148] A. Ramesh et al., “Zero-shot text-to-image generation,” in Proc. ICML,
[120] R. Zhang et al., “MonoDETR: Depth-guided transformer for monocular 2021, pp. 8821–8831.
3D object detection,” 2022, arXiv:2203.13310. [149] C. Jia et al., “Scaling up visual and vision-language representa-
[121] Y. Wang, V. C. Guizilini, T. Zhang, Y. Wang, H. Zhao, and tion learning with noisy text supervision,” in Proc. ICML, 2021,
J. Solomon, “DETR3D: 3D object detection from multi-view images pp. 4904–4916.
via 3D-to-2D queries,” in Proc. 5th Conf. Robot Learn., 2022, [150] R. Hu and A. Singh, “UniT: Multimodal multitask learning with a uni-
pp. 180–191. fied transformer,” in Proc. IEEE/CVF ICCV, Oct. 2021, pp. 1439–1449.
[122] X. Bai et al., “TransFusion: Robust LiDAR-camera fusion for 3D object [151] Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao, “SimVLM:
detection with transformers,” in Proc. IEEE/CVF CVPR, Jun. 2022, Simple visual language model pretraining with weak supervision,” in
pp. 1090–1099. Proc. ICLR, 2021, pp. 1–17.
Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.
7498 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 35, NO. 6, JUNE 2024

[152] A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “Data2vec: [180] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
A general framework for self-supervised learning in speech, vision and real-time object detection with region proposal networks,” IEEE
language,” 2022, arXiv:2202.03555. Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149,
[153] A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, and N. Carion, Jun. 2017.
“MDETR—Modulated detection for end-to-end multi-modal under- [181] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only
standing,” in Proc. IEEE/CVF ICCV, Oct. 2021, pp. 1780–1790. look once: Unified, real-time object detection,” in Proc. IEEE CVPR,
[154] J. Deng, Z. Yang, T. Chen, W. Zhou, and H. Li, “TransVG: End-to- Jun. 2016, pp. 779–788.
end visual grounding with transformers,” in Proc. IEEE/CVF ICCV, [182] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal
Oct. 2021, pp. 1769–1779. loss for dense object detection,” in Proc. IEEE ICCV, Oct. 2017,
[155] Y. Du, Z. Fu, Q. Liu, and Y. Wang, “Visual grounding with transform- pp. 2980–2988.
ers,” in Proc. ICME, 2022, pp. 1–6. [183] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, “Object detection with deep
[156] M. Li and L. Sigal, “Referring transformer: A one-step approach learning: A review,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30,
to multi-task visual grounding,” in Proc. NeurIPS, 2021, no. 11, pp. 3212–3232, Nov. 2019.
pp. 19652–19664. [184] J. Dai et al., “Deformable convolutional networks,” in Proc. ICCV,
[157] H. Jiang, Y. Lin, D. Han, S. Song, and G. Huang, “Pseudo-Q: 2017, pp. 764–773.
Generating pseudo language queries for visual grounding,” in Proc. [185] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
IEEE/CVF CVPR, Jun. 2022, pp. 15513–15523. “Feature pyramid networks for object detection,” in Proc. IEEE CVPR,
[158] J. Roh, K. Desingh, A. Farhadi, and D. Fox, “LanguageRefer: Spatial- Jul. 2017, pp. 2117–2125.
language model for 3D visual grounding,” in Proc. 5th Conf. Robot [186] Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully convolutional
Learn., 2022, pp. 1046–1056. one-stage object detection,” in Proc. IEEE/CVF ICCV, Oct. 2019,
[159] D. He et al., “TransRefer3D: Entity-and-relation aware transformer pp. 9627–9636.
for fine-grained 3D visual grounding,” in Proc. 29th ACM Int. Conf. [187] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in
Multimedia (CMM), 2021, pp. 2344–2352. Proc. ICCV, 2017, pp. 2961–2969.
[160] S. Huang, Y. Chen, J. Jia, and L. Wang, “Multi-view transformer [188] Z. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into high quality
for 3D visual grounding,” in Proc. IEEE/CVF CVPR, Jun. 2022, object detection,” in Proc. IEEE CVPR, Jun. 2018, pp. 6154–6162.
pp. 15524–15533. [189] P. Sun et al., “What makes for end-to-end object detection?” in Proc.
[161] A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid, “TubeDETR: ICML, 2021, pp. 9934–9944.
Spatio-temporal video grounding with transformers,” in Proc. [190] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution represen-
IEEE/CVF CVPR, Jun. 2022, pp. 16442–16453. tation learning for human pose estimation,” in Proc. IEEE/CVF CVPR,
[162] J. Lei Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016, Jun. 2019, pp. 5693–5703.
arXiv:1607.06450. [191] L. C. Chen, G. Papandreou, and I. Kokkinos, “DeepLab: Semantic
[163] P. P. Brahma, D. Wu, and Y. She, “Why deep learning works: A image segmentation with deep convolutional nets, atrous convolution,
manifold disentanglement perspective,” IEEE Trans. Neural Netw. and fully connected CRFs,” IEEE Trans. Pattern Anal. Mach. Intell.,
Learn. Syst., vol. 27, no. 10, pp. 1997–2008, Oct. 2016. vol. 40, no. 4, pp. 834–848, Jun. 2016.
[164] Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng, “A2 -Nets: Double [192] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional
attention networks,” in Proc. NeurIPS, 2018, pp. 352–361. networks for biomedical image segmentation,” in Proc. Med. Image
[165] A. Krizhevsky et al., “Learning multiple layers of features from tiny Comput. Comput.-Assist. Intervent., 2015, pp. 234–241.
images,” Univ. Toronto, Tech. Rep., 2009. [193] P. Sun et al., “Sparse R-CNN: End-to-end object detection
[166] C. Sun, A. Shrivastava, S. Singh, and A. Gupta, “Revisiting unreason- with learnable proposals,” in Proc. IEEE/CVF CVPR, Jun. 2021,
able effectiveness of data in deep learning era,” in Proc. IEEE ICCV, pp. 14454–14463.
Oct. 2017, pp. 843–852. [194] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, “Unified per-
[167] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ceptual parsing for scene understanding,” in Proc. ECCV, 2018,
“ImageNet: A large-scale hierarchical image database,” in Proc. CVPR, pp. 418–434.
Jun. 2009, pp. 248–255. [195] M. Chen et al., “Searching the search space of vision transformer,” in
[168] Y. Dong, J.-B. Cordonnier, and A. Loukas, “Attention is not all you Proc. NeurIPS, 2021, pp. 1–13.
need: Pure attention loses rank doubly exponentially with depth,” in [196] H. Chen, K. Sun, Z. Tian, C. Shen, Y. Huang, and Y. Yan, “Blend-
Proc. ICLR, 2021, pp. 2793–2803. Mask: Top-down meets bottom-up for instance segmentation,” in Proc.
[169] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with rel- IEEE/CVF CVPR, Jun. 2020, pp. 8573–8581.
ative position representations,” in Proc. Conf. North Amer. Chap- [197] X. Wang, R. Zhang, T. Kong, L. Li, and C. Shen, “SOLOv2:
ter Assoc. Comput. Linguistics, Hum. Lang. Technol., vol. 2, 2018, Dynamic and fast instance segmentation,” in Proc. NeurIPS, 2020,
pp. 464–468. pp. 17721–17732.
[170] P. W. Battaglia et al., “Relational inductive biases, deep learning, and [198] Y. Wang, R. Huang, S. Song, Z. Huang, and G. Huang, “Not all images
graph networks,” 2018, arXiv:1806.01261. are worth 16 × 16 words: Dynamic transformers for efficient image
[171] S. Abnar, M. Dehghani, and W. Zuidema, “Transferring inductive recognition,” in Proc. NeurIPS, 2021, pp. 11960–11973.
biases through knowledge distillation,” 2020, arXiv:2006.00555. [199] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep hierarchical
[172] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, feature learning on point sets in a metric space,” in Proc. NeurIPS,
“MobileNetV2: Inverted residuals and linear bottlenecks,” in Proc. 2017, pp. 1–10.
IEEE/CVF CVPR, Jun. 2018, pp. 4510–4520. [200] P. Anderson et al., “Bottom-up and top-down attention for image
[173] A. Islam, S. Jia, and N. D. B. Bruce, “How much position information captioning and visual question answering,” in Proc. IEEE/CVF CVPR,
do convolutional neural networks encode?” in Proc. ICLR, 2020, Jun. 2018, pp. 6077–6086.
pp. 1–11. [201] R. Krishna et al., “Visual Genome: Connecting language and vision
[174] J. Lin, C. Gan, and S. Han, “TSM: Temporal shift module for using crowdsourced dense image annotations,” Int. J. Comput. Vis.,
efficient video understanding,” in Proc. IEEE/CVF ICCV, Oct. 2019, vol. 123, no. 1, pp. 32–73, 2017.
pp. 7083–7093. [202] P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions:
[175] Y. Pang, M. Sun, X. Jiang, and X. Li, “Convolution in convolution for A cleaned, hypernymed, image alt-text dataset for automatic image
network in network,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, captioning,” in Proc. 56th Annu. Meeting Assoc. Comput. Linguistics,
no. 5, pp. 1587–1597, May 2018. 2018, pp. 2556–2565.
[176] J. Gao, D. He, X. Tan, T. Qin, L. Wang, and T.-Y. Liu, “Representation [203] S. Antol et al., “VQA: Visual question answering,” in Proc. IEEE
degeneration problem in training natural language generation models,” ICCV, Dec. 2015, pp. 2425–2433.
in Proc. ICLR, 2019, pp. 1–14. [204] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell:
[177] S. Yun, D. Han, S. Chun, S. J. Oh, Y. Yoo, and J. Choe, “CutMix: Reg- A neural image caption generator,” in Proc. IEEE CVPR, Jun. 2015,
ularization strategy to train strong classifiers with localizable features,” pp. 3156–3164.
in Proc. IEEE/CVF ICCV, Oct. 2019, pp. 6023–6032. [205] P. Achlioptas, A. Abdelreheem, F. Xia, M. Elhoseiny, and L. Guibas,
[178] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “ReferIt3D: Neural listeners for fine-grained 3D object identification
“Unsupervised learning of visual features by contrasting cluster assign- in real-world scenes,” in Proc. ECCV, 2020, pp. 422–440.
ments,” in Proc. NeurIPS, 2020, pp. 9912–9924. [206] W. Yu et al., “MetaFormer is actually what you need for vision,” in
[179] C.-F.-R. Chen, Q. Fan, and R. Panda, “CrossViT: Cross-attention multi- Proc. IEEE/CVF CVPR, Jun. 2022, pp. 10819–10829.
scale vision transformer for image classification,” in Proc. IEEE/CVF [207] N. Park and S. Kim, “How do vision transformers work?” in Proc.
ICCV, Oct. 2021, pp. 357–366. ICLR, 2021, pp. 1–26.

Authorized licensed use limited to: Military Institute of Science and Technology. Downloaded on September 12,2024 at 08:52:55 UTC from IEEE Xplore. Restrictions apply.

You might also like