0% found this document useful (0 votes)
9 views28 pages

Transformers_for_Vision_A_Survey_on_Innovative_Methods_for_Computer_Vision

This survey discusses the transformative impact of transformer architectures on computer vision, highlighting their advantages over traditional convolutional neural networks (CNNs) in modeling long-range dependencies and global context. It provides an overview of key transformer mechanisms, various architectures like ViT and Swin Transformer, and their applications in tasks such as image classification and object detection. The paper also addresses challenges in deploying vision transformers and outlines future research directions to enhance their efficiency and effectiveness.

Uploaded by

caro caro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views28 pages

Transformers_for_Vision_A_Survey_on_Innovative_Methods_for_Computer_Vision

This survey discusses the transformative impact of transformer architectures on computer vision, highlighting their advantages over traditional convolutional neural networks (CNNs) in modeling long-range dependencies and global context. It provides an overview of key transformer mechanisms, various architectures like ViT and Swin Transformer, and their applications in tasks such as image classification and object detection. The paper also addresses challenges in deploying vision transformers and outlines future research directions to enhance their efficiency and effectiveness.

Uploaded by

caro caro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Received 15 April 2025, accepted 14 May 2025, date of publication 20 May 2025, date of current version 6 June 2025.

Digital Object Identifier 10.1109/ACCESS.2025.3571735

Transformers for Vision: A Survey on Innovative


Methods for Computer Vision
BALAMURUGAN PALANISAMY 1 , VIKAS HASSIJA 2 ,
ARPITA CHATTERJEE1 , ARPITA MANDAL1 , DEBANSHI CHAKRABORTY1 , AMIT PANDEY 3,

G. S. S. CHALAPATHI 1 , (Senior Member, IEEE), AND DHRUV KUMAR 4


1 Department of Electrical and Electronics Engineering, Birla Institute of Technology and Science, Pilani, Pilani Campus, Vidya Vihar, Pilani, Rajasthan 333031,
India
2 School of Computer Engineering, Kalinga Institute of Industrial Technology (KIIT) Deemed to be University, Bhubaneswar, Odisha 751024, India
3 School of CSET, Bennett University, Gautam Buddha Nagar, Greater Noida 201310, India
4 Department of Computer Science and Information Systems, Birla Institute of Technology and Science, Pilani, Pilani Campus, Vidya Vihar, Pilani, Rajasthan

333031, India
Corresponding author: Dhruv Kumar ([email protected])
This work was supported by the Birla Institute of Technology and Science, Pilani, India.

ABSTRACT Transformers have emerged as a groundbreaking architecture in the field of computer vision,
offering a compelling alternative to traditional convolutional neural networks (CNNs) by enabling the
modeling of long-range dependencies and global context through self-attention mechanisms. Originally
developed for natural language processing, transformers have now been successfully adapted for a wide
range of vision tasks, leading to significant improvements in performance and generalization. This survey
provides a comprehensive overview of the fundamental principles of transformer architectures, highlighting
the core mechanisms such as self-attention, multi-head attention, and positional encoding that distinguish
them from CNNs. We delve into the theoretical adaptations required to apply transformers to visual
data, including image tokenization and the integration of positional embeddings. A detailed analysis
of key transformer-based vision architectures such as ViT, DeiT, Swin Transformer, PVT, Twins, and
CrossViT are presented, alongside their practical applications in image classification, object detection, video
understanding, medical imaging, and cross-modal tasks. The paper further compares the performance of
vision transformers with CNNs, examining their respective strengths, limitations, and the emergence of
hybrid models. Finally, current challenges in deploying ViTs, such as computational cost, data efficiency,
and interpretability, and explore recent advancements and future research directions including efficient
architectures, self-supervised learning, and multimodal integration are discussed.

INDEX TERMS Action recognition, computer vision, convolutional neural networks (CNNs), efficient
transformers, image classification, object detection, segmentation, video processing, vision transformers,
visual reasoning.

I. INTRODUCTION models in capturing hierarchical visual features can be


The field of Computer Vision has been fundamentally shaped attributed to their inherent inductive biases, specifically
by Deep Learning, with Convolutional Neural Networks translation equivariance, and locality, which confer an ability
(CNNs) dominating the landscape for nearly a decade. Since to learn efficiently from limited data, thereby enhancing their
AlexNet’s breakthrough in 2012, CNNs have become the de performance in diverse visual recognition tasks.
facto standard, powering advances in image classification, However, CNNs face inherent limitations. Their local
object detection, and segmentation through architectures receptive fields struggle with long-range dependencies, and
like VGG, ResNet, and EfficientNet. The efficacy of these their sequential processing of features creates bottlenecks in
modeling global context. These constraints become apparent
The associate editor coordinating the review of this manuscript and in tasks requiring holistic understanding, such as scene
approving it for publication was Rajeeb Dey . interpretation or cross-modal reasoning. Meanwhile, the
2025 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
95496 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 13, 2025
B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

FIGURE 1. Organization of the paper.

TABLE 1. Comparison of this survey paper with other literature survey papers.

Transformer architecture revolutionized natural language foundations, key innovations, performance benchmarks, and
processing (NLP) through self-attention mechanisms, offer- ongoing challenges. By synthesizing recent research, we aim
ing parallel processing and dynamic feature weighting. to highlight the strengths and limitations of ViTs and outline
The success of models like BERT and GPT demonstrated future directions for advancing transformer-based vision
Transformers’ ability to capture complex relationships in models. A comparison of this survey paper with other survey
sequential data. papers available in the literature is summarized in Table 1.
The emergence of Vision Transformers (ViTs) was pre- The primary contributions of this paper are described as
cipitated by the innovative application of the Transformer’s below:
self-attention mechanism to visual data, where images 1) In-depth examination of the key architectures and
are discretized into sequences of patches. Notwithstanding variants of ViTs thereby offering a comprehensive
earlier attempts to combine CNNs with attention mecha- understanding of the evolutionary landscape of ViT
nisms, as exemplified by Non-Local Networks, the Vision models.
Transformer (ViT) provided conclusive evidence in 2020 that 2) A comparative analysis of ViTs with traditional CNNs
Transformer-based architectures, unadulterated by convolu- and hybrid approaches, highlighting each paradigm’s
tional components, could outperform CNNs when scaled to relative strengths and weaknesses and shedding light on
a sufficient extent. Moreover, recent innovations, including the trade-offs between them.
incorporating hierarchical attention in Swin Transformers [6] 3) Explores the diverse range of ViT applications, high-
and masked autoencoding in Masked AutoEncoder (MAE) lighting their potential and versatility in various domains
[7], have substantially bridged the gaps in data efficiency and and demonstrating their significant impact on real-world
computational performance, ushering in a novel era in visual problems.
information processing. 4) Presents various challenges and open problems inherent
This review aims to provide a comprehensive overview to the current ViT architectures, underscoring the exist-
of Vision Transformers and their hybrid variants, comparing ing limitations and identifying key areas that necessitate
them with traditional CNNs. We explore their architectural additional research and development.

VOLUME 13, 2025 95497


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

in various Natural Language Processing (NLP) and Com-


puter Vision tasks. The following subsections provide the
necessary fundamental details to understand the transformer
architecture:

A. SELF-ATTENTION MECHANISM
The self-attention mechanism is the core computational
building block of Transformers, including Vision Trans-
formers (ViTs). It enables a model to weigh and process
relationships between different parts of an input sequence,
allowing it to capture both local and global dependencies
efficiently. Unlike convolutional neural networks (CNNs)
that process information using local receptive fields, self-
attention [9] computes pairwise interactions between all
elements, providing a more flexible and scalable approach to
FIGURE 2. Transformer architecture [8].
feature extraction.
Given an input sequence (e.g., text tokens or image
5) Provides a concise overview of the recent advancements
patches), the self-attention mechanism maps each element in
in the field of ViTs, offering a glimpse into the current
the sequence to a new representation by computing attention
state-of-the-art and highlighting the prevailing trends.
scores relative to all other elements. This involves three key
The organization of the survey is depicted in Figure 1. This steps:
survey is organized to comprehensively explore ViTs, begin-
1) Input Representation: For an input sequence X ∈
ning with foundational concepts and progressively advancing
Rn×d (where n is the sequence length and d is the feature
to cutting-edge developments. Section I is the introduction
dimension), three trainable linear projections are applied
that establishes the historical context and motivations for
to obtain the following:
transitioning from CNNs to transformer-based vision models,
followed by an explanation of core transformer mechanisms • Query (Q): Represents the current element being

in Section II. Section III lays the theoretical groundwork processed.


for adapting transformers to visual data, while Section IV • Key (K): Represents other elements in the sequence

traces their evolutionary trajectory in computer vision. for comparison.


A detailed analysis of key architectures and variants is • Value (V): Contains the content to be aggregated

presented in Section V, followed by their diverse applications based on attention.


in Section VI. The paper then compares ViTs and CNNs These are computed as:
critically in Section VII, highlighting hybrid approaches and
trade-offs. Sections VIII and IX address current challenges Q = XWQ , K = XWK , V = XWV (1)
and emerging solutions, concluding with future directions
and a synthesis of key insights in Section X. This structure where WQ , WK and WV are learnable weight matrices.
facilitates a logical progression from fundamental principles 2) Attention Calculation: The core operation is comput-
to advanced research frontiers, enabling readers to understand ing the similarity between the query and key vectors
both the theoretical underpinnings and practical implications using the scaled dot product. This measures how much
of ViTs in computer vision. attention each element should pay to others:

QK T
 
II. FUNDAMENTALS OF TRANSFORMERS Attention(Q, K , V ) = softmax √ V (2)
dk
The fundamentals of Transformers are built upon the Self-
Attention Mechanism, which enables the model to weigh The dot product measures the similarity between the

the importance of different input elements relative to each query and keys. The scaling factor dk , normalizes the
other. The Transformer architecture is shown in Figure 2. dot product to stabilize gradients and prevent exploding
Additionally, key components include: attention values. The softmax activation converts the
1) Multi-Head Self Attention scores into a probability distribution, ensuring attention
2) Encoder-Decoder Architecture weights sum to 1. The weighted sum ensures each value
3) Positional Encoding vector is aggregated based on its attention weight.
4) Feed Forward Network 3) Output Representation: The attention-weighted sum
5) Layer Normalization produces an output sequence with the same length as the
These components work in tandem to facilitate the input but enriched with global contextual information.
processing of sequential data, such as text or images, and This process is repeated in multiple layers to capture
have been instrumental in achieving state-of-the-art results hierarchical relationships across the entire input.

95498 VOLUME 13, 2025


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

understanding of image structure [13]. Since the Transformer


architecture was initially designed for sequential data in
Natural Language Processing (NLP), where token order is
crucial, ViTs must explicitly encode spatial information to
preserve the relative positions of image patches. This is
achieved through positional encoding (PE), which helps
the model distinguish the position of each patch in the
input sequence. In the ViT architecture, images are divided
into non-overlapping patches and then flattened into 1D
sequences of embeddings. While the self-attention mecha-
nism can capture global dependencies, it is position-agnostic
which means it does not retain any information about the
original spatial arrangement of patches.

E. LAYER NORMALIZATION
FIGURE 3. Self-Attention mechanism (Left). Multi-Head Attention Layer Normalization (LayerNorm) is a crucial component
consists of several attention layers running in parallel (Right) [11]. of Vision Transformers (ViTs) that stabilizes training and
improves model convergence [14]. Originally introduced
B. MULTI-HEAD SELF-ATTENTION (MHSA) for sequence-based tasks in natural language processing,
To enhance the model’s ability to capture different types LayerNorm is adapted in ViTs to handle high-dimensional
of relationship, Transformers use multi-head self-attention patch embeddings. It normalizes activations within a layer,
(MHSA) [10]. Instead of performing attention once, the ensuring stable gradients and preventing internal covariate
input is projected into multiple subspaces (or ‘‘heads’’), and shifts during training. Layer Normalization normalizes the
attention is computed independently in each: inputs across the feature dimension for each data point
independently. Without normalization, the distribution of
MHSA(X ) = Concat(head1 , . . . , headh )WO (3)
intermediate representations can shift significantly during
Each head corresponds to a different set of learned pro- training, causing unstable gradients and poor convergence.
jections, and h is the number of attention heads. WO is a LayerNorm mitigates these issues by ensuring consistent
final linear projection combining all heads’ outputs. MHSA activation magnitudes. Given an input x ∈ Rd (where d is the
allows the model to focus on diverse aspects of the input feature dimension), the normalized output is computed as:
simultaneously i.e., such as local texture patterns in one head
and global object structures in another. Figure 3 shows the xi − µ
x̂i = .γ + β (5)
block diagram of the working of self-attention and Multi- σ
Head attention.
where,
C. ENCODER-DECODER ARCHITECTURE • xi represents the input features.
The encoder-decoder architecture is a fundamental design • µ is the mean of the input across the feature dimension
computed as µ = d1 di=1 xi
P
pattern used in many deep learning models, including Trans-
formers. Originally introduced for sequence-to-sequence • σ
q isPthe standard deviation computed using σ =
d
i=1 (xi − µ) + ϵ with a small constant ϵ for
(Seq2Seq) tasks like machine translation, this architecture 1 2
d
has been adapted for vision tasks in models like Vision numerical stability.
Transformers (ViTs) and image-to-image translation models • γ and β are learnable parameters that scale and shift the
[12]. The encoder captures essential features from the input normalized output.
by processing the input (e.g., an image or image patches)
to create a compact latent representation. The decoder uses
F. FEED FORWARD NETWORK
the encoded representation to generate a structured output,
such as an image, class label, or sequence. The following The Feed Forward Network (FFN) [15] is a crucial compo-
is the mathematical representation of Encoder and Decoder nent of the Vision Transformer (ViT) architecture, situated
functions for the given X : after the multi-head self-attention (MHSA) mechanism in
each Transformer block. Its primary function is to apply
Z = Encoder(X ); Y = Decoder(Z ) (4) point-wise transformations to the token representations,
allowing the model to capture non-linear dependencies and
D. POSITIONAL ENCODING complex feature interactions. This section elaborates on the
Unlike Convolutional Neural Networks (CNNs), which structure, function, and significance of FFNs within the ViT
inherently capture spatial hierarchies through their local framework. Each FFN in a Transformer block consists of two
receptive fields, Vision Transformers (ViTs) lack an innate main layers:

VOLUME 13, 2025 95499


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

FIGURE 4. ViT Model overview. The image split into fixed-size patches, linearly embed each of them, add position embeddings, and feed
the resulting sequence of vectors to a standard transformer encoder [8].

1) Linear Projection Layers: These are fully connected in natural language processing (NLP), to 2D image data,
layers that transform the input into a higher-dimensional introducing several unique theoretical considerations. A key
space and then project it back to the original dimension. adaptation involves image tokenization, where an input image
2) Activation Function: Introduces non-linearity, allowing is divided into fixed-size patches that are flattened and
the model to capture complex patterns. ViTs typically projected to form a sequence of tokens. This transformation
use the GELU (Gaussian Error Linear Unit) activation allows images to be processed in a structured format com-
for improved performance. patible with the Transformer framework. Another essential
Given an input token representation, X ∈ Rd , the FFN applies modification is the use of 2D positional encodings to preserve
the following operations: spatial information, ensuring the model captures the relative
position of each patch. These adaptations enable ViTs to
FFN (X ) = W2 σ (W1 X + b1 ) + b2 (6) model long-range dependencies across an image effectively.
where, Understanding these foundational changes provides insight
into how ViTs extend the Transformer’s capabilities to
• W1 and W2 are weight matrices.
visual data. Figure 4 shows the overview of ViT Model.
• b1 and b2 are bias vectors.
The following subsections briefly explain about various
• σ (.) represents a non-linear activation function (com-
theoretical foundations necessary to understand ViT better:
monly GELU).
In each Transformer block, the FFN follows the multi-head
A. ADAPTING TRANSFORMERS FOR VISION TASKS
self-attention (MHSA) layer and is sandwiched between two
Layer Normalization (LN) steps, as shown: The Transformer architecture, originally designed for natural
language processing (NLP), operates on one-dimensional
1) Input Processing: Token embeddings enter the MHSA
sequences of tokens, making it inherently different from the
module.
structured nature of images, which exist as two-dimensional
2) Self-Attention: Captures global relationships between
grids with complex spatial hierarchies. Adapting Trans-
patches.
formers to computer vision introduces several fundamental
3) Add & Norm: Residual connection and layer normaliza-
challenges that must be addressed to achieve effective
tion.
learning and inference [16].
4) Feed Forward Network (FFN): Applies point-wise
transformation to each token.
5) Add & Norm: Another residual connection and layer 1) IMAGE-TO-SEQUENCE CONVERSION (PATCH
normalization. EMBEDDINGS)
Unlike NLP models that process discrete word tokens, images
The overall output of a Transformer block is computed as:
lack a natural tokenization process. Vision Transformers
Output = LN (FFN (LN (MHSA(X ) + X )) + X ) (7) (ViTs) overcome this by partitioning an image into small,
non-overlapping patches, which are then flattened and
III. VISION TRANSFORMERS: THEORETICAL projected into a latent embedding space using a trainable
FOUNDATION linear layer. This transformation converts the structured
Vision Transformers (ViTs) adapt the Transformer archi- 2D image into a 1D sequence of patch embeddings,
tecture, which was originally designed for sequential data making it compatible with the Transformer framework [17].

95500 VOLUME 13, 2025


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

However, this approach discards pixel-level details and 3) Hierarchical Attention Mechanisms: To enhance effi-
imposes constraints on capturing fine-grained local patterns, ciency and feature extraction, recent advancements
necessitating additional mechanisms to compensate for lost such as those seen in Swin Transformers introduce
spatial information. hierarchical attention. Unlike standard ViTs that treat
all patches equally, hierarchical models organize patches
2) PRESERVING SPATIAL RELATIONSHIPS (POSITIONAL into progressively coarser levels, resembling the feature
ENCODING) pyramids in CNNs. The shifted window attention mech-
Transformers lack an intrinsic sense of order, unlike con- anism, a core technique in Swin Transformers, restricts
volutional neural networks (CNNs), which inherently model self-attention computations to local windows while
spatial hierarchies through localized receptive fields. In NLP, allowing cross-window interactions. This approach
positional encodings are used to incorporate word order into significantly reduces computational overhead while
the model, but extending this concept to images requires preserving long-range dependencies.
adapting positional embeddings to a 2D domain [18]. ViTs By incorporating these innovations, Vision Transformers
employ 2D positional encodings, which are either learnable have demonstrated strong performance across various com-
or fixed sinusoidal functions, to retain spatial information puter vision tasks, including image classification, object
about patch locations within the image. The effectiveness detection, and segmentation. However, ongoing research
of these encodings directly impacts the model’s ability to continues to refine their efficiency and adaptability, paving
understand structural relationships between different regions the way for broader real-world deployment, even in resource-
of the image. constrained environments.
One of the major challenges in scaling Transformers
to vision tasks is the computational cost of self-attention. B. IMAGE PATCH TOKENIZATION
Standard self-attention mechanisms exhibit O(N 2 ) complex- In Vision Transformers (ViTs), image patch tokenization is
ity [19], where N represents the number of tokens. In Vision the process of converting a 2D image into a sequence of
Transformers (ViTs), since an image is tokenized into token embeddings. This transformation allows images to be
numerous patches, self-attention computations become pro- processed by the self-attention mechanism of the Transformer
hibitively expensive, especially for high-resolution images. model, which traditionally handles 1D sequences [20]. The
This quadratic scaling in memory and computation restricts patch tokenization method is central to how ViTs ‘‘see’’ and
the feasibility of training and deploying ViTs on resource- interpret visual data. The following are the steps involved:
constrained hardware. 1) Splitting the Image into Patches: Given an input image
To overcome these challenges, researchers have introduced X ∈ RH ×W ×C , where where H is the image height,
various efficiency-driven innovations that allow Transform- W is the image width, and C is the number of color
ers to achieve state-of-the-art performance in vision tasks. channels, the image is divided into N non-overlapping
These innovations primarily focus on reducing computational square patches of size P × P (where P is the path size).
overhead while preserving performance, ensuring that ViTs The number of patches is calculated as:
remain practical for real-world applications. Key advance-    
ments include: H W
N = × (8)
1) Patch-Based Tokenization: Instead of processing raw P P
pixel values or feature maps extracted from convolu- Each patch xp ∈ RP×P×C captures a local region of the
tional layers, ViTs divide images into fixed-size patches, image, providing a manageable unit of information for
treating them as input tokens. This significantly reduces subsequent processing.
sequence length compared to a per-pixel representa- 2) Flattening and Projecting Patches: Once the image is
tion, making computations more feasible. Additionally, divided into patches, each patch xp is flattened into
hybrid models integrate convolutional feature extractors a 1D vector of size P2 × C, representing the raw
before applying self-attention, enhancing local feature pixel information. To prepare these vectors for the
representation while maintaining the global reasoning Transformer model, they are linearly projected into a
capability of Transformers. latent space of dimension d using a learnable projection
2) 2D-Aware Positional Encoding: Unlike NLP Trans- matrix:
formers, which use 1D positional embeddings to capture
zp = Linear(xp ), zp ∈ Rd (9)
word order, ViTs require 2D positional encodings to
retain spatial relationships between image patches. This operation produces a collection of patch embed-
These encodings can be absolute (fixed embeddings dings Z ∈ RN ×d , where each row represents the
based on patch positions) or relative (dynamically embedding of a specific patch. These patch embeddings
adjusted based on inter-patch relationships). Some serve as input tokens to the Transformer model,
models further employ learnable embeddings optimized analogous to word embeddings in NLP.
during training, improving adaptability to different 3) Incorporating the [CLS] Token for Classification:
image resolutions. Inspired by BERT in NLP, ViTs introduce a learnable

VOLUME 13, 2025 95501


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

FIGURE 5. Evolution of vision transformers (ViTs): A Timeline of Key Models from 2020 to 2024 and beyond.

class token zcls to the sequence of patch embeddings. layers. Additionally, the [CLS] token (used for classification)
This token is prepended to the patch sequence and serves receives a dedicated positional embedding.
as a global representation of the entire image. During
training, the model learns to store information relevant to D. COMPUTATIONAL EFFICIENCY AND SCALABILITY
classification within this token. The final input sequence CONSIDERATIONS
to the Transformer is: Vision Transformers (ViTs) face significant computational
Zfinal = [zcls ; z1 . . . zN ] ∈ R
(N +1)×d
(10) challenges due to the quadratic complexity of the self-
attention mechanism, which scales poorly with increasing
At the output stage, the model uses the representation of image resolution. To improve efficiency and scalability,
zcls for downstream classification tasks. several strategies have been developed. Hierarchical models
like Swin Transformers use local attention windows to reduce
C. POSITIONAL EMBEDDINGS IN ViTs computational cost, while linear and sparse attention tech-
Transformers lack an inherent understanding of the spatial niques (e.g., Linformer, Performer) approximate attention
structure of their input because the self-attention mechanism calculations to achieve linear complexity. Token reduction
is permutation-invariant. While this is suitable for natural methods, such as patch merging and dynamic pruning,
language processing (NLP) sequences where token order is further decrease the sequence length during processing.
explicitly encoded, images possess a 2D grid structure that Hybrid models combining CNNs and Transformers leverage
must be preserved for accurate visual interpretation. Posi- the efficiency of CNNs for local feature extraction while
tional embeddings (PEs) address this limitation by encoding maintaining the global modeling capabilities of Transform-
spatial information and adding it to the patch embeddings, ers. Efficient scaling requires balancing patch size, model
allowing the model to maintain spatial awareness during depth, and training data volume, with lightweight variants
training and inference [21]. like MobileViT optimized for resource-constrained devices.
In ViTs, image patches are tokenized and flattened into These innovations make ViTs more practical for large-scale
a 1D sequence. However, the spatial arrangement of these vision tasks and real-world applications while retaining their
patches is lost during this process. Without positional embed- strong performance [22].
dings, the model would treat all patches as an unordered
collection, failing to capture relationships between nearby IV. EVOLUTION OF TRANSFORMERS FOR VISION
and distant regions of the image. By incorporating positional The evolution of Transformers for vision has undergone
information, ViTs can preserve spatial relationships between significant transformations, from the introduction of Vision
image patches, distinguish patch locations, enable the model Transformers (ViTs) to the development of variants like
to learn spatial hierarchies, and enhance performance on Swin Transformers and MAE. These advancements have
vision tasks by understanding object structure and context. enabled efficient and effective image recognition, pro-
Given a sequence of patch embeddings Z ∈ RN ×d cessing, and generation capabilities. The incorporation of
(where N is the number of patches and d is the embedding self-attention mechanisms and hierarchical architectures has
dimension), a positional embedding matrix P ∈ RN ×d is further enhanced the performance of ViTs [23]. As a result,
added element-wise: ViTs have become a cornerstone of computer vision research,
driving innovation and progress in the field. Figure 5 shows
Zinput = Z + P (11)
the milestones in ViT development from the 2020 to 2024.
This enriched representation, combining both patch content The following subsections provide a brief study of the
and spatial location, is then fed into the Transformer evolution:

95502 VOLUME 13, 2025


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

A. 2020: THE BIRTH OF VISION TRANSFORMERS performance in image classification and recognition tasks
The introduction of Vision Transformer (ViT) by Doso- [25].
vitskiy et al. marked a significant milestone in computer CrossViT, another innovative model, introduced a
vision. ViT treated an image as a sequence of patches and dual-branch architecture that processed both small- and
applied the transformer architecture, previously dominant in large-scale image patches simultaneously. By combining
NLP, to visual tasks. Unlike traditional convolutional neural information from different resolutions, CrossViT enhanced
networks (CNNs), ViT leveraged self-attention mechanisms the model’s ability to capture both local and global image
to capture global dependencies in an image, significantly features, improving classification accuracy.
improving classification performance when trained on Meanwhile, Swin Transformer V2 was introduced as
large-scale datasets like ImageNet. However, ViT required an upgraded version of the Swin Transformer, optimizing
massive computational resources, making it impractical for its efficiency for large-scale datasets and high-resolution
smaller datasets and real-world applications [1]. images. Swin V2 incorporated novel normalization tech-
To address this limitation, Data-efficient Image Trans- niques and scaling strategies to improve robustness and
former (DeiT) was introduced by Facebook AI in 2020. adaptability across diverse vision tasks.
DeiT improved ViT’s training efficiency by incorporating
knowledge distillation, which allowed smaller models to
learn from larger teacher models. This breakthrough made D. 2023: OPTIMIZING ATTENTION MECHANISMS AND
ViTs accessible to a broader audience, enabling their HYBRID APPROACHES
deployment in scenarios with limited data. In 2023, researchers focused on improving attention mech-
anisms and making transformers more computationally
efficient. Focal Transformer introduced focal self-attention,
B. 2021: SCALING AND REFINING VISION a mechanism that prioritizes the most relevant parts of
TRANSFORMERS an image while reducing computational overhead. This
With the success of ViT, researchers focused on enhancing optimization made it more effective for tasks requiring fine-
efficiency and adaptability. The Swin Transformer, developed grained attention, such as object detection and segmentation.
by Microsoft, introduced a hierarchical feature representation Hybrid architectures also gained momentum, with models
and shifted windows, enabling the model to process images like ConvNext blending CNNs and transformers to achieve
at varying scales. This hierarchical approach allowed Swin state-of-the-art performance in image classification and
Transformer to outperform ViT in segmentation and object object detection. ConvNext demonstrated that CNN-based
detection tasks [1], making it a strong competitor against inductive biases could still play a crucial role in visual tasks,
CNNs. Swin Transformer’s ability to handle variable image even in the era of transformers [26].
sizes efficiently made it a leading architecture for real-world MaxViT emerged as a powerful transformer model,
vision applications. introducing a multi-axis attention mechanism that captured
Another key development was the Pyramid Vision Trans- both local and global dependencies simultaneously. This
former (PVT), which integrated a pyramid structure into innovation significantly improved ViTs’ ability to process
the transformer design, mirroring the multi-scale feature complex scenes and high-resolution images while maintain-
extraction process of CNNs. PVT enabled ViTs to handle ing efficiency.
dense prediction tasks like object detection and semantic
segmentation more effectively, making it more efficient for
vision applications beyond classification [24]. E. 2024: ADVANCEMENTS IN EFFICIENCY, EDGE AI, AND
Meanwhile, BoTNet (Bottleneck Transformers) combined SELF-SUPERVISED LEARNING
CNNs with transformers, leveraging self-attention in the later In 2024, Vision Transformers continued evolving toward
stages of a ResNet-like architecture. This hybrid approach efficiency, edge computing, and self-supervised learning.
demonstrated that transformers could complement, rather DINOV2 (Meta AI) was introduced as an advanced
than replace, CNNs in visual tasks, providing both efficiency self-supervised vision model, allowing transformers to
and superior feature extraction capabilities. achieve remarkable performance in image classification and
object detection without relying on large labeled datasets.
DINOV2 leveraged self-supervised learning techniques to
C. 2022: ENHANCING FEATURE EXTRACTION AND extract meaningful visual representations, making it highly
PERFORMANCE effective in data-scarce environments [27].
As transformers continued to gain traction, researchers EfficientViT addressed the growing need for computa-
sought ways to improve their computational efficiency and tionally lightweight transformers by reducing the number
feature extraction capabilities. T2TViT (Tokens-to-Token of parameters and optimizing hardware efficiency, making
ViT) refined the patch embedding process in ViTs by it suitable for real-time applications. EfficientViT was
introducing a progressive tokenization mechanism. This particularly impactful for embedded systems, robotics, and
method improved feature representation, allowing for better real-time video analysis.

VOLUME 13, 2025 95503


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

FIGURE 6. ViT key architectures and its variants.

For mobile and edge devices, MobileViT V2 introduced designed to address ViT’s heavy reliance on large-scale
an optimized transformer architecture that maintained high datasets for effective training. Unlike the original ViT, which
accuracy while operating within low-power constraints. required massive pretraining on datasets like JFT-300M or
By improving energy efficiency, MobileViT V2 enabled ImageNet-21k, DeiT is capable of achieving competitive
the deployment of transformer models on smartphones, IoT performance when trained from scratch on ImageNet-1k,
devices, and augmented reality (AR) applications. a significantly smaller dataset. A key innovation of DeiT [29]
As transformers continue to evolve, their adaptability is the introduction of a distillation token, which enables the
across multiple domains solidifies their position as the future model to benefit from knowledge distillation by learning from
of computer vision and AI-driven perception. a CNN-based teacher model. Figure 7 shows the distallation
procedure which includes distallation tocken along with class
V. KEY ARCHITECTURES AND VARIANTS tocken to reproduce predicted label by the teacher instead
Vision Transformers (ViTs) have emerged as a powerful of true label. This approach improves training efficiency,
alternative to CNNs for various computer vision tasks. generalization, and convergence speed without requiring
Unlike CNNs, which rely on local feature extraction, additional labeled data.
ViTs leverage self-attention mechanisms to model global DeiT retains the self-attention-based transformer archi-
relationships across an image. They take advantage of tecture but incorporates several optimization techniques,
the transformer architecture, initially designed for natural including RandAugment, Mixup, CutMix, and stochastic
language processing (NLP), to process image data. Since depth, to improve robustness and data efficiency. It comes
the introduction of the original ViT, numerous variants in different variants, such as DeiT-Ti [32] (Tiny), DeiT-S
have been developed to improve efficiency, scalability, (Small), and DeiT-B (Base), balancing accuracy and com-
and performance. In the following, we outline the key putational cost for various applications. DeiT outperforms
architectures and their improvements also, Figure 6 provides traditional CNNs like ResNet-50 in image classification
the key architectures of ViT and its variants. while maintaining a comparable model size and efficiency,
making it suitable for edge AI and real-world deployment.
A. DATA EFFICIENCY AND TRAINING OPTIMIZATION By significantly reducing the data requirements for ViTs,
These architectures follow the standard ViT design but DeiT paves the way for wider adoption of transformer-based
introduce modifications to improve training efficiency [28], architectures in computer vision, bridging the gap between
scalability, or robustness. Some of the standard ViT variants CNN efficiency and transformer scalability.
are briefly described below:
2) BEiT (BERT PRETRAINING FOR VISION TRANSFORMERS)
1) DATA-EFFICIENT IMAGE TRANSFORMER (DeiT) The BERT Pretraining for Vision Transformers (BEiT)
The Data-Efficient Image Transformer (DeiT), developed [33]is a self-supervised learning framework that adapts
by Meta AI, is a variant of the Vision Transformer (ViT) BERT-style masked pretraining for Vision Transformers

95504 VOLUME 13, 2025


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

Below is a detailed explanation of the key architectures in


this category:

1) SWIN TRANSFORMER
The Swin Transformer (Shifted Window Transformer) [31] is
a hierarchical Vision Transformer (ViT) designed to improve
efficiency, scalability, and performance for computer vision
tasks such as image classification, object detection, and
segmentation. Introduced by Microsoft Research, it addresses
the computational inefficiency of standard ViTs, which
use global self-attention and struggle with high-resolution
images. Swin Transformer introduces a hierarchical feature
extraction mechanism, similar to CNNs, where feature maps
are progressively reduced in size across layers, making it
highly efficient for dense prediction tasks. Figure 8 shows the
architecture of Swin Transformer.
A key innovation of Swin Transformer is the shifted win-
dow attention mechanism, where self-attention is computed
within non-overlapping local windows to reduce complexity,
followed by a shifted window approach that allows informa-
tion exchange between neighboring windows. This enables
Swin Transformer to capture both local and global depen-
dencies while significantly reducing computational cost
FIGURE 7. Distillation procedure including distillation token. The
objective is to reproduce the (hard) label predicted by the teacher,
compared to standard ViTs. With its hierarchical structure
instead of true label (based on [30]). and efficient self-attention, Swin Transformer has become
a foundational architecture for vision tasks, outperforming
(ViTs). Introduced by Microsoft Research, BEiT follows the CNN-based models like ResNet and enabling state-of-the-
concept of Masked Image Modeling (MIM), where random art results in image classification, object detection (Faster
patches of an input image are masked and the model is trained R-CNN, Cascade R-CNN), and segmentation (Swin-UNet,
to predict their corresponding visual representations. Unlike Mask R-CNN).
traditional supervised learning, BEiT learns rich, transferable
visual representations without requiring labeled data, making 2) PYRAMID VISION TRANSFORMER (PVT)
it particularly effective for pretraining ViTs on large-scale The Pyramid Vision Transformer (PVT) [36] is a hierarchical
image datasets. Vision Transformer designed to improve efficiency and
BEiT consists of two main stages: pretraining and fine- scalability for dense vision tasks such as object detection
tuning. During pretraining, a ViT model learns to reconstruct and segmentation. Unlike the standard Vision Transformer
masked patches by predicting discrete visual tokens derived (ViT), which maintains a fixed sequence length throughout
from a pre-trained tokenizer (e.g., dVAE from DALL·E) its layers, PVT progressively reduces the number of tokens
rather than raw pixel values. This forces the model to develop as the network deepens, similar to the feature pyramid
semantic understanding of objects and textures. In the fine- structure used in CNNs. This hierarchical tokenization
tuning phase, the pretrained BEiT model is adapted for enables multi-scale feature representation, making PVT more
downstream tasks like image classification, object detection, suitable for vision tasks that require fine-grained spatial
and segmentation. By leveraging self-supervised masked details.
pretraining, BEiT significantly improves the efficiency and A key innovation of PVT is its spatial-reduction attention
generalization of ViTs, making them more data-efficient and (SRA), which reduces the number of tokens in deeper
competitive with CNNs in vision tasks. layers while still capturing global information. This signif-
icantly lowers memory and computational costs compared
B. HIERARCHICAL AND MULTI-SCALE ARCHITECTURES to traditional ViTs, making PVT more efficient for high-
Hierarchical and Multi-Scale Architectures in Vision Trans- resolution images. Due to its strong feature representation
formers (ViTs) are designed to capture features at multiple and scalability, PVT has been widely adopted for tasks such
scales, making them particularly effective for tasks like object as semantic segmentation (e.g., PVT + UPerNet), object
detection, segmentation, and other dense prediction tasks detection (e.g., PVT + RetinaNet), and dense prediction
[34]. These architectures often mimic the pyramidal structure applications, outperforming CNN-based architectures in
of Convolutional Neural Networks (CNNs), where features these domains. Its hierarchical structure allows it to bridge
are extracted at progressively higher levels of abstraction. the gap between ViTs and CNNs, offering the benefits of

VOLUME 13, 2025 95505


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

FIGURE 8. The architecture of a Swin Transformer (Swin-T) (based on [31]).

FIGURE 9. Architecture of MobileViT (based on [35]).

global self-attention while maintaining efficiency for real- segmentation (Twins + Mask R-CNN, UPerNet). By striking
world applications. a balance between local feature extraction and global context
understanding, Twins Transformer serves as a highly efficient
3) TWINS TRANSFORMER alternative to traditional ViTs and CNNs for large-scale
The Twins Transformer [37] is a hierarchical Vision Trans- vision applications.
former (ViT) that improves the efficiency and scalability
of vision models by integrating both local and global 4) CrossViT (CROSS-SCALE VISION TRANSFORMER)
self-attention mechanisms. Unlike standard ViTs, which The Cross-Scale Vision Transformer (CrossViT) is a
rely on global self-attention throughout, Twins Transformer multi-scale Vision Transformer [38] designed to improve
introduces a dual-branch architecture that balances efficiency representation learning by processing images at multiple
and expressiveness by combining locally-grouped self- resolutions simultaneously. Unlike standard ViTs, which
attention (LCSA) for fine-grained spatial feature extraction operate on a single fixed-size patch embedding, CrossViT
and global sub-sampled attention (GSA) for capturing long- introduces a dual-branch architecture, where one branch
range dependencies. processes larger patches (coarse features) for global context,
A key advantage of the Twins Transformer is its while the other processes smaller patches (fine details) for
hierarchical multi-scale structure, similar to CNNs, which local feature extraction. This cross-scale fusion allows the
enables better dense prediction tasks such as object detection model to capture both fine-grained and high-level semantic
and segmentation. By limiting self-attention to localized information, leading to improved performance on various
regions while incorporating periodic global attention, Twins vision tasks.
achieves lower computational complexity than standard A key innovation in CrossViT is the Cross-Attention Mod-
ViTs while maintaining strong feature representation. This ule (CAM), which enables efficient information exchange
makes it particularly effective in vision tasks such as image between the coarse and fine branches. By integrating
classification, object detection (Twins + Faster R-CNN), and multi-scale feature representations, CrossViT achieves better

95506 VOLUME 13, 2025


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

generalization and robustness compared to standard ViTs, significantly faster than traditional ViTs while maintaining
especially for classification, object detection, and segmen- competitive accuracy.
tation tasks. It has been shown to outperform traditional A key innovation in LeViT is the use of hierarchical
ViTs and CNNs on image classification benchmarks while patch embeddings and downsampling to progressively reduce
maintaining comparable efficiency. CrossViT is particularly the number of tokens, thereby reducing computational
useful for tasks where multi-scale information is crucial, such complexity. Additionally, it replaces standard Multi-Head
as scene understanding, medical imaging, and fine-grained Self-Attention (MHSA) with an optimized version, making
object recognition. it more efficient for high-resolution images. LeViT achieves
better latency and throughput compared to traditional ViTs,
C. EFFICIENCY AND LIGHTWEIGHT MODELS making it suitable for real-time applications such as object
Efficiency and Lightweight Models in Vision Transformers detection, facial recognition, and edge AI tasks. By striking a
(ViTs) focus on reducing computational complexity and balance between performance, efficiency, and computational
memory usage while maintaining competitive performance. cost, LeViT serves as a practical alternative to CNNs and
These models are designed for resource-constrained environ- standard transformers in low-power environments.
ments, such as mobile devices, edge computing, and real-time
applications. Below is a detailed explanation of the key 3) CvT (CONVOLUTIONAL VISION TRANSFORMER)
architectures in this category: The Convolutional Vision Transformer (CvT) [15] is a hybrid
Vision Transformer that integrates convolutional layers into
1) MobileViT
the standard ViT framework to improve efficiency, feature
The MobileViT is a lightweight Vision Transformer [35] extraction, and inductive biases. Introduced by researchers
designed for mobile and edge devices, combining the from Microsoft, CvT enhances tokenization, local feature
strengths of convolutions and self-attention to achieve high extraction, and spatial hierarchies by leveraging convolutions
performance while maintaining efficiency. Figure 9 shows the in both the patch embedding stage and self-attention layers.
architecture of MobileViT depicting various building blocks This modification helps address the shortcomings of tradi-
of the architecture. Traditional ViTs require extensive com- tional ViTs, which lack local spatial priors and require large
putations, making them unsuitable for real-time applications datasets for effective training.
on resource-constrained devices. To address this, MobileViT A key innovation in CvT is the Convolutional Token
introduces a hybrid CNN-Transformer architecture, where Embedding, which applies overlapping convolutions instead
convolutional layers are used for early feature extraction and of simple linear projections, improving local feature rep-
MobileViT blocks replace standard convolutional layers in resentation and reducing computational overhead. Addi-
deeper network stages, enabling global self-attention with tionally, CvT modifies the self-attention mechanism by
minimal computational overhead. introducing depthwise convolutions within the projection
A key innovation of MobileViT is its ability to learn both layers, making the transformer more efficient and structurally
local and global representations efficiently. It first applies similar to CNNs. These improvements enable CvT to
convolutional layers to extract local features, then unfolds outperform standard ViTs and CNNs in image classification,
feature maps into sequences and processes them using object detection, and segmentation tasks, while maintaining
transformer-based self-attention before refolding them back a lower computational footprint. By combining the strengths
into a spatial representation. This design allows MobileViT of CNNs and transformers, CvT offers an effective balance
to retain the inductive biases of CNNs while leveraging between accuracy, efficiency, and scalability for real-world
the global context-awareness of transformers. MobileViT vision applications.
achieves state-of-the-art performance on image classification
(ImageNet), object detection, and segmentation, all while
D. ATTENTION MECHANISM INNOVATIONS
being lightweight, fast, and memory-efficient making it ideal
for applications such as real-time AI, autonomous systems, Attention Mechanism Innovations in Vision Transformers
and mobile vision tasks. (ViTs) focus on improving the efficiency, scalability, and
effectiveness of self-attention, which is the core component
2) LeViT (LEAKED VISION TRANSFORMER) of transformer architectures. These innovations address
The LeViT (Leaked Vision Transformer) [39] is a hybrid challenges such as computational complexity, long-range
Vision Transformer (ViT) designed to be fast, efficient, dependency modeling, and task-specific feature extraction.
and scalable for mobile and edge computing applications. Below is a detailed explanation of the key architectures in
Unlike standard ViTs, which suffer from high compu- this category:
tational costs and memory usage, LeViT introduces a
convolution-Transformer hybrid approach, optimizing both 1) CaiT (CLASS-ATTENTION IN IMAGE TRANSFORMERS)
speed and accuracy. It combines convolutional layers for The Class-Attention in Image Transformers (CaiT) is an
early feature extraction with lightweight transformer blocks improved variant of the Vision Transformer (ViT) introduced
for efficient global self-attention, resulting in a model that is to enhance training stability and performance in deep

VOLUME 13, 2025 95507


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

TABLE 2. Comparison of vision transformer (ViT) architectures based on parameters, throughput, accuracy, and computational complexity.

transformer models [40]. Unlike standard ViTs, where the to achieve competitive results in image classification while
classification token (CLS token) interacts with all layers, maintaining computational efficiency.
CaiT introduces class-attention layers that are placed at
the end of the transformer stack. These layers exclusively 2) MaxViT (MAXIMALLY PARALLEL ViT)
process the CLS token while keeping the patch tokens The MaxViT (Maximally Parallel Vision Transformer) is
frozen, reducing gradient propagation issues and allowing a highly efficient Vision Transformer that introduces a
deeper transformer architectures to be trained effectively. grid-based attention mechanism to enhance scalability and
This design helps mitigate the training instability of deep computational efficiency for vision tasks [41]. Unlike
transformers and improves representation learning for clas- standard ViTs, which rely on global self-attention and
sification tasks. suffer from quadratic complexity, MaxViT incorporates a
CaiT also incorporates LayerScale, a technique that hybrid local-global attention approach that enables efficient
introduces trainable scaling parameters in residual connec- processing of high-resolution images. This design allows
tions, further stabilizing deep model training. Compared to MaxViT to balance local feature extraction and long-range
conventional ViTs, CaiT achieves higher accuracy without dependencies while maintaining a high degree of parallelism
requiring additional pretraining data and scales efficiently for fast and scalable computations.
to deeper architectures. By refining token interactions and A key innovation in MaxViT is its Grid Attention mech-
stabilizing gradient flow, CaiT enables Vision Transformers anism, which combines block attention (local processing

95508 VOLUME 13, 2025


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

TABLE 3. Description of advanced ViT models. This table provides a description of some of the advanced ViT models, which are categorized based on the
key innovation along with its advantages and challenges.

within small regions) and grid attention (global interaction Table 2 provides the comparison of different ViT archi-
across spatially distant regions) in a structured manner. This tectures discussed above based on parameters throughput,
hierarchical structure allows MaxViT to efficiently capture accuracy, and computational complexity. The advanced ViT
both fine-grained details and global semantic information, models which are not explained in the above is summarized
making it particularly effective for dense vision tasks such in Table 3.
as image classification, object detection, and segmentation.
Additionally, MaxViT retains CNN-like efficiency while
VI. APPLICATIONS OF VISION TRANSFORMERS
benefiting from transformer-based self-attention, enabling
With the introduction of a novel framework known as Vision
strong performance across various benchmarks. By inte-
Transformers (ViT), a new approach to tackling computer
grating parallelized attention operations, MaxViT achieves
vision tasks has emerged. Unlike traditional convolutional
superior efficiency, making it well-suited for high-resolution
neural networks (CNNs), which have long dominated the
vision tasks and real-time AI applications.
field, ViTs reformulate these tasks as an optimization prob-
lem involving equality operations, positioning themselves as
a formidable alternative. Table 4 summarizes various ViT
3) ViTDet
Applications along with its architectural features, and some
The ViTDet (Vision Transformer for Detection) [42] is a real-World use cases,
detection-optimized Vision Transformer designed specifi-
cally for object detection and instance segmentation tasks.
Unlike traditional ViTs, which are primarily used for image A. IMAGE CLASSIFICATION
classification, ViTDet is fine-tuned for dense prediction tasks For classification, trainable attention is divided into two
using a feature pyramid structure that allows multi-scale fea- primary streams. Paranoid intricate attention is a necessary
ture extraction. Developed for frameworks like Mask R-CNN precondition for large-scale digital interactions, focusing
and Cascade R-CNN, ViTDet enables high-resolution vision on specific image regions. Soft attention is used to create
processing while maintaining the benefits of self-attention deformable feature maps. The term ‘‘visual attention’’ was
mechanisms. first introduced to select essential regions and locations in an
A key innovation in ViTDet is its use of non-overlapping image [44]. Additionally, resizing the input image can help
window attention, similar to Swin Transformer, to improve reduce the computational load of the model. The attention
computational efficiency while retaining global receptive heat map can be used to crop a sub-region from a global
fields. This allows ViTDet to handle large-scale objects image.
and fine-grained details effectively. It is designed to The first application of AG-CNN was in medical
work seamlessly with modern object detection pipelines, image classification [45]. SENet [46] introduced soft
demonstrating state-of-the-art performance on benchmarks self-attention to reweight convolutional feature channels,
such as COCO. ViTDet outperforms CNN-based models while complex attention requires recalibrating selected
like ResNet-FPN [43], making it a powerful choice for feature maps. Saumya et al. [47] leveraged attention maps
high-precision detection, segmentation, and dense vision to reweight DNNs using intermediate feature estimations.
applications by leveraging hierarchical feature extraction and Han et al. [48] further enhanced CNN representations by
efficient transformer-based attention. employing attribute-aware attention.

VOLUME 13, 2025 95509


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

TABLE 4. Vision transformer applications: Task descriptions, architectural features, and real-world use cases.

FIGURE 10. The architecture of DETR (based on [49]).

iGPT [50] is a powerful yet non-specialized language image classification datasets, including CIFAR-10 [51],
model for image classification. Traditional image classifica- CIFAR-100 [52], STL-10, and ImageNet.
tion involves training on vast datasets of images and tags.
While iGPT learns from image descriptions or captions, B. OBJECT DETECTION
it cannot process or analyze image content directly. Object detection is a fundamental task in computer vision
Computer vision has long utilized pre-training approaches, that requires simultaneous localization and classifica-
as revisited and integrated into self-supervised methods by tion of potential objects within a single image. It is
Chen et al. [50]. The pre-training phase is followed by fine- essential for various applications like facial recognition,
tuning. The auto-regressive and BERT-based methods are autonomous driving, pedestrian detection, and medical
revealed during the pre-training stage. Unlike NLP, where detection. It significantly influences scene understand-
recursion is applied to language tokens, pixel prediction ing, environment perception, and object tracking. Deep
is implemented using a sequence transformer architecture. learning-based object detection methods are becoming
Pre-training, similar to early stopping methods, benefits more popular due to their rapid development. However,
from favorable initializations like incremental PCA. A small several challenges remain in handling varying scales,
classification head is added in the final tuning stage to creating lightweight models, and balancing efficiency with
adjust all weights and optimize classification objectives. The precision.
following are the key observations and achievements: Like Faster R-CNN [53], convolutional neural networks
• Understanding 2D Image Characteristics: Despite (CNNs) have been the foundation of most traditional
being trained on 1D pixel sequences, iGPT understands mainstream object detection techniques. YOLO [54], and
2D image features such as object appearance and SSD [55] are notable examples. The success of transformers
category. in natural language processing has led researchers to try
• Generative Capabilities: It generates coherent image and modify Transformer topologies for computer vision
completions and samples without human-provided problems.
labels. Transformers have gained significant interest in object
• Competitive Features: State-of-the-art performance is detection recently, leading to high-performance models
achieved by features extracted from iGPT across various like Deformable DETR, DETR, Swin Transformer, and

95510 VOLUME 13, 2025


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

DINO. In-depth analysis and evaluation will be necessary


for subsequent research as these models represent a new
paradigm in object detection.
The Detection Transformer (DETR), introduced by
Carion et al. [56], is a transformer-based object detection
framework that eliminates the need for hand-crafted
components like region proposals. It extracts image features
using a CNN backbone, adds fixed positional encodings,
and processes them through an encoder-decoder transformer.
The decoder produces N output embeddings using learned
positional encodings (object queries), where N is predefined
rather than dependent on the number of objects. Final
predictions bounding boxes and class labels are computed
using simple feed-forward networks (FFNs). Unlike tradi-
tional detectors, DETR employs bipartite matching to assign
predictions to ground truth objects. The architecture of
FIGURE 11. Object detection using MaX-DeepLab using dual-path
DETER is shown in Figure 10. transformer architecture(based on [63]).
Although DETR enables end-to-end object detection,
it has drawbacks like long training times and poor small- C. SEMANTIC SEGMENTATION
object performance. Zhu et al. [57] introduced Deformable Semantic segmentation requires rich spatial information and
DETR, which improves efficiency by using a deformable a large receptive field. Unfortunately, newer techniques typi-
attention module that focuses on key locations instead of cally sacrifice spatial resolution to reach real-time inference
all spatial features. This accelerates convergence, reduces speed, resulting in subpar performance. A brand-new network
computational costs, and integrates multi-scale features, for bilateral segmentation called as BiSeNet was proposed by
achieving 1.6× faster inference and 10× lower training cost C.Yu et al. [64]. A Small Stride Spatial Path was built first to
than DETR. maintain the spatial information and produce high-resolution
To further reduce computational complexity, Zheng et al. features. A quick downsampling Context Path is used in
[58] proposed the Adaptive Clustering Transformer (ACT), the interim to obtain an adequate receptive field. A new
which replaces DETR’s self-attention with locality sensitivity Feature Fusion Module was introduced to efficiently combine
hashing (LSH) for efficient query clustering, minimizing the characteristics on top of the two paths. The suggested
accuracy loss. Multi-Task Knowledge Distillation (MTKD) architecture strikes the ideal compromise between speed and
[59] mitigates performance drops with fine-tuning. segmentation performance on the Cityscapes, CamVid, and
Sun et al. [60] identified cross-attention as the main COCO-Stuff datasets. 68.4% of the mean IOU for the 2048 ×
cause of DETR’s slow convergence. They proposed an 1024 input.
encoder-only variant with improved training stability, intro- Deformable VisTR is an extension of the VisTR frame-
ducing TSP-FCOS and TSP-RCNN models, which enhance work, an end-to-end transformer-based approach for VIS.
performance with feature pyramids. One of the major challenges with VisTR is its train-
Dai et al. [61] developed UP-DETR, an unsupervised pre- ing efficiency. It is computationally expensive, requiring
training approach inspired by NLP, using random query patch around 1000 GPU hours during training. Another important
detection. This improves DETR’s accuracy, especially on challenge is its slow convergence. To resolve these potential
small datasets like PASCAL VOC. challenges, Deformable VisTR uses the Spatio-Temporal
MaX-DeepLab is designed explicitly for a sub-category Deformable Attention module [65]. The key idea is that
of semantic segmentation known as panoptic segmentation. instead of attending to all points, it focuses on a small, fixed
The object detection using MaX-DeepLab is shown in number of crucial spatiotemporal sampling locations. The
Figure 11, which uses a dual-path transformer architecture for key benefit is that it achieves linear computation in the size of
object detection. The regular semantic segmentation method spatiotemporal feature maps while maintaining performance
provides each pixel in an image with a unique class label (e.g., comparable to the original VisTR but with significantly fewer
car, person, background), but it doesn’t differentiate between GPU training hours (10 × less).
individual instances of objects. While panoptic segmentation
combines both semantic segmentation and instance segmen- D. ACTION RECOGNITION
tation [62]. MaX-DeepLab directly predicts class-labeled Action recognition involves analyzing human movements
masks, eliminating the need for separate steps like object from video sequences by capturing both spatial (object pres-
detection and merging. It achieves advanced performance ence) and temporal (motion dynamics) information. Vision
using a ‘‘dual-path transformer’’ that combines CNNs with Transformers (ViT) have shown remarkable performance in
transformers [62]. this domain by leveraging self-attention to model long-range

VOLUME 13, 2025 95511


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

While these methods are widely used in video processing,


ViT offers a more powerful alternative by directly learning
spatial and temporal representations through self-attention,
reducing reliance on handcrafted features. Recent research
focuses on optimizing ViT for efficient action recognition,
making them increasingly viable for large-scale applications.

E. VIDEO PROCESSING
FIGURE 12. ST-GCN block structure (based on [69]).
Video processing is another important computer vision task
that involves analyzing and manipulating video streams
dependencies, leading to greater robustness and reduced data
to extract meaningful information. It plays a crucial role
requirements compared to traditional methods [66].
in applications that require understanding dynamic scenes,
Despite their success, ViT faces computational challenges
such as action recognition, autonomous driving, and surveil-
when applied to videos. Researchers have explored vari-
lance. ViTs have significantly advanced video processing
ous strategies to improve efficiency including processing
by modeling complex temporal dynamics and long-range
shorter video segments instead of full sequences, using
dependencies. Compared to traditional methods, they achieve
skeleton-based representations (body joint positions) to
superior accuracy and quality, leading to remarkable progress
reduce complexity, and designing specialized ViT architec-
in tasks like video completion and translation. ViT facilitates
tures tailored for video tasks [67]. Effective action recog- tasks that demand a deep understanding of video content
nition requires capturing both spatial structures (object or and temporal coherence by effectively capturing spatial and
joint positions) and temporal patterns (movement dynamics). temporal information.
One approach is skeleton-based modeling, where human
movements are represented as graphs, providing a compact
and viewpoint-invariant representation [68]. Graph-based 1) VIDEO COMPLETION
methods offer advantages over raw video-based techniques ViT excels in video completion by leveraging self-attention to
by focusing on motion patterns rather than pixel-based capture long-range dependencies and contextual information,
details. ensuring temporal coherence and realistic results. Their
ST-GCN [69] is a skeleton-based action recognition that ability to analyze spatial and temporal features makes them
relies on modeling human movement as a graph, where joints well-suited for video inpainting and extrapolation, signifi-
act as nodes and their connections as edges. This approach cantly enhancing video continuity, and visual quality [101].
is more robust to variations in viewpoint, occlusions, and Existing approaches to video inpainting often struggle
appearance changes compared to traditional RGB-based with maintaining temporal consistency due to their limited
methods. Earlier techniques used handcrafted features such temporal receptive fields. Many methods rely on adjacent
as SURF [70] and SIFT [71], but these struggled to capture frames, leading to artifacts and inconsistencies when han-
long-range dependencies and required extensive tuning. dling complex motion. These approaches typically assume
Recent advances have shifted towards Graph Convolutional global affine transformations, which can result in inaccu-
Networks (GCNs), which effectively model spatial and rate reconstructions [102]. Additionally, without dedicated
temporal relationships in skeletal data. The block structure temporal coherence optimization, frame-by-frame processing
of ST-GCN is shown in Figure 12. requires extensive post-processing, which may fail under
Sijie Yan et al. [72] pioneered the use of GCNs for action severe artifacts. The Spatial-Temporal Transformer Network
recognition by constructing graphs from human skeletons and (STTN) was proposed to overcome these challenges, which
applying graph convolutions to extract movement features. formulates video inpainting as a ‘‘multi-to-multi’’ problem.
Li et al. [73] extended this concept by introducing the STTN fills missing regions across multiple frames by
Action Structure Graph Convolutional Network (AS-GCN) leveraging both nearby and distant frames. A multi-scale
[74], which incorporated additional structural and actional patch-based attention module identifies coherent content
connections to improve recognition accuracy. However, across spatial and temporal dimensions, with transformer
many GCN-based methods focus only on natural joint heads computing similarity across spatial patches at dif-
connections, overlooking important relationships between ferent scales. Stacking multiple layers, STTN dynamically
non-adjacent joints. ST-GCN enhances traditional GCN enhances attention mechanisms to refine missing regions.
models by incorporating extended skeleton graphs with Additionally, a spatial-temporal adversarial loss guides the
functional connections and a refined partitioning strategy, model in learning to generate cohesive and visually realistic
significantly improving action recognition performance on content [103].
large-scale datasets [69], [75]. These advancements open new
possibilities for more efficient and accurate skeleton-based 2) VIDEO CAPTIONING
action recognition, paving the way for further research in ViTs are essential for video translation due to their self-
graph-based approaches. attention mechanism, which captures complex

95512 VOLUME 13, 2025


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

TABLE 5. Benchmarking ViT performance across tasks: Objectives, Models, datasets, and evaluation criteria.

transformations and inter-frame interactions. This enables of local attention and single-shot feature masking in AMT
precise frame interpolation, style transfer, and resolution significantly enhances processing speed without compromis-
enhancement. By efficiently gathering spatial and temporal ing accuracy, making it a strong contender for real-time
information, transformers produce accurate, and superior applications.
video translations [104]. Table 5 provides the summary of benchmarking ViT
Current video captioning methods for untrimmed videos performance across differnt computer vision tasks. The table
typically involve two phases: proposal localization and also have reference to various benchmarked datasets and the
caption generation. The Masked Transformer (MT) [105], top evaluation metrics.
inspired by machine translation, integrates and optimizes
these steps. The MT model uses a proposal encoder to VII. COMPARISON WITH CNNs AND HYBRID
predict event proposals and a caption decoder to generate APPROACHES
captions. It replaces recurrent neural networks (RNNs) with Vision Transformers (ViTs) and Convolutional Neural Net-
stacked self-attention blocks to better capture long-range works (CNNs) represent two distinct paradigms for visual
dependencies. understanding. While CNNs excel due to their inductive
The Accelerated Masked Transformer (AMT) [106] biases, such as spatial locality and translation invariance,
improves upon MT with enhanced efficiency. It intro- ViTs leverage self-attention to capture long-range dependen-
duces acceleration strategies for both stages. First, faster cies and model global relationships. This section compares
localization via local attention and a lightweight, anchor- their performance, strengths, and trade-offs, while also
free proposal predictor, and second, single-shot feature highlighting the emergence of hybrid models that aim to
masking and average attention for quicker caption generation. combine the best aspects of both architectures.
In testing, AMT is nearly twice as fast as a 3-layer
Masked Transformer, with slightly better performance. AMT A. PERFORMANCE BENCHMARKS ON STANDARD
demonstrates the power of self-attention mechanisms in video DATASETS
captioning by effectively handling long-range dependencies ViTs have demonstrated competitive or superior performance
and optimizing computational efficiency. The introduction to CNNs on major computer vision benchmarks, particularly

VOLUME 13, 2025 95513


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

TABLE 6. Image classification performance on ImageNet-1K.

TABLE 7. Performance on COCO (Detection) and ADE20K (Segmentation).

when large-scale datasets are available. Table 6 shows the 1) STRENGTHS AND WEAKNESSES OF CNN
comparison of ViT and CNN for image classification tasks. CNNs excel in parameter efficiency and local feature
The CNNs, such as EfficientNet, have been found to be extraction, making them ideal for resource-constrained
more parameter-efficient when working with small datasets applications like mobile devices and small datasets. Their
[107]. In contrast, ViTs and DeiT typically require large-scale convolutional operations, honed by decades of research,
pretraining on extensive datasets, such as ImageNet-21K/JFT, efficiently capture hierarchical patterns (e.g., edges, textures)
to match the performance of CNNs. However, hybrid and benefit from robust frameworks like PyTorch [109].
models that combine the strengths of both architectures, However, their reliance on local receptive fields limits global
like ConvNeXt and CoAtNet, have shown great promise in context understanding, and their sequential layer structure
bridging this gap, often outperforming both pure ViTs and reduces parallelism, hindering scalability. While techniques
CNNs [108]. like dilated convolutions help, CNNs often underperform
Table 7 provides the comparison for object detection ViTs in tasks requiring long-range dependencies, such as
and segmentation tasks. The ViTs have been shown to scene understanding or multimodal learning.
excel in high-compute scenarios, such as object detec-
tion, where models like ViTDet outperform traditional
CNN-based detectors when sufficient data is available. 2) STRENGTHS AND WEAKNESSES OF ViT
However, hybrid models, including Swin and PVT, ViTs surpass CNNs in global reasoning and scalability,
have demonstrated exceptional performance by leveraging leveraging self-attention to model distant relationships
the strengths of both architectures, particularly through and achieve state-of-the-art results on large datasets. Yet,
their ability to learn multi-scale features, ultimately their quadratic computation cost and data hunger make
dominating the landscape in various computer vision them impractical for edge deployment or small-scale tasks
tasks. where CNNs remain dominant. For applications demand-
ing fine-grained local analysis (e.g., medical imaging)
B. STRENGTHS AND WEAKNESSES: CNNs VS. ViTs or efficient inference, CNNs retain an edge, while ViTs
Both Vision Transformers (ViTs) and Convolutional Neural thrive in high-complexity domains like video analysis or
Networks (CNNs) are foundational architectures in com- multimodal systems [110]. The choice hinges on trade-offs
puter vision, each with distinct advantages and limitations. between computational resources, data availability, and task
ViTs, inspired by the Transformer model in natural lan- requirements.
guage processing (NLP), leverage self-attention to process The key strengths and weaknesses discussed above about
images, while CNNs utilize convolutional layers to extract CNNs and ViTs are summarized in Table 8. While CNNs
hierarchical features. Below is an expanded comparison remain a robust choice for resource-efficient and small-scale
highlighting their strengths and weaknesses across various applications, ViTs dominate large-scale and complex tasks.
dimensions. Hybrid approaches like Swin Transformer and BoTNet [111]

95514 VOLUME 13, 2025


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

TABLE 8. Comparison of vision transformers (ViTs) and convolutional neural networks (CNNs) across various aspects.

TABLE 9. Popular hybrid architectures combining CNNs and transformers.

combine the strengths of both paradigms, offering promising Convolutional Neural Networks (CNNs) such as MobileNet
solutions across diverse domains. and EfficientNet are suitable due to their efficiency and
parameter frugality. For example, in medical imaging and
C. HYBRID MODELS: COMBINING CNNs AND robotics, where data is often limited and computational
TRANSFORMERS resources are constrained, CNNs have been shown to be
Hybrid models aim to leverage the strengths of both Convo- effective. In contrast, Vision Transformers (ViTs) are better
lutional Neural Networks (CNNs) and Vision Transformers suited for large-scale pretraining tasks, such as those using
(ViTs) to create a more powerful and efficient architecture. the JFT or LAION datasets, and for tasks that require a global
By combining the local feature extraction capabilities of understanding of the input data, like panoramic segmentation.
CNNs with the global understanding and parallelization Hybrid models, which combine the strengths of both CNNs
capabilities of ViTs, hybrid models can achieve state-of-the- and ViTs, offer a balanced trade-off between efficiency and
art performance in various computer vision tasks. Hybrid accuracy, making them a good choice for applications where
models can learn features at multiple scales, combining both are important, such as in the Swin and ConvNeXt
the local features extracted by CNNs with the global models. Additionally, hybrid models have been shown to be
features learned by ViTs [112]. Hybrid models can create a effective for detection and segmentation tasks, such as in
hierarchical representation of the input data, with early layers the PVT and Mask2Former models, where their ability to
focusing on local features and later layers focusing on global capture both local and global features is particularly useful.
features. Hybrid models can benefit from the parallelization Table 10 summarizes the key trade-off between CNNs, ViTs
capabilities of ViTs, making them more efficient for large- and hybrid models on various aspects.
scale computations.
Hybrid models can achieve state-of-the-art performance in VIII. CHALLENGES AND OPEN ISSUES
various computer vision tasks, surpassing the performance Vision Transformers (ViTs) have demonstrated remarkable
of pure CNNs and ViTs. Some of the advanced and popular success across various computer vision tasks, but several
hybrid architectures available in the literature are summarized critical challenges remain unresolved. These limitations
in Table 9. present important opportunities for future research and
development. Figure 13 provides the summary of various
D. KEY TRADE-OFFS challenges and limitations of ViT in image processing and
The choice of architecture depends on the specific application computer vision, which is briefly explained in the below
and requirements. For edge devices and small datasets, subsections.

VOLUME 13, 2025 95515


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

TABLE 10. Key trade-offs between CNNs, ViTs, and hybrid models.

shown promise in reducing the reliance on large labeled


datasets, but their effectiveness in ViTs remains an open
question.
• Designing ViT architectures with stronger priors: How
can ViT architectures be designed to incorporate
stronger built-in priors, allowing them to perform well
in small data regimes? This may involve introducing
additional constraints or biases into the model, such as
spatial hierarchies or attention mechanisms, to improve
its ability to generalize from limited data.
• Knowledge distillation: Are there effective distillation
techniques to transfer knowledge from large ViTs to
compact versions, enabling the deployment of ViTs in
FIGURE 13. ViT challenges and open issues. resource-constrained environments? Distillation meth-
ods have been successfully applied to CNNs, but their
applicability to ViTs remains an open problem.
A. DATA EFFICIENCY AND PRETRAINING REQUIREMENTS
One of the significant challenges facing Vision Transformers
(ViTs) is their requirement for massive datasets to achieve B. COMPUTATIONAL COST AND MEMORY FOOTPRINT
competitive performance. In contrast to Convolutional Neural The Vision Transformer (ViT) architecture has shown
Networks (CNNs), which can learn effectively from smaller remarkable performance in various computer vision tasks, but
datasets, ViTs typically necessitate large-scale pretraining its computational cost and memory footprint pose significant
datasets, such as JFT-300M or LAION [113], to reach state- challenges, particularly when dealing with high-resolution
of-the-art performance. This raises several key issues and images and video. The quadratic complexity of self-attention,
open problems that must be addressed to improve the data a key component of ViTs, limits their applicability to large-
efficiency of ViTs. The following are the key issues: scale inputs. The key issues are the following:
• Weak inductive biases: Compared to CNNs, ViTs pos-
• Quadratic complexity: The self-attention mecha-
sess weaker inductive biases, including a lack of built-in
nism [114] in ViTs has a computational complexity
translation equivariance and locality. This results in
of O(N 2 ) and memory requirements of O(N 2 ) for N
reduced performance when trained on smaller datasets,
patches, where N is the number of patches in the input
where the model’s ability to generalize is crucial.
image. This quadratic complexity leads to significant
• Poor sample efficiency: ViTs demonstrate poor sample
computational costs and memory consumption, making
efficiency on medium-sized and small datasets, such
it challenging to apply ViTs to high-resolution images
as those encountered in medical imaging applications.
and video.
This limitation hinders the adoption of ViTs in domains
• Heavy memory consumption during training: The large
where data is scarce or expensive to collect.
memory requirements of ViTs during training can lead
The following are some of the open challenges on data to significant memory consumption, making it difficult
efficiency: to train large models on standard hardware.
• Self-supervised learning: Can self-supervised learning • Inefficient inference: Compared to optimized Con-
(SSL) methods, such as Masked Autoencoders (MAE) volutional Neural Networks (CNNs), ViTs can be
or Dense Contrastive Learning (DINO), eliminate the less efficient during inference, which can limit their
need for supervised pretraining? SSL methods have deployment in real-time applications.

95516 VOLUME 13, 2025


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

Some of the open challenges on computational and • Robustness paradox: Why are ViTs simultaneously
memory efficiency are listed below: robust to some perturbations (e.g., random noise)
• Linear attention variants: Can linear attention vari- but vulnerable to others (e.g., adversarial patches)?
ants, such as linear attention or attention with linear Understanding this paradox is crucial for developing
complexity, achieve parity with softmax attention in more robust ViT architectures.
terms of performance? Linear attention variants can • ViT-specific regularization techniques: Can we develop
potentially reduce the computational complexity and regularization techniques that are specifically designed
memory requirements of ViTs, making them more for ViTs, which can help improve their robustness and
applicable to large-scale inputs. generalization capabilities? Regularization techniques,
• Optimization for hardware accelerators: How can ViTs such as dropout or weight decay, are commonly used
be optimized for hardware accelerators, such as Tensor in CNNs, but their effectiveness in ViTs is still an open
Processing Units (TPUs) or Graphics Processing Units question.
(GPUs), to improve their computational efficiency and • Stabilizing attention mechanisms: How can we make
reduce memory consumption? Optimizing ViTs for attention mechanisms more stable and robust, par-
hardware accelerators can enable their deployment in a ticularly in deep architectures? Stabilizing attention
wide range of applications, from cloud-based services to mechanisms is crucial for improving the overall robust-
edge devices. ness and generalization capabilities of ViTs.
• Dynamic token sparsification: Are there dynamic token
sparsification methods that can maintain the accuracy D. INTERPRETABILITY AND EXPLAINABILITY
of ViTs while reducing their computational cost and While Vision Transformers (ViTs) have achieved remark-
memory requirements? Token sparsification methods, able performance across computer vision tasks, their
such as pruning or quantization, can potentially reduce decision-making processes remain significantly less inter-
the number of tokens processed by the self-attention pretable than conventional CNNs [116]. This ‘‘black box’’
mechanism, leading to significant computational sav- nature poses substantial challenges for deployment in sensi-
ings and memory reduction. tive domains (e.g., medical imaging, autonomous systems)
where model transparency is crucial. Some of the key
C. ROBUSTNESS AND GENERALIZATION challenges are listed below:
Vision Transformers (ViTs) have demonstrated impressive
• Noisy Attention Maps: Unlike CNN filters that often cor-
performance in various computer vision tasks, but their
respond to semantically meaningful features (e.g., edge
robustness and generalization capabilities are still not well
detectors), ViT attention maps frequently exhibit diffuse
understood. Recent studies have revealed that ViTs can
or counterintuitive patterns. Unlike CNN filters that
exhibit unexpected failure modes when faced with distribu-
often correspond to semantically meaningful features
tion shifts and adversarial attacks, which raises significant
(e.g., edge detectors), ViT attention maps frequently
concerns about their reliability and trustworthiness. The
exhibit diffuse or counterintuitive patterns.
following are some of the key challenges:
• Lack of Feature Visualization Analogues: Patch embed-
• Susceptibility to adversarial patches: ViTs have been dings are high-dimensional and non-linear. Attention
shown to be vulnerable to adversarial patches, which operates on abstract token relationships rather than
are small, specially crafted patches that can be added to spatial hierarchies. While CNNs enable visualization
an image to mislead the model [115]. This vulnerability through Filter activation patterns and Class activation
highlights the need for more robust and secure ViT maps (e.g., Grad-CAM).
architectures. • Cross-Patch Interaction Complexity: Global self-
• Poor out-of-distribution generalization: ViTs often attention creates intricate dependency graphs where any
struggle to generalize to new, unseen distributions, patch can influence any other, making it difficult to trace
which can lead to poor performance in real-world decision pathways. The dynamic nature of attention
applications. This limitation is particularly concerning (context-dependent weights) prevents static analysis of
in domains where data distribution shifts are common, feature importance.
such as in medical imaging or autonomous driving.
• Attention collapse in deep architectures: Deep ViT The following are the open issues that pave the way
architectures can suffer from attention collapse, where for future research directions on ViT interpretability and
the attention mechanism becomes less effective or explainability:
even collapses, leading to poor performance. This • Quantifying Interpretability: Metrics Needed: Formal
phenomenon is not yet fully understood and requires measures to evaluate whether attention aligns with
further investigation. human-annotated regions of interest and Causal relation-
The following are some of the open issues still present in ViT ships. Standardized datasets with ground-truth explana-
for future exploration and study: tions for model decisions.

VOLUME 13, 2025 95517


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

• ViT-Specific Visualization Tools: Methods to filter Need improvement in distillation methods to CNN-to-
noisy attention (e.g., sparsification, clustering) while ViT knowledge transfer.
preserving salient features. Techniques to map attention • Mobile Deployment Optimization: Hardware-aligned
patterns to visualize semantic concepts and temporal designs to patch embeddings optimized for mobile
analysis. NPUs, and Attention approximation for DSP acceler-
• Human-Aligned Attention: Training objectives that ation. The existing or new framework should support
encourage attention to focus on semantically meaningful TFLite delegates for ViT ops and ONNX extensions for
regions (e.g., integrating eye-tracking data). Intermedi- dynamic attention.
ate layers that enforce alignment with human-defined
concepts. Architectures that distinguish correlation from
causation in patch relationships. IX. RECENT ADVANCEMENTS IN ViT
Recent Vision Transformers (ViTs) advancements have sig-
E. REAL-TIME AND EDGE DEPLOYMENT nificantly enhanced their performance, efficiency, and appli-
cability across various computer vision tasks. Researchers
While Vision Transformers (ViTs) have demonstrated state-
have developed lightweight and efficient variants like
of-the-art performance on many vision tasks [117], their com-
MobileViT and TinyViT, which reduce computational costs
putational demands make deployment in latency-sensitive
while maintaining high accuracy, making them suitable for
and resource-constrained environments particularly challeng-
resource-constrained devices. Techniques such as masked
ing. Unlike CNNs that benefit from decades of hardware
autoencoders (MAE) and self-supervised learning frame-
optimization, ViTs face fundamental architectural hurdles for
works (e.g., DINO, BEiT) have improved data efficiency,
efficient edge deployment [118]. Some of the key issues are
enabling ViTs to generalize well even with limited labeled
as follows:
data. Hierarchical architectures like Swin Transformers
• High Inference Latency: The quadratic complexity of and PVT have addressed scalability challenges, allow-
self-attention leads to 2-5× slower inference than opti- ing ViTs to handle high-resolution images effectively
mized CNNs for comparable accuracy. This is particu- for dense prediction tasks like segmentation and object
larly problematic for high-resolution inputs (greater than detection. Additionally, multimodal models such as CLIP
512px) where the patch count grows rapidly. Hardware- and Flamingo have expanded ViTs’ capabilities to bridge
unfriendly operations include Irregular memory access vision and language understanding, opening new possi-
patterns in attention and a lack of optimized kernels for bilities for cross-modal applications. These advancements,
token mixing operations. coupled with ongoing research into interpretability, robust-
• Memory Constraints: ViTs require 3-10× more memory ness, and hardware optimization, continue to solidify ViTs
than CNNs due to the storage of full attention matrices as a powerful and versatile tool in modern computer
(O(N 2 )) and large intermediate activations in feed- vision.
forward layers. This adds limitations to developing
ViT models in mobile and memory-constrained edge 1) Efficient Vision Transformers: One of the main
devices. challenges with ViTs is their computational cost and
• Framework Limitations: Current deployment stacks memory footprint. Several advancements have focused
(TensorRT, ONNX Runtime) are optimized for on reducing computational complexity while maintain-
CNNs. Missing compiler optimizations for dynamic ing high performance [119]:
token sparsification, mixed-precision attention, and a) Swin Transformer: Introduces hierarchical attention
hardware-specific acceleration of matrix with shifted windows, reducing computational cost
multiplications. while preserving spatial locality.
The following are some of the open challenges that exist in b) Pyramid Vision Transformer (PVT): Uses progres-
deploying ViT in real-time and edge deployment: sively smaller attention windows, making it efficient
• Real-Time Video Processing: Architectural innovations for dense prediction tasks.
needed for temporal attention compression for video, c) Tokens-To-Token (T2T-ViT): Refines tokenization by
keyframe-based attention sharing, and recurrent ViT aggregating local features before feeding them into
variants with memory buffers. Latency targets should be the transformer, improving inductive biases.
<30ms/frame for 1080p video (real-time at 30FPS) and d) LeViT & MobileViT: Optimized for edge and mobile
<100mW power consumption for embedded vision. devices, reducing energy consumption while main-
• Model Compression Strategies: Attention-specific taining competitive accuracy.
quantization schemes and 4-bit integer vs floating-point e) Token Merging (ToMe): Reduces redundant tokens
tradeoffs should be considered specifically for ViT dynamically to enhance efficiency while preserving
deployment. Pruning techniques should be improved for accuracy.
Token pruning (adaptive computation), head pruning for 2) Self-Supervised & Contrastive Learning in ViTs:
multi-head attention and block-wise sparsity patterns. Self-supervised learning (SSL) has enabled

95518 VOLUME 13, 2025


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

transformers to learn from unlabeled data, improving c) Neuromorphic ViTs: Exploring biologically inspired
robustness and generalization [120]: spiking neural networks (SNNs) for ultra-low-power
a) DINO (Self-Distillation with No Labels): Uses knowl- vision tasks.
edge distillation to train ViTs without labels, produc-
ing high-quality, semantic feature representations. X. CONCLUSION
b) MAE (Masked Autoencoders): Adapts masked image Transformers have introduced a paradigm shift in computer
modeling (similar to BERT’s masked token predic- vision by replacing traditional, localized feature extractors
tion) for vision tasks, reconstructing missing patches with globally attentive mechanisms capable of modeling
of an image. intricate relationships across an image. Through this survey,
c) SimMIM & iBOT: Extend contrastive and masked we have provided a comprehensive exploration of the
modeling techniques to enhance self-supervised foundational principles of transformers, their adaptation
learning efficiency. to vision-specific tasks, and the evolution of architectures
3) Multimodal Vision Transformers: ViTs are increas- such as ViT, DeiT, Swin Transformer, PVT, CrossViT, and
ingly used in multimodal tasks [121] by integrating others. These models demonstrate that vision transformers
vision with text, speech, or other modalities: can achieve state-of-the-art performance across a wide
a) CLIP (Contrastive Language-Image Pretraining): spectrum of applications including image classification,
Learns vision representations from natural language object detection, segmentation, medical image analysis,
supervision, enabling zero-shot classification. video understanding, and cross-modal learning. Moreover,
b) DALL-E and Parti: Vision-language models that we have compared transformers with CNNs and hybrid
generate images from textual descriptions, leveraging models to illustrate their strengths in capturing global context,
transformer-based architectures. flexibility in design, and potential for multi-task learning.
c) BEiT (Bidirectional Encoder Representation from However, their superior performance often comes at the
Images): Inspired by BERT, it learns bidirectional cost of higher computational demands, memory usage, and
representations using self-supervised objectives. reliance on large-scale datasets for effective training factors
that limit their widespread adoption in resource-constrained
4) Vision Transformers in Medical Imaging: ViTs
environments.
have shown promise in medical imaging applications,
Despite these limitations, recent advancements have
improving diagnosis accuracy in radiology, pathology,
shown promising directions to make Vision Transformers
and ophthalmology [91]:
more practical and adaptable. Techniques such as pruning,
a) TransUNet: Combines UNet-like convolutional lay- quantization, knowledge distillation, and the design of
ers with transformers for medical image efficient transformer variants are helping reduce computa-
segmentation. tional overhead. In parallel, self-supervised and contrastive
b) Swin UNETR: Uses hierarchical attention mecha- learning approaches are improving data efficiency and
nisms to improve medical image processing. model generalization. Furthermore, the growing interest in
c) ViTs in Pathology: Applied in histopathology for multimodal architectures where transformers serve as a
cancer detection, anomaly localization, and cell unifying backbone across text, vision, and audio indicates
segmentation. their expanding role in broader AI systems. Looking
5) Robustness and Generalization of ViTs: ViTs exhibit ahead, future research should focus on improving the
improved robustness against adversarial attacks com- interpretability of ViTs, enhancing robustness to adversar-
pared to CNNs. However, research focuses on enhancing ial inputs, and optimizing their deployment for real-time
generalization across different datasets [122]: and edge computing. With these developments, Vision
a) Adversarially Trained ViTs: Improve robustness by Transformers are well-positioned to become foundational
integrating adversarial learning techniques. components of intelligent systems across diverse domains,
b) RobustViT & Efficient Fine-Tuning: Reduce overfit- combining accuracy, flexibility, and scalability in ways
ting by integrating domain adaptation strategies. that go beyond the capabilities of traditional convolutional
c) Few-Shot Learning with ViTs: Enhance learning architectures.
capabilities in data-scarce environments using meta-
learning approaches. REFERENCES
6) Hardware Acceleration for ViTs: Efforts have been [1] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah,
made to optimize ViTs for real-time applications on ‘‘Transformers in vision: A survey,’’ ACM Comput. Surv., vol. 54, no. 10s,
pp. 1–41, Jan. 2022.
GPUs, TPUs, and specialized AI accelerators [123]: [2] A. Khan, Z. Rauf, A. Sohail, A. R. Khan, H. Asif, A. Asif, and U. Farooq,
a) Sparse Transformers: Reduce computational com- ‘‘A survey of the vision transformers and their CNN-transformer based
variants,’’ Artif. Intell. Rev., vol. 56, no. S3, pp. 2917–2970, Dec. 2023.
plexity by processing only essential tokens.
[3] S. Jamil, M. Jalil Piran, and O.-J. Kwon, ‘‘A comprehensive survey
b) Quantized ViTs: Lower bit precision models for of transformers for computer vision,’’ Drones, vol. 7, no. 5, p. 287,
energy-efficient inference. Apr. 2023.

VOLUME 13, 2025 95519


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

[4] Y. Wang, Y. Deng, Y. Zheng, P. Chattopadhyay, and L. Wang, [26] J. Pang and S. Dong, ‘‘A novel ensemble system for short-term wind
‘‘Vision transformers for image classification: A comparative survey,’’ speed forecasting based on hybrid decomposition approach and artificial
Technologies, vol. 13, no. 1, p. 32, Jan. 2025. [Online]. Available: intelligence models optimized by self-attention mechanism,’’ Energy
https://www.mdpi.com/2227-7080/13/1/32 Convers. Manage., vol. 307, May 2024, Art. no. 118343.
[5] K. Islam, ‘‘Recent advances in vision transformer: A survey and outlook [27] Z. Yang, H. Du, D. Niyato, X. Wang, Y. Zhou, L. Feng, F. Zhou, W. Li,
of recent work,’’ 2022, arXiv:2203.01536. and X. Qiu, ‘‘Revolutionizing wireless networks with self-supervised
[6] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, ‘‘Video Swin learning: A pathway to intelligent communications,’’ IEEE Wireless
transformer,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Commun., pp. 1–8, 2025, doi: 10.1109/MWC.002.2400197.
(CVPR), Jun. 2022, pp. 3192–3201. [28] A. Steiner, A. Kolesnikov, X. Zhai, R. Wightman, J. Uszkoreit, and
[7] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, ‘‘Masked L. Beyer, ‘‘How to train your ViT? Data, augmentation, and regularization
autoencoders are scalable vision learners,’’ 2021, arXiv:2111.06377. in vision transformers,’’ 2021, arXiv:2106.10270.
[8] Y. Liu, Y. Zhang, Y. Wang, F. Hou, J. Yuan, J. Tian, Y. Zhang, Z. Shi, [29] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou,
J. Fan, and Z. He, ‘‘A survey of visual transformers,’’ IEEE ‘‘Training data-efficient image transformers & distillation through
Trans. Neural Netw. Learn. Syst., vol. 35, no. 6, pp. 7478–7498, attention,’’ in Proc. Int. Conf. Mach. Learn., 2021, pp. 10347–10357.
Jun. 2024. [30] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and
[9] Y. Li, J. Wang, X. Dai, L. Wang, C.-C. Michael Yeh, Y. Zheng, W. Zhang, H. Jégou, ‘‘Training data-efficient image transformers & distillation
and K.-L. Ma, ‘‘How does attention work in vision transformers? A visual through attention,’’ 2021, arXiv:2012.12877.
analytics attempt,’’ 2023, arXiv:2303.13731. [31] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and
[10] E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov, ‘‘Analyzing B. Guo, ‘‘Swin transformer: Hierarchical vision transformer using shifted
multi-head self-attention: Specialized heads do the heavy lifting, the rest windows,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021,
can be pruned,’’ in Proc. 57th Annu. Meeting Assoc. Comput. Linguistics, pp. 10012–10022.
A. Korhonen, D. Traum, and L. Màrquez, Eds., Florence, Italy: [32] T. Jumphoo, K. Phapatanaburi, W. Pathonsuwan, P. Anchuen,
Association for Computational Linguistics, Jul. 2019, pp. 5797–5808. M. Uthansakul, and P. Uthansakul, ‘‘Exploiting data-efficient image
[Online]. Available: https://aclanthology.org/P19-1580/ transformer-based transfer learning for valvular heart diseases detection,’’
[11] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, IEEE Access, vol. 12, pp. 15845–15855, 2024.
C. Xu, Y. Xu, Z. Yang, Y. Zhang, and D. Tao, ‘‘A survey on vision [33] H. Bao, L. Dong, S. Piao, and F. Wei, ‘‘BEiT: BERT pre-training of image
transformer,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, transformers,’’ 2021, arXiv:2106.08254.
pp. 87–110, Jan. 2023.
[34] A. Ashourvan, Q. K. Telesford, T. Verstynen, J. M. Vettel, and
[12] Y. Gündüç, ‘‘Tensor-to-image: Image-to-image translation with vision D. S. Bassett, ‘‘Multi-scale detection of hierarchical community archi-
transformers,’’ 2021, arXiv:2110.08037. tecture in structural and functional brain networks,’’ PLoS ONE, vol. 14,
[13] X. Chu, Z. Tian, B. Zhang, X. Wang, and C. Shen, ‘‘Conditional positional no. 5, May 2019, Art. no. e0215520.
encodings for vision transformers,’’ 2021, arXiv:2102.10882. [35] S. Mehta and M. Rastegari, ‘‘MobileViT: Light-weight, general-purpose,
[14] R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, and mobile-friendly vision transformer,’’ 2021, arXiv:2110.02178.
Y. Lan, L. Wang, and T.-Y. Liu, ‘‘On layer normalization in the [36] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and
transformer architecture,’’ 2020, arXiv:2002.04745. L. Shao, ‘‘Pyramid vision transformer: A versatile backbone for dense
[15] H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and prediction without convolutions,’’ 2021, arXiv:2102.12122.
L. Zhang, ‘‘CvT: Introducing convolutions to vision transformers,’’ 2021,
[37] J. Li, Y. Bao, W. Liu, P. Ji, L. Wang, and Z. Wang, ‘‘Twins
arXiv:2103.15808.
transformer: Cross-attention based two-branch transformer network for
[16] S. Chen, C. Ge, Z. Tong, J. Wang, Y. Song, J. Wang, and P. Luo, ‘‘Adapt- rotating bearing fault diagnosis,’’ Measurement, vol. 223, Dec. 2023,
Former: Adapting vision transformers for scalable visual recognition,’’ in Art. no. 113687.
Proc. Adv. Neural Inf. Process. Syst., vol. 35, 2022, pp. 16664–16678.
[38] W. Wang, W. Chen, Q. Qiu, L. Chen, B. Wu, B. Lin, X. He, and
[17] A. Shokouhmand, H. Wen, S. Khan, J. A. Puma, A. Patel, P. Green, W. Liu, ‘‘CrossFormer++: A versatile vision transformer hinging on
F. Ayazi, and N. Ebadi, ‘‘Diagnosis of coexisting valvular heart cross-scale attention,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 46,
diseases using image-to-sequence translation of contact microphone no. 5, pp. 3123–3136, May 2024.
recordings,’’ IEEE Trans. Biomed. Eng., vol. 70, no. 9, pp. 2540–2551,
[39] B. Graham, A. El-Nouby, H. Touvron, P. Stock, A. Joulin, H. Jégou, and
Sep. 2023.
M. Douze, ‘‘LeViT: A vision transformer in ConvNet’s clothing for faster
[18] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng,
inference,’’ 2021, arXiv:2104.01136.
T. Xiang, P. H. S. Torr, and L. Zhang, ‘‘Rethinking semantic segmentation
[40] H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and H. Jégou,
from a sequence-to-sequence perspective with transformers,’’ in Proc.
‘‘Going deeper with image transformers,’’ 2021, arXiv:2103.17239.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021,
pp. 6881–6890. [41] Z. Tu, H. Talebi, H. Zhang, F. Yang, P. Milanfar, A. Bovik, and Y. Li,
[19] F. D. Keles, P. M. Wijewardena, and C. Hegde, ‘‘On the computational ‘‘MaxViT: Multi-axis vision transformer,’’ 2022, arXiv:2204.01697.
complexity of self-attention,’’ in Proc. 34th Int. Conf. Algorithmic Learn. [42] L. Wang and A. Tien, ‘‘Aerial image object detection with vision
Theory, 2023, pp. 597–619. transformer detector (ViTDet),’’ in Proc. IEEE Int. Geosci. Remote Sens.
[20] C. Esteves, M. Suhail, and A. Makadia, ‘‘Spectral image tokenizer,’’ 2024, Symp. (IGARSS), Jul. 2023, pp. 6450–6453.
arXiv:2412.09607. [43] A. Abdelrahman and S. Viriri, ‘‘FPN-SE-ResNet model for accurate
[21] K. Jiang, P. Peng, Y. Lian, and W. Xu, ‘‘The encoding method of position diagnosis of kidney tumors using CT images,’’ Appl. Sci., vol. 13, no. 17,
embeddings in vision transformer,’’ J. Vis. Commun. Image Represent., p. 9802, Aug. 2023.
vol. 89, Nov. 2022, Art. no. 103664. [44] P. Ramachandran, N. Parmar, A. Vaswani, I. Bello, A. Levskaya, and
[22] C. J. B. Hernndez, D. A. Sierra, S. Varrette, and D. L. Pacheco, ‘‘Energy J. Shlens, ‘‘Stand-alone self-attention in vision models,’’ in Proc. Adv.
efficiency on scalable computing architectures,’’ in Proc. IEEE 11th Int. Neural Inf. Process. Syst., vol. 32, 2019, pp. 1–13.
Conf. Comput. Inf. Technol., Aug. 2011, pp. 635–640. [45] Q. Guan, Y. Huang, Z. Zhong, Z. Zheng, L. Zheng, and Y. Yang,
[23] Y. Xu, Z. Zhang, M. Zhang, K. Sheng, K. Li, W. Dong, L. Zhang, ‘‘Diagnose like a radiologist: Attention guided convolutional neural
C. Xu, and X. Sun, ‘‘Evo-ViT: Slow-fast token evolution for dynamic network for thorax disease classification,’’ 2018, arXiv:1801.09927.
vision transformer,’’ in Proc. AAAI Conf. Artif. Intell., Jun. 2022, vol. 36, [46] J. Hu, L. Shen, and G. Sun, ‘‘Squeeze-and-excitation networks,’’ in
no. 3, pp. 2964–2972. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
[24] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, ‘‘Scaling vision pp. 7132–7141.
transformers,’’ 2021, arXiv:2106.04560. [47] J. Saumya, A. L. Nicholas, N. Lee, and H. T. Philip, ‘‘Learn to pay
[25] R. Xu, S. Hu, H. Wan, Y. Xie, Y. Cai, and J. Wen, ‘‘A unified deep learning attention,’’ in Proc. Int. Conf. Learn. Represent., 2018, pp. 1209–1222.
framework for water quality prediction based on time-frequency feature [48] K. Han, J. Guo, C. Zhang, and M. Zhu, ‘‘Attribute-aware attention model
extraction and data feature enhancement,’’ J. Environ. Manage., vol. 351, for fine-grained representation learning,’’ in Proc. 26th ACM Int. Conf.
Feb. 2024, Art. no. 119894. Multimedia, Oct. 2018, pp. 2040–2048.

95520 VOLUME 13, 2025


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

[49] A. B. Amjoud and M. Amrouch, ‘‘Object detection using deep learning, [74] Z. Yu, D. Jin, Z. Liu, D. He, X. Wang, H. Tong, and J. Han, ‘‘AS-GCN:
CNNs and vision transformers: A review,’’ IEEE Access, vol. 11, Adaptive semantic architecture of graph convolutional networks for text-
pp. 35479–35516, 2023. rich networks,’’ in Proc. IEEE Int. Conf. Data Mining (ICDM), Dec. 2021,
[50] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever, pp. 837–846.
‘‘Generative pretraining from pixels,’’ in Proc. Int. Conf. Mach. Learn., [75] O. Keskes and R. Noumeir, ‘‘Vision-based fall detection using ST-GCN,’’
Jul. 2020, pp. 1691–1703. IEEE Access, vol. 9, pp. 28224–28236, 2021.
[51] F. O. Giuste and J. C. Vizcarra, ‘‘CIFAR-10 image classification using [76] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
feature ensembles,’’ 2020, arXiv:2002.03846. T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
[52] Y. Shima, ‘‘Image augmentation for object image classification based on J. Uszkoreit, and N. Houlsby, ‘‘An image is worth 16×16 words:
combination of pre-trained CNN and SVM,’’ in Proc. J. Phys., Conf., Transformers for image recognition at scale,’’ 2020, arXiv:2010.11929.
2018, vol. 1004, no. 1, Art. no. 012001. [77] M. Li, ‘‘Transformer-based self-supervised learning and distillation for
[53] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once: medical image classification: Improving colorectal cancer detection on
Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis. NCT-CRC-HE-100K with Swin-T V2,’’ in Proc. 3rd Int. Conf. Cloud
Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788. Comput., Big Data Appl. Softw. Eng. (CBASE), Oct. 2024, pp. 644–648.
[54] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, ‘‘YOLOX: Exceeding YOLO
[78] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam,
series in 2021,’’ 2021, arXiv:2107.08430.
‘‘Encoder-decoder with atrous separable convolution for semantic image
[55] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and segmentation,’’ 2018, arXiv:1802.02611.
A. C. Berg, ‘‘SSD: Single shot multibox detector,’’ in Proc. 14th Eur.
Conf. Comput. Vis. (ECCV). Amsterdam, The Netherlands: Springer, [79] Q. Tong, Z. Zhu, M. Zhang, K. Cao, and H. Xing, ‘‘Cross former embed-
2016, pp. 21–37. ding DeepLabv3+ for remote sensing images semantic segmentation,’’
Comput., Mater. Continua, vol. 79, no. 1, pp. 1353–1375, 2024.
[56] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
S. Zagoruyko, ‘‘End-to-end object detection with transformers,’’ in Proc. [80] Y. Xu, Y. Xia, Q. Zhao, K. Yang, and Q. Li, ‘‘A road crack segmentation
Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2020, pp. 213–229. method based on transformer and multi-scale feature fusion,’’ Electronics,
[57] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, ‘‘Deformable vol. 13, no. 12, p. 2257, Jun. 2024.
DETR: Deformable transformers for end-to-end object detection,’’ 2020, [81] S. Akcay, A. Atapour-Abarghouei, and T. P. Breckon, ‘‘GANomaly:
arXiv:2010.04159. Semi-supervised anomaly detection via adversarial training,’’ 2018,
[58] M. Zheng, P. Gao, R. Zhang, K. Li, X. Wang, H. Li, and H. Dong, ‘‘End- arXiv:1805.06725.
to-end object detection with adaptive clustering transformer,’’ 2020, [82] A. Luiz B. Vieira e Silva, F. Simões, D. Kowerko, T. Schlosser,
arXiv:2011.09315. F. Battisti, and V. Teichrieb, ‘‘Attention modules improve mod-
[59] W.-H. Li and H. Bilen, ‘‘Knowledge distillation for multi-task learning,’’ ern image-level anomaly detection: A DifferNet case study,’’ 2024,
in Proc. Eur. Conf. Comput. Vis. (ECCV) Workshops. Glasgow, U.K.: arXiv:2401.08686.
Springer, 2020, pp. 163–176. [83] F. Wu and S. Xu, ‘‘Mask-patchcore: A robust anomaly detection model
[60] Z. Sun, S. Cao, Y. Yang, and K. Kitani, ‘‘Rethinking transformer-based set focusing on interested region,’’ in Proc. 16th Int. Conf. Graph. Image
prediction for object detection,’’ in Proc. IEEE/CVF Int. Conf. Comput. Process. (ICGIP), vol. 13539. Bellingham, WA, USA: SPIE, 2025,
Vis. (ICCV), Oct. 2021, pp. 3611–3620. pp. 88–99.
[61] Z. Dai, B. Cai, Y. Lin, and J. Chen, ‘‘UP-DETR: Unsupervised pre- [84] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte,
training for object detection with transformers,’’ in Proc. IEEE/CVF Conf. ‘‘SwinIR: Image restoration using Swin transformer,’’ in Proc. IEEE/CVF
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 1601–1610. Int. Conf. Comput. Vis. Workshops (ICCVW), Oct. 2021, pp. 1833–1844.
[62] H. Wang, Y. Zhu, H. Adam, A. Yuille, and L.-C. Chen, ‘‘MaX-DeepLab: [85] X. Kong, C. Dong, and L. Zhang, ‘‘Towards effective multiple-in-one
End-to-end panoptic segmentation with mask transformers,’’ in Proc. image restoration: A sequential and prompt learning strategy,’’ 2024,
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, arXiv:2401.03379.
pp. 5463–5474. [86] H. Choi, C. Na, J. Oh, S. Lee, J. Kim, S. Choe, J. Lee, T. Kim, and
[63] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, ‘‘Panoptic J. Yang, ‘‘Reciprocal attention mixing transformer for lightweight image
segmentation,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. restoration,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2019, pp. 9404–9413. Workshops (CVPRW), Jun. 2024, pp. 5992–6002.
[64] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, ‘‘BiSeNet: Bilateral [87] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila,
segmentation network for real-time semantic segmentation,’’ in Proc. Eur. ‘‘Analyzing and improving the image quality of StyleGAN,’’ in Proc.
Conf. Comput. Vis. (ECCV), Sep. 2018, pp. 325–341. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
[65] J. Li, X. Liu, M. Zhang, and D. Wang, ‘‘Spatio-temporal deformable pp. 8107–8116.
3D ConvNets with attention for action recognition,’’ Pattern Recognit.,
[88] A. Bhattad, D. McKee, D. Hoiem, and D. A. Forsyth, ‘‘StyleGAN knows
vol. 98, Feb. 2020, Art. no. 107037.
normal, depth, albedo, and more,’’ in Proc. Adv. Neural Inf. Process. Syst.,
[66] Y. Kong and Y. Fu, ‘‘Human action recognition and prediction: A survey,’’
vol. 36, 2023, pp. 73082–73103.
Int. J. Comput. Vis., vol. 130, no. 5, pp. 1366–1401, May 2022.
[89] J. Wang, Y. Jiang, Z. Yuan, B. Peng, Z. Wu, and Y.-G. Jiang,
[67] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, ‘‘Towards
‘‘OmniTokenizer: A joint image-video tokenizer for visual generation,’’
understanding action recognition,’’ in Proc. IEEE Int. Conf. Comput. Vis.,
in Proc. Adv. Neural Inf. Process. Syst., vol. 37, 2024, pp. 28281–28295.
Dec. 2013, pp. 3192–3199.
[68] S. Herath, M. Harandi, and F. Porikli, ‘‘Going deeper into action [90] A. Hatamizadeh, V. Nath, Y. Tang, D. Yang, H. Roth, and D. Xu, ‘‘Swin
recognition: A survey,’’ Image Vis. Comput., vol. 60, pp. 4–21, Apr. 2017. UNETR: Swin transformers for semantic segmentation of brain tumors
[69] Q. Wang, K. Zhang, and M. A. Asghar, ‘‘Skeleton-based ST-GCN for in MRI images,’’ 2022, arXiv:2201.01266.
human action recognition with extended skeleton graph and partitioning [91] J. Chen, J. Mei, X. Li, Y. Lu, Q. Yu, Q. Wei, X. Luo, Y. Xie,
strategy,’’ IEEE Access, vol. 10, pp. 41403–41410, 2022. E. Adeli, Y. Wang, M. Lungren, L. Xing, L. Lu, A. Yuille, and Y. Zhou,
[70] E. Oyallon and J. Rabin, ‘‘An analysis of the SURF method,’’ Image ‘‘3D TransUNet: Advancing medical image segmentation through vision
Process. Line, vol. 5, pp. 176–218, Jul. 2015. transformers,’’ 2023, arXiv:2310.07781.
[71] J. Wu, Z. Cui, V. S. Sheng, P. Zhao, D. Su, and S. Gong, ‘‘A comparative [92] D. Saadati, O. Nejati Manzari, and S. Mirzakuchaki, ‘‘Dilated-UNet: A
study of SIFT and its variants,’’ Meas. Sci. Rev., vol. 13, no. 3, fast and accurate medical image segmentation approach using a dilated
pp. 122–131, Jun. 2013. transformer and U-Net architecture,’’ 2023, arXiv:2304.11450.
[72] S. Yan, Y. Xiong, and D. Lin, ‘‘Spatial temporal graph convolutional [93] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, ‘‘Beyond a Gaussian
networks for skeleton-based action recognition,’’ in Proc. AAAI Conf. denoiser: Residual learning of deep CNN for image denoising,’’ IEEE
Artif. Intell., 2018, vol. 32, no. 1, pp. 1–9. Trans. Image Process., vol. 26, no. 7, pp. 3142–3155, Jul. 2017.
[73] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, ‘‘Actional- [94] Z. Wu, W. Shi, L. Xu, Z. Ding, T. Liu, Z. Zhang, and B. Zheng, ‘‘DIFNet:
structural graph convolutional networks for skeleton-based action Dual-domain information fusion network for image denoising,’’ in Proc.
recognition,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Chin. Conf. Pattern Recognit. Comput. Vis. (PRCV). Singapore: Springer,
(CVPR), Jun. 2019, pp. 3595–3603. 2024, pp. 279–293.

VOLUME 13, 2025 95521


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

[95] M. Wang, P. Yuan, S. Qiu, W. Jin, L. Li, and X. Wang, ‘‘Dual-encoder [119] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z. Jiang, F. E. H. Tay,
UNet-based narrowband uncooled infrared imaging denoising network,’’ J. Feng, and S. Yan, ‘‘Tokens-to-token ViT: Training vision transformers
Sensors, vol. 25, no. 5, p. 1476, Feb. 2025. from scratch on ImageNet,’’ 2021, arXiv:2101.11986.
[96] R. Ranftl, A. Bochkovskiy, and V. Koltun, ‘‘Vision transformers for dense [120] J. Rabarisoa, V. Belissen, F. Chabot, and Q.-C. Pham, ‘‘Self-supervised
prediction,’’ 2021, arXiv:2103.13413. pre-training of vision transformers for dense prediction tasks,’’ 2022,
[97] J. Bae, K. Hwang, and S. Im, ‘‘A study on the generality of neural network arXiv:2205.15173.
structures for monocular depth estimation,’’ IEEE Trans. Pattern Anal. [121] Y. Wang, X. Chen, L. Cao, W. Huang, F. Sun, and Y. Wang, ‘‘Multimodal
Mach. Intell., vol. 46, no. 4, pp. 2224–2238, Apr. 2024. token fusion for vision transformers,’’ 2022, arXiv:2204.08721.
[98] Y. Li and X. Wei, ‘‘MobileDepth: Monocular depth estimation based [122] Y. Mo, D. Wu, Y. Wang, Y. Guo, and Y. Wang, ‘‘When adversarial training
on lightweight vision transformer,’’ Appl. Artif. Intell., vol. 38, no. 1, meets vision transformers: Recipes from training to architecture,’’ 2022,
Dec. 2024, Art. no. 2364159. arXiv:2210.07540.
[99] A. Anagnostopoulou, T. Gouvea, and D. Sonntag, ‘‘Enhancing journalism [123] Z. Li and Q. Gu, ‘‘I-ViT: Integer-only quantization for efficient vision
with AI: A study of contextualized image captioning for news articles transformer inference,’’ 2022, arXiv:2207.01405.
using LLMs and LMMs,’’ 2024, arXiv:2408.04331.
[100] S. Bianco, L. Celona, M. Donzella, and P. Napoletano, ‘‘Improving image
captioning descriptiveness by ranking and LLM-based fusion,’’ 2023,
arXiv:2306.11593.
[101] C. Wang, H. Huang, X. Han, and J. Wang, ‘‘Video inpainting by jointly
learning temporal structure and spatial details,’’ in Proc. AAAI Conf. Artif.
Intell., 2019, vol. 33, no. 1, pp. 5232–5239.
[102] Y.-L. Chang, Z. Y. Liu, K.-Y. Lee, and W. Hsu, ‘‘Free-form video BALAMURUGAN PALANISAMY is currently
inpainting with 3D gated convolution and temporal PatchGAN,’’ in Proc. pursuing the Ph.D. degree with the Department
IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 9066–9075. of Electrical and Electronics Engineering, Birla
[103] Y. Zeng, J. Fu, and H. Chao, ‘‘Learning joint spatial–temporal transforma- Institute of Technology and Science, Pilani,
tions for video inpainting,’’ in Proc. 16th Eur. Conf. Comput. Vis. (ECCV). Rajasthan, India. His research interests include
Glasgow, U.K.: Springer, 2020, pp. 528–543. natural language processing, deep learning, and
[104] L. Liu, J. Zhang, R. He, Y. Liu, Y. Wang, Y. Tai, D. Luo, C. Wang, generative models.
J. Li, and F. Huang, ‘‘Learning by analogy: Reliable supervision from
transformations for unsupervised optical flow estimation,’’ in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
pp. 6489–6498.
[105] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, ‘‘End-to-end dense
video captioning with masked transformer,’’ in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8739–8748.
[106] Z. Yu and N. Han, ‘‘Accelerated masked transformer for dense video
captioning,’’ Neurocomputing, vol. 445, pp. 72–80, Jul. 2021.
[107] A. Sarkar, Y. Yang, and M. Vihinen, ‘‘Variation benchmark datasets: VIKAS HASSIJA received the B.Tech. degree from
Update, criteria, quality and applications,’’ Database, vol. 2020, M. D. U. University, Rohtak, India, in 2010, the
Jan. 2020, Art. no. baz117. M.S. degree in telecommunication and software
[108] S. Dhar and L. Shamir, ‘‘Evaluation of the benchmark datasets for testing engineering from the Birla Institute of Technology
the efficacy of deep convolutional neural networks,’’ Vis. Informat., vol. 5, and Science (BITS), Pilani, India, in 2014, and
no. 3, pp. 92–101, Sep. 2021. the Ph.D. degree in IoT security and blockchain
[109] D. Bhatt, C. Patel, H. Talsania, J. Patel, R. Vaghela, S. Pandya, K. from the Jaypee Institute of Information Tech-
Modi, and H. Ghayvat, ‘‘CNN variants for computer vision: History, nology (JIIT), Noida. He has done his postdoc-
architecture, application, challenges and future scope,’’ Electronics,
toral research with the National University of
vol. 10, no. 20, p. 2470, Oct. 2021.
Singapore, Singapore. He is currently an Associate
[110] X. Huang, N. Bi, and J. Tan, ‘‘Visual transformer-based models: A
Professor with KIIT, Bhubaneswar. He has also worked as an Assistant
survey,’’ in Proc. Int. Conf. Pattern Recognit. Artif. Intell. Springer, 2022,
pp. 295–305. Professor with JIIT for four years. He has eight years of industry experience
and has worked with various telecommunication companies, such as Tech
[111] S. Y. Yerima and M. K. Alzaylaee, ‘‘Mobile botnet detection: A deep
learning approach using convolutional neural networks,’’ in Proc. Int. Mahindra and Accenture. His research interests include IoT security, network
Conf. Cyber Situational Awareness, Data Anal. Assessment (CyberSA), security, blockchain, and distributed computing.
Jun. 2020, pp. 1–8.
[112] H. Long, ‘‘Hybrid design of CNN and vision transformer: A review,’’ in
Proc. 7th Int. Conf. Comput. Inf. Sci. Artif. Intell., Sep. 2024, pp. 121–127.
[113] A. Birhane, V. Prabhu, S. Han, V. Naresh Boddeti, and A. Sasha Luccioni,
‘‘Into the LAIONs den: Investigating hate in multimodal datasets,’’ 2023,
arXiv:2311.03449.
[114] F. Duman Keles, P. Mahesakya Wijewardena, and C. Hegde, ‘‘On the
computational complexity of self-attention,’’ 2022, arXiv:2209.04881. ARPITA CHATTERJEE is currently pursuing the
bachelor’s degree in computer science and engi-
[115] Y. Fu, S. Zhang, S. Wu, C. Wan, and Y. Celine Lin, ‘‘Patch-fool: Are
vision transformers always robust against adversarial perturbations?’’ neering with the Kalinga Institute of Industrial
2022, arXiv:2203.08392. Technology. Also, she is currently pursuing her
[116] W. Bousselham, A. Boggust, S. Chaybouti, H. Strobelt, and H. Kuehne, research internship with the Department of Electri-
‘‘LeGrad: An explainability method for vision transformers via feature cals and Electronics, Birla Institute of Technology
formation sensitivity,’’ 2024, arXiv:2404.03214. and Science (BITS), Pilani, under Dr. GSS Chala-
[117] X. Liu, Y. Song, X. Li, Y. Sun, H. Lan, Z. Liu, L. Jiang, and J. Li, pathi. She have a keen interest in algorithms and
‘‘Efficient partitioning vision transformer on edge devices for distributed data structures. In addition to academic pursuits,
inference,’’ 2024, arXiv:2410.11650. she actively participates in Hackathons like the
[118] G. Xu, Z. Hao, Y. Luo, H. Hu, J. An, and S. Mao, ‘‘DeViT: Decomposing Smart India Hackathon (SIH). She is passionate about data science and
vision transformers for collaborative inference in edge devices,’’ 2023, aims to contribute to the field through innovative research and hands-on
arXiv:2309.05015. experience.

95522 VOLUME 13, 2025


B. Palanisamy et al.: Transformers for Vision: A Survey on Innovative Methods for Computer Vision

ARPITA MANDAL is currently pursuing the bach- G. S. S. CHALAPATHI (Senior Member, IEEE)
elor’s degree in computer science and engineering received the B.E. degree (Hons.) in electrical and
with the Kalinga Institute of Industrial Technol- electronics engineering from the Birla Institute of
ogy. Also, she is currently pursuing her research Technology and Science (BITS), Pilani, in 2009,
internship with the Department of Electricals and and the M.E. degree in embedded systems and
Electronics, Birla Institute of Technology and the Ph.D. degree from BITS Pilani, in 2011 and
Science (BITS), Pilani, under Dr. GSS Chalapathi. 2019, respectively. He carried out his postdoc-
Through coursework and coding practices, she has toral research with The University of Melbourne,
developed an adequate foundation in program- Australia, under the supervision of Prof. Rajkumar
ming, algorithms, data structures, etc.. She has Buyya, and a Distinguished Professor with The
contributed to several research and academic initiatives With a focus on Full University of Melbourne. During his doctoral studies, he was a Visiting
Stack and Web Development. She is eager to explore the field of CNN and Researcher with the National University of Singapore and Johannes
ViT further and apply her skills to real-world challenges in the IT industry. Kepler University, Austria. He is currently an Assistant Professor with
the Department of Electrical and Electronics Engineering, BITS Pilani.
He has published in reputed journals, such as IEEE WIRELESS COMMUNICATION
LETTERS, IEEE SENSORS JOURNAL, and FUTURE GENERATION COMPUTING SYSTEMS.
His research interests include UAVs, precision agriculture, and embedded
DEBANSHI CHAKRABORTY is currently pur- systems. He is a member of ACM. He is a Reviewer of IEEE INTERNET OF
suing the bachelor’s degree in computer science THINGS JOURNAL and IEEE ACCESS.
and engineering with the Kalinga Institute of
Industrial Technology. Also, she is currently pur-
suing her research internship with the Department
of Electricals and Electronics, Birla Institute of
Technology and Science (BITS), Pilani, under
Dr. GSS Chalapathi. She is deeply interested
in machine learning and is actively researching
generative AI using ML tools. She has developed
expertise in Full Stack Web Development. Through academic projects and
research initiatives, she aims to apply her skills to real-world challenges and
further explore advancements in AI and web technologies.

DHRUV KUMAR received the Ph.D. degree in the


AMIT PANDEY received the M.Tech. degree in computer science, focusing on large scale data ana-
computer science from the Deenbandhu Chhotu lytics from the University of Minnesota (UMN),
Ram University of Science and Technology, Soni- USA. He is currently an Assistant Professor with
pat, Haryana, India. He is currently a Research the Department of Computer Science and Infor-
Scholar with the School of Computer Science, mation Systems, BITS Pilani, Rajasthan, India.
Engineering and Technology, Bennett Univer- He has published over 20 papers in internationally
sity, Greater Noida, India. His research interests recognized journals and conferences. His current
include semantic segmentation for biomedical research interests include generative AI and AI
image segmentation and explainable artificial agents.
intelligence.

VOLUME 13, 2025 95523

You might also like