Lightweight
Lightweight
extraction network
Wenfeng Zheng1 , Siyu Lu1 , Youshuai Yang1 , Zhengtong Yin2 and Lirong Yin3
1
School of Automation, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
2
College of Resource and Environment Engineering, Guizhou University, Guiyang, Guizhou, China
3
Department of Geography and Anthropology, Louisiana State University, Baton Rouge, LA, United States of
America
ABSTRACT
In recent years, the image feature extraction method based on Transformer has become
a research hotspot. However, when using Transformer for image feature extraction,
the model’s complexity increases quadratically with the number of tokens entered.
The quadratic complexity prevents vision transformer-based backbone networks from
modelling high-resolution images and is computationally expensive. To address this
issue, this study proposes two approaches to speed up Transformer models. Firstly, the
self-attention mechanism’s quadratic complexity is reduced to linear, enhancing the
model’s internal processing speed. Next, a parameter-less lightweight pruning method
is introduced, which adaptively samples input images to filter out unimportant tokens,
effectively reducing irrelevant input. Finally, these two methods are combined to create
an efficient attention mechanism. Experimental results demonstrate that the combined
methods can reduce the computation of the original Transformer model by 30%–50%,
while the efficient attention mechanism achieves an impressive 60%–70% reduction in
computation.
How to cite this article Zheng W, Lu S, Yang Y, Yin Z, Yin L. 2024. Lightweight transformer image feature extraction network. PeerJ
Comput. Sci. 10:e1755 [Link]
vision (CV) tasks and extract features from images using Transformers (Liang et al., 2021;
Chen, Fan & Panda, 2021; Guo et al., 2021). Because compared to CNN, Transformer can
model longer-distance dependencies without being limited by local interactions. Parallel
computing can also be realized, and good experimental results have been achieved in
various visual tasks.
For many vision tasks, the final accuracy and time efficiency largely depend on the
network responsible for extracting image features. Therefore, it is very important to design
a good image feature extraction network. In the past few years, the work of designing
Transformer-based image feature extraction network has begun to emerge (Han et al.,
2023; Khan et al., 2022). However, any Transformer-based model has a bottleneck. That is,
given a token sequence as input, the self-attention mechanism associates any pair of tokens
to iteratively learn feature representation. This will lead to a quadratic relationship between
the model’s time and space complexity and the input tokens’ amount. This quadratic
complexity prevents Transformer from modelling high-resolution images, and the high
computational cost poses challenges for deploying it on edge devices.
In academia, the majority of existing linear attention mechanisms aim to make the
Softmax operator approximated. RFA (Peng et al., 2021) utilizes the stochastic Fourier
characteristic theorem, and Performer (Krzysztof et al., 2021) utilizes positive random
characteristics to approximate the Softmax operator. Nevertheless, empirical evidence
supports that high sampling rates can cause instability in these methods. The way to achieve
linear attention mechanisms of these methods by utilizing an effective Softmax operator
approximation only within the constrained theoretical range, so if the corresponding
assumptions are not fulfilled, or approximation errors accumulate, these methods may not
always outperform ordinary structures.
Currently, many pruning methods for Transformer-based image feature extraction
networks that operate on the token dimension involve the introduction of additional
neural networks for training. To calculate the token score, and use this as a basis to further
judge wither the tokens should be redundant or kept. For example, Dynamic ViT (Rao
et al., 2021) is one of that case. However, methods in this type always reduce token by a
fixed ratio at each stage. While this approach indeed externally reduces the computational
burden of Transformer-based image feature extraction networks, additional computational
cost is also introduced at the same time. That is, the scoring network needs to be trained in
conjunction with the Transformer-based image feature extraction network and additional
hyperparameters and loss items need to be added to modify the loss function. Another
limitation of it is that they rely on a fixed pruning token ratio. Then when it needs to be
deployed on different edge devices, the network needs to be retrained, which greatly limits
the application scenarios of the model.
This study proposes a Transformer image feature extraction network based on linear
attention and pruning. First, each token is individually scored externally to determine the
importance of the class token used for classification based on the input tokens. Then some
tokens are retained according to the score sampling, and pruning is performed from the
token dimension. Then, from an internal perspective, a combination function is employed
to substitute the Softmax operator in calculating the self-attention matrix, resulting in a
RELATED WORK
In 2017, Google proposed Transformer (Vaswani et al., 2017), a milestone model in the
field of natural language processing. In October 2020, Dosovitskiy et al. (2021) proposed
the ViT (VisionTransformer) model, which was the first to introduce Transformer into
the visual field, proving that Transformer can also achieve good results as a network
for extracting image features. From then on, the research on Transformer as an image
feature extraction network has entered a rapid development process, and the development
direction can be divided into two categories, namely the improvement of training strategies
and models. The training strategy aspect refers to the improvement of the ViT model during
the training process, while the model aspect refers to the improvement of various modules
in the image feature extraction network designed based on Transformer. The current
mainstream training strategy mainly refers to the model DeiT (Touvron et al., 2021a)
proposed by Touvron in December 2020. The proposal of this model is to address the
drawback of the ViT model requiring pre training using a large dataset of JFT-300M. The
core of DeiT improvement is to introduce distillation learning into the training process of
the Transformer model, and also provide a set of hyperparameters with good experimental
results. After this, most Transformer models refer to this set of hyperparameters in the
experimental process, and the hyperparameter settings in this article’s experiment are also
the same. At present, the improvement of self-attention mechanism can be divided into
two directions: improving global attention and introducing additional local attention. (1)
Improve global attention. The most typical global attention is the multi head attention in
the Transformer model. When the resolution of the input image is high, the number of
tokens converted is large, and the computational cost is high when calculating attention.
Taking PVT (Wang et al., 2021) as an example, it proposes a space reduction module,
METHOD
Lightweight based on linear attention
This section internally reduces the complexity of Transformer’s attention mechanism to
linear, and designs a combination function to replace the original Softmax operator.
where Q ∈ R(n×d) , K ∈ R(n×d) , V ∈ R(n×d) , which can be calculated from the inputs,
respectively.
An in-depth analysis of Eq. (2) shows that it is the Softmax operator defined in Eq. (2)
that restricts the performance of the scaling point product attention mechanism and makes
it quadratic, so we need to calculate QK T first, and a matrix of n × n was obtained, so
that the complexity is O (n2 ) level. If there is no Softmax operator, then the three matrices
QK T V will be multiplied, and using the combination law of matrix multiplication, the last
two matrices K T V can be calculated first to get a matrix of d k × d v , and then let the matrix
Q to do the left multiplication, due to the actual case d k sum d v is much smaller than n, so
the overall complexity can be seen as O(n) level. The Softmax operator is defined as shown
in Eq. (3):
e zi
Softmax (zi ) = Pc zc
(3)
c=1 e
where z i represents the the i-th node’s output value, and C denotes the total number of
output nodes.
The Softmax operator could convert the multiple clusters’ output values to a probability
distribution at range between 0 and 1, with a sum of 1. With the introduction of Eqs. (3)
and (2) can be rewritten as follows, as shown in Eq. (4):
Pn
j=1 S(qi ,kj )vj
Att(Q,K,V)i = Pn (4)
j=1 S(qi ,kj )
as a replacement for the Softmax operator under the premise of ensuring the experimental
effect. To achieve this, it is necessary to identify the crucial characteristics of the current
attention mechanism and design a decomposable similarity function that fulfills the
prerequisites for linear attention.
In the formula, f (·) and g (·) are two self-defined functions to ensure the above two
properties, defined as shown in Eqs. (6) and (7):
π (i + n − j)
qi ,kj = qi kTj sin
g (7)
2n
where i,j =1 ,...,n.
Equation(6) is used to ensure the non-negativity of the attention matrix.
Equation (7) is formulated in this form for several reasons:
(1) Realize local deviation. When i and j are close, the corresponding trigonometric
function value is close to 1. When i and j are far away, the difference between the two is
close to n, and the corresponding trigonometric function value is close to 0, so that the
corresponding The similarity function is negligible;
(2) The trigonometric function itself is nonlinear, so the design can focus on the
distribution of weights, so as to achieve the purpose of stabilizing the training process;
(3) Eq. (7) can be decomposed by the sum-difference product formula, so that the
associative law of matrix multiplication can be used to reduce the complexity of the
attention mechanism to linear.
The Eq. (8) is the dismantling process of the sub-function g (·) for satisfying the
non-linear reweighting property:
π(i + n − j)
0
g q0i ,k0j = q0i kj Tsin
2n
π(i−j)
0 0 π
= qi kj Tsin 2n + 2
πj πj (8)
0 πi πi
= q0i kj T cos 2n cos 2n + sin 2n sin 2n
kj Tcos πj πj
πi
0
πi
0
= q0i cos 2n 2n + q 0
i sin 2n kj Tsin 2n
Among them, q0i = f (qi ), k0j = f (kj ), f (·) are shown in Eq. (2).
πj πj
πi πi 0 0
Let qcos 0
sin 0
cos sin
i = q i cos 2n , qi = q i sin 2n , kj = k j Tcos 2n and kj = kj Tsin 2n .
cos
Apparently, qsin
i and qi are transformed from the i-th column vector qi of Q through
self-defined non-negative function f (·) and sine or cosine function. Denote the Q after this
transformation as Qcos and Qsin . Similarly, K after transformation is K cos and K sin . Thus,
the final expression after linear decomposition is obtained, as shown in Eq. (9):
Att(Q,K,V) = g (f (Q,K))V = Qcos Kcos V + Qsin Ksin V
(9)
The attention calculation Eq. (9) proposed in this section can make use of the associative
law of matrix multiplication to minimize the computational complexity of the linear
attention mechanism. Figure 3 shows the calculation process.
images are first estimated with their respective scores, and then part of the tokens are
retained through sampling.
Score
After bypassing the path of training additional parameters to obtain the score of the input
token, we can focus on the existing information of the image feature extraction network
itself designed based on Transformer. All tokens input by the self-attention layer inside the
model are divided into two parts. The first part only includes the first class token, which is
additionally introduced to judge the category of the input image in the final stage, and it
is placed in the first position of the token sequence. The second part is all the remaining
tokens, which are obtained by dividing and transforming the input image.
To prune from the token dimension is to reduce the second part of tokens converted
from images as much as possible and record the number of tokens in this part as m. That is
to say, there are m+1 tokens in the input, and there are r 0 +1 tokens after the output. Here
r 0 is the number of tokens in the second part retained after pruning, and the corresponding
data range is r 0 ≤ R ≤ m. Here R is a parameter given in advance, and its function is to
control the maximum number of tokens retained after sampling.
In the standard self-attention layer, Q ∈ R((m+1)×dq ) ,K ∈ R((m+1)×dk ) ,V ∈ R((m+1)×dv ) is
calculated from the input token, which is recorded as X ∈ R((m+1)×d) . The self-attention
matrix A is calculated from both Q and K T . It should be noted that the attention calculation
method at this time is still the standard scaling dot product self-attention. The attention
matrix captures the association’s similarity or degree among all input tokens.
For example, in the self-attention matrix A, the elements of [i, j] or [j, i] represent
the correlation degree between the i-th and the j-th tokens among all the currently input
tokens. The larger the value, the more related the two are, indicating that each other is
more important to another token.
In the Equation, a1,j and a1,i represent the elements of the 1st row, column j and column
i respectively of the self-attention matrix A, hj is the j-th token’s score.
In addition, for the multi-head attention layer, each head’s score can be calculated
separately, and then all the heads can be added together.
Sampling
In the previous section, the importance scores of each token have been obtained in a
parameterless way. Now some tokens can be removed from the self-attention matrix based
on these scores. It is natural to think of first screening out some tokens with the highest
scores, keeping them, and then directly removing the remaining tokens with lower scores.
But this method has two problems: First, the number of tokens that are screened out
is not easy to determine, and it is difficult to achieve adaptive dynamic changes based
on different input images. Secondly, if tokens with lower scores are directly screened out
at a certain stage from the beginning, then these screened out tokens are not necessarily
unimportant for the final classification. Just as the CNN-based image feature extraction
network extracts only shallow features such as edges and textures at the beginning, but
it extracts more advanced semantic features later. Different tokens may have different
functions at different stages and represent different meanings. The current score is only
obtained at the current stage. If it is simply deleted based on the low effect of an intermediate
stage, the deleted token will not be able to enter the subsequent stage. But it may play an
important role in a later stage, which will affect the final experimental results instead.
Therefore, according to these scores, some tokens can be randomly selected to be
retained by sampling. If the score is lower, it has a lower probability of being selected
at the current stage and thus cannot be retained, and vice versa. This method is indeed
better than crudely deleting tokens with lower scores in terms of experimental results. This
randomness can offset the deficiency of directly deleting token to a certain extent.
Then it needs to be sampled according to the scores of these tokens. Tokens with
higher scores have a greater probability of being retained, and tokens with lower scores
have a lower probability of being retained. That is to find a probability distribution
that obeys these scores, and then sample according to this probability distribution. The
cumulative distribution function corresponding to these scores can be calculated and
inverse transformed to obtain the inverse function of this random distribution.
Note here that the accumulation starts from the second token, because the first
classification class token must be retained. After obtaining the CDF, the sampling function
can be obtained according to its inverse form, as shown in Eq. (13):
A more efficient attention mechanism (e-attention) is obtained after the fusion. Figure 4
shows the calculation process.
Dataset
This study is based on the comparative experiment verification of the ImageNet1k dataset,
which is widely used in the field of image classification, and the COCO dataset, which is
commonly used in the field of target detection.
The ImageNet dataset is currently a dataset widely used in the field of artificial intelligence
images (Yang et al., 2022). Most of the work on image positioning, classification, and
detection is based on imageNet. It is widely used in computer vision, maintained by the
team at Stanford University, and easy to use. ImageNet contains more than 14 million
images with more than 20,000 classification categories. The ISLVRC competition uses a
lightweight version of the ImageNet dataset. In some papers, this light version of the data
is called ImageNet 1K, which means that there are 1,000 categories.
The COCO dataset is a large-scale detection and segmentation dataset maintained by
Microsoft, which is mainly obtained in complex daily scenes (Lin et al., 2014). The COCO
dataset mainly solves the three problems of contextual relationship between targets, target
detection and two-dimensional precise positioning. It contains 91 categories. Although the
number of categories is much smaller than ImageNet, each category contains a very large
number of pictures, and more specific scenes in each category can be obtained. Containing
200,000 images with more than 500,000 annotations for 80 of the 91 categories, it is
arguably the most extensive public object detection dataset. COCO contains 20G pictures
and 500M labels, and the ratio of the datasets (training, test, and validation) is [Link].
EXPERIMENT
Environment and hyperparameter settings
The environment of the experiment is shown in Table 1.
The hyperparameter settings of all experimental models adopt the hyperparameters
provided by the model DeiT (Touvron et al., 2021a). It can achieve up-points without
changing the structure of the ViT model. As shown in Table 2.
Evaluation indicators
The methods proposed in this study are all aimed at accelerating the model, and the
complexity and speed of the model need to be considered. Therefore, the three indicators
Name Version
Editor VS-code
System environment Windows 10 education edition 64-bit
system with 32 GB of memory
Processor Intel(R) Core (TM) i79700K
Graphics processor Nvidia GeForce RTX 2080Ti 11 GB
Software Python3.6
Neural network library Pytorch
of Params, FLOPs and FPS are selected. On the other hand, to evaluate the model’s
performance, the two indicators of Top-1 Acc and mAP were also selected.
In the image classification task of computer vision, evaluating the quality of a
classification model is mainly judged by the accuracy and error rate identified by the
classification model. The error rate includes Top-5 error and Top-1 error, and the accuracy
rate includes Top-5 and Top-1. This study selects the Top-1 accuracy indicator to test
the accuracy of model classification after introducing the two methods of linear attention
mechanism and token pruning. On the other hand, this study selects the FLOPs index to
describe the calculation amount of the model.
In the target detection task, the quality of a detection model is mainly evaluated through
the mAP value. On the other hand, when using various methods to speed up the detection
model and reduce its complexity, the corresponding performance indicators generally use
FPS and Params.
Benchmark model
In the experiment, six selected benchmark models including DeiT (Touvron et al., 2021a),
PVT (Wang et al., 2021), Swin Transformer (Huang et al., 2021), TNT (Han et al., 2021),
T2T-ViT (Yuan et al., 2021) and CaiT (Touvron et al., 2021b). In the specific experiment,
the different sizes of the six models were also compared with each other, to further analyze
Hyperparameter Value/Choice
Epochs 300
Batch size 1024
Base learning rate 0.0005
Optimizer AdamW
Learning rate decay strategy Cosine
Weight decay 0.05
Dropout 0.1
Warmup epochs 5
Pruning method K 70%
the effect of these three innovations. A brief explanation will be provided for each of these
six models.
(1) DeiT: The model DeiT was proposed to alleviate the limitations of traditional
Transformer models, which require good performance and generalization only on a large
dataset JFT-300 containing three million images. The author of DeiT proposed a new
training scheme combining the lightweight operation of distillation, which introduces
distillation tokens to enable students to learn from the teacher through attention, and
proposes a Transformer specific teacher-student strategy. When the DeiT model is only
trained on ImageNet, a convolution free Transformer with good performance is obtained.
During the training of DeiT, the prediction vectors of the additional class token and
distillation token introduced for classification are averaged before being converted into
probability distributions.
(2) PVT: The model PVT was proposed to overcome the difficulties encountered by
traditional visual Transformer models when applied to different dense prediction tasks. It
is the first visual Transformer model that can handle dense prediction tasks with different
resolutions. Compared to ViT, which used to consume a lot of computation in the past, it
can use progressive pyramid reduction to reduce computational costs.
(3) Swin Transformer: The model Swin Transformer proposes a layered Transformer
that utilizes moving windows to calculate feature representations, in order to address
the significant differences in visual entity resolution. During the internal design of the
model, non overlapping local windows are first divided in advance, limiting the calculation
of self attention to them, and combining cross window connections to improve model
efficiency. This design can model images of different resolutions and significantly reduce
computational costs.
(4) TNT: The model TNT was proposed because traditional Transformer models ignore
the inherent information of the patches converted from input images. Therefore, TNT
models both patch level and pixel level feature representations.
(5) T2T ViT: The model T2T ViT was proposed because ViT cannot model important
local structures such as edges and lines between adjacent pixels. T2T-ViT models the local
structure around tokens by recursively aggregating adjacent objects layer by layer, resulting
in multiple adjacent tokens being aggregated into a single token. At the same time, it
reduces the length of token sequences and computational costs.
(6) CaiT: CaiT makes it easier for deeper Transformers to converge and improve
accuracy. A new and efficient method for processing classified tokens has been proposed.
Two changes were made to the Transformer architecture, significantly improving the
accuracy of deep transformers.
π (i + n − j)
qi ,kj = qi kTj sin
g (15)
2n
g qi ,kj = qi kTj .
(16)
The second consideration is to keep only the non-linear reweighting scheme and discard
the non-negativity of the attention matrix. Therefore, Eq. (17) is changed to Eq. (18),
resulting in a new attention mechanism Ours-B that only retains the property of the
non-linear reweighting scheme.
(
x +1 x ≥0
f (x) = x (17)
e x¡0
f (x) = x. (18)
RESULTS
Experimental results of image classification
The experiment in this section is to take the average of three repeated experiments on the
ImageNet1k dataset for the different benchmark models for verification. To ensure that
the experimental results are as accurate as possible, the errors caused by uncertain factors
are minimized. The experimental results are shown in Table 4.
Next, select the six models in the table above. Add these six different Transformer models
to token pruning, linear attention, or e-attention, and then compare the FLOPs indicators.
Table 5 shows the results.
For the six selected models, no matter which improvement method is introduced alone,
or the two methods are introduced together, the FLOPs index has a large improvement.
Because these two methods were originally designed to accelerate the model from both
the external and internal perspectives of the model. For the model DeiT-B, after changing
the attention mechanism to linear alone, the FLOPs indicator increased by 52%. After
introducing the pruning module alone, the FLOPs index increased by 32%. When both are
incorporated into the model improvements, the total improvement is 63%.
Specifically, for the remaining five models of PVT-M, Swin-S, TNT-S, T2T-ViT-19
and CaiT-XS36, the improvement after changing the attention mechanism to the linear
attention mechanism alone is 49%, 51%, 48%, 53%, and 52%. The improvement after
introducing the model lightweight token pruning module alone is 33%, 35%, 33%, 31%,
and 32%. After both improvements are added, the improvement of the model is 62%, 62%,
59%, 65%, and 62%.
Next, select the six models in the table above. Add these six different Transformer models
to token pruning, linear attention or e-attention, and then compare the Top-1 Accuracy
indicators. As Table 7 shows below.
No matter which improvement method is introduced alone, or the two methods are
introduced together, the Top-1 Accuracy indicator has a certain degree of decline. Because
compared with the original model, the two methods of model improvement are to speed
up the model, and both are to increase the speed of the model at the expense of part of the
performance.
Specifically, for the six models, the decline after changing the attention mechanism to a
linear attention mechanism is 1.3%, 1.5%, 1.8%, 1.9%, 2.5%, and 2.0%. After the model
lightweight token pruning module is introduced separately, the declines are 0.2%, 0.2%,
0.3%, 0.2%, 0.4%, and 0.3%. After both improvements are added, the decline of the model
is 1.5%, 1.6%, 2.3%, 2.2%, 2.8%, and 2.2%.
Funding
This work was supported by the Sichuan Science and Technology Program (2021YFQ0003,
2023YFSY0026, 2023YFH0004). The funders had no role in study design, data collection
and analysis, decision to publish, or preparation of the manuscript.
Grant Disclosures
The following grant information was disclosed by the authors:
Sichuan Science and Technology Program: 2021YFQ0003, 2023YFSY0026, 2023YFH0004.
Competing Interests
The authors declare there are no competing interests.
Author Contributions
• Wenfeng Zheng conceived and designed the experiments, performed the experiments,
performed the computation work, prepared figures and/or tables, authored or reviewed
drafts of the article, and approved the final draft.
• Siyu Lu performed the experiments, prepared figures and/or tables, authored or reviewed
drafts of the article, and approved the final draft.
• Youshuai Yang analyzed the data, performed the computation work, authored or
reviewed drafts of the article, and approved the final draft.
• Zhengtong Yin analyzed the data, authored or reviewed drafts of the article, and approved
the final draft.
• Lirong Yin analyzed the data, authored or reviewed drafts of the article, and approved
the final draft.
Data Availability
The following information was supplied regarding data availability:
The code is available at Zenodo: Zheng, W., Lu, S., Yang, Y., Yin, Z., & Yin,
L. (2023). Lightweight Transformer Image Feature Extraction Network. Zenodo.
[Link]
The raw data, ImageNet (ILSVRC2012), is available at [Link]
challenges/LSVRC/2012/[Link]#
The Common Objects in Context (COCO) dataset is available at [Link]
#download.
REFERENCES
Baldi P, Vershynin R. 2023. The quarks of attention: structure and capacity of neural
attention building blocks. Artificial Intelligence 319:103901
DOI 10.1016/[Link].2023.103901.
Chen CFR, Fan Q, Panda R. 2021. CrossViT: cross-attention multi-scale vision trans-
former for image classification. In: 2021 IEEE/CVF international conference on
computer vision (ICCV). 347–356 DOI 10.1109/ICCV48922.2021.00041.