Vision Transformer Reliability Evaluation on the C
Vision Transformer Reliability Evaluation on the C
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNS.2024.3513774
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. XX, NO. XX, XXXX 2024 1
Abstract—Vision transformers (ViTs) outperform convolu- without truly achieving a global receptive field. Fortunately,
tional neural networks (CNNs) in tasks such as image classifica- researchers have recently developed a new architecture capable
tion, and, despite their high computational complexity, can still of correlating input information on a global scale: the trans-
be mapped to low-power EdgeAI accelerators, such as the Coral
Tensor Processing Unit (TPU). In this paper, through accelerated former model.
neutron beam experiments, we study the reliability of six ViTs Transformers are a type of deep learning (DL) model archi-
on the Coral TPU and four micro-benchmarks. According to our tecture originally introduced in natural language processing
data, the internal size of attention heads (the main computational (NLP), where they revolutionized the field. More recently,
block in ViTs) has negligible impact on the FIT rate of the the transformer architecture has been successfully applied to
model compared to increasing the number of heads in the model;
furthermore, our results show that employing convolutions in the image and video processing, being named vision transformers
patch embedding reduces the FIT rate of the model. Additionally, (ViTs). ViTs leverage the concept of attention, which allows
we decompose ViT into four basic computational blocks which a global processing of information from all over the image,
represent the main operators of the model, showing that, although overcoming the spatially local receptive field of CNNs and
the transformer layer (with multi-head self-attention and multi- resulting in a higher accuracy. Interestingly, transformers,
layer perceptron) presents the highest FIT rate, it is actually the
patch embedding that is more likely to cause misclassifications. despite having a more complex architecture with respect to
These results can be leveraged to design hardening techniques CNNs, can also be deployed in embedded applications with
that improve the resilience of the critical blocks of a ViT, strict energy, weight, and space constraints. In this paper, we
identified in our evaluation, while minimizing the additional study the reliability of transformer models on low-power and
overhead. low-cost commercial-of-the-shelf (COTS) accelerators, such
as the Coral Edge Tensor Processing Unit (TPU), a device
I. I NTRODUCTION capable of processing neural networks in an extremely cost-
effective and energy-efficient manner.
Processing visual information is a key task in applications While the effect of radiation on CNNs executed on TPUs
such as self-driving cars, airplanes, space probes, and Un- has already been studied [4], [5], to the best of our knowledge,
manned Aerial Vehicles (UAVs), where reliable computing this is the first paper investigating the impact of atmospheric
is also crucial [1]. Until recently, convolutional neural net- neutrons on the reliability of transformers running on TPUs. In
works (CNNs) were the main approach to detect or classify order to provide a complete and accurate reliability overview,
objects in an image or video. However, the accuracy of CNN- we consider six different vision transformer models: Compact
based detection is bounded by an intrinsic limitation due to Convolution Transformer (CCT) [6], two standard vision trans-
the very nature of the convolution operation: being a local formers (ViT) [7] (one with 8 attention heads and 8x8 patches,
operator performed as a sliding window over the input image, and another with 16 heads and 16x16 patches), and three
the network can only extract information from pixels that EfficientFormers [8] (with increasing internal sizes, named L1,
are spatially close to each others. Attempts to increase the L3 and L7). Our data shows that CCT has the lowest FIT
receptive field of CNNs [2], [3] have shown improvements rate, suggesting a reliability benefit in adopting convolution.
in global reasoning capabilities at the expense of efficiency. Additionally, the FIT rate of the EfficientFormers does not
Nonetheless, these approaches either introduce a significant depend on the model size, whereas ViT-16 has a 5x higher
information bottleneck [2] or they enlarge the kernel size [3] FIT rate compared to the smaller ViT-8.
Additionally, to better understand the main reasons for the
This work was supported in part by the Italian Ministry for University and
Research (MUR) through the “Departments of Excellence 2023-27” program observed phenomena, we characterize the reliability of four
(L.232/2016) awarded to the Department of Industrial Engineering and by micro-benchmarks: two single attention heads (one from ViT-
the European Union’s 2020 research and innovation programme under grant 8, the other from ViT-16), and the transformer encoders from
agreement No 101008126, corresponding to the RADNEXT project
Pablo Rafael Bodmann is with the Universidade Federal do Rio Grande do ViT-8 and ViT-16, respectively. As these micro-benchmarks
Sul (UFRGS), Porto Alegre, Brazil (e-mail: [email protected]). represent the most characteristic atomic operations of ViTs,
Niccolò Cavagnero is with the Department of Control and Computer Engi- they provide insight into how the architecture of each model
neering of the Politecnico di Torino, Italy (email: [email protected]
Christopher Frost is with ChipIR, Rutherford Appleton Laboratory, affects the FIT rate. Furthermore, we evaluated eight additional
Science and Technology Facility Council, Didcot, UK (email: christo- micro-models that represent the main operations performed by
[email protected]). the ViT model: patch embedding, multi-head self-attention,
Paolo Rech and Bruno Loureiro Coelho are with the Department of Indus-
trial Engineering of the University of Trento, Italy (e-mail: [email protected] transformer layer, and multi-layer perceptron classification
and [email protected]). head.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Nuclear Science. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNS.2024.3513774
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. XX, NO. XX, XXXX 2024 2
Overall, we present experimental data on 18 configurations an output. While each neuron is relatively simple, a large
of vision transformers tested for more than 266 hours of number of neurons in parallel, called a layer, is able to
effective neutron irradiation at the ChipIR facility. When process complex information. Deep learning stacks several of
scaled to the natural neutron flux at New York City [9], these layers in sequence to build powerful models capable of
this accounts for more than 258 billion years of neutron achieving super-human performance in specific tasks [11].
exposure. Our results show that the probability of radiation- While NNs can achieve high accuracy in image classifica-
induced errors affecting the output of a model increases with tion tasks, radiation-induced faults can negatively affect the
the model size. Furthermore, the probability of these errors model by causing SDCs. However, considering the output of
is significantly affected by the complexity of the architecture, a neural network is probabilistic, the corrupted output can
with more complex architectures such as the ones used in still allow for a correct classification. This can happen, for
the EfficientFormers [8] being more susceptible to radiation instance, when the corruption modifies classification probabil-
effects. In addition to characterizing the reliability of different ities without changing the class with the highest probability.
ViTs on the Coral EdgeTPU, we identify the most critical Therefore, SDCs that do not affect the final classification are
blocks of the transformer architecture. Specifically, our ex- considered tolerable SDCs. In contrast, some SDCs do change
perimental data shows that radiation-induced errors on the the classification, thus being considered critical SDCs.
patch embedding layer of the transformer model are more
likely to lead to misclassifications than errors on other layers
of the model. Fortunately, our analysis has also shown that B. Vision transformers
employing convolution operations on the patch embedding Transformers are the current State-of-the-Art in machine
layer improves the resilience of the ViT model. These results learning (ML) models, being able to outperform previous
can be used to design effective selective hardening techniques architectures in multiple tasks across several fields, such as
that improve the overall reliability of the model or to tune computer vision and robotics. Vision transformers (ViT) [7]
existing reliability solutions specifically designed for machine were shown to outperform convolutional neural networks
learning models [10]. (CNNs), the previously most commonly adopted architecture
The remainder of the paper is structured as follows. Sec- for image processing. The improvement in accuracy achieved
tion II presents background information and related work. by ViT is in large part due to its ability to process the entire
Next, we describe the experimental methodology in Sec- image at once. In contrast, CNNs are intrinsically limited
tion III. The results of our experiments are discussed in due to convolution being a local operation, thus binding the
Section IV, where we characterize the reliability of transformer maximum achievable accuracy [7].
models and micro-benchmarks. Finally, Section V concludes Figure 1 illustrates a simplified architecture of the standard
the paper with our final remarks. Vision Transformer [7], which adapts an architecture initially
developed for natural language processing tasks to be able to
II. BACKGROUND & R ELATED W ORK process images. The basic ViT architecture splits the input
In this section, we present background information on the picture in non-overlapping patches, which are then encoded
effects of radiation on neural networks and discuss related with information about their spatial position in the image.
work. Additionally, we provide details on the vision trans- After this initial encoding, the data is processed by a series of
former architecture, and on the Coral Edge TPU. transformer layers, responsible for the extraction of informa-
tion from the input. Each transformer layer processes the input
through a combination of self-attention heads and a multi-layer
A. Effects of radiation on neural networks perceptron (MLP). Each attention head leverages the concept
Radiation-induced transient faults have three possible out- of self-attention to capture both global and local dependen-
comes: (1) the fault propagates into an error that causes a cies in the input data. More specifically, the self-attention
detected unrecoverable error (DUE): a program crash or hang, mechanism weighs the importance of different patches in an
thus requiring a restart of the application or the device, (2) the image with respect to every other patch. This allows the
fault propagates through the stack of system layers and leads to model to identify and focus on the more complex and relevant
a silent data corruption (SDC), affecting the application output, relationships in the image. Therefore, the attention head is
or (3) the application is unaffected (i.e., the fault is masked, one of the core components of the ViT architecture, being
or the corrupted data is not used) [11]. The probability of responsible for identifying the main informative features of the
radiation causing SDCs or DUEs depends on a combination input, thus affecting the final classification. After computing
of factors, including the hardware architecture (such as the the attention scores, the transformer block leverages an MLP
memory/logic sensitivity [12], [13]), and the application [14]. to increase the non-linear fitting capability of the model.
As such, there is a need to study the reliability of a given This process is repeated for each of the transformer layers
application implemented on the selected hardware in order to in the model, with the output of one layer being used as
safely deploy the system. the input of the next layer. Finally, the output of the last
Neural networks (NNs) are being applied to solve various transformer layer is forwarded to a classifier MLP responsible
tasks in several fields, such as computer vision and robotics. for outputting a prediction (class of an object). For a more in-
A neural network is based on artificial neurons, which receive depth understanding of the vision transformer architecture, we
weighted inputs and apply an activation function to produce refer the reader to the original Vision Transformer paper [7].
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Nuclear Science. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNS.2024.3513774
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. XX, NO. XX, XXXX 2024 3
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Nuclear Science. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNS.2024.3513774
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. XX, NO. XX, XXXX 2024 4
TPUs
USB
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Nuclear Science. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNS.2024.3513774
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. XX, NO. XX, XXXX 2024 5
Classifier MLP indicate which parts of the image contain different kinds of
Patch Embedding
relevant information. This information is then passed through
Partial Output
(prev. layer) an MLP to increase the non-linear fitting capability of the
model, which completes a single (d) transformer layer. In
(a) Patch Embedding micro- (b) Classifier Multi-Layer Per-
model. ceptron micro-model.
other words, the (d) transformer layer includes the (c) multi-
Transformer Layer 1
head self-attention block and an additional MLP (along with
residual and normalization layers, as previously described and
Transformer Layer 1
Multi-Layer Perceptron
shown in Figure 1).
Normalization Normalization
The micro-models selected allow us to obtain and ana-
lyze the intermediate outputs of the model, which would
be an otherwise impossible or highly inefficient process due
Multi-Head Self-Attention Multi-Head Self-Attention
to the way the Edge TPU functions. Particularly, the Edge
Normalization Normalization
TPU does not allow us to obtain intermediate results of a
Patch Embedding Patch Embedding
neural network without executing part of the model on the
host device. Therefore, obtaining intermediate results without
micro-models requires synchronizing the TPU and host after
(c) Multi-Head Self-Attention (d) Transformer Layer micro- every single layer to exchange the outputs and inputs of each
micro-model. model.
layer (which also requires quantization of values). Instead, our
Fig. 4: Selected micro-models for our ablation experiments. approach allows us to obtain the output from each micro-
Each micro-model allows us to gain insight into vital blocks model, which can then be used for further analysis.
of ViT. The experiment consists of each TPU running one of the
models, micro-benchmarks, or micro-models listed above.
Additionally, each host device (Raspberry Pi 4) only has one
ViT-8 and ViT-16 are chosen to understand the impact of TPU connected via USB. Therefore, each host only runs
the number of heads in the ViT sensitivity. CCT is tested in one benchmark at a time. Figure 5 shows an iteration of
order to measure the effect of convolution in the error rate the experiment, which starts with the TPU being initialized
of a transformer, and the three EfficientFormers are chosen with the model parameters, the test images, and the expected
to compare a more efficient transformer block with increasing (golden) output for each image. After the initialization, the
complexity. main loop starts: the image is fed as input to the TPU, which
Due to limitations on the TPU, the attention heads were will then apply the model over that input. When the TPU
implemented from scratch rather than using the ones available completes its computations, it returns the output to the host
in TensorFlow. This was done because the GELU [29] activa- device, which in turn compares the obtained result with the
tion function used by the MLPs is not supported by the TPU respective fault-free golden (expected) output. If there is any
compiler. Thus, to overcome this limitation, we use the Tanh discrepancy between the computed output and the golden
approximation [30], which can be mapped to the TPU without output, the erroneous data is logged for posterior analysis.
impacting the model accuracy. After all the images of the batch have been tested, the main
Besides the transformer models, we also evaluated micro- loop starts again from the first image. Considering that only
benchmarks, which are characteristic atomic operations ex- the layers of the neural network are executed on the TPU,
ecuted in the ViT models. The micro-benchmarks comprise whereas the comparison is executed on the Raspberry Pi (not
two different single attention heads with the same sizes as the irradiated), one can assume that all observed errors come from
ones in ViT-8 (Attention 1) and ViT-16 (Attention 2), and two the TPU.
transformer encoders, which also follow the sizes of the ones
in ViT-8 and ViT-16 (listed as Transformer Encoder 1 and 2,
IV. E XPERIMENTAL R ESULTS
respectively).
Additionally, in order to analyze how radiation affects dif- In this section, we present the results of neutron experi-
ferent parts of the ViT model, we selected four micro-models, ments with several transformer models and micro-benchmarks.
as shown in Figure 4. The idea is to propose an ablation study, Section IV-A shows how radiation-induced errors affect trans-
incrementally adding parts of the ViT model to understand the formers on the Edge TPU, while Section IV-B analyzes how
contribution of each part to the overall framework error rate. errors in different layers affect the correct application output.
These micro-models were selected to evaluate several aspects
of the ViT model: first, (a) patch embedding is responsible for
creating an efficient representation of the image (which has a A. Radiation-induced errors on Edge TPU transformers
high dimensionality) while retaining the necessary information In order to better understand how radiation affects vision
for the subsequent blocks of the model. Next, we wanted transformers running on Edge TPUs, we evaluated both entire
to isolate the (b) MLP classifier, the block responsible for transformer models and micro-benchmarks that compose the
outputting the classification scores. After evaluating the very core computations of transformers. While evaluating entire
first and last blocks, we selected the (c) multi-head self- models allows us to characterize the reliability of different
attention block, which computes the attention scores that configurations and architectures, evaluating micro-benchmarks
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Nuclear Science. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNS.2024.3513774
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. XX, NO. XX, XXXX 2024 6
400
286.31
270.39
SDC
260.13
Initialization 350
229.60
Raspberry Pi 4 Critical SDC
300
Crashes
250
FIT
Set image as 200
first layer 150
48.95
28.55
14.05
13.19
12.59
100
8.42
8.42
7.64
6.91
6.41
3.30
1.92
0.66
3.11
Run Transformer Model Compare 50
inference with No SDC 0
CCT ViT 8 ViT 16 L1 L3 L7
template
SDC detected Fig. 7: FIT rate for the tested models. CCT uses convolution,
Log SDC ViT 8 and 16 are the classical transformers, and L1, L3, and
L7 are EfficientFormers with increasing complexity.
60
Crashes Based on the results shown in Figure 6, despite the different
50 sizes, the FIT rates of Attention 1 and Attention 2 are
40 similar. However, this is not true for the transformer encoders:
FIT
18.41
12.76
20
1.98
1.65
1.62
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Nuclear Science. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNS.2024.3513774
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. XX, NO. XX, XXXX 2024 7
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Nuclear Science. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNS.2024.3513774
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. XX, NO. XX, XXXX 2024 8
27.18
0.48
0.47
Tolerable Error SDC Critical SDC Crashes
0.50 Critical Error 35
0.37
0.38
0.37
30
0.33
0.32
0.33
0.31
0.29
0.40
0.28
0.26
FIT
25
14.02
0.30
20
0.20
15
7.60
5.26
4.89
0.10
5.11
4.49
10
3.10
2.90
2.88
2.50
2.41
0.96
1.71
0.52
0.84
0.04
1.05
0.67
0.43
0.04
0.00
0.22
in
g io
n
ye
r
i ng io
n
yer 5
nt nt
0
dd La dd La
0
be A tte er be Atte er 0
Em lf- rm Em lf- rm
Se fo Se fo g io
n
ye
r er g
io
n
ye
r er
tc
h s ch s in
nt ifi in
nt ifi
Pa ea
d an Pa
t
ea
d an dd te La ss dd te La ss
Tr Tr be t er la be At er la
i-H i-H f-A C lf- C
ul
t
ul
t Em l rm
LP Em rm
LP
h Se sf
o h Se fo
M M tc d an
M tc d ns M
Pa ea al Pa ea a al
H Tr Fi
n H Tr Fi
n
ti- ti-
ul ul
M M
VIT-8 VIT-16
Fig. 9: Average error magnitude (relative to correct value) for
VIT-8 VIT-16
ViT-8 and ViT-16 blocks (micro-benchmarks).
Fig. 11: SDC (tolerable and critical) and DUE FIT rates of
the main parts of the ViT-8 and ViT-16 transformer models.
Radiation-induced fault
Micro-model
Corrupted
Input Data Patch Embedding
Micro Output subsequent layer, which is further aggravated by the nature
Neutron Beam Experiment Log corrupted micro output of the patch embedding layer. Finally, patch embedding is
Post-processing Inject observed
the only non-residual layer, meaning that radiation-induced
Corrupted
(without beam) Corrupted corrupted output
Final Output errors are not smoothed out by re-using previously (correctly)
Micro Output
Full model
computed data.
Patch Transformer Classifier
Input Data
Embedding Encoder MLP
C. Impact of experimental results
Fig. 10: We utilize the corrupted outputs observed and col- Based on the results shown above, we are able to identify
lected during beam experiments to inject (real) errors in the the patch embedding as the most critical operation in the ViT
full ViT models. model, meaning that a radiation-induced error in this block has
a high probability of resulting in a misclassification (critical
SDC). Additionally, while transformer layers have the highest
ments) and inject them into the full model to obtain what FIT rate of the evaluated blocks, most of these SDCs are
would be the final corrupted output (i.e., the classification tolerable, i.e., they do not change the final classification. This
probabilities). While the figure shows an example for the patch is an important result considering the transformer encoder
embedding micro-model, the process is analogous for the other comprises the vast majority of the computations performed by
three micro-models, with the corrupted outputs collected dur- ViT, as this block includes several transformer layers (3 in the
ing the experiment being injected into the appropriate part of evaluated models). In contrast, the patch embedding requires
the full model. This process allows us to observe intermediate orders-of-magnitude fewer operations and is only executed
errors while also being able to realistically simulate how the once in the ViT model. Thus, it may be possible to improve
error would have affected the final output. the reliability of ViT by implementing selective hardening
Figure 11 combines the SDC FIT rate of each evaluated techniques on this block. Due to it being a light-weight
micro-model with the ratio of critical SDCs. Based on these operation, replicating this block would introduce negligible
results, while the transformer layer is the most likely to suffer overhead while protecting the most critical operation of the
radiation-induced faults, these SDCs rarely affect the final model. Alternatively, as shown in the comparison of vision
classification, i.e., they are tolerable SDCs (meaning they are transformer architectures, employing convolutions in the patch
not critical). In contrast, errors in the patch embedding often embedding increases the reliability of the model.
lead to misclassifications (critical SDCs): around 15% of all
SDCs in patch embedding lead to misclassification in ViT- V. C ONCLUSION
8 and 20% in ViT-16. These results show that errors while In this work, we reported the results collected after irradi-
processing the initial image are more critical than errors af- ating Coral Edge TPUs with neutrons for over 266 effective
fecting the computation in the internal parts of the transformer hours. We considered six different transformer models, four
encoder. Because the patch embedding reduces the dimension micro-benchmarks, and eight parts of the Vision Transformer
of the initial image with a large number of pixels (over 12, 000 (ViT) model (micro-models), for a total of eighteen config-
values for a 64x64 image) to a small number of patches urations evaluated. Data showed that the size of the head
(e.g., 192 patches for ViT-8), errors in this procedure may has a negligible influence on the model FIT rate, while the
significantly affect the representation of patches. Additionally, number of heads impacts the SDC FIT rate significantly.
errors in the first layer of the model will cascade into every When comparing different models, the results indicate that
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Transactions on Nuclear Science. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TNS.2024.3513774
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, VOL. XX, NO. XX, XXXX 2024 9
the underlying architecture of the transformer has a large [15] M. Casey, E. Wyrwas, and R. Austin, “Recent radiation test results
influence on the SDC FIT rate. By evaluating both micro- on cots ai edge processing asics,” in NEPP Electronics Technology
Workshop (ETW), Greenbelt, Maryland, USA, Jun. 2022.
benchmarks and full models, our experimental data provided [16] N. P. Jouppi, D. Hyun Yoon, M. Ashcraft, M. Gottscho, T. B. Jablin,
valuable insights on the sensitivity of each part of vision G. Kurian, J. Laudon, S. Li, P. Ma, X. Ma, T. Norrie, N. Patil, S. Prasad,
transformers to radiation-induced faults. Additionally, after C. Young, Z. Zhou, and D. Patterson, “Ten lessons from three genera-
tions shaped google’s tpuv4i : Industrial product,” in 2021 ACM/IEEE
observing real SDCs in different parts of the ViT model, we 48th Annual International Symposium on Computer Architecture (ISCA),
injected the errors collected during the experiment in order Virtual Event, Spain, Jun. 2021, pp. 1–14.
to determine what parts of the model are more likely to [17] Q-Engineering. Google coral edge tpu explained in depth. Accessed:
2023-02-01. [Online]. Available: https://qengineering.eu/google-corals-
cause misclassifications. Based on this analysis, we identified tpu-explained.html
the patch embedding as the most critical component of ViT. [18] R. L. R. Junior and P. Rech, “Reliability of google’s tensor process-
Interestingly, our experimental results have also shown that ing units for convolutional neural networks,” in 2022 52nd Annual
IEEE/IFIP International Conference on Dependable Systems and Net-
employing convolutions during patch embedding considerably works - Supplemental Volume (DSN-S), Baltimore, MD, USA, Jun. 2022,
improves the reliability of the model. pp. 25–27.
[19] R. L. Rech and P. Rech, “Reliability of google’s tensor processing
units for embedded applications,” in 2022 Design, Automation & Test
R EFERENCES in Europe Conference & Exhibition (DATE), Antwerp, Belgium, Mar.
2022, pp. 376–381.
[20] P. R. Bodmann and P. Rech, “Tensor processing unit reliability de-
[1] Road vehicles — Functional safety, International Organization for Stan-
pendence on temperature and radiation source,” IEEE Transactions on
dardization ISO 26 262, Dec. 2018.
Nuclear Science, vol. 71, no. 4, pp. 854–860, Apr. 2024.
[2] J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and-excitation
[21] P. R. Bodmann, M. Saveriano, A. Kritikakou, and P. Rech, “Neutrons
networks,” in 2018 IEEE/CVF Conference on Computer Vision and
sensitivity of deep reinforcement learning policies on edgeai accelera-
Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, Jun. 2018,
tors,” IEEE Transactions on Nuclear Science, vol. 71, no. 8, pp. 1480–
pp. 7132–7141.
1486, Aug. 2024.
[3] Z. Liu, H. Mao, C. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A [22] G. Lentaris, V. Leon, C. Sakos, D. Soudris, A. Tavoularis, A. Costantino,
convnet for the 2020s,” in 2022 IEEE/CVF Conference on Computer and C. B. Polo, “Performance and radiation testing of the coral tpu co-
Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, Jun. processor for ai onboard satellites,” in 2023 European Data Handling
2022, pp. 11 966–11 976. & Data Processing Conference (EDHPC), Juan Les Pins, France, Oct.
[4] M. C. Casey, J. S. Goodwill, E. J. Wyrwas, R. A. Austin, C. M. Wilson, 2023, pp. 1–4.
S. D. Stansberry, N. Gorius, and S. Aslam, “Single-event effects on [23] K. Ma, C. Amarnath, and A. Chatterjee, “Error resilient transformers:
commercial-off-the-shelf edge-processing artificial intelligence asics,” A novel soft error vulnerability guided approach to error checking and
IEEE Transactions on Nuclear Science, vol. 70, no. 8, pp. 1716–1723, suppression,” in 2023 IEEE European Test Symposium (ETS), Venezia,
Aug. 2023. Italy, May 2023, pp. 32–37.
[5] R. L. Rech Junior, S. Malde, C. Cazzaniga, M. Kastriotou, M. Letiche, [24] X. Xue, C. Liu, Y. Wang, B. Yang, T. Luo, L. Zhang, H. Li, and
C. Frost, and P. Rech, “High energy and thermal neutron sensitivity of X. Li, “Soft error reliability analysis of vision transformers,” IEEE
google tensor processing units,” IEEE Transactions on Nuclear Science, Transactions on Very Large Scale Integration (VLSI) Systems, vol. 31,
vol. 69, no. 3, pp. 567–575, Mar. 2022. no. 12, pp. 2126–2136, Dec. 2023.
[6] A. Hassani, S. Walton, N. Shah, A. Abuduweili, J. Li, and H. Shi, [25] L. Roquet, F. Fernandes dos Santos, P. Rech, M. Traiola, O. Sentieys,
“Escaping the big data paradigm with compact transformers,” arXiv and A. Kritikakou, “Cross-Layer Reliability Evaluation and Efficient
preprint 2104:05704, Apr. 2021. Hardening of Large Vision Transformers Models,” in Design Automation
[7] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, Conference (DAC) preprint, San Francisco, CA, USA, Jun. 2024.
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, [26] G. Gavarini, A. Ruospo, and E. Sanchez, “Evaluation and mitigation of
J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Trans- faults affecting swin transformers,” in 2023 IEEE 29th International
formers for image recognition at scale,” in International Conference on Symposium on On-Line Testing and Robust System Design (IOLTS),
Learning Representations, Vienna, Austria, May 2021. Crete, Greece, Sep. 2023, pp. 168–174.
[8] Y. Li, G. Yuan, Y. Wen, J. Hu, G. Evangelidis, S. Tulyakov, Y. Wang, [27] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z.-H. Jiang, F. E. Tay, J. Feng,
and J. Ren, “Efficientformer: Vision transformers at mobilenet speed,” and S. Yan, “Tokens-to-token vit: Training vision transformers from
in Advances in Neural Information Processing Systems, vol. 35, New scratch on imagenet,” in Proceedings of the IEEE/CVF international
Orleans, Louisiana, USA, Nov. 2022, pp. 12 934–12 949. conference on computer vision, Guangzhou, China, Nov. 2021, pp. 538–
[9] C. Slayman, “Jedec standards on measurement and reporting of alpha 547.
particle and terrestrial cosmic ray induced soft errors,” in Soft Errors in [28] C. Cazzaniga and C. D. Frost, “Progress of the scientific commissioning
Modern Electronic Systems. Boston, MA, USA: Springer US, 2011, of a fast neutron beamline for chip irradiation,” Journal of Physics, vol.
vol. 41, ch. 3, pp. 55–76. 1021, pp. 12 037–12 041, May 2018.
[10] N. Cavagnero, F. Dos Santos, M. Ciccone, G. Averta, T. Tommasi, [29] D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv
and P. Rech, “Transient-fault-aware design and training to enhance preprint 1606:08415, Jun. 2023.
dnns reliability with zero-overhead,” in 2022 IEEE 28th International [30] ——, “Bridging nonlinearities and stochastic regularizers with gaussian
Symposium on On-Line Testing and Robust System Design (IOLTS), error linear units,” arXiV preprint 1606:08415, Nov. 2016.
Torino, Italy, Sep. 2022, pp. 181–187. [31] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
[11] P. Rech, “Artificial neural networks for space and safety-critical appli- M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural
cations: Reliability issues and potential solutions,” IEEE Transactions networks for mobile vision applications,” arXiv preprint 1704:04861,
on Nuclear Science, vol. 71, no. 4, pp. 377–404, Apr. 2024. Apr. 2017.
[12] R. Baumann, “Radiation-induced soft errors in advanced semiconductor
technologies,” IEEE Transactions on Device and Materials Reliability,
vol. 5, no. 3, pp. 305–316, Sep. 2005.
[13] J. Noh, V. Correas, S. Lee, J. Jeon, I. Nofal, J. Cerba, H. Belhaddad,
D. Alexandrescu, Y. Lee, and S. Kwon, “Study of neutron soft error
rate (ser) sensitivity: Investigation of upset mechanisms by comparative
simulation of finfet and planar mosfet srams,” IEEE Transactions on
Nuclear Science, vol. 62, no. 4, pp. 1642–1649, Aug. 2015.
[14] V. Sridharan and D. R. Kaeli, “Using hardware vulnerability factors to
enhance AVF analysis,” in Proceedings of the 37th annual international
symposium on Computer architecture, New York, NY, USA, Jun. 2010,
pp. 461–472.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/