Performance Analysis of Vision Transformer Based Architecture For Cursive Handwritten Text Recognition
Performance Analysis of Vision Transformer Based Architecture For Cursive Handwritten Text Recognition
Abstract—Computers have been given vision by researchers learning models, such as deep neural networks, support vector
across the world for many years. Now it is the era of digitization. machines, hidden Markov models, etc. are commonly used for
Recognizing handwritten text is a must for a computer vision cursive word recognition [1]–[10]. Labeled datasets are used
system. Due to the variation and complexity of the cursive writing
style, the holistic approach is mostly used for the recognition of to train these models, which contain examples of handwritten
cursive scripts. Though Convolutional Neural Network (CNN) cursive words with their corresponding classes. The trained
based models have been employed in literature for Holistic model uses the extracted features from a segmented word to
Handwritten Text Recognition (HTR) of different domains, recent predict which class it belongs to. The final transcribed output
breakthroughs in the image classification Vision Transformer of the recognition system can be helpful in applications such as
(ViT) based models have not been utilized for HTR so far. In this
research, we have designed a ViT-based model for the HTR of historical document digitization, automated form processing,
various cursive scripts. To validate the performance of the model, handwritten text analysis, etc.
various handwritten datasets of cursive scripts have been used. Alexey Dosovitskiy et al. [11] introduced the use of Trans-
The notable finding includes that the accuracy of the designed formers for image recognition. They showed that the reliance
model has increased by up to 26% after applying the image data on CNNs is not necessary, and a pure transformer applied
augmentation techniques.
Index Terms—Vision Transformer, Augmentation, Cursive directly to sequences of image patches can perform very well
Handwritten Text Recognition (HTR) on image classification tasks, requiring substantially fewer
computational resources to train. Research on handwritten
I. I NTRODUCTION word recognition using ViT-based models has been limited
Cursive script is a style of writing in which letters are in the literature. So, in this research, we will design a ViT-
connected within words, resulting in a flowing and more based image recognition model for cursive handwritten word
complex visual representation than the standard printed format. recognition. ViT has weaker inductive biases compared to
Recognizing and transcribing handwritten texts that are written CNN. Inductive bias refers to any assumptions that a model
in the cursive script is challenging due to the variation in makes to generalize the training data and learn the target
individual handwriting styles, the complexity of the cursive function. As a result, it needs more data to learn these biases
script, and the ambiguity in character shapes. The holistic and perform well. Hence, we will also apply image data
approach is effective when dealing with cursive scripts, as it augmentation techniques to increase the performance of the
treats each word as a single entity and focuses on recognizing designed model.
the entire word as a whole. So, the objective of this study is to design a ViT-based image
The cursive handwritten text recognition process generally recognition model for cursive handwritten word recognition
starts with the digitization of a handwritten document, such and analyze the performance of the model using different
as a scanned page. Several image preprocessing techniques datasets. We will also apply image data augmentation to check
are often applied to enhance the quality of the handwritten the effect of augmentation on the ViT-based model.
document. For the holistic approach, segmentation techniques The rest of the paper is organized into six sections. Section
are used to isolate words from the input image. After seg- II describes some previous works related to our task. Section
menting the words, the recognition process starts. Machine III presents details of the research methodology. Section IV
time with the convergence of the training process and overall Normalization 1 i
Transformer Encoder × n (Number of Layers, n =3) [1 <= i <= 3]
Position Encoded Projection
(None, 300, 128) 256
performance. The multi-head attention layer is implemented as (LayerNormalization)
Multi-head Attention i
or, MLP Output (n-1)
Normalization 1 i,
(None, 300, 128) 131,968
(MultiHeadAttention) Normalization 1 i
described in [20]. 2 attention heads with each attention head Attention Output i Multi-head Attention i,
(None, 300, 128) 0
(Add) Position Encoded Projection ×3
of size 128 for query, key, and value are used. Gaussian Error MLP Dense 1 i (Dense) Attention Output i (None, 300, 256) 33,024
MLP Dense 2 i (Dense) MLP Dense 1 i (None, 300, 128) 32,896
Linear Unit, also known as GELU [21] is used as an activation MLP Output i (Add)
MLP Dense 2 i,
(None, 300, 128) 0
Attention Output i
function with the first MLP layer. GELU nonlinearity weights Normalization 2
Output of n stack of Transformer Encoders (i==3)
MLP Output n (None, 300, 128) 256
inputs by their value, whereas ReLU gates inputs by their (LayerNormalization)
Flatten 1 (Flatten) Normalization 2 (None, 38400) 0
sign. A list of all the hyperparameters used in the transformer Dropout 1 (Dropout)
Output (Dense)
Flatten 1
Dropout 1
(None, 38400)
(None, Cout )
0
38,400 × Cout + Cout
encoder is given in Table II. ∗C
Total Trainable Parameters
out = Number of output class
60,3008 + 38,400 × Cout + Cout