0% found this document useful (0 votes)
23 views6 pages

Performance Analysis of Vision Transformer Based Architecture For Cursive Handwritten Text Recognition

Uploaded by

Sultan Md. Ayman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views6 pages

Performance Analysis of Vision Transformer Based Architecture For Cursive Handwritten Text Recognition

Uploaded by

Sultan Md. Ayman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2023 26th International Conference on Computer and Information Technology (ICCIT)

13-15 December, Cox’s Bazar, Bangladesh

Performance Analysis of Vision Transformer Based


Architecture for Cursive Handwritten Text
Recognition
Avishek Chowdhury Md. Ahasan Hossen Mohammad Hasan
Dept. of Computer Science & Engineering Dept. of Computer Science & Engineering Dept. of Computer Science & Engineering
Premier University Premier University Premier University
Chattogram, Bangladesh Chattogram, Bangladesh Chattogram, Bangladesh
[email protected] [email protected] [email protected]

Sheikh Md Rukunuddin Osmani


Dept. of Computer Science & Engineering
Premier University
Chattogram, Bangladesh
[email protected]

Abstract—Computers have been given vision by researchers learning models, such as deep neural networks, support vector
across the world for many years. Now it is the era of digitization. machines, hidden Markov models, etc. are commonly used for
Recognizing handwritten text is a must for a computer vision cursive word recognition [1]–[10]. Labeled datasets are used
system. Due to the variation and complexity of the cursive writing
style, the holistic approach is mostly used for the recognition of to train these models, which contain examples of handwritten
cursive scripts. Though Convolutional Neural Network (CNN) cursive words with their corresponding classes. The trained
based models have been employed in literature for Holistic model uses the extracted features from a segmented word to
Handwritten Text Recognition (HTR) of different domains, recent predict which class it belongs to. The final transcribed output
breakthroughs in the image classification Vision Transformer of the recognition system can be helpful in applications such as
(ViT) based models have not been utilized for HTR so far. In this
research, we have designed a ViT-based model for the HTR of historical document digitization, automated form processing,
various cursive scripts. To validate the performance of the model, handwritten text analysis, etc.
various handwritten datasets of cursive scripts have been used. Alexey Dosovitskiy et al. [11] introduced the use of Trans-
The notable finding includes that the accuracy of the designed formers for image recognition. They showed that the reliance
model has increased by up to 26% after applying the image data on CNNs is not necessary, and a pure transformer applied
augmentation techniques.
Index Terms—Vision Transformer, Augmentation, Cursive directly to sequences of image patches can perform very well
Handwritten Text Recognition (HTR) on image classification tasks, requiring substantially fewer
computational resources to train. Research on handwritten
I. I NTRODUCTION word recognition using ViT-based models has been limited
Cursive script is a style of writing in which letters are in the literature. So, in this research, we will design a ViT-
connected within words, resulting in a flowing and more based image recognition model for cursive handwritten word
complex visual representation than the standard printed format. recognition. ViT has weaker inductive biases compared to
Recognizing and transcribing handwritten texts that are written CNN. Inductive bias refers to any assumptions that a model
in the cursive script is challenging due to the variation in makes to generalize the training data and learn the target
individual handwriting styles, the complexity of the cursive function. As a result, it needs more data to learn these biases
script, and the ambiguity in character shapes. The holistic and perform well. Hence, we will also apply image data
approach is effective when dealing with cursive scripts, as it augmentation techniques to increase the performance of the
treats each word as a single entity and focuses on recognizing designed model.
the entire word as a whole. So, the objective of this study is to design a ViT-based image
The cursive handwritten text recognition process generally recognition model for cursive handwritten word recognition
starts with the digitization of a handwritten document, such and analyze the performance of the model using different
as a scanned page. Several image preprocessing techniques datasets. We will also apply image data augmentation to check
are often applied to enhance the quality of the handwritten the effect of augmentation on the ViT-based model.
document. For the holistic approach, segmentation techniques The rest of the paper is organized into six sections. Section
are used to isolate words from the input image. After seg- II describes some previous works related to our task. Section
menting the words, the recognition process starts. Machine III presents details of the research methodology. Section IV

979-8-3503-5901-5/23/$31.00 ©2023 IEEE


describes conducted experiments and analyses of the result. both the encoder and the decoder, leveraging large-scale real-
Section V concludes our study by remarking on the outcomes. text images and synthetic text images with character-level an-
notation. Ablation studies on the BCTR dataset demonstrated
II. R ELATED W ORK the effectiveness of pretraining, with fine-tuning the pre-
trained encoder leading to a 4.0% improvement in accuracy
An extensive research history has been devoted to the holis- compared to training from scratch. MaskOCR shows promise
tic recognition of handwritten cursive scripts. Though there is in text recognition tasks, benefiting from encoder and decoder
limited research on handwritten word recognition using Vision pretraining. Further experiments validate the effectiveness of
Transformer-based models, several researchers from different the proposed encoder-decoder pretraining approach.
regions of the world have applied various Convolution-based Saleh Momeni et al. [13] conducted a study on Arabic
models for holistic recognition of handwritten words of dif- offline handwritten text recognition using Transformer models.
ferent domains. In this section, we are going to review some They generated a synthetic dataset of half a million printed
of these existing works. Arabic text-line images with corresponding ground truth us-
Bhowmik et al. [3] developed a dataset named ”CMA- ing various open-source fonts. The proposed Transformer-
TERdb2.1.2” containing 18,000-word images of Bengali (West based pipeline achieved state-of-the-art results on the KHATT
Bengal) handwritten city names, having 120-word classes. benchmark dataset of 6,742 text lines. Their research explored
Several studies have been conducted on this dataset. The different model configurations and analyzed the impact of
authors of the dataset proposed a method using a feature de- encoder and decoder layer numbers on accuracy. In their study,
scriptor by combining different Elliptical, Tetragonal, and Ver- the Transformer-Transducer and Transformer with Cross-
tical pixel density histogram-based features. Their proposed Attention models achieved a Character Error Rate (CER) of
method performs comparably better with a support vector 19.76% and 18.45% respectively. The study demonstrated the
machine (SVM) classifier than a multi-layer perceptron (MLP) effectiveness of the non-recurrent Transformer architectures
classifier. They obtained 83.64% accuracy at best case and for handwritten text recognition.
79.38% accuracy on average using five-fold cross validation.
A deep CNN model called H-WordNet was proposed by the III. R ESEARCH M ETHODOLOGY
researchers of [4]. The proposed model is evaluated using This section describes the detailed research methodology,
the same Bengali word database developed in [3]. The H- including the preparation of datasets and models. At first,
WordNet model achieved a recognition accuracy of 96.17%. we collected four different datasets on handwritten words.
The authors of [5] have employed five different CNN- We applied some preprocessing techniques to make the word
based architectures to test the performance of holistic Bengali images of all the datasets usable as input data. After the
word recognition. They achieved an accuracy of 98.86% with dataset preparation, we built our transformer-based (ViT) im-
ResNet50 using the dataset “CMATERdb 2.1.2”, which is age recognition model. We validated our model’s performance
significantly higher compared to other state-of-the-art models. using both augmented and without augmented images. Figure
Md Ali Azad et al. [6] have conducted a comprehensive 1 illustrates the flow chart of our research methodology.
study on Bengali handwritten word recognition, focusing on
the holistic approach and dataset development. They intro-
duced the “Zilla-64” dataset, comprising 4,480 handwritten
word samples of 64 Bangladeshi district names. The authors
employed the H-WordNet model, a Deep Convolutional Neural
Network (DCNN) which achieved an accuracy of 93.30%.
The authors of [7] utilized a dataset of 40,000 samples
named “HWR-Gurmukhi Postal 1.0” with 100 unique place
names of handwritten Gurmukhi words collected from dif-
ferent writers in the Punjab state of India. Their experimental
results show that the proposed ensemble classifier using KNN,
Decision Tree, Random Forest, and CNN achieved an accuracy
of 96.98% based on a majority voting scheme with a bagging
methodology.
Pengyuan Lyu et al. [12] introduced MaskOCR, an ap-
proach for text recognition using a sequence-based pipeline
and an encoder-decoder Transformer. Based on the Vision
Fig. 1. Flow Chart of Research Methodology
Transformer (ViT) architecture, the encoder extracts patch
representations from text images using self-attention and FFN
blocks. The decoder, inspired by DETR-style, maps the patch A. Dataset Preparation
representations to text using self-attention, cross-attention, and In spite of handwritten text recognition being more crucial
FFN blocks. The proposed approach utilizes pretraining for than ever, there is a lack of publicly available benchmark
datasets. We barely gathered four publicly available hand- On-the-fly image augmentation with a batch size of 32 was
written datasets of cursive words. The first dataset named applied to generate augmented images. Parameters including
“CMATERdb 2.1.2” [14] contains Bengali handwritten word rotation with 8 degrees, zoom of 0.1%, and a shear rate
images of different city names of West Bengal. The second and of 0.5% were used to execute the augmentation process.
third datasets named “Zilla-64” [15] and “BN-HW-DSNd” ImageDataGenerator API from Keras [18] has been used to
[16] contain Bengali handwritten word images of different dis- perform this. Only train sets of the datasets were used for
trict names of Bangladesh. The fourth dataset named “HWR- generating augmented data. No augmented data was given as
Gurmukhi Postal 1.0” [17] contains Gurmukhi handwritten the test set to the model.
word images of different place names of Punjab. Table I
represents the amount of data and number of classes for each B. Model Preparation
dataset. Our transformer-based model consists of one patch em-
bedding layer, one positional encoding layer, a transformer
TABLE I
D ETAILED L IST OF U SED DATASETS encoder of three layers, one normalization layer, one dropout
layer, and a fully connected layer connected with the output
Name of the Dataset Amount of Data No. of Class layer. Figure 3 represents the overall architecture of our
CMATERdb 2.1.2 18,000 120 designed ViT model.
Zilla-64 4,480 64
BN-HW-DSNd 7,040 64
HWR-Gurmukhi Postal 1.0 40,000 100

Each of the image samples has been resized to 240 × 80,


where 240 is the weight and 80 is the height of the image. 240
× 80 was selected for resizing to maintain a reasonable aspect
ratio and keep the characteristics of the images unchanged.
All the image samples were converted to binary as color
has no role in handwritten word recognition. To do this, at
first, we have converted the RGB image samples to Grayscale
image. RGB to grayscale conversion was done using the
weighted method. To convert the grayscale images to binary
images. threshold value is calculated using Otsu’s binarization
algorithm. The threshold separates pixels into two classes,
background, and foreground. Some image samples from the
datasets after preprocessing are given in Figure 2.

(a) (b) Fig. 3. Architecture of the ViT Model

For the self-attention mechanism, each pixel in an image


needs to attend and be compared to every other pixel of the
image. But, it is not feasible for a high-resolution digital
(c) (d) image. To overcome this, ViT introduces the concept of
splitting the input image. As a first step, our ViT model splits
Fig. 2. Sample Images from (a) CMATERdb, (b) Zilla-64, (c)
BN-HW-DSNd and (d) HWR-Gurmukhi Postal Dataset the input image into a fixed-size sequence of patches, also
called visual tokens. For an input of size Hi × Wi × Ci where
We split all the datasets into three sets: train (70%), test (Hi, Wi) is the resolution of the original image and Ci is the
(15%), and validation (15%). Datasets are split using the number of channels, ‘N’ number of patches has been generated
stratified method so that every set has an equal ratio for each and the image has been reshaped into a sequence of flattened
class. The train set was fed to the model to learn the patterns 2D image patches with a size of ‘R’, which also serves as the
from the data and the validation set was used to simultaneously effective input sequence length for the transformer encoder.
validate the model after each epoch during training. The test For each image patch of resolution (P, P), ‘N’ and ‘R’ can be
set was used to test the performance of our trained model. calculated using the equation 1 and 2.
There was no common data between the train, validation, and Hi × Wi
test sets. N= (1)
P2
R = N × (P 2 .Ci ) (2) TABLE II
H YPERPARAMETER L IST OF THE E NCODER
All of our input images were resized to shape (80, 240, 1)
as discussed in the “Dataset Preparation” section and we have Number of Layers 3
used a patch size of (8, 8). So, we have (80 × 240) ÷ 82 = 300 Hidden Size (Projection Dimension) 128
patches. After splitting, the flattened patches are projected into MLP Size 256
Number of Heads 2
a linear layer of dimension 128 to get the embedding of each
patch. Positional encoding is augmented with the sequence of
patches in the “Position Encoded Projection” layer. Positional
embeddings are added to learn the sequence order of the output layer. Softmax activation function [22] is used in the
patches by the model. output layer for the classification of the inputs. We have used
The sequence of patch embedding vectors including position Keras [18] for building the model using TensorFlow [23]
embedding is given as input of the transformer encoder, which library. Table III represents the details of all the layers of
consists of a stack of 3 identical layers. Figure 4 represents our ViT model including the output shape and the number
the structure of the transformer encoder used in the model. of parameters of each layer. The input shape of every layer
The transformer encoder consists of a normalization layer (without the “Input” layer) is equal to the output shape of the
followed by a multi-head attention layer of two heads and connected layer.
two multi-layer perception (MLP) layers. Residual connections
are included after every block to allow the gradients to flow TABLE III
through the network directly without passing through non- L AYERS OF THE V I T M ODEL
linear activations. The objective of the normalization layer Layer Name (Type)
Input (Input)
Connected to
×
Output Shape
(None, 80, 240, 1)
Parameters
0
[19] is to normalize the outputs of the previous layer to have Reshape (Reshape)
Patch Embedding (Dense)
Input
Reshape
(None, 300, 64)
(None, 300, 128)
0
8,320
zero mean and unit variance, which helps improve the training Position Encoded Projection
(Add)
Patch Embedding, Positions (None, 300, 128) 0

time with the convergence of the training process and overall Normalization 1 i
Transformer Encoder × n (Number of Layers, n =3) [1 <= i <= 3]
Position Encoded Projection
(None, 300, 128) 256
performance. The multi-head attention layer is implemented as (LayerNormalization)
Multi-head Attention i
or, MLP Output (n-1)
Normalization 1 i,
(None, 300, 128) 131,968
(MultiHeadAttention) Normalization 1 i
described in [20]. 2 attention heads with each attention head Attention Output i Multi-head Attention i,
(None, 300, 128) 0
(Add) Position Encoded Projection ×3
of size 128 for query, key, and value are used. Gaussian Error MLP Dense 1 i (Dense) Attention Output i (None, 300, 256) 33,024
MLP Dense 2 i (Dense) MLP Dense 1 i (None, 300, 128) 32,896
Linear Unit, also known as GELU [21] is used as an activation MLP Output i (Add)
MLP Dense 2 i,
(None, 300, 128) 0
Attention Output i
function with the first MLP layer. GELU nonlinearity weights Normalization 2
Output of n stack of Transformer Encoders (i==3)
MLP Output n (None, 300, 128) 256
inputs by their value, whereas ReLU gates inputs by their (LayerNormalization)
Flatten 1 (Flatten) Normalization 2 (None, 38400) 0
sign. A list of all the hyperparameters used in the transformer Dropout 1 (Dropout)
Output (Dense)
Flatten 1
Dropout 1
(None, 38400)
(None, Cout )
0
38,400 × Cout + Cout
encoder is given in Table II. ∗C
Total Trainable Parameters
out = Number of output class
60,3008 + 38,400 × Cout + Cout

IV. E XPERIMENT AND R ESULT A NALYSIS


The details of our experimental outcomes are elaborated in
the following sections. We have used Python as the program-
ming language to implement and Jupyter Notebook to execute
all of our modules. We also have used some libraries and
packages including Pandas [24] for extracting information
from the dataset, OpenCV [25] for performing all the image
processing tasks, Scikit-learn [26] for binarizing output,
splitting the dataset, and generating classification report.
A. Evaluation Metrics
To evaluate our model, we have calculated True Positives
(TP), True Negatives (TN), False Positives (FP), and False
Negatives (FN) from the predicted output for each class, which
have been used to calculate various performance metrics such
as Accuracy, Precision, Recall, and F1-Score. Equation 3, 4,
5, and 6 show the mathematical expression for calculating
Fig. 4. Structure of the Transformer Encoder Used in the ViT Model accuracy, precision, recall, and f1-score.
The normalization layer used after the transformer block TP + TN
Accuracy = (3)
works the same as the “Norm” layer of the transformer TP + TN + FP + FN
encoder. After the normalization layer, a dropout layer is also TP
used with a dropout rate of 0.5 between the flatten and the P recision = (4)
TP + FP
TP
Recall = (5)
TP + FN
P recision × Recall
F 1 − Score = 2 × (6)
P recision + Recall
B. Model Training
During training, we have used Adam [27] stochastic gradi-
ent descent method with a learning rate of 0.001 to optimize
the parameters of the model and Categorical cross-entropy loss
function to compute loss. A batch size of 32 was used per
gradient update. The epoch number is determined using the
early stopping technique to reduce overfitting.
C. Performance Evaluation Fig. 6. Classification Result After Augmentation
1) Before Augmentation: Initially, we have fed the raw
training sets and the validation sets of our dataset to the model.
Then we have evaluated the performance using the test sets These results indicate the effectiveness of the augmentation in
of the dataset. Figure 5 represents the classification results of improving the performance of a ViT-based model to classify
our designed ViT model before augmentation. images accurately. Therefore, datasets with more translated
data points tend to be more likely to perform higher.
Our research adopts a novel Vit-based approach for cursive
handwritten word recognition, marking a departure from con-
ventional methods. By incorporating augmentation techniques,
our research aims to highlight its significance in optimizing the
performance of ViT-based systems in challenging recognition
tasks, shedding light on the importance of data augmentation
strategies for optimizing performance. Though our proposed
architecture has not surpassed the performance of existing
works, it underscores the crucial role of augmentation in
enhancing the performance of a ViT-based cursive handwritten
text recognition system.
V. C ONCLUSION
Fig. 5. Classification Result Before Augmentation This research embarked on a journey to harness the potential
of vision transformer models for word recognition using a
From figure 5, it is seen that the model achieved the Holistic approach, systematically exploring the impact of data
lowest accuracy using the “Zilla-64” dataset and the highest augmentation on model performance. The vision transformer,
accuracy with the “HWR-Gurmukhi Postal 1.0” dataset. The originally designed for image classification tasks, has adapted
“Zilla-64” dataset has the least amount of data and “HWR- successfully to the domain of word recognition, demonstrating
Gurmukhi Postal 1.0” has the highest amount of data. There- its versatility and efficacy. Through a comprehensive com-
fore, it is obvious that datasets with less amount of data are parative analysis of various datasets before and after aug-
not suitable for a ViT-based recognition model. In order to mentation, we shed light on data augmentation techniques’
expand the training set, we have used data augmentation. profound influence on model generalization and robustness.
2) After Augmentation: As a second step, we focused The augmented datasets exhibited a significant enhancement
on utilizing the augmentation as described in the “Dataset in terms of model accuracy and consistency, validating the
Preparation” section. We have trained our model using the hypothesis that an enriched training set leads to improved
augmented training sets of our dataset and evaluated the per- recognition performance. Furthermore, our findings underline
formance using the same test sets as before the augmentation. the importance of dataset quality and diversity in training deep-
Figure 6 represents the classification results after applying data learning models. The augmentation process introduced varia-
augmentation. tions to the dataset that closely mimic real-world scenarios,
Figure 6 shows that the model achieved a significantly equipping the model with the ability to handle various font
higher performance with all the datasets after augmentation. styles, orientations, and noise levels commonly encountered
Accuracy with “CMATERdb 2.1.2” has increased by 14%, in practical applications. It is noteworthy that the accuracy of
“Zilla-64” by 26%, and “BN-HW-DSNd” by 16%. The accu- the model has increased by up to 26% after the augmentation,
racy with the “HWR-Gurmukhi Postal 1.0” dataset with the which is one of the major findings. Therefore, the amount of
highest amount of data has also increased by 2%. Precision, data significantly affects the generalization and robustness of
Recall, and F1-Score have also increased with all the datasets. the vision transformer in image recognition.
R EFERENCES Neurocomputing: Algorithms, architectures and applications. Springer,
1990, pp. 227–236.
[1] S. Barua, S. Malakar, S. Bhowmik, R. Sarkar, and M. Nasipuri, “Bangla [23] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
handwritten city name recognition using gradient-based feature,” in Pro- Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,
ceedings of the 5th International Conference on Frontiers in Intelligent A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,
Computing: Theory and Applications: FICTA 2016, Volume 1. Springer, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray,
2017, pp. 343–352. C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,
[2] K. Balakrishnan et al., “Offline handwritten recognition of malayalam P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals,
district name-a holistic approach,” arXiv preprint arXiv:1705.00794, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng,
2017. “TensorFlow: Large-scale machine learning on heterogeneous systems,”
[3] S. Bhowmik, S. Malakar, R. Sarkar, S. Basu, M. Kundu, and 2015, software available from tensorflow.org. [Online]. Available:
M. Nasipuri, “Off-line bangla handwritten word recognition: a holistic https://www.tensorflow.org/
approach,” Neural Computing and Applications, vol. 31, pp. 5783–5798, [24] Wes McKinney, “Data Structures for Statistical Computing in Python,”
2019. in Proceedings of the 9th Python in Science Conference, Stéfan van der
[4] D. Das, D. R. Nayak, R. Dash, B. Majhi, and Y.-D. Zhang, “H-wordnet: Walt and Jarrod Millman, Eds., 2010, pp. 56 – 61.
a holistic convolutional neural network approach for handwritten word [25] G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software
recognition,” IET Image Processing, vol. 14, no. 9, pp. 1794–1805, 2020. Tools, 2000.
[5] R. Pramanik and S. Bag, “Handwritten bangla city name word recogni- [26] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
tion using cnn-based transfer learning and fcn,” Neural Computing and O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
Applications, vol. 33, pp. 9329–9341, 2021. plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
esnay, “Scikit-learn: Machine learning in Python,” Journal of Machine
[6] M. A. Azad, H. S. Singha, and M. M. H. Nahid, “Zilla-64: A bangla
Learning Research, vol. 12, pp. 2825–2830, 2011.
handwritten word dataset of 64 districtsname of bangladesh and recog-
[27] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
nition using holistic approach,” in 2021 International Conference on
arXiv preprint arXiv:1412.6980, 2014.
Science & Contemporary Technologies (ICSCT). IEEE, 2021, pp. 1–6.
[7] H. Kaur, M. Kumar, A. Gupta, M. Sachdeva, A. Mittal, and K. Kumar,
“Bagging: An ensemble approach for recognition of handwritten place-
names in gurumukhi script,” ACM Transactions on Asian and Low-
Resource Language Information Processing, 2023.
[8] D. Nurseitov, K. Bostanbekov, M. Kanatov, A. Alimova, A. Abdallah,
and G. Abdimanap, “Classification of handwritten names of cities and
handwritten text recognition using various deep learning models,” arXiv
preprint arXiv:2102.04816, 2021.
[9] R. K. Roy, H. Mukherjee, K. Roy, and U. Pal, “Recognition of
handwritten indian trilingual city names,” in Recent Trends in Image
Processing and Pattern Recognition: Third International Conference,
RTIP2R 2020, Aurangabad, India, January 3–4, 2020, Revised Selected
Papers, Part I 3. Springer, 2021, pp. 488–498.
[10] S. Sharma, S. Gupta, D. Gupta, S. Juneja, H. Turabieh, L. Sharma, and
Z. Kiros Bitsue, “Optimized cnn-based recognition of district names of
punjab state in gurmukhi script,” Journal of Mathematics, vol. 2022, pp.
1–10, 2022.
[11] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al.,
“An image is worth 16x16 words: Transformers for image recognition
at scale,” arXiv preprint arXiv:2010.11929, 2020.
[12] P. Lyu, C. Zhang, S. Liu, M. Qiao, Y. Xu, L. Wu, K. Yao, J. Han, E. Ding,
and J. Wang, “Maskocr: Text recognition with masked encoder-decoder
pretraining,” arXiv preprint arXiv:2206.00311, 2022.
[13] S. Momeni and B. BabaAli, “A transformer-based approach for arabic
offline handwritten text recognition,” arXiv preprint arXiv:2307.15045,
2023.
[14] S. Bhowmik, “CMATERdb 2.1.2-Dataset,” 2018. [On-
line]. Available: https://www.kaggle.com/datasets/showmikbhowmik/
cmaterdb212-a-handwritten-bangla-word-database
[15] M. M. H. Nahid, “Zilla-64-Dataset,” 2021. [Online]. Available:
https://github.com/MahadiHasanNahid/Zilla-64-Dataset
[16] M. H. Zisad, “BN-HW-DSNd,” 2022. [Online]. Available: https:
//www.kaggle.com/datasets/mdhosenzisad/bnhwdsnd
[17] H. Kaur and M. Kumar, “Benchmark dataset: offline handwritten
gurmukhi city names for postal automation,” in Document Analysis
and Recognition: 4th Workshop, DAR 2018, Held in Conjunction with
ICVGIP 2018, Hyderabad, India, December 18, 2018, Revised Selected
Papers 4. Springer, 2019, pp. 152–159.
[18] F. Chollet et al., “Keras,” https://keras.io, 2015.
[19] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv
preprint arXiv:1607.06450, 2016.
[20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in
neural information processing systems, vol. 30, 2017.
[21] D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv
preprint arXiv:1606.08415, 2016.
[22] J. S. Bridle, “Probabilistic interpretation of feedforward classification
network outputs, with relationships to statistical pattern recognition,” in

You might also like