Handwritten OCR For Word in Indic Language Using Deep Networks
Handwritten OCR For Word in Indic Language Using Deep Networks
Abstract—A large number of Indian documents are The encoder part is used to get visual features from the input
handwritten and India is a diverse nation with many languages. image and this image embedding is added with 2D Positional
These handwritten documents contain important historical and encoding. Then this image embedding is given to the decoder
cultural information which needs to be preserved by converting
to digital format. The major problem is everyone has unique part which is the same as Transformer Decoder from [3] with
handwriting with different styles of writing. To address this non-causal attention for encoder output and causal attention for
problem, we have trained Handwritten Optical Character decoder input [2]. Teacher forcing is used to make the model
Recognition (HOCR) in eight Indian languages i.e. Bangla, converge faster and to reduce training time. This model uses
Gujarati, Gurumukhi, Hindi, Kannada, Odia, Telugu, and Urdu. character-level token embedding.
The datasets IIIT-HW-Dev and IIIT-HW-Telugu refer to a
Devanagari dataset and a Telugu dataset respectively. The IIIT- Computer Vision (CV) can be used to analyze, interpret
INDIC-HW-WORDS consists of 872K handwritten words written and understand meaningful information from two-dimensional
in 8 Indic scripts by 135 writers. Devanagari and Telugu datasets images. The human eye capability can be matched in machines
are comprised of 95K and 120K handwritten words respectively. with the help of CV. A massive image dataset is required to
Tamil and Malayalam languages are excluded due to issues in learn the orientation and pattern of scenes as compared to
the IIIT-INDIC-HW-WORDS dataset. The paper describes how
the CNN-Transformer architecture leverages visual and textual humans with the help of machine learning and deep learning.
features to perform OCR tasks in different languages. The model Similar challenges are noticed in the detection and recognition
takes word images as input then CNN generates visual features, of UIDAI as well as PAN number from images [4].
and feeds them to the transformer decoder for text generation. The rest of the paper is organized as follows. Section
An encoder ResNet18 and a decoder from a transformer have
II describes the OCR-related work done in past by other
been used for all eight languages to evaluate the performance of
this architecture. This architecture performed best in Kannada authors including me. Section III explains the process of
with just a 1.5% character error rate. data preparation in detail for eight Indian languages. Section
Index Terms—Offline Handwritten Character Recognition,
IV explains the overall methodology of experimentation in
OHCR, Convolution Neural Networks, CNN, Transformers, detail. Section V describes the training configuration required
encoder-decoder for the experiment setup. Section VI describes the obtained
experimental results. Section VII concludes the paper and
shows future work.
I. I NTRODUCTION
HOCR is used for converting handwritten text to a digital
II. R ELATED W ORK
editable and searchable format which can be further used for
analysis and easy information retrieval through large data. OCR research has been going on for more than three
In this paper, the HOCR is trained for eight Indian regional decades and many conventional algorithms were used like
scripts Bangla, Gujarati, Gurumukhi, Hindi, Kannada, Odia, SVMs and HMMs which use character-level segmentation
Telugu, and Urdu. India is diverse, it has many languages and which are very difficult to annotate [5]. The visual features
HOCR can be used to preserve the knowledge, history, and of the image can be extracted using Convolution Neural
heritage of these handwritten manuscripts. Network (CNN). The problems like object classification
There is too much inconsistency in handwriting styles and and object detection, etc can be solved using these visual
various State of the Art models fail to give a satisfactory features. CNN networks cannon generate text output as text
performance on various types of handwriting samples [1]. generation is a recurrent problem that is common in most
Handwritten data has a large amount of variation within Natural Language Processing (NLP) tasks. The sequence-to-
the same sequence, this can range from the more easy to sequence learning task is solved using the encoder-decoder
handle issues such as the image based noise, blur and scale; architecture using CNN as the backbone and RNNs as
to the more complex issues such as random skewing to a decoders. A combination of CNN-BILSTM-CTC [6] is used
direction depending on writing style, shifts in the structures which performs exceptionally on trained data but it performs
of characters due to human error, character skew, etc. These poorly on unseen and noisy data. A more complex architecture
issues make handwritten OCR a massive challenge. MDLSTM-CTC [7] with dropout [8] was next evolution which
We trained an Image to sequence architecture inherited from slightly improved accuracy over BiLSTM. Compute-intensive
[2]. This model has two main parts Encoder and Decoder. CTC loss is replaced by Cross-entropy loss and greedy search
Authorized licensed use limited to: M S RAMAIAH INSTITUTE OF TECHNOLOGY. Downloaded on November 15,2024 at 08:26:19 UTC from IEEE Xplore. Restrictions apply.
decoding for text generation [9]. Transformers are State of
the art models for language models currently [3]. The CNN-
Transformer architecture resolves this problem [2], [10]. The P E(y, 2i) = sin(y/100002i/dmodel )
HOCR is a word-level OCR that is based on an encoder- P E(y, 2i + 1) = cos(y/100002i/dmodel )
decoder architecture consisting of CNN with a transformer as
P E(x, dmodel /2 + 2i) = sin(x/100002i/dmodel )
decoder architecture. This type of architecture is more accurate
than that of only a CNN-RNN-based model on unseen data P E(x, dmodel /2 + 2i + 1) = sin(x/100002i/dmodel )
and can handle noisy data because using transformers as the i ∈ [0, dmodel /4) (1)
decoder part gives the character level context the same as
RNNs but with long-term contexts which helps in increasing The decoder is a transformer decoder stack, with non causal
accuracy significantly [4], [11], [12], [13]. attention to the encoder and causal self-attention [17]. The
input sequence is sent with 1D positional encoding. Training
In this paper, we trained CNN and transformer-based models
is done using teacher forcing i.e the decoder can only use
on different regional languages from IIIT Dataset [14]. The
the previous part of the sequence to predict the next. This
main purpose of this paper is to see how this new architecture
is nowadays commonly used, it helps the transformer to
performs for different Indian languages.
learn faster by providing a shifted input and the loss is
calculated based on the ground truth. The method of decoding
is greedy which can have some issues with repetition and
III. DATA P REPARATION junk generation but this has not been observed on single word
sequences. The decoder generates a probability distribution of
IIIT-INDIC-HW Dataset: It is a large dataset of annotated tokens for each step which is dependent on the probabilities
words for 10 scripts i.e Hindi, Telugu, Bengali, Gurumukhi, generated thus far as shown in equation 2.
Gujarati, Odia, Urdu, Kannada, Tamil, Malayalam [14]. The Probability Function:
images are augmented with skew, shift, scaling, and gaussian
blur. Eight language are selected for training, validation and
testing as shown in table I. pt := 1, ..., V → [0, 1]; Yt ∼ pt
1
IV. M ETHODOLOGY Lseq = − ln (pt (ytGT )) ; τ ≡ sequence length
τ t
We use an image to sequence architecture [2] consisting 1
Lbatch =− ln (pt (ytGT )) ; n ≡ #of token in batch
of a CNN encoder and a transformer decoder. This type of n
batch t
architecture belongs to Sequence-to-Sequence and Tensor-to- (3)
Tensor architecture [15]. The labels are obtained from the
ground truth as a set of characters, in addition, the three tokens This loss function is modified for mini batches.
are added for padding, end of the statement, and start of the The final layer is a 1x1 convolution which produces the
statement. logits which can be normalised with softmax to give the
The encoder is a ResNet [16] for 2D feature extraction. prediction.
Instead of pooling and classifying, we only take the feature
map generated by the last block. We use 2D positional V. T RAINING CONFIGURATION
encoding which is added to the feature map before it is The base configuration was as follows:
flattened into a sequence. 2D positional encoding [3] is a fixed Encoder: ResNet18
sinusoidal encoding, here we use half the dmodel channels for Transformer Stack:
encoding the Y coordinate, while the other half is used for the Number of layers = 6
X coordinate as shown in equation 1. After the 2D positional dmodel = 260
encoding is added we send the resulting output to the decoder h (number of heads) = 4
layer stack. dff = 1024
Positional Encoding: Dropout = 0.1
390 10th International Conference on Signal Processing and Integrated Networks (SPIN 2023)
Authorized licensed use limited to: M S RAMAIAH INSTITUTE OF TECHNOLOGY. Downloaded on November 15,2024 at 08:26:19 UTC from IEEE Xplore. Restrictions apply.
Figure 1: Overall encoder-decoder architecture
The model is implemented in PyTorch. The batch sizes step in the encoder output and causal attention on the sequence
chosen could go up to 256 but depending on augmentations, input from the decoder. The input is right shifted and then
we decided to go with a lower number especially when random added to the 1D positional encoding of the input sequence,
rotation is involved. The optimizer used is ADAM and is here the input sequence is formed by performing token
employed with a fixed learning rate (α) of 2e-4, β1 = 0.9 embedding on the ground truth. This is followed by masking
and β2 = 0.999. for teacher forcing. The decoder layers apply attention to the
System specification encoder output and self-attention to the previous layer output,
The training is done on PARAM SHAVAK with 2.6 GHz the result is then passed through a position-wise feed-forward
Intel Xeon Gold 6145, RAM 96 GB, and NVIDIA Quadro® network. The output from the decoder stack goes through a
GV100 GPU card. linear (1x1 Convolution) layer for the final logits; which may
then be normalised using a softmax layer.
Duration
On average, the training takes around 2-3 days including
validation. The training time depends heavily on the available
system resources and the model size as is standard. The only VI. R ESULTS
other thing noticed is, global padding influences system time
as it pads all images to the size of the largest image. Hence,
a batch-wise padding method is used. The input images of all 8 languages with the respective
model predictions are shown in figure 2..
The results are compiled after testing the trained models
A. OCR Engine
on the test set in the data. The metrics are character error
1) Encoder architecture : As shown in the figure 1, CNN rate and word error rate as shown in equation 4. The model
is used in the encoder to extract features from the input performs well on many of the languages and is shown to give
images. The encoder will extract a feature map from the given a reasonably good output in most cases.
image, which when combined with 2D positional encoding
will contain both, the features and the information about the
positions of the extracted features. This feature map when #levenshteins_distance
flattened, forms our encoding. In this model, the encoder is CER = ( ∗ 100)
#total_characters
a ResNet18 architecture without the final global pooling and
the fully connected classification layers.
2) Decoder architecture: The Decoder is a six block
transformer decoder stack. It applies non-causal attention on (#wrong_words)
the entire encoder output i.e it is independent of the previous W ER = ∗ 100 (4)
(#total_words)
10th International Conference on Signal Processing and Integrated Networks (SPIN 2023) 391
Authorized licensed use limited to: M S RAMAIAH INSTITUTE OF TECHNOLOGY. Downloaded on November 15,2024 at 08:26:19 UTC from IEEE Xplore. Restrictions apply.
Language Val_loss CER WER Model
Bangla 0.3419 2.484 6.08 dmodel
Bangla 0.3944 2.8 6.94 ResNet18
Bangla 6.394 134.29 99.43 ResNet50
Bangla 0.2701 2.27 6.08 ResNet34
Bangla 0.346453 0.0249 0.072688 EfficientNet
Bangla 0.49669 0.04244 0.10001 MobileNet
392 10th International Conference on Signal Processing and Integrated Networks (SPIN 2023)
Authorized licensed use limited to: M S RAMAIAH INSTITUTE OF TECHNOLOGY. Downloaded on November 15,2024 at 08:26:19 UTC from IEEE Xplore. Restrictions apply.
Handwritten OCR. The advantage of visual features from
CNN and language features from transformers is being used
in this unified framework. The proposed architecture achieves
the least character error rate in Kannada with 1.5% on the
validation set followed by Bangla with 2.8% after 999 epochs.
Additional experiments on Bangla by replacing ResNet18
with ResNet34 gave a performance boost over the default
architecture; the char error rate went down to 2.2%. This
can help us improve CER of other languages, it is left for
future research. In the future, we will apply normalization rules
in all other languages as we have used in Hindi which can
significantly increase model accuracy. All handwritten OCRs
are trained separately; each model took around two to three
days to train. We can try to apply transfer learning to other
languages because the parameters in the model architecture are
not changing, we can make use of low-level features learned
Figure 5: Gives information about WER for all 8 languages. by CNN which can help us to reduce the training time and cost.
WER for Kannada and Telugu is the lowest whereas Odia and Transformers require a large amount of data to learn and to
Urdu are the highest.(Lower the better) address that we can try generating synthetic data using GANs
to give our model different handwriting exposures which can
help us in generalizing the model. Further experiments can
be done to make a single OCR for all the languages which
will be beneficial in addressing the requirements. We can try
replacing our architecture with Visual Transformer.
ACKNOWLEDGEMENT
The authors would like to thank the Technology
Development for Indian Languages (TDIL) Programme of the
Ministry of Electronics and Information Technology (MeitY),
Government of India for funding the consortium project for
Marathi OCR.
R EFERENCES
Figure 6: Gives information about CER for all 8 languages. [1] A. Obaid, H. El-Bakry, M. Eldosuky, and A. Shehab, “Handwritten text
It shows similar trends as WER. The value of CER is lower recognition system based on neural network,” International Journal of
Advanced Research in Computer Science and Technology, vol. 4, pp.
than WER for all languages 72–77, 01 2016.
[2] S. S. Singh and S. Karayev, “Full page handwriting recognition via
image to sequence extraction,” CoRR, vol. abs/2103.06450, 2021.
[Online]. Available: https://arxiv.org/abs/2103.06450
[3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all
you need,” CoRR, vol. abs/1706.03762, 2017. [Online]. Available:
http://arxiv.org/abs/1706.03762
[4] M. K. Gupta, R. Shah, J. Rathod, and A. Kumar, “Smartidocr: Automatic
detection and recognition of identity card number using deep networks,”
in 2021 Sixth International Conference on Image Information Processing
(ICIIP), vol. 6, 2021, pp. 267–272.
[5] V. Bansal and R. M. K. Sinha, “A complete ocr for printed hindi text
in devanagari script,” in Proceedings of Sixth International Conference
on Document Analysis and Recognition, Sep. 2001, pp. 800–804.
[6] C. Biswas, P. S. Mukherjee, K. Ghosh, U. Bhattacharya, and S. K.
Parui, “A hybrid deep architecture for robust recognition of text lines
of degraded printed documents,” in 2018 24th International Conference
on Pattern Recognition (ICPR), 2018, pp. 3174–3179.
[7] A. Graves and J. Schmidhuber, “Offline handwriting recognition
with multidimensional recurrent neural networks,” in Advances in
Neural Information Processing Systems, D. Koller, D. Schuurmans,
Figure 7: All language Val Loss, CER and WER Y. Bengio, and L. Bottou, Eds., vol. 21. Curran Associates, Inc.,
2008. [Online]. Available: https://proceedings.neurips.cc/paper/2008/
VII. C ONCLUSION AND FUTURE WORK file/66368270ffd51418ec58bd793f2d9b1b-Paper.pdf
[8] V. Pham, C. Kermorvant, and J. Louradour, “Dropout improves
The CNN-Transformer architecture is applied to Indian recurrent neural networks for handwriting recognition,” CoRR, vol.
regional languages for the first time for word-level abs/1312.4569, 2013. [Online]. Available: http://arxiv.org/abs/1312.4569
10th International Conference on Signal Processing and Integrated Networks (SPIN 2023) 393
Authorized licensed use limited to: M S RAMAIAH INSTITUTE OF TECHNOLOGY. Downloaded on November 15,2024 at 08:26:19 UTC from IEEE Xplore. Restrictions apply.
[9] N. Ly, C. Nguyen, and M. Nakagawa, “An attention-based end-to-
end model for multiple text lines recognition in japanese historical
documents,” 09 2019, pp. 629–634.
[10] L. Kang, P. Riba, M. Rusiñol, A. Fornés, and M. Villegas, “Pay
attention to what you read: Non-recurrent handwritten text-line
recognition,” CoRR, vol. abs/2005.13044, 2020. [Online]. Available:
https://arxiv.org/abs/2005.13044
[11] B. Su and S. Lu, “Accurate scene text recognition based on recurrent
neural network,” in Computer Vision – ACCV 2014, D. Cremers,
I. Reid, H. Saito, and M.-H. Yang, Eds. Cham: Springer International
Publishing, 2015, pp. 35–48.
[12] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network
for image-based sequence recognition and its application to scene text
recognition,” vol. 39, no. 11, Nov 2017, pp. 2298–2304.
[13] K. Mehrotra, M. K. Gupta, and K. Khajuria, “Collaborative deep neural
network for printed text recognition of indian languages,” in 2019 Fifth
International Conference on Image Information Processing (ICIIP),
2019, pp. 252–256.
[14] S. Gongidi and C. V. Jawahar, “Iiit-indic-hw-words: A dataset
for indic handwritten text recognition,” in Document Analysis and
Recognition, ICDAR 2021, 16th International Conference, Lausanne,
Switzerland, September 5 to 10, 2021, Proceedings, Part IV. Berlin,
Heidelberg: Springer Verlag, 2021, pp. 444–459. [Online]. Available:
https://doi.org/10.1007/978-3-030-86337-1_30
[15] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov,
R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image
caption generation with visual attention,” 2015. [Online]. Available:
https://arxiv.org/abs/1502.03044
[16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” 2015. [Online]. Available: https://arxiv.org/abs/1512.03385
[17] N. Moritz, T. Hori, and J. L. Roux, “Dual causal/non-causal self-
attention for streaming end-to-end speech recognition,” 2021. [Online].
Available: https://arxiv.org/abs/2107.01269
394 10th International Conference on Signal Processing and Integrated Networks (SPIN 2023)
Authorized licensed use limited to: M S RAMAIAH INSTITUTE OF TECHNOLOGY. Downloaded on November 15,2024 at 08:26:19 UTC from IEEE Xplore. Restrictions apply.