0% found this document useful (0 votes)
12 views5 pages

Optical Character Recognition Using Convolutional Neural Network

بحث تخرج ماجستير

Uploaded by

nsoabkrh1996
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views5 pages

Optical Character Recognition Using Convolutional Neural Network

بحث تخرج ماجستير

Uploaded by

nsoabkrh1996
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Optical Character Recognition using Convolutional

Neural Network
Sakshi Shreya Yash Upadhyay Mohit Manchanda
Electronics and Communication Computer Science Engineering, Electronics and Communication
Engineering, %KDUDWL9LG\DSHHWK¶V&ROOHJHRI Engineering,
%KDUDWL9LG\DSHHWK¶V&ROOHJHRI Engineering, %KDUDWL9LG\DSHHWK¶V&ROOHJHRI
Engineering, New Delhi, India Engineering,
New Delhi, India [email protected] New Delhi, India
[email protected] [email protected]

Rubeena Vohra Gagan Deep Singh


Electronics and Communication Engineering, Electronics and Communication Engineering,
%KDUDWL9LG\DSHHWK¶V&ROOHJHRI(QJLQHHULQJ, %KDUDWL9LG\DSHHWK¶V&ROOHJHRI(QJLQHHULQJ,
New Delhi, India New Delhi, India
[email protected] [email protected]

Abstract - Optical Character Recognition is the process of labelling, image classification, action recognition etc. And in
translating images of handwritten, typewritten, or printed text NLP, CNN is used for applications like Speech Recognition
into a format understood by machines. The purposes of Optical and Text Classification.
Character Recognition are editing, indexing/searching, and
reduction in storage size. This is achieved by first scanning the C. Yi et al. [4] have employed a motion-based algorithm to
photo of the text character-by-character, then it is followed by identify text in a cluttered environment. This algorithm detects
analysis of the scanned image, and finally the translation of the an object of interest by asking the user to shake the hand-held
character image into character codes, such as ASCII. In this object. Background Subtraction (BGR) method is used to
paper, we have used segmentation algorithm to divide the image
isolate moving object from the stationary background. They
into lines, words and then characters. The characters are
recognized using Convolutional Neural Network. The results
have used the AdaBoost learning model [5] for training the text
obtained by Convolutional Neural Networks seems to be classifier.
promising in recognizing the characters in comparison to results J. Bai et al. [6] have introduced Shared-Hidden Layer Deep
obtained through other machine learning algorithms such as
Convolutional Neural Network (SHL-CNN) in their work. This
Support Vector Machines and Artificial Neural Networks.
model reduces character recognition errors by 16-30% if it is
Keywords - Optical Character Recognition; Convolutional Neural trained for characters of one language, and can also be used for
Network; Artificial Neural Network; Support Vector Machines; more than one languages.
Segmentation; Image Characters
V. V. Patil et al. [7] and V. Shrivastava et al. [8] have
I. INTRODUCTION outlined the phases that are used in Optical Character
Optical Character Recognition (OCR) is a technique of Recognition. They have divided the process into two major
automatic identification of optical characters [1]. One of the parts: Image Processing and Recognition.
methods used for OCR is Convolutional Neural Networks P. Satyanarayana et al. [9] researched for blind people by
(CNN). In case of classifying images neural networks have providing them assistance for reading text from the documents
eclipsed conventional machine learning methods, neural and papers. They divided the process into 3 main stages-
networks at the expense of time taken in training tend to Training the data, testing the data i.e classification using KNN
achieve much higher accuracy and precision than SVMs and and then text to speech conversion of the recognised data.
KNN. For computer vision CNN has been an area of interest Raspberry Pi was chosen as the hardware interface.
for researchers due to its human level generalization abilities,
due to its pooling layer it doesn't overfit to the training data. Y. Li et al. [10] have experimented on a method to detect text
CNN is a deep learning architecture which is inspired by the lines in a handwritten document. They have blurred the image
visual system [2]. It finds its applications in areas like in the horizontal direction for multiple times and then
Computer Vision and Natural Language Processing (NLP)[3]. thresholded the image so that the words get connected with
In Computer Vision, CNN is used for face recognition, scene each other. They then took the thresholded image and grouped

c
978-93-80544-34-2$31.00 2019 IEEE 55

Authorized licensed use limited to: Bharati Vidyapeeth's New Delhi. Downloaded on October 12,2023 at 07:05:34 UTC from IEEE Xplore. Restrictions apply.
them. Each group is a line. There may be some cases that the An image is given as an input to the system. The image can
text has more separation between words. In those cases, the be in .jpg or .png format. The image is first converted to
line is segmented. They have used thresholding to find if the grayscale because it is faster to process a grayscale image than
two segments are in the same line or different lines. an RGB image. Preprocessing is further divided into two steps:
CNN has very high accuracy in image recognition problems 1) Noise Removal: Users usually take the image using
if trained well. This makes it an ideal choice in image smartphone which leads to a lot of noise. Even the scanners are
recognition applications. Apart from their high accuracy not completely noise-free. These lead to poor quality images.
CNN's rely on spatial features. It can be their strength if the To remove these noise, we used Gaussian and median blur.
context of the feature is local (a bunch of pixels) and it can be
Gaussian blur is used to blur the grayscale image using a
their weakness if the context is distributed (a sum of several
Gaussian kernel. This method reduces the Gaussian noise,
one-hot encodings mode is difficult to handle with CNN's, but
which is the most common type of noise in an image. A two-
YHU\ HDV\ IRU D GHFLVLRQ WUHH  $OVR &11¶s store much more
dimensional Gaussian function is used to generate a
information as parameters than other methods.
convolutional matrix. The result of the convolution of the
ConvNets have a better way of utilizing the enormous image with this matrix produces a blurred effect. The Gaussian
amount of parameters and these also learn by gradient descent, function in two dimensions is
each layer feeds from a layer below so this builds up
hierarchical features adapted to the task at hand. Support ௫ మ ା௬ మ
vector machines (ML) and others usually require features in ͳ ି
‫ܩ‬ሺ‫ݔ‬ǡ ‫ݕ‬ሻ ൌ  ݁ ଶఙ మ (1)
form of a real-valued vector. But convNets are trained ʹߨߪ ଶ
normally end-to-end which means they adapt to the problem
they are trying to solve. Where x is the distance from the origin in the horizontal axis,
y is the distance from the origin in the vertical axis and is the
A character can be handwritten in a vast variety of ways. standard deviation of the Gaussian Distribution.
Every person writes the text with different tilt, shape, size etc.
Yet a human can easily recognize and read each character. Median blur reduces the salt-and-pepper noise in an image by
The goal of this paper is to create an OCR program that inputs replacing every pixel by the median of its surrounding pixels.
an image and returns characters in the form of text that can be Median blur gives best results on a binary image. So, before
stored in .txt format. This was accomplished by passing the applying the median blur, we convert the image to a binary
image through various steps. These steps are pre-processing, LPDJHE\WKUHVKROGLQJ:HKDYHXVHG2WVX¶VWKUHVKROGLQJ>@
segmentation and recognition. Pre-processing includes various which automatically calculates the threshold value based on a
image processing techniques like noise-removal, thresholding clustering-based method. We have inverted the image so that
and rotation. Similarly, segmentation covers three steps of line the background becomes black and the letters become white in
segmentation, word segmentation and character segmentation. colour.
After the segmentation step, the image is converted to various 2) Rotation: It is not possible to take exactly horizontal
small images each containing a different character. These images from a smartphone or even a scanner. The image is
images are passed to an R-CNN model which recognizes the slightly tilted in most of the cases. We detected the tilt angle
characters and returns the output in ASCII format. and then performed necessary rotation to correct the image.
II. DATA DESCRIPTION B. Segmentation
For training the image classifier we have used For properly detecting the sequence in which the characters
EMNIST[11][12] dataset which is a set of handwritten appear, we divided the process into three steps.
character digits derived from the NIST[13][14] Special
Database. It is an extended version of a much famous MNIST 1) Line Segmentation: The most common algorithm for
dataset, MNIST only contained handwritten characters detecting lines is taking a projection of the image in the
whereas EMNIST also includes handwritten digits. EMNIST horizontal direction. The whole projection matrix is divided
ByClass contains 814,255 characters in 62 unbalanced classes. into two categories. These categories are lines and spaces.
It has 26 classes for characters and 10 classes for digits. All the Wherever there is a character present, that part falls in the
images are given in 28 by 28 pixel format. category of lines. The space between two lines falls in the
category of spaces. In our case, since we have inverted the
III. IMPLEMENTATION image, a pixel value of 0 will mean space and a pixel value of
The process is divided into 2 major phases. These are anything above 0 will mean a line. A threshold space value is
preprocessing and segmentation. decided to figure out if the space is because of the separation
between lines or due to separation in characters because of
A. Preprocessing some noise. This method works for printed text because they
have proper spacing between the lines. But in handwritten text,

56 6th International Conference on Computing for Sustainable Global Development (INDIACom).

Authorized licensed use limited to: Bharati Vidyapeeth's New Delhi. Downloaded on October 12,2023 at 07:05:34 UTC from IEEE Xplore. Restrictions apply.
it is possible that there is very less spacing between the two noises. We have taken threshold of the input image because it
lines. That is why we have reduced the threshold value to 0. helps in removing noise and also makes it easier to read the
LPDJH 2WVX¶V WKUHVKROGLQJ DXWRPDWLFDOO\ ILQGV RXW WKH
2) Word Segmentation: Word segmentation is similar to line
threshold for bimodal images using a clustering-based
segmentation. Instead of the whole image, the program takes
algorithm.
one line at a time. For each line, the 2D image is converted to
the 1D matrix by taking the vertical projection of the image.
Value 0 means space between both characters and words and
the other values means text. This information is used to find
out a threshold space in tKHLPDJH,IWKHVWUHDNRI¶VKDVPRUH
¶VWKDQWKHWKUHVKROGWKLVPHDQVWKDWLWLVDVSDFHEHWZHHQWZR
ZRUGV %XW LI WKH VWUHDN KDV OHVV ¶V WKDQ WKH WKUHVKROG WKHQ
that is the space between two letters of the same word. This 1D
matrix is processed to find out the x coordinates of all the
words. The total number of words are counted and returned to
the user.
3) Character Segmentation: The program takes one word at a
time and finds the contours present in that word. A contour is
Fig. 2 Resulting image after noise removal and thresholding
defined as an outline that bounds the shape of something. In
our case, the contour represents all the characters. We find the 2) Rotation: Tilt angle was perfectly detected for horizontal
area of each contour which is then used to find out a threshold images if the range of tilt is (-180°, 180°]. If the tilt angle lies
area. Every contour is compared with the threshold. If the outside this range, the image was rotated upside-down and the
contour area is less than the threshold area, that means it is a letters were not recognized properly.
noise and is thus ignored. Other contours are taken and
rectangles are drawn around them.
Bounding boxes of some characters need to be corrected.
These letters are i, j, .(full stop) and ,(comma). All of these
characters have small parts. First, we detect if the part lies in
the lower half of the line. This may be a full stop or comma.
Full stops and commas are 0 padded on top. Else if the dot lies
in the upper half, this may be the part of ‘i’, or ‘j’. Its second
part is found by comparing the x coordinate value. And both of
them are merged. These characters are then converted to 28x28
matrices so that they can be fed into a Neural Network.
IV. DISCUSSION OF RESULTS Fig 3. Tilt detection in the thresholded image

A. Pre-processing

Fig. 4 Image after rotation


Fig. 1 A handwritten document image for testing
B. Segmentation
1) Noise Removal: After the process of noise removal, most 1) Line Segmentation: In fig. 4, we can see that the characters
of the noises were removed except the large patches in the like ‘g’ and ‘p’ are very close to the neighbouring lines. Since
document. These patches have similar or larger size than the we removed the threshold value in our algorithm, all the lines
characters. That is why it is hard to remove these type of are properly detected. But, if the user writes in such a manner

6th International Conference on Computing for Sustainable Global Development (INDIACom). 57

Authorized licensed use limited to: Bharati Vidyapeeth's New Delhi. Downloaded on October 12,2023 at 07:05:34 UTC from IEEE Xplore. Restrictions apply.
that two lines overlap each other, our algorithm will consider
those two lines as one single line. In such cases, we will not
get the expected results.

TABLE I COMPARISON OF DIFFERENT TECHNIQUES USED FOR OPTICAL CHARACTER RECOGNITION

Knn ANN CNN


Advantages x Simplest Classification x Has fault tolerance. x Produces state-of-the-art
Algorithm. Corruption of one or more results for spacial data like
x Calculation time is small. cells does not prevent it from images.
generating output. x Stores more information as
x Stores information on the parameters than other
entire network. methods.
Disadvantages Accuracy depends on the value x Hard to interpret the model. x Data required for training is
of k and the type of distance They are black box once large.
chosen. trained. x Very high computational
x Data required for training is cost.
large.
Applications x Text classification x Speech Recognition x Face recognition
x Climate forecasting x Stock market forecasting x Scene labelling
x Stock market forecasting x Image Classification x Image classification
x Medicine x Text Classification x Speech Recognition
x Image classification x Medicine x Text Classification
Accuracy in OCR 60-70% 80-95% 85-97%
Accuracy achieved 65.9% 90% 92.4%
by our experiments
3) Character Segmentation: The users are required to write in
block letters to detect all the characters. We find the contour
and draw a bounding rectangle that bounds the contour. Fig. 6
represents the bounding rectangle for all the characters in the
first word.

Fig. 5 Line segmentation in the document


Fig. 7 Character segmentation for the first word
2) Word Segmentation: Fig. 5 is the output after the word
V. CHARACTER RECOGNITION
segmentation of the first line. Since humans provide enough
space between two characters, detecting characters is relatively Character recognition is done using a convolutional network,
easier than detecting lines or characters. If lines are detected after the character segmentation step we give the segmented
properly, words were also detected properly. But if two lines images to a Multistack Convolutional Neural Network. Our
were detected as a single line while performing line neural network¶VDUFKLWHFWXUHLVPDGHXSRI 2 convolutional 2d
segmentation, that caused problems in properly detecting the layers, 1 max pooling layer, 2 dropout layers, and 2 densely
words. connected layers with 512 neurons each. This neural network
is trained on EMNIST dataset. Kernel size was taken as 3 by 3.
32 convolutional filters were used in the network with filter
size as 2 by 2. The neural network was trained for 30 epochs
Fig. 6 Word segmentation in the first line and ended with accuracy on training set equal to 0.9353,
accuracy on validation set was 0.9241, loss on training set was
0.0647 and loss on validation set was 0.0759.

58 6th International Conference on Computing for Sustainable Global Development (INDIACom).

Authorized licensed use limited to: Bharati Vidyapeeth's New Delhi. Downloaded on October 12,2023 at 07:05:34 UTC from IEEE Xplore. Restrictions apply.
VI. CONCLUSION the code will not be able to detect the letters. We will make the
In this paper, we focused on converting an image to text code flexible to detect the light coloured text on the dark
format. We explored the use various image processing coloured background.
algorithms like thresholding, gaussian blur, median blur, REFERENCES
contour detection, etc and a machine learning algorithm
[1] L. Eikvil, ³OCR - Optical Character Recognition”, Oslo, 1993, pp. 5-7.
Convolutional Neural Networks(CNN). The image processing
[2] Hubel, David H., and Torsten N. Wiesel. "Receptive fields, binocular
methods were combined to extract all the characters from the interaction and functional architecture in the cat's visual cortex." The
image. The characters were recognised using CNN with an Journal of physiology 160.1 (1962): 106-154.
accuracy of 0.9241. [3] A. Bhandare, M. Bhide, P. Gokhale and R. Chandavarkar, "Applications
of Convolutional Neural Networks", International Journal of Computer
We have compared this result with the other well-known Science and Information Technologies, vol. 7, no. 5, pp. 2206-2215,
techniques used for classification. The other techniques are K- 2016.
nearest neighbours (Knn) and Artificial Neural Networks [4] C. Yi, Y. Tian and A. Arditi, "Portable Camera-Based Assistive Text and
(ANN) as mentioned in Table 1. On comparing all of the Product Label Reading From Hand-Held Objects for Blind Persons",
methods, we have come to a conclusion that we have provided IEEE/ASME Transactions on Mechatronics, vol. 19, no. 3, pp. 808-817,
2014.
a better solution that gives highest accuracy.
[5] < )UHXQG DQG 5 6FKDSLUH ³([SHULPHQWV ZLWK D QHZ ERRVWLQJ
This work can be applied for reading number plates of DOJRULWKP´LQ3URF Int. Conf. Machine Learning, 1996, pp. 148±156.
vehicles, text on a card, etc. so that these can be easily stored [6] J. Bai, Z. Chen, B. Feng and B. Xu, "Image character recognition using
deep convolutional neural network learned from different languages",
in a machine or a database. This work can also be applied for IEEE International Conference on Image Processing (ICIP), pp. 2560-
helping blind people to convert a printed or handwritten text to 2564, 2014.
a machine readable form. The machine readable text can be [7] 9 9 3DWLO 5 9 6DQDS DQG 5 % .KDUDWH ³2SWLFDO &KDUDFWHU
further converted to speech. 5HFRJQLWLRQXVLQJ$UWLILFLDO1HXUDO1HWZRUN´'U--0DJGXP&ROOHJH
of Engg., Jaysingpur, M.P., India, ISSN 2091-2730, Jan.-Feb. 2015.
The recognition can be further improved by properly [8] V. Shrivastava and N. Sharma, "Artificial Neural Network Based Optical
segmenting the lines and characters. Humans cannot write the Character Recognition", Signal & Image Processing: An International
paragraphs in exactly horizontal fashion. Some lines always Journal, vol. 3, no. 5, pp. 73-80, 2012.
get tilted. Also, not all the characters are separate in a [9] Satyanarayana, P., Sujitha, K., Sai Anitha Kiron, V., Ajitha Reddy, P.
handwritten document. These cause problems in segmenting DQG*DQHVK0  ³$VVLVWDnce Vision for Blind People Using k-nn
$OJRULWKP DQG 5DVSEHUU\ 3L´ 3URFHHGLQJV RI QG ,QWHUQDWLRQDO
the lines and characters. We can also improve the recognition Conference on Micro-Electronics, Electromagnetics and
by using more advanced neural networks like resnet, Lenet and Telecommunications, pp.113-122.
VGG16 and using boosting techniques like xgboost and [10] Y. Li, Y. Zheng and D. Doermann, "Detecting Text Lines in Handwritten
adaboost. Documents", 18th International Conference on Pattern Recognition
(ICPR'06), 2006.
[11] G. Cohen, S. Afshar, J. Tapson and A. Schaik, "The EMNIST Dataset",
NIST, 2016. [Online]. Available: https://www.nist.gov/itl/iad/image-
group/emnist-dataset. [Accessed: 15- Aug- 2018].
[12] G. Cohen, S. Afshar, J. Tapson and A. van Schaik, "EMNIST: an
extension of MNIST to handwritten letters", arXiv preprint
arXiv:1702.05373, 2017.
[13] P. Grother, "NIST Special Database 19", NIST, 2017. [Online].
Available: https://www.nist.gov/srd/nist-special-database-19. [Accessed:
15- Aug- 2018].
[14] 3 *URWKHU DQG . +DQDRND ³1,67 VSHFLDO GDWDEDVH  KDQGSULQWHG
IRUPV DQG FKDUDFWHUV QG (GLWLRQ´ 1DWLRQDO ,QVWLWXWH RI 6WDQGDUGV DQG
Technology, Tech. Rep., 2016.
[15] H. Vala and A. Baxi, "A Review on Otsu Image Segmentation
Fig. 8 The output of OCR in text format Algorithm", International Journal of Advanced Research in Computer
Engineering & Technology (IJARCET), vol. 2, no. 2, pp. 387-389, 2013.
VII. FUTURE WORK
The user should write in block letters for the letters to get
identified. Otherwise, our algorithm is not able to detect the
letters. We will improve our algorithm to detect the connected
letters also. Our model is not yet trained for detecting special
characters like the question mark, colon, apostrophe,
exclamation mark etc. We will train our dataset for these
characters in the future. Our algorithm assumes that the user
has written the text with a dark coloured pen on a light
coloured paper. If the background is dark and the text is light,

6th International Conference on Computing for Sustainable Global Development (INDIACom). 59

Authorized licensed use limited to: Bharati Vidyapeeth's New Delhi. Downloaded on October 12,2023 at 07:05:34 UTC from IEEE Xplore. Restrictions apply.

You might also like