An Algorithm For Japanese Character Recognition
An Algorithm For Japanese Character Recognition
net/publication/273311050
CITATIONS READS
11 1,182
2 authors, including:
Sreeparna Banerjee
West Bengal University of Technology
34 PUBLICATIONS 167 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Sreeparna Banerjee on 06 January 2016.
Abstract—In this paper we propose a geometry- topology characters called Kania. Furthermore, Japanese language
based algorithm for Japanese Hiragana character is agglutinative and moratorium. It has a relatively small
recognition. This algorithm is based on center of gravity sound inventory and lexically significant pitch accent
identification and is size, translation and rotation system and is distinguished by a complex system of
invariant. In addition, to the center of gravity, topology honorifics. Japanese text does not have delimiters like
based landmarks like conjunction points masking the spaces, separating different words. Also, several
intersection of closed loops and multiple strokes, as well characters in the Japanese alphabet could be
as end points have been used to compute centers of home-morphic, i.e. have similar shape definition which
gravity of these points located in the individual quadrants could add to the complexity of the recognition process.
of the circles enclosing the characters. After initial Thus, Japanese OCR is a very challenging task and many
pre-processing steps like notarization, resizing, cropping, research efforts have been conducted to perform these
noise removal, synchronization, the total number of task. A survey of some of the approaches to OCR for the
conjunction points as well as the total number of end Japanese language have been discussed in [4].
points are computed and stored. The character is then This paper proposes a geometric topological based
encircled and divided into four quadrants. The center of algorithm for Japanese character recognition by
gravity (cog) of the entire character as well as the cogs of combining the Size Translation Rotation Invariant
each of the four quadrants are computed and the Character Recognition and Feature vector Based
Euclidean distances of the conjunction and end points in (STRICR-FB) algorithm originally proposed by Barnes
each of the quadrants with the cogs are computed and and Manic [5] along with some topological features of
stored. Values of these quantities both for target and the individual characters.
template images are computed and a match is made with The remainder of the paper is organized as follows.
the character having the minimum Euclidean distance. The next section describes the Japanese language model,
Average accuracy obtained is 94.1 %. followed by a description of allied work in the following
section. In section IV, a review of the original
Index Terms—Japanese Optical Character Recognition, STRICR-FB is presented. The proposed algorithm is
geometry, topology, image processing. described in Section V. After that, application of he
proposed algorithm is discussed in section VI, and our
conclusions are given in in section VII.
I. INTRODUCTION
Optical character recognition (OCR) for both
handwritten and printed text is a crucial step towards II. JAPANESE LANGUAGE MODEL
document analysis and retrieval for the purpose of storing A. Japanese Character Sets
and transmitting text in digital form via computers and
networks. Furthermore, different languages have very The Japanese language is written in a mixture of three
different characteristics of their alphabets which form the scripts; kana and kanji. The two kana are called hiragana
basis of this written text. In Indian languages, for instance, and katakana shown in Figure 1 and 2, respectively. For
the written script can be broadly divided into Devanagari Japanese words, Hiragana is used, mostly for
script for the North Indian languages and the Tamil script grammatical morphemes. Katakanas are used for
for the South Indian languages. OCR for both printed transcribing foreign words, mostly western, borrowing
Hindi character [1] s as well as handwritten Devanagari and non-standard areas. In addition, diacritic signs like
characters [2] constitute a complex task. South Indian dakuten and handakuten are used (see Fig 3 and 4.)
languages like Malayan [3] also possess a complex Dakuten are used for syllables with a voiced consonant
written character system. Japanese and Chinese phoneme. The dakuten glyph (゛) resembles a quotation
languages also possess very complex character sets as mark and is directly attached to a character.
these include both syllabic character sets as well as Several thousand kanji are in regular use, while the
ideograms. In this paper we focus on Japanese, which has two syllabaries each contain 48 basic Handakuten
over three thousand characters comprising of syllabic characters which are used for syllables with a /p/
character characters called Kama and ideographic morpheme. The glyph for a 'maru' is a little circle (゜)
Copyright © 2015 MECS I.J. Image, Graphics and Signal Processing, 2015, 1, 9-15
10 An Algorithm for Japanese Character Recognition
that is directly attached to a character. Kanji is derived only between their grammatical modifiers and post
from Chinese characters and represents logographic or positions. The second way is to consider the modifiers
morphological units. and post positions as a part of the modified word. Based
Thus Hiragana is used primarily for grammatical on the study conducted by Saint et.al [6] using 16
elements - particles, inflectional endings and auxiliary subjects in Japanese reading, 60 word texts from excepts
verbs. Katakana is used for writing loan words, of newspapers and internet columns, it was concluded
onomatopoeic words, to give emphasis, to suggest a that in pure Katakana text, inter-word spacing is an
conversational tone, or to indicate irony or a euphemism effective segmentation method, in contrast to
and for transcribing foreign words, mostly western. Kanji-Hiragana text, since visually silent kanji characters
serve as effective segmentation uses by themselves.
C. Character Features Vectors Identification
Every character has its own features and identities. By
identifying features we can recognize characters from a
textual image document. By feature extraction the critical
characteristics of characters gets isolated, and that
reduces the complexities of the pattern. After
classification it compares with known patterns and then
matched with the character that has the same
Fig 1: Hiragana script characteristics. The characters can be further subdivided
into segments and the strokes in each of these segments
exhibit certain characteristics in terms of shape, angle of
inclination. In addition, presence of dakuten and
handakuten also changes the character. All these aspects
need to be taken into account in devising feature vectors
for identification.
Copyright © 2015 MECS I.J. Image, Graphics and Signal Processing, 2015, 1, 9-15
An Algorithm for Japanese Character Recognition 11
Some of the major efforts are stated in the following. through a neural network for identification. After locating
Convergence of the shortest path has been used as a the Center of Gravity of each of the characters, the four
criterion for segmentation in [11]. An algorithm that CUFV, namely, mean variance, character density and
takes into account the variations of angles due to pen decentricity have been calculated. In the second phase, an
position trajectories has been presented in [12]. Adaptive unsupervised clustering algorithm has been used, where
context processing [13] is also another technique that has clusters are built with real data.
been used. An efficient indexing scheme for image
storage and recognition has been reported in [14].
Improvement strategies for template matching have been V. PROPOSED ALGORITHM
discussed in [15]. Reference [16] proposes a multistage
pre-candidate selection procedure in handwritten Chinese The proposed algorithm is based on STRICR-FB [5] in
the first phase, where the center of gravity (COG) is first
OCR. Recognition enhancement by linear tournament
verification is suggested in [17]. Korean OCR is determined. Then a normalization of the character image
described in [18]. Hull [19] used multiple distortion (bitonic) is performed. Subsequently, the extraction of
conjunction points are carried out. The features of the
invariant descriptors for document image matching.
Snead ethical. [20] generated character candidates for candidate images are compared by template matching
document image retrieval. In [21] Nina, Kagoshima and with the prototype/template images using an Euclidean
distance measure defined below.
Shimmer used keyword information for post processing.
Linguistic knowledge was found to improve OCR The entire character image is enclosed by a circle
accuracy by Dal, Norton and Taylor [22]. However, an whose center and radius are determined from the image.
This circle is subsequently divided into four quadrants.
exhaustive list of all the approaches is beyond the scope
of the present work. The CoG of the bitonic image of the character is
In all these efforts, the reported accuracy is between 80 computed using the number of pixels (black) and their
and 90 %. In both soft computing and image shape locations (x,y). Once the COG is obtained, we try to
based methods the candidate character, which happens to locate the COG position among the four quadrants of the
encircled character.
be a handwritten sample, is matched with template
characters. However, extracted features of the characters Due to presence of multiple stoke type and orders;
should be local, stable and discriminative [10]. The there are characters with intersection points and
non-intersection points. A conjunction point is a point in
problems with the methods described above are that any
change in size, translation and rotation affect the feature the image where multiple strokes intersect. The
extraction process. Horizontal and vertical lines might characters with multiple strokes and conjunction point(s)
create a pattern of relationship between COG location
be longer than a standard character. swirls and loops
might be tight or open. Besides, the slanting orientation and the number of conjunction points for each of the
of left and right handed writers can be different. These individual characters of the Hiragana script. Moreover,
the stroke orders and curvature types are comparatively
orientations result in variations of angles of the individual
strokes of the characters. Calligraphy stills might be plain less complex than Kanji characters. Thus, a template
or ornate. Handwritten characters are affected by all of matching scheme based on the distances of the
conjunction points from the COGs in each of the
these changes. The presence of dakuten and handwritten
characters in Japanese Katakana script, i.e. characters quadrants provides a robust method.
with the presence of double tick marks or small circles at Japanese characters are homeomorphic and so the
computation of COG alone might not be sufficient to
the top right hand of the characters, respectively, poses
additional problems. The STRICR-FB algorithm [5], recognize individual characters. Furthermore,
based on Kohonen's Winner Take All (WTA) type of handwritten characters vary widely from person to person.
unsupervised learning addresses all these problems. The Thus distinctive landmarks are required to impart an
original method has been test on the MS-Mincho font unique identity to the character. In particular, conjunction
points, which mark the intersection of closed loops and
character set. The extension of the STRICR-FB algorithm
proposed in this paper has been tested on several also multiple strokes play an important role in identifying
handwritten samples and a high accuracy has been the topology of the character. Thus, after normalizing a
handwritten character for size, shape and orientation and
obtained by including topological features of the
individual characters of the character sets. These subsequent fitting into a circle, the conjunction points in
topological features included conjunction points as well each quadrant are located and COGs are measured for
each character w.r.t to the conjunction points of that
as end points. Also, by dividing the image into four
quadrants and finding the COG and Euclidean distances character. Euclidean distances of test and template
within the four quadrants enhanced the accuracy. images are calculated for these conjunction points. The
outline of the process is given in Figure 2.
After pre-processing and normalization, the matching
process involves (a) COG identification, (b) location of
VI. OVERVIEW OF STRICR-FB
conjunction points and end points (c) measuring the
The original STRICR-FB algorithm is performed in Euclidean distance (d) comparison of template and target
two phases: (i) the construction of Character Unique images and (e)check matching. These steps have been
Feature Vectors (CUFV) (ii) passing of these CUFVs described as follows:
Copyright © 2015 MECS I.J. Image, Graphics and Signal Processing, 2015, 1, 9-15
12 An Algorithm for Japanese Character Recognition
COG COMPUTATION
Where ‗ed‘ represents Euclidean distance and (x, y) is
the coordinate of a conjunction point of the sample 1. Encircle character
character in the image and (icog, jcog) is the COG obtained. 2. Divide circle into four quadrants
3. Locate the COG
D. Comparison between Template and Target Images 4. Locate the COG in quadrant (n)
Once the COG and Euclidean distance is determined 5. Check for conjunction points with multiple strokes
for the all points, the steps A, B and C is repeated for 6. Calculate Euclidean distance between COG and
target image of the handwritten text also. And then the conjunction point
comparison of the numerical values between system 7. Get average Euclidean distance for all conjunction
generated template image and handwritten target image(s) points
are performed. If the character has two conjunction points, 8. Get % match between template and target images
then the calculation will be followed for template is: 9. End
Copyright © 2015 MECS I.J. Image, Graphics and Signal Processing, 2015, 1, 9-15
An Algorithm for Japanese Character Recognition 13
the characters with multiple stokes and presence of common is the Euclidean distance. Table II shows the
intersection of the strokes. measurement of six target samples of the handwritten
The matching similarity of the characters having single Japanese Hiragana character ―あ‖.
stroke or multiple strokes with zero intersection point Handwritten characters differ from person to person
could be obtained from equation (7). According to the due to various ways to of writing, by various people.
Collection of multiple handwritten samples proves COG
location could be one of the factors for identifying the
handwritten pictorial character sample. The location of
the COG among the four quadrants, in an enclosed circle.
VI. RESULTS
Here a comparative study has been performed where
the numeric values are compared with visual inspection.
An analysis based on complete visual inspection is made.
We applied the approaches on 45 Hiragana template Distinguishes handwritten characters from sample to
images and six different handwritten samples of 45 sample.
Hiragana characters, individually. And then the samples
were checked accordingly. An example of the Table 3. Relationship Between Cog In Quadrant And Total Number Of
Conjunction Points
calculations performed for the Hiragana character ―あ‖ is
given in Table II.
Copyright © 2015 MECS I.J. Image, Graphics and Signal Processing, 2015, 1, 9-15
14 An Algorithm for Japanese Character Recognition
Our average recognition rate was 94.1% with the best "Hand printed Hiragana recognition using support vector
three results corresponding to 100 %, 98.4 % and 95%. machines"; Frontiers in Handwriting Recognition, 2002.
This is better than the result obtained in [5] using Proceedings. Eighth International Workshop on Digital
STRICR-FB for the MS Mincho set with a recognition Object Identifier: 10.1109/IWFHR.2002.1030884; 2002,
Pages: 55 – 60.
rate of 90%. [11] Proposal of Character Segmentation using Convergence
of the Shortest Path, E. Tan aka (2011) Meeting on
Image Recognition and Understanding Proceedings (2011)
VII. CONCLUSION MIRU 2011, 331-336.
[12] An Online Character Recognition Algorithm RAV
In this paper, we have proposed a semi-analytical (Parametrized Angle Variations) M. Lubumbashi, S,
approach based on physical and linguistic features of the Masai, O. Minamoto, Y. Nagasaki, Y. Kommunizma and T.
Japanese Hiragana script and worked on six different Maturation, (2000), Journal of the Information Processing
handwritten samples for each of 45 hiragana characters Society, 41(9), 2536-2544.
for identification by template matching. Due to the nature [13] A Handwriting Character Recognition System Using
of homeomorphism of Japanese characters, we have Adaptive Context Processing, N. Okayama, (1999) IPSJ
combined the original STRICR-FB algorithm for COG Technical Report on Information and Media (IM) 85-90.
identification along with topological feature identification, [14] An Efficient Indexing Scheme for Image Storage and
Recognition, IEEE Transactions on Industrial Electronics,
namely, conjunction points and end points for
M. Al Mohamed, (1999) 46(2).
handwritten characters which is an improvement over the [15] S. H. Kim, ―Performance Improvement Strategies on
result obtained using STRICR-FB for printed character Template Matching for Large Set Character Recognition‖,
sets. This is due to the fact that the inclusion of Proc. 17th International Conference on Computer
topological features in the identification process has Processing of Oriental Languages, pp 250-253, Hongkong,
added robustness to the technique. April 1997.
In the future, the technique will be tested on Katakana [16] C. H. Ting, H. J. Lee, Y. J. Thai, ―Multi stage
and Kanji characters, both printed and handwritten. per-candidate selection handwritten Chinese character
recognition systems‖, Pattern Recognition, vol. 27, no.8,
pp.1093-1102, 1994.‖
REFERENCES [17] H. Tallahassee and T.D. Griffin, ―Recognition
[1] Indira B., Shalini M., Ramana Murthy M.V., Mahaboob enhancement by linear tournament verification‖, Prof. 2nd
Sharrief Shaik, Classification and Recognition of printed ICDAR, Tsunami, Japan, 1993, pp.585-588.
Hindi characters using Artificial Neural Networks, Int. J. [18] D.H. Kim, Y.S. Twang, S.T. Park, E.J. Kim, S.H. Park and
Image, Graphics, Signal Processing, 4(6), 15-21 (2012). S.Y. Bang, ―Handwritten Korean character image
[2] Dewangan Shailendra Kumar, Real Time Recognition of database PE92‖, Prof. 2nd ICDAR, Tsunami, Japan, 1993,
Handwritten Devanagiri Signatures without pp.470-473.
segmentation using Artificial Neural Networks, Int. J. [19] J. J. Hull, Document image matching and retrieval with
Image, Graphics, Signal Processing,, 5(4), 30-37 (2013). multiple distortion--invariant descriptors, Prof. IAPR
[3] John Jomy, Balakrishnan Kannan, Pramode K.V., A Workshop on Document Analysis Systems, pp. 383-399
system for Offline Recognition of Handwritten characters (1994).
in Malayam Script, nt. J. Image, Graphics, Signal [20] Snead, M. Minos and K. locked, Document image
Processing,, 5(4), 53-59 (2013). retrieval system using character candidates generated by
[4] Das S and Banerjee S (2014), Survey of Pattern character recognition process, Proc. Second Int. Cong.
Recognition Approaches in Japanese Character Document Analysis and Recognition, pp. 541- 546 (1993).
Recognition, International Journal of Computer Science [21] N. Nina, K. Kagoshima and Y. Shimmer, Post processing
and Information Technology, Vol. 5(1) 93-99. for character recognition using keyword information,
[5] D. Barnes and M. Manic, STRICR-FB a Novel IAPR Workshop Machine Vision Applications, pp.
SIze-Translation Rotation Invariant Character 519-522 (1992).
Recognition Method (2010), Proceed. 6th Human System [22] D.A. Dahl, L. M. Norton and S. L. Taylor, Improving
Interaction Conference, Rzeszow, Poland, 163-168 OCR accuracy with linguistic knowledge, Proc. Second
[6] Miia Sainio, Kazuo Bingushi, Raymond Bertram; Vision Ann. Symp. Document Analysis and Information
Research (2007); "The role of interword spacing in Retrieval, pp. 169-177 (1993).
reading Japanese: An eye movement study"; Volume: 47,
Issue: 20, Pages: 2577-2586
[7] Neural Networks which consist of Simple Recurrent
Networks for Character Recognition by Template Authors’ Profiles
Matching, (2008) Y. Shimozawa, Journal of the
Information Processing Society 49(10), 3703-3714. Soumendu Das, male, was a student of
[8] The post processing of Character Recognition by Genetic West Bengal University of Technology,
Algorithms, Y Shimozawa, S. Okoma, (1999), Journal of Kolkata, India, and he is currently
the Information Processing Society, 40(3) 1106-1116. employed at Infosys, India. His research
[9] Off-line Hand Printed Character Recognition System interests are image processing and OCR in
using directional HMM based on features of connected Japanese
pixels, H. Nishimura, M. Tsutsumi, M. Maruyama, H.
Miyao and Y. Nakano (2002), Journal of the Information
Processing Society 43(12) 4051-4058.
[10] Maruyama, K.-I.; Maruyama, M.; Miyao, H.; Nakano, Y.;
Copyright © 2015 MECS I.J. Image, Graphics and Signal Processing, 2015, 1, 9-15
An Algorithm for Japanese Character Recognition 15
Copyright © 2015 MECS I.J. Image, Graphics and Signal Processing, 2015, 1, 9-15