Japanese Text Classification
Japanese Text Classification
net/publication/335337209
CITATIONS READS
0 360
4 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Bens Pardamean on 23 August 2019.
Bens Pardamean
Computer Science Department,
BINUS Graduate Program -
Master of Computer Science Program,
Bina Nusantara University,
Jakarta, Indonesia 11480
[email protected]
Abstract—As a subset of Artificial Intelligence, Natural construct transitive/intransitive verb pairs, adverbial clause,
Language Processing (NLP) is a breakthrough in surpassing relative clause, sentence with genitive marker, and sentence
language barrier. Japanese language characteristics bring its with locative particles are five most difficult tasks in Japanese
own challenge in morphological analysis due to the uniqueness for people who learn to adopt the language.
of Japanese grammatical system. By the rapid development of
NLP tools, many Japanese NLP tools developed with limited In Japanese medical scope, NLP believed will deliver a
ability yet specialized in running certain preprocessing methods. solution surpassing the communication difficulties between
In this paper, the compilation of various methods and newly medical terms in Japan and the rest of the world. For example,
discovered tools for preprocess Japanese text are delivered to Aramaki, Yano, and Wakamiya developed Japanese NLP
help people decide which Japanese NLP tools needs to be utilized
to run some preprocessing methods. All of the Japanese
model for clinical information retrieval and text classification
preprocessing methods and tools are collected through literature to match with global medical terms in straightforward method
review. It is concluded that depending on one NLP tool is not [10]. Ito et al. also performed a similar study to help Japanese
recommended since combination of Japanese NLP tools is medical workers preparing global standard medical reports by
required to finish Japanese preprocessing phase. creating Japanese medical dictionary with NLP [11]. Briefly,
text classification is the first step for developing high
Keywords—Japanese, Preprocessing, Tools, Methods, Natural functional NLP model and to find the correct preprocessing
Language Processing method and tool is the root of the whole solution.
I. INTRODUCTION To get update with current trend of NLP, a literature
Japanese known as one of the most difficult languages to review of current Japanese NLP tools is required as an initial
learn. Along with the rapid development of Artificial step. This research conducted under collaboration of experts
Intelligence (AI), one subset that learns the pattern of human in AI and Japanese language/culture to map out the
language in the form of text called Natural Language association between Japanese NLP tools and computational
Processing (NLP) become a breakthrough for non-Japanese linguistic methods in the scope of Japanese language
speakers to communicate in Japanese with minimum modelling as a fundamental step for Japanese text
knowledge. In general, the application of NLP is able to classification. This research put forward a compilation of
surpass the problems in language analysis [1], word usable preliminary methods and tools in processing Japanese
segmentation [2], [3], and automation in question-answering text for text classification use case. Special treatment such as
[4], [5]. cleaning dataset from less useful attributes and including
structural markup are essentially necessary to prepare the
For a non-alphabetic language, 3 types of Japanese letters dataset for the algorithm to learn the language pattern and
namely Hiragana, Katakana, and Kanji are used. The three of solve text classification problem. Therefore, preprocessing
them combine with alphabetic, or Romaji as the Japanese methods such as tokenization, stemming, stop-word removal,
said, and numeric in daily use. Consist of 2,136 characters as POS tagging, and lemmatization in Japanese will be matched
officially announced by the Japanese Ministry of Education with Japanese NLP preprocessing tools based on the
on 2010 [6], historically from 9th century the usage of Kanji capabilities.
was simplified using the other two types of letters. In modern
day, Katakana is used to write words of foreign origin and II. RELATED WORKS
foreign names [7], [8] while Hiragana is used to write native
A. Dataset
Japanese words which doesn’t have no Kanji representation.
Hiragana also used to write the grammatical elements such as Morikawa mentioned sources to find text datasets for NLP
particles: を (wo), に (ni), へ (he), が (ga), は (wa). With all or machine learning are prepared for many purposes [12]:
those complexity and uniqueness, Suzuki discovered • The website of National Institute of Informatics provided
communication difficulties when using Japanese for foreign datasets for informatics purpose. Datasets from Yahoo!
speakers from China [9]. In general, using grammar to
978-1-7281-3333-1/19/$31.00 ©2019 IEEE 19-20 August 2019, Jakarta & Bali, Indonesia
2019 International Conference on Information Management and Technology (ICIMTech)
472
Japan, Rakuten, and many popular Japanese companies • Gosen is owned by Stanbol Apache, which owned
and organization can be found and used. Kuromoji as well. This tool has exactly the same
• LinkData is bridging people in supporting open data functionality like Kuromoji [24].
community. Users can simply upload and download the • Sudachi offers specific usage in tokenzing for business
dataset to enrich the collection in the platform. purpose. Overall, Sudachi acquired 2.6 million tokens,
• National Institute of Information and Communications with 1.4 tokens in normalized form, POS, and kana
Technology provides published papers and Japanese – information [23].
English Bilingual Corpus of Wikipedia’s Kyoto Articles. • TinySegmenter repackage the algorithm which
It is originally prepared for translation purpose and originally written in JavaScript into Python 2.5 and
contains of 500,000 manually translated sentences. above so it can be utilized as NLTK extension [25].
B. Text Preprocessing in English B. Stemming
Preprocessing English text give different impression and Every language has conjugation system, but Japanese’
difficulties since the methods and are supported with more conjugation is focus on adding kana after the original verb to
powerful tools. Since NLTK [13] and Standford NLP [14] are form tenses [26], [27]. Stemming is a task returning
known as high-functionality NLP tools for preprocessing text dictionary form from any tenses form as long as it has the
in many languages, all methods in the project can be covered same meaning [19]. Table 1 encloses examples of stemming
if English dataset is utilized. in Japanese daily words. JapaneseStemmer [28] is a tool for
In finishing sentiment analysis using English tweets as performing the task. It was inspired from Porter Stemming
dataset, Javed and Kamal used NLTK to finish Stop Words Algorithm [29]. Different with the Porter’s algorithm that
Removal, Stemming, Lemmatization, and POS Tagging tasks removing word’s suffix, JapaneseStemmer simply conjugates
in a sentiment analysis project [15]. To detect irony in English a Japanese word back to its plain form.
tweets, Marrese-Taylor, Ilic, Balazs, Prendinger, and Matsuo
TABLE 1. STEMMING EXAMPLES OF JAPANESE DAILY WORDS
also using NLTK to tokenized the texts without losing its
ironic characteristics [16]. Dictionary Form Stem Meaning
On the other hand, Stanford NLP high-functionality are 食べる 食べた Ate
proved in many cases. For example, in an information Tabete
Taberu
extraction study from the US Securities and Exchange (Eat) 食べられる Be eaten
Commission’s legal contracts, Bommarito, Katz, and Taberareru
Detterman used Stanford NLP functions for Tokenization,
Stemming, Lemmatization, and POS tagging [17]. Compared 食べさせられる Be allowed to eat
Tabesaserareru
to another tools, Stanford NLP sentiment analyser become a
one-stop solution for sentiment analysis task as Jongeling, 食べられない Cannot eat
Datta, and Serebrenik successfully applying the function in Taberarenai
analyzing positive, neutral, and negative texts [18]. 読んだ Read (past)
読む
Yonde
III. PREPROCESSING METHODS Yomu
(Read) 読まれる Be read
Preprocessing phase is a starting phase in every AI Yomareru
research and development project including NLP. It is
preparing the dataset for NLP model development and test 読ませられる Be allowed to read
Yomaserareru
phase. Specifically speaking, preprocessing task in NLP
project covered tokenization, stemming, and stop word 読ませない Cannot read
removal, POS tagging, and lemmatization [19]. Literature Yomasenai
review has been done to collect and classify Japanese NLP
tools to its capabilities in running method. The following 飲む 飲んだ Drank
Nonde
points are describing NLP preprocessing methods and Nomu
Japanese NLP tools that able to perform the method. (Drink) 飲まれる Be drunk
Nomareru
A. Tokenization
飲ませられる Be allowed to drink
Tokenization or lexical/morphological analysis known as Nomaserareru
the beginning step of NLP. To process Japanese text data,
token or single words out taken from a text must be defined 飲まれない Cannot drink
Nomarenai
[20], [21]. Some Japanese tokenizer is developed as open
source program to finish specific tokenization problems. 話す 話した Spoke
Hanasu Hanashite
• MeCab which developed by Nara Institute of Science (Speak)
and Technology is the most popular Japanese tokenizer 話される Be spoken
Hanasarareru
among all. The functionality covered word
segmentation and POS tagging. Unfortunately, users 話させられる Be allowed to speak
must struggle in doing pre/post-processing Japanese Hanaserareru
text data on their own [22]. 話されない Cannot speak
Hanasarenai
• Kuromoji which written in Java, has the same
functionality as MeCab. As the tool is donated and
integrated with Apache Lucene or Solr, it can’t be
utilized outside Apache [23].
978-1-7281-3333-1/19/$31.00 ©2019 IEEE 19-20 August 2019, Jakarta & Bali, Indonesia
2019 International Conference on Information Management and Technology (ICIMTech)
473
C. Stop Word Removal TABLE 2. LEMMATIZATION EXAMPLE [37]
Removing words that doesn’t give any contribution is a Inflected Form Lemma
task known as stop word removal. Grammar articles and ヤハリ 矢張り
pronouns are the targets of stop word removal as the words
Yahari
are not bring signification to the text. For Japanese stop word Yahari
removal, the tools are mentioned as follow: ヤッパリ (Too, either, also, still, even so,
as expected)
• Stopwords-ja: In the form of json, it contains collection Yappari
of Japanese stopwords.[30]. ヤッパ
• Many-stop-words: This python package collected stop Yappa
words from many different languages including
Japanese [31]. アフ 会う
Afu
Au
アワ (To meet, to encounter, to see, to
D. Part-of-Speech (POS) Tagging
Au unite, to agree with, to fit)
Japanese has grammatical functions like any other
languages. With POS tagging method, each word in a text will
be categorized into grammatical functions like nouns,
IV. DISCUSSION
pronouns, adjectives, verbs, adverbs, prepositions,
determiners, and conjugations. POS tagging is important Table 3 shows the compilation of preprocessing methods
since some NLP tasks namely sentiment analysis, question and tools for Japanese text classification project.
answering, and word sense disambiguation need Tokenization is the most supported methods among all as
differentiation to tackle word ambiguity [19]. NLP tools that most of the tools capable to do the task. On the other side,
supporting Japanese POS Tagging methods are mentioned as Kuromoji and Gosen is the most usable tools as tokenization,
follow: POS tagging, and lemmatization are doable tasks.
• Kuromoji: As defined by the Stanbol NLP processing Compared to English Text Preprocessing, Japanese Text
module, POS Tagging method that provided by Preprocessing is supported with limited resources and tools
Kuromoji using LexicalCategories and POS types to yet every tool able to run specific method. If preprocessing
map words [32]. English texts required only one compact library like Stanford
• Gosen: Exactly same with Kuromoji, which modified NLP or NLTK, preprocessing Japanese text required more
version of Stanbol NLP processing module, delivered since compact Japanese-supported library is not exist yet and
in Java [24]. even if it exists, it’s not covering all methods. While English
• RakutenMA: Written in JavaScript, the tools are preprocessing tools are well-developed and updated,
trained with general and e-commerce corpora. Chinese Japanese preprocessing tools depends on individual
and Japanese are included in this tool [33]. customization.
• MeCab: Japanese MeCab not only supporting POS The struggle in preprocessing Japanese text could be
tagging taask, but also inflection type and form tagging searching for the right tools or developing custom tools to run
[22], [34]. specific methods. But, on the other hand, users have an
• KyTea: Kyoto Text Analysis Toolkit or KyTea is opportunity to develop or mix and match Japanese NLP tools
capable to estimate POS tag in Japanese and Chinese. to complete preprocessing phase which have certain
The performance of KyTea in POS Tagging is easily specification. For example, to preprocess Japanese e-
adaptable as it was tested through partial annotation commerce text data, utilization of Sudachi, Japanese
and active learning [35]. Stemmer, Stopwords-ja / Many-stop-words, and RakutenMA
is recommended as Sudachi and RakutenMA have
E. Lemmatization specification to recognize Japanese business terms.
Lemmatization is a task of morphological analysis in
returning dictionary form (lemma) from inflected forms. In
Japanese, lemmatization is useful to avoid ambiguation in V. CONCLUSION
speaking [36]. Table 2 encloses lemmatization examples from As machine learning projects are depending on the
the research of Ogiso, Komachi, Den, and Matsumoto [37]. readiness of data, there are methods that prepares Japanese
text dataset into preprocessed dataset. In this paper,
These following tools are supporting Japanese lemmatization:
compilation of methods and tools in preprocessing Japanese
• Kuromoji: Japanese lemmatization is supported by text is delivered. Unlike English text preprocessing, the tools
Kuromoji in Java programming language [32]. for Japanese text preprocessing is very limited. To make the
• Gosen: Exactly the same with Kuromoji, Japanese first step in developing high functional Japanese NLP model,
lemmatizatin is supported in Java programming combination of tools can be selected by choosing the
language [24]. following tools based on its functionality in running
• Sudachi: Japanese lemmatization is supported with preprocessing methods. It is recommended to develop a
focus in business purpose [23]. customized framework by combining several tools and not
depends on one tool only to finish Japanese text
preprocessing task. This study will be the starting point to
develop Japanese text classification by using Japanese news
978-1-7281-3333-1/19/$31.00 ©2019 IEEE 19-20 August 2019, Jakarta & Bali, Indonesia
2019 International Conference on Information Management and Technology (ICIMTech)
474
articles which is useful for classifying people’s news interest. ACKNOWLEDGMENT
Previously, the similar study has been conducted by Binus We would like to thank kejepang.co.id for becoming domain
University’s AI R&D Center team to classify Bahasa expert in this research under the leadership of Febrian Lubis-
Indonesia, English, and Arabic. Modelling Japanese is a sensei. Thank you for the contribution in translating
brand-new path to enrich the capability of NLP system in the interpreting, describing, and teaching fundamentals in
research center. Japanese so this research can be well finished.
Many-
Tiny Japanese Stopwords-
Tools Mecab Kuromoji Gosen Sudachi stop- RakutenMA KyTea
Segmenter Stemmer ja
words
Methods
Tokenization x x x x x
Stemming x
Stop word removal x x
POS Tagging x x x x x
Lemmatization x x x
978-1-7281-3333-1/19/$31.00 ©2019 IEEE 19-20 August 2019, Jakarta & Bali, Indonesia
2019 International Conference on Information Management and Technology (ICIMTech)
476