0% found this document useful (0 votes)
66 views10 pages

Automatic Generation of Stopwords

The document summarizes a research paper that proposes a method for automatically generating stop words in Amharic text. The method uses an aggregated approach based on word frequency, inverse document frequency, and entropy measures of words in documents. The goal is to make information retrieval in Amharic faster and improve the language's usefulness for information processing by identifying and removing non-informative words. The proposed automatic approach aims to overcome limitations of existing static or dictionary-based stop word identification methods.

Uploaded by

Bini Teflon Ankh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views10 pages

Automatic Generation of Stopwords

The document summarizes a research paper that proposes a method for automatically generating stop words in Amharic text. The method uses an aggregated approach based on word frequency, inverse document frequency, and entropy measures of words in documents. The goal is to make information retrieval in Amharic faster and improve the language's usefulness for information processing by identifying and removing non-informative words. The proposed automatic approach aims to overcome limitations of existing static or dictionary-based stop word identification methods.

Uploaded by

Bini Teflon Ankh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Addis Ababa University

School of Information Science


Department of Information Science
IR
Assignment IV
A literature review on Automatic Generation of Stopwords
in the Amharic Text

By:
Name ID No.
Biniam Worku GSE/6722/13

Submission Date: 04/10/2021


Abstract
As an important preprocessing step of information retrieval and information
processing, the accuracy of stop words’ elimination directly influences the ultimate
result of retrieval and mining. In information retrieval, stop words’ elimination can
compress the storage space of index, and in text mining, it can reduce the
dimension of vector space enormously, save the storage space of vector space and
speed up the calculation. Stop words list for different world languages like English,
Chinese, Hindi, Arabic Sanskrit etc. are identified. Not many literatures found to
show if any stop word list is done for the Amharic language. In the document I
have reviewed, the researcher proposed to the automatic identification of Stop
words for the Amharic text by an aggregate based methodology of words
frequency, inverse document frequency, and entropy value measure. Available
works on Stopwords identification techniques are based on static or dictionary
based Stopwords lists. This method inefficient and very expensive and it is a time-
consuming task as the searching process takes a long time. The proposed work will
overcome these problems using aggregated methods of both frequency measures
and entropy measures of words in the Amharic text for the automatic Stopwords
identification.
1. Introduction
Removal of stopwords is one of the text preprocessing steps in Information
Retrieval, text classification, document clustering and similar document analysis.
Stopwords are the words that appear frequently in documents and only
serve syntactic function but they carry no usable information to aid learning tasks
and are unlikely to assist in text classification, retrieval, clustering or analysis and
hence are deleted during pre-processing‖. These words are considered as noise in
information systems, hence there are research efforts to develop stoplists that are
robust enough to contain these words and can help to efficiently manage noise in
textual processing activities and information systems.
Therefore, stoplists that are either domain specific or language specific have
emerged because of the idea of which words constitutes noise in a language or
domain. The importance of these ―customized ‖ stop lists is well founded on the
language differences in the languages or domains where there are specialized
linguistic and morphological rules. Consequently, stoplist for a language may be
inefficient for another depending on the similarity or differences between the
languages. Lately, researchers compile stoplist that are time specific because of the
time changing attribute of natural languages with human sophistication. However,
except static or Dictionary based approach, automatic generation of stopwords for
the Amharic text are not available.
In the paper I reviewed, the researchers proposed to identify stopwords
automatically from in Amharic text using the aggregated based technique. One is
bases of word frequency and the other is based on entropy measures of words in
the given documents of the Amharic text.
The researchers used Amharic newspapers, Amharic magazines and well-known
Amharic blogs which are considered as they are written in correct language
structure as sources for their research data source. And this technique enables them
to identify the stop word lists without affecting the content of information the
original document before removing the non-informative words. Identification of
these stopwords enables the language users to retrieve information fast and makes
the language more powerful for information processing.
2. Objective
a. General Objective
Is to identify stopwords automatically from the Amharic text using the aggregated
based technique. One is bases of word frequency; inverse document frequency and
the other is based on entropy measures of words in the given documents.
b. Specific Objectives
The following are specific objects of the research:
- Make the retrieval of Amharic words fast and
- Make the Amharic language more powerful for information
retrieval.
3. Scope and Limitaion
The work of the research paper reviewed is to create an automated way of in time
listing of Amharic stopwords. Between the methods available to list stop words,
the researchers picked the aggregated based technique. The limitation of the
research paper is the researcher didn’t include other local language like Tigrigna
which is in a similar linguistic family as Amharic language.
4. Methodology
In order to make the research the writers used different methids as listed below :
Literature review
The researchers have reviewed documents that contain the same research
objectives and goals .
Data collection
The researchers have used different corpus, Amharic newspapers, Amharic
magazines and well-known Amharic blogs, for them in put in identifying and
listing common stopwords.
Researchers also apply Inverse Document Frequency (IDF) to identify which word
appears frequently on all documents.
The entropy of each word in the dataset also has been considered, and the value
will be ordered by increasing of entropy to expose the words that have a better
probability of being noise words.
By aggregating term frequency, inverse document frequency, and entropy
measures the researchers can generated most important lists of Amharic stopwords.
5. RELATED WORK
In natural language processing and related fields, various researchers have been
done on the idea of identification and removal of stopwords different languages.
Automated Stopwords identification is the most efficient and widely used method
with a little or no intervention of manual methods. Jaideep Singh et al used
automatic stopwords identification algorithm for the Sanskrit language and some
manual intervention is used by the language expert, and then call the method
hybrid. They calculated the frequency of words from the input text and they also
used some words from the dictionary to identify the stop list. Asubiaro, Toluwase
V, used an entropy-based algorithm to identify stopwords for Nigerian Yoruba
language text. A word whose entropy is greater than 0.6 but not a noun was
considered as a Stopword. Walaa Medhat et al generated stopwords list for the
Egyptian dialect for online social network data to investigate the effects or removal
of stopwords from the text for the sentiment analysis (SA) task using frequency the
frequency of words from the input Egyptian dialect. Mohammed-Ali Yaghoub-
Zadeh-Fard et al generated stopwords list for Persian language Information
retrieval system based on similarity function and POS information using the
aggregated method of part of speech and statistical features of stopwords.
Vijayarani S, et al used Zipf’s Law (Z method) for creation of stop-words.
Rakholia and Saini have presented a rule based approach to dynamically identify
stop words for Gujarati language. Vandana Jha et al developed an algorithm to
remove stopwords from the Hindi text based on Deterministic finite automata. The
algorithm also tested on 200 documents and succeeded 99% accuracy and time
efficiency.
Saini and Rakholia have presented an analytic in-depth report on continent and
script-wisedivisions-based statistical measures for stopwords lists ofvarious
international Languages. A. Alajmi et al generated stop-words for the Arabic
language using a statistical approach.1002 documents with over 700,000 words
were tested and they achieved about 90% general accuracy. El-Khair,et al
conducted research on the effectiveness of three stop words lists for Arabic
Information Retrieval--- General Stoplist, Corpus-Based Stoplist, Combined
Stoplist -- -were investigated in this study. Three popular weighting
schemes were examined: the inverse document frequency weight, probabilistic
weighting, and statistical language modelling. The Idea is to combine the statistical
approaches with linguistic approaches to reach an optimal performance,
and compare their effect on retrieval.
6. The architecture of the automatic generation of Amharic stop words
Different approaches are used by researchers to generate and remove stopwords
from the documents of different languages of the world. Some of these methods
are Dictionary based approach, supervised approach using probability distribution,
automated algorithm based on the frequency of words, deterministic finite
automata entropy measures approach for the contents of information of a word in
the document, a revised statically approach which is based on term frequency
and distribution of words in different documents and studying part of speech are
some of the techniques to identify general and domain-specific stopwords from
documents. Amharic is a national/working language having its own grammar and
syntax structure. However, as long as I know, there is no general list of stopwords
for the Amharic language. Stopwords in Amharic should have the following
properties.
•They are non-informative words if they are used alone.
•They occur frequently in documents.
•Important for the structure of the language not important for the semantics
purpose.
•Most of the time they can be adjectives, pronouns, Articles.
•General words for the language and are not domain specific.
In this paper, the researchers tried to identify stopwords automatically from the
Amharic text using the aggregated based technique. One is bases of word
frequency; inverse document frequency and the other is based on entropy measures
of words in the given documents. The data inputs for this research are from
magazines, newspapers, and blogs written with the proper structure of the
language.
I. Term frequency
The count or number of times each term (t) occurs in each document (d) is called
its term frequency. From the lists of words that we get from magazines,
newspapers, and blogs as inputs, we can calculate the frequency of each word in
the documents and it shows some measure of term density in a document. This
measure is very important to determine the most relevant document to the query
terms from a set of text documents. The best way to apply is by eliminating the
documents that do not contain all the terms we need. So to further distinguish, we
have to count the number of times each word is coming in a document and then
sum up them together. This sum is what we call “term frequency”. Thus, terms
with high frequency are considered as less informative terms in the document. And
most researchers used this measurement for the stopwords list identification for
different world languages.
Term frequency a term can be defined as
𝒕𝒇 = (𝒕𝒇, 𝒅)/ (∑𝒇𝒕, 𝒅, )
Where,
𝒕𝒇, 𝒅 is Term frequency in a document and ∑ 𝒇𝒕, 𝒅 total word number of terms of
documents
II. Inverse Document Frequency(idf)
Inverse Document Frequency is the measure of the uniqueness of a term. It shows
whether a term is common or rare in the document. In the computation of term
frequency, we have considered all the terms are important. In the Amharic text,
although you all know that few terms like “እና”, “ነዉ”, and “ግን” appear a lot of
times in the document but they are having little importance. Hence, we must lower
the weight of frequent occurring terms and increase their rareness. The inverse
document frequency for any given term is defined as,
idf=log⁡((𝑁⁡𝑑𝑜𝑐𝑢𝑚𝑛𝑒𝑡𝑠)/(𝑁⁡𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠⁡𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔⁡𝑡ℎ𝑒⁡𝑡𝑒𝑟𝑚)⁡)
III. Entropy measures
In information theory, word information bearing capacity correlates the
randomness of a word. Shanon [22] suggests that a randomness measure of a word
is called entropy. Then, words with high randomness and are also low entropy
words are considered as very informative. Since stopwords are less. Informative
they are high entropy words. Entropy measures the frequency variance of a given
word for multiple documents, i.e. words with very high frequencies in some
documents but the low frequency in others will have high entropy. Entropy H (w)
of a given word w with respect to a given set of n documents is as follows:
𝐻 (𝑊𝑗) = Ʃ 𝑃𝑖, 𝑗 . 𝑙𝑜𝑔(1/𝑃𝑖, 𝑗!)
Where,
𝑷𝒊𝑾=𝒇𝒊𝑾∑𝒏𝒋=𝟏𝒇𝒋𝒘
𝒇𝒊(𝒘) = Frequency of word 𝒘 in document i, n = number of documents.
The entropy of each word in the dataset will be considered, and the value will be
ordered by increasing of entropy to expose the words that have a better probability
of being noise words. Finally, by aggregating term frequency, idf, it-idf and
entropy measures we can generate most important lists of Amharic stopwords. The
following block diagram shows the general structure of the research work.
7. Conclusion
Stop words list generated for many natural languages of the world. Amharic is also
the largest and most important language of Ethiopia .as it’s the national language
of the country stop words list generation for the language is an important task
required for the text processing purposes. In this paper, we proposed to generate
Amharic stop words list from the Amharic text. The methodology we are an
aggregation high term frequency measure, low term weight measure and high
entropy measures. This enables educators, researchers, and language experts etc. to
do more on the idea to enhance the language power in various aspects.
References
1. Asubiaro, T. V. (2013). Entropy-Based Generic Stopwords List for Yoruba
Texts. Entropy, 2(05).
2. Puri, R., Bedi, R. P. S., & Goyal, V. (2013). Automated Stopwords
Identification in Punjabi Documents. vol, 8
3. Na, D., & Xu, C. (2015). Automatically generation and evaluation of Stop
words list for Chinese Patents. TELKOMNIKA (Telecommunication
ComputingElectronics and Control), 13(4)
4. Alajmi, A., Saad, E. M., & Darwish, R. R. (2012). Toward an ARABIC
stop-words list generation. International Journal of Computer Applications,
5. R. Tsz-Wai, B. He, and I. ―Automatically Building a Stopword List for an
Information Retrieval System. ‖ 5th Dutch-Belgium Information Retrieval
Workshop (DIR)’05Utrecht, the Netherlands 2005.

You might also like