The document summarizes a research paper that proposes a method for automatically generating stop words in Amharic text. The method uses an aggregated approach based on word frequency, inverse document frequency, and entropy measures of words in documents. The goal is to make information retrieval in Amharic faster and improve the language's usefulness for information processing by identifying and removing non-informative words. The proposed automatic approach aims to overcome limitations of existing static or dictionary-based stop word identification methods.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
66 views10 pages
Automatic Generation of Stopwords
The document summarizes a research paper that proposes a method for automatically generating stop words in Amharic text. The method uses an aggregated approach based on word frequency, inverse document frequency, and entropy measures of words in documents. The goal is to make information retrieval in Amharic faster and improve the language's usefulness for information processing by identifying and removing non-informative words. The proposed automatic approach aims to overcome limitations of existing static or dictionary-based stop word identification methods.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10
Addis Ababa University
School of Information Science
Department of Information Science IR Assignment IV A literature review on Automatic Generation of Stopwords in the Amharic Text
By: Name ID No. Biniam Worku GSE/6722/13
Submission Date: 04/10/2021
Abstract As an important preprocessing step of information retrieval and information processing, the accuracy of stop words’ elimination directly influences the ultimate result of retrieval and mining. In information retrieval, stop words’ elimination can compress the storage space of index, and in text mining, it can reduce the dimension of vector space enormously, save the storage space of vector space and speed up the calculation. Stop words list for different world languages like English, Chinese, Hindi, Arabic Sanskrit etc. are identified. Not many literatures found to show if any stop word list is done for the Amharic language. In the document I have reviewed, the researcher proposed to the automatic identification of Stop words for the Amharic text by an aggregate based methodology of words frequency, inverse document frequency, and entropy value measure. Available works on Stopwords identification techniques are based on static or dictionary based Stopwords lists. This method inefficient and very expensive and it is a time- consuming task as the searching process takes a long time. The proposed work will overcome these problems using aggregated methods of both frequency measures and entropy measures of words in the Amharic text for the automatic Stopwords identification. 1. Introduction Removal of stopwords is one of the text preprocessing steps in Information Retrieval, text classification, document clustering and similar document analysis. Stopwords are the words that appear frequently in documents and only serve syntactic function but they carry no usable information to aid learning tasks and are unlikely to assist in text classification, retrieval, clustering or analysis and hence are deleted during pre-processing‖. These words are considered as noise in information systems, hence there are research efforts to develop stoplists that are robust enough to contain these words and can help to efficiently manage noise in textual processing activities and information systems. Therefore, stoplists that are either domain specific or language specific have emerged because of the idea of which words constitutes noise in a language or domain. The importance of these ―customized ‖ stop lists is well founded on the language differences in the languages or domains where there are specialized linguistic and morphological rules. Consequently, stoplist for a language may be inefficient for another depending on the similarity or differences between the languages. Lately, researchers compile stoplist that are time specific because of the time changing attribute of natural languages with human sophistication. However, except static or Dictionary based approach, automatic generation of stopwords for the Amharic text are not available. In the paper I reviewed, the researchers proposed to identify stopwords automatically from in Amharic text using the aggregated based technique. One is bases of word frequency and the other is based on entropy measures of words in the given documents of the Amharic text. The researchers used Amharic newspapers, Amharic magazines and well-known Amharic blogs which are considered as they are written in correct language structure as sources for their research data source. And this technique enables them to identify the stop word lists without affecting the content of information the original document before removing the non-informative words. Identification of these stopwords enables the language users to retrieve information fast and makes the language more powerful for information processing. 2. Objective a. General Objective Is to identify stopwords automatically from the Amharic text using the aggregated based technique. One is bases of word frequency; inverse document frequency and the other is based on entropy measures of words in the given documents. b. Specific Objectives The following are specific objects of the research: - Make the retrieval of Amharic words fast and - Make the Amharic language more powerful for information retrieval. 3. Scope and Limitaion The work of the research paper reviewed is to create an automated way of in time listing of Amharic stopwords. Between the methods available to list stop words, the researchers picked the aggregated based technique. The limitation of the research paper is the researcher didn’t include other local language like Tigrigna which is in a similar linguistic family as Amharic language. 4. Methodology In order to make the research the writers used different methids as listed below : Literature review The researchers have reviewed documents that contain the same research objectives and goals . Data collection The researchers have used different corpus, Amharic newspapers, Amharic magazines and well-known Amharic blogs, for them in put in identifying and listing common stopwords. Researchers also apply Inverse Document Frequency (IDF) to identify which word appears frequently on all documents. The entropy of each word in the dataset also has been considered, and the value will be ordered by increasing of entropy to expose the words that have a better probability of being noise words. By aggregating term frequency, inverse document frequency, and entropy measures the researchers can generated most important lists of Amharic stopwords. 5. RELATED WORK In natural language processing and related fields, various researchers have been done on the idea of identification and removal of stopwords different languages. Automated Stopwords identification is the most efficient and widely used method with a little or no intervention of manual methods. Jaideep Singh et al used automatic stopwords identification algorithm for the Sanskrit language and some manual intervention is used by the language expert, and then call the method hybrid. They calculated the frequency of words from the input text and they also used some words from the dictionary to identify the stop list. Asubiaro, Toluwase V, used an entropy-based algorithm to identify stopwords for Nigerian Yoruba language text. A word whose entropy is greater than 0.6 but not a noun was considered as a Stopword. Walaa Medhat et al generated stopwords list for the Egyptian dialect for online social network data to investigate the effects or removal of stopwords from the text for the sentiment analysis (SA) task using frequency the frequency of words from the input Egyptian dialect. Mohammed-Ali Yaghoub- Zadeh-Fard et al generated stopwords list for Persian language Information retrieval system based on similarity function and POS information using the aggregated method of part of speech and statistical features of stopwords. Vijayarani S, et al used Zipf’s Law (Z method) for creation of stop-words. Rakholia and Saini have presented a rule based approach to dynamically identify stop words for Gujarati language. Vandana Jha et al developed an algorithm to remove stopwords from the Hindi text based on Deterministic finite automata. The algorithm also tested on 200 documents and succeeded 99% accuracy and time efficiency. Saini and Rakholia have presented an analytic in-depth report on continent and script-wisedivisions-based statistical measures for stopwords lists ofvarious international Languages. A. Alajmi et al generated stop-words for the Arabic language using a statistical approach.1002 documents with over 700,000 words were tested and they achieved about 90% general accuracy. El-Khair,et al conducted research on the effectiveness of three stop words lists for Arabic Information Retrieval--- General Stoplist, Corpus-Based Stoplist, Combined Stoplist -- -were investigated in this study. Three popular weighting schemes were examined: the inverse document frequency weight, probabilistic weighting, and statistical language modelling. The Idea is to combine the statistical approaches with linguistic approaches to reach an optimal performance, and compare their effect on retrieval. 6. The architecture of the automatic generation of Amharic stop words Different approaches are used by researchers to generate and remove stopwords from the documents of different languages of the world. Some of these methods are Dictionary based approach, supervised approach using probability distribution, automated algorithm based on the frequency of words, deterministic finite automata entropy measures approach for the contents of information of a word in the document, a revised statically approach which is based on term frequency and distribution of words in different documents and studying part of speech are some of the techniques to identify general and domain-specific stopwords from documents. Amharic is a national/working language having its own grammar and syntax structure. However, as long as I know, there is no general list of stopwords for the Amharic language. Stopwords in Amharic should have the following properties. •They are non-informative words if they are used alone. •They occur frequently in documents. •Important for the structure of the language not important for the semantics purpose. •Most of the time they can be adjectives, pronouns, Articles. •General words for the language and are not domain specific. In this paper, the researchers tried to identify stopwords automatically from the Amharic text using the aggregated based technique. One is bases of word frequency; inverse document frequency and the other is based on entropy measures of words in the given documents. The data inputs for this research are from magazines, newspapers, and blogs written with the proper structure of the language. I. Term frequency The count or number of times each term (t) occurs in each document (d) is called its term frequency. From the lists of words that we get from magazines, newspapers, and blogs as inputs, we can calculate the frequency of each word in the documents and it shows some measure of term density in a document. This measure is very important to determine the most relevant document to the query terms from a set of text documents. The best way to apply is by eliminating the documents that do not contain all the terms we need. So to further distinguish, we have to count the number of times each word is coming in a document and then sum up them together. This sum is what we call “term frequency”. Thus, terms with high frequency are considered as less informative terms in the document. And most researchers used this measurement for the stopwords list identification for different world languages. Term frequency a term can be defined as 𝒕𝒇 = (𝒕𝒇, 𝒅)/ (∑𝒇𝒕, 𝒅, ) Where, 𝒕𝒇, 𝒅 is Term frequency in a document and ∑ 𝒇𝒕, 𝒅 total word number of terms of documents II. Inverse Document Frequency(idf) Inverse Document Frequency is the measure of the uniqueness of a term. It shows whether a term is common or rare in the document. In the computation of term frequency, we have considered all the terms are important. In the Amharic text, although you all know that few terms like “እና”, “ነዉ”, and “ግን” appear a lot of times in the document but they are having little importance. Hence, we must lower the weight of frequent occurring terms and increase their rareness. The inverse document frequency for any given term is defined as, idf=log((𝑁𝑑𝑜𝑐𝑢𝑚𝑛𝑒𝑡𝑠)/(𝑁𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔𝑡ℎ𝑒𝑡𝑒𝑟𝑚)) III. Entropy measures In information theory, word information bearing capacity correlates the randomness of a word. Shanon [22] suggests that a randomness measure of a word is called entropy. Then, words with high randomness and are also low entropy words are considered as very informative. Since stopwords are less. Informative they are high entropy words. Entropy measures the frequency variance of a given word for multiple documents, i.e. words with very high frequencies in some documents but the low frequency in others will have high entropy. Entropy H (w) of a given word w with respect to a given set of n documents is as follows: 𝐻 (𝑊𝑗) = Ʃ 𝑃𝑖, 𝑗 . 𝑙𝑜𝑔(1/𝑃𝑖, 𝑗!) Where, 𝑷𝒊𝑾=𝒇𝒊𝑾∑𝒏𝒋=𝟏𝒇𝒋𝒘 𝒇𝒊(𝒘) = Frequency of word 𝒘 in document i, n = number of documents. The entropy of each word in the dataset will be considered, and the value will be ordered by increasing of entropy to expose the words that have a better probability of being noise words. Finally, by aggregating term frequency, idf, it-idf and entropy measures we can generate most important lists of Amharic stopwords. The following block diagram shows the general structure of the research work. 7. Conclusion Stop words list generated for many natural languages of the world. Amharic is also the largest and most important language of Ethiopia .as it’s the national language of the country stop words list generation for the language is an important task required for the text processing purposes. In this paper, we proposed to generate Amharic stop words list from the Amharic text. The methodology we are an aggregation high term frequency measure, low term weight measure and high entropy measures. This enables educators, researchers, and language experts etc. to do more on the idea to enhance the language power in various aspects. References 1. Asubiaro, T. V. (2013). Entropy-Based Generic Stopwords List for Yoruba Texts. Entropy, 2(05). 2. Puri, R., Bedi, R. P. S., & Goyal, V. (2013). Automated Stopwords Identification in Punjabi Documents. vol, 8 3. Na, D., & Xu, C. (2015). Automatically generation and evaluation of Stop words list for Chinese Patents. TELKOMNIKA (Telecommunication ComputingElectronics and Control), 13(4) 4. Alajmi, A., Saad, E. M., & Darwish, R. R. (2012). Toward an ARABIC stop-words list generation. International Journal of Computer Applications, 5. R. Tsz-Wai, B. He, and I. ―Automatically Building a Stopword List for an Information Retrieval System. ‖ 5th Dutch-Belgium Information Retrieval Workshop (DIR)’05Utrecht, the Netherlands 2005.