100% found this document useful (1 vote)
201 views18 pages

Afan Oromo Text Keyword Extraction Using Machine Learning

This document discusses keyword extraction from Afaan Oromo text using machine learning. It describes keyword extraction and its applications, as well as different machine learning algorithms that can be used for this task. The document aims to analyze how well term frequency-inverse document frequency (TF-IDF) can identify keywords from Afaan Oromo news texts.

Uploaded by

Bekuma Gudina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
201 views18 pages

Afan Oromo Text Keyword Extraction Using Machine Learning

This document discusses keyword extraction from Afaan Oromo text using machine learning. It describes keyword extraction and its applications, as well as different machine learning algorithms that can be used for this task. The document aims to analyze how well term frequency-inverse document frequency (TF-IDF) can identify keywords from Afaan Oromo news texts.

Uploaded by

Bekuma Gudina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Afan Oromo Text Keyword Extraction Using

Machine learning

1. INTRODUCTION
1Background of the Study

• Natural Language Processing (NLP) refers to the use


of computational methods to process spoken or
written form of such free text. Which acts as a mode
of communication commonly used by humans(Assal
et al., 2011).There are lot many processes involved
in the pipeline of NLP. the syntactic level,
statements are segmented into words, punctuation
(i.e. tokens) and each token is assigned with its label
in the form of noun, verb, adjective, adverb and so
on (Part of Speech Tagging).
At the semantic level, each word is analyzed to get the
meaningful representation of the sentence. Hence, the
basic task of NLP is to process the unstructured text and
to produce a representation of its meaning. Automatic
Keyword Extraction (AKE) methods have been
proposed in the past. Traditional unsupervised graph-
based approaches consider the central nodes of a graph-
of-words as the most representative ones (Mihalcea and
Tarau, 2004) .
• Moreover, there exist strong baselines that use
common statistics (e.g. term frequency-inverse
document frequency, known as Tf-Idf) and/or
heuristics (e.g., position in the document) to detect the
most significant terms. Furthermore, state-of-the-art
approaches for the task span both classical supervised
machine learning methods (Medelyan et al., 2009) and
deep learning techniques (Meng et al., 2017) that
perform better compared to unsupervised ones.
Keyword extraction is the retrieval of keywords or
key phrases from text documents. They are selected
among phrases in the text document and
characterize the document’s topic. In this thesis,
summaries the most commonly used methods that
automatically extract keywords. The automatically
extract keywords from the documents use
heuristics to select the most used and significant
words or phrases from the text document.
Classify keyword extraction methods in the field
named natural language processing, which is an
important field in machine learning and artificial
intelligence. Keyword extractors are used to
extract words (keywords) or groups of two or
more words that create a phrase (key phrases). In
this article, I use the term keyword extraction,
which includes either keyword or key-phrase
extraction
When Use keyword extraction; Save time  Based on
keywords, one can decide if the topic of the text (e.g. article)
interest him and whether to read it. Keywords provide a
summary of the document to the user
Find relevant documents Today, tons of articles are written,
and it is not possible to read all of them. Keyword extraction
algorithms can help us to find relevant articles. Keyword
extraction algorithms also automate book, publication or web
indexes building
• Keyword extraction as support for machine
learning:  Keyword extraction algorithms find the
most relevant words that describe the text. They can
be later used for visualizations or to automatically
classify textKeywords Extraction is one of the most
important tasks in Natural Language Processing,
and it is responsible for determining various
methods for extracting a significant number of
words and phrases from a collection of texts
Both keywords and key phrases describe the essence of what a document
concerns. The difference between the two is that keywords are single words,
while key phrases can be either individual words or phrases (i.e., n-grams
with n ≥ 1). Many key phrase extraction methods form and rank the candidate
phrases using the previously scored candidate unigrams by a keyword
extractor (Wan and Xiao, 2008a; Hasan and Ng, 2014; Florescu and Caragea).
Now a days with the existence of computer technologies there is a huge effort
going on toward processing natural languages using computers which is so
called as
Afaan Oromo, also called Afaan Oromoo, is a member of the Cushitic
branch of the Afro-Asiatic language family
• All of this is done to summaries and assist in the relevant
and well-organized organization, storage, search, and
retrieval of content. There are numerous keyword
extraction algorithms available,?each of which employs a
unique set of fundamental and theoretical methods to this
type of problem. There are various types of NLP
algorithms, some of which extract only words and others
which extract both words and phrases. There are also NLP
algorithms that extract keywords based onthe complete
content of the texts, as well as algorithms that extract
keywords based on the entire content of the texts
The keyword extraction service is used to extract key words and phrases
from text, such as an email or chat. The algorithm parses the text into
sentences and remove most frequent but least useful words for determining
meaning (stop-words). It then applies various statistical and frequency
methods to determine the most significant key words and phrases.
Automatic Keyword Key phrase Extraction intends to discover a limited
but concise set of words phrases that reflect the main topics discussed
within a text document, avoiding the expensive and time-consuming
process of manual annotation by experts (Vega-Oliveros et al., 2019).
It is the third most widely spoken language in
Africa, after Hausa and Arabic. Its original homeland is an area
that includes much of what today is called Ethiopia, Somalia,
Sudan and northern Kenya and some parts of other East African
countries. Currently, it is an official language of Oromia Regional
State (which is the biggest region among the current Federal
States in Ethiopia). It is used by Oromo people, who are the
largest ethnic group in Ethiopia, which amounts to 50% of the
total population in 2007 (2015 Census statistic of Ethiopia)
(Tesema, n.d.)Natural Language Processing (NLP) or
Computational Linguistics (Mandefro, 2010)With theadvent of
the big data era, information has been increasing Exponentially.
•Traditionally, people acquire information from books, newspapers
and magazines. Now they are used to acquire information via the
Internet. Texts is one of the main information formats of information.
For textual information, a keyword set consists of several words,
which can express the meaning of the text. Keywords can help users
quickly understand the topics of text. Besides, keyword extraction is
the basis of applications such as summarization, information
retrieval, text classification and clustering
• In the early stage, keywords are manually extracted from the text
(wang zhuohao et.al, 2021).
•). Manual extracting and tagging keywords are time-consuming and labor-
intensive. The extraction results are subjective, which is difficult to objectively
reflect the meaning of texts. With the quickly increase of information, it is difficult
to manually extract keywords. Therefore, it is urgent to automatically extract
keywords from texts.
•The higher level tasks in NLP are Machine Translation (MT), Information
Extraction (IE), Information Retrieval (IR), Automatic Text Summarization (ATS),
Question-Answering System, Parsing, Sentiment Analysis, Natural Language
Understanding (NLU) and Natural Language Generation (NLG). Information
Extraction (IE) refers to the use of computational methods to identify relevant
pieces of information in document generated for human use and convert this
information into a representation suitable for computer based storage, processing,
and retrieval (Wimalasuriya and Dua, 2010)
1.2 Statement of the Problem

Afaan Oromo is the language spoken by a large ethnic group in


Ethiopia and nowadays it is becoming a popular language even to
outside of Ethiopia.Therefore,it is vitalNLPresearch area though
there is no standard corpus developed for it yet.??????????
The identification and classification of keyword extraction in
plain text is of key importance in numerous natural language
processing applications. In Information Extraction (IE) systems
keyword extraction generally carry important information about
the text itself, and thus are targets for extraction. In Machine
Translation (MT), keyword extraction and other sorts of words
have to behandled in a different way due to the specific
translation rules that apply to them(Farkas et al)
1.3 Research/Questions/Hypotheses

•Afaan Oromo is one of the local languages spoken in Ethiopia, especially by


Oromo ethnic groups and others. This language is still under development in view
of its applicability in the development of current technology, and due to this reason,
it is more interested domain for any language-dependent researches. Based
on this information, the currently proposed research work will answer the following
questions:
 What are previous researches conducted related to the currently proposed one?
 How does machine learning possibly enable in text extracting from Afaan Oromo
news, words, pharase and document?
 What is YAKE (Yet another Keyword extraction ) and Term Frequency?

What will the performance of machine learning be in extracting and


classifying text extracting from Afaan Oromo news texts comparative to
the other previously usemodels?.
1.4 Objectives

The objectives of the current study are explained separately as general objective and
specific objectives.
1.4.1.General Objective
The main objective of this research is to developing Afaan Oromo Text Keyword Extraction Using
Machine learning Approach.

1.4.2. Specific Objectives


The specific objectives of the proposed research are to:
The following specific objectives are identified in order to achieve the specified general objective
 To Analyze how effectively TF-IDF the can be implemented in identification and
classification of key word from Afaan Oromo news texts.
 Review techniques and methodologies used for Afaan Oromo text keyword extraction To design
architecture for Afaan Oromo text keyword extraction.
 Develop a prototype for keyword extraction
 Design and train TF-IDFAfaan Oromo textkeyword extraction system with Afaan Oromo news texts.
 Evaluate the performance of the system.research area in the future
 Test and evaluate the performance of the system.
 Finally, draw conclusion from experimental results and recommend for further research.
1.4.3. Scope of the Study
•This research focuses on single document keyword extraction for Afaan
Oromo texts. It doesn’t
employ an abstractive keyword extraction since it requires deep
linguistic analysis and difficult to implement with current state of the art
of the field. The main work is limited to a textual document of Afaan
Oromo text corpus only. However, there are other data types such as
image, audio, video, and etc, which are out of the scope of this study.
• The current study does not include the explanation about the
descriptive information
(attributes) of any keyword extracted from news texts, i.e., the task of
describing
about attributes of any extracted keyword extraction. And also, the
study does not
include the description about relationships among extracted keyword
from the document

You might also like