0% found this document useful (0 votes)

29 views7 pages

Arabic Root Based Stemmer

The document discusses an Arabic root-based stemming algorithm. It aims to improve over previous algorithms by handling words without roots as well as prefixes and suffixes that are part of the word. The implementation and evaluation shows improved accuracy over other methods. Key challenges in Arabic stemming are also outlined.

Uploaded by

Serigne Modou NDIAYE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views7 pages

Arabic Root Based Stemmer

Uploaded by

Serigne Modou NDIAYE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

The 2006 International Arab Conference on Information Technology (ACIT'2006)

Arabic Root Based Stemmer

Mohammed Naji AL-Kabi* and Ronza S. Al- Mustafa**

*Computer Information Systems Dept, Yarmouk University, Irbid, Jordan, [email protected]
**Computer Information Systems Dept, Yarmouk University, Irbid, Jordan, [email protected]

ABSTRACT stem is a morpheme or a set of concatenated

morphemes that can accept an affix, where a root is a
This paper presents a new (root-based) stemming single morpheme that provides the basic meaning of a
algorithm for Arabic language. As other natural word.
languages not all the words used in Arabic language, has
roots, some of these are borrowed from other languages, Stemming might be useful to Information
e.g. as the word " retrieval systems, text classification systems, text
clustering systems, dictionary automation, text
" television, so in this case the

compression, etc.
stemmer will fail to get the right root because these
foreign words have no root. This algorithm is based on
affix removal beside knowledge from structural Stemming is considered by a number of authors
linguistics. The implementation and evaluation of this as word Standardization [12]. A number of writers
algorithm shows a noticeable improvement in the thought that stemming is useful for improving
accuracy relative to previous algorithms. retrieval performance because it reduces variants of
the same root word to common concept, besides
Keywords: Arabic, Stemming, Root, Negative Suffix, reducing the size of the indexing structure because the
Negative Prefix, Light Stemming, NLP. number of distinct index terms is reduced [3]. Other
writers are not satisfied with the concept of using
1. INTRODUCTION stemming in IR and Text mining [3]. Accordingly
many search engines do not adopt stemming [3].
The Arabic language is the fifth most widely spoken Several common types of stemming strategies are
language in the world. It belongs to the Semitic family; so discussed by Frakes: affix removal, table lookup,
it differs from the Indo-European languages successor variety, and n-grams [7]. Affix removal
morphologically, semantically, and syntactically. The strategy tries to eliminate the prefixes and suffixes.
Arabic alphabet contains twenty-eight letters, always The most important part in this strategy is suffix
written from right to left in cursive form. Diacritical removal, since most variants of terms are generated by
marks (harakat) (tashkiil ) appear either above or suffixes.
below the letters, and play an essential role in many cases
in distinguishing semantically and phonetically between In Arabic language as with other natural
two identical words with the same characters, but with languages the stemmer may face the problem of a
different diacritics. Diacritical marks are used in holy negative prefix, where the prefix which eliminated is
books, poems, and children s literature; newspapers, part of the word and not really a prefix. If a stemmer
journals and other books for adults are usually printed tries to strip the " " which is a well known prefix
without diacritics, which means that many strings are from the following examples, the output will be
ambiguous. Most native Arabic words are derived from definitely wrong, e.g. " " Allah, " " Germany,
verbal roots. Arabized words, on the other hand, mainly " " Brigades, " " Albania, etc. It also includes
nouns borrowed from other languages with a slight other prefix such as " " And which represents a
phonetic adjustment to suit the Arabic pronunciation, have frequently used conjunction, e.g., stripping off " "
no roots [8]. And from " " honesty leads to a wrong stem.
All Arabic words belong to three main categories: The negative prefix problem in Arabic language
noun, verb or particle. Around 64% of Arabic words are stemmer is not restricted to the " " and " " prefixes,
derived from triliteral verbs (three consonants), but there but it also includes other prefixes such as " "," ",
are also biliteral verbs (two consonants), quadriliteral "‫" ﻠ ـ‬, "‫" ﻓ ﺎ ﻟ ـ‬, etc. The Arabic light stemming in this case
verbs (four consonants), and pentaliteral verbs (five for the term "‫ " و ا ﻟ ـ ﻲ‬Governor will be wrong, if the
consonants). Naturally these verbs represent the roots for prefix "‫ " و ا ﻟ ـ‬strip off from the term. Similarly the
which stemming algorithms typically search. This stems of the words "‫ " ﻛ ﺎ ﻟ ﺢ‬glum, "‫ " ا ﷲ‬Allah, " "
stemming process excludes words derived from nouns and successful, if we strip from them the prefixes "‫" ﻛ ﺎ ﻟ ـ‬,
particles[9]. "‫" ﻟ ﻠ ـ‬, "‫ " ﻓ ﺎ ﻟ ـ‬respectively. Similarly Arabic stemmers
A morpheme is the smallest meaningful lingual unit
face another problem of a negative suffix, where the
which has a semantic interpretation in the grammar of a
suffix which has been eliminated is part of the word
language. There is a difference between stem and a root, a
and not really a suffix. If a stemmer tries to strip off

1/7
The 2006 International Arab Conference on Information Technology (ACIT'2006)

the "‫ " ا ن‬which is a well known suffix from the following doubled, and it look like the w shape. Strong verb is a
examples, the output will be definitely wrong, e.g. "‫" ﻟ ﻌ ﻤ ﺎ ن‬ triliteral rooted verb’s which does not have any of the
To Amman, " " Japan, etc. Table 5 in the Appendix above three weak letters.
illustrates a number of examples.
"Some words do not have roots. For example
Table lookup is the simplest strategy among the four; the Arabic equivalents of "‫ " ﻧ ﺤ ﻦ‬we, "‫" ﺑ ﻌ ﺪ‬
it simply looks for the root of the term in the lookup table. after, "‫ " ﺗ ﺤ ﺖ‬under and so on. If the stemmer
The performance of this strategy is highly affected by the comes across any of these words, it does
number of words (terms) and their root in the table, as the nothing. "
lookup tables gets larger the performance get higher too.
Large lookup tables might need a considerable storage "Sometimes a root letter is deleted during
space. Successor variety is not straightforward as the derivation. This is especially true of roots
others, and depends on algorithms which is based on that have duplicate letters (e.g. the last two
structural linguistics and attempts to determine morpheme letters are the same), e.g., "َ‫ " د ُﺟ ٍ ﺞ‬get dressed,
boundaries. N-grams stemming searches for digrams, "َ‫ " د َﻟ ﱠ ﻞ‬dandle, "َ‫ " ﺧ َﻠ ﱠ ﻞ‬souse, "َ‫ " ﻋ َﻠ ﱠ ﻞ‬explained,
trigrams or more term successive letters. This strategy is a "َ‫ " ﻗ َﻠ ﱠ ﻞ‬reduced, "‫ " ﺑ َﻠ َ ﻞ‬wet, etc. The stemmer
term of clustering procedure not a stemming procedure. can detect this, and return the letter that was
removed. - If a root contains a hamza, this
The above two problems (negative prefix & negative hamza could change form during derivation,
suffix) of Arabic stemmers leads to a wrong grammatical e.g., " " talk, " " stand up, etc. The
root, so the accuracy of IR & Text mining systems which stemmer detects this, and returns the original
rely on these stemmers will be deteriorated. form of this hamza."
The two main problems of stemming have been L. S. Larkey and M. E. Connell [11] conducted a
described by Chris D. Paice [12]. In the first place, pairs good study based on a modified version of Shereen
of etymologically related words sometimes differ sharply Khoja stemmer. The modified version includes a few
in meaning [12] for example; consider "‫ " ﺳ ﻞ‬ask, "‫" ﺳ ﻠ ﺐ‬ changes to enhance the accuracy of the stemmer.
stole, and "‫ " ﺳ ﻼ م‬Peace. In the second place, the These changes are summarized as follows:
transformations involved in adding and removing suffixes
involve numerous irregularities and special cases [12]. If a root were not found, the normalized form
Stemming errors are of two kinds: understemming errors, would be returned, rather than returning the
in which words which refer to the same concept are not original unmodified word.
reduced to the same stem, and overstemming errors, in List of place names are considered
which words are converted to the same stem even though "unbreakable" words exempt from stemming.
they refer to distinct concepts. In designing a stemming
algorithm there is a trade-off between these two kinds of In addition to the Arabic stop word list
error. included in the Khoja stemmer, a script was
to remove stop phrases.
A light stemmer plays safe in order to avoid
overstemming errors, but consequently leaves many A light stemmer used to strip off definite
understemming errors. A heavy stemmer boldly removes articles (‫ ﻓ ﺎ ﻟ ـ‬, ‫ ﻛ ﺎ ﻟ ـ‬, ‫ ﺑ ﺎ ﻟ ـ‬, ‫ و ا ﻟ ـ‬, ‫ ا ﻟ ـ‬, and ‫ ) و‬from
all sorts of endings, some of which are decidedly unsafe, the beginnings of normalized words and
and therefore commits many overstemming errors [12]. strips 10 suffixes from the ends of words (‫ ا ت‬,
‫ ا ن‬, , ‫ ي‬, ‫ ة‬, ‫ ه‬, , , , and ‫) و ن‬.
Shereen Khoja addressed the problems that might
face the Arabic stemmer [9]: Table 5 in the appendix shows that light
stemming leads to wrong results if it carried out
"If the root contains a weak letter (e.g. "‫ " أ‬alif, "‫" و‬ unconditionally, so we record our reservation on the
waw or "‫ " ي‬yaa), the form of this letter may change last step. Larkey, and Connell’s stemmer seem to be
during derivation. To deal with this, the stemmer must better than its parent (Khoja stemmer).
check to see if the weak letter is in the correct form. " If
not, the stemmer produces the correct form of this weak Morphology is a branch of linguistics that is
letter, which then gives the correct form of the root. If any concerned with studying of the internal structure of
triliteral rooted verb’s one of the three root letters contains word forms. Semitic languages have a complex
either "‫ " أ‬alif hamza (a), "‫ " و‬waw (w) or "‫ " ي‬yaa (y) then morphology and so the Arabic language is a complex
that is defined as a weak verb, e.g. " " gave, "َ‫" و َﺟ َ ﺪ‬ language for stemming. Arabic stemmers have to deal
found, "َ‫ " و َﺿ َ ﻊ‬put, "َ‫ " و َﻗ َ ﻒ‬stood, "َ‫ " و َﻋ َ ﺪ‬promised, "َ‫" ﺑَ ﺎ ع‬ with affixes (prefixes, infixes, and suffixes), in
bought, "‫ " ﺟ ﺎ ء‬came, "‫ " ﻗَ ﺮَ أ‬read. Also weak verbs includes a addition to diacritic marks (harakat), in order to get
triliteral rooted verb’s where the second letter is doubled the right root with its appropriate diacritic marks on it.
with a ّ shadda, e.g. "َ‫ " ﺷ َ ﻤّ ﺮ‬prepared. Shadda (Germination Furthermore Arabic stemmer has to deal with
mark (tashdeed)) is written above the consonant that is

2/7
The 2006 International Arab Conference on Information Technology (ACIT'2006)

Arabized words (foreign words) which have no root, and T(i) be any term
in this case have to be excluded from stemming.
Let LenT(i) be the length of each
This study uses morphological patterns to obtain the term
trilateral and quadriliteral roots. The algorithm used
simply tries to extract the root, in case there is a match Let n be a number of terms within a
between pattern infix and word’s infix. document

Shereen Khoja is a pioneer in this field, but Let chr(i) be the character position
unfortunately we failed to get her original work entitled within a term
"Stemming Arabic Text" with her colleague Roger Let LenP(j) be the length of the
Garside. Leah S. Larkey and Margaret E. Connell and pattern
others headed a team at University of Massachusetts,
Amherst to conduct a number of studies which depends Let Infixes_String be a string
on Khoja work. Their work [10] [11] represent an generated manually, consisting of
improvement to Khoja work. Although their work include the pattern, and the affix of that
improvements to Khoja but it does not solve the problems pattern, e.g., the stem "‫" ﻣ ﺴ ﺎ ﺑ ﺢ‬
of negative prefix and negative suffix which discussed swimming pools, match with the
before. Al-Kharashi, I.A. et. Al. [2] presents pattern based pattern of "‫ﻞ‬ ", so the
stemming for Arabic language, also Taghva K. et. Al. [13] Infixes_String in this case is the
used the same approach which is different from Khoja, string "‫" ﻣ ﺎ‬, where "‫ " م‬lie in the first
with an equivalent performance. Pattern based stemming position, and "‫ " ا‬lie in the third
does not use root dictionary. This approach based on position.
Let T_String be the corresponding
matching the word with a number of Arabic patterns to
extract the root. Chen A. et. Al. [4] conducted a study to
string of the word which
find Arabic roots using Machine Translation (MT) based
corresponds the string of the pattern
Infixes_String, i. e., to clarify the
stemmer. Although this study depends on Ajeeb machine
translation system, stopword removing, clustering, light
idea suppose we want to find the
stemming, and morphological analysis, but it does not
root of the stem "‫ " ﻣ ﺴ ﺎ ﺑ ﺢ‬swimming
presents a solution to the problems of negative prefix and
pools, the system has to check this
negative suffix. Kareem Darwish [5] shows how to extract
word with all 5 characters patterns,
a root from the word, by first removing the prefix and
one of these patterns is " ", so the
Infixes_String in this case is "‫" ﺗ ﻲ‬
suffix of the word to get a stem, then match a stem to a
and the T_String is "‫" ﻣ ﺐ‬, the
number of templates to get the root. In this study the
researcher did not mention how many templates used in
mismatch is obvious in this case,
comparisons, beside the absence of an algorithm.
when matching the stem with the
pattern "‫ " ﻣ ﻔ ﺎ ﻋ ﻞ‬the Infixes_String &
Darwish, K. et. Al. [6] used an approach which is similar
T_String will be "‫" ﻣ ﺎ‬.
to his previous one[5], but with more details about the
prefixes, and suffixes being removed. Table 6 shows the
Table 1 shows how to get Infixes_String for each
patterns used within our algorithm.
of the patterns used.

2. THE ALGORITHM Table 1: An example of patterns and their infixes,

and the position of each infix
The first step of the Arabic Rooter under study is to
Pattern Infixes_String Infix : Infix position
normalize the text. Afterward a matching is performed
between the stem and the verbal and noun patterns, in
order to obtain the root. To conduct this study, a system ‫ﻓ ﻌﺎ ل‬ ‫ا‬ 3:‫ا‬
(stemmer) is built to find the Arabic roots using Visual ‫ﻣﻔ ﻌ ﻮ ل‬ ‫ﻣ ﻮ‬ 1:‫م‬ 4:‫و‬
Basic 6.0. This stemmer kept the words unchanged if it ‫ﺘﻔ ﻌﻠ ﻦ‬ 1:‫ي‬ 2:‫س‬ 3:‫ت‬ 7:‫ن‬
failed to find a root, and this is a normal case when the
stem is an Arabized word or when it represent the names 1. Stop word removal depending on a list of
of places, such as continents, regions, countries, states, (1281) stop words consists of prepositions,
districts, cities, villages, rivers, mountains, deserts, etc. pronouns, article and conjunctions.
2. Normalization
Germination mark (tashdeed) ( ّ ) 2.1 Remove tatweel (kasheeda) symbol ("_")
"shaddah" is placed above a consonant 2.2 Remove punctuations using a list of
letter as a sign for the duplication of the punctuation characters
consonant

3/7
The 2006 International Arab Conference on Information Technology (ACIT'2006)

2.3 Remove diacritics depending on a list of Table 2. Trace of the manual extraction of the
diacritics characters correct root.
Original Normalized T_String Root Status
3. If LenT(i) ≥ 5 then Word T(i) T(i) (Stem) (T(i))
Remove initial definite article (‫ ﻟ ﻞ‬، ‫) ا ل‬
Else if LenT(i) ≥ 6 then
ْ‫ﻋِ ﻠ ﻢ‬ Right
‫ا ن‬ Wrong
Remove initial definite article (‫ ﺑ ﺎ ل‬، ‫ ﻓ ﺎ ل‬، ‫) ﻛ ﺎ ل‬
End if
ْ‫ﺛَ َﻤ ﺮ‬ Right
َ‫ﻋَ ﻠِ ﻢ‬ Right
4. If LenT(i) > 4 and the final character of the T(i) ‫ا ﻹ ﺳﺘ ﺮ ﺣﺎ م‬ ‫ا ﺳﺘ ﺮ ﺣﺎ م‬ ‫ا ﺳﺘﺎ‬ ‫رَ ﺣِ ﻢ‬ Right
like "‫ " ا ء‬then َ‫ﺗَ ﺮَ ك‬ Right
Replace final "‫ " ا ء‬with "‫" ي‬ َ‫رَ ﺷَ ﺪ‬ Right
End if ّ‫ﻣ ﺪ‬ ّ‫ﻣ ﺪ‬ - َ‫ﻣَ ﺪَ د‬ Right
5. Replace initial ( ‫ إ‬, ), ( ‫ ) أ‬with bare alif ( ‫) ا‬
‫ن‬ ‫ا ن‬ َ‫ﻣَ ﺰ‬ Wrong
‫ﺗ ﺴﺎﺋﻠ ﻮا‬ ‫ﺗ ﺴﺎﺋﻠ ﻮا‬ ‫ﺗﺎ وا‬ َ‫ﺳَ ﺄَ ل‬ Right
6. Replace initial ( ‫ ) آ‬with bare alif ( ‫) ا‬ ‫اﻟ ﻤ ﺪا ر س‬ ‫ﻣ ﺪا ر س‬ ‫ﻣﺎ‬ َ‫دَ رَ س‬ Right
7. Replace final ( ‫ ) ة‬with ( ‫) ه‬
‫ي‬ َ‫ﻛَ ﺮُ م‬ Right
‫ﺑﺎﻟ ﻤ ﻜﺘﺒ ﺔ‬ َ‫ﻛَ ﺘَ ﺐ‬ Right
8. Replace final ( ‫ ) ى‬with ( ‫) ي‬ ‫اﻟ ﻄﺎﺋ ﺮ‬ ‫ﻃﺎأ ر‬ ‫ا‬ ْ‫ﻃَ ﺄ ر‬ Wrong
ِ‫ﺟ‬ Wrong
9. For i 1 to n do ِ‫ﻣُ ﺤ‬ Wrong
9.1 If LenT(i) = 3 then
9.1.1 If T(i) ends with germination mark (tashdeed) Table 3 Accuracy of root extraction for three Arabic
( ّ ) then Root(T(i)) = chr(1)& chr(2)& chr(2) text files
Else Root(T(i)) = T(i)
Number of Number of Roots
End if Number of Words not
incorrect extracted
End if words Analyzed
Roots correctly
9.2 If LenT(i) ≥ 4 then
9.2.1 For j 1 to number of patterns of length =
147 3 (2%) 16 (10.8%) 130 (87.2%)

LenT(i) do
244 7 (2.8%) 24 (9.8%) 215 (87.4%)

9.2.1.1 If T_String match Infixes_String

579 19 (3.3% ) 33 (5.7%) 527 (91%)
then 857 26 (3%) 39 (4.6%) 791 (92.4)
9.2.1.1.1 Remove the infix characters 1827 55 (3%) 112 (6.1%) 1663 (91%)
from T(i)
9.2.1.1.2 Replace "‫ " ئ‬or "‫ " ؤ‬with "‫" أ‬
9.2.1.1.3 Replace "‫ " ء‬or "‫ " ى‬with "‫" ي‬
9.2.1.1.4 Return Root (T(i))
Else
Return the normalized term
End if
Next j
End if
Next i

Figure 1
3. EVALUATION Statistics for root extraction
In order to test the accuracy of our algorithm, we selected Table 4 shows the precision, recall and the
a number of words randomly. Table 2 shows the manual harmonic mean (F-measure). Here we used the
trace of the execution of the above algorithm to extract the precision, recall and F-measure as shown in the
root of the selected terms. following formulas:
Table 3 shows the strength and weakness of the Correct …………. (1)
Precision
above algorithm, using a small data set containing 1,827 Correct Incorrect
words. The system failed to analyze 55 words, since their Correct
Recall …………. (2)
patterns are unknown. This failure mostly due to foreign Correct UnA nalyzed
(Arabized) words. The system accepts to analyze the rest
of the (1,772 words), but we found that accuracy of 2 Precision Recall …………. (3)
F
extracting the right roots is 91%. Precision Recall

4/7
The 2006 International Arab Conference on Information Technology (ACIT'2006)

Table 4 shows that the system obtains about 92% solve these problems within our next enhancement to
overall precision for the analyzed words, note that words this work.
that doe not match any of the verbal and noun patterns
have been ignored as illustrated in table 6 from the

REFERENCES
computations of the accuracy measures, because these
words are foreign words.
Table 4. Accuracy of root extraction for three Arabic [1] Aljlayl. M, Frieder. O. "On Arabic Search:
text files Improving the Retrieval Effectiveness via a
Light Stemming Approach", CIKM 02,
F-
Number of Precision (Accuracy November 4-9, 2002, McLean, Virginia,
Recall measure
words of Analyzed word) USA. Pages 340 -- 347. ACM 1-58113-492-
4/02/0011.
147 0.9771 0.8889 0.9309
244 0.9682 0.8987 0.9322 [2] Al-Kharashi, I.A., & Al-Sughaiyer, I.A.
579 0.9652 0.9411 0.9530 (2002e). "Pattern-based Arabic stemmer". In
857 0.9682 0.9531 0.9606
Proceedings of the 2nd Saudi Technical
Conference and Exhibition (STCEX2002),
1827 0.9697 0.9204 0.9442 Volume II (pp. 238-244), Riyadh, Saudi
Arabia.
4. CONCLUSIONS [3] Baeza-Yates, R., & Ribeiro-Neto, Modern
In order to increase the accuracy of the system, and to Information Retrieval. Addison Wesley,
reduce the probability of facing the problems of negative 1999.
suffix and negative prefix, the system shall not remove the [4] Chen A. and Gey Fredic. 2002. "Building an
prefixes ("‫ " ﻓ ـ‬، "‫ " ب‬، "‫ " ﻟ ـ‬، "‫ " و‬، "‫ )" ﻓ ـ‬and suffix (" "). arabic stemmer for information retrieval". In
Furthermore the system uses a conditional removing, Proceedings of the Eleventh Text REtrieval
e.g., in case the term length is six or more the system will Conference (TREC 2002), National Institute
remove the following prefixes ("‫" و ا ل‬، "‫ " ﺑ ﺎ ل‬، "‫" ﻛ ﺎ ل‬، "‫)" ﻓ ﺎ ل‬ of Standards and Technology, November.
otherwise when the term length is less than six the term [5] Darwish K. 2002. "Building a shallow Arabic
will be unchanged. Morphological Analyzer in one day", In
As mentioned in Thabet [14] root-based algorithm proceedings of the ACL-02 workshop on
increases word ambiguity, where many word variants Computational approaches to semitic
have different meaning, and this will affect the accuracy languages, Association for Computational
of IR, Text mining, etc systems which rely on root based Linguistics , July.
stemmers. Table 5 presents a number of ambiguous cases, [6] Darwish, K. and D. Oard. "CLIR Experiments
one of these is the term " ", this can be interpreted by at Maryland for TREC 2002: Evidence
the reader as parents, religion, and debt, since this word is Combination for Arabic-English Retrieval".
bare of diacritics, and it is in its own, not within a In TREC. 2002. Gaithersburg, MD.
statement. As we said the diacritics used to distinguish the
words semantically and phonetically. [7] Frakes W. B., Introduction to Information
Storage and Retrieval Systems, chapter 1,
Arabic stemmers can be used to enhance the pages 1--12. Prentice-Hall, 1992.
efficiency of a number of systems such as, Spell checkers,
Information retrieval systems, Text mining systems, Text [8] Kanaan, G.; Al-Shalabi, R.; AL-Kabi, M.N.;
Analysis systems, Compression systems , etc. Jaam, J.M.; Hasnah, A.; . 2004. "New
Approach for Extracting
This algorithm is incapable of extracting Arabic roots Quadriliteral/Quadrilateral Arabic Roots ”, In
of some imperative verbs (" ") that is made up of proceedings of 1st International Conference
one Arabic letter with the fact that its root being of three on Information & Communication
letters (trilateral verbs), e.g., " "ِ‫ " ﻋ ـ‬, with the root of Technologies: from Theory to Applications,
"‫" وﻋ ِ ﻲ‬. In addition, the problem of defective roots (weak ICTTA’04, (Damascus, Syria, April 2004).
roots) is still not solved by this algorithm. Defective roots IEEE-France.
are roots that contain vowels ("‫" ي‬، "‫" و‬، "‫ )" أ‬which are
classified as irregular roots, since some vowels in these [9] Khoja S., Research Interests, Pacific
roots are altered to other vowels or removed in the University, 2043 College Way, Forest Grove,
derivational process [1], e.g., "‫ " ر ﻣ ﺎ‬and "‫ " ر ﻣ ﻲ‬these two Oregon 97116,
words have the same meaning throw, and both of them http://zeus.cs.pacificu.edu/shereen/research.h
represent the same root. As a future research, we hope to tm, July 8, 2006.

5/7
The 2006 International Arab Conference on Information Technology (ACIT'2006)

[10] Larkey L., Ballesteros L., and Connell M.,

[13] Taghva, K., Elkoury, R., and Coombs, J.
"Improving Stemming for Arabic Information
"Arabic Stemming without a root
Retrieval: Light Stemming and Co-occurrence
dictionary".2005.
Analysis," SIGIR 2002: 275-282, 2002.
www.isri.unlv.edu/publications/isripub/Tagh
[11] Larkey L. S., and Connell M. E., "Arabic va2005b.pdf
information retrieval at UMass in TREC-10". In
[14] Thabet, N. (2004). “Stemming the Qur’an”.
TREC 2001.
In Proceedings of Arabic Script-Based
[12] Paice C.D., "An evaluation method for stemming Languages Workshop, COLING-04,
algorithms". In W.B. Croft and C.J. van Switzerland, August 2004.
Rijsbergen, editors, Proceedings of the 17th
Annual International ACM SIGIR Conference on
Research and Development in Information
Retrieval, pages 69-90. Springer-Verlag, July
1994.

Appendix A:
Table 5: The problem of negative prefixes and
negative suffixes

Full Removing Full Removing Full word Removing Full word Removing
word the suffix word the suffix the suffix the suffix
‫اﻟﺒ ﺮ ﻛﺎ ت‬ ‫اﻟﺒ ﺮ ك‬ ‫ا ﻷ ﻣﺎ ن‬ ‫ا ﻷم‬ ‫ﺑﺎﻟ ﻌ ﻮ ن‬ ‫ﺑﺎﻟ ﻊ‬ ‫ا ﻷم‬
‫ا ﻹﻧ ﺴﺎ ن‬ ‫ا ﻹﻧ ﺲ‬ ‫اﻟﺒﺎﻟ ﻮ ن‬ ‫اﻟﺒﺎ ل‬ ‫اﻟﺘﺎ م‬
‫اﻟﺜ ﻮ را ت‬ ‫اﻟﺜ ﻮ ر‬ ‫ا ﻷ وا ن‬ ‫ا ﻷ و‬ ‫ﺑ ﻄ ﻮ ن‬ ‫ﺑ ﻂ‬ ‫ﺗ ﺤ ﺲ‬
‫اﻟ ﺠ ﻤﺎ ﻋﺎ ت‬ ‫اﻟ ﺠ ﻤﺎ ع‬ ‫ا ﻷ و ﻃﺎ ن‬ ‫ا ﻷ و ط‬ ‫ﺑﻠ ﻮ ن‬ ‫ﺑ ﻞ‬ ‫ﺣ ﻦ‬
‫اﻟ ﺤ ﻤ ﻼ ت‬ ‫اﻟ ﺤ ﻤ ﻞ‬ ‫ﺑ ﺮ ﻛﺎ ن‬ ‫ﺑ ﺮ ك‬ ‫اﻟﺘ ﻌﺎ و ن‬ ‫اﻟﺘ ﻌﺎ‬ ‫اﻟ ﺪ‬
‫اﻟ ﺪ و را ت‬ ‫اﻟ ﺪ و ر‬ ‫ا ﻟ ﺠِ ﻨ ﺎ ن‬ ‫ا ﻟ ﺠِ ﻦ‬ ‫اﻟ ﺤ ﺴ ﻮ ن‬ ‫اﻟ ﺤ ﺲ‬ ‫اﻟ ﺬ‬
‫د و ر ي‬ ‫اﻟ ﺤﻨﺎ ن‬ ‫اﻟ ﺤ ﻦ‬ ‫ﺣﻨ ﻮ ن‬ ‫ﺣ ﻦ‬ ‫ﺳ ﺞ‬
‫اﻟ ﺬا ت‬ ‫اﻟ ﺬ‬ ‫ﺧﻠ ﺠﺎ ن‬ ‫ﺧﻠ ﺞ‬ ‫اﻟ ﺴﺘ ﻮ ن‬ ‫اﻟ ﺴ ﺖ‬ ‫ﺳ ﻚ‬
‫اﻟ ﺴﻠ ﻄﺎ ت‬ ‫اﻟ ﺴﻠ ﻂ‬ ‫اﻟ ﺮ ي‬ ‫ﺳﻜ ﻮ ن‬ ‫ﺳ ﻚ‬ ‫ﺳﻨ ﺖ‬
‫اﻟ ﺴﻨ ﻮا ت‬ ‫اﻟ ﺴﻨ ﻮ‬ ‫ﺻﺎﺑ ﻮ ن‬ ‫ﺻﺎ ب‬ ‫ﺳ ﻦ‬
‫اﻟ ﻀ ﻤﺎ ن‬ ‫اﻟ ﻀ ﻢ‬ ‫اﻟ ﻌ ﻲ‬ ‫ع‬
‫اﻟ ﺸ ﺮ ﻛﺎ ت‬ ‫اﻟ ﺸ ﺮ ك‬ ‫ﻋ ﺠ ﻤﺎ ن‬ ‫ﻋ ﺠﻢ‬ ‫ﻗ ﺮ و ن‬ ‫ﻗ ﺮ‬ ‫ﻗ ﻮا ن‬
‫ﻃﺒﻘﺎ ت‬ ‫ﻃﺒ ﻖ‬ ‫ﻋﻨ ﻮا ن‬ ‫ﻋﻨ ﻮ‬ ‫ﻛﺎﻧ ﻮ ن‬ ‫ﻛﺎ ن‬ ‫ﻛﺪ‬
‫اﻟﻘ ﻮا ت‬ ‫اﻟﻘ ﻮ‬ ‫ﻟﺒﻨﺎ ن‬ ‫ﻟﺒ ﻦ‬ ‫ﻣ ﺮه‬ ‫ل‬
‫ﻟ ﺠﺄ ت‬ ‫ﻟ ﺞ‬ ‫ﻟ ﻌ ﻤﺎ ن‬ ‫ﻟﻌﻢ‬ ‫اﻟ ﻤﻠ ﻲ‬ ‫ﻣ ﺖ‬
‫ﻟ ﺬ وا ت‬ ‫ﻟﺬ و‬ ‫ﻟﻠﺒﻨﺎ ن‬ ‫ﻟﻠﺒ ﻦ‬ ‫ﻣ ﺪﻟ ﻞ‬
‫ﻣ ﺮ ﺟﺎ ن‬ ‫ﻣ ﺮ ج‬ ‫ﻣ ﺴ ﻚ‬
‫ﻟﻨ ﺰ ﻻ ت‬ ‫ﻟﻨ ﺰ ل‬ ‫اﻟ ﻤ ﻌﻠ ﻖ‬
‫ﻣ ﺪا ﺧ ﻼ ت‬ ‫ﻣ ﺪا ﺧ ﻞ‬ ‫ﻣ ﻀﻤ ﻮ ن‬ ‫ﻣ ﻀﻢ‬ ‫ﻣ ﻊ‬
‫اﻟﻨﻘﺎ ﺷﺎ ت‬ ‫اﻟﻨﻘﺎ ش‬ ‫ﻣ ﺴﻜ ﻮ ن‬ ‫ﻣ ﺴ ﻚ‬
‫و ذ را ت‬ ‫وذ ر‬ 企伎企 ‫ﻣﻔﺘ ﻮ ن‬ ‫ﻣﻔ ﺖ‬

6/7
The 2006 International Arab Conference on Information Technology (ACIT'2006)

Table 6: Verbal and noun patterns used within the algorithm

Full word Pattern's used

Length 3 patterns
Length 4 patterns ‫ل ﻓﺘ ﻌ ل‬
Length 5 patterns

‫ﻤﻔﺘ ﻌ ل ﻤﻔ ﻌﻴ ل ﺒﻔ ﻌﺎ ل ﻟﻔ‬

Length 6 patterns

‫ﺒﺘﻔ ﻌﻴ ل ﻟﺘﻔ ﻌﻴ ل ﻓ ﻌ‬

Length 7 patterns
‫ﻤ ﺴﺘ‬

‫ﻟﻔ ﻌ‬

Length 8 patterns

‫ﻟﻔﺎ ﻋﻠ ﻬ ﻤﺎ ﺒﻔ ﻌﻠﺘ ﻜ ﻤﺎ ﻟﻔ ﻌﻠﺘ ﻜ ﻤﺎ ﺒﻔ ﻌﻠﺘ‬

‫ﻓﺎ ﻋﻠﺘ ﻬ ﻤﺎ ﻓﺎ ﻋﻠﺘ ﻜ ﻤﺎ‬
Length 9 patterns ‫ﺴﻴﻔ ﻌ ﻼﻨ ﻬﺎ ﺴﺘﻔ ﻌ ﻼﻨ ﻬﺎ‬

7/7

Implemented Stemming Algorithms For Six Ethiopian Languages
No ratings yet
Implemented Stemming Algorithms For Six Ethiopian Languages
5 pages
Arabic Root Based Stemmer
No ratings yet
Arabic Root Based Stemmer
8 pages
A Novel Arabic Lemmatization Algorithm
No ratings yet
A Novel Arabic Lemmatization Algorithm
6 pages
Introduction - Types of Stemming Algorithms
No ratings yet
Introduction - Types of Stemming Algorithms
28 pages
An Accuracy-Enhanced Light Stemmer For Arabic Text
No ratings yet
An Accuracy-Enhanced Light Stemmer For Arabic Text
22 pages
Telstem:An Unsupervised Telugu Stemmer With Heuristic Improvements and Normalized Signatures
No ratings yet
Telstem:An Unsupervised Telugu Stemmer With Heuristic Improvements and Normalized Signatures
42 pages
Comparing Words, Stems, and Roots As Index Terms in An Arabic Information Retrieval System
No ratings yet
Comparing Words, Stems, and Roots As Index Terms in An Arabic Information Retrieval System
13 pages
Proper Noun Extracting Algorithm For Arabic Language: Abstract-Many of Natural Language
No ratings yet
Proper Noun Extracting Algorithm For Arabic Language: Abstract-Many of Natural Language
9 pages
Light Stemming For Arabic Information Retrieval
No ratings yet
Light Stemming For Arabic Information Retrieval
34 pages
Arabic Morphology Generation
No ratings yet
Arabic Morphology Generation
8 pages
Chapter 2 Part II
No ratings yet
Chapter 2 Part II
75 pages
Implementation of A New Method For Stemming in Persian Language
No ratings yet
Implementation of A New Method For Stemming in Persian Language
5 pages
Improving Stemming For Arabic Information Retrieval Light Stemming and Co-Occurrence Analysis
No ratings yet
Improving Stemming For Arabic Information Retrieval Light Stemming and Co-Occurrence Analysis
8 pages
Designing A Rule Based Stemmer For Afaan Oromo Text
No ratings yet
Designing A Rule Based Stemmer For Afaan Oromo Text
11 pages
A Lexicon of Arabic Verbs Constructed On
No ratings yet
A Lexicon of Arabic Verbs Constructed On
9 pages
Persian Verbs
No ratings yet
Persian Verbs
8 pages
Rule Based Stemmer in Urdu: Vaishali Gupta, Nisheeth Joshi, Iti Mathur
No ratings yet
Rule Based Stemmer in Urdu: Vaishali Gupta, Nisheeth Joshi, Iti Mathur
4 pages
Stemming and Lemmatizing in Action (Sources)
No ratings yet
Stemming and Lemmatizing in Action (Sources)
3 pages
Assas-Band, An Affix-Exception-List Based Urdu Stemmer: Qurat-ul-Ain Akram Asma Naseer Sarmad Hussain
No ratings yet
Assas-Band, An Affix-Exception-List Based Urdu Stemmer: Qurat-ul-Ain Akram Asma Naseer Sarmad Hussain
8 pages
Pashto Language Stemming Algorithm: Jurnal Teknologi Maklumat Dan Multimedia Asia-Pasifik
No ratings yet
Pashto Language Stemming Algorithm: Jurnal Teknologi Maklumat Dan Multimedia Asia-Pasifik
13 pages
6 Amharic - Light - Stemmer
No ratings yet
6 Amharic - Light - Stemmer
10 pages
A Software Tool For Building A Statistical Prefix Processor
No ratings yet
A Software Tool For Building A Statistical Prefix Processor
6 pages
2011 Dawson Stemmer
No ratings yet
2011 Dawson Stemmer
7 pages
DHull GGrefenstette Technical Report MLTT96
No ratings yet
DHull GGrefenstette Technical Report MLTT96
17 pages
Stemming: Ilakiyaselvan N, B2 Slot
No ratings yet
Stemming: Ilakiyaselvan N, B2 Slot
23 pages
Information Retrieval: Text Processing
No ratings yet
Information Retrieval: Text Processing
43 pages
An Application Oriented Arabic Morphological Analyzer
No ratings yet
An Application Oriented Arabic Morphological Analyzer
13 pages
A Framework To Automate The Parsing of Arabic Language Sentences
No ratings yet
A Framework To Automate The Parsing of Arabic Language Sentences
7 pages
Comparison of Effectiveness of Stemming Algorithms in Indonesian Documents
No ratings yet
Comparison of Effectiveness of Stemming Algorithms in Indonesian Documents
5 pages
Lecture 3 - Basic Text Processing
No ratings yet
Lecture 3 - Basic Text Processing
58 pages
A Novel Root Based Arabic Stemmer: Journal of King Saud University - Computer and Information Sciences
No ratings yet
A Novel Root Based Arabic Stemmer: Journal of King Saud University - Computer and Information Sciences
10 pages
Unit 5
No ratings yet
Unit 5
14 pages
A Rule Based Bengali Stemmer - Mahmud2014 - Citedby - 53
No ratings yet
A Rule Based Bengali Stemmer - Mahmud2014 - Citedby - 53
7 pages
Chapter 2 - Vector Space Clustering
No ratings yet
Chapter 2 - Vector Space Clustering
59 pages
Automatic Training of Lemmatization Rules That Handle Morphological Changes in Pre-, in - and Suffixes Alike
No ratings yet
Automatic Training of Lemmatization Rules That Handle Morphological Changes in Pre-, in - and Suffixes Alike
9 pages
Unit 2 Data - Structures
No ratings yet
Unit 2 Data - Structures
84 pages
XSTEM: An Exemplar-Based Stemming Algorithm: Kirk Baker Lexical Intelligence, LLC May 10, 2022
No ratings yet
XSTEM: An Exemplar-Based Stemming Algorithm: Kirk Baker Lexical Intelligence, LLC May 10, 2022
11 pages
A Survey of Stemming Algorithms in Information Retrieval: Author Index Subject Index Search Home
No ratings yet
A Survey of Stemming Algorithms in Information Retrieval: Author Index Subject Index Search Home
22 pages
A Rule-Based Approach of Stemming For Inflectional and Derivational Words in Bengali
No ratings yet
A Rule-Based Approach of Stemming For Inflectional and Derivational Words in Bengali
3 pages
Rule Based Urdu Stemmer: Rohit Kansal Vishal Goyal G. S. Lehal
No ratings yet
Rule Based Urdu Stemmer: Rohit Kansal Vishal Goyal G. S. Lehal
10 pages
Designing A Rule Based Stemming Algorithm For Kambaata Language Text
100% (1)
Designing A Rule Based Stemming Algorithm For Kambaata Language Text
14 pages
IR Assignment Article Review 2023
No ratings yet
IR Assignment Article Review 2023
7 pages
An Amharic Stemmer Reducing Words To The
No ratings yet
An Amharic Stemmer Reducing Words To The
5 pages
Willettp9 PorterStemmingReview
No ratings yet
Willettp9 PorterStemmingReview
9 pages
Natural Language Computing
No ratings yet
Natural Language Computing
20 pages
Analysis and Evaluation of Stemming Algorithms A Case Study With Assamese
No ratings yet
Analysis and Evaluation of Stemming Algorithms A Case Study With Assamese
5 pages
Building A Wordnet For Arabic
No ratings yet
Building A Wordnet For Arabic
6 pages
El Kah-Anoual-Publications-17-08-2022-11-08-19-34
No ratings yet
El Kah-Anoual-Publications-17-08-2022-11-08-19-34
10 pages
1999 - Stemming Methodologies Over Individual Query Words For An Arabic Information Retrieval System - Abu - Salem - 99
No ratings yet
1999 - Stemming Methodologies Over Individual Query Words For An Arabic Information Retrieval System - Abu - Salem - 99
6 pages
A Methodology For Building Simple But Robust Stemmers Without Language Knowledge Stemmer Configuration
No ratings yet
A Methodology For Building Simple But Robust Stemmers Without Language Knowledge Stemmer Configuration
6 pages
Designing A Stemmer For Geez Text Using Rule Based Approach
No ratings yet
Designing A Stemmer For Geez Text Using Rule Based Approach
6 pages
Dev of An Elec Arabic Dic
No ratings yet
Dev of An Elec Arabic Dic
8 pages
Designing A Stemmer For Geez Text Using Rule Based Approach PDF
No ratings yet
Designing A Stemmer For Geez Text Using Rule Based Approach PDF
6 pages
Designing A Rule Based Stemmer For Ge'Ez Text: Zigju Demissie Baye
No ratings yet
Designing A Rule Based Stemmer For Ge'Ez Text: Zigju Demissie Baye
4 pages
D AR B H L: Esign OF ULE Ased Indi Emmatizer
No ratings yet
D AR B H L: Esign OF ULE Ased Indi Emmatizer
8 pages
Root Identification Tool For Arabic Verbs
100% (1)
Root Identification Tool For Arabic Verbs
7 pages
Unit Iii Data Structure
No ratings yet
Unit Iii Data Structure
43 pages
Operation Susannah or The Lavon Affair
100% (1)
Operation Susannah or The Lavon Affair
12 pages
The Arab World and The Language of Polit
No ratings yet
The Arab World and The Language of Polit
20 pages
Asset-V1 LinuxFoundationX+LFD109x+1T2022+type@asset+block@LFD109x - Course - Syllabus
No ratings yet
Asset-V1 LinuxFoundationX+LFD109x+1T2022+type@asset+block@LFD109x - Course - Syllabus
9 pages
03-IPLOOK PCRF Product Description2022
No ratings yet
03-IPLOOK PCRF Product Description2022
37 pages
Energies 16 05855
No ratings yet
Energies 16 05855
17 pages
"Their Need Was Great" Émigrés and Anglo-American Intelligence Operations in The Early Cold War
No ratings yet
"Their Need Was Great" Émigrés and Anglo-American Intelligence Operations in The Early Cold War
229 pages
Anodot Cloud Cost Survey Report 2023
No ratings yet
Anodot Cloud Cost Survey Report 2023
19 pages
079 Fulltext
No ratings yet
079 Fulltext
217 pages
Rootworddictionary ALIF
No ratings yet
Rootworddictionary ALIF
14 pages
IGWT Report 23 - Nuggets 15 - Showdown in Sound Money
No ratings yet
IGWT Report 23 - Nuggets 15 - Showdown in Sound Money
24 pages
2016 - 1 Inflation and Investment
No ratings yet
2016 - 1 Inflation and Investment
38 pages
Currency War Final
No ratings yet
Currency War Final
93 pages

Arabic Root Based Stemmer

Uploaded by

Arabic Root Based Stemmer

Uploaded by

The 2006 International Arab Conference on Information Technology (ACIT'2006)

Arabic Root Based Stemmer

Mohammed Naji AL-Kabi* and Ronza S. Al- Mustafa**

ABSTRACT stem is a morpheme or a set of concatenated

2. THE ALGORITHM Table 1: An example of patterns and their infixes,

9.2.1.1 If T_String match Infixes_String

[10] Larkey L., Ballesteros L., and Connell M.,

Table 6: Verbal and noun patterns used within the algorithm

Full word Pattern's used

‫ﻟﻔﺎ ﻋﻠ ﻬ ﻤﺎ ﺒﻔ ﻌﻠﺘ ﻜ ﻤﺎ ﻟﻔ ﻌﻠﺘ ﻜ ﻤﺎ ﺒﻔ ﻌﻠﺘ‬

You might also like