Finding the Structure of Words
• Human language is a complicated thing.
• We use it to express our thoughts, and through language, we
receive information and infer (understand) its meaning.
• Linguistic expressions may appear to unorganized, though they
actually have an underlying organization or structure.
• Trying to understand a language all together is not a viable
approach.
• Linguists have developed a whole disciplines that look at language
from different perspectives and at different levels of detail.
• For Example :
• The point of morphology, for instance, is to study of the variable
forms and functions of words.
Finding the Structure of Words
• Its syntax is concerned with the arrangement of words into
phrases, clauses, and sentences.
• The rules and limitations governing how words are structured and
formed based on their pronunciation are explained and defined by
the field of linguistics known as phonology.
• The conventions for writing constitute the orthography of a
language.
• The terms like etymology and lexicology cover especially the
evolution of words and explain the semantic, morphological, and
other links among them.
• Words are perhaps the most intuitive units of language, yet they
are in general tricky to define.
Finding the Structure of Words
• Knowing how to work with them allows, in particular, the
development of syntactic and semantic abstractions.
• Hence, finding the structure of words involves two steps:
• First we need to explore how to identify words of distinct types in
human languages, and
• Second, how the internal structure of words can be modelled in
connection with the grammatical properties and lexical concepts.
• The discovery of word structure is called morphological parsing.
Words and Their Components
• Words are defined in most languages as the smallest linguistic units
that can form a complete utterance by themselves.
• The minimal parts of words that deliver meaning to them are called
morphemes.
• Words in English are delimited only by whitespace and punctuation (
fullstop, comma, and brackets)
• For Example - Will you read the newspaper? Will you read it? I won’t
read it.
• If we make an assumption with insights from etymology and syntax,
we notice two words are here: newspaper and won’t.
• The word newspaper has an interesting derivational structure.
Words and Their Components
• In writing, newspaper and the associated concept is distinguished
from the isolated news and paper.
• Generally, linguists prefer to analyze won’t as two syntactic words,
or tokens, each of which has its independent role and can be
reverted back to its normalized form.
• The structure of won’t could be parsed as will followed by not.
• In English, this kind of tokenization and normalization may apply
to just a limited set of cases.
• But in other languages, these phenomena have to be treated in a
less trivial manner.
Words and Their Components
• In Arabic or Hebrew certain tokens are concatenated in writing
with the preceding or the following ones and possibly may change
their forms.
• The lexical or syntactic units blur into one compact string of letters
and no longer appear as distinct words.
• Tokens behaving in this way can be found in various languages and
are often called clitics.
• For Example - writing systems of Chinese, Japanese and Thai,
whitespace is not used to separate words.
• Tokenization, also known as word segmentation, is the fundamental
step of morphological analysis and a prerequisite for most language
processing applications.
Words and Their Components
1. Lexemes
• By the term word, we often denote word with not just the one
linguistic form in the given context but also as the concept behind
the form and the set of alternative forms that can express it.
• Such sets are called lexemes or lexical items and they constitute
the lexicon of a language.
• Lexemes can be divided by their behavior into the lexical categories
of verbs, nouns, adjectives, conjunctions, articles, or other parts of
speech.
• The citation form of a lexeme, by which it is commonly identified, is
also called its lemma.
Words and Their Components
• When we transform a lexeme into another one that is
morphologically related we say we derive the lexeme:
• for instance, the nouns receiver and reception are derived from the
verb to receive.
• For Example - Did you see him? I didn’t see him. I didn’t see anyone.
• The example presents the problem of tokenization of didn’t and the
investigation of the internal structure of anyone.
• The paraphrase of Did you see him ? is I saw no one
• The lexeme to see would be inflected into the form saw to reflect its
grammatical function of expressing positive past tense.
• Likewise, him is the oblique case form of he.
Words and Their Components
• In the paraphrase, no one can be perceived as the minimal word
synonymous with nobody.
2. Morphemes
• Morphology is the study of the structure of words i.e the way words
are built up with smaller and minimal units of meaning which are
termed as morphemes.
• For Example
• Played = play ed
• Cats = cat s
• The word played has two morphemes: play ‘word’ and ed ‘plural
marker’.
Words and Their Components
• The word cat has two morphemes: cat ‘word’ and s ‘plural marker’.
• There are two broad types of morphemes :
i. Stem (Root) ii. Affixes
• Root is the main meaning bearing morpheme of the word.
• Example : play, cat, friend etc.
• Affixes add ‘additional’ meanings of different kinds.
• Example : -ed, -s, un-, -ly etc.
• Two main types of affixes:
i. Prefixes precede the stem: un-, in- etc.
ii. Suffixes follow the stem: -ed, -s, un-, -ly, etc.
Words and Their Components
• Affixes are called bound morphemes as they cannot occur on their own
and must combine with a root/stem.
• There are two basic processes of word formation :
i. Inflection ii. Derivation
• Inflection is a process where affixes are added to a root/stem to perform
some grammatical functions but the category of the word remains the
same.
• Example
Lemma Singular Plural
cat cat cats
knife knife knives
• The process through which the new words are formed by adding an affix
to an existing word is called derivation. Example – inter+national =
international.
Words and Their Components
• Unlike inflection, derivation often leads to change in the category.
• The simplest morphological process concatenates morphs one by
one.
• For Example –
• The word dis-agree-ment-s, where agree is a free lexical morpheme
and the other elements are bound grammatical morphemes
contributing some partial meaning to the whole word.
• The alternative forms of a morpheme are termed allomorphs.
Words and Their Components
3. Typology
• Morphological typology is a linguistic classification system that
categorizes languages based on the ways they use morphemes, which
are the smallest units of meaning in language.
• It can consider various criteria, and during the history of linguistics,
different classifications have been proposed.
• Based on quantitative relations between words, their morphemes, and
their features, the primary morphological typological categories are :
A. Isolating or Analytical languages
• In isolating languages, words are typically composed of one or more free
morphemes, and there is little or no use of bound morphemes like
prefixes or suffixes to indicate grammatical relationships.
Words and Their Components
• Each word often carries a single, specific meaning.
• Example - typical isolating members are Chinese, Vietnamese, and
Thai.
B. Synthetic languages
• Synthetic languages can combine more morphemes in one word
and are further divided into agglutinative and fusional languages.
i. Agglutinative languages
• Agglutinative languages use a high number of bound morphemes,
each of which typically carries a single grammatical meaning.
• Example – Japanese language is an agglutinative language where
suffixes are added to roots to convey aspects of tense, mood, and
case.
Words and Their Components
ii. Fusional Languages
• Fusional languages use bound morphemes, but these morphemes often
carry multiple grammatical meanings or features simultaneously.
• The morphemes are fused together, making it more challenging to
separate individual meanings.
• Example - Arabic, Latin, Sanskrit, German use fusional languages.
• In addition with the word formation processes mentioned above, we can
also find out languages using concatenative and nonlinear forms.
• Concatenative languages link morphs and morphemes one after
another and Nonlinear languages allowing structural components to
merge non-sequentially to apply tonal morphemes or change the
consonantal of words.
Issues and Challenges
• Issues and challenges related to words and their components,
which encompass morphemes, phonemes, and other linguistic
elements which are important considerations in the field of
linguistics and natural language processing.
• Here are some key issues and challenges:
1. Irregularity word forms are not described by a prototypical
linguistic model.
2. Ambiguity word forms be understood in multiple ways out of the
context of their discourse.
3. Productivity is the inventory of words in a language finite, or is it
unlimited ?
Issues and Challenges
• Irregularity
• The phenomenon where certain words or word forms does not follow
regular patterns or rules interms of their morphology or syntax.
• By irregularity, we mean existence of such forms and structures
that are not described appropriately by a prototypical linguistic
model.
• Some irregularities can be redesigned and improve its rules, but
other lexically dependent irregularities often cannot be generalized.
• It is a challenge for the algorithms which follow particular patterns.
• What are actually the word forms that form irregularity ?
• There are many word forms that form irregularity like:
Issues and Challenges
i. Irregular verb or nouns – which does not follow standard pattern
of inflection.
For Example – Some common irregular word verbs are:
ii. Exceptional Inflection – is mainly caused by comparative and
superlative adjectives. For example
Comparative Superlative
Big Bigger Biggest
Dark Darker Darkest
Good Better Best
Issues and Challenges
2. Ambiguity
• Words forms that can be understood in multiple ways out of the context.
• Word forms that look same but have distinct functions or meaning also
called as homonyms.
• Ambiguity arises in morphological processing and language processing.
• Four kinds of ambiguity are :
i. Word sense ambiguity – A particular word will be having different
meanings depending on the context in which they are used.
• For Example
Bank has ambiguity relating to money or river bank.
Bat has ambiguity relating to a bird or cricket bat.
Issues and Challenges
ii. Parts of speech ambiguity – The parts of speech of a particular
word will be changing in different context.
For Example - I run (verb)
He went for a run (noun)
iii. Structural ambiguity – Structural ambiguity is having with
multiple valid syntactic structures.
A sentence which can’t be written in one particular form. The
sentence is written in multiple ways which give same meaning.
For example - “She walked gracefully through the garden.”
Sentence 1 - Through the garden, she walked with grace.
Sentence 2 - She strolled elegantly in the garden.
Issues and Challenges
iv. Referential ambiguity – In referential ambiguity, the name of a person
or thing is reference by pronouns like he/she/it/they/this etc.
For Example - John went to the store, and he bought some groceries.
I found a book at the library, and it was really interesting.
3. Productivity
• Productivity refers to the ability to generate new words or word forms
using productive rules.
• For Example – According to Wikipedia, googol means 1 followed by 100
zeroes.
• From googol, an unknown word google was generated and by using some
productivity rules, a new words are generated like googling, googlish,
googleology etc.
Morphological Models
• There are many possible approaches to designing and
implementing morphological models.
• Over time, computational linguistics has witnessed the
development of a number of formalisms and frameworks.
• The most prominent types of computational approaches to
morphology are:
1. Dictionary Lookup
2. Finite-State Morphology
3. Unification-Based Morphology
4. Functional Morphology
Morphological Models
1. Dictionary Lookup
• Dictionary Lookup, also known as Lexicon-Based Morphological
Analysis that relies on pre-built dictionaries or lexicons to
analyze and process words.
• A dictionary is understood as a data structure that directly
enables obtaining some precomputed results, in our case word
analyses.
• The data structure can be optimized for efficient lookup, and
the results can be shared.
• Lookup operations are relatively simple and are usually quick.
Morphological Models
• Dictionaries can be implemented, for instance, as lists, binary
search trees, hash tables, and so on.
• Dictionary Lookup as a morphological model works with
following ways:
i. Dictionary Creation: In this approach, a comprehensive
dictionary or lexicon is compiled for the target language.
ii. Word Analysis: When a word is encountered in text, the
Dictionary Lookup model first attempts to find an exact match
for the word in the dictionary.
iii. Lemma Retrieval: The model then retrieves the lemma (base
form) of the word from the dictionary..
Morphological Models
iv. Handling Ambiguity: Dictionary Lookup models may need to handle
cases of word ambiguity, where a single word form can have multiple
possible lemmas or meanings.
• Advantages of Dictionary Lookup are :
a. Accuracy
b. Transparency
• Limitations of Dictionary Lookup are :
a. Limited Coverage
b. Ambiguity Handling
c. Resource-Intensive
Morphological Models
2. Finite-State Morphology
• By finite-state morphological models, we mean those in which the
specifications written by human programmers are directly compiled
into finite-state transducers.
• The two most popular tools supporting this approach, XFST (Xerox
Finite-State Tool) and Lex Tools.
• Finite-state transducers are computational devices extending the
power of finite-state automata.
• FST consist of a finite set of nodes connected by directed edges
labeled with pairs of input and output symbols.
• In such a network or graph, nodes are also called states, while edges
are called arcs.
Morphological Models
• Traversing the network from the set of initial states to the set
of final states along the arcs is equivalent to reading the
sequences of encountered input symbols and writing the
sequences of corresponding output symbols.
• The set of possible sequences accepted by the transducer
defines the input language;
• The set of possible sequences emitted by the transducer defines
the output language.
• Following example shows the FST state diagram for the input
words and their corresponding morphological parsed output or
morphological parsing.
Morphological Models
Input Input Morphological parsed
output
Cats cat +N +PL
Cat cat +N +SG
Cities city +N +PL
Geese goose +N +PL
Goose goose +N +SG) or (goose +V)
Gooses goose +V +3SG
Merging merge +V +PRES-PART • Figure : A schematic finite state transducer for
Caught (caught +V +PAST-PART) or English number inflection Tnoun . The symbols
(catch +V +PAST) above each arc represent elements of the
morphological parse in the lexical tape.
Morphological Models
•In finite-state computational morphology, it is common to refer to
the input word forms as surface strings and to the output
descriptions as lexical strings.
•In English, a finite-state transducer could analyze the surface string
children into the lexical string child [+plural].
•for instance - woman from woman [+plural(women)]
•Relations on languages can also be viewed as functions.
•Let us have a relation R, and let us denote by [Σ] the set of all
sequences over some set of symbols Σ.
•The domain and the range of R are subsets of [Σ].
Morphological Models
• We can then consider R as a function mapping an input string
into a set of output strings, formally denoted by this type of
signature, where [Σ] equals String.
𝑅 ∷[∑]՜ Σ
𝑅 ∷ 𝑆𝑡𝑟𝑖𝑛𝑔 ՜ 𝑆𝑡𝑟𝑖𝑛𝑔
•A theoretical limitation of finite-state models of morphology is
the problem of capturing reduplication of words or their
elements (e.g., to express plurality) found in several human
languages.
Morphological Models
3. Unification-based morphology
• Unification-based morphology is a computational approach to
morphological analysis and generation that uses unification to
combine information about the morphemes and features of a
word.
• Unification is a logical operation that merges two or more sets
of constraints into a single consistent set.
• This approach focuses on the relationships between morphemes
(the smallest units of meaning in a language) and how they
combine to create word forms.
Morphological Models
• The key components and concepts of Unification-Based
Morphology are :
1. Feature Structures:
• Unification-Based Morphology represents linguistic information
using feature structures, which are data structures that consist
of attribute-value pairs.
• These feature structures encode information about morphemes,
their grammatical properties, and their relationships within
words.
2. Morphemes and Morphological Rules : Morphemes are the
building blocks of words, representing units of meaning.
Morphological Models
• Unification-Based Morphology defines morphological rules that
specify how morphemes can combine and interact.
• These rules are expressed using feature structures.
• For Example :
• In English, the word "cats" consists of two morphemes: "cat"
and "-s" (indicating plural).
• These morphemes can be represented as feature structures:
• Morpheme "cat": {base: "cat", pos: "noun", number: "singular"}
• Morpheme "-s": {base: "", pos: "plural marker", number: "plural"}
Morphological Models
3. Lexical Entries:
Each word in a language is associated with a lexical entry, which
includes information about the word's morphemes, their features, and
how they are combined.
Lexical entries are represented as feature structures.
For Example :
Lexical Entry for "cats":
Word: "cats"
Morphemes: {"cat", "-s"}
Feature Structures:
{ { base: "cat", pos: "noun", number: "singular"},
{ base: “ ”, pos: "plural marker", number: "plural"}}
Morphological Models
4. Morphological Analysis:
• Unification-Based Morphology performs morphological analysis by
unifying feature structures representing morphemes to generate the
complete word form.
• This process involves unification, which combines feature structures
while preserving shared features and resolving conflicts.
• For Example :
To analyze the word "cats," the feature structures for its
morphemes are unified to generate the complete word form:
• Unified Feature Structure for "cats":
{base: "cat", pos: "noun", number: "plural"}
Morphological Models
5. Morphological Generation:
• Morphological generation, involves starting with feature structures
representing the desired grammatical and semantic properties of a word
and generating the word form by applying morphological rules in
reverse.
• For Example:
• Conversely, in morphological generation, we can start with the desired
features and generate a word form:
• Desired Feature Structure: {base: "dog", pos: "noun", number: "plural"}
• Morphological Rules:
Apply "-s" to indicate plural.
• Generated Word Form: "dogs"
Morphological Models
6. Ambiguity Handling
• Unification-Based Morphology provides a framework for handling
morphological ambiguity by representing multiple potential
feature structures and allowing for dis-ambiguation based on
context or additional linguistic constraints.
• Consider the word "saw," which could be a past tense verb or a
noun (e.g., a tool).
• Ambiguity Representation:
• Past Tense Verb: {base: "see", pos: "verb", tense: "past"}
• Noun (Tool): {base: "saw", pos: "noun"}
Morphological Models
4. Functional Morphology
•Functional morphology defines its models using principles of
functional programming and type theory.
•It treats morphological operations and processes as pure
mathematical functions and organizes the linguistic expression.
•It also abstract elements of a model into distinct types of values
and type classes.
•Functional morphology is not limited to modelling particular
types of morphologies in human languages, it is also especially
useful for fusional morphologies.
Morphological Models
•Linguistic notions like paradigms, rules, exceptions,
grammatical categories, parameters, lexemes, morphemes, and
morphs can be represented intuitively and clearly.
•Functional morphology implementations are intended to be
reused as programming libraries capable of handling the
complete morphology of a language.
•We can describe inflection I, derivation D, and lookup L as
functions of these generic type as below :
• 𝑰 ∷ 𝒍𝒆𝒙𝒆𝒎𝒆 ՜ 𝒑𝒂𝒓𝒂𝒎𝒆𝒕𝒆𝒓 ՜ 𝒇𝒐𝒓𝒎
• 𝑫 ∷ 𝒍𝒆𝒙𝒆𝒎𝒆 ՜ 𝒑𝒂𝒓𝒂𝒎𝒆𝒕𝒆𝒓 ՜ 𝒍𝒆𝒙𝒆𝒎𝒆
• 𝑳 ∷ 𝒄𝒐𝒏𝒕𝒆𝒏𝒕 ՜ 𝒍𝒆𝒙𝒆𝒎𝒆
Morphological Models
•Many functional morphology implementations are embedded in
a general-purpose programming language.
•This gives programmers more freedom with advanced
programming techniques and allows them to develop full-
featured, real-world applications for their models.
•For instance
•The Zen toolkit for Sanskrit morphology is written in OCaml.
•It influenced the functional morphology framework in Haskell,
with which morphologies of Latin, Swedish, Spanish, Urdu, and
other languages have been implemented.
Morphological Models
• Morphological grammars in Grammatical Framework can be
extended with descriptions of the syntax and semantics of a
language.
• Grammatical Framework itself supports multi-linguality, and
models of more than a dozen languages that are available as
open-source software.
Assignment Questions
1. Define the following terminologies
a. Morphology
b. Phonology
c. Orthography
d. Etymology and lexicology
e. Morphological Parsing
2. Give the steps to find the structure of the words given a sentence.
3. What is the significance of words and their components.
4. Explain the key components of word with example.
5. What are the issues and challenges in finding the structure o words ? Discuss.
6. Explain the dictionary lookup as a morphological model.
7. What is a finite state morphology ? Traverse the given words using FST
i. cities ii. Cats iii. Gooses iv. Caught v. runs vi. Feet
8. What is unification-based morphology ? Give its key components with
examples.
9. Define functional morphology. What is the significance of using functional
morphology.