0% found this document useful (0 votes)
4 views92 pages

Module 2 Chap1

The document discusses word-level analysis in natural language processing, focusing on how computers recognize, analyze, and categorize words. It covers topics such as understanding word sequences, morphological variants, spell checking, part-of-speech tagging, and regular expressions. Additionally, it explains finite state automata, morphological parsing, and the importance of lexicons in parsing and word formation.

Uploaded by

shreekd2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views92 pages

Module 2 Chap1

The document discusses word-level analysis in natural language processing, focusing on how computers recognize, analyze, and categorize words. It covers topics such as understanding word sequences, morphological variants, spell checking, part-of-speech tagging, and regular expressions. Additionally, it explains finite state automata, morphological parsing, and the importance of lexicons in parsing and word formation.

Uploaded by

shreekd2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 92

Module-2

Word Level Analysis


• Discusses how computers process words in text. It explains different ways to recognize, analyze, and categorize words.
1.Understanding Word Sequences:
Computers need to recognize the order of words in a sentence.
•Example: The sentences "I love coffee" and "Coffee love I" have the same words but different meanings because of the sequence.
2.Identifying Morphological Variants:
•Words can change form but still mean the same thing.
•Example: "run", "running", and "ran" are different forms of the same root word.
•Computers need to understand that these words are related.
3. Detecting and Correcting Misspelled Words:
•If you type "recieve", the system should recognize it as a misspelling of "receive" and correct it.
•This is useful for spell checkers.

4.Part-of-Speech Tagging:

• Computers need to know whether a word is a noun, verb, adjective, etc. to understand the sentence.

• Example: In "The fish swims", "fish" is a noun. But in "I fish in the river", "fish" is a verb.

• This is important for language understanding.


5. Methods to Identify Parts of Speech:
The document mentions three methods:
•Rule-based (Linguistic): Uses fixed grammar rules.
•Stochastic (Data-driven): Uses statistics and probability.
•Hybrid: A mix of both methods.
6. Regular Expressions (Regex):
•A tool for recognizing patterns in text.
•Example: If you want to find all versions of "supernova" (e.g., "supernova", "Supernova", "supernovas"), you
can use a regular expression instead of searching for each word separately.
Regular Expressions
What are Regular Expressions?
• Regular expressions (RE) are special patterns used to find and
manipulate text. Think of them as smart search tools that help
computers recognize specific words, numbers, or formats in a large
amount of text.
• Ex: Finding all text files in a folder like .txt, .ppt etc

•RE was first studied in 1956 by a scientist named Stephen


Kleene.
•It became popular in computer science when Unix editors like
‘ed’ and Perl programming language started using it.
• REis a special text pattern that helps find or match specific words or characters in a
larger body of text.
• For example:
🔹 /a/ → This pattern will find the letter 'a' anywhere in a word or sentence.
🔹 /supernova/ → This pattern will match only the exact word "supernova" and
nothing else.

These patterns highlight words inside sentences, making them useful for
searching or text filtering.
Character Classes (Grouping
Characters Together)
• Character classes let you define multiple possible matches inside square brackets
[ ].
Instead of searching for just one word, we can group multiple choices together.
Example : Finding Multiple Letters
🔹 /[abcd]/ → Finds any one of the letters a, b, c, or d in a word.
• Will match words like "bad", "cab", "dad", etc.
🔹 /[0123456789]/ → Finds any digit (0 to 9).
• Instead of listing each character separately, we can use a dash (-) to define a range.
• This can be shortened to /[0-9]/ for convenience.
• /[5-9]/ → Matches any number from 5 to 9 (so it will match 5, 6, 7, 8, or 9).
• /[m-p]/ → Matches any letter between m and p (so it will match m, n, o, or p).
• If we don’t want to match a specific character, you can use a caret
(^) inside the square brackets
Ex: /[^x]/ → This pattern matches any character EXCEPT 'x’.
If applied to "example", it will match "e", "a", "m", "p", "l", and
"e", but NOT 'x’.
Case Sensitivity in RE
•Regular expressions are case-sensitive.
•The pattern /s/ only matches lowercase 's', but not uppercase 'S'.
•Ex: /sana/ will match "sana", but NOT "Sana".
•To match both lowercase and uppercase, use /[sS]ana/. This matches either 'sana' or 'Sana’.
Making Characters Optional (?)
Sometimes, we want to match both singular and plural words like "supernova" and "supernovas".
•Pattern: /supernovas?/
•What does ? do?
•The ? makes the preceding character optional.
•So, "supernova" and "supernovas" both match!
•Doesn't match: "supernovaS" (capital S)

•Repeating Characters (* – Kleene Star)

•The asterisk * means "zero or more repetitions" of the previous character.


•This helps when we want to match any number of a certain letter.
•Example 1: Match any number of 'b'
•Pattern: /b*/
•✅ Matches: "b", "bb", "bbb", "bbbb", "" (empty string is also valid!)
•Example 2: Match 'b' followed by zero or more 'b'
•Pattern: /bb*/ : Matches: "b", "bb", "bbb", "bbbb"
Anchors (^ and $)
•The caret ^ is used to match the beginning of a line.
•The dollar sign $ is used to match the end of a line.
Example: Match an exact phrase in a line
•Suppose we want to find lines that only contain "The nature."
•Pattern: /^The nature\.$/
• Matches:
•The nature. (if this is the full line)
• Does NOT match:
•The nature is beautiful. (because extra words are present)
1. Using . to Match Any Character
Pattern: /c.t/
Matches: "cat", "cut", "cot", "c4t" (since the dot . . matches any single character)
2. Using \d and \D
Pattern: /\d\d\d/ → This matches exactly three digits.
Matches: "123", "456", "789"
Does NOT match: "12" (only two digits)
Pattern: /\D\D\D/ → This matches exactly three non-digits.
Matches: "abc", "XYZ"
Does NOT match: "12A" (contains a digit)
3. Using the OR (|) Operator: The pipe symbol | means "OR".
• It allows us to match multiple words or phrases.
Example: Matching Multiple Words. Pattern: /blackberry|blackberries/
Matches: "blackberry" "blackberries“. Does NOT match: "blueberry"
Wildcard Matching with .*
•The .* pattern means "match anything" (zero or more characters of any kind).
•Example:
•Pattern: /....berry/
This will find any four-letter word + 'berry’.
Matches: "strawberry" , "sugarberry" , "blackberry"

Does NOT match: "bluberry" (because "blu" has only 3 letters before "berry")
Example: Validate an Email Address Using RE
^[A-Za-z0-9_\.\-]+@[A-Za-z0-9_\.\-]+[A-Za-z0-9][A-Za-z0-9_]$

Valid Emails: [email protected], [email protected] Invalid


Emails: @gmail.com (Missing username), hello@com (Domain extension missing)
Regular Languages?
• Regular expressions are part of "regular languages", which are based on Boolean
logic (true/false statements).
Just like formulas in math help solve equations, RE helps find patterns in
text.
Regular languages can be used in search engines, spam filters, and
programming to validate user input.
• Symbol Pairs in RE
symbol pairs (like /a:b/) represent relationships between two expressions.
• /a:b/ can be used to map two different sets of characters.
• It’s useful in text transformation and machine learning models.
Whenever you see a, replace it with b. Ex: /c:k/ (Convert 'c' to 'k')
Finite State Automata (FSA)
A finite automaton is a simple type of machine used in computing to
process input and decide an outcome based on a set of predefined
rules. It has:
1.A set of states (e.g., "Start," "Middle," "End").
2.A set of input symbols (e.g., letters 'a' and 'b').
3.Transitions (rules that define how the machine moves from one state
to another when it reads an input).
• This is called finite because the number of states is limited.
Deterministic Finite Automaton (DFA)

A DFA is a type of finite automaton where:


• Each state has only one transition per input symbol.
• It is predictable—given an input, the machine always follows the same path.
Example:
• We have an input alphabet Σ = {a, b, c} (i.e., possible inputs).
• There are five states: {q₀, q₁, q₂, q₃, q₄}.
• q₀ is the starting state.
• q₄ is the final (accepting) state (shown as a double circle in the diagram).
• The following transitions occur:
• If at q₀ and input is a, move to q₁.
• If at q₁ and input is b, move to q₂.
• If at q₁ and input is c, move to q₃.
• If at q₂ and input is b, move to q₄ (final state).
• If at q₃ and input is b, move to q₄ (final state).
The finite-state automaton (FSA) is defined as:
M=(Q,Σ,δ,S,F)M
Q → {S, q1, q2, q3, q4} (set of states)
Σ → {a, b, c} (alphabet)
S → Start state
F → {q4} (final/accepting state) and δ (transition function → QxΣ to Q)
Non-Deterministic Finite-State Automaton
(NFA)
• State can have multiple possible transitions for the same input.
NFA with states q₀, q₁, q₂, q₃, q₄, q₅.
• From q₀, if input is 'a', the automaton can go to either q₁ or q₂.
• This is non-deterministic because more than one transition is
possible for the same input.
• If the path reaches q₅, the string is accepted.
δ:Q×(Σ∪{ϵ})→Q
• The language defined by the automaton is /abb|acb/
• /(a|b)* baa$/
Contain only 'a' and 'b’
End with the substring "baa"
Morphological Parsing
Morphemes:
• A morpheme is the smallest linguistic unit that carries meaning. It cannot be further divided without losing or changing its meaning.
Morphemes are the building blocks of words and are used to form complex words through affixation, compounding, and other
morphological processes.
Example:
• The word "bread" consists of a single morpheme (cannot be broken down further).
• The word "eggs" consists of two morphemes:
• "egg" (root word or base)
• "-s" (indicating plural)
• Two Broad Types of Morphemes
1. Stems (Root Morphemes): The main morpheme that contains the central meaning of a word.
•Example: In the word "unhappiness", the stem is "happy".

2. Affixes (Modifiers): Modify the meaning of the stem.


•Classified into:
1. Prefix: Appears before the stem. Ex: Un-" in "unhappy"
2. Suffix: Appears after the stem. Ex: -s" in "cats" (plural form of "cat").

3. Infix: Appears inside the stem (less common in English). Ex: "पिबामि" (pibāmi) → "पिपासामि" (pipāsāmi) Root: पिब् (pib

meaning "to drink")Infix: -सा-New Meaning: "to feel thirsty"In this case, the infix -सा- is inserted within the root to modify the

meaning
4. Circumfix: Appears on both sides of the stem. Ex: ಸಂತೋಷ, ಅ-"+ಸಂತೋಷ +"-ತೆ“=ಅಸಂತೋಷತೆ
Three main processes of word formation in linguistics:
1.Inflection:
1. A root word is combined with a grammatical morpheme to create a new form without changing its word
class.
2. Example: Adding -s to "cat" → "cats" (plural form of the noun).
2.Derivation:
1. A word stem is combined with a morpheme to create a new word belonging to a different grammatical
class.
2. Example: "compute" (verb) → "computation" (noun).
3. This process includes nominalization, where verbs or adjectives become nouns.
3.Compounding:
1. Two or more words are merged to form a new word with a distinct meaning.
2. Example: "desktop" (desk + top), "overlook" (over + look).
Importance of Morphological Analysis in NLP (Natural Language Processing)
• Helps in word formation and understanding new words.
• Used in spelling correction, machine translation, and information retrieval.
• Helps in parsing, which involves breaking a sentence into syntactic, semantic, and pragmatic
structures.
Morphological parsing takes an inflected surface form (the actual word
as it appears in a sentence) and analyzes its structure to produce:
1.The lemma (canonical form) – The base form of the word.
2.Morphological features – Information about tense, number, gender,
person, case, etc.
Ex: Consider the word “running” as an input.
Input (Inflected Surface Form) Word: "running"
Morphological Parsing Output:

•The parser identifies “running” as a verb.


•It extracts "run" as the base (lemma).
•The suffix "-ing" indicates that the word is in
present continuous tense.
Morphological Parsing Process
• A morphological parser analyzes the surface form of words in a text
and produces a parsed version with:
• Canonical form (lemma) → The base or dictionary form of the word.
• Syntactic and morphological characteristics → Includes properties like part
of speech, gender, number, person, tense, etc.
• The reverse process (morphological generation) creates words from
lemmas using rules.
• The parser relies on two sources of information:
• A dictionary of valid word lemmas in the language.
• A set of inflection paradigms (rules for modifying words).
Information Sources for a Morphological Parser
A morphological parser relies on the following key components:
1. Lexicon
• A lexicon is a dictionary that stores stems and affixes along with their basic meanings and features.
• It helps in retrieving the root form and grammatical details of a word.
2. Morphotactics
• Morphotactics deals with the rules for combining morphemes in a language.
• Certain morpheme combinations are valid, while others are not allowed.
• Example:
• Correct: "rest-less-ness" (valid in English).
• Incorrect: "rest-ness-less" (invalid).
• This shows that morphemes must follow specific ordering rules to form valid words.
3. Orthographic Rules
• These are spelling rules that apply when morphemes combine.
• Example:
• The y → ier rule changes "easy" → "easier", but not "easyer".
• These rules prevent incorrect word formation and ensure proper spelling modifications when
Sample Lexicon Entry (Hindi Example)
• The passage provides a sample lexicon entry for the Hindi word
"ghodhaa" (घोड़ा - horse).

• The lexicon allows quick access to the grammatical properties of a word.


• Example: If we look up "ghodhon", we find that it is masculine plural.
Limitations of Lexicon-Based Approaches
•Memory Consumption:
•Storing all word forms in a lexicon requires a lot of memory.
•Example: Instead of just storing "ghodhaa", "ghodhi", "ghodhon",
"ghodhe" separately, a more efficient approach
•is to use morphological rules to generate forms dynamically.
•Redundancy:
•Many lexicon entries repeat information, leading to unnecessary data
storage.
Stemming
Stemming is a simpler way of reducing words to their base form. It
does not consider grammar but simply removes word endings using
rules.
Example of Stemming:
• “Playing” → “Play”
• “Easily” → “Eas” (incorrect reduction)
• “Organization” → “Organ” (incorrect reduction)
• As you can see, stemming sometimes chops off too much and creates
meaningless words (“eas” instead of “easy”).
major limitations:
• It fails to capture relationships between different word forms
that share the same root.
• It cannot generalize linguistic rules, making it difficult to handle
unknown words.
• For morphologically complex languages (e.g., Turkish), storing all
possible word forms is impractical as the number of forms can be
infinite.
Example
If we create a lexicon for the word "play", it would contain: play,
plays, played, playing
However, this does not indicate that all these forms come from a
single root (play).
Stemming: A Simple Approach
The simplest form of morphological analysis is stemming,
which reduces words to their base form (stem).
•Stemmers do not require a lexicon.
•They apply rewrite rules to remove suffixes and prefixes.
Two major stemming algorithms:
1.Lovins Stemmer (1968)
2.Porter Stemmer (1980) (widely used in NLP)

Examples of Stemming Rules


Porter’s stemmer applies rules like:
•ier → y → earlier → early
•ing → ↋ → playing → play
•ational → ate → educational → educate
This helps reduce words to a simpler form, making them easier to
analyze.
How Stemming Works
A stemming algorithm works in two steps:
1.Suffix Removal → Removes predefined endings from words.
2.Recoding → Adds predefined endings to the output of the first step.
These two steps can be:
•Sequential (as in Lovins’ Stemmer)
•Simultaneous (as in Porter’s Stemmer)
Example: Porter’s Stemmer applies the transformation: educational → educate
This simplifies words by removing unnecessary suffixes.
Challenges in Stemming
•Errors of omission and over-reduction:
•"organization" → "organ" (incorrect), "noise" → "noisy" (incorrect)
•Prefixes and compound words are not reduced properly.
Example of Errors
•"computational" → "comput" (incorrect reduction)
•"noisy" → "noise" (incorrect reduction)
These mistakes show why stemming is not always perfect.
Two-Level Morphological Model
A more advanced morphological parsing approach is the Two-Level
Morphological Model, introduced by Koskenniemi (1983).
How It Works
• Instead of just removing suffixes, it maps words at two levels:
• Lexical Level (Morpheme-based representation)
• Surface Level (Actual word form)
• Finite-State Transducers (FSTs)
Surface vs. Lexical Level Representation
The document first discusses how a word's surface form (as it appears
in text) can be transformed into its lexical form, which consists of a root
word and additional morphological information.
Example 1: "playing"
•Surface form: p l a y i n g
•Lexical form: p l a y +V+PP
•Here, the stem "play" is separated from the suffix "-ing".
•The "+V" indicates that "play" is a verb.
•The "+PP" indicates that the verb is in the present participle form.
Example 2: "books"
•Surface form: b o o k s
•Lexical form: b o o k +N+PL
•The stem "book" is separated from the plural suffix "-s".
•"+N" indicates that it is a noun.
•"+PL" indicates the plural form.
Finite-State Transducer (FST)
• To automate this transformation, Finite-State Transducers (FSTs) are used. These are a
type of finite-state automaton that maps input symbols to output symbols.
An FST consists of six components:
1.Σ₁ (Input Alphabet) → The set of symbols the FST reads.
2.Σ₂ (Output Alphabet) → The set of symbols the FST outputs.
3.Q (Set of States) → The set of states in the transducer.
4.S (Start State) → The initial state.
5.F (Final States) → The states that indicate a valid transformation.
6.δ (Transition Function) → Defines how input symbols are mapped to output symbols.
•Unlike a Finite-State Automaton (FSA), which only recognizes strings, an FST
both reads an input and produces an output.
•It can be compared to a Non-deterministic Finite Automaton (NFA), except
that it also generates output.
ST Mapping Example:
•The FST accepts the words "hot" and "cat" and transforms them into "cot" and "bat", respectively.
State Transitions:
1.The input string "hot" is processed.
•"h" is mapped to "c" → (h : c)
•"o" is unchanged → (o : o)
•"t" is unchanged → (t : t)
•Final output: "cot"
2.The input string "cat" is processed.
•"c" is mapped to "b" → (c : b)
•"a" is unchanged → (a : a)
•"t" is unchanged → (t : t)
•Final output: "bat"
This demonstrates how an FST transforms words by substituting or retaining certain characters.
• Finite-State Transducers and Regular Relations
• Just as Finite-State Automata (FSA) encode regular languages, FSTs
encode regular relations.
• A regular relation defines how two languages relate to each other.
• The upper language is the one on the upper side of the FST, and the
lower language is the transformed version.
• If T is a transducer and s is a string, then T(s) represents the set of
encoded string pairs (s, t) in the relation.
• FSTs are closed under union, concatenation, and Kleene closure but
not under intersection and complementation.
• Two-Step Morphological Parsing Using FST
Step 1: Splitting Words into Morphemes
• Words are broken into their components (stems and suffixes).
• Example:
• "birds" → "bird + s" (stem + plural suffix)
• "boxes" → Can be split in two ways:
• "box + s" (treats "box" as the stem and "s" as the plural suffix)
• "boxe + s" (assumes the spelling rule inserted "e" before "s")
Step 2: Mapping Morphemes to Stems and Morphological Features
• This step adds grammatical categories to the morphemes.
• Example:
• "bird + s" → "bird + N + PL" (indicating "bird" is a noun (N) in plural (PL) form)
• "goose" → "goose + N + sg" (noun in singular)
• "geese" → "geese + N + PL" (plural form of "goose")
Role of FST in Morphological Parsing
•The FST transducer performs the mapping (translation) of surface
forms to lexical representations.
•Example:
Word: "lesser"
•Surface Form → "lesser"
•Lexical Form → "less + Adj + Comp" (Comparative form)
•The FST represents "er" as a comparative suffix, while "less" is the
adjective root.
This FST diagram represents the transformation of the adjective "less" → "lesser":
•The input ("lesser") is processed step by step:
•"l" → "l"
•"e" → "e"
•"s" → "s"
•"s" → "s"
•"e" → "e"
•"r" → "+Comp" (indicating that the suffix "r" marks the comparative form)
Key Properties of the FST in Figure 3.8:
•Handles spelling variations (like "box + s" vs. "boxe + s").
•Works bidirectionally:
•Upward Application → Converts surface forms into lexical representations.
•Downward Application → Generates surface forms from lexical forms.
• Understanding Morphological Parsing with Finite-State Transducers (FSTs)
Step1: Using a Lexicon for Categorization
Before applying transformations, the system needs to understand the categories of stems
and the meanings of affixes.
For example:
• bird + s → bird + N + PL
• The noun "bird" is marked with N (Noun) and PL (Plural marker)
• box + s → box + N + PL
• The noun "box" follows the same pattern.
However, the system also needs to determine whether splitting is valid or invalid.
• If we try to split "boxes" as "boxe + s", this is incorrect, because "boxe" is not a valid stem.
• Instead, spelling rules tell us that "e" should be added before "s" when forming plural
words like "dishes" (dish → dishes) or "boxes" (box → boxes).
• Some words, like "spouses" or "passes", follow different spelling rules, so orthographic
rules help in handling such cases.
Step 2: Using Finite-State Transducers (FSTs) for Singular and Plural
Forms
The next step is to map the surface form (actual written word) to an
intermediate form.
• Plural nouns in English typically end in "-s" or "-es".
• However, not all words ending in "s" are plural!
• Example: "miss" and "ass" are singular words.
• Some transformations involve deleting "e" at the morpheme
boundary for words ending in "s", "x", "z", "ch", "sh" (e.g., "dish" →
"dishes", "box" → "boxes").
Understanding the Transducer's Role
A transducer is used to process words and recognize their grammatical structure. In this step, the
transducer takes an intermediate representation and converts it into a lexical representation.
The input to the transducer falls into four different categories:
1. Regular noun stem (base form of a noun)
1. Examples: bird, cat
2. These words are already singular and do not need modification.
2. Regular noun stem + 's' (plural form)
1. Example: bird + s → birds
2. In this case, the transducer detects the "s" and identifies it as a plural marker.
3. Singular irregular noun stem
1. Example: goose
2. Some words do not follow the standard pluralization rules (adding "s" or "es").
3. The word "goose" is singular, and its plural form is "geese", which requires a special rule.
4. Plural irregular noun stem
1. Example: geese
2. This is already a plural noun, so the transducer must recognize it as such.
Reading "s" (State 1 → 1 and State 1 → 2)
•The transducer encounters "s", which is a plural marker.
•The transition diverges into two possible paths:
1.Direct Path (State 1 → 1):
•The "s" is simply output as "s" (Surface Level: "birds").
2.Lexical Path (State 1 → 2 → 6):
•The transducer adds a "+" separator (ε transition) before "s",
indicating the morphological split between "bird" and "s".
•This leads to Output: "bird + s".
Spelling Error Detection and
Correction
• Common spelling errors in computer-based information systems, focusing on
single-character mistakes. According to research by Damerau (1964), over 80% of
typing errors fall into four categories:
Typing Errors:
1.Substitution of a single letter- hello" → "jello" (h → j)
2.Omission of a single letter- "chair" → "char" (missing "i")
3.Insertion of a single letter- "book" → "boook" (extra "o")
4.Transposition of two adjacent letters- "teh" instead of "the" (swapped "e" and
"h")

There are Different types of spelling errors. It categorizes errors into typing errors,
Optical Character Recognition (OCR) errors, phonetic errors, and non-word vs. real-
word errors.
Optical Character Recognition (OCR) Errors
OCR errors occur when a computer scans a printed or handwritten text and
misinterprets characters due to visual similarity. These errors are different from
typing errors and usually involve incorrect letter recognition.
Common OCR errors:
• Substitution due to visual similarity
Correct: "come"
OCR Error: "eome" (c mistaken for e)
• Multi-substitution (framing errors)
Correct: "mm"
OCR Error: "m->rn" (m recognized as rn)
• Space deletion or insertion
Correct: "pen ink"
OCR Error: "penink" (missing space)
• Failure (not recognizing a character at all)
Correct: "lion"
Phonetic Spelling Errors
These errors happen when a word is misspelled in a way that sounds similar
to the correct pronunciation. Unlike typing errors, phonetic errors may not
be immediately obvious because they "sound correct" when spoken.
Examples of Phonetic Errors:
• "rite" instead of "right"
• "no" instead of "know"
• "their" instead of "there"
• These errors are common in speech recognition, transliteration, and voice
typing.
Non-Word vs. Real-Word Errors
Spelling errors can be categorized into two major types:
• Non-word Errors: The misspelled word does not exist in the dictionary.
• Real-word Errors: The misspelled word is a real word, but it is the wrong word in context.
(A) Non-Word Errors
Occurs when a word is misspelled into something that isn’t a valid word.
Correct: "receive"
Non-word Error: "recieve"
Detection Method: Dictionary lookup and N-gram analysis
(B) Real-Word Errors
Occurs when a word is incorrectly replaced by another valid word, leading to semantic errors. These are
harder to detect because the word itself is correct, but it is wrong in the context.
Correct: "I would like a piece of cake.“ and "The soldiers fought for peace."
Real-Word Errors:
• "I would like a peace of cake." (Incorrect meaning)
• "The soldiers fought for piece." (Incorrect meaning)
Detection Method:
• Context analysis using AI models
Spelling Correction: Detection and Correction Methods
Spelling correction consists of two main processes:
1. Error Detection → Identifying misspelled words.
2. Error Correction → Suggesting the correct words.
These problems are addressed in two ways:
1. Isolated-Word Error Detection and Correction
Each word is checked separately, without considering the surrounding words.
Example Method: Dictionary (Lexicon) Lookup
Challenges of Isolated-Word Error Detection:
• Requires a large lexicon (dictionary), which takes time and storage.
• Some languages have too many words to list in a dictionary.
• Fails when an error results in a real-word mistake (e.g., "theses" instead of "these").
• In a large lexicon, mistakes might go undetected because incorrect words can still exist in the
dictionary.
Example:
• Correct word: "receive"
• Misspelled word: "recieve" → Detected as incorrect because "recieve" is not in the dictionary.
• Real-word error: "peace" instead of "piece" → Not detected because both are valid words.
2. Context-Dependent Error Detection and Correction
Checks the meaning and grammatical context of the word to detect errors.
Example Method: Grammar & Language Processing
Advantages of Context-Dependent Detection:
• Can detect real-word errors that isolated-word detection misses.
• Uses grammatical rules to ensure correct usage.
Example:
"I will meat you tomorrow."
Incorrect (should be "meet")
Context-aware detection catches this error.
"Their going to the store."
Incorrect ("Their" should be "They're")
Context-aware detection fixes this.
Process:
1.First, use isolated-word detection to get a list of possible correct words.
Spelling Correction Algorithms:
1. Minimum Edit Distance (Levenshtein Distance)
A way to measure how different two words are by counting:
• Insertions (adding a letter)
• Deletions (removing a letter)
• Substitutions (changing a letter)
Example:
Convert "hte" to "the"
1.Swap 'h' and 't' → "the" (1 substitution)
Edit Distance = 1
Example:
Convert "speling" to "spelling"
2.Insert 'l' → "spelling" (1 insertion)
Edit Distance = 1
• The lower the edit distance, the more similar the words are.
2. Similarity Key Techniques
This method converts a string into a key so that similar words share the
same key.
Example: SOUNDEX System (Odell and Russell, 1918)
• Used in phonetic spelling correction.
• Groups words that sound alike but may be spelled differently.
Example:
"Robert" and "Rupert" both have Soundex key R163.
Helps match similar-sounding words even with different spellings.
3. N-Gram Based Techniques
N-grams (sequences of N letters) help detect both non-word and real-word errors.
• Some letter combinations never occur or are rare in English.
• If a word contains a rare N-gram, it might be a spelling error.
Example:
• Rare tri-gram: "qst" → Likely incorrect.
• Rare bi-gram: "qd" → Likely incorrect.
• A large dictionary (corpus) is used to check valid combinations of letters.
For real-word errors, N-grams predict which letters should follow others based on
probability.
4. Neural Networks
AI-based Neural Networks detect and correct spelling errors using
pattern recognition.
• Trained on large datasets to identify common spelling errors.
• Can adapt to noisy or incomplete text.
Drawback: Computationally expensive to train and use.
Example:
Misspelled input: "teh qick brwn fox"
Neural network output: "the quick brown fox"
5. Rule-Based Techniques
Uses predefined spelling rules (heuristics) to correct common errors.
Example:
• If "ue" is often mistyped as "eu", create a rule to swap them.
• Error: "euestion" → Correction: "question"
🔹 Other common rules:
• "ie" vs. "ei": If not after ‘c’, "i" before "e" (e.g., "believe", "receive").
• Silent ‘e’ rules: "make" → "making" (drop ‘e’ before adding ‘-ing’).
Minimum Edit Distance
• The Minimum Edit Distance (MED) is a metric used to determine the similarity
between two strings. It calculates the smallest number of insertions, deletions, and
substitutions required to transform one string into another. This concept is widely
used in Natural Language Processing (NLP), DNA sequence alignment, and spell
checking.
The minimum edit distance between two strings, s (source) and t (target), is the
smallest number of operations needed to convert s into t. The three operations
allowed are:
1.Insertion: Adding a character.
2.Deletion: Removing a character.
3.Substitution: Replacing one character with another.
For example, consider transforming "tutor" into "tumour":
• Substitute 'm' for 't'.
• Insert 'u' before 'r'.
Edit Distance Function
• The edit distance function ed(s, t) is symmetric, meaning that the cost of converting
s → t is the same as t → s:
ed(s,t)=ed(t,s)
String Alignment
The minimum edit distance can be visualized using string alignment.
Example:
Convert "tutor" to "tumour".
•Optimal alignment (cost = 2):

•'t' → 'm' (substitution)


•Insert 'u' before 'r' (insertion)

Another alignment (cost = 3, not optimal):


Insert=1, Delet=1, Substitution=1
Delete 't' → Cost = 1,Insert 'm' → Cost = 1, Insert 'u' → Cost = 1
Total cost = 1 + 1 + 1 = 3
Dynamic Programming Approach
To efficiently compute the minimum edit distance, we use dynamic
programming by constructing an edit distance matrix.
Edit Distance Matrix
We create a matrix where:
• Rows represent characters of the source string.
• Columns represent characters of the target string.
• Each cell (i, j) stores the minimum edit distance for the first i
characters of the source and the first j characters of the target.
Recurrence Relation
Each cell in the matrix is computed using:

• Insertion: Moves from (i−1,j) → Add a character.


• Deletion: Moves from (i,j−1) → Remove a character.
• Substitution: Moves from (i−1,j−1) → Replace a character.
• Special case:
If source[i] == target[j], then substitution cost = 0 (no change needed).
• Minimum Edit Distance Algorithm
• Computing Minimum Edit Distance
Words and Word Classes: The provided text explains the concept of word classes,
also known as parts of speech or lexical categories in English.

Words in a language are classified into categories based on their syntactic (grammatical)
role and morphological behavior (how they change form). These classifications help us
understand the function of words in sentences.
Common Lexical Categories:
• Nouns (NN): Represent people, places, things, or concepts. (Examples: student, chair,
proof, mechanism)
• Verbs (VB): Indicate actions or states. (Examples: study, increase, produce)
• Adjectives (ADJ): Describe nouns. (Examples: large, high, tall, few)
• Adverbs (JJ): Modify verbs, adjectives, or other adverbs. (Examples: carefully, slowly,
uniformly)
• Prepositions (IN): Show relationships between words in a sentence. (Examples: in, on, to,
of)
• Pronouns (PRP): Replace nouns. (Examples: I, me, they)
Open vs. Closed Word Classes
Words are also categorized as open or closed classes:
Open Word Classes (Can Expand)
These categories frequently allow the addition of new words:
• Nouns (new words like "selfie," "blog")
• Verbs (new words like "google," "texting")
• Adjectives (new words like "woke," "viral")
• Adverbs
• Interjections
Closed Word Classes (Rarely Expand)
These categories are more stable and rarely gain new words:
• Prepositions (e.g., "on," "in," "of")
• Auxiliary Verbs (e.g., "is," "was," "have")
• Conjunctions (e.g., "and," "but," "or")
• Pronouns (e.g., "he," "she," "they")
• Determiners (e.g., "the," "a," "some")
Part-of-Speech (POS) tagging
POS tagging is the process of labeling words in a sentence with their
corresponding grammatical category, such as noun, verb, adjective, or
preposition. This process helps in understanding the function of words in a
sentence.
Example:
1.The word "book" can function as:
1. Noun: I am reading a good book.
2. Verb: The police booked the snatcher.
➝ In the first sentence, "book" is a noun (NN).
➝ In the second sentence, "booked" is a verb (VBD - past tense).
2.The word "sona" (Hindi) can mean:
3. Gold (Noun)
4. Sleep (Verb)
• Challenge: Some words can belong to multiple categories, so POS tagging
POS Tag Sets
A tag set is a predefined collection of tags used by a POS tagger. Different tag sets exist,
such as:
• Penn Treebank (45 tags)
• C7 tagset (164 tags)
• TOSCA-ICE (270 tags)
• TESS (200 tags)
Example:
• The Penn Treebank tag set is widely used because English is not a morphologically rich
language.
• However, languages with more inflections (e.g., Hindi, Tamil) may need bigger tag sets.
• English verbs change forms based on tense, subject, and aspect.

Penn Treebank tag set


• POS Tags for the Word "Eat“

Example of a tagged sentence:


(Penn Treebank tag set is used) Correctly Taged
Speech/NN sounds/NNS were/VBD sampled/VBN by/IN a/DT microphone/NN.
•Speech/NN → "Speech" is a noun (NN).
•sounds/NNS → "Sounds" is a plural noun (NNS).
•were/VBD → "Were" is past tense of 'be' (VBD).
•sampled/VBN → "Sampled" is past participle (VBN).
•by/IN → "By" is a preposition (IN).
•a/DT → "A" is a determiner (DT).
•microphone/NN → "Microphone" is a noun (NN).
Incorrect Tagged Sentence
Speech/NN sounds/VBZ were/VBD sampled/VBN by/IN a/DT microphone/NN.
• Here, "sounds/VBZ" is incorrect.
• "Sounds" is a noun (NNS), not a verb (VBZ).
• This leads to semantic incoherence (meaning confusion).
• The correct tagging helps resolve ambiguity and ensures that the sentence
structure makes sense.
POS Tagging Methods
POS tagging methods into three types:
1. Rule-Based Taggers (Linguistic): Uses hand-coded rules to assign POS tags.
Uses a lexicon (a list of words with possible tags).
Example rule: If a word ends in "-ing", it is likely a verb (VBG).
2. Stochastic Taggers (Data-Driven): Uses statistics and probability to assign tags.
Learns from large annotated corpora (text datasets with tags). Assigns the most frequent tag
for a word. Uses Hidden Markov Models (HMM) and Machine Learning. HMM decides the tag
based on surrounding words (context).
The word "bank" can be: Noun (NN) → "I deposited money in the bank."
Verb (VB) → "He banked the ball into the corner”. Example of Stochastic Tagger:
CLAWS (Constituent Likelihood Automatic Word-Tagging System) and TAGGIT (early POS
tagger)
3. Hybrid Taggers
•Combines Rule-Based and Stochastic methods.
•Uses rules for known words and probabilities for unknown words.
Rule-Based Taggers
a. Rule-based POS taggers are used to assign grammatical categories (such as
noun, verb, adjective) to words in a sentence. These taggers operate using a
two-stage architecture:
1.Dictionary Lookup – The system checks a lexicon (word database) to find
possible POS tags for each word.
2.Hand-Coded Rules – Contextual and morphological rules eliminate incorrect
tags to determine the correct POS tag.
Ex: Consider the sentence:
The show must go on.
Here, the word "show" can be tagged as either:
•VB (Verb) → "I will show you."
•NN (Noun) → "The show was amazing."
To resolve this ambiguity, a rule-based approach is applied:
Rule: IF preceding word is a determiner (DT), THEN eliminate the VB tag.
b. Using Morphological Information
Some taggers use word structure (morphology) to determine the correct tag.
For example, consider this rule:
IF word ends in "-ing" AND the preceding word is a verb, THEN label it a
verb (VB).
Example Sentence:
She is running fast.
His running was impressive.
Here, the word "running" can be:
1.Verb (VB) → "She is running fast."
2.Noun (NN) → "His running was impressive."
Applying the Rule:
•In "She is running fast", the word "is" (a verb) comes before "running".
•Rule applies → "running" is labeled as a verb (VB).
•In "His running was impressive", "His" is a pronoun, not a verb.
•Rule does not apply → "running" is labeled as a noun (NN).
c. Capitalization-Based Disambiguation
Another strategy is using capitalization to identify unknown nouns (such as
proper nouns).
Example:
Paris is a beautiful city.
He went to a paris café.
•In the first sentence, "Paris" is a proper noun (NNP).
•In the second sentence, "paris" (lowercase) might not be a name but a general
noun.
Thus, capitalized words are often labeled as proper nouns (NNP).
d. Early Rule-Based Taggers
Some well-known rule-based POS taggers include:
• TAGGIT (Greene & Rubin, 1971) → Used for Brown Corpus.
• ENGTWOL (Voutilainen, 1995) → Another advanced rule-based tagger.
TAGGIT Example:
• Used 3,300 hand-coded rules to tag 77% of words correctly in the Brown Corpus
dataset
Stochastic POS Taggers
Stochastic POS taggers use probability and statistics to assign parts of speech to words. They rely on
Markov models and n-gram probabilities to determine the most likely tag for a word based on
training data.
• 1. HMM (Hidden Markov Model) Tagger
1. HMM (Hidden Markov Model) Tagger
HMM tagger assumes that the probability of a word’s POS tag depends on the previous
•The

word’s tag (Markov assumption).


•The simplest model is the unigram model, which assigns the most frequent POS tag for
each word.
Example: Unigram Model Error
"fast" in different sentences:
Consider the word

1.She had a fast. (fast = noun)


2.Hindus fast during Navaratri. (fast = verb)
3.Those who were injured in the accident need to be helped fast. (fast = adverb)
Since "fast" is most commonly used as an adjective, a unigram model might wrongly tag it
as JJ (adjective) in all cases.
Incorrect Tagging (Unigram Model Output):
•She had a fast/JJ. (Should be NN - noun)
•Hindus fast/JJ during Navaratri. (Should be VB - verb)
2. Bigram Model for More Accuracy
To improve accuracy, a bigram model considers the previous word’s tag
to predict the correct POS tag.
• If the previous tag is DT (determiner) → "fast" is likely NN (noun).
• If the previous tag is NN (noun) → "fast" is likely VB (verb).
• If the previous tag is VB (verb) → "fast" is likely RB (adverb).
Example: Bigram Model Fixing Errors
• "She had a fast" → “a” (DT) suggests NN (noun) → Correct!
• " Hindus fast during Navaratri " → "Hindus" (NN) suggests VB (verb)
→ Correct!
• "helped fast" → "helped" (VB) suggests RB (adverb) → Correct!
• Thus, the bigram model improves accuracy by considering context.
Overview of n-gram Models in Tagging
• An n-gram model assigns a part-of-speech (POS) tag to a word based on the previous
(n-1) words and their tags.
• The simplest model is a unigram model, where each word is tagged based on the
most frequent tag in a trained dataset.
• A bigram model improves accuracy by considering the previous word’s tag.
• A trigram model considers the two previous words’ tags, leading to even better
predictions.

Here, the tag of a word tn is predicted using the tags of the previous two
words tn−2​and tn−1. The gray-shaded area represents the context used for
tagging.
Hidden Markov Model (HMM) Tagger
• The HMM tagger is a probabilistic model that uses two layers:
• Visible layer – the sequence of words in a sentence.
• Hidden layer – the sequence of POS tags.
• The tags are "hidden" during actual text processing, meaning we only
see words, but the model must infer the correct tag sequence.
Here the Objective is to find the most probable sequence of POS tags T
for a given sentence W
Given:𝑊=𝑤1,𝑤2,...,𝑤𝑛 and 𝑇=𝑡1,𝑡2,...,𝑡𝑛​
we only maximize: 𝑇’=arg⁡max⁡𝑇 𝑃(𝑇∣𝑊)
Aim to maximize the probability 𝑃(𝑇∣𝑊), which can be rewritten using
Bayes' theorem: 𝑃(𝑇∣𝑊)= 𝑃(𝑇∣𝑊) ×𝑃(𝑇)/𝑃(𝑊)
Since 𝑃(𝑊) remains the same for all tag sequences,
𝑇’=arg⁡max⁡𝑇 𝑃(𝑊∣𝑇)* 𝑃(𝑇)
1. Accuracy of Stochastic Taggers
• Stochastic taggers (like HMM-based taggers) have an accuracy of 96-97%.
• This accuracy is measured per word, meaning that while individual word tagging is quite accurate, the
probability of making at least one mistake in a full sentence increases.
• For example, if a sentence has 20 words, the probability of a fully correct sentence is:
• 0.9620=56%. This means that 44% of sentences will have at least one tagging error.
2. Importance of a Tagged Corpus
• One drawback of stochastic taggers is that they need a manually tagged corpus for training.
• However, Kupiec (1992) and Cutting et al. (1992) showed that an HMM tagger can also be trained from
unannotated text.
• Even though it is possible to use an unannotated corpus, a manually tagged corpus improves performance.
• A tagger trained on a hand-coded corpus performs better than one trained on an unannotated text.
Example: The bird can fly.
Where:
• DT = Determiner
• NNP = Proper Noun
• MD = Modal Verb
• VB = Verb (Base Form)
Hybrid Taggers
Hybrid taggers combine rule-based and stochastic approaches for part-of-speech (POS) tagging.
Key Points:
1. Uses Rules for Tagging:
1. Unlike purely statistical models, hybrid taggers use rules to assign POS tags to words.
2. These rules can be automatically learned from data instead of being manually defined.
2. Machine Learning-Based Approach:
1. Hybrid taggers, like stochastic models, rely on machine learning techniques.
2. Instead of manually designing rules, the model learns rules from annotated training data.
3. Transformation-Based Learning (TBL):
1. Brill Tagging, introduced by E. Brill (1995), is an example of a hybrid approach.
2. TBL is an error-driven learning method:
1. The tagger first assigns initial POS tags using a simple rule-based or statistical method.
2. It then iteratively refines these tags by applying transformation rules that correct errors.
4. Applications of TBL:
1. Brill’s method has been successfully used in various NLP tasks, including:
1. POS tagging
2. Speech generation
3. Syntactic parsing
2. It has been studied and improved upon by various researchers, including Brill (1993, 1994) and Huang et al. (1994).
Transformation-Based Learning (TBL) Process
How TBL Works
1.Input:
1. A tagged corpus (training data with correct tags).
2. A lexicon (dictionary with the most frequent word-tag associations).
2.Initial Tagging:
1. Each word is assigned the most likely tag based on the lexicon.
2. This forms the initial state of tagging.
3.Transformation Learning:
1. A set of transformation rules is iteratively applied to improve the tagging.
2. The best transformation rule (one that results in the highest improvement) is selected
and applied.
3. This process continues until a stopping criterion is met (e.g., no further improvement).
4.Final Output:
1. A ranked sequence of transformation rules is learned.
2. These rules can then be applied to new, untagged text for automatic tagging.
TBL tagging algorithm
Transformation-Based Learning (TBL) in POS Tagging
• The images describe Transformation-Based Learning (TBL), a hybrid tagging approach that
combines rule-based and statistical methods. The process follows an iterative mechanism
where the system starts with an initial tagging and gradually refines it using
transformation rules.
Overview of TBL Process
1.Initialization:
1. The input is a tagged corpus and a lexicon.
2. An initial state annotator assigns the most likely Part-of-Speech (POS) tag based on a lexicon.
2.Transformation Rules Application:
1. An ordered set of transformation rules is applied sequentially.
2. The best rule that improves tagging the most is selected at each step.
3. A manually tagged corpus is used as the ground truth for training.
4. The process iterates until a stopping condition is met (e.g., no further improvement).
3.Final Output:
1. A ranked list of learned transformation rules is generated.
2. New text can then be tagged by first assigning the most frequent tag and applying learned
transformations.
Example of TBL Rules and Application
Transformation Rules Table (Table 3.12)
• Rules follow a specific format:
Change tag A to tag B if a certain condition is met.
Example rules
Change NN (noun) → VB (verb) if the previous
word is "TO".
Example: "To fish/NN" → "To fish/VB“
Change JJ (adjective) → RB (adverb) if the
previous word is tagged VBZ.
Example: "Runs/VBZ fast/JJ" → "Runs/VBZ fast/RB"
Example of Rule Application
Probability of "fish" being a noun (NN): 0.91
Probability of "fish" being a verb (VB): 0.09
• Initially, fish is tagged as NN in both sentences:
I/PRP like/VB to/TO eat/VB fish/NNP
I/PRP like/VB to/TO fish/NNP
The second sentence is mis-tagged. After applying TBL, a rule is
learned:
• Change NNP to VB if the previous tag is TO.
• Corrected tagging: like/VB to/TO fish/NN → like/VB to/TO fish/VB
Efficiency and Applications of TBL
1.Efficiency Improvements
1. Indexing words in a training corpus can speed up transformations.
2. Finite state transducers have been explored to optimize pattern-action rule compilation.
2.Comparison with Other Taggers
1. Roche & Schabes (1997) applied TBL using finite-state transducers, making the process faster.
2. Their version is larger but significantly faster than Brill’s original tagger.
3.POS Tagging in Different Languages
1. Most POS tagging research focuses on English & European languages.
2. Indian languages face challenges due to limited annotated corpora.
3. Examples of Indian language POS taggers:
1. Bengali tagger (HMM-based, Sandipan et al., 2004)
2. Hindi tagger (Decision tree-based, Smriti et al., 2006)
3. Developed as part of the NLPAI-2006 machine learning contest.
4.Challenges in Urdu Tagging
1. More complex due to:
1. Right-to-left script.
2. Grammatical influence from Arabic & Persian.
2. Hardie (2003) created an Urdu POS tag set under the EMILLE project (Enabling Minority Language
Unknown Words
• Unknown words refer to words not present in a dictionary or training corpus. These words create challenges
during Part-of-Speech (POS) tagging, as the model does not have prior information to correctly assign tags.
Solutions to Handle Unknown Words
Several strategies are mentioned in the text to tackle this issue:
1. Assigning the Most Frequent Tag
1. The simplest method is to assign the most common POS tag in the training corpus to the unknown word.
2. For example, if NN (Noun) is the most frequently assigned tag, an unknown word is tagged as NN.
3. Limitation: This approach can lead to misclassifications, especially for words that are not nouns.
2. Assuming Open-Class Tags
1. Open-class words include nouns, verbs, adjectives, and adverbs (as opposed to closed-class words like prepositions and
conjunctions).
2. The unknown word is initially assigned an open-class tag.
3. Later, the tag is disambiguated based on probabilities.
3. Using Morphological Information (Affixes & Prefixes)
1. The model analyzes word structure to predict the POS tag.
2. Example:
1. Words ending in -ing are likely verbs (e.g., "running").
2. Words ending in -ly are likely adverbs (e.g., "quickly").
3. The tag is assigned based on words in the training set with the same suffix or prefix.
4. Brill’s Tagger Approach

You might also like