Module 2 Chap1
Module 2 Chap1
4.Part-of-Speech Tagging:
• Computers need to know whether a word is a noun, verb, adjective, etc. to understand the sentence.
• Example: In "The fish swims", "fish" is a noun. But in "I fish in the river", "fish" is a verb.
These patterns highlight words inside sentences, making them useful for
searching or text filtering.
Character Classes (Grouping
Characters Together)
• Character classes let you define multiple possible matches inside square brackets
[ ].
Instead of searching for just one word, we can group multiple choices together.
Example : Finding Multiple Letters
🔹 /[abcd]/ → Finds any one of the letters a, b, c, or d in a word.
• Will match words like "bad", "cab", "dad", etc.
🔹 /[0123456789]/ → Finds any digit (0 to 9).
• Instead of listing each character separately, we can use a dash (-) to define a range.
• This can be shortened to /[0-9]/ for convenience.
• /[5-9]/ → Matches any number from 5 to 9 (so it will match 5, 6, 7, 8, or 9).
• /[m-p]/ → Matches any letter between m and p (so it will match m, n, o, or p).
• If we don’t want to match a specific character, you can use a caret
(^) inside the square brackets
Ex: /[^x]/ → This pattern matches any character EXCEPT 'x’.
If applied to "example", it will match "e", "a", "m", "p", "l", and
"e", but NOT 'x’.
Case Sensitivity in RE
•Regular expressions are case-sensitive.
•The pattern /s/ only matches lowercase 's', but not uppercase 'S'.
•Ex: /sana/ will match "sana", but NOT "Sana".
•To match both lowercase and uppercase, use /[sS]ana/. This matches either 'sana' or 'Sana’.
Making Characters Optional (?)
Sometimes, we want to match both singular and plural words like "supernova" and "supernovas".
•Pattern: /supernovas?/
•What does ? do?
•The ? makes the preceding character optional.
•So, "supernova" and "supernovas" both match!
•Doesn't match: "supernovaS" (capital S)
Does NOT match: "bluberry" (because "blu" has only 3 letters before "berry")
Example: Validate an Email Address Using RE
^[A-Za-z0-9_\.\-]+@[A-Za-z0-9_\.\-]+[A-Za-z0-9][A-Za-z0-9_]$
3. Infix: Appears inside the stem (less common in English). Ex: "पिबामि" (pibāmi) → "पिपासामि" (pipāsāmi) Root: पिब् (pib
meaning "to drink")Infix: -सा-New Meaning: "to feel thirsty"In this case, the infix -सा- is inserted within the root to modify the
meaning
4. Circumfix: Appears on both sides of the stem. Ex: ಸಂತೋಷ, ಅ-"+ಸಂತೋಷ +"-ತೆ“=ಅಸಂತೋಷತೆ
Three main processes of word formation in linguistics:
1.Inflection:
1. A root word is combined with a grammatical morpheme to create a new form without changing its word
class.
2. Example: Adding -s to "cat" → "cats" (plural form of the noun).
2.Derivation:
1. A word stem is combined with a morpheme to create a new word belonging to a different grammatical
class.
2. Example: "compute" (verb) → "computation" (noun).
3. This process includes nominalization, where verbs or adjectives become nouns.
3.Compounding:
1. Two or more words are merged to form a new word with a distinct meaning.
2. Example: "desktop" (desk + top), "overlook" (over + look).
Importance of Morphological Analysis in NLP (Natural Language Processing)
• Helps in word formation and understanding new words.
• Used in spelling correction, machine translation, and information retrieval.
• Helps in parsing, which involves breaking a sentence into syntactic, semantic, and pragmatic
structures.
Morphological parsing takes an inflected surface form (the actual word
as it appears in a sentence) and analyzes its structure to produce:
1.The lemma (canonical form) – The base form of the word.
2.Morphological features – Information about tense, number, gender,
person, case, etc.
Ex: Consider the word “running” as an input.
Input (Inflected Surface Form) Word: "running"
Morphological Parsing Output:
There are Different types of spelling errors. It categorizes errors into typing errors,
Optical Character Recognition (OCR) errors, phonetic errors, and non-word vs. real-
word errors.
Optical Character Recognition (OCR) Errors
OCR errors occur when a computer scans a printed or handwritten text and
misinterprets characters due to visual similarity. These errors are different from
typing errors and usually involve incorrect letter recognition.
Common OCR errors:
• Substitution due to visual similarity
Correct: "come"
OCR Error: "eome" (c mistaken for e)
• Multi-substitution (framing errors)
Correct: "mm"
OCR Error: "m->rn" (m recognized as rn)
• Space deletion or insertion
Correct: "pen ink"
OCR Error: "penink" (missing space)
• Failure (not recognizing a character at all)
Correct: "lion"
Phonetic Spelling Errors
These errors happen when a word is misspelled in a way that sounds similar
to the correct pronunciation. Unlike typing errors, phonetic errors may not
be immediately obvious because they "sound correct" when spoken.
Examples of Phonetic Errors:
• "rite" instead of "right"
• "no" instead of "know"
• "their" instead of "there"
• These errors are common in speech recognition, transliteration, and voice
typing.
Non-Word vs. Real-Word Errors
Spelling errors can be categorized into two major types:
• Non-word Errors: The misspelled word does not exist in the dictionary.
• Real-word Errors: The misspelled word is a real word, but it is the wrong word in context.
(A) Non-Word Errors
Occurs when a word is misspelled into something that isn’t a valid word.
Correct: "receive"
Non-word Error: "recieve"
Detection Method: Dictionary lookup and N-gram analysis
(B) Real-Word Errors
Occurs when a word is incorrectly replaced by another valid word, leading to semantic errors. These are
harder to detect because the word itself is correct, but it is wrong in the context.
Correct: "I would like a piece of cake.“ and "The soldiers fought for peace."
Real-Word Errors:
• "I would like a peace of cake." (Incorrect meaning)
• "The soldiers fought for piece." (Incorrect meaning)
Detection Method:
• Context analysis using AI models
Spelling Correction: Detection and Correction Methods
Spelling correction consists of two main processes:
1. Error Detection → Identifying misspelled words.
2. Error Correction → Suggesting the correct words.
These problems are addressed in two ways:
1. Isolated-Word Error Detection and Correction
Each word is checked separately, without considering the surrounding words.
Example Method: Dictionary (Lexicon) Lookup
Challenges of Isolated-Word Error Detection:
• Requires a large lexicon (dictionary), which takes time and storage.
• Some languages have too many words to list in a dictionary.
• Fails when an error results in a real-word mistake (e.g., "theses" instead of "these").
• In a large lexicon, mistakes might go undetected because incorrect words can still exist in the
dictionary.
Example:
• Correct word: "receive"
• Misspelled word: "recieve" → Detected as incorrect because "recieve" is not in the dictionary.
• Real-word error: "peace" instead of "piece" → Not detected because both are valid words.
2. Context-Dependent Error Detection and Correction
Checks the meaning and grammatical context of the word to detect errors.
Example Method: Grammar & Language Processing
Advantages of Context-Dependent Detection:
• Can detect real-word errors that isolated-word detection misses.
• Uses grammatical rules to ensure correct usage.
Example:
"I will meat you tomorrow."
Incorrect (should be "meet")
Context-aware detection catches this error.
"Their going to the store."
Incorrect ("Their" should be "They're")
Context-aware detection fixes this.
Process:
1.First, use isolated-word detection to get a list of possible correct words.
Spelling Correction Algorithms:
1. Minimum Edit Distance (Levenshtein Distance)
A way to measure how different two words are by counting:
• Insertions (adding a letter)
• Deletions (removing a letter)
• Substitutions (changing a letter)
Example:
Convert "hte" to "the"
1.Swap 'h' and 't' → "the" (1 substitution)
Edit Distance = 1
Example:
Convert "speling" to "spelling"
2.Insert 'l' → "spelling" (1 insertion)
Edit Distance = 1
• The lower the edit distance, the more similar the words are.
2. Similarity Key Techniques
This method converts a string into a key so that similar words share the
same key.
Example: SOUNDEX System (Odell and Russell, 1918)
• Used in phonetic spelling correction.
• Groups words that sound alike but may be spelled differently.
Example:
"Robert" and "Rupert" both have Soundex key R163.
Helps match similar-sounding words even with different spellings.
3. N-Gram Based Techniques
N-grams (sequences of N letters) help detect both non-word and real-word errors.
• Some letter combinations never occur or are rare in English.
• If a word contains a rare N-gram, it might be a spelling error.
Example:
• Rare tri-gram: "qst" → Likely incorrect.
• Rare bi-gram: "qd" → Likely incorrect.
• A large dictionary (corpus) is used to check valid combinations of letters.
For real-word errors, N-grams predict which letters should follow others based on
probability.
4. Neural Networks
AI-based Neural Networks detect and correct spelling errors using
pattern recognition.
• Trained on large datasets to identify common spelling errors.
• Can adapt to noisy or incomplete text.
Drawback: Computationally expensive to train and use.
Example:
Misspelled input: "teh qick brwn fox"
Neural network output: "the quick brown fox"
5. Rule-Based Techniques
Uses predefined spelling rules (heuristics) to correct common errors.
Example:
• If "ue" is often mistyped as "eu", create a rule to swap them.
• Error: "euestion" → Correction: "question"
🔹 Other common rules:
• "ie" vs. "ei": If not after ‘c’, "i" before "e" (e.g., "believe", "receive").
• Silent ‘e’ rules: "make" → "making" (drop ‘e’ before adding ‘-ing’).
Minimum Edit Distance
• The Minimum Edit Distance (MED) is a metric used to determine the similarity
between two strings. It calculates the smallest number of insertions, deletions, and
substitutions required to transform one string into another. This concept is widely
used in Natural Language Processing (NLP), DNA sequence alignment, and spell
checking.
The minimum edit distance between two strings, s (source) and t (target), is the
smallest number of operations needed to convert s into t. The three operations
allowed are:
1.Insertion: Adding a character.
2.Deletion: Removing a character.
3.Substitution: Replacing one character with another.
For example, consider transforming "tutor" into "tumour":
• Substitute 'm' for 't'.
• Insert 'u' before 'r'.
Edit Distance Function
• The edit distance function ed(s, t) is symmetric, meaning that the cost of converting
s → t is the same as t → s:
ed(s,t)=ed(t,s)
String Alignment
The minimum edit distance can be visualized using string alignment.
Example:
Convert "tutor" to "tumour".
•Optimal alignment (cost = 2):
Words in a language are classified into categories based on their syntactic (grammatical)
role and morphological behavior (how they change form). These classifications help us
understand the function of words in sentences.
Common Lexical Categories:
• Nouns (NN): Represent people, places, things, or concepts. (Examples: student, chair,
proof, mechanism)
• Verbs (VB): Indicate actions or states. (Examples: study, increase, produce)
• Adjectives (ADJ): Describe nouns. (Examples: large, high, tall, few)
• Adverbs (JJ): Modify verbs, adjectives, or other adverbs. (Examples: carefully, slowly,
uniformly)
• Prepositions (IN): Show relationships between words in a sentence. (Examples: in, on, to,
of)
• Pronouns (PRP): Replace nouns. (Examples: I, me, they)
Open vs. Closed Word Classes
Words are also categorized as open or closed classes:
Open Word Classes (Can Expand)
These categories frequently allow the addition of new words:
• Nouns (new words like "selfie," "blog")
• Verbs (new words like "google," "texting")
• Adjectives (new words like "woke," "viral")
• Adverbs
• Interjections
Closed Word Classes (Rarely Expand)
These categories are more stable and rarely gain new words:
• Prepositions (e.g., "on," "in," "of")
• Auxiliary Verbs (e.g., "is," "was," "have")
• Conjunctions (e.g., "and," "but," "or")
• Pronouns (e.g., "he," "she," "they")
• Determiners (e.g., "the," "a," "some")
Part-of-Speech (POS) tagging
POS tagging is the process of labeling words in a sentence with their
corresponding grammatical category, such as noun, verb, adjective, or
preposition. This process helps in understanding the function of words in a
sentence.
Example:
1.The word "book" can function as:
1. Noun: I am reading a good book.
2. Verb: The police booked the snatcher.
➝ In the first sentence, "book" is a noun (NN).
➝ In the second sentence, "booked" is a verb (VBD - past tense).
2.The word "sona" (Hindi) can mean:
3. Gold (Noun)
4. Sleep (Verb)
• Challenge: Some words can belong to multiple categories, so POS tagging
POS Tag Sets
A tag set is a predefined collection of tags used by a POS tagger. Different tag sets exist,
such as:
• Penn Treebank (45 tags)
• C7 tagset (164 tags)
• TOSCA-ICE (270 tags)
• TESS (200 tags)
Example:
• The Penn Treebank tag set is widely used because English is not a morphologically rich
language.
• However, languages with more inflections (e.g., Hindi, Tamil) may need bigger tag sets.
• English verbs change forms based on tense, subject, and aspect.
Here, the tag of a word tn is predicted using the tags of the previous two
words tn−2and tn−1. The gray-shaded area represents the context used for
tagging.
Hidden Markov Model (HMM) Tagger
• The HMM tagger is a probabilistic model that uses two layers:
• Visible layer – the sequence of words in a sentence.
• Hidden layer – the sequence of POS tags.
• The tags are "hidden" during actual text processing, meaning we only
see words, but the model must infer the correct tag sequence.
Here the Objective is to find the most probable sequence of POS tags T
for a given sentence W
Given:𝑊=𝑤1,𝑤2,...,𝑤𝑛 and 𝑇=𝑡1,𝑡2,...,𝑡𝑛
we only maximize: 𝑇’=argmax𝑇 𝑃(𝑇∣𝑊)
Aim to maximize the probability 𝑃(𝑇∣𝑊), which can be rewritten using
Bayes' theorem: 𝑃(𝑇∣𝑊)= 𝑃(𝑇∣𝑊) ×𝑃(𝑇)/𝑃(𝑊)
Since 𝑃(𝑊) remains the same for all tag sequences,
𝑇’=argmax𝑇 𝑃(𝑊∣𝑇)* 𝑃(𝑇)
1. Accuracy of Stochastic Taggers
• Stochastic taggers (like HMM-based taggers) have an accuracy of 96-97%.
• This accuracy is measured per word, meaning that while individual word tagging is quite accurate, the
probability of making at least one mistake in a full sentence increases.
• For example, if a sentence has 20 words, the probability of a fully correct sentence is:
• 0.9620=56%. This means that 44% of sentences will have at least one tagging error.
2. Importance of a Tagged Corpus
• One drawback of stochastic taggers is that they need a manually tagged corpus for training.
• However, Kupiec (1992) and Cutting et al. (1992) showed that an HMM tagger can also be trained from
unannotated text.
• Even though it is possible to use an unannotated corpus, a manually tagged corpus improves performance.
• A tagger trained on a hand-coded corpus performs better than one trained on an unannotated text.
Example: The bird can fly.
Where:
• DT = Determiner
• NNP = Proper Noun
• MD = Modal Verb
• VB = Verb (Base Form)
Hybrid Taggers
Hybrid taggers combine rule-based and stochastic approaches for part-of-speech (POS) tagging.
Key Points:
1. Uses Rules for Tagging:
1. Unlike purely statistical models, hybrid taggers use rules to assign POS tags to words.
2. These rules can be automatically learned from data instead of being manually defined.
2. Machine Learning-Based Approach:
1. Hybrid taggers, like stochastic models, rely on machine learning techniques.
2. Instead of manually designing rules, the model learns rules from annotated training data.
3. Transformation-Based Learning (TBL):
1. Brill Tagging, introduced by E. Brill (1995), is an example of a hybrid approach.
2. TBL is an error-driven learning method:
1. The tagger first assigns initial POS tags using a simple rule-based or statistical method.
2. It then iteratively refines these tags by applying transformation rules that correct errors.
4. Applications of TBL:
1. Brill’s method has been successfully used in various NLP tasks, including:
1. POS tagging
2. Speech generation
3. Syntactic parsing
2. It has been studied and improved upon by various researchers, including Brill (1993, 1994) and Huang et al. (1994).
Transformation-Based Learning (TBL) Process
How TBL Works
1.Input:
1. A tagged corpus (training data with correct tags).
2. A lexicon (dictionary with the most frequent word-tag associations).
2.Initial Tagging:
1. Each word is assigned the most likely tag based on the lexicon.
2. This forms the initial state of tagging.
3.Transformation Learning:
1. A set of transformation rules is iteratively applied to improve the tagging.
2. The best transformation rule (one that results in the highest improvement) is selected
and applied.
3. This process continues until a stopping criterion is met (e.g., no further improvement).
4.Final Output:
1. A ranked sequence of transformation rules is learned.
2. These rules can then be applied to new, untagged text for automatic tagging.
TBL tagging algorithm
Transformation-Based Learning (TBL) in POS Tagging
• The images describe Transformation-Based Learning (TBL), a hybrid tagging approach that
combines rule-based and statistical methods. The process follows an iterative mechanism
where the system starts with an initial tagging and gradually refines it using
transformation rules.
Overview of TBL Process
1.Initialization:
1. The input is a tagged corpus and a lexicon.
2. An initial state annotator assigns the most likely Part-of-Speech (POS) tag based on a lexicon.
2.Transformation Rules Application:
1. An ordered set of transformation rules is applied sequentially.
2. The best rule that improves tagging the most is selected at each step.
3. A manually tagged corpus is used as the ground truth for training.
4. The process iterates until a stopping condition is met (e.g., no further improvement).
3.Final Output:
1. A ranked list of learned transformation rules is generated.
2. New text can then be tagged by first assigning the most frequent tag and applying learned
transformations.
Example of TBL Rules and Application
Transformation Rules Table (Table 3.12)
• Rules follow a specific format:
Change tag A to tag B if a certain condition is met.
Example rules
Change NN (noun) → VB (verb) if the previous
word is "TO".
Example: "To fish/NN" → "To fish/VB“
Change JJ (adjective) → RB (adverb) if the
previous word is tagged VBZ.
Example: "Runs/VBZ fast/JJ" → "Runs/VBZ fast/RB"
Example of Rule Application
Probability of "fish" being a noun (NN): 0.91
Probability of "fish" being a verb (VB): 0.09
• Initially, fish is tagged as NN in both sentences:
I/PRP like/VB to/TO eat/VB fish/NNP
I/PRP like/VB to/TO fish/NNP
The second sentence is mis-tagged. After applying TBL, a rule is
learned:
• Change NNP to VB if the previous tag is TO.
• Corrected tagging: like/VB to/TO fish/NN → like/VB to/TO fish/VB
Efficiency and Applications of TBL
1.Efficiency Improvements
1. Indexing words in a training corpus can speed up transformations.
2. Finite state transducers have been explored to optimize pattern-action rule compilation.
2.Comparison with Other Taggers
1. Roche & Schabes (1997) applied TBL using finite-state transducers, making the process faster.
2. Their version is larger but significantly faster than Brill’s original tagger.
3.POS Tagging in Different Languages
1. Most POS tagging research focuses on English & European languages.
2. Indian languages face challenges due to limited annotated corpora.
3. Examples of Indian language POS taggers:
1. Bengali tagger (HMM-based, Sandipan et al., 2004)
2. Hindi tagger (Decision tree-based, Smriti et al., 2006)
3. Developed as part of the NLPAI-2006 machine learning contest.
4.Challenges in Urdu Tagging
1. More complex due to:
1. Right-to-left script.
2. Grammatical influence from Arabic & Persian.
2. Hardie (2003) created an Urdu POS tag set under the EMILLE project (Enabling Minority Language
Unknown Words
• Unknown words refer to words not present in a dictionary or training corpus. These words create challenges
during Part-of-Speech (POS) tagging, as the model does not have prior information to correctly assign tags.
Solutions to Handle Unknown Words
Several strategies are mentioned in the text to tackle this issue:
1. Assigning the Most Frequent Tag
1. The simplest method is to assign the most common POS tag in the training corpus to the unknown word.
2. For example, if NN (Noun) is the most frequently assigned tag, an unknown word is tagged as NN.
3. Limitation: This approach can lead to misclassifications, especially for words that are not nouns.
2. Assuming Open-Class Tags
1. Open-class words include nouns, verbs, adjectives, and adverbs (as opposed to closed-class words like prepositions and
conjunctions).
2. The unknown word is initially assigned an open-class tag.
3. Later, the tag is disambiguated based on probabilities.
3. Using Morphological Information (Affixes & Prefixes)
1. The model analyzes word structure to predict the POS tag.
2. Example:
1. Words ending in -ing are likely verbs (e.g., "running").
2. Words ending in -ly are likely adverbs (e.g., "quickly").
3. The tag is assigned based on words in the training set with the same suffix or prefix.
4. Brill’s Tagger Approach