0% found this document useful (0 votes)
9 views14 pages

UNIT NO 2

The document is a question bank for an NLP end-semester exam, covering topics such as Named Entity Recognition (NER), Morphology, and Maximum Entropy principle. It includes detailed explanations, key concepts, applications, advantages, and disadvantages of each topic. The document serves as a comprehensive guide for students preparing for their NLP assessments.

Uploaded by

Darshan Tipale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views14 pages

UNIT NO 2

The document is a question bank for an NLP end-semester exam, covering topics such as Named Entity Recognition (NER), Morphology, and Maximum Entropy principle. It includes detailed explanations, key concepts, applications, advantages, and disadvantages of each topic. The document serves as a comprehensive guide for students preparing for their NLP assessments.

Uploaded by

Darshan Tipale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

NLP END SEM QUESTION BANK

UNIT NO: 2
Q. No Question Marks
1 Show Working of Named Entity Relation(NER) With appropriate example. 6
ANS  Named Entity Recognition (NER):
 Named Entity Recognition (NER) is a key task in Natural Language Processing (NLP) that
involves the identification and classification of named entities in unstructured text, such as
people, organizations, locations, dates, and other relevant information.
 NER is used in various NLP applications such as information extraction, sentiment analysis,
question-answering, and recommendation systems.

 Key concepts related to NER:


1) Named Entity: Any word or group of words that refer to a specific person, place,
organization, or other object or concept.
2) Corpus: A collection of texts used for language analysis and training of NER models.
3) POS Tagging: A process that involves labeling words in a text with their corresponding parts
of speech, such as nouns, verbs, adjectives, etc.
4) Chunking: A process that involves grouping words together into meaningful phrases based
on their part of speech and syntactic structure.
5) Training and Testing Data: The process of training a model with a set of labeled data (called
the training data) and evaluating its performance on another set of labeled data (called the
testing data).

 Steps involved in NER:


1) Tokenization: The first step in NER involves breaking down the input text into individual
words or tokens.
2) POS Tagging: Next, we need to label each word in the text with its corresponding part of
speech.
3) Chunking: After POS tagging, we can group the words together into meaningful phrases
using a process called chunking.
4) Named Entity Recognition: Once we have identified the chunks, we can apply NER
techniques to identify and classify the named entities in the text.
5) Evaluation: Finally, we can evaluate the performance of our NER model on a set of testing
data to determine its accuracy and effectiveness.

 Working of Named Entity Recognition:


1) Tokenization: Before identifying entities, the text is split into tokens, which can be words,
phrases, or even sentences.
2) Entity identification: Using various linguistic rules or statistical methods, potential named
entities are detected.
3) Entity classification:
 Once entities are identified, they are categorized into predefined classes such as "Person",
"Organization", or "Location".
 This is often achieved using machine learning models trained on labeled datasets.
4) Contextual analysis:
 NER systems often consider the surrounding context to improve accuracy.
5) Post-processing:
 After initial recognition and classification, post-processing might be applied to refine
results.
NLP END SEM QUESTION BANK

This could involve resolving ambiguities, merging multi-token entities, or using
knowledge bases to enhance entity data.
 Example:
 Input: "Apple Inc. was founded by Steve Jobs in Cupertino, California, in 1976."
1. Tokenization: ["Apple", "Inc.", "was", "founded", "by", "Steve", "Jobs", "in",
"Cupertino", ",", "California", ",", "in", "1976", "."]
2. Entity Detection:

3. Entity Categorization:
o Tokens are assigned appropriate labels.
o Example (Token-LevelLabels):Apple (B-ORG), Inc. (I-ORG), Steve (B-PER),
Jobs (I-PER), Cupertino (B-LOC), California (B-LOC), 1976 (B-DATE).
o B stands for the beginning of an entity, and I stand for inside the entity.
4. Output:
 Organization: Apple Inc.
 Person: Steve Jobs
 Location: Cupertino, California
 Date: 1976

 Named Entity Recognition Methods:


1) Lexicon Based Method:
 The NER uses a dictionary with a list of words or terms.
 The process involves checking if any of these words are present in a given text.
 However, this approach isn’t commonly used because it requires constant updating and
careful maintenance of the dictionary to stay accurate and effective.
2) Rule Based Method:
 The Rule Based NER method uses a set of predefined rules guides the extraction of
information.
 These rules are based on patterns and context. Pattern-based rules focus on the structure
and form of words, looking at their morphological patterns.
 On the other hand, context-based rules consider the surrounding words or the context in
which a word appears within the text document.
 This combination of pattern-based and context-based rules enhances the precision of
information extraction in Named Entity Recognition (NER).
3) Machine Learning-Based Method:
 Multi-Class Classification with Machine Learning Algorithms.
 One way is to train the model for multi-class classification using different machine
learning algorithms, but it requires a lot of labelling.
 In addition to labelling the model also requires a deep understanding of context to deal
with the ambiguity of the sentences.
NLP END SEM QUESTION BANK
 This makes it a challenging task for a simple machine learning algorithm.
4) Conditional Random Field (CRF):
 Conditional random field is implemented by both NLP Speech Tagger and NLTK.
 It is a probabilistic model that can be used to model sequential data such as words.
 The CRF can capture a deep understanding of the context of the sentence.
5) Deep Learning Based Method:
 Deep learning NER system is much more accurate than previous method, as it is capable
to assemble words.
 This is due to the fact that it used a method called word embedding, that is capable of
understanding the semantic and syntactic relationship between various words.
 It is also able to learn analyzes topic specific as well as high level words automatically.
 This makes deep learning NER applicable for performing multiple tasks.
 Deep learning can do most of the repetitive work itself, hence researchers for example can
use their time more efficiently.

 Application:
1) Information Extraction:
 NER can be used to extract relevant information from large volumes of unstructured text,
such as news articles, social media posts, and online reviews.
 This information can be used to generate insights and make informed decisions.
2) Sentiment Analysis:
 NER can be used to identify the sentiment expressed in a text towards a particular named
entity, such as a product or service.
 This information can be used to improve customer satisfaction and identify areas for
improvement.
3) Question Answering:
 NER can be used to identify the relevant entities in a text that can be used to answer a
specific question.
 This is particularly useful for chatbots and virtual assistants.
4) Recommendation Systems:
 NER can be used to identify the interests and preferences of users based on the entities
mentioned in their search queries or online interactions.
 This information can be used to provide personalized recommendations and improve user
engagement.

 Advantages of NER:
1) Improved Accuracy: NER can improve the accuracy of NLP applications by identifying and
classifying named entities in a text more accurately and efficiently.
2) Speed and Efficiency: NER can automate the process of identifying and classifying named
entities in a text, saving time and improving efficiency.
3) Scalability: NER can be applied to large volumes of unstructured text, making it a valuable
tool for analyzing big data.
4) Personalization: NER can be used to identify the interests and preferences of users based on
their interactions with a system, allowing for personalized recommendations and improved
user engagement.
 Disadvantages of NER:
1) Ambiguity: NER can be challenging to apply in cases where there is ambiguity in the meaning
of a word or phrase. Example: the word “Apple” can refer to a fruit or a technology company.
2) Limited Scope: NER is limited to identifying and classifying named entities in a text and
cannot capture the full meaning of a text.
NLP END SEM QUESTION BANK
3) Data Requirements: NER requires large volumes of labeled data for training, which can be
expensive and time-consuming to collect and annotate.
4) Language Dependency: NER models are language-dependent and may require additional
training for use in different languages.

2 Write a short note on Morphology and its application in NLP. 6


ANS  Definition:
 Morphology is the study of the internal structure of words and how they are formed from
smaller meaningful units called morphemes.
 A morpheme is the smallest linguistic unit with semantic meaning, such as prefixes, suffixes,
roots, and stems.
 Morphology in NLP helps in analyzing and understanding word forms and structures to derive
their meaning and grammatical functions.

 Types of Morphology:
1) Derivational Morphology:
 Derivation is how we create new words from a base word or root word by adding parts
called affixes (prefixes and suffixes).
 We combine affixes with root words.
 Example, adding -ation to the verb "summarize" gives us the noun "summarization."
 Less Productive: Some affixes only work with certain words. For example, you can’t add
-ation to every verb.
 Different Meanings: Suffixes can change the meaning of a word. For instance,
"conformation" and "conformity" both come from "conform," but they mean different
things.
 The new words created can also be used to make even more new words, making our
language richer.

2) Inflectional Morphology:
 Inflection changes the form of the same word without changing its meaning.
 Inflection adds parts (called morphemes) to a word’s stem to show grammatical information
like number (singular/plural), tense (past/present), agreement, or case.
 The inflected word keeps its original meaning and word category.
 Example: a noun stays a noun, and a verb remains a verb but changes to show a different
tense.
 In English, inflection happens mostly with nouns and verbs (and sometimes adjectives).
 English has fewer inflectional morphemes compared to some other languages.
NLP END SEM QUESTION BANK

 Application:
1) Machine Translation:
 Helps in translating words with different morphological rules across languages.
 Example: Translating “playing” (English) to “jouant” (French).
2) Spell Checking: Identifies and suggests corrections for misspelled words based on
morphological rules.
3) Information Retrieval:
 Improves search accuracy by matching words with their morphological variations.
 Example: Searching for “run” will also return results for “running” and “ran.”
4) Text-to-Speech Systems: Morphological analysis helps in pronouncing words correctly by
understanding their structure.
5) Sentiment Analysis:
 Extracts meaning from words with prefixes or suffixes indicating positive or negative
sentiments.
 Example: "unhappy" (negative sentiment) → "happy" (positive root).
6) Language Modeling: Helps build NLP models for morphologically rich languages like
Finnish, Turkish, or Hindi, where words have complex structures.

3 Explain the use of Regular expressions in NLP. 6


ANS
4 Illustrate Maximum Entropy principle formula with example. 6
ANS  Maximum Entropy Principle:
NLP END SEM QUESTION BANK
 The Maximum Entropy principle is a statistical approach used to model distributions that make
the least number of assumptions, other than the known constraints.
 In Natural Language Processing (NLP), it is widely used in classification tasks, such as part-
of-speech tagging, text classification, and information retrieval.
 The principle of maximum entropy states that the probability distribution with the highest
entropy should be selected when faced with constraints.
 This is because it leaves the most uncertainty, or the maximum entropy, while still being
consistent with the constraints

 Formula for Maximum Entropy Principle:


 The probability distribution is expressed as:

 Where:
P(y∣x): The probability of class y given input x.
Z(x): The normalization factor, also called the partition function, ensures that
probabilities sum to 1:

 λi: Weight parameters learned during training.


 fi(x, y): Feature functions, representing characteristics of the input-output pair.
 exp: Exponential function.

 Example of Maximum Entropy in NLP:


 Task: Part-of-Speech (POS) Tagging
 Input: "The cat sits on the mat."
 We want to assign POS tags such as:
o "The" → Determiner (DET)
o "cat" → Noun (NN)
o "sits" → Verb (VB)

 Steps:
NLP END SEM QUESTION BANK

5 What is Morphological Analysis? Explain Pre-processing steps. 6


ANS  Morphological Analysis:
 Morphology is the branch of linguistics concerned with the structure and form of words in a
language.
 Morphological analysis, in the context of NLP, refers to the computational processing of word
structures.
 It aims to break down words into their constituent parts, such as roots, prefixes, and suffixes,
and understand their roles and meanings.
 This process is essential for various NLP tasks, including language modeling, text analysis,
and machine translation.

 Importance of Morphological Analysis:


1) Understanding Word Formation: It helps in identifying the basic building blocks of words,
which is crucial for language comprehension.
2) Improving Text Analysis: By breaking down words into their roots and affixes, it enhances
the accuracy of text analysis tasks like sentiment analysis and topic modeling.
3) Enhancing Language Models: Morphological analysis provides detailed insights into word
formation, improving the performance of language models used in tasks like speech
recognition and text generation.
NLP END SEM QUESTION BANK
4) Facilitating Multilingual Processing: It aids in handling the morphological diversity of
different languages, making NLP systems more robust and versatile.

 Pre-processing Step:
1) Stemming:
 Stemming reduces words to their base or root form, usually by removing suffixes.
 The resulting stems are not necessarily valid words but are useful for text normalization.
 Common ways to implement stemming:
1. Porter Stemmer: One of the most popular stemming algorithms, known for its
simplicity and efficiency
2. Snowball Stemmer: An improvement over the Porter Stemmer, supporting multiple
languages.
3. Lancaster Stemmer: A more aggressive stemming algorithm, often resulting in
shorter stems.
2) Lemmatization:
 Lemmatization reduces words to their base or dictionary form (lemma).
 It considers the context and part of speech, producing valid words.
 To implement lemmatization in python, WordNet Lemmatizer is used, which leverages
the WordNet lexical database to find the base form of words.
3) Morphological Parsing:
 Morphological parsing involves analyzing the structure of words to identify their
morphemes (roots, prefixes, suffixes).
 It requires knowledge of morphological rules and patterns.
 Finite-State Transducers (FSTs) is uses as a tool for morphological parsing.
 FSTs are computational models used to represent and analyze the morphological structure
of words.
 They consist of states and transitions, capturing the rules of word formation.
 Applications:
1. Morphological Analysis: Parsing words into their morphemes.
2. Morphological Generation: Generating word forms from morphemes.
4) Neural Network Models: Neural network models, especially deep learning models, can be
trained to perform morphological analysis by learning patterns from large datasets.
 Types of Neural Network:
1. Recurrent Neural Networks (RNNs): Useful for sequential data like text.
2. Convolutional Neural Networks (CNNs): Can capture local patterns in the text.
3. Transformers: Advanced models like BERT and GPT that understand context and
semantics.
5) Rule-Based Methods:
 Rule-based methods rely on manually defined linguistic rules for morphological analysis.
 These rules can handle specific language patterns and exceptions.
 Applications:
1. Affix Stripping: Removing known prefixes and suffixes to find the root form.
2. Inflectional Analysis: Identifying grammatical variations like tense, number, and
case.
6) Hidden Markov Models (HMMs):
 Hidden Markov Models (HMMs) are probabilistic models that can be used to analyze
sequences of data, such as morphemes in words.
 HMMs consist of a set of hidden states, each representing a possible state of the system,
and observable outputs generated from these states.
NLP END SEM QUESTION BANK
In the context of morphological analysis, HMMs can be used to model the probabilistic
relationships between sequences of morphemes, helping to predict the most likely
sequence of morphemes for a given word.
 Components of Hidden Markov Models (HMMs):
1. States: Represent different parts of words (e.g., prefixes, roots, suffixes).
2. Observations: The actual characters or morphemes in the words.
3. Transition Probabilities: Probabilities of moving from one state to another.
4. Emission Probabilities: Probabilities of an observable output being generated from a
state.
 Applications:
1. Morphological Segmentation: Breaking words into morphemes.
2. Part-of-Speech Tagging: Assigning parts of speech to each word in a sentence.
3. Sequence Prediction: Predicting the most likely sequence of morphemes for a given
word.

 Applications of Morphological Analysis:


1) Information Retrieval: Enhances search engines by improving the matching of query terms
with relevant documents, even if they are in different morphological forms.
2) Machine Translation: Facilitates accurate translation by understanding and generating
correct word forms in different languages.
3) Text-to-Speech Systems: Improves pronunciation and intonation by accurately identifying
word structures and their stress patterns.
4) Spell Checkers and Grammar Checkers: Detects and suggests corrections for misspelled
words and grammatical errors by analyzing word forms and their usage.
5) Named Entity Recognition (NER): Helps in identifying and classifying named entities in
text by understanding their morphological variations.

6 What is a Language Model? Write note on N-Gram language model. 6


ANS  Language Model:
 Language modeling is the way of determining the probability of any sequence of words.
 Language modeling is used in various applications such as Speech Recognition, Spam
filtering, etc.
 Language modeling is the key aim behind implementing many state-of-the-art Natural
Language Processing models.

 Methods of Language Modelling:


1) Statistical Language Modelling: Statistical Language Modeling, or Language Modeling, is
the development of probabilistic models that can predict the next word in the sequence given
the words that precede. Examples such as N-gram language modeling.
2) Neural Language Modeling: Neural network methods are achieving better results than
classical methods both on standalone language models and when models are incorporated into
larger models on challenging tasks like speech recognition and machine translation. A way of
performing a neural language model is through word embeddings.

 N-Gram Language Model:


 N-gram can be defined as the contiguous sequence of n items from a given sample of text or
speech.
 The items can be letters, words, or base pairs according to the application. The N-grams
typically are collected from a text or speech corpus (A long text dataset).
NLP END SEM QUESTION BANK
 An N-gram language model predicts the probability of a given N-gram within any sequence
of words in a language.

 Types of N-grams:
1. Unigrams (1-grams): These are single words.
2. Bigrams (2-grams): These consist of pairs of consecutive words.
3. Trigrams (3-grams): These are sequences of three consecutive words.

 Example of N-Gram Language Model:

 Math Behind N-Gram:

 Example:
NLP END SEM QUESTION BANK

 Uses of N-grams:
1) Language Modeling:
 N-grams are used to estimate the probability of a word given its previous N-1 words.
 This is fundamental in predicting the next word in a sequence of text, making them
essential for applications like auto-completion and text generation.
2) Text Classification:
 N-grams can be used to represent documents or text for classification tasks.
 By counting the occurrences of N-grams in a document, it's possible to create a feature
vector that can be used in machine learning algorithms for text classification.
3) Information Retrieval:
 In search engines, N-grams can be used to index documents and query terms.
 This helps in ranking and retrieving relevant documents.
4) Speech Recognition: N-grams can be used to model sequences of phonemes or words in
speech recognition systems, aiding in accurate transcription of spoken language.
5) Machine Translation: N-grams are used in machine translation to align and translate
sequences of words or phrases in different languages.
6) Spelling Correction: N-grams can help identify and correct misspelled words by comparing
them to correctly spelled N-grams in a language model.

7 Find the probability of test sentence given below: 6


If a Bi-Gram model is used on the following training Sentences:
Training Data:
1. The Arabian Knights.
2. These are the fairy tales of the eat
3. The stories of the Arabian knights are translated Primary languages.
Test Data:
1. The Arabian knights are the fairy tales of the eat.
ANS  Bi-Gram Model Solution:

8 Explain Vector space Model of information Retrieval. 6


ANS  Vector Space Model of Informational Retrieval:Due to the above disadvantages of the Boolean
model, Gerard Salton and his colleagues suggested a model, which is based on Luhn’s similarity
NLP END SEM QUESTION BANK
criterion. The similarity criterion formulated by Luhn states, “the more two representations agreed
in given elements and their distribution, the higher would be the probability of their representing
similar information.”Consider the following important points to understand more about the Vector
Space ModelThe index representations (documents) and the queries are considered as vectors
embedded in a high dimensional Euclidean space.The similarity measure of a document vector to
a query vector is usually the cosine of the angle between them.Cosine Similarity Measure
FormulaCosine is a normalized dot product, which can be calculated with the help of the following
formula

9 What do you mean by morpheme? Explain different morphemes with examples. 6


ANS  Morpheme:
 Morpheme is the smallest meaning unit in a word.
 This smallest unit has its own semantic and grammatical importance in a word.
 A single word may be made of one or more than one morpheme.
 Example: International( Inter + Nation + Al) – It has three morphemes
Cats (Cat + s) – It has two morphemes
NLP END SEM QUESTION BANK
 Types of Morphemes:
1) Free Morphemes:
 The morphemes which can be used as separate words on their own are called free
morphemes.
 They can be independently used in sentences.
 Types of Free Morphemes:
1. Lexical Morphemes:
 Lexical morphemes are words that give us the main meaning of a sentence, text
or conversation.
 They can be nouns, verbs, adjectives etc.
 Example: Pencil, Listen, Happy
2. Grammatical Morphemes:
 The prepositions, auxiliaries,(both primary and modal), conjunctions etc.
 Example: is, are, was, can, may, might, in, on, from, and, or, but etc
2) Bound Morphemes:
 Bound morphemes cannot stand alone but must be bound to other morphemes, like –s, un-
, and –y.
 Bound morphemes are often affixes.
 This is a general term that comprises prefixes, which are added to the beginnings of words,
like re– and un-, and suffixes, which are added to the ends of words, like –s, –ly, and –
ness.
 Some languages also have infixes, which are added into the middle of words, but these are
rare in Modern English.
 Types of Bound Morphemes:
1. Derivational Morphemes:
 Derivational morphemes change the meaning or the part of speech of a word (i.e.,
they are morphemes by which we “derive” a new word).
 Examples: un-, which gives a negative meaning to the word it is added to, –y,
which turns nouns into adjectives, or –ness, which turns adjectives into nouns.
2. Inflectional Morphemes:
 An inflectional morpheme, which is a type of a bound morpheme, is defined by
linguists as a mere grammatical indicator or marker.
 An inflectional morpheme cannot generate or create new words nor can it affect
the grammatical class of a word.
 Example: adding “-s” to “dog” to make it plural, “dogs” or “-ing” to “play” to
make “playing”

 Example of Morphemes in Action:

10 Explain the finite sate machine-based Morphology. 6


NLP END SEM QUESTION BANK
Design a Finite state transducer with E-insertion orthographic rule that Parses from surface level
“foxes” to Lexical level “fox + N + PL” using FST
ANS

You might also like