natural language processing unit1
natural language processing unit1
1. Sentiment analysis
Sentiment analysis, also referred to as opinion mining, is an approach to natural
language processing (NLP) that identifies the emotional tone behind a body of text.
This is a popular way for organizations to determine and categorize opinions about a
product, service or idea.
Sentiment analysis systems help organizations gather insights into real-time customer
sentiment, customer experience and brand reputation.
Generally, these tools use text analytics to analyze online sources such as emails, blog
posts, online reviews, news articles, survey responses, case studies, web chats, tweets,
forums and comments.
Sentiment analysis uses machine learning models to perform text analysis of human
language. The metrics used are designed to detect whether the overall sentiment of a
piece of text is positive, negative or neutral.
2. Machine Translation
Machine translation, sometimes referred to by the abbreviation MT, is a sub-field
of computational linguistics that investigates the use of software to translate text or
speech from one language to another.
On a basic level, MT performs mechanical substitution of words in one language for
words in another, but that alone rarely produces a good translation because
recognition of whole phrases and their closest counterparts in the target language is
needed.
Not all words in one language have equivalent words in another language, and many
words have more than one meaning.
2
Solving this problem with corpus statistical and neural techniques is a rapidly
growing field that is leading to better translations, handling differences in linguistic
typology, translation of idioms, and the isolation of anomalies.
Corpus: A collection of written texts, especially the entire works of a particular
author.
3. Text Extraction
There are a number of natural language processing techniques that can be
used to extract information from text or unstructured data.
These techniques can be used to extract information such as entity names,
locations, quantities, and more.
With the help of natural language processing, computers can make sense
of the vast amount of unstructured text data that is generated every day,
and humans can reap the benefits of having this information readily
available.
Industries such as healthcare, finance, and e-commerce are already using
natural language processing techniques to extract information and
improve business processes.
As the machine learning technology continues to develop, we will only
see more and more information extraction use cases covered.
4. Text Classification
5. Speech Recognition
Speech recognition is an interdisciplinary subfield of computer
science and computational linguistics that develops methodologies and technologies
that enable the recognition and translation of spoken language into text by computers.
It is also known as automatic speech recognition (ASR), computer speech
recognition or speech to text (STT).
It incorporates knowledge and research in the computer
science, linguistics and computer engineering fields. The reverse process is speech
synthesis.
3
You’ll come across chatbots on business websites or messengers that give pre-scripted
replies to your questions. As the entire process is automated, bots can provide quick
assistance 24/7 without human intervention.
7. Email Filter
One of the most fundamental and essential applications of NLP online is email
filtering. It began with spam filters, which identified specific words or phrases that
indicate a spam message. But, like early NLP adaptations, filtering has been
improved.
Gmail's email categorization is one of the more common, newer implementations of
NLP. Based on the contents of emails, the algorithm determines whether they belong
in one of three categories (main, social, or promotional).
This maintains your inbox manageable for all Gmail users, with critical, relevant
emails you want to see and reply to fast.
8. Search Autocorrect and Autocomplete
When you type 2-3 letters into Google to search for anything, it displays a list of
probable search keywords. Alternatively, if you search for anything with mistakes, it
corrects them for you while still returning relevant results. Isn't it incredible?
4
Everyone uses Google search autocorrect autocomplete on a regular basis but seldom
gives it any thought. It's a fantastic illustration of how natural language processing is
touching millions of people across the world, including you and me.
Both, search autocomplete and autocorrect make it much easier to locate accurate
results.
3. Components of NLP
There are two components of NLP, Natural Language Understanding (NLU)and
Natural Language Generation (NLG).
Natural Language Understanding (NLU) which involves transforming
humanlanguage into a machine-readable format.It helps the machine to understand
and analyze human language by extracting the text from large data such as keywords,
emotions, relations, and semantics.
Natural Language Generation (NLG) acts as a translator that converts
thecomputerized data into natural language representation.
It mainly involves Text planning, Sentence planning, and Text realization.
The NLU is harder than NLG.
4. Steps in NLP
There are general five steps :
1. Lexical Analysis
2. Syntactic Analysis (Parsing)
3. Semantic Analysis
4. Discourse Integration
5. Pragmatic Analysis
Lexical Analysis:
The first phase of NLP is the Lexical Analysis.
This phase scans the source code as a stream of characters and converts it into
meaningful lexemes.
It divides the whole text into paragraphs, sentences, and words.
Lexeme: A lexeme is a basic unit of meaning. In linguistics, the abstract unit of
morphological analysis that corresponds to a set of forms taken by a single word is
called lexeme.
The way in which a lexeme is used in a sentence is determined by its grammatical
category.
5
Semantic Analysis
Semantic analysis is concerned with the meaning representation.
It mainly focuses on the literal meaning of words, phrases, and sentences.
The semantic analyzer disregards sentence such as “hot ice-cream”.
Another Example is “Manhattan calls out to Dave” passes a syntactic
analysis because it’s a grammatically correct sentence. However, it fails a
semantic analysis. Because Manhattan is a place (and can’t literally call out
to people), the sentence’s meaning doesn’t make sense.
Discourse Integration
Discourse Integration depends upon the sentences that
precedes it and also invokesthe meaning of the sentences that
follow it.
For instance, if one sentence reads, “Manhattan speaks to all its
people,” and the following sentence reads, “It calls out to Dave,”
discourse integration checks the first sentence for context to understand
that “It” in the latter sentence refers to Manhattan.
Pragmatic Analysis
During this, what was said is re-interpreted on what it actually meant.
It involves deriving those aspects of language which require real world knowledge.
For instance, a pragmatic analysis can uncover the intended meaning of
“Manhattan speaks to all its people.” Methods like neural networks
assess the context to understand that the sentence isn’t literal, and most
people won’t interpret it as such. A pragmatic analysis deduces that this
sentence is a metaphor for how people emotionally connect with place.
6
7
8
9
10
Issues and challenges in morphological modeling and parsing in natural language processing (NLP).
Ambiguity arises when a linguistic expression can have multiple interpretations. This is a
fundamental challenge in NLP and computational linguistics. There are two primary types of
ambiguity mentioned:
Accidental Ambiguity: This happens when a word or phrase has multiple possible meanings
depending on the context. For example, the word bank can refer to a financial institution or
the side of a river.
Ambiguity Due to Lexemes Having Multiple Senses: Some words can have multiple
meanings even within a single grammatical category. For example, light can mean "not
heavy" or "illumination."
Language is Dynamic: New words are constantly created due to cultural, technological, and
societal changes (e.g., "selfie" or "metaverse").
Morphological Systems Can’t Always Handle New Words: Most NLP models rely on a
predefined lexicon or a set of grammatical rules. When they encounter a new word (a
neologism) or an unfamiliar usage, they may fail to parse or process it correctly.
12
The Unknown Word Problem: Words that are not present in the model’s vocabulary remain
unprocessed. This problem is especially severe in:
Example: If an NLP model trained only on standard English encounters a new internet slang
term like yeet, it may fail to understand or process it correctly.
Morphological Parsing: This refers to breaking down words into their smallest meaningful
units (morphemes). For example, the word unhappiness can be split into un- (prefix), happy
(root), and -ness (suffix).
o NLP systems aim to create broad rules that can apply to many words.
o However, language has many exceptions (irregular forms), making this difficult.
Challenges of Irregularity:
o Some words don’t follow standard rules (e.g., go → went instead of goed).
Example:
o Irregular English past tense: run → ran (does not follow rule)
13
14
15
16
17
18
documents in NLP. It involves identifying the points in a document where the topic or
theme of the text shifts. This task is particularly useful for organizing and
underlying semantic structure and meaning of the text, rather than simply identifying
specific markers or patterns. As such, there are several methods and techniques that
Lexical cohesion: This method looks at the patterns of words and phrases
a particular keyword or phrase drops off sharply after a certain point in the
2. Discourse markers: This method looks at the use of discourse markers, such
as "however", "in contrast", and "furthermore", which are often used to signal a
identify patterns and features in a text that are associated with topic
METHODS:
There are several methods and techniques used in NLP to find the structure of
adjective, etc.) to each word in a sentence, which is useful for tasks like
that refer to the same entity, which is important for tasks like information
where the topic or theme of the text shifts, which is useful for organizing and
Most commonly used generative sequence classification method for topic and sentence is the
hidden Markov model (HMM).
1️States (Hidden States) → These are the things we want to predict (e.g., POS tags: Noun, Verb,
Adjective).
2️Observations (Visible Outputs) → The actual words we see in a sentence (e.g., "dogs", "run",
"quickly").
3️Transition Probabilities (A) → Probability of moving from one hidden state to another (e.g., P(Noun
→ Verb)).
4️Emission Probabilities (B) → Probability of a word being generated from a state (e.g., P("run" |
Verb)).
5️Initial Probabilities (π) → Probability of starting in a particular state.
HMM for POS tagging using a small training dataset and testing it on a new sentence
“A dog barks”
21
import nltk
trainer = hmm.HiddenMarkovModelTrainer()
hmm_model = trainer.train(train_data)
predicted_tags = hmm_model.tag(test_sentence)
print(predicted_tags)
Bayes rule:
22
23