0% found this document useful (0 votes)
4 views23 pages

natural language processing unit1

Natural Language Processing (NLP) is a field that enables computers to understand and interpret human languages, facilitating applications like chatbots, sentiment analysis, and machine translation. Key components of NLP include Natural Language Understanding (NLU) and Natural Language Generation (NLG), with various steps such as lexical analysis and semantic analysis involved in processing language. Challenges in NLP include ambiguity, irregularities in language, and the need for effective topic boundary detection to organize and summarize text.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views23 pages

natural language processing unit1

Natural Language Processing (NLP) is a field that enables computers to understand and interpret human languages, facilitating applications like chatbots, sentiment analysis, and machine translation. Key components of NLP include Natural Language Understanding (NLU) and Natural Language Generation (NLG), with various steps such as lexical analysis and semantic analysis involved in processing language. Challenges in NLP include ambiguity, irregularities in language, and the need for effective topic boundary detection to organize and summarize text.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

1

Natural Language Processing (NLP) Unit-I


1. Natural Language Processing – Introduction

 Humans communicate through some form of language either by text or speech.


 To make interactions between computers and humans, computers need to understand
natural languages used by humans.
 Natural language processing is all about making computers learn, understand,
analyze, manipulate and interpret natural(human) languages.
 NLP stands for Natural Language Processing, which is a part of Computer Science,
Human languages or Linguistics, and Artificial Intelligence.
 Processing of Natural Language is required when you want an intelligent system like
robot to perform as per your instructions, when you want to hear decision from a
dialogue based clinical expert system, etc.
 The ability of machines to interpret human language is now at the core of many
applications that we use every day - chatbots, Email classification and spam filters,
search engines, grammar checkers, voice assistants, and social language translators.
 The input and output of an NLP system can be Speech or Written Text.

2. Applications of NLP or Use cases of NLP

1. Sentiment analysis
 Sentiment analysis, also referred to as opinion mining, is an approach to natural
language processing (NLP) that identifies the emotional tone behind a body of text.
 This is a popular way for organizations to determine and categorize opinions about a
product, service or idea.
 Sentiment analysis systems help organizations gather insights into real-time customer
sentiment, customer experience and brand reputation.
 Generally, these tools use text analytics to analyze online sources such as emails, blog
posts, online reviews, news articles, survey responses, case studies, web chats, tweets,
forums and comments.
 Sentiment analysis uses machine learning models to perform text analysis of human
language. The metrics used are designed to detect whether the overall sentiment of a
piece of text is positive, negative or neutral.
2. Machine Translation
 Machine translation, sometimes referred to by the abbreviation MT, is a sub-field
of computational linguistics that investigates the use of software to translate text or
speech from one language to another.
 On a basic level, MT performs mechanical substitution of words in one language for
words in another, but that alone rarely produces a good translation because
recognition of whole phrases and their closest counterparts in the target language is
needed.
 Not all words in one language have equivalent words in another language, and many
words have more than one meaning.
2

 Solving this problem with corpus statistical and neural techniques is a rapidly
growing field that is leading to better translations, handling differences in linguistic
typology, translation of idioms, and the isolation of anomalies.
 Corpus: A collection of written texts, especially the entire works of a particular
author.

3. Text Extraction
 There are a number of natural language processing techniques that can be
used to extract information from text or unstructured data.
 These techniques can be used to extract information such as entity names,
locations, quantities, and more.
 With the help of natural language processing, computers can make sense
of the vast amount of unstructured text data that is generated every day,
and humans can reap the benefits of having this information readily
available.
 Industries such as healthcare, finance, and e-commerce are already using
natural language processing techniques to extract information and
improve business processes.
 As the machine learning technology continues to develop, we will only
see more and more information extraction use cases covered.

4. Text Classification

 Unstructured text is everywhere, such as emails, chat conversations, websites, and


social media. Nevertheless, it’s hard to extract value from this data unless it’s
organized in a certain way.
 Text classification also known as text tagging or text categorization is the process of
categorizing text into organized groups. By using Natural Language
Processing (NLP), text classifiers can automatically analyze text and then assign a set
of pre-defined tags or categories based on its content.
 Text classification is becoming an increasingly important part of businesses as it
allows to easily get insights from data and automate business processes.

5. Speech Recognition
 Speech recognition is an interdisciplinary subfield of computer
science and computational linguistics that develops methodologies and technologies
that enable the recognition and translation of spoken language into text by computers.
 It is also known as automatic speech recognition (ASR), computer speech
recognition or speech to text (STT).
 It incorporates knowledge and research in the computer
science, linguistics and computer engineering fields. The reverse process is speech
synthesis.
3

Speech recognition use cases


 A wide number of industries are utilizing different applications of speech technology
today, helping businesses and consumers save time and even lives. Some examples
include:
 Automotive: Speech recognizers improves driver safety by enabling voice-activated
navigation systems and search capabilities in car radios.
 Technology: Virtual agents are increasingly becoming integrated within our daily
lives, particularly on our mobile devices. We use voice commands to access them
through our smartphones, such as through Google Assistant or Apple’s Siri, for tasks,
such as voice search, or through our speakers, via Amazon’s Alexa or Microsoft’s
Cortana, to play music. They’ll only continue to integrate into the everyday products
that we use, fueling the “Internet of Things” movement.
 Healthcare: Doctors and nurses leverage dictation applications to capture and log
patient diagnoses and treatment notes.
 Sales: Speech recognition technology has a couple of applications in sales. It can help
a call center transcribe thousands of phone calls between customers and agents to
identify common call patterns and issues. AI chatbots can also talk to people via a
webpage, answering common queries and solving basic requests without needing to
wait for a contact center agent to be available. In both instances speech recognition
systems help reduce time to resolution for consumer issues.
6. Chatbot
 Chatbots are computer programs that conduct automatic conversations with people.
They are mainly used in customer service for information acquisition. As the name
implies, these are bots designed with the purpose of chatting and are also simply
referred to as “bots.”

 You’ll come across chatbots on business websites or messengers that give pre-scripted
replies to your questions. As the entire process is automated, bots can provide quick
assistance 24/7 without human intervention.

7. Email Filter
 One of the most fundamental and essential applications of NLP online is email
filtering. It began with spam filters, which identified specific words or phrases that
indicate a spam message. But, like early NLP adaptations, filtering has been
improved.
 Gmail's email categorization is one of the more common, newer implementations of
NLP. Based on the contents of emails, the algorithm determines whether they belong
in one of three categories (main, social, or promotional).
 This maintains your inbox manageable for all Gmail users, with critical, relevant
emails you want to see and reply to fast.
8. Search Autocorrect and Autocomplete
 When you type 2-3 letters into Google to search for anything, it displays a list of
probable search keywords. Alternatively, if you search for anything with mistakes, it
corrects them for you while still returning relevant results. Isn't it incredible?
4

 Everyone uses Google search autocorrect autocomplete on a regular basis but seldom
gives it any thought. It's a fantastic illustration of how natural language processing is
touching millions of people across the world, including you and me.
 Both, search autocomplete and autocorrect make it much easier to locate accurate
results.
3. Components of NLP
 There are two components of NLP, Natural Language Understanding (NLU)and
Natural Language Generation (NLG).
 Natural Language Understanding (NLU) which involves transforming
humanlanguage into a machine-readable format.It helps the machine to understand
and analyze human language by extracting the text from large data such as keywords,
emotions, relations, and semantics.
 Natural Language Generation (NLG) acts as a translator that converts
thecomputerized data into natural language representation.
 It mainly involves Text planning, Sentence planning, and Text realization.
 The NLU is harder than NLG.

4. Steps in NLP
There are general five steps :
 1. Lexical Analysis
 2. Syntactic Analysis (Parsing)
 3. Semantic Analysis
 4. Discourse Integration
 5. Pragmatic Analysis

Lexical Analysis:
 The first phase of NLP is the Lexical Analysis.
 This phase scans the source code as a stream of characters and converts it into
meaningful lexemes.
 It divides the whole text into paragraphs, sentences, and words.
 Lexeme: A lexeme is a basic unit of meaning. In linguistics, the abstract unit of
morphological analysis that corresponds to a set of forms taken by a single word is
called lexeme.
 The way in which a lexeme is used in a sentence is determined by its grammatical
category.
5

 Lexeme can be individual word or multiword.


 For example, the word talk is an example of an individual
word lexeme, which mayhave many grammatical variants
like talks, talked and talking.
 Multiword lexeme can be made up of more than one
orthographic word. For example,speak up, pull through, etc.
are the examples of multiword lexemes.

Syntax Analysis (Parsing)


 Syntactic Analysis is used to check grammar, word
arrangements, and shows therelationship among the words.
 The sentence such as “The school goes to boy” is rejected
by English syntactic analyzer.

Semantic Analysis
 Semantic analysis is concerned with the meaning representation.
 It mainly focuses on the literal meaning of words, phrases, and sentences.
 The semantic analyzer disregards sentence such as “hot ice-cream”.
 Another Example is “Manhattan calls out to Dave” passes a syntactic
analysis because it’s a grammatically correct sentence. However, it fails a
semantic analysis. Because Manhattan is a place (and can’t literally call out
to people), the sentence’s meaning doesn’t make sense.

Discourse Integration
 Discourse Integration depends upon the sentences that
precedes it and also invokesthe meaning of the sentences that
follow it.
 For instance, if one sentence reads, “Manhattan speaks to all its
people,” and the following sentence reads, “It calls out to Dave,”
discourse integration checks the first sentence for context to understand
that “It” in the latter sentence refers to Manhattan.

Pragmatic Analysis
 During this, what was said is re-interpreted on what it actually meant.
 It involves deriving those aspects of language which require real world knowledge.
 For instance, a pragmatic analysis can uncover the intended meaning of
“Manhattan speaks to all its people.” Methods like neural networks
assess the context to understand that the sentence isn’t literal, and most
people won’t interpret it as such. A pragmatic analysis deduces that this
sentence is a metaphor for how people emotionally connect with place.
6
7
8
9
10

Note: for detail morphemes topic refer notes


11

Issues and challenges in morphological modeling and parsing in natural language processing (NLP).

1. Ambiguity in Language and Syncretism

Ambiguity arises when a linguistic expression can have multiple interpretations. This is a
fundamental challenge in NLP and computational linguistics. There are two primary types of
ambiguity mentioned:

 Accidental Ambiguity: This happens when a word or phrase has multiple possible meanings
depending on the context. For example, the word bank can refer to a financial institution or
the side of a river.

 Ambiguity Due to Lexemes Having Multiple Senses: Some words can have multiple
meanings even within a single grammatical category. For example, light can mean "not
heavy" or "illumination."

 Syncretism (Systematic Ambiguity): This refers to cases where different grammatical


categories share the same word form, making interpretation more difficult. For example, in
English, the word deer remains the same for both singular and plural forms. In many
languages, the same word form can be used for different grammatical cases, making
disambiguation complex for NLP models.

2. Productivity and Creativity in Language

 Language is Dynamic: New words are constantly created due to cultural, technological, and
societal changes (e.g., "selfie" or "metaverse").

 Morphological Systems Can’t Always Handle New Words: Most NLP models rely on a
predefined lexicon or a set of grammatical rules. When they encounter a new word (a
neologism) or an unfamiliar usage, they may fail to parse or process it correctly.
12

 The Unknown Word Problem: Words that are not present in the model’s vocabulary remain
unprocessed. This problem is especially severe in:

o Speech or writing that includes domain-specific terminology.

o Conversations that mix multiple languages (code-switching).

o Cases where foreign names or new slang words are used.

 Example: If an NLP model trained only on standard English encounters a new internet slang
term like yeet, it may fail to understand or process it correctly.

3. Irregularity in Morphological Parsing

 Morphological Parsing: This refers to breaking down words into their smallest meaningful
units (morphemes). For example, the word unhappiness can be split into un- (prefix), happy
(root), and -ness (suffix).

 Generalization and Abstraction:

o NLP systems aim to create broad rules that can apply to many words.

o However, language has many exceptions (irregular forms), making this difficult.

 Challenges of Irregularity:

o Some words don’t follow standard rules (e.g., go → went instead of goed).

o Some words have multiple possible segmentations, leading to ambiguity.

o Some descriptions of linguistic data may be inaccurate or overly complex.

 Example:

o Regular English past tense: walk → walked (follows rule)

o Irregular English past tense: run → ran (does not follow rule)
13
14
15
16
17
18

Topic Boundary Detection:

Topic boundary detection is another important subtask of finding the structure of

documents in NLP. It involves identifying the points in a document where the topic or

theme of the text shifts. This task is particularly useful for organizing and

summarizing large amounts of text, as it allows for the identification of different

topics or subtopics within a document.

Topic boundary detection is a challenging task, as it involves understanding the


19

underlying semantic structure and meaning of the text, rather than simply identifying

specific markers or patterns. As such, there are several methods and techniques that

have been developed to address this challenge, including

Lexical cohesion: This method looks at the patterns of words and phrases

that appear in a text, and identifies changes in the frequency or distribution of

these patterns as potential topic boundaries. For example, if the frequency of

a particular keyword or phrase drops off sharply after a certain point in the

text, this could indicate a shift in topic.

2. Discourse markers: This method looks at the use of discourse markers, such

as "however", "in contrast", and "furthermore", which are often used to signal a

change in topic or subtopic. By identifying these markers in a text, it is

possible to locate potential topic boundaries.

3. Machine learning: This method involves training a machine learning model to

identify patterns and features in a text that are associated with topic

boundaries. This can involve using a variety of linguistic and contextual

features, such as sentence length, word frequency, and part-of-speech tags, to

identify potential topic boundaries.

METHODS:

There are several methods and techniques used in NLP to find the structure of

documents, which include:

1. Sentence boundary detection: This involves identifying the boundaries

between sentences in a document, which is important for tasks like parsing,

machine translation, and text-to-speech synthesis.

2. Part-of-speech tagging: This involves assigning a part of speech (noun, verb,

adjective, etc.) to each word in a sentence, which is useful for tasks like

parsing, information extraction, and sentiment analysis.

3. Named entity recognition: This involves identifying and classifying named

entities (such as people, organizations, and locations) in a document, which is

important for tasks like information extraction and text categorization.


20

4. Coreference resolution: This involves identifying all the expressions in a text

that refer to the same entity, which is important for tasks like information

extraction and machine translation.

5. Topic boundary detection: This involves identifying the points in a document

where the topic or theme of the text shifts, which is useful for organizing and

summarizing large amounts of text.

6. Parsing: This involves analyzing the grammatical structure of sentences in a

document, which is important for tasks like machine translation,

text-to-speech synthesis, and information extraction.

7. Sentiment analysis: This involves identifying the sentiment (positive, negative,

or neutral) expressed in a document, which is useful for tasks like brand

monitoring, customer feedback analysis, and market research

Generative Sequence Classification Methods:

Most commonly used generative sequence classification method for topic and sentence is the
hidden Markov model (HMM).

An HMM has five components:

1️States (Hidden States) → These are the things we want to predict (e.g., POS tags: Noun, Verb,
Adjective).
2️Observations (Visible Outputs) → The actual words we see in a sentence (e.g., "dogs", "run",
"quickly").
3️Transition Probabilities (A) → Probability of moving from one hidden state to another (e.g., P(Noun
→ Verb)).
4️Emission Probabilities (B) → Probability of a word being generated from a state (e.g., P("run" |
Verb)).
5️Initial Probabilities (π) → Probability of starting in a particular state.

HMM for POS tagging using a small training dataset and testing it on a new sentence

“The Cat Runs”

“A dog barks”
21

Testing the data: “A dog runs”

import nltk

from nltk.tag import hmm

train_data = [[('The', 'DET'), ('cat', 'NOUN'), ('runs', 'VERB')],

[('A', 'DET'), ('dog', 'NOUN'), ('barks', 'VERB')]]

trainer = hmm.HiddenMarkovModelTrainer()

hmm_model = trainer.train(train_data)

test_sentence = ["The", "dog", "runs"]

predicted_tags = hmm_model.tag(test_sentence)

print(predicted_tags)

Bayes rule:
22
23

You might also like