0% found this document useful (0 votes)
7 views

Unit 1 NLP

The document provides an overview of Natural Language Processing (NLP), its key areas, applications, and challenges, particularly focusing on ambiguity in language. It also introduces the Natural Language Toolkit (NLTK), detailing various NLP tasks such as tokenization, stemming, lemmatization, and named entity recognition. The document serves as a foundational guide for understanding NLP concepts and tools used in the field.

Uploaded by

pappupandit78012
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Unit 1 NLP

The document provides an overview of Natural Language Processing (NLP), its key areas, applications, and challenges, particularly focusing on ambiguity in language. It also introduces the Natural Language Toolkit (NLTK), detailing various NLP tasks such as tokenization, stemming, lemmatization, and named entity recognition. The document serves as a foundational guide for understanding NLP concepts and tools used in the field.

Uploaded by

pappupandit78012
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Noida Institute of Engineering and Technology, Greater Noida

Subject code ACSA10712


Subject name Natural language Processing

Unit – 1
Overview of Natural language Processing
Definition , Applications and emerging trends in NLP, Challenges, Ambiguity , NLP
tasks using NLTK : Tokenization, stemming , lemmatization, stop-word, removal, Pos
tagging, parsing, named entity Recognition, coreferene resolution.
Natural Language Processing

What is NLP
Natural Language Processing (NLP) is a subfield of artificial
intelligence (AI) and computational linguistics focused on enabling
computers to understand, interpret, and respond to human language
in a way that is both meaningful and useful
Key Areas of NLP

• Text Analysis:
– Tokenization: Breaking down text into smaller units, such as words
or sentences.
– Part-of-Speech Tagging: Identifying the grammatical parts of speech
(nouns, verbs, adjectives, etc.) in a sentence.
– Named Entity Recognition (NER): Identifying and classifying
named entities (people, organizations, locations, etc.) in text.
– Sentiment Analysis: Determining the sentiment or emotion expressed
in a piece of text, such as positive, negative, or neutral.
• Speech Recognition:
– Converting spoken language into text, enabling voice-activated
systems like virtual assistants (e.g., Siri, Alexa) to understand and
process voice commands.
Key Areas of NLP
• Machine Translation:
– Automatically translating text or speech from one language to another,
as seen in services like Google Translate.
• Text Generation:
– Automatically generating coherent and contextually relevant text, such
as in chatbots, content creation, or language models like GPT.
• Question Answering:
– Building systems that can answer questions posed in natural language
by retrieving and summarizing relevant information from a large
dataset.
• Text Summarization:
– Condensing a large piece of text into a shorter version while
preserving its meaning and key information.
Applications of NLP

• Virtual Assistants: NLP is used in virtual assistants like Siri, Alexa, and
Google Assistant to understand voice commands and respond
appropriately.
• Chatbots: Many businesses use NLP-based chatbots to provide
customer support and answer common queries.
• Search Engines: NLP helps search engines like Google understand user
queries and deliver relevant search results.
• Translation Services: NLP powers translation tools that convert text
from one language to another.
• Sentiment Analysis: Companies use sentiment analysis to monitor
social media, reviews, and other forms of user feedback to gauge public
opinion.
Challenges in NLP

• Ambiguity: Natural language is often ambiguous, with words having


multiple meanings or sentences that can be interpreted in different ways.
• Context: Understanding the context of a conversation or text is essential
for accurate processing, which can be challenging.
• Cultural Differences: Language use varies across cultures, making it
difficult to create models that work universally.
• Evolving Language: Language constantly evolves, with new slang,
idioms, and usage patterns emerging regularly.
Challenges in NLP - Ambiguity

Natural language ambiguity refers to situations where a word, phrase, or


sentence has multiple meanings, making it challenging to interpret
correctly.
Some common forms of ambiguity include
1. Lexical Ambiguity
2. Syntactic Ambiguity
3. Semantic Ambiguity
4. Referential Ambiguity
5. Contextual Ambiguity
Challenges in NLP - Ambiguity

1. Lexical ambiguity
• Lexical means relating to words of a language.
• During Lexical analysis given paragraphs are broken down into words or
tokens. Each token has got specific meaning.
There can be instances where a single word can be interpreted in multiple ways.
The ambiguity that is caused by the word alone rather than the context is known as
Lexical Ambiguity.

Example: “Give me the bat!”


In this example “bat” have more than one meaning animal or circket bat
Challenges in NLP – Ambiguity continue….

Lexical ambiguity divide in two category


1. Polysemy
○ One word has many meanings
○ Determining the sense of a word in a particular context
■ He sat on the bank of a river/Withdraw money from the bank
■ Maruti has built a plant to manufacture cars/A man was planted in the
audience to raise anti-political slogans
2. Homonymy
o It refers to a single word having multiple but unrelated meanings.
Examples Bear, left, Pole
Challenges in NLP - Ambiguity

A bear (the animal) can bear (tolerate) very cold temperatures.

The driver turned left (opposite of right) and left (departed from) the main
road.

Pole and Pole — The first Pole refers to a citizen of Poland who could
either be referred to as Polish or a Pole. The second Pole refers to a
bamboo pole or any other wooden pole.
Challenges in NLP - Ambiguity

2. Syntactic Ambiguity/ Structural ambiguity


Syntactic meaning refers to the grammatical structure and rules that
define how words should be combined to form sentences and phrases.
A sentence can be interpreted in more than one way due to its structure
or syntax such ambiguity is referred to as Syntactic Ambiguity.
Example 1: “Old men and women”
There are possible two meaning of the example
All old men and young women.
All old men and old women.
Example 2: “John saw the boy with telescope.”
In example , two possible meanings are
John saw the boy through his telescope.
John saw the boy who was holding the telescope.
Challenges in NLP - Ambiguity

3. Semantic Ambiguity Semantics is nothing but “Meaning”.


• The semantics of a word or phrase refers to the way it is typically
understood or interpreted by people.
• Syntax describes the rules by which words can be combined into
sentences, while semantics describes what they mean.
This type of ambiguity occurs when a sentence has more than one
interpretation or meaning.
Example 1 “The chicken is ready to eat.”
The chicken (as food) is cooked and ready to be eaten.
The chicken (the animal) is hungry and ready to eat something.
Example 2 “Seema loves her mother and Sriya does too.”
In example two may be two interpretations
Sriya loves Seema’s mother or Sriya likes her mother.
Challenges in NLP - Ambiguity

4. Anaphoric (when a noun replace pronoun)Ambiguity -


A word that gets its meaning from a preceding word or phrase is called an
Example 1 - The house is on the longest street. It is very dirty.
In example1 “It ” represent to which house or long street
Example 2 – “I went to the hospital, and they told me to go home and rest.”
In example2- ‘they’ does not explicitly refer to the hospital instead it refers to the Dr or staff
who attended the patient in the hospital.
5. Pragmatic ambiguity - Pragmatics focuses on the real-time usage of language like what
the speaker wants to convey and how the listener infers it.
Example 1 : Do you know what time it is ?
Meaning of the example 1 - that some is asking for the time and other meaning is that
someone showing anger for missed the due time
Natural Language ToolKit (NLTK)

The Natural Language Toolkit (NLTK) is a Python programming environment


for creating applications for statistical natural language processing (NLP).

1. Tokenization
2. Sentence Segmentation
3. Corpus and Vocabulary
4. Stop words
5. Stemming and Lemmatization
6. Named Entity Recognition
7. Co-referencing Resolution
8. POS tagging
9. Parsing
Natural Language ToolKit (NLTK) – 1.Tokenization

● Tokenization method is used to split a sentence, paragraph, or


full-text document into smaller units - tokens
I Love NLP.
[‘I’, ‘love’, ‘NLP’,‘.’]
○ The basic unit of a language
○ It helps to interpret the meaning of the text by analysing the words
present in the text
○ Count the number and frequency of words in the text
Natural Language ToolKit (NLTK) – 2. Sentence Segmentation

You first need to break the entire document down into its constituent
sentences. You can do this by segmenting the article along with its
punctuations like full stops and commas.
● Splitting the given input text into sentences
● Characters used for defining sentence end - ‘!’, ‘?’, ‘.’
● Ambiguities:
○ The yearly results of Yahoo! are promising.
○ We are using the .NET framework for our project.
○ Susan scored 78.5% marks in her exams
○ Mr. Mehta is doing a great job.
● Rules like – Numbers around the ‘.’
Natural Language ToolKit (NLTK) – 3. Corpus and Vocabulary

● text corpus
○ The set of text documents used for the model.
○ e.g. For model to analyze movie reviews, corpus is the set of documents each
containing a movie review.
○ Documents set divided into training/testing for the model
● Vocabulary
○ The unique set of words in the entire corpus
○ Usually, the feature vector is based on the vocabulary of the corpus
○ Vocabulary size – number of words in the vocabulary
● Freely available corpus
○ Links to some freely available corpora - www.nltk.org
Natural Language ToolKit (NLTK) – 3. Corpus and Vocabulary
conti….
Some popular Corpus Available
● Movie review dataset (IMDB dataset)
○ Consisting of 1000 negative and 1000 positive labeled movie reviews
○ http://www.cs.cornell.edu/people/pabo/movie-review-data/
● Amazon product review datasets
○ DVD dataset, Sports and Outdoor datasets
○ Each consisting of 1000 negative and 1000 positive labeled product reviews.
○ http://www.cs.jhu.edu/~mdredze/datasets/sentiment/
Natural Language ToolKit (NLTK) – 4. Stop words

Very common words in a language, no useful information


● Articles, prepositions, pronouns, conjunctions, etc,.
○ e.g. - “the”, “in”, “of”, “his”, “and ,”etc”
● Removal of stopwords token
○ Focus to the important information
○ Reduces the dataset size and training time
● May be needed for
○ Relational queries – “flights to Tokyo”
○ Phrases like - “To be or not to be”
○ Prediction of sentiment - “I told you that she was not happy” → “told, happy”
Natural Language ToolKit (NLTK) – 5. Stemming and Lemmatization

● Reduces the form of a word to the common base form.


e.g. - (go, going, gone) -> go (running, ran, runs, run) -> run
● To prepare text, words, and documents for further processing.
● When we search for a word on the web it also retrieves
variations of the word. If we search for say ‘kill’, it may also
return words like killer, killing, killed, kills.
● Here kill is the stem for killer, killed, killing, kills. It conveys
that each of these has the idea of ‘kill’.
Natural Language ToolKit (NLTK) – 5. Stemming and Lemmatization

● Stemming use heuristics that


○ Chops off letters from the end of the word
○ Transforms the end letters
● Lemmatization groups together similar inflected forms of a words, called lemma.
Word Suffix Stem Word Lemma

Was was Is, was, were Be


Cats s cat
Cats Cat
Changing ing chang
Changing, changed, change change
Studies es studi

Studying ing study Studies, studying Study


Natural Language ToolKit (NLTK) – 5. Stemming and Lemmatization

Stemming Lemmatization
Fast and simple- Pattern based Needs POS tagging,
dictionaries
Snowball, Porters LemmaGen, Morpha

Returns the stem of a word – may not be in Returns a proper word -


vocabulary lemma
Crude, less useful More informative

● Stemming and lemmatization are methods used by search engines and chatbots
to analyze the meaning behind a word.
● Stemming uses the stem of the word, while lemmatization uses the context in
which the word is being used.
Natural Language ToolKit (NLTK) – 6. Name Entity Recognition

● Named Entities (NEs) are proper names in texts, i.e. the names of people,
organizations, locations, time and quantities
● NER is to process a text and identify named entities
● Applications:
○ Helps identify the key elements in a text
■ Helps sort unstructured data and detect important information.
○ Useful in answering- question systems
Hi, My name is Shubhangi Deb PERSON
■ “Where was Mahatma Gandhi born?” I am from Australia GPE
I want to work with Amazon ORG
Jeff Bezos PERSON is my inspiration

Named entity recognition with Machine


Learning
Natural Language ToolKit (NLTK) – 6. Name Entity Recognition

● Applications
○ Processing Resumes
■ Looking for information from resumes formatted differently.
■ Personal information, experience, skills, degree etc,.
○ Gain Insights from Customer Feedback
■ Organize all this customer feedback and pinpoint repeating problem areas
■ Areas of customer likes/dislikes/improvement areas
○ Content Recommendation:
■ If you watch a lot of comedies on Netflix, you’ll get more recommendations
that have been classified as the entity ‘Comedy’.
Natural Language ToolKit (NLTK) – 7. Coreference Resolution

● Identify all expressions that refer to the same object.


○ Mohan went to McDonald to buy a burger. He visits the store very often and loves its food.

“ I voted for Biden because he was most aligned to my principes”, Jenna


said
Original sentence

“ Jenna voted for Biden because Biden was most aligned to Jenna’s
principes”, Jenna said
Sentence with resolved conferences

● Uses - It is an important step for a lot of higher level NLP tasks that involves
natural language understanding
Natural Language ToolKit (NLTK) – 7. Coreference Resolution

● Uses
○ Document summarization
○ Question answering
○ Machine translation
● Anaphora (backward references)
○ Refers to any reference that “points backward” to information that was
presented earlier in the text
“The apple on the table was rotten. It had been there for three days.”
● Cataphora (forward references)
○ Refers to any reference that “points forward” to information that will be
presented later in the text
“It has four legs The cow is a domestic animal.”
Natural Language ToolKit (NLTK) – 8. POS tagging

● Assign a parts of speech to each word in text Noun Pronoun


○ Nouns: Which defines any object or entity Interjection

○ Verbs: That defines some action.


○ Pronoun: That can replace a noun – she, him. Verb
Preposition Parts of
○ Adjectives Describe a noun/pronoun. speech

● In a sentence, every word will be associated with a proper Conjunction


Adverb

POS tag Adjective

Puja bought a new phone from Samsung Store

Proper Noun Verb Determiner Adjective Noun Preposition Proper noun Noun

NNP VBN DT JJ NN IN NNP NN


Natural Language ToolKit (NLTK) – 9. NLTK

NLTK (Natural Language Toolkit) is a library for NLP in Python.


Powerful tool to preprocess text data for further analysis like as input to
Machine Learning algorithms.
Tokenization, POS tagging, Stemming etc
It includes many corpora and lexical resources (like WordNet)
Natural Language ToolKit (NLTK) – 9. NLTK

Installation
pip install nltk

Downloading the datasets:


import nltk
nltk.download()

Choose from the screen


whatever packages you want to
download

Source: https://thinkinfi.com/how-to-download-nltk-corpus-manually//
Natural Language ToolKit (NLTK) – 9. NLTK

Operations using NLTK:


Tokenization
import nltk
text = "First sentence. Second sentence“
nltk.word_tokenize(text)
Output: ['First', 'sentence', '.', 'Second', 'sentence', '.’]

Sentence splitting
nltk.sent_tokenize(text)
Output: ['First sentenece.', 'Second sentence']
Natural Language ToolKit (NLTK) – 9. NLTK

Accessing Text Corpora in NLTK:


● Corpus: A set of text documents usually having some common characteristic.
Corpora is plural of corpus.
set of online books
set of movie reviews
collection of tweets
• NLTK corpus is set of natural language datasets in nltk_data directory
In NLTK some corpora are included
Gutenberg corpus (online books)
Brown corpus (categories – news, humor etc
Reuters corpus (news)
WordNet –most advanced , contains words, synonyms, antonyms etc
from nltk.corpus import gutenberg
Natural Language ToolKit (NLTK) – NLTK

Accessing Gutenberg corpus:


● NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, which
contains some 25,000 free electronic books
See website for full list - http://www.gutenberg.org/
To download the corpus - nltk.download('gutenberg’)
To find the Gutenberg collection that are downloaded with the NLTK package.-
nltk.corpus.gutenberg.fileids()
To find the words in the text file - gutenberg.words(fileid)
To find the sentences in a file gutenberg.sents(fileid)
To get the text in the file into a string - gutenberg.raw(fileid)
In NLTK some corpora are included
Gutenberg corpus
Brown corpus
If phone number “0120- 4543466” in string “ His contact number is 0120- 4543466”
Python command - “0120- 4543466” in “ His contact number is 0120- 4543466”

If we know the format - ####-####### (phone number ) or ##/##/#### (date)


We need regular expressions to search for these patterns.
Natural Language ToolKit (NLTK) – NLTK

Accessing Browns corpus:


● Contains text categorized by genre, such as news, humor etc.
● To find the categories
from nltk.corpus import brown
brown.categories()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor',
'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction’]
To find the files in the category ‘humor’
brown.fileids(categories=‘humor’)
['cr01', 'cr02', 'cr03', 'cr04', 'cr05', 'cr06', 'cr07', 'cr08', 'cr09’ ]
To find the words in the file ‘cr01’
brown.words(fileids=['cr01‘])
['It', 'was', 'among', 'these', 'that', 'Hinkle', ...]
For details refer : http://www.nltk.org/book_1ed/ch02.html
Natural Language ToolKit (NLTK) – NLTK

Accessing Gutenberg corpus:


Contains text categorized by genre, such as news, humor etc.
● To find the categories
from nltk.corpus import brown
brown.categories()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor',
'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction’]
To find the files in the category ‘humor’
brown.fileids(categories=‘humor’)
['cr01', 'cr02', 'cr03', 'cr04', 'cr05', 'cr06', 'cr07', 'cr08', 'cr09’ ]
To find the words in the file ‘cr01’
brown.words(fileids=['cr01‘])
['It', 'was', 'among', 'these', 'that', 'Hinkle', ...]
For details refer : http://www.nltk.org/book_1ed/ch02.html
Natural Language ToolKit (NLTK) – NLTK

Accessing Gutenberg corpus:


● NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, which
contains some 25,000 free electronic books
See website for full list - http://www.gutenberg.org/
To download the corpus - nltk.download('gutenberg’)
To find the Gutenberg collection that are downloaded with the NLTK package.-
nltk.corpus.gutenberg.fileids()
To find the words in the text file - gutenberg.words(fileid)
To find the sentences in a file gutenberg.sents(fileid)
To get the text in the file into a string - gutenberg.raw(fileid)
In NLTK some corpora are included
Gutenberg corpus
Brown corpus
If phone number “0120- 4543466” in string “ His contact number is 0120- 4543466”
Python command - “0120- 4543466” in “ His contact number is 0120- 4543466”

If we know the format - ####-####### (phone number ) or ##/##/#### (date)


We need regular expressions to search for these patterns.
Natural Language ToolKit (NLTK) – NLTK

Accessing Gutenberg corpus:


● NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, which
contains some 25,000 free electronic books
See website for full list - http://www.gutenberg.org/
To download the corpus - nltk.download('gutenberg’)
To find the Gutenberg collection that are downloaded with the NLTK package.-
nltk.corpus.gutenberg.fileids()
To find the words in the text file - gutenberg.words(fileid)
To find the sentences in a file gutenberg.sents(fileid)
To get the text in the file into a string - gutenberg.raw(fileid)
In NLTK some corpora are included
Gutenberg corpus
Brown corpus
If phone number “0120- 4543466” in string “ His contact number is 0120- 4543466”
Python command - “0120- 4543466” in “ His contact number is 0120- 4543466”

If we know the format - ####-####### (phone number ) or ##/##/#### (date)


We need regular expressions to search for these patterns.
Text Preprocessing in NLP

Text Preprocessing in NLP


● Convert raw text into a set of tokens that the computer can understand and use.
Ready for feature extraction
Data cleaning and pre-processing is critical for the quality of further analysis
Pre-processing steps depend on
a.The data – structured (movie reviews) or unstructured (Tweets)
b.The application for which data needs to be used.
Text Preprocessing in NLP

Steps in Text Preprocessing:

● Convert all characters to lower case


o e.g. Hello, HELLO, hello, hellO -> hello
Remove HTML tags, URL, email id
http://www.example.com/index.html,
[email protected]
Text between HTML tags

Converting data to standard form


2mrw, tmrw->tomorrow , btwn, btw -> between, b4->before
Text Preprocessing in NLP

● Emojis
Remove them / replace with a word / sentiment analysis
Replace characters repeated more (twitter)
e.g. it was ssssoooo nice -> it was so nice
Replace contractions (short forms to full words)
I ‘m -> I am, did’nt -> did not
Removal of punctuations
Removal of rare/frequent words
Text Preprocessing in NLP

● Tokenization
'the new policy of the government is good’

Sentence segmentation(if required)


text = "First in class. Last in class.“
Output: ['First in class.', ‘Last in class.']
● Remove stop words
Tokens as input - ['the', 'new', 'policy', 'of', 'the', 'government', 'is',
'good’]
['new', 'policy', 'government', 'good’]
Text Preprocessing in NLP

● Parts of Speech (POS) tagging


Tokens as input -['he', 'loves', 'to', 'play', 'with', 'toys', 'in', 'morning’]
[('he', 'PRP’), ('loves', 'VBZ’), ('to', 'TO'), ('play', 'VB’), ('with', 'IN'), ('toys', 'NN'), ('in', 'IN'),
('morning', 'NN’)]

Stemming
Tokens as input -['Stemming', 'usually', 'tries', 'to', 'convert', 'the', 'word', 'into', 'its', 'root',
'format’]
Stemming - ['stem', 'us', 'tri', 'to', 'convert', 'the', 'word', 'into', 'it', 'root', 'form’]

Lemmatization
Tokens as input -['Stemming', 'usually', 'tries', 'to', 'convert', 'the', 'word', 'into', 'its', 'root',
'format’]
Lemmatization - ['Stemming', 'usually', 'try', 'to', 'convert', 'the', 'word', 'into', 'it', 'root', 'format']
Text Preprocessing in NLP

● Name Entity Recognition


String ‘text’ as input
text = "Tom is good at playing football and stays in London."
tokens = word_tokenize(text)
pos_text= nltk.pos_tag(tokens)
nes=nltk.ne_chunk(pos_text, binary =
nes=nltk.ne_chunk(pos_text, binary = True) False)

OUTPUT OUTPUT
(NE Tom/NNP) (PERSON Tom/NNP)
('is', 'VBZ') ('is', 'VBZ')
('good', 'JJ') ('good', 'JJ')
('at', 'IN') ('at', 'IN')
('playing', 'VBG') ('playing', 'VBG')
('football', 'NN') ('football', 'NN')
('and', 'CC') ('and', 'CC')
('stays', 'NNS') ('stays', 'NNS')
('in', 'IN') ('in', 'IN')
(NE London/NNP) (GPE London/NNP)
('.', '.') ('.', '.')
Text Preprocessing in NLP

How do we implement these NLP concepts:

Language: Python
● Why Python?
○ Has simple syntax
○ Has extensive collection of NLP tools and libraries
● Python Libraries used in this course
○ Numpy
○ Pandas
○ Matplotlib
○ SciKit-Learn
○ NLTK
Text Preprocessing in NLP

Summary
● Now you have an idea of what NLP is, its applications and the emerging
trends and challenges in using it.
● You also learnt about the basic concepts of NLP like
○ Corpus and Vocabulary
○ Text Normalization (Tokenization , Stemming and Lemmatization,
Stop words, Sentence segmentation )
○ POS tagging, Named Entity Recognition, Co-referencing Resolution
○ Parsing

●Implementation of the above concepts using NLTK

You might also like