Unit 1 NLP
Unit 1 NLP
Unit – 1
Overview of Natural language Processing
Definition , Applications and emerging trends in NLP, Challenges, Ambiguity , NLP
tasks using NLTK : Tokenization, stemming , lemmatization, stop-word, removal, Pos
tagging, parsing, named entity Recognition, coreferene resolution.
Natural Language Processing
What is NLP
Natural Language Processing (NLP) is a subfield of artificial
intelligence (AI) and computational linguistics focused on enabling
computers to understand, interpret, and respond to human language
in a way that is both meaningful and useful
Key Areas of NLP
• Text Analysis:
– Tokenization: Breaking down text into smaller units, such as words
or sentences.
– Part-of-Speech Tagging: Identifying the grammatical parts of speech
(nouns, verbs, adjectives, etc.) in a sentence.
– Named Entity Recognition (NER): Identifying and classifying
named entities (people, organizations, locations, etc.) in text.
– Sentiment Analysis: Determining the sentiment or emotion expressed
in a piece of text, such as positive, negative, or neutral.
• Speech Recognition:
– Converting spoken language into text, enabling voice-activated
systems like virtual assistants (e.g., Siri, Alexa) to understand and
process voice commands.
Key Areas of NLP
• Machine Translation:
– Automatically translating text or speech from one language to another,
as seen in services like Google Translate.
• Text Generation:
– Automatically generating coherent and contextually relevant text, such
as in chatbots, content creation, or language models like GPT.
• Question Answering:
– Building systems that can answer questions posed in natural language
by retrieving and summarizing relevant information from a large
dataset.
• Text Summarization:
– Condensing a large piece of text into a shorter version while
preserving its meaning and key information.
Applications of NLP
• Virtual Assistants: NLP is used in virtual assistants like Siri, Alexa, and
Google Assistant to understand voice commands and respond
appropriately.
• Chatbots: Many businesses use NLP-based chatbots to provide
customer support and answer common queries.
• Search Engines: NLP helps search engines like Google understand user
queries and deliver relevant search results.
• Translation Services: NLP powers translation tools that convert text
from one language to another.
• Sentiment Analysis: Companies use sentiment analysis to monitor
social media, reviews, and other forms of user feedback to gauge public
opinion.
Challenges in NLP
1. Lexical ambiguity
• Lexical means relating to words of a language.
• During Lexical analysis given paragraphs are broken down into words or
tokens. Each token has got specific meaning.
There can be instances where a single word can be interpreted in multiple ways.
The ambiguity that is caused by the word alone rather than the context is known as
Lexical Ambiguity.
The driver turned left (opposite of right) and left (departed from) the main
road.
Pole and Pole — The first Pole refers to a citizen of Poland who could
either be referred to as Polish or a Pole. The second Pole refers to a
bamboo pole or any other wooden pole.
Challenges in NLP - Ambiguity
1. Tokenization
2. Sentence Segmentation
3. Corpus and Vocabulary
4. Stop words
5. Stemming and Lemmatization
6. Named Entity Recognition
7. Co-referencing Resolution
8. POS tagging
9. Parsing
Natural Language ToolKit (NLTK) – 1.Tokenization
You first need to break the entire document down into its constituent
sentences. You can do this by segmenting the article along with its
punctuations like full stops and commas.
● Splitting the given input text into sentences
● Characters used for defining sentence end - ‘!’, ‘?’, ‘.’
● Ambiguities:
○ The yearly results of Yahoo! are promising.
○ We are using the .NET framework for our project.
○ Susan scored 78.5% marks in her exams
○ Mr. Mehta is doing a great job.
● Rules like – Numbers around the ‘.’
Natural Language ToolKit (NLTK) – 3. Corpus and Vocabulary
● text corpus
○ The set of text documents used for the model.
○ e.g. For model to analyze movie reviews, corpus is the set of documents each
containing a movie review.
○ Documents set divided into training/testing for the model
● Vocabulary
○ The unique set of words in the entire corpus
○ Usually, the feature vector is based on the vocabulary of the corpus
○ Vocabulary size – number of words in the vocabulary
● Freely available corpus
○ Links to some freely available corpora - www.nltk.org
Natural Language ToolKit (NLTK) – 3. Corpus and Vocabulary
conti….
Some popular Corpus Available
● Movie review dataset (IMDB dataset)
○ Consisting of 1000 negative and 1000 positive labeled movie reviews
○ http://www.cs.cornell.edu/people/pabo/movie-review-data/
● Amazon product review datasets
○ DVD dataset, Sports and Outdoor datasets
○ Each consisting of 1000 negative and 1000 positive labeled product reviews.
○ http://www.cs.jhu.edu/~mdredze/datasets/sentiment/
Natural Language ToolKit (NLTK) – 4. Stop words
Stemming Lemmatization
Fast and simple- Pattern based Needs POS tagging,
dictionaries
Snowball, Porters LemmaGen, Morpha
● Stemming and lemmatization are methods used by search engines and chatbots
to analyze the meaning behind a word.
● Stemming uses the stem of the word, while lemmatization uses the context in
which the word is being used.
Natural Language ToolKit (NLTK) – 6. Name Entity Recognition
● Named Entities (NEs) are proper names in texts, i.e. the names of people,
organizations, locations, time and quantities
● NER is to process a text and identify named entities
● Applications:
○ Helps identify the key elements in a text
■ Helps sort unstructured data and detect important information.
○ Useful in answering- question systems
Hi, My name is Shubhangi Deb PERSON
■ “Where was Mahatma Gandhi born?” I am from Australia GPE
I want to work with Amazon ORG
Jeff Bezos PERSON is my inspiration
● Applications
○ Processing Resumes
■ Looking for information from resumes formatted differently.
■ Personal information, experience, skills, degree etc,.
○ Gain Insights from Customer Feedback
■ Organize all this customer feedback and pinpoint repeating problem areas
■ Areas of customer likes/dislikes/improvement areas
○ Content Recommendation:
■ If you watch a lot of comedies on Netflix, you’ll get more recommendations
that have been classified as the entity ‘Comedy’.
Natural Language ToolKit (NLTK) – 7. Coreference Resolution
“ Jenna voted for Biden because Biden was most aligned to Jenna’s
principes”, Jenna said
Sentence with resolved conferences
● Uses - It is an important step for a lot of higher level NLP tasks that involves
natural language understanding
Natural Language ToolKit (NLTK) – 7. Coreference Resolution
● Uses
○ Document summarization
○ Question answering
○ Machine translation
● Anaphora (backward references)
○ Refers to any reference that “points backward” to information that was
presented earlier in the text
“The apple on the table was rotten. It had been there for three days.”
● Cataphora (forward references)
○ Refers to any reference that “points forward” to information that will be
presented later in the text
“It has four legs The cow is a domestic animal.”
Natural Language ToolKit (NLTK) – 8. POS tagging
Proper Noun Verb Determiner Adjective Noun Preposition Proper noun Noun
Installation
pip install nltk
Source: https://thinkinfi.com/how-to-download-nltk-corpus-manually//
Natural Language ToolKit (NLTK) – 9. NLTK
Sentence splitting
nltk.sent_tokenize(text)
Output: ['First sentenece.', 'Second sentence']
Natural Language ToolKit (NLTK) – 9. NLTK
● Emojis
Remove them / replace with a word / sentiment analysis
Replace characters repeated more (twitter)
e.g. it was ssssoooo nice -> it was so nice
Replace contractions (short forms to full words)
I ‘m -> I am, did’nt -> did not
Removal of punctuations
Removal of rare/frequent words
Text Preprocessing in NLP
● Tokenization
'the new policy of the government is good’
Stemming
Tokens as input -['Stemming', 'usually', 'tries', 'to', 'convert', 'the', 'word', 'into', 'its', 'root',
'format’]
Stemming - ['stem', 'us', 'tri', 'to', 'convert', 'the', 'word', 'into', 'it', 'root', 'form’]
Lemmatization
Tokens as input -['Stemming', 'usually', 'tries', 'to', 'convert', 'the', 'word', 'into', 'its', 'root',
'format’]
Lemmatization - ['Stemming', 'usually', 'try', 'to', 'convert', 'the', 'word', 'into', 'it', 'root', 'format']
Text Preprocessing in NLP
OUTPUT OUTPUT
(NE Tom/NNP) (PERSON Tom/NNP)
('is', 'VBZ') ('is', 'VBZ')
('good', 'JJ') ('good', 'JJ')
('at', 'IN') ('at', 'IN')
('playing', 'VBG') ('playing', 'VBG')
('football', 'NN') ('football', 'NN')
('and', 'CC') ('and', 'CC')
('stays', 'NNS') ('stays', 'NNS')
('in', 'IN') ('in', 'IN')
(NE London/NNP) (GPE London/NNP)
('.', '.') ('.', '.')
Text Preprocessing in NLP
Language: Python
● Why Python?
○ Has simple syntax
○ Has extensive collection of NLP tools and libraries
● Python Libraries used in this course
○ Numpy
○ Pandas
○ Matplotlib
○ SciKit-Learn
○ NLTK
Text Preprocessing in NLP
Summary
● Now you have an idea of what NLP is, its applications and the emerging
trends and challenges in using it.
● You also learnt about the basic concepts of NLP like
○ Corpus and Vocabulary
○ Text Normalization (Tokenization , Stemming and Lemmatization,
Stop words, Sentence segmentation )
○ POS tagging, Named Entity Recognition, Co-referencing Resolution
○ Parsing