0% found this document useful (0 votes)

7 views

Unit 1 NLP

The document provides an overview of Natural Language Processing (NLP), its key areas, applications, and challenges, particularly focusing on ambiguity in language. It also introduces the Natural Language Toolkit (NLTK), detailing various NLP tasks such as tokenization, stemming, lemmatization, and named entity recognition. The document serves as a foundational guide for understanding NLP concepts and tools used in the field.

Uploaded by

pappupandit78012

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Unit 1 NLP

Uploaded by

pappupandit78012

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Noida Institute of Engineering and Technology, Greater Noida

Subject code ACSA10712

Subject name Natural language Processing

Unit – 1
Overview of Natural language Processing
Definition , Applications and emerging trends in NLP, Challenges, Ambiguity , NLP
tasks using NLTK : Tokenization, stemming , lemmatization, stop-word, removal, Pos
tagging, parsing, named entity Recognition, coreferene resolution.
Natural Language Processing

What is NLP
Natural Language Processing (NLP) is a subfield of artificial
intelligence (AI) and computational linguistics focused on enabling
computers to understand, interpret, and respond to human language
in a way that is both meaningful and useful
Key Areas of NLP

• Text Analysis:
– Tokenization: Breaking down text into smaller units, such as words
or sentences.
– Part-of-Speech Tagging: Identifying the grammatical parts of speech
(nouns, verbs, adjectives, etc.) in a sentence.
– Named Entity Recognition (NER): Identifying and classifying
named entities (people, organizations, locations, etc.) in text.
– Sentiment Analysis: Determining the sentiment or emotion expressed
in a piece of text, such as positive, negative, or neutral.
• Speech Recognition:
– Converting spoken language into text, enabling voice-activated
systems like virtual assistants (e.g., Siri, Alexa) to understand and
process voice commands.
Key Areas of NLP
• Machine Translation:
– Automatically translating text or speech from one language to another,
as seen in services like Google Translate.
• Text Generation:
– Automatically generating coherent and contextually relevant text, such
as in chatbots, content creation, or language models like GPT.
• Question Answering:
– Building systems that can answer questions posed in natural language
by retrieving and summarizing relevant information from a large
dataset.
• Text Summarization:
– Condensing a large piece of text into a shorter version while
preserving its meaning and key information.
Applications of NLP

• Virtual Assistants: NLP is used in virtual assistants like Siri, Alexa, and
Google Assistant to understand voice commands and respond
appropriately.
• Chatbots: Many businesses use NLP-based chatbots to provide
customer support and answer common queries.
• Search Engines: NLP helps search engines like Google understand user
queries and deliver relevant search results.
• Translation Services: NLP powers translation tools that convert text
from one language to another.
• Sentiment Analysis: Companies use sentiment analysis to monitor
social media, reviews, and other forms of user feedback to gauge public
opinion.
Challenges in NLP

• Ambiguity: Natural language is often ambiguous, with words having

multiple meanings or sentences that can be interpreted in different ways.
• Context: Understanding the context of a conversation or text is essential
for accurate processing, which can be challenging.
• Cultural Differences: Language use varies across cultures, making it
difficult to create models that work universally.
• Evolving Language: Language constantly evolves, with new slang,
idioms, and usage patterns emerging regularly.
Challenges in NLP - Ambiguity

Natural language ambiguity refers to situations where a word, phrase, or

sentence has multiple meanings, making it challenging to interpret
correctly.
Some common forms of ambiguity include
1. Lexical Ambiguity
2. Syntactic Ambiguity
3. Semantic Ambiguity
4. Referential Ambiguity
5. Contextual Ambiguity
Challenges in NLP - Ambiguity

1. Lexical ambiguity
• Lexical means relating to words of a language.
• During Lexical analysis given paragraphs are broken down into words or
tokens. Each token has got specific meaning.
There can be instances where a single word can be interpreted in multiple ways.
The ambiguity that is caused by the word alone rather than the context is known as
Lexical Ambiguity.

Example: “Give me the bat!”

In this example “bat” have more than one meaning animal or circket bat
Challenges in NLP – Ambiguity continue….

Lexical ambiguity divide in two category

1. Polysemy
○ One word has many meanings
○ Determining the sense of a word in a particular context
■ He sat on the bank of a river/Withdraw money from the bank
■ Maruti has built a plant to manufacture cars/A man was planted in the
audience to raise anti-political slogans
2. Homonymy
o It refers to a single word having multiple but unrelated meanings.
Examples Bear, left, Pole
Challenges in NLP - Ambiguity

A bear (the animal) can bear (tolerate) very cold temperatures.

The driver turned left (opposite of right) and left (departed from) the main
road.

Pole and Pole — The first Pole refers to a citizen of Poland who could
either be referred to as Polish or a Pole. The second Pole refers to a
bamboo pole or any other wooden pole.
Challenges in NLP - Ambiguity

2. Syntactic Ambiguity/ Structural ambiguity

Syntactic meaning refers to the grammatical structure and rules that
define how words should be combined to form sentences and phrases.
A sentence can be interpreted in more than one way due to its structure
or syntax such ambiguity is referred to as Syntactic Ambiguity.
Example 1: “Old men and women”
There are possible two meaning of the example
All old men and young women.
All old men and old women.
Example 2: “John saw the boy with telescope.”
In example , two possible meanings are
John saw the boy through his telescope.
John saw the boy who was holding the telescope.
Challenges in NLP - Ambiguity

3. Semantic Ambiguity Semantics is nothing but “Meaning”.

• The semantics of a word or phrase refers to the way it is typically
understood or interpreted by people.
• Syntax describes the rules by which words can be combined into
sentences, while semantics describes what they mean.
This type of ambiguity occurs when a sentence has more than one
interpretation or meaning.
Example 1 “The chicken is ready to eat.”
The chicken (as food) is cooked and ready to be eaten.
The chicken (the animal) is hungry and ready to eat something.
Example 2 “Seema loves her mother and Sriya does too.”
In example two may be two interpretations
Sriya loves Seema’s mother or Sriya likes her mother.
Challenges in NLP - Ambiguity

4. Anaphoric (when a noun replace pronoun)Ambiguity -

A word that gets its meaning from a preceding word or phrase is called an
Example 1 - The house is on the longest street. It is very dirty.
In example1 “It ” represent to which house or long street
Example 2 – “I went to the hospital, and they told me to go home and rest.”
In example2- ‘they’ does not explicitly refer to the hospital instead it refers to the Dr or staff
who attended the patient in the hospital.
5. Pragmatic ambiguity - Pragmatics focuses on the real-time usage of language like what
the speaker wants to convey and how the listener infers it.
Example 1 : Do you know what time it is ?
Meaning of the example 1 - that some is asking for the time and other meaning is that
someone showing anger for missed the due time
Natural Language ToolKit (NLTK)

The Natural Language Toolkit (NLTK) is a Python programming environment

for creating applications for statistical natural language processing (NLP).

1. Tokenization
2. Sentence Segmentation
3. Corpus and Vocabulary
4. Stop words
5. Stemming and Lemmatization
6. Named Entity Recognition
7. Co-referencing Resolution
8. POS tagging
9. Parsing
Natural Language ToolKit (NLTK) – 1.Tokenization

● Tokenization method is used to split a sentence, paragraph, or

full-text document into smaller units - tokens
I Love NLP.
[‘I’, ‘love’, ‘NLP’,‘.’]
○ The basic unit of a language
○ It helps to interpret the meaning of the text by analysing the words
present in the text
○ Count the number and frequency of words in the text
Natural Language ToolKit (NLTK) – 2. Sentence Segmentation

You first need to break the entire document down into its constituent
sentences. You can do this by segmenting the article along with its
punctuations like full stops and commas.
● Splitting the given input text into sentences
● Characters used for defining sentence end - ‘!’, ‘?’, ‘.’
● Ambiguities:
○ The yearly results of Yahoo! are promising.
○ We are using the .NET framework for our project.
○ Susan scored 78.5% marks in her exams
○ Mr. Mehta is doing a great job.
● Rules like – Numbers around the ‘.’
Natural Language ToolKit (NLTK) – 3. Corpus and Vocabulary

● text corpus
○ The set of text documents used for the model.
○ e.g. For model to analyze movie reviews, corpus is the set of documents each
containing a movie review.
○ Documents set divided into training/testing for the model
● Vocabulary
○ The unique set of words in the entire corpus
○ Usually, the feature vector is based on the vocabulary of the corpus
○ Vocabulary size – number of words in the vocabulary
● Freely available corpus
○ Links to some freely available corpora - www.nltk.org
Natural Language ToolKit (NLTK) – 3. Corpus and Vocabulary
conti….
Some popular Corpus Available
● Movie review dataset (IMDB dataset)
○ Consisting of 1000 negative and 1000 positive labeled movie reviews
○ http://www.cs.cornell.edu/people/pabo/movie-review-data/
● Amazon product review datasets
○ DVD dataset, Sports and Outdoor datasets
○ Each consisting of 1000 negative and 1000 positive labeled product reviews.
○ http://www.cs.jhu.edu/~mdredze/datasets/sentiment/
Natural Language ToolKit (NLTK) – 4. Stop words

Very common words in a language, no useful information

● Articles, prepositions, pronouns, conjunctions, etc,.
○ e.g. - “the”, “in”, “of”, “his”, “and ,”etc”
● Removal of stopwords token
○ Focus to the important information
○ Reduces the dataset size and training time
● May be needed for
○ Relational queries – “flights to Tokyo”
○ Phrases like - “To be or not to be”
○ Prediction of sentiment - “I told you that she was not happy” → “told, happy”
Natural Language ToolKit (NLTK) – 5. Stemming and Lemmatization

● Reduces the form of a word to the common base form.

e.g. - (go, going, gone) -> go (running, ran, runs, run) -> run
● To prepare text, words, and documents for further processing.
● When we search for a word on the web it also retrieves
variations of the word. If we search for say ‘kill’, it may also
return words like killer, killing, killed, kills.
● Here kill is the stem for killer, killed, killing, kills. It conveys
that each of these has the idea of ‘kill’.
Natural Language ToolKit (NLTK) – 5. Stemming and Lemmatization

● Stemming use heuristics that

○ Chops off letters from the end of the word
○ Transforms the end letters
● Lemmatization groups together similar inflected forms of a words, called lemma.
Word Suffix Stem Word Lemma

Was was Is, was, were Be

Cats s cat
Cats Cat
Changing ing chang
Changing, changed, change change
Studies es studi

Studying ing study Studies, studying Study

Natural Language ToolKit (NLTK) – 5. Stemming and Lemmatization

Stemming Lemmatization
Fast and simple- Pattern based Needs POS tagging,
dictionaries
Snowball, Porters LemmaGen, Morpha

Returns the stem of a word – may not be in Returns a proper word -

vocabulary lemma
Crude, less useful More informative

● Stemming and lemmatization are methods used by search engines and chatbots
to analyze the meaning behind a word.
● Stemming uses the stem of the word, while lemmatization uses the context in
which the word is being used.
Natural Language ToolKit (NLTK) – 6. Name Entity Recognition

● Named Entities (NEs) are proper names in texts, i.e. the names of people,
organizations, locations, time and quantities
● NER is to process a text and identify named entities
● Applications:
○ Helps identify the key elements in a text
■ Helps sort unstructured data and detect important information.
○ Useful in answering- question systems
Hi, My name is Shubhangi Deb PERSON
■ “Where was Mahatma Gandhi born?” I am from Australia GPE
I want to work with Amazon ORG
Jeff Bezos PERSON is my inspiration

Named entity recognition with Machine

Learning
Natural Language ToolKit (NLTK) – 6. Name Entity Recognition

● Applications
○ Processing Resumes
■ Looking for information from resumes formatted differently.
■ Personal information, experience, skills, degree etc,.
○ Gain Insights from Customer Feedback
■ Organize all this customer feedback and pinpoint repeating problem areas
■ Areas of customer likes/dislikes/improvement areas
○ Content Recommendation:
■ If you watch a lot of comedies on Netflix, you’ll get more recommendations
that have been classified as the entity ‘Comedy’.
Natural Language ToolKit (NLTK) – 7. Coreference Resolution

● Identify all expressions that refer to the same object.

○ Mohan went to McDonald to buy a burger. He visits the store very often and loves its food.

“ I voted for Biden because he was most aligned to my principes”, Jenna

said
Original sentence

“ Jenna voted for Biden because Biden was most aligned to Jenna’s
principes”, Jenna said
Sentence with resolved conferences

● Uses - It is an important step for a lot of higher level NLP tasks that involves
natural language understanding
Natural Language ToolKit (NLTK) – 7. Coreference Resolution

● Uses
○ Document summarization
○ Question answering
○ Machine translation
● Anaphora (backward references)
○ Refers to any reference that “points backward” to information that was
presented earlier in the text
“The apple on the table was rotten. It had been there for three days.”
● Cataphora (forward references)
○ Refers to any reference that “points forward” to information that will be
presented later in the text
“It has four legs The cow is a domestic animal.”
Natural Language ToolKit (NLTK) – 8. POS tagging

● Assign a parts of speech to each word in text Noun Pronoun

○ Nouns: Which defines any object or entity Interjection

○ Verbs: That defines some action.

○ Pronoun: That can replace a noun – she, him. Verb
Preposition Parts of
○ Adjectives Describe a noun/pronoun. speech

● In a sentence, every word will be associated with a proper Conjunction

Adverb

POS tag Adjective

Puja bought a new phone from Samsung Store

Proper Noun Verb Determiner Adjective Noun Preposition Proper noun Noun

NNP VBN DT JJ NN IN NNP NN

Natural Language ToolKit (NLTK) – 9. NLTK

NLTK (Natural Language Toolkit) is a library for NLP in Python.

Powerful tool to preprocess text data for further analysis like as input to
Machine Learning algorithms.
Tokenization, POS tagging, Stemming etc
It includes many corpora and lexical resources (like WordNet)
Natural Language ToolKit (NLTK) – 9. NLTK

Installation
pip install nltk

Downloading the datasets:

import nltk
nltk.download()

Choose from the screen

whatever packages you want to
download

Source: https://thinkinfi.com/how-to-download-nltk-corpus-manually//
Natural Language ToolKit (NLTK) – 9. NLTK

Operations using NLTK:

Tokenization
import nltk
text = "First sentence. Second sentence“
nltk.word_tokenize(text)
Output: ['First', 'sentence', '.', 'Second', 'sentence', '.’]

Sentence splitting
nltk.sent_tokenize(text)
Output: ['First sentenece.', 'Second sentence']
Natural Language ToolKit (NLTK) – 9. NLTK

Accessing Text Corpora in NLTK:

● Corpus: A set of text documents usually having some common characteristic.
Corpora is plural of corpus.
set of online books
set of movie reviews
collection of tweets
• NLTK corpus is set of natural language datasets in nltk_data directory
In NLTK some corpora are included
Gutenberg corpus (online books)
Brown corpus (categories – news, humor etc
Reuters corpus (news)
WordNet –most advanced , contains words, synonyms, antonyms etc
from nltk.corpus import gutenberg
Natural Language ToolKit (NLTK) – NLTK

Accessing Gutenberg corpus:

● NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, which
contains some 25,000 free electronic books
See website for full list - http://www.gutenberg.org/
To download the corpus - nltk.download('gutenberg’)
To find the Gutenberg collection that are downloaded with the NLTK package.-
nltk.corpus.gutenberg.fileids()
To find the words in the text file - gutenberg.words(fileid)
To find the sentences in a file gutenberg.sents(fileid)
To get the text in the file into a string - gutenberg.raw(fileid)
In NLTK some corpora are included
Gutenberg corpus
Brown corpus
If phone number “0120- 4543466” in string “ His contact number is 0120- 4543466”
Python command - “0120- 4543466” in “ His contact number is 0120- 4543466”

If we know the format - ####-####### (phone number ) or ##/##/#### (date)

We need regular expressions to search for these patterns.
Natural Language ToolKit (NLTK) – NLTK

Accessing Browns corpus:

● Contains text categorized by genre, such as news, humor etc.
● To find the categories
from nltk.corpus import brown
brown.categories()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor',
'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction’]
To find the files in the category ‘humor’
brown.fileids(categories=‘humor’)
['cr01', 'cr02', 'cr03', 'cr04', 'cr05', 'cr06', 'cr07', 'cr08', 'cr09’ ]
To find the words in the file ‘cr01’
brown.words(fileids=['cr01‘])
['It', 'was', 'among', 'these', 'that', 'Hinkle', ...]
For details refer : http://www.nltk.org/book_1ed/ch02.html
Natural Language ToolKit (NLTK) – NLTK

Accessing Gutenberg corpus:

Contains text categorized by genre, such as news, humor etc.
● To find the categories
from nltk.corpus import brown
brown.categories()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor',
'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction’]
To find the files in the category ‘humor’
brown.fileids(categories=‘humor’)
['cr01', 'cr02', 'cr03', 'cr04', 'cr05', 'cr06', 'cr07', 'cr08', 'cr09’ ]
To find the words in the file ‘cr01’
brown.words(fileids=['cr01‘])
['It', 'was', 'among', 'these', 'that', 'Hinkle', ...]
For details refer : http://www.nltk.org/book_1ed/ch02.html
Natural Language ToolKit (NLTK) – NLTK

Accessing Gutenberg corpus:

If we know the format - ####-####### (phone number ) or ##/##/#### (date)

We need regular expressions to search for these patterns.
Natural Language ToolKit (NLTK) – NLTK

Accessing Gutenberg corpus:

If we know the format - ####-####### (phone number ) or ##/##/#### (date)

We need regular expressions to search for these patterns.
Text Preprocessing in NLP

Text Preprocessing in NLP

● Convert raw text into a set of tokens that the computer can understand and use.
Ready for feature extraction
Data cleaning and pre-processing is critical for the quality of further analysis
Pre-processing steps depend on
a.The data – structured (movie reviews) or unstructured (Tweets)
b.The application for which data needs to be used.
Text Preprocessing in NLP

Steps in Text Preprocessing:

● Convert all characters to lower case

o e.g. Hello, HELLO, hello, hellO -> hello
Remove HTML tags, URL, email id
http://www.example.com/index.html,
[email protected]
Text between HTML tags

Converting data to standard form

2mrw, tmrw->tomorrow , btwn, btw -> between, b4->before
Text Preprocessing in NLP

● Emojis
Remove them / replace with a word / sentiment analysis
Replace characters repeated more (twitter)
e.g. it was ssssoooo nice -> it was so nice
Replace contractions (short forms to full words)
I ‘m -> I am, did’nt -> did not
Removal of punctuations
Removal of rare/frequent words
Text Preprocessing in NLP

● Tokenization
'the new policy of the government is good’

Sentence segmentation(if required)

text = "First in class. Last in class.“
Output: ['First in class.', ‘Last in class.']
● Remove stop words
Tokens as input - ['the', 'new', 'policy', 'of', 'the', 'government', 'is',
'good’]
['new', 'policy', 'government', 'good’]
Text Preprocessing in NLP

● Parts of Speech (POS) tagging

Tokens as input -['he', 'loves', 'to', 'play', 'with', 'toys', 'in', 'morning’]
[('he', 'PRP’), ('loves', 'VBZ’), ('to', 'TO'), ('play', 'VB’), ('with', 'IN'), ('toys', 'NN'), ('in', 'IN'),
('morning', 'NN’)]

Stemming
Tokens as input -['Stemming', 'usually', 'tries', 'to', 'convert', 'the', 'word', 'into', 'its', 'root',
'format’]
Stemming - ['stem', 'us', 'tri', 'to', 'convert', 'the', 'word', 'into', 'it', 'root', 'form’]

Lemmatization
Tokens as input -['Stemming', 'usually', 'tries', 'to', 'convert', 'the', 'word', 'into', 'its', 'root',
'format’]
Lemmatization - ['Stemming', 'usually', 'try', 'to', 'convert', 'the', 'word', 'into', 'it', 'root', 'format']
Text Preprocessing in NLP

● Name Entity Recognition

String ‘text’ as input
text = "Tom is good at playing football and stays in London."
tokens = word_tokenize(text)
pos_text= nltk.pos_tag(tokens)
nes=nltk.ne_chunk(pos_text, binary =
nes=nltk.ne_chunk(pos_text, binary = True) False)

OUTPUT OUTPUT
(NE Tom/NNP) (PERSON Tom/NNP)
('is', 'VBZ') ('is', 'VBZ')
('good', 'JJ') ('good', 'JJ')
('at', 'IN') ('at', 'IN')
('playing', 'VBG') ('playing', 'VBG')
('football', 'NN') ('football', 'NN')
('and', 'CC') ('and', 'CC')
('stays', 'NNS') ('stays', 'NNS')
('in', 'IN') ('in', 'IN')
(NE London/NNP) (GPE London/NNP)
('.', '.') ('.', '.')
Text Preprocessing in NLP

How do we implement these NLP concepts:

Language: Python
● Why Python?
○ Has simple syntax
○ Has extensive collection of NLP tools and libraries
● Python Libraries used in this course
○ Numpy
○ Pandas
○ Matplotlib
○ SciKit-Learn
○ NLTK
Text Preprocessing in NLP

Summary
● Now you have an idea of what NLP is, its applications and the emerging
trends and challenges in using it.
● You also learnt about the basic concepts of NLP like
○ Corpus and Vocabulary
○ Text Normalization (Tokenization , Stemming and Lemmatization,
Stop words, Sentence segmentation )
○ POS tagging, Named Entity Recognition, Co-referencing Resolution
○ Parsing

●Implementation of the above concepts using NLTK

Life 4 Teachers - Guide
100% (7)
Life 4 Teachers - Guide
369 pages
Lecture 1
No ratings yet
Lecture 1
33 pages
NLP_Presentation1
No ratings yet
NLP_Presentation1
25 pages
NLP Ass 1&2
No ratings yet
NLP Ass 1&2
18 pages
2 INTRODUCTION
No ratings yet
2 INTRODUCTION
15 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
63 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
45 pages
INTRONLP
No ratings yet
INTRONLP
30 pages
Nlp Lecture
No ratings yet
Nlp Lecture
18 pages
6CS4 AI Unit-5
No ratings yet
6CS4 AI Unit-5
65 pages
notes
No ratings yet
notes
9 pages
Chapter 6-NLPs
No ratings yet
Chapter 6-NLPs
31 pages
Unit V
No ratings yet
Unit V
16 pages
NLP
No ratings yet
NLP
14 pages
NLP Merged
100% (1)
NLP Merged
975 pages
Introduction to NLP_chap1.Pptx
No ratings yet
Introduction to NLP_chap1.Pptx
47 pages
NLP Steps Basic
No ratings yet
NLP Steps Basic
26 pages
Seminar Report
No ratings yet
Seminar Report
12 pages
Introduction To Natural Language Processing and NLTK
No ratings yet
Introduction To Natural Language Processing and NLTK
23 pages
unit-4 NLP
No ratings yet
unit-4 NLP
54 pages
Introduction To NLP - Chap1
No ratings yet
Introduction To NLP - Chap1
47 pages
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
36 pages
NLP Introduction Overview
No ratings yet
NLP Introduction Overview
34 pages
NLP IA1
No ratings yet
NLP IA1
7 pages
nlp1 230307185600 52214d99
No ratings yet
nlp1 230307185600 52214d99
26 pages
Unit - 1
No ratings yet
Unit - 1
9 pages
Archivo - 01 (4 Cópia)
No ratings yet
Archivo - 01 (4 Cópia)
6 pages
Chapter 1
No ratings yet
Chapter 1
5 pages
Natural Language Processing Lec 1
No ratings yet
Natural Language Processing Lec 1
23 pages
Module_1_part1_NLP
No ratings yet
Module_1_part1_NLP
24 pages
Natural Language Processing
No ratings yet
Natural Language Processing
24 pages
3.1 Natural Language Processing
No ratings yet
3.1 Natural Language Processing
5 pages
NLP Natural Language Processing Notes
No ratings yet
NLP Natural Language Processing Notes
76 pages
Introduction To NLP
No ratings yet
Introduction To NLP
42 pages
1 - Introducntion To NLP
No ratings yet
1 - Introducntion To NLP
43 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
30 pages
NLP Notes
No ratings yet
NLP Notes
18 pages
NLP Presentation
No ratings yet
NLP Presentation
19 pages
NLP-Unit-1-part1
No ratings yet
NLP-Unit-1-part1
61 pages
38. Natural Language Processing (1) Copy
No ratings yet
38. Natural Language Processing (1) Copy
30 pages
NLP FINAL
No ratings yet
NLP FINAL
33 pages
NLp_lab1
No ratings yet
NLp_lab1
33 pages
Unit I Inroduction
No ratings yet
Unit I Inroduction
52 pages
NLP-UNIT-1-1
No ratings yet
NLP-UNIT-1-1
67 pages
nlp-01
No ratings yet
nlp-01
16 pages
1.introduction To Natural Language Processing (NLP)
100% (1)
1.introduction To Natural Language Processing (NLP)
37 pages
History of NLP
No ratings yet
History of NLP
7 pages
01
No ratings yet
01
60 pages
NLP Unit 1
No ratings yet
NLP Unit 1
44 pages
Seminar Report1
No ratings yet
Seminar Report1
17 pages
NLP unit1
No ratings yet
NLP unit1
24 pages
Introduction
No ratings yet
Introduction
49 pages
Natural Language Processing
No ratings yet
Natural Language Processing
30 pages
NLP PPT1 (1)
No ratings yet
NLP PPT1 (1)
29 pages
Introduction To NLP and Ambiguity
No ratings yet
Introduction To NLP and Ambiguity
42 pages
NLP
No ratings yet
NLP
78 pages
NLP Module 1
No ratings yet
NLP Module 1
124 pages
nlp
No ratings yet
nlp
19 pages
1 Introduction
No ratings yet
1 Introduction
99 pages
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
Syntax and Sentence Structure in Linguistics
From Everand
Syntax and Sentence Structure in Linguistics
Aadinath Guha
No ratings yet
OET 1001 Words
No ratings yet
OET 1001 Words
530 pages
2024-25 Grade: Viii Subject: Arabic Month Chapter Name April
No ratings yet
2024-25 Grade: Viii Subject: Arabic Month Chapter Name April
11 pages
How Questions Grammar Drills Grammar Guides Information Gap Acti 93602
No ratings yet
How Questions Grammar Drills Grammar Guides Information Gap Acti 93602
3 pages
Chapter I: Understanding Literary and Archaeological Sources
No ratings yet
Chapter I: Understanding Literary and Archaeological Sources
2 pages
Date 3 Term: I. Reading Skill
No ratings yet
Date 3 Term: I. Reading Skill
13 pages
West Bengal Board Class 8 English Solution Someone
No ratings yet
West Bengal Board Class 8 English Solution Someone
4 pages
Learning Module IN English For Academic and Professional Purposes (Eapp) (Third Quarter)
No ratings yet
Learning Module IN English For Academic and Professional Purposes (Eapp) (Third Quarter)
49 pages
Mixed Conditionals
No ratings yet
Mixed Conditionals
3 pages
Practice Test VB 1
No ratings yet
Practice Test VB 1
10 pages
Verb Tense Review
No ratings yet
Verb Tense Review
14 pages
9093 Alevel Theories
No ratings yet
9093 Alevel Theories
17 pages
English Assessment Term1
No ratings yet
English Assessment Term1
4 pages
22625-winter-2023
No ratings yet
22625-winter-2023
32 pages
Q2 Le W6 Language
No ratings yet
Q2 Le W6 Language
16 pages
AK-CN-Grade+3-English-Chapter+7-Class+2-v02 (1)
No ratings yet
AK-CN-Grade+3-English-Chapter+7-Class+2-v02 (1)
4 pages
Research Format
No ratings yet
Research Format
45 pages
Ala Eh
100% (1)
Ala Eh
29 pages
CH 12 AAiW
No ratings yet
CH 12 AAiW
15 pages
Two Days National Seminar March 6-7, 2024 Brochure DFL - Final
No ratings yet
Two Days National Seminar March 6-7, 2024 Brochure DFL - Final
2 pages
Lesson Plan Yr 6 Week 3 at 2017
No ratings yet
Lesson Plan Yr 6 Week 3 at 2017
6 pages
Companies and Places British English Student
No ratings yet
Companies and Places British English Student
3 pages
Shady Abu Taleb CV-1
No ratings yet
Shady Abu Taleb CV-1
6 pages
La Dolce Lingua - Stare_ Da Quanto_ Perche
No ratings yet
La Dolce Lingua - Stare_ Da Quanto_ Perche
12 pages
Simple Present
No ratings yet
Simple Present
4 pages
English World WorkBook Unit 1 Level 6
No ratings yet
English World WorkBook Unit 1 Level 6
10 pages
Aneeq C V
No ratings yet
Aneeq C V
3 pages
210 EARLY MODERN ENGLISH
No ratings yet
210 EARLY MODERN ENGLISH
5 pages
An Introduction to Bayesian Statistical Decision Processes -- [by] Bruce W_ Morgan -- Englewood Cliffs, N_J, New Jersey, 1968 -- Englewood Cliffs, -- 250575 -- 261eb81f5805ce6be0b35e7bc09827ad -- Anna’s Archive
No ratings yet
An Introduction to Bayesian Statistical Decision Processes -- [by] Bruce W_ Morgan -- Englewood Cliffs, N_J, New Jersey, 1968 -- Englewood Cliffs, -- 250575 -- 261eb81f5805ce6be0b35e7bc09827ad -- Anna’s Archive
432 pages
Q1 Week 4 KOMPA
No ratings yet
Q1 Week 4 KOMPA
5 pages

Unit 1 NLP

Uploaded by

Unit 1 NLP

Uploaded by

Noida Institute of Engineering and Technology, Greater Noida

Subject code ACSA10712

• Ambiguity: Natural language is often ambiguous, with words having

Natural language ambiguity refers to situations where a word, phrase, or

Example: “Give me the bat!”

Lexical ambiguity divide in two category

A bear (the animal) can bear (tolerate) very cold temperatures.

2. Syntactic Ambiguity/ Structural ambiguity

3. Semantic Ambiguity Semantics is nothing but “Meaning”.

4. Anaphoric (when a noun replace pronoun)Ambiguity -

The Natural Language Toolkit (NLTK) is a Python programming environment

● Tokenization method is used to split a sentence, paragraph, or

Very common words in a language, no useful information

● Reduces the form of a word to the common base form.

● Stemming use heuristics that

Was was Is, was, were Be

Studying ing study Studies, studying Study

Returns the stem of a word – may not be in Returns a proper word -

Named entity recognition with Machine

● Identify all expressions that refer to the same object.

“ I voted for Biden because he was most aligned to my principes”, Jenna

● Assign a parts of speech to each word in text Noun Pronoun

○ Verbs: That defines some action.

● In a sentence, every word will be associated with a proper Conjunction

POS tag Adjective

Puja bought a new phone from Samsung Store

NNP VBN DT JJ NN IN NNP NN

NLTK (Natural Language Toolkit) is a library for NLP in Python.

Downloading the datasets:

Choose from the screen

Operations using NLTK:

Accessing Text Corpora in NLTK:

Accessing Gutenberg corpus:

If we know the format - ####-####### (phone number ) or ##/##/#### (date)

Accessing Browns corpus:

Accessing Gutenberg corpus:

Accessing Gutenberg corpus:

If we know the format - ####-####### (phone number ) or ##/##/#### (date)

Accessing Gutenberg corpus:

If we know the format - ####-####### (phone number ) or ##/##/#### (date)

Text Preprocessing in NLP

Steps in Text Preprocessing:

● Convert all characters to lower case

Converting data to standard form

Sentence segmentation(if required)

● Parts of Speech (POS) tagging

● Name Entity Recognition

How do we implement these NLP concepts:

●Implementation of the above concepts using NLTK

You might also like