0% found this document useful (0 votes)

16 views54 pages

NLP Lecture Slides - Part 1

Uploaded by

karthiktej890

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views54 pages

NLP Lecture Slides - Part 1

Uploaded by

karthiktej890

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 54

Natural Language

Processing
UNIT – I V
▶ Introduction.

▶ Applications.
Contents ▶ Chatbots, virtual agents (Alexa, Google
Assistant, Siri). Importance, Applications,

▶ NLP Subproblems.

▶ Components of Natural Language.

▶ Steps to get text data into workable

format.

▶ Terms Frequency, Inverse Document

Frequency,

▶ Bag of Words,

▶ ngram,

▶ One hot encoding.

▶ Notion of corpus. Intro to NLTK

NL
P

▶ is among the hottest topic in the field of data science.

▶ Companies are putting tons of money into research in this field.
▶ Everyone is trying to understand NLP and its applications to make a career around
it.
▶ Every business out there wants to integrate it into their business somehow.
Are you using NLP these days?
Search Autocorrect a n d
Autoc o mplete – Language
Translator
Social media
monitoring
▶ More people these days have started using social media for posting their
thoughts about a particular product, policy, or matter.
▶ These could contain some useful information about an individual’s likes and
dislikes.
▶ Analyzing this unstructured data can help in generating valuable insights. NLP
comes to rescue here too.
▶ various NLP techniques are used by companies to analyze social media posts and
know what customers think about their products.
▶ Companies are also using social media monitoring to understand the issues and
problems that their customers are facing by using their products.
Chatbot
s

Modern
Conversational
Agents can
• Answer
questions
• Book flights
• Find Restaurants
• functions for
which they rely
on a much more
sophisticated
understanding of
the user’s intent
Survey
Analysis
▶ Surveys are an important way of
evaluating a
company’s performance.
▶ to get customer’s feedback on various
products.
▶ useful in understanding the
flaws and help companies improve
their products.

▶ NLP is used to analyze the surveys and

generate insights from them, like
knowing the sentiments of users
analyzing product reviews to
understand the pros and cons
Targeted Advertising – Hiring a n d
Recruitment

Targeted advertising is a type of

online advertising where ads are
shown to the user based on their
online activity.

it saves companies a lot of money

relevant ads are shown only to the
potential customers.
Voice
Assistants
Conventional vs. NLP-based
search
What is
NLP?
▶ Natural language processing is a sub-field of linguistics, computer
science and AI concerned with the interactions between computers
and human language
▶ NLP makes computers understand complex language structure and retrieve
meaningful pieces of information from it
▶ Modern challenges in NLP involve
▶ speech recognition,
▶ natural language understanding and
▶ natural language generation
Applications of
NLP

▶ Text Classification
▶ Language Modelling
▶ Information
Extraction
▶ Information Retrieval
▶ C onversatio n al
A g ents
▶ Text Summarization
▶ Question A nswering
▶ M achine Translation
▶ Topic Modelling
▶ Speech Recognition
Chatbots a n d Virtual
Assistants
▶ What is a chatbot?
▶ It's all a b o ut the c onversation.
▶ A chatbot is a pi e c e of software, usually powered by artificial intelligence,
with which humans c a n interact.
▶ They are sometimes called virtual assistants, chatbox, or even
chatterbox.
▶ Chatbots c a n converse with us humans through both text a n d voice
▶ Who is using this technology?
▶ Though chatbots are relatively old technology, but recently businesses have
started putting them to eff ective use.
▶ Companies like Disney use chatbots as a brand a n d marketing play.
▶ Companies like Go o gl e use chatbots to innovate in their field.
▶ Companies like Burberry, Staples use chatbots as a customer service tool.
▶ An d many more!
Chatbots a n d Virtual
Assistants

Artificial intelligence and chatbots

▶ Most chatbots are powered by artificial intelligence, or AI.
▶ AI chatbots tend to be more useful, purely because they are smarter and can learn over
time. This is, of course, valuable to businesses.
▶ Artificial intelligence in chatbots comes in many forms.
The most common are natural language processing (NLP) which powers the language
side of the chatbot, to machine learning (ML) which powers data and algorithms.
Some Virtual
Assistants

Alex Google Si
a Assistant ri
Benefi ts/importance/applications
of Chatbots a n d Virtual Assistants
▶ Businesses use Chatbots extensively to optimize internal business processes, boost
productivity, raise revenue, a n d improve customer experience.
▶ Website chatbots increase engagement by giving quick a n d personalized responses.
Customers who would otherwise have to wait longer to respond via traditional telephone
channels benefi t significantly from this.
▶ Because of their ease of use, chatbots have a very high adoption rate. Chatbot adoption
opens up a fl oodgate of possibilities for businesses to better their client interaction process
a n d operational effi ciency, lowering customer service costs.
▶ Virtual Assistants make organizing a n d carrying out our daily chores easier.
▶ Virtual Assistants c o m e in helpful in supporting our tasks, whether it's setting reminders
a n d alarms,
adding commissions to the calendar, making calls, or retrieving information from the internet.
▶ Virtual assistants c a n now control home intelligent gadgets such as lighting, thermostats,
a n d music devices.
NLP Subproblems

 Text Categorization
 Machine Translation
 Text Summarization
 Entity Recognition
 Temporal event recognition
 Text Generation
 Natural Language Interface
 Speech Recognition
 Text to speech
Text Categorization

 Sentiment Analysis
 Spam Detection
 Authorship Attribution

The goal of classification is

• to take a single observation,
• extract some useful features, and
• thereby classify the observation into one of a set of
discrete classes.
Text Categorization : Sentiment Analysis

 It is the extraction of the sentiment, the positive

or negative orientation that a writer expresses
toward some object.
 To perform this task the words in the review
provide excellent cues.
 For example, words like great, richly, awesome,
pathetic, awful and ridiculously are informative
cues
Examples :
i. “ I really like the new design of your website! “ –
Positive
ii. “ The new design is awful “ – Negative.
Text Categorization : Spam Detection

 It is another important commercial application.

 It is a binary classification task of assigning an email to
one of the two classes spam or not-spam.
 Many lexical and other features can be used to perform
this classification.
 For example, you might quite reasonably be suspicious of
an email containing phrases like “online pharmaceutical”
or “WITHOUT ANY COST” or “Dear Winner”
Text Categorization : Authorship Attribution

 Authorship attribution is the task of identifying the author of a given document.

Machine Translation

• Machine Translation(MT) translates one neural language into another language

automatically.
• Encoder-Decoder models are commonly employed to solve Machine Translation problems.
• Machine translation can be done using Statistics or based on rules.
• Neural Machine Translation relies upon neural network models to build statistical models
Text Summarization

 Automatic text summarization aims to

transform lengthy documents into
shortened versions, something which
could be difficult and costly to undertake
if done manually.
 Machine learning algorithms can be
trained to comprehend documents and
identify the sections that convey
important facts and information before
producing the required summarized texts.
 For example, the image below is of this
news article that has been fed into a
machine learning algorithm to generate a
summary.
Entity Recognition

 Named Entity Recognition is one of the key entity detection methods in NLP.
 Named entity recognition is a natural language processing technique that can automatically scan entire
articles and pull out some fundamental entities in a text and classify them into predefined categories.
 Entities may be
 Organizations,
 Quantities,
 Monetary values,
 Percentages, and more.
 People’s names
 Company names
 Geographic locations (Both physical and political)
 Product names
 Dates and time
Contd..

 In simple words, Named Entity Recognition is the process of detecting the named entities
such as person names, location names, company names, etc from the text.
 It is also known as entity identification or entity extraction or entity chunking.

With the help of named entity recognition, we can extract key information to understand
the text, or merely use it to extract important information to store in a database.
Text Generation

 Text generation is a subfield of natural language

processing (NLP).
 It leverages knowledge in computational linguistics
and artificial intelligence to automatically generate
natural language texts, which can satisfy certain
communicative requirements.
Natural Language Interface

 A natural language interface is a user interface

in which the user and the system communicate
via a natural (human) language.
 The user provides input via speech or some
other method, and the system generates
responses in the form of utterances delivered
by speech, text or another suitable modality.
Speech Recognition

 Speech recognition, also known as automatic speech

recognition (ASR), computer speech recognition, or speech-to-
text, is a capability which enables a program to process human
speech into a written format.
 Many speech recognition applications and devices are available,
but the more advanced solutions use AI and machine learning.
They integrate grammar, syntax, structure, and composition of
audio and voice signals to understand and process human
speech. Ideally, they learn as they go — evolving responses
with each interaction.
Words – What counts as a
word?

▶ corpus (plural corpora): a computer-readable corpora collection of text or

speech
▶ For example the Brown corpus is a million-word collection of samples from
500 written English texts from different genres (newspaper, fiction, non-
fiction, academic, etc.)
Contd..

 How many words are in the following

Brown sentence?
 Sentence : He stepped out into
the hall, was delighted to
encounter a water brother.

This sentence has 13 words if we

don’t count punctuation marks as
words,
15 if we c o unt punctuation.
Contd..

Are capitalized tokens like They and uncapitalized tokens like they the same
word?
▶ How about inflected forms like cats versus cat?
These two words have the same lemma cat but are different wordforms.

▶ A lemma is a set of lexical forms having the same stem, the

same major part-of-speech, and the same word sense.

▶ The wordform is the full infl ected or derived form of the word.
Notion of Corpus:
Words – Types a n d
Tokens
▶ Word Types are the number of distinct words in a corpus; if the set of
words in the vocabulary is V, the number of types is the word token
vocabulary size | V | .
▶ Word Tokens are the total number N of running words.
▶ ignore punctuation and find the number of tokens and types in the
following sentence

They picnicked by the pool, then lay back on the grass

and looked at the stars16
tokens
14
types
Notion of
Corpus:
Corpora
▶ Any particular p i e c e of text that w e study is produced by
▶ one or more specifi c speakers or writers,
▶ in a specifi c dialect of a specifi c language,
▶ at a specifi c time,
▶ in a specifi c place,
▶ for a specifi c function.
▶ The most important dimension of variation is the language.
▶ NLP algorithms are most useful when they apply across many languages. The
world has 7097 languages.
▶ It is important to test algorithms on more than one language, a n d
particularly on languages with different properties; by contrast there is a n
unfortunate current tendency for NLP algorithms to b e de v e l ope d or tested just on
English
▶ C o d e Switching : A phenomenon which uses multiple languages in a single
communicative a c t
▶ Another variations are Genre, demographic characteristics of the writer, time.
Getting text to workable format
Approximate order of steps for preprocessing
text data

Noise Sta n d ardiz

Raw No rmaliza ti Clea n
Remov on
ati on
Text al Text

• Removal of • Tokenization
stop words • Stemming
and • Lemmatizati
punctuations
on
Noise
Removal

▶ Noise : Any p i e c e of text which is not relevant to the context of

the data.
▶ Generally, the noisy entities are
▶ Stop words,
▶ punctuation marks.
▶ Stop words
Removal
▶ It is a
process of
removing c o m m o n
language articles,
Pronouns a n d
propositions
such as “and”,
Stop words using
NLTK
Removing stop words from a
sentence using NLTK
Write a simple script to remove
punctuations.
Text Normalization

• Before almost any natural language processing of a text, the text has to be
normalized.
• At least three tasks are commonly applied as part of any normalization process:
• Tokenizing (segmenting) words
• Normalizing word formats
• Segmenting sentences
Tokenization :

• It is a way of separating a piece of text into smaller units

called tokens.
• The most common way of processing the raw text happens
at the token level.
• The ultimate goal of Tokenization is the creation of
vocabulary - Tokenization is performed on the corpus to obtain
tokens. The following tokens are then used to prepare a
vocabulary.
Vocabulary - refers to the set of unique tokens in the corpus.
• The tokens can be words, characters, or subwords.
Contd..

Example:

• word tokenization for the sentence: "Never give up" - ["Never", "give", "up"]

• Character tokenization for "smarter" is ['s','m','a','r','t','e','r']

• Subword tokenization for "smarter" is ["smart", "er"]

• Word Tokenization is the most
commonly used tokenization
algorithm.

Word • It splits a piece of text into individual

words based on a certain delimiter.
Tokenization • Depending upon delimiters, different
word-level tokens are formed.
Methods to perform tokenization:

Most commonly used tokenization

▶ Splits a p i e c e of text into individual words based on a certain delimiter
▶ Methods to perform tokenization
▶ Using python’s split() function
▶ Using regular expressions
▶ Using NLTK
Tokenization using Python’s split()
function

we can use only one

separator at a time.
split() did not consider
punctuation as a
separate token
Tokenization using Regular
Expressions (RegEx)
Tokenization using NLTK
Word Normalization, Stemming
a n d Lemmatization
▶ Used to prepare text, words, and documents for further
processing
▶ Stemming and Lemmatization helps us to achieve the
root forms of inflected words
Stemmin
g
 helps us to achieve the root forms of inflected words.
 Stem (root) is the part of the word to which you add
inflectional (changing/deriving) affixes such as (-ed,-ize, -s,-de,mis).
 stemming a word or sentence may result in words that are not actual words.
Stems are created by removing the suffixes or prefixes used with a word.
 A computer program that stems word is called a stemming program, or
stemmer
 PorterStemmer is stemming algorithm present in NLTK which uses
Suffix
Stripping

 It does not follow linguistics rather a set of 5 rules for different cases that are
applied in phases to generate stems.
create a function which takes a sentence and returns the stemmed sentence.
Lemmatizati
on
 Lemmatization reduces the inflected words properly ensuring that the root word belongs to the language. In
Lemmatization root word is called Lemma
 For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all these words.
 As lemmatization returns an actual word of the language, it is used where it is necessary to get valid words.
 Python NLTK provides WordNetLemmatizer that uses the WordNet Database to lookup lemmas of words.
create a function which takes a
sentence and returns the lemmatized
sentence.
Sentence
Segmentation

▶ Sentence segmentation is another important step in text processing.

▶ The most useful cues for segmenting a text into sentences are punctuation, like periods,
question marks, and exclamation points.

• Question marks and exclamation points are relatively unambiguous markers of sentence
boundaries. Periods, on the other hand, are more ambiguous.
Standardization of Data
The common operations performed to standardize the data are

Chapter-6 Communicating, Perceiving, and Acting
100% (1)
Chapter-6 Communicating, Perceiving, and Acting
10 pages
Artificial Intelligence (Unit - 2)
No ratings yet
Artificial Intelligence (Unit - 2)
118 pages
Natural Language Processing Notes
No ratings yet
Natural Language Processing Notes
80 pages
BTech Advanced AI Unit04
No ratings yet
BTech Advanced AI Unit04
45 pages
ML1701 - NLP Notes Unit-1
No ratings yet
ML1701 - NLP Notes Unit-1
38 pages
Natural Language Processing - 1
No ratings yet
Natural Language Processing - 1
44 pages
Artificial Intelligence (Unit - 2)
No ratings yet
Artificial Intelligence (Unit - 2)
118 pages
UNIT 5 Application AI
No ratings yet
UNIT 5 Application AI
16 pages
NLP Materia
No ratings yet
NLP Materia
29 pages
Natural Language Processing
No ratings yet
Natural Language Processing
13 pages
Ai NLP
No ratings yet
Ai NLP
34 pages
Tech Titans
No ratings yet
Tech Titans
12 pages
AI Unit-5
No ratings yet
AI Unit-5
10 pages
Ai CH 4
No ratings yet
Ai CH 4
53 pages
NLP Notes
No ratings yet
NLP Notes
90 pages
Natural Language Processing Unit1
No ratings yet
Natural Language Processing Unit1
23 pages
NLP Grade 10 2023-2024
No ratings yet
NLP Grade 10 2023-2024
72 pages
Natural Language Processing
No ratings yet
Natural Language Processing
73 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
27 pages
Ai Unit5
No ratings yet
Ai Unit5
16 pages
Artificial Intelligence - NLP
No ratings yet
Artificial Intelligence - NLP
32 pages
Natural Languag-Wps Office
No ratings yet
Natural Languag-Wps Office
24 pages
AI 6th Sem Unit 5
No ratings yet
AI 6th Sem Unit 5
13 pages
Ai NLP
No ratings yet
Ai NLP
21 pages
AI Unit 3 - Natural Language Processing by Kulbhushan (Krazy Kaksha & KK World)
No ratings yet
AI Unit 3 - Natural Language Processing by Kulbhushan (Krazy Kaksha & KK World)
4 pages
Unit 7
No ratings yet
Unit 7
17 pages
AI With Natural Language Processing and Speech Recognition
No ratings yet
AI With Natural Language Processing and Speech Recognition
16 pages
What Is Natural Language Processing (NLP)
No ratings yet
What Is Natural Language Processing (NLP)
15 pages
Natural Language Processing Unit 1
No ratings yet
Natural Language Processing Unit 1
13 pages
Unit 3&4
No ratings yet
Unit 3&4
10 pages
NLP Notes
No ratings yet
NLP Notes
9 pages
6 10ai-3a
No ratings yet
6 10ai-3a
9 pages
Leveraging Linguistic and Computer Science Notes
No ratings yet
Leveraging Linguistic and Computer Science Notes
4 pages
Natural Language Processing-2
No ratings yet
Natural Language Processing-2
13 pages
Natural Language Processing
100% (1)
Natural Language Processing
12 pages
Application of NLP
No ratings yet
Application of NLP
10 pages
Natural Language Processing
No ratings yet
Natural Language Processing
5 pages
NLP Handwritten Notes
No ratings yet
NLP Handwritten Notes
26 pages
DS Exp2 20101A0021 Satyam Mishra
No ratings yet
DS Exp2 20101A0021 Satyam Mishra
5 pages
DS Exp2 Rugved
No ratings yet
DS Exp2 Rugved
5 pages
Class X Unit VI Natural Language Processing
No ratings yet
Class X Unit VI Natural Language Processing
42 pages
What Is Natural Language Processing?
No ratings yet
What Is Natural Language Processing?
5 pages
Intro NLP
No ratings yet
Intro NLP
47 pages
AI Chapter 6
No ratings yet
AI Chapter 6
27 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
31 pages
Introduction To NLP - Part 1
No ratings yet
Introduction To NLP - Part 1
23 pages
NLP 01
No ratings yet
NLP 01
7 pages
Pic 11
No ratings yet
Pic 11
27 pages
Understanding NLP
No ratings yet
Understanding NLP
3 pages
Natural Language Processing
No ratings yet
Natural Language Processing
43 pages
Database Languages in DBMS
No ratings yet
Database Languages in DBMS
20 pages
NLP Unit-1
No ratings yet
NLP Unit-1
20 pages
Unit Vi Natural Language Processing
No ratings yet
Unit Vi Natural Language Processing
2 pages
Approach:: Cst2355 - Database System Assignmnet 1
80% (5)
Approach:: Cst2355 - Database System Assignmnet 1
45 pages
What Is Natural Language Processing (NLP) ?
No ratings yet
What Is Natural Language Processing (NLP) ?
11 pages
Chapter 6.
No ratings yet
Chapter 6.
31 pages
PDF Document 4
No ratings yet
PDF Document 4
5 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
11 pages
ATM Project Report
42% (12)
ATM Project Report
16 pages
SAP BRF+ Calling BRF Plus Rules From SAP Workflows
No ratings yet
SAP BRF+ Calling BRF Plus Rules From SAP Workflows
12 pages
Keyword Reseach
No ratings yet
Keyword Reseach
36 pages
Kmean Clustering
No ratings yet
Kmean Clustering
10 pages
A216 DWM EXP 2b
No ratings yet
A216 DWM EXP 2b
33 pages
NLP Lecture 1
No ratings yet
NLP Lecture 1
3 pages
Group 8 NLP
No ratings yet
Group 8 NLP
3 pages
COMP8047 - S03 Business Requirements
No ratings yet
COMP8047 - S03 Business Requirements
30 pages
Unit 3
No ratings yet
Unit 3
14 pages
2022 IEEE Java
No ratings yet
2022 IEEE Java
13 pages
Lecture 3
No ratings yet
Lecture 3
65 pages
Kinetic ReleaseNotes 2023.1.9
No ratings yet
Kinetic ReleaseNotes 2023.1.9
102 pages
Memory Management Part-1st
No ratings yet
Memory Management Part-1st
24 pages
LLM Based Biological Named Entity Recognition From Scientific Literature
No ratings yet
LLM Based Biological Named Entity Recognition From Scientific Literature
3 pages
KNIME Usage and Benefits
No ratings yet
KNIME Usage and Benefits
3 pages
Distributed File System
No ratings yet
Distributed File System
20 pages
Introduction To Dbms - 1
No ratings yet
Introduction To Dbms - 1
7 pages
DMS Annexure 2 (Group 18) Deepak Gaund
No ratings yet
DMS Annexure 2 (Group 18) Deepak Gaund
12 pages
Hci QB Unit1,2
No ratings yet
Hci QB Unit1,2
5 pages
Assignment 1 Individual Assignment
No ratings yet
Assignment 1 Individual Assignment
7 pages
UML Presentation
No ratings yet
UML Presentation
20 pages
Unit 5
No ratings yet
Unit 5
10 pages
NLP-2 - Problem Statement
No ratings yet
NLP-2 - Problem Statement
3 pages
Classic Management Resume
No ratings yet
Classic Management Resume
1 page
Satabdi Gantayat Resume
No ratings yet
Satabdi Gantayat Resume
1 page
DBMS Module 1
No ratings yet
DBMS Module 1
56 pages
DP 203 Demo
No ratings yet
DP 203 Demo
9 pages
Implementing An REA Data Model in A Relational Database
100% (1)
Implementing An REA Data Model in A Relational Database
7 pages
Paper 33-A Comparative Study of Databases With Different Methods of Internal Data Management
No ratings yet
Paper 33-A Comparative Study of Databases With Different Methods of Internal Data Management
6 pages
Dbms Notes: Unit 1
No ratings yet
Dbms Notes: Unit 1
14 pages
12 ISODRAFT Transfer File: 12.1 Preprocessing and Postprocessing Commands
No ratings yet
12 ISODRAFT Transfer File: 12.1 Preprocessing and Postprocessing Commands
1 page
Online Riches: Mastering AI and GPT for Profit
From Everand
Online Riches: Mastering AI and GPT for Profit
Dale Allman
No ratings yet

NLP Lecture Slides - Part 1

Uploaded by

NLP Lecture Slides - Part 1

Uploaded by

Natural Language

▶ Components of Natural Language.

▶ Steps to get text data into workable

▶ Terms Frequency, Inverse Document

▶ One hot encoding.

▶ Notion of corpus. Intro to NLTK

▶ is among the hottest topic in the field of data science.

▶ NLP is used to analyze the surveys and

Targeted advertising is a type of

it saves companies a lot of money

Artificial intelligence and chatbots

The goal of classification is

 It is the extraction of the sentiment, the positive

 It is another important commercial application.

 Authorship attribution is the task of identifying the author of a given document.

• Machine Translation(MT) translates one neural language into another language

 Automatic text summarization aims to

 Text generation is a subfield of natural language

 A natural language interface is a user interface

 Speech recognition, also known as automatic speech

▶ corpus (plural corpora): a computer-readable corpora collection of text or

 How many words are in the following

This sentence has 13 words if we

▶ A lemma is a set of lexical forms having the same stem, the

They picnicked by the pool, then lay back on the grass

Noise Sta n d ardiz

▶ Noise : Any p i e c e of text which is not relevant to the context of

• It is a way of separating a piece of text into smaller units

• Character tokenization for "smarter" is ['s','m','a','r','t','e','r']

• Subword tokenization for "smarter" is ["smart", "er"]

Word • It splits a piece of text into individual

Most commonly used tokenization

we can use only one

▶ Sentence segmentation is another important step in text processing.

You might also like