0% found this document useful (0 votes)
16 views54 pages

NLP Lecture Slides - Part 1

Uploaded by

karthiktej890
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views54 pages

NLP Lecture Slides - Part 1

Uploaded by

karthiktej890
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

Natural Language

Processing
UNIT – I V
▶ Introduction.

▶ Applications.
Contents ▶ Chatbots, virtual agents (Alexa, Google
Assistant, Siri). Importance, Applications,

▶ NLP Subproblems.

▶ Components of Natural Language.

▶ Steps to get text data into workable


format.

▶ Terms Frequency, Inverse Document


Frequency,

▶ Bag of Words,

▶ ngram,

▶ One hot encoding.

▶ Notion of corpus. Intro to NLTK


NL
P

▶ is among the hottest topic in the field of data science.


▶ Companies are putting tons of money into research in this field.
▶ Everyone is trying to understand NLP and its applications to make a career around
it.
▶ Every business out there wants to integrate it into their business somehow.
Are you using NLP these days?
Search Autocorrect a n d
Autoc o mplete – Language
Translator
Social media
monitoring
▶ More people these days have started using social media for posting their
thoughts about a particular product, policy, or matter.
▶ These could contain some useful information about an individual’s likes and
dislikes.
▶ Analyzing this unstructured data can help in generating valuable insights. NLP
comes to rescue here too.
▶ various NLP techniques are used by companies to analyze social media posts and
know what customers think about their products.
▶ Companies are also using social media monitoring to understand the issues and
problems that their customers are facing by using their products.
Chatbot
s

Modern
Conversational
Agents can
• Answer
questions
• Book flights
• Find Restaurants
• functions for
which they rely
on a much more
sophisticated
understanding of
the user’s intent
Survey
Analysis
▶ Surveys are an important way of
evaluating a
company’s performance.
▶ to get customer’s feedback on various
products.
▶ useful in understanding the
flaws and help companies improve
their products.

▶ NLP is used to analyze the surveys and


generate insights from them, like
knowing the sentiments of users
analyzing product reviews to
understand the pros and cons
Targeted Advertising – Hiring a n d
Recruitment

Targeted advertising is a type of


online advertising where ads are
shown to the user based on their
online activity.

it saves companies a lot of money


relevant ads are shown only to the
potential customers.
Voice
Assistants
Conventional vs. NLP-based
search
What is
NLP?
▶ Natural language processing is a sub-field of linguistics, computer
science and AI concerned with the interactions between computers
and human language
▶ NLP makes computers understand complex language structure and retrieve
meaningful pieces of information from it
▶ Modern challenges in NLP involve
▶ speech recognition,
▶ natural language understanding and
▶ natural language generation
Applications of
NLP

▶ Text Classification
▶ Language Modelling
▶ Information
Extraction
▶ Information Retrieval
▶ C onversatio n al
A g ents
▶ Text Summarization
▶ Question A nswering
▶ M achine Translation
▶ Topic Modelling
▶ Speech Recognition
Chatbots a n d Virtual
Assistants
▶ What is a chatbot?
▶ It's all a b o ut the c onversation.
▶ A chatbot is a pi e c e of software, usually powered by artificial intelligence,
with which humans c a n interact.
▶ They are sometimes called virtual assistants, chatbox, or even
chatterbox.
▶ Chatbots c a n converse with us humans through both text a n d voice
▶ Who is using this technology?
▶ Though chatbots are relatively old technology, but recently businesses have
started putting them to eff ective use.
▶ Companies like Disney use chatbots as a brand a n d marketing play.
▶ Companies like Go o gl e use chatbots to innovate in their field.
▶ Companies like Burberry, Staples use chatbots as a customer service tool.
▶ An d many more!
Chatbots a n d Virtual
Assistants

Artificial intelligence and chatbots


▶ Most chatbots are powered by artificial intelligence, or AI.
▶ AI chatbots tend to be more useful, purely because they are smarter and can learn over
time. This is, of course, valuable to businesses.
▶ Artificial intelligence in chatbots comes in many forms.
The most common are natural language processing (NLP) which powers the language
side of the chatbot, to machine learning (ML) which powers data and algorithms.
Some Virtual
Assistants

Alex Google Si
a Assistant ri
Benefi ts/importance/applications
of Chatbots a n d Virtual Assistants
▶ Businesses use Chatbots extensively to optimize internal business processes, boost
productivity, raise revenue, a n d improve customer experience.
▶ Website chatbots increase engagement by giving quick a n d personalized responses.
Customers who would otherwise have to wait longer to respond via traditional telephone
channels benefi t significantly from this.
▶ Because of their ease of use, chatbots have a very high adoption rate. Chatbot adoption
opens up a fl oodgate of possibilities for businesses to better their client interaction process
a n d operational effi ciency, lowering customer service costs.
▶ Virtual Assistants make organizing a n d carrying out our daily chores easier.
▶ Virtual Assistants c o m e in helpful in supporting our tasks, whether it's setting reminders
a n d alarms,
adding commissions to the calendar, making calls, or retrieving information from the internet.
▶ Virtual assistants c a n now control home intelligent gadgets such as lighting, thermostats,
a n d music devices.
NLP Subproblems

 Text Categorization
 Machine Translation
 Text Summarization
 Entity Recognition
 Temporal event recognition
 Text Generation
 Natural Language Interface
 Speech Recognition
 Text to speech
Text Categorization

 Sentiment Analysis
 Spam Detection
 Authorship Attribution

The goal of classification is


• to take a single observation,
• extract some useful features, and
• thereby classify the observation into one of a set of
discrete classes.
Text Categorization : Sentiment Analysis

 It is the extraction of the sentiment, the positive


or negative orientation that a writer expresses
toward some object.
 To perform this task the words in the review
provide excellent cues.
 For example, words like great, richly, awesome,
pathetic, awful and ridiculously are informative
cues
Examples :
i. “ I really like the new design of your website! “ –
Positive
ii. “ The new design is awful “ – Negative.
Text Categorization : Spam Detection

 It is another important commercial application.


 It is a binary classification task of assigning an email to
one of the two classes spam or not-spam.
 Many lexical and other features can be used to perform
this classification.
 For example, you might quite reasonably be suspicious of
an email containing phrases like “online pharmaceutical”
or “WITHOUT ANY COST” or “Dear Winner”
Text Categorization : Authorship Attribution

 Authorship attribution is the task of identifying the author of a given document.


Machine Translation

• Machine Translation(MT) translates one neural language into another language


automatically.
• Encoder-Decoder models are commonly employed to solve Machine Translation problems.
• Machine translation can be done using Statistics or based on rules.
• Neural Machine Translation relies upon neural network models to build statistical models
Text Summarization

 Automatic text summarization aims to


transform lengthy documents into
shortened versions, something which
could be difficult and costly to undertake
if done manually.
 Machine learning algorithms can be
trained to comprehend documents and
identify the sections that convey
important facts and information before
producing the required summarized texts.
 For example, the image below is of this
news article that has been fed into a
machine learning algorithm to generate a
summary.
Entity Recognition

 Named Entity Recognition is one of the key entity detection methods in NLP.
 Named entity recognition is a natural language processing technique that can automatically scan entire
articles and pull out some fundamental entities in a text and classify them into predefined categories.
 Entities may be
 Organizations,
 Quantities,
 Monetary values,
 Percentages, and more.
 People’s names
 Company names
 Geographic locations (Both physical and political)
 Product names
 Dates and time
Contd..

 In simple words, Named Entity Recognition is the process of detecting the named entities
such as person names, location names, company names, etc from the text.
 It is also known as entity identification or entity extraction or entity chunking.

With the help of named entity recognition, we can extract key information to understand
the text, or merely use it to extract important information to store in a database.
Text Generation

 Text generation is a subfield of natural language


processing (NLP).
 It leverages knowledge in computational linguistics
and artificial intelligence to automatically generate
natural language texts, which can satisfy certain
communicative requirements.
Natural Language Interface

 A natural language interface is a user interface


in which the user and the system communicate
via a natural (human) language.
 The user provides input via speech or some
other method, and the system generates
responses in the form of utterances delivered
by speech, text or another suitable modality.
Speech Recognition

 Speech recognition, also known as automatic speech


recognition (ASR), computer speech recognition, or speech-to-
text, is a capability which enables a program to process human
speech into a written format.
 Many speech recognition applications and devices are available,
but the more advanced solutions use AI and machine learning.
They integrate grammar, syntax, structure, and composition of
audio and voice signals to understand and process human
speech. Ideally, they learn as they go — evolving responses
with each interaction.
Words – What counts as a
word?

▶ corpus (plural corpora): a computer-readable corpora collection of text or


speech
▶ For example the Brown corpus is a million-word collection of samples from
500 written English texts from different genres (newspaper, fiction, non-
fiction, academic, etc.)
Contd..

 How many words are in the following


Brown sentence?
 Sentence : He stepped out into
the hall, was delighted to
encounter a water brother.

This sentence has 13 words if we


don’t count punctuation marks as
words,
15 if we c o unt punctuation.
Contd..

Are capitalized tokens like They and uncapitalized tokens like they the same
word?
▶ How about inflected forms like cats versus cat?
These two words have the same lemma cat but are different wordforms.

▶ A lemma is a set of lexical forms having the same stem, the


same major part-of-speech, and the same word sense.

▶ The wordform is the full infl ected or derived form of the word.
Notion of Corpus:
Words – Types a n d
Tokens
▶ Word Types are the number of distinct words in a corpus; if the set of
words in the vocabulary is V, the number of types is the word token
vocabulary size | V | .
▶ Word Tokens are the total number N of running words.
▶ ignore punctuation and find the number of tokens and types in the
following sentence

They picnicked by the pool, then lay back on the grass


and looked at the stars16
tokens
14
types
Notion of
Corpus:
Corpora
▶ Any particular p i e c e of text that w e study is produced by
▶ one or more specifi c speakers or writers,
▶ in a specifi c dialect of a specifi c language,
▶ at a specifi c time,
▶ in a specifi c place,
▶ for a specifi c function.
▶ The most important dimension of variation is the language.
▶ NLP algorithms are most useful when they apply across many languages. The
world has 7097 languages.
▶ It is important to test algorithms on more than one language, a n d
particularly on languages with different properties; by contrast there is a n
unfortunate current tendency for NLP algorithms to b e de v e l ope d or tested just on
English
▶ C o d e Switching : A phenomenon which uses multiple languages in a single
communicative a c t
▶ Another variations are Genre, demographic characteristics of the writer, time.
Getting text to workable format
Approximate order of steps for preprocessing
text data

Noise Sta n d ardiz


Raw No rmaliza ti Clea n
Remov on
ati on
Text al Text

• Removal of • Tokenization
stop words • Stemming
and • Lemmatizati
punctuations
on
Noise
Removal

▶ Noise : Any p i e c e of text which is not relevant to the context of


the data.
▶ Generally, the noisy entities are
▶ Stop words,
▶ punctuation marks.
▶ Stop words
Removal
▶ It is a
process of
removing c o m m o n
language articles,
Pronouns a n d
propositions
such as “and”,
Stop words using
NLTK
Removing stop words from a
sentence using NLTK
Write a simple script to remove
punctuations.
Text Normalization

• Before almost any natural language processing of a text, the text has to be
normalized.
• At least three tasks are commonly applied as part of any normalization process:
• Tokenizing (segmenting) words
• Normalizing word formats
• Segmenting sentences
Tokenization :

• It is a way of separating a piece of text into smaller units


called tokens.
• The most common way of processing the raw text happens
at the token level.
• The ultimate goal of Tokenization is the creation of
vocabulary - Tokenization is performed on the corpus to obtain
tokens. The following tokens are then used to prepare a
vocabulary.
Vocabulary - refers to the set of unique tokens in the corpus.
• The tokens can be words, characters, or subwords.
Contd..

Example:

• word tokenization for the sentence: "Never give up" - ["Never", "give", "up"]

• Character tokenization for "smarter" is ['s','m','a','r','t','e','r']

• Subword tokenization for "smarter" is ["smart", "er"]


• Word Tokenization is the most
commonly used tokenization
algorithm.

Word • It splits a piece of text into individual


words based on a certain delimiter.
Tokenization • Depending upon delimiters, different
word-level tokens are formed.
Methods to perform tokenization:

Most commonly used tokenization


▶ Splits a p i e c e of text into individual words based on a certain delimiter
▶ Methods to perform tokenization
▶ Using python’s split() function
▶ Using regular expressions
▶ Using NLTK
Tokenization using Python’s split()
function

we can use only one


separator at a time.
split() did not consider
punctuation as a
separate token
Tokenization using Regular
Expressions (RegEx)
Tokenization using NLTK
Word Normalization, Stemming
a n d Lemmatization
▶ Used to prepare text, words, and documents for further
processing
▶ Stemming and Lemmatization helps us to achieve the
root forms of inflected words
Stemmin
g
 helps us to achieve the root forms of inflected words.
 Stem (root) is the part of the word to which you add
inflectional (changing/deriving) affixes such as (-ed,-ize, -s,-de,mis).
 stemming a word or sentence may result in words that are not actual words.
Stems are created by removing the suffixes or prefixes used with a word.
 A computer program that stems word is called a stemming program, or
stemmer
 PorterStemmer is stemming algorithm present in NLTK which uses
Suffix
Stripping

 It does not follow linguistics rather a set of 5 rules for different cases that are
applied in phases to generate stems.
create a function which takes a sentence and returns the stemmed sentence.
Lemmatizati
on
 Lemmatization reduces the inflected words properly ensuring that the root word belongs to the language. In
Lemmatization root word is called Lemma
 For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all these words.
 As lemmatization returns an actual word of the language, it is used where it is necessary to get valid words.
 Python NLTK provides WordNetLemmatizer that uses the WordNet Database to lookup lemmas of words.
create a function which takes a
sentence and returns the lemmatized
sentence.
Sentence
Segmentation

▶ Sentence segmentation is another important step in text processing.

▶ The most useful cues for segmenting a text into sentences are punctuation, like periods,
question marks, and exclamation points.

• Question marks and exclamation points are relatively unambiguous markers of sentence
boundaries. Periods, on the other hand, are more ambiguous.
Standardization of Data
The common operations performed to standardize the data are

You might also like