NLP Lecture Slides - Part 1
NLP Lecture Slides - Part 1
Processing
UNIT – I V
▶ Introduction.
▶ Applications.
Contents ▶ Chatbots, virtual agents (Alexa, Google
Assistant, Siri). Importance, Applications,
▶ NLP Subproblems.
▶ Bag of Words,
▶ ngram,
Modern
Conversational
Agents can
• Answer
questions
• Book flights
• Find Restaurants
• functions for
which they rely
on a much more
sophisticated
understanding of
the user’s intent
Survey
Analysis
▶ Surveys are an important way of
evaluating a
company’s performance.
▶ to get customer’s feedback on various
products.
▶ useful in understanding the
flaws and help companies improve
their products.
▶ Text Classification
▶ Language Modelling
▶ Information
Extraction
▶ Information Retrieval
▶ C onversatio n al
A g ents
▶ Text Summarization
▶ Question A nswering
▶ M achine Translation
▶ Topic Modelling
▶ Speech Recognition
Chatbots a n d Virtual
Assistants
▶ What is a chatbot?
▶ It's all a b o ut the c onversation.
▶ A chatbot is a pi e c e of software, usually powered by artificial intelligence,
with which humans c a n interact.
▶ They are sometimes called virtual assistants, chatbox, or even
chatterbox.
▶ Chatbots c a n converse with us humans through both text a n d voice
▶ Who is using this technology?
▶ Though chatbots are relatively old technology, but recently businesses have
started putting them to eff ective use.
▶ Companies like Disney use chatbots as a brand a n d marketing play.
▶ Companies like Go o gl e use chatbots to innovate in their field.
▶ Companies like Burberry, Staples use chatbots as a customer service tool.
▶ An d many more!
Chatbots a n d Virtual
Assistants
Alex Google Si
a Assistant ri
Benefi ts/importance/applications
of Chatbots a n d Virtual Assistants
▶ Businesses use Chatbots extensively to optimize internal business processes, boost
productivity, raise revenue, a n d improve customer experience.
▶ Website chatbots increase engagement by giving quick a n d personalized responses.
Customers who would otherwise have to wait longer to respond via traditional telephone
channels benefi t significantly from this.
▶ Because of their ease of use, chatbots have a very high adoption rate. Chatbot adoption
opens up a fl oodgate of possibilities for businesses to better their client interaction process
a n d operational effi ciency, lowering customer service costs.
▶ Virtual Assistants make organizing a n d carrying out our daily chores easier.
▶ Virtual Assistants c o m e in helpful in supporting our tasks, whether it's setting reminders
a n d alarms,
adding commissions to the calendar, making calls, or retrieving information from the internet.
▶ Virtual assistants c a n now control home intelligent gadgets such as lighting, thermostats,
a n d music devices.
NLP Subproblems
Text Categorization
Machine Translation
Text Summarization
Entity Recognition
Temporal event recognition
Text Generation
Natural Language Interface
Speech Recognition
Text to speech
Text Categorization
Sentiment Analysis
Spam Detection
Authorship Attribution
Named Entity Recognition is one of the key entity detection methods in NLP.
Named entity recognition is a natural language processing technique that can automatically scan entire
articles and pull out some fundamental entities in a text and classify them into predefined categories.
Entities may be
Organizations,
Quantities,
Monetary values,
Percentages, and more.
People’s names
Company names
Geographic locations (Both physical and political)
Product names
Dates and time
Contd..
In simple words, Named Entity Recognition is the process of detecting the named entities
such as person names, location names, company names, etc from the text.
It is also known as entity identification or entity extraction or entity chunking.
With the help of named entity recognition, we can extract key information to understand
the text, or merely use it to extract important information to store in a database.
Text Generation
Are capitalized tokens like They and uncapitalized tokens like they the same
word?
▶ How about inflected forms like cats versus cat?
These two words have the same lemma cat but are different wordforms.
▶ The wordform is the full infl ected or derived form of the word.
Notion of Corpus:
Words – Types a n d
Tokens
▶ Word Types are the number of distinct words in a corpus; if the set of
words in the vocabulary is V, the number of types is the word token
vocabulary size | V | .
▶ Word Tokens are the total number N of running words.
▶ ignore punctuation and find the number of tokens and types in the
following sentence
• Removal of • Tokenization
stop words • Stemming
and • Lemmatizati
punctuations
on
Noise
Removal
• Before almost any natural language processing of a text, the text has to be
normalized.
• At least three tasks are commonly applied as part of any normalization process:
• Tokenizing (segmenting) words
• Normalizing word formats
• Segmenting sentences
Tokenization :
Example:
• word tokenization for the sentence: "Never give up" - ["Never", "give", "up"]
It does not follow linguistics rather a set of 5 rules for different cases that are
applied in phases to generate stems.
create a function which takes a sentence and returns the stemmed sentence.
Lemmatizati
on
Lemmatization reduces the inflected words properly ensuring that the root word belongs to the language. In
Lemmatization root word is called Lemma
For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all these words.
As lemmatization returns an actual word of the language, it is used where it is necessary to get valid words.
Python NLTK provides WordNetLemmatizer that uses the WordNet Database to lookup lemmas of words.
create a function which takes a
sentence and returns the lemmatized
sentence.
Sentence
Segmentation
▶ The most useful cues for segmenting a text into sentences are punctuation, like periods,
question marks, and exclamation points.
• Question marks and exclamation points are relatively unambiguous markers of sentence
boundaries. Periods, on the other hand, are more ambiguous.
Standardization of Data
The common operations performed to standardize the data are