Unit 6 (NLP)
Unit 6 (NLP)
ARTIFICIAL
INTELLIGENCE
Unit 6
(Natural Language Processing)
Natural Language Processing: Natural Language Processing (NLP) is a branch of artificial intelligence that
enables computers to process human language in the form of text or voice data, 'understand' its full meaning
and mimic human conversation.
2. Sentiment Analysis: Sentiment Analysis refers to the use of linguistic analysis using Al to detect
emotional and language tones in written text or speech to text.
3. Text Classification: Text Classification is the process of understanding, analysing and categorising
unstructured text into organised groups using NLP and other AI technologies based on predetermined
tags and categories.
4. Virtual Assistants: Virtual Assistants are NLP based programs that are auto- mated to communicate in
human voice, mimicking human interaction to help ease your day-to-day tasks, such as showing
weather reports, creating remainders, making shopping lists etc. for example, Siri, Cortana, Alexa,
Google Assistant.
7. Chatbots: Chatbots are essentially a software applications that use Al and NLP to assist humans and
communicate through text or voice.
ii. Smart Chatbots: Smart chatbots are based on Al, these bots don't have pre-
programmed answers. They learn with time, catching keywords and putting them in
context, and help users arrive at the most relevant answers to their queries.
Script Chatboat vs Smart Chatboat:
Humans communicate through language which we process all the time. As a person speaks, the
sound travels and enters the listener's eardrum. This sound then converted into neuron impulse and
transported to the brain for processing. After processing, the brain gains understanding around the meaning
of sound.
The computer understands the language of numbers. Everything that is sent to the machine has to
be converted to numbers. And while typing, if a single mistake is made, the computer throws an error and
does not process that part. The communications made by the machines are very basic and simple.
1. Arrangement of the words and meaning: There are rules in human language which provide structure to a
language. There are nouns, verbs, adverbs, adjectives. A word can be a noun at one time and an adjective
some other time.
2. Multiple meanings of a word: In natural language, a word can have multiple meanings and the meanings fit
into the statement according to the context of it.
3. Perfect Syntax, no Meaning: Sometimes, a statement can have a perfectly correct syntax but it does not
mean anything. For example, take a look at this statement:
This statement is correct grammatically but does not make any sense.
How NLP makes it possible for the machines to understand and speak just like humans?:
We all know that the language of computers Is Numerical, so the very first step that comes to our mind
is to convert our language to numbers. This conversion happens in various steps which are given below.
1. Text Normalisation: In Text Normalisation, we undergo several steps to normalise the text to a lower
level. Text Normalisation helps in cleaning up the textual data in such a way that it comes down to a level
where its complexity is lower than the actual data. Steps of Text Normalisation are:
a) Sentence Segmentation: Sentence Segmentation is the process of dividing the whole text into smaller
components, i.e., individual sentences. This is done to understand the thought or idea of each
individual sentence. For example:
b) Tokenisation: After segmenting the sentences, each sentence is then further divided into tokens.
Tokens is a term used for any word or number or special character occurring in a sentence. Under
tokenisation, every word, number and special character is considered separately and each of them is
now a separate token. For example:
c) Removing Stopwords, Special Characters and Numbers: The function of this step is to find
unimportant words as per the overall meaning and retain the important words in the text. For
example, consider the following two sentences which are conveying the same meaning to the
computer.
This step, along with removing the stop words, also removes the redundant special characters and
numbers, which are not contributing to the overall meaning.
d) Converting text to a common case: After the stopwords removal, we convert the whole text into a
similar case, preferably lower case. This is done to ensure that if any machine is case sensitive, it
should not affect the overall result by considering two same words in different cases as
different words.
e) Stemming : In this step, the remaining words are reduced to their root words. In other words,
stemming is the process in which the affixes of words are removed and the words are converted to
their base form. Note that in stemming, the stemmed words might not be meaningful.
f) Lemmatization: Stemming and lemmatization both are alternative processes to each other as the role
of both the processes is same – removal of affixes. But the difference between both of them is that in
lemmatization, the word we get after affix removal (also known as lemma) is a meaningful one.
2. Bag of Words : Bag of Words is a Natural Language Processing model which helps in extracting features
out of the text which can be helpful in machine learning algorithms. In bag of words, we get the occurrences of each
word and construct the vocabulary for the text.
Step 1: Text Normalisation: In the first step, the data is to be collected and then pre-processed. For example, we have
the following text available:
Step 2: Create Dictionary: Now we can make a list of all of the words in our model vocabulary with the unique words
left after pre-processing of all the documents. So, the unique words in our Dictionary for the text will be:
Step 3: Create document vector: In this step, the vocabulary is written in the top row. Now, for each word in the
document, if it matches with the vocabulary, put a 1 under it. If the same word appears again, increment the previous
value by 1. And if the word does not occur in that document, put a 0 under it.
Since in the first document, we have words: aman, and, anil, are, stressed. So, all these words get a value of 1 and rest of
the words get a 0 value.
Step 4: Repeat for all documents: Same exercise has to be done for all the documents. Hence, the table becomes:
In this table, the header row contains the vocabulary of the corpus and three rows correspond to three different
documents.
Finally, this gives us the document vector table for our corpus. But the tokens have still not converted to
numbers. This leads us to the final steps of our algorithm: TFIDF.
3. TFIDF: Term Frequency & Inverse Document Frequency: TFIDF stands for Term Frequency and
Inverse Document Frequency. TFIDF helps un in identifying the value for each word. Let us understand each term one by
one.
Term Frequency: Term frequency is the frequency of a word in one document. Term frequency can easily be found
from the document vector table as in that table we mention the frequency of each word of the vocabulary in each
document.
Inverse Document Frequency: To understand Inverse Document Frequency, let us first understand what does
document frequency mean. Document Frequency is the number of documents in which the word occurs irrespective of
how many times it has occurred in those documents. The document frequency for the exemplar vocabulary would be:
In inverse document frequency, we need to put the document frequency in the denominator while the total number of
documents is the numerator. Here, the total number of documents are 3, hence inverse document frequency becomes: