0% found this document useful (0 votes)
16 views8 pages

Unit 6 (NLP)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views8 pages

Unit 6 (NLP)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Class – X

ARTIFICIAL
INTELLIGENCE

Unit 6
(Natural Language Processing)
Natural Language Processing: Natural Language Processing (NLP) is a branch of artificial intelligence that
enables computers to process human language in the form of text or voice data, 'understand' its full meaning
and mimic human conversation.

Application of NLP: Some common NLP applications are:

1. AutoText Summaritation: Automatic Text Summarisation is an NLP technique where a computer


program shortens longer texts and generates summaries to pass the intended message as the most
important information is retained.

2. Sentiment Analysis: Sentiment Analysis refers to the use of linguistic analysis using Al to detect
emotional and language tones in written text or speech to text.

3. Text Classification: Text Classification is the process of understanding, analysing and categorising
unstructured text into organised groups using NLP and other AI technologies based on predetermined
tags and categories.

4. Virtual Assistants: Virtual Assistants are NLP based programs that are auto- mated to communicate in
human voice, mimicking human interaction to help ease your day-to-day tasks, such as showing
weather reports, creating remainders, making shopping lists etc. for example, Siri, Cortana, Alexa,
Google Assistant.

7. Chatbots: Chatbots are essentially a software applications that use Al and NLP to assist humans and
communicate through text or voice.

Depending on how they are programmed, they can be categorized as:


i. Script Chatbots: They work on pre-written keywords that they understand. Each of the
commands that they are going to follow must be coded into them by the developer. So,
if a user asks them something outside of their knowledge base, they respond with
"sorry, I did not understand", or something along those lines.

ii. Smart Chatbots: Smart chatbots are based on Al, these bots don't have pre-
programmed answers. They learn with time, catching keywords and putting them in
context, and help users arrive at the most relevant answers to their queries.
Script Chatboat vs Smart Chatboat:

S.No. Script Chatboat Smart Chatboat


1 These are task specific chatbots These are flexible and versatile chatbots
Script bots work around a script which Smart bots work on bigger databases and
2
is programmed in them other resources directly
These require simple or very less These require advanced AI based
3
programming programming
4 These have limited functionality These are flexible and adaptive

Human Language VS Computer Language:

Humans communicate through language which we process all the time. As a person speaks, the
sound travels and enters the listener's eardrum. This sound then converted into neuron impulse and
transported to the brain for processing. After processing, the brain gains understanding around the meaning
of sound.

The computer understands the language of numbers. Everything that is sent to the machine has to
be converted to numbers. And while typing, if a single mistake is made, the computer throws an error and
does not process that part. The communications made by the machines are very basic and simple.

Difficulties faced by machine to understand human language:

1. Arrangement of the words and meaning: There are rules in human language which provide structure to a
language. There are nouns, verbs, adverbs, adjectives. A word can be a noun at one time and an adjective
some other time.

2. Multiple meanings of a word: In natural language, a word can have multiple meanings and the meanings fit
into the statement according to the context of it.

3. Perfect Syntax, no Meaning: Sometimes, a statement can have a perfectly correct syntax but it does not
mean anything. For example, take a look at this statement:

Chickens feed extravagantly while the moon drinks tea.

This statement is correct grammatically but does not make any sense.

How NLP makes it possible for the machines to understand and speak just like humans?:
We all know that the language of computers Is Numerical, so the very first step that comes to our mind
is to convert our language to numbers. This conversion happens in various steps which are given below.

1. Text Normalisation: In Text Normalisation, we undergo several steps to normalise the text to a lower
level. Text Normalisation helps in cleaning up the textual data in such a way that it comes down to a level
where its complexity is lower than the actual data. Steps of Text Normalisation are:

a) Sentence Segmentation: Sentence Segmentation is the process of dividing the whole text into smaller
components, i.e., individual sentences. This is done to understand the thought or idea of each
individual sentence. For example:

Hello world. AI is fun to know.  Hello world.


It has starting impacting our  AI is fun to know.
lives in many ways. Many  It has starting impacting our lives in
more revolutionary many ways.
technologies will soon  Many more revolutionary technologies
evolve out of it. will soon evolve out of it.

b) Tokenisation: After segmenting the sentences, each sentence is then further divided into tokens.
Tokens is a term used for any word or number or special character occurring in a sentence. Under
tokenisation, every word, number and special character is considered separately and each of them is
now a separate token. For example:

c) Removing Stopwords, Special Characters and Numbers: The function of this step is to find
unimportant words as per the overall meaning and retain the important words in the text. For
example, consider the following two sentences which are conveying the same meaning to the
computer.

‘Delhi is the capital and the most populous city.’

‘Delhi capital most populous city’

This step, along with removing the stop words, also removes the redundant special characters and
numbers, which are not contributing to the overall meaning.
d) Converting text to a common case: After the stopwords removal, we convert the whole text into a
similar case, preferably lower case. This is done to ensure that if any machine is case sensitive, it
should not affect the overall result by considering two same words in different cases as
different words.

e) Stemming : In this step, the remaining words are reduced to their root words. In other words,
stemming is the process in which the affixes of words are removed and the words are converted to
their base form. Note that in stemming, the stemmed words might not be meaningful.

f) Lemmatization: Stemming and lemmatization both are alternative processes to each other as the role
of both the processes is same – removal of affixes. But the difference between both of them is that in
lemmatization, the word we get after affix removal (also known as lemma) is a meaningful one.
2. Bag of Words : Bag of Words is a Natural Language Processing model which helps in extracting features
out of the text which can be helpful in machine learning algorithms. In bag of words, we get the occurrences of each
word and construct the vocabulary for the text.

A bag of words gives us two things:


1. A vocabulary of words for the corpus
2. The frequency of these words (number of times it has occurred in the whole corpus).
It is called a "bag" of words, because it contains just the collection of words without any information about the order or
structure of words in the document.

Steps to Implement Bag-of-Words Model:

Step 1: Text Normalisation: In the first step, the data is to be collected and then pre-processed. For example, we have
the following text available:

“Aman and Anil are stressed.


Aman went to a therapist.
Anil went to download a health chatbot.”

After text normalisation, the text becomes:

Document 1: [aman, and, anil, are, stressed]


Document 2: [aman, went, to, a, therapist]
Document 3: [anil, went, to, download, a, health, chatbot]

Step 2: Create Dictionary: Now we can make a list of all of the words in our model vocabulary with the unique words
left after pre-processing of all the documents. So, the unique words in our Dictionary for the text will be:

Step 3: Create document vector: In this step, the vocabulary is written in the top row. Now, for each word in the
document, if it matches with the vocabulary, put a 1 under it. If the same word appears again, increment the previous
value by 1. And if the word does not occur in that document, put a 0 under it.
Since in the first document, we have words: aman, and, anil, are, stressed. So, all these words get a value of 1 and rest of
the words get a 0 value.

Step 4: Repeat for all documents: Same exercise has to be done for all the documents. Hence, the table becomes:

In this table, the header row contains the vocabulary of the corpus and three rows correspond to three different
documents.
Finally, this gives us the document vector table for our corpus. But the tokens have still not converted to
numbers. This leads us to the final steps of our algorithm: TFIDF.

3. TFIDF: Term Frequency & Inverse Document Frequency: TFIDF stands for Term Frequency and
Inverse Document Frequency. TFIDF helps un in identifying the value for each word. Let us understand each term one by
one.

Term Frequency: Term frequency is the frequency of a word in one document. Term frequency can easily be found
from the document vector table as in that table we mention the frequency of each word of the vocabulary in each
document.

Inverse Document Frequency: To understand Inverse Document Frequency, let us first understand what does
document frequency mean. Document Frequency is the number of documents in which the word occurs irrespective of
how many times it has occurred in those documents. The document frequency for the exemplar vocabulary would be:
In inverse document frequency, we need to put the document frequency in the denominator while the total number of
documents is the numerator. Here, the total number of documents are 3, hence inverse document frequency becomes:

Finally, the formula of TFIDF for any word W becomes:

TFIDF(W) = TF(W) * log( IDF(W) )

You might also like