0% found this document useful (0 votes)

18 views42 pages

4. Chapter 8 Text Analytics

The document discusses text analytics, focusing on sentiment classification using unstructured text data such as movie reviews. It explains the importance of data pre-processing, feature extraction methods like Bag-of-Words and TF-IDF, and the application of the Naive-Bayes model for sentiment classification. Additionally, it highlights challenges in text analytics, including context-specific language and the limitations of the Bag-of-Words model.

Uploaded by

ajay200457

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views42 pages

4. Chapter 8 Text Analytics

Uploaded by

ajay200457

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Text Analytics

Overview
• In today’s world, one of the biggest sources of information is text data,
which is unstructured in nature. Finding customer sentiments from
product reviews or feedbacks, extracting opinions from social media data
are a few examples of text analytics.
• Finding insights from text data is not as straight forward as structured data
and it needs extensive data pre-processing.
• All the techniques that we have learnt so far require data to be available in
a structured format (matrix form). Algorithms that we have explored so
far, such as regression, classification, or clustering, can be applied to text
data only when the data is cleaned and prepared.
• For example, predicting stock price movements from news articles is an
example of regression in which the features are positive and negative
sentiments about a company.
• Classifying customer sentiment from his or her review comments as
positive and negative is an example of classification using text data.
SENTIMENT CLASSIFICATION
• The data consists of sentiments expressed by users on various
movies. Here each comment is a record, which is either
classified as positive or negative.
• Few of the texts may have been truncated while printing as
the default column width is limited. This can be changed by
setting max_colwidth parameter to increase the width size.
• Each record or example in the column text is called a
document.

Above summarizes the first five positive sentiments. Sentiment value of 1 denotes
positive sentiment.
To print first five negative sentiment documents use

• Above summarizes the first five negative sentiments. Sentiment

value of 0 denotes negative sentiment.
Exploring the Dataset:
Exploratory data analysis can be carried out by counting the
number of comments, positive comments, negative comments, etc.
For example, we can check how many reviews are available in the
dataset? Are the positive and negative sentiment reviews well
represented in the dataset? Printing metadata of the DataFrame
using info() method.
From the output we can infer that
there are 6918 records available in
the dataset. We create a count plot
to compare the number of positive
and negative sentiments.
• From Figure, we can infer that there are total 6918 records
(feedback on movies) in the dataset. Out of 6918 records, 2975
records belong to negative sentiments, while 3943 records belong
to positive sentiments. Thus, positive and negative sentiment
documents have fairly equal representation in the dataset.
• Before building the model, text data needs pre-processing for
feature extraction.
Text Pre-processing:
• Unlike structured data, features (independent variables) are not
explicitly available in text data. Thus, we need to use a process to
extract features from the text data.
• One way is to consider each word as a feature and find a measure
to capture whether a word exists or does not exist in a sentence.
This is called the bag-of-words (BoW) model. That is, each sentence
(comment on a movie or a product) is treated as a bag of words.
• Each sentence (record) is called a document and collection of all
documents is called corpus.
Bag-of-Words (BoW) Model
• The first step in creating a BoW model is to create a dictionary
of all the words used in the corpus. At this stage, we will not
worry about grammar and only occurrence of the word is
captured. Then we will convert each document to a vector
that represents words available in the document. There are
three ways to identify the importance of words in a BoW
model:
1. Count Vector Model
2. Term Frequency Vector Model
3. Term Frequency-Inverse Document Frequency (TF-IDF)
Model.
Count Vector Model:
Consider the following two documents:
1. Document 1 (positive sentiment): I really really like IPL.
2. Document 2 (negative sentiment): I never like IPL.
Note: IPL stands for Indian Premier League.
• The complete vocabulary set (aka dictionary) for the above two documents
will have words such as I, really, never, like, IPL. These words can be
considered as features (x1 through x5).
• For creating count vectors, we count the occurrence of each word in the
document as shown in Table. The y-column in Table indicates the
sentiment of the statement: 1 for positive and 0 for negative sentiment.
Term Frequency Vector Model:
Term frequency (TF) vector is calculated for each
document in the corpus and is the frequency of each term in the
document. It is given by,

Where TFi is the term frequency for word (aka token). TF

representation for the two documents is shown in the following
Table.
Term Frequency-Inverse Document Frequency (TF-IDF):
TF-IDF measures how important a word is to a document
in the corpus. The importance of a word (or token) increases
proportionally to the number of times a word appears in the
document but is reduced by the frequency of the word present
in the corpus. TF-IDF for a word i in the document is given by

where N is the total number of documents in the corpus,

Ni is the number of documents that contain word i.
The IDF value for each word for the above two
documents is given in the following Table.
The TF-IDF values for the two documents are shown in
the following Table.

Creating Count Vectors for sentiment_train Dataset

Each document in the dataset needs to be transformed
into TF or TF-IDF vectors. sklearn.feature_extraction. text module
provides classes for creating both TF and TF-IDF vectors from
text data. We will use CountVectorizer to create count vectors. In
CountVectorizer, the documents will be represented by the
number of times each word appears in the document.
We use the following code to process and create a
dictionary of all words present across all the documents. The
dictionary will contain all unique words across the corpus. And
each word in the dictionary will be treated as feature.

Total number of features or unique words in the corpus

are 2132. The random sample of features can be obtained by
using the following random.sample() method.
Using the above dictionary, we can convert all the
documents in the dataset to count vectors using transform()
method of count vectorizer:
After converting the document into a vector, we will have
a sparse matrix with 2132 features or dimensions. Each
document is represented by a count vector of 2132 dimensions
and if a specific word exists in a document, the corresponding
dimension of the vector will be set to the count of that word in
the document.
But most of the documents have only few words in them,
hence most of the dimensions in the vectors will have value set
to 0. That is a lot of 0’s in the matrix! So, the matrix is stored as a
sparse matrix.
Sparse matrix representation stores only the non-zero
values and their index in the vector. This optimizes storage as
well as computational needs. To know how many actual non-zero
values are present in the matrix, we can use getnnz() method on
the DataFrame.
Computing proportion of non-zero values with respect to
zero values in the matrix can be obtained by dividing the number
of non-zero values (i.e., 65398) by the dimension of the matrix
(i.e., 6918 × 2132), that is, 65398/(6918 × 2132).

The matrix has less than 1% non-zero values, that is,

more than 99% values are zero values. This is a very sparse
representation.
Displaying Document Vectors:
To visualize the count vectors, we will convert this matrix
into a DataFrame and set the column names to the actual
feature names. The following commands are used for displaying
the count vector:
We cannot print the complete vector as it has 2132
dimensions. Let us print the dimensions (words) from index 150
to 157. This index range contains the word awesome, which
actually should have been encoded into 1.

The feature awesome is set to 1, while the other features

are set to 0. Now select all the columns as per the words in the
sentence and print below.

Yes, the features in the count vector are

appropriately set to 1.
The vector represents the sentence “The
Da Vinci Code book is just awesome”.
Removing Low-frequency Words:
One of the challenges of dealing with text is the number
of words or features available in the corpus is too large.
The number of features could easily go over tens of
thousands. Some words would be common words and be
present across most of the documents, while some words would
be rare and present only in very few documents.
Frequency of each feature or word can be analyzed using
histogram. To calculate the total occurrence of each feature or
word, we will use np.sum() method.
To find rare words in the dictionary, for example words
that are present in any one of the document, we can filter the
features by count equal to 1:

There are 1228 words which are present only once across all
documents in the corpus. These words can be ignored.
We can restrict the number of features by setting
max_features parameters to 1000 while creating the count
vectors.
It can be noticed that the selected list of features
contains words like the, is, was, and, etc. These words are
irrelevant in determining the sentiment of the document. These
words are called stop words and can be removed from the
dictionary. This will reduce the number of features further.
Removing Stop Words:
sklearn.feature_extraction.text provides a list of pre-
defined stop words in English, which can be used as a reference
to remove the stop words from the dictionary, that is, feature
set.
Also, additional stop words can be added to this list for
removal. For example, the movie names and the word “movie”
itself can be a stop word in this case. These words can be added
to the existing list of stop words for removal. For example,

Creating Count Vectors:

All vectorizer classes take a list of stop words as a
parameter and remove the stop words while building the
dictionary or feature set. And these words will not appear in the
count vectors representing the documents. We will create new
count vectors by passing the my_stop_words as stop words list.
It can be noted that the stop words have been removed.
But we also notice another problem. Many words appear
in multiple forms. For example, love and loved. The
vectorizer treats the two words as two separate words
and hence creates two separate features. But, if a word
has similar meaning in all its form, we can use only the
root word as a feature.
Stemming and Lemmatization are two popular
techniques that are used to convert the words into root words.
Stemming: This removes the differences between inflected
forms of a word to reduce each word to its root form. This is
done by mostly chopping off the end of words (suffix). For
instance, love or loved will be reduced to the root word love.
The root form of a word may not even be a real word. For
example, awesome and awesomeness will be stemmed to
awesom.
One problem with stemming is that chopping of words
may result in words that are not part of vocabulary (e.g.,
awesom).
PorterStemmer and LancasterStemmer are two popular
algorithms for stemming, which have rules on how to chop off a
word.
Lemmatization: This takes the morphological analysis of the
words into consideration. It uses a language dictionary (i.e.,
English dictionary) to convert the words to the root word.
For example, stemming would fail to diﬀerentiate
between man and men, while lemmatization can bring these
words to its original form man.
• Natural Language Toolkit (NLTK) is a very popular library in
Python that has an extensive set of features for natural
language processing.
• NLTK supports PorterStemmer, EnglishStemmer, and
LancasterStemmer for stemming, while WordNetLemmatizer
for lemmatization.
• These features can be used in CountVectorizer, while creating
count vectors. We need to create a utility method, which
takes documents, tokenizes it to create words, stems the
words and remove the stop words before returning the final
set of words for creating vectors.
CountVectorizer takes a custom analyzer for stemming and stop word removal,
before creating count vectors. So, the custom function stemmed_words() is passed
as an analyzer.
It can be noted that words love,
loved, awesome have all been
stemmed to the root words.
Distribution of Words Across Different Sentiment:
The words which have positive or negative meaning
occur across documents of different sentiments. This could give
an initial idea of how these words can be good features for
predicting the sentiment of documents. For example, let us
consider the word awesome.
As show in Figure, the word awesom
(stemmed word for awesome) appears
mostly in positive sentiment
documents.
How about a neutral word like realli?

As shown in Figure, the word realli

(stemmed word for really) occurs almost
equally across positive and negative
sentiments.
How about the word hate?
As shown in Figure, the word hate occurs
mostly in negative sentiments than
positive sentiments.
This absolutely makes sense.

This gives us an initial idea that the words awesom and

hate could be good features in determining sentiments of the
document.
NAIVE–BAYES MODEL FOR SENTIMENT CLASSIFICATION
• We will build a Naive–Bayes model to classify sentiments.
Naive–Bayes classifier is widely used in Natural Language
Processing and proved to give better results. It works on the
concept of Bayes’ theorem.
• Assume that we would like to predict whether the probability
of a document is positive (or negative) given that the
document contains a word awesome. This can be computed if
the probability of the word awesome appearing in a
document given that it is a positive (or negative) sentiment
multiplied by the probability of the document being positive
(or negative).
• The posterior probability of the sentiment is computed from
the prior probabilities of all the words it contains. The
assumption is that the occurrences of the words in a
document are considered independent and they do not
influence each other. So, if the document contains N words
and words are represented as W1, W2, …., Wn.

• sklearn.naive_bayes provides a class BernoulliNB which is a

Naive–Bayes classifier for multivariate Bernoulli models.
BernoulliNB is designed for Binary/Boolean features (feature
is either present or absent), which is the case here.
• The steps involved in using Naive–Bayes Model for sentiment classification
are as follows:
1. Split dataset into train and validation sets.
2. Build the Naive–Bayes model
3. Find model accuracy.
In the confusion matrix, the rows represent the actual number positive
and negative documents in the test set, whereas the columns represent what the
model has predicted. Label 1 means positive sentiment and label 0 means negative
sentiment. Figure shows, as per the model prediction, that there are only 13
positive sentiment documents classified wrongly as negative sentiment documents
(False Negatives) and there are only 28 negative sentiment documents classified
wrongly as positive sentiment documents (False Positives). Rest all have been
classified correctly.
USING TF-IDF VECTORIZER:
• TfidfVectorizer is used to create both TF Vectorizer and TF-IDF
Vectorizer. It takes a parameter use_idf (default True) to
create TF-IDF vectors. If use_idf set to False, it will create only
TF vectors and if it is set to True, it will create TF-IDF vectors.
CHALLENGES OF TEXT ANALYTICS
The text could be highly context-specific. The language
people use to describe movies may not be the same for other
products, say apparels. So, the training data needs to come from
similar context (distribution) for building the model. The
language may be highly informal.
The language people use on social media may be a mix of
languages or emoticons. The training data also needs to contain
similar examples for learning.
Bag-of-words model completely ignores the structure of
the sentence or sequence of words in the sentence. This can be
overcome to a certain extent by using n-grams.
Using n-Grams:
• The models we built, created features out of each token or
word. But the meaning of some of the words might be
dependent on the words it precedes or succeeds, for example
not happy. It should be considered as one feature and not as
two different features.
• n-gram is a contiguous sequence of n words. When two
consecutive words are treated as one feature, it is called
bigram; three consecutive words is called trigram and so on.
• We will write a new custom analyzer get_stemmed_tokens(),
which splits the sentences and stems the words from them
before creating n-grams. The following code block removes
non-alphabetic characters and then applies stemming.
Now TfidfVectorizer takes the above method as a custom
tokenizer. It also takes ngram_range parameter, which is a tuple
value, for creating n-grams. A value (1, 2) means create features
with one word and two consecutive words as features.
Build the Model Using n-Grams
Split the dataset to 70:30 ratio for creating training and
test datasets and then apply BernoulliNB for classification. We
will apply the model to predict the test set and then print the
classification report.

The recall for

identifying positive
sentiment documents
(with label 1) have
increased to almost
1.0.

King's 2.0 - Iq
100% (1)
King's 2.0 - Iq
8 pages
Pike and Plunder
100% (2)
Pike and Plunder
22 pages
ML UNIT-II
No ratings yet
ML UNIT-II
27 pages
AIML IA3 loki & SG (2)
No ratings yet
AIML IA3 loki & SG (2)
31 pages
Feature extraction techniques in NLP
No ratings yet
Feature extraction techniques in NLP
10 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
28.1 - 28.16 Real World Problem - Predict Rating Given Product Reviews On Amazon
No ratings yet
28.1 - 28.16 Real World Problem - Predict Rating Given Product Reviews On Amazon
19 pages
Amazon Food Review Notes
No ratings yet
Amazon Food Review Notes
37 pages
Lecture 6 - From Unstructured Texts to Structure Data I
No ratings yet
Lecture 6 - From Unstructured Texts to Structure Data I
17 pages
Chapter 10 - Text Analytics
No ratings yet
Chapter 10 - Text Analytics
13 pages
Bag of Words
No ratings yet
Bag of Words
32 pages
NLP Asgn3
No ratings yet
NLP Asgn3
6 pages
Term Frequency and Inverse Document Frequency
No ratings yet
Term Frequency and Inverse Document Frequency
26 pages
Machine Learning With Python - Unit-5
No ratings yet
Machine Learning With Python - Unit-5
26 pages
UNIT V (1)
No ratings yet
UNIT V (1)
22 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
NLP-Neuro Linguistic Programming: What Is A Corpus?
No ratings yet
NLP-Neuro Linguistic Programming: What Is A Corpus?
3 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
Sentiment Analysis On IMDB Movie Comments and Twit
No ratings yet
Sentiment Analysis On IMDB Movie Comments and Twit
8 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
Classifier Series - Naive Bayes Sentiment Analysis
No ratings yet
Classifier Series - Naive Bayes Sentiment Analysis
10 pages
Practical 9 - Text Mining
No ratings yet
Practical 9 - Text Mining
22 pages
DeekshikaJadyada26-AP24LDS11
No ratings yet
DeekshikaJadyada26-AP24LDS11
7 pages
Unit6 002
No ratings yet
Unit6 002
10 pages
Tidy Text
No ratings yet
Tidy Text
39 pages
Sentiment Analysis On Amazon Fine Food Reviews by Using Linear Machine Learning Models
No ratings yet
Sentiment Analysis On Amazon Fine Food Reviews by Using Linear Machine Learning Models
6 pages
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
Sentiment Analysis On Twitter Data
No ratings yet
Sentiment Analysis On Twitter Data
23 pages
Text Similarity Cosine BOW TF-IDF Lecture
No ratings yet
Text Similarity Cosine BOW TF-IDF Lecture
6 pages
Module5-Representing and Mining Text
No ratings yet
Module5-Representing and Mining Text
24 pages
Computer Programming and Problem Solving Explorations
From Everand
Computer Programming and Problem Solving Explorations
Pasquale De Marco
No ratings yet
A New Approach To Represent Textual Documents Using CVSM
No ratings yet
A New Approach To Represent Textual Documents Using CVSM
6 pages
BDA3
No ratings yet
BDA3
61 pages
DSB - Unit4-Representing and Miniing text-decision-analytic-think-II
No ratings yet
DSB - Unit4-Representing and Miniing text-decision-analytic-think-II
46 pages
Semantic_Technology-Assisted_Review_STAR_Document_
No ratings yet
Semantic_Technology-Assisted_Review_STAR_Document_
14 pages
Text Pre Processing With NLTK
No ratings yet
Text Pre Processing With NLTK
42 pages
Sentiment Analysis Presentationnotes
No ratings yet
Sentiment Analysis Presentationnotes
4 pages
UNIT 5
No ratings yet
UNIT 5
9 pages
Lab 5
No ratings yet
Lab 5
27 pages
lect5
No ratings yet
lect5
40 pages
Maneesha Nidigonda Verzeo Major Project
No ratings yet
Maneesha Nidigonda Verzeo Major Project
11 pages
Unit iv
No ratings yet
Unit iv
58 pages
Perceptual Computing: Fundamentals and Applications
From Everand
Perceptual Computing: Fundamentals and Applications
Fouad Sabry
No ratings yet
a-review-on-machine-learning-text-feature-extraction-techniques
No ratings yet
a-review-on-machine-learning-text-feature-extraction-techniques
6 pages
DS- LAB REPORT. (4)
No ratings yet
DS- LAB REPORT. (4)
25 pages
unit2newml
No ratings yet
unit2newml
25 pages
Document Classification Using Distributed Machine Learning
No ratings yet
Document Classification Using Distributed Machine Learning
4 pages
Topic Modelling and LSA
No ratings yet
Topic Modelling and LSA
10 pages
Module 8 - Text - Update
No ratings yet
Module 8 - Text - Update
42 pages
RES Presentation
No ratings yet
RES Presentation
21 pages
Data Science Interview Preparation Questions (#Day06)
No ratings yet
Data Science Interview Preparation Questions (#Day06)
10 pages
Sentiments Analysis Code Analysis
No ratings yet
Sentiments Analysis Code Analysis
42 pages
Blue Doodle Project Presentation
No ratings yet
Blue Doodle Project Presentation
15 pages
Assignment 4
No ratings yet
Assignment 4
5 pages
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Coding Interview Questions and Answers
From Everand
Coding Interview Questions and Answers
Chinmoy Mukherjee
No ratings yet
NLP
No ratings yet
NLP
4 pages
Sentimental Analysis Using NLP
No ratings yet
Sentimental Analysis Using NLP
5 pages
Twiiter Sentiment Analysis
No ratings yet
Twiiter Sentiment Analysis
15 pages
PPPT
No ratings yet
PPPT
20 pages
TF Idf
No ratings yet
TF Idf
27 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
17 pages
4.3.1 Journal - Your Susceptibility To Disease (Journal)
No ratings yet
4.3.1 Journal - Your Susceptibility To Disease (Journal)
3 pages
How To Install The CCS C Compiler Inside MPLAB X
No ratings yet
How To Install The CCS C Compiler Inside MPLAB X
18 pages
4 and 6 Submersible Manual 09-13 - WEB
No ratings yet
4 and 6 Submersible Manual 09-13 - WEB
24 pages
Chapter 7.1b
No ratings yet
Chapter 7.1b
4 pages
Siege of Heaven
No ratings yet
Siege of Heaven
433 pages
Methylene Blue and Light Therapy For Herpes Simplex
No ratings yet
Methylene Blue and Light Therapy For Herpes Simplex
2 pages
Service Manual: Foreword
No ratings yet
Service Manual: Foreword
6 pages
Wire Transfer Receipts 2
No ratings yet
Wire Transfer Receipts 2
1 page
Liquid Pressure
No ratings yet
Liquid Pressure
13 pages
Airframe and Aircraft Components PDF
100% (1)
Airframe and Aircraft Components PDF
168 pages
Key Reading Toeic Pre - Hiennhungtoeic
No ratings yet
Key Reading Toeic Pre - Hiennhungtoeic
22 pages
Discussion Notes (Poisson's Ratio and Stattistically Indeterminate Members)
No ratings yet
Discussion Notes (Poisson's Ratio and Stattistically Indeterminate Members)
12 pages
First Language Acquisition-2023
No ratings yet
First Language Acquisition-2023
61 pages
Rough, Make It Hurt by Slytheringheights
No ratings yet
Rough, Make It Hurt by Slytheringheights
15 pages
DHCP
No ratings yet
DHCP
7 pages
GRR & AAA
No ratings yet
GRR & AAA
9 pages
2022 PT Topic 2 - CRP Foreign Currency Translations
No ratings yet
2022 PT Topic 2 - CRP Foreign Currency Translations
24 pages
A 6
No ratings yet
A 6
4 pages
Jacobians of Matrix Transformations and Functions of Matrix Arguments
No ratings yet
Jacobians of Matrix Transformations and Functions of Matrix Arguments
225 pages
Team Charter Sample
No ratings yet
Team Charter Sample
2 pages
Course 231: Equations of Mathematical Physics: Notes by Chris Blair
No ratings yet
Course 231: Equations of Mathematical Physics: Notes by Chris Blair
17 pages
L6004L8 - Triac
No ratings yet
L6004L8 - Triac
8 pages
Causes and Emergence of Informal Group
No ratings yet
Causes and Emergence of Informal Group
4 pages
TR Biomedical Equipment Services NCII
No ratings yet
TR Biomedical Equipment Services NCII
90 pages
Treble and Bass Clefs
No ratings yet
Treble and Bass Clefs
2 pages
PP2 TERM 2 MATHEMATICS SCHEMES (1)
No ratings yet
PP2 TERM 2 MATHEMATICS SCHEMES (1)
34 pages
Vibrato&SoundHandout
No ratings yet
Vibrato&SoundHandout
7 pages
Spec Ops Essays - JSOU - MAY 15
No ratings yet
Spec Ops Essays - JSOU - MAY 15
68 pages

4. Chapter 8 Text Analytics

Uploaded by

4. Chapter 8 Text Analytics

Uploaded by

Text Analytics

• Above summarizes the first five negative sentiments. Sentiment

Where TFi is the term frequency for word (aka token). TF

where N is the total number of documents in the corpus,

Creating Count Vectors for sentiment_train Dataset

Total number of features or unique words in the corpus

The matrix has less than 1% non-zero values, that is,

The feature awesome is set to 1, while the other features

Yes, the features in the count vector are

Creating Count Vectors:

As shown in Figure, the word realli

This gives us an initial idea that the words awesom and

• sklearn.naive_bayes provides a class BernoulliNB which is a

The recall for

You might also like