0% found this document useful (0 votes)
18 views42 pages

4. Chapter 8 Text Analytics

The document discusses text analytics, focusing on sentiment classification using unstructured text data such as movie reviews. It explains the importance of data pre-processing, feature extraction methods like Bag-of-Words and TF-IDF, and the application of the Naive-Bayes model for sentiment classification. Additionally, it highlights challenges in text analytics, including context-specific language and the limitations of the Bag-of-Words model.

Uploaded by

ajay200457
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views42 pages

4. Chapter 8 Text Analytics

The document discusses text analytics, focusing on sentiment classification using unstructured text data such as movie reviews. It explains the importance of data pre-processing, feature extraction methods like Bag-of-Words and TF-IDF, and the application of the Naive-Bayes model for sentiment classification. Additionally, it highlights challenges in text analytics, including context-specific language and the limitations of the Bag-of-Words model.

Uploaded by

ajay200457
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Text Analytics

Overview
• In today’s world, one of the biggest sources of information is text data,
which is unstructured in nature. Finding customer sentiments from
product reviews or feedbacks, extracting opinions from social media data
are a few examples of text analytics.
• Finding insights from text data is not as straight forward as structured data
and it needs extensive data pre-processing.
• All the techniques that we have learnt so far require data to be available in
a structured format (matrix form). Algorithms that we have explored so
far, such as regression, classification, or clustering, can be applied to text
data only when the data is cleaned and prepared.
• For example, predicting stock price movements from news articles is an
example of regression in which the features are positive and negative
sentiments about a company.
• Classifying customer sentiment from his or her review comments as
positive and negative is an example of classification using text data.
SENTIMENT CLASSIFICATION
• The data consists of sentiments expressed by users on various
movies. Here each comment is a record, which is either
classified as positive or negative.
• Few of the texts may have been truncated while printing as
the default column width is limited. This can be changed by
setting max_colwidth parameter to increase the width size.
• Each record or example in the column text is called a
document.

Above summarizes the first five positive sentiments. Sentiment value of 1 denotes
positive sentiment.
To print first five negative sentiment documents use

• Above summarizes the first five negative sentiments. Sentiment


value of 0 denotes negative sentiment.
Exploring the Dataset:
Exploratory data analysis can be carried out by counting the
number of comments, positive comments, negative comments, etc.
For example, we can check how many reviews are available in the
dataset? Are the positive and negative sentiment reviews well
represented in the dataset? Printing metadata of the DataFrame
using info() method.
From the output we can infer that
there are 6918 records available in
the dataset. We create a count plot
to compare the number of positive
and negative sentiments.
• From Figure, we can infer that there are total 6918 records
(feedback on movies) in the dataset. Out of 6918 records, 2975
records belong to negative sentiments, while 3943 records belong
to positive sentiments. Thus, positive and negative sentiment
documents have fairly equal representation in the dataset.
• Before building the model, text data needs pre-processing for
feature extraction.
Text Pre-processing:
• Unlike structured data, features (independent variables) are not
explicitly available in text data. Thus, we need to use a process to
extract features from the text data.
• One way is to consider each word as a feature and find a measure
to capture whether a word exists or does not exist in a sentence.
This is called the bag-of-words (BoW) model. That is, each sentence
(comment on a movie or a product) is treated as a bag of words.
• Each sentence (record) is called a document and collection of all
documents is called corpus.
Bag-of-Words (BoW) Model
• The first step in creating a BoW model is to create a dictionary
of all the words used in the corpus. At this stage, we will not
worry about grammar and only occurrence of the word is
captured. Then we will convert each document to a vector
that represents words available in the document. There are
three ways to identify the importance of words in a BoW
model:
1. Count Vector Model
2. Term Frequency Vector Model
3. Term Frequency-Inverse Document Frequency (TF-IDF)
Model.
Count Vector Model:
Consider the following two documents:
1. Document 1 (positive sentiment): I really really like IPL.
2. Document 2 (negative sentiment): I never like IPL.
Note: IPL stands for Indian Premier League.
• The complete vocabulary set (aka dictionary) for the above two documents
will have words such as I, really, never, like, IPL. These words can be
considered as features (x1 through x5).
• For creating count vectors, we count the occurrence of each word in the
document as shown in Table. The y-column in Table indicates the
sentiment of the statement: 1 for positive and 0 for negative sentiment.
Term Frequency Vector Model:
Term frequency (TF) vector is calculated for each
document in the corpus and is the frequency of each term in the
document. It is given by,

Where TFi is the term frequency for word (aka token). TF


representation for the two documents is shown in the following
Table.
Term Frequency-Inverse Document Frequency (TF-IDF):
TF-IDF measures how important a word is to a document
in the corpus. The importance of a word (or token) increases
proportionally to the number of times a word appears in the
document but is reduced by the frequency of the word present
in the corpus. TF-IDF for a word i in the document is given by

where N is the total number of documents in the corpus,


Ni is the number of documents that contain word i.
The IDF value for each word for the above two
documents is given in the following Table.
The TF-IDF values for the two documents are shown in
the following Table.

Creating Count Vectors for sentiment_train Dataset


Each document in the dataset needs to be transformed
into TF or TF-IDF vectors. sklearn.feature_extraction. text module
provides classes for creating both TF and TF-IDF vectors from
text data. We will use CountVectorizer to create count vectors. In
CountVectorizer, the documents will be represented by the
number of times each word appears in the document.
We use the following code to process and create a
dictionary of all words present across all the documents. The
dictionary will contain all unique words across the corpus. And
each word in the dictionary will be treated as feature.

Total number of features or unique words in the corpus


are 2132. The random sample of features can be obtained by
using the following random.sample() method.
Using the above dictionary, we can convert all the
documents in the dataset to count vectors using transform()
method of count vectorizer:
After converting the document into a vector, we will have
a sparse matrix with 2132 features or dimensions. Each
document is represented by a count vector of 2132 dimensions
and if a specific word exists in a document, the corresponding
dimension of the vector will be set to the count of that word in
the document.
But most of the documents have only few words in them,
hence most of the dimensions in the vectors will have value set
to 0. That is a lot of 0’s in the matrix! So, the matrix is stored as a
sparse matrix.
Sparse matrix representation stores only the non-zero
values and their index in the vector. This optimizes storage as
well as computational needs. To know how many actual non-zero
values are present in the matrix, we can use getnnz() method on
the DataFrame.
Computing proportion of non-zero values with respect to
zero values in the matrix can be obtained by dividing the number
of non-zero values (i.e., 65398) by the dimension of the matrix
(i.e., 6918 × 2132), that is, 65398/(6918 × 2132).

The matrix has less than 1% non-zero values, that is,


more than 99% values are zero values. This is a very sparse
representation.
Displaying Document Vectors:
To visualize the count vectors, we will convert this matrix
into a DataFrame and set the column names to the actual
feature names. The following commands are used for displaying
the count vector:
We cannot print the complete vector as it has 2132
dimensions. Let us print the dimensions (words) from index 150
to 157. This index range contains the word awesome, which
actually should have been encoded into 1.

The feature awesome is set to 1, while the other features


are set to 0. Now select all the columns as per the words in the
sentence and print below.

Yes, the features in the count vector are


appropriately set to 1.
The vector represents the sentence “The
Da Vinci Code book is just awesome”.
Removing Low-frequency Words:
One of the challenges of dealing with text is the number
of words or features available in the corpus is too large.
The number of features could easily go over tens of
thousands. Some words would be common words and be
present across most of the documents, while some words would
be rare and present only in very few documents.
Frequency of each feature or word can be analyzed using
histogram. To calculate the total occurrence of each feature or
word, we will use np.sum() method.
To find rare words in the dictionary, for example words
that are present in any one of the document, we can filter the
features by count equal to 1:

There are 1228 words which are present only once across all
documents in the corpus. These words can be ignored.
We can restrict the number of features by setting
max_features parameters to 1000 while creating the count
vectors.
It can be noticed that the selected list of features
contains words like the, is, was, and, etc. These words are
irrelevant in determining the sentiment of the document. These
words are called stop words and can be removed from the
dictionary. This will reduce the number of features further.
Removing Stop Words:
sklearn.feature_extraction.text provides a list of pre-
defined stop words in English, which can be used as a reference
to remove the stop words from the dictionary, that is, feature
set.
Also, additional stop words can be added to this list for
removal. For example, the movie names and the word “movie”
itself can be a stop word in this case. These words can be added
to the existing list of stop words for removal. For example,

Creating Count Vectors:


All vectorizer classes take a list of stop words as a
parameter and remove the stop words while building the
dictionary or feature set. And these words will not appear in the
count vectors representing the documents. We will create new
count vectors by passing the my_stop_words as stop words list.
It can be noted that the stop words have been removed.
But we also notice another problem. Many words appear
in multiple forms. For example, love and loved. The
vectorizer treats the two words as two separate words
and hence creates two separate features. But, if a word
has similar meaning in all its form, we can use only the
root word as a feature.
Stemming and Lemmatization are two popular
techniques that are used to convert the words into root words.
Stemming: This removes the differences between inflected
forms of a word to reduce each word to its root form. This is
done by mostly chopping off the end of words (suffix). For
instance, love or loved will be reduced to the root word love.
The root form of a word may not even be a real word. For
example, awesome and awesomeness will be stemmed to
awesom.
One problem with stemming is that chopping of words
may result in words that are not part of vocabulary (e.g.,
awesom).
PorterStemmer and LancasterStemmer are two popular
algorithms for stemming, which have rules on how to chop off a
word.
Lemmatization: This takes the morphological analysis of the
words into consideration. It uses a language dictionary (i.e.,
English dictionary) to convert the words to the root word.
For example, stemming would fail to differentiate
between man and men, while lemmatization can bring these
words to its original form man.
• Natural Language Toolkit (NLTK) is a very popular library in
Python that has an extensive set of features for natural
language processing.
• NLTK supports PorterStemmer, EnglishStemmer, and
LancasterStemmer for stemming, while WordNetLemmatizer
for lemmatization.
• These features can be used in CountVectorizer, while creating
count vectors. We need to create a utility method, which
takes documents, tokenizes it to create words, stems the
words and remove the stop words before returning the final
set of words for creating vectors.
CountVectorizer takes a custom analyzer for stemming and stop word removal,
before creating count vectors. So, the custom function stemmed_words() is passed
as an analyzer.
It can be noted that words love,
loved, awesome have all been
stemmed to the root words.
Distribution of Words Across Different Sentiment:
The words which have positive or negative meaning
occur across documents of different sentiments. This could give
an initial idea of how these words can be good features for
predicting the sentiment of documents. For example, let us
consider the word awesome.
As show in Figure, the word awesom
(stemmed word for awesome) appears
mostly in positive sentiment
documents.
How about a neutral word like realli?

As shown in Figure, the word realli


(stemmed word for really) occurs almost
equally across positive and negative
sentiments.
How about the word hate?
As shown in Figure, the word hate occurs
mostly in negative sentiments than
positive sentiments.
This absolutely makes sense.

This gives us an initial idea that the words awesom and


hate could be good features in determining sentiments of the
document.
NAIVE–BAYES MODEL FOR SENTIMENT CLASSIFICATION
• We will build a Naive–Bayes model to classify sentiments.
Naive–Bayes classifier is widely used in Natural Language
Processing and proved to give better results. It works on the
concept of Bayes’ theorem.
• Assume that we would like to predict whether the probability
of a document is positive (or negative) given that the
document contains a word awesome. This can be computed if
the probability of the word awesome appearing in a
document given that it is a positive (or negative) sentiment
multiplied by the probability of the document being positive
(or negative).
• The posterior probability of the sentiment is computed from
the prior probabilities of all the words it contains. The
assumption is that the occurrences of the words in a
document are considered independent and they do not
influence each other. So, if the document contains N words
and words are represented as W1, W2, …., Wn.

• sklearn.naive_bayes provides a class BernoulliNB which is a


Naive–Bayes classifier for multivariate Bernoulli models.
BernoulliNB is designed for Binary/Boolean features (feature
is either present or absent), which is the case here.
• The steps involved in using Naive–Bayes Model for sentiment classification
are as follows:
1. Split dataset into train and validation sets.
2. Build the Naive–Bayes model
3. Find model accuracy.
In the confusion matrix, the rows represent the actual number positive
and negative documents in the test set, whereas the columns represent what the
model has predicted. Label 1 means positive sentiment and label 0 means negative
sentiment. Figure shows, as per the model prediction, that there are only 13
positive sentiment documents classified wrongly as negative sentiment documents
(False Negatives) and there are only 28 negative sentiment documents classified
wrongly as positive sentiment documents (False Positives). Rest all have been
classified correctly.
USING TF-IDF VECTORIZER:
• TfidfVectorizer is used to create both TF Vectorizer and TF-IDF
Vectorizer. It takes a parameter use_idf (default True) to
create TF-IDF vectors. If use_idf set to False, it will create only
TF vectors and if it is set to True, it will create TF-IDF vectors.
CHALLENGES OF TEXT ANALYTICS
The text could be highly context-specific. The language
people use to describe movies may not be the same for other
products, say apparels. So, the training data needs to come from
similar context (distribution) for building the model. The
language may be highly informal.
The language people use on social media may be a mix of
languages or emoticons. The training data also needs to contain
similar examples for learning.
Bag-of-words model completely ignores the structure of
the sentence or sequence of words in the sentence. This can be
overcome to a certain extent by using n-grams.
Using n-Grams:
• The models we built, created features out of each token or
word. But the meaning of some of the words might be
dependent on the words it precedes or succeeds, for example
not happy. It should be considered as one feature and not as
two different features.
• n-gram is a contiguous sequence of n words. When two
consecutive words are treated as one feature, it is called
bigram; three consecutive words is called trigram and so on.
• We will write a new custom analyzer get_stemmed_tokens(),
which splits the sentences and stems the words from them
before creating n-grams. The following code block removes
non-alphabetic characters and then applies stemming.
Now TfidfVectorizer takes the above method as a custom
tokenizer. It also takes ngram_range parameter, which is a tuple
value, for creating n-grams. A value (1, 2) means create features
with one word and two consecutive words as features.
Build the Model Using n-Grams
Split the dataset to 70:30 ratio for creating training and
test datasets and then apply BernoulliNB for classification. We
will apply the model to predict the test set and then print the
classification report.

The recall for


identifying positive
sentiment documents
(with label 1) have
increased to almost
1.0.

You might also like