4. Chapter 8 Text Analytics
4. Chapter 8 Text Analytics
Overview
• In today’s world, one of the biggest sources of information is text data,
which is unstructured in nature. Finding customer sentiments from
product reviews or feedbacks, extracting opinions from social media data
are a few examples of text analytics.
• Finding insights from text data is not as straight forward as structured data
and it needs extensive data pre-processing.
• All the techniques that we have learnt so far require data to be available in
a structured format (matrix form). Algorithms that we have explored so
far, such as regression, classification, or clustering, can be applied to text
data only when the data is cleaned and prepared.
• For example, predicting stock price movements from news articles is an
example of regression in which the features are positive and negative
sentiments about a company.
• Classifying customer sentiment from his or her review comments as
positive and negative is an example of classification using text data.
SENTIMENT CLASSIFICATION
• The data consists of sentiments expressed by users on various
movies. Here each comment is a record, which is either
classified as positive or negative.
• Few of the texts may have been truncated while printing as
the default column width is limited. This can be changed by
setting max_colwidth parameter to increase the width size.
• Each record or example in the column text is called a
document.
Above summarizes the first five positive sentiments. Sentiment value of 1 denotes
positive sentiment.
To print first five negative sentiment documents use
There are 1228 words which are present only once across all
documents in the corpus. These words can be ignored.
We can restrict the number of features by setting
max_features parameters to 1000 while creating the count
vectors.
It can be noticed that the selected list of features
contains words like the, is, was, and, etc. These words are
irrelevant in determining the sentiment of the document. These
words are called stop words and can be removed from the
dictionary. This will reduce the number of features further.
Removing Stop Words:
sklearn.feature_extraction.text provides a list of pre-
defined stop words in English, which can be used as a reference
to remove the stop words from the dictionary, that is, feature
set.
Also, additional stop words can be added to this list for
removal. For example, the movie names and the word “movie”
itself can be a stop word in this case. These words can be added
to the existing list of stop words for removal. For example,