0% found this document useful (0 votes)
2 views24 pages

Lecture 8 - Text Analytics NLP

The document covers various aspects of Text Analytics and Natural Language Processing (NLP), including text cleaning, tokenization, stemming, and encoding text as a Bag of Words. It discusses the importance of NLP in analyzing unstructured textual data for business insights and compares text mining with NLP. Additionally, it outlines methods for text classification, feature generation using Bag of Words and TF-IDF, and sentiment analysis approaches.

Uploaded by

2025032
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views24 pages

Lecture 8 - Text Analytics NLP

The document covers various aspects of Text Analytics and Natural Language Processing (NLP), including text cleaning, tokenization, stemming, and encoding text as a Bag of Words. It discusses the importance of NLP in analyzing unstructured textual data for business insights and compares text mining with NLP. Additionally, it outlines methods for text classification, feature generation using Bag of Words and TF-IDF, and sentiment analysis approaches.

Uploaded by

2025032
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Machine Learning

HDip in DAB
CCT College Dublin

Text Analytics and NLP


Week 9

Lecturer: Dr. Muhammad Iqbal *

©CCT College Dublin 2022


Email: [email protected] 1
Agenda
• Introduction and NLP
• Cleaning Text
• Parsing and Cleaning HTML
• Removing Punctuation
• Tokenizing Text
• Removing Stop Words
• Stemming Words
• Tagging Parts of Speech
• Encoding Text as a Bag of Words
• Weighting Word Importance 2
Introduction to NLP
• Imagine a hypothetical person, John Doe. He’s the CTO of a fast-growing technology startup. On a busy
day, John wakes up and has this conversation with his digital assistant.
• John: “How is the weather today?”
• Digital assistant: “It is 37 degrees centigrade outside with no rain today.”
• John: “What does my schedule look like?”
• Digital assistant: “You have a strategy meeting at 4 p.m. and an all-hands at 5:30 p.m. Based on today’s
traffic situation, it is recommended you leave for the office by 8:15 a.m.”
• While he’s getting dressed, John probes the assistant on his fashion choices:
• John: “What should I wear today?”
• Digital assistant: “White seems like a good choice.”
• We might have used smart assistants such as Amazon Alexa, Google Home, or Apple Siri or Microsoft
Cortana to do similar things.
• We have experience of talking to these assistants not in a programming language, but in our natural 3
Introduction to NLP
• In today's area of internet and online services, data is generating at incredible speed and amount.

• Generally, Data analyst, engineer, and scientists are handling relational or tabular data. These
tabular data columns have either numerical or categorical data.

• Generated data has a variety of structures such as text, image, audio, and video. Online activities
such as articles, website text, blog posts, social media posts are generating unstructured textual
data.

• Corporate and Business need to analyze textual data to understand customer activities, opinion,
and feedback to successfully derive their business.

• To compete with big textual data, text analytics is evolving at a faster rate than ever before.

• Text Analytics has lots of applications in today's online world. By analyzing tweets on Twitter,
we can find trending news and peoples reaction on a particular event. Amazon can understand
user feedback or review on the specific product. BookMyShow can discover people's opinion
about the movie. Youtube can also analyze and understand peoples viewpoints on a video. 4
Compare Text Analytics, NLP and Text
Mining
• Text mining is referred to as text analytics. Text mining is a process of exploring
sizeable textual data and find patterns.
• Text Mining processes the text itself, while NLP processes with the underlying
metadata.
• Finding frequency counts of words, length of the sentence, presence/ absence of
specific words is known as text mining.
• Natural language processing (NLP) is one of the components of text mining. NLP
helps to identify sentiment, finding entities in the sentence, and category of blog/
article.
• Text mining is pre-processed data for text analytics. In Text Analytics, statistical
and machine learning algorithm used to classify information.
5
What is Special About Learning from Text?
• Most machine learning applications in the text domain work with the bag-of-words
representation in which the words are treated as dimensions with values corresponding
to word frequencies.
• A data set corresponds to a collection of documents, which is also referred to as a
corpus. The complete and distinct set of words used to define the corpus is referred to as
the lexicon.
• Dimensions are also referred to as terms or features. Some applications of text work with
a binary representation in which the presence of a term in a document corresponds to a
value of 1, and 0, otherwise.
• Other applications use a normalized function of the word frequencies as the values of
the dimensions. In each of these cases, the dimensionality of data is very large, and may
be of the order of 105 or even 106.
• Furthermore, most values of the dimensions are 0s, and only a few dimensions take on
positive values. In other words, text is a high-dimensional, sparse, and non-negative
6
representation.
Cleaning Text
• Problem: We have some unstructured text data and want to complete
some basic cleaning.
• Solution
• Most basic text cleaning operations should only replace Python’s core
string operations, in particular strip, replace, and split:

7
Regular Expressions
• A RegEx, or Regular Expression, is a sequence of characters that forms a search
pattern.
• RegEx can be used to check if a string contains the specified search pattern.
• Python has a built-in package called re, which can be used to work with Regular
Expressions. For example, import re

8
https://www.w3schools.com/python/python_regex.asp
Parsing and Cleaning HTML
• Problem: We have text data with HTML elements and want to extract just the text.
• Solution: Use Beautiful Soup’s extensive set of options to parse and extract from
HTML.

• Despite the strange name, Beautiful Soup is a powerful Python library designed for
scraping HTML. Beautiful Soup is used to scrape live websites, but we can just as
easily use it to extract text data embedded in HTML. The full range of Beautiful
Soup operations is significantly wider, but the few methods used in our solution 9
Removing Punctuation
• Problem: You have a feature of text data and want to remove punctuation.
• Solution: Define a function that uses translate with a dictionary of punctuation characters:

• Translate is a Python method popular due to its blazing speed. In this solution, we created a
dictionary, punctuation, with all punctuation characters according to Unicode as its keys and
None as its values.
• We translated all characters in the string that are in the punctuation into None, effectively
removing them. There are more readable ways to remove punctuation.
• It is important to be conscious of the fact that punctuation contains information (e.g., “Right?”
versus “Right!”). Removing punctuation is a necessary evil to create features; however, if the10
Tokenizing Text
• Problem: You have text and want to break it up into individual words.
• Solution: Natural Language Toolkit for Python (NLTK) has a powerful set of text
manipulation operations, including word tokenizing:
• Tokenization is a common task after
cleaning text data because it is the first
step in the process of turning the text
into data we will use to construct the
useful features.
• We use the method word_tokenize() to
split a sentence into words. The output
of word tokenization can be converted
to Data Frame for better text
understanding in machine learning
11
applications.
Removing Stop Words
• Problem: Given tokenized text data, you want
to remove extremely common words (e.g., a, is,
of, on) that contain little informational value.

• Solution: Use NLTK’s stopwords:

• While “stop words” can refer to any set of


words that we want to remove before
processing, frequently the term refers to
extremely common words that themselves
contain little information value.

• NLTK has a list of common stop words that we


can use to find and remove stop words in our
tokenized words. 12
Stemming Words
• Problem: You have tokenized words and want to convert them into their root forms.
• Solution: Use NLTK’s PorterStemmer:

• Stemming reduces a word to its stem by identifying and removing affixes (e.g., gerunds)
while keeping the root meaning of the word. For example, both “tradition” and
“traditional” have “tradit” as their stem, indicating that while they are different words
they represent the same general concept.
• By stemming text data, we transform it to something less readable, but closer to its base
meaning and thus more suitable for comparison across observations. NLTK’s
PorterStemmer implements the widely used Porter stemming algorithm to remove or
replace common suffixes to produce the word stem. 13
Tagging Parts of Speech
• Problem: You have text data and want to tag each word or character with its part of speech.
• Solution: Use NLTK’s pre-trained parts-of-speech tagger:

• The output is a list of tuples with the word and the tag of the part of speech. NLTK uses the
Penn Treebank parts for speech tags. Some examples of the Penn Treebank tags are 14
mentioned. https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
Tagging Parts of Speech
• A more realistic situation
would be that we have data
where every observation
contains a tweet.
• We want to convert those
sentences into features for
individual parts of speech
(e.g., a feature with 1 if a
proper noun is present, and
0 otherwise).

15
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html
Tagging Parts of Speech
• If our text is English and not on a specialized topic (e.g., medicine), the simplest
solution is to use NLTK’s pre-trained parts-of-speech tagger.
• However, if pos_tag is not very accurate, NLTK also gives us the ability to train our
own tagger.
• The major downside of training a tagger is that we need a large corpus of text
where the tag of each word is known.
• Constructing this tagged corpus is labor intensive and is probably going to be a
last resort.
• If we had a tagged corpus and wanted to train a tagger, the following is an
example of how we could do it.
• The corpus we are using is the Brown Corpus, one of the most popular sources of
tagged text. 16
https://en.wikipedia.org/wiki/Brown_Corpus
Encoding Text as a Bag of Words

• Problem: You have text data and want


to create a set of features indicating
the number of times an observation’s
text contains a particular word.
• Solution: Use scikit-learn’s
CountVectorizer:
• This output is a sparse array, which is
necessary when we have a large
amount of text. However, in our toy
example we can use toarray to view a
matrix of word counts for each
observation. 17
Encoding Text as a Bag of Words
• One of the most common methods of transforming text into features is by using a
bag-of-words model.
• Bag-of-words models output a feature for every unique word in the text data, with
each feature containing a count of occurrences in the observations.
• For example, in our solution the sentence I love Brazil. Brazil! has a value of 2 in the
“brazil” feature because the word brazil appears two times.
• The text data in our solution was purposely small. In the real world, a single
observation of text data could be the contents of an entire book!
• Since our bag-of-words model creates a feature for every unique word in the data,
the resulting matrix can contain thousands of features.
• This means that the size of the matrix can become very large in the memory.
However, we can exploit a common characteristic of bag-of-words feature matrices18
to reduce the amount of data we need to store.
Text Classification
• Text classification is one of the important tasks
of text mining. It is a supervised approach.
• Identifying category or class of given text such
as a blog, book, web page, news articles, and
tweets.
• It has various application in today's computer
world such as spam detection, task
categorization in CRM services, categorizing
products on E-retailer websites, classifying the
content of websites for a search engine,
sentiments of customer feedback, etc.
• We learn how you can do text classification in
python.
19
Feature Generation using Bag of Words
• In the Text Classification, we have a set of texts and their respective labels. But we directly can't
use text for our model. We need to convert these text into some numbers or vectors of numbers.

• Bag-of-words model (BoW) is the simplest way of extracting features from the text. BoW
converts text into the matrix of occurrence of words within a document. This model concerns
about whether given words occurred or not in the document.

• Example: There are three documents:

• Doc 1: I love dogs. Doc 2: I hate dogs and knitting. Doc 3: Knitting is my hobby and passion.

• Now, we can create a matrix of document and words by counting the occurrence of words in the
given document. This matrix is known as Document-Term Matrix (DTM).

20
Feature Generation using TF-IDF
• In Term Frequency (TF), we count the number of words occurred in each document. The main
issue with this Term Frequency is that it will give more weight to longer documents. Term
frequency is basically the output of the BoW model.
• IDF (Inverse Document Frequency) measures the amount of information a given word provides
across the document. IDF is the logarithmically scaled inverse ratio of the number of documents
that contain the word and the total number of documents.

• TF-IDF (Term Frequency-Inverse Document Frequency) normalizes the document term matrix. It
is the product of TF and IDF. Word with high tf-idf in a document, it is most of the times occurred
in given documents and must be absent in the other documents. So the words must be a
signature word.

21
Model Building and Evaluation (TF-IDF)
• Let us split dataset by using function train_test_split().
• We need to pass basically 3 parameters, such as
features, target, and test_set size. Additionally, we can
use random_state to select records randomly.
• First, import the MultinomialNB module and create the
Multinomial Naive Bayes classifier object using
MultinomialNB() function.
• Then, fit your model on a train set using fit() and
perform prediction on the test set using predict().
• We got a classification rate of 58.65% using TF-IDF
features, which is not considered as good accuracy. We
need to improve the accuracy by using some other
preprocessing or feature engineering. Let's suggest in
comment box some approach for accuracy
22
improvement.
Sentiment Analysis
• As a data analyst, It is more important to • We have learned data pre-processing using
understand our sentiments, what it really NLTK. Now, we learn Text Classification. We
means? perform Multi-Nomial Naive Bayes
Classification using scikit-learn.
• There are mainly two approaches for
performing sentiment analysis. • In the model building part, we can use the
"Sentiment Analysis of Movie, Reviews"
• Lexicon-based: count number of positive dataset available on Kaggle.
and negative words in given text and the larger
count will be the sentiment of text. • The dataset is a tab-separated file. Dataset
has four columns PhraseId, SentenceId,
• Machine learning based approach: Phrase, and Sentiment. This data has 5
Develop a classification model, which is trained sentiment labels:
using the pre-labeled dataset of positive,
negative, and neutral.
• In this lecture, we use the second approach
(Machine learning based approach).
• This is how you learn sentiment and text 23
Reference and Resources
• Introduction to Machine Learning with Python A Guide for Data Scientists,
Andreas C. Müller and Sarah Guido, Copyright © 2017, O'Reilly.

• https://learning.oreilly.com/library/view/machine-learning-with/
9781491989371/ch06.html

• https://www.datacamp.com/community/tutorials/text-analytics-beginners-nltk

• Neural Network Projects with Python by James Loy Published by Packt


Publishing, 2019.

24

You might also like