https://bit.
ly/2MWyuD7
Text Mining
Social
Sepuluh Nopember
Institute of Technology (ITS)
Sentiment
29-31 Oct. 2019
Analysis
Using WEKA Tools
Dr Shuzlina Abdul Rahman Ts Dr Sofianita Mutalib
Associate Professor Senior Lecturer
Centre for Information Systems Studies, Faculty of Computer & Mathematical Sciences,
Universiti Teknologi MARA (UiTM), Shah Alam, MALAYSIA
Learning Outcomes from
Workshop
• At the end of the workshop, participants
should be able to use WEKA for the Sentiment
below tasks:
Analysis
• pre-process textual data
• develop classification model using
machine learning algorithms
INTRODUCTION
WHAT IS SENTIMENT WHY SA IS HOW TO APPLY SA IN
ANALYSIS (SA)? INTERESTING? WEKA?
What is
Sentiment
Analysis
(SA)?
[1]
SA - process which focuses on analyzing people’s opinions, feelings, and
attitudes towards a specific product, organization or service.
Why SA is
interesting?
[2] https://www.pewresearch.org/fact-tank/2019/04/04/indonesians-optimistic-
about-their-countrys-democracy-and-economy-as-elections-near/
Basic Concepts
• A document can be described by a set of
representative keywords called index terms.
Concepts • Different index terms have varying relevance when
of Text used to describe document contents.
Analytics? • This effect is captured through the assignment of
numerical weights to each index term of a document.
(e.g.: frequency, term-frequency-index document
frequency or tf-idf)
• DBMS Analogy
• Index Terms à Attributes
• Weights à Attribute Values
7 Han et al, 2011
Basic Concepts
• Index Terms (Attribute) Selection - tokenization:
• Stop list – irrelevant set of words (a, the, of,…)
• Word stem – (drug: drugs, drugged) viewed as different occurrences of the same
word
• Index terms weighting methods
• Terms r Documents Frequency Matrices – measures the no. of
occurrences of term t in the document d
Han et al, 2011
8
Text Categorization
• Pre-given categories and labeled document examples (Categories may form
hierarchy)
• Classify new documents ; e.g. Google mail apps
• A standard classification (supervised learning ) problem
Positive
Categorization
Positive
Past Reviews System
Negative
… …
Positive
Neutral
Neutral
New Reviews Negative
Han et al, 2011
9
Data Acquisition and Preparation
Tokenization
Text Pre-processing Feature selection
How To Feature Transformation
apply SA in
Weka Tools? Text Classification using
Random Forest (RF)
Naïve Bayes (NB)
Machine Learning Logistic Regression (LR)
Support Vector Machine (SVM)
Analysis of the results
WEKA
Machine Learning Software in Java
https://www.kaggle.com/datasets https://archive.ics.uci.edu/ml/datasets.php
WEKA 3 • Weka is a collection of
machine learning
algorithms for data mining
tasks. It contains tools for
data preparation,
classification, regression,
clustering, association rules
mining, and visualization.
• https://www.cs.waikato.ac.
nz/ml/weka/
• Two versions:
• Stable ver. (3.8)
• Developer ver. (3.9)
Waikato Environment for Knowledge
Analysis (WEKA)
In 1993, the University of Waikato in New Zealand began development of
the original version of Weka
It is released as open source software under the GNU GPL. It is written in
JAVA and provides a API that is well documented and promotes integration
into your own applications
WEKA provides a collection of data mining, machine learning algorithms
and processing tools
WEKA is an environment for comparing learning algorithms
WEKA GUI Chooser
Data
Mining
Tool
https://www.cs.waikato.ac.nz/ml/weka/index.html
WEKA Interfaces
E-BOOKS
https://www.cs.waikato.ac.nz/ml/weka/documentation.html
Select and run classification Run association algorithms to
and regression algorithms to extract insights from your Visualize the relationship
operate on your data between attributes.
data
Load a dataset and Run attribute selection
manipulate the data into a Select and run clustering algorithms on your data to
form that you want to work algorithms on your dataset select those attributes that
with are relevant to the feature
you want to predict.
.arff
.csv .dat
WEKA
Data
Format
WEKA 3
LET’S BEGIN
Dataset - Product Reviews
• Amazon Kindle
• https://www.kaggle.com/bharadwaj6/kindle-
reviews
• Amazon Kindle Store category from May 1996
- July 2014.
• Contains total of 982619 entries.
• Each reviewer has at least 5 reviews and each
product has at least 5 reviews in this dataset.
Download and open files and
save (.csv) in Excel
You need to save the
file in .csv (Comma
delimited)
Data Preparation
• Transform
• Merge
• Rename
Text Pre-processing
• Tokenization
• Attribute selection
Outline
Text Classification using Machine Learning
• Random Forest (RF)
• Naïve Bayes (NB)
• Logistic Regression (LR)
• Support Vector Machine (SVM)
Analysis of the results
Data Preparation
• For this exercise
• 10 attributes
• 2500 instances
• Two primary attributes
• overall
• reviewText
• Delete other attributes
• no, asin, helpful,
reviewTime,
reviewID,
reviewerName,
summary, and
unixReviewTime
Data Preparation
• Then, Load your dataset into Weka
• Click ‘Open File..’ -> find your data ‘.csv’
-> click ‘Open’
Data Preparation
• Overall is a overall rating based on the
reviews.
• We have reviews in the range of [1-5].
• Let label rating = 1 or 2 as the
“Negative” review .
• Lets consider "3" as the “Neutral”
review.
• And rating = 4 or 5 as “Positive”
reviews.
• Then, run again the dataset in Weka.
Data Preparation
• Overall is a overall rating based
on the reviews.
• We have reviews in the range of
[1-5].
• Let label rating = 1 or 2 as the
“Negative” review .
• Lets consider "3" as the
“Neutral” review.
• And rating = 4 or 5 as “Positive”
reviews.
Data Preparation
• Change from numeric to nominal
• Steps:
• Filter>>Unsupervised>> Attribute >> NumericToNominal
• Change ”attributeIndices” to first
• Click OK >> Apply
Merge the values of
attribute “overall”, for
Data Preparation ‘1’ and ‘2’ and
‘4’ and ‘5’
• Merge TWO attributes’ value;
• Steps:
• Filter>>Unsupervised>> Attribute >> MergeTwoValues
• Change ”attributeIndices” to first
• Change “firstValueIndex” to 1 or first
• Change “secondValueIndex” to 2
• Click OK >> Apply
RECALL our objective is to:
• Overall is a overall rating based on the reviews.
• We have reviews in the range of [1-5].
• Let label rating = 1 or 2 as the “Negative” review .
• Lets consider "3" as the “Neutral” review.
• And rating = 4 or 5 as “Positive” reviews.
Data Preparation
• Rename the new label
• Steps:
• Filter>>Unsupervised>> Attribute >> RenameNominalValues
• Change ”selectedAttributes” to 1
• Type in “valueReplacements” with 1_2:neg, 3:neu, 4_5:pos
• Click OK >> Apply
Data Preparation
• To set attribute “overall” as
class label
• Steps:
• Click edit, go to header
column one, right hand
click, choose “Attribute as
class”
• Save file as
kindle_review_labelled.arff
Data Preparation with Excel (alternative)
Go to Excel. Find and Replace
‘Overall’ number into label.
Data Preparation
• Transform
• Merge
• Rename
Text Pre-processing
• Tokenization
• Attribute selection
Outline
Text Classification using Machine Learning
• Random Forest (RF)
• Naïve Bayes (NB)
• Logistic Regression (LR)
• Support Vector Machine (SVM)
Analysis of the results
Text Pre-processing (Tokenization)
• Process text (attribute review)
from Nominal to String
• Steps;
• Choose
Filter>>Unsupervised>>attribute
>>Nominal to String
• Change “attributeIndexes” to 1
• Click OK, then Apply
Text Pre-processing (Tokenization)
• Process text (attribute review) to create
Index Terms
• Steps;
• Choose
Filter>>Unsupervised>>attribute>>StringToWordVector”
• Change “IDFTransform” to True
• Change “TFTransorm” to True
• Change “attributeIndices” to 1
• Change “lowerCaseTokens” to True
• Change “outputWordCounts” to ‘True’,
• Goto “stemmer” and Choose ”IteratedLovinsStemmer”
• Goto “stopwordsHandler” and Choose “Rainbow”
• Goto “tokenizer” and Choose “WordTokenizer”, type in
delimiters as follows:
.,;:'"()?!@#$%^&*(){}[]/|\1234567890-=+_)
• WordToKeep = 200
• Click OK, then Apply
• Word Tokenizer used for tokenizing a set of delimiter characters
• Filter -> Choose ‘Unsupervised’ -> ‘Attribute’ -> click
‘StringToWordVector’
• 200 attributes to
be kept for each
class
• WordToKeep
= 200
Text Pre-processing (Tokenization): Output
• Selection of attribute also can be done in ‘Classify’ tab.
• On ‘Classify’ tab -> ‘Classifier’ -> ‘Meta’ -> ‘AttributeSelectedClassifier’
Classifier -> choose ‘J48’
Evaluator -> choose
‘GainRatioAttributeEval’
search -> choose ‘Ranker’
Cont…
• Click on ‘Ranker’ -> change ‘numToSelect’ =100
which means Weka will select the top 100 attributes by the rank
Alternative Approach:
On ‘Select attribute’ tab >>choose ‘AttributeSelection’>>
‘CorrelationAttributeEval’; choose “Ranker”
Click Start ; always ensure the Class label is the correct one.
CorrelationAttributeEval
• Evaluates the worth of an attribute by
measuring the correlation (Pearson’s)
of the attributes with the class.
• Nominal attributes are considered on
a value by value basis by treating each
value as an indicator. An overall
correlation for a nominal attribute is
arrived at via a weighted average.
• ‘Test options’ – for exercise we are going to use percentage split by
default = 66% -> click ‘Start’
Text preprocessing: Save reduced dataset
In the result list, go to the
run result and right hand
click, choose “Save reduced
data” and name the file.
Alternative technique of Tokenization NGram
• Open the original 2500 dataset (or retrieved the save file).
• Perform the similar process to get below result:
Text preprocessing – NGram Tokenization and Stop Words
• NGramTokenizer used to split a string into an n-gram with min and max grams.
• Filter -> unsupervised -> ‘StringToWordVector’
IDFTransform, TFTransform,
lowerCaseTokens,
outputWordCounts ->
change to ‘True’,
stopwordsHandler =
‘MultiStopWords’
Tokenizer =
‘NGramTokenizer’,
WordToKeep = 200,
Click OK & Apply
Text preprocessing – NGramTokenization and Stop Words
• Check the Select Attribute and the word ranked in it.
Attribute evaluator -> choose ‘CorrelationAttributeEval’ ->click ‘Start’
Tokenization:
Use Own Dictionary and Stop Words
Best Practice - Always save files in .arff format
• Preprocess -> click ‘Save’ -> name your file .arff -> click ‘Save’
Data Preparation
• Transform
• Merge
• Rename
Text Pre-processing
• Tokenization
• Attribute selection
Outline
Text Classification using Machine Learning
• Random Forest (RF)
• Naïve Bayes (NB)
• Logistic Regression (LR)
• Support Vector Machine (SVM)
Analysis of the results
• Text Classification based on
Sentiment
• Sentiment is labelled based on
case study
Text
Classification
Text Classification – RandomForest (RF)
Test Options: Cross-validation 10 folds
Save Models
• Classify -> in ‘Result list’ -> right click on ‘RandomForest’-> click ‘Save result buffer’
• Name your model and save as “.model”
• Repeat step for each model.
Text Classification – Naïve Bayes (NB)
Test Options: Cross-validation 10 folds
Text Classification – Logistic Regression (LR)
Test Options: Cross-validation 10 folds
Text Classification – Support Vector Machine (SVM)
Test Options: Cross-validation 10 folds
To use SVM classification, you need to download the LibSVM first.
Go to Weka GUI -> Tools -> click on Package manager
Download the latest version and install
Text Classification – Support Vector Machine (SVM)
Test Options: Cross-validation 10 folds
To use SVM classification, you need to download the LibSVM first.
Go to Weka GUI -> Tools -> click on Package manager
Text Classification – RandomForest (RF)
Test Options: Percentage split by default = 66% training, 34% testing
Text Classification – Naïve Bayes (NB)
Test Options: Percentage split by default = 66% training, 34% testing
Text Classification – Logistic Regression (LR)
Test Options: Percentage split by default = 66% training, 34% testing
Text Classification – Support Vector Machine (SVM)
Test Options: Percentage split by default = 66% training, 34% testing
Data Preparation
• Transform
• Merge
• Rename
Text Pre-processing
• Tokenization
• Attribute selection
Outline
Text Classification using Machine Learning
• Random Forest (RF)
• Naïve Bayes (NB)
• Logistic Regression (LR)
• Support Vector Machine (SVM)
Analysis of the results
Analysis of the Results
• This result is based on the average weight.
Methods Ten fold cross-validation Split percentage by default =
66% training, 34% testing
TP Rate FP Rate Precision Recall TP Rate FP Rate Precision Recall
Random Forest (RF) 0.611 0.256 0.566 0.611 0.586 0.265 0.535 0.586
Naïve Bayes (NB) 0.505 0.271 0.530 0.505 0.502 0.265 0.531 0.502
Logistic Regression (LR) 0.612 0.222 0.593 0.612 0.576 0.235 0.563 0.576
Support Vector Machine (SVM) 0.640 0.235 0.614 0.640 0.638 0.236 0.637 0.638
Classifier Evaluation Metrics:
Precision and Recall
• Precision: exactness – what % of tuples that the classifier labeled as positive are actually positive?
*+ *+
• 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = =
*+,-+ +.
• Recall: completeness – what % of positive tuples did the classifier label as positive?
*+ *+
• 𝑟𝑒𝑐𝑎𝑙𝑙 = =
*+,-1 +
Actual C1 ~C1
• Perfect score is 1.0 class\Predicte
d class
• Inverse relationship between C1 True False
precision & recall Positives
(TP)
Negatives
(FN)
~C1 False True
Positives Negatives
(FP) (TN)
62
Basic Measures for Text Retrieval
Relevant Relevant &
Retrieved Retrieved
All Documents
• Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e.,
“correct” responses)
| {Relevant} Ç {Retrieved} |
precision =
| {Retrieved} |
• Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved
| {Relevant} Ç {Retrieved} |
recall =
| {Relevant} |
63
References
• [1] Sentiment Analysis, https://devopedia.org/sentiment-analysis
• [2] Indonesians optimistic about their country’s democracy and economy as
elections near, APRIL 4, 2019, https://www.pewresearch.org/fact-
tank/2019/04/04/indonesians-optimistic-about-their-countrys-democracy-and-
economy-as-elections-near/
• [3] Jiawei Han, Micheline Kamber, and Jian Pei, Data Mining: Concepts and
Techniques, 3rd edition, Morgan Kaufmann, 2011.
• [4] AN OVERVIEW OF SENTIMENT ANALYSIS,
https://jgateplus.com/home/2019/01/16/an-overview-of-sentiment-analysis/
How to Assign Weights
• Two-fold heuristics based on frequency
• TF (Term frequency)
• More frequent within a document à more relevant to semantics
• e.g., “query” vs. “commercial”
• IDF (Inverse document frequency)
• Less frequent among documents à more discriminative
• e.g. “algebra” vs. “science”
65
TF Weighting
• Weighting:
• More frequent => more relevant to topic
• e.g. “query” vs. “commercial”
• Raw TF= f(t,d): how many times term t appears in doc d
• Normalization:
• Document length varies => relative frequency preferred
• e.g., Maximum frequency normalization
66
IDF Weighting
• Ideas:
vLess frequent among documents à more discriminative
• Formula:
•
• n — total number of docs
• k — # docs with term t appearing
(the DF document frequency)
67
TF-IDF Weighting
• TF-IDF weighting : weight(t, d) = TF(t, d) * IDF(t)
• Frequent within doc à high tf à high weight
• Selective among docs à high idf à high weight
• Recall VS model
• Each selected term represents one dimension
• Each doc is represented by a feature vector
• Its t-term coordinate of document d is the TF-IDF weight
• This is more reasonable
• Just for illustration …
• Many complex and more effective weighting variants exist in practice
68
Regex
• https://www.w3schools.com/python/python_regex.asp