PPPT
PPPT
Tokenization
Filtering
Tokens
Stemming
Conversion
Pre-Processing
Data Pre Processing(cont..)
great win Fabulous Innings another adopted home ground Excellent role played take
India over the line
Feature Extraction
Selection of useful words from the tweets from the preprocessed data set is
called as feature extraction . In the feature extraction method, we extract the
aspects from the pre-processed twitter dataset.
There are different ways of feature extraction – unigram , bigram and n-gram.
For eg:she is not bad.
If the word ‘bad’ occurs , the sentiment is not necessarily negative. If we
consider 2-gram , the feature ‘not bad’ also has to be taken into account i.e this
statement is most likely to be a positive statement. Therefore, n-grams used as
features in classification can improve the result.
Parts Of Speech Tags like adjectives, adverbs, verbs and nouns are good
indicators of subjectivity and sentiment which specifies the polarity of the
tweet.
Negation is very important and a difficult feature to interpret. The presence of
a negation in tweet changes the polarity of the sentiment.
Naïve Bayes
Naïve Bayes is a machine learning based probabilistic approach for
sentiment analysis. It is based on Bayes Theorem with an
assumption of independence among predictors. .
Rainy No
if we want to calculate the probability of the total number of yes when it is a sunny day, then
=>P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny) .
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) =5/14 = 0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
In our case , we have train and test data .Train data is used for generating features to
train the algorithm.Assuming,we have n tweets from which k are positives and (n-k)
are negative.
we are classifying the tweets into two classes (positive /negative) .
Example: @xyz when you are happy , you look beautiful !!!
@xyz I am sad .
A=P(positive | tweet) P(tweet | positive)*P(positive)/P(tweet)
B=P(negative | tweet)=P(tweet | positive)*P(positive)/P(tweet)
P(positive | tweet)=P(happy| positive)*P(beautiful |positive) *P(positive)
Similarly,
P(negative | tweet)=P(sad | negative) *P(negative)
Features Positive Negative
beautiful 4 1
if(A>B) then the Tweet is positive otherwise,
the tweet is negative. sad 2 5
happy 3 0
total 9 6
SUPPORT VECTOR
MACHINE.
1.“Support Vector Machine”(SVM) is a Supervised Machine Learning
Algorithm. It is used for both classification and regression.
2. SVM is a supervised learning method that sorts data into two categories.
4.In other words, given labeled training data (supervised learning), the
algorithm outputs an optimal hyperplane which categorizes new examples.
HOW DOES IT WORK?
SCENARIO ONE
Here we have three hyperplanes(A,B and C).We need to identify the
right hyperplane to classify star and cirlce.
HOW DOES IT WORK?
SCENARIO TWO
Here we have three hyperplanes(A,B and C) and all are
segregating the classes well. So How can we identify the
right Hyperplane?
Sentiment Analysis using SVM
Sentiment Analysis is treated as classification task , as it
classifies orientation of text into positive and negative.
POSITIVE NEGATIVE
POSITIVE 3000 0
NEGATIVE 900 0
CALCULATING
ACCURACY
Accuracy can be calculated as number of correctly
predicted reviews to the number of total number of
reviews present in the corpus. The formula for
calculating accuracy is given as:
Confusion Matrix
TP=TRUE POSITIVE
TN=TRUE NEGATIVE
FP=FALSE POSITIVE
FN=FALSE NEGATIVE
A study by Twitter in 2015 shows that 15% of tweets during TV prime time contain
at least one emoji that’s why they are a major factor of consideration. The polarity of
an emoticon is based on the score they are carrying. The polarity of a tweet is the
sum of the polarity of the textual part and the emoticon part. Following is the list of
some of the emoticons along with their scores:-
As we can see that the scores of the negative emoticons are already
negative so when they are added to the polarity of the textual part of the
tweet, the polarity of the tweet changes accordingly.