0% found this document useful (0 votes)
5 views

NLP Presentation

Uploaded by

rameshtharu076
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

NLP Presentation

Uploaded by

rameshtharu076
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Sakshi Goel PES1201700148

Bilingual Sentiment Analysis Suhail Rahman PES1201701420


UE17CS333 Project Submission
ABOUT THE PROJECT
- The main aim of the project is to develop a sentiment analyzer that can be used
on twitter data to classify it as positive or negative.

- Our project takes care of the challenge of bilingual comments, where people
tweet in two languages, in this case Hindi and English, in the English Alphabet.

UE17CS333-PROJECT_2020 2
UNIQUENESS AND ANALYSIS
- We created an aggregated model consisting of all the classifiers used
during the process. The ensemble model created worked to our advantage
as we saw in the previous slides that it provided one of the highest
accuracy compared to other classifiers.

- When a sentence is in Hindi, we use Google Translate to directly


convert it to English. If the sentence consists of a combination of Hindi and
English, we make use of TextBlob to identify that.

- We can observe that using this approach of both the platforms, increased
our accuracy significantly when compared to using them individually.

UE17CS333-PROJECT_2020 3
DATASET SOURCE
- The dataset that was used was obtained from “Kaggle” called the
Sentiment140 dataset.

- It contains 1,600,000 tweets extracted using the twitter API. The tweets
have been annotated (0 = negative, 4 = positive) and they can be used to
detect sentiment.

- The two columns that we mainly need are as follows:


- The Label
- The Tweet

UE17CS333-PROJECT_2020 4
DATASET SOURCE
- The format of the Tweet column was not useful and had to be cleaned and
tokenized. We also limited the number of tweets to 40 thousand.

UE17CS333-PROJECT_2020 5
DATASET PREPROCESSING
- Chose the relevant columns that were required for our study, which were
the tweet and the sentiment associated.

- If there were any emoticons used, we converted them into their equivalent
emotion that they are trying to signify, while emojis were removed.

- We also expanded some words which were joined together such as “Can’t”
was changed to “Can not”.

UE17CS333-PROJECT_2020 6
DATASET PREPROCESSING
- Removal of numbers, URLs, html tags and symbols, the “@” symbol
followed by the account handle.

- These were all some data cleaning steps that were important to the study
to function effectively. Finally, the dataset contained the cleaned tweets
which we converted to lowercase for simplicity.

- Certain features, like adjectives, abstract nouns and adverbs were focused
on and the rest of the words were removed as they did not add any value
to the sentiment.

UE17CS333-PROJECT_2020 7
LITERATURE REVIEW - TABLE 1
Papers Title Authors Methodology Used

Paper 1 Machine translation of R. Mahesh, K.Sinha, Makes use a system designed


bi-lingual Hindi-English Anil Thakur specifically to separate out the Hindi
(Hinglish) text and English parts of a word that has
a combination of the two.

Paper 2 Towards Sub-Word Aditya Joshi,Ameya Introduces a constantly learning


Level Compositions for Prabhu Pandurang, sub-word level representation in
Sentiment Analysis of Manish Shrivatsava and LSTM (Subword-LSTM) architecture
Hindi-English Code Vasudeva Varma instead of character-level or
Mixed Text word-level representations.

UE17CS333-PROJECT_2020 8
LITERATURE REVIEW - TABLE 1
Paper 3 A Dataset of Aditya Bohra, Deepanshu Makes use of a system created
Hindi-English Vijay, Vinay Singh, Syed that classifies a tweet having a
Code-Mixed Social S. Akhtar and Manish combination of Hindi and English to
Media Text for Hate Shrivatsava negative or not.
Speech Detection

Paper 4 Resource Creation for Sakshi Gupta, Piyush Proposes a method to successfully
Hindi-English Code Bansal and Radhika aggregate data to form a dataset of
Mixed Social Media Text Mamidi words that have a multilingual
characteristic.

Paper 5 Sentiment classification Kumar Ravi and Made use of different combinations
of Hinglish text Vadlamani Ravi of feature selection methods and a
host of classifiers using term
frequency-inverse document
frequency feature representation.

UE17CS333-PROJECT_2020 9
LITERATURE REVIEW - TABLE 2
Papers Accuracy Benefits Drawbacks

Paper 1 90% The strategy described here is equally Elaborate testing is not possible as
applicable to all Indian languages as these languages are used in verbal
these are verb ending languages and communication.
have similar mixture of lexicons as in
case of Hindi.

Paper 2 69.7% Sub-Word LSTM interprets sentiment The lexicon lookup approach didn’t
based on morpheme-like structures and perform well owing to the heavily
the results thus produced are misspelt words in the text, which led to
significantly better than baselines. incorrect transliterations.

UE17CS333-PROJECT_2020 10
LITERATURE REVIEW - TABLE 2
Paper 3 71.7% The features used in the classification The corpus was not annotated with
system are character n-grams, word part-of-speech tags at word level
n-grams, punctuations, negation words which would have yield better results.
and hate lexicon which are integrated in
the SVM as the classification system.

Paper 4 89.94% They have used an existing language Have not taken into consideration the
identification system, and improved a sentence-level context for word
normalisation system, achieving a higher disambiguation.
accuracy than the base system.

Paper 5 AUC = Proposed a triumvirate of TF-IDF, GR, and Did not employ sentence parser for
0.8601 RBFNN, which is found as the best considering relation between different
combination for classifying sentiment parts-of-speech of a sentence.
expressed in the Hinglish text.

UE17CS333-PROJECT_2020 11
BLOCK DIAGRAM FOR IMPLEMENTATION

UE17CS333-PROJECT_2020 12
QUANTITY OF WORK – THE MAIN
CODE MODULES
Sl. No. Code Module Description Status (% completed) Comments

1. func(test_text) 100% The master module

2. hinglish(test_text) 100% Takes care of text translation

3. text_classify(text) 100% Classifies text using all 8 models

4. hybrid(test_set_formatted) 100% Builds the hybrid model classifier

5. features(test_text) 100% Filters features from the text

6. start(text) 100% Preprocessing module

UE17CS333-PROJECT_2020 13
QUALITY OF WORK – MILESTONES
THAT ARE DONE AND WORKING
Serial Milestone description Status (% Comments
no complete)
1. Dataset Selection 100% A better dataset can be used.
2. Preprocessing 100% Cleaning done efficiently.
3. Feature Selection 100% Adjectives, Abstract Nouns, Adverbs
4. Choice of Classifiers 100% 7 Classifiers chosen.
5. Building Classifiers 100% Successfully built
6. Training Classifiers 100% Trained on 85% data.
7. Creation of Hybrid Model 100% Voting Based Ensemble Model.
8. Translation Challenge 100% Google Translate Machine, TextBlob
9. Creating a controller module 100% func module combines all functionality.

UE17CS333-PROJECT_2020 14
RESULTS OBTAINED - Accuracy
Comparison of Accuracies Classifier Used Accuracy

Naive Bayes 62.0729

Multinomial Naive Bayes 62.2062

Bernoulli Naive Bayes 62.2062


Accuracy

Logistic Regression 62.2562

SGD 61.2397

SVC Classifier 61.3897

Max Entropy 613897

Hybrid Model 62.2563


Classifier

UE17CS333-PROJECT_2020 15
RESULTS OBTAINED - Confusion Matrix
For Hybrid Model:

UE17CS333-PROJECT_2020 16
RESULTS OBTAINED - F1 Score
Naive Bayes’ Classifier:

Bernouille’s Naive Bayes’ Classifier:

UE17CS333-PROJECT_2020 17
RESULTS OBTAINED - F1 Score
Multinomial Naive Bayes’ Classifier:

Logistic Regression Classifier:

UE17CS333-PROJECT_2020 18
RESULTS OBTAINED - F1 Score
Stochastic Gradient Descent Classifier:

Support Vector Machines Classifier:

UE17CS333-PROJECT_2020 19
RESULTS OBTAINED - F1 Score
Maximum Entropy Classifer:

Hybrid Model:

UE17CS333-PROJECT_2020 20
OUR TOP THREE LEARNING IN THIS
PROJECT
1. We were able to get familiar with the usage and implementation of
different classifiers.

2. Understanding which classifiers work when used on a certain type of data.


Learning the advantages and drawbacks of the used classification models.

3. Getting the opportunity to create an ensemble model to give us optimal


results.

UE17CS333-PROJECT_2020 21
TOP CHALLENGES UNRESOLVED SO
FAR
1. Accuracy for the testing of the models was around 60%, even after several
efforts to increase it.

2. Two separate modules, instead of one, used for translation.

3. Dataset used for training could be a better one.

UE17CS333-PROJECT_2020 22
OUR GOING FORWARD PLAN (IF
ANY)
1. Find a better dataset to work with.

2. Try more complex machine learning models for the classification of text.

3. Use better translation techniques.

UE17CS333-PROJECT_2020 23

You might also like