NLP Presentation
NLP Presentation
- Our project takes care of the challenge of bilingual comments, where people
tweet in two languages, in this case Hindi and English, in the English Alphabet.
UE17CS333-PROJECT_2020 2
UNIQUENESS AND ANALYSIS
- We created an aggregated model consisting of all the classifiers used
during the process. The ensemble model created worked to our advantage
as we saw in the previous slides that it provided one of the highest
accuracy compared to other classifiers.
- We can observe that using this approach of both the platforms, increased
our accuracy significantly when compared to using them individually.
UE17CS333-PROJECT_2020 3
DATASET SOURCE
- The dataset that was used was obtained from “Kaggle” called the
Sentiment140 dataset.
- It contains 1,600,000 tweets extracted using the twitter API. The tweets
have been annotated (0 = negative, 4 = positive) and they can be used to
detect sentiment.
UE17CS333-PROJECT_2020 4
DATASET SOURCE
- The format of the Tweet column was not useful and had to be cleaned and
tokenized. We also limited the number of tweets to 40 thousand.
UE17CS333-PROJECT_2020 5
DATASET PREPROCESSING
- Chose the relevant columns that were required for our study, which were
the tweet and the sentiment associated.
- If there were any emoticons used, we converted them into their equivalent
emotion that they are trying to signify, while emojis were removed.
- We also expanded some words which were joined together such as “Can’t”
was changed to “Can not”.
UE17CS333-PROJECT_2020 6
DATASET PREPROCESSING
- Removal of numbers, URLs, html tags and symbols, the “@” symbol
followed by the account handle.
- These were all some data cleaning steps that were important to the study
to function effectively. Finally, the dataset contained the cleaned tweets
which we converted to lowercase for simplicity.
- Certain features, like adjectives, abstract nouns and adverbs were focused
on and the rest of the words were removed as they did not add any value
to the sentiment.
UE17CS333-PROJECT_2020 7
LITERATURE REVIEW - TABLE 1
Papers Title Authors Methodology Used
UE17CS333-PROJECT_2020 8
LITERATURE REVIEW - TABLE 1
Paper 3 A Dataset of Aditya Bohra, Deepanshu Makes use of a system created
Hindi-English Vijay, Vinay Singh, Syed that classifies a tweet having a
Code-Mixed Social S. Akhtar and Manish combination of Hindi and English to
Media Text for Hate Shrivatsava negative or not.
Speech Detection
Paper 4 Resource Creation for Sakshi Gupta, Piyush Proposes a method to successfully
Hindi-English Code Bansal and Radhika aggregate data to form a dataset of
Mixed Social Media Text Mamidi words that have a multilingual
characteristic.
Paper 5 Sentiment classification Kumar Ravi and Made use of different combinations
of Hinglish text Vadlamani Ravi of feature selection methods and a
host of classifiers using term
frequency-inverse document
frequency feature representation.
UE17CS333-PROJECT_2020 9
LITERATURE REVIEW - TABLE 2
Papers Accuracy Benefits Drawbacks
Paper 1 90% The strategy described here is equally Elaborate testing is not possible as
applicable to all Indian languages as these languages are used in verbal
these are verb ending languages and communication.
have similar mixture of lexicons as in
case of Hindi.
Paper 2 69.7% Sub-Word LSTM interprets sentiment The lexicon lookup approach didn’t
based on morpheme-like structures and perform well owing to the heavily
the results thus produced are misspelt words in the text, which led to
significantly better than baselines. incorrect transliterations.
UE17CS333-PROJECT_2020 10
LITERATURE REVIEW - TABLE 2
Paper 3 71.7% The features used in the classification The corpus was not annotated with
system are character n-grams, word part-of-speech tags at word level
n-grams, punctuations, negation words which would have yield better results.
and hate lexicon which are integrated in
the SVM as the classification system.
Paper 4 89.94% They have used an existing language Have not taken into consideration the
identification system, and improved a sentence-level context for word
normalisation system, achieving a higher disambiguation.
accuracy than the base system.
Paper 5 AUC = Proposed a triumvirate of TF-IDF, GR, and Did not employ sentence parser for
0.8601 RBFNN, which is found as the best considering relation between different
combination for classifying sentiment parts-of-speech of a sentence.
expressed in the Hinglish text.
UE17CS333-PROJECT_2020 11
BLOCK DIAGRAM FOR IMPLEMENTATION
UE17CS333-PROJECT_2020 12
QUANTITY OF WORK – THE MAIN
CODE MODULES
Sl. No. Code Module Description Status (% completed) Comments
UE17CS333-PROJECT_2020 13
QUALITY OF WORK – MILESTONES
THAT ARE DONE AND WORKING
Serial Milestone description Status (% Comments
no complete)
1. Dataset Selection 100% A better dataset can be used.
2. Preprocessing 100% Cleaning done efficiently.
3. Feature Selection 100% Adjectives, Abstract Nouns, Adverbs
4. Choice of Classifiers 100% 7 Classifiers chosen.
5. Building Classifiers 100% Successfully built
6. Training Classifiers 100% Trained on 85% data.
7. Creation of Hybrid Model 100% Voting Based Ensemble Model.
8. Translation Challenge 100% Google Translate Machine, TextBlob
9. Creating a controller module 100% func module combines all functionality.
UE17CS333-PROJECT_2020 14
RESULTS OBTAINED - Accuracy
Comparison of Accuracies Classifier Used Accuracy
SGD 61.2397
UE17CS333-PROJECT_2020 15
RESULTS OBTAINED - Confusion Matrix
For Hybrid Model:
UE17CS333-PROJECT_2020 16
RESULTS OBTAINED - F1 Score
Naive Bayes’ Classifier:
UE17CS333-PROJECT_2020 17
RESULTS OBTAINED - F1 Score
Multinomial Naive Bayes’ Classifier:
UE17CS333-PROJECT_2020 18
RESULTS OBTAINED - F1 Score
Stochastic Gradient Descent Classifier:
UE17CS333-PROJECT_2020 19
RESULTS OBTAINED - F1 Score
Maximum Entropy Classifer:
Hybrid Model:
UE17CS333-PROJECT_2020 20
OUR TOP THREE LEARNING IN THIS
PROJECT
1. We were able to get familiar with the usage and implementation of
different classifiers.
UE17CS333-PROJECT_2020 21
TOP CHALLENGES UNRESOLVED SO
FAR
1. Accuracy for the testing of the models was around 60%, even after several
efforts to increase it.
UE17CS333-PROJECT_2020 22
OUR GOING FORWARD PLAN (IF
ANY)
1. Find a better dataset to work with.
2. Try more complex machine learning models for the classification of text.
UE17CS333-PROJECT_2020 23