0% found this document useful (0 votes)

5 views

NLP Presentation

Uploaded by

rameshtharu076

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

NLP Presentation

Uploaded by

rameshtharu076

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Sakshi Goel PES1201700148

Bilingual Sentiment Analysis Suhail Rahman PES1201701420

UE17CS333 Project Submission
ABOUT THE PROJECT
- The main aim of the project is to develop a sentiment analyzer that can be used
on twitter data to classify it as positive or negative.

- Our project takes care of the challenge of bilingual comments, where people
tweet in two languages, in this case Hindi and English, in the English Alphabet.

UE17CS333-PROJECT_2020 2
UNIQUENESS AND ANALYSIS
- We created an aggregated model consisting of all the classifiers used
during the process. The ensemble model created worked to our advantage
as we saw in the previous slides that it provided one of the highest
accuracy compared to other classifiers.

- When a sentence is in Hindi, we use Google Translate to directly

convert it to English. If the sentence consists of a combination of Hindi and
English, we make use of TextBlob to identify that.

- We can observe that using this approach of both the platforms, increased
our accuracy significantly when compared to using them individually.

UE17CS333-PROJECT_2020 3
DATASET SOURCE
- The dataset that was used was obtained from “Kaggle” called the
Sentiment140 dataset.

- It contains 1,600,000 tweets extracted using the twitter API. The tweets
have been annotated (0 = negative, 4 = positive) and they can be used to
detect sentiment.

- The two columns that we mainly need are as follows:

- The Label
- The Tweet

UE17CS333-PROJECT_2020 4
DATASET SOURCE
- The format of the Tweet column was not useful and had to be cleaned and
tokenized. We also limited the number of tweets to 40 thousand.

UE17CS333-PROJECT_2020 5
DATASET PREPROCESSING
- Chose the relevant columns that were required for our study, which were
the tweet and the sentiment associated.

- If there were any emoticons used, we converted them into their equivalent
emotion that they are trying to signify, while emojis were removed.

- We also expanded some words which were joined together such as “Can’t”
was changed to “Can not”.

UE17CS333-PROJECT_2020 6
DATASET PREPROCESSING
- Removal of numbers, URLs, html tags and symbols, the “@” symbol
followed by the account handle.

- These were all some data cleaning steps that were important to the study
to function effectively. Finally, the dataset contained the cleaned tweets
which we converted to lowercase for simplicity.

- Certain features, like adjectives, abstract nouns and adverbs were focused
on and the rest of the words were removed as they did not add any value
to the sentiment.

UE17CS333-PROJECT_2020 7
LITERATURE REVIEW - TABLE 1
Papers Title Authors Methodology Used

Paper 1 Machine translation of R. Mahesh, K.Sinha, Makes use a system designed

bi-lingual Hindi-English Anil Thakur specifically to separate out the Hindi
(Hinglish) text and English parts of a word that has
a combination of the two.

Paper 2 Towards Sub-Word Aditya Joshi,Ameya Introduces a constantly learning

Level Compositions for Prabhu Pandurang, sub-word level representation in
Sentiment Analysis of Manish Shrivatsava and LSTM (Subword-LSTM) architecture
Hindi-English Code Vasudeva Varma instead of character-level or
Mixed Text word-level representations.

UE17CS333-PROJECT_2020 8
LITERATURE REVIEW - TABLE 1
Paper 3 A Dataset of Aditya Bohra, Deepanshu Makes use of a system created
Hindi-English Vijay, Vinay Singh, Syed that classifies a tweet having a
Code-Mixed Social S. Akhtar and Manish combination of Hindi and English to
Media Text for Hate Shrivatsava negative or not.
Speech Detection

Paper 4 Resource Creation for Sakshi Gupta, Piyush Proposes a method to successfully
Hindi-English Code Bansal and Radhika aggregate data to form a dataset of
Mixed Social Media Text Mamidi words that have a multilingual
characteristic.

Paper 5 Sentiment classification Kumar Ravi and Made use of different combinations
of Hinglish text Vadlamani Ravi of feature selection methods and a
host of classifiers using term
frequency-inverse document
frequency feature representation.

UE17CS333-PROJECT_2020 9
LITERATURE REVIEW - TABLE 2
Papers Accuracy Benefits Drawbacks

Paper 1 90% The strategy described here is equally Elaborate testing is not possible as
applicable to all Indian languages as these languages are used in verbal
these are verb ending languages and communication.
have similar mixture of lexicons as in
case of Hindi.

Paper 2 69.7% Sub-Word LSTM interprets sentiment The lexicon lookup approach didn’t
based on morpheme-like structures and perform well owing to the heavily
the results thus produced are misspelt words in the text, which led to
signiﬁcantly better than baselines. incorrect transliterations.

UE17CS333-PROJECT_2020 10
LITERATURE REVIEW - TABLE 2
Paper 3 71.7% The features used in the classiﬁcation The corpus was not annotated with
system are character n-grams, word part-of-speech tags at word level
n-grams, punctuations, negation words which would have yield better results.
and hate lexicon which are integrated in
the SVM as the classiﬁcation system.

Paper 4 89.94% They have used an existing language Have not taken into consideration the
identiﬁcation system, and improved a sentence-level context for word
normalisation system, achieving a higher disambiguation.
accuracy than the base system.

Paper 5 AUC = Proposed a triumvirate of TF-IDF, GR, and Did not employ sentence parser for
0.8601 RBFNN, which is found as the best considering relation between different
combination for classifying sentiment parts-of-speech of a sentence.
expressed in the Hinglish text.

UE17CS333-PROJECT_2020 11
BLOCK DIAGRAM FOR IMPLEMENTATION

UE17CS333-PROJECT_2020 12
QUANTITY OF WORK – THE MAIN
CODE MODULES
Sl. No. Code Module Description Status (% completed) Comments

1. func(test_text) 100% The master module

2. hinglish(test_text) 100% Takes care of text translation

3. text_classify(text) 100% Classifies text using all 8 models

4. hybrid(test_set_formatted) 100% Builds the hybrid model classifier

5. features(test_text) 100% Filters features from the text

6. start(text) 100% Preprocessing module

UE17CS333-PROJECT_2020 13
QUALITY OF WORK – MILESTONES
THAT ARE DONE AND WORKING
Serial Milestone description Status (% Comments
no complete)
1. Dataset Selection 100% A better dataset can be used.
2. Preprocessing 100% Cleaning done efficiently.
3. Feature Selection 100% Adjectives, Abstract Nouns, Adverbs
4. Choice of Classifiers 100% 7 Classifiers chosen.
5. Building Classifiers 100% Successfully built
6. Training Classifiers 100% Trained on 85% data.
7. Creation of Hybrid Model 100% Voting Based Ensemble Model.
8. Translation Challenge 100% Google Translate Machine, TextBlob
9. Creating a controller module 100% func module combines all functionality.

UE17CS333-PROJECT_2020 14
RESULTS OBTAINED - Accuracy
Comparison of Accuracies Classifier Used Accuracy

Naive Bayes 62.0729

Multinomial Naive Bayes 62.2062

Bernoulli Naive Bayes 62.2062

Accuracy

Logistic Regression 62.2562

SGD 61.2397

SVC Classifier 61.3897

Max Entropy 613897

Hybrid Model 62.2563

Classifier

UE17CS333-PROJECT_2020 15
RESULTS OBTAINED - Confusion Matrix
For Hybrid Model:

UE17CS333-PROJECT_2020 16
RESULTS OBTAINED - F1 Score
Naive Bayes’ Classifier:

Bernouille’s Naive Bayes’ Classifier:

UE17CS333-PROJECT_2020 17
RESULTS OBTAINED - F1 Score
Multinomial Naive Bayes’ Classifier:

Logistic Regression Classifier:

UE17CS333-PROJECT_2020 18
RESULTS OBTAINED - F1 Score
Stochastic Gradient Descent Classifier:

Support Vector Machines Classifier:

UE17CS333-PROJECT_2020 19
RESULTS OBTAINED - F1 Score
Maximum Entropy Classifer:

Hybrid Model:

UE17CS333-PROJECT_2020 20
OUR TOP THREE LEARNING IN THIS
PROJECT
1. We were able to get familiar with the usage and implementation of
different classifiers.

2. Understanding which classifiers work when used on a certain type of data.

Learning the advantages and drawbacks of the used classification models.

3. Getting the opportunity to create an ensemble model to give us optimal

results.

UE17CS333-PROJECT_2020 21
TOP CHALLENGES UNRESOLVED SO
FAR
1. Accuracy for the testing of the models was around 60%, even after several
efforts to increase it.

2. Two separate modules, instead of one, used for translation.

3. Dataset used for training could be a better one.

UE17CS333-PROJECT_2020 22
OUR GOING FORWARD PLAN (IF
ANY)
1. Find a better dataset to work with.

2. Try more complex machine learning models for the classification of text.

3. Use better translation techniques.

UE17CS333-PROJECT_2020 23

Murtagh S Practice Tips 6th Edition PDF
88% (8)
Murtagh S Practice Tips 6th Edition PDF
281 pages
Boy Overboard
67% (3)
Boy Overboard
20 pages
Koegel PRT Pocket Guide Intro PDF
No ratings yet
Koegel PRT Pocket Guide Intro PDF
14 pages
NLP Presentation
No ratings yet
NLP Presentation
23 pages
CSE4062S21_Group3_Project_Delivery7_FinalReport
No ratings yet
CSE4062S21_Group3_Project_Delivery7_FinalReport
9 pages
2024.dravidianlangtech-1.21
No ratings yet
2024.dravidianlangtech-1.21
5 pages
NLP_EXP1
No ratings yet
NLP_EXP1
5 pages
AI Report Shivam
No ratings yet
AI Report Shivam
8 pages
INTRODUCTION
No ratings yet
INTRODUCTION
3 pages
Ppt- Sentiment Analysis Using Machine Learning Algorithms
No ratings yet
Ppt- Sentiment Analysis Using Machine Learning Algorithms
23 pages
Complete Report
No ratings yet
Complete Report
56 pages
2023.dravidianlangtech-1.24
No ratings yet
2023.dravidianlangtech-1.24
4 pages
T6-19
No ratings yet
T6-19
7 pages
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
Twitter Sentiment Analysis System
No ratings yet
Twitter Sentiment Analysis System
5 pages
Majorprojectdoc
No ratings yet
Majorprojectdoc
23 pages
minor_project_report
No ratings yet
minor_project_report
29 pages
Akshada Tweet Report With Pages Removed
No ratings yet
Akshada Tweet Report With Pages Removed
15 pages
Assignment 4
No ratings yet
Assignment 4
5 pages
Sentiment Prediction in Hindi and English Language
No ratings yet
Sentiment Prediction in Hindi and English Language
25 pages
ML Project Report
No ratings yet
ML Project Report
26 pages
nlp_project(documentation)
No ratings yet
nlp_project(documentation)
8 pages
Manuscript Updated-1
No ratings yet
Manuscript Updated-1
10 pages
Project Proposal Machine Learning: Title: Team Members
No ratings yet
Project Proposal Machine Learning: Title: Team Members
2 pages
Session 7
No ratings yet
Session 7
17 pages
2024.dravidianlangtech-1.43
No ratings yet
2024.dravidianlangtech-1.43
5 pages
2023.dravidianlangtech-1.30
No ratings yet
2023.dravidianlangtech-1.30
6 pages
Natural Language Processing (Ue16Cs333) MINI-PROJECT (2019) Sentiment Analysis
No ratings yet
Natural Language Processing (Ue16Cs333) MINI-PROJECT (2019) Sentiment Analysis
2 pages
Proposal after 12th changes
No ratings yet
Proposal after 12th changes
18 pages
COMP 4650 6490 Assignment 3 2023-v1.1
No ratings yet
COMP 4650 6490 Assignment 3 2023-v1.1
6 pages
Twitter Analysis
No ratings yet
Twitter Analysis
8 pages
Assignment 1 Groupwork C0927405 C0928791
No ratings yet
Assignment 1 Groupwork C0927405 C0928791
11 pages
Deep Learning Based Sentiment Analysis For Malayalam, Tamil and Kannada Languages
No ratings yet
Deep Learning Based Sentiment Analysis For Malayalam, Tamil and Kannada Languages
9 pages
Report
No ratings yet
Report
12 pages
ad872f8c-5dc6-4311-bfb6-e1b5fcfa1cb7
No ratings yet
ad872f8c-5dc6-4311-bfb6-e1b5fcfa1cb7
48 pages
DOC-20250208-WA0002
No ratings yet
DOC-20250208-WA0002
21 pages
FRAMEWORK[1]
No ratings yet
FRAMEWORK[1]
3 pages
FRAMEWORK
No ratings yet
FRAMEWORK
3 pages
Thesis - Aru Omarali
No ratings yet
Thesis - Aru Omarali
34 pages
Case Studies 1,2,3
No ratings yet
Case Studies 1,2,3
6 pages
SYNOPSIS
No ratings yet
SYNOPSIS
28 pages
Sentiment Analysis Using NLP
No ratings yet
Sentiment Analysis Using NLP
42 pages
document-dsbda-codes-for-mini-project
No ratings yet
document-dsbda-codes-for-mini-project
9 pages
Course Project and Term Paper Logistics
No ratings yet
Course Project and Term Paper Logistics
7 pages
Template For The First Slide of PPT Presentation1
No ratings yet
Template For The First Slide of PPT Presentation1
18 pages
Malignant Comments Classifier Project
No ratings yet
Malignant Comments Classifier Project
30 pages
Sentimental Analysis of Web Scapping Data
No ratings yet
Sentimental Analysis of Web Scapping Data
9 pages
Emotion Recognition On Twitter: Comparative Study and Training A Unison Model
No ratings yet
Emotion Recognition On Twitter: Comparative Study and Training A Unison Model
9 pages
NLP paper
No ratings yet
NLP paper
5 pages
Introduction to Programming Languages
From Everand
Introduction to Programming Languages
IntroBooks Team
4/5 (1)
Natural Language Processing Assignment
No ratings yet
Natural Language Processing Assignment
3 pages
2022.dravidianlangtech 1.44
No ratings yet
2022.dravidianlangtech 1.44
7 pages
Implementing Domain-Specific Languages with Xtext and Xtend - Second Edition
From Everand
Implementing Domain-Specific Languages with Xtext and Xtend - Second Edition
Lorenzo Bettini
4/5 (1)
Abstract Review PPT Tem - 03
No ratings yet
Abstract Review PPT Tem - 03
7 pages
Sentiment Analysis of Code-Mixed Languages Leveraging Resource Rich Languages
No ratings yet
Sentiment Analysis of Code-Mixed Languages Leveraging Resource Rich Languages
11 pages
Introduction
No ratings yet
Introduction
27 pages
Final Twitter - Sentiment - Analysis - Report
100% (1)
Final Twitter - Sentiment - Analysis - Report
14 pages
205_Political_Sentiment_Analys
No ratings yet
205_Political_Sentiment_Analys
5 pages
NLP-2 - Problem Statement
No ratings yet
NLP-2 - Problem Statement
3 pages
Maneesha Nidigonda Verzeo Major Project
No ratings yet
Maneesha Nidigonda Verzeo Major Project
11 pages
EXP5
No ratings yet
EXP5
15 pages
Python Project Synopsis Sample
No ratings yet
Python Project Synopsis Sample
2 pages
Project
No ratings yet
Project
11 pages
AIDI 1003 Presentation
No ratings yet
AIDI 1003 Presentation
9 pages
Call-for-applications-adjunct-faculty-member-positions-October-24-ver1
No ratings yet
Call-for-applications-adjunct-faculty-member-positions-October-24-ver1
3 pages
solution
No ratings yet
solution
3 pages
Oop
No ratings yet
Oop
219 pages
Krishna Reddy ( Oracle 11g )
No ratings yet
Krishna Reddy ( Oracle 11g )
246 pages
Learning Management Systems Annotated Bibliography
100% (1)
Learning Management Systems Annotated Bibliography
7 pages
This I Believe - Cristina Sarrico
No ratings yet
This I Believe - Cristina Sarrico
2 pages
ACFrOgBGQHJ7w7Zu6EPiQ r8HNLVFjqgAXLXwHH0Gee48gB7f6WIiYT G5GDFjWIM4iZPKy6n4RwemUPBaa0VFup4YfofceLmx3I4PIekqBqEaBnbMW7LWFWPSvAyG8
No ratings yet
ACFrOgBGQHJ7w7Zu6EPiQ r8HNLVFjqgAXLXwHH0Gee48gB7f6WIiYT G5GDFjWIM4iZPKy6n4RwemUPBaa0VFup4YfofceLmx3I4PIekqBqEaBnbMW7LWFWPSvAyG8
36 pages
News Report
No ratings yet
News Report
5 pages
Japanese Haiku Poem 4's Lesson Plan
100% (2)
Japanese Haiku Poem 4's Lesson Plan
2 pages
Prajwal New21 Report
No ratings yet
Prajwal New21 Report
16 pages
TT4 Tests EOT 1A
No ratings yet
TT4 Tests EOT 1A
4 pages
The Life of Reason or the Phases of Human Progress, Book 4 Reason in Art (George Santayana) (Z-Library)
No ratings yet
The Life of Reason or the Phases of Human Progress, Book 4 Reason in Art (George Santayana) (Z-Library)
333 pages
Intelligent Agent
No ratings yet
Intelligent Agent
10 pages
CPH Microproject Super Final - Organized - Removed
No ratings yet
CPH Microproject Super Final - Organized - Removed
20 pages
Postdoctoral Position Opportunity On Animal Behaviour
No ratings yet
Postdoctoral Position Opportunity On Animal Behaviour
2 pages
Cover Letter For The Post of Network Support Officer
No ratings yet
Cover Letter For The Post of Network Support Officer
2 pages
Biology Grade 11 End of Term One Test
No ratings yet
Biology Grade 11 End of Term One Test
8 pages
Rubrics LP
No ratings yet
Rubrics LP
2 pages
Simple Present
No ratings yet
Simple Present
1 page
Test Bank Ob All Chapters
No ratings yet
Test Bank Ob All Chapters
177 pages
FBS Week 8
No ratings yet
FBS Week 8
10 pages
Key Terms Questions Verbatim Open Codes
No ratings yet
Key Terms Questions Verbatim Open Codes
4 pages
Proposal 1 Project RISE Cap B
No ratings yet
Proposal 1 Project RISE Cap B
6 pages
Patient Theses11
No ratings yet
Patient Theses11
43 pages
Philosophy of Nursing School
No ratings yet
Philosophy of Nursing School
3 pages
Chapter 7 Selection
No ratings yet
Chapter 7 Selection
7 pages
LESSON PLAN On Legal & Ethical Issue
83% (6)
LESSON PLAN On Legal & Ethical Issue
4 pages
Types of AI Agents - Javatpoint
No ratings yet
Types of AI Agents - Javatpoint
12 pages
Hardcover Thesis Di Shah Alam
100% (1)
Hardcover Thesis Di Shah Alam
8 pages
Text-Based Questions - Narrative and Historical Recount
100% (1)
Text-Based Questions - Narrative and Historical Recount
2 pages
Informative Writing Techniques - Lesson Plan (Villasis, Richelle B.)
100% (7)
Informative Writing Techniques - Lesson Plan (Villasis, Richelle B.)
7 pages