0% found this document useful (0 votes)
354 views109 pages

Final Project Report New

This document is a project report submitted by four students (Rohini KR, Sreehari Tanil, Sreejith PM, and Yedumohan PM) to the APJ Abdul Kalam Technological University in partial fulfillment of the requirements for their Bachelor of Technology degree in Computer Science and Engineering. The project report is on developing a "Cyber Bullying Detection System" under the guidance of their project supervisor Dr. Vinith R. The report includes a declaration by the students, a certificate signed by the internal supervisor and head of the department, and acknowledgments of those who helped with the project.

Uploaded by

Abhijith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
354 views109 pages

Final Project Report New

This document is a project report submitted by four students (Rohini KR, Sreehari Tanil, Sreejith PM, and Yedumohan PM) to the APJ Abdul Kalam Technological University in partial fulfillment of the requirements for their Bachelor of Technology degree in Computer Science and Engineering. The project report is on developing a "Cyber Bullying Detection System" under the guidance of their project supervisor Dr. Vinith R. The report includes a declaration by the students, a certificate signed by the internal supervisor and head of the department, and acknowledgments of those who helped with the project.

Uploaded by

Abhijith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 109

C YBER B ULLYING D ETECTION SYSTEM

A Project Report
Submitted by

ROHINI K R JEC16CS097
SREEHARI T ANIL JEC16CS113
SREEJITH P M JEC16CS114
YEDUMOHAN P M JEC16CS126
to
APJ Abdul Kalam Technological University
in partial fulfillment of the requirements for the award of the Degree of
Bachelor of Technology (B.Tech)
in
C OMPUTER S CIENCE & E NGINEERING

Under the guidance of


D R . V INITH R

D EPARTMENT OF C OMPUTER S CIENCE & E NGINEERING

April 2020
DECLARATION

We the undersigned hereby declare that the project report “Cyber Bullying Detection
system”, submitted for partial fulfillment of the requirements for the award of degree of
Bachelor of Technology of the APJ Abdul Kalam Technological University, Kerala is a
bonafide work done by us under supervision of Dr. Vinith R. This submission represents
our ideas in our own words and where ideas or words of others have been included, we have
adequately and accurately cited and referenced the original sources. We also declare that
we have adhered to ethics of academic honesty and integrity and have not misrepresented
or fabricated any data or idea or fact or source in this submission. We understand that
any violation of the above will be a cause for disciplinary action by the institute and/or
the University and can also evoke penal action from the sources which have thus not been
properly cited or from whom proper permission has not been obtained. This report has not
been previously used by anybody as a basis for the award of any degree, diploma or similar
title of any other University.

Name of Students Signature


ROHINI K R (JEC16CS097)
SREEHARI T ANIL (JEC16CS113)
SREEJITH P M (JEC16CS114)
YEDUMOHAN P M (JEC16CS126)
Place:
Date:
D EPARTMENT OF C OMPUTER S CIENCE & E NGINEERING

CERTIFICATE

This is to certify that the report entitled “ CYBER BULLYING DETECTION


SYSTEM ” submitted by ROHINI K R (JEC16CS097), SREEHARI T ANIL
(JEC16CS113), SREEJITH P M (JEC16CS114), YEDUMOHAN P M (JEC16CS126) to
the APJ Abdul Kalam Technological University in partial fulfillment of the requirements
for the award of the Degree in Bachelor of Technology in Computer Science &
Engineering is a bonafide record of the project work carried out by them under my/our
guidance and supervision. This report in any form has not been submitted to any other
University or Institute for any purpose.

Dr. Vinith R Fr. Dr. A K George


Associate Professor Professor
Internal Supervisor Head of the Department
ACKNOWLEDGEMENT

We take this opportunity to thank everyone who helped us profusely, for the successful
completion of our project work. With prayers, we thank God Almighty for his grace and
blessings, for without his unseen guidance, this project would have remained only in our
dreams.

We thank the Management of Jyothi Engineering College and our Principal, Fr. Dr. Jaison
Paul Mulerikkal CMI for providing all the facilities to carry out this project work. We are
greatful to the Head of the Department Fr. Dr. A K George for his valuable suggestions and
encouragement to carry out this project work.

We would like to express our whole hearted gratitude to the project guide Dr. Vinith R
for his encouragement, support and guidance in the right direction during the entire project
work.

We thank our Project Coordinators Mr. Anil Antony & Mr. Unnikrishnan P for their
constant encouragement during the entire project work. We extend our gratefulness to all
teaching and non teaching staff members who directly or indirectly involved in the successful
completion of this project work.

Finally, we take this opportunity to express our gratitude to the parents for their love, care
and support and also to our friends who have been constant sources of support and inspiration
for completing this project work.

ROHINI K R (JEC16CS097)
SREEHARI T ANIL (JEC16CS113)
SREEJITH P M (JEC16CS114)
YEDUMOHAN P M (JEC16CS126)

ii
VISION OF THE INSTITUTE
Creating eminent and ethical leaders through quality professional education with
emphasis on holistic excellence.

MISSION OF THE INSTITUTE


• To emerge as an institution par excellence of global standards by imparting quality
Engineering and other professional programmes with state-of-the-art facilities.

• To equip the students with appropriate skills for a meaningful career in the global
scenario.

• To inculcate ethical values among students and ignite their passion for holistic
excellence through social initiatives.

• To participate in the development of society through technology incubation,


entrepreneurship and industry interaction.

iii
VISION OF THE DEPARTMENT
Creating eminent and ethical leaders in the domain of computational sciences through
quality professional education with a focus on holistic learning and excellence..

MISSION OF THE DEPARTMENT


• To create technically competent and ethically conscious graduates in the field
of Computer Science Engineering by encouraging holistic learning and excellence.

• To prepare students for careers in Industry, Academia and the Government.

• To instill Entrepreneurial Orientation and research motivation among the students


of the department.

• To emerge as a leader in education in the region by encouraging teaching, learning,


industry and societal connect

iv
PROGRAMME EDUCATIONAL OBJECTIVES

PEO 1: Graduates shall have a good foundation in the fundamental and practical
aspects of Mathematics and Engineering Sciences so as to build successful
and enriching careers in the field of Electrical Engineering and allied areas.

PEO 2: Graduates shall learn and adapt themselves to the latest technological
developments in the field of Electrical & Electronics Engineering which will in
turn motivate them to excel in their domains and shall pursue higher education
and research.

PEO 3: Graduates shall have professional ethics and good communication ability along
with entrepreneurial skills and leadership skills, so that they can succeed in
multidisciplinary and diverse fields.

v
PROGRAMME SPECIFIC OUTCOMES

Graduate possess -
PSO 1: Ability to apply their knowledge and technical competence to solve real world
problems related to electrical power system, control system, power electronics
and industrial drives.

PSO 2: Ability to use technical and computational skills for design and development
of electrical and electronic systems.

PSO 3: Ability to become a technically competent professional for testing,


maintenance and installation of electrical equipments and systems.

vi
PROGRAMME OUTCOMES

1. Engineering knowledge: Apply the knowledge of mathematics, science, engineering


fundamentals, and an engineering specialization to the solution of complex engineering
problems.
2. Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of
mathematics, natural sciences, and engineering sciences.
3. Design/development of solutions: Design solutions for complex engineering problems
and design system components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and environmental
considerations.
4. Conduct investigations of complex problems: Use research-based knowledge and
research methods including design of experiments, analysis and interpretation of data, and
synthesis of the information to provide valid conclusions.
5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modeling to complex engineering
activities with an understanding of the limitations.
6. The engineer and society: Apply reasoning informed by the contextual knowledge to
assess societal, health, safety, legal and cultural issues and the consequent responsibilities
relevant to the professional engineering practice.
7. Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and
need for sustainable development.
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities
and norms of the engineering practice.
9. Individual and team work: Function effectively as an individual, and as a member or
leader in diverse teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and
write effective reports and design documentation, make effective presentations, and give
and receive clear instructions.
11. Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member
and leader in a team, to manage projects and in multidisciplinary environments.
12. Life-long learning: Recognize the need for, and have the preparation and ability to engage
in independent and life-long learning in the broadest context of technological change.

vii
COURSE OUTCOMES

COs Description
The students will be able to think innovatively on the development of
C3O7.1
components, products, processes or technologies in the engineering field.
The students will be able to analyse the problem requirements and arrive
C3O7.2
workable design solutions.
C3O7.3 The students will be able to understand the concept of reverse engineering.
The students will be able to familiarise with the modern tools used in the
C3O7.4
process of design and development.

CO MAPPING TO POs
POs
COs PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
C3O7.1
C3O7.2
C3O7.3
C3O7.4
Average

CO MAPPING TO PSOs
PSOs
COs PSO1 PSO2 PSO3
C3O7.1
C3O7.2
C3O7.3
C3O7.4
Average

viii
ABSTRACT

The advancement of social media plays an important role in increasing the population of
youngsters on the web. And it has become the biggest medium of expressing one’s thoughts
and emotions. Recent studies report that cyber bullying constitutes a growing problem
among youngsters on the web. These kinds of attacks have a major influence on the current
generation’s personal and social life, because youngsters are ready to adopt online life instead
of a real one, which leads them into an imaginary world. So we are proposing a system for
early detection of cyber bullying on the web using sentiment analysis and machine learning
techniques.
Our system will initially check whether a text is bullying or not using sentiment analysis, and
from that, we are trying to identify public bullies, so that their further activities are monitored.
This system will be helpful to save many youngsters from this kind of attack by getting
warnings from the system.
Here we are using four different machine learning algorithms, and they are Naive Bayes,
Decision Tree, Logistic Regression and Support Vector Machine in four different ways in
pre-processing of data using uni-gram, bi-gram, tri-gram and n- gram in Bag of Words
algorithm. And the best prediction will take place with Naive Bayes algorithm when using
bi-gram bag of words pre-processing technique with a accuracy of 79 % and F measure of
38%

ix
CONTENTS

List of Tables xiii


List of Figures xiv
List of Abbreviations xvi
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Cyber Bullying on SM Websites . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Rise of aggressive behaviour on SM . . . . . . . . . . . . . . . . 4
1.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.1 Specific Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Literature Survey 7
3 Challenges in Bullying Detection 12
3.1 Issues related to Cyber Bullying Definition . . . . . . . . . . . . . . . . . 12
3.2 Human data Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Culture Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Language Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Prediction of Cyber Bullying Severity . . . . . . . . . . . . . . . . . . . . 14
3.6 Unsupervised Machine Learning . . . . . . . . . . . . . . . . . . . . . . . 14
3.7 Challenges in Machine Learning Algorithm Selection . . . . . . . . . . . 15
3.8 Imbalanced Class Distribution . . . . . . . . . . . . . . . . . . . . . . . . 15
3.9 Issues in Selection of Evaluation Matrices . . . . . . . . . . . . . . . . . . 16
3.10 Issues Related to Feature Engineering . . . . . . . . . . . . . . . . . . . . 16
4 Project Management 18
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1.1 Initiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1.2 Planning and Design . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.3 Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.4 Monitoring and Controlling . . . . . . . . . . . . . . . . . . . . . 19

x
4.2 System Development Life Cycle . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.1 Iterative Modelling . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.2 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Methodology 23
5.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2.1 Bag of Words Algorithm: . . . . . . . . . . . . . . . . . . . . . . 24
5.3 Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3.1 Naive Bayes Algorithm . . . . . . . . . . . . . . . . . . . . . . . 28
5.3.2 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3.4 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . 30
5.4 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.4.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.4.2 Precision, Recall, and F-Measure . . . . . . . . . . . . . . . . . . 33
5.4.3 Area Under the Curve (AUC) . . . . . . . . . . . . . . . . . . . . 33
5.5 Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6 Design 41
6.1 Data flow Diagrams, Architecture Diagram and Conceptual Diagram . . . 41
6.2 Module Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2.1 Admin Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.3 Hardware and Software Requirements . . . . . . . . . . . . . . . . . . . . 45
6.3.1 7th Gen intelcore i3-7100U PROCESSOR . . . . . . . . . . . . . 45
6.3.2 4GB DDR4 RAM Intel HD Graphics 620 . . . . . . . . . . . . . 46
6.3.3 1TB HDD:Storage . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.3.4 Python 3.7.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.3.5 Spyder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.3.6 Windows Operating System . . . . . . . . . . . . . . . . . . . . 47
7 Results & Discussion 48
7.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8 Conclusion & Future Scope 52
8.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
8.2 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

xi
References 54
Appendices 56
A Insta Crawler 57
B Prediction Code 61

xii
LIST OF TABLES

Table No. Title Page No.


7.1 Naive Bayes with Uni-gram . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.2 Naive Bayes with Bi-gram . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.3 Naive Bayes with Tri-gram . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.4 Naive Bayes with n-gram . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.5 Decision Tree with Uni-gram . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.6 Decision Tree with Bi-gram . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.7 Decision Tree with Tri-gram . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.8 Decision Tree with n-gram . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.9 Logistic Regression with Uni-gram . . . . . . . . . . . . . . . . . . . . . . 50
7.10 Logistic Regression with Bi-gram . . . . . . . . . . . . . . . . . . . . . . 50
7.11 Logistic Regression with Tri-gram . . . . . . . . . . . . . . . . . . . . . . 50
7.12 Logistic Regression with n-gram . . . . . . . . . . . . . . . . . . . . . . . 50
7.13 SVM with Uni-gram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.14 SVM with Bi-gram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.15 SVM with Tri-gram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.16 SVM with n-gram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

xiii
LIST OF FIGURES

Figure No. Title Page No.


4.1 Iterative Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.1 Code Insta crawler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2 Code Insta crawler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3 Code Insta crawler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.4 Labelled Data set sample . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.5 Data set sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.6 Bow Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.7 Implementation Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.8 Implementation Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . 30
5.9 Implementation Logistic Regression . . . . . . . . . . . . . . . . . . . . . . 31
5.10 Implementation SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.11 Naive Bayes Unigram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.12 Naive Bayes bigram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.13 Naive Bayes Trigram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.14 Naive Bayes Ngram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.15 Naive Bayes after data split . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.16 Decision Tree Unigram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.17 Decision Tree Bigram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.18 Decision Tree Trigram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.19 Decision Tree Ngram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.20 Decision Tree after Data split . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.21 Logistic Regression Unigram . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.22 Logistic Regression Bigram . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.23 Logistic Regression Trigram . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.24 Logistic Regression Ngram . . . . . . . . . . . . . . . . . . . . . . . . . . 38

xiv
5.25 Logistic Regression after data split . . . . . . . . . . . . . . . . . . . . . . 38
5.26 SVM Unigram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.27 SVM Bigram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.28 SVM Trigram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.29 SVM Ngram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.30 SVM after data split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.1 Level 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2 Level 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.3 Level 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.4 Level 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.5 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.6 Conceptual Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

xv
LIST OF ABBREVIATIONS

SM Social Media
OSN Online Social Networking
SVM Support Vector Machine
LSA Latent Semantic Analysis
SVD Singular-Value Decomposition
HCM Human composition matrix
URL Uniform Resource Locator
SMOTE Synthetic Minority Over-sampling Technique
AUC Area under curve
API Application program interface
SDLC Systems development life cycle

xvi
Chapter 1. Introduction 1

CHAPTER 1

INTRODUCTION

The advancement of social media has an important role in the extended population of
youngsters on the web. They utilize such platforms mainly for communicating and entertainment.
The visible trend nowadays is communicating sentiments through social media. Most social
media profiles look like a portrait of that user’s life. The tendency of sharing every second of
life through different forms of social networks has grown, where Instagram holds the top rank
within youngsters. Hence this paper has chosen the Instagram platform and the necessary
data collection is easier than others. The text-based analysis method we used for this research
is facilitated by the availability of numerous public accounts and public comments related to
them.

When we consider social media into account there are many safety issues, includes bullying,
grooming, phishing etc. The aftereffect of this kind of issues is a huge area where it may lead
to social, mental and physical issues in the current generation. In this research, we are mainly
focusing on detecting bullying in social networks because the suicidal tendency in youngsters
is an increasing issue in currents scenario. This kind of people expresses their feelings either
as extreme depression or extreme anger which we will be able to identify through their posts,
by considering sentiment in the captions, hashtags used in the posts etc. Different studies
show that India has the highest occurrence of bullying through social media, such that it is
necessary to control it.

Nowadays there are a lot of researches occurs related to bullying detection, avoidance etc. And
the preliminary studies showing that the bullies targeting people who post something which is
closely related to religious activities, sexual exposing, political activities and so on. Hence the
first step which we did is that detection of this kind of posts from the Instagram network using
some typical keywords related to this kind of posts and we extracted the meta information
and the comments too. We are comparing four different classifiers in four different feature
extraction criteria and comparing their results and identifies the best suitable classifier for our
problem. The for different classifiers are Naive Bayes, Decision Tree Logistic Regression and
Support vector machine algorithm.

1.1 Background
Many studies have been conducted on the contribution of machine learning algorithms to
OSN content analysis in the last few years. Machine learning research has become crucial in

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 1. Introduction 2

numerous areas and successfully produced many models, tools, and algorithms for handling
large amounts of data to solve real-world problems. Machine learning algorithms have been
used extensively to analyze SM website content for spam, phishing, and cyber bullying
prediction. Aggressive behaviour includes spam propagation, phishing, malware spread, and
cyber bullying. Textual cyber bullying has become the dominant aggressive behaviour in SM
websites because these websites give users full freedom to post on their platforms.

SM websites contain massive amounts of text and/or non-text content and different info
associated with aggressive behaviour. In this work, a content analysis of SM websites is
performed to predict aggressive behaviour. Such an analysis is limited to textual OSN content
for predicting cyber bullying behaviour. Given that cyber bullying may be simply committed,
it is considered a dangerous and fast-spreading aggressive behaviour. Bullies only require
willingness and a laptop or cell phone with an Internet connection to perform misbehaviour
without confronting victims. The popularity and proliferation of SM websites have increased
online bullying activities. Cyber bullying in SM websites is rampant due to the structural
characteristics of SM websites. Cyber bullying in traditional platforms, such as emails or
phone text messages, is performed on a limited number of people. SM websites allow users
to create profiles for establishing friendships and communicating with other users regardless
of geographic location, thus expanding cyber bullying beyond physical location. Anonymous
users may also exist on SM websites, and this has been confirmed to be a primary cause for
increased aggressive user behaviour. Developing an efficient prediction model for predicting
cyber bullying is thus of sensible significance. With all these considerations, this work
performs a content-based analysis for predicting textual cyber bullying on SM websites.

1.2 Problem Statement


The motivations for carrying out this review for predicting cyber bullying on SM websites
are discussed as follows. Cyber bullying is a major problem and has been documented
as a serious national health problem due to the recent growth of online communication
and SM websites. Research has shown that cyber bullying exerts negative effects on the
psychological and physical health and academic performance of people. Studies have
conjointly shown that cyber bullying victims incur a high risk of dangerous cerebration.
Other studies according to an association between cyber bullying victimization and dangerous
cerebration risk. Consequently, developing a cyber bullying prediction model that detects
aggressive behaviour that is related to the security of human beings is more important than
developing a prediction model for aggressive behaviour related to the security of machines.

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 1. Introduction 3

Cyber bullying can be committed anywhere and anytime. Escaping from cyber bullying is
difficult because cyber bullying can reach victims anywhere and anytime. It can be committed
by posting comments and statuses for a large potential audience. The victims cannot stop the
spread of such activities. Although SM websites have become an integral part of users’ lives, a
study found that SM websites are the most common platforms for cyber bullying victimization.
[1]A well-known characteristic of SM websites, such as Twitter, is that they allow users to
publicly express and spread their posts to a large audience while remaining anonymous. The
effects of public cyber bullying are worse than those of private ones, and anonymous scenarios
of cyber bullying are worse than non-anonymous cases. Consequently, the severity of cyber
bullying has increased on SM websites, which support public and anonymous scenarios of
cyber bullying. These characteristics make build create SM websites, such as Twitter, a
dangerous platform for committing cyber bullying.

Recent research has indicated that most experts favour the automatic monitoring of
cyber bullying. A study that examined fourteen teams of adolescents confirmed the pressing
would like for automatic watching and prediction models for cyber bullying as a result of
ancient ways for dealing with cyber bullying in the era of big data and networks do not work
well. Moreover, analyzing large amounts of complex data requires machine learning-based
automatic monitoring.

1.3 Cyber Bullying on SM Websites


Most researchers define cyber bullying as using electronic communication technologies to
bully people. cyber bullying may exist in different types of forms, such as writing aggressive
posts, harassing or bullying a victim, making hateful posts, or insulting the victim. Given
that cyber bullying is simply committed, it’s thought-about a dangerous and fast-spreading
aggressive behaviour. [2]Bullies only require willingness and a laptop or cell phone connected
to the Internet to perform misbehaviour without confronting the victims. The popularity and
proliferation of SM websites have increased online bullying activities. cyber bullying on
SM websites is performed on a large number of users due to the structural characteristics of
SM websites. cyber bullying in traditional platforms, such as emails or phone text messages,
is committed on a limited number of people. SM websites allow users to create profiles
for establishing friendships and interacting with other online users regardless of geographic
location, thus [3]expanding cyber bullying beyond physical location. Moreover, anonymous
users may exist on SM websites, and this has been confirmed to be a primary cause of
increased aggressive user behaviour. The nature of SM websites allows cyber bullying to
occur secretly, spread rapidly, and continue easily. Consequently, developing an effective
prediction model for predicting cyber bullying is of practical significance. SM websites

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 1. Introduction 4

contain large amounts of text and/or non-text content and information related to aggressive
behaviour.

1.3.1 Rise of aggressive behaviour on SM


[4]Prior to the innovation of communication technologies, social interaction evolved
among little cultural boundaries, such as locations and families. The recent development of
communication technologies exceptionally transcends the temporal and abstraction limitations
of ancient communication. In the last few years, online communication has shifted toward
user-driven technologies, such as SM websites, blogs, online virtual communities, and online
sharing platforms. New forms of aggression and violence emerge exclusively online. The
dramatic increase in negative human behaviour on SM, with high increments in aggressive
behaviour, presents a new challenge.[5] The advancement of Web 2.0 technologies, including
SM websites that are often accessed through mobile devices, has completely transformed
functionality on the side of users. SM characteristics, such as flexibility, being free and having
well connected social networks, accessibility, provide users with liberty and flexibility to post
and write on their accounts privately or publicly. Therefore, users will simply [6]demonstrate
aggressive behaviour. SM websites have become dynamic social communication websites
for millions of users worldwide Data in the form of ideas, opinions, preferences, views, and
discussions are spread among users rapidly through online social communication. The online
interactions of SM users generate a huge volume of data that can be utilized to study human
behavioural patterns. SM websites also provide an exceptional[7] opportunity to analyze
patterns of social interactions among populations at a scale that is much larger than before.

Aside from renovating the means through which people are influenced, SM websites
provide a place for a severe form of misbehaviour among users. Online complex networks,
such as SM websites, changed substantially in the last decade, and this change was stimulated
by the popularity of online communication through SM websites. Online communication
has become an entertainment tool, rather than serving only to communicate and interact
with known and unknown users. Although SM websites provide many benefits to users,[8]
cybercriminals can use these websites to commit different types of misbehaviour and/or
aggressive behaviour. The common forms of misbehaviour and/or aggressive behaviour on
OSN sites include cyber bullying, phishing, spam distribution, malware spreading, and cyber
bullying.

Users utilize SM websites to demonstrate different types of aggressive behavior. The main
involvement of SM websites in aggressive behavior can be summarized in two points

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 1. Introduction 5

1. [9] OSN communication is a revolutionary trend that exploits Web 2.0. Web 2.0 has
new features that allow users to create profiles and pages, which, in turn, make users
active. Unlike Web 1.0 that limits users to being passive readers of content only, Web
2.0 has expanded capabilities that allow users to be active as they post and write their
thoughts. SM websites have four explicit options, namely, collaboration, participation,
authorisation, and timeliness. These characteristics enable criminals to use SM websites
as a platform to commit aggressive behaviour without confronting victims. Examples of
aggressive behaviour are committing cyber bullying and financial fraud, using malicious
applications, and implementing social engineering and phishing.[10]

2. SM websites area unit structures that change info exchange and dissemination. They
allow users to effortlessly share information, such as messages links, photos, and videos.
However, because SM websites connect billions of users, they have become delivery
mechanisms for different forms of aggressive behaviour at an extraordinary scale. SM
websites help cyber criminals reach many users.[11]

1.4 Objectives
In India, the rate of Cyber bullying is going uphill. The Social Media platform, which
was once solely used for quiet communication, resource sharing and other helpful services,
has now become the prior location for handling out cyber bullying activities. This has caused
disturbance and insecurity for all the clients and users of Social Media. Cyber predators
focuses on either a particular user or a group of random users and then go after anything
posted by the victim, causing mental breakdown and insecurity and eventually preventing
them from using the SM platform again. Many of these attacks may even trigger suicidal
tendencies among users with low emotional stability. So the prevention of these types of cyber
bullying attacks has now come to a standstill and inevitable situation. The main objectives of
the project are:

1. Finding accurate machine learning algorithms: The method we propose to detect and
prevent cyber bullying incidents across social media is by extracting the textual and
other data from the SM platform and analyzing the data for such bullying occurrences.
We run through the data and match it towards a manually created set of bullying words
and thereby detecting cyber bullying incidents. We use Machine Learning algorithms to
classify the data into Bullying and Non-bullying. Performance of algorithms differs by
their classifying logic. So we compare the algorithms based on their accuracy and other
classifying parameters and arrive on the best suited classification algorithm for the data.

2. Making social networks secure and transparent: By the proper implementation of the
project, we will be able to detect the cyber bullying incidents that occurs across the SM
platform. This in turn elevates the security and competence of such SMs.

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 1. Introduction 6

3. Identifying public bullies: The data not only comprises of the bullying expressions, but
also of who and when the bullying contents are posted. When an attacker purposefully
concentrates on a user multiple times or when he attacks more users, he is categorized
as a public bully. Such public bullies have more tendencies to go for such bullying
actions. The users are alerted of such public bullies and they are suggested to take
necessary precautions.

4. Identifying challenges in detecting cyber bullying: Machine Learning algorithms do not


tend to yield absolute accuracy. Even some of the cyber bullying expressions may be
classified as non-bullying. Studying about such events will aid us to train the algorithms
better and give in increased accuracy.

1.4.1 Specific Objectives


• To increase the accuracy of the detection process.

• Finding accurate machine learning algorithms.

• Making social networks secure and transparent.

• Identifying public bullies.

• Identifying challenges in detecting cyber bullying.

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 2. Literature Survey 7

CHAPTER 2

LITERATURE SURVEY

There are a lot of researches which is already completed in this field. When we take them
as a single one, we can find that most of the existing systems use the SVM algorithm for
the classification which at best provides an accuracy rate of 73-76 %. And the other things
which are common in all papers are that they are based on an individual post and there is
no history-based analysis. Also, there is no scope of considering many features instead of it
they are considering comments and their label. More than the papers when we consider the
Instagram platform as a source, the current network giving many advantages over this kind of
attacks to prevent them, but still there are many problems with it like they are giving warning
or users who are searching things like self-harm, suicidal etc. which are the keywords closely
related with this kind of issues, not only that the Instagram network introduced new stickers
which can be used to say things like stop bullying, don’t bully it etc.[12] And most of the
existing system which includes Content warning feature and Parental guide to monitoring the
activities of the logged-in user. But there are still some problems with this kind of systems,
for example when we consider the feature of the Instagram platform, we can find that there is
no warning for posts which includes the keywords mentioned above or there are no warnings
for comments which seem to be bullying.

[13] F. Toriumi, T. Nakanishi, M. and K. Eguchi describes the clear difference between cyber
bullying and cyber aggression in terms of frequency, negativity and imbalance of power
applied in large scale labelling. Also uses images and their corresponding comments for the
system. This was a multi-model classification result for cyber bullying detection. The main
features they considered for the analysis are the content of the image, comments and metadata
of the profile, and they found that cyber bullying occurs in posts which involves religion,
death appearance and sexual hints. Other findings include, posts which face these attacks are
most likely to be of negative emotions and relates to drug, tattoo etc. Here they use Latent
Semantic Analysis (LSA) based on Singular-Value Decomposition (SVD) and linear support
vector machine (SVM) classifier which uses n-gram texts with normalization. This system
provides the highest of 79% Recall and 71% of Precision. But at the same time, the Data set
is limited and they did labelling of data using survey, hence no media-based social networking
can include in this system. Not only that there are no image reorganization algorithms are
used, but the decisions are also taken only based on the survey.

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 2. Literature Survey 8

[14] R. Badonnel, R. State, I. Chrisment and O. Festor, explains about the tracking of cyber
predators in a peer-to-peer network. This system mainly aims at detecting network attacks
against vulnerable services and host-based attacks like unauthorized logins. Here they made
the system capable of tracking and reporting cyber predators and hence it protects normal
users from being in contact with these pathological users. Also, we define the system using
two criteria, tracking the deployment and tracking the target. And it is composed of a set of
configurable honey pot agents and a central platform manager. The main managerial activities
take place in this system is the generation of fake files, capturing of file requests and local
statistical analysis. Advantages of the system include fully compatible with management
architecture and management protocol, fully independent of the file directory service and
generic concern in the peer to peer clients. At the same time lack of central management and
control over the available resources and the problem of central back up of files and folders
make the system down.

When we consider the findings of [15] M. Di Capua, E. Di Nardo and A. Petrosino, they are
tried to make a system which follows unsupervised learning methods for finding bullying
activities in social media. This system proposes a method to detect cyber bullying with a
hybrid set of features with classical textual features and social features. They adopt natural
language processing algorithms and semantic as well as syntactic methods for filter the
data. The main feature of this system is that they are considering emotional traces, and
they make use of sentiment analysis with a set of features which are closely related to the
social platform and the sentiment polarity of the sentences are calculated based on ranking.
Here they are considered emojis and classified them as extremely negative, negative, neutral,
positive and extremely positive. Here they use neural networks for the clustering purpose.
The performance of the classifier is calculated based on precision, recall and F-measure.

Noviantho, S. M. Isa and L. Ashianti [16] explain how we can detect bullying using text
mining techniques. Here the bullying conversations are identified based on naive Bayes
method and SVM using a poly kernel. They used the data set from formsorng.me in the form
of textual conversations and filtered it out through avoiding conversations which contains less
than 15 words as well as which includes meaningless words. As an initial stage, they classify
the data to two; Yes and No. Then a 4-class model development; No, Low, Medium, High
and an 11 class classification. Finally, they found that 4 class classifications are the most
optimal one and they proceed with that using n-gram. This system employs based on textual
conversations and hence they can only identify the cyber bullying behaviour which is not
enough for the system because of conversations contains other elements such as keywords and
abbreviations and conversations include a lot of emojis contents but here it is not considered.

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 2. Literature Survey 9

[17]Cyber Security Risks for Minors: A Taxonomy and a Software Architecture(IEEE)


discusses about Distinctive situations which cause risks for minors Online risks which are the
expansion of problems in real life, for example, pornography.

1. 1. Risks that arise from the interaction of two under-agers such as cyber bullying.

2. Risks, which arise from the interaction between a child and an adult, such as cyber
grooming.

3. Risks arise by the collection of data, against the protection of privacy, such as viruses
and other malware.

[18] Also, it describes different kinds of risks such as content risk, contact risk, children
targeted as consumers, economic risks, and online privacy risk. So this article concludes that
the main issues faced by children are cyber bullying and Pornography.
An interesting paper Facebook Watchdog: A Research Agenda For Detecting Online Grooming
and Bullying Activities (IEEE) aims of this suggestion is to protect adolescents against
bullying and grooming attacks. This can make a clear picture of the difference between the
issues of cyber bullying and online grooming. Also, there is a study about related works,
so it helps to go through some more details. And it helps to know about Facebook and its
information pool which includes Album, application, check-in, photo, link, etc. So it gives
a good idea about what kind of data is available for analysis. And this paper focuses on the
analysis of Image/video, social media analytics, and text analysis.

[19]A study about the views of young people in shaming strangers by using social media was
conducted and the result was appalling. Which means the number of predators in social media
is increasing day by day. Because the youth is not considering this as a huge problem, but the
truth is just the opposite. This study mainly focuses on the circumferences which accept or
appropriate to conduct online shaming and how the young people conceive this. Shaming can
take place in different ways, shaming is targeted to a single person or as common in general
topics. Shaming can be both legal and illegal. The proposed model is having the processes
individual captures or records public behaviour. Then a collection of materials by uploaded
to social media and materials shared with social media will take and through an emotional
and behavioural response of users 2 sections will be produced one is viewed but not shares
and the other class is going through dissemination and finally, media will uptake this. The
approaches of the study are what, how and why this happening. Through this study, they
distinguish cyber shaming from cyber bullying. this paper is prepared through a question
answering(Interview) section. Finally, it identified 6 themes with different contents, which
are the concept of shaming, the difference between cyber bullying and cyber shaming, the

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 2. Literature Survey 10

use of Smartphones for recording public behaviours, uploading and sharing in social media,
managing online presence and finally context of behaviour. So I think this study is giving the
idea that today’s this kind of activity is common because people are not aware of the issues,
and future problems. So it is difficult to manage this kind of behaviour in people and also
it increases the complexity in the detection of predators are Explained in The Use of Social
Media for Shaming Strangers: Young People’s Views (IEEE).

[20]OPTIMAL ONLINE cyber bullying DETECTION: (IEEE) is an implementation paper


and discusses Here cyber bullying detection is taken as a sequential hypothesis testing problem.
And here proposing a novel algorithm designed to reduce the time to raise a cyber bullying
alert by drastically reducing the number of feature evaluations necessary for a decision to be
made. And they demonstrate using real-world datasets from Twitter. The strategies which are
used in this are classification strategy and stopping strategy. And the analysis is taking place
by dividing messages as two cyber bullying and normal.

[21] A Web Pornography Patrol System by Content-based Analysis: In Particular Text and
Image (IEEE) suggests an idea to filter pornographic websites based on text and image analysis.
The text-based analysis is constructed by the support vector machine(SVM) algorithm. The
image-based analysis is normalized R/G ratio and human composition matrix (HCM) based
on skin detection. The results from text and image analysis are integrated and analyzed
together with the Boolean model. if a URL is analyzed as a pornographic website, it is stored
into a blacklist. Finally, these URLs are used for blocking inappropriate material. The URLs
can be divided into two lists such as blacklist and white list, the blacklist consists of URLs
that must be blocked. The problem with URL-blocking is that new sites emerge quickly and
continually. Keyword filtering uses a list of keywords to identify undesirable web pages. If
a page contains a certain number of keywords found in the list, then it is considered as an
undesirable web page. It uses the N-gram model based on Bayes’ theorem to improve the
efficiency of the SVM algorithm.

[22]Using Machine Learning to Detect cyber bullying (IEEE) discusses different techniques
and uses the data set of the website Formspring.me. Because this site is populated mostly by
teens and college students, and there is a high percentage of bullying content in the data. After
the collection of data, it will be the label. And the development procedure includes developing
features for Input, learning the model, class weighting, and evaluation. In developing features
for input it includes the SUM, TOTAL features to NUM and NORM versions of the dataset.
NUM set indicates the bad words and the density of bad words is featured as NORM. In
learning the model it uses different algorithms like JRIP, IBK, and SMO. So here the detection

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 2. Literature Survey 11

is taking place by recording the percentage of a curse and insult words within a post.

[23]When we go through some higher level of implementation Identifying cyber predators


through forensic authorship analysis of chat logs (IEEE) is one of the best papers. it says
there are two approaches can be applied to authorship attribution: Statistical Techniques
this involves statistical authorial unitary invariants. Machine learning techniques provide
greater scalability for handling more features of a data set. This paper will discuss the role
of authorship analysis of chat log from a forensic perspective as a means to detect cyber
predators. Also, this paper argues that the pattern-matching problem is highly suited to
machine learning because machine learning makes it possible for Digital Forensic Experts
to effectively classify unseen data by producing a model based on the knowledge it learned
from previously seen data. [24]The methodology used in this paper is, data and metadata are
important in this kind of analysis so it includes both the semantic analysis of the text and the
statistical information that is resident within the sequence of text. And the methodologies
include data collection and preparation, feature extraction, which includes lexical features,
Syntactic features, n-gram features, Structural features, and Content-specific features. The
next step is the comparison of classification techniques. And the steps are generation of
a word frequency list, corpus linguistics techniques, function words and different methods
of authorship attribution are stylistic measures, syntactic cues, and word-based document
features. The limitation of this paper is in the field of chat log authorship attribution from a
computer forensics perspective.

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 3. Challenges in Bullying Detection 12

CHAPTER 3

CHALLENGES IN BULLYING DETECTION

There are a lot of issues related to this kind of prediction in non-technical terms, which will
also affect the technical stages included in the model selection. This section will describe
some of them.

3.1 Issues related to Cyber Bullying Definition


Traditional bullying is generally defined as “intentional behaviour to harm another, repeatedly,
where it is difficult for the victim to defend himself or herself”. By extending the definition of
traditional bullying, we can define cyber bullying as “aggressive behaviour that is achieved
using electronic platforms by a group or an individual repeatedly and overtime against a
victim who cannot easily defend him or herself.” Applying such a definition makes it difficult
to classify manually labelled data (the instance in which machine learning algorithms learn
from) and whether a post is cyber bullying or not. Two main issues make the above definition
difficult to be applied in online environments. The first issue is how to measure “repeatedly
and over time aggressive behaviour” on SM, and the second one is how to measure power
imbalance and “a victim who cannot easily defend himself or herself” on SM. These issues
have been discussed by researchers to simplify the concept of cyber bullying in the online
context. First, the concept of the repetitive act in cyber bullying is not as straightforward
as that in SM. [25]For example, SM websites can provide cyberbullies with a medium to
propagate cyber bullying posts for a large population. Consequently, a single act by one
committer may become repetitive over time. Second, the power imbalance is presented in
different forms of online communication. Researchers have suggested that the content in
online environments is difficult to eliminate or avoid, thus making a victim powerless.

These definitions are under intense debate, but to simplify the definition of cyber bullying
and make this definition applies to a wide range of applications, the researchers defined cyber
bullying as “the use of electronic communication technologies to bully others.” Proposing
a simplified and clear definition of cyber bullying is a crucial step toward building machine
learning models that can satisfy the definition criteria of cyber bullying engagement.

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 3. Challenges in Bullying Detection 13

3.2 Human data Characteristics


Although SM big data provide insights into large human behaviour data, in reality, the analysis
of such big data remains subjective. Building human prediction systems involves steps where
subjectivity about human behaviour does exist. For example, when creating a manually
labelled data set to train a machine-learning algorithm to predict cyber bullying posts, human
bias may exist based on how cyber bullying is being defined and the criteria used to categorize
the text as cyber bullying text. Moreover, subjectivity may exist during the creation of a set of
features (learning factors) in the feature engineering process. For example, the pre-processing
stage involves a “data cleaning and filtering” process wherein alternative about what features
will be considered, and which will be avoided is established. This process is inherently
subjective.

[26] Predicting human behaviour is crucial but complex. To achieve an effective prediction of
human behaviour, the patterns that exist and are used for constructing the prediction model
should also exist in the future input data. The patterns should clearly represent features that
occur in current and future data to retain the context of the model. Given that big data are not
generic and dynamic in nature, the context of these data is difficult to understand in terms of
scale and even more difficult to maintain when data are reduced to fit into a machine learning
model. Handling context of big data is challenging and has been presented as an important
future direction. Furthermore, human behaviour is dynamic. Knowing when online users
change the way of committing cyber bullying is an important component in updating the
prediction model with such changes. Therefore, dynamically updating the prediction model is
necessary to meet human behavioural changes.

3.3 Culture Effect


What was considered cyber bullying yesterday might not be considered cyber bullying today,
and what was previously considered cyber bullying may not be considered cyber bullying now
due to the introduction of OSNs. OSNs have a globalized culture. However, machine learning
always learns from the examples provided. Consequently, designing different examples that
represent a different culture remains to be defined, and robust work from different disciplines
is required. For this purpose, cross-disciplinary coordination is highly desirable.

3.4 Language Dynamics


Language is quickly changing, particularly among the young generation. New slang is
regularly integrated into the language culture. Therefore, researchers are invited to propose

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 3. Challenges in Bullying Detection 14

dynamic algorithms to detect new slang and abbreviations related to cyber bullying behaviour
on SM websites and keep updating the training processes of machine learning algorithms by
using newly-introduced words.

3.5 Prediction of Cyber Bullying Severity


T[23]he level of cyber bullying severity should be determined. The effect of cyber bullying
is proportional to its severity and spread. Predicting different levels of cyber bullying
severity does not only require machine learning understanding but also a comprehensive
investigation to define and categorize the level of cyber bullying severity from social and
psychological perceptions. Efforts from different disciplines are required to define and identify
the levels of severity then introduce related factors that can be converted into features to build
multi-classifier machine learning for classifying cyber bullying severity into different levels
as opposed to a binary classifier that only detects whether an instance is cyber bullying or not.

3.6 Unsupervised Machine Learning


Human learning is essentially unsupervised. The structure of the world was discovered by
observing it and not by being told the name of every objective. Nevertheless, unsupervised
machine learning has been overshadowed by the success of supervised learning. This gap in
literature may be caused by the fact that nearly all current studies rely on manually labelled
data as the input to supervised algorithms for classifying classes. Thus, finding patterns
between the two classes by using unsupervised grouping remains difficult. Intensive research
is required to develop unsupervised algorithms that can detect effective patterns from data.
Traditional machine learning algorithms lack the capability to handle cyber bullying big data.
Deep learning has recently attracted the attention of many researchers in different fields.
Natural language understanding is a new area in which deep learning is poised to make a large
effect over the next few years. The traditional machine learning algorithms pointed out in this
survey lacks the capability to process big data in a standalone format. Big data have rendered
traditional machine learning algorithms impotent. cyber bullying big data generated from SM
require advanced technology for the processing of the generated data to gain insights and help
in making intelligent decisions.

Big data are generated at a very high velocity, variety, volume, verdict, value, veracity,
complexity, etc. Researchers need to leverage various deep learning techniques for processing
social media big data for cyber bullying behaviours. The deep learning techniques and
architectures with the potential to explore the cyber bullying big data generated from SM can
include generative adversarial network, deep belief network, convolutions neural network,

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 3. Challenges in Bullying Detection 15

stacked autoencoder, deep echo state network, and deep recurrent neural network. These deep
learning architectures remain unexplored in cyber bullying detection in SM.

3.7 Challenges in Machine Learning Algorithm Selection


A machine learning algorithm is selected to be trained on proposed features. However,
deciding which classifier performs best for a specific data set is difficult. More than one
machine learning algorithm should be tested to determine the best machine learning algorithm
for a specific data set. Three points may be used as a guide to narrow the selection of Machine
Learning algorithms to be tested. First, specific literature on machine learning for cyber
bullying detection is important in selecting a specified classifier. The preeminence of the
classifier may be circumscribed to a given domain. Therefore, general previous research and
findings on machine learning can be used as a guide to select a machine learning algorithm.
Second, a literature review of text mining can be used as a guide. Third, a performance
comparison of comprehensive data sets can be used as a basis to select machine learning
algorithms. However, although these three points can be used as a guide to narrow the
selection of machine learning algorithms, researchers need to test many machine learning
algorithms to identify the optimal classifier for an accurate predictive model.

3.8 Imbalanced Class Distribution


In many cases of real data, data sets naturally have imbalanced classes in which the normal
the class has a large number of instances and the abnormal class has a small number of
instances in the data set. Abnormal class instances are rare and difficult to be collected
from real-world applications. Examples of imbalanced data applications are fraud detection,
instruction detection, and medical diagnosis. Similarly, the number of cyber bullying posts is
expected to be much less than the number of non-cyber bullying posts, and this assumption
generates an imbalanced class distribution in the data set in which the instances of non-cyber
bullying contain much more posts than those of cyber bullying. Such cases can prevent the
model from correctly classifying the examples. Many methods have been proposed to solve
this issue, and examples include SMOTE and weight adjustment (cost-sensitive technique).

The SMOTE technique is applied to avoid overfitting, which occurs when particular replicas
of minority classes are added to the main data set. A subdivision of data is reserved from the
minority class as an example, and new synthetic similar classes are generated. These synthetic
classes are then added to the original data set. The created data set is used to train the machine
learning methods. The cost-sensitive technique is utilized to control the imbalance class. It is
based on creating is a cost matrix, which defines the costs experienced in false positives and

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 3. Challenges in Bullying Detection 16

false negatives.

3.9 Issues in Selection of Evaluation Matrices


Accuracy, precision, recall, and AUC are commonly used as evaluation metrics. Evaluation
metric selection is important. The selection is based on the nature of manually labelled data.
Selecting an inappropriate evaluation metric may result in better performance according to
the selected evaluation metric. Then, the researcher may find the results to be significantly
improved, although an investigation of how the machine learning model is evaluated may
produce contradicting results and may not truly reflect the improvement of performance. For
example, cyber bullying posts are commonly considered abnormal cases, whereas non- cyber
bullying posts are considered normal cases. The ratio between cyber bullying and non-cyber
bullying is normally large. Generally, non-cyber bullying posts comprise a large portion. For
example, 1000 posts are manually labelled as cyber bullying and non-cyber bullying. The
non-cyber bullying posts are 900, and the remaining 100 posts are cyber bullying. If a machine
learning classifier classifies all 1000 posts as non-cyber bullying and is unable to classify
any posts (0) as cyber bullying, then this classifier is considered impractical. By contrast, if
researchers use accuracy as the main evaluation metric, then the accuracy of this classifier
calculated as mentioned in the accuracy equation will yield a high accuracy percentage.

[3]In the example, the classifier fails to classify any cyber bullying posts but obtains a high
accuracy percentage. Knowing the nature of manually labelled data is important in selecting
an evaluation metric. In cases where data are imbalanced, researchers may need to select AUC
as the main evaluation metric. In class-imbalance situations, AUC is more robust than other
performance metrics. cyber bullying and non-cyber bullying data are commonly imbalanced
datasets (non-cyber bullying posts outnumber the cyber bullying ones) that closely represent
the real-life data that machine learning algorithms need to train on. Accordingly, the learning
performance of these algorithms is independent of data skewness. Special care should be
taken in selecting the main evaluation metric to avoid uncertain results and appropriately
evaluate the performance of machine learning algorithms.

3.10 Issues Related to Feature Engineering


[27]Many cyber bullying prediction studies extracted their data sets by using specific keywords
or profile IDs. Nevertheless, by simply tracking posts that have particular keywords, these
researches may have presented potential sampling bias, limited the prediction to posts
that contain the predefined keywords, and overlooked many other posts relevant to cyber
bullying. Such data collection methods limit the prediction model of cyber bullying to

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 3. Challenges in Bullying Detection 17

specified keywords. The identification of keywords for extracting posts is also subject
to the author’s understanding of [19] cyber bullying. An effective method should use a
complete range of posts indicating cyber bullying to train the machine learning classifier and
ensure the generalization capability of the cyber bullying prediction model. An important
objective of machine learning is to generalize and not to limit the examples in a training data
set. Researchers should investigate whether the sampled data are extracted from data that
effectively represents all possible activities on SM websites. Extracting well-representative
data from SM is the first step toward building effective machine learning prediction models.
However, SM websites’ public application program interface (API) only allows the extraction
of a small sample of all relevant data and thus poses a potential for sampling bias. For
example, whether data extracted from Twitter’s streaming API is a sufficient representation of
the activities in the Twitter network as a whole; the author compared keyword (words, phrases,
or hashtags), user ID, and geo-coded sampling. Twitter’s streaming API returns a data set with
some bias when keyword or user ID sampling is used. By contrast, using geo-tagged filtering
provides good data representation. With these points in mind, researchers should ensure
minimum bias as much as possible when they extract data to guarantee that the examples
selected to be represented in training data are generalized and provide an effective model
when applied to test data. Bias in data collection can impose bias in the selected training data
set based on specific keywords or users, and such a bias consequently introduces overfitting
issues that affect the capability of a machine learning model to make reliable predictions on
untrained data.

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 4. Project Management 18

CHAPTER 4

PROJECT MANAGEMENT

4.1 Introduction
Project management is the discipline of planning, organizing, setting, managing, securing,
leading, and controlling resources to attain a specific aim. A project is a temporary attempt
with a demarcated origin and end (usually time-constrained, and frequently constrained by
funding or deliverable), undertaken to encounter special aims and objectives, basically to
bring about useful changes or added value. The non-permanent nature of projects stands in
contrary to business as usual (or operations), which are recurring, constant, or semi-permanent
functional activities to fabricate products or services. In practice, the administration of two
systems is often quite distinct, and as such demands the evolution of different technical skills
and management tactics.
In our project we followed typical development phases of an engineering project

• Initiation

• Planning and Design

• Execution and Construction

• Monitoring and Controlling Systems

• Completion

4.1.1 Initiation
The initiating procedures regulate the nature and scope of the project. The beginning stage
must include a plan that encloses the following areas :

1. Analyzing the business requirements in quantifiable goals

2. Evaluation of the current operations

3. Financial assessment of the costs and profit including a budget study and inspection
including users and support personal for the project

4. Stakeholder Project charter including prices, tasks, deliverable, and schedules

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 4. Project Management 19

4.1.2 Planning and Design


After the initiation stage, the project is designed to a suitable level of detail. The main
aim is to plan time, cost and resources adequately to approximate the work required and to
perfectly control risk during project execution. As with the Initiation process group, a failure
to efficiently plan greatly reduces the project’s opportunity of successfully achieving its goals.

1. Deciding how to plan

2. Evolving the scope statement

3. Choosing the planning team

4. Distinguishing deliverable and forming the work breakdown structure

5. Recognizing the activities required to finish those deliverable

6. Forming the schedule

7. Risk planning and designing

4.1.3 Execution
Execution includes the processes used to finish the work explained in the project plan to
achieve the project’s needs. Executing process involves arranging people and resources, as
well as organizing and executing the activities of the project in accordance with the project
management plan. The deliverable are produced as outputs from the processes performed as
defined in the project management plan and other frameworks that should be relevant to the
type of project at hand.

4.1.4 Monitoring and Controlling


Monitoring and controlling includes the processes performed to observe the project implementation
so that potential problems must be classified in a timely manner and corrective action must
be taken, when required to control the fulfillment of the project. The main benefit is that
project performance is analyzed and measured regularly to identify variances from the project
management plan.

4.2 System Development Life Cycle


The Systems development life cycle (SDLC), or Software development process in systems
engineering, information systems, and software engineering, is a process of designing or

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 4. Project Management 20

recasting information systems, and the models and new methodologies that the world use
to introduce these systems. In software engineering, the SDLC concept supports different
varieties of software development methodologies. These methodologies form the skeletal for
planning and controlling the creation of an information system.

The SDLC phases serve as a programmatic mentor to project activity and supply a flexible
but constant way to execute projects to a depth matching the range of the project. Each of
the SDLC phase objectives are explained in this section with key deliverable, elucidation of
recommended tasks, and a conclusion of related control goals for perfect management. It is
very crucial for the project manager to initiate and observe control objectives during each
SDLC phase while performing projects. Control objectives are very useful to provide a clear
statement of the expected result or purpose and could be used throughout the entire SDLC
process.

4.2.1 Iterative Modelling


This model leads the software development process in the forms of iterations. It projects the
process of development in a cyclic manner repeating every step after every cycle of SDLC
process. The software is first developed on a very small scale and all the steps are followed
which are taken into consideration. Then, on every next iteration, more features and modules
are designed, coded, tested and added to the software. Every cycle produces software, which
is complete in itself and has more features and capabilities than that of the previous one. After
each iteration, the management team can do work on risk management and prepare for the
next iteration. Because a cycle includes a small portion of the whole software process, it is
easier to manage the development process but it consumes more resources.

Requirement gathering analysis:


In this phase, requirements are gathered from customers and check by an analyst whether
requirements will fulfil or not. Analyst checks that need will achieve within budget or not.
After all of this, the software team skips to the next phase.

Design:
In the design phase, team design the software by the different diagrams like Data Flow
diagram, activity diagram, class diagram, state transition diagram, etc.

Implementation:
In the implementation, requirements are written in the coding language and transformed into

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 4. Project Management 21

Figure 4.1: Iterative Model

computer programs which are called Software.

Testing:
After completing the coding phase, software testing starts using different test methods. There
are many test methods, but the most common are white box, black box, and grey box test
methods.

Deployment:
After completing all the phases, software is deployed to its work environment.

Review:
In this phase, after the product deployment, review phase is performed to check the behaviour
and validity of the developed product. And if there are any error found then the process starts
again from the requirement gathering.

Maintenance:
In the maintenance phase, after deployment of the software in the working environment there
may be some bugs, some errors or new updates are required. Maintenance involves debugging
and new addition options.

4.2.2 Advantages
• Testing and debugging during smaller iteration is easy.

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 4. Project Management 22

• A Parallel development can plan.

• It is easily acceptable to ever-changing needs of the project.

• Risks are identified and resolved during iteration.

• Limited time spent on documentation and extra time on designing.

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 5. Methodology 23

CHAPTER 5

METHODOLOGY

We analyzed Instagram profiles and their associated posts for bullying detection. This
section discusses data collection followed by filtering of relevant data, feature extraction,
implementation and results.

5.1 Data Collection


We are using Instagram data for analysis and the data was collected based on a set of
keywords which includes Muslim, god, tattoo, pray, Hindu, bjp, etc. We chose these keywords
by analyzing different proles and found that these are the most commonly used keywords.
The extracted fields of a profile from Instagram include the entire posts of the prole along with
captions, hashtags, created time and the biography of the user and comments related with each
post. The biography has an important role in ease filtering of data i.e., if bio contains words
like quotes, memes, awareness, motivation etc. then we can avoid those proles in the initial
stage. Also, these kinds of public pages will have more followers than followed by a count.
These points were used to filter the public pages like motivation pages or awareness pages
etc. Then we took the comments of each post and labelled it manually as either bullying or
non-bullying. A total of 1065 comments were considered and in which 636 were non bullying
comments and the following are bullying one. We divide the entire data set as training as well
as test data and each contains 746 and 319 respectively. During each algorithm calls for every
n-gram combination, we shuffle the data set.

For the data collection, we connect with the Instagram API, with the help of the python
library igramscraper. In this module a subpackage is available, that is the Instagram package,
through importing it will make the crawling is easy. From the extracted data set we choose
the comments and manually labelled them as bullying comments and non bullying comments.

5.2 Feature Extraction


Here we consider each comment and initially preprocessed the data for the removal of stopping
words, special characters etc. Then the features comments were vectorized or tokenized into
different n-gram forms using the bag of words algorithms. Here first we tried unigram, bigram
and trigram, then followed by the combination of these 3 in terms of the n-gram.

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 5. Methodology 24

Figure 5.1: Code Insta crawler

For preprocessing we used the library nltk.corpus which consist of different dictionary
collections like stop words in Hindi, English and such languages. Initially, the comments are
tokenized and then the cleaning process takes place. Here we took the stop words in English
which includes and, or, he, have, is, etc. and removed them from our data set. As the part
of preprocessing, we also remove the plantations, special characters and so on. The main
algorithm which we used for the preprocessing part was the bag of words algorithm.

5.2.1 Bag of Words Algorithm:


The bag-of-words model is a way of representing text data when modelling text with machine
learning algorithms. The bag-of-words model is simple to understand and implement and has
seen great success in problems such as language modelling and document classification.

A bag-of-words model, or BoW for short, is a way of extracting features from the text for
use in modelling, such as with machine learning algorithms. The approach is very simple
and flexible and can be used in a myriad of ways for extracting features from documents.
A bag-of-words is a representation of text that describes the occurrence of words within a
document. It involves two things:

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 5. Methodology 25

Figure 5.2: Code Insta crawler

Figure 5.3: Code Insta crawler

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 5. Methodology 26

Figure 5.4: Labelled Data set sample

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 5. Methodology 27

Figure 5.5: Data set sample

1. A vocabulary of known words.

2. A measure of the presence of known words.

Here also we are make use of inbuilt function for BoW algorithms using Collectionfinder
package. It is called a “bag” of words because any information about the order or structure
of words in the document is discarded. The model is only concerned with whether known
words occur in the document, not wherein the document.And here we implemented the bag
of words as a unigram, bigram, trigram and ngram using the help of nltk library. All these
are representing the n consecutive terms of a word. As the selection of ’n’ changes, the
performance will also change.It is called a “bag” of words because any information about the
order or structure of words in the document is discarded. The model is only concerned with
whether known words occur in the document, not wherein the document.

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 5. Methodology 28

Figure 5.6: Bow Implementation

5.3 Machine Learning Algorithms


We use four machine learning algorithms and they are Naive Bayes Algorithm, Decision tree,
Logistic regression and SVM. All these are implemented using nltk tool and also we compare
their results using the performance measures accuracy, precision, recall and F-measure.

5.3.1 Naive Bayes Algorithm


NB was used to construct cyber bullying prediction models in. NB classifiers were constructed
by applying Bayes’ theorem between features. Bayesian learning is commonly used for text
classification. This model assumes that the text is generated by a parametric model and utilizes
training data to compute Bayes-optimal estimates of the model parameters. It categorizes
generated test data with these approximations.

NB classifiers can deal with an arbitrary number of continuous or categorical independent


features. By using the assumption that the features are independent, a high-dimensional
density estimation task is reduced to one-dimensional kernel density estimation. The NB
algorithm is a learning algorithm that is grounded on the use of Bayes theorem with strong
(naive) independence assumptions. The NB algorithm is one of the most commonly used
machine learning algorithms, and it has been constructed as a machine learning classifier in
numerous social media-based studies.

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 5. Methodology 29

Figure 5.7: Implementation Naive Bayes

5.3.2 Decision Tree


Decision tree classifiers were used in the construction of cyber bullying prediction models in
many models. Decision trees are easy to understand and interpret; hence, the decision tree
algorithm can be used to analyze data and build a graphic model for classification. The most
commonly improved version of decision tree algorithms used for cyber bullying prediction is
C.45. C4.5 can be explained as follows. Given N number of examples, C4.5 first produces
an initial tree through the divide-and-conquer algorithm as follows. The parameters used in
our system includes binary=True, entropycutoff=0.8, depthcutoff=5, and supportcutoff=30. If
all examples in N belong to the same class or N is small, the tree is a leaf labelled with the
most frequent class in N. Otherwise, a test is selected based on, for example, the most used
information gain test on a single attribute with two or more outputs. Considering that the test
is the root of the tree creation partition of N into subsets N1,N2,N3....... regarding the outputs
for each example, the same procedure is applied recursively to each subset.

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 5. Methodology 30

Figure 5.8: Implementation Decision Tree

5.3.3 Logistic Regression


Logistic regression is one of the common techniques imported by machine learning from the
statistics field. Logistic regression is an algorithm that builds a separating hyperplane between
two data sets utilizing the logistic function. The logistic regression algorithm takes inputs
(features) and generates a forecast according to the probability of the input being appropriate
for a class. For example, if the probability is ¿0.5, the classification of the instance will be
a positive class; otherwise, the prediction is for the other class (negative class).Parameters
which we use includes algorithm=’gis’, trace=0, max=10, and min=0.5.

5.3.4 Support Vector Machine


Support vector machine (SVM) is a supervised machine learning classifier that is commonly
used in text classification. SVM is constructed by generating a separating hyper plane in
the feature attributes of two classes, in which the distance between the hyper plane and the
adjacent data point of each class is maximized. Theoretically, SVM was developed from
statistical learning theory. In the SVM algorithm, the optimal separation hyper plane pertains
to the separating hyper plane that minimizes misclassification that is achieved in the training

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 5. Methodology 31

Figure 5.9: Implementation Logistic Regression

step. The approach is based on minimized classification risks. SVM was initially established
to classify linearly separable classes. A 2D plane comprises linearly separable objects from
different classes (e.g., positive or negative). SVM aims to separate the two classes effectively.
SVM identifies the exceptional hyper plane that provides the maximum margin by maximizing
the distance between the hyper plane and the nearest data point of each class.

In real-time applications, precisely determining the separating hyper plane is difficult and
nearly impossible in several cases. SVM was developed to adapt to these cases and can now
be used as a classifier for non-separable classes. SVM is a capable classification algorithm
because of its characteristics. Specifically, SVM can powerfully separate non-linearly divisible
features by converting them to a high-dimensional space using the kernel model. The
advantage of SVM is its high speed, scalability, capability to predict intrusions in real time,
and update training patterns dynamically.

SVM has been used to develop cyber bullying prediction models and found to be effective and
efficient. For example, Chen et al. [9] applied SVM to construct a cyber bullying prediction
model for the detection of offensive content in SM. SM content with potential cyber bullying
were extracted, and the SVM cyber bullying prediction model was applied to detect offensive
content. The result showed that SVM is more accurate in detecting user offensiveness than
Naı̈ve Bayes (NB). However, NB is faster than SVM. Chavan and Shylaja [1] proposed the
use of SVM to build a classifier for the detection of cyber bullying in social networking sites.

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 5. Methodology 32

Figure 5.10: Implementation SVM

Data containing offensive words were extracted from social networking sites and utilized to
build a cyber bullying SVM prediction model. The SVM classifier detected cyber bullying
more accurately than LR did. Others used SVM to build a gender specific cyber bullying
prediction model. An SVM text classifier was created with gender specific characteristics.

The SVM cyber bullying prediction model enhanced the detection of cyber bullying in SM. A
study developed an SVM-based cyber bullying detection model to detect cyber- bullying in
a social network site. The SVM-based model was trained using data containing cyberbully-
ing extracted from the social network site. The researchers found that that the SVM-based
cyber bullying model effectively detected cyber- bullying. Some constructed an SVM-based
cyber bullying detection model for YouTube. Data were collected from YouTube comments
on videos posted on the site. The data were used to train SVM and construct a cyberbully- ing
detection model, which was then used to detect cyber bullying. The results suggested that the
SVM-based cyber bullying model is more reliable but not as accurate as a rule-based model.
However, the SVM-based cyber bullying model is more accurate than NB and tree-based
J48. They also proposed the use of SVM for the detection of cyber bullying on Twitter. An
SVM- based cyber bullying model was constructed from data extracted from Twitter. The
SVM-based cyber bullying prediction model was applied to detect cyber bullying in Twitter.
SVM detected cyber bullying better than NB- and LR-based cyber bullying detection models
did.

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 5. Methodology 33

5.4 Performance Measures


Researchers measure the effectiveness of a proposed model to determine how successfully
the model can distinguish cyber bullying from non-cyber bullying by using various evaluation
measures. Reviewing common evaluation metrics in the research community is important
to understand the performance of conflicting models. The most commonly used metrics in
evaluating cyber bullying classifiers for SM websites are as follows:

5.4.1 Accuracy
Accuracy can be defined as, it is a description of systematic errors, a measure of statistical
bias; low accuracy causes a difference between a result and a ”true” value. ISO calls this
trueness. Or it is also referred as describing a combination of both types of observational
error above (random and systematic), so high accuracy requires both high precision and high
trueness.
(tp + tn)
Accuracy = (5.1)
tp + tn + fp + fn

5.4.2 Precision, Recall, and F-Measure


Precision: This is defined as the fraction of relevant instances among the retrieved instances,
or it is the fraction of the total amount of relevant instances that were actually retrieved. And
precision is calculated using the bellow formula.

tp
precision = (5.2)
(tp + fp)

Recall: It is defined as the fraction of the relevant documents that are successfully retrieved.
And it can be formulated using the below formula,

tp
recall = (5.3)
(tp + fn)

F-measure: F-mesure is a combination of recall and precision and this is used to tests the
test’s accuracy and it can be defined as,

(2 × precision × recall)
F − measure = (5.4)
(precision + recall)

5.4.3 Area Under the Curve (AUC)


AUC offers a discriminatory rate of the classifier at various operating points. The main benefit
of using AUC as an evaluation metric is that AUC gives a more robust measurement than the
accuracy metric in class-imbalance situations.

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 5. Methodology 34

AUC - ROC curve is a performance measurement for classification problem at various


thresholds settings. ROC is a probability curve and AUC represents degree or measure of
separability. It tells how much model is capable of distinguishing between classes. Higher the
AUC, better the model is at predicting 0s as 0s and 1s as 1s.

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 5. Methodology 35

5.5 Outputs

Figure 5.11: Naive Bayes Unigram

Figure 5.12: Naive Bayes bigram

Figure 5.13: Naive Bayes Trigram

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 5. Methodology 36

Figure 5.14: Naive Bayes Ngram

Figure 5.15: Naive Bayes after data split

Figure 5.16: Decision Tree Unigram

Figure 5.17: Decision Tree Bigram

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 5. Methodology 37

Figure 5.18: Decision Tree Trigram

Figure 5.19: Decision Tree Ngram

Figure 5.20: Decision Tree after Data split

Figure 5.21: Logistic Regression Unigram

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 5. Methodology 38

Figure 5.22: Logistic Regression Bigram

Figure 5.23: Logistic Regression Trigram

Figure 5.24: Logistic Regression Ngram

Figure 5.25: Logistic Regression after data split

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 5. Methodology 39

Figure 5.26: SVM Unigram

Figure 5.27: SVM Bigram

Figure 5.28: SVM Trigram

Figure 5.29: SVM Ngram

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 5. Methodology 40

Figure 5.30: SVM after data split

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 6. Design 41

CHAPTER 6

DESIGN

6.1 Data flow Diagrams, Architecture Diagram and Conceptual Diagram

Figure 6.1: Level 0

Figure 6.2: Level 1

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 6. Design 42

Figure 6.3: Level 2

Figure 6.4: Level 3

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 6. Design 43

Figure 6.5: Architecture

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 6. Design 44

Figure 6.6: Conceptual Diagram

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 6. Design 45

6.2 Module Description


6.2.1 Admin Module
This module supervises and sets up the back end of the whole system. Management of the bits
and pieces of the entire system comes under this module. The module initiates at the primary
level where the data is gathered. The data is obtained with the help of data scrapping tools.
This data consists of useful and redundant data items. The administrator goes only for those
data items which are useful for the classification process. Previous to the processing steps,
the data is pre-processed, which is one of the decisive steps carried out by the administrator.

The pre-processing mainly deals with turning the data into a more processing suitable data.
This includes steps such as Stop words removal, Repeated letters removal, Noise data removal
etc. The output of the pre processing makes the data more suitable for the focal processing
part. After the data is pre processed the data is sent for further processing steps.

The secondary purpose of the admin is to monitor the bully and the bullying activities. Once
the output of the classification is obtained, the admin can keep an eye on the cyber bully.
For every bullying approach, the attacker is marked and the bullying count is incremented.
The user is also declared as a public bully if he (attacker) continues to bully a targeted user
(victim) or any other user multiple times. For the subsequent data gathering, the data of these
public bullies are most studied.

Based on the output classification, the administrator can also make contact with the victim
user and other unmarked users to warn them about such cyber bullying incidents that might
come about. This in turn helps to reduce the chances for further cyber bullying activities and
also be of assistance to prevent the attacking at an earlier stage.

6.3 Hardware and Software Requirements


6.3.1 7th Gen intelcore i3-7100U PROCESSOR
The Intel Core i3-7100U is a dual-core processor of the Kaby Lake architecture. It offers two
CPU centers clocked at 2.4 GHz (without Turbo Boost) and incorporates HyperThreading to
work with up to 4 strings immediately. The architecture contrasts are fairly little contrasted
with the Skylake generation, thus the presentation per MHz ought to be fundamentally the
same as. The SoC incorporates a double channel DDR4 memory controller and Intel HD
Graphics 620 illustrations card (timed at 300 - 1000 MHz). It is fabricated in an improved
14nm FinFET process at Intel. Contrast with the old Skylake Core i3-6100U, the i3-7100U

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 6. Design 46

offers a 100 MHz improved clock speed. Intel fundamentally utilizes the equivalent smaller
scale design contrasted with Skylake, so the per-MHz execution doesn’t vary. The maker
just revamped the Speed Shift innovation for quicker powerful alterations of voltages and
timekeepers, and the improved 14nm procedure permits a lot higher frequencies joined with
preferred efficiency over previously.

6.3.2 4GB DDR4 RAM Intel HD Graphics 620


Intel HD graphics 620 prefer it will utilize the integrated graphics from the CPU, the
seventh-gen kaby lake processor probably. This will be taken from your framework memory
(the 4GB of slam), however it for the most part is dynamic and doesn’t remain as a set amount.
Intel HD Graphics 620 is a coordinated Graphics Card. Henceforth its memory is dynamic. It
utilizes the framework’s memory which could go up to 32 GB.

6.3.3 1TB HDD:Storage


A 1TB hard drive has generally 1,000GB of extra storage space. The bigger your hard drive,
the more information, and records you can store on it. Bigger drives are estimated in terabytes.
1TB is twice as large as 500GB. 1TB is additionally generally twice as costly. In case you’re
getting a hard drive, 1TB is completely sensible. On the off chance that you are getting an
SSD, even 500GB is likely needless excess. It is characterized as Mass Storage Devices,
Accessible Storage Space, Data Access Performance, Device Form Factor and Connection.
The HDD is an electromechanical information gadget that utilizes attractive capacity to store
and recover advanced data utilizing at least one inflexible quickly turning circles (platters)
covered with a magnetic material.

6.3.4 Python 3.7.3


Python 3.7.3 is extremely fast and quick. Python can be used for web-improvement. python
3.7.3 has a feature of a built-in function breakpoint(). This element makes the debuggers
increasingly adaptable and intuitive. The new data classes module makes it progressively
advantageous to compose your own classes, as uncommon methods. Python 3.7 carries a
few upgrades to the table: better execution, center help, and forward references. Context
factors are factors that can have various qualities relying upon their specific situation. They
are like Thread-Local Storage in which every execution string may have an alternate an
incentive for a variable. In any case, with setting factors, there might be a few settings in
a single execution string. The primary use case for setting factors is monitoring factors in
simultaneous non concurrent tasks. Python 3.7.3 has included a few highlights focused on

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 6. Design 47

as a designer. Moreover, a lot of progressively particular enhancements are included. There


is less overhead in calling numerous strategies in the standard library. Strategy calls are up
to 20 percentage quicker in general. The startup time of Python itself is decreased by 10-30
percentage. Importing composing is multiple times quicker.

6.3.5 Spyder
Spyder is an open source cross-platform integrated development environment (IDE) for
scientific programming in the Python language. Spyder integrates with a number of prominent
packages in the scientific Python stack, including NumPy, SciPy, Matplotlib, pandas, IPython,
SymPy and Cython, as well as other open source software. It is released under the MIT license.

Spyder is extensible with first- and third party plugins,[6] includes support for interactive
tools for data inspection and embeds Python-specific code quality assurance and introspection
instruments, such as Pyflakes, Pylint and Rope. It is available cross-platform through
Anaconda, on Windows, on macOS through MacPorts, and on major Linux distributions such
as Arch Linux, Debian, Fedora, Gentoo Linux, openSUSE and Ubuntu.

6.3.6 Windows Operating System


Microsoft Windows, commonly referred to as Windows, is a proprietary, graphical operating
system families, all of which are developed and marketed by Microsoft. Each family caters
to a certain sector of the computing industry industry. Active Microsoft Windows families
include Windows NT and Windows IoT; these may encompass subfamilies, e.g. Windows
Server or Windows Embedded Compact (Windows CE). Defunct Microsoft Windows families
include Windows 9x, Windows Mobile and Windows Phone.

As of February 2020, the most recent version of Windows for PCs, tablets and embedded
devices is Windows 10. The most recent version for server computers is Windows Server,
version 1909. Other commonly used versions of Windows for PCs are Windows 8 and 8.1,
Windows 7, Windows XP, Windows XP etc.

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 7. Results & Discussion 48

CHAPTER 7

RESULTS & DISCUSSION

We have compared 4 different machine learning algorithms to detect bullying and they are
Naive Bayes, Decision Tree, Logistic Regression and Support Vector Machine. As a first
stage, we pre-processed the data by removing punctuation, stop words etc and then the feature
extraction is done using BoW algorithms. For more precise features we tried 4 different
combinations of consecutive words such as uni-gram, bi-gram, tri-gram and n-gram.For
comparing the performance of these algorithms, the main evaluation methodologies which
we took are accuracy, precision, recall and F-measure. From our analysis, we found that the
Naive Bayes algorithm with bi-gram having the highest performance in all terms of measures.

Also, we found that through changing the parameters of each algorithm and by changing the
data spilt we can improve the performance. Hence we tried different data splits, parameter
change, an increased quantity of data, then we saw that the performance gets varied in every
case. From these results, we conclude that the Naive Bayes algorithm is the most perfect one
for cyber bullying detection in terms of the data set which we take. In cases of parameter
changes, data split changes etc was also giving the best performance for the same. For
conforming the results we did cross-validation with k value as 5, using this the results are
verified again.

7.1 Results

Accuracy Precision Recall F-measure


Bullying .65 .74 .83 .25
Non-Bullying .65 .65 .71 .76

Table 7.1: Naive Bayes with Uni-gram

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 7. Results & Discussion 49

Accuracy Precision Recall F-measure


Bullying .79 .77 .77 .35
Non-Bullying .79 .67 .67 .79

Table 7.2: Naive Bayes with Bi-gram

Accuracy Precision Recall F-measure


Bullying .83 .78 .44 .56
Non-Bullying .83 .71 .91 .79

Table 7.3: Naive Bayes with Tri-gram

Accuracy Precision Recall F-measure


Bullying .75 .78 .14 .24
Non-Bullying .75 .63 .97 .76

Table 7.4: Naive Bayes with n-gram

Accuracy Precision Recall F-measure


Bullying .64 .57 .61 .42
Non-Bullying .64 .68 .72 .73

Table 7.5: Decision Tree with Uni-gram

Accuracy Precision Recall F-measure


Bullying .66 .56 .56 .52
Non-Bullying .66 .7 .7 .74

Table 7.6: Decision Tree with Bi-gram

Accuracy Precision Recall F-measure


Bullying .72 .63 .63 .57
Non-Bullying .72 .71 .71 .75

Table 7.7: Decision Tree with Tri-gram

Accuracy Precision Recall F-measure


Bullying .64 .57 .33 .42
Non-Bullying .64 .65 .83 .73

Table 7.8: Decision Tree with n-gram

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 7. Results & Discussion 50

Accuracy Precision Recall F-measure


Bullying .6 .57 .57 .42
Non-Bullying .6 .61 .61 .75

Table 7.9: Logistic Regression with Uni-gram

Accuracy Precision Recall F-measure


Bullying .62 .57 .57 .52
Non-Bullying .62 .62 .62 .76

Table 7.10: Logistic Regression with Bi-gram

Accuracy Precision Recall F-measure


Bullying .64 .65 .65 .61
Non-Bullying .64 .61 .61 .75

Table 7.11: Logistic Regression with Tri-gram

Accuracy Precision Recall F-measure


Bullying .73 .57 .33 .42
Non-Bullying .73 .6 1 .75

Table 7.12: Logistic Regression with n-gram

Accuracy Precision Recall F-measure


Bullying .64 .58 .58 .43
Non-Bullying .64 .6 .6 .75

Table 7.13: SVM with Uni-gram

Accuracy Precision Recall F-measure


Bullying .61 .57 .57 .52
Non-Bullying .61 .61 .61 .76

Table 7.14: SVM with Bi-gram

Accuracy Precision Recall F-measure


Bullying .69 .65 .65 .61
Non-Bullying .69 .59 .59 .74

Table 7.15: SVM with Tri-gram

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 7. Results & Discussion 51

Accuracy Precision Recall F-measure


Bullying .76 .58 .34 .43
Non-Bullying .76 .6 .95 .75

Table 7.16: SVM with n-gram

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 8. Conclusion & Future Scope 52

CHAPTER 8

CONCLUSION & FUTURE SCOPE

8.1 Conclusion
Cyber bullying is an increasingly frequent problem among adolescents, and it produces
considerable social concern. India has an increasing amount of social media users and many
of them have reported being victims of cyber bullying activities. Such cases have a high
probability to trigger mental insecurity and problems among users. If we can carve out this
awful behaviour early in their age, they are unlikely to continue down that path. The method
we propose to detect and prevent cyber bullying incidents across social media is by extracting
the textual and other data from the SM platform and analyzing the data for such bullying
occurrences.

Our system will initially accept the input dataset and pre-processes it for obtaining only the
wanted data. After that, the system will check whether a selected group of text is bullying
or not using sentiment analysis. The users that are carrying out the bullying activities are
categorized as a cyber bully. If the cyber bully continues to carry out such activities for
multiple users, then he is categorized as a public bully, so that their further activities can
be monitored. To achieve this whole feat, we have used mainly four different machine
learning algorithms; Naive Bayes, Decision Trees, Logistic Regression and Support Vector
Machine. The performances of all these algorithms are recorded separately and also through
ensemble methods. We then compared the performance parameters and concluded that the
best prediction takes place when we use Naive Bayes algorithm when using a bi-gram bag
of words pre-processing technique with an accuracy of 79% and an F-measure of 38%. The
other algorithms have also been recorded to perform almost around the same ballpark with
accuracies ranging from 65% to the uppermost 79%.

From the analysis of the obtained result, it is definite that most of the users are victims of cyber
bullying incidents. These results also provide insight into the importance of the prevention
and eradication of the growing problem of cyber bullying. And this system will be helpful to
save many youngsters from this kind of attack by getting warnings from the system.

8.2 Future Scope


These results provide insight into the important in the prevention and eradication of the
growing problem of cyber bullying and the statistical study regarding the same gave us an
output of 79% The relevance of the project and the implications it has on the current generation

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


Chapter 8. Conclusion & Future Scope 53

is visible from many aspects. We often come across different sensational news around the
world in which cyber bullying plays a major role.

Considering the situation of the present scenario, it is observed the importance of our project
to detect and monitor cyber bullying. So far we have worked on the core part of the same
and provided results that help in the prevention and eradication of the growing problem of
Cyberbullying and this system will also be helpful for the youngsters from this kind of attacks
by getting warnings from the system. This, when deployed in any social media platform, will
help the users using that particular platform from the adverse effects of Cyber bullying or we
can even implement the same using a user interface that could assist any of the popular social
media platforms and will hence be also able to serve our purpose of detection and prevention
of Cyber bullying.

Only the Detection of the cyber bullies will not be a good breakthrough when the serious
background of the topic is considered. Detection, as well as continuous monitoring of the
detected bullies, will be an outstanding contribution to this problem of cyber bullying which
is prevalent in the society which we belong to. Hence the practice of monitoring the detected
bullies will be one of the major influential innovation that can be achieved.

It is a common practice noticed in the social media behaviour of many people that most of the
people express their emotions and feelings through a vast number of emojis. The addition of
emojis in the comments and chats made the user is quite a common thing found in various
social media platforms. Therefore it is also equally important to identify the bullying from
the usage of emojis as well. The Detection and monitoring of cyber bullying from the usage
of emojis is another refined version that will be very helpful in this problem of cyber bullying
and that if implemented can add more impact and will definitely prove to be more effective in
detection and monitoring of the cyber bullies.

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


54

REFERENCES

[1] M. Dadvar, D. Trieschnigg, R. Ordelman, and F. de Jong, “Improving cyberbullying


detection with user context,” in European Conference on Information Retrieval,
pp. 693–696, Springer, 2013.

[2] Y. Chen, Y. Zhou, S. Zhu, and H. Xu, “Detecting offensive language in social media to
protect adolescent online safety,” in 2012 International Conference on Privacy, Security,
Risk and Trust and 2012 International Confernece on Social Computing, pp. 71–80,
IEEE, 2012.

[3] J. Ratkiewicz, M. D. Conover, M. Meiss, B. Gonçalves, A. Flammini, and F. M. Menczer,


“Detecting and tracking political abuse in social media,” in Fifth international AAAI
conference on weblogs and social media, 2011.

[4] W. Dong, S. S. Liao, Y. Xu, and X. Feng, “Leading effect of social media for financial
fraud disclosure: A text mining based analytics,” in AMCIS, 2016.

[5] V. S. Chavan and Shylaja S S, “Machine learning approach for detection of


cyber-aggressive comments by peers on social media network,” in 2015 International
Conference on Advances in Computing, Communications and Informatics (ICACCI),
pp. 2354–2358, 2015.

[6] A. Aggarwal, A. Rajadesingan, and P. Kumaraguru, “Phishari: Automatic realtime


phishing detection on twitter,” in 2012 eCrime Researchers Summit, pp. 1–12, IEEE,
2012.

[7] M. S. Rahman, T.-K. Huang, H. V. Madhyastha, and M. Faloutsos, “Frappe: detecting


malicious facebook applications,” in Proceedings of the 8th international conference on
Emerging networking experiments and technologies, pp. 313–324, ACM, 2012.

[8] S. Yardi, D. Romero, G. Schoenebeck, et al., “Detecting spam in a twitter network,”


First Monday, vol. 15, no. 1, 2010.

[9] R. Sugandhi, A. Pande, S. Chawla, A. Agrawal, and H. Bhagat, “Methods for detection
of cyberbullying: A survey,” in 2015 15th International Conference on Intelligent
Systems Design and Applications (ISDA), pp. 173–177, 2015.

[10] C. Yang, R. Harkreader, J. Zhang, S. Shin, and G. Gu, “Analyzing spammers’ social
networks for fun and profit: A case study of cyber criminal ecosystem on twitter,” in

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


55

Proceedings of the 21st International Conference on World Wide Web, WWW ’12, (New
York, NY, USA), pp. 71–80, ACM, 2012.

[11] S. Abu-Nimeh, T. Chen, and O. Alzubi, “Malicious and spam posts in online social
networks,” Computer, vol. 44, pp. 23–28, Sep. 2011.

[12] H. Hosseinmardi, S. A. Mattson, R. I. Rafiq, R. Han, Q. Lv, and S. Mishra, “Prediction


of cyberbullying incidents on the instagram social network,” CoRR, vol. abs/1508.06257,
2015.

[13] F. Toriumi, T. Nakanishi, M. Tashiro, and K. Eguchi, “Analysis of user behavior


on private chat system,” in 2015 IEEE/WIC/ACM International Conference on Web
Intelligence and Intelligent Agent Technology (WI-IAT), vol. 3, pp. 1–4, 2015.

[14] R. Badonnel, R. State, I. Chrisment, and O. Festor, “A management platform for


tracking cyber predators in peer-to-peer networks,” in Second International Conference
on Internet Monitoring and Protection (ICIMP 2007), pp. 11–11, 2007.

[15]

[16] S. M. Isa, L. Ashianti, et al., “Cyberbullying classification using text mining,” in 2017
1st International Conference on Informatics and Computational Sciences (ICICoS),
pp. 241–246, IEEE, 2017.

[17] V. Subrahmanian and S. Kumar, “Predicting human behavior: The next frontiers,”
Science, vol. 355, no. 6324, pp. 489–489, 2017.

[18] H. Lauw, J. C. Shafer, R. Agrawal, and A. Ntoulas, “Homophily in the digital world: A
livejournal case study,” IEEE Internet Computing, vol. 14, pp. 15–23, March 2010.

[19] M. Al-garadi, K. Varathan, and S. D. Ravana, “Cybercrime detection in online


communications: The experimental case of cyberbullying detection in the twitter
network,” Computers in Human Behavior, vol. 63, pp. 433–443, 10 2016.

[20] L. Phillips, C. Dowling, K. Shaffer, N. Hodas, and S. Volkova, “Using social media
to predict the future: a systematic literature review,” arXiv preprint arXiv:1706.06134,
2017.

[21] J. Heidemann, M. Klier, and F. Probst, “Online social networks: A survey of a global
phenomenon,” Computer networks, vol. 56, no. 18, pp. 3866–3878, 2012.

[22] J. K. Peterson and J. Densley, “Is social media a gang? toward a selection,
facilitation, orenhancement explanation of cyber violence,” Aggression Violent Behav.,
pp. 3866–3878, 2016.

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


56

[23] N. M. Shekokar and K. B. Kansara, “Security against sybil attack in social network,” in
2016 International Conference on Information Communication and Embedded Systems
(ICICES), pp. 1–5, IEEE, 2016.

[24] M. Rybnicek, R. Poisel, and S. Tjoa, “Facebook watchdog: A research agenda


for detecting online grooming and bullying activities,” in 2013 IEEE International
Conference on Systems, Man, and Cybernetics, pp. 2854–2859, 2013.

[25] A. Upadhyay, A. Chaudhari, S. Ghale, S. Pawar, et al., “Detection and prevention


measures for cyberbullying and online grooming,” in 2017 International Conference on
Inventive Systems and Control (ICISC), pp. 1–4, IEEE, 2017.

[26] M. Fire, R. Goldschmidt, and Y. Elovici, “Online social networks: threats and solutions,”
IEEE Communications Surveys & Tutorials, vol. 16, no. 4, pp. 2019–2036, 2014.

[27] G. R. Weir, F. Toolan, and D. Smeed, “The threats of social networking: Old wine in
new bottles?,” Information Security Technical Report, vol. 16, no. 2, pp. 38 – 43, 2011.
Social Networking Threats.

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


57

APPENDIX A
INSTA CRAWLER
A.1 insta dump.py
from random import randint
from time import sleep
import shelve
import json

from igramscraper.instagram import Instagram

#hashtags = [’adhd’,’afraid’,’alone’,’antibullying’,’anxiety’,’anxietyat

max_pages = 10
output_shelf = ’newoutput.shelf’
output_json = ’newoutput.json’
hash_tags_file = ’hashtags.txt’

hashtags = [line.rstrip(’\n’).strip() for line in open(hash_tags_file)]


instagram = Instagram(sleep_between_requests=3)

with shelve.open(output_shelf) as db:


for hashtag in hashtags:
print(’Dumping results for hashtag "{}"!’.format(hashtag))

has_next_page = True
next_page_id = ’’
page_num = 1

while(has_next_page):
print(’Dumping page {} [{}] for #{}’.format(page_num, next_p

try:
results = instagram.get_paginate_medias_by_tag(hashtag,
except Exception as err:
nap_time = randint(10,100)
print(’Failed to dump page {} [{}] for #{} (nap: {})’.fo

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


58

print(err)
sleep(nap_time)
continue

if page_num == 1:
result_count = results[’count’]
print(’There are "{}" posts for #{}!’.format(result_coun

for media in results[’medias’]:


try:
output = {}

output[’id’] = _id = media.identifier


output[’short_code’] = media.short_code
output[’owner’] = media.owner.identifier
output[’created_time’] = media.created_time
output[’modified_time’] = media.modified

output[’caption’] = _caption = media.caption

if ’#’ in _caption:
_hashtags = list({ tag.strip("#") for tag in _ca
if not _hashtags: _hashtags = None
else:
_hashtags = None
output[’hashtags’] = _hashtags

output[’comm_count’] = _com_count = media.comments_


if _com_count == 0:
output[’comments’] = media.comments
else:
_comments = []
try:
_comments_result = instagram.get_media_comme
except Exception as err:
nap_time = randint(10,100)
print(’Failed to dump comments for post id {
print(err)

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


59

sleep(nap_time)
else:
for _c in _comments_result[’comments’]:
_com = {}
_com[’id’] = _c.identifier
_com[’created_time’] = _c.created_at
_com[’modified_time’] = _c.modified
_com[’owner’] = _c.owner.identifier
_com[’comment’] = _comment = _c.text

if ’#’ in _comment:
_hashtags = list({ tag.strip("#") fo
if not _hashtags: _hashtags = None
else:
_hashtags = None
_com[’hashtags’] = _hashtags

_comments.append(_com)

output[’comments’] = _comments

output[’likes_count’] = media.likes_count if hasattr

output[’is_ad’] = media.is_ad if hasattr(media, ’is_


output[’type’] = _type = media.type
if _type == ’image’ :
output[’media’] = [media.image_high_resolution_u
elif _type == ’video’:
output[’media’] = media.video_url
elif _type == ’carousel’:
output[’media’] = media.carousel_media
else:
output[’media’] = None

output[’loc_id’] = media.location_id if hasattr(medi


output[’loc_name’] = media.location_name if hasattr(
output[’loc_slug’] = media.location_slug if hasattr(

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


60

#dumping output
db[_id] = output
db.sync()

with open(output_json, ’a’) as jf:


json.dump(output, jf)
jf.write(’\n’)
#done dumping output

except Exception as err:


print(’Failed to save result..’)
print(media)
print(err)

has_next_page = results[’hasNextPage’]
if has_next_page and page_num <= max_pages:
page_num+=1
next_page_id = results[’maxId’]
continue
else:
break

A.2 Hashtags Used


god
prayer
muslim
hindu
christian
gay
lesbian
couple
fatty
body
sexy
thin
socialdrinker
tattoo

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


61

APPENDIX B
PREDICTION CODE
B.1 Prediction Code
[]
import pandas as pd
#import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

data = pd.read_csv("test_data.csv")
#print(data)
#%%
#preprocessing
Tweet = []
Labels = []

for row in data["Tweet"]:


#tokenize words
words = word_tokenize(row)
#remove punctuations
clean_words = [word.lower() for word in words if word not in set(str
#remove stop words
english_stops = set(stopwords.words(’english’))
characters_to_remove = ["’’",’‘‘’,"rt","https","’","\",""","\u200b",
clean_words = [word for word in clean_words if word not in english_s
clean_words = [word for word in clean_words if word not in set(chara

#print(clean_words)

for row in data["Text Label"]:


Labels.append(row)

#%%
combined = zip(data["Tweet"], data["Text Label"])

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


62

def bag_of_words(words):
return dict([(word, True) for word in words])

Final_Data = []
for r, v in combined:
bag_of_words(r)
Final_Data.append((bag_of_words(r),v))
#%%
import random
random.shuffle(Final_Data)
print(len(Final_Data))
#%%
#Split the data into training and test
train_set, test_set = Final_Data[0:800], Final_Data[800:]
#%%

#Naive Bayes for Unigramsm check accuracy


import nltk
import collections
from nltk.metrics.scores import (accuracy, precision, recall, f_measure)
from nltk import metrics

#Naive Bayes for Unigrams, Recall Measure


nb_classifier = nltk.NaiveBayesClassifier.train(train_set)

nbrefset = collections.defaultdict(set)
nbtestset = collections.defaultdict(set)

for i, (feats, label) in enumerate(test_set):


nbrefset[label].add(i)
observed = nb_classifier.classify(feats)
nbtestset[observed].add(i)

print("Naive Bayes Performance with Unigrams ")


print("Accuracy:",nltk.classify.accuracy(nb_classifier, test_set))
#print("UnigramNB Recall")

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


63

print(’Bullying precision:’, precision(nbrefset[’Bullying’], nbtestset[’

print(’Bullying recall:’, recall(nbtestset[’Bullying’], nbrefset[’Bullyi


print("")
print("Bullying F measure:",f_measure(nbtestset[’Bullying’], nbrefset[’B

print(’Non bullying precision:’, precision(nbrefset[’Non-Bullying’], nbt

print(’Non Bullying recall:’, recall(nbtestset[’Non-Bullying’], nbrefset


print("")
print("Non bullying F meaure:",f_measure(nbtestset[’Non-Bullying’], nbre
#Find most informative features

#classifier.show_most_informative_features(n=10)
print("*********")

#%%

#Decision Tree for Unigrams


from nltk.classify import DecisionTreeClassifier

dt_classifier = DecisionTreeClassifier.train(train_set,
binary=True,
entropy_cutoff=0.8,
depth_cutoff=5,
support_cutoff=30)
refset = collections.defaultdict(set)
testset = collections.defaultdict(set)

for i, (feats, label) in enumerate(test_set):


refset[label].add(i)
observed = dt_classifier.classify(feats)
testset[observed].add(i)
print("Decision Tree with Unigrams ")

print("Accuracy:",nltk.classify.accuracy(dt_classifier, test_set))

print(’Bullying precision:’, precision(refset[’Bullying’], testset[’Bull

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


64

print(’Bullying recall:’, recall(testset[’Bullying’], refset[’Bullying’]

print("Bullying F measure:",f_measure(testset[’Bullying’], refset[’Bully

print("")
print(’Non bullying precision:’, precision(refset[’Non-Bullying’], tests
print(’Non Bullying recall:’, recall(testset[’Non-Bullying’], refset[’No
print("Non- Bullying F measure:",f_measure(testset[’Non-Bullying’], refs
print("*********************")
# In[13]:

#Logisitic Regression for Unigrams


from nltk.classify import MaxentClassifier

logit_classifier = MaxentClassifier.train(train_set, algorithm=’gis’, tr

for i, (feats, label) in enumerate(test_set):


refset[label].add(i)
observed = logit_classifier.classify(feats)
testset[observed].add(i)

print("Logistic regression with Unigrams ")

print("Accuracy:",nltk.classify.accuracy(logit_classifier, test_set))
print(’bullying precision:’, precision(refset[’Bullying’], testset[’Bull

print(’Bullying recall:’, recall(testset[’Bullying’], refset[’Bullying’]


print("Bullying F Measure:",f_measure(testset[’Bullying’], refset[’Bully
print("")

print(’Non bullying precision:’, precision(refset[’Non-Bullying’], tests

print(’Non Bullying recall:’, recall(testset[’Non-Bullying’], refset[’No


print("Non Bulyying F measure:",f_measure(testset[’Non-Bullying’], refse

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


65

print("*********************")
# In[14]:

#Support Vector Machine for Unigrams


from nltk.classify import SklearnClassifier
from sklearn.svm import SVC
SVM_classifier = SklearnClassifier(SVC(), sparse=False).train(train_set)

for i, (feats, label) in enumerate(test_set):


refset[label].add(i)
observed = SVM_classifier.classify(feats)
testset[observed].add(i)

print("Support Vector Machine with Unigrams ")

print("Accuracy:",nltk.classify.accuracy(SVM_classifier, test_set))
print(’Bullying precision:’, precision(refset[’Bullying’], testset[’Bull

print(’Bullying recall:’, recall(testset[’Bullying’], refset[’Bullying’]


print(’Bullying F mesure:’, f_measure(testset[’Bullying’], refset[’Bully

print("")
print(’Non bullying precision:’, precision(refset[’Non-Bullying’], tests

print(’Non Bullying recall:’, recall(testset[’Non-Bullying’], refset[’No


print(’Non bullying f measure:’,f_measure(testset[’Non-Bullying’], refse
print("******************************************************")
print("#####################")
# In[15]:

#Same thing with Bigrams


from nltk import bigrams, trigrams
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

combined = zip(data["Tweet"],data["Text Label"])

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


66

#Bag of Words of Bigrams


def bag_of_bigrams_words(words, score_fn=BigramAssocMeasures.chi_sq, n=2
bigram_finder = BigramCollocationFinder.from_words(words)
bigrams = bigram_finder.nbest(score_fn, n)
return bag_of_words(bigrams)

#In[18]:

Final_Data2 =[]

for z, e in combined:
bag_of_bigrams_words(z)
Final_Data2.append((bag_of_bigrams_words(z),e))
# In[19]:

import random
random.shuffle(Final_Data2)
print(len(Final_Data2))

train_set, test_set = Final_Data2[0:800], Final_Data2[800:]

import nltk
import collections
from nltk.metrics.scores import (accuracy, precision, recall, f_measure)
from nltk import metrics
#%%
#Naive Bayes for Bigrams

refsets = collections. defaultdict(set)


testsets = collections.defaultdict(set)

classifier = nltk.NaiveBayesClassifier.train(train_set)

for i, (feats, label) in enumerate(test_set):


refsets[label].add(i)

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


67

observed = classifier.classify(feats)
testsets[observed].add(i)

print("Naive Bayes Performance with Bigrams ")


print("Accuracy:",nltk.classify.accuracy(classifier, test_set))
print(’Bullying precision:’, precision(refsets[’Bullying’], testsets[’Bu
#classifier.show_most_informative_features(n=10)

print(’Bullying recall:’, recall(testsets[’Bullying’], refsets[’Bullying

print(’Bullying F measure:’,f_measure(testsets[’Bullying’], refsets[’Bul


print("")

print(’Non bullying precision:’, precision(refsets[’Non-Bullying’], test

print(’Non Bullying recall:’, recall(testsets[’Non-Bullying’], refsets[’

print(’Non bullying F measure;’,f_measure(testsets[’Non-Bullying’], ref


print("*******************")
# In[22]:

#Decision Tree for Bigrams


from nltk.classify import DecisionTreeClassifier

dt_classifier = DecisionTreeClassifier.train(train_set,
binary=True,
entropy_cutoff=0.8,
depth_cutoff=5,
support_cutoff=30)
refset = collections.defaultdict(set)
testset = collections.defaultdict(set)

for i, (feats, label) in enumerate(test_set):


refset[label].add(i)
observed = dt_classifier.classify(feats)
testset[observed].add(i)

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


68

print("Decision Tree Performance with Bigrams ")


print("Accuracy:",nltk.classify.accuracy(dt_classifier, test_set))
print(’bullying precision:’, precision(refset[’Bullying’], testset[’Bull
print(’Bullying recall:’, recall(testset[’Bullying’], refset[’Bullying’]

print(’Bullying F measure:’,f_measure(testset[’Bullying’], refset[’Bully


print("")

print(’Nonbullying precision:’, precision(refset[’Non-Bullying’], testse

print(’NonBullying recall:’, recall(testset[’Non-Bullying’], refset[’Non

print(’Non bullying f ,easure:’,f_measure(testset[’Non-Bullying’], refse


print("*******************")
# In[23]:

#Logistic Regression for Bigrams


from nltk.classify import MaxentClassifier

logit_classifier = MaxentClassifier.train(train_set, algorithm=’gis’, tr

for i, (feats, label) in enumerate(test_set):


refset[label].add(i)
observed = logit_classifier.classify(feats)
testset[observed].add(i)

print("Logistic Regression Performance with Bigrams ")


print("Accuracy:",nltk.classify.accuracy(logit_classifier, test_set))
print(’Bullying precision:’, precision(refset[’Bullying’], testset[’Bull

print(’Bullying recall:’, recall(testset[’Bullying’], refset[’Bullying’]


print(’Bullying F measure:’,f_measure(testset[’Bullying’], refset[’Bully
print("")

print(’Non bullying precision:’, precision(refset[’Non-Bullying’], tests

print(’Non Bullying recall:’, recall(testset[’Non-Bullying’], refset[’No

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


69

print(’Non bullying f measure:’,f_measure(testset[’Non-Bullying’], refse


print("*******************")
# In[24]:

#Support Vector Machine for Bigrams


from nltk.classify import SklearnClassifier
from sklearn.svm import SVC
SVM_classifier = SklearnClassifier(SVC(), sparse=False).train(train_set)

for i, (feats, label) in enumerate(test_set):


refset[label].add(i)
observed = SVM_classifier.classify(feats)
testset[observed].add(i)

print("Support Vector Machhine Performance with Bigrams ")


print("Accuracy:",nltk.classify.accuracy(SVM_classifier, test_set))
print(’bullying precision:’, precision(refset[’Bullying’], testset[’Bull
print(’Bullying recall:’, recall(testset[’Bullying’], refset[’Bullying’]
print(’Bullying f measure:’,f_measure(testset[’Bullying’], refset[’Bully
print("")

print(’Nonbullying precision:’, precision(refset[’Non-Bullying’], testse

print(’NonBullying recall:’, recall(testset[’Non-Bullying’], refset[’Non


print(’Non bullying F measure:’,f_measure(testset[’Non-Bullying’], refse
print("*******************************")
print("##########################")

# In[25]:

combined = zip(data["Tweet"],data["Text Label"])

# In[26]:

#Same thing with Trigrams

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


70

from nltk import bigrams, trigrams


from nltk.collocations import TrigramCollocationFinder
from nltk.metrics import TrigramAssocMeasures

def bag_of_trigrams_words(words, score_fn=TrigramAssocMeasures.chi_sq, n


trigram_finder = TrigramCollocationFinder.from_words(words)
trigrams = trigram_finder.nbest(score_fn, n)
return bag_of_words(trigrams)

# In[27]:

Final_Data3 =[]

for z, e in combined:
bag_of_trigrams_words(z)
Final_Data3.append((bag_of_trigrams_words(z),e))

import random
random.shuffle(Final_Data3)
print(len(Final_Data3))

train_set, test_set = Final_Data3[0:800], Final_Data3[800:]


#%%

#Naive Bayes for Trigrams


import nltk
import collections
from nltk.metrics.scores import (accuracy, precision, recall, f_measure)
from nltk import metrics

refsets = collections. defaultdict(set)


testsets = collections.defaultdict(set)

classifier = nltk.NaiveBayesClassifier.train(train_set)

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


71

for i, (feats, label) in enumerate(test_set):


refsets[label].add(i)
observed = classifier.classify(feats)
testsets[observed].add(i)

print("Naive Bayes Performance with Trigrams ")


print("Accuracy:",nltk.classify.accuracy(classifier, test_set))

print(’Bullying precision:’, precision(refsets[’Bullying’], testsets[’Bu


print(’Bullying recall:’, recall(refsets[’Bullying’], testsets[’Bullying
print(’Bullying F measure:’,f_measure(refsets[’Bullying’], testsets[’Bul
print("")
#classifier.show_most_informative_features(n=10)

print(’Non bullying precision:’, precision(refsets[’Non-Bullying’], test


print(’Non bullying recall:’, recall(refsets[’Non-Bullying’], testsets[’
print(’Non bullying F measure:’,f_measure(refsets[’Non-Bullying’], tests
print("*******************")
# In[30]:

#Decision Tree for Trigrams


from nltk.classify import DecisionTreeClassifier

dt_classifier = DecisionTreeClassifier.train(train_set,
binary=True,
entropy_cutoff=0.8,
depth_cutoff=5,
support_cutoff=30)
refset = collections.defaultdict(set)
testset = collections.defaultdict(set)

for i, (feats, label) in enumerate(test_set):


refset[label].add(i)
observed = dt_classifier.classify(feats)
testset[observed].add(i)
print("Decision Tree Performance with Trigrams ")

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


72

print("Accuracy:",nltk.classify.accuracy(dt_classifier, test_set))
print(’Bullying precision:’, precision(refset[’Bullying’], testset[’Bull

print(’Bullying recall:’, recall(testset[’Bullying’], refset[’Bullying’]


print(’Bullying f measure:’,f_measure(testset[’Bullying’], refset[’Bully
print("")

print(’Non bullying precision:’, precision(refset[’Non-Bullying’], tests


print(’Non Bullying recall:’, recall(testset[’Non-Bullying’], refset[’No

print(’Non bullying F measure:’,f_measure(testset[’Non-Bullying’], refse


print("*********************")
# In[48]:

#Logistic Regression for Trigrams


from nltk.classify import MaxentClassifier

logit_classifier = MaxentClassifier.train(train_set, algorithm=’gis’, tr

for i, (feats, label) in enumerate(test_set):


refset[label].add(i)
observed = logit_classifier.classify(feats)
testset[observed].add(i)
print("Logistic Regression Performance with Trigrams ")
print("Accuracy:",nltk.classify.accuracy(logit_classifier, test_set))
print(’Bullying precision:’, precision(refset[’Bullying’], testset[’Bull

print(’Bullying recall:’, recall(testset[’Bullying’], refset[’Bullying’]


print(’Bullying F measure:’,f_measure(testset[’Bullying’], refset[’Bully
print("")
print(’Non bullying precision:’, precision(refset[’Non-Bullying’], tests

print(’Non Bullying recall:’, recall(testset[’Non-Bullying’], refset[’No


print(’Non bullying f measure:’,f_measure(testset[’Non-Bullying’], refse
print("*************************")
# In[31]:

#Support Vector Machine for Trigrams

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


73

from nltk.classify import SklearnClassifier


from sklearn.svm import SVC
SVM_classifier = SklearnClassifier(SVC(), sparse=False).train(train_set)

for i, (feats, label) in enumerate(test_set):


refset[label].add(i)
observed = SVM_classifier.classify(feats)
testset[observed].add(i)
print("SVM Performance with Trigrams ")
print("Accuracy:",nltk.classify.accuracy(SVM_classifier, test_set))
print(’Bullying precision:’, precision(refset[’Bullying’], testset[’Bull

print(’Bullying recall:’, recall(testset[’Bullying’], refset[’Bullying’]


print(’Bullying f measure:’,f_measure(testset[’Bullying’], refset[’Bully
print("")
print(’Non bullying precision:’, precision(refset[’Non-Bullying’], tests

print(’Non Bullying recall:’, recall(testset[’Non-Bullying’], refset[’No


print(’Non bullying F measure:’,f_measure(testset[’Non-Bullying’], refse
print("***********************************")
print("#########################")
#%%
combined = zip(data["Tweet"],data["Text Label"])

# In[32]:

#Combine both unigrams, bigrams, and trigrams

# Import Bigram metrics - we will use these to identify the top 200 bigr
def bigrams_words(words, score_fn=BigramAssocMeasures.chi_sq,
n=200):
bigram_finder = BigramCollocationFinder.from_words(words)
bigrams = bigram_finder.nbest(score_fn, n)
return bigrams

from nltk.collocations import TrigramCollocationFinder

# Import Trigram metrics - we will use these to identify the top 200 tri

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


74

from nltk.metrics import TrigramAssocMeasures

def trigrams_words(words, score_fn=TrigramAssocMeasures.chi_sq,


n=200):
trigram_finder = TrigramCollocationFinder.from_words(words)
trigrams = trigram_finder.nbest(score_fn, n)
return trigrams

#bag of ngrams
def bag_of_Ngrams_words(words):
bigramBag = bigrams_words(words)

#The following two for loops convert tuple into string


for b in range(0,len(bigramBag)):
bigramBag[b]=’ ’.join(bigramBag[b])

trigramBag = trigrams_words(words)
for t in range(0,len(trigramBag)):
trigramBag[t]=’ ’.join(trigramBag[t])

return bag_of_words("trigramBag + bigramBag + words")

# In[34]:

Final_Data4 =[]

for z, e in combined:
bag_of_Ngrams_words(z)
Final_Data4.append((bag_of_Ngrams_words(z),e))

# In[35]:

import random
random.shuffle(Final_Data4)

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


75

print(len(Final_Data4))

train_set, test_set = Final_Data4[0:800], Final_Data4[800:]

import nltk
import collections
from nltk.metrics.scores import (accuracy, precision, recall, f_measure)
from nltk import metrics
#%%
#Naive Bayes for Ngrams

refsets = collections. defaultdict(set)


testsets = collections.defaultdict(set)

classifier = nltk.NaiveBayesClassifier.train(train_set)

for i, (feats, label) in enumerate(test_set):


refsets[label].add(i)
observed = classifier.classify(feats)
testsets[observed].add(i)

print("Naive Bayes Performance with Ngrams ")


print("Accuracy:",nltk.classify.accuracy(classifier, test_set))

#classifier.show_most_informative_features(n=10)

print(’Bullying precision:’, precision(refsets[’Bullying’], testsets[’Bu


print(’Bullying recall:’, recall(refsets[’Bullying’], testsets[’Bullying
print(’Bullying f measure:’,f_measure(refsets[’Bullying’], testsets[’Bul
print("")
print(’Non bullying precision:’, precision(refsets[’Non-Bullying’], test
print(’Non bullying recall:’, recall(refsets[’Non-Bullying’], testsets[’
print(’Non bullying f measure:’,f_measure(refsets[’Non-Bullying’], tests
print("********")

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


76

# In[32]:

#Decision tree with n gram


import collections
from nltk import metrics
from nltk.metrics.scores import (accuracy, precision, recall, f_measure)
from nltk.classify import DecisionTreeClassifier
from nltk.classify.util import accuracy
dt_classifier = DecisionTreeClassifier.train(train_set,
binary=True,
entropy_cutoff=0.8,
depth_cutoff=5,
support_cutoff=30)
print("Performance of Decision Tree with Ngram")
from nltk.classify.util import accuracy
print(’Accuracy:’,accuracy(dt_classifier, test_set))

refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)

for i, (Final_Data, label) in enumerate(test_set):


refsets[label].add(i)
observed = dt_classifier.classify(Final_Data)
testsets[observed].add(i)

print(’bullying precision:’, precision(refsets[’Bullying’], testsets[’Bu


print(’bullying recall:’, recall(refsets[’Bullying’], testsets[’Bullying
print(’bullying F-measure:’, f_measure(refsets[’Bullying’], testsets[’Bu
print(’non-bullying precision:’, precision(refsets[’Non-Bullying’], test
print(’non-bullying recall:’, recall(refsets[’Non-Bullying’], testsets[’
print(’non-bullying F-measure:’, f_measure(refsets[’Non-Bullying’], test

print("***************")
# In[33]:

#Create Logistic Regression with ngram


from nltk.classify import MaxentClassifier

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


77

import collections
from nltk.metrics.scores import (accuracy, precision, recall, f_measure)

logit_classifier = MaxentClassifier.train(train_set, algorithm=’gis’, tr


from nltk.classify.util import accuracy
print("Performnave of Logistic Regression with n gram \n","Accuracy:",ac
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (Final_Data, label) in enumerate(test_set):
refsets[label].add(i)
observed = logit_classifier.classify(Final_Data)
testsets[observed].add(i)

print(’Bullying precision:’, precision(refsets[’Bullying’], testsets[’Bu


print(’Bullying recall:’, recall(refsets[’Bullying’], testsets[’Bullying
print(’Bullying F-measure:’, f_measure(refsets[’Bullying’], testsets[’Bu
print(’Non bullying precision:’, precision(refsets[’Non-Bullying’], test
print(’non recall:’, recall(refsets[’Non-Bullying’], testsets[’Non-Bully
print(’non F-measure:’, f_measure(refsets[’Non-Bullying’], testsets[’Non
print("****************")

# In[34]:

# SVM model

from nltk.classify import SklearnClassifier


from sklearn.svm import SVC

SVM_classifier = SklearnClassifier(SVC(), sparse=False).train(train_set)


from nltk.classify.util import accuracy
print("Performance of SVM with n gram","\n","Accuracy:",accuracy(SVM_cla
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (Final_Data, label) in enumerate(test_set):
refsets[label].add(i)
observed = SVM_classifier.classify(Final_Data)
testsets[observed].add(i)

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


78

print(’Bullying precision:’, precision(refsets[’Bullying’], testsets[’Bu


print(’Bullying recall:’, recall(refsets[’Bullying’], testsets[’Bullying
print(’Bullying F-measure:’, f_measure(refsets[’Bullying’], testsets[’Bu
print(’Non bullying precision:’, precision(refsets[’Non-Bullying’], test
print(’Non bullying recall:’, recall(refsets[’Non-Bullying’], testsets[’
print(’Non bullying F-measure:’, f_measure(refsets[’Non-Bullying’], test
print("*****")
print("#####")

# In[42]:

zl = zip(data["Tweet"],data["Text Label"])

#define a bag_of_words function to return word, True.

def bag_of_words(words):
return dict([(word, True) for word in words])

def bag_of_words_not_in_set(words, badwords):


return bag_of_words(set(words) - set(badwords))

# Define another function that will return words that are in words, but

from nltk.corpus import stopwords

#define a bag_of_non_stopwords function to return word, True.

def bag_of_non_stopwords(words, stopfile=’english’):


badwords = stopwords.words(stopfile)
return bag_of_words_not_in_set(words, badwords)

from nltk.collocations import BigramCollocationFinder

# Import Bigram metrics - we will use these to identify the top 200 bigr
from nltk.metrics import BigramAssocMeasures

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


79

def bag_of_bigrams_words(words, score_fn=BigramAssocMeasures.chi_sq, n=2


bigram_finder = BigramCollocationFinder.from_words(words)
bigrams = bigram_finder.nbest(score_fn, n)
return bag_of_words(bigrams)

bigrams = bag_of_bigrams_words(words)

#Creating our unigram featureset dictionary for modeling

Final_Data = []

for k, v in zl:
bag_of_bigrams_words(k)
Final_Data.append((bag_of_bigrams_words(k),v))

import random
random.shuffle(Final_Data)

#splits the data around 70% of reviews* for both testing and training

train_set, test_set = Final_Data[0:860], Final_Data[860:]


#%%
#Now we will calculate accuracy, precision, recall, and f-measure using

import nltk
import collections
from nltk.metrics.scores import (accuracy, precision, recall, f_measure)
nb_classifier = nltk.NaiveBayesClassifier.train(train_set)
nb_classifier.show_most_informative_features(10)

from nltk.classify.util import accuracy


print("Performance when data split changes NB","\n accuracy:",accuracy(n

refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)

for i, (Final_Data, label) in enumerate(test_set):


refsets[label].add(i)

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


80

observed = nb_classifier.classify(Final_Data)
testsets[observed].add(i)

print(’bullying precision:’, precision(refsets[’Bullying’], testsets[’Bu


print(’bullying recall:’, recall(refsets[’Bullying’], testsets[’Bullying
print(’bullying F-measure:’, f_measure(refsets[’Bullying’], testsets[’Bu
print(’not-bullying precision:’, precision(refsets[’Non-Bullying’], test
print(’not-bullying recall:’, recall(refsets[’Non-Bullying’], testsets[’
print(’not-bullying F-measure:’, f_measure(refsets[’Non-Bullying’], test

print("*************")

# In[48]:

import collections
from nltk import metrics
from nltk.metrics.scores import (accuracy, precision, recall, f_measure)
from nltk.classify import DecisionTreeClassifier
from nltk.classify.util import accuracy
dt_classifier = DecisionTreeClassifier.train(train_set,
binary=True,
entropy_cutoff=0.8,
depth_cutoff=5,
support_cutoff=30)
from nltk.classify.util import accuracy
print("Performace of Decision Tree","\n accuracy:",accuracy(dt_classifie

refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)

for i, (Final_Data, label) in enumerate(test_set):


refsets[label].add(i)
observed = dt_classifier.classify(Final_Data)
testsets[observed].add(i)

print(’bullying precision:’, precision(refsets[’Bullying’], testsets[’Bu

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


81

print(’bullying recall:’, recall(refsets[’Bullying’], testsets[’Bullying


print(’bullying F-measure:’, f_measure(refsets[’Bullying’], testsets[’Bu
print(’non-bullying precision:’, precision(refsets[’Non-Bullying’], test
print(’non-bullying recall:’, recall(refsets[’Non-Bullying’], testsets[’
print(’non-bullying F-measure:’, f_measure(refsets[’Non-Bullying’], test

print("**************")
# In[49]:

#Create Logistic Regression model to compare


from nltk.classify import MaxentClassifier
import collections
from nltk.metrics.scores import (accuracy, precision, recall, f_measure)

logit_classifier = MaxentClassifier.train(train_set, algorithm=’gis’, tr


from nltk.classify.util import accuracy
print("Performace of Logistic Regression","\n accuracy:",accuracy(dt_cla
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (Final_Data, label) in enumerate(test_set):
refsets[label].add(i)
observed = logit_classifier.classify(Final_Data)
testsets[observed].add(i)

print(’pos precision:’, precision(refsets[’Bullying’], testsets[’Non-Bul


print(’pos recall:’, recall(refsets[’Bullying’], testsets[’Non-Bullying’
print(’pos F-measure:’, f_measure(refsets[’Bullying’], testsets[’Non-Bul
print(’neg precision:’, precision(refsets[’Non-Bullying’], testsets[’Non
print(’neg recall:’, recall(refsets[’Non-Bullying’], testsets[’Non-Bully
print(’neg F-measure:’, f_measure(refsets[’Non-Bullying’], testsets[’Non
print("*******")

# In[50]:

# SVM model

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


82

from nltk.classify import SklearnClassifier


from sklearn.svm import SVC

SVM_classifier = SklearnClassifier(SVC(), sparse=False).train(train_set)


from nltk.classify.util import accuracy
print("Performace of SVM","\n accuracy:",accuracy(dt_classifier, test_se
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (Final_Data, label) in enumerate(test_set):
refsets[label].add(i)
observed = SVM_classifier.classify(Final_Data)
testsets[observed].add(i)

print(’pos precision:’, precision(refsets[’Bullying’], testsets[’Bullyin


print(’pos recall:’, recall(refsets[’Bullying’], testsets[’Bullying’]))
print(’pos F-measure:’, f_measure(refsets[’Bullying’], testsets[’Bullyin
print(’neg precision:’, precision(refsets[’Non-Bullying’], testsets[’Non
print(’neg recall:’, recall(refsets[’Non-Bullying’], testsets[’Non-Bully
print(’neg F-measure:’, f_measure(refsets[’Non-Bullying’], testsets[’Non
print("******")
#%%

’’’#Data split and ngram

zl = zip(data["Tweet"],data["Text Label"])

#define a bag_of_words function to return word, True.

def bag_of_words(words):
return dict([(word, True) for word in words])

def bag_of_words_not_in_set(words, badwords):


return bag_of_words(set(words) - set(badwords))

# Define another function that will return words that are in words, but

from nltk.corpus import stopwords

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


83

#define a bag_of_non_stopwords function to return word, True.

def bag_of_non_stopwords(words, stopfile=’english’):


badwords = stopwords.words(stopfile)
return bag_of_words_not_in_set(words, badwords)

from nltk.collocations import BigramCollocationFinder

# Import Bigram metrics - we will use these to identify the top 200 bigr
from nltk.metrics import BigramAssocMeasures

bigrams = bag_of_bigrams_words(words)
def bigrams_words(words, score_fn=BigramAssocMeasures.chi_sq,
n=200):
bigram_finder = BigramCollocationFinder.from_words(words)
bigrams = bigram_finder.nbest(score_fn, n)
return bigrams

from nltk.collocations import TrigramCollocationFinder

# Import Bigram metrics - we will use these to identify the top 200 bigr
from nltk.metrics import TrigramAssocMeasures

def trigrams_words(words, score_fn=TrigramAssocMeasures.chi_sq,


n=200):
trigram_finder = TrigramCollocationFinder.from_words(words)
trigrams = trigram_finder.nbest(score_fn, n)
return trigrams

def bag_of_Ngrams_words(words):
bigramBag = bigrams_words(words)

#The following two for loops convert tuple into string


for b in range(0,len(bigramBag)):

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


84

bigramBag[b]=’ ’.join(bigramBag[b])

trigramBag = trigrams_words(words)
for t in range(0,len(trigramBag)):
trigramBag[t]=’ ’.join(trigramBag[t])

return bag_of_words(’trigramBag + bigramBag + words’)

#Creating our unigram featureset dictionary for modeling

Final_Data = []

for k, v in zl:
bag_of_bigrams_words(k)
Final_Data.append((bag_of_bigrams_words(k),v))

import random
random.shuffle(Final_Data)

#splits the data around 70% of 500 *350 reviews* for both testing and tr

train_set, test_set = Final_Data[0:860], Final_Data[860:]


#%%
#Now we will calculate accuracy, precision, recall, and f-measure using

import nltk
import collections
from nltk.metrics.scores import (accuracy, precision, recall, f_measure)
nb_classifier = nltk.NaiveBayesClassifier.train(train_set)
nb_classifier.show_most_informative_features(10)

from nltk.classify.util import accuracy


print("Performance when data split changes","\n accuracy:",accuracy(nb_c

refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)

for i, (Final_Data, label) in enumerate(test_set):

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


85

refsets[label].add(i)
observed = nb_classifier.classify(Final_Data)
testsets[observed].add(i)

print(’bullying precision:’, precision(refsets[’Bullying’], testsets[’Bu


print(’bullying recall:’, recall(refsets[’Bullying’], testsets[’Bullying
print(’bullying F-measure:’, f_measure(refsets[’Bullying’], testsets[’Bu
print(’not-bullying precision:’, precision(refsets[’Non-Bullying’], test
print(’not-bullying recall:’, recall(refsets[’Non-Bullying’], testsets[’
print(’not-bullying F-measure:’, f_measure(refsets[’Non-Bullying’], test

# In[44]:

# In[48]:

import collections
from nltk import metrics
from nltk.metrics.scores import (accuracy, precision, recall, f_measure)
from nltk.classify import DecisionTreeClassifier
from nltk.classify.util import accuracy
dt_classifier = DecisionTreeClassifier.train(train_set,
binary=True,
entropy_cutoff=0.8,
depth_cutoff=5,
support_cutoff=30)
from nltk.classify.util import accuracy
print("Performace of Decision Tree","\n accuracy:",accuracy(dt_classifie

refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)

for i, (Final_Data, label) in enumerate(test_set):

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


86

refsets[label].add(i)
observed = dt_classifier.classify(Final_Data)
testsets[observed].add(i)

print(’bullying precision:’, precision(refsets[’Bullying’], testsets[’Bu


print(’bullying recall:’, recall(refsets[’Bullying’], testsets[’Bullying
print(’bullying F-measure:’, f_measure(refsets[’Bullying’], testsets[’Bu
print(’non-bullying precision:’, precision(refsets[’Non-Bullying’], test
print(’non-bullying recall:’, recall(refsets[’Non-Bullying’], testsets[’
print(’non-bullying F-measure:’, f_measure(refsets[’Non-Bullying’], test

# In[49]:

#Create Logistic Regression model to compare


from nltk.classify import MaxentClassifier
import collections
from nltk.metrics.scores import (accuracy, precision, recall, f_measure)

logit_classifier = MaxentClassifier.train(train_set, algorithm=’gis’, tr


from nltk.classify.util import accuracy
print("Performace of Logistic Regression","\n accuracy:",accuracy(dt_cla
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (Final_Data, label) in enumerate(test_set):
refsets[label].add(i)
observed = logit_classifier.classify(Final_Data)
testsets[observed].add(i)

print(’pos precision:’, precision(refsets[’Bullying’], testsets[’Non-Bul


print(’pos recall:’, recall(refsets[’Bullying’], testsets[’Non-Bullying’
print(’pos F-measure:’, f_measure(refsets[’Bullying’], testsets[’Non-Bul
print(’neg precision:’, precision(refsets[’Non-Bullying’], testsets[’Non
print(’neg recall:’, recall(refsets[’Non-Bullying’], testsets[’Non-Bully
print(’neg F-measure:’, f_measure(refsets[’Non-Bullying’], testsets[’Non

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


87

# In[50]:

# SVM model

from nltk.classify import SklearnClassifier


from sklearn.svm import SVC

SVM_classifier = SklearnClassifier(SVC(), sparse=False).train(train_set)


from nltk.classify.util import accuracy
print("Performace of Logistic Regression","\n accuracy:",accuracy(dt_cla
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (Final_Data, label) in enumerate(test_set):
refsets[label].add(i)
observed = SVM_classifier.classify(Final_Data)
testsets[observed].add(i)

print(’pos precision:’, precision(refsets[’Bullying’], testsets[’Bullyin


print(’pos recall:’, recall(refsets[’Bullying’], testsets[’Bullying’]))
print(’pos F-measure:’, f_measure(refsets[’Bullying’], testsets[’Bullyin
print(’neg precision:’, precision(refsets[’Non-Bullying’], testsets[’Non
print(’neg recall:’, recall(refsets[’Non-Bullying’], testsets[’Non-Bully
print(’neg F-measure:’, f_measure(refsets[’Non-Bullying’], testsets[’Non

#%%
’’’#Data split and tri

zl = zip(data["Tweet"],data["Text Label"])

#define a bag_of_words function to return word, True.

# Define another function that will return words that are in words, but

from nltk.corpus import stopwords

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


88

#define a bag_of_non_stopwords function to return word, True.

from nltk.collocations import TrigramCollocationFinder

# Import Bigram metrics - we will use these to identify the top 200 bigr
from nltk.metrics import TrigramAssocMeasures

def trigrams_words(words, score_fn=TrigramAssocMeasures.chi_sq,


n=200):
trigram_finder = TrigramCollocationFinder.from_words(words)
trigrams = trigram_finder.nbest(score_fn, n)
return trigrams

trigrams = trigrams_words(words)

#Creating our unigram featureset dictionary for modeling

Final_Data = []

for k, v in zl:
bag_of_bigrams_words(k)
Final_Data.append((bag_of_bigrams_words(k),v))

import random
random.shuffle(Final_Data)

#splits the data around 70% of 500 *350 reviews* for both testing and tr

train_set, test_set = Final_Data[0:860], Final_Data[860:]


#%%
#Now we will calculate accuracy, precision, recall, and f-measure using

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


89

import nltk
import collections
from nltk.metrics.scores import (accuracy, precision, recall, f_measure)
nb_classifier = nltk.NaiveBayesClassifier.train(train_set)
nb_classifier.show_most_informative_features(10)

from nltk.classify.util import accuracy


print("Performance when data split changes","\n accuracy:",accuracy(nb_c

refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)

for i, (Final_Data, label) in enumerate(test_set):


refsets[label].add(i)
observed = nb_classifier.classify(Final_Data)
testsets[observed].add(i)

print(’bullying precision:’, precision(refsets[’Bullying’], testsets[’Bu


print(’bullying recall:’, recall(refsets[’Bullying’], testsets[’Bullying
print(’bullying F-measure:’, f_measure(refsets[’Bullying’], testsets[’Bu
print(’not-bullying precision:’, precision(refsets[’Non-Bullying’], test
print(’not-bullying recall:’, recall(refsets[’Non-Bullying’], testsets[’
print(’not-bullying F-measure:’, f_measure(refsets[’Non-Bullying’], test

# In[44]:

# In[48]:

import collections
from nltk import metrics
from nltk.metrics.scores import (accuracy, precision, recall, f_measure)
from nltk.classify import DecisionTreeClassifier

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


90

from nltk.classify.util import accuracy


dt_classifier = DecisionTreeClassifier.train(train_set,
binary=True,
entropy_cutoff=0.8,
depth_cutoff=5,
support_cutoff=30)
from nltk.classify.util import accuracy
print("Performace of Decision Tree","\n accuracy:",accuracy(dt_classifie

refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)

for i, (Final_Data, label) in enumerate(test_set):


refsets[label].add(i)
observed = dt_classifier.classify(Final_Data)
testsets[observed].add(i)

print(’bullying precision:’, precision(refsets[’Bullying’], testsets[’Bu


print(’bullying recall:’, recall(refsets[’Bullying’], testsets[’Bullying
print(’bullying F-measure:’, f_measure(refsets[’Bullying’], testsets[’Bu
print(’non-bullying precision:’, precision(refsets[’Non-Bullying’], test
print(’non-bullying recall:’, recall(refsets[’Non-Bullying’], testsets[’
print(’non-bullying F-measure:’, f_measure(refsets[’Non-Bullying’], test

# In[49]:

#Create Logistic Regression model to compare


from nltk.classify import MaxentClassifier
import collections
from nltk.metrics.scores import (accuracy, precision, recall, f_measure)

logit_classifier = MaxentClassifier.train(train_set, algorithm=’gis’, tr


from nltk.classify.util import accuracy
print("Performace of Logistic Regression","\n accuracy:",accuracy(dt_cla
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020


91

for i, (Final_Data, label) in enumerate(test_set):


refsets[label].add(i)
observed = logit_classifier.classify(Final_Data)
testsets[observed].add(i)

print(’pos precision:’, precision(refsets[’Bullying’], testsets[’Non-Bul


print(’pos recall:’, recall(refsets[’Bullying’], testsets[’Non-Bullying’
print(’pos F-measure:’, f_measure(refsets[’Bullying’], testsets[’Non-Bul
print(’neg precision:’, precision(refsets[’Non-Bullying’], testsets[’Non
print(’neg recall:’, recall(refsets[’Non-Bullying’], testsets[’Non-Bully
print(’neg F-measure:’, f_measure(refsets[’Non-Bullying’], testsets[’Non

# In[50]:

# SVM model

from nltk.classify import SklearnClassifier


from sklearn.svm import SVC

SVM_classifier = SklearnClassifier(SVC(), sparse=False).train(train_set)


from nltk.classify.util import accuracy
print("Performace of Logistic Regression","\n accuracy:",accuracy(dt_cla
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
for i, (Final_Data, label) in enumerate(test_set):
refsets[label].add(i)
observed = SVM_classifier.classify(Final_Data)
testsets[observed].add(i)

print(’pos precision:’, precision(refsets[’Bullying’], testsets[’Bullyin


print(’pos recall:’, recall(refsets[’Bullying’], testsets[’Bullying’]))
print(’pos F-measure:’, f_measure(refsets[’Bullying’], testsets[’Bullyin
print(’neg precision:’, precision(refsets[’Non-Bullying’], testsets[’Non
print(’neg recall:’, recall(refsets[’Non-Bullying’], testsets[’Non-Bully
print(’neg F-measure:’, f_measure(refsets[’Non-Bullying’], testsets[’Non

Jyothi Engineering College, Cheruthuruthy Dept. of CSE, April 2020

You might also like