Final Project Report New
Final Project Report New
A Project Report
Submitted by
ROHINI K R JEC16CS097
SREEHARI T ANIL JEC16CS113
SREEJITH P M JEC16CS114
YEDUMOHAN P M JEC16CS126
to
APJ Abdul Kalam Technological University
in partial fulfillment of the requirements for the award of the Degree of
Bachelor of Technology (B.Tech)
in
C OMPUTER S CIENCE & E NGINEERING
April 2020
DECLARATION
We the undersigned hereby declare that the project report “Cyber Bullying Detection
system”, submitted for partial fulfillment of the requirements for the award of degree of
Bachelor of Technology of the APJ Abdul Kalam Technological University, Kerala is a
bonafide work done by us under supervision of Dr. Vinith R. This submission represents
our ideas in our own words and where ideas or words of others have been included, we have
adequately and accurately cited and referenced the original sources. We also declare that
we have adhered to ethics of academic honesty and integrity and have not misrepresented
or fabricated any data or idea or fact or source in this submission. We understand that
any violation of the above will be a cause for disciplinary action by the institute and/or
the University and can also evoke penal action from the sources which have thus not been
properly cited or from whom proper permission has not been obtained. This report has not
been previously used by anybody as a basis for the award of any degree, diploma or similar
title of any other University.
CERTIFICATE
We take this opportunity to thank everyone who helped us profusely, for the successful
completion of our project work. With prayers, we thank God Almighty for his grace and
blessings, for without his unseen guidance, this project would have remained only in our
dreams.
We thank the Management of Jyothi Engineering College and our Principal, Fr. Dr. Jaison
Paul Mulerikkal CMI for providing all the facilities to carry out this project work. We are
greatful to the Head of the Department Fr. Dr. A K George for his valuable suggestions and
encouragement to carry out this project work.
We would like to express our whole hearted gratitude to the project guide Dr. Vinith R
for his encouragement, support and guidance in the right direction during the entire project
work.
We thank our Project Coordinators Mr. Anil Antony & Mr. Unnikrishnan P for their
constant encouragement during the entire project work. We extend our gratefulness to all
teaching and non teaching staff members who directly or indirectly involved in the successful
completion of this project work.
Finally, we take this opportunity to express our gratitude to the parents for their love, care
and support and also to our friends who have been constant sources of support and inspiration
for completing this project work.
ROHINI K R (JEC16CS097)
SREEHARI T ANIL (JEC16CS113)
SREEJITH P M (JEC16CS114)
YEDUMOHAN P M (JEC16CS126)
ii
VISION OF THE INSTITUTE
Creating eminent and ethical leaders through quality professional education with
emphasis on holistic excellence.
• To equip the students with appropriate skills for a meaningful career in the global
scenario.
• To inculcate ethical values among students and ignite their passion for holistic
excellence through social initiatives.
iii
VISION OF THE DEPARTMENT
Creating eminent and ethical leaders in the domain of computational sciences through
quality professional education with a focus on holistic learning and excellence..
iv
PROGRAMME EDUCATIONAL OBJECTIVES
PEO 1: Graduates shall have a good foundation in the fundamental and practical
aspects of Mathematics and Engineering Sciences so as to build successful
and enriching careers in the field of Electrical Engineering and allied areas.
PEO 2: Graduates shall learn and adapt themselves to the latest technological
developments in the field of Electrical & Electronics Engineering which will in
turn motivate them to excel in their domains and shall pursue higher education
and research.
PEO 3: Graduates shall have professional ethics and good communication ability along
with entrepreneurial skills and leadership skills, so that they can succeed in
multidisciplinary and diverse fields.
v
PROGRAMME SPECIFIC OUTCOMES
Graduate possess -
PSO 1: Ability to apply their knowledge and technical competence to solve real world
problems related to electrical power system, control system, power electronics
and industrial drives.
PSO 2: Ability to use technical and computational skills for design and development
of electrical and electronic systems.
vi
PROGRAMME OUTCOMES
vii
COURSE OUTCOMES
COs Description
The students will be able to think innovatively on the development of
C3O7.1
components, products, processes or technologies in the engineering field.
The students will be able to analyse the problem requirements and arrive
C3O7.2
workable design solutions.
C3O7.3 The students will be able to understand the concept of reverse engineering.
The students will be able to familiarise with the modern tools used in the
C3O7.4
process of design and development.
CO MAPPING TO POs
POs
COs PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
C3O7.1
C3O7.2
C3O7.3
C3O7.4
Average
CO MAPPING TO PSOs
PSOs
COs PSO1 PSO2 PSO3
C3O7.1
C3O7.2
C3O7.3
C3O7.4
Average
viii
ABSTRACT
The advancement of social media plays an important role in increasing the population of
youngsters on the web. And it has become the biggest medium of expressing one’s thoughts
and emotions. Recent studies report that cyber bullying constitutes a growing problem
among youngsters on the web. These kinds of attacks have a major influence on the current
generation’s personal and social life, because youngsters are ready to adopt online life instead
of a real one, which leads them into an imaginary world. So we are proposing a system for
early detection of cyber bullying on the web using sentiment analysis and machine learning
techniques.
Our system will initially check whether a text is bullying or not using sentiment analysis, and
from that, we are trying to identify public bullies, so that their further activities are monitored.
This system will be helpful to save many youngsters from this kind of attack by getting
warnings from the system.
Here we are using four different machine learning algorithms, and they are Naive Bayes,
Decision Tree, Logistic Regression and Support Vector Machine in four different ways in
pre-processing of data using uni-gram, bi-gram, tri-gram and n- gram in Bag of Words
algorithm. And the best prediction will take place with Naive Bayes algorithm when using
bi-gram bag of words pre-processing technique with a accuracy of 79 % and F measure of
38%
ix
CONTENTS
x
4.2 System Development Life Cycle . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.1 Iterative Modelling . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.2 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Methodology 23
5.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2.1 Bag of Words Algorithm: . . . . . . . . . . . . . . . . . . . . . . 24
5.3 Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3.1 Naive Bayes Algorithm . . . . . . . . . . . . . . . . . . . . . . . 28
5.3.2 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3.4 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . 30
5.4 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.4.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.4.2 Precision, Recall, and F-Measure . . . . . . . . . . . . . . . . . . 33
5.4.3 Area Under the Curve (AUC) . . . . . . . . . . . . . . . . . . . . 33
5.5 Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6 Design 41
6.1 Data flow Diagrams, Architecture Diagram and Conceptual Diagram . . . 41
6.2 Module Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2.1 Admin Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.3 Hardware and Software Requirements . . . . . . . . . . . . . . . . . . . . 45
6.3.1 7th Gen intelcore i3-7100U PROCESSOR . . . . . . . . . . . . . 45
6.3.2 4GB DDR4 RAM Intel HD Graphics 620 . . . . . . . . . . . . . 46
6.3.3 1TB HDD:Storage . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.3.4 Python 3.7.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.3.5 Spyder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.3.6 Windows Operating System . . . . . . . . . . . . . . . . . . . . 47
7 Results & Discussion 48
7.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8 Conclusion & Future Scope 52
8.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
8.2 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
xi
References 54
Appendices 56
A Insta Crawler 57
B Prediction Code 61
xii
LIST OF TABLES
xiii
LIST OF FIGURES
xiv
5.25 Logistic Regression after data split . . . . . . . . . . . . . . . . . . . . . . 38
5.26 SVM Unigram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.27 SVM Bigram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.28 SVM Trigram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.29 SVM Ngram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.30 SVM after data split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.1 Level 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2 Level 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.3 Level 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.4 Level 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.5 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.6 Conceptual Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
xv
LIST OF ABBREVIATIONS
SM Social Media
OSN Online Social Networking
SVM Support Vector Machine
LSA Latent Semantic Analysis
SVD Singular-Value Decomposition
HCM Human composition matrix
URL Uniform Resource Locator
SMOTE Synthetic Minority Over-sampling Technique
AUC Area under curve
API Application program interface
SDLC Systems development life cycle
xvi
Chapter 1. Introduction 1
CHAPTER 1
INTRODUCTION
The advancement of social media has an important role in the extended population of
youngsters on the web. They utilize such platforms mainly for communicating and entertainment.
The visible trend nowadays is communicating sentiments through social media. Most social
media profiles look like a portrait of that user’s life. The tendency of sharing every second of
life through different forms of social networks has grown, where Instagram holds the top rank
within youngsters. Hence this paper has chosen the Instagram platform and the necessary
data collection is easier than others. The text-based analysis method we used for this research
is facilitated by the availability of numerous public accounts and public comments related to
them.
When we consider social media into account there are many safety issues, includes bullying,
grooming, phishing etc. The aftereffect of this kind of issues is a huge area where it may lead
to social, mental and physical issues in the current generation. In this research, we are mainly
focusing on detecting bullying in social networks because the suicidal tendency in youngsters
is an increasing issue in currents scenario. This kind of people expresses their feelings either
as extreme depression or extreme anger which we will be able to identify through their posts,
by considering sentiment in the captions, hashtags used in the posts etc. Different studies
show that India has the highest occurrence of bullying through social media, such that it is
necessary to control it.
Nowadays there are a lot of researches occurs related to bullying detection, avoidance etc. And
the preliminary studies showing that the bullies targeting people who post something which is
closely related to religious activities, sexual exposing, political activities and so on. Hence the
first step which we did is that detection of this kind of posts from the Instagram network using
some typical keywords related to this kind of posts and we extracted the meta information
and the comments too. We are comparing four different classifiers in four different feature
extraction criteria and comparing their results and identifies the best suitable classifier for our
problem. The for different classifiers are Naive Bayes, Decision Tree Logistic Regression and
Support vector machine algorithm.
1.1 Background
Many studies have been conducted on the contribution of machine learning algorithms to
OSN content analysis in the last few years. Machine learning research has become crucial in
numerous areas and successfully produced many models, tools, and algorithms for handling
large amounts of data to solve real-world problems. Machine learning algorithms have been
used extensively to analyze SM website content for spam, phishing, and cyber bullying
prediction. Aggressive behaviour includes spam propagation, phishing, malware spread, and
cyber bullying. Textual cyber bullying has become the dominant aggressive behaviour in SM
websites because these websites give users full freedom to post on their platforms.
SM websites contain massive amounts of text and/or non-text content and different info
associated with aggressive behaviour. In this work, a content analysis of SM websites is
performed to predict aggressive behaviour. Such an analysis is limited to textual OSN content
for predicting cyber bullying behaviour. Given that cyber bullying may be simply committed,
it is considered a dangerous and fast-spreading aggressive behaviour. Bullies only require
willingness and a laptop or cell phone with an Internet connection to perform misbehaviour
without confronting victims. The popularity and proliferation of SM websites have increased
online bullying activities. Cyber bullying in SM websites is rampant due to the structural
characteristics of SM websites. Cyber bullying in traditional platforms, such as emails or
phone text messages, is performed on a limited number of people. SM websites allow users
to create profiles for establishing friendships and communicating with other users regardless
of geographic location, thus expanding cyber bullying beyond physical location. Anonymous
users may also exist on SM websites, and this has been confirmed to be a primary cause for
increased aggressive user behaviour. Developing an efficient prediction model for predicting
cyber bullying is thus of sensible significance. With all these considerations, this work
performs a content-based analysis for predicting textual cyber bullying on SM websites.
Cyber bullying can be committed anywhere and anytime. Escaping from cyber bullying is
difficult because cyber bullying can reach victims anywhere and anytime. It can be committed
by posting comments and statuses for a large potential audience. The victims cannot stop the
spread of such activities. Although SM websites have become an integral part of users’ lives, a
study found that SM websites are the most common platforms for cyber bullying victimization.
[1]A well-known characteristic of SM websites, such as Twitter, is that they allow users to
publicly express and spread their posts to a large audience while remaining anonymous. The
effects of public cyber bullying are worse than those of private ones, and anonymous scenarios
of cyber bullying are worse than non-anonymous cases. Consequently, the severity of cyber
bullying has increased on SM websites, which support public and anonymous scenarios of
cyber bullying. These characteristics make build create SM websites, such as Twitter, a
dangerous platform for committing cyber bullying.
Recent research has indicated that most experts favour the automatic monitoring of
cyber bullying. A study that examined fourteen teams of adolescents confirmed the pressing
would like for automatic watching and prediction models for cyber bullying as a result of
ancient ways for dealing with cyber bullying in the era of big data and networks do not work
well. Moreover, analyzing large amounts of complex data requires machine learning-based
automatic monitoring.
contain large amounts of text and/or non-text content and information related to aggressive
behaviour.
Aside from renovating the means through which people are influenced, SM websites
provide a place for a severe form of misbehaviour among users. Online complex networks,
such as SM websites, changed substantially in the last decade, and this change was stimulated
by the popularity of online communication through SM websites. Online communication
has become an entertainment tool, rather than serving only to communicate and interact
with known and unknown users. Although SM websites provide many benefits to users,[8]
cybercriminals can use these websites to commit different types of misbehaviour and/or
aggressive behaviour. The common forms of misbehaviour and/or aggressive behaviour on
OSN sites include cyber bullying, phishing, spam distribution, malware spreading, and cyber
bullying.
Users utilize SM websites to demonstrate different types of aggressive behavior. The main
involvement of SM websites in aggressive behavior can be summarized in two points
1. [9] OSN communication is a revolutionary trend that exploits Web 2.0. Web 2.0 has
new features that allow users to create profiles and pages, which, in turn, make users
active. Unlike Web 1.0 that limits users to being passive readers of content only, Web
2.0 has expanded capabilities that allow users to be active as they post and write their
thoughts. SM websites have four explicit options, namely, collaboration, participation,
authorisation, and timeliness. These characteristics enable criminals to use SM websites
as a platform to commit aggressive behaviour without confronting victims. Examples of
aggressive behaviour are committing cyber bullying and financial fraud, using malicious
applications, and implementing social engineering and phishing.[10]
2. SM websites area unit structures that change info exchange and dissemination. They
allow users to effortlessly share information, such as messages links, photos, and videos.
However, because SM websites connect billions of users, they have become delivery
mechanisms for different forms of aggressive behaviour at an extraordinary scale. SM
websites help cyber criminals reach many users.[11]
1.4 Objectives
In India, the rate of Cyber bullying is going uphill. The Social Media platform, which
was once solely used for quiet communication, resource sharing and other helpful services,
has now become the prior location for handling out cyber bullying activities. This has caused
disturbance and insecurity for all the clients and users of Social Media. Cyber predators
focuses on either a particular user or a group of random users and then go after anything
posted by the victim, causing mental breakdown and insecurity and eventually preventing
them from using the SM platform again. Many of these attacks may even trigger suicidal
tendencies among users with low emotional stability. So the prevention of these types of cyber
bullying attacks has now come to a standstill and inevitable situation. The main objectives of
the project are:
1. Finding accurate machine learning algorithms: The method we propose to detect and
prevent cyber bullying incidents across social media is by extracting the textual and
other data from the SM platform and analyzing the data for such bullying occurrences.
We run through the data and match it towards a manually created set of bullying words
and thereby detecting cyber bullying incidents. We use Machine Learning algorithms to
classify the data into Bullying and Non-bullying. Performance of algorithms differs by
their classifying logic. So we compare the algorithms based on their accuracy and other
classifying parameters and arrive on the best suited classification algorithm for the data.
2. Making social networks secure and transparent: By the proper implementation of the
project, we will be able to detect the cyber bullying incidents that occurs across the SM
platform. This in turn elevates the security and competence of such SMs.
3. Identifying public bullies: The data not only comprises of the bullying expressions, but
also of who and when the bullying contents are posted. When an attacker purposefully
concentrates on a user multiple times or when he attacks more users, he is categorized
as a public bully. Such public bullies have more tendencies to go for such bullying
actions. The users are alerted of such public bullies and they are suggested to take
necessary precautions.
CHAPTER 2
LITERATURE SURVEY
There are a lot of researches which is already completed in this field. When we take them
as a single one, we can find that most of the existing systems use the SVM algorithm for
the classification which at best provides an accuracy rate of 73-76 %. And the other things
which are common in all papers are that they are based on an individual post and there is
no history-based analysis. Also, there is no scope of considering many features instead of it
they are considering comments and their label. More than the papers when we consider the
Instagram platform as a source, the current network giving many advantages over this kind of
attacks to prevent them, but still there are many problems with it like they are giving warning
or users who are searching things like self-harm, suicidal etc. which are the keywords closely
related with this kind of issues, not only that the Instagram network introduced new stickers
which can be used to say things like stop bullying, don’t bully it etc.[12] And most of the
existing system which includes Content warning feature and Parental guide to monitoring the
activities of the logged-in user. But there are still some problems with this kind of systems,
for example when we consider the feature of the Instagram platform, we can find that there is
no warning for posts which includes the keywords mentioned above or there are no warnings
for comments which seem to be bullying.
[13] F. Toriumi, T. Nakanishi, M. and K. Eguchi describes the clear difference between cyber
bullying and cyber aggression in terms of frequency, negativity and imbalance of power
applied in large scale labelling. Also uses images and their corresponding comments for the
system. This was a multi-model classification result for cyber bullying detection. The main
features they considered for the analysis are the content of the image, comments and metadata
of the profile, and they found that cyber bullying occurs in posts which involves religion,
death appearance and sexual hints. Other findings include, posts which face these attacks are
most likely to be of negative emotions and relates to drug, tattoo etc. Here they use Latent
Semantic Analysis (LSA) based on Singular-Value Decomposition (SVD) and linear support
vector machine (SVM) classifier which uses n-gram texts with normalization. This system
provides the highest of 79% Recall and 71% of Precision. But at the same time, the Data set
is limited and they did labelling of data using survey, hence no media-based social networking
can include in this system. Not only that there are no image reorganization algorithms are
used, but the decisions are also taken only based on the survey.
[14] R. Badonnel, R. State, I. Chrisment and O. Festor, explains about the tracking of cyber
predators in a peer-to-peer network. This system mainly aims at detecting network attacks
against vulnerable services and host-based attacks like unauthorized logins. Here they made
the system capable of tracking and reporting cyber predators and hence it protects normal
users from being in contact with these pathological users. Also, we define the system using
two criteria, tracking the deployment and tracking the target. And it is composed of a set of
configurable honey pot agents and a central platform manager. The main managerial activities
take place in this system is the generation of fake files, capturing of file requests and local
statistical analysis. Advantages of the system include fully compatible with management
architecture and management protocol, fully independent of the file directory service and
generic concern in the peer to peer clients. At the same time lack of central management and
control over the available resources and the problem of central back up of files and folders
make the system down.
When we consider the findings of [15] M. Di Capua, E. Di Nardo and A. Petrosino, they are
tried to make a system which follows unsupervised learning methods for finding bullying
activities in social media. This system proposes a method to detect cyber bullying with a
hybrid set of features with classical textual features and social features. They adopt natural
language processing algorithms and semantic as well as syntactic methods for filter the
data. The main feature of this system is that they are considering emotional traces, and
they make use of sentiment analysis with a set of features which are closely related to the
social platform and the sentiment polarity of the sentences are calculated based on ranking.
Here they are considered emojis and classified them as extremely negative, negative, neutral,
positive and extremely positive. Here they use neural networks for the clustering purpose.
The performance of the classifier is calculated based on precision, recall and F-measure.
Noviantho, S. M. Isa and L. Ashianti [16] explain how we can detect bullying using text
mining techniques. Here the bullying conversations are identified based on naive Bayes
method and SVM using a poly kernel. They used the data set from formsorng.me in the form
of textual conversations and filtered it out through avoiding conversations which contains less
than 15 words as well as which includes meaningless words. As an initial stage, they classify
the data to two; Yes and No. Then a 4-class model development; No, Low, Medium, High
and an 11 class classification. Finally, they found that 4 class classifications are the most
optimal one and they proceed with that using n-gram. This system employs based on textual
conversations and hence they can only identify the cyber bullying behaviour which is not
enough for the system because of conversations contains other elements such as keywords and
abbreviations and conversations include a lot of emojis contents but here it is not considered.
1. 1. Risks that arise from the interaction of two under-agers such as cyber bullying.
2. Risks, which arise from the interaction between a child and an adult, such as cyber
grooming.
3. Risks arise by the collection of data, against the protection of privacy, such as viruses
and other malware.
[18] Also, it describes different kinds of risks such as content risk, contact risk, children
targeted as consumers, economic risks, and online privacy risk. So this article concludes that
the main issues faced by children are cyber bullying and Pornography.
An interesting paper Facebook Watchdog: A Research Agenda For Detecting Online Grooming
and Bullying Activities (IEEE) aims of this suggestion is to protect adolescents against
bullying and grooming attacks. This can make a clear picture of the difference between the
issues of cyber bullying and online grooming. Also, there is a study about related works,
so it helps to go through some more details. And it helps to know about Facebook and its
information pool which includes Album, application, check-in, photo, link, etc. So it gives
a good idea about what kind of data is available for analysis. And this paper focuses on the
analysis of Image/video, social media analytics, and text analysis.
[19]A study about the views of young people in shaming strangers by using social media was
conducted and the result was appalling. Which means the number of predators in social media
is increasing day by day. Because the youth is not considering this as a huge problem, but the
truth is just the opposite. This study mainly focuses on the circumferences which accept or
appropriate to conduct online shaming and how the young people conceive this. Shaming can
take place in different ways, shaming is targeted to a single person or as common in general
topics. Shaming can be both legal and illegal. The proposed model is having the processes
individual captures or records public behaviour. Then a collection of materials by uploaded
to social media and materials shared with social media will take and through an emotional
and behavioural response of users 2 sections will be produced one is viewed but not shares
and the other class is going through dissemination and finally, media will uptake this. The
approaches of the study are what, how and why this happening. Through this study, they
distinguish cyber shaming from cyber bullying. this paper is prepared through a question
answering(Interview) section. Finally, it identified 6 themes with different contents, which
are the concept of shaming, the difference between cyber bullying and cyber shaming, the
use of Smartphones for recording public behaviours, uploading and sharing in social media,
managing online presence and finally context of behaviour. So I think this study is giving the
idea that today’s this kind of activity is common because people are not aware of the issues,
and future problems. So it is difficult to manage this kind of behaviour in people and also
it increases the complexity in the detection of predators are Explained in The Use of Social
Media for Shaming Strangers: Young People’s Views (IEEE).
[21] A Web Pornography Patrol System by Content-based Analysis: In Particular Text and
Image (IEEE) suggests an idea to filter pornographic websites based on text and image analysis.
The text-based analysis is constructed by the support vector machine(SVM) algorithm. The
image-based analysis is normalized R/G ratio and human composition matrix (HCM) based
on skin detection. The results from text and image analysis are integrated and analyzed
together with the Boolean model. if a URL is analyzed as a pornographic website, it is stored
into a blacklist. Finally, these URLs are used for blocking inappropriate material. The URLs
can be divided into two lists such as blacklist and white list, the blacklist consists of URLs
that must be blocked. The problem with URL-blocking is that new sites emerge quickly and
continually. Keyword filtering uses a list of keywords to identify undesirable web pages. If
a page contains a certain number of keywords found in the list, then it is considered as an
undesirable web page. It uses the N-gram model based on Bayes’ theorem to improve the
efficiency of the SVM algorithm.
[22]Using Machine Learning to Detect cyber bullying (IEEE) discusses different techniques
and uses the data set of the website Formspring.me. Because this site is populated mostly by
teens and college students, and there is a high percentage of bullying content in the data. After
the collection of data, it will be the label. And the development procedure includes developing
features for Input, learning the model, class weighting, and evaluation. In developing features
for input it includes the SUM, TOTAL features to NUM and NORM versions of the dataset.
NUM set indicates the bad words and the density of bad words is featured as NORM. In
learning the model it uses different algorithms like JRIP, IBK, and SMO. So here the detection
is taking place by recording the percentage of a curse and insult words within a post.
CHAPTER 3
There are a lot of issues related to this kind of prediction in non-technical terms, which will
also affect the technical stages included in the model selection. This section will describe
some of them.
These definitions are under intense debate, but to simplify the definition of cyber bullying
and make this definition applies to a wide range of applications, the researchers defined cyber
bullying as “the use of electronic communication technologies to bully others.” Proposing
a simplified and clear definition of cyber bullying is a crucial step toward building machine
learning models that can satisfy the definition criteria of cyber bullying engagement.
[26] Predicting human behaviour is crucial but complex. To achieve an effective prediction of
human behaviour, the patterns that exist and are used for constructing the prediction model
should also exist in the future input data. The patterns should clearly represent features that
occur in current and future data to retain the context of the model. Given that big data are not
generic and dynamic in nature, the context of these data is difficult to understand in terms of
scale and even more difficult to maintain when data are reduced to fit into a machine learning
model. Handling context of big data is challenging and has been presented as an important
future direction. Furthermore, human behaviour is dynamic. Knowing when online users
change the way of committing cyber bullying is an important component in updating the
prediction model with such changes. Therefore, dynamically updating the prediction model is
necessary to meet human behavioural changes.
dynamic algorithms to detect new slang and abbreviations related to cyber bullying behaviour
on SM websites and keep updating the training processes of machine learning algorithms by
using newly-introduced words.
Big data are generated at a very high velocity, variety, volume, verdict, value, veracity,
complexity, etc. Researchers need to leverage various deep learning techniques for processing
social media big data for cyber bullying behaviours. The deep learning techniques and
architectures with the potential to explore the cyber bullying big data generated from SM can
include generative adversarial network, deep belief network, convolutions neural network,
stacked autoencoder, deep echo state network, and deep recurrent neural network. These deep
learning architectures remain unexplored in cyber bullying detection in SM.
The SMOTE technique is applied to avoid overfitting, which occurs when particular replicas
of minority classes are added to the main data set. A subdivision of data is reserved from the
minority class as an example, and new synthetic similar classes are generated. These synthetic
classes are then added to the original data set. The created data set is used to train the machine
learning methods. The cost-sensitive technique is utilized to control the imbalance class. It is
based on creating is a cost matrix, which defines the costs experienced in false positives and
false negatives.
[3]In the example, the classifier fails to classify any cyber bullying posts but obtains a high
accuracy percentage. Knowing the nature of manually labelled data is important in selecting
an evaluation metric. In cases where data are imbalanced, researchers may need to select AUC
as the main evaluation metric. In class-imbalance situations, AUC is more robust than other
performance metrics. cyber bullying and non-cyber bullying data are commonly imbalanced
datasets (non-cyber bullying posts outnumber the cyber bullying ones) that closely represent
the real-life data that machine learning algorithms need to train on. Accordingly, the learning
performance of these algorithms is independent of data skewness. Special care should be
taken in selecting the main evaluation metric to avoid uncertain results and appropriately
evaluate the performance of machine learning algorithms.
specified keywords. The identification of keywords for extracting posts is also subject
to the author’s understanding of [19] cyber bullying. An effective method should use a
complete range of posts indicating cyber bullying to train the machine learning classifier and
ensure the generalization capability of the cyber bullying prediction model. An important
objective of machine learning is to generalize and not to limit the examples in a training data
set. Researchers should investigate whether the sampled data are extracted from data that
effectively represents all possible activities on SM websites. Extracting well-representative
data from SM is the first step toward building effective machine learning prediction models.
However, SM websites’ public application program interface (API) only allows the extraction
of a small sample of all relevant data and thus poses a potential for sampling bias. For
example, whether data extracted from Twitter’s streaming API is a sufficient representation of
the activities in the Twitter network as a whole; the author compared keyword (words, phrases,
or hashtags), user ID, and geo-coded sampling. Twitter’s streaming API returns a data set with
some bias when keyword or user ID sampling is used. By contrast, using geo-tagged filtering
provides good data representation. With these points in mind, researchers should ensure
minimum bias as much as possible when they extract data to guarantee that the examples
selected to be represented in training data are generalized and provide an effective model
when applied to test data. Bias in data collection can impose bias in the selected training data
set based on specific keywords or users, and such a bias consequently introduces overfitting
issues that affect the capability of a machine learning model to make reliable predictions on
untrained data.
CHAPTER 4
PROJECT MANAGEMENT
4.1 Introduction
Project management is the discipline of planning, organizing, setting, managing, securing,
leading, and controlling resources to attain a specific aim. A project is a temporary attempt
with a demarcated origin and end (usually time-constrained, and frequently constrained by
funding or deliverable), undertaken to encounter special aims and objectives, basically to
bring about useful changes or added value. The non-permanent nature of projects stands in
contrary to business as usual (or operations), which are recurring, constant, or semi-permanent
functional activities to fabricate products or services. In practice, the administration of two
systems is often quite distinct, and as such demands the evolution of different technical skills
and management tactics.
In our project we followed typical development phases of an engineering project
• Initiation
• Completion
4.1.1 Initiation
The initiating procedures regulate the nature and scope of the project. The beginning stage
must include a plan that encloses the following areas :
3. Financial assessment of the costs and profit including a budget study and inspection
including users and support personal for the project
4.1.3 Execution
Execution includes the processes used to finish the work explained in the project plan to
achieve the project’s needs. Executing process involves arranging people and resources, as
well as organizing and executing the activities of the project in accordance with the project
management plan. The deliverable are produced as outputs from the processes performed as
defined in the project management plan and other frameworks that should be relevant to the
type of project at hand.
recasting information systems, and the models and new methodologies that the world use
to introduce these systems. In software engineering, the SDLC concept supports different
varieties of software development methodologies. These methodologies form the skeletal for
planning and controlling the creation of an information system.
The SDLC phases serve as a programmatic mentor to project activity and supply a flexible
but constant way to execute projects to a depth matching the range of the project. Each of
the SDLC phase objectives are explained in this section with key deliverable, elucidation of
recommended tasks, and a conclusion of related control goals for perfect management. It is
very crucial for the project manager to initiate and observe control objectives during each
SDLC phase while performing projects. Control objectives are very useful to provide a clear
statement of the expected result or purpose and could be used throughout the entire SDLC
process.
Design:
In the design phase, team design the software by the different diagrams like Data Flow
diagram, activity diagram, class diagram, state transition diagram, etc.
Implementation:
In the implementation, requirements are written in the coding language and transformed into
Testing:
After completing the coding phase, software testing starts using different test methods. There
are many test methods, but the most common are white box, black box, and grey box test
methods.
Deployment:
After completing all the phases, software is deployed to its work environment.
Review:
In this phase, after the product deployment, review phase is performed to check the behaviour
and validity of the developed product. And if there are any error found then the process starts
again from the requirement gathering.
Maintenance:
In the maintenance phase, after deployment of the software in the working environment there
may be some bugs, some errors or new updates are required. Maintenance involves debugging
and new addition options.
4.2.2 Advantages
• Testing and debugging during smaller iteration is easy.
CHAPTER 5
METHODOLOGY
We analyzed Instagram profiles and their associated posts for bullying detection. This
section discusses data collection followed by filtering of relevant data, feature extraction,
implementation and results.
For the data collection, we connect with the Instagram API, with the help of the python
library igramscraper. In this module a subpackage is available, that is the Instagram package,
through importing it will make the crawling is easy. From the extracted data set we choose
the comments and manually labelled them as bullying comments and non bullying comments.
For preprocessing we used the library nltk.corpus which consist of different dictionary
collections like stop words in Hindi, English and such languages. Initially, the comments are
tokenized and then the cleaning process takes place. Here we took the stop words in English
which includes and, or, he, have, is, etc. and removed them from our data set. As the part
of preprocessing, we also remove the plantations, special characters and so on. The main
algorithm which we used for the preprocessing part was the bag of words algorithm.
A bag-of-words model, or BoW for short, is a way of extracting features from the text for
use in modelling, such as with machine learning algorithms. The approach is very simple
and flexible and can be used in a myriad of ways for extracting features from documents.
A bag-of-words is a representation of text that describes the occurrence of words within a
document. It involves two things:
Here also we are make use of inbuilt function for BoW algorithms using Collectionfinder
package. It is called a “bag” of words because any information about the order or structure
of words in the document is discarded. The model is only concerned with whether known
words occur in the document, not wherein the document.And here we implemented the bag
of words as a unigram, bigram, trigram and ngram using the help of nltk library. All these
are representing the n consecutive terms of a word. As the selection of ’n’ changes, the
performance will also change.It is called a “bag” of words because any information about the
order or structure of words in the document is discarded. The model is only concerned with
whether known words occur in the document, not wherein the document.
step. The approach is based on minimized classification risks. SVM was initially established
to classify linearly separable classes. A 2D plane comprises linearly separable objects from
different classes (e.g., positive or negative). SVM aims to separate the two classes effectively.
SVM identifies the exceptional hyper plane that provides the maximum margin by maximizing
the distance between the hyper plane and the nearest data point of each class.
In real-time applications, precisely determining the separating hyper plane is difficult and
nearly impossible in several cases. SVM was developed to adapt to these cases and can now
be used as a classifier for non-separable classes. SVM is a capable classification algorithm
because of its characteristics. Specifically, SVM can powerfully separate non-linearly divisible
features by converting them to a high-dimensional space using the kernel model. The
advantage of SVM is its high speed, scalability, capability to predict intrusions in real time,
and update training patterns dynamically.
SVM has been used to develop cyber bullying prediction models and found to be effective and
efficient. For example, Chen et al. [9] applied SVM to construct a cyber bullying prediction
model for the detection of offensive content in SM. SM content with potential cyber bullying
were extracted, and the SVM cyber bullying prediction model was applied to detect offensive
content. The result showed that SVM is more accurate in detecting user offensiveness than
Naı̈ve Bayes (NB). However, NB is faster than SVM. Chavan and Shylaja [1] proposed the
use of SVM to build a classifier for the detection of cyber bullying in social networking sites.
Data containing offensive words were extracted from social networking sites and utilized to
build a cyber bullying SVM prediction model. The SVM classifier detected cyber bullying
more accurately than LR did. Others used SVM to build a gender specific cyber bullying
prediction model. An SVM text classifier was created with gender specific characteristics.
The SVM cyber bullying prediction model enhanced the detection of cyber bullying in SM. A
study developed an SVM-based cyber bullying detection model to detect cyber- bullying in
a social network site. The SVM-based model was trained using data containing cyberbully-
ing extracted from the social network site. The researchers found that that the SVM-based
cyber bullying model effectively detected cyber- bullying. Some constructed an SVM-based
cyber bullying detection model for YouTube. Data were collected from YouTube comments
on videos posted on the site. The data were used to train SVM and construct a cyberbully- ing
detection model, which was then used to detect cyber bullying. The results suggested that the
SVM-based cyber bullying model is more reliable but not as accurate as a rule-based model.
However, the SVM-based cyber bullying model is more accurate than NB and tree-based
J48. They also proposed the use of SVM for the detection of cyber bullying on Twitter. An
SVM- based cyber bullying model was constructed from data extracted from Twitter. The
SVM-based cyber bullying prediction model was applied to detect cyber bullying in Twitter.
SVM detected cyber bullying better than NB- and LR-based cyber bullying detection models
did.
5.4.1 Accuracy
Accuracy can be defined as, it is a description of systematic errors, a measure of statistical
bias; low accuracy causes a difference between a result and a ”true” value. ISO calls this
trueness. Or it is also referred as describing a combination of both types of observational
error above (random and systematic), so high accuracy requires both high precision and high
trueness.
(tp + tn)
Accuracy = (5.1)
tp + tn + fp + fn
tp
precision = (5.2)
(tp + fp)
Recall: It is defined as the fraction of the relevant documents that are successfully retrieved.
And it can be formulated using the below formula,
tp
recall = (5.3)
(tp + fn)
F-measure: F-mesure is a combination of recall and precision and this is used to tests the
test’s accuracy and it can be defined as,
(2 × precision × recall)
F − measure = (5.4)
(precision + recall)
5.5 Outputs
CHAPTER 6
DESIGN
The pre-processing mainly deals with turning the data into a more processing suitable data.
This includes steps such as Stop words removal, Repeated letters removal, Noise data removal
etc. The output of the pre processing makes the data more suitable for the focal processing
part. After the data is pre processed the data is sent for further processing steps.
The secondary purpose of the admin is to monitor the bully and the bullying activities. Once
the output of the classification is obtained, the admin can keep an eye on the cyber bully.
For every bullying approach, the attacker is marked and the bullying count is incremented.
The user is also declared as a public bully if he (attacker) continues to bully a targeted user
(victim) or any other user multiple times. For the subsequent data gathering, the data of these
public bullies are most studied.
Based on the output classification, the administrator can also make contact with the victim
user and other unmarked users to warn them about such cyber bullying incidents that might
come about. This in turn helps to reduce the chances for further cyber bullying activities and
also be of assistance to prevent the attacking at an earlier stage.
offers a 100 MHz improved clock speed. Intel fundamentally utilizes the equivalent smaller
scale design contrasted with Skylake, so the per-MHz execution doesn’t vary. The maker
just revamped the Speed Shift innovation for quicker powerful alterations of voltages and
timekeepers, and the improved 14nm procedure permits a lot higher frequencies joined with
preferred efficiency over previously.
6.3.5 Spyder
Spyder is an open source cross-platform integrated development environment (IDE) for
scientific programming in the Python language. Spyder integrates with a number of prominent
packages in the scientific Python stack, including NumPy, SciPy, Matplotlib, pandas, IPython,
SymPy and Cython, as well as other open source software. It is released under the MIT license.
Spyder is extensible with first- and third party plugins,[6] includes support for interactive
tools for data inspection and embeds Python-specific code quality assurance and introspection
instruments, such as Pyflakes, Pylint and Rope. It is available cross-platform through
Anaconda, on Windows, on macOS through MacPorts, and on major Linux distributions such
as Arch Linux, Debian, Fedora, Gentoo Linux, openSUSE and Ubuntu.
As of February 2020, the most recent version of Windows for PCs, tablets and embedded
devices is Windows 10. The most recent version for server computers is Windows Server,
version 1909. Other commonly used versions of Windows for PCs are Windows 8 and 8.1,
Windows 7, Windows XP, Windows XP etc.
CHAPTER 7
We have compared 4 different machine learning algorithms to detect bullying and they are
Naive Bayes, Decision Tree, Logistic Regression and Support Vector Machine. As a first
stage, we pre-processed the data by removing punctuation, stop words etc and then the feature
extraction is done using BoW algorithms. For more precise features we tried 4 different
combinations of consecutive words such as uni-gram, bi-gram, tri-gram and n-gram.For
comparing the performance of these algorithms, the main evaluation methodologies which
we took are accuracy, precision, recall and F-measure. From our analysis, we found that the
Naive Bayes algorithm with bi-gram having the highest performance in all terms of measures.
Also, we found that through changing the parameters of each algorithm and by changing the
data spilt we can improve the performance. Hence we tried different data splits, parameter
change, an increased quantity of data, then we saw that the performance gets varied in every
case. From these results, we conclude that the Naive Bayes algorithm is the most perfect one
for cyber bullying detection in terms of the data set which we take. In cases of parameter
changes, data split changes etc was also giving the best performance for the same. For
conforming the results we did cross-validation with k value as 5, using this the results are
verified again.
7.1 Results
CHAPTER 8
8.1 Conclusion
Cyber bullying is an increasingly frequent problem among adolescents, and it produces
considerable social concern. India has an increasing amount of social media users and many
of them have reported being victims of cyber bullying activities. Such cases have a high
probability to trigger mental insecurity and problems among users. If we can carve out this
awful behaviour early in their age, they are unlikely to continue down that path. The method
we propose to detect and prevent cyber bullying incidents across social media is by extracting
the textual and other data from the SM platform and analyzing the data for such bullying
occurrences.
Our system will initially accept the input dataset and pre-processes it for obtaining only the
wanted data. After that, the system will check whether a selected group of text is bullying
or not using sentiment analysis. The users that are carrying out the bullying activities are
categorized as a cyber bully. If the cyber bully continues to carry out such activities for
multiple users, then he is categorized as a public bully, so that their further activities can
be monitored. To achieve this whole feat, we have used mainly four different machine
learning algorithms; Naive Bayes, Decision Trees, Logistic Regression and Support Vector
Machine. The performances of all these algorithms are recorded separately and also through
ensemble methods. We then compared the performance parameters and concluded that the
best prediction takes place when we use Naive Bayes algorithm when using a bi-gram bag
of words pre-processing technique with an accuracy of 79% and an F-measure of 38%. The
other algorithms have also been recorded to perform almost around the same ballpark with
accuracies ranging from 65% to the uppermost 79%.
From the analysis of the obtained result, it is definite that most of the users are victims of cyber
bullying incidents. These results also provide insight into the importance of the prevention
and eradication of the growing problem of cyber bullying. And this system will be helpful to
save many youngsters from this kind of attack by getting warnings from the system.
is visible from many aspects. We often come across different sensational news around the
world in which cyber bullying plays a major role.
Considering the situation of the present scenario, it is observed the importance of our project
to detect and monitor cyber bullying. So far we have worked on the core part of the same
and provided results that help in the prevention and eradication of the growing problem of
Cyberbullying and this system will also be helpful for the youngsters from this kind of attacks
by getting warnings from the system. This, when deployed in any social media platform, will
help the users using that particular platform from the adverse effects of Cyber bullying or we
can even implement the same using a user interface that could assist any of the popular social
media platforms and will hence be also able to serve our purpose of detection and prevention
of Cyber bullying.
Only the Detection of the cyber bullies will not be a good breakthrough when the serious
background of the topic is considered. Detection, as well as continuous monitoring of the
detected bullies, will be an outstanding contribution to this problem of cyber bullying which
is prevalent in the society which we belong to. Hence the practice of monitoring the detected
bullies will be one of the major influential innovation that can be achieved.
It is a common practice noticed in the social media behaviour of many people that most of the
people express their emotions and feelings through a vast number of emojis. The addition of
emojis in the comments and chats made the user is quite a common thing found in various
social media platforms. Therefore it is also equally important to identify the bullying from
the usage of emojis as well. The Detection and monitoring of cyber bullying from the usage
of emojis is another refined version that will be very helpful in this problem of cyber bullying
and that if implemented can add more impact and will definitely prove to be more effective in
detection and monitoring of the cyber bullies.
REFERENCES
[2] Y. Chen, Y. Zhou, S. Zhu, and H. Xu, “Detecting offensive language in social media to
protect adolescent online safety,” in 2012 International Conference on Privacy, Security,
Risk and Trust and 2012 International Confernece on Social Computing, pp. 71–80,
IEEE, 2012.
[4] W. Dong, S. S. Liao, Y. Xu, and X. Feng, “Leading effect of social media for financial
fraud disclosure: A text mining based analytics,” in AMCIS, 2016.
[9] R. Sugandhi, A. Pande, S. Chawla, A. Agrawal, and H. Bhagat, “Methods for detection
of cyberbullying: A survey,” in 2015 15th International Conference on Intelligent
Systems Design and Applications (ISDA), pp. 173–177, 2015.
[10] C. Yang, R. Harkreader, J. Zhang, S. Shin, and G. Gu, “Analyzing spammers’ social
networks for fun and profit: A case study of cyber criminal ecosystem on twitter,” in
Proceedings of the 21st International Conference on World Wide Web, WWW ’12, (New
York, NY, USA), pp. 71–80, ACM, 2012.
[11] S. Abu-Nimeh, T. Chen, and O. Alzubi, “Malicious and spam posts in online social
networks,” Computer, vol. 44, pp. 23–28, Sep. 2011.
[15]
[16] S. M. Isa, L. Ashianti, et al., “Cyberbullying classification using text mining,” in 2017
1st International Conference on Informatics and Computational Sciences (ICICoS),
pp. 241–246, IEEE, 2017.
[17] V. Subrahmanian and S. Kumar, “Predicting human behavior: The next frontiers,”
Science, vol. 355, no. 6324, pp. 489–489, 2017.
[18] H. Lauw, J. C. Shafer, R. Agrawal, and A. Ntoulas, “Homophily in the digital world: A
livejournal case study,” IEEE Internet Computing, vol. 14, pp. 15–23, March 2010.
[20] L. Phillips, C. Dowling, K. Shaffer, N. Hodas, and S. Volkova, “Using social media
to predict the future: a systematic literature review,” arXiv preprint arXiv:1706.06134,
2017.
[21] J. Heidemann, M. Klier, and F. Probst, “Online social networks: A survey of a global
phenomenon,” Computer networks, vol. 56, no. 18, pp. 3866–3878, 2012.
[22] J. K. Peterson and J. Densley, “Is social media a gang? toward a selection,
facilitation, orenhancement explanation of cyber violence,” Aggression Violent Behav.,
pp. 3866–3878, 2016.
[23] N. M. Shekokar and K. B. Kansara, “Security against sybil attack in social network,” in
2016 International Conference on Information Communication and Embedded Systems
(ICICES), pp. 1–5, IEEE, 2016.
[26] M. Fire, R. Goldschmidt, and Y. Elovici, “Online social networks: threats and solutions,”
IEEE Communications Surveys & Tutorials, vol. 16, no. 4, pp. 2019–2036, 2014.
[27] G. R. Weir, F. Toolan, and D. Smeed, “The threats of social networking: Old wine in
new bottles?,” Information Security Technical Report, vol. 16, no. 2, pp. 38 – 43, 2011.
Social Networking Threats.
APPENDIX A
INSTA CRAWLER
A.1 insta dump.py
from random import randint
from time import sleep
import shelve
import json
#hashtags = [’adhd’,’afraid’,’alone’,’antibullying’,’anxiety’,’anxietyat
max_pages = 10
output_shelf = ’newoutput.shelf’
output_json = ’newoutput.json’
hash_tags_file = ’hashtags.txt’
has_next_page = True
next_page_id = ’’
page_num = 1
while(has_next_page):
print(’Dumping page {} [{}] for #{}’.format(page_num, next_p
try:
results = instagram.get_paginate_medias_by_tag(hashtag,
except Exception as err:
nap_time = randint(10,100)
print(’Failed to dump page {} [{}] for #{} (nap: {})’.fo
print(err)
sleep(nap_time)
continue
if page_num == 1:
result_count = results[’count’]
print(’There are "{}" posts for #{}!’.format(result_coun
if ’#’ in _caption:
_hashtags = list({ tag.strip("#") for tag in _ca
if not _hashtags: _hashtags = None
else:
_hashtags = None
output[’hashtags’] = _hashtags
sleep(nap_time)
else:
for _c in _comments_result[’comments’]:
_com = {}
_com[’id’] = _c.identifier
_com[’created_time’] = _c.created_at
_com[’modified_time’] = _c.modified
_com[’owner’] = _c.owner.identifier
_com[’comment’] = _comment = _c.text
if ’#’ in _comment:
_hashtags = list({ tag.strip("#") fo
if not _hashtags: _hashtags = None
else:
_hashtags = None
_com[’hashtags’] = _hashtags
_comments.append(_com)
output[’comments’] = _comments
#dumping output
db[_id] = output
db.sync()
has_next_page = results[’hasNextPage’]
if has_next_page and page_num <= max_pages:
page_num+=1
next_page_id = results[’maxId’]
continue
else:
break
APPENDIX B
PREDICTION CODE
B.1 Prediction Code
[]
import pandas as pd
#import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
data = pd.read_csv("test_data.csv")
#print(data)
#%%
#preprocessing
Tweet = []
Labels = []
#print(clean_words)
#%%
combined = zip(data["Tweet"], data["Text Label"])
def bag_of_words(words):
return dict([(word, True) for word in words])
Final_Data = []
for r, v in combined:
bag_of_words(r)
Final_Data.append((bag_of_words(r),v))
#%%
import random
random.shuffle(Final_Data)
print(len(Final_Data))
#%%
#Split the data into training and test
train_set, test_set = Final_Data[0:800], Final_Data[800:]
#%%
nbrefset = collections.defaultdict(set)
nbtestset = collections.defaultdict(set)
#classifier.show_most_informative_features(n=10)
print("*********")
#%%
dt_classifier = DecisionTreeClassifier.train(train_set,
binary=True,
entropy_cutoff=0.8,
depth_cutoff=5,
support_cutoff=30)
refset = collections.defaultdict(set)
testset = collections.defaultdict(set)
print("Accuracy:",nltk.classify.accuracy(dt_classifier, test_set))
print("")
print(’Non bullying precision:’, precision(refset[’Non-Bullying’], tests
print(’Non Bullying recall:’, recall(testset[’Non-Bullying’], refset[’No
print("Non- Bullying F measure:",f_measure(testset[’Non-Bullying’], refs
print("*********************")
# In[13]:
print("Accuracy:",nltk.classify.accuracy(logit_classifier, test_set))
print(’bullying precision:’, precision(refset[’Bullying’], testset[’Bull
print("*********************")
# In[14]:
print("Accuracy:",nltk.classify.accuracy(SVM_classifier, test_set))
print(’Bullying precision:’, precision(refset[’Bullying’], testset[’Bull
print("")
print(’Non bullying precision:’, precision(refset[’Non-Bullying’], tests
#In[18]:
Final_Data2 =[]
for z, e in combined:
bag_of_bigrams_words(z)
Final_Data2.append((bag_of_bigrams_words(z),e))
# In[19]:
import random
random.shuffle(Final_Data2)
print(len(Final_Data2))
import nltk
import collections
from nltk.metrics.scores import (accuracy, precision, recall, f_measure)
from nltk import metrics
#%%
#Naive Bayes for Bigrams
classifier = nltk.NaiveBayesClassifier.train(train_set)
observed = classifier.classify(feats)
testsets[observed].add(i)
dt_classifier = DecisionTreeClassifier.train(train_set,
binary=True,
entropy_cutoff=0.8,
depth_cutoff=5,
support_cutoff=30)
refset = collections.defaultdict(set)
testset = collections.defaultdict(set)
# In[25]:
# In[26]:
# In[27]:
Final_Data3 =[]
for z, e in combined:
bag_of_trigrams_words(z)
Final_Data3.append((bag_of_trigrams_words(z),e))
import random
random.shuffle(Final_Data3)
print(len(Final_Data3))
classifier = nltk.NaiveBayesClassifier.train(train_set)
dt_classifier = DecisionTreeClassifier.train(train_set,
binary=True,
entropy_cutoff=0.8,
depth_cutoff=5,
support_cutoff=30)
refset = collections.defaultdict(set)
testset = collections.defaultdict(set)
print("Accuracy:",nltk.classify.accuracy(dt_classifier, test_set))
print(’Bullying precision:’, precision(refset[’Bullying’], testset[’Bull
# In[32]:
# Import Bigram metrics - we will use these to identify the top 200 bigr
def bigrams_words(words, score_fn=BigramAssocMeasures.chi_sq,
n=200):
bigram_finder = BigramCollocationFinder.from_words(words)
bigrams = bigram_finder.nbest(score_fn, n)
return bigrams
# Import Trigram metrics - we will use these to identify the top 200 tri
#bag of ngrams
def bag_of_Ngrams_words(words):
bigramBag = bigrams_words(words)
trigramBag = trigrams_words(words)
for t in range(0,len(trigramBag)):
trigramBag[t]=’ ’.join(trigramBag[t])
# In[34]:
Final_Data4 =[]
for z, e in combined:
bag_of_Ngrams_words(z)
Final_Data4.append((bag_of_Ngrams_words(z),e))
# In[35]:
import random
random.shuffle(Final_Data4)
print(len(Final_Data4))
import nltk
import collections
from nltk.metrics.scores import (accuracy, precision, recall, f_measure)
from nltk import metrics
#%%
#Naive Bayes for Ngrams
classifier = nltk.NaiveBayesClassifier.train(train_set)
#classifier.show_most_informative_features(n=10)
# In[32]:
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
print("***************")
# In[33]:
import collections
from nltk.metrics.scores import (accuracy, precision, recall, f_measure)
# In[34]:
# SVM model
# In[42]:
zl = zip(data["Tweet"],data["Text Label"])
def bag_of_words(words):
return dict([(word, True) for word in words])
# Define another function that will return words that are in words, but
# Import Bigram metrics - we will use these to identify the top 200 bigr
from nltk.metrics import BigramAssocMeasures
bigrams = bag_of_bigrams_words(words)
Final_Data = []
for k, v in zl:
bag_of_bigrams_words(k)
Final_Data.append((bag_of_bigrams_words(k),v))
import random
random.shuffle(Final_Data)
#splits the data around 70% of reviews* for both testing and training
import nltk
import collections
from nltk.metrics.scores import (accuracy, precision, recall, f_measure)
nb_classifier = nltk.NaiveBayesClassifier.train(train_set)
nb_classifier.show_most_informative_features(10)
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
observed = nb_classifier.classify(Final_Data)
testsets[observed].add(i)
print("*************")
# In[48]:
import collections
from nltk import metrics
from nltk.metrics.scores import (accuracy, precision, recall, f_measure)
from nltk.classify import DecisionTreeClassifier
from nltk.classify.util import accuracy
dt_classifier = DecisionTreeClassifier.train(train_set,
binary=True,
entropy_cutoff=0.8,
depth_cutoff=5,
support_cutoff=30)
from nltk.classify.util import accuracy
print("Performace of Decision Tree","\n accuracy:",accuracy(dt_classifie
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
print("**************")
# In[49]:
# In[50]:
# SVM model
zl = zip(data["Tweet"],data["Text Label"])
def bag_of_words(words):
return dict([(word, True) for word in words])
# Define another function that will return words that are in words, but
# Import Bigram metrics - we will use these to identify the top 200 bigr
from nltk.metrics import BigramAssocMeasures
bigrams = bag_of_bigrams_words(words)
def bigrams_words(words, score_fn=BigramAssocMeasures.chi_sq,
n=200):
bigram_finder = BigramCollocationFinder.from_words(words)
bigrams = bigram_finder.nbest(score_fn, n)
return bigrams
# Import Bigram metrics - we will use these to identify the top 200 bigr
from nltk.metrics import TrigramAssocMeasures
def bag_of_Ngrams_words(words):
bigramBag = bigrams_words(words)
bigramBag[b]=’ ’.join(bigramBag[b])
trigramBag = trigrams_words(words)
for t in range(0,len(trigramBag)):
trigramBag[t]=’ ’.join(trigramBag[t])
Final_Data = []
for k, v in zl:
bag_of_bigrams_words(k)
Final_Data.append((bag_of_bigrams_words(k),v))
import random
random.shuffle(Final_Data)
#splits the data around 70% of 500 *350 reviews* for both testing and tr
import nltk
import collections
from nltk.metrics.scores import (accuracy, precision, recall, f_measure)
nb_classifier = nltk.NaiveBayesClassifier.train(train_set)
nb_classifier.show_most_informative_features(10)
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
refsets[label].add(i)
observed = nb_classifier.classify(Final_Data)
testsets[observed].add(i)
# In[44]:
# In[48]:
import collections
from nltk import metrics
from nltk.metrics.scores import (accuracy, precision, recall, f_measure)
from nltk.classify import DecisionTreeClassifier
from nltk.classify.util import accuracy
dt_classifier = DecisionTreeClassifier.train(train_set,
binary=True,
entropy_cutoff=0.8,
depth_cutoff=5,
support_cutoff=30)
from nltk.classify.util import accuracy
print("Performace of Decision Tree","\n accuracy:",accuracy(dt_classifie
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
refsets[label].add(i)
observed = dt_classifier.classify(Final_Data)
testsets[observed].add(i)
# In[49]:
# In[50]:
# SVM model
#%%
’’’#Data split and tri
zl = zip(data["Tweet"],data["Text Label"])
# Define another function that will return words that are in words, but
# Import Bigram metrics - we will use these to identify the top 200 bigr
from nltk.metrics import TrigramAssocMeasures
trigrams = trigrams_words(words)
Final_Data = []
for k, v in zl:
bag_of_bigrams_words(k)
Final_Data.append((bag_of_bigrams_words(k),v))
import random
random.shuffle(Final_Data)
#splits the data around 70% of 500 *350 reviews* for both testing and tr
import nltk
import collections
from nltk.metrics.scores import (accuracy, precision, recall, f_measure)
nb_classifier = nltk.NaiveBayesClassifier.train(train_set)
nb_classifier.show_most_informative_features(10)
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
# In[44]:
# In[48]:
import collections
from nltk import metrics
from nltk.metrics.scores import (accuracy, precision, recall, f_measure)
from nltk.classify import DecisionTreeClassifier
refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
# In[49]:
# In[50]:
# SVM model