CLICKBAIT
classifier
You Won’t Believe What This
ClickBait Classifier Does!
TABLE OF CONTENTS
INTRODUCTION
Data
Preprocessing
Feature
Engineering
Training The
Model
01 02
03 04
Clickbait
YouTuber by the name Vertasium uploaded an
informative video to demonstrate the Magnus effect
by dropping a basketball from the top of a dam, titled
“Strange Applications of Magnus Effect” and received
a few thousands of views on YouTube. Later, the same
video was uploaded on a different website under the
title “Basketball dropped from a dam” and received
tens of millions of views! This simple example
illustrates just how powerful clickbait titles can be
and just how inevitable it is in today’s fast-paced
media world to be able to get viewers or visitors on a
website.
What Is Clickbait?
01
Clickbait
Clickbait is a text or a thumbnail link that is designed to attract
attention and entice users to follow that link and read or view that linked
piece of online content, typically deceptive, sensationalized, or otherwise
misleading.
The teasing title aims to exploit the “curiosity gap”, by providing just
enough information to make readers of websites curious, but not enough to
satisfy their curiosity without clicking through to the linked content.
Click-bait headlines add an element of dishonesty, using enticements
that do not accurately reflect the content being delivered.
—SOMEONE FAMOUS
Data has been scrapped from multiple sources like Twitter, Reuters, The Washington Post, The
Guardian, Bloomberg, The Hindu and WikiNews which comprises all the Non-Clickbait news,
as they are from trusted sources and are known to be reliable and largely encompass news
that are facts reported from around the world.
On the other hand, news headlines are also collected from sources like Buzzfeed, Examiner,
TheOdyssey, Thatscoop, Viralstories, PoliticalInsider, Upworthy, ViralNova and BoredPanda,
which tend to be more clickbaity than facts.
These two types of sources are used to train the model and build a classifier that can detect if
the title is trustworthy or not. The final data is labeled as clickbait or not-clickbait depending on
the source.
Data Collection
—SOMEONE FAMOUS
The headlines data contains punctuations, non-numerical and non-alphabetical
characters and they were removed using regular expressions as they would not
contribute in training the model.
Using NLTK library, the stop words are removed as it adds more noise and takes
the focus away from the keywords.
All the letters are converted into lowercase and tokenized initially into unigrams for
EDA and later into unigrams and bigrams for modeling.
A vector of word frequency is created for visualization purposes and for text
classification and understanding of the data distribution.
Data Preprocessing
—SOMEONE FAMOUS
Clickbait headlines tend to have more exaggerated words (seen below)
with numbers, exclamation and question marks. These features help us
classify the headline text into clickbait and non-clickbait. To understand
the characteristics of the text of the headlines that we are dealing with, we
assign a few features where we mark 1 if contains the feature and 0 if it
doesn’t for the following:
● Starts with or contains exaggerated words
● Starts with or contains question words
● Ends with question mark
● Ends with exclamation mark
● Starts with number
● Headlines word count
Feature Engineering
—SOMEONE FAMOUS
‘Insane’, ‘awesome’, ‘amazing’, ‘won’t believe’,
‘must’, ‘secret’, ‘facts’, ‘ultimate guide’,’ways to
improve’,’list of the best’, ‘why we love’,’you’ll
never guess’,‘strategies’, ‘ingredients’,’click
here to learn more’, ‘what happened next’,
‘see’, ‘live’, ‘you won’t believe’, ‘the last’, ‘you
can now’, ‘this is how’, ‘this is the’,‘this is what’,
‘things you need’, ‘reasons why’
Feature Engineering
—SOMEONE FAMOUS
We analyze word frequencies to find a
pattern within clickbait and non-clickbait
headlines and this is visualized using
WordClouds. We can see a clear
contrast in the type of words between
the two categories. Clickbait headlines
WordCloud have numbers and vague
wordings such as ‘actually’, ‘like’,
‘heres’, ‘need’ and ‘best’.
Exploratory Data analysis
—SOMEONE FAMOUS
Non-clickbait headlines WordCloud
have words that are news and facts
related such as ‘president’, ‘election’,
‘coronavirus’ and ‘australian’. These
tend to be less catchy words.
Exploratory Data analysis
—SOMEONE FAMOUS
We then analyze the word count feature and find that the clickbait headlines
tend to be lengthier than non-clickbait news.
Exploratory Data analysis
—SOMEONE FAMOUS
WORD FREQUENCY
—SOMEONE FAMOUS
Naive Bayes classifier, Random Forest classifier, SVM classifier and Logistic Regression
models are trained and tested and the accuracy and recall values for each of them are
measured to evaluate performance.
In order to avoid false negatives where a non-clickbait headline is classified as clickbait,
the recall value is given more weightage and consideration.
Train the model
—SOMEONE FAMOUS
From the tabulated results
above we can see that Naive
Bayes performs the best for this
dataset in terms of both
accuracy and recall scores.
Other models perform nearly
the same. But we consider
Naive Bayes as it runs faster
compared to the other models,
and this comes especially
handy when the data scales up.
Train the model
—SOMEONE FAMOUS
From the tabulated results
above we can see that Naive
Bayes performs the best for this
dataset in terms of both
accuracy and recall scores.
Other models perform nearly
the same. But we consider
Naive Bayes as it runs faster
compared to the other models,
and this comes especially
handy when the data scales up.
Train the model
—SOMEONE FAMOUS
The top 15 coefficients for clickbait are as follows:
Train the model
TAKEAWAY
Using machine learning algorithms one can train a
model to detect clickbait. As the type of data online
changes and grows, we can include more new data
into the training dataset in the future to build a better
classifier.
This POC performed at a range of 90–93% in accuracy
and recall. Since it worked at such high accuracy, it can
definitely be used on a larger scale of data to filter out
clickbait headlines. This model can be deployed on any
web platform to weed out the misinformation.
CREDITS: This presentation template was created by
Slidesgo, including icons by Flaticon, infographics &
images by Freepik and illustrations by Storyset
THANK
You.
CREDITS: This presentation template was created by
Slidesgo, including icons by Flaticon, infographics &
images by Freepik and illustrations by Storyset
Please, keep this slide for the attribution
SPECIAL REMINDERS
JUPITER
Jupiter is a gas giant and the biggest
planet in the entire Solar System
MARS
Despite being red, Mars is actually a
cold place full of iron oxide dust

Ppt Presentation on Clickbait Classifier - Anupama Kurudi

  • 1.
    CLICKBAIT classifier You Won’t BelieveWhat This ClickBait Classifier Does!
  • 2.
  • 3.
    Clickbait YouTuber by thename Vertasium uploaded an informative video to demonstrate the Magnus effect by dropping a basketball from the top of a dam, titled “Strange Applications of Magnus Effect” and received a few thousands of views on YouTube. Later, the same video was uploaded on a different website under the title “Basketball dropped from a dam” and received tens of millions of views! This simple example illustrates just how powerful clickbait titles can be and just how inevitable it is in today’s fast-paced media world to be able to get viewers or visitors on a website.
  • 4.
  • 5.
    Clickbait Clickbait is atext or a thumbnail link that is designed to attract attention and entice users to follow that link and read or view that linked piece of online content, typically deceptive, sensationalized, or otherwise misleading. The teasing title aims to exploit the “curiosity gap”, by providing just enough information to make readers of websites curious, but not enough to satisfy their curiosity without clicking through to the linked content. Click-bait headlines add an element of dishonesty, using enticements that do not accurately reflect the content being delivered.
  • 6.
    —SOMEONE FAMOUS Data hasbeen scrapped from multiple sources like Twitter, Reuters, The Washington Post, The Guardian, Bloomberg, The Hindu and WikiNews which comprises all the Non-Clickbait news, as they are from trusted sources and are known to be reliable and largely encompass news that are facts reported from around the world. On the other hand, news headlines are also collected from sources like Buzzfeed, Examiner, TheOdyssey, Thatscoop, Viralstories, PoliticalInsider, Upworthy, ViralNova and BoredPanda, which tend to be more clickbaity than facts. These two types of sources are used to train the model and build a classifier that can detect if the title is trustworthy or not. The final data is labeled as clickbait or not-clickbait depending on the source. Data Collection
  • 7.
    —SOMEONE FAMOUS The headlinesdata contains punctuations, non-numerical and non-alphabetical characters and they were removed using regular expressions as they would not contribute in training the model. Using NLTK library, the stop words are removed as it adds more noise and takes the focus away from the keywords. All the letters are converted into lowercase and tokenized initially into unigrams for EDA and later into unigrams and bigrams for modeling. A vector of word frequency is created for visualization purposes and for text classification and understanding of the data distribution. Data Preprocessing
  • 8.
    —SOMEONE FAMOUS Clickbait headlinestend to have more exaggerated words (seen below) with numbers, exclamation and question marks. These features help us classify the headline text into clickbait and non-clickbait. To understand the characteristics of the text of the headlines that we are dealing with, we assign a few features where we mark 1 if contains the feature and 0 if it doesn’t for the following: ● Starts with or contains exaggerated words ● Starts with or contains question words ● Ends with question mark ● Ends with exclamation mark ● Starts with number ● Headlines word count Feature Engineering
  • 9.
    —SOMEONE FAMOUS ‘Insane’, ‘awesome’,‘amazing’, ‘won’t believe’, ‘must’, ‘secret’, ‘facts’, ‘ultimate guide’,’ways to improve’,’list of the best’, ‘why we love’,’you’ll never guess’,‘strategies’, ‘ingredients’,’click here to learn more’, ‘what happened next’, ‘see’, ‘live’, ‘you won’t believe’, ‘the last’, ‘you can now’, ‘this is how’, ‘this is the’,‘this is what’, ‘things you need’, ‘reasons why’ Feature Engineering
  • 10.
    —SOMEONE FAMOUS We analyzeword frequencies to find a pattern within clickbait and non-clickbait headlines and this is visualized using WordClouds. We can see a clear contrast in the type of words between the two categories. Clickbait headlines WordCloud have numbers and vague wordings such as ‘actually’, ‘like’, ‘heres’, ‘need’ and ‘best’. Exploratory Data analysis
  • 11.
    —SOMEONE FAMOUS Non-clickbait headlinesWordCloud have words that are news and facts related such as ‘president’, ‘election’, ‘coronavirus’ and ‘australian’. These tend to be less catchy words. Exploratory Data analysis
  • 12.
    —SOMEONE FAMOUS We thenanalyze the word count feature and find that the clickbait headlines tend to be lengthier than non-clickbait news. Exploratory Data analysis
  • 13.
  • 14.
    —SOMEONE FAMOUS Naive Bayesclassifier, Random Forest classifier, SVM classifier and Logistic Regression models are trained and tested and the accuracy and recall values for each of them are measured to evaluate performance. In order to avoid false negatives where a non-clickbait headline is classified as clickbait, the recall value is given more weightage and consideration. Train the model
  • 15.
    —SOMEONE FAMOUS From thetabulated results above we can see that Naive Bayes performs the best for this dataset in terms of both accuracy and recall scores. Other models perform nearly the same. But we consider Naive Bayes as it runs faster compared to the other models, and this comes especially handy when the data scales up. Train the model
  • 16.
    —SOMEONE FAMOUS From thetabulated results above we can see that Naive Bayes performs the best for this dataset in terms of both accuracy and recall scores. Other models perform nearly the same. But we consider Naive Bayes as it runs faster compared to the other models, and this comes especially handy when the data scales up. Train the model
  • 17.
    —SOMEONE FAMOUS The top15 coefficients for clickbait are as follows: Train the model
  • 18.
    TAKEAWAY Using machine learningalgorithms one can train a model to detect clickbait. As the type of data online changes and grows, we can include more new data into the training dataset in the future to build a better classifier. This POC performed at a range of 90–93% in accuracy and recall. Since it worked at such high accuracy, it can definitely be used on a larger scale of data to filter out clickbait headlines. This model can be deployed on any web platform to weed out the misinformation.
  • 19.
    CREDITS: This presentationtemplate was created by Slidesgo, including icons by Flaticon, infographics & images by Freepik and illustrations by Storyset THANK You. CREDITS: This presentation template was created by Slidesgo, including icons by Flaticon, infographics & images by Freepik and illustrations by Storyset Please, keep this slide for the attribution
  • 21.
    SPECIAL REMINDERS JUPITER Jupiter isa gas giant and the biggest planet in the entire Solar System MARS Despite being red, Mars is actually a cold place full of iron oxide dust