Mahatma Jyotiba Phule Rohilkhand
University, Bareilly
Institute of Engineering and Technology
Department of Computer Science and Information Technology
(2023-24)
Project Report On
EMAIL SPAM DETECTION.
UNDER THE GUIDANCE OF –
Dr. BRAJESH KUMAR
Submitted By –
Aditya Gupta(220089020026)
Ramanand Kumar Gupt(220089020062)
Akshay Pratap Singh(220089020029)
Surendra(220089020068)
ACKNOWLEDGEMENT
We extend our sincere and heartfelt thanks to our esteemed guide, Dr. Brajesh Kumar and for
his exemplary guidance, monitoring and constant encouragement throughout the course at
crucial junctures and for showing us the right way.
We would like to extend thanks to our respected Head of the division, [Link] Rishiwal for
allowing us to use the facilities available. We would like to thank other faculty members also.
Last but not least, we would like to thank our friends and family for the support and
encouragement they have given us during the course of our work.
we wish to express our thanks to all Teaching and Non-teaching staff members of the
Department of Computer Science and Information Technology who were helpful in many ways
for the completion of the project.
Aditya Gupta (21CS23)
Ramanand Kumar Gupt(21CS28)
Akshay Pratap Singh(21CS30)
Surendra(21CS59)
INDEX
1. INTRODUCTION
1.1 CONTEXT
1.2 MOTIVATION
1.3 OBJECTIVE
2. LITERATURE REVIEW
2.1 INTRODUCTION
2.2 RELATED WORK
3. METHODOLOGY
3.1 METHODOLOGY
4. AlGORITHM
4.1 ALGORITHM
5. FLOW CHART
5.1 FLOWCHART
6. DESCRIBING OF DATASET
6.1 spam detection using machine learning
7. RESULT
8. Conclusion
9. Reference
ABSTRACT
Nowadays communication plays a major role in everything be it professional
or personal. Email communication service is being used extensively because of
its free use services, low-cost operations, accessibility, and popularity. Emails
have one major security flaw that is anyone can send an email to anyone just by
getting their unique user id. This security flaw is being exploited by some
businesses and ill-motivated persons for advertising, phishing, malicious
purposes, and finally fraud. This produces a kind of email category called
SPAM.
Spam refers to any email that contains an advertisement, unrelated and frequent
emails. These emails are increasing day by day in numbers. Studies show that
around 55 percent of all emails are some kind of spam. A lot of effort is being
put into this by service providers. Spam is evolving by changing the obvious
markers of detection. Moreover, the spam detection of service providers can
never be aggressive with classification because it may cause potential
information loss in case of a misclassification.
To tackle this problem, we present a new and efficient method to detect spam
using machine learning and natural language processing. A tool that can detect
and classify spam. In addition to that, it also provides information regarding the
text provided in a quick view format for user convenience.
1. INTRODUCTION
1.1 CONTEXT
Email spam detection is a crucial aspect of ensuring the security and efficiency of
email communication. Spam refers to unsolicited and often irrelevant or inappropriate
messages sent over the internet, typically to a large number of users, for advertising,
phishing, spreading malware, or other malicious purposes. Detecting and filtering out
spam emails is essential to protect users from potential security threats and to maintain
the integrity of email communication. Here's the context of email spam detection:
1.2 MOTIVATION
The motivation behind email spam detection projects lies in addressing
several important concerns and challenges associated with the
increasing volume of spam emails. Here are some key motivations for
implementing and continuously improving email spam detection
systems the motivation for email spam detection projects stems from a
combination of user-centric, security, resource optimization,
compliance, and business continuity considerations. The ongoing
evolution of spamming techniques reinforces the need for continuous
improvement and innovation in spam detection technologies.
1.3 OBJECTIVE
The primary objectives of email spam detection projects revolve around
enhancing the security, efficiency, and user experience associated with
email communication. Here are the key objectives of such projects.
The objectives of email spam detection projects are multifaceted,
encompassing security, user experience, resource optimization,
compliance, and adaptability to emerging threats. These objectives aim
to create a secure and efficient email communication environment for
both individuals and organizations.
2 LITERATURE REVIEWS
2.1 Introduction
This chapter discusses the literature review for machine learning classifier that being
used in previous researches and projects. It is not about information gathering but it
summarizes the prior research that related to this project. It involves the process of
searching, reading, analyzing, summarizing and evaluating the reading materials based
on the project.
A lot of research has been done on spam detection using machine learning. But
due to the evolvement of spam and development of various technologies the
proposed methods are not dependable. Natural language processing is one of
the lesser known fields in machine learning and it is reflected here with
comparatively less work present.
2.2 Related work
Spam classification is a problem that is neither new nor simple. A lot of research has
been done and several effective methods have been proposed.
M. RAZA, N. D. Jayasinghe, and M. M. A. Muslam have analyzed various techniques
for spam classification and concluded that naïve Bayes and support vector machines
have higher accuracy than the rest, around 91% consistently [1].
S. Gadde, A. Lakshmanarao, and S. Satyanarayana in their paper on spam detection
concluded that the LSTM system resulted in higher accuracy of 98%[2].
P. Sethi, V. Bhandari, and B. Kohli concluded that machine learning algorithms perform
differently depending on the presence of different attributes [3].
H. Karamollaoglu, İ. A. Dogru, and M. Dorterler performed spam classification on
Turkish messages and emails using both naïve Bayes classification algorithms and
support vector machines and concluded that the accuracies of both models measured
around 90% [4].
P. Navaney, G. Dubey, and A. Rana compared the efficiency of the SVM, 12 naïve
Bayes, and entropy method and the SVM had the highest accuracy (97.5%) compared to
the other two models [5].
S. Nandhini and J. Marseline K.S in their paper on the best model for spam detection it
is concluded that random forest algorithm beats others in accuracy and KNN in building
time [6].
S. O. Olatunji concluded in her paper that while SVM outperforms ELM in terms of
accuracy, the ELM beats the SVM in terms of speed [7].
M. Gupta, A. Bakliwal, S. Agarwal, and P. Mehndiratta studied classical machine learning
classifiers and concluded that convolutional neural network outperforms the classical machine
learning methods by a small margin but take more time for classification [8].
N. Kumar, S. Sonowal, and Nishant, in their paper, published that naïve Bayes
algorithm is best but has class conditional limitations [9].
T. Toma, S. Hassan, and M. Arifuzzaman studied various types of naïve Bayes
algorithms and proved that the multinomial naïve Bayes classification
algorithm has better accuracy than the rest with an accuracy of 98% [10].
F. Hossain, M. N. Uddin, and R. K. Halder in their study concluded that
machine learning models outperform deep learning models when it comes to
spam classification and ensemble models outperform individual models in
terms of accuracy and precision [11]
3. METHODOLOGY
Data cleaning
Data cleaning, also known as data preprocessing or data wrangling, is a crucial
step in the machine learning pipeline. It involves the identification and correction
of errors, inconsistencies, and missing values in the dataset to ensure that the data
is suitable for training machine learning models. Here are some common data
cleaning tasks.
Exploratory Data Analysis
EDA stands for Exploratory Data Analysis, and it is a critical step in the process
of understanding and preparing data for machine learning. EDA involves
examining and visualizing the dataset to gather insights, discover patterns, and
identify relationships among variables. This process helps in making informed
decisions about data preprocessing, feature engineering, and model selection. Here
are some key steps and techniques involved in EDA for machine learning.
Text preprocessing
Text preprocessing is a crucial step in preparing text data for machine learning
applications. It involves cleaning and transforming raw text into a format that is
suitable for analysis and model training. Here are common text preprocessing steps
used in machine learning.
Model building
Building a machine learning model involves selecting an appropriate algorithm,
preparing the data, training the model, and evaluating its performance.
Evaluation
Evaluation in machine learning involves assessing the performance of a model on
a dataset. The goal is to understand how well the model generalizes to new, unseen
data.
Improvement
Improving machine learning models involves optimizing various aspects of the
model, the data, and the training process to enhance performance and
generalization.
Deploy
Deploying a machine learning model involves making your trained model
available for use in a real-world environment, where it can make predictions on
new, unseen data. The deployment process can vary depending on the type of
application (web service, mobile app, edge device) and the specific requirements
of the project.
4. AlGORITHM
A combination of algorithms are used for the classifications are as follows..
K-Nearest Neighbors
KNN is a classification algorithm. It comes under supervised algorithms. All the
data points are assumed to be in an n-dimensional space. And then based on
neighbors the category of current data is determined based on the majority.
Euclidian distance is used to determine the distance between points.
The distance between 2 points is calculated as
d=√(〖(x2-x1)〗^2+〖(y2-y1)〗^2 )
The distances between the unknown point and all the others are calculated.
Depending on the K provided k closest neighbors are determined. The category to
which the majority of the neighbors belong is selected as the unknown data
category.
If the data contains up to 3 features then the plot can be visualized. It is fairly slow
compared to other distance-based algorithms such as SVM as it needs to determine
the distance to all points to get the closest neighbors to the given point.
Naïve Bayes Classifier
A naïve Bayes classifier is a supervised probabilistic machine learning model that is used
for classification tasks. The main principle behind this model is the Bayes theorem.
Bayes Theorem:
Naive Bayes is a classification technique that is based on Bayes’ Theorem with an
assumption that all the features that predict the target value are independent of each
other. It calculates the probability of each class and then picks the one with the highest
probability.
Naive Bayes classifier assumes that the features we use to predict the target are
independent and do not affect each other. Though the independence assumption is never
correct in real-world data, but often works well in practice. so that it is called “Naive”
[14].
P(A│B)=(P(B│A)P(A))/P(B)
Extra Trees Classifier(ETC)
Extra Trees is an ensemble learning method that is similar to Random Forests. It builds
multiple decision trees and merges their predictions. The main difference lies in the way
the trees are constructed.
In Extra Trees, each decision tree is built from the entire dataset using random thresholds
for feature splits. This randomness helps to make the algorithm more robust and less
prone to overfitting.
Random Forest Classifier
Random Forest classifier is a supervised ensemble algorithm. A random forest consists
of multiple random decision trees. Two types of random nesses are built into the trees.
First, each tree is built on a random sample from the original data. Second, at each tree
node, a subset of features is randomly selected to generate the best split [16].
Decision Tree:
The decision tree is a classification algorithm based completely on features. The
tree repeatedly splits the data on a feature with the best information gain. This
process continues until the information gained remains constant. Then the
unknown data is evaluated feature by feature until categorized. Tree pruning
techniques are used for improving accuracy and reducing the overfitting of data.
Several decision trees are created on subsets of data the result that was given by
the majority of trees is considered as the final result. The number of trees to be
created is determined based on accuracy and other metrics through iterative
methods. Random forest classifiers are mainly used on condition-based data but it
works for text if the text is converted into numerical form.
Support Vector Machines (SVM)
It is a machine learning algorithm for classification. Decision boundaries are
drawn between various categories and based on which side the point falls to the
boundary the category is determined.
AdaBoost
AdaBoost, short for Adaptive Boosting, is an ensemble learning algorithm used in
machine learning for classification and regression tasks. It was introduced by Yoav
Freund and Robert Shapira in 1996. The primary idea behind AdaBoost is to
combine the predictions of weak learners (usually simple models) to create a
strong learner that performs well on the overall task.
Logistic Regression
Logistic Regression is a “Supervised machine learning” algorithm that can be used
to model the probability of a certain class or event. It is used when the data is
linearly separable and the outcome is binary or dichotomous [17]. The
probabilities are calculated using a sigmoid function.
For example, let us take a problem where data has n features. We need to fit a line
for the given data and this line can be represented by the equation
z=b_0+b_1 x_1+b_2 x_2+b_3 x_3….+b_n x_n
here z = odds generally,
odds are calculated as
odds=p(event occurring)/p(event not occurring)
Gradient Boosting Decision Trees (GBDT)
Gradient Boosting Decision Trees (GBDT) is an ensemble learning algorithm used
for both classification and regression tasks. It is a popular and powerful machine
learning technique that builds a strong predictive model by combining the
predictions of multiple weak learners, typically decision trees.
[Link]
[Link]
[Link] OF DATASET
Module:-
Numpy
NumPy is a powerful, open-source library for the Python programming language that
provides support for large, multi-dimensional arrays and matrices of numerical data, as
well as a large collection of mathematical functions to operate on these arrays. It is
widely used in scientific computing, data analysis, machine learning, and other related
fields. One of the main features of NumPy is its n-dimensional array object, which is
used to store and manipulate large arrays of numerical data.
Pandas
Pandas allow us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets and make them readable and relevant.
Relevant data is very important in data science.
Scikit-Learn
Scikit-Learn, also known as sk-learn is a python library to implement machine learning
models and statistical modelling. Through scikit-learn, we can implement various
machine learning models for regression, classification, clustering, and statistical tools
for analyzing these models.
Matplotlib
Matplotlib is a comprehensive library for creating static, animated, and interactive
visualizations in Python. Matplotlib makes easy things easy and hard things possible.
Create publication quality plots. Make interactive figures that can zoom, pan, update.
Pilot
pilot is a collection of command style functions that make matplotlib work like
MATLAB. Each pilot function makes some change to a figure: e.g., creates a figure,
creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot
with labels, etc.
NLTK (Natural Language Toolkit)
NLTK (Natural Language Toolkit) is the go-to API for NLP (Natural Language
Processing) with Python. It is a really powerful tool to preprocess text data for further
analysis like with ML models for instance. It helps convert text into numbers, which
the model can then easily work with.
6.1 SPAM DETECTION USING MACHINE LEARNING
• For training the algorithm dataset from Kaggle is used which is shown below
[Link]
• It has many fields, some of these columns of the dataset are not required. So
remove some columns which are notrequired. We need to change the names of
the columns.
[Link] of dataset
• With the help of NLTK (Natural Language Tool Kit) for the text processing, Using
Matplotlib you can plot graphs, histogram and bar plot and all those things ,Word
Cloud is used to present text data and pandas for data manipulation and
analysis, NumPy is to do the mathematical and scientific operation. The
packages used in the proposed model are shown below.
[Link]
• Split the data into training and testing sets as shown below. Some
percentage f the data set is used as train dataset and the rest as a test dataset.
[Link] dataset
• Reset train and test index as shown in the next column:
Fig.6. Reset train and test index
• We need to find out the most repeated words in the spam and ham
[Link] Word Cloud library is us.
• Whenever there is any message, we must first preprocess the input messages.
We need to convert all the input characters to lowercase.
• Then split up the text into small pieces and also removing the punctuations. So
the Tokenization process is used to remove punctuations and splitting
messages.
• We need to find the probability of the word in spam and ham messages.
Fig.10. Ham and spam probability
• plot the histogram graph
[Link] graph
[Link] graph
• Exploratory data analysis (EDA)
[Link]
[Link] and Visualization
When we receive message in the inbox ,that message will be exported to dataset
shown This message will be detected as spam or not.
Accuracy:
Accuracy is a metric that measures how often a Machine learning model
correctly predicts the outcomes.
Precision:
Precision is one indicator of a machine learning models
performance-the Quality of a positive prediction made by the
model.
[Link]
6.1 Conclusion
From the results obtained we can conclude that an ensemble machine learning model is
more effective in detection and classification of spam than any individual algorithms.
We can also conclude that TF-IDF (term frequency inverse document frequency)
language model is more effective than Bag of words model in classification of spam
when combined with several algorithms. And finally, we can say that spam detection
can get better if machine learning algorithms are combined and tuned to needs.
6.2 Future work
There are numerous applications to machine learning and natural language processing
and when combined they can solve some of the most troubling problems concerned with
texts. This application can be scaled to intake text in bulk so that classification can be
done more effectively in some public sites. Other contexts such as negative, phishing,
malicious, etc.. can be used to train the model to filter things such as public comments
in various social sites. This application can be converted to online type of machine
learning system and can be easily updated with latest trends of spam and other mails so
that the system can adapt to new types of spam emails and texts.
[Link]
[1] S. H. a. M. A. T. Toma, "An Analysis of Supervised Machine Learning Algorithms
for Spam Email Detection," in International Conference on Automation, Control and
Mechatronics for Industry 4.0 (ACMI), 2021.
[2]S. Nandhini and J. Marseline K.S., "Performance Evaluation of Machine Learning
Algorithms for Email Spam Detection," in International Conference on Emerging Trends
in Information Technology and Engineering (ic-ETITE), 2020.
[3] A. L. a. S. S. S. Gadde, "SMS Spam Detection using Machine Learning and Deep
Learning Techniques," in 7th International Conference on Advanced Computing and
Communication Systems (ICACCS), 2021, 2021.
[4] V. B. a. B. K. P. Sethi, "SMS spam detection and comparison of various machine
learning algorithms," in International Conference on Computing and Communication
Technologies for Smart Nation (IC3TSN), 2017.
[5] G. D. a. A. R. P. Navaney, "SMS Spam Filtering Using Supervised Machine Learning
Algorithms," in 8th International Conference on Cloud Computing, Data Science &
Engineering (Confluence), 2018.
[6]S. O. Olatunji, "Extreme Learning Machines and Support Vector Machines models
for email spam detection," in IEEE 30th Canadian Conference on Electrical and
Computer Engineering (CCECE), 2017.
[7] S. S. a. N. N. Kumar, "Email Spam Detection Using Machine Learning Algorithms,"
in Second International Conference on Inventive Research in Computing Applications
(CIRCA), 2020.
[8] R. Madan, "[Link]," [Online]. Available: [Link]
vidhya/tf-idf-term-frequency-technique-easiest-explanatio n-for-text-classification-in-
nlp-with-code-8ca3912e58c3.
[9] N. D. J. a. M. M. A. M. M. RAZA, "A Comprehensive Review on Email Spam
Classification using Machine Learning Algorithms," in International Conference on
Information Networking (ICOIN), 2021, 2021.
[10] A. B. S. A. a. P. M. M. Gupta, "A Comparative Study of Spam SMS Detection
Using Machine Learning Classifiers," in Eleventh International Conference on
Contemporary Computing (IC3), 2018.