0% found this document useful (0 votes)
136 views5 pages

Content Based Spam Detection in Email Us PDF

Uploaded by

kasperweiss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views5 pages

Content Based Spam Detection in Email Us PDF

Uploaded by

kasperweiss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

This full-text paper was peer-reviewed and accepted to be presented at the IEEE ICCSP 2015 conference.

Content Based Spam Detection in Email using


Bayesian Classifier

Sunil B. Rathod, Tareek M. Pattewar

sites.
Abstract- Internet provides Emails as means of data Content Based Spam Filter :
communication. Email messaging is an essential contribution. Content based filter checks for Text in the body of Email,
Hacking attacks, phishing attacks and malicious attack are
then URL and It also considers the mail header like subject
frequently undergo email services to attempt fraud and deception
for classification of text . It performs Text classification task
motivation. They use emails to obtain personal credentials of user
for financial gain. Emails with genuine content may include by employing preprocessing on TEXT in terms of HTML tag
phishing URLs for stealing of useful data such kind of emails are removal, Stopword Removal, Tokenizing and Word frequency
nothing but a spam. calculation for determining word probability to find out
In order to detect and filter such kind of emails. Bayesian whether a given mail is spam or not.
classifier is used and The performance of Bayesian classifier is
The rest of the paper is organized as follows next section
evaluated in terms of Accuracy, Error, Time, Precision and
describes the related work concerning spam classification and
Recall, Bayesian classifier is used for email classification and
detection of spam mails.
filtering method. The section III discusses on algorithm study,
We have used Bayesian classifier algorithm and evaluation
Index Terms- Bayesian Classifier, Email Classification, criteria for our work. The section IV elaborates experiment
Spam Filtering. which consists of training dataset, preprocessing, application
of Bayesian classifier and testing dataset then performance
I. INTRODUCTION
evaluation and The section V discusses about result which we
have derived by considering performance measurement
parameters. Finally the section VI concludes the paper.
mail services are becoming popular by means of
E infonnation communication over Internet. These Email
II. RELATED WORK
facility has also created troubles to user through Electronic
The existing work undergo an implementation on detection of
junk mails. These are called as Spam mails .The spam mails
malicious URL in Email by Dhanalakshmi R and Chellapan C
are sent to many users in bulk as advertising mails, claim mails 2013, they considered Age of domain, Host based features,
. Some of the mails contains genuine content with malicious Lexical features and Page rank for analysis of URL to classify
URLs called as phishing mails. It also spread virus, malicious into malicious URL and legitimate URL. They have used
attacks through spam mails. Spam mails are also said to be Bayesian classifier to improve the accuracy by reduced feature
Unsolicited Bulk Emails and Its another part is Unsolicited sets and considered phishtank dataset, The work was restricted
Commercial Emails. Hacker.phishers and malicious attackers to URL in Email only [1].
are frequently using email services to send false kinds of Sahami et al. 1998, has given a spam classification method
messages by which target user can loss their money and social using a Bayesian approach. A Bayesian classifier is statistical
reputations. These results into gaining personal credentials classifier works on independence computation of probability.
such as credit card number, passwords and some confidential They have considered content of e-mail with features of
data. domain, and shown that accuracy can be increased [2].
To stop such kind of things one should employ following : Do V Christina et ai, had shown that the need of effective spam
not reply to spam mails, Do not click the links / URLs in filters increases. He discussed spam and spam filtering
Emails, Do not post your email ids on the unrecognized web methods and their correlated problems [3].
Sadeghian A. et ai, had presented spam detection based on
Sunil B. Rathod is the PG Student, He is now with the Department of
interval type-2 fuzzy sets. This system gives user more control
Computer Engineering, North Maharashtra University, SES's R. C. Patel on categories of spam and permits the personalization of the
Institute of Technology, Shirpur, India (e-mail: spam filter [4].
sunilrathod.rathodO [email protected]). Congfu Xu et al. has derived a feature extraction on Base64
Tareek M. Pattewar is the Assistant Professor, He is now with the
encoding of image with n-gram technique. Effectiveness and
Department of Computer Engineering, North Maharashtra University, SES's
R. C. Patel Institute of Technology, Shirpur, India (e-mail:
efficiency in detecting spam images are shown by these
[email protected]). features from legitimate images by training a SVM.

978-1-4799-8081-9/15/$31.00 © 2015 IEEE

1257
This full-text paper was peer-reviewed and accepted to be presented at the IEEE ICCSP 2015 conference.

experimental results shows that it has prominent performance Likelihood of X-mail given Spam Number of spam mail in
=

for classification of spam image in terms of accuracy, the vicinity of X-mails / Total number of spam mail. (4)
precision, and recall [5].
Man Qi et aI., has explored two main semantic methods: Posterior probability of X-mail being legitimate Prior
=

Bayesian algorithms and Sup-port Vector Machine (SVM). probability of legitimate mail x Likelihood of X-mail given
Recent spam filters are discussed in this paper for determining legitimate. (5)
spam messages which utilize semantic analysis information
[6]. Posterior probability of X-mail being spam Prior probability
=

Zhan Chuan, LV Xian-liang has presented An application to of spam mail x Likelihood of X-mail given spam. (6)
Anti-Spam Email using a new improved Bayesian-based e­ Finally we classify X-mail as spam as its class membership has
mail filter. They have used vector weights for representing a largest posterior probability.
word frequency and adopted attribute selection based on word
entropy and deduce its corresponding formula .It is proved that B. Evaluation Criteria
their filter improves total performances apparently [7]. Classification results were calculated using following, As we
Holly Esquivel et al. had focused on the pre-acceptance formulate the spam detection problem as Bayesian
altering mechanism IP reputation. They first classify SMTP classification problem, each mail undergo one of four possible
senders into three main categories: legitimate servers, end­ scenarios: Accuracy (TP, correctly classified instances), Error
hosts, and spam gangs, and empirically study the limits of (TN, Incorrectly classified instance), Though Error rate
effectiveness regarding IP reputation filtering for each (fraction of wrongly classified Instances ) may be of limited
category [8]. interest in our context where data sets are unbalanced.
Georgios Paliouras et aI., have presented Learning to Filter Additionally, we report standard measures such as precision,
Spam E-Mail. They investigated the performance of two recall and Error.
machine learning algorithm in context of anti-spam filtering by
comparison of a NaIve Bayesian and a Memory-Based IV. EX PERIMENT
Approach. They have determined the performance on publicly
available corpus for naive bayes. Also they have compared the
The existing system mainly works on main headers like
performance of the Naive Bayesian filter to an alternative
subject, body of mail and mailing address but we are dealing
memory based learning approach so that in both methods
with only the body of mail which is estimated based on
accuracy has improved for spam filtering and keyword based
content. Content based filter checks for information in the
filter are used widely for email [9].
body of mail by considering subjects, VRLs for acceptance,
Gray Robinson proposed computation of probability of spam
rejection and classification of mail by considering content to
mail and legitimate mail [10].
spam and legitimate mail . The method can be described as in
fig. 1;
III. ALGORITHM STUDY
A. Training:
We are using mail dataset collected from Gmail which
A. Bayesian Classifier consist of spam mail and legitimate mail. This mails
NaIve bayes classifier is popular as it is statistical classifier are considered as input in HTML format for
known for Email filtering, It uses text classification method for preprocessing.
identifying spam mails. NaIve bayes uses tokens (words) with
spam and ham mails for Calculating probability to determine B. Preprocessing :
whether a mail is spam or not. 1) HTML Tag Removal:
Mathematical Formulation: The input Emails are in HTML format so this contains
Bayesian classifier is based on NaIve Bayes theorem, NaIve the tag, so to purify the text we need to remove the tags.
Bayes theorem can perform more sophisticated classification 2) Stopword Removal:
methods. To demonstrate the concept consider following This is the stopword list which consist of terms
equations [11]; including articles, prepositions, conjunctions and
Thus, we can write: certain high frequency words (such as some verbs,
Prior probability of Legitimate mail = Number of legitimate adverbs).
mail / Total number of mail. (1)

Prior probability of Spam mail = Number of spam mail / Total


number of mail. (2)

Likelihood of X-mail given Legitimate Number of legitimate


=

mail in the vicinity of X-mails / Total number of legitimate


mail. (3)

1258
This full-text paper was peer-reviewed and accepted to be presented at the IEEE ICCSP 2015 conference.

(Correctly classified instances), Error (Incorrectly classified


instances), precision and Recall are evaluated .

Accuracy = (TN + TP) / (TN + TP + FN + FP)

Precision = (TP) / (TP+FP)

Recall = (TP) / (TP + FN)

Where,
TN: True Negative, Legitimate predicted as Legitimate
TP: True Positive, Spam predicted as Spam
FP: Legitimate predicted as Spam
FN: Spam predicted as Legitimate.

V. RESULT

A. Computation of Bayesian classifier efficiency under


different data volume:
We are performing experiments on different volume of
Fig. I. Content Based Spam Detection in Email using Bayesian Classifier training dataset and testing data set form Gmail .
The training and testing dataset of Gmail are 1000 mails, 1500
mails and 2100 mails then we are going to measure the
3) Tokenization : performance in terms of Accuracy, Error, Time, Precision and
Lexical analysis also named as Tokenising, It involves recall. The fig.2 describes Accuracy for the system
dividing the content of text into strings of character architecture defined in fig. 1 under different volume of dataset.
called as Tokens. Filtering techniques uses white space Similarly Error rate can be shown in fig. 3, whereas the time
(blank) removal and removal of punctuation symbols in needed to perform this classification and filtering can be
tokenizing. deduced with fig.4. The Precision and Recall can be shown in
4) Word Frequency: fig.5 and fig.6 respectively. Finally Performance Measurement
This counts the frequency of words depending on its can be shown with Table I.
occurrence, It helps in deriving the word probability for
spam and legitimate mails.

C. Bayesian Classifier:
It is method used for classification of text, It gives efficient
learning algorithm for data mining. This uses Bayes classifier
theorem which is based on conditional independence
assumption:

P (spam/word) = [P (word/spam) P (spam)] / p (word)

Considering spam probability for words, It evaluates Spam and


Legitimate mails for classification then gives performance
measurement.
D. Testing Dataset :
This is derived from g-mail consisting of spam and legitimate
mails .It also needs to be preprocessed to give pure text then
classification is done by using Bayesian classifier. Further
correctly classified instances (mails) and Incorrectly classified
instances (mails) are evaluated.
E. Performance Measurement:
Fig. 2. Derived Accuracy on different volume of dataset
As classification model buids, It is essential to derive
performance on the basis of parameters such as Accuracy

1259
This full-text paper was peer-reviewed and accepted to be presented at the IEEE ICCSP 2015 conference.

Fig. 3. Derived Error Rate on different volume of dataset Fig. 5. Precision on different volume of dataset

I- , 111

Fig. 4. Time taken on different volume of dataset Fig. 6. Recall on different volume of dataset

TABLET
PERFORMANCE MEASUREMENT

1260
This full-text paper was peer-reviewed and accepted to be presented at the IEEE ICCSP 2015 conference.

VI. CONCLUSION

We have emphasized Bayesian approach for classifying


Spam and legitimate mails using supervised learning across
features extracted. Applying the Bayesian classifier, we
experimentally demonstrated that spam mails can be detected
with an accuracy of more than 96.46% with respect to real­
world gmail data sets. The mail dataset once trained,
effectively detect a potentially spam mails and thus help
internet users from avoiding those spam.
As future work, We will integrate these content based
spam detection System with malicious URL detection to
improve the accuracy of the system for detecting spam mails
and malicious URLs.

ACKNOWLEDGMENT

We are sincerely grateful to all the persons who help us


through this work to make it successful.

REFERENCES

[I] Dhanalakshmi Ranganayakulu and Chellappan C. , "Detecting


malicious URLs in E-Mail - An implementation",AASRl Conference on
intelligent Systems and Control, Vol. 4 ,2013,pg. 125-131

[2] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, "A bayesian


approach to filtering junk e-mail . AAAiTech.Rep.WS-98-05. pp.55-
..

62,1998.
[3] V Christina., "A study on email spam filtering techniques",
international Journal of Computer Applications, Vol. 12- No.1, 2010.
[4] Sadeghian, A and Ariaeinejad, R. , "Spam detection system: A new
approach based on interval type-2 fuzzy sets", iEEE CCECE -000379,
2011.
[5] Congfu Xu, Yafang Chen, Kevin Chiew, "An approach to image spam
filtering based on base64 encoding and N-Gram feature extraction",
iEEE international Conference on Tools with Artificial intelligence,
DOI I0. 1109/ICTAl.2010.31,2010.
[6] Man Qi, Mousoli, R,"Semantic analysis for spam filtering",
international Conference on Fuzzy Systems and Knowledge Discovery,
VoI.6,Pg. 2914-2917,2010.
[7] Zhan Chuan, LU Xian-Iiang, ZHOU Xu, HOU Meng-shu, "An
Improved Bayesian with Application to Anti-Spam Email ",Journal of
Electronic Science andTechnology of China, Mar. 2005, Vol.3 No. 1

[8] Holly Esquivel and Aditya Akella, "On the effectiveness of I P reputation
for spam filtering", iEEE international Conference on Communication
Systems and Networks, DOl:I 0.11 09ICOMSNETS.20 I0. 5431981,
Pg.I-10,2010.
[9] I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C.D.
Spyropoulos, and P. Stamatopoulos, "Learning to filter spam e-mail: A
comparison of a naive bayesian and a memorybased approach",
Proceedings of the Workshop on Machine Learning and Textual
information Access, 4th European Conference on Principles and
Practice of Knowledge Discovery in Databases (PKDD 2000), pages 1-
13,2000.
[10] G. Robinson. (2014,Oct ). "A statistical approach to the spam problem",
2003. [Online] Available
http://www.linuxjoumal.comlarticle. php?sid=6467

[II] NaIve Bayes Classifier.(2014, Dec) [online] Available


http://www.statsoft. comltextbooklnaive-bayes-classifier

126 1

You might also like